[mpich-discuss] How to set port range used under LSF? [EXT]

Sendu Bala sb10 at sanger.ac.uk
Tue Mar 9 17:35:10 CST 2021


Hi,

As noted, I had already tried MPIEXEC_PORT_RANGE with no difference.

I don’t know if it’s normal for blaunch to just have its own ports and this is nothing to do with mpich or mpiexec at all, but the follow on issue is if I ssh to one of the other nodes, nothing is listening on the control port. I only see a process listening on one of these other ports outside my specified range. Which I think might be related to my fundamental problem of mpiexec failing to do anything under LSF most of the time at high host counts (see my previous thread on this list).


On 9 Mar 2021, at 16:21, Zhou, Hui <zhouh at anl.gov<mailto:zhouh at anl.gov>> wrote:

Hi Sendu,

Could you try `MPIEXEC_PORT_RANGE` in stead?

I know it is confusing and the documentation probably need update/correction, but `MPICH_PORT_RANGE` is for `MPICH` rather than `MPIEXEC`.

--
Hui Zhou


From: Sendu Bala via discuss <discuss at mpich.org<mailto:discuss at mpich.org>>
Date: Tuesday, March 9, 2021 at 7:16 AM
To: discuss at mpich.org<mailto:discuss at mpich.org> <discuss at mpich.org<mailto:discuss at mpich.org>>
Cc: Sendu Bala <sb10 at sanger.ac.uk<mailto:sb10 at sanger.ac.uk>>
Subject: [mpich-discuss] How to set port range used under LSF?
Via a bsub, I’m doing:

MPICH_PORT_RANGE="46107:46140” mpiexec mpich/examples/cpi

When I ssh to the controlling node, I see it has spawned a set of blaunch processes with `--control-port node-12-3-2:46107` as expected, but:

ss -l -p -n | grep blaunch
tcp               LISTEN              0                    128                                                                 0.0.0.0:34361            0.0.0.0:*                                                                                users:(("blaunch",pid=2823,fd=7))
tcp               LISTEN              0                    128                                                                 0.0.0.0:46107            0.0.0.0:*                                                                                users:(("cpi",pid=2839,fd=5),("blaunch",pid=2837,fd=5),("blaunch",pid=2836,fd=5),("blaunch",pid=2835,fd=5),("blaunch",pid=2834,fd=5),("blaunch",pid=2833,fd=5),("blaunch",pid=2832,fd=5),("blaunch",pid=2831,fd=5),("blaunch",pid=2830,fd=5),("blaunch",pid=2829,fd=5),("blaunch",pid=2828,fd=5),("blaunch",pid=2827,fd=5),("blaunch",pid=2826,fd=5),("blaunch",pid=2825,fd=5),("blaunch",pid=2824,fd=5),("blaunch",pid=2823,fd=5),("hydra_pmi_proxy",pid=2822,fd=5),("mpiexec",pid=2821,fd=5))
tcp               LISTEN              0                    128                                                                 0.0.0.0:43741            0.0.0.0:*                                                                                users:(("blaunch",pid=2825,fd=12))
tcp               LISTEN              0                    128                                                                 0.0.0.0:41983            0.0.0.0:*                                                                                users:(("blaunch",pid=2830,fd=22))
tcp               LISTEN              0                    128                                                                 0.0.0.0:41215            0.0.0.0:*                                                                                users:(("blaunch",pid=2832,fd=26))
tcp               LISTEN              0                    128                                                                 0.0.0.0:34433            0.0.0.0:*                                                                                users:(("blaunch",pid=2831,fd=24))
tcp               LISTEN              0                    128                                                                 0.0.0.0:33219            0.0.0.0:*                                                                                users:(("blaunch",pid=2827,fd=16))
tcp               LISTEN              0                    128                                                                 0.0.0.0:34405            0.0.0.0:*                                                                                users:(("blaunch",pid=2837,fd=36))
tcp               LISTEN              0                    128                                                                 0.0.0.0:43465            0.0.0.0:*                                                                                users:(("blaunch",pid=2836,fd=34))
tcp               LISTEN              0                    128                                                                 0.0.0.0:39755            0.0.0.0:*                                                                                users:(("blaunch",pid=2833,fd=28))
tcp               LISTEN              0                    128                                                                 0.0.0.0:38095            0.0.0.0:*                                                                                users:(("blaunch",pid=2829,fd=20))
tcp               LISTEN              0                    128                                                                 0.0.0.0:44625            0.0.0.0:*                                                                                users:(("blaunch",pid=2834,fd=30))
tcp               LISTEN              0                    128                                                                 0.0.0.0:35345            0.0.0.0:*                                                                                users:(("blaunch",pid=2835,fd=32))
tcp               LISTEN              0                    128                                                                 0.0.0.0:43827            0.0.0.0:*                                                                                users:(("blaunch",pid=2826,fd=14))
tcp               LISTEN              0                    128                                                                 0.0.0.0:40915            0.0.0.0:*                                                                                users:(("blaunch",pid=2828,fd=18))
tcp               LISTEN              0                    128                                                                 0.0.0.0:42549            0.0.0.0:*                                                                                users:(("blaunch",pid=2824,fd=9))

Why are these all listening on ports outside my range? I’ve also tried setting MPIEXEC_PORT_RANGE and MPIR_CVAR_CH3_PORT_RANGE and still have the problem.

Is there any way to fully control the ports used?


Cheers,
Sendu.




--
 The Wellcome Sanger Institute is operated by Genome Research
 Limited, a charity registered in England with number 1021457 and a
 company registered in England with number 2742969, whose registered
 office is 215 Euston Road, London, NW1 2BE.
_______________________________________________
discuss mailing list     discuss at mpich.org<mailto:discuss at mpich.org>
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss [lists.mpich.org]<https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.mpich.org_mailman_listinfo_discuss&d=DwMF-g&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=R4ZUzQZ7_TZ1SVV_pAmysrrJ1zatMHFpzMNAdJSpPIo&m=CuEiWIvEL-XMgdluJKtxPAWm-NPb8DmcQkMrblVojU0&s=PWAs4LCPVtlyPby567cPJCzNtS471MUxw-iuL_53ZlY&e=>




-- 
 The Wellcome Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20210309/a3aa6c93/attachment-0001.html>


More information about the discuss mailing list