[mpich-discuss] MPICH configure

Zhou, Hui zhouh at anl.gov
Thu Apr 30 13:25:04 CDT 2020


Hi Bruce,

Could you share your job scripts with us? It’ll be helpful to understand how you exactly launches jobs.

--
Hui Zhou


From: "Palmer, Bruce J" <Bruce.Palmer at pnnl.gov>
Date: Thursday, April 30, 2020 at 9:25 AM
To: "Zhou, Hui" <zhouh at anl.gov>, "discuss at mpich.org" <discuss at mpich.org>
Subject: Re: [mpich-discuss] MPICH configure

Hi Hui,

I checked with the system admins and they say there are no firewall restrictions between nodes. According to them they can talk to any port on any machine.

I’ve also followed up on their suggestion that I link with PMI and verified with ldd that the PMI libraries are showing up before MPI in the executables. I still only get 1 process when running with srun. This is a summary of what I am seeing when I run on 2 nodes (after configuring and building with slurm).

MPICH-3.3.1
Launch with mpiexec: runs okay
Launch with mpirun: runs okay
Link with PMI and Launch with srun: only get 1 process (from MPI_Comm_size on MPI_COMM_WORLD)

MPICH-3.3.2
Launch with mpiexec: hangs
Launch with mpirun: hangs
Link with PMI and Launch with srun: only get 1 process (from MPI_Comm_size on MPI_COMM_WORLD)

For what it’s worth, it looks like the error message

[proxy:0:1 at node168.local] HYDU_sock_connect (utils/sock/sock.c:145): unable to connect from "node168.local" to "node100.local" (Connection refused)
[proxy:0:1 at node168.local] main (pm/pmiserv/pmip.c:183): unable to connect to server node100.local at port 54762 (check for firewalls!)

doesn’t show up immediately. It looks like it appears (if it appears) after the system has been hung up for a while.

Bruce


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20200430/9944ca9a/attachment.html>


More information about the discuss mailing list