[mpich-discuss] MPICH configure

Palmer, Bruce J Bruce.Palmer at pnnl.gov
Thu Apr 23 11:07:43 CDT 2020


Hi Hui,

I was launching the jobs with mpirun. I tried launching the jobs with srun and they no longer hang, but it looks like they are returning an MPI_COMM_WORLD with only one process, although I didn’t investigate this extensively. I also tried Pavan’s suggestion and rebuilt 3.3.2 without slurm and ran it with mpiexec. This also hangs  and the error I was seeing previously reappears

[proxy:0:1 at node168.local] HYDU_sock_connect (utils/sock/sock.c:145): unable to connect from "node168.local" to "node100.local" (Connection refused)
[proxy:0:1 at node168.local] main (pm/pmiserv/pmip.c:183): unable to connect to server node100.local at port 54762 (check for firewalls!)
srun: error: node168: task 1: Exited with exit code 5
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
slurmstepd: *** STEP 13201325.0 CANCELLED AT 2020-04-23T08:58:02 *** on node100
slurmstepd: *** JOB 13201325 CANCELLED AT 2020-04-23T08:58:02 *** on node100
[mpiexec at node100.local] HYDU_sock_write (utils/sock/sock.c:256): write error (Bad file descriptor)
[mpiexec at node100.local] HYD_pmcd_pmiserv_send_signal (pm/pmiserv/pmiserv_cb.c:178): unable to write data to proxy
[mpiexec at node100.local] ui_cmd_cb (pm/pmiserv/pmiserv_pmci.c:77): unable to send signal downstream
[mpiexec at node100.local] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error status
[mpiexec at node100.local] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:196): error waiting for event
[mpiexec at node100.local] main (ui/mpich/mpiexec.c:336): process manager error waiting for completion

I suppose this could be a firewall issue, but there must also be some changes between 3.3.1 and 3.3.2, otherwise it should be a problem for all versions of mpich.

Bruce

From: "Zhou, Hui" <zhouh at anl.gov>
Date: Wednesday, April 22, 2020 at 10:22 AM
To: "Palmer, Bruce J" <Bruce.Palmer at pnnl.gov>, "discuss at mpich.org" <discuss at mpich.org>
Subject: Re: [mpich-discuss] MPICH configure

Hi Bruce,

How did you launch your jobs? Since you configured with slurm, you should launch your job with `srun`, right?

--
Hui Zhou


From: "Palmer, Bruce J" <Bruce.Palmer at pnnl.gov>
Date: Wednesday, April 22, 2020 at 11:55 AM
To: "Zhou, Hui" <zhouh at anl.gov>, "discuss at mpich.org" <discuss at mpich.org>
Subject: Re: [mpich-discuss] MPICH configure

Hi Hui,

Its currently 14.03.8

Bruce

From: "Zhou, Hui" <zhouh at anl.gov>
Date: Wednesday, April 22, 2020 at 9:36 AM
To: "Palmer, Bruce J" <Bruce.Palmer at pnnl.gov>, "discuss at mpich.org" <discuss at mpich.org>
Subject: Re: [mpich-discuss] MPICH configure

Hi Bruce,

Thanks for the effort checking these versions. What is the slurm versions that you have on cluster?

--
Hui Zhou


From: "Palmer, Bruce J" <Bruce.Palmer at pnnl.gov>
Date: Wednesday, April 22, 2020 at 11:02 AM
To: "Zhou, Hui" <zhouh at anl.gov>, "discuss at mpich.org" <discuss at mpich.org>
Subject: Re: [mpich-discuss] MPICH configure

Hi Hui,

I rebuilt everything from scratch and tried running several versions of mpich. Release 3.3.1 seems to work okay but 3.3.2 hangs. Here is a complete summary of the versions I ran

3.3rc1: Works
3.3: Works
3.3.1: Works
3.3.2: Hangs
3.4a2: Hangs

I’m not seeing the error message from hydra anymore (I have no idea why not), but I logged into one of the hung processes when running with 3.3.2 and got the following listing from gdb

(gdb) where
#0  0x0000003d7ce0e810 in __read_nocancel () from /lib64/libpthread.so.0
#1  0x00002aaaac6e936e in PMIU_readline () from /people/d3g293/mpich/mpich-3.3.2/install/lib/libmpi.so.12
#2  0x00002aaaac6e985b in GetResponse.part.0 () from /people/d3g293/mpich/mpich-3.3.2/install/lib/libmpi.so.12
#3  0x00002aaaac6e4e36 in MPIDU_shm_seg_commit () from /people/d3g293/mpich/mpich-3.3.2/install/lib/libmpi.so.12
#4  0x00002aaaabc541dc in MPIR_Init_thread () from /people/d3g293/mpich/mpich-3.3.2/install/lib/libmpi.so.12
#5  0x00002aaaabc3db7e in PMPI_Init () from /people/d3g293/mpich/mpich-3.3.2/install/lib/libmpi.so.12
#6  0x0000000000408d0e in main ()

Bruce

From: "Zhou, Hui" <zhouh at anl.gov>
Date: Monday, April 20, 2020 at 10:25 AM
To: "discuss at mpich.org" <discuss at mpich.org>
Cc: "Palmer, Bruce J" <Bruce.Palmer at pnnl.gov>
Subject: Re: [mpich-discuss] MPICH configure

The error is from `hydra`, which should not have changed much between the versions. Could you verify that 3.3.1 still works for you?

--
Hui Zhou


From: "Palmer, Bruce J via discuss" <discuss at mpich.org>
Reply-To: "discuss at mpich.org" <discuss at mpich.org>
Date: Thursday, April 16, 2020 at 5:48 PM
To: "discuss at mpich.org" <discuss at mpich.org>
Cc: "Palmer, Bruce J" <Bruce.Palmer at pnnl.gov>
Subject: [mpich-discuss] MPICH configure

Hi,

I’ve been building MPICH on are aging Infiniband cluster using the following formula


./configure --prefix=/people/d3g293/mpich/mpich-3.3.2/install --with-device=ch4:ofi:sockets --with-libfabric=embedded --enable-threads=multiple --with-slurm CC=gcc CXX=g++

It’s been working pretty well but I recently tried to build mpich-3.3.2 and mpich-3.4a2 and although the build seems to work okay, I’m having problems actually running anything. If I run on 2 nodes the code seems to hang on MPI_Init and it looks like it is producing the error message


[proxy:0:1 at node013.local] HYDU_sock_connect (utils/sock/sock.c:145): unable to connect from "node013.l

ocal" to "node012.local" (Connection refused)

[proxy:0:1 at node013.local] main (pm/pmiserv/pmip.c:183): unable to connect to server node012.local at p

ort 37769 (check for firewalls!)

srun: error: node013: task 1: Exited with exit code 5

If I run on a single node, things seem to work. Any idea what is going on here? I’ve got a working build of mpich-3.3, so things were okay up until recently. Has something in MPICH changed and my configuration formula is no good, or is this more likely to be due to some system modification?

Bruce Palmer
Senior Research Scientist
Pacific Northwest National Laboratory
Richland, WA 99352
(509) 375-3899

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20200423/f46908d6/attachment-0001.html>


More information about the discuss mailing list