[mpich-discuss] MPICH configure

Zhou, Hui zhouh at anl.gov
Fri Apr 24 15:33:29 CDT 2020


Hi Bruce,

Before give you solution (which I have no clue yet), let’s understand what is needed. Do you have a firewall rules between nodes?

--
Hui Zhou


From: "Palmer, Bruce J" <Bruce.Palmer at pnnl.gov>
Date: Friday, April 24, 2020 at 2:54 PM
To: "Zhou, Hui" <zhouh at anl.gov>, "discuss at mpich.org" <discuss at mpich.org>
Subject: Re: [mpich-discuss] MPICH configure

Hui,

mpirun is pointing to mpiexec.hydra. I’m using “mpirun -n 6 executable.x” to launch jobs. The system guys have given me some information about linking to the pmi libraries that I’m going to try to see if I can get srun to work properly. I will give that a try and see if it enables me to use srun. They also suggested trying to configure MPICH with --add-pmi, although that doesn’t look like an MPICH configuration option. I did see a –enable-pmiport option. Is that something I should try?

Bruce

From: "Zhou, Hui" <zhouh at anl.gov>
Date: Friday, April 24, 2020 at 10:59 AM
To: "Palmer, Bruce J" <Bruce.Palmer at pnnl.gov>, "discuss at mpich.org" <discuss at mpich.org>
Subject: Re: [mpich-discuss] MPICH configure

Hi Bruce,

Mpirun, mpiexec, and mpiexec.hydra, are all the same binary, with previous two symbolic links to the last. Please verify. If not, then you have installation issue. Hydra will detect environment and utilize information gathered from, e.g. slurm. For example, it will gather that you are launching with  a given number of processes. When in doubt, use explicit command line option, such as `-n <numprocs>`. By the way, what is your complete command line that you used to launch jobs in all the cases?

--
Hui Zhou


From: "Palmer, Bruce J" <Bruce.Palmer at pnnl.gov>
Date: Friday, April 24, 2020 at 11:49 AM
To: "Zhou, Hui" <zhouh at anl.gov>, "discuss at mpich.org" <discuss at mpich.org>
Subject: Re: [mpich-discuss] MPICH configure

Hi Hui,

I’m not sure what is going on with srun, but it doesn’t seem to work the way you are describing, at least on our system. I’ve got an inquiry into our system administrators asking about it but generally, when I launch with srun it looks like MPI_Comm_size is returns 1 for the size of MPI_COMM_WORLD no matter how many processors I’m actually running on.

I’ve tried the following combinations:

mpich-3.3.2 built without slurm
      Launch with mpiexec.hydra: hangs in MPI_Init
      Launch with srun: only get 1 processor
      Launch with mpirun: hangs in MPI_Init
      Launch with mpiexec: hangs in MPI_Init

mpich-3.3.2 built with slurm
      Launch with mpiexec: hangs in MPI_Init
      Launch with srun: only get 1 processor
      Launch with mpirun: hangs in MPI_Init

mpich-3.3.1 built with slurm
      Launch with mpiexec: runs okay
      Launch with srun: only get 1 processor
      Launch with mpirun: runs okay

I was under the impression that MPICH created its own version of mpirun/mpiexec depending on what it found out about the scheduling system during configuration and then built mpirun or mpiexec accordingly. Is this not correct?

Bruce

From: "Zhou, Hui" <zhouh at anl.gov>
Date: Thursday, April 23, 2020 at 1:31 PM
To: "Palmer, Bruce J" <Bruce.Palmer at pnnl.gov>, "discuss at mpich.org" <discuss at mpich.org>
Subject: Re: [mpich-discuss] MPICH configure

Hi Bruce,

I noticed you are mixing up things a bit, so let’s clear it up first:


  *   If you compile with slurm then you should only launch with srun, not mpirun. Otherwise it won’t work.
  *   If you compile without slurm, then you should launch with mpirun, or more precisely, mpiexec.hydra.

Depend on which you are doing, each may have issues in your envioronment, but whatever issues are probably unrelated, and should not be discussed in a same context.

What is working with your with 3.3.1? With slurm? If that’s what’s working for you, then let’s focus on compile and run 3.3.2 with slurm as well.

--
Hui Zhou


From: "Palmer, Bruce J" <Bruce.Palmer at pnnl.gov>
Date: Thursday, April 23, 2020 at 11:08 AM
To: "Zhou, Hui" <zhouh at anl.gov>, "discuss at mpich.org" <discuss at mpich.org>
Subject: Re: [mpich-discuss] MPICH configure

Hi Hui,

I was launching the jobs with mpirun. I tried launching the jobs with srun and they no longer hang, but it looks like they are returning an MPI_COMM_WORLD with only one process, although I didn’t investigate this extensively. I also tried Pavan’s suggestion and rebuilt 3.3.2 without slurm and ran it with mpiexec. This also hangs  and the error I was seeing previously reappears

[proxy:0:1 at node168.local] HYDU_sock_connect (utils/sock/sock.c:145): unable to connect from "node168.local" to "node100.local" (Connection refused)
[proxy:0:1 at node168.local] main (pm/pmiserv/pmip.c:183): unable to connect to server node100.local at port 54762 (check for firewalls!)
srun: error: node168: task 1: Exited with exit code 5
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
slurmstepd: *** STEP 13201325.0 CANCELLED AT 2020-04-23T08:58:02 *** on node100
slurmstepd: *** JOB 13201325 CANCELLED AT 2020-04-23T08:58:02 *** on node100
[mpiexec at node100.local] HYDU_sock_write (utils/sock/sock.c:256): write error (Bad file descriptor)
[mpiexec at node100.local] HYD_pmcd_pmiserv_send_signal (pm/pmiserv/pmiserv_cb.c:178): unable to write data to proxy
[mpiexec at node100.local] ui_cmd_cb (pm/pmiserv/pmiserv_pmci.c:77): unable to send signal downstream
[mpiexec at node100.local] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error status
[mpiexec at node100.local] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:196): error waiting for event
[mpiexec at node100.local] main (ui/mpich/mpiexec.c:336): process manager error waiting for completion

I suppose this could be a firewall issue, but there must also be some changes between 3.3.1 and 3.3.2, otherwise it should be a problem for all versions of mpich.

Bruce

From: "Zhou, Hui" <zhouh at anl.gov>
Date: Wednesday, April 22, 2020 at 10:22 AM
To: "Palmer, Bruce J" <Bruce.Palmer at pnnl.gov>, "discuss at mpich.org" <discuss at mpich.org>
Subject: Re: [mpich-discuss] MPICH configure

Hi Bruce,

How did you launch your jobs? Since you configured with slurm, you should launch your job with `srun`, right?

--
Hui Zhou


From: "Palmer, Bruce J" <Bruce.Palmer at pnnl.gov>
Date: Wednesday, April 22, 2020 at 11:55 AM
To: "Zhou, Hui" <zhouh at anl.gov>, "discuss at mpich.org" <discuss at mpich.org>
Subject: Re: [mpich-discuss] MPICH configure

Hi Hui,

Its currently 14.03.8

Bruce

From: "Zhou, Hui" <zhouh at anl.gov>
Date: Wednesday, April 22, 2020 at 9:36 AM
To: "Palmer, Bruce J" <Bruce.Palmer at pnnl.gov>, "discuss at mpich.org" <discuss at mpich.org>
Subject: Re: [mpich-discuss] MPICH configure

Hi Bruce,

Thanks for the effort checking these versions. What is the slurm versions that you have on cluster?

--
Hui Zhou


From: "Palmer, Bruce J" <Bruce.Palmer at pnnl.gov>
Date: Wednesday, April 22, 2020 at 11:02 AM
To: "Zhou, Hui" <zhouh at anl.gov>, "discuss at mpich.org" <discuss at mpich.org>
Subject: Re: [mpich-discuss] MPICH configure

Hi Hui,

I rebuilt everything from scratch and tried running several versions of mpich. Release 3.3.1 seems to work okay but 3.3.2 hangs. Here is a complete summary of the versions I ran

3.3rc1: Works
3.3: Works
3.3.1: Works
3.3.2: Hangs
3.4a2: Hangs

I’m not seeing the error message from hydra anymore (I have no idea why not), but I logged into one of the hung processes when running with 3.3.2 and got the following listing from gdb

(gdb) where
#0  0x0000003d7ce0e810 in __read_nocancel () from /lib64/libpthread.so.0
#1  0x00002aaaac6e936e in PMIU_readline () from /people/d3g293/mpich/mpich-3.3.2/install/lib/libmpi.so.12
#2  0x00002aaaac6e985b in GetResponse.part.0 () from /people/d3g293/mpich/mpich-3.3.2/install/lib/libmpi.so.12
#3  0x00002aaaac6e4e36 in MPIDU_shm_seg_commit () from /people/d3g293/mpich/mpich-3.3.2/install/lib/libmpi.so.12
#4  0x00002aaaabc541dc in MPIR_Init_thread () from /people/d3g293/mpich/mpich-3.3.2/install/lib/libmpi.so.12
#5  0x00002aaaabc3db7e in PMPI_Init () from /people/d3g293/mpich/mpich-3.3.2/install/lib/libmpi.so.12
#6  0x0000000000408d0e in main ()

Bruce

From: "Zhou, Hui" <zhouh at anl.gov>
Date: Monday, April 20, 2020 at 10:25 AM
To: "discuss at mpich.org" <discuss at mpich.org>
Cc: "Palmer, Bruce J" <Bruce.Palmer at pnnl.gov>
Subject: Re: [mpich-discuss] MPICH configure

The error is from `hydra`, which should not have changed much between the versions. Could you verify that 3.3.1 still works for you?

--
Hui Zhou


From: "Palmer, Bruce J via discuss" <discuss at mpich.org>
Reply-To: "discuss at mpich.org" <discuss at mpich.org>
Date: Thursday, April 16, 2020 at 5:48 PM
To: "discuss at mpich.org" <discuss at mpich.org>
Cc: "Palmer, Bruce J" <Bruce.Palmer at pnnl.gov>
Subject: [mpich-discuss] MPICH configure

Hi,

I’ve been building MPICH on are aging Infiniband cluster using the following formula


./configure --prefix=/people/d3g293/mpich/mpich-3.3.2/install --with-device=ch4:ofi:sockets --with-libfabric=embedded --enable-threads=multiple --with-slurm CC=gcc CXX=g++

It’s been working pretty well but I recently tried to build mpich-3.3.2 and mpich-3.4a2 and although the build seems to work okay, I’m having problems actually running anything. If I run on 2 nodes the code seems to hang on MPI_Init and it looks like it is producing the error message


[proxy:0:1 at node013.local] HYDU_sock_connect (utils/sock/sock.c:145): unable to connect from "node013.l

ocal" to "node012.local" (Connection refused)

[proxy:0:1 at node013.local] main (pm/pmiserv/pmip.c:183): unable to connect to server node012.local at p

ort 37769 (check for firewalls!)

srun: error: node013: task 1: Exited with exit code 5

If I run on a single node, things seem to work. Any idea what is going on here? I’ve got a working build of mpich-3.3, so things were okay up until recently. Has something in MPICH changed and my configuration formula is no good, or is this more likely to be due to some system modification?

Bruce Palmer
Senior Research Scientist
Pacific Northwest National Laboratory
Richland, WA 99352
(509) 375-3899

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20200424/9fa0325f/attachment-0001.html>


More information about the discuss mailing list