[mpich-discuss] MPICH configure

Palmer, Bruce J Bruce.Palmer at pnnl.gov
Wed Apr 22 11:55:15 CDT 2020


Hi Hui,

Its currently 14.03.8

Bruce

From: "Zhou, Hui" <zhouh at anl.gov>
Date: Wednesday, April 22, 2020 at 9:36 AM
To: "Palmer, Bruce J" <Bruce.Palmer at pnnl.gov>, "discuss at mpich.org" <discuss at mpich.org>
Subject: Re: [mpich-discuss] MPICH configure

Hi Bruce,

Thanks for the effort checking these versions. What is the slurm versions that you have on cluster?

--
Hui Zhou


From: "Palmer, Bruce J" <Bruce.Palmer at pnnl.gov>
Date: Wednesday, April 22, 2020 at 11:02 AM
To: "Zhou, Hui" <zhouh at anl.gov>, "discuss at mpich.org" <discuss at mpich.org>
Subject: Re: [mpich-discuss] MPICH configure

Hi Hui,

I rebuilt everything from scratch and tried running several versions of mpich. Release 3.3.1 seems to work okay but 3.3.2 hangs. Here is a complete summary of the versions I ran

3.3rc1: Works
3.3: Works
3.3.1: Works
3.3.2: Hangs
3.4a2: Hangs

I’m not seeing the error message from hydra anymore (I have no idea why not), but I logged into one of the hung processes when running with 3.3.2 and got the following listing from gdb

(gdb) where
#0  0x0000003d7ce0e810 in __read_nocancel () from /lib64/libpthread.so.0
#1  0x00002aaaac6e936e in PMIU_readline () from /people/d3g293/mpich/mpich-3.3.2/install/lib/libmpi.so.12
#2  0x00002aaaac6e985b in GetResponse.part.0 () from /people/d3g293/mpich/mpich-3.3.2/install/lib/libmpi.so.12
#3  0x00002aaaac6e4e36 in MPIDU_shm_seg_commit () from /people/d3g293/mpich/mpich-3.3.2/install/lib/libmpi.so.12
#4  0x00002aaaabc541dc in MPIR_Init_thread () from /people/d3g293/mpich/mpich-3.3.2/install/lib/libmpi.so.12
#5  0x00002aaaabc3db7e in PMPI_Init () from /people/d3g293/mpich/mpich-3.3.2/install/lib/libmpi.so.12
#6  0x0000000000408d0e in main ()

Bruce

From: "Zhou, Hui" <zhouh at anl.gov>
Date: Monday, April 20, 2020 at 10:25 AM
To: "discuss at mpich.org" <discuss at mpich.org>
Cc: "Palmer, Bruce J" <Bruce.Palmer at pnnl.gov>
Subject: Re: [mpich-discuss] MPICH configure

The error is from `hydra`, which should not have changed much between the versions. Could you verify that 3.3.1 still works for you?

--
Hui Zhou


From: "Palmer, Bruce J via discuss" <discuss at mpich.org>
Reply-To: "discuss at mpich.org" <discuss at mpich.org>
Date: Thursday, April 16, 2020 at 5:48 PM
To: "discuss at mpich.org" <discuss at mpich.org>
Cc: "Palmer, Bruce J" <Bruce.Palmer at pnnl.gov>
Subject: [mpich-discuss] MPICH configure

Hi,

I’ve been building MPICH on are aging Infiniband cluster using the following formula


./configure --prefix=/people/d3g293/mpich/mpich-3.3.2/install --with-device=ch4:ofi:sockets --with-libfabric=embedded --enable-threads=multiple --with-slurm CC=gcc CXX=g++

It’s been working pretty well but I recently tried to build mpich-3.3.2 and mpich-3.4a2 and although the build seems to work okay, I’m having problems actually running anything. If I run on 2 nodes the code seems to hang on MPI_Init and it looks like it is producing the error message


[proxy:0:1 at node013.local] HYDU_sock_connect (utils/sock/sock.c:145): unable to connect from "node013.l

ocal" to "node012.local" (Connection refused)

[proxy:0:1 at node013.local] main (pm/pmiserv/pmip.c:183): unable to connect to server node012.local at p

ort 37769 (check for firewalls!)

srun: error: node013: task 1: Exited with exit code 5

If I run on a single node, things seem to work. Any idea what is going on here? I’ve got a working build of mpich-3.3, so things were okay up until recently. Has something in MPICH changed and my configuration formula is no good, or is this more likely to be due to some system modification?

Bruce Palmer
Senior Research Scientist
Pacific Northwest National Laboratory
Richland, WA 99352
(509) 375-3899

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20200422/ac9ac387/attachment-0001.html>


More information about the discuss mailing list