[mpich-discuss] MPICH configure
balaji at anl.gov
Wed Apr 22 11:55:18 CDT 2020
The `--with-slurm` option is to pick SLURM libraries, so you can use SLURM's PMI library with srun. If you don't use that, MPICH will use its internal PMI library to run with Hydra. Can you try that? Basically remove `--with-slurm` and try running the program with "mpiexec".
The below message seems like there's a firewall issue between the nodes. Did you check on that?
> On Apr 16, 2020, at 5:47 PM, Palmer, Bruce J via discuss <discuss at mpich.org> wrote:
> I’ve been building MPICH on are aging Infiniband cluster using the following formula
> ./configure --prefix=/people/d3g293/mpich/mpich-3.3.2/install --with-device=ch4:ofi:sockets --with-libfabric=embedded --enable-threads=multiple --with-slurm CC=gcc CXX=g++
> It’s been working pretty well but I recently tried to build mpich-3.3.2 and mpich-3.4a2 and although the build seems to work okay, I’m having problems actually running anything. If I run on 2 nodes the code seems to hang on MPI_Init and it looks like it is producing the error message
> [proxy:0:1 at node013.local] HYDU_sock_connect (utils/sock/sock.c:145): unable to connect from "node013.l
> ocal" to "node012.local" (Connection refused)
> [proxy:0:1 at node013.local] main (pm/pmiserv/pmip.c:183): unable to connect to server node012.local at p
> ort 37769 (check for firewalls!)
> srun: error: node013: task 1: Exited with exit code 5
> If I run on a single node, things seem to work. Any idea what is going on here? I’ve got a working build of mpich-3.3, so things were okay up until recently. Has something in MPICH changed and my configuration formula is no good, or is this more likely to be due to some system modification?
> Bruce Palmer
> Senior Research Scientist
> Pacific Northwest National Laboratory
> Richland, WA 99352
> (509) 375-3899
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
More information about the discuss