[mpich-discuss] Problems with mpich-3.1.4 and slurm-14.11

Bill Broadley bill at cse.ucdavis.edu
Sat May 23 20:41:58 CDT 2015


I have an ubuntu 14.04 cluster where slurm-14.11 and OpenMPI 1.8.5 work
well.  I'm trying to get mpich-3.1.4 working.  An OpenMPI + slurm example:
 $ mpicc.openmpi relay.c -o relay
 $ mpicc.openmpi hello.c -o relay

A run on a single node:
$ salloc -N 1 -n 2 mpirun ./relay 1
salloc: Granted job allocation 682
c7-13 c7-13
size=     1,  16384 hops,  2 nodes in   0.01 sec (  0.37 us/hop)  10468
KB/sec

A run on 2 nodes:
$ salloc -N 2 -n 2 mpirun ./relay 1
salloc: Granted job allocation 683
c7-13 c7-14
size=     1,  16384 hops,  2 nodes in   0.10 sec (  5.95 us/hop)    657
KB/sec

I compiled mpich-3.14, ./configure setting only the --prefix.  I just
used the ubuntu default gcc-4.8.2.  From config.log:
  $ ./configure --prefix=/share/apps/mpich-3.1.4/gcc

Now to test:
$ export PATH=/share/apps/mpich-3.1.4/gcc/bin:$PATH
$ export LD_LIBRARY_PATH=/share/apps/mpich-3.1.4/gcc/lib:$LD_LIBRARY_PATH
$ which mpicc
/share/apps/mpich-3.1.4/gcc/bin/mpicc
$ which mpiexec
/share/apps/mpich-3.1.4/gcc/bin/mpiexec
$ mpicc relay.c -o relay
$ mpicc hello.c -o hello
$ salloc -N 1 -n 2 mpiexec ./hello
salloc: Granted job allocation 688
Hello world from process 0 of 2
Hello world from process 1 of 2
$ salloc -N 2 -n 2 mpiexec ./hello
salloc: Granted job allocation 689
Hello world from process 0 of 2
Hello world from process 1 of 2

Great, the basics are working.  But no actual communications are happening.

Communications within a single node seems to work:
$ salloc -N 1 -n 2 mpiexec ./relay 1
salloc: Granted job allocation 690
c7-13 c7-13
size=     1,  16384 hops,  2 nodes in   0.01 sec (  0.55 us/hop)   7074
KB/sec

But not between more than one node:
bill at hpc1:~/src/relay$ salloc -N 2 -n 2 mpiexec ./relay 1
salloc: Granted job allocation 691
c7-13 Fatal error in MPI_Recv: Unknown error class, error stack:
MPI_Recv(187)...................: MPI_Recv(buf=0x7ffdc56aa630,
count=129, MPI_CHAR, src=1, tag=1, MPI_COMM_WORLD,
status=0x7ffdc56aa610) failed
MPIDI_CH3U_Recvq_FDU_or_AEP(636): Communication error with rank 1

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 9734 RUNNING AT c7-13
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:1 at c7-14] HYD_pmcd_pmip_control_cmd_cb
(pm/pmiserv/pmip_cb.c:885): assert (!closed) failed
[proxy:0:1 at c7-14] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:1 at c7-14] main (pm/pmiserv/pmip.c:206): demux engine error
waiting for event
srun: error: c7-14: task 1: Exited with exit code 7
[mpiexec at hpc1] HYDT_bscu_wait_for_completion
(tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated
badly; aborting
[mpiexec at hpc1] HYDT_bsci_wait_for_completion
(tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting
for completion
[mpiexec at hpc1] HYD_pmci_wait_for_completion
(pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for
completion
[mpiexec at hpc1] main (ui/mpich/mpiexec.c:344): process manager error
waiting for completion
salloc: Relinquishing job allocation 691

DNS, and ssh seem to be fine:
$ ssh c7-13 "hostname; ssh c7-14 'hostname'"
c7-13
c7-14

Any ideas?
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list