[mpich-discuss] Problems with mpich-3.1.4 and slurm-14.11
Bill Broadley
bill at cse.ucdavis.edu
Sat May 23 20:41:58 CDT 2015
I have an ubuntu 14.04 cluster where slurm-14.11 and OpenMPI 1.8.5 work
well. I'm trying to get mpich-3.1.4 working. An OpenMPI + slurm example:
$ mpicc.openmpi relay.c -o relay
$ mpicc.openmpi hello.c -o relay
A run on a single node:
$ salloc -N 1 -n 2 mpirun ./relay 1
salloc: Granted job allocation 682
c7-13 c7-13
size= 1, 16384 hops, 2 nodes in 0.01 sec ( 0.37 us/hop) 10468
KB/sec
A run on 2 nodes:
$ salloc -N 2 -n 2 mpirun ./relay 1
salloc: Granted job allocation 683
c7-13 c7-14
size= 1, 16384 hops, 2 nodes in 0.10 sec ( 5.95 us/hop) 657
KB/sec
I compiled mpich-3.14, ./configure setting only the --prefix. I just
used the ubuntu default gcc-4.8.2. From config.log:
$ ./configure --prefix=/share/apps/mpich-3.1.4/gcc
Now to test:
$ export PATH=/share/apps/mpich-3.1.4/gcc/bin:$PATH
$ export LD_LIBRARY_PATH=/share/apps/mpich-3.1.4/gcc/lib:$LD_LIBRARY_PATH
$ which mpicc
/share/apps/mpich-3.1.4/gcc/bin/mpicc
$ which mpiexec
/share/apps/mpich-3.1.4/gcc/bin/mpiexec
$ mpicc relay.c -o relay
$ mpicc hello.c -o hello
$ salloc -N 1 -n 2 mpiexec ./hello
salloc: Granted job allocation 688
Hello world from process 0 of 2
Hello world from process 1 of 2
$ salloc -N 2 -n 2 mpiexec ./hello
salloc: Granted job allocation 689
Hello world from process 0 of 2
Hello world from process 1 of 2
Great, the basics are working. But no actual communications are happening.
Communications within a single node seems to work:
$ salloc -N 1 -n 2 mpiexec ./relay 1
salloc: Granted job allocation 690
c7-13 c7-13
size= 1, 16384 hops, 2 nodes in 0.01 sec ( 0.55 us/hop) 7074
KB/sec
But not between more than one node:
bill at hpc1:~/src/relay$ salloc -N 2 -n 2 mpiexec ./relay 1
salloc: Granted job allocation 691
c7-13 Fatal error in MPI_Recv: Unknown error class, error stack:
MPI_Recv(187)...................: MPI_Recv(buf=0x7ffdc56aa630,
count=129, MPI_CHAR, src=1, tag=1, MPI_COMM_WORLD,
status=0x7ffdc56aa610) failed
MPIDI_CH3U_Recvq_FDU_or_AEP(636): Communication error with rank 1
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 9734 RUNNING AT c7-13
= EXIT CODE: 1
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:1 at c7-14] HYD_pmcd_pmip_control_cmd_cb
(pm/pmiserv/pmip_cb.c:885): assert (!closed) failed
[proxy:0:1 at c7-14] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:1 at c7-14] main (pm/pmiserv/pmip.c:206): demux engine error
waiting for event
srun: error: c7-14: task 1: Exited with exit code 7
[mpiexec at hpc1] HYDT_bscu_wait_for_completion
(tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated
badly; aborting
[mpiexec at hpc1] HYDT_bsci_wait_for_completion
(tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting
for completion
[mpiexec at hpc1] HYD_pmci_wait_for_completion
(pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for
completion
[mpiexec at hpc1] main (ui/mpich/mpiexec.c:344): process manager error
waiting for completion
salloc: Relinquishing job allocation 691
DNS, and ssh seem to be fine:
$ ssh c7-13 "hostname; ssh c7-14 'hostname'"
c7-13
c7-14
Any ideas?
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list