[mpich-discuss] Problems with mpich-3.1.4 and slurm-14.11

Kenneth Raffenetti raffenet at mcs.anl.gov
Wed May 27 07:33:22 CDT 2015


I wonder if your MPICH build was able to detect and build in Slurm 
support. Can you send the src/pm/hydra/config.log file from your MPICH 
build directory?

Ken

On 05/23/2015 08:41 PM, Bill Broadley wrote:
> I have an ubuntu 14.04 cluster where slurm-14.11 and OpenMPI 1.8.5 work
> well.  I'm trying to get mpich-3.1.4 working.  An OpenMPI + slurm example:
>   $ mpicc.openmpi relay.c -o relay
>   $ mpicc.openmpi hello.c -o relay
>
> A run on a single node:
> $ salloc -N 1 -n 2 mpirun ./relay 1
> salloc: Granted job allocation 682
> c7-13 c7-13
> size=     1,  16384 hops,  2 nodes in   0.01 sec (  0.37 us/hop)  10468
> KB/sec
>
> A run on 2 nodes:
> $ salloc -N 2 -n 2 mpirun ./relay 1
> salloc: Granted job allocation 683
> c7-13 c7-14
> size=     1,  16384 hops,  2 nodes in   0.10 sec (  5.95 us/hop)    657
> KB/sec
>
> I compiled mpich-3.14, ./configure setting only the --prefix.  I just
> used the ubuntu default gcc-4.8.2.  From config.log:
>    $ ./configure --prefix=/share/apps/mpich-3.1.4/gcc
>
> Now to test:
> $ export PATH=/share/apps/mpich-3.1.4/gcc/bin:$PATH
> $ export LD_LIBRARY_PATH=/share/apps/mpich-3.1.4/gcc/lib:$LD_LIBRARY_PATH
> $ which mpicc
> /share/apps/mpich-3.1.4/gcc/bin/mpicc
> $ which mpiexec
> /share/apps/mpich-3.1.4/gcc/bin/mpiexec
> $ mpicc relay.c -o relay
> $ mpicc hello.c -o hello
> $ salloc -N 1 -n 2 mpiexec ./hello
> salloc: Granted job allocation 688
> Hello world from process 0 of 2
> Hello world from process 1 of 2
> $ salloc -N 2 -n 2 mpiexec ./hello
> salloc: Granted job allocation 689
> Hello world from process 0 of 2
> Hello world from process 1 of 2
>
> Great, the basics are working.  But no actual communications are happening.
>
> Communications within a single node seems to work:
> $ salloc -N 1 -n 2 mpiexec ./relay 1
> salloc: Granted job allocation 690
> c7-13 c7-13
> size=     1,  16384 hops,  2 nodes in   0.01 sec (  0.55 us/hop)   7074
> KB/sec
>
> But not between more than one node:
> bill at hpc1:~/src/relay$ salloc -N 2 -n 2 mpiexec ./relay 1
> salloc: Granted job allocation 691
> c7-13 Fatal error in MPI_Recv: Unknown error class, error stack:
> MPI_Recv(187)...................: MPI_Recv(buf=0x7ffdc56aa630,
> count=129, MPI_CHAR, src=1, tag=1, MPI_COMM_WORLD,
> status=0x7ffdc56aa610) failed
> MPIDI_CH3U_Recvq_FDU_or_AEP(636): Communication error with rank 1
>
> ===================================================================================
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   PID 9734 RUNNING AT c7-13
> =   EXIT CODE: 1
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> ===================================================================================
> [proxy:0:1 at c7-14] HYD_pmcd_pmip_control_cmd_cb
> (pm/pmiserv/pmip_cb.c:885): assert (!closed) failed
> [proxy:0:1 at c7-14] HYDT_dmxu_poll_wait_for_event
> (tools/demux/demux_poll.c:76): callback returned error status
> [proxy:0:1 at c7-14] main (pm/pmiserv/pmip.c:206): demux engine error
> waiting for event
> srun: error: c7-14: task 1: Exited with exit code 7
> [mpiexec at hpc1] HYDT_bscu_wait_for_completion
> (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated
> badly; aborting
> [mpiexec at hpc1] HYDT_bsci_wait_for_completion
> (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting
> for completion
> [mpiexec at hpc1] HYD_pmci_wait_for_completion
> (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for
> completion
> [mpiexec at hpc1] main (ui/mpich/mpiexec.c:344): process manager error
> waiting for completion
> salloc: Relinquishing job allocation 691
>
> DNS, and ssh seem to be fine:
> $ ssh c7-13 "hostname; ssh c7-14 'hostname'"
> c7-13
> c7-14
>
> Any ideas?
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list