[mpich-discuss] Communication
Raffenetti, Kenneth J.
raffenet at mcs.anl.gov
Wed Dec 9 10:32:24 CST 2020
This error output is mostly just Hydra (mpiexec) cleaning up after an MPI process crashed. This line:
srun: error: node008: task 0: Exited with exit code 7
indicates the failed process exited with error code 7. Does the test program generate useful error codes? It could also indicate a signal (e.g. SIGBUS) that caused the program to crash.
Ken
On 12/9/20, 9:50 AM, "Palmer, Bruce J via discuss" <discuss at mpich.org> wrote:
Hi,
I’m currently debugging a variant of the progress ranks runtime used in Global Arrays and I’m getting a failure in one of the runtime test programs when I run on more than one node (running on one node is successful). The failure occurs in a portion of the code that is performing a large number of MPI_Isends on the sending side and posting MPI_Recv from MPI_ANY_SOURCE on the receive side. The configuration is 6 MPI ranks evenly split between two nodes and the error is
[proxy:0:0 at node008.local] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:878): assert (!closed) failed
[proxy:0:0 at node008.local] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0 at node008.local] main (pm/pmiserv/pmip.c:200): demux engine error waiting for event
srun: error: node008: task 0: Exited with exit code 7
[mpiexec at node008.local] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:75): one of the processes terminated badly; aborting
[mpiexec at node008.local] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:22): launcher returned error waiting for completion
[mpiexec at node008.local] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:215): launcher returned error waiting for completion
[mpiexec at node008.local] main (ui/mpich/mpiexec.c:336): process manager error waiting for completion
Can anyone describe in layman’s terms what might be generating this type of error? I’m using the mpich-3.4a2 release and configuring with
./configure --prefix=/people/bjpalmer/mpich/mpich-3.4a2/install --with-device=ch4:ofi:sockets --with-libfabric=embedded --enable-threads=multiple --with-slurm CC=gcc CXX=g++
Bruce Palmer
Senior Research Scientist
Pacific Northwest National Laboratory
Richland, WA 99352
(509) 375-3899
More information about the discuss
mailing list