[mpich-discuss] Communication

Raffenetti, Kenneth J. raffenet at mcs.anl.gov
Wed Dec 9 10:32:24 CST 2020


This error output is mostly just Hydra (mpiexec) cleaning up after an MPI process crashed. This line:

    srun: error: node008: task 0: Exited with exit code 7

indicates the failed process exited with error code 7. Does the test program generate useful error codes? It could also indicate a signal (e.g. SIGBUS) that caused the program to crash.

Ken

On 12/9/20, 9:50 AM, "Palmer, Bruce J via discuss" <discuss at mpich.org> wrote:

    Hi,
     
    I’m currently debugging a variant of the progress ranks runtime used in Global Arrays and I’m getting a failure in one of the runtime test programs when I run on more than one node (running on one node is successful). The failure occurs in a portion of the code that is performing a large number of MPI_Isends on the sending side and posting MPI_Recv from MPI_ANY_SOURCE on the receive side. The configuration is 6 MPI ranks evenly split between two nodes and the error is
     
    [proxy:0:0 at node008.local] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:878): assert (!closed) failed
    [proxy:0:0 at node008.local] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error status
    [proxy:0:0 at node008.local] main (pm/pmiserv/pmip.c:200): demux engine error waiting for event
    srun: error: node008: task 0: Exited with exit code 7
    [mpiexec at node008.local] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:75): one of the processes terminated badly; aborting
    [mpiexec at node008.local] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:22): launcher returned error waiting for completion
    [mpiexec at node008.local] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:215): launcher returned error waiting for completion
    [mpiexec at node008.local] main (ui/mpich/mpiexec.c:336): process manager error waiting for completion
     
    Can anyone describe in layman’s terms what might be generating this type of error? I’m using the mpich-3.4a2 release and configuring with
     
    ./configure --prefix=/people/bjpalmer/mpich/mpich-3.4a2/install --with-device=ch4:ofi:sockets --with-libfabric=embedded --enable-threads=multiple --with-slurm CC=gcc CXX=g++
     
     
    Bruce Palmer
    Senior Research Scientist
    Pacific Northwest National Laboratory
    Richland, WA 99352
    (509) 375-3899
     



More information about the discuss mailing list