[mpich-discuss] Communication

Palmer, Bruce J Bruce.Palmer at pnnl.gov
Wed Dec 9 09:49:12 CST 2020


Hi,

I’m currently debugging a variant of the progress ranks runtime used in Global Arrays and I’m getting a failure in one of the runtime test programs when I run on more than one node (running on one node is successful). The failure occurs in a portion of the code that is performing a large number of MPI_Isends on the sending side and posting MPI_Recv from MPI_ANY_SOURCE on the receive side. The configuration is 6 MPI ranks evenly split between two nodes and the error is


[proxy:0:0 at node008.local] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:878): assert (!closed) failed

[proxy:0:0 at node008.local] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error status

[proxy:0:0 at node008.local] main (pm/pmiserv/pmip.c:200): demux engine error waiting for event

srun: error: node008: task 0: Exited with exit code 7

[mpiexec at node008.local] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:75): one of the processes terminated badly; aborting

[mpiexec at node008.local] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:22): launcher returned error waiting for completion

[mpiexec at node008.local] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:215): launcher returned error waiting for completion

[mpiexec at node008.local] main (ui/mpich/mpiexec.c:336): process manager error waiting for completion

Can anyone describe in layman’s terms what might be generating this type of error? I’m using the mpich-3.4a2 release and configuring with


./configure --prefix=/people/bjpalmer/mpich/mpich-3.4a2/install --with-device=ch4:ofi:sockets --with-libfabric=embedded --enable-threads=multiple --with-slurm CC=gcc CXX=g++


Bruce Palmer
Senior Research Scientist
Pacific Northwest National Laboratory
Richland, WA 99352
(509) 375-3899

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20201209/0530cfae/attachment.html>


More information about the discuss mailing list