[mpich-discuss] hydra_pmi_proxy sending signal 9 to successful processes
Lana Deere
lana.deere at gmail.com
Mon Apr 6 10:23:43 CDT 2026
I've got several MPI programs here. The one which is the most complicated
started exiting, reporting that a process got signal 9 while cleaning up
after a run it reported was successful. Many of the other MPI processes
showed truncated outputs as if they too had received a signal 9. Only
that one program has this problem, the other programs don't. I tried
reducing the big program to a small testcase which reproduces the issue but
was unsuccessful.
I did put a gdb onto the hydra_pmi_proxy and discovered that it is the
process sending the signal 9 to the various MPI processes,
(gdb) where
#0 0x00007f4b17853d7e in killpg () from /lib64/libc.so.6
#1 0x00000000004053e2 in PMIP_bcast_signal (sig=sig at entry=9) at
proxy/pmip_pg.c:259
#2 0x0000000000406e60 in pmi_cb (fd=9, events=<optimized out>,
userp=<optimized out>)
at proxy/pmip_cb.c:326
#3 0x0000000000421418 in HYDT_dmxu_poll_wait_for_event (wtime=<optimized
out>)
at lib/tools/demux/demux_poll.c:75
#4 0x0000000000403ff5 in main (argc=<optimized out>, argv=<optimized out>)
at proxy/pmip.c:121
At that time I was using mpich 4.3.0, so I upgraded to 5.0.0 hoping the
problem would be resolved. 5.0.0 still showed the same symptom. This all
is happening on SUSE Linux 15.5.
On CentOS7 and Rocky Linux 9 we use mvapich2 2.3.6, so for an experiment I
took the mpirun and hydra_pmi_proxy from 2.3.6 and used them instead of the
versions from the mpich 5.0.0 release. Now the program works without
difficulty. All of this suggests to me that the hydra_pmi_proxy has
incorrectly determined that one of the MPI processes exited with a signal.
Any suggestions about what's going on?
.. Lana (lana.deere at gmail.com)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20260406/4388d4ba/attachment.html>
More information about the discuss
mailing list