[mpich-discuss] hydra_pmi_proxy sending signal 9 to successful processes

Mon Apr 6 10:23:43 CDT 2026

I've got several MPI programs here.  The one which is the most complicated
started exiting, reporting that a process got signal 9 while cleaning up
after a run it reported was successful. Many of the other MPI processes
showed truncated outputs as if they too had received a signal 9.   Only
that one program has this problem, the other programs don't.  I tried
reducing the big program to a small testcase which reproduces the issue but
was unsuccessful.

I did put a gdb onto the hydra_pmi_proxy and discovered that it is the
process sending the signal 9 to the various MPI processes,

(gdb) where
#0  0x00007f4b17853d7e in killpg () from /lib64/libc.so.6
#1  0x00000000004053e2 in PMIP_bcast_signal (sig=sig at entry=9) at
proxy/pmip_pg.c:259
#2  0x0000000000406e60 in pmi_cb (fd=9, events=<optimized out>,
userp=<optimized out>)
    at proxy/pmip_cb.c:326
#3  0x0000000000421418 in HYDT_dmxu_poll_wait_for_event (wtime=<optimized
out>)
    at lib/tools/demux/demux_poll.c:75
#4  0x0000000000403ff5 in main (argc=<optimized out>, argv=<optimized out>)
at proxy/pmip.c:121

At that time I was using mpich 4.3.0, so I upgraded to 5.0.0 hoping the
problem would be resolved.  5.0.0 still showed the same symptom.  This all
is happening on SUSE Linux 15.5.

On CentOS7 and Rocky Linux 9 we use mvapich2 2.3.6, so for an experiment I
took the mpirun and hydra_pmi_proxy from 2.3.6 and used them instead of the
versions from the mpich 5.0.0 release.  Now the program works without
difficulty.  All of this suggests to me that the hydra_pmi_proxy has
incorrectly determined that one of the MPI processes exited with a signal.
Any suggestions about what's going on?

.. Lana (lana.deere at gmail.com)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20260406/4388d4ba/attachment.html>