[mpich-discuss] hydra_pmi_proxy sending signal 9 to successful processes
Zhou, Hui
zhouh at anl.gov
Mon Apr 6 10:36:51 CDT 2026
Hi Lana,
You can try add -disable-auto-cleanup to mpiexec to prevent it kill every other processes when one process exits abnormally.
I usually use a .gdbinit script to get a backtrace for such cases. For example, if you program is ./t, then
mpirun gdb ./t
Example .gdbinit:
```
set $_exitcode = -999
run
if $_exitcode == -999
backtrace
end
exit $_exitcode
```
Hope that helps.
--
Hui Zhou
From: Lana Deere via discuss <discuss at mpich.org>
Date: Monday, April 6, 2026 at 10:24 AM
To: discuss at mpich.org <discuss at mpich.org>
Cc: Lana Deere <lana.deere at gmail.com>
Subject: [mpich-discuss] hydra_pmi_proxy sending signal 9 to successful processes
This Message Is From an External Sender
This message came from outside your organization.
I've got several MPI programs here. The one which is the most complicated started exiting, reporting that a process got signal 9 while cleaning up after a run it reported was successful. Many of the other MPI processes showed truncated outputs as if they too had received a signal 9. Only that one program has this problem, the other programs don't. I tried reducing the big program to a small testcase which reproduces the issue but was unsuccessful.
I did put a gdb onto the hydra_pmi_proxy and discovered that it is the process sending the signal 9 to the various MPI processes,
(gdb) where
#0 0x00007f4b17853d7e in killpg () from /lib64/libc.so.6
#1 0x00000000004053e2 in PMIP_bcast_signal (sig=sig at entry=9) at proxy/pmip_pg.c:259
#2 0x0000000000406e60 in pmi_cb (fd=9, events=<optimized out>, userp=<optimized out>)
at proxy/pmip_cb.c:326
#3 0x0000000000421418 in HYDT_dmxu_poll_wait_for_event (wtime=<optimized out>)
at lib/tools/demux/demux_poll.c:75
#4 0x0000000000403ff5 in main (argc=<optimized out>, argv=<optimized out>) at proxy/pmip.c:121
At that time I was using mpich 4.3.0, so I upgraded to 5.0.0 hoping the problem would be resolved. 5.0.0 still showed the same symptom. This all is happening on SUSE Linux 15.5.
On CentOS7 and Rocky Linux 9 we use mvapich2 2.3.6, so for an experiment I took the mpirun and hydra_pmi_proxy from 2.3.6 and used them instead of the versions from the mpich 5.0.0 release. Now the program works without difficulty. All of this suggests to me that the hydra_pmi_proxy has incorrectly determined that one of the MPI processes exited with a signal. Any suggestions about what's going on?
.. Lana (lana.deere at gmail.com<mailto:lana.deere at gmail.com>)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20260406/d47d10de/attachment-0001.html>
More information about the discuss
mailing list