[mpich-discuss] hydra_pmi_proxy sending signal 9 to successful processes

Lana Deere lana.deere at gmail.com
Tue Apr 7 08:11:05 CDT 2026


Thanks for the suggestion, I will try it.  So far I am not seeing any sign
that any of my processes exited abnormally.  They all record success in
their log files (except the ones killed by -9), and if I 'borrow'
hydra_pmi_proxy from a different MPI they all exit cleanly, exit code 0.  I
can experiment with -disable-auto-cleanup while I'm debugging but in
production use the auto cleanup is needed so if we do crash there aren't
leftover processes hanging on the cluster.

.. Lana (lana.deere at gmail.com)




On Mon, Apr 6, 2026 at 11:36 AM Zhou, Hui <zhouh at anl.gov> wrote:

> Hi Lana,
>
> You can try add -disable-auto-cleanup​ to mpiexec to prevent it kill
> every other processes when one process exits abnormally.
>
> I usually use a .gdbinit​ script to get a backtrace for such cases. For
> example, if you program is ./t​, then
>
> mpirun gdb ./t​
>
> Example .gdbinit​:
> ```
> set $_exitcode = -999
> run
> if $_exitcode == -999
>     backtrace
> end
> exit $_exitcode
> ```
>
> Hope that helps.
>
> --
> Hui Zhou
>
>
> *From: *Lana Deere via discuss <discuss at mpich.org>
> *Date: *Monday, April 6, 2026 at 10:24 AM
> *To: *discuss at mpich.org <discuss at mpich.org>
> *Cc: *Lana Deere <lana.deere at gmail.com>
> *Subject: *[mpich-discuss] hydra_pmi_proxy sending signal 9 to successful
> processes
>
> This Message Is From an External Sender
> This message came from outside your organization.
>
> I've got several MPI programs here.  The one which is the most complicated
> started exiting, reporting that a process got signal 9 while cleaning up
> after a run it reported was successful. Many of the other MPI processes
> showed truncated outputs as if they too had received a signal 9.   Only
> that one program has this problem, the other programs don't.  I tried
> reducing the big program to a small testcase which reproduces the issue but
> was unsuccessful.
>
> I did put a gdb onto the hydra_pmi_proxy and discovered that it is the
> process sending the signal 9 to the various MPI processes,
>
> (gdb) where
> #0  0x00007f4b17853d7e in killpg () from /lib64/libc.so.6
> #1  0x00000000004053e2 in PMIP_bcast_signal (sig=sig at entry=9) at
> proxy/pmip_pg.c:259
> #2  0x0000000000406e60 in pmi_cb (fd=9, events=<optimized out>,
> userp=<optimized out>)
>     at proxy/pmip_cb.c:326
> #3  0x0000000000421418 in HYDT_dmxu_poll_wait_for_event (wtime=<optimized
> out>)
>     at lib/tools/demux/demux_poll.c:75
> #4  0x0000000000403ff5 in main (argc=<optimized out>, argv=<optimized
> out>) at proxy/pmip.c:121
>
> At that time I was using mpich 4.3.0, so I upgraded to 5.0.0 hoping the
> problem would be resolved.  5.0.0 still showed the same symptom.  This all
> is happening on SUSE Linux 15.5.
>
> On CentOS7 and Rocky Linux 9 we use mvapich2 2.3.6, so for an experiment I
> took the mpirun and hydra_pmi_proxy from 2.3.6 and used them instead of the
> versions from the mpich 5.0.0 release.  Now the program works without
> difficulty.  All of this suggests to me that the hydra_pmi_proxy has
> incorrectly determined that one of the MPI processes exited with a signal.
> Any suggestions about what's going on?
>
>
>
> .. Lana (lana.deere at gmail.com)
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20260407/bb245d23/attachment.html>


More information about the discuss mailing list