<head><!-- BaNnErBlUrFlE-HeAdEr-start -->
<style>
#pfptBannerhrd4tes { all: revert !important; display: block !important;
visibility: visible !important; opacity: 1 !important;
background-color: #D0D8DC !important;
max-width: none !important; max-height: none !important }
.pfptPrimaryButtonhrd4tes:hover, .pfptPrimaryButtonhrd4tes:focus {
background-color: #b4c1c7 !important; }
.pfptPrimaryButtonhrd4tes:active {
background-color: #90a4ae !important; }
html:root, html:root>body { all: revert !important; display: block !important;
visibility: visible !important; opacity: 1 !important; }
</style>
<!-- BaNnErBlUrFlE-HeAdEr-end -->
</head><!-- BaNnErBlUrFlE-BoDy-start -->
<!-- Preheader Text : BEGIN -->
<div style="display:none !important;display:none;visibility:hidden;mso-hide:all;font-size:1px;color:#ffffff;line-height:1px;height:0px;max-height:0px;opacity:0;overflow:hidden;">
Thanks for the suggestion, I will try it. So far I am not seeing any sign that any of my processes exited abnormally. They all record success in their log files (except the ones killed by -9), and if I 'borrow' hydra_pmi_proxy from a</div>
<!-- Preheader Text : END -->
<!-- Email Banner : BEGIN -->
<div style="display:none !important;display:none;visibility:hidden;mso-hide:all;font-size:1px;color:#ffffff;line-height:1px;max-height:0px;opacity:0;overflow:hidden;">ZjQcmQRYFpfptBannerStart</div>
<!--[if ((ie)|(mso))]>
<table border="0" cellspacing="0" cellpadding="0" width="100%" style="padding: 16px 0px 16px 0px; direction: ltr" ><tr><td>
<table border="0" cellspacing="0" cellpadding="0" style="padding: 0px 10px 5px 6px; width: 100%; border-radius:4px; border-top:4px solid #90a4ae;background-color:#D0D8DC;"><tr><td valign="top">
<table align="left" border="0" cellspacing="0" cellpadding="0" style="padding: 4px 8px 4px 8px">
<tr><td style="color:#000000; font-family: 'Arial', sans-serif; font-weight:bold; font-size:14px; direction: ltr">
This Message Is From an External Sender
</td></tr>
<tr><td style="color:#000000; font-weight:normal; font-family: 'Arial', sans-serif; font-size:12px; direction: ltr">
This message came from outside your organization.
</td></tr>
</table>
</td></tr></table>
</td></tr></table>
<![endif]-->
<![if !((ie)|(mso))]>
<div dir="ltr" id="pfptBannerhrd4tes" style="all: revert !important; display:block !important; text-align: left !important; margin:16px 0px 16px 0px !important; padding:8px 16px 8px 16px !important; border-radius: 4px !important; min-width: 200px !important; background-color: #D0D8DC !important; background-color: #D0D8DC; border-top: 4px solid #90a4ae !important; border-top: 4px solid #90a4ae;">
<div id="pfptBannerhrd4tes" style="all: unset !important; float:left !important; display:block !important; margin: 0px 0px 1px 0px !important; max-width: 600px !important;">
<div id="pfptBannerhrd4tes" style="all: unset !important; display:block !important; visibility: visible !important; background-color: #D0D8DC !important; color:#000000 !important; color:#000000; font-family: 'Arial', sans-serif !important; font-family: 'Arial', sans-serif; font-weight:bold !important; font-weight:bold; font-size:14px !important; line-height:18px !important; line-height:18px">
This Message Is From an External Sender
</div>
<div id="pfptBannerhrd4tes" style="all: unset !important; display:block !important; visibility: visible !important; background-color: #D0D8DC !important; color:#000000 !important; color:#000000; font-weight:normal; font-family: 'Arial', sans-serif !important; font-family: 'Arial', sans-serif; font-size:12px !important; line-height:18px !important; line-height:18px; margin-top:2px !important;">
This message came from outside your organization.
</div>
</div>
<div style="clear: both !important; display: block !important; visibility: hidden !important; line-height: 0 !important; font-size: 0.01px !important; height: 0px"> </div>
</div>
<![endif]>
<div style="display:none !important;display:none;visibility:hidden;mso-hide:all;font-size:1px;color:#ffffff;line-height:1px;max-height:0px;opacity:0;overflow:hidden;">ZjQcmQRYFpfptBannerEnd</div>
<!-- Email Banner : END -->
<!-- BaNnErBlUrFlE-BoDy-end -->
<div dir="ltr"><div>Thanks for the suggestion, I will try it. So far I am not seeing any sign that any of my processes exited abnormally. They all record success in their log files (except the ones killed by -9), and if I 'borrow' hydra_pmi_proxy from a different MPI they all exit cleanly, exit code 0. I can experiment with -disable-auto-cleanup while I'm debugging but in production use the auto cleanup is needed so if we do crash there aren't leftover processes hanging on the cluster.</div><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><br>.. Lana (<a href="mailto:lana.deere@gmail.com" target="_blank">lana.deere@gmail.com</a>)<br><br><br></div></div><br></div><br><div class="gmail_quote gmail_quote_container"><div dir="ltr" class="gmail_attr">On Mon, Apr 6, 2026 at 11:36 AM Zhou, Hui <<a href="mailto:zhouh@anl.gov">zhouh@anl.gov</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<div style="direction:ltr;font-family:Aptos,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Hi Lana,</div>
<div style="direction:ltr;font-family:Aptos,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="direction:ltr;font-family:Aptos,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
You can try add <code>-disable-auto-cleanup</code> to mpiexec to prevent it kill every other processes when one process exits abnormally.</div>
<div style="direction:ltr;font-family:Aptos,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="direction:ltr;font-family:Aptos,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
I usually use a <code>.gdbinit</code> script to get a backtrace for such cases. For example, if you program is
<code>./t</code>, then</div>
<div style="direction:ltr;font-family:Aptos,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="direction:ltr;font-family:Aptos,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<code>mpirun gdb ./t</code></div>
<div style="direction:ltr;font-family:Aptos,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="direction:ltr;font-family:Aptos,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Example <code>.gdbinit</code>:</div>
<div style="direction:ltr;font-family:Aptos,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
```</div>
<div style="direction:ltr;font-family:Aptos,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
set $_exitcode = -999</div>
<div style="direction:ltr;font-family:Aptos,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
run</div>
<div style="direction:ltr;font-family:Aptos,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
if $_exitcode == -999</div>
<div style="direction:ltr;font-family:Aptos,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
backtrace</div>
<div style="direction:ltr;font-family:Aptos,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
end</div>
<div style="direction:ltr;font-family:Aptos,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
exit $_exitcode</div>
<div style="direction:ltr;font-family:Aptos,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
```</div>
<div style="direction:ltr;font-family:Aptos,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="direction:ltr;font-family:Aptos,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Hope that helps.</div>
<div id="m_-7765762493745736655ms-outlook-mobile-signature">
<p class="MsoNormal">-- <br>
Hui Zhou</p>
<p class="MsoNormal"> </p>
</div>
<div id="m_-7765762493745736655mail-editor-reference-message-container">
<div style="direction:ltr">
</div>
<div style="text-align:left;padding:3pt 0in 0in;border-width:1pt medium medium;border-style:solid none none;border-color:rgb(181,196,223) currentcolor currentcolor;font-family:Aptos;font-size:12pt;color:black">
<b>From: </b>Lana Deere via discuss <<a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a>><br>
<b>Date: </b>Monday, April 6, 2026 at 10:24 AM<br>
<b>To: </b><a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a> <<a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a>><br>
<b>Cc: </b>Lana Deere <<a href="mailto:lana.deere@gmail.com" target="_blank">lana.deere@gmail.com</a>><br>
<b>Subject: </b>[mpich-discuss] hydra_pmi_proxy sending signal 9 to successful processes<br>
<br>
</div>
<div dir="ltr" id="m_-7765762493745736655pfptBannerukxck0k" style="opacity:1;max-width:none;max-height:none;display:block;text-align:left;margin:16px 0px;padding:8px 16px;border-radius:4px;min-width:200px;background-color:rgb(208,216,220);border-top:4px solid rgb(144,164,174)">
<div id="m_-7765762493745736655pfptBannerukxck0k" style="opacity:1;background-color:rgb(208,216,220);max-height:none;float:left;display:block;margin:0px 0px 1px;max-width:600px">
<div id="m_-7765762493745736655pfptBannerukxck0k" style="opacity:1;max-width:none;max-height:none;display:block;background-color:rgb(208,216,220);color:rgb(0,0,0);font-family:Arial,sans-serif;font-weight:bold;font-size:14px;line-height:18px">
This Message Is From an External Sender</div>
<div id="m_-7765762493745736655pfptBannerukxck0k" style="font-weight:normal;opacity:1;max-width:none;max-height:none;display:block;background-color:rgb(208,216,220);color:rgb(0,0,0);font-family:Arial,sans-serif;font-size:12px;line-height:18px;margin-top:2px">
This message came from outside your organization.</div>
</div>
<div style="line-height:0;height:0px;display:block;font-size:0.01px"> </div>
</div>
<div style="direction:ltr">
I've got several MPI programs here. The one which is the most complicated started exiting, reporting that a process got signal 9 while cleaning up after a run it reported was successful. Many of the other MPI processes showed truncated outputs as if they too
had received a signal 9. Only that one program has this problem, the other programs don't. I tried reducing the big program to a small testcase which reproduces the issue but was unsuccessful.</div>
<div style="direction:ltr">
<br>
</div>
<div style="direction:ltr">
I did put a gdb onto the hydra_pmi_proxy and discovered that it is the process sending the signal 9 to the various MPI processes,</div>
<div style="direction:ltr">
<br>
</div>
<div style="direction:ltr">
(gdb) where<br>
#0 0x00007f4b17853d7e in killpg () from /lib64/libc.so.6<br>
#1 0x00000000004053e2 in PMIP_bcast_signal (sig=sig@entry=9) at proxy/pmip_pg.c:259<br>
#2 0x0000000000406e60 in pmi_cb (fd=9, events=<optimized out>, userp=<optimized out>)<br>
at proxy/pmip_cb.c:326<br>
#3 0x0000000000421418 in HYDT_dmxu_poll_wait_for_event (wtime=<optimized out>)<br>
at lib/tools/demux/demux_poll.c:75<br>
#4 0x0000000000403ff5 in main (argc=<optimized out>, argv=<optimized out>) at proxy/pmip.c:121<br>
<br>
</div>
<div style="direction:ltr">
At that time I was using mpich 4.3.0, so I upgraded to 5.0.0 hoping the problem would be resolved. 5.0.0 still showed the same symptom. This all is happening on SUSE Linux 15.5.</div>
<div style="direction:ltr">
<br>
</div>
<div style="direction:ltr">
On CentOS7 and Rocky Linux 9 we use mvapich2 2.3.6, so for an experiment I took the mpirun and hydra_pmi_proxy from 2.3.6 and used them instead of the versions from the mpich 5.0.0 release. Now the program works without difficulty. All of this suggests to
me that the hydra_pmi_proxy has incorrectly determined that one of the MPI processes exited with a signal. Any suggestions about what's going on?</div>
<div style="direction:ltr">
<br>
</div>
<div style="direction:ltr">
<br>
</div>
<div class="gmail_signature" style="direction:ltr"><br>
.. Lana (<a href="mailto:lana.deere@gmail.com" target="_blank">lana.deere@gmail.com</a>)<br>
<br>
<br>
</div>
</div>
</div>
</blockquote></div>