<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:#0563C1;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:#954F72;
text-decoration:underline;}
p.msonormal0, li.msonormal0, div.msonormal0
{mso-style-name:msonormal;
mso-margin-top-alt:auto;
margin-right:0in;
mso-margin-bottom-alt:auto;
margin-left:0in;
font-size:12.0pt;
font-family:"Times New Roman",serif;}
span.EmailStyle18
{mso-style-type:personal;
font-family:"Calibri",sans-serif;
color:windowtext;}
span.EmailStyle19
{mso-style-type:personal-reply;
font-family:"Calibri",sans-serif;
color:#1F497D;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-US" link="#0563C1" vlink="#954F72">
<div class="WordSection1">
<p class="MsoNormal"><span style="color:#1F497D">Hui,<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D">I’d like to trap segfaults so that the process that raised them can be shut down gracefully/finalized without taking down the whole MPI job. Maybe you are right and I shouldn’t trap them.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D">However, when I install my signal handlers, I check if MPI already has a signal handler installed for each of them. It isn’t reporting that MPI has done so. Perhaps my code is incorrect?<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D">setUpOneHandler(const int signum)<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D">{<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D"> struct sigaction sa_new, sa_old;<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D"> sa_new.sa_handler = mpiSignalHandler;<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D"> sa_new.sa_flags = 0;<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D"> sigemptyset(&sa_new.sa_mask);<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D"> if (sigaction(signum, NULL, &sa_old) < 0)<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D"> {<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D"> // ERROR: could not query old handler<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D"> } else if (sa_old.sa_handler == SIG_IGN ||<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D"> sa_old.sa_handler == SIG_DFL)<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D"> {<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D"> // MPI hasn't set its own handler for this signal, so we
<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D"> // will install our own.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D"> if (sigaction(signum, &sa_new, NULL) < 0)<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D"> {<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D"> // ERROR: could not set new handler<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D"> }<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D"> } else<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D"> {<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D"> // MPI already has a handler installed for this signal. Do nothing<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D"> }<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D">}<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D"> <o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#1F497D"><o:p> </o:p></span></p>
<div>
<div style="border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"><b>From:</b> Zhou, Hui <zhouh@anl.gov> <br>
<b>Sent:</b> Monday, April 6, 2020 2:08 PM<br>
<b>To:</b> Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall@nasa.gov>; discuss@mpich.org<br>
<b>Subject:</b> [EXTERNAL] Re: [mpich-discuss] Abnormal termination on Linux<o:p></o:p></p>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Thanks, Kurt. I think the reason your signal trap didn’t work is because `mpiexec` is trapping it first. Segfault is code error. Why would you want to trap it?<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<div>
<div>
<p class="MsoNormal">-- <br>
Hui Zhou<o:p></o:p></p>
</div>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"><b><span style="font-size:12.0pt;color:black">From: </span></b><span style="font-size:12.0pt;color:black">"Mccall, Kurt E. (MSFC-EV41)" <<a href="mailto:kurt.e.mccall@nasa.gov">kurt.e.mccall@nasa.gov</a>><br>
<b>Date: </b>Monday, April 6, 2020 at 1:52 PM<br>
<b>To: </b>"Zhou, Hui" <<a href="mailto:zhouh@anl.gov">zhouh@anl.gov</a>>, "<a href="mailto:discuss@mpich.org">discuss@mpich.org</a>" <<a href="mailto:discuss@mpich.org">discuss@mpich.org</a>><br>
<b>Subject: </b>RE: [mpich-discuss] Abnormal termination on Linux<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<p class="MsoNormal"><span style="color:#1F497D">Hui,</span><o:p></o:p></p>
<p class="MsoNormal"><span style="color:#1F497D"> </span><o:p></o:p></p>
<p class="MsoNormal"><span style="color:#1F497D">Sorry for not mentioning that. MPICH 3.3.2 compiled with pgc++ 19.5.</span><o:p></o:p></p>
<p class="MsoNormal"><span style="color:#1F497D"> </span><o:p></o:p></p>
<p class="MsoNormal"><span style="color:#1F497D">Kurt </span><o:p></o:p></p>
<p class="MsoNormal"><span style="color:#1F497D"> </span><o:p></o:p></p>
<div>
<div style="border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"><b>From:</b> Zhou, Hui <<a href="mailto:zhouh@anl.gov">zhouh@anl.gov</a>>
<br>
<b>Sent:</b> Monday, April 6, 2020 12:58 PM<br>
<b>To:</b> <a href="mailto:discuss@mpich.org">discuss@mpich.org</a><br>
<b>Cc:</b> Mccall, Kurt E. (MSFC-EV41) <<a href="mailto:kurt.e.mccall@nasa.gov">kurt.e.mccall@nasa.gov</a>><br>
<b>Subject:</b> [EXTERNAL] Re: [mpich-discuss] Abnormal termination on Linux<o:p></o:p></p>
</div>
</div>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">Which version of MPICH were you running?<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<div>
<div>
<div>
<p class="MsoNormal">-- <br>
Hui Zhou<o:p></o:p></p>
</div>
</div>
</div>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"><b><span style="font-size:12.0pt;color:black">From: </span></b><span style="font-size:12.0pt;color:black">"Mccall, Kurt E. (MSFC-EV41) via discuss" <<a href="mailto:discuss@mpich.org">discuss@mpich.org</a>><br>
<b>Reply-To: </b>"<a href="mailto:discuss@mpich.org">discuss@mpich.org</a>" <<a href="mailto:discuss@mpich.org">discuss@mpich.org</a>><br>
<b>Date: </b>Monday, April 6, 2020 at 12:45 PM<br>
<b>To: </b>"<a href="mailto:discuss@mpich.org">discuss@mpich.org</a>" <<a href="mailto:discuss@mpich.org">discuss@mpich.org</a>><br>
<b>Cc: </b>"Mccall, Kurt E. (MSFC-EV41)" <<a href="mailto:kurt.e.mccall@nasa.gov">kurt.e.mccall@nasa.gov</a>><br>
<b>Subject: </b>Re: [mpich-discuss] Abnormal termination on Linux</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
<p class="MsoNormal"><span style="color:#1F497D">I should mention that I am unable to predict in which node or process the abnormal termination occurs, so I can’t practically attach a debugger and try to intercept the error.</span><o:p></o:p></p>
<p class="MsoNormal"><span style="color:#1F497D"> </span><o:p></o:p></p>
<p class="MsoNormal"><span style="color:#1F497D">Kurt</span><o:p></o:p></p>
<p class="MsoNormal"><span style="color:#1F497D"> </span><o:p></o:p></p>
<div>
<div style="border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"><b>From:</b> Mccall, Kurt E. (MSFC-EV41) <<a href="mailto:kurt.e.mccall@nasa.gov">kurt.e.mccall@nasa.gov</a>>
<br>
<b>Sent:</b> Monday, April 6, 2020 11:50 AM<br>
<b>To:</b> <a href="mailto:discuss@mpich.org">discuss@mpich.org</a><br>
<b>Cc:</b> Mccall, Kurt E. (MSFC-EV41) <<a href="mailto:kurt.e.mccall@nasa.gov">kurt.e.mccall@nasa.gov</a>><br>
<b>Subject:</b> Abnormal termination on Linux<o:p></o:p></p>
</div>
</div>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">I have a couple of questions about abnormal termination. The EXIT CODE below is 11, which could be signal SIGSEGV, or is it something defined by MPICH? If it is SIGSEGV, it is strange because my signal handler isn’t catching it and cleaning
up properly (the signal handler calls MPI_Finalize()). Is there any way to get more information about the location of the error?<o:p></o:p></p>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES<o:p></o:p></p>
<p class="MsoNormal">= PID 14385 RUNNING AT n020.cluster.com<o:p></o:p></p>
<p class="MsoNormal">= <b><span style="color:red">EXIT CODE: 11</span></b><o:p></o:p></p>
<p class="MsoNormal">= CLEANING UP REMAINING PROCESSES<o:p></o:p></p>
<div style="border:none;border-bottom:double windowtext 2.25pt;padding:0in 0in 22.0pt 0in">
<p class="MsoNormal">= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES<o:p></o:p></p>
</div>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal">Thanks,<o:p></o:p></p>
<p class="MsoNormal">Kurt<o:p></o:p></p>
</div>
</body>
</html>