<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
span.EmailStyle18
{mso-style-type:personal-reply;
font-family:"Calibri",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
--></style>
</head>
<body lang="EN-US" link="blue" vlink="purple">
<div class="WordSection1">
<p class="MsoNormal">Hi Junchao,<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">What discussed in this thread is to call `MPI_Abort` in a signal handler. Since `MPI_Abort` is not interrupt safe – nearly all MPI functions for that matter – we really shouldn’t call `MPI_Abort` in signal handlers.
<br>
<br>
The normal error handling in MPICH are run normally, i.e. not in a signal handler. MPICH functions are thread-safe, so it generally does not cause problems.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<div>
<div>
<p class="MsoNormal">-- <br>
Hui Zhou<o:p></o:p></p>
</div>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"><b><span style="font-size:12.0pt;color:black">From: </span></b><span style="font-size:12.0pt;color:black">Junchao Zhang <junchao.zhang@gmail.com><br>
<b>Date: </b>Tuesday, April 28, 2020 at 1:05 PM<br>
<b>To: </b>"discuss@mpich.org" <discuss@mpich.org><br>
<b>Cc: </b>"Zhou, Hui" <zhouh@anl.gov>, John Peterson <jwpeterson@gmail.com><br>
<b>Subject: </b>Re: [mpich-discuss] Should MPI_Abort() call exit or _exit?<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">Hui, <o:p></o:p></p>
<div>
<p class="MsoNormal"> I am a bit confused. I browsed MPICH source code. The default MPI error handler in MPICH is implemented by MPIR_Handle_fatal_error(), which calls MPID_Abort(), which in turn calls MPL_exit(). So even PETSc does not take over the error
handling, users can still run into problems reported in this email thread. Am I missing something?<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> Thanks.<o:p></o:p></p>
</div>
<div>
<div>
<div>
<div>
<p class="MsoNormal">--Junchao Zhang<o:p></o:p></p>
</div>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<div>
<p class="MsoNormal">On Tue, Apr 21, 2020 at 1:18 PM Zhou, Hui via discuss <<a href="mailto:discuss@mpich.org">discuss@mpich.org</a>> wrote:<o:p></o:p></p>
</div>
<blockquote style="border:none;border-left:solid #CCCCCC 1.0pt;padding:0in 0in 0in 6.0pt;margin-left:4.8pt;margin-right:0in">
<div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">Let’s make it clear – inside a signal handler, one is not allowed to call any interrupt unsafe functions, that includes all MPI functions including `MPI_Abort`, which means `MPL_exit`
should never be called inside a signal handler. Back to the question, since `MPL_exit` is not inside a signal handler, it is allowed to call `exit`.<o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">I don’t think it is possible to do any real clean up inside a signal handler. The best a signal handler can do is to flip some atomic flags, the your applications should have checkpoints
check those flags and do your graceful exit if necessary.<o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
<div>
<div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">-- <br>
Hui Zhou<o:p></o:p></p>
</div>
</div>
</div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"><b><span style="font-size:12.0pt;color:black">From:
</span></b><span style="font-size:12.0pt;color:black">John Peterson via discuss <<a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a>><br>
<b>Reply-To: </b>"<a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a>" <<a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a>><br>
<b>Date: </b>Tuesday, April 21, 2020 at 1:06 PM<br>
<b>To: </b>"<a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a>" <<a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a>><br>
<b>Cc: </b>John Peterson <<a href="mailto:jwpeterson@gmail.com" target="_blank">jwpeterson@gmail.com</a>><br>
<b>Subject: </b>[mpich-discuss] Should MPI_Abort() call exit or _exit?</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
</div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;margin-bottom:12.0pt">Hi,<o:p></o:p></p>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">I have what is a fairly convoluted question involving several different libraries, so I'll try to make it as succinct as possible. The issue is that we have an MPI job that is canceled
by slurm's "scancel", but instead of exiting cleanly, the job (sometimes) hangs. I think we've tracked it down to the job being canceled while in the middle of a call to "free" and then "free" subsequently being called again from a function called by the signal
handler, which leads to the deadlock. The general rule seems to be that only "asynchronous-safe" functions (abort(), _Exit(), etc.) are allowed to be called in signal handlers [0].<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
</div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">In our specific case, the stack trace of one of the hung jobs is:<o:p></o:p></p>
</div>
<div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
</div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">#0 __lll_lock_wait_private () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95<br>
#1 0x00007ff175223114 in _int_free (have_lock=0, p=0x55cd9120f2a0, av=0x7ff175576c40 <main_arena>) at malloc.c:4266<br>
#2 __GI___libc_free (mem=0x55cd9120f2b0) at malloc.c:3124<br>
#3 0x00007ff1714938aa in H5MM_xfree () <br>
#4 0x00007ff17147e5da in H5L_term_package () <br>
#5 0x00007ff17133a766 in H5_term_library () <br>
#6 0x00007ff1751ce041 in __run_exit_handlers (status=59, listp=0x7ff175576718 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108<br>
#7 0x00007ff1751ce13a in __GI_exit (status=<optimized out>) at exit.c:139<br>
#8 0x00007ff176f3c809 in MPL_exit () from /usr/lib/x86_64-linux-gnu/libmpich.so.0<br>
#9 0x00007ff176eefd4c in ?? () from /usr/lib/x86_64-linux-gnu/libmpich.so.0<br>
#10 0x00007ff176e3ea59 in PMPI_Abort () from /usr/lib/x86_64-linux-gnu/libmpich.so.0<br>
#11 0x00007ff177635531 in PetscSignalHandlerDefault () <o:p></o:p></p>
</div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">#12 0x00007ff177635270 in PetscSignalHandler_Private () <br>
#13 <signal handler called><br>
#14 0x00007ff175222c6f in _int_free (have_lock=0, p=0x55cd9345e360, av=0x7ff175576c40 <main_arena>) at malloc.c:4280<br>
#15 __GI___libc_free (mem=0x55cd9345e370) at malloc.c:3124<o:p></o:p></p>
</div>
</div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
</div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">To summarize this:<o:p></o:p></p>
</div>
<div>
<div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">1.) We are in a call to "free"<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">2.) The process receives a sigterm/sigkill signal, which petsc handles<br>
3.) petsc calls mpich's abort function<br>
4.) mpich calls "exit", which causes the "atexit" functions to run <o:p></o:p></p>
</div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">5.) HDF5 registers an "atexit" function which also calls "free"<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">6.) Deadlock<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
</div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">This could definitely be seen as an HDF5 issue: I'm not sure of the wisdom of registering "atexit" functions which free memory -- the program is exiting after all. But, I also wanted
to confirm whether calling exit() from MPL_exit() is a deliberate choice, or if it could perhaps be changed to _exit, which I think would avoid this particular problem.<o:p></o:p></p>
</div>
</div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
</div>
<div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">Thanks, <o:p></o:p></p>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">John<o:p></o:p></p>
</div>
</div>
<div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
</div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">[0]: <a href="https://wiki.sei.cmu.edu/confluence/display/c/SIG30-C.+Call+only+asynchronous-safe+functions+within+signal+handlers" target="_blank">https://wiki.sei.cmu.edu/confluence/display/c/SIG30-C.+Call+only+asynchronous-safe+functions+within+signal+handlers</a><o:p></o:p></p>
</div>
</div>
</div>
</div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
</div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
</div>
</div>
</div>
<p class="MsoNormal">_______________________________________________<br>
discuss mailing list <a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a><br>
To manage subscription options or unsubscribe:<br>
<a href="https://lists.mpich.org/mailman/listinfo/discuss" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><o:p></o:p></p>
</blockquote>
</div>
</div>
</body>
</html>