[mpich-discuss] Should MPI_Abort() call exit or _exit?
Junchao Zhang
junchao.zhang at gmail.com
Tue Apr 28 15:12:14 CDT 2020
Got it. Thanks a lot.
--Junchao Zhang
On Tue, Apr 28, 2020 at 2:35 PM Zhou, Hui <zhouh at anl.gov> wrote:
> HI Junchao,
>
>
>
> As far as I know MPICH does not trap any signals. A segfault will simply
> kill the mpi process. Hydra will see one of its child being killed by
> signal and report so. So no, a segfault will not cause mpich to run
> `MPIR_Handle_fatal_error()`.
>
>
>
> --
> Hui Zhou
>
>
>
>
>
> *From: *Junchao Zhang <junchao.zhang at gmail.com>
> *Date: *Tuesday, April 28, 2020 at 2:17 PM
> *To: *"Zhou, Hui" <zhouh at anl.gov>
> *Cc: *"discuss at mpich.org" <discuss at mpich.org>
> *Subject: *Re: [mpich-discuss] Should MPI_Abort() call exit or _exit?
>
>
>
> Then, will a segfault in MPICH cause it to eventually execute
> MPIR_Handle_fatal_error()?
>
> --Junchao Zhang
>
>
>
>
>
> On Tue, Apr 28, 2020 at 1:11 PM Zhou, Hui <zhouh at anl.gov> wrote:
>
> Hi Junchao,
>
>
>
> What discussed in this thread is to call `MPI_Abort` in a signal handler.
> Since `MPI_Abort` is not interrupt safe – nearly all MPI functions for that
> matter – we really shouldn’t call `MPI_Abort` in signal handlers.
>
> The normal error handling in MPICH are run normally, i.e. not in a signal
> handler. MPICH functions are thread-safe, so it generally does not cause
> problems.
>
>
>
> --
> Hui Zhou
>
>
>
>
>
> *From: *Junchao Zhang <junchao.zhang at gmail.com>
> *Date: *Tuesday, April 28, 2020 at 1:05 PM
> *To: *"discuss at mpich.org" <discuss at mpich.org>
> *Cc: *"Zhou, Hui" <zhouh at anl.gov>, John Peterson <jwpeterson at gmail.com>
> *Subject: *Re: [mpich-discuss] Should MPI_Abort() call exit or _exit?
>
>
>
> Hui,
>
> I am a bit confused. I browsed MPICH source code. The default MPI error
> handler in MPICH is implemented by MPIR_Handle_fatal_error(), which calls
> MPID_Abort(), which in turn calls MPL_exit(). So even PETSc does not take
> over the error handling, users can still run into problems reported in this
> email thread. Am I missing something?
>
>
>
> Thanks.
>
> --Junchao Zhang
>
>
>
>
>
> On Tue, Apr 21, 2020 at 1:18 PM Zhou, Hui via discuss <discuss at mpich.org>
> wrote:
>
> Let’s make it clear – inside a signal handler, one is not allowed to call
> any interrupt unsafe functions, that includes all MPI functions including
> `MPI_Abort`, which means `MPL_exit` should never be called inside a signal
> handler. Back to the question, since `MPL_exit` is not inside a signal
> handler, it is allowed to call `exit`.
>
>
>
> I don’t think it is possible to do any real clean up inside a signal
> handler. The best a signal handler can do is to flip some atomic flags, the
> your applications should have checkpoints check those flags and do your
> graceful exit if necessary.
>
>
>
> --
> Hui Zhou
>
>
>
>
>
> *From: *John Peterson via discuss <discuss at mpich.org>
> *Reply-To: *"discuss at mpich.org" <discuss at mpich.org>
> *Date: *Tuesday, April 21, 2020 at 1:06 PM
> *To: *"discuss at mpich.org" <discuss at mpich.org>
> *Cc: *John Peterson <jwpeterson at gmail.com>
> *Subject: *[mpich-discuss] Should MPI_Abort() call exit or _exit?
>
>
>
> Hi,
>
> I have what is a fairly convoluted question involving several different
> libraries, so I'll try to make it as succinct as possible. The issue is
> that we have an MPI job that is canceled by slurm's "scancel", but instead
> of exiting cleanly, the job (sometimes) hangs. I think we've tracked it
> down to the job being canceled while in the middle of a call to "free" and
> then "free" subsequently being called again from a function called by the
> signal handler, which leads to the deadlock. The general rule seems to be
> that only "asynchronous-safe" functions (abort(), _Exit(), etc.) are
> allowed to be called in signal handlers [0].
>
>
>
> In our specific case, the stack trace of one of the hung jobs is:
>
>
>
> #0 __lll_lock_wait_private () at
> ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
> #1 0x00007ff175223114 in _int_free (have_lock=0, p=0x55cd9120f2a0,
> av=0x7ff175576c40 <main_arena>) at malloc.c:4266
> #2 __GI___libc_free (mem=0x55cd9120f2b0) at malloc.c:3124
> #3 0x00007ff1714938aa in H5MM_xfree ()
> #4 0x00007ff17147e5da in H5L_term_package ()
> #5 0x00007ff17133a766 in H5_term_library ()
> #6 0x00007ff1751ce041 in __run_exit_handlers (status=59,
> listp=0x7ff175576718 <__exit_funcs>, run_list_atexit=run_list_atexit at entry=true,
> run_dtors=run_dtors at entry=true) at exit.c:108
> #7 0x00007ff1751ce13a in __GI_exit (status=<optimized out>) at exit.c:139
> #8 0x00007ff176f3c809 in MPL_exit () from
> /usr/lib/x86_64-linux-gnu/libmpich.so.0
> #9 0x00007ff176eefd4c in ?? () from
> /usr/lib/x86_64-linux-gnu/libmpich.so.0
> #10 0x00007ff176e3ea59 in PMPI_Abort () from
> /usr/lib/x86_64-linux-gnu/libmpich.so.0
> #11 0x00007ff177635531 in PetscSignalHandlerDefault ()
>
> #12 0x00007ff177635270 in PetscSignalHandler_Private ()
> #13 <signal handler called>
> #14 0x00007ff175222c6f in _int_free (have_lock=0, p=0x55cd9345e360,
> av=0x7ff175576c40 <main_arena>) at malloc.c:4280
> #15 __GI___libc_free (mem=0x55cd9345e370) at malloc.c:3124
>
>
>
> To summarize this:
>
> 1.) We are in a call to "free"
>
> 2.) The process receives a sigterm/sigkill signal, which petsc handles
> 3.) petsc calls mpich's abort function
> 4.) mpich calls "exit", which causes the "atexit" functions to run
>
> 5.) HDF5 registers an "atexit" function which also calls "free"
>
> 6.) Deadlock
>
>
>
> This could definitely be seen as an HDF5 issue: I'm not sure of the wisdom
> of registering "atexit" functions which free memory -- the program is
> exiting after all. But, I also wanted to confirm whether calling exit()
> from MPL_exit() is a deliberate choice, or if it could perhaps be changed
> to _exit, which I think would avoid this particular problem.
>
>
>
> Thanks,
>
> John
>
>
>
> [0]:
> https://wiki.sei.cmu.edu/confluence/display/c/SIG30-C.+Call+only+asynchronous-safe+functions+within+signal+handlers
>
>
>
>
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20200428/d019925b/attachment.html>
More information about the discuss
mailing list