[mpich-discuss] Should MPI_Abort() call exit or _exit?
John Peterson
jwpeterson at gmail.com
Tue Apr 21 13:05:19 CDT 2020
Hi,
I have what is a fairly convoluted question involving several different
libraries, so I'll try to make it as succinct as possible. The issue is
that we have an MPI job that is canceled by slurm's "scancel", but instead
of exiting cleanly, the job (sometimes) hangs. I think we've tracked it
down to the job being canceled while in the middle of a call to "free" and
then "free" subsequently being called again from a function called by the
signal handler, which leads to the deadlock. The general rule seems to be
that only "asynchronous-safe" functions (abort(), _Exit(), etc.) are
allowed to be called in signal handlers [0].
In our specific case, the stack trace of one of the hung jobs is:
#0 __lll_lock_wait_private () at
../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1 0x00007ff175223114 in _int_free (have_lock=0, p=0x55cd9120f2a0,
av=0x7ff175576c40 <main_arena>) at malloc.c:4266
#2 __GI___libc_free (mem=0x55cd9120f2b0) at malloc.c:3124
#3 0x00007ff1714938aa in H5MM_xfree ()
#4 0x00007ff17147e5da in H5L_term_package ()
#5 0x00007ff17133a766 in H5_term_library ()
#6 0x00007ff1751ce041 in __run_exit_handlers (status=59,
listp=0x7ff175576718 <__exit_funcs>,
run_list_atexit=run_list_atexit at entry=true,
run_dtors=run_dtors at entry=true) at exit.c:108
#7 0x00007ff1751ce13a in __GI_exit (status=<optimized out>) at exit.c:139
#8 0x00007ff176f3c809 in MPL_exit () from
/usr/lib/x86_64-linux-gnu/libmpich.so.0
#9 0x00007ff176eefd4c in ?? () from /usr/lib/x86_64-linux-gnu/libmpich.so.0
#10 0x00007ff176e3ea59 in PMPI_Abort () from
/usr/lib/x86_64-linux-gnu/libmpich.so.0
#11 0x00007ff177635531 in PetscSignalHandlerDefault ()
#12 0x00007ff177635270 in PetscSignalHandler_Private ()
#13 <signal handler called>
#14 0x00007ff175222c6f in _int_free (have_lock=0, p=0x55cd9345e360,
av=0x7ff175576c40 <main_arena>) at malloc.c:4280
#15 __GI___libc_free (mem=0x55cd9345e370) at malloc.c:3124
To summarize this:
1.) We are in a call to "free"
2.) The process receives a sigterm/sigkill signal, which petsc handles
3.) petsc calls mpich's abort function
4.) mpich calls "exit", which causes the "atexit" functions to run
5.) HDF5 registers an "atexit" function which also calls "free"
6.) Deadlock
This could definitely be seen as an HDF5 issue: I'm not sure of the wisdom
of registering "atexit" functions which free memory -- the program is
exiting after all. But, I also wanted to confirm whether calling exit()
from MPL_exit() is a deliberate choice, or if it could perhaps be changed
to _exit, which I think would avoid this particular problem.
Thanks,
John
[0]:
https://wiki.sei.cmu.edu/confluence/display/c/SIG30-C.+Call+only+asynchronous-safe+functions+within+signal+handlers
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20200421/1f24df1e/attachment.html>
More information about the discuss
mailing list