[mpich-discuss] Abnormal termination on Linux
zhouh at anl.gov
Tue Apr 7 19:15:50 CDT 2020
I see. So I read a bit more. It appears that `exit()` is not interrupt-safe either. Let's try something basic first: in your signal handler, call `_exit(1)` or `_Exit(1)` , which should instantly terminate your process, no fuss. Don't call `MPI_Finalize` or any other non-safe functions before exit. Make sure that works for you; then to clean up states, you need work with your actual process and figure out synchronizations between your normal code and your signal handler, or even between multiple invocations of your signal handler -- remember, your code can be interrupted multiple times, any time, any place.
Now, once the process abruptly terminates, the process manager will receive signal. The default action is to stop all processes. I guess that's why you want to call `MPI_Finalize`. -- I guess I talked myself into understanding what you are trying to do. __ Well the problem is `MPI_Finalize` can't run safely at any point -- imagine you are in the middle of an MPI operation that is changing some states half-way, and when `MPI_Finalize` try to cleanup, it may access memory that is freed by yet the pointer or state hasn't reset yet, which certainly will result in segfault.
To make it work, you need some kind of coordination. A typical strategy is to raise a flag in your signal handler and have your main program check that flag frequently and gracefully exit when flag is raised. That way, you are always running `MPI_Finalize` outside MPI functions, which should work.
But this won't work with segfaults. Segfaults is your code problem and won't go away. It will repeatedly segfault after your signal handler returns. But hopefully, the segfaults you are seeing is caused by your signal handler itself, in which case, the synchronized strategy should work.
On 4/7/20, 5:43 PM, "Mccall, Kurt E. (MSFC-EV41)" <kurt.e.mccall at nasa.gov> wrote:
Hui, I am definitely calling exit in my signal handler after calling MPI_Finalize. This has worked in the past for small jobs, but as I scale up the size of the job, I've been seeing strange errors.
More information about the discuss