[mpich-discuss] Abnormal termination on Linux
Mccall, Kurt E. (MSFC-EV41)
kurt.e.mccall at nasa.gov
Tue Apr 7 17:14:22 CDT 2020
I'm using the methods in the paper https://www.mcs.anl.gov/~lusk/papers/fault-tolerance.pdf to make my Monte Carlo jobs fault tolerant (using inter-communicators rather than intra-communicators). The processes have a manager/worker arrangement, with the workers being single instances of a simulation, each with unique randomized inputs. These workers are created using MPI_Comm_spawn(). When a worker finishes, it must exit, and up to now I've been calling MPI_Finalize in the worker before exiting.
>From what you've said, this is incorrect because MPI_Finalize is collective over all of the processes in the job? Obviously my managers do not exit when the workers exit. Can you suggest any way to do cleanup before a worker exists, so that the job can continue?
Thanks,
Kurt
-----Original Message-----
From: Joachim Protze <protze at itc.rwth-aachen.de>
Sent: Tuesday, April 7, 2020 3:37 AM
To: discuss at mpich.org; Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov>
Cc: Zhou, Hui <zhouh at anl.gov>
Subject: [EXTERNAL] Re: [mpich-discuss] Abnormal termination on Linux
As long as you use a standard MPI implementation, there is no resilience
and the MPI job will not survive a segfault or any other MPI error.
Probably you should target your question towards "Why does my
application segfault in the first place?"
If it's only one process, that segfaults probably use the core file for
debugging. By default I assume corefile size is limited on most
clusters, so increase it in your batch job:
ulimit -c unlimited
gdb a.out -c corefile.hostname.timestamp
backtrace
quit
Am 07.04.20 um 00:23 schrieb Zhou, Hui via discuss:
> I see. MPICH doesn’t trap any signal as I checked 😊. So back to your issue, let’s assume your trap did work, but after the signal handler, which after it called `MPI_Finalize`, the code is still going to run the same code that causes segfault in the first place, right? Wouldn’t that endup in infinite signal loop or maybe the kernel smart out and by-passed your handler? Just a thought experiment.
>
Calling MPI_Finalize would not help, because MPI_Finalize is a
collective call and will not finish before all other processes have
called MPI_Finalize.
Furthermore, in such error situation you won't have any guarantees, that
the process does not have any open blocking or non-blocking
communication. Such communication would need to be completed before
calling MPI_Finalize.
- Joachim
> Beside, MPICH is thread-safe but not interrupt safe. So you should not really call MPI functions inside signal handlers.
>
> --
> Hui Zhou
>
>
> From: "Mccall, Kurt E. (MSFC-EV41)" <kurt.e.mccall at nasa.gov>
> Date: Monday, April 6, 2020 at 4:50 PM
> To: "Zhou, Hui" <zhouh at anl.gov>, "discuss at mpich.org" <discuss at mpich.org>
> Subject: RE: [mpich-discuss] Abnormal termination on Linux
>
> Hui,
>
> I’d like to trap segfaults so that the process that raised them can be shut down gracefully/finalized without taking down the whole MPI job. Maybe you are right and I shouldn’t trap them.
>
> However, when I install my signal handlers, I check if MPI already has a signal handler installed for each of them. It isn’t reporting that MPI has done so. Perhaps my code is incorrect?
>
> setUpOneHandler(const int signum)
> {
> struct sigaction sa_new, sa_old;
>
> sa_new.sa_handler = mpiSignalHandler;
> sa_new.sa_flags = 0;
> sigemptyset(&sa_new.sa_mask);
>
> if (sigaction(signum, NULL, &sa_old) < 0)
> {
> // ERROR: could not query old handler
>
> } else if (sa_old.sa_handler == SIG_IGN ||
> sa_old.sa_handler == SIG_DFL)
> {
> // MPI hasn't set its own handler for this signal, so we
> // will install our own.
>
> if (sigaction(signum, &sa_new, NULL) < 0)
> {
> // ERROR: could not set new handler
> }
> } else
> {
> // MPI already has a handler installed for this signal. Do nothing
> }
> }
>
>
>
>
> From: Zhou, Hui <zhouh at anl.gov>
> Sent: Monday, April 6, 2020 2:08 PM
> To: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov>; discuss at mpich.org
> Subject: [EXTERNAL] Re: [mpich-discuss] Abnormal termination on Linux
>
> Thanks, Kurt. I think the reason your signal trap didn’t work is because `mpiexec` is trapping it first. Segfault is code error. Why would you want to trap it?
>
> --
> Hui Zhou
>
>
> From: "Mccall, Kurt E. (MSFC-EV41)" <kurt.e.mccall at nasa.gov<mailto:kurt.e.mccall at nasa.gov>>
> Date: Monday, April 6, 2020 at 1:52 PM
> To: "Zhou, Hui" <zhouh at anl.gov<mailto:zhouh at anl.gov>>, "discuss at mpich.org<mailto:discuss at mpich.org>" <discuss at mpich.org<mailto:discuss at mpich.org>>
> Subject: RE: [mpich-discuss] Abnormal termination on Linux
>
> Hui,
>
> Sorry for not mentioning that. MPICH 3.3.2 compiled with pgc++ 19.5.
>
> Kurt
>
> From: Zhou, Hui <zhouh at anl.gov<mailto:zhouh at anl.gov>>
> Sent: Monday, April 6, 2020 12:58 PM
> To: discuss at mpich.org<mailto:discuss at mpich.org>
> Cc: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov<mailto:kurt.e.mccall at nasa.gov>>
> Subject: [EXTERNAL] Re: [mpich-discuss] Abnormal termination on Linux
>
> Which version of MPICH were you running?
>
> --
> Hui Zhou
>
>
> From: "Mccall, Kurt E. (MSFC-EV41) via discuss" <discuss at mpich.org<mailto:discuss at mpich.org>>
> Reply-To: "discuss at mpich.org<mailto:discuss at mpich.org>" <discuss at mpich.org<mailto:discuss at mpich.org>>
> Date: Monday, April 6, 2020 at 12:45 PM
> To: "discuss at mpich.org<mailto:discuss at mpich.org>" <discuss at mpich.org<mailto:discuss at mpich.org>>
> Cc: "Mccall, Kurt E. (MSFC-EV41)" <kurt.e.mccall at nasa.gov<mailto:kurt.e.mccall at nasa.gov>>
> Subject: Re: [mpich-discuss] Abnormal termination on Linux
>
> I should mention that I am unable to predict in which node or process the abnormal termination occurs, so I can’t practically attach a debugger and try to intercept the error.
>
> Kurt
>
> From: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov<mailto:kurt.e.mccall at nasa.gov>>
> Sent: Monday, April 6, 2020 11:50 AM
> To: discuss at mpich.org<mailto:discuss at mpich.org>
> Cc: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov<mailto:kurt.e.mccall at nasa.gov>>
> Subject: Abnormal termination on Linux
>
> I have a couple of questions about abnormal termination. The EXIT CODE below is 11, which could be signal SIGSEGV, or is it something defined by MPICH? If it is SIGSEGV, it is strange because my signal handler isn’t catching it and cleaning up properly (the signal handler calls MPI_Finalize()). Is there any way to get more information about the location of the error?
>
> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> = PID 14385 RUNNING AT n020.cluster.com
> = EXIT CODE: 11
> = CLEANING UP REMAINING PROCESSES
> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>
> Thanks,
> Kurt
>
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
--
Dipl.-Inf. Joachim Protze
IT Center
Group: High Performance Computing
Division: Computational Science and Engineering
RWTH Aachen University
Seffenter Weg 23
D 52074 Aachen (Germany)
Tel: +49 241 80- 24765
Fax: +49 241 80-624765
protze at itc.rwth-aachen.de
www.itc.rwth-aachen.de
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 9754 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20200407/0bcb401f/attachment.p7s>
More information about the discuss
mailing list