[mpich-discuss] Abnormal termination on Linux

Mccall, Kurt E. (MSFC-EV41) kurt.e.mccall at nasa.gov
Tue Apr 7 17:42:40 CDT 2020

Hui,  I am definitely calling exit in my signal handler after calling MPI_Finalize.    This has worked in the past for small jobs, but as I scale up the size of the job, I've been seeing strange errors.

-----Original Message-----
From: Zhou, Hui <zhouh at anl.gov> 
Sent: Tuesday, April 7, 2020 5:35 PM
To: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov>; discuss at mpich.org
Subject: [EXTERNAL] Re: [mpich-discuss] Abnormal termination on Linux

>    I'm using the methods in the paper https://urldefense.proofpoint.com/v2/url?u=https-3A__www.mcs.anl.gov_-7Elusk_papers_fault-2Dtolerance.pdf&d=DwIGaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=6cP1IfXu3IZOHSDh_vBqciYiIh4uuVgs1MSi5K7l5fQ&m=8DWCDAOyJ2suvdNEWMpw4Ty7fxlK-UMwwyVr2H099e0&s=P2bdu0fi6h9Z_eRfmyYegNOLrO2iuOZvf1zF2DuVtso&e=  to make my Monte Carlo jobs fault tolerant (using inter-communicators rather than intra-communicators).   The processes have a manager/worker arrangement, with the workers being single instances of a simulation, each with unique randomized inputs.  These workers are created using MPI_Comm_spawn().   When a worker finishes, it must exit, and up to now I've been calling MPI_Finalize in the worker before exiting. 

>   From what you've said, this is incorrect because MPI_Finalize is collective over all of the processes in the job? Obviously my managers do not exit when the workers exit.   Can you suggest any way to do cleanup before a worker exists, so that the job can continue?
I think the first question should be answered is: what happens when you exit your signal handler? `MPI_Finalize` won't exit the process. I believe your process will simply try to continue at where it gets interrupted, which is the same place where It segfaults -- or maybe the segfaults is caused by your code continue trying to run MPI-operations but you have "Finalized" the state in the interrupt handler?

Since you are going to exit your process, why don't you simply exit? I am not sure whether you are allowed to exit in signal handler -- curious to know 😊. But if exit works, you are leaving to the kernel to clean up your process, which may be sufficient for your goal.

Hui Zhou

More information about the discuss mailing list