[mpich-discuss] Abnormal termination on Linux
Mccall, Kurt E. (MSFC-EV41)
kurt.e.mccall at nasa.gov
Tue Apr 7 17:42:40 CDT 2020
Hui, I am definitely calling exit in my signal handler after calling MPI_Finalize. This has worked in the past for small jobs, but as I scale up the size of the job, I've been seeing strange errors.
From: Zhou, Hui <zhouh at anl.gov>
Sent: Tuesday, April 7, 2020 5:35 PM
To: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov>; discuss at mpich.org
Subject: [EXTERNAL] Re: [mpich-discuss] Abnormal termination on Linux
> I'm using the methods in the paper https://urldefense.proofpoint.com/v2/url?u=https-3A__www.mcs.anl.gov_-7Elusk_papers_fault-2Dtolerance.pdf&d=DwIGaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=6cP1IfXu3IZOHSDh_vBqciYiIh4uuVgs1MSi5K7l5fQ&m=8DWCDAOyJ2suvdNEWMpw4Ty7fxlK-UMwwyVr2H099e0&s=P2bdu0fi6h9Z_eRfmyYegNOLrO2iuOZvf1zF2DuVtso&e= to make my Monte Carlo jobs fault tolerant (using inter-communicators rather than intra-communicators). The processes have a manager/worker arrangement, with the workers being single instances of a simulation, each with unique randomized inputs. These workers are created using MPI_Comm_spawn(). When a worker finishes, it must exit, and up to now I've been calling MPI_Finalize in the worker before exiting.
> From what you've said, this is incorrect because MPI_Finalize is collective over all of the processes in the job? Obviously my managers do not exit when the workers exit. Can you suggest any way to do cleanup before a worker exists, so that the job can continue?
I think the first question should be answered is: what happens when you exit your signal handler? `MPI_Finalize` won't exit the process. I believe your process will simply try to continue at where it gets interrupted, which is the same place where It segfaults -- or maybe the segfaults is caused by your code continue trying to run MPI-operations but you have "Finalized" the state in the interrupt handler?
Since you are going to exit your process, why don't you simply exit? I am not sure whether you are allowed to exit in signal handler -- curious to know 😊. But if exit works, you are leaving to the kernel to clean up your process, which may be sufficient for your goal.
More information about the discuss