[mpich-discuss] Abnormal termination on Linux

Zhou, Hui zhouh at anl.gov
Tue Apr 7 17:35:19 CDT 2020

>    I'm using the methods in the paper https://www.mcs.anl.gov/~lusk/papers/fault-tolerance.pdf to make my Monte Carlo jobs fault tolerant (using inter-communicators rather than intra-communicators).   The processes have a manager/worker arrangement, with the workers being single instances of a simulation, each with unique randomized inputs.  These workers are created using MPI_Comm_spawn().   When a worker finishes, it must exit, and up to now I've been calling MPI_Finalize in the worker before exiting. 

>   From what you've said, this is incorrect because MPI_Finalize is collective over all of the processes in the job? Obviously my managers do not exit when the workers exit.   Can you suggest any way to do cleanup before a worker exists, so that the job can continue?
I think the first question should be answered is: what happens when you exit your signal handler? `MPI_Finalize` won't exit the process. I believe your process will simply try to continue at where it gets interrupted, which is the same place where It segfaults -- or maybe the segfaults is caused by your code continue trying to run MPI-operations but you have "Finalized" the state in the interrupt handler?

Since you are going to exit your process, why don't you simply exit? I am not sure whether you are allowed to exit in signal handler -- curious to know 😊. But if exit works, you are leaving to the kernel to clean up your process, which may be sufficient for your goal.

Hui Zhou

More information about the discuss mailing list