[mpich-discuss] Abnormal termination on Linux
zhouh at anl.gov
Tue Apr 7 17:35:19 CDT 2020
> I'm using the methods in the paper https://www.mcs.anl.gov/~lusk/papers/fault-tolerance.pdf to make my Monte Carlo jobs fault tolerant (using inter-communicators rather than intra-communicators). The processes have a manager/worker arrangement, with the workers being single instances of a simulation, each with unique randomized inputs. These workers are created using MPI_Comm_spawn(). When a worker finishes, it must exit, and up to now I've been calling MPI_Finalize in the worker before exiting.
> From what you've said, this is incorrect because MPI_Finalize is collective over all of the processes in the job? Obviously my managers do not exit when the workers exit. Can you suggest any way to do cleanup before a worker exists, so that the job can continue?
I think the first question should be answered is: what happens when you exit your signal handler? `MPI_Finalize` won't exit the process. I believe your process will simply try to continue at where it gets interrupted, which is the same place where It segfaults -- or maybe the segfaults is caused by your code continue trying to run MPI-operations but you have "Finalized" the state in the interrupt handler?
Since you are going to exit your process, why don't you simply exit? I am not sure whether you are allowed to exit in signal handler -- curious to know 😊. But if exit works, you are leaving to the kernel to clean up your process, which may be sufficient for your goal.
More information about the discuss