[mpich-discuss] "Graceful" recovery from segmentation fault

Mccall, Kurt E. (MSFC-EV41) kurt.e.mccall at nasa.gov
Thu Nov 5 00:42:46 CST 2020


I'm using MPICH 3.3.2, Torque 5.1.1.

I've discussed something like this with you guys before - is there any way to have MPICH shut down the
entire job after a seg fault, rather than running forever?   You've said that what really needs to be done
is to locate and fix the bug causing it, which is right, but I'm just worried that a non-terminating job would
be confusing to the user that encounters it, if some new unanticipated seg fault were generated in the future.

I know that I can't put MPI_Finalize() or any non-async safe code in my signal handler.   It helps a little bit
to print a backtrace from the signal handler (I know, I/O is also not safe).

Is there anything else that can be done?   Here are the MPICH error messages written some minutes after the
seg fault occurs, when the job is ending and all of the good processes are calling MPI_Finalize().

Thanks, Kurt

[proxy:0:0 at n001.cluster.com] [proxy:1:0 at n001.cluster.com] [proxy:3:0 at n001.cluster.com] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:878): HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:878): HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:878): assert (!closed) failed
[proxy:4:0 at n001.cluster.com] assert (!closed) failed
assert (!closed) failed
[proxy:0:0 at n001.cluster.com] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:878): [proxy:1:0 at n001.cluster.com] [proxy:3:0 at n001.cluster.com] [proxy:5:0 at n001.cluster.com] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): assert (!closed) failed
HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): [proxy:6:0 at n001.cluster.com] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:878): callback returned error status
[proxy:4:0 at n001.cluster.com] [proxy:8:0 at n001.cluster.com] callback returned error status
callback returned error status
[proxy:9:0 at n001.cluster.com] assert (!closed) failed
HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:878): [proxy:0:0 at n001.cluster.com] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): [proxy:1:0 at n001.cluster.com] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:878): [proxy:3:0 at n001.cluster.com] [proxy:10:0 at n001.cluster.com] [proxy:5:0 at n001.cluster.com] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:878): assert (!closed) failed
main (pm/pmiserv/pmip.c:200): callback returned error status
main (pm/pmiserv/pmip.c:200): assert (!closed) failed
main (pm/pmiserv/pmip.c:200): HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:878): assert (!closed) failed
[proxy:6:0 at n001.cluster.com] demux engine error waiting for event
[proxy:4:0 at n001.cluster.com] demux engine error waiting for event
[proxy:8:0 at n001.cluster.com] demux engine error waiting for event
callback returned error status
assert (!closed) failed
[proxy:9:0 at n001.cluster.com] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): main (pm/pmiserv/pmip.c:200): HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): [proxy:5:0 at n001.cluster.com] callback returned error status
[proxy:10:0 at n001.cluster.com] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error status
demux engine error waiting for event
main (pm/pmiserv/pmip.c:200): [proxy:8:0 at n001.cluster.com] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): [proxy:7:0 at n001.cluster.com] callback returned error status
[proxy:6:0 at n001.cluster.com] demux engine error waiting for event
main (pm/pmiserv/pmip.c:200): callback returned error status
[proxy:9:0 at n001.cluster.com] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:878): main (pm/pmiserv/pmip.c:200): demux engine error waiting for event
[proxy:10:0 at n001.cluster.com] main (pm/pmiserv/pmip.c:200): assert (!closed) failed
demux engine error waiting for event
main (pm/pmiserv/pmip.c:200): demux engine error waiting for event
[proxy:7:0 at n001.cluster.com] demux engine error waiting for event
HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error status
[proxy:7:0 at n001.cluster.com] main (pm/pmiserv/pmip.c:200): demux engine error waiting for event
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20201105/da36d870/attachment-0001.html>


More information about the discuss mailing list