[mpich-discuss] "Graceful" recovery from segmentation fault

Balaji, Pavan balaji at anl.gov
Thu Nov 5 07:40:48 CST 2020


Hi Kurt,

MPICH should gracefully cleanup all of the remaining processes, when one process dies.  If it is not doing that, then it might a bug.  But before we go digging into it, can you try the “mpiexec” from the latest mpich version?  There have been a few bug fixes.  In fact, we just fixed one more bug a couple of days ago (https://github.com/pmodels/mpich/pull/4862/commits/7f362d74ee), although I think this last bug might be unrelated to the error that you are seeing.

  — Pavan

> On Nov 5, 2020, at 12:42 AM, Mccall, Kurt E. (MSFC-EV41) via discuss <discuss at mpich.org> wrote:
> 
> I’m using MPICH 3.3.2, Torque 5.1.1.
>  
> I’ve discussed something like this with you guys before – is there any way to have MPICH shut down the
> entire job after a seg fault, rather than running forever?   You’ve said that what really needs to be done
> is to locate and fix the bug causing it, which is right, but I’m just worried that a non-terminating job would
> be confusing to the user that encounters it, if some new unanticipated seg fault were generated in the future.
>  
> I know that I can’t put MPI_Finalize() or any non-async safe code in my signal handler.   It helps a little bit
> to print a backtrace from the signal handler (I know, I/O is also not safe).
>  
> Is there anything else that can be done?   Here are the MPICH error messages written some minutes after the
> seg fault occurs, when the job is ending and all of the good processes are calling MPI_Finalize(). 
>  
> Thanks, Kurt
>  
> [proxy:0:0 at n001.cluster.com] [proxy:1:0 at n001.cluster.com] [proxy:3:0 at n001.cluster.com] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:878): HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:878): HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:878): assert (!closed) failed
> [proxy:4:0 at n001.cluster.com] assert (!closed) failed
> assert (!closed) failed
> [proxy:0:0 at n001.cluster.com] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:878): [proxy:1:0 at n001.cluster.com] [proxy:3:0 at n001.cluster.com] [proxy:5:0 at n001.cluster.com] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): assert (!closed) failed
> HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): [proxy:6:0 at n001.cluster.com] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:878): callback returned error status
> [proxy:4:0 at n001.cluster.com] [proxy:8:0 at n001.cluster.com] callback returned error status
> callback returned error status
> [proxy:9:0 at n001.cluster.com] assert (!closed) failed
> HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:878): [proxy:0:0 at n001.cluster.com] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): [proxy:1:0 at n001.cluster.com] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:878): [proxy:3:0 at n001.cluster.com] [proxy:10:0 at n001.cluster.com] [proxy:5:0 at n001.cluster.com] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:878): assert (!closed) failed
> main (pm/pmiserv/pmip.c:200): callback returned error status
> main (pm/pmiserv/pmip.c:200): assert (!closed) failed
> main (pm/pmiserv/pmip.c:200): HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:878): assert (!closed) failed
> [proxy:6:0 at n001.cluster.com] demux engine error waiting for event
> [proxy:4:0 at n001.cluster.com] demux engine error waiting for event
> [proxy:8:0 at n001.cluster.com] demux engine error waiting for event
> callback returned error status
> assert (!closed) failed
> [proxy:9:0 at n001.cluster.com] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): main (pm/pmiserv/pmip.c:200): HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): [proxy:5:0 at n001.cluster.com] callback returned error status
> [proxy:10:0 at n001.cluster.com] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error status
> demux engine error waiting for event
> main (pm/pmiserv/pmip.c:200): [proxy:8:0 at n001.cluster.com] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): [proxy:7:0 at n001.cluster.com] callback returned error status
> [proxy:6:0 at n001.cluster.com] demux engine error waiting for event
> main (pm/pmiserv/pmip.c:200): callback returned error status
> [proxy:9:0 at n001.cluster.com] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:878): main (pm/pmiserv/pmip.c:200): demux engine error waiting for event
> [proxy:10:0 at n001.cluster.com] main (pm/pmiserv/pmip.c:200): assert (!closed) failed
> demux engine error waiting for event
> main (pm/pmiserv/pmip.c:200): demux engine error waiting for event
> [proxy:7:0 at n001.cluster.com] demux engine error waiting for event
> HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error status
> [proxy:7:0 at n001.cluster.com] main (pm/pmiserv/pmip.c:200): demux engine error waiting for event
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss



More information about the discuss mailing list