[mpich-discuss] MPI process killed and SIGUSR1

Wesley Bland wbland at anl.gov
Tue Oct 28 15:59:39 CDT 2014


> On Oct 28, 2014, at 1:53 PM, Roy, Hirak <Hirak_Roy at mentor.com> wrote:
> 
> Hi Wesley,
> 
> Please check the client.c. There is explicit "assert" statement, which will terminate the client program abnormally. The server would be waiting on a MPI_Recv at that moment and the receive never times out or completes.

You’re right. I had missed that.
> 
> Even I replace the assert with sleep (100) and then manually kill the client process by process-id, the result is same.
> 
> Could you please let me know if a client is abnormally terminated, is there any way to detect that from server or not.
Support for FT with dynamic processes right now is somewhat unstable in MPICH, so it’s possible that it doesn’t work. It’s not something that’s been thoroughly tested on our end yet. All of the FT stuff is still very experimental and is actually being actively replaced. In fact, from the version that you’ve been using (3.0.4) and the current master branch, there have been some pretty major changes, so I can’t be sure what the behavior was at that point.

You’re welcome to try out the latest code from the nightly builds, but obviously those are pretty rough as well. Unfortunately, this is all pretty unexplored territory so I can’t make any guarantees about anything at this point, though I’m happy to hear about any problems you have in order to improve future versions.

Thanks,
Wesley
> 
> Thanks,
> Hirak
> 
> > Ok. I don't see where you're killing a process. 
> 
> 
> 
> > On Oct 27, 2014, at 10:51 PM, Roy, Hirak <Hirak_Roy at mentor.com <https://lists.mpich.org/mailman/listinfo/discuss>> wrote:
> > 
> > Hi Wesley,
> >  
> > This is related to the FT problem.
> >  
> > Thanks,
> > Hirak
> >  
> >  
> > Is this still related to your FT problems or is this a separate problem?
> >  
> > Thanks,
> > Wesley
> >  
> _______________________________________________
> discuss mailing list     discuss at mpich.org <mailto:discuss at mpich.org>
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss <https://lists.mpich.org/mailman/listinfo/discuss>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20141028/38c0891b/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list