[mpich-discuss] Fault tolerance of an MPI cluster after one node dies

Bland, Wesley B. wbland at anl.gov
Wed Dec 10 10:01:09 CST 2014


We’ve been doing work specifically on this. In the latest alpha, an experimental version of the ULFM specification was added. You’re welcome to try it out. If you’re not familiar, I’d recommend reading through some of the documentation on www.fault-tolerance.org <http://www.fault-tolerance.org/>. You can find some tutorials along with the specification.

Thanks,
Wesley


> On Dec 10, 2014, at 6:31 AM, YANG Fan <iddmbr at gmail.com <mailto:iddmbr at gmail.com>> wrote:
> 
> Hi,
> 
> Is it possible for an MPI distributed cluster to continue working if one node dies? I'm not sure if MPICH provides such functionality.
> 
> It seems that MPI_Comm_create requires that all processes in the superset communicators to be alive; while the errhandler with --disable-auto-cleanup also does not avoid such issue, as one process cannot call MPI_Finalize().
> 
> Thanks in advance!
> 
> Best Regards,
> Fan
> _______________________________________________
> discuss mailing list     discuss at mpich.org <mailto:discuss at mpich.org>
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss <https://lists.mpich.org/mailman/listinfo/discuss>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20141210/874b3b2b/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list