[mpich-discuss] Mpich + Siesta erro

Jeff Hammond jeff.science at gmail.com
Fri Dec 6 19:09:36 CST 2013


I bet you (1) Siesta doesn't check MPI return codes and (2) Siesta has no way to handle node failure. I bet it can't even handle malloc returning NULL. 

If a node fails more than once a month, the hardware is bad and you should buy new stuff. 

Jeff

Sent from my iPhone

> On Dec 6, 2013, at 7:04 PM, Julio Henrique <juliohenrique at msn.com> wrote:
> 
> 
> Okay Pavan. I'll try that. Then I'll return the result.
> Thank's.
> Julio.
>  
> 
>  
> 
>  
> 
>  
> > From: balaji at mcs.anl.gov
> > Date: Fri, 6 Dec 2013 19:01:08 -0600
> > To: discuss at mpich.org
> > Subject: Re: [mpich-discuss] Mpich + Siesta erro
> > 
> > Hi Julio,
> > 
> > There are two steps needed for this:
> > 
> > 1. You need to tell your MPI application to return errors instead of aborting.
> > 
> > 2. Tell the process manager to not clean up your remaining processes when one of the processes dies.
> > 
> > Details on both these steps are listed in the "Fault Tolerance” section of the MPICH README. Please try it out and let us know how it goes.
> > 
> > — Pavan
> > 
> > On Dec 6, 2013, at 6:54 PM, Julio Henrique <juliohenrique at msn.com> wrote:
> > 
> > > 
> > > I am using mpich-3.0.4 on cluster with 7 nodes running the latest version of siesta. My problem is that when a one node goes down, the siesta and mpich stops running and giveserror.
> > > How do I get when a node falls, siesta and mpich continue to run?
> > > Thank's.
> > > Julio.
> > > 
> > > 
> > > 
> > > _______________________________________________
> > > discuss mailing list discuss at mpich.org
> > > To manage subscription options or unsubscribe:
> > > https://lists.mpich.org/mailman/listinfo/discuss
> > 
> > --
> > Pavan Balaji
> > http://www.mcs.anl.gov/~balaji
> > 
> > _______________________________________________
> > discuss mailing list discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20131206/001b9790/attachment.html>


More information about the discuss mailing list