[mpich-discuss] Crash on one node cause all programm termination.

Antonio J. Peña antonio.pena at bsc.es
Fri Jan 29 07:40:56 CST 2016


Hi Andris,

Unfortunately right now the MPI standard leaves the behavior upon 
failures pretty open to implementations.

With the current implementation, you can pass the flag 
"-disable-auto-cleanup" to mpiexec (Hydra) to prevent it from killing 
the non-failing processes. But, bear in mind that you shouldn't use 
those communicators that involved the failing process anymore. So, what 
you say is somehow possible if you use MPI_Comm_spawn to create the 
slaves and set the error handler MPI_ERRORS_RETURN in the intracommunicator.

Again, this is something not defined in the standard, and hence there's 
no guarantee it'll work in all conditions and/or MPI implementations.

On the other hand, you can search and take a look at the ULFM proposal.

I hope this helps.

Best,
   Antonio


On 29/01/16 14:29, Andris wrote:
> Hello!
> I'm a newbie in MPI world.
> I have some question about program execution.
> I build some program (for example, exp1) using mpicc and run it to 
> multiple hosts.
> mpiexec -f my_hosts -n 8 ./exp1
> 4 exp1 are running on host A (rank 0-3) and 4 - on host B (rank 4-7).
> If one of them crashes, all other are terminating too. mpiexec print:
> ===================================================================================
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   PID 5774 RUNNING AT slave1
> =   EXIT CODE: 11
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> ===================================================================================
>
> If I understand right, I can handle in my code only MPI function errors.
> In my project I need if one process is terminated, all other processes 
> will stay running. For example, if slave node lose power, processes on 
> master node stay running. Master node will know that processes on 
> slave node are terminated and after some time master node will rerun 
> these processes on slave node.
> Is it possible? If yes, how?
>
> Big thanks!
>
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

-- 
Antonio J. Peña, PhD
Senior Researcher
Barcelona Supercomputing Center
http://www.bsc.es/about-bsc/staff-directory/pena-antonio



WARNING / LEGAL TEXT: This message is intended only for the use of the
individual or entity to which it is addressed and may contain
information which is privileged, confidential, proprietary, or exempt
from disclosure under applicable law. If you are not the intended
recipient or the person responsible for delivering the message to the
intended recipient, you are strictly prohibited from disclosing,
distributing, copying, or in any way using this message. If you have
received this communication in error, please notify the sender and
destroy and delete any copies you may have received.

http://www.bsc.es/disclaimer
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20160129/8acb4b3d/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list