[mpich-discuss] Fault tolerance after MPI_Comm_connect/accept

Pavan Balaji balaji at mcs.anl.gov
Tue Mar 5 10:01:53 CST 2013


On 03/05/2013 09:33 AM US Central Time, Jim Dinan wrote:
> I created an MPI Forum ticket for this:
> 
> https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/365

Jim: iaccept was a part of Josh's original MPI-3 proposal, but was voted
down by the Forum because it's collective over the communicator (though
that's true for all nonblocking collectives).  I, myself, pointed out
several use cases for iaccept (one of which was VOCL, where we have to
use a separate thread for comm_accept).

It might be worthwhile revisiting it for MPI-3.1 or MPI-4.

> In terms of what is guaranteed by the standard, the behavior is
> undefined.  In terms of that MPICH will do, I am not sure, although my
> guess is that current MPICH will be unable to continue working after
> such a failure.  You may need to do some testing or read the code to
> find out.

FWIW, the default behavior in mpich is to clean up all processes if
something goes wrong.  However, you can pass -disable-auto-cleanup to
mpiexec to disable this behavior.  In this case, if one process dies,
the remaining processes are left alone, and will return an error on
communication -- you'll obviously need to check the return value of the
MPI functions to detect such failures.

One small gotcha is the interaction of such failures with MPI-3
nonblocking collectives.  Currently, we don't have a way to get these
two working together, though that's planned to be fixed in mpich-3.1 to
be released in 2014 (preview releases will show up later this year).

You can use the environment MPICH_ENABLE_COLL_FT_RET=1 to tell MPICH to
not hang, and return an error, if a process fails during a collective
operation.  But doing this will cause nonblocking collectives to break.

 -- Pavan

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji



More information about the discuss mailing list