[mpich-discuss] Fault tolerance MPICH2 hydra vs MPICH3.0.4 hydra

Wesley Bland wbland at anl.gov
Sun May 18 12:40:15 CDT 2014


Hi Anatoly,

I think the problem may be the way that you're aborting. MPICH catches the
system abort call and kills the entire application when it's called.
Instead, I suggest using MPI_Abort(MPI_COMM_WORLD, 1); That's what I use in
my tests and it works fine. It also seemed to work for your code when I
tried. I'll attach my modified version of your code. I switched it to C
since I happened to have C++ support disabled on my local install, but that
shouldn't change anything.

Thanks,
Wesley


On Sun, May 18, 2014 at 5:18 AM, Anatoly G <anatolyrishon at gmail.com> wrote:

>  Dear MPICH2,
> Can you please help me with understanding Fault Tolerance in MPICH3.0.4
> I have a simple MPI program:
> Master calls MPI_Irecv + MPI_Wait in loop.
> Single slave: calls MPI_Send x 5 times, then calls abort.
>
>  When I execute program with MPICH2 hydra I get multiple times Master
> process prints about fail in slave. In MPICH3 hydra I get a single message
> about fail of slave and then Master process enters to endless wait for next
> Irecv completion.
>  In both cases I compiled program with MPICH3.0.4
>
>  In other words, with MPICH2 hydra each Irecv completes (even if slave
> died before execution of Irecv) but in MPICH3 hydra not. Causes MPI_Irecvendless wait.
>
>  If I compile same program with MPICH2 and use MPICH2 hydra, I get the
> same result as compiling with MPICH3.0.4 and running with MPICH2 hydra.
>
>  Execution command:
>  mpiexec.hydra -genvall -disable-auto-cleanup -f MpiConfigMachines1.txt-launcher=
> rsh -n 2 mpi_irecv_ft_simple
>
>
>  Both hydra's configured with:
>   $ ./configure --prefix=/space/local/mpich-3.0.4/
> --enable-error-checking=runtime --enable-g=dbg CFLAGS=-fPIC CXXFLAGS=-fPIC
> FFLAGS=-fpic --enable-threads=runtime --enable-totalview --enable-static
> --disable-f77 --disable-fc --no-recursion
>
>    $ ./configure --prefix=/space/local/mpich2-1.5b2/
> --enable-error-checking=runtime --enable-g=dbg CFLAGS=-fPIC CXXFLAGS=-fPIC
> FFLAGS=-fpic --enable-threads=runtime --enable-totalview --enable-static
> --disable-f77 --disable-fc
>
>  Can you advice please?
>
>  Regards,
> Anatoly.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140518/fc0b9cd2/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mpi_irecv_ft_simple.c
Type: text/x-csrc
Size: 1406 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140518/fc0b9cd2/attachment.bin>


More information about the discuss mailing list