[mpich-discuss] Fault tolerance MPICH2 hydra vs MPICH3.0.4 hydra

Anatoly G anatolyrishon at gmail.com
Sun May 18 05:18:56 CDT 2014


Dear MPICH2,
Can you please help me with understanding Fault Tolerance in MPICH3.0.4
I have a simple MPI program:
Master calls MPI_Irecv + MPI_Wait in loop.
Single slave: calls MPI_Send x 5 times, then calls abort.

When I execute program with MPICH2 hydra I get multiple times Master
process prints about fail in slave. In MPICH3 hydra I get a single message
about fail of slave and then Master process enters to endless wait for next
Irecv completion.
In both cases I compiled program with MPICH3.0.4

In other words, with MPICH2 hydra each Irecv completes (even if slave died
before execution of Irecv) but in MPICH3 hydra not. Causes
MPI_Irecvendless wait.

If I compile same program with MPICH2 and use MPICH2 hydra, I get the same
result as compiling with MPICH3.0.4 and running with MPICH2 hydra.

Execution command:
mpiexec.hydra -genvall -disable-auto-cleanup -f MpiConfigMachines1.txt-launcher=
rsh -n 2 mpi_irecv_ft_simple


Both hydra's configured with:
  $ ./configure --prefix=/space/local/mpich-3.0.4/ --enable-error-checking=
runtime --enable-g=dbg CFLAGS=-fPIC CXXFLAGS=-fPIC FFLAGS=-fpic--enable-threads=
runtime --enable-totalview --enable-static --disable-f77
--disable-fc--no-recursion

  $ ./configure --prefix=/space/local/mpich2-1.5b2/ --enable-error-checking=
runtime --enable-g=dbg CFLAGS=-fPIC CXXFLAGS=-fPIC FFLAGS=-fpic--enable-threads=
runtime --enable-totalview --enable-static --disable-f77 --disable-fc

Can you advice please?

Regards,
Anatoly.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140518/a8f32a32/attachment.html>
-------------- next part --------------
172.19.54.37:3


-------------- next part --------------
A non-text attachment was scrubbed...
Name: mpi_irecv_ft_simple.cpp
Type: text/x-c++src
Size: 1347 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140518/a8f32a32/attachment.bin>


More information about the discuss mailing list