[mpich-discuss] Question regarding fault tolerance difference 3.1 & 3.4.2

Anatoly G anatolyrishon at gmail.com
Mon Dec 13 07:18:31 CST 2021


Hi MPICH,
I have a small program which has different outputs on MPICH 3.1 & MPICH
3.4.2
The program (code attached):

   1. Master executes ping-pong with each one of the slaves (separately)
   2. Each slave replies to master when it gets a message from master
   3. One of the slaves simulates failre with command abort().
   4. Master recognizes that slave fails and continues to work with
   survived slaves only.

*I execute my program on ubuntu18.*
I use TCP as a transport layer.
*Execute command*:
mpiexec.hydra -genvall -disable-auto-cleanup -f MpiConfigMachines.txt
-launcher=ssh -n 3 ft_ping_pong

I expect that "ping pong" will continue till all 20 iterations will finish.
MPICH 3.1    hydra works as expected only slave 1 fails and ping-pong
continues between Master and slave2.
MPICH 3.4.2 hydra process slave 2 fails together with slave 1.

*MPICH 3.1 configuration:*
./configure --prefix="my directory" --enable-error-checking=runtime
--enable-g=dbg CFLAGS=-fPIC CXXFLAGS=-fPIC FFLAGS=-fpic
--enable-threads=runtime --enable-totalview --enable-static --disable-f77
--disable-fc

*MPICH 3.4.2 configuration:*
$ ./configure --prefix="my directory" --enable-error-checking=all
--enable-g=dbg CFLAGS=-fPIC CXXFLAGS=-fPIC FFLAGS=-fpic
--enable-threads=runtime --enable-totalview --enable-static --disable-f77
--disable-fc --disable-fortran *--with-device=ch3:nemesis*
--enable-error-messages=all

Should I use another device or the behavior was modified between versions?

Regards,
Anatoly.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20211213/6e7c9846/attachment.html>
-------------- next part --------------
172.19.54.59:1
172.19.54.193:100
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ft_ping_pong.cpp
Type: application/octet-stream
Size: 2417 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20211213/6e7c9846/attachment.obj>


More information about the discuss mailing list