[mpich-discuss] Question regarding fault tolerance difference 3.1 & 3.4.2

Zhou, Hui zhouh at anl.gov
Wed Dec 15 10:32:01 CST 2021


Hi Anatoly,

I think the failure behavior is yet to be standardized, thus the behavior falls into the undefined behavior category.

I suspect the different outcomes with MPICH 3.1 and 3.4.2 are from changes in the process manager.  Could you try using hydra from 3.1 in your test with 3.4.2?

--
Hui Zhou
________________________________
From: Anatoly G via discuss <discuss at mpich.org>
Sent: Monday, December 13, 2021 7:18 AM
To: mpich-discuss at mcs.anl.gov <mpich-discuss at mcs.anl.gov>
Cc: Anatoly G <anatolyrishon at gmail.com>
Subject: [mpich-discuss] Question regarding fault tolerance difference 3.1 & 3.4.2

Hi MPICH,
I have a small program which has different outputs on MPICH 3.1 & MPICH 3.4.2
The program (code attached):

  1.  Master executes ping-pong with each one of the slaves (separately)
  2.  Each slave replies to master when it gets a message from master
  3.  One of the slaves simulates failre with command abort().
  4.  Master recognizes that slave fails and continues to work with survived slaves only.

I execute my program on ubuntu18.
I use TCP as a transport layer.
Execute command:
mpiexec.hydra -genvall -disable-auto-cleanup -f MpiConfigMachines.txt -launcher=ssh -n 3 ft_ping_pong

I expect that "ping pong" will continue till all 20 iterations will finish.
MPICH 3.1    hydra works as expected only slave 1 fails and ping-pong continues between Master and slave2.
MPICH 3.4.2 hydra process slave 2 fails together with slave 1.

MPICH 3.1 configuration:
./configure --prefix="my directory" --enable-error-checking=runtime --enable-g=dbg CFLAGS=-fPIC CXXFLAGS=-fPIC FFLAGS=-fpic --enable-threads=runtime --enable-totalview --enable-static --disable-f77 --disable-fc

MPICH 3.4.2 configuration:
$ ./configure --prefix="my directory" --enable-error-checking=all --enable-g=dbg CFLAGS=-fPIC CXXFLAGS=-fPIC FFLAGS=-fpic --enable-threads=runtime --enable-totalview --enable-static --disable-f77 --disable-fc --disable-fortran --with-device=ch3:nemesis --enable-error-messages=all

Should I use another device or the behavior was modified between versions?

Regards,
Anatoly.



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20211215/96f18180/attachment.html>


More information about the discuss mailing list