[mpich-discuss] Question regarding fault tolerance difference 3.1 & 3.4.2
Zhou, Hui
zhouh at anl.gov
Wed Dec 15 10:32:01 CST 2021
Hi Anatoly,
I think the failure behavior is yet to be standardized, thus the behavior falls into the undefined behavior category.
I suspect the different outcomes with MPICH 3.1 and 3.4.2 are from changes in the process manager. Could you try using hydra from 3.1 in your test with 3.4.2?
--
Hui Zhou
________________________________
From: Anatoly G via discuss <discuss at mpich.org>
Sent: Monday, December 13, 2021 7:18 AM
To: mpich-discuss at mcs.anl.gov <mpich-discuss at mcs.anl.gov>
Cc: Anatoly G <anatolyrishon at gmail.com>
Subject: [mpich-discuss] Question regarding fault tolerance difference 3.1 & 3.4.2
Hi MPICH,
I have a small program which has different outputs on MPICH 3.1 & MPICH 3.4.2
The program (code attached):
1. Master executes ping-pong with each one of the slaves (separately)
2. Each slave replies to master when it gets a message from master
3. One of the slaves simulates failre with command abort().
4. Master recognizes that slave fails and continues to work with survived slaves only.
I execute my program on ubuntu18.
I use TCP as a transport layer.
Execute command:
mpiexec.hydra -genvall -disable-auto-cleanup -f MpiConfigMachines.txt -launcher=ssh -n 3 ft_ping_pong
I expect that "ping pong" will continue till all 20 iterations will finish.
MPICH 3.1 hydra works as expected only slave 1 fails and ping-pong continues between Master and slave2.
MPICH 3.4.2 hydra process slave 2 fails together with slave 1.
MPICH 3.1 configuration:
./configure --prefix="my directory" --enable-error-checking=runtime --enable-g=dbg CFLAGS=-fPIC CXXFLAGS=-fPIC FFLAGS=-fpic --enable-threads=runtime --enable-totalview --enable-static --disable-f77 --disable-fc
MPICH 3.4.2 configuration:
$ ./configure --prefix="my directory" --enable-error-checking=all --enable-g=dbg CFLAGS=-fPIC CXXFLAGS=-fPIC FFLAGS=-fpic --enable-threads=runtime --enable-totalview --enable-static --disable-f77 --disable-fc --disable-fortran --with-device=ch3:nemesis --enable-error-messages=all
Should I use another device or the behavior was modified between versions?
Regards,
Anatoly.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20211215/96f18180/attachment.html>
More information about the discuss
mailing list