[mpich-discuss] Fault tolerance MPICH2 hydra vs MPICH3.0.4 hydra

Anatoly G anatolyrishon at gmail.com
Wed May 21 08:40:27 CDT 2014


Thank you, Pavan.
I tried MPICH 3.1, it works good with Fault tolerance.
One more question:
If I execute my simulation (previous mail) with MPICH 3.1 compiled with
--with-device=ch3:sock,
I see not stable fault tolerance. Sometimes whole system crashes, but
sometimes not.
If I use default configuration flags (without --with-device=ch3:sock) whole
system is stable.

Is this expected behavior?

Regards,
Anatoly.



On Mon, May 19, 2014 at 5:03 PM, Balaji, Pavan <balaji at anl.gov> wrote:

>
> Hydra should be compatible between different versions of mpich (or mpich
> derivatives).  However, there’s always a possibility that there was a bug
> in mpich-3.0.4’s hydra that was fixed in mpich-3.1.  So we recommend using
> the latest version.
>
>   — Pavan
>
> On May 19, 2014, at 1:15 AM, Anatoly G <anatolyrishon at gmail.com> wrote:
>
> > Hi Wesley.
> > Thank you very much for quick response.
> > I executed your's code. Master can't finish it's execution. It stalled
> on MPI_Wait on iteration 7.
> >
> > But if I use MPICH2 hydra, Master process will finish executing by
> reporting number of times on failure of slave process.
> >
> > Can you please advice if it's safe to make hybrid system of build with
> MPICH3.0.4, but using MPICH2 hydra?
> > Or may be any other solution.
> > Does MPICH 3.0.4 include all MPICH2  hydra functionality?
> > May be my configuration of MPICH 3.0.4 is wrong?
> >
> > Regards,
> > Anatoly.
> >
> >
> > On Sun, May 18, 2014 at 8:40 PM, Wesley Bland <wbland at anl.gov> wrote:
> > Hi Anatoly,
> >
> > I think the problem may be the way that you're aborting. MPICH catches
> the system abort call and kills the entire application when it's called.
> Instead, I suggest using MPI_Abort(MPI_COMM_WORLD, 1); That's what I use in
> my tests and it works fine. It also seemed to work for your code when I
> tried. I'll attach my modified version of your code. I switched it to C
> since I happened to have C++ support disabled on my local install, but that
> shouldn't change anything.
> >
> > Thanks,
> > Wesley
> >
> >
> > On Sun, May 18, 2014 at 5:18 AM, Anatoly G <anatolyrishon at gmail.com>
> wrote:
> > Dear MPICH2,
> > Can you please help me with understanding Fault Tolerance in MPICH3.0.4
> > I have a simple MPI program:
> > Master calls MPI_Irecv + MPI_Wait in loop.
> > Single slave: calls MPI_Send x 5 times, then calls abort.
> >
> > When I execute program with MPICH2 hydra I get multiple times Master
> process prints about fail in slave. In MPICH3 hydra I get a single message
> about fail of slave and then Master process enters to endless wait for next
> Irecv completion.
> > In both cases I compiled program with MPICH3.0.4
> >
> > In other words, with MPICH2 hydra each Irecv completes (even if slave
> died before execution of Irecv) but in MPICH3 hydra not. Causes MPI_Irecv
> endless wait.
> >
> > If I compile same program with MPICH2 and use MPICH2 hydra, I get the
> same result as compiling with MPICH3.0.4 and running with MPICH2 hydra.
> >
> > Execution command:
> > mpiexec.hydra -genvall -disable-auto-cleanup -f MpiConfigMachines1.txt
> -launcher=rsh -n 2 mpi_irecv_ft_simple
> >
> >
> > Both hydra's configured with:
> >   $ ./configure --prefix=/space/local/mpich-3.0.4/
> --enable-error-checking=runtime --enable-g=dbg CFLAGS=-fPIC CXXFLAGS=-fPIC
> FFLAGS=-fpic --enable-threads=runtime --enable-totalview --enable-static
> --disable-f77 --disable-fc --no-recursion
> >
> >   $ ./configure --prefix=/space/local/mpich2-1.5b2/
> --enable-error-checking=runtime --enable-g=dbg CFLAGS=-fPIC CXXFLAGS=-fPIC
> FFLAGS=-fpic --enable-threads=runtime --enable-totalview --enable-static
> --disable-f77 --disable-fc
> >
> > Can you advice please?
> >
> > Regards,
> > Anatoly.
> >
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140521/432e12e1/attachment.html>


More information about the discuss mailing list