[mpich-discuss] Fault tolerance MPICH2 hydra vs MPICH3.0.4 hydra

Balaji, Pavan balaji at anl.gov
Mon May 19 09:03:32 CDT 2014


Hydra should be compatible between different versions of mpich (or mpich derivatives).  However, there’s always a possibility that there was a bug in mpich-3.0.4’s hydra that was fixed in mpich-3.1.  So we recommend using the latest version.

  — Pavan

On May 19, 2014, at 1:15 AM, Anatoly G <anatolyrishon at gmail.com> wrote:

> Hi Wesley.
> Thank you very much for quick response.
> I executed your's code. Master can't finish it's execution. It stalled on MPI_Wait on iteration 7.
> 
> But if I use MPICH2 hydra, Master process will finish executing by reporting number of times on failure of slave process. 
> 
> Can you please advice if it's safe to make hybrid system of build with MPICH3.0.4, but using MPICH2 hydra?
> Or may be any other solution. 
> Does MPICH 3.0.4 include all MPICH2  hydra functionality?
> May be my configuration of MPICH 3.0.4 is wrong?
> 
> Regards,
> Anatoly.
> 
> 
> On Sun, May 18, 2014 at 8:40 PM, Wesley Bland <wbland at anl.gov> wrote:
> Hi Anatoly,
> 
> I think the problem may be the way that you're aborting. MPICH catches the system abort call and kills the entire application when it's called. Instead, I suggest using MPI_Abort(MPI_COMM_WORLD, 1); That's what I use in my tests and it works fine. It also seemed to work for your code when I tried. I'll attach my modified version of your code. I switched it to C since I happened to have C++ support disabled on my local install, but that shouldn't change anything.
> 
> Thanks,
> Wesley
> 
> 
> On Sun, May 18, 2014 at 5:18 AM, Anatoly G <anatolyrishon at gmail.com> wrote:
> Dear MPICH2,
> Can you please help me with understanding Fault Tolerance in MPICH3.0.4
> I have a simple MPI program:
> Master calls MPI_Irecv + MPI_Wait in loop.
> Single slave: calls MPI_Send x 5 times, then calls abort.
> 
> When I execute program with MPICH2 hydra I get multiple times Master process prints about fail in slave. In MPICH3 hydra I get a single message about fail of slave and then Master process enters to endless wait for next Irecv completion. 
> In both cases I compiled program with MPICH3.0.4
> 
> In other words, with MPICH2 hydra each Irecv completes (even if slave died before execution of Irecv) but in MPICH3 hydra not. Causes MPI_Irecv endless wait.
> 
> If I compile same program with MPICH2 and use MPICH2 hydra, I get the same result as compiling with MPICH3.0.4 and running with MPICH2 hydra.
> 
> Execution command:
> mpiexec.hydra -genvall -disable-auto-cleanup -f MpiConfigMachines1.txt -launcher=rsh -n 2 mpi_irecv_ft_simple
> 
> 
> Both hydra's configured with:
>   $ ./configure --prefix=/space/local/mpich-3.0.4/ --enable-error-checking=runtime --enable-g=dbg CFLAGS=-fPIC CXXFLAGS=-fPIC FFLAGS=-fpic --enable-threads=runtime --enable-totalview --enable-static --disable-f77 --disable-fc --no-recursion
> 
>   $ ./configure --prefix=/space/local/mpich2-1.5b2/ --enable-error-checking=runtime --enable-g=dbg CFLAGS=-fPIC CXXFLAGS=-fPIC FFLAGS=-fpic --enable-threads=runtime --enable-totalview --enable-static --disable-f77 --disable-fc 
> 
> Can you advice please?
> 
> Regards,
> Anatoly.
> 
> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss




More information about the discuss mailing list