[mpich-discuss] Fault tolerance MPICH2 hydra vs MPICH3.0.4 hydra

Wesley Bland wbland at anl.gov
Wed May 21 08:42:07 CDT 2014


Yes, this is expected. Fault tolerance is an experimental feature and as such is not implemented in all devices. It is currently only compatible with the TCP device.


Thanks,
Wesley







> On May 21, 2014, at 8:40 AM, Anatoly G <anatolyrishon at gmail.com> wrote:
> 
> 
> Thank you, Pavan.
> I tried MPICH 3.1, it works good with Fault tolerance.
> One more question:
> If I execute my simulation (previous mail) with MPICH 3.1 compiled with --with-device=ch3:sock,
> I see not stable fault tolerance. Sometimes whole system crashes, but sometimes not.
> If I use default configuration flags (without --with-device=ch3:sock) whole system is stable.
> 
> 
> Is this expected behavior?
> 
> 
> Regards,
> Anatoly.
> 
> 
> 
> On Mon, May 19, 2014 at 5:03 PM, Balaji, Pavan<balaji at anl.gov
>>wrote:
>> 
>> Hydra should be compatible between different versions of mpich (or mpich derivatives).  However, there’s always a possibility that there was a bug in mpich-3.0.4’s hydra that was fixed in mpich-3.1.  So we recommend using the latest version.
>> 
>>   — Pavan
>> 
>> On May 19, 2014, at 1:15 AM, Anatoly G <anatolyrishon at gmail.com
>>> wrote:
>> 
>>> Hi Wesley.
>>> Thank you very much for quick response.
>>> I executed your's code. Master can't finish it's execution. It stalled on MPI_Wait on iteration 7.
>>>
>>> But if I use MPICH2 hydra, Master process will finish executing by reporting number of times on failure of slave process.
>>>
>>> Can you please advice if it's safe to make hybrid system of build with MPICH3.0.4, but using MPICH2 hydra?
>>> Or may be any other solution.
>>> Does MPICH 3.0.4 include all MPICH2  hydra functionality?
>>> May be my configuration of MPICH 3.0.4 is wrong?
>>>
>>> Regards,
>>> Anatoly.
>>>
>>>
>>> On Sun, May 18, 2014 at 8:40 PM, Wesley Bland <wbland at anl.gov
>>> wrote:
>>> Hi Anatoly,
>>>
>>> I think the problem may be the way that you're aborting. MPICH catches the system abort call and kills the entire application when it's called. Instead, I suggest using MPI_Abort(MPI_COMM_WORLD, 1); That's what I use in my tests and it works fine. It also seemed to work for your code when I tried. I'll attach my modified version of your code. I switched it to C since I happened to have C++ support disabled on my local install, but that shouldn't change anything.
>>>
>>> Thanks,
>>> Wesley
>>>
>>>
>>> On Sun, May 18, 2014 at 5:18 AM, Anatoly G <anatolyrishon at gmail.com
>>> wrote:
>>> Dear MPICH2,
>>> Can you please help me with understanding Fault Tolerance in MPICH3.0.4
>>> I have a simple MPI program:
>>> Master calls MPI_Irecv + MPI_Wait in loop.
>>> Single slave: calls MPI_Send x 5 times, then calls abort.
>>>
>>> When I execute program with MPICH2 hydra I get multiple times Master process prints about fail in slave. In MPICH3 hydra I get a single message about fail of slave and then Master process enters to endless wait for next Irecv completion.
>>> In both cases I compiled program with MPICH3.0.4
>>>
>>> In other words, with MPICH2 hydra each Irecv completes (even if slave died before execution of Irecv) but in MPICH3 hydra not. Causes MPI_Irecv endless wait.
>>>
>>> If I compile same program with MPICH2 and use MPICH2 hydra, I get the same result as compiling with MPICH3.0.4 and running with MPICH2 hydra.
>>>
>>> Execution command:
>>> mpiexec.hydra -genvall -disable-auto-cleanup -f MpiConfigMachines1.txt -launcher=rsh -n 2 mpi_irecv_ft_simple
>>>
>>>
>>> Both hydra's configured with:
>>>   $ ./configure --prefix=/space/local/mpich-3.0.4/ --enable-error-checking=runtime --enable-g=dbg CFLAGS=-fPIC CXXFLAGS=-fPIC FFLAGS=-fpic --enable-threads=runtime --enable-totalview --enable-static --disable-f77 --disable-fc --no-recursion
>>>
>>>   $ ./configure --prefix=/space/local/mpich2-1.5b2/ --enable-error-checking=runtime --enable-g=dbg CFLAGS=-fPIC CXXFLAGS=-fPIC FFLAGS=-fpic --enable-threads=runtime --enable-totalview --enable-static --disable-f77 --disable-fc
>>>
>>> Can you advice please?
>>>
>>> Regards,
>>> Anatoly.
>>>
>>>
>>> _______________________________________________
>>> discuss mailing list    discuss at mpich.org
>> 
>>> To manage subscription options or unsubscribe:
>>>https://lists.mpich.org/mailman/listinfo/discuss
>> 
>>>
>>> _______________________________________________
>>> discuss mailing list    discuss at mpich.org
>> 
>>> To manage subscription options or unsubscribe:
>>>https://lists.mpich.org/mailman/listinfo/discuss
>> 
>> 
>> _______________________________________________
>> discuss mailing list    discuss at mpich.org
>> 
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>> 
>> 
>> 
>> 
> 
> 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140521/781b6d6c/attachment.html>


More information about the discuss mailing list