[mpich-discuss] Fault tolerance MPICH2 hydra vs MPICH3.0.4 hydra

Anatoly G anatolyrishon at gmail.com
Mon May 19 01:15:33 CDT 2014


Hi Wesley.
Thank you very much for quick response.
I executed your's code. Master can't finish it's execution. It stalled on
MPI_Wait on iteration 7.

But if I use MPICH2 hydra, Master process will finish executing by
reporting number of times on failure of slave process.

Can you please advice if it's safe to make hybrid system of build with
MPICH3.0.4, but using MPICH2 hydra?
Or may be any other solution.
Does MPICH 3.0.4 include all MPICH2  hydra functionality?
May be my configuration of MPICH 3.0.4 is wrong?

Regards,
Anatoly.


On Sun, May 18, 2014 at 8:40 PM, Wesley Bland <wbland at anl.gov> wrote:

> Hi Anatoly,
>
> I think the problem may be the way that you're aborting. MPICH catches the
> system abort call and kills the entire application when it's called.
> Instead, I suggest using MPI_Abort(MPI_COMM_WORLD, 1); That's what I use in
> my tests and it works fine. It also seemed to work for your code when I
> tried. I'll attach my modified version of your code. I switched it to C
> since I happened to have C++ support disabled on my local install, but that
> shouldn't change anything.
>
> Thanks,
> Wesley
>
>
> On Sun, May 18, 2014 at 5:18 AM, Anatoly G <anatolyrishon at gmail.com>wrote:
>
>>  Dear MPICH2,
>> Can you please help me with understanding Fault Tolerance in MPICH3.0.4
>> I have a simple MPI program:
>> Master calls MPI_Irecv + MPI_Wait in loop.
>> Single slave: calls MPI_Send x 5 times, then calls abort.
>>
>>  When I execute program with MPICH2 hydra I get multiple times Master
>> process prints about fail in slave. In MPICH3 hydra I get a single message
>> about fail of slave and then Master process enters to endless wait for next
>> Irecv completion.
>>  In both cases I compiled program with MPICH3.0.4
>>
>>  In other words, with MPICH2 hydra each Irecv completes (even if slave
>> died before execution of Irecv) but in MPICH3 hydra not. Causes MPI_Irecvendless wait.
>>
>>  If I compile same program with MPICH2 and use MPICH2 hydra, I get the
>> same result as compiling with MPICH3.0.4 and running with MPICH2 hydra.
>>
>>  Execution command:
>>  mpiexec.hydra -genvall -disable-auto-cleanup -f MpiConfigMachines1.txt-launcher=
>> rsh -n 2 mpi_irecv_ft_simple
>>
>>
>>  Both hydra's configured with:
>>   $ ./configure --prefix=/space/local/mpich-3.0.4/
>> --enable-error-checking=runtime --enable-g=dbg CFLAGS=-fPIC CXXFLAGS=-
>> fPIC FFLAGS=-fpic --enable-threads=runtime --enable-totalview--enable-static --disable-f77 --disable-
>> fc --no-recursion
>>
>>    $ ./configure --prefix=/space/local/mpich2-1.5b2/
>> --enable-error-checking=runtime --enable-g=dbg CFLAGS=-fPIC CXXFLAGS=-
>> fPIC FFLAGS=-fpic --enable-threads=runtime --enable-totalview--enable-static --disable-f77 --disable-
>> fc
>>
>>  Can you advice please?
>>
>>  Regards,
>> Anatoly.
>>
>
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140519/6d672ad5/attachment.html>


More information about the discuss mailing list