<div dir="ltr">Thank you very much. <div>Issend - is not so good, It can't support me Fault tolerance. If slave process fails, the master stall.<br><div>I tried mpich-3.0.4 with hydra-3.0.4 but my program which uses MPI Fault tolerance doesn't recognize failure of slave process, but recognizes failure with MPICH2. May be you can suggest solution?</div>
<div>I tried to use hydra from MPICH2 but link my program with MPICH3. This combination recognizes failures, but I"m not sure that such combination is stable enough.</div><div>Can you please advice?</div><div>Anatoly.<br>
<div><br></div></div></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Sat, Dec 7, 2013 at 5:20 PM, Pavan Balaji <span dir="ltr"><<a href="mailto:balaji@mcs.anl.gov" target="_blank">balaji@mcs.anl.gov</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>
As much as I hate saying this — some people find it easier to think of it as “MPICH3”.<br>
<span class="HOEnZb"><font color="#888888"><br>
— Pavan<br>
</font></span><div class="HOEnZb"><div class="h5"><br>
On Dec 7, 2013, at 7:37 AM, Wesley Bland <<a href="mailto:wbland@mcs.anl.gov">wbland@mcs.anl.gov</a>> wrote:<br>
<br>
> MPICH is just the new version of MPICH2. We renamed it when we went past version 3.0.<br>
><br>
> On Dec 7, 2013, at 3:55 AM, Anatoly G <<a href="mailto:anatolyrishon@gmail.com">anatolyrishon@gmail.com</a>> wrote:<br>
><br>
>> Ok. I"ll try both Issend, and next step to upgrade MPICH to 3.0.4.<br>
>> I thought before that MPICH & MPICH2 are two different branches, when MPICH2 partially supports Fault tolerance, but MPICH not. Now I understand, that I was wrong and MPICH2 is just main version of MPICH.<br>
>><br>
>> Thank you very much,<br>
>> Anatoly.<br>
>><br>
>><br>
>><br>
>> On Thu, Dec 5, 2013 at 11:01 PM, Rajeev Thakur <<a href="mailto:thakur@mcs.anl.gov">thakur@mcs.anl.gov</a>> wrote:<br>
>> The master is receiving too many incoming messages than it can match quickly enough with Irecvs. Try using MPI_Issend instead of MPI_Isend.<br>
>><br>
>> Rajeev<br>
>><br>
>> On Dec 5, 2013, at 2:58 AM, Anatoly G <<a href="mailto:anatolyrishon@gmail.com">anatolyrishon@gmail.com</a>> wrote:<br>
>><br>
>> > Hello.<br>
>> > I"m using MPICH2 1.5.<br>
>> > My system contains master and 16 slaves.<br>
>> > System uses number of communicators.<br>
>> > The single communicator used for below scenario:<br>
>> > Each slave sends non-stop 2Kbyte data buffer using MPI_Isend and waits using MPI_Wait.<br>
>> > Master starts with MPI_Irecv to each slave<br>
>> > Then in endless loop:<br>
>> > MPI_Waitany and MPI_Irecv on rank returned by MPI_Waitany.<br>
>> ><br>
>> > Another communicator used for broadcast communication (commands between master + slaves),<br>
>> > but it's not used in parallel with previous communicator,<br>
>> > only before or after data transfer.<br>
>> ><br>
>> > The system executed on two computers linked by 1Gbit/s Ethernet.<br>
>> > Master executed on first computer, all slaves on other one.<br>
>> > Network traffic is ~800Mbit/s.<br>
>> ><br>
>> > After 1-2 minutes of execution, master process starts to increase it's memory allocation and network traffic becomes low.<br>
>> > This memory allocation & network traffic slow down continues until fail of MPI,<br>
>> > without core file save.<br>
>> > My program doesn't allocate memory. Can you please explain this behaviour.<br>
>> > How can I cause MPI to stop sending slaves if Master can't serve such traffic, instead of memory allocation and fail?<br>
>> ><br>
>> ><br>
>> > Thank you,<br>
>> > Anatoly.<br>
>> ><br>
>> > P.S.<br>
>> > On my stand alone test, I simulate similar behaviour, but with single thread on each process (master & hosts).<br>
>> > When I start stand alone test, master stops slaves until it completes accumulated data processing and MPI doesn't increase memory allocation.<br>
>> > When Master is free slaves continue to send data.<br>
>> > _______________________________________________<br>
>> > discuss mailing list <a href="mailto:discuss@mpich.org">discuss@mpich.org</a><br>
>> > To manage subscription options or unsubscribe:<br>
>> > <a href="https://lists.mpich.org/mailman/listinfo/discuss" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br>
>><br>
>> _______________________________________________<br>
>> discuss mailing list <a href="mailto:discuss@mpich.org">discuss@mpich.org</a><br>
>> To manage subscription options or unsubscribe:<br>
>> <a href="https://lists.mpich.org/mailman/listinfo/discuss" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br>
>><br>
>> _______________________________________________<br>
>> discuss mailing list <a href="mailto:discuss@mpich.org">discuss@mpich.org</a><br>
>> To manage subscription options or unsubscribe:<br>
>> <a href="https://lists.mpich.org/mailman/listinfo/discuss" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br>
> _______________________________________________<br>
> discuss mailing list <a href="mailto:discuss@mpich.org">discuss@mpich.org</a><br>
> To manage subscription options or unsubscribe:<br>
> <a href="https://lists.mpich.org/mailman/listinfo/discuss" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br>
<br>
</div></div><div class="im HOEnZb">--<br>
Pavan Balaji<br>
<a href="http://www.mcs.anl.gov/~balaji" target="_blank">http://www.mcs.anl.gov/~balaji</a><br>
<br>
</div><div class="HOEnZb"><div class="h5">_______________________________________________<br>
discuss mailing list <a href="mailto:discuss@mpich.org">discuss@mpich.org</a><br>
To manage subscription options or unsubscribe:<br>
<a href="https://lists.mpich.org/mailman/listinfo/discuss" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br>
</div></div></blockquote></div><br></div>