<div dir="ltr">Thank you very much.<div>I tried to clean functions which use MPI:</div><div>1) mainInit</div><div>2) Rcv_WaitAny</div><div>3) SndSyncSlave</div><div><br></div><div>I added input data verification - which shows that messages are come as expected.</div>
<div>In current configuration I still see that network measurements become low and ends with fail.</div><div>But if I comment out two lines (74,75): </div><div><div><font size="1"> MPI_Recv(RcvBufs[slaveIdx], BUF_SZ, MPI::CHAR, slaveRank, TAG1, MPI_COMM_WORLD, &status);</font></div>
<div><font size="1"> validateInput(fpLog, RcvBufs, SlavesRcvIters, slaveIdx, myStatistics, &SlavesFinished);</font></div></div><div><br></div><div>in Rcv_WaitAny function (validate input needs for verification only no MPI inside), I see stable & full network rate, no degradation, no failure.</div>
<div><br></div><div>I suppose that using MPI_Recv together with MPI_Irecv & MPI_Waitany causes this degradation (unexpected messages) and failure.</div><div>If I set instead of these two lines "sleep for 0.5 second", I see low but stable network rate, no failure. (Senders are flow controlled by master).</div>
<div><br></div><div>Execution command:</div><div>mpiexec.hydra -genvall -f MpiConfigMachines1.txt -launcher=rsh -n 10 mpi_rcv_any_multithread 100000 1000000 out<br></div><div><br></div><div><br></div><div>Is my scenario legal? If it is legal, why I see degradation?</div>
<div><br></div><div>Attached out_r0.log for master and out_r2.log for slave. </div><div>I got error in shell:</div><div><div>Assertion failed in file src/mpid/ch3/channels/nemesis/src/ch3_progress.c at line 640: pkt->type >= 0 && pkt->type < MPIDI_NEM_PKT_END</div>
<div>internal ABORT - process 0</div></div><div><br></div><div><br></div><div><br></div><div>Please help.</div><div><br></div><div>Regards,</div><div>Anatoly.</div><div><br></div><div><br></div></div><div class="gmail_extra">
<br><br><div class="gmail_quote">On Mon, Jan 6, 2014 at 7:23 PM, Pavan Balaji <span dir="ltr"><<a href="mailto:balaji@mcs.anl.gov" target="_blank">balaji@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
There should be automatic flow-control within the network layer to deal with this, unless the receives are not posted causing unexpected messages. Looking through your code, that seems to be at least one of the problems. You are posting one Irecv for each slave, and waiting for all of them to finish, while the slaves are sending out many messages. So you can get many messages from one slave before getting the first message from all the slaves. This creates unexpected messages.<br>
<br>
Btw, your program can be significantly simplified to a bare-bones 100-line source code, which can make debugging easier. I’d recommend doing that.<br>
<br>
— Pavan<br>
<div><div class="h5"><br>
On Jan 6, 2014, at 11:14 AM, Jeff Hammond <<a href="mailto:jeff.science@gmail.com">jeff.science@gmail.com</a>> wrote:<br>
<br>
> Are you blasting the server (master) with messages from N clients<br>
> (slaves)? At some point, that will overwhelm the communication<br>
> buffers and fail. Can you turn off eager using the documented<br>
> environment variable? Rendezvous-only should be much slower but not<br>
> fail. Then you can eliminate the pathological usage in your<br>
> application.<br>
><br>
> Jeff<br>
><br>
> On Sun, Jan 5, 2014 at 10:38 PM, Anatoly G <<a href="mailto:anatolyrishon@gmail.com">anatolyrishon@gmail.com</a>> wrote:<br>
>> Hi.<br>
>> I have created an application. This application fails on MPI error.<br>
>> Assertion failed in file src/mpid/ch3/channels/nemesis/src/ch3_progress.c at<br>
>> line 640: pkt->type >= 0 && pkt->type < MPIDI_NEM_PKT_END<br>
>> internal ABORT - process 0<br>
>><br>
>> Scenario:<br>
>> Master receives messages from slaves.<br>
>> Each slave sends data using MPI_Send.<br>
>> Master receives using MPI_Irecv and MPI_Recv.<br>
>><br>
>> There are another errors in out*.log files.<br>
>> Application doesn't fail with 10 processes, but fails with 20.<br>
>><br>
>> execute command:<br>
>> mpiexec.hydra -genvall -f MpiConfigMachines1.txt -launcher=rsh -n 20<br>
>> /home/anatol-g/Grape/release_under_constr_MPI_tests_quantum/bin/linux64/rhe6/g++4.4.6/debug/mpi_rcv_any_multithread<br>
>> 100000 1000000 -1 -1 1 out<br>
>><br>
>> Please help,<br>
>><br>
>> Regards,<br>
>> Anatoly.<br>
>><br>
>><br>
>><br>
>> _______________________________________________<br>
>> discuss mailing list <a href="mailto:discuss@mpich.org">discuss@mpich.org</a><br>
>> To manage subscription options or unsubscribe:<br>
>> <a href="https://lists.mpich.org/mailman/listinfo/discuss" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br>
><br>
><br>
><br>
> --<br>
> Jeff Hammond<br>
> <a href="mailto:jeff.science@gmail.com">jeff.science@gmail.com</a><br>
> _______________________________________________<br>
> discuss mailing list <a href="mailto:discuss@mpich.org">discuss@mpich.org</a><br>
> To manage subscription options or unsubscribe:<br>
> <a href="https://lists.mpich.org/mailman/listinfo/discuss" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br>
<br>
</div></div>--<br>
Pavan Balaji<br>
<a href="http://www.mcs.anl.gov/~balaji" target="_blank">http://www.mcs.anl.gov/~balaji</a><br>
<div class="HOEnZb"><div class="h5"><br>
_______________________________________________<br>
discuss mailing list <a href="mailto:discuss@mpich.org">discuss@mpich.org</a><br>
To manage subscription options or unsubscribe:<br>
<a href="https://lists.mpich.org/mailman/listinfo/discuss" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br>
</div></div></blockquote></div><br></div>