[mpich-discuss] MPI fail on 20 processes, but not on 10.

Anatoly G anatolyrishon at gmail.com
Mon Jan 13 01:40:46 CST 2014


Thank you very much.
I tried to clean functions which use MPI:
1) mainInit
2) Rcv_WaitAny
3) SndSyncSlave

I added input data verification - which shows that messages are come as
expected.
In current configuration I still see that network measurements become low
and ends with fail.
But if I comment out two lines (74,75):
        MPI_Recv(RcvBufs[slaveIdx], BUF_SZ, MPI::CHAR, slaveRank, TAG1,
MPI_COMM_WORLD, &status);
        validateInput(fpLog, RcvBufs, SlavesRcvIters, slaveIdx,
myStatistics, &SlavesFinished);

in Rcv_WaitAny function (validate input needs for verification only no MPI
inside), I see stable & full network rate, no degradation, no failure.

I suppose that using MPI_Recv together with MPI_Irecv & MPI_Waitany causes
this degradation (unexpected messages) and failure.
If I set instead of these two lines "sleep for 0.5 second", I see low but
stable network rate, no failure. (Senders are flow controlled by master).

Execution command:
mpiexec.hydra -genvall -f MpiConfigMachines1.txt -launcher=rsh -n 10
mpi_rcv_any_multithread 100000 1000000 out


Is my scenario legal? If it is legal, why I see degradation?

Attached out_r0.log for master and out_r2.log for slave.
I got error in shell:
Assertion failed in file src/mpid/ch3/channels/nemesis/src/ch3_progress.c
at line 640: pkt->type >= 0 && pkt->type < MPIDI_NEM_PKT_END
internal ABORT - process 0



Please help.

Regards,
Anatoly.




On Mon, Jan 6, 2014 at 7:23 PM, Pavan Balaji <balaji at mcs.anl.gov> wrote:

>
> There should be automatic flow-control within the network layer to deal
> with this, unless the receives are not posted causing unexpected messages.
>  Looking through your code, that seems to be at least one of the problems.
>  You are posting one Irecv for each slave, and waiting for all of them to
> finish, while the slaves are sending out many messages.  So you can get
> many messages from one slave before getting the first message from all the
> slaves.  This creates unexpected messages.
>
> Btw, your program can be significantly simplified to a bare-bones 100-line
> source code, which can make debugging easier.  I’d recommend doing that.
>
>   — Pavan
>
> On Jan 6, 2014, at 11:14 AM, Jeff Hammond <jeff.science at gmail.com> wrote:
>
> > Are you blasting the server (master) with messages from N clients
> > (slaves)?  At some point, that will overwhelm the communication
> > buffers and fail.  Can you turn off eager using the documented
> > environment variable?  Rendezvous-only should be much slower but not
> > fail.  Then you can eliminate the pathological usage in your
> > application.
> >
> > Jeff
> >
> > On Sun, Jan 5, 2014 at 10:38 PM, Anatoly G <anatolyrishon at gmail.com>
> wrote:
> >> Hi.
> >> I have created an application. This application fails on MPI error.
> >> Assertion failed in file
> src/mpid/ch3/channels/nemesis/src/ch3_progress.c at
> >> line 640: pkt->type >= 0 && pkt->type < MPIDI_NEM_PKT_END
> >> internal ABORT - process 0
> >>
> >> Scenario:
> >> Master receives messages from slaves.
> >> Each slave sends data using MPI_Send.
> >> Master receives using MPI_Irecv and MPI_Recv.
> >>
> >> There are another errors in out*.log files.
> >> Application doesn't fail with 10 processes, but fails with 20.
> >>
> >> execute command:
> >> mpiexec.hydra -genvall -f MpiConfigMachines1.txt -launcher=rsh -n 20
> >>
> /home/anatol-g/Grape/release_under_constr_MPI_tests_quantum/bin/linux64/rhe6/g++4.4.6/debug/mpi_rcv_any_multithread
> >> 100000 1000000 -1 -1 1 out
> >>
> >> Please help,
> >>
> >> Regards,
> >> Anatoly.
> >>
> >>
> >>
> >> _______________________________________________
> >> discuss mailing list     discuss at mpich.org
> >> To manage subscription options or unsubscribe:
> >> https://lists.mpich.org/mailman/listinfo/discuss
> >
> >
> >
> > --
> > Jeff Hammond
> > jeff.science at gmail.com
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
>
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140113/b0b42b96/attachment.html>
-------------- next part --------------
1.1.1.1:1
1.1.1.2:1000
-------------- next part --------------
A non-text attachment was scrubbed...
Name: out_r2.log
Type: application/octet-stream
Size: 5503 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140113/b0b42b96/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: out_r0.log
Type: application/octet-stream
Size: 5707 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140113/b0b42b96/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mpi_rcv_any_multithread.cpp
Type: text/x-c++src
Size: 9261 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140113/b0b42b96/attachment.bin>


More information about the discuss mailing list