[mpich-discuss] MPI fail on 20 processes, but not on 10.

Pavan Balaji balaji at mcs.anl.gov
Mon Jan 6 11:23:22 CST 2014


There should be automatic flow-control within the network layer to deal with this, unless the receives are not posted causing unexpected messages.  Looking through your code, that seems to be at least one of the problems.  You are posting one Irecv for each slave, and waiting for all of them to finish, while the slaves are sending out many messages.  So you can get many messages from one slave before getting the first message from all the slaves.  This creates unexpected messages.

Btw, your program can be significantly simplified to a bare-bones 100-line source code, which can make debugging easier.  I’d recommend doing that.

  — Pavan

On Jan 6, 2014, at 11:14 AM, Jeff Hammond <jeff.science at gmail.com> wrote:

> Are you blasting the server (master) with messages from N clients
> (slaves)?  At some point, that will overwhelm the communication
> buffers and fail.  Can you turn off eager using the documented
> environment variable?  Rendezvous-only should be much slower but not
> fail.  Then you can eliminate the pathological usage in your
> application.
> 
> Jeff
> 
> On Sun, Jan 5, 2014 at 10:38 PM, Anatoly G <anatolyrishon at gmail.com> wrote:
>> Hi.
>> I have created an application. This application fails on MPI error.
>> Assertion failed in file src/mpid/ch3/channels/nemesis/src/ch3_progress.c at
>> line 640: pkt->type >= 0 && pkt->type < MPIDI_NEM_PKT_END
>> internal ABORT - process 0
>> 
>> Scenario:
>> Master receives messages from slaves.
>> Each slave sends data using MPI_Send.
>> Master receives using MPI_Irecv and MPI_Recv.
>> 
>> There are another errors in out*.log files.
>> Application doesn't fail with 10 processes, but fails with 20.
>> 
>> execute command:
>> mpiexec.hydra -genvall -f MpiConfigMachines1.txt -launcher=rsh -n 20
>> /home/anatol-g/Grape/release_under_constr_MPI_tests_quantum/bin/linux64/rhe6/g++4.4.6/debug/mpi_rcv_any_multithread
>> 100000 1000000 -1 -1 1 out
>> 
>> Please help,
>> 
>> Regards,
>> Anatoly.
>> 
>> 
>> 
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
> 
> 
> 
> -- 
> Jeff Hammond
> jeff.science at gmail.com
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

--
Pavan Balaji
http://www.mcs.anl.gov/~balaji




More information about the discuss mailing list