[mpich-discuss] MPI fail on 20 processes, but not on 10.

Jeff Hammond jeff.science at gmail.com
Mon Jan 6 11:14:30 CST 2014


Are you blasting the server (master) with messages from N clients
(slaves)?  At some point, that will overwhelm the communication
buffers and fail.  Can you turn off eager using the documented
environment variable?  Rendezvous-only should be much slower but not
fail.  Then you can eliminate the pathological usage in your
application.

Jeff

On Sun, Jan 5, 2014 at 10:38 PM, Anatoly G <anatolyrishon at gmail.com> wrote:
> Hi.
> I have created an application. This application fails on MPI error.
> Assertion failed in file src/mpid/ch3/channels/nemesis/src/ch3_progress.c at
> line 640: pkt->type >= 0 && pkt->type < MPIDI_NEM_PKT_END
> internal ABORT - process 0
>
> Scenario:
> Master receives messages from slaves.
> Each slave sends data using MPI_Send.
> Master receives using MPI_Irecv and MPI_Recv.
>
> There are another errors in out*.log files.
> Application doesn't fail with 10 processes, but fails with 20.
>
> execute command:
> mpiexec.hydra -genvall -f MpiConfigMachines1.txt -launcher=rsh -n 20
> /home/anatol-g/Grape/release_under_constr_MPI_tests_quantum/bin/linux64/rhe6/g++4.4.6/debug/mpi_rcv_any_multithread
> 100000 1000000 -1 -1 1 out
>
> Please help,
>
> Regards,
> Anatoly.
>
>
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss



-- 
Jeff Hammond
jeff.science at gmail.com



More information about the discuss mailing list