[mpich-discuss] Failed to allocate memory for an unexpected message

Halim Amer aamer at anl.gov
Thu Jul 2 16:09:51 CDT 2015


Hi Luiz,

Please use the latest MPICH. The one you are using it very old.

--Halim

Abdelhalim Amer (Halim)
Postdoctoral Appointee
MCS Division
Argonne National Laboratory

On 7/2/15 1:22 PM, Luiz Carlos da Costa Junior wrote:
> Hello all,
>
> In 2013 I had problems regarding the allocation of unexpected messages
> in MPI.
> After your kind assistance, I implemented a "buffer" matrix in the
> receiver process, using MPI_IRECV, MPI_WAITANY and MPI_TESTANY functions
> (the code snippet is attached).
>
> It has been working nicely since than until recently, when I faced the
> same problems again:
>
>     Fatal error in MPI_Recv: Other MPI error, error stack:
>     MPI_Recv(186)......................: MPI_Recv(buf=0x7fffe8dd5974,
>     count=1, MPI_INTEGER, src=0, tag=MPI_ANY_TAG, MPI_COMM_WORLD,
>     status=0xd213d0) failed
>     MPIDI_CH3I_Progress(402)...........:
>     MPID_nem_mpich2_blocking_recv(905).:
>     MPID_nem_tcp_connpoll(1838)........:
>     state_commrdy_handler(1676)........:
>     MPID_nem_tcp_recv_handler(1564)....:
>     MPID_nem_handle_pkt(636)...........:
>     MPIDI_CH3_PktHandler_EagerSend(606): Failed to allocate memory for
>     an unexpected message. 261895 unexpected messages queued.
>     Fatal error in MPI_Recv: Other MPI error, error stack:
>     MPI_Recv(186).............: MPI_Recv(buf=0x7fffd052b9f4, count=1,
>     MPI_INTEGER, src=0, tag=MPI_ANY_TAG, MPI_COMM_WORLD,
>     status=0xd213d0) failed
>     dequeue_and_set_error(596): Communication error with rank 0
>     Fatal error in MPI_Recv: Other MPI error, error stack:
>     MPI_Recv(186).............: MPI_Recv(buf=0x7fff58fe5b74, count=1,
>     MPI_INTEGER, src=0, tag=MPI_ANY_TAG, MPI_COMM_WORLD,
>     status=0xd213d0) failed
>     dequeue_and_set_error(596): Communication error with rank 0
>     Fatal error in MPI_Recv: Other MPI error, error stack:
>     MPI_Recv(186).............: MPI_Recv(buf=0x7fff6fae19f4, count=1,
>     MPI_INTEGER, src=0, tag=MPI_ANY_TAG, MPI_COMM_WORLD,
>     status=0xd213d0) failed
>     dequeue_and_set_error(596): Communication error with rank 0
>     Fatal error in MPI_Recv: Other MPI error, error stack:
>     MPI_Recv(186).............: MPI_Recv(buf=0x7fff55bc8e74, count=1,
>     MPI_INTEGER, src=0, tag=MPI_ANY_TAG, MPI_COMM_WORLD,
>     status=0xd213d0) failed
>     dequeue_and_set_error(596): Communication error with rank 0
>
>
> I'm using the MPICH2 1.4.1p1 on a Linux x64 machine (AWS EC2 instance).
> The last execution with this error had 63 working processes sending all
> the output to just one receiver/writer process.
>
> The program and number of messages sent/received are pretty much the
> same. The only thing I can imagine is that, probably, the processor is
> proportionally faster than the network/IO speed today when compared to
> 2013 AWS EC2 instance. In this way, probably the writer process gets
> "flooded" with messages earlier. Does it make sense?
>
> Could you please give some advice on how to solve this issue?
>
> Best regards,
> Luiz
>
> On 13 March 2014 at 16:01, Luiz Carlos da Costa Junior <lcjunior at ufrj.br
> <mailto:lcjunior at ufrj.br>> wrote:
>
>     Thanks again Kenneth, I could solve using MPI_TESTANY.
>     Regards, Luiz
>
>
>     On 13 March 2014 15:35, Kenneth Raffenetti <raffenet at mcs.anl.gov
>     <mailto:raffenet at mcs.anl.gov>> wrote:
>
>         On 03/13/2014 12:35 PM, Luiz Carlos da Costa Junior wrote:
>
>             Does anyone have any clue about this?
>
>             Thanks in advance.
>
>
>             On 12 March 2014 14:40, Luiz Carlos da Costa Junior
>             <lcjunior at ufrj.br <mailto:lcjunior at ufrj.br>
>             <mailto:lcjunior at ufrj.br <mailto:lcjunior at ufrj.br>>> wrote:
>
>                  Dear Kenneth,
>
>                  Thanks for your quick reply.
>                  I tested your suggestion and, unfortunately, this
>             approach didn't work.
>
>                  Question: when I call MPI_IPROBE it accounts also for
>             the messages
>                  that were already received asynchronously?
>
>
>         That should not be the case. If a message has been matched by a
>         recv/irecv, MPI_Probe should not match it again.
>
>
>
>                  Is there any way to know, for my list of mpi_requests
>             (from my
>                  MPI_IRECV's), which ones are "opened" and which ones
>             have messages?
>
>
>         MPI_Test will take a request as an argument and tell you whether
>         or not that requested operation has been completed.
>
>         Ken
>
>         _________________________________________________
>         discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
>         To manage subscription options or unsubscribe:
>         https://lists.mpich.org/__mailman/listinfo/discuss
>         <https://lists.mpich.org/mailman/listinfo/discuss>
>
>
>
>
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list