<div dir="ltr">Dear Pavan,<div><br></div><div>Thanks for your reply.</div><div><br></div><div>I noticed that the numbers of 261895 or 261894 unexpected messages appear every time.</div><div>I am using AWS machine with 60GB of RAM memory to run my application. We could see during the execution that there was more than 30 GB available.</div>
<div>Is this a internal limitation of MPICH implementation? Would it be possible to recompile MPICH setting a different parameter?<br></div><div>Do you know where exactly this limitation occurs?</div><div><br></div><div>
Best regards,</div>
</div><div class="gmail_extra"><br><br><div class="gmail_quote">On 24 October 2013 00:19, Pavan Balaji <span dir="ltr"><<a href="mailto:balaji@mcs.anl.gov" target="_blank">balaji@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Luiz,<br>
<br>
You don’t need to know how many messages you will receive. You only need to make sure that whenever a message comes in, it knows which buffer it should go into. One way to fix that is to post N receives with ANY_SOURCE and ANY_TAG:<br>
<br>
MPI_Irecv(.., MPI_ANY_SOURCE, MPI_ANY_TAG, ..);<br>
<br>
Then test for the first one to see if something came in. Whenever one message comes in, process it and repost it as another Irecv. In this case, you should not have any unexpected messages (unless you have multiple communicators, which is a different story).<br>
<br>
—- Pavan<br>
<div><div class="h5"><br>
On Oct 23, 2013, at 4:22 PM, Luiz Carlos da Costa Junior <<a href="mailto:lcjunior@ufrj.br">lcjunior@ufrj.br</a>> wrote:<br>
<br>
> Hi Antonio,<br>
><br>
> Thanks for your quick reply.<br>
><br>
> I confess I have to study a bit more about this, but I think I understood your suggestion.<br>
> After a little research, I understood that when I use MPI_IRecv I am performing a non-blocking operation to just inform that I have a message to receive and then after MPI_Waitany returns I can access data to write it on the file. In this kind of implementation, I inform that I have a bunch of messages to receive from worker processes and, because of that, I would have to have a buffer in the writer process to receive messages asynchronously. The higher efficiency is archived because, while the writer process is performing the IO operation (writing to hard disk), MPI can transfer the received data in the meanwhile to my pre-allocated buffer in my application. Is all this right?<br>
><br>
> But, what if I don't know the number of messages I will receive?<br>
> That's my case, actually... I don't even know which worker process will send me data and I also don't know how many times each one of them will send me messages. I use a master-slaves scheme to distribute tasks, so the number of calculations done in each worker process (and thus, data send to writer process) depends on the speed each worker and how many times they ask for tasks. It is easy to deal with messages coming from a unknown sender, but I don't know how to deal with an unknown number of messages (i.e., how many MPI_IRecv I have to call?). Any idea?<br>
><br>
> As I said before, I considered having more than one writer process, but I can't see how this would solve the problem, that seems to be related to IO disk speed. In other words, why it would be worth having more than one writer process if, at the end, I have only one hard disk to perform IO operations?<br>
><br>
> Thanks again.<br>
><br>
> Regards, Luiz<br>
><br>
><br>
> On 23 October 2013 17:42, Antonio J. Peña <<a href="mailto:apenya@mcs.anl.gov">apenya@mcs.anl.gov</a>> wrote:<br>
><br>
> Hi Luiz,<br>
><br>
> Your error trace indicates that the receiver went out of memory due to a too large amount (261,895) of eager unexpected messages received, i.e., small messages received without a matching receive operation. Whenever this happens, the receiver allocates a temporary buffer to hold the received message. This exhausted the available memory in the computer where the receiver was executing.<br>
><br>
> To avoid this, try to pre-post receives before messages arrive. Indeed, this is far more efficient. Maybe you could do an MPI_IRecv per worker in your writer process, and process them after an MPI_Waitany. You may also consider having multiple writer processes if your use case permits and the volume of received messages is too high to be processed by a single writer.<br>
><br>
> Antonio<br>
><br>
><br>
> On Wednesday, October 23, 2013 05:27:27 PM Luiz Carlos da Costa Junior wrote:<br>
> Hi,<br>
><br>
> I am getting the following error when running my parallel application:<br>
><br>
> MPI_Recv(186)......................: MPI_Recv(buf=0x125bd840, count=2060, MPI_CHARACTER, src=24, tag=94, comm=0x84000002, status=0x125fcff0) failed<br>
> MPIDI_CH3I_Progress(402)...........:<br>
> MPID_nem_mpich2_blocking_recv(905).:<br>
> MPID_nem_tcp_connpoll(1838)........:<br>
> state_commrdy_handler(1676)........:<br>
> MPID_nem_tcp_recv_handler(1564)....:<br>
> MPID_nem_handle_pkt(636)...........:<br>
> MPIDI_CH3_PktHandler_EagerSend(606): Failed to allocate memory for an unexpected message. 261895 unexpected messages queued.<br>
> Fatal error in MPI_Send: Other MPI error, error stack:<br>
> MPI_Send(173)..............: MPI_Send(buf=0x765d2e60, count=2060, MPI_CHARACTER, dest=0, tag=94, comm=0x84000004) failed<br>
> MPID_nem_tcp_connpoll(1826): Communication error with rank 1: Connection reset by peer<br>
><br>
> I went to MPICH's FAQ (<a href="http://wiki.mpich.org/mpich/index.php/Frequently_Asked_Questions#Q:_Why_am_I_getting_so_many_unexpected_messages.3F" target="_blank">http://wiki.mpich.org/mpich/index.php/Frequently_Asked_Questions#Q:_Why_am_I_getting_so_many_unexpected_messages.3F</a>).<br>
> It says that most likely the receiver process can't cope to process the high number of messages it is receiving.<br>
><br>
> In my application, the worker processes perform a very large number of small computations and, after some computation is complete, they sent the data to a special "writer" process that is responsible to write the output to disk.<br>
> This scheme use to work in a very reasonable fashion, until we faced some new data with larger parameters that caused the problem above.<br>
><br>
> Even though we can redesign the application, for example, by creating a pool of writer process we still have only one hard disk, so the bottleneck would not be solved. So, this doesn't seem to be a good approach.<br>
><br>
> As far as I understood, MPICH saves the content of every MPI_SEND in a internal buffer (I don't know where the buffer in located, sender or receiver?) to allow asynchronous sender's computation while the messages are being received.<br>
> The problem is that buffer has been exhausted due some resource limitation.<br>
><br>
> It is very interesting to have a buffer but if the buffer in the writer process is close to its limit the workers processes should stop and wait until it frees some space to restart sending new data to be written to disk.<br>
><br>
> Is it possible to check this buffer in MPICH? Or is it possible to check the number of messages to be received?<br>
> Can anyone suggest a better (easy to implement) solution?<br>
><br>
> Thanks in advance.<br>
><br>
> Regards,<br>
> Luiz<br>
><br>
><br>
</div></div>> _______________________________________________<br>
> discuss mailing list <a href="mailto:discuss@mpich.org">discuss@mpich.org</a><br>
> To manage subscription options or unsubscribe:<br>
> <a href="https://lists.mpich.org/mailman/listinfo/discuss" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br>
<br>
--<br>
Pavan Balaji<br>
<a href="http://www.mcs.anl.gov/~balaji" target="_blank">http://www.mcs.anl.gov/~balaji</a><br>
<br>
_______________________________________________<br>
discuss mailing list <a href="mailto:discuss@mpich.org">discuss@mpich.org</a><br>
To manage subscription options or unsubscribe:<br>
<a href="https://lists.mpich.org/mailman/listinfo/discuss" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br>
</blockquote></div><br></div>