<div dir="ltr">Hi,<div><br></div><div>I am getting the following error when running my parallel application:</div><div><br></div><blockquote style="margin:0px 0px 0px 40px;border:none;padding:0px"><div><font face="courier new, monospace"><span style="color:rgb(83,80,80);line-height:20px;text-align:justify">MPI_Recv(186).................</span><span style="color:rgb(83,80,80);line-height:20px;text-align:justify">.....: MPI_Recv(buf=0x125bd840, count=2060, MPI_CHARACTE<span style="background-color:rgb(255,255,255)">R, src=24, </span></span><span style="background-color:rgb(255,255,255)"><span class="" style="line-height:20px;text-align:justify">tag</span><span style="color:rgb(83,80,80);line-height:20px;text-align:justify">=</span><span class="" style="line-height:20px;text-align:justify">94</span><span style="color:rgb(83,80,80);line-height:20px;text-align:justify">, comm=0x84000002, status=0x125fcff0) failed</span></span></font></div>
<div><font face="courier new, monospace" style="background-color:rgb(255,255,255)"><span style="color:rgb(83,80,80);line-height:20px;text-align:justify">MPIDI_CH3I_Progress(402)......</span><span style="color:rgb(83,80,80);line-height:20px;text-align:justify">.....: </span></font></div>
<div><font face="courier new, monospace" style="background-color:rgb(255,255,255)"><span style="color:rgb(83,80,80);line-height:20px;text-align:justify">MPID_nem_mpich2_blocking_recv(</span><span style="color:rgb(83,80,80);line-height:20px;text-align:justify">905).: </span></font></div>
<div><font face="courier new, monospace" style="background-color:rgb(255,255,255)"><span style="color:rgb(83,80,80);line-height:20px;text-align:justify">MPID_nem_tcp_connpoll(1838)...</span><span style="color:rgb(83,80,80);line-height:20px;text-align:justify">.....: </span></font></div>
<div><font face="courier new, monospace" style="background-color:rgb(255,255,255)"><span style="color:rgb(83,80,80);line-height:20px;text-align:justify">state_commrdy_handler(1676)...</span><span style="color:rgb(83,80,80);line-height:20px;text-align:justify">.....: </span></font></div>
<div><font face="courier new, monospace" style="background-color:rgb(255,255,255)"><span style="color:rgb(83,80,80);line-height:20px;text-align:justify">MPID_nem_tcp_recv_handler(</span><span style="color:rgb(83,80,80);line-height:20px;text-align:justify">1564)....: </span></font></div>
<div><font face="courier new, monospace" style="background-color:rgb(255,255,255)"><span style="color:rgb(83,80,80);line-height:20px;text-align:justify">MPID_nem_handle_pkt(636)......</span><span style="color:rgb(83,80,80);line-height:20px;text-align:justify">.....: </span></font></div>
<div><font face="courier new, monospace" style="background-color:rgb(255,255,255)"><span style="color:rgb(83,80,80);line-height:20px;text-align:justify">MPIDI_CH3_PktHandler_</span><span style="color:rgb(83,80,80);line-height:20px;text-align:justify">EagerSend(606): Failed to allocate memory for an unexpected message. 261895 unexpected messages queued.</span></font></div>
<div><font face="courier new, monospace"><span style="color:rgb(83,80,80);line-height:20px;text-align:justify;background-color:rgb(255,255,255)">Fatal error in MPI_Send: Other MPI error, error stack:</span></font></div><div>
<font face="courier new, monospace"><span style="background-color:rgb(255,255,255)"><span style="color:rgb(83,80,80);line-height:20px;text-align:justify">MPI_Send(173)..............: MPI_Send(buf=0x765d2e60, count=2060, MPI_CHARACTER, dest=0, </span><span class="" style="line-height:20px;text-align:justify">tag</span><span style="color:rgb(83,80,80);line-height:20px;text-align:justify">=</span><span class="" style="line-height:20px;text-align:justify">94</span></span><span style="color:rgb(83,80,80);line-height:20px;text-align:justify"><span style="background-color:rgb(255,255,255)">, comm=0x8</span>4000004) failed</span></font></div>
<div><font face="courier new, monospace"><span style="color:rgb(83,80,80);line-height:20px;text-align:justify">MPID_nem_tcp_connpoll(1826): Communication error with rank 1: Connection reset by peer</span></font></div></blockquote>
<div><br></div><div>I went to MPICH's FAQ (<a href="http://wiki.mpich.org/mpich/index.php/Frequently_Asked_Questions#Q:_Why_am_I_getting_so_many_unexpected_messages.3F">http://wiki.mpich.org/mpich/index.php/Frequently_Asked_Questions#Q:_Why_am_I_getting_so_many_unexpected_messages.3F</a>).</div>
<div>It says that most likely the receiver process can't cope to process the high number of messages it is receiving.</div><div><br></div><div>In my application, the worker processes perform a very large number of small computations and, after some computation is complete, they sent the data to a special "writer" process that is responsible to write the output to disk.</div>
<div>This scheme use to work in a very reasonable fashion, until we faced some new data with larger parameters that caused the problem above.</div><div><br></div><div>Even though we can redesign the application, for example, by creating a pool of writer process we still have only one hard disk, so the bottleneck would not be solved. So, this doesn't seem to be a good approach.</div>
<div><br></div><div>As far as I understood, MPICH saves the content of every MPI_SEND in a internal buffer (I don't know where the buffer in located, sender or receiver?) to allow asynchronous sender's computation while the messages are being received.</div>
<div>The problem is that buffer has been exhausted due some resource limitation.</div><div><br></div><div>It is very interesting to have a buffer but if the buffer in the writer process is close to its limit the workers processes should stop and wait until it frees some space to restart sending new data to be written to disk.</div>
<div><br></div><div>Is it possible to check this buffer in MPICH? Or is it possible to check the number of messages to be received?</div><div>Can anyone suggest a better (easy to implement) solution?<br></div><div><br></div>
<div>Thanks in advance.</div><div><br></div><div>Regards,<br></div><div>Luiz</div></div>