[mpich-discuss] process failing...

"Antonio J. Peña" apenya at mcs.anl.gov
Sat May 24 19:19:24 CDT 2014


Hi Ron,

Depending on how the algorithm is structured, I may well be that the 
faster computers are generating messages faster than the slow computer 
is able to process them. As the problem size increases, the amount of 
data in unprocessed messages may get too high for that computer. In a 
very simple example (but note that there are many other possible 
situations):

Hosts A and B:
   for(int i=0; i<n; i++) {
     MPI_Send(buf, count, MPI_INT, host_c, 0, MPI_COMM_WORLD);
     buf += count;
   }

Host C:
   for(int i=0; i<2*n; i++) {
     MPI_Recv(buf, count, MPI_INT, MPI_ANY_SOURCE, 0, MPI_COMM_WORLD, 
&status);
     process_data(buf);
   }

In that case the senders keep sending data to host_c faster than it's 
calling MPI_Recv, and get received by MPICH as unexpected messages 
(MPICH doesn't know where to place that data, and uses temporary 
internal buffers). Note that this situation is not necessarily caused by 
different processing speeds, since the culprit may well be the used 
algorithm itself. A very simple solution to avoid the prior potential 
problem could be:

Host A and B:
   MPI_Barrier(MPI_COMM_WORLD);
   for(int i=0; i<n; i++) {
     MPI_Send(buf, count, MPI_INT, host_c, 0, MPI_COMM_WORLD);
     buf += n;
   }

Host C:
   for(int i=0; i<2*n; i++) {
     MPI_Irecv(buf, count, MPI_INT, MPI_ANY_SOURCE, 0, MPI_COMM_WORLD, 
&requests[i]);
     buf += count;
   }
   MPI_Barrier(MPI_COMM_WORLD);
   for(int i=0; i<2*n; i++) {
     MPI_Waitany(2*n, requests, &idx, &status);
     process_data(buf+idx*count);
   }

Now the data is placed on its final destination, since it's guaranteed 
that the MPI implementation knows its destination before receiving it. 
That solution would assume that there is enough memory for preallocating 
all receiving buffers, which may not be true and may require, for 
example, implementing a high-level protocol (such as based on credits) 
to synchronize the sends and receives to make sure that there is room on 
the receiver for them.

I hope this helps.

Best,
   Antonio


On 05/24/2014 02:00 AM, Ron Palmer wrote:
> Antonio, Rajeev and others,
> thanks for your replies and comments on possible causes for the error 
> messages and failure, I have passed them on to the programmers of the 
> underlying application. I must admit I do not understand what 
> unexpected messages are (I am but a mere user), could you perhaps give 
> examples of typical causes of them? Eg, the cluster it runs on 
> consists of 3 dual xeon computers with varying cpu clock rating - 
> could these error messages be due to getting out of synch, expecting 
> results but not getting them from the slower computer? I have 
> re-started the process but excluded the slowest computer (2.27GHz, the 
> other two are running at 2.87 and 3.2) as I was running out of ideas.
>
> For your information, this runs well on smaller problems (few 
> computations).
>
> Thanks,
> Ron
>
> On 24/05/2014 3:10 AM, Rajeev Thakur wrote:
>> Yes. The message below says some process has received 261,895 
>> messages for which no matching receives have been posted yet.
>>
>>
>>
>> Rajeev
>>
>>
>>> It looks like at least one of your processes is receiving too 
>>> many unexpected messages, leading to get out of 
>>> memory. Unexpected messages are those not matching a posted receive 
>>> on the receiver side. You may check with the application developers 
>>> to make them review the algorithm or look for any possible bug.
>>>
>>>   Antonio
>>
>>
>>
>> _______________________________________________
>> discuss mailing listdiscuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>
>
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss


-- 
Antonio J. Peña
Postdoctoral Appointee
Mathematics and Computer Science Division
Argonne National Laboratory
9700 South Cass Avenue, Bldg. 240, Of. 3148
Argonne, IL 60439-4847
apenya at mcs.anl.gov
www.mcs.anl.gov/~apenya

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140524/9427dad2/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/octet-stream
Size: 27 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140524/9427dad2/attachment.obj>


More information about the discuss mailing list