[mpich-discuss] process failing...

Ron Palmer ron.palmer at pgcgroup.com.au
Thu Jun 5 19:36:28 CDT 2014


Thanks Antonio, Pavan and others,
your suggestions were forwarded to the programmers and it is now 
working! It took a little while to try it all out but I got a successful 
result this morning.

I have another question as there is an uneven performance but I will 
start a new thread for that question.

Thanks for your assistance.

Regards,
Ron

On 25/05/2014 10:19, "Antonio J. Peña" wrote:
>
> Hi Ron,
>
> Depending on how the algorithm is structured, I may well be that the 
> faster computers are generating messages faster than the slow computer 
> is able to process them. As the problem size increases, the amount of 
> data in unprocessed messages may get too high for that computer. In a 
> very simple example (but note that there are many other possible 
> situations):
>
> Hosts A and B:
>   for(int i=0; i<n; i++) {
>     MPI_Send(buf, count, MPI_INT, host_c, 0, MPI_COMM_WORLD);
>     buf += count;
>   }
>
> Host C:
>   for(int i=0; i<2*n; i++) {
>     MPI_Recv(buf, count, MPI_INT, MPI_ANY_SOURCE, 0, MPI_COMM_WORLD, 
> &status);
>     process_data(buf);
>   }
>
> In that case the senders keep sending data to host_c faster than it's 
> calling MPI_Recv, and get received by MPICH as unexpected messages 
> (MPICH doesn't know where to place that data, and uses temporary 
> internal buffers). Note that this situation is not necessarily caused 
> by different processing speeds, since the culprit may well be the used 
> algorithm itself. A very simple solution to avoid the prior potential 
> problem could be:
>
> Host A and B:
>   MPI_Barrier(MPI_COMM_WORLD);
>   for(int i=0; i<n; i++) {
>     MPI_Send(buf, count, MPI_INT, host_c, 0, MPI_COMM_WORLD);
>     buf += n;
>   }
>
> Host C:
>   for(int i=0; i<2*n; i++) {
>     MPI_Irecv(buf, count, MPI_INT, MPI_ANY_SOURCE, 0, MPI_COMM_WORLD, 
> &requests[i]);
>     buf += count;
>   }
>   MPI_Barrier(MPI_COMM_WORLD);
>   for(int i=0; i<2*n; i++) {
>     MPI_Waitany(2*n, requests, &idx, &status);
>     process_data(buf+idx*count);
>   }
>
> Now the data is placed on its final destination, since it's guaranteed 
> that the MPI implementation knows its destination before receiving it. 
> That solution would assume that there is enough memory for 
> preallocating all receiving buffers, which may not be true and may 
> require, for example, implementing a high-level protocol (such as 
> based on credits) to synchronize the sends and receives to make sure 
> that there is room on the receiver for them.
>
> I hope this helps.
>
> Best,
>   Antonio
>
>
> On 05/24/2014 02:00 AM, Ron Palmer wrote:
>> Antonio, Rajeev and others,
>> thanks for your replies and comments on possible causes for the error 
>> messages and failure, I have passed them on to the programmers of the 
>> underlying application. I must admit I do not understand what 
>> unexpected messages are (I am but a mere user), could you perhaps 
>> give examples of typical causes of them? Eg, the cluster it runs on 
>> consists of 3 dual xeon computers with varying cpu clock rating - 
>> could these error messages be due to getting out of synch, expecting 
>> results but not getting them from the slower computer? I have 
>> re-started the process but excluded the slowest computer (2.27GHz, 
>> the other two are running at 2.87 and 3.2) as I was running out of ideas.
>>
>> For your information, this runs well on smaller problems (few 
>> computations).
>>
>> Thanks,
>> Ron
>>
>> On 24/05/2014 3:10 AM, Rajeev Thakur wrote:
>>> Yes. The message below says some process has received 261,895 
>>> messages for which no matching receives have been posted yet.
>>>
>>>
>>>
>>> Rajeev
>>>
>>>
>>>> It looks like at least one of your processes is receiving too 
>>>> many unexpected messages, leading to get out of 
>>>> memory. Unexpected messages are those not matching a posted receive 
>>>> on the receiver side. You may check with the application developers 
>>>> to make them review the algorithm or look for any possible bug.
>>>>
>>>>   Antonio
>>>
>>>
>>>
>>> _______________________________________________
>>> discuss mailing listdiscuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>>
>>
>> _______________________________________________
>> discuss mailing listdiscuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>
>
> -- 
> Antonio J. Peña
> Postdoctoral Appointee
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 9700 South Cass Avenue, Bldg. 240, Of. 3148
> Argonne, IL 60439-4847
> apenya at mcs.anl.gov
> www.mcs.anl.gov/~apenya
>
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

-- 

*Ron Palmer*MSc MBA.

Principal Geophysicist

ron.palmer at pgcgroup.com.au <mailto:ron.palmer at pgcgroup.com.au>

0413 579 099

07 3103 4963


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140606/0fa937b9/attachment.html>


More information about the discuss mailing list