[mpich-discuss] process failing...
Ron Palmer
ron.palmer at pgcgroup.com.au
Thu Jun 5 19:36:28 CDT 2014
Thanks Antonio, Pavan and others,
your suggestions were forwarded to the programmers and it is now
working! It took a little while to try it all out but I got a successful
result this morning.
I have another question as there is an uneven performance but I will
start a new thread for that question.
Thanks for your assistance.
Regards,
Ron
On 25/05/2014 10:19, "Antonio J. Peña" wrote:
>
> Hi Ron,
>
> Depending on how the algorithm is structured, I may well be that the
> faster computers are generating messages faster than the slow computer
> is able to process them. As the problem size increases, the amount of
> data in unprocessed messages may get too high for that computer. In a
> very simple example (but note that there are many other possible
> situations):
>
> Hosts A and B:
> for(int i=0; i<n; i++) {
> MPI_Send(buf, count, MPI_INT, host_c, 0, MPI_COMM_WORLD);
> buf += count;
> }
>
> Host C:
> for(int i=0; i<2*n; i++) {
> MPI_Recv(buf, count, MPI_INT, MPI_ANY_SOURCE, 0, MPI_COMM_WORLD,
> &status);
> process_data(buf);
> }
>
> In that case the senders keep sending data to host_c faster than it's
> calling MPI_Recv, and get received by MPICH as unexpected messages
> (MPICH doesn't know where to place that data, and uses temporary
> internal buffers). Note that this situation is not necessarily caused
> by different processing speeds, since the culprit may well be the used
> algorithm itself. A very simple solution to avoid the prior potential
> problem could be:
>
> Host A and B:
> MPI_Barrier(MPI_COMM_WORLD);
> for(int i=0; i<n; i++) {
> MPI_Send(buf, count, MPI_INT, host_c, 0, MPI_COMM_WORLD);
> buf += n;
> }
>
> Host C:
> for(int i=0; i<2*n; i++) {
> MPI_Irecv(buf, count, MPI_INT, MPI_ANY_SOURCE, 0, MPI_COMM_WORLD,
> &requests[i]);
> buf += count;
> }
> MPI_Barrier(MPI_COMM_WORLD);
> for(int i=0; i<2*n; i++) {
> MPI_Waitany(2*n, requests, &idx, &status);
> process_data(buf+idx*count);
> }
>
> Now the data is placed on its final destination, since it's guaranteed
> that the MPI implementation knows its destination before receiving it.
> That solution would assume that there is enough memory for
> preallocating all receiving buffers, which may not be true and may
> require, for example, implementing a high-level protocol (such as
> based on credits) to synchronize the sends and receives to make sure
> that there is room on the receiver for them.
>
> I hope this helps.
>
> Best,
> Antonio
>
>
> On 05/24/2014 02:00 AM, Ron Palmer wrote:
>> Antonio, Rajeev and others,
>> thanks for your replies and comments on possible causes for the error
>> messages and failure, I have passed them on to the programmers of the
>> underlying application. I must admit I do not understand what
>> unexpected messages are (I am but a mere user), could you perhaps
>> give examples of typical causes of them? Eg, the cluster it runs on
>> consists of 3 dual xeon computers with varying cpu clock rating -
>> could these error messages be due to getting out of synch, expecting
>> results but not getting them from the slower computer? I have
>> re-started the process but excluded the slowest computer (2.27GHz,
>> the other two are running at 2.87 and 3.2) as I was running out of ideas.
>>
>> For your information, this runs well on smaller problems (few
>> computations).
>>
>> Thanks,
>> Ron
>>
>> On 24/05/2014 3:10 AM, Rajeev Thakur wrote:
>>> Yes. The message below says some process has received 261,895
>>> messages for which no matching receives have been posted yet.
>>>
>>>
>>>
>>> Rajeev
>>>
>>>
>>>> It looks like at least one of your processes is receiving too
>>>> many unexpected messages, leading to get out of
>>>> memory. Unexpected messages are those not matching a posted receive
>>>> on the receiver side. You may check with the application developers
>>>> to make them review the algorithm or look for any possible bug.
>>>>
>>>> Antonio
>>>
>>>
>>>
>>> _______________________________________________
>>> discuss mailing listdiscuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>>
>>
>> _______________________________________________
>> discuss mailing listdiscuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>
>
> --
> Antonio J. Peña
> Postdoctoral Appointee
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 9700 South Cass Avenue, Bldg. 240, Of. 3148
> Argonne, IL 60439-4847
> apenya at mcs.anl.gov
> www.mcs.anl.gov/~apenya
>
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
--
*Ron Palmer*MSc MBA.
Principal Geophysicist
ron.palmer at pgcgroup.com.au <mailto:ron.palmer at pgcgroup.com.au>
0413 579 099
07 3103 4963
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140606/0fa937b9/attachment.html>
More information about the discuss
mailing list