[mpich-discuss] Better alternatives of MPI_Allreduce()

hritikesh semwal hritikesh.semwal at gmail.com
Tue May 5 06:16:33 CDT 2020


I want to add two more questions about my solver,
1. I am using MPI_Neighbor_alltoallw() for exchanging the data by
generating a distributed graph topology communicator. My concern is that
most of the time my code is working fine but sometimes I guess it is going
into deadlock (as it is not showing any output). But MPI_Neighbor_alltoallw
uses MPI_Waitall inside it so I am not getting why exactly this is
happening.
2. Is it possible that every time I run the code the processors times for
completion of the task may vary? For example, for one run it all processors
takes around 100 seconds and for another run, all processors take 110
seconds.

Please help in above two matters.

On Tue, May 5, 2020 at 4:28 PM hritikesh semwal <hritikesh.semwal at gmail.com>
wrote:

> Thanks for your response.
>
> Yes, you are right. I have put barrier just before Allreduce and out of
> the total time consumed by Allreduce, 79% time is consumed by the barrier.
> But my computational work is balanced. Right now, I have distributed 97336
> cells among 24 processors and maximum and minimum cell distribution among
> all processors is 4057 and 4055 respectively which is not too bad. Is there
> any solution to get rid of this.
>
> On Tue, May 5, 2020 at 12:30 PM Joachim Protze <protze at itc.rwth-aachen.de>
> wrote:
>
>> Hello,
>>
>> it is important to understand, that most of the time you see is not the
>> cost of the allreduce, but the cost of synchronization (caused by load
>> imbalance).
>>
>> You can do an easy experiment and add a barrier before the allreduce.
>> Then you will see the actual cost of the allreduce, while the cost of
>> synchronization will go into the barrier.
>>
>> Now, think about dependencies in your algorithm: do you need the output
>> value immediately? Is this the same time, where you have the input value
>> ready?
>> -> otherwise use non-blocking communication and perform independent work
>> in between
>>
>> In any case: fix your load imbalance (the root cause of synchronization
>> cost).
>>
>> Best
>> Joachim
>>
>> Am 05.05.20 um 07:38 schrieb hritikesh semwal via discuss:
>> > Hello all,
>> >
>> > I am working on the development of a parallel CFD solver and I am using
>> > MPI_Allreduce for the global summation of the local errors calculated
>> on
>> > all processes of a group and the summation is to be used by all the
>> > processes. My concern is that MPI_Allreduce is taking almost 27-30% of
>> > the total time used, which is a significant amount. So, I want to ask
>> if
>> > anyone can suggest me better alternative/s to replace MPI_Allreduce
>> > which can reduce the time consumption.
>> >
>> > Thank you.
>> >
>> > _______________________________________________
>> > discuss mailing list     discuss at mpich.org
>> > To manage subscription options or unsubscribe:
>> > https://lists.mpich.org/mailman/listinfo/discuss
>> >
>>
>>
>> --
>> Dipl.-Inf. Joachim Protze
>>
>> IT Center
>> Group: High Performance Computing
>> Division: Computational Science and Engineering
>> RWTH Aachen University
>> Seffenter Weg 23
>> D 52074  Aachen (Germany)
>> Tel: +49 241 80- 24765
>> Fax: +49 241 80-624765
>> protze at itc.rwth-aachen.de
>> www.itc.rwth-aachen.de
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20200505/323e742e/attachment-0001.html>


More information about the discuss mailing list