[mpich-discuss] Better alternatives of MPI_Allreduce()

Benson Muite benson_muite at emailplus.org
Tue May 5 06:57:39 CDT 2020



On Tue, May 5, 2020, at 2:46 PM, hritikesh semwal wrote:
> 
> 
> On Tue, May 5, 2020 at 4:51 PM Benson Muite <benson_muite at emailplus.org> wrote:
>> __
>> 
>>>> > 
>>>> > Hi Hitesh,
>>>> > 
>>>> > What hardware are you running on and what is the interconnect?
>>> 
>>> Right now I am using a cluster.
>> 
>> What is the interconnect?
> 
> I don't know about this. Is it relevant?

It can affect performance - but expect it may not be the most important factor on 24 processors. Most common is Infiniband (https://en.wikipedia.org/wiki/InfiniBand)

> 
>>> 
>>>> > Have you tried changing any of the MPI settings?
>>> 
>>> What do you mean by MPI settings?
>> Given your comment on the barrier, this is probably not so useful at the moment.
>>> 
>>>> > Can the reduction be done asynchronously?
>>> 
>>> I did not get your question.
>> 
>> For example using a non blocking all reduce:
>> https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report/node135.htm
>> 
> 
> I tried using a non-blocking call but after this code is not working correctly. 

Ok. Change back to blocking call. It is likely you have poor load balancing.

>> 
>>> 
>>>> > 
>>>> > Regards,
>>>> > Benson
>>>> 
>>>> Also, is your work load balanced? One way to check this might be to place a barrier just before the all-reduce call. If the barrier ends up taking most of your time, then it is likely you will need to determine a better way to distribute the computational work.
>>> 
>>>  Thanks for your response.
>>> 
>>> Yes, you are right. I have put barrier just before Allreduce and out of the total time consumed by Allreduce, 79% time is consumed by the barrier. But my computational work is balanced. Right now, I have distributed 97336 cells among 24 processors and maximum and minimum cell distribution among all processors is 4057 and 4055 respectively which is not too bad. Is there any solution to get rid of this?
> Please help me in this regard. 

If you cannot profile your code, time the section before the all reduce on each processor using MPI_WTIME and check if it is even across all 24 processors. If using more processors, you will likely want to use a profiling tool, but if expect to run on about 24 processors, setting up a profiling tool if not already available may take some time.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20200505/816b59fa/attachment.html>


More information about the discuss mailing list