<!DOCTYPE html><html><head><title></title><style type="text/css">p.MsoNormal,p.MsoNoSpacing{margin:0}</style></head><body><div style="font-family:Arial;"><br></div><blockquote type="cite" id="qt" style=""><div dir="ltr"><div class="qt-gmail_quote"><blockquote class="qt-gmail_quote" style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204, 204, 204);padding-left:1ex;"><div> > <br></div><div> > Hi Hitesh,<br></div><div> > <br></div><div> > What hardware are you running on and what is the interconnect?<br></div></blockquote><div> <br></div><div>Right now I am using a cluster.<br></div></div></div></blockquote><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">What is the interconnect?</div><blockquote type="cite" id="qt" style=""><div dir="ltr"><div class="qt-gmail_quote"><div> <br></div><blockquote class="qt-gmail_quote" style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204, 204, 204);padding-left:1ex;">> Have you tried changing any of the MPI settings?<br></blockquote><div> <br></div><div>What do you mean by MPI settings?<br></div></div></div></blockquote><div style="font-family:Arial;">Given your comment on the barrier, this is probably not so useful at the moment.</div><blockquote type="cite" id="qt" style=""><div dir="ltr"><div class="qt-gmail_quote"><div> <br></div><blockquote class="qt-gmail_quote" style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204, 204, 204);padding-left:1ex;">> Can the reduction be done asynchronously?<br></blockquote><div> <br></div><div>I did not get your question.<br></div></div></div></blockquote><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">For example using a non blocking all reduce:<br></div><div style="font-family:Arial;"><a href="https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report/node135.htm">https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report/node135.htm</a><br></div><div style="font-family:Arial;"><br></div><blockquote type="cite" id="qt" style=""><div dir="ltr"><div class="qt-gmail_quote"><div> <br></div><blockquote class="qt-gmail_quote" style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204, 204, 204);padding-left:1ex;"><div>> <br></div><div> > Regards,<br></div><div> > Benson<br></div><div> <br></div><div> Also, is your work load balanced? One way to check this might be to place a barrier just before the all-reduce call. If the barrier ends up taking most of your time, then it is likely you will need to determine a better way to distribute the computational work.<br></div></blockquote><div><br></div><div> Thanks for your response.<br></div><div><br></div><div>Yes, you are right. I have put barrier just before Allreduce and out of the total time consumed by Allreduce, 79% time is consumed by the barrier. But my computational work is balanced. Right now, I have distributed 97336 cells among 24 processors and maximum and minimum cell distribution among all processors is 4057 and 4055 respectively which is not too bad. Is there any solution to get rid of this. <br></div></div></div></blockquote><div style="font-family:Arial;"><br></div></body></html>