[mpich-discuss] Better alternative to MPI_Allreduce() and avoiding deadlock with MPI_Neighbor_alltoallw().

Wed May 6 01:21:12 CDT 2020

On Tue, May 5, 2020, at 10:19 PM, hritikesh semwal via discuss wrote:
> 
> 
> On Tue, 5 May, 2020, 10:30 PM , <discuss-request at mpich.org> wrote:
>> 
>>  > 1. I am using MPI_Neighbor_alltoallw() for exchanging the data by generating a distributed graph topology communicator. My concern is that most of the time my code is working fine but sometimes I guess it is going into deadlock (as it is not showing any output). But MPI_Neighbor_alltoallw uses MPI_Waitall inside it so I am not getting why exactly this is happening.
>>  >> 
>>  >> May want to check sending and receiving correct data. Perhaps also try MPI_Neighbor_alltoallw
>>  >> 
>>  >> > 2. Is it possible that every time I run the code the processors times for completion of the task may vary? For example, for one run it all processors takes around 100 seconds and for another run, all processors take 110 seconds. 
>>  >> 
>>  >> There is usually some variability. Do you solve the same system each time? What is the method of solution? If your code is available it can sometimes be easier to give suggestions.
>>  >> 
>>  > Yes, the system of equations are the same. I am using the finite volume method for solving Navier stokes equations. By first sentence you mean to say it is possible.
>> 
>>  Is the method implicit or explicit?
> 
>> 
> Its an explicit method.

Ok

>> 
>> >
>>  >> > 
>>  >> > Please help in above two matters.
>>  >> > 
>>  >> > On Tue, May 5, 2020 at 4:28 PM hritikesh semwal <hritikesh.semwal at gmail.com> wrote:
>>  >> >> Thanks for your response.
>>  >> >> 
>>  >> >> Yes, you are right. I have put barrier just before Allreduce and out of the total time consumed by Allreduce, 79% time is consumed by the barrier. But my computational work is balanced. Right now, I have distributed 97336 cells among 24 processors and maximum and minimum cell distribution among all processors is 4057 and 4055 respectively which is not too bad. Is there any solution to get rid of this.
>>  >> 
>>  >> Try profiling your code not just looking at cell distribution. Are any profling tools already installed on your cluster?
>>  > 
>>  > gprof and valgrind are there.
>> 
>>  While not ideal GPROF may be helpful. Perhaps initial try running on 12 processors. With GPROF you will get 12 files to examine. Check if all subroutines take similar times on each processor. You can also time the subroutines individually using MPI_WTIME to get the same information.
> 
>> 
> Yes, I have already timed my code before posting this question. I will try with gprof.
> 

Great. Some documentation on Gprof:
http://shwina.github.io/2014/11/profiling-parallel
https://cluster.earlham.edu/wiki/index.php/Cluster:Gprof
https://portal.tacc.utexas.edu/documents/13601/1041435/29-Overview_of_Profiling.pdf/84359111-d21a-4618-9d90-ca878c1e37ab
https://hpc.llnl.gov/software/development-environment-software/gprof
https://support.pawsey.org.au/documentation/display/US/Profiling+with+gprof
https://stackoverflow.com/questions/39041871/missing-function-from-gprof-output

>> 
>> Also, try not to reply to the digest -, or if you do, change the subject of the message. This is useful in deciding what to read.
> 
>> 
> Is it fine this time? I have changed the subject line. Is that what you want to say?

That is ok if you cannot reply to the message directly.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20200506/cab94f13/attachment.html>