[mpich-devel] mpich3 error
Thakur, Rajeev
thakur at anl.gov
Sat Jan 16 15:18:05 CST 2021
It uses a different algorithm and communicates much less data. The sum is not all computed after doing a gather to the root.
From: Brent Morgan <brent.taylormorgan at gmail.com>
Date: Saturday, January 16, 2021 at 2:58 PM
To: "Thakur, Rajeev" <thakur at anl.gov>
Cc: Robert Katona <robert.katona at hotmail.com>, "Zhou, Hui" <zhouh at anl.gov>, "devel at mpich.org" <devel at mpich.org>
Subject: Re: [mpich-devel] mpich3 error
We will try MPI_Reduce which will improve our code, but it will not solve the underlying problem.
Best,
Brent
On Sat, Jan 16, 2021 at 1:31 PM Thakur, Rajeev <thakur at anl.gov<mailto:thakur at anl.gov>> wrote:
Your mail all the way below says “We are using MPI_Gather collector for merely calculating the sum of the result of N processes”. Why don’t you use MPI_Reduce instead then?
Rajeev
From: Brent Morgan via devel <devel at mpich.org<mailto:devel at mpich.org>>
Reply-To: "devel at mpich.org<mailto:devel at mpich.org>" <devel at mpich.org<mailto:devel at mpich.org>>
Date: Saturday, January 16, 2021 at 1:38 PM
To: "Zhou, Hui" <zhouh at anl.gov<mailto:zhouh at anl.gov>>
Cc: Brent Morgan <brent.taylormorgan at gmail.com<mailto:brent.taylormorgan at gmail.com>>, "devel at mpich.org<mailto:devel at mpich.org>" <devel at mpich.org<mailto:devel at mpich.org>>, Robert Katona <robert.katona at hotmail.com<mailto:robert.katona at hotmail.com>>
Subject: Re: [mpich-devel] mpich3 error
Hi Hui, Mpich community,
Thanks for the response. You're right, I'll provide a toy program that replicates the code structure (and results). The toy program is calculating a sum value from each process- the value isn't too important for this toy program. The timing, however, is the only thing important in our demonstration. It exactly replicates what we are observing for our actual program. This directly relates to the MPI functionality- we can't find out what the issue is.
[cid:image001.png at 01D6EC1A.C953B760]
I have attached the code. Is something wrong with our implementation? It starts with the main() function. Thank you very much for any help,
Best,
Brent
PS My subscription to discuss at mpich.org<mailto:discuss at mpich.org> is pending currently.
On Sat, Jan 16, 2021 at 12:24 PM Brent Morgan <brent.taylormorgan at gmail.com<mailto:brent.taylormorgan at gmail.com>> wrote:
Hi Hui, Mpich community,
Thanks for the response. You're right, I'll provide a toy program that replicates the code structure (and results). The toy program is calculating a sum value from each process- the value isn't too important for this toy program. The timing, however, is the only thing important in our demonstration. It exactly replicates what we are observing for our actual program. This directly relates to the MPI functionality- we can't find out what the issue is.
[cid:image001.png at 01D6EC1A.C953B760]
I have attached the code. Is something wrong with our implementation? It starts with the main() function. Thank you very much for any help,
Best,
Brent
On Fri, Jan 15, 2021 at 10:43 PM Zhou, Hui <zhouh at anl.gov<mailto:zhouh at anl.gov>> wrote:
Your description only mentions MPI_Gather. If there is indeed problem with MPI_Gather, then you should be able to reproduce the issue with a sample program. Share with us and we can better assist you. If you can’t reproduce the issue with a simple example, then I suspect there are other problems that you are not able to fully describe. We really can’t help much without able to see the code.
That said, I am not even sure what is the issue you are describing. 100 process MPI_Gather will be slower than 50 process MPI_Gather. And since it is a collective, if one of your process is delayed due to some computations or else, the whole collective will take longer to finish just due to waiting for the late process. You really need tell us what your program is doing in order for us to even offer an intelligent guess.
--
Hui Zhou
From: Brent Morgan <brent.taylormorgan at gmail.com<mailto:brent.taylormorgan at gmail.com>>
Date: Friday, January 15, 2021 at 10:42 PM
To: Zhou, Hui <zhouh at anl.gov<mailto:zhouh at anl.gov>>, discuss at mpich.org<mailto:discuss at mpich.org> <discuss at mpich.org<mailto:discuss at mpich.org>>
Cc: Robert Katona <robert.katona at hotmail.com<mailto:robert.katona at hotmail.com>>
Subject: Re: [mpich-devel] mpich3 error
Hi MPICH community,
My team has downloaded mpich 3.3.2 (using ch3 as default) and implemented MPI, and for small # of processes (<50), everything worked fine for our MPI implementation. For >=50 processes, there was a ch3 error and crashed the program after a random amount of seconds (sometimes 10seconds, sometimes 100seconds). So we compiled mpich 3.3.2 with ch4 (instead of default ch3) using '--with-device=ch4:ofi` flag and this got rid of the error- but for >12 processes, the speed would slow down to 2x slower suddenly.
Upon Hui's suggestion, we upgraded to mpich 3.4 and compiled with '--with-device=ch4:ofi` flag (where ch4 is default for mpich 3.4). Everything worked fine until we hit 20 processes; after >=20 processes, the 2x slowdown is happening again.
We have tried 1 communicator and multiple communicators in an attempt to make the MPI implementation faster, but there's no significant difference in observations. We are using MPI_Gather collector for merely calculating the sum of the result of N processes, but we can't seem to maintain stability within MPI as we increase N processes. Is there something we are missing that is ultimately causing this error? We are at a loss here, thank you.
Best,
Brent
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/devel/attachments/20210116/2a0bc7cd/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 87385 bytes
Desc: image001.png
URL: <http://lists.mpich.org/pipermail/devel/attachments/20210116/2a0bc7cd/attachment-0001.png>
More information about the devel
mailing list