[mpich-devel] mpich3 error

Brent Morgan brent.taylormorgan at gmail.com
Sat Jan 16 19:37:12 CST 2021


Why isn’t the sum all computed after doing a gather to the root- is there a
limitation of 27 processes (7 devices) for gather?  We thought a collector
could handle much more,

Best,
Brent

On Sat, Jan 16, 2021 at 2:18 PM Thakur, Rajeev <thakur at anl.gov> wrote:

> It uses a different algorithm and communicates much less data. The sum is
> not all computed after doing a gather to the root.
>
>
>
> *From: *Brent Morgan <brent.taylormorgan at gmail.com>
> *Date: *Saturday, January 16, 2021 at 2:58 PM
> *To: *"Thakur, Rajeev" <thakur at anl.gov>
> *Cc: *Robert Katona <robert.katona at hotmail.com>, "Zhou, Hui" <
> zhouh at anl.gov>, "devel at mpich.org" <devel at mpich.org>
> *Subject: *Re: [mpich-devel] mpich3 error
>
>
>
> We will try MPI_Reduce which will improve our code, but it will not solve
> the underlying problem.
>
>
>
> Best,
>
> Brent
>
>
>
> On Sat, Jan 16, 2021 at 1:31 PM Thakur, Rajeev <thakur at anl.gov> wrote:
>
> Your mail all the way below says “We are using MPI_Gather collector for
> merely calculating the sum of the result of N processes”. Why don’t you use
> MPI_Reduce instead then?
>
>
>
> Rajeev
>
>
>
>
>
> *From: *Brent Morgan via devel <devel at mpich.org>
> *Reply-To: *"devel at mpich.org" <devel at mpich.org>
> *Date: *Saturday, January 16, 2021 at 1:38 PM
> *To: *"Zhou, Hui" <zhouh at anl.gov>
> *Cc: *Brent Morgan <brent.taylormorgan at gmail.com>, "devel at mpich.org" <
> devel at mpich.org>, Robert Katona <robert.katona at hotmail.com>
> *Subject: *Re: [mpich-devel] mpich3 error
>
>
>
> Hi Hui, Mpich community,
>
>
>
> Thanks for the response.  You're right, I'll provide a toy program that
> replicates the code structure (and results).  The toy program is
> calculating a sum value from each process- the value isn't too important
> for this toy program.  The timing, however, is the only thing important in
> our demonstration.  It exactly replicates what we are observing for our
> actual program.  This directly relates to the MPI functionality- we can't
> find out what the issue is.
>
>
>
> I have attached the code.  Is something wrong with our implementation?  It
> starts with the main() function.  Thank you very much for any help,
>
>
>
> Best,
>
> Brent
>
> PS My subscription to discuss at mpich.org is pending currently.
>
>
>
> On Sat, Jan 16, 2021 at 12:24 PM Brent Morgan <
> brent.taylormorgan at gmail.com> wrote:
>
> Hi Hui, Mpich community,
>
>
>
> Thanks for the response.  You're right, I'll provide a toy program that
> replicates the code structure (and results).  The toy program is
> calculating a sum value from each process- the value isn't too important
> for this toy program.  The timing, however, is the only thing important in
> our demonstration.  It exactly replicates what we are observing for our
> actual program.  This directly relates to the MPI functionality- we can't
> find out what the issue is.
>
>
>
> I have attached the code.  Is something wrong with our implementation?  It
> starts with the main() function.  Thank you very much for any help,
>
>
>
> Best,
>
> Brent
>
>
>
> On Fri, Jan 15, 2021 at 10:43 PM Zhou, Hui <zhouh at anl.gov> wrote:
>
> Your description only mentions MPI_Gather. If there is indeed problem with
> MPI_Gather, then you should be able to reproduce the issue with a sample
> program. Share with us and we can better assist you. If you can’t reproduce
> the issue with a simple example, then I suspect there are other problems
> that you are not able to fully describe. We really can’t help much without
> able to see the code.
>
>
>
> That said, I am not even sure what is the issue you are describing. 100
> process MPI_Gather will be slower than 50 process MPI_Gather. And since it
> is a collective, if one of your process is delayed due to some computations
> or else, the whole collective will take longer to finish just due to
> waiting for the late process. You really need tell us what your program is
> doing in order for us to even offer an intelligent guess.
>
>
>
> --
> Hui Zhou
>
>
>
>
>
> *From: *Brent Morgan <brent.taylormorgan at gmail.com>
> *Date: *Friday, January 15, 2021 at 10:42 PM
> *To: *Zhou, Hui <zhouh at anl.gov>, discuss at mpich.org <discuss at mpich.org>
> *Cc: *Robert Katona <robert.katona at hotmail.com>
> *Subject: *Re: [mpich-devel] mpich3 error
>
> Hi MPICH community,
>
>
>
> My team has downloaded mpich 3.3.2 (using ch3 as default) and implemented
> MPI, and for small # of processes (<50), everything worked fine for our MPI
> implementation.  For >=50 processes, there was a ch3 error and crashed the
> program after a random amount of seconds (sometimes 10seconds, sometimes
> 100seconds).  So we compiled mpich 3.3.2 with ch4 (instead of default ch3)
> using '--with-device=ch4:ofi` flag and this got rid of the error- but for
> >12 processes, the speed would slow down to 2x slower suddenly.
>
>
>
> Upon Hui's suggestion, we upgraded to mpich 3.4 and compiled with
> '--with-device=ch4:ofi` flag (where ch4 is default for mpich 3.4).
> Everything worked fine until we hit 20 processes; after >=20 processes, the
> 2x slowdown is happening again.
>
>
>
> We have tried 1 communicator and multiple communicators in an attempt to
> make the MPI implementation faster, but there's no significant difference
> in observations.  We are using MPI_Gather collector for merely calculating
> the sum of the result of N processes, but we can't seem to maintain
> stability within MPI as we increase N processes.  Is there something we are
> missing that is ultimately causing this error?  We are at a loss here,
> thank you.
>
>
>
> Best,
>
> Brent
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/devel/attachments/20210116/c6ac3620/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 87385 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/devel/attachments/20210116/c6ac3620/attachment-0001.png>


More information about the devel mailing list