[mpich-devel] mpich3 error

Brent Morgan brent.taylormorgan at gmail.com
Thu Jan 14 23:01:56 CST 2021


Hi Hui Zhou,

Robert and I managed to incorporate multiple communicators- we use
MPI_Gather() of Send/Receives.  However the issue remains- for a small
(N<50) # of threads, the calculations work and seem fine.  For large
(N>=50) # of threads, the issue persists.  We will try compiling for ch4
tonight...  But I wonder if we're doing something wrong?  Should we do
Gather of Gathers?

Our MPI implementation is merely finding the sum of the results of the N
processes, where N is large.  Is MPI_Reduce going to be faster?

Best,
Brent



On Thu, Jan 14, 2021 at 2:15 AM Robert Katona <robert.katona at hotmail.com>
wrote:

> Hi Hui Zhou,
>
> Thanks for the findings. You are right, we used one communicator. And my
> idea was also to recompile the MPICH with ch4.
>
> First I would like to try the MPI_Split() and use multiple communicator.
> But I stuck on the implementation. In the attachment you can see a toy app
> where all node want to share a float value with world rank 0.
>
> But when I run the app for 20 processes I got the following output:
> 0, 1, 2, 3, 4, 5, ,6, 7, 8, 9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>
> But the expected is:
> 0, 1, 2, 3, 4, 5, ,6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
>
> What am I missing?
>
> Regards,
>
> Robert
>
>
>
> ------------------------------
> *Feladó:* Zhou, Hui <zhouh at anl.gov>
> *Elküldve:* 2021. január 13., szerda 17:39
> *Címzett:* Robert Katona <robert.katona at hotmail.com>; devel at mpich.org <
> devel at mpich.org>
> *Másolatot kap:* Brent Morgan <brent.taylormorgan at gmail.com>
> *Tárgy:* Re: [mpich-devel] mpich3 error
>
>
> Hi Robert,
>
>
>
> Were you running MPI_Gather in multiple threads concurrently on a same
> communicator?
>
> That is not allowed. You’ll need at least different communicator for
> different threads.
>
> If that was not the issue, could you try compile MPICH with ch4, with
> `--with-device=ch4:ofi`, assuming you are using the latest release.
>
>
>
> --
> Hui Zhou
>
>
>
>
>
> *From: *Robert Katona <robert.katona at hotmail.com>
> *Date: *Wednesday, January 13, 2021 at 10:34 AM
> *To: *Zhou, Hui <zhouh at anl.gov>, devel at mpich.org <devel at mpich.org>
> *Cc: *Brent Morgan <brent.taylormorgan at gmail.com>
> *Subject: *Re: [mpich-devel] mpich3 error
>
> Hello Hui Zhou,
>
>
>
> And can you help us in brainstorming?
>
>
>
> Unfortunately we cannot share the code. But I can give you more details.
>
>
>
> Couple of computers are connected into the same network and we would like
> to run a distributed calculation. During testing we run the MPI application
> over 50 job executer processes/threads and one job evaluter master/root
> thread. Each job calculate two float value what the master collects. To
> distribute this data we use the MPI_Gather() function. The execution of the
> problem runs fine but randomly stops with error message what Brent sent to
> you.
>
>
>
> Sometimes it stops after 10 iterative calculation, sometimes after 50 or
> 70. We do the same calculation, with same input, but the event of the error
> is very random. And it always stops in the MPI_Gather() function, when the
> master try to collect the data from all of the jobs.
>
>
>
> If there is a network issue with any of the computer can it produce this
> error?
>
>
>
> We are using Mpich 3.3.2, in the release note of 3.4 it is written the
> network communication has changed from ch3 to ch4. With ch4 can we expect
> better behavior?
>
>
>
> Can you give us any hint where to look and what to check in this topic?
>
>
>
> Regards,
>
>
>
> Robert
>
>
>
> Az Android Outlook <https://aka.ms/ghei36> letöltése
>
>
> ------------------------------
>
> *Feladó:* Zhou, Hui <zhouh at anl.gov>
> *Elküldve:* 2021. január 13., szerda 15:51
> *Címzett:* devel at mpich.org
> *Másolat:* Brent Morgan; Robert Katona
> *Tárgy:* Re: [mpich-devel] mpich3 error
>
>
>
> Hi Brent,
>
>
>
> Unfortunately, unless you can provide us with a reproducer, there is
> little we can do to find the issue.
>
>
>
> --
> Hui Zhou
>
>
>
>
>
> *From: *Brent Morgan via devel <devel at mpich.org>
> *Date: *Wednesday, January 13, 2021 at 3:20 AM
> *To: *devel at mpich.org <devel at mpich.org>
> *Cc: *Brent Morgan <brent.taylormorgan at gmail.com>, Robert Katona <
> robert.katona at hotmail.com>
> *Subject: *[mpich-devel] mpich3 error
>
> Hello mpich dev support,
>
> I am receiving the following error in my MPI implementation, when I use
> 110+ threads.
>
> Assertion failed in file src/mpid/ch3/channels/nemesis/src/ch3_progress.c
> at line 567: !vc_ch->recv_active
> 0x7f7f870937 ???
> ???:0
> 0x7f7f870957 ???
> ???:0
> 0x7f7f8089cf ???
> ???:0
> 0x7f7f854983 ???
> ???:0
> 0x7f7f75a4f3 ???
> ???:0
> 0x7f7f7cd8bf ???
> ???:0
> 0x7f7f7cde0f ???
> ???:0
> 0x7f7f78d577 ???
> ???:0
> 0x7f7f6ea6cb ???
> ???:0
> 0x7f7f6ea8f3 ???
> ???:0
> 0x7f7f78c7ff ???
> ???:0
> 0x7f7f6e9e6f ???
> ???:0
> 0x7f7f6e9eb7 ???
> ???:0
> 0x7f7f6ea033 ???
> ???:0
> 0x55793fe47b ???
> ???:0
> 0x55793f96b3 ???
> ???:0
> 0x7f7f29908f ???
> ???:0
> 0x55793f9a9b ???
> ???:0
> internal ABORT - process 171
>
> Does this give a clue as to what I may be doing wrong?  Thank you,
>
>
>
> Best,
>
> Brent
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/devel/attachments/20210114/4f0d3ea9/attachment.html>


More information about the devel mailing list