[mpich-devel] mpich3 error

Zhou, Hui zhouh at anl.gov
Thu Jan 14 23:11:09 CST 2021


    “Our MPI implementation is merely finding the sum of the results of the N processes, where N is large.  Is MPI_Reduce going to be faster?
“

Oh, yeah, if you are doing reduce, you should call `MPI_Reduce`.

However, I suspect there maybe some usage errors involved. Could you post some sample/pseudo code?

--
Hui Zhou


From: Brent Morgan <brent.taylormorgan at gmail.com>
Date: Thursday, January 14, 2021 at 11:02 PM
To: Robert Katona <robert.katona at hotmail.com>
Cc: Zhou, Hui <zhouh at anl.gov>, devel at mpich.org <devel at mpich.org>
Subject: Re: [mpich-devel] mpich3 error
Hi Hui Zhou,

Robert and I managed to incorporate multiple communicators- we use MPI_Gather() of Send/Receives.  However the issue remains- for a small (N<50) # of threads, the calculations work and seem fine.  For large (N>=50) # of threads, the issue persists.  We will try compiling for ch4 tonight...  But I wonder if we're doing something wrong?  Should we do Gather of Gathers?

Our MPI implementation is merely finding the sum of the results of the N processes, where N is large.  Is MPI_Reduce going to be faster?

Best,
Brent



On Thu, Jan 14, 2021 at 2:15 AM Robert Katona <robert.katona at hotmail.com<mailto:robert.katona at hotmail.com>> wrote:
Hi Hui Zhou,

Thanks for the findings. You are right, we used one communicator. And my idea was also to recompile the MPICH with ch4.

First I would like to try the MPI_Split() and use multiple communicator. But I stuck on the implementation. In the attachment you can see a toy app where all node want to share a float value with world rank 0.

But when I run the app for 20 processes I got the following output:
0, 1, 2, 3, 4, 5, ,6, 7, 8, 9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

But the expected is:
0, 1, 2, 3, 4, 5, ,6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,

What am I missing?

Regards,

Robert



________________________________
Feladó: Zhou, Hui <zhouh at anl.gov<mailto:zhouh at anl.gov>>
Elküldve: 2021. január 13., szerda 17:39
Címzett: Robert Katona <robert.katona at hotmail.com<mailto:robert.katona at hotmail.com>>; devel at mpich.org<mailto:devel at mpich.org> <devel at mpich.org<mailto:devel at mpich.org>>
Másolatot kap: Brent Morgan <brent.taylormorgan at gmail.com<mailto:brent.taylormorgan at gmail.com>>
Tárgy: Re: [mpich-devel] mpich3 error


Hi Robert,



Were you running MPI_Gather in multiple threads concurrently on a same communicator?

That is not allowed. You’ll need at least different communicator for different threads.

If that was not the issue, could you try compile MPICH with ch4, with `--with-device=ch4:ofi`, assuming you are using the latest release.



--
Hui Zhou





From: Robert Katona <robert.katona at hotmail.com<mailto:robert.katona at hotmail.com>>
Date: Wednesday, January 13, 2021 at 10:34 AM
To: Zhou, Hui <zhouh at anl.gov<mailto:zhouh at anl.gov>>, devel at mpich.org<mailto:devel at mpich.org> <devel at mpich.org<mailto:devel at mpich.org>>
Cc: Brent Morgan <brent.taylormorgan at gmail.com<mailto:brent.taylormorgan at gmail.com>>
Subject: Re: [mpich-devel] mpich3 error

Hello Hui Zhou,



And can you help us in brainstorming?



Unfortunately we cannot share the code. But I can give you more details.



Couple of computers are connected into the same network and we would like to run a distributed calculation. During testing we run the MPI application over 50 job executer processes/threads and one job evaluter master/root thread. Each job calculate two float value what the master collects. To distribute this data we use the MPI_Gather() function. The execution of the problem runs fine but randomly stops with error message what Brent sent to you.



Sometimes it stops after 10 iterative calculation, sometimes after 50 or 70. We do the same calculation, with same input, but the event of the error is very random. And it always stops in the MPI_Gather() function, when the master try to collect the data from all of the jobs.



If there is a network issue with any of the computer can it produce this error?



We are using Mpich 3.3.2, in the release note of 3.4 it is written the network communication has changed from ch3 to ch4. With ch4 can we expect better behavior?



Can you give us any hint where to look and what to check in this topic?



Regards,



Robert



Az Android Outlook<https://aka.ms/ghei36> letöltése



________________________________

Feladó: Zhou, Hui <zhouh at anl.gov<mailto:zhouh at anl.gov>>
Elküldve: 2021. január 13., szerda 15:51
Címzett: devel at mpich.org<mailto:devel at mpich.org>
Másolat: Brent Morgan; Robert Katona
Tárgy: Re: [mpich-devel] mpich3 error



Hi Brent,



Unfortunately, unless you can provide us with a reproducer, there is little we can do to find the issue.



--
Hui Zhou





From: Brent Morgan via devel <devel at mpich.org<mailto:devel at mpich.org>>
Date: Wednesday, January 13, 2021 at 3:20 AM
To: devel at mpich.org<mailto:devel at mpich.org> <devel at mpich.org<mailto:devel at mpich.org>>
Cc: Brent Morgan <brent.taylormorgan at gmail.com<mailto:brent.taylormorgan at gmail.com>>, Robert Katona <robert.katona at hotmail.com<mailto:robert.katona at hotmail.com>>
Subject: [mpich-devel] mpich3 error

Hello mpich dev support,

I am receiving the following error in my MPI implementation, when I use 110+ threads.

Assertion failed in file src/mpid/ch3/channels/nemesis/src/ch3_progress.c at line 567: !vc_ch->recv_active
0x7f7f870937 ???
???:0
0x7f7f870957 ???
???:0
0x7f7f8089cf ???
???:0
0x7f7f854983 ???
???:0
0x7f7f75a4f3 ???
???:0
0x7f7f7cd8bf ???
???:0
0x7f7f7cde0f ???
???:0
0x7f7f78d577 ???
???:0
0x7f7f6ea6cb ???
???:0
0x7f7f6ea8f3 ???
???:0
0x7f7f78c7ff ???
???:0
0x7f7f6e9e6f ???
???:0
0x7f7f6e9eb7 ???
???:0
0x7f7f6ea033 ???
???:0
0x55793fe47b ???
???:0
0x55793f96b3 ???
???:0
0x7f7f29908f ???
???:0
0x55793f9a9b ???
???:0
internal ABORT - process 171

Does this give a clue as to what I may be doing wrong?  Thank you,



Best,

Brent


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/devel/attachments/20210115/9ae1979c/attachment-0001.html>


More information about the devel mailing list