[mpich-devel] mpich3 error
zhouh at anl.gov
Wed Jan 13 10:39:27 CST 2021
Were you running MPI_Gather in multiple threads concurrently on a same communicator?
That is not allowed. You’ll need at least different communicator for different threads.
If that was not the issue, could you try compile MPICH with ch4, with `--with-device=ch4:ofi`, assuming you are using the latest release.
From: Robert Katona <robert.katona at hotmail.com>
Date: Wednesday, January 13, 2021 at 10:34 AM
To: Zhou, Hui <zhouh at anl.gov>, devel at mpich.org <devel at mpich.org>
Cc: Brent Morgan <brent.taylormorgan at gmail.com>
Subject: Re: [mpich-devel] mpich3 error
Hello Hui Zhou,
And can you help us in brainstorming?
Unfortunately we cannot share the code. But I can give you more details.
Couple of computers are connected into the same network and we would like to run a distributed calculation. During testing we run the MPI application over 50 job executer processes/threads and one job evaluter master/root thread. Each job calculate two float value what the master collects. To distribute this data we use the MPI_Gather() function. The execution of the problem runs fine but randomly stops with error message what Brent sent to you.
Sometimes it stops after 10 iterative calculation, sometimes after 50 or 70. We do the same calculation, with same input, but the event of the error is very random. And it always stops in the MPI_Gather() function, when the master try to collect the data from all of the jobs.
If there is a network issue with any of the computer can it produce this error?
We are using Mpich 3.3.2, in the release note of 3.4 it is written the network communication has changed from ch3 to ch4. With ch4 can we expect better behavior?
Can you give us any hint where to look and what to check in this topic?
Az Android Outlook<https://aka.ms/ghei36> letöltése
Feladó: Zhou, Hui <zhouh at anl.gov>
Elküldve: 2021. január 13., szerda 15:51
Címzett: devel at mpich.org
Másolat: Brent Morgan; Robert Katona
Tárgy: Re: [mpich-devel] mpich3 error
Unfortunately, unless you can provide us with a reproducer, there is little we can do to find the issue.
From: Brent Morgan via devel <devel at mpich.org>
Date: Wednesday, January 13, 2021 at 3:20 AM
To: devel at mpich.org <devel at mpich.org>
Cc: Brent Morgan <brent.taylormorgan at gmail.com>, Robert Katona <robert.katona at hotmail.com>
Subject: [mpich-devel] mpich3 error
Hello mpich dev support,
I am receiving the following error in my MPI implementation, when I use 110+ threads.
Assertion failed in file src/mpid/ch3/channels/nemesis/src/ch3_progress.c at line 567: !vc_ch->recv_active
internal ABORT - process 171
Does this give a clue as to what I may be doing wrong? Thank you,
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the devel