[mpich-discuss] mpich3 error with ch3
zhouh at anl.gov
Thu May 27 17:45:20 CDT 2021
Thanks for the details. That looks like data corruptions, probably due to some of the atomics used are not strong enough on the particular CPU architecture. Ch3 in some place uses weaker atomics that runs fine on the known architecture, but you may have just provided an example that the weaker atomics is insufficient. If running Ch3 is important for you, please file an issue over https://github.com/pmodels/mpich and we will track it down.
From: Alim Akhtar <alim.akhtar at gmail.com>
Date: Thursday, May 27, 2021 at 9:20 AM
To: Zhou, Hui <zhouh at anl.gov>
Cc: discuss at mpich.org <discuss at mpich.org>
Subject: Re: [mpich-discuss] mpich3 error with ch3
On Thu, May 27, 2021 at 7:28 PM Zhou, Hui <zhouh at anl.gov> wrote:
> Similar issues can be very different in causes. Checking the referenced discussion, I wasn’t sure what was the original issues. We suggested to try ch4 as to get more data points rather than as a solution. Nevertheless, ch4 is the current recommended device as it is more actively developed.
> Ch3 is not broken as far as we know. Could you describe your issue in more details?
Assertion failed in file
src/mpid/ch3/channels/nemesis/src/ch3_progress.c at line 530:
payload_len >= sizeof (MPIDI_CH3_Pkt_t)
after some loop.
or sometime I see like :
Assertion failed in file
src/mpid/ch3/channels/nemesis/src/ch3_progress.c at line 567:
The number of pass loop does depends on number of CPUs used. (more
number of CPUs more failure).
With One CPU, no failure.
> Hui Zhou
> From: Alim Akhtar via discuss <discuss at mpich.org>
> Date: Wednesday, May 26, 2021 at 10:51 PM
> To: discuss at mpich.org <discuss at mpich.org>
> Cc: Alim Akhtar <alim.akhtar at gmail.com>
> Subject: [mpich-discuss] mpich3 error with ch3
> Hi mpich dev team,
> I am facing one issue similar to discussed in below discussion
> Someone in the mailing list suggested recompiling the mpi bench using
> CH4. as below:
> "MPICH with ch4, with `--with-device=ch4:ofi`"
> Actually this fixes the failure on this CPU architecture.
> 1. Is this a known issue with MPI bench on the recent CPU
> architecture? (I am running on ARM's Cortex -A 75), like ch3 is
> 2. With no error after using CH4, does this mean the CPU is all good?
> Note: using ch3 was working fine on our previous CPU (A-72)
> Any input will be really appreciated.
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the discuss