[mpich-discuss] possible thread bug with MPI_Reduce/MPI_Allreduce

Zhou, Hui zhouh at anl.gov
Fri Jul 14 15:05:36 CDT 2023


Glad I could help.

Hui
________________________________
From: Burlen Loring <burlen.loring at gmail.com>
Sent: Friday, July 14, 2023 12:31 PM
To: Zhou, Hui <zhouh at anl.gov>; discuss at mpich.org <discuss at mpich.org>
Subject: Re: [mpich-discuss] possible thread bug with MPI_Reduce/MPI_Allreduce

Hi Hui,

The wording in the documentation of MPI_Init_thread, and specifically the part that says with MPI_THREAD_MULTIPLE that "multiple threads may call MPI at once with no restrictions", had me thinking along the wrong lines.

Following your suggestions, I rewrote my code to give each thread exclusive access to one communicator from a small pool of communicators. This is working well and out performed Iallreduce/Ireduce approach.

Thanks so much for the help

Burlen


On 7/12/23 11:55, Zhou, Hui wrote:
Hi Burlen,

You are not allowed to issue collectives concurrently on the same communicator.  You can use non-blocking collectives, i.e. MPI_Ireduce and MPI_Iallreduce to achieve the overlapping. You still need serialize the issue of MPI_Ireduce/MPI_Iallreduce, but you don't need mutex to wait for the request to complete in concurrent threads.

--
Hui
________________________________
From: Burlen Loring via discuss <discuss at mpich.org><mailto:discuss at mpich.org>
Sent: Wednesday, July 12, 2023 1:29 PM
To: <discuss at mpich.org><mailto:discuss at mpich.org> <discuss at mpich.org><mailto:discuss at mpich.org>
Cc: Burlen Loring <burlen.loring at gmail.com><mailto:burlen.loring at gmail.com>
Subject: [mpich-discuss] possible thread bug with MPI_Reduce/MPI_Allreduce

Hi All,

I'm using MPICH 4.0.2 on Fedora 37 from the package manager for development. From an MPI parallel  simulation I'm spawning a thread that does a number of reductions (MPI_Allreduce and MPI_Reduce). MPI_IN_PLACE option is used. The results are written with POSIX I/O from rank 0. The simulation continues, and can launch the next set of reductions before the previous ones completed. I have called MPI_Init_thread and requested and received MPI_THREAD_MULTIPLE support.

However, when multiple threads overlap (in test runs 3-4 threads running concurrently) both MPI_Allreduce and MPI_Reduce calls can produce incorrect results. If instead I serialize the threads, by waiting on them before returning to the simulation, the results are correct. Also, if I use a mutex around my MPI_Allreduce/Reduce sections, the results are correct. I think that MPI_Reduce/Allreduce is not thread safe.

I was wondering if this is a known issue? Could it be a mpich build/configure setting not set correctly by the Fedora package maintainer ?

Thanks
Burlen

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20230714/63bd27c4/attachment.html>


More information about the discuss mailing list