[mpich-discuss] Overlapping non-blocking collectives leads to deadlock
Joachim Protze
protze at itc.rwth-aachen.de
Mon Nov 18 11:52:34 CST 2019
Hi Mark,
I agree with Guiseppe. This is an issue in your code. You can use
different communicators (e.g., duplicates of comm-world) if you want to
concurrently execute collective communication by different threads.
- Joachim
On 11/18/19 5:25 PM, Mark Davis via discuss wrote:
> Hi Giuseppe, thank you for the fast response -- that clarifies things for me.
>
> On Mon, Nov 18, 2019 at 10:39 AM Congiu, Giuseppe via discuss
> <discuss at mpich.org> wrote:
>>
>> Hello Mark,
>>
>> I don’t think that is a bug in MPICH, it’s a bug in your code. The MPI standard requires that collectives (non-blocking ones are no exception) must be invoked in the same order in all processes. If T0 in process 0 runs first and T1 in process 1 runs first you have a mismatch and a resulting deadlock.
>>
>> —Giuseppe
>>
>>> On Nov 18, 2019, at 8:27 AM, Mark Davis via discuss <discuss at mpich.org> wrote:
>>>
>>> I realized something else relevant: I mentioned above that this
>>> deadlock occurs sometimes but not all of the time; I think I've
>>> narrowed down when it happens. Here's the above example with thread
>>> IDs annotated in:
>>>
>>>
>>> PROCESS 0 (root for ireduce and ibcast):
>>> // T0 is always the thread that calls MPI functions
>>> T0: MPI_Ireduce(..., &req)
>>> T0: MPI_Wait(&req); <-- blocking here
>>> ...
>>> T0: MPI_Ibcast(..., &req2);
>>> T0: MPI_Wait(&req2);
>>>
>>> PROCESS 1 (non-root for ireduce and ibcast):
>>> // T0 is the root for the reduce
>>> T0: MPI_Ireduce(..., &req)
>>> T0:MPI_Wait(&req);
>>> ...
>>> // T1 is the root for the bcast
>>> T1: MPI_Ibcast(..., &req2);
>>> T1: MPI_Wait(&req2); <-- blocking here
>>>
>>> Note that the non-root process has two different threads, T0 and T1,
>>> and T0 does the Ireduce and T1 does the bcast. I believe the T0 call
>>> to MPI_Ireduce is concurrent with the T1 call to MPI_Ibcast (both as
>>> non-roots).
>>>
>>> So, I believe the question is: is it legal in MPI to have two threads
>>> in a given MPI process call different non-blocking collectives (e.g.,
>>> reduce and bcast) concurrently with MPI_THREAD_MULTIPLE enabled?
>>>
>>> Thank you
>>>
>>> On Mon, Nov 18, 2019 at 10:05 AM Mark Davis <markdavisinboston at gmail.com> wrote:
>>>>
>>>> Hello, I'm experimenting with non-blocking collectives using MPICH in
>>>> a multithreaded C++ program (with MPI_THREAD_MULTIPLE initialization).
>>>>
>>>> I'm currently doing a non-blocking reduce followed by a non-blocking
>>>> broadcast (I realize I can just use an allreduce but for my
>>>> experiment, I need to serialize these operations). I was able to
>>>> produce this bug with only two MPI processes. I see in gdb that the
>>>> root process is stuck trying to execute the MPI_Ireduce in cases where
>>>> the non-root process does the MPI_Ireduce and gets to the MPI_Ibcast
>>>> quickly. That is, process 0 (root) isn't able to complete the
>>>> MPI_Ireduce wait while process 1 is stuck in the MPI_Ibcast wait.
>>>>
>>>> PROCESS 0 (root for ireduce and ibcast):
>>>> MPI_Ireduce(..., &req)
>>>> MPI_Wait(&req); <-- blocking here
>>>> ...
>>>> MPI_Ibcast(..., &req2);
>>>> MPI_Wait(&req2);
>>>>
>>>> PROCESS 1 (non-root for ireduce and ibcast):
>>>> MPI_Ireduce(..., &req)
>>>> MPI_Wait(&req);
>>>> ...
>>>> MPI_Ibcast(..., &req2);
>>>> MPI_Wait(&req2); <-- blocking here
>>>>
>>>> Much of the time, the program deadlocks as shown above; sometimes this
>>>> works fine, though, perhaps due to subtle timing differences. I
>>>> mentioned above that this is a multithreaded program. I'm able to
>>>> produce the issue with two threads with two MPI procs. The other
>>>> threads are not calling MPI functions -- they are helping with other
>>>> computation. I've verified that I don't have any TSAN or ASAN errors
>>>> in this program. However, when I only have one thread per process, I
>>>> don't have this issue. I think there's a decent chance, though, that
>>>> this has to do with timing differences as opposed to changing anything
>>>> with the MPI calls. I have verified that only one thread per process
>>>> is calling the MPI routines in the multithreaded case.
>>>>
>>>> When I change the MPI_Ireduce to a blocking MPI_Reduce and I keep the
>>>> MPI_Ibcast non-blocking, the program runs fine. Only when BOTH
>>>> MPI_Ireduce and MPI_Ibcast happen serially do I see this deadlock
>>>> (again, some of the time).
>>>>
>>>> Unfortunately, this program is part of a very large system and it's
>>>> not straightforward to give a fully working example. So, I'm just
>>>> looking for any ideas anyone has for what sort of thing may be
>>>> happening, any information that may be helpful about how two
>>>> coincident non-blocking requests could interact with each other, etc.
>>>>
>>>> Also, if anyone has tips on how to debug this sort of thing in gdb
>>>> that would be helpful. For example, are there ways to introspect the
>>>> MPI_Request object, etc.?
>>>>
>>>> Thanks
>>> _______________________________________________
>>> discuss mailing list discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>> _______________________________________________
>> discuss mailing list discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
--
Dipl.-Inf. Joachim Protze
IT Center
Group: High Performance Computing
Division: Computational Science and Engineering
RWTH Aachen University
Seffenter Weg 23
D 52074 Aachen (Germany)
Tel: +49 241 80- 24765
Fax: +49 241 80-624765
protze at itc.rwth-aachen.de
www.itc.rwth-aachen.de
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5327 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20191118/b8e371bf/attachment-0001.p7s>
More information about the discuss
mailing list