[mpich-discuss] Overlapping non-blocking collectives leads to deadlock

Raffenetti, Kenneth J. raffenet at mcs.anl.gov
Mon Nov 18 13:47:16 CST 2019


Correct. The ordering requirement is per-communicator, so creating 
additional communicators is one way to ensure safety.

Ken

On 11/18/19 11:52 AM, Joachim Protze via discuss wrote:
> Hi Mark,
> 
> I agree with Guiseppe. This is an issue in your code. You can use 
> different communicators (e.g., duplicates of comm-world) if you want to 
> concurrently execute collective communication by different threads.
> 
> - Joachim
> 
> On 11/18/19 5:25 PM, Mark Davis via discuss wrote:
>> Hi Giuseppe, thank you for the fast response -- that clarifies things 
>> for me.
>>
>> On Mon, Nov 18, 2019 at 10:39 AM Congiu, Giuseppe via discuss
>> <discuss at mpich.org> wrote:
>>>
>>> Hello Mark,
>>>
>>> I don’t think that is a bug in MPICH, it’s a bug in your code. The 
>>> MPI standard requires that collectives (non-blocking ones are no 
>>> exception) must be invoked in the same order in all processes. If T0 
>>> in process 0 runs first and T1 in process 1 runs first you have a 
>>> mismatch and a resulting deadlock.
>>>
>>> —Giuseppe
>>>
>>>> On Nov 18, 2019, at 8:27 AM, Mark Davis via discuss 
>>>> <discuss at mpich.org> wrote:
>>>>
>>>> I realized something else relevant: I mentioned above that this
>>>> deadlock occurs sometimes but not all of the time; I think I've
>>>> narrowed down when it happens. Here's the above example with thread
>>>> IDs annotated in:
>>>>
>>>>
>>>> PROCESS 0 (root for ireduce and ibcast):
>>>> // T0 is always the thread that calls MPI functions
>>>> T0: MPI_Ireduce(..., &req)
>>>> T0: MPI_Wait(&req);  <-- blocking here
>>>> ...
>>>> T0: MPI_Ibcast(..., &req2);
>>>> T0: MPI_Wait(&req2);
>>>>
>>>> PROCESS 1 (non-root for ireduce and ibcast):
>>>> // T0 is the root for the reduce
>>>> T0: MPI_Ireduce(..., &req)
>>>> T0:MPI_Wait(&req);
>>>> ...
>>>> // T1 is the root for the bcast
>>>> T1: MPI_Ibcast(..., &req2);
>>>> T1: MPI_Wait(&req2); <-- blocking here
>>>>
>>>> Note that the non-root process has two different threads, T0 and T1,
>>>> and T0 does the Ireduce and T1 does the bcast. I believe the T0 call
>>>> to MPI_Ireduce is concurrent with the T1 call to MPI_Ibcast (both as
>>>> non-roots).
>>>>
>>>> So, I believe the question is: is it legal in MPI to have two threads
>>>> in a given MPI process call different non-blocking collectives (e.g.,
>>>> reduce and bcast) concurrently with MPI_THREAD_MULTIPLE enabled?
>>>>
>>>> Thank you
>>>>
>>>> On Mon, Nov 18, 2019 at 10:05 AM Mark Davis 
>>>> <markdavisinboston at gmail.com> wrote:
>>>>>
>>>>> Hello, I'm experimenting with non-blocking collectives using MPICH in
>>>>> a multithreaded C++ program (with MPI_THREAD_MULTIPLE initialization).
>>>>>
>>>>> I'm currently doing a non-blocking reduce followed by a non-blocking
>>>>> broadcast (I realize I can just use an allreduce but for my
>>>>> experiment, I need to serialize these operations). I was able to
>>>>> produce this bug with only two MPI processes. I see in gdb that the
>>>>> root process is stuck trying to execute the MPI_Ireduce in cases where
>>>>> the non-root process does the MPI_Ireduce and gets to the MPI_Ibcast
>>>>> quickly. That is, process 0 (root) isn't able to complete the
>>>>> MPI_Ireduce wait while process 1 is stuck in the MPI_Ibcast wait.
>>>>>
>>>>> PROCESS 0 (root for ireduce and ibcast):
>>>>> MPI_Ireduce(..., &req)
>>>>> MPI_Wait(&req);  <-- blocking here
>>>>> ...
>>>>> MPI_Ibcast(..., &req2);
>>>>> MPI_Wait(&req2);
>>>>>
>>>>> PROCESS 1 (non-root for ireduce and ibcast):
>>>>> MPI_Ireduce(..., &req)
>>>>> MPI_Wait(&req);
>>>>> ...
>>>>> MPI_Ibcast(..., &req2);
>>>>> MPI_Wait(&req2); <-- blocking here
>>>>>
>>>>> Much of the time, the program deadlocks as shown above; sometimes this
>>>>> works fine, though, perhaps due to subtle timing differences.  I
>>>>> mentioned above that this is a multithreaded program. I'm able to
>>>>> produce the issue with two threads with two MPI procs. The other
>>>>> threads are not calling MPI functions -- they are helping with other
>>>>> computation. I've verified that I don't have any TSAN or ASAN errors
>>>>> in this program. However, when I only have one thread per process, I
>>>>> don't have this issue. I think there's a decent chance, though, that
>>>>> this has to do with timing differences as opposed to changing anything
>>>>> with the MPI calls. I have verified that only one thread per process
>>>>> is calling the MPI routines in the multithreaded case.
>>>>>
>>>>> When I change the MPI_Ireduce to a blocking MPI_Reduce and I keep the
>>>>> MPI_Ibcast non-blocking, the program runs fine. Only when BOTH
>>>>> MPI_Ireduce and MPI_Ibcast happen serially do I see this deadlock
>>>>> (again, some of the time).
>>>>>
>>>>> Unfortunately, this program is part of a very large system and it's
>>>>> not straightforward to give a fully working example. So, I'm just
>>>>> looking for any ideas anyone has for what sort of thing may be
>>>>> happening, any information that may be helpful about how two
>>>>> coincident non-blocking requests could interact with each other, etc.
>>>>>
>>>>> Also, if anyone has tips on how to debug this sort of thing in gdb
>>>>> that would be helpful. For example, are there ways to introspect the
>>>>> MPI_Request object, etc.?
>>>>>
>>>>> Thanks
>>>> _______________________________________________
>>>> discuss mailing list     discuss at mpich.org
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>
>>> _______________________________________________
>>> discuss mailing list     discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
> 
> 
> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
> 


More information about the discuss mailing list