[mpich-discuss] Overlapping non-blocking collectives leads to deadlock

Mark Davis markdavisinboston at gmail.com
Mon Nov 18 10:25:55 CST 2019


Hi Giuseppe, thank you for the fast response -- that clarifies things for me.

On Mon, Nov 18, 2019 at 10:39 AM Congiu, Giuseppe via discuss
<discuss at mpich.org> wrote:
>
> Hello Mark,
>
> I don’t think that is a bug in MPICH, it’s a bug in your code. The MPI standard requires that collectives (non-blocking ones are no exception) must be invoked in the same order in all processes. If T0 in process 0 runs first and T1 in process 1 runs first you have a mismatch and a resulting deadlock.
>
> —Giuseppe
>
> > On Nov 18, 2019, at 8:27 AM, Mark Davis via discuss <discuss at mpich.org> wrote:
> >
> > I realized something else relevant: I mentioned above that this
> > deadlock occurs sometimes but not all of the time; I think I've
> > narrowed down when it happens. Here's the above example with thread
> > IDs annotated in:
> >
> >
> > PROCESS 0 (root for ireduce and ibcast):
> > // T0 is always the thread that calls MPI functions
> > T0: MPI_Ireduce(..., &req)
> > T0: MPI_Wait(&req);  <-- blocking here
> > ...
> > T0: MPI_Ibcast(..., &req2);
> > T0: MPI_Wait(&req2);
> >
> > PROCESS 1 (non-root for ireduce and ibcast):
> > // T0 is the root for the reduce
> > T0: MPI_Ireduce(..., &req)
> > T0:MPI_Wait(&req);
> > ...
> > // T1 is the root for the bcast
> > T1: MPI_Ibcast(..., &req2);
> > T1: MPI_Wait(&req2); <-- blocking here
> >
> > Note that the non-root process has two different threads, T0 and T1,
> > and T0 does the Ireduce and T1 does the bcast. I believe the T0 call
> > to MPI_Ireduce is concurrent with the T1 call to MPI_Ibcast (both as
> > non-roots).
> >
> > So, I believe the question is: is it legal in MPI to have two threads
> > in a given MPI process call different non-blocking collectives (e.g.,
> > reduce and bcast) concurrently with MPI_THREAD_MULTIPLE enabled?
> >
> > Thank you
> >
> > On Mon, Nov 18, 2019 at 10:05 AM Mark Davis <markdavisinboston at gmail.com> wrote:
> >>
> >> Hello, I'm experimenting with non-blocking collectives using MPICH in
> >> a multithreaded C++ program (with MPI_THREAD_MULTIPLE initialization).
> >>
> >> I'm currently doing a non-blocking reduce followed by a non-blocking
> >> broadcast (I realize I can just use an allreduce but for my
> >> experiment, I need to serialize these operations). I was able to
> >> produce this bug with only two MPI processes. I see in gdb that the
> >> root process is stuck trying to execute the MPI_Ireduce in cases where
> >> the non-root process does the MPI_Ireduce and gets to the MPI_Ibcast
> >> quickly. That is, process 0 (root) isn't able to complete the
> >> MPI_Ireduce wait while process 1 is stuck in the MPI_Ibcast wait.
> >>
> >> PROCESS 0 (root for ireduce and ibcast):
> >> MPI_Ireduce(..., &req)
> >> MPI_Wait(&req);  <-- blocking here
> >> ...
> >> MPI_Ibcast(..., &req2);
> >> MPI_Wait(&req2);
> >>
> >> PROCESS 1 (non-root for ireduce and ibcast):
> >> MPI_Ireduce(..., &req)
> >> MPI_Wait(&req);
> >> ...
> >> MPI_Ibcast(..., &req2);
> >> MPI_Wait(&req2); <-- blocking here
> >>
> >> Much of the time, the program deadlocks as shown above; sometimes this
> >> works fine, though, perhaps due to subtle timing differences.  I
> >> mentioned above that this is a multithreaded program. I'm able to
> >> produce the issue with two threads with two MPI procs. The other
> >> threads are not calling MPI functions -- they are helping with other
> >> computation. I've verified that I don't have any TSAN or ASAN errors
> >> in this program. However, when I only have one thread per process, I
> >> don't have this issue. I think there's a decent chance, though, that
> >> this has to do with timing differences as opposed to changing anything
> >> with the MPI calls. I have verified that only one thread per process
> >> is calling the MPI routines in the multithreaded case.
> >>
> >> When I change the MPI_Ireduce to a blocking MPI_Reduce and I keep the
> >> MPI_Ibcast non-blocking, the program runs fine. Only when BOTH
> >> MPI_Ireduce and MPI_Ibcast happen serially do I see this deadlock
> >> (again, some of the time).
> >>
> >> Unfortunately, this program is part of a very large system and it's
> >> not straightforward to give a fully working example. So, I'm just
> >> looking for any ideas anyone has for what sort of thing may be
> >> happening, any information that may be helpful about how two
> >> coincident non-blocking requests could interact with each other, etc.
> >>
> >> Also, if anyone has tips on how to debug this sort of thing in gdb
> >> that would be helpful. For example, are there ways to introspect the
> >> MPI_Request object, etc.?
> >>
> >> Thanks
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list