[mpich-discuss] MPI_Win RMA with threads

Thu Feb 27 12:46:29 CST 2020

Hello,

I'm experimenting with using MPI + threads with one-sided
communication with PSCW synchronization. I'm running into a deadlock
situation and I want to clarify my understanding of these concepts.

In a simple benchmark, I'm creating 2 MPI processes, each with 8
pthreads. I'm running MPI in MPI_THREAD_MULIPLE mode. Before I create
the threads, the parent processes each create 8 MPI_Win's in the same
order, each with an allocated and attached buffer on rank 1's side
(and a zero-sized window on rank 1's side). I verified that I'm
creating each of them in the same order, storing the resulting MPI_Win
handle in an 8-element array wins. All windows are created in the same
MPI_Comm (MPI_COMM_WORLD).

My understanding from MPICH 3.3.1 mpid_rma.c:274 is that each MPI_Win
object actually creates its own communicator that's used internally
for this particular window. I mention this because I would have
assumed that this would obviate this issue, but it doesn't seem to.

Then I spawn the 8 threads on each of the two MPI ranks, resulting in
16 threads total in the system across the two MPI ranks. Each thread
is given its own dedicated MPI window; that is, thread i gets wins[i].
wins[0] on rank 0, thread 0 should correspond to wins[0] on rank 1,
thread 0, for example.

Here is the pseudocode that each of the threads run:

* rank 0 threads are "senders":

    for(iter=0; iter<5; ++iter) {
         // i is the thread number
         MPI_Win_start(group_containing_rank_1, 0, wins[i]);
         MPI_Put(buf_, count_, MPI_INT, partner_rank_, 0, count_, MPI_INT,
         MPI_Win_complete(wins[i]);
    }

* rank 1 threads are "receivers":

    for(iter=0; iter<5; ++iter) {
         // i is the thread number
         MPI_Win_post(group_containing_rank_0, 0 , wins[i]);
         MPI_Win_wait(wins[i]);
         // verify correct value in the received buffer, from the
proper sender, etc.
    }

where group_containing_rank_0 and group_containing_rank_0 are single
rank groups.

Note that the loops iterate 5 times; however, I almost always get a
deadlock before it finishes.

Specifically, it deadlocks when a sender thread gets to i=1, trying to
issue the MPI_Win_start (but blocking) and the corresponding receiver
thread is still stuck on the completion of iteration 0, blocking on
the MPI_Win_wait. Some of the time the program runs correctly (and I
am having the receiver of the PUT verify that the "message" is from
the proper sender by inspecting the payload).

One guess is that the sender thread received a completion ack from
some other thread on rank 1, and that it's not able to differentiate
those. The "completion ack" seems to be done via an MPI_Isend to all
members in the specified group (in this example, that is just one MPI
rank, rank 0) with an MPID_Isend(&i, 0, MPI_INT, dst, SYNC_POST_TAG,
win_comm_ptr, MPIR_CONTEXT_INTRA_PT2PT, &req_ptr) call in
ch3u_rma_sync.c:759 (MPID_Win_post). While this call will use the same
tag (SYNC_POST_TAG) for all posts, they should all be in different
communicators given the point above about the internal communicator
duplicate (win_comm_ptr).

I have seen other non-threaded MPI code that uses multiple different
windows, each with different parameters and names, so I'm pretty sure
two MPI processes can have multiple windows between them. However,
it's possible that works by ensuring a consistent ordering between the
uses of those windows. I was hoping that the internal communicator
that's created would be sufficient for disambiguation between multiple
windows on a pair of MPI ranks, though.

Some questions:

1. can you please help me understand what's going on in my program and
clarify any misunderstandings I have about MPI's guarantees about
windows?

2. Does the MPI standard enable my goal of having multiple independent
windows between the same two MPI ranks such that they won't conflict?
Is there any way to enforce that the ith created window on rank 0 will
only communicate with the ith created window on rank 1?

3. Even if I can get this approach to work, I worry about scaling this
to larger numbers of ranks; my understanding is that there's a limit
of around 2000 MPI_Comms in the application. So, if you have an idea
on how I can get the above to work, would it scale to a larger number
of threads? (I'm sure there's a better approach than what I'm doing
here.)

4. I'm also considering moving to passive synchronization with
MPI_LOCK and MPI_UNLOCK. Would that approach happen to help here?

Thanks for any light you can shed on this.