[mpich-discuss] MPI_Win RMA with threads

Mark Davis markdavisinboston at gmail.com
Thu Feb 27 14:32:09 CST 2020


Please see the attached. You can run with something like

mpic++ -std=c++17 -g -ggdb -fno-omit-frame-pointer -lpthread -O0
thread_test_2.cpp -o thread_test_2 && mpirun -n 2 thread_test_2 1 5 8

A follow-up question: it seems that MPICH creates a duplicate MPI_Comm
for each window, as I mentioned above. Is that something that's
enforced by the standard or just an MPICH implementation detail?

On Thu, Feb 27, 2020 at 2:47 PM Thakur, Rajeev <thakur at anl.gov> wrote:
>
> Can you send us the whole program?
>
> Rajeev
>
>
> -----Original Message-----
> From: Mark Davis via discuss <discuss at mpich.org>
> Reply-To: "discuss at mpich.org" <discuss at mpich.org>
> Date: Thursday, February 27, 2020 at 12:47 PM
> To: "discuss at mpich.org" <discuss at mpich.org>
> Cc: Mark Davis <markdavisinboston at gmail.com>
> Subject: [mpich-discuss] MPI_Win RMA with threads
>
> Hello,
>
> I'm experimenting with using MPI + threads with one-sided
> communication with PSCW synchronization. I'm running into a deadlock
> situation and I want to clarify my understanding of these concepts.
>
> In a simple benchmark, I'm creating 2 MPI processes, each with 8
> pthreads. I'm running MPI in MPI_THREAD_MULIPLE mode. Before I create
> the threads, the parent processes each create 8 MPI_Win's in the same
> order, each with an allocated and attached buffer on rank 1's side
> (and a zero-sized window on rank 1's side). I verified that I'm
> creating each of them in the same order, storing the resulting MPI_Win
> handle in an 8-element array wins. All windows are created in the same
> MPI_Comm (MPI_COMM_WORLD).
>
> My understanding from MPICH 3.3.1 mpid_rma.c:274 is that each MPI_Win
> object actually creates its own communicator that's used internally
> for this particular window. I mention this because I would have
> assumed that this would obviate this issue, but it doesn't seem to.
>
> Then I spawn the 8 threads on each of the two MPI ranks, resulting in
> 16 threads total in the system across the two MPI ranks. Each thread
> is given its own dedicated MPI window; that is, thread i gets wins[i].
> wins[0] on rank 0, thread 0 should correspond to wins[0] on rank 1,
> thread 0, for example.
>
> Here is the pseudocode that each of the threads run:
>
> * rank 0 threads are "senders":
>
>     for(iter=0; iter<5; ++iter) {
>          // i is the thread number
>          MPI_Win_start(group_containing_rank_1, 0, wins[i]);
>          MPI_Put(buf_, count_, MPI_INT, partner_rank_, 0, count_, MPI_INT,
>          MPI_Win_complete(wins[i]);
>     }
>
>
> * rank 1 threads are "receivers":
>
>     for(iter=0; iter<5; ++iter) {
>          // i is the thread number
>          MPI_Win_post(group_containing_rank_0, 0 , wins[i]);
>          MPI_Win_wait(wins[i]);
>          // verify correct value in the received buffer, from the
> proper sender, etc.
>     }
>
> where group_containing_rank_0 and group_containing_rank_0 are single
> rank groups.
>
>
>
> Note that the loops iterate 5 times; however, I almost always get a
> deadlock before it finishes.
>
> Specifically, it deadlocks when a sender thread gets to i=1, trying to
> issue the MPI_Win_start (but blocking) and the corresponding receiver
> thread is still stuck on the completion of iteration 0, blocking on
> the MPI_Win_wait. Some of the time the program runs correctly (and I
> am having the receiver of the PUT verify that the "message" is from
> the proper sender by inspecting the payload).
>
> One guess is that the sender thread received a completion ack from
> some other thread on rank 1, and that it's not able to differentiate
> those. The "completion ack" seems to be done via an MPI_Isend to all
> members in the specified group (in this example, that is just one MPI
> rank, rank 0) with an MPID_Isend(&i, 0, MPI_INT, dst, SYNC_POST_TAG,
> win_comm_ptr, MPIR_CONTEXT_INTRA_PT2PT, &req_ptr) call in
> ch3u_rma_sync.c:759 (MPID_Win_post). While this call will use the same
> tag (SYNC_POST_TAG) for all posts, they should all be in different
> communicators given the point above about the internal communicator
> duplicate (win_comm_ptr).
>
>
> I have seen other non-threaded MPI code that uses multiple different
> windows, each with different parameters and names, so I'm pretty sure
> two MPI processes can have multiple windows between them. However,
> it's possible that works by ensuring a consistent ordering between the
> uses of those windows. I was hoping that the internal communicator
> that's created would be sufficient for disambiguation between multiple
> windows on a pair of MPI ranks, though.
>
> Some questions:
>
> 1. can you please help me understand what's going on in my program and
> clarify any misunderstandings I have about MPI's guarantees about
> windows?
>
> 2. Does the MPI standard enable my goal of having multiple independent
> windows between the same two MPI ranks such that they won't conflict?
> Is there any way to enforce that the ith created window on rank 0 will
> only communicate with the ith created window on rank 1?
>
> 3. Even if I can get this approach to work, I worry about scaling this
> to larger numbers of ranks; my understanding is that there's a limit
> of around 2000 MPI_Comms in the application. So, if you have an idea
> on how I can get the above to work, would it scale to a larger number
> of threads? (I'm sure there's a better approach than what I'm doing
> here.)
>
> 4. I'm also considering moving to passive synchronization with
> MPI_LOCK and MPI_UNLOCK. Would that approach happen to help here?
>
>
> Thanks for any light you can shed on this.
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: thread_test_2.cpp
Type: application/octet-stream
Size: 11700 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20200227/942b3c85/attachment-0001.obj>


More information about the discuss mailing list