[mpich-discuss] Hang in MPI_Win_Flush during scenario in one-sided communication
Rajeev Thakur
thakur at mcs.anl.gov
Mon Sep 22 19:29:58 CDT 2014
For this to work you need to enable asynchronous progress in MPICH at least on the target side. You can do that by setting the environment variable MPICH_ASYNC_PROGRESS to 1 before running mpiexec.
Rajeev
On Sep 22, 2014, at 4:53 PM, Neil Spruit <nrspruit at gmail.com>
wrote:
> Hello,
>
>
> I have been experimenting with mpich’s implementation of one-sided communication with no remote synchronization that was introduced in the mpi3.0 spec and I have run into a hang issue during connect/accept MPI scenarios.
>
>
>
> Here is the situation, I have two applications that connect to each other after they have been launched separately with mpiexec on two different hosts.
>
>
> After connecting, both hosts allocate memory and run MPI_Win_Create as the following:
>
> Host1_program:
>
> void* buffer = mmap(0, buf_length, PROT_READ | PROT_WRITE, MAP_SHARED|MAP_POPULATE|MAP_ANON, -1, 0);
>
> MPI_Intercomm_merge(connected_comm, 1, &comm);
>
> MPI_Comm_rank(comm, &myrank);
>
> MPI_Win_create(NULL, 0, 1, MPI_INFO_NULL, comm, &win);
>
> Host2_program:
>
> void* buffer = mmap(0, buf_length, PROT_READ | PROT_WRITE, MAP_SHARED|MAP_POPULATE|MAP_ANON, -1, 0);
>
> MPI_Intercomm_merge(connected_comm, 1, &comm);
>
> MPI_Comm_rank(comm, &myrank);
>
> MPI_Win_create(buffer, buf_length, 1, MPI_INFO_NULL, comm, &win);
>
>
>
> Once the MPI_Window has been created I execute the following on host1:
>
> Host1_program(continued):
>
> err = MPI_Win_lock(MPI_LOCK_SHARED, target_rank, 0, win);
>
> if(err != MPI_SUCCESS)
>
> {
>
> return 1;
>
> }
>
> err = MPI_Put(buffer, buf_length, MPI_BYTE, target_rank, 0, buf_length, MPI_BYTE, win);
>
> if(err != MPI_SUCCESS)
>
> {
>
> return 1;
>
> }
>
> err = MPI_Win_flush(target_rank, win);
>
> if(err != MPI_SUCCESS)
>
> {
>
> return 1;
>
> }
>
> err = MPI_Win_unlock(target_rank, win);
>
> if(err != MPI_SUCCESS)
>
> {
>
> return 1;
>
> }
>
> While this is running, host2’s program is simply sleeping in a getchar().
>
>
> This should run and complete, however, if the other host has not called another mpi function such as MPI_Recv and instead is in a sleep, getchar(), etc. the MPI_Win_flush above will simply hang.
>
>
> Why is the one-sided operation hanging until another MPI operation is called on the target host? My understanding is that utilizing MPI_Win_Flush does not require any action by the target host to complete the operation.
>
>
> This is a stack trace of where the flush gets stuck:
>
> #0 0x0000003b456dea28 in poll () from /lib64/libc.so.6
>
> #1 0x00007fe1c08004ee in MPID_nem_tcp_connpoll ()
>
> from /usr/local/lib/libmpich.so.12
>
> #2 0x00007fe1c07efab7 in MPIDI_CH3I_Progress ()
>
> from /usr/local/lib/libmpich.so.12
>
> #3 0x00007fe1c07d35ff in MPIDI_CH3I_Wait_for_lock_granted ()
>
> from /usr/local/lib/libmpich.so.12
>
> #4 0x00007fe1c07d88bf in MPIDI_Win_flush () from /usr/local/lib/libmpich.so.12
>
> #5 0x00007fe1c08c1b46 in PMPI_Win_flush () from /usr/local/lib/libmpich.so.12
>
> #6 0x000000000040115c in main ()
>
>
> According to this it seems that somehow the remote host is holding something after MPI_Win_Create that in my scenario gets unlocked if I call MPI_Recv on the connected_comm from above after the MPI_Win_Create.
>
>
> Thank you for your time and I look forward to your reply.
>
>
> Respectfully,
>
> Neil Spruit
>
>
>
> P.S. MPICH version info
>
> Version: 3.1
>
> Release Date: Thu Feb 20 11:41:13 CST 2014
>
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list