[mpich-discuss] Is it allowed to attach automatic array for remote access with MPI_Win_attach?

Jeff Hammond jeff.science at gmail.com
Wed Apr 20 19:09:34 CDT 2016


On Wed, Apr 20, 2016 at 7:31 AM, Maciej Szpindler <m.szpindler at icm.edu.pl>
wrote:

> Many thanks for the correction and explanation, Jeff. Now it is fixed.
>
> I am wondering, but this is likely not MPICH specific, if my approach
> is correct. I would like to replace send-receive halo exchange module
> of the larger application with rma pscw scheme. Basic implementation
> performs poorly and I were looking for improvement. Code structure
> requires this module to initialize memory windows with every halo
> exchange. I have tried to address this with dynamic windows but, as
> you have pointed, additional synchronization is then required and
> potential benefits of pscw approach are gone.
>
> Should it be expect that the good implementation of pscw scheme compares
> with message passing?
>
>
If you find that PSCW with dynamic windows beats Send-Recv, I will be
shocked.  I'll spare you the implementation details, but there is
absolutely no reason why this should happen in MPICH, or in any other
normal implementation.  The only way I can see it winning is with a shared
memory machine where dynamic windows could leverage XPMEM.

This assumes you are not using MPI_Accumulate, by the way, since if you
were to use MPI_Accumulate, it would likely be more efficient than the
equivalent code in Send-Recv, even with a purely active message
implementation of RMA.

Now, if you can find a way to use MPI_Win_allocate, you might get some
benefit, especially within the node, since in that case RMA will translate
directly to load-store in shared memory.

A different way to optimize halo exchange is with neighborhood
collectives.  These may not be optimized in MPICH today, but Torsten has a
paper showing how they can be optimized and these are more general than
most of the optimizations for RMA.

Best,

Jeff


> Reagrds,
> Maciej
>
> W dniu 19.04.2016 o 17:48, Jeff Hammond pisze:
>
> When you use dynamic windows, you must use the virtual address of the
>> remote memory as the offset,  That means you must attach a buffer and then
>> get the address with MPI_GET_ADDRESS.  Then you must share that address
>> with any processes that target that memory, perhaps using
>> MPI_SEND/MPI_RECV
>> or MPI_ALLGATHER of an address-sized integer (MPI_AINT is the MPI type
>> corresponding to the MPI_Aint C type).  It appears you are not doing this.
>>
>> This issue should affect you whether you use automatic arrays or heap
>> data...
>>
>> It does not appear to be a problem here, but if you use automatic arrays
>> with RMA, you must guarentee that they remain in scope throughout the
>> duration of when they will be accessed remotely.  I think you are doing
>> this sufficiently with a barrier.  However, at the point at which you are
>> calling barrier to ensure they stay in scope, you lose all of the benefits
>> of fine-grain synchronization from PSCW.  You might as well just use
>> MPI_Win_fence.
>>
>> There is a sentence in the MPI spec that says that, strictly speaking,
>> using memory not allocated by MPI_Alloc_mem (or MPI_Win_allocate(_shared),
>> of course) in RMA is not portable, but I don't know any implementation
>> that
>> actually behaves this way.  MPICH has an active-message implementation of
>> RMA, which does not care what storage class is involved, up to performance
>> differences (interprocess shared memory is faster in some cases).
>>
>> This is a fairly complicated topic and it is possible that I have been a
>> bit crude in summarizing the MPI standard, so I apologize to any MPI Forum
>> experts who can find fault in what I've written :-)
>>
>> Jeff
>>
>> On Tue, Apr 19, 2016 at 5:51 AM, Maciej Szpindler <m.szpindler at icm.edu.pl
>> >
>> wrote:
>>
>> This is simplified version of my routine. It may look odd but I am
>>> trying to migrate from send/recv scheme to one sided pscw and that
>>> is the reason for buffers etc. As long as dynamic windows are not used,
>>> it works fine (I believe). When I switch to dynamic windows it fails
>>> with segmentation fault. I would appreciate any comment and suggestion
>>> how to improve this.
>>>
>>> SUBROUTINE swap_simple_rma(field, row_length, rows, levels, halo_size)
>>>
>>> USE mpi
>>>
>>> IMPLICIT NONE
>>>
>>> INTEGER, INTENT(IN) :: row_length
>>> INTEGER, INTENT(IN) :: rows
>>> INTEGER, INTENT(IN) :: levels
>>> INTEGER, INTENT(IN) :: halo_size
>>> REAL(KIND=8), INTENT(INOUT) :: field(1:row_length, 1:rows, levels)
>>> REAL(KIND=8) :: send_buffer(halo_size, rows, levels)
>>> REAL(KIND=8) :: recv_buffer(halo_size, rows, levels)
>>> INTEGER  :: buffer_size
>>> INTEGER ::  i,j,k
>>> INTEGER(kind=MPI_INTEGER_KIND)  :: ierror
>>> INTEGER(kind=MPI_INTEGER_KIND) :: my_rank, comm_size
>>> Integer(kind=MPI_INTEGER_KIND) :: win, win_info
>>> Integer(kind=MPI_INTEGER_KIND) :: my_group, origin_group, target_group
>>> Integer(kind=MPI_INTEGER_KIND), DIMENSION(1) :: target_rank, origin_rank
>>> Integer(kind=MPI_ADDRESS_KIND) :: win_size, disp
>>>
>>>   CALL MPI_Comm_Rank(MPI_COMM_WORLD, my_rank, ierror)
>>>   CALL MPI_Comm_Size(MPI_COMM_WORLD, comm_size, ierror)
>>>
>>>   buffer_size = halo_size * rows * levels
>>>
>>>   CALL MPI_Info_create(win_info, ierror)
>>>   CALL MPI_Info_set(win_info, "no_locks", "true", ierror)
>>>
>>>   CALL MPI_Comm_group(MPI_COMM_WORLD, my_group, ierror)
>>>
>>>   If (my_rank /= comm_size - 1) Then
>>>     origin_rank = my_rank + 1
>>>     CALL MPI_Group_incl(my_group, 1, origin_rank, origin_group, ierror)
>>>     win_size = 8*buffer_size
>>>   Else
>>>     origin_group = MPI_GROUP_EMPTY
>>>     win_size = 0
>>>   End If
>>>
>>>   CALL MPI_Win_create_dynamic(win_info, MPI_COMM_WORLD, win, ierror)
>>> !! CALL MPI_Win_create(recv_buffer, win_size,      &
>>> !!        8, win_info, MPI_COMM_WORLD, win, ierror)
>>>   CALL MPI_Win_attach(win, recv_buffer, win_size, ierror)
>>>
>>>   CALL MPI_Barrier(MPI_COMM_WORLD, ierror)
>>>
>>>   CALL MPI_Win_post(origin_group, MPI_MODE_NOSTORE, win, ierror)
>>>
>>>   ! Prepare buffer
>>>      DO k=1,levels
>>>        DO j=1,rows
>>>          DO i=1,halo_size
>>>            send_buffer(i,j,k)=field(i,j,k)
>>>          END DO ! I
>>>         END DO ! J
>>>       END DO ! K
>>>
>>>   If (my_rank /= 0 ) Then
>>>      target_rank = my_rank - 1
>>>      CALL MPI_Group_incl(my_group, 1, target_rank, target_group, ierror)
>>>   Else
>>>      target_group = MPI_GROUP_EMPTY
>>>   End If
>>>
>>>   CALL MPI_Win_start(target_group, 0, win, ierror)
>>>
>>>   disp = 0
>>>
>>>   If (my_rank /= 0) Then
>>>     CALL MPI_Put(send_buffer, buffer_size, MPI_REAL8,   &
>>>         my_rank - 1, disp, buffer_size, MPI_REAL8, win, ierror)
>>>   End If
>>>   CALL MPI_Win_complete(win, ierror)
>>>
>>>   CALL MPI_Barrier(MPI_COMM_WORLD, ierror)
>>>   write (0,*) 'Put OK'
>>>   CALL MPI_Barrier(MPI_COMM_WORLD, ierror)
>>>
>>>   CALL MPI_Win_wait(win, ierror)
>>>
>>>   ! Read from buffer
>>>   If (my_rank /= comm_size -1 ) Then
>>>       DO k=1,levels
>>>         DO j=1,rows
>>>           DO i=1,halo_size
>>>             field(row_length+i,j,k) =  recv_buffer(i,j,k)
>>>           END DO
>>>         END DO
>>>       END DO
>>>   End if
>>>
>>>   CALL MPI_Win_detach(win, recv_buffer, ierror)
>>>   CALL MPI_Win_free(win, ierror)
>>>
>>> END SUBROUTINE swap_simple_rma
>>>
>>> Best Regards,
>>> Maciej
>>>
>>> W dniu 14.04.2016 o 19:21, Thakur, Rajeev pisze:
>>>
>>> After the Win_attach, did you add a barrier or some other form of
>>>
>>>> synchronization? The put shouldn’t happen before Win_attach returns.
>>>>
>>>> Rajeev
>>>>
>>>> On Apr 14, 2016, at 10:56 AM, Maciej Szpindler <m.szpindler at icm.edu.pl>
>>>>
>>>>> wrote:
>>>>>
>>>>> Dear All,
>>>>>
>>>>> I am trying to use dynamic RMA windows in fortran. In my case I would
>>>>> like to attach automatic array to dynamic window. The question is if
>>>>> it is correct and allowed in MPICH. I feel that it is not working, at
>>>>> least in cray-mpich/7.3.2.
>>>>>
>>>>> I have a subroutine that use RMA windows:
>>>>>
>>>>> SUBROUTINE foo(x, y, z , ...)
>>>>>
>>>>> USE mpi
>>>>> ...
>>>>>
>>>>> INTEGER, INTENT(IN) :: x, y, z
>>>>> REAL(KIND=8) :: buffer(x, y, z)
>>>>> INTEGER(kind=MPI_INTEGER_KIND) :: win_info, win, comm
>>>>> INTEGER(kind=MPI_INTEGER_KIND) :: buff_size
>>>>> ...
>>>>>
>>>>> buff_size = x*y*z*8
>>>>>
>>>>> CALL MPI_Info_create(win_info, ierror)
>>>>> CALL MPI_Info_set(win_info, "no_locks", "true", ierror)
>>>>>
>>>>> CALL MPI_Win_create_dynamic(win_info, comm, win, ierror)
>>>>>
>>>>> CALL MPI_Win_attach(win, buffer, buff_size, ierror)
>>>>>
>>>>> ...
>>>>>
>>>>> This produces segmentation fault when MPI_Put is called on a window,
>>>>> while exactly the same routine code with static MPI_Win_create on
>>>>> buffer instead of create_dynamic+attach works fine. As far as I
>>>>> understand buffer is in this case "simply contiguous" in a sense of
>>>>> the MPI Standard. Any help would be appreciated!
>>>>>
>>>>> Best Regards,
>>>>> Maciej
>>>>> _______________________________________________
>>>>> discuss mailing list     discuss at mpich.org
>>>>> To manage subscription options or unsubscribe:
>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>
>>>>>
>>>> _______________________________________________
>>>> discuss mailing list     discuss at mpich.org
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>
>>>> _______________________________________________
>>>>
>>> discuss mailing list     discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>
>>>
>>
>>
>>
>>
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>



-- 
Jeff Hammond
jeff.science at gmail.com
http://jeffhammond.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20160420/933769cd/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list