<meta http-equiv="Content-Type" content="text/html; charset=utf-8"><div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Apr 20, 2016 at 7:31 AM, Maciej Szpindler <span dir="ltr"><<a href="mailto:m.szpindler@icm.edu.pl" target="_blank">m.szpindler@icm.edu.pl</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Many thanks for the correction and explanation, Jeff. Now it is fixed.<br>


<br>


I am wondering, but this is likely not MPICH specific, if my approach<br>


is correct. I would like to replace send-receive halo exchange module<br>


of the larger application with rma pscw scheme. Basic implementation<br>


performs poorly and I were looking for improvement. Code structure<br>


requires this module to initialize memory windows with every halo<br>


exchange. I have tried to address this with dynamic windows but, as<br>


you have pointed, additional synchronization is then required and<br>


potential benefits of pscw approach are gone.<br>


<br>


Should it be expect that the good implementation of pscw scheme compares<br>


with message passing?<br>


<br></blockquote><div><br></div><div>If you find that PSCW with dynamic windows beats Send-Recv, I will be shocked.  I'll spare you the implementation details, but there is absolutely no reason why this should happen in MPICH, or in any other normal implementation.  The only way I can see it winning is with a shared memory machine where dynamic windows could leverage XPMEM.</div><div><br></div><div>This assumes you are not using MPI_Accumulate, by the way, since if you were to use MPI_Accumulate, it would likely be more efficient than the equivalent code in Send-Recv, even with a purely active message implementation of RMA.</div><div><br></div><div>Now, if you can find a way to use MPI_Win_allocate, you might get some benefit, especially within the node, since in that case RMA will translate directly to load-store in shared memory.</div><div><br></div><div>A different way to optimize halo exchange is with neighborhood collectives.  These may not be optimized in MPICH today, but Torsten has a paper showing how they can be optimized and these are more general than most of the optimizations for RMA.</div><div><br></div><div>Best,</div><div><br></div><div>Jeff</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


Reagrds,<br>


Maciej<br>


<br>


W dniu 19.04.2016 o 17:48, Jeff Hammond pisze:<div class="HOEnZb"><div class="h5"><br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


When you use dynamic windows, you must use the virtual address of the<br>


remote memory as the offset,  That means you must attach a buffer and then<br>


get the address with MPI_GET_ADDRESS.  Then you must share that address<br>


with any processes that target that memory, perhaps using MPI_SEND/MPI_RECV<br>


or MPI_ALLGATHER of an address-sized integer (MPI_AINT is the MPI type<br>


corresponding to the MPI_Aint C type).  It appears you are not doing this.<br>


<br>


This issue should affect you whether you use automatic arrays or heap<br>


data...<br>


<br>


It does not appear to be a problem here, but if you use automatic arrays<br>


with RMA, you must guarentee that they remain in scope throughout the<br>


duration of when they will be accessed remotely.  I think you are doing<br>


this sufficiently with a barrier.  However, at the point at which you are<br>


calling barrier to ensure they stay in scope, you lose all of the benefits<br>


of fine-grain synchronization from PSCW.  You might as well just use<br>


MPI_Win_fence.<br>


<br>


There is a sentence in the MPI spec that says that, strictly speaking,<br>


using memory not allocated by MPI_Alloc_mem (or MPI_Win_allocate(_shared),<br>


of course) in RMA is not portable, but I don't know any implementation that<br>


actually behaves this way.  MPICH has an active-message implementation of<br>


RMA, which does not care what storage class is involved, up to performance<br>


differences (interprocess shared memory is faster in some cases).<br>


<br>


This is a fairly complicated topic and it is possible that I have been a<br>


bit crude in summarizing the MPI standard, so I apologize to any MPI Forum<br>


experts who can find fault in what I've written :-)<br>


<br>


Jeff<br>


<br>


On Tue, Apr 19, 2016 at 5:51 AM, Maciej Szpindler <<a href="mailto:m.szpindler@icm.edu.pl" target="_blank">m.szpindler@icm.edu.pl</a>><br>


wrote:<br>


<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


This is simplified version of my routine. It may look odd but I am<br>


trying to migrate from send/recv scheme to one sided pscw and that<br>


is the reason for buffers etc. As long as dynamic windows are not used,<br>


it works fine (I believe). When I switch to dynamic windows it fails<br>


with segmentation fault. I would appreciate any comment and suggestion<br>


how to improve this.<br>


<br>


SUBROUTINE swap_simple_rma(field, row_length, rows, levels, halo_size)<br>


<br>


USE mpi<br>


<br>


IMPLICIT NONE<br>


<br>


INTEGER, INTENT(IN) :: row_length<br>


INTEGER, INTENT(IN) :: rows<br>


INTEGER, INTENT(IN) :: levels<br>


INTEGER, INTENT(IN) :: halo_size<br>


REAL(KIND=8), INTENT(INOUT) :: field(1:row_length, 1:rows, levels)<br>


REAL(KIND=8) :: send_buffer(halo_size, rows, levels)<br>


REAL(KIND=8) :: recv_buffer(halo_size, rows, levels)<br>


INTEGER  :: buffer_size<br>


INTEGER ::  i,j,k<br>


INTEGER(kind=MPI_INTEGER_KIND)  :: ierror<br>


INTEGER(kind=MPI_INTEGER_KIND) :: my_rank, comm_size<br>


Integer(kind=MPI_INTEGER_KIND) :: win, win_info<br>


Integer(kind=MPI_INTEGER_KIND) :: my_group, origin_group, target_group<br>


Integer(kind=MPI_INTEGER_KIND), DIMENSION(1) :: target_rank, origin_rank<br>


Integer(kind=MPI_ADDRESS_KIND) :: win_size, disp<br>


<br>


  CALL MPI_Comm_Rank(MPI_COMM_WORLD, my_rank, ierror)<br>


  CALL MPI_Comm_Size(MPI_COMM_WORLD, comm_size, ierror)<br>


<br>


  buffer_size = halo_size * rows * levels<br>


<br>


  CALL MPI_Info_create(win_info, ierror)<br>


  CALL MPI_Info_set(win_info, "no_locks", "true", ierror)<br>


<br>


  CALL MPI_Comm_group(MPI_COMM_WORLD, my_group, ierror)<br>


<br>


  If (my_rank /= comm_size - 1) Then<br>


    origin_rank = my_rank + 1<br>


    CALL MPI_Group_incl(my_group, 1, origin_rank, origin_group, ierror)<br>


    win_size = 8*buffer_size<br>


  Else<br>


    origin_group = MPI_GROUP_EMPTY<br>


    win_size = 0<br>


  End If<br>


<br>


  CALL MPI_Win_create_dynamic(win_info, MPI_COMM_WORLD, win, ierror)<br>


!! CALL MPI_Win_create(recv_buffer, win_size,      &<br>


!!        8, win_info, MPI_COMM_WORLD, win, ierror)<br>


  CALL MPI_Win_attach(win, recv_buffer, win_size, ierror)<br>


<br>


  CALL MPI_Barrier(MPI_COMM_WORLD, ierror)<br>


<br>


  CALL MPI_Win_post(origin_group, MPI_MODE_NOSTORE, win, ierror)<br>


<br>


  ! Prepare buffer<br>


     DO k=1,levels<br>


       DO j=1,rows<br>


         DO i=1,halo_size<br>


           send_buffer(i,j,k)=field(i,j,k)<br>


         END DO ! I<br>


        END DO ! J<br>


      END DO ! K<br>


<br>


  If (my_rank /= 0 ) Then<br>


     target_rank = my_rank - 1<br>


     CALL MPI_Group_incl(my_group, 1, target_rank, target_group, ierror)<br>


  Else<br>


     target_group = MPI_GROUP_EMPTY<br>


  End If<br>


<br>


  CALL MPI_Win_start(target_group, 0, win, ierror)<br>


<br>


  disp = 0<br>


<br>


  If (my_rank /= 0) Then<br>


    CALL MPI_Put(send_buffer, buffer_size, MPI_REAL8,   &<br>


        my_rank - 1, disp, buffer_size, MPI_REAL8, win, ierror)<br>


  End If<br>


  CALL MPI_Win_complete(win, ierror)<br>


<br>


  CALL MPI_Barrier(MPI_COMM_WORLD, ierror)<br>


  write (0,*) 'Put OK'<br>


  CALL MPI_Barrier(MPI_COMM_WORLD, ierror)<br>


<br>


  CALL MPI_Win_wait(win, ierror)<br>


<br>


  ! Read from buffer<br>


  If (my_rank /= comm_size -1 ) Then<br>


      DO k=1,levels<br>


        DO j=1,rows<br>


          DO i=1,halo_size<br>


            field(row_length+i,j,k) =  recv_buffer(i,j,k)<br>


          END DO<br>


        END DO<br>


      END DO<br>


  End if<br>


<br>


  CALL MPI_Win_detach(win, recv_buffer, ierror)<br>


  CALL MPI_Win_free(win, ierror)<br>


<br>


END SUBROUTINE swap_simple_rma<br>


<br>


Best Regards,<br>


Maciej<br>


<br>


W dniu 14.04.2016 o 19:21, Thakur, Rajeev pisze:<br>


<br>


After the Win_attach, did you add a barrier or some other form of<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


synchronization? The put shouldn’t happen before Win_attach returns.<br>


<br>


Rajeev<br>


<br>


On Apr 14, 2016, at 10:56 AM, Maciej Szpindler <<a href="mailto:m.szpindler@icm.edu.pl" target="_blank">m.szpindler@icm.edu.pl</a>><br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


wrote:<br>


<br>


Dear All,<br>


<br>


I am trying to use dynamic RMA windows in fortran. In my case I would<br>


like to attach automatic array to dynamic window. The question is if<br>


it is correct and allowed in MPICH. I feel that it is not working, at<br>


least in cray-mpich/<a href="http://7.3.2." rel="noreferrer" target="_blank">7.3.2.</a><br>


<br>


I have a subroutine that use RMA windows:<br>


<br>


SUBROUTINE foo(x, y, z , ...)<br>


<br>


USE mpi<br>


...<br>


<br>


INTEGER, INTENT(IN) :: x, y, z<br>


REAL(KIND=8) :: buffer(x, y, z)<br>


INTEGER(kind=MPI_INTEGER_KIND) :: win_info, win, comm<br>


INTEGER(kind=MPI_INTEGER_KIND) :: buff_size<br>


...<br>


<br>


buff_size = x*y*z*8<br>


<br>


CALL MPI_Info_create(win_info, ierror)<br>


CALL MPI_Info_set(win_info, "no_locks", "true", ierror)<br>


<br>


CALL MPI_Win_create_dynamic(win_info, comm, win, ierror)<br>


<br>


CALL MPI_Win_attach(win, buffer, buff_size, ierror)<br>


<br>


...<br>


<br>


This produces segmentation fault when MPI_Put is called on a window,<br>


while exactly the same routine code with static MPI_Win_create on<br>


buffer instead of create_dynamic+attach works fine. As far as I<br>


understand buffer is in this case "simply contiguous" in a sense of<br>


the MPI Standard. Any help would be appreciated!<br>


<br>


Best Regards,<br>


Maciej<br>


_______________________________________________<br>


discuss mailing list     <a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a><br>


To manage subscription options or unsubscribe:<br>


<a href="https://lists.mpich.org/mailman/listinfo/discuss" rel="noreferrer" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br>


<br>


</blockquote>


<br>


_______________________________________________<br>


discuss mailing list     <a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a><br>


To manage subscription options or unsubscribe:<br>


<a href="https://lists.mpich.org/mailman/listinfo/discuss" rel="noreferrer" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br>


<br>


_______________________________________________<br>


</blockquote>


discuss mailing list     <a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a><br>


To manage subscription options or unsubscribe:<br>


<a href="https://lists.mpich.org/mailman/listinfo/discuss" rel="noreferrer" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br>


<br>


</blockquote>


<br>


<br>


<br>


<br>


<br>


_______________________________________________<br>


discuss mailing list     <a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a><br>


To manage subscription options or unsubscribe:<br>


<a href="https://lists.mpich.org/mailman/listinfo/discuss" rel="noreferrer" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br>


<br>


</blockquote>


_______________________________________________<br>


discuss mailing list     <a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a><br>


To manage subscription options or unsubscribe:<br>


<a href="https://lists.mpich.org/mailman/listinfo/discuss" rel="noreferrer" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br>


</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature">Jeff Hammond<br><a href="mailto:jeff.science@gmail.com" target="_blank">jeff.science@gmail.com</a><br><a href="http://jeffhammond.github.io/" target="_blank">http://jeffhammond.github.io/</a></div>


</div></div>