<meta http-equiv="Content-Type" content="text/html; charset=utf-8"><div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Apr 20, 2016 at 7:31 AM, Maciej Szpindler <span dir="ltr"><<a href="mailto:m.szpindler@icm.edu.pl" target="_blank">m.szpindler@icm.edu.pl</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Many thanks for the correction and explanation, Jeff. Now it is fixed.<br>
<br>
I am wondering, but this is likely not MPICH specific, if my approach<br>
is correct. I would like to replace send-receive halo exchange module<br>
of the larger application with rma pscw scheme. Basic implementation<br>
performs poorly and I were looking for improvement. Code structure<br>
requires this module to initialize memory windows with every halo<br>
exchange. I have tried to address this with dynamic windows but, as<br>
you have pointed, additional synchronization is then required and<br>
potential benefits of pscw approach are gone.<br>
<br>
Should it be expect that the good implementation of pscw scheme compares<br>
with message passing?<br>
<br></blockquote><div><br></div><div>If you find that PSCW with dynamic windows beats Send-Recv, I will be shocked. I'll spare you the implementation details, but there is absolutely no reason why this should happen in MPICH, or in any other normal implementation. The only way I can see it winning is with a shared memory machine where dynamic windows could leverage XPMEM.</div><div><br></div><div>This assumes you are not using MPI_Accumulate, by the way, since if you were to use MPI_Accumulate, it would likely be more efficient than the equivalent code in Send-Recv, even with a purely active message implementation of RMA.</div><div><br></div><div>Now, if you can find a way to use MPI_Win_allocate, you might get some benefit, especially within the node, since in that case RMA will translate directly to load-store in shared memory.</div><div><br></div><div>A different way to optimize halo exchange is with neighborhood collectives. These may not be optimized in MPICH today, but Torsten has a paper showing how they can be optimized and these are more general than most of the optimizations for RMA.</div><div><br></div><div>Best,</div><div><br></div><div>Jeff</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Reagrds,<br>
Maciej<br>
<br>
W dniu 19.04.2016 o 17:48, Jeff Hammond pisze:<div class="HOEnZb"><div class="h5"><br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
When you use dynamic windows, you must use the virtual address of the<br>
remote memory as the offset, That means you must attach a buffer and then<br>
get the address with MPI_GET_ADDRESS. Then you must share that address<br>
with any processes that target that memory, perhaps using MPI_SEND/MPI_RECV<br>
or MPI_ALLGATHER of an address-sized integer (MPI_AINT is the MPI type<br>
corresponding to the MPI_Aint C type). It appears you are not doing this.<br>
<br>
This issue should affect you whether you use automatic arrays or heap<br>
data...<br>
<br>
It does not appear to be a problem here, but if you use automatic arrays<br>
with RMA, you must guarentee that they remain in scope throughout the<br>
duration of when they will be accessed remotely. I think you are doing<br>
this sufficiently with a barrier. However, at the point at which you are<br>
calling barrier to ensure they stay in scope, you lose all of the benefits<br>
of fine-grain synchronization from PSCW. You might as well just use<br>
MPI_Win_fence.<br>
<br>
There is a sentence in the MPI spec that says that, strictly speaking,<br>
using memory not allocated by MPI_Alloc_mem (or MPI_Win_allocate(_shared),<br>
of course) in RMA is not portable, but I don't know any implementation that<br>
actually behaves this way. MPICH has an active-message implementation of<br>
RMA, which does not care what storage class is involved, up to performance<br>
differences (interprocess shared memory is faster in some cases).<br>
<br>
This is a fairly complicated topic and it is possible that I have been a<br>
bit crude in summarizing the MPI standard, so I apologize to any MPI Forum<br>
experts who can find fault in what I've written :-)<br>
<br>
Jeff<br>
<br>
On Tue, Apr 19, 2016 at 5:51 AM, Maciej Szpindler <<a href="mailto:m.szpindler@icm.edu.pl" target="_blank">m.szpindler@icm.edu.pl</a>><br>
wrote:<br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
This is simplified version of my routine. It may look odd but I am<br>
trying to migrate from send/recv scheme to one sided pscw and that<br>
is the reason for buffers etc. As long as dynamic windows are not used,<br>
it works fine (I believe). When I switch to dynamic windows it fails<br>
with segmentation fault. I would appreciate any comment and suggestion<br>
how to improve this.<br>
<br>
SUBROUTINE swap_simple_rma(field, row_length, rows, levels, halo_size)<br>
<br>
USE mpi<br>
<br>
IMPLICIT NONE<br>
<br>
INTEGER, INTENT(IN) :: row_length<br>
INTEGER, INTENT(IN) :: rows<br>
INTEGER, INTENT(IN) :: levels<br>
INTEGER, INTENT(IN) :: halo_size<br>
REAL(KIND=8), INTENT(INOUT) :: field(1:row_length, 1:rows, levels)<br>
REAL(KIND=8) :: send_buffer(halo_size, rows, levels)<br>
REAL(KIND=8) :: recv_buffer(halo_size, rows, levels)<br>
INTEGER :: buffer_size<br>
INTEGER :: i,j,k<br>
INTEGER(kind=MPI_INTEGER_KIND) :: ierror<br>
INTEGER(kind=MPI_INTEGER_KIND) :: my_rank, comm_size<br>
Integer(kind=MPI_INTEGER_KIND) :: win, win_info<br>
Integer(kind=MPI_INTEGER_KIND) :: my_group, origin_group, target_group<br>
Integer(kind=MPI_INTEGER_KIND), DIMENSION(1) :: target_rank, origin_rank<br>
Integer(kind=MPI_ADDRESS_KIND) :: win_size, disp<br>
<br>
CALL MPI_Comm_Rank(MPI_COMM_WORLD, my_rank, ierror)<br>
CALL MPI_Comm_Size(MPI_COMM_WORLD, comm_size, ierror)<br>
<br>
buffer_size = halo_size * rows * levels<br>
<br>
CALL MPI_Info_create(win_info, ierror)<br>
CALL MPI_Info_set(win_info, "no_locks", "true", ierror)<br>
<br>
CALL MPI_Comm_group(MPI_COMM_WORLD, my_group, ierror)<br>
<br>
If (my_rank /= comm_size - 1) Then<br>
origin_rank = my_rank + 1<br>
CALL MPI_Group_incl(my_group, 1, origin_rank, origin_group, ierror)<br>
win_size = 8*buffer_size<br>
Else<br>
origin_group = MPI_GROUP_EMPTY<br>
win_size = 0<br>
End If<br>
<br>
CALL MPI_Win_create_dynamic(win_info, MPI_COMM_WORLD, win, ierror)<br>
!! CALL MPI_Win_create(recv_buffer, win_size, &<br>
!! 8, win_info, MPI_COMM_WORLD, win, ierror)<br>
CALL MPI_Win_attach(win, recv_buffer, win_size, ierror)<br>
<br>
CALL MPI_Barrier(MPI_COMM_WORLD, ierror)<br>
<br>
CALL MPI_Win_post(origin_group, MPI_MODE_NOSTORE, win, ierror)<br>
<br>
! Prepare buffer<br>
DO k=1,levels<br>
DO j=1,rows<br>
DO i=1,halo_size<br>
send_buffer(i,j,k)=field(i,j,k)<br>
END DO ! I<br>
END DO ! J<br>
END DO ! K<br>
<br>
If (my_rank /= 0 ) Then<br>
target_rank = my_rank - 1<br>
CALL MPI_Group_incl(my_group, 1, target_rank, target_group, ierror)<br>
Else<br>
target_group = MPI_GROUP_EMPTY<br>
End If<br>
<br>
CALL MPI_Win_start(target_group, 0, win, ierror)<br>
<br>
disp = 0<br>
<br>
If (my_rank /= 0) Then<br>
CALL MPI_Put(send_buffer, buffer_size, MPI_REAL8, &<br>
my_rank - 1, disp, buffer_size, MPI_REAL8, win, ierror)<br>
End If<br>
CALL MPI_Win_complete(win, ierror)<br>
<br>
CALL MPI_Barrier(MPI_COMM_WORLD, ierror)<br>
write (0,*) 'Put OK'<br>
CALL MPI_Barrier(MPI_COMM_WORLD, ierror)<br>
<br>
CALL MPI_Win_wait(win, ierror)<br>
<br>
! Read from buffer<br>
If (my_rank /= comm_size -1 ) Then<br>
DO k=1,levels<br>
DO j=1,rows<br>
DO i=1,halo_size<br>
field(row_length+i,j,k) = recv_buffer(i,j,k)<br>
END DO<br>
END DO<br>
END DO<br>
End if<br>
<br>
CALL MPI_Win_detach(win, recv_buffer, ierror)<br>
CALL MPI_Win_free(win, ierror)<br>
<br>
END SUBROUTINE swap_simple_rma<br>
<br>
Best Regards,<br>
Maciej<br>
<br>
W dniu 14.04.2016 o 19:21, Thakur, Rajeev pisze:<br>
<br>
After the Win_attach, did you add a barrier or some other form of<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
synchronization? The put shouldn’t happen before Win_attach returns.<br>
<br>
Rajeev<br>
<br>
On Apr 14, 2016, at 10:56 AM, Maciej Szpindler <<a href="mailto:m.szpindler@icm.edu.pl" target="_blank">m.szpindler@icm.edu.pl</a>><br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
wrote:<br>
<br>
Dear All,<br>
<br>
I am trying to use dynamic RMA windows in fortran. In my case I would<br>
like to attach automatic array to dynamic window. The question is if<br>
it is correct and allowed in MPICH. I feel that it is not working, at<br>
least in cray-mpich/<a href="http://7.3.2." rel="noreferrer" target="_blank">7.3.2.</a><br>
<br>
I have a subroutine that use RMA windows:<br>
<br>
SUBROUTINE foo(x, y, z , ...)<br>
<br>
USE mpi<br>
...<br>
<br>
INTEGER, INTENT(IN) :: x, y, z<br>
REAL(KIND=8) :: buffer(x, y, z)<br>
INTEGER(kind=MPI_INTEGER_KIND) :: win_info, win, comm<br>
INTEGER(kind=MPI_INTEGER_KIND) :: buff_size<br>
...<br>
<br>
buff_size = x*y*z*8<br>
<br>
CALL MPI_Info_create(win_info, ierror)<br>
CALL MPI_Info_set(win_info, "no_locks", "true", ierror)<br>
<br>
CALL MPI_Win_create_dynamic(win_info, comm, win, ierror)<br>
<br>
CALL MPI_Win_attach(win, buffer, buff_size, ierror)<br>
<br>
...<br>
<br>
This produces segmentation fault when MPI_Put is called on a window,<br>
while exactly the same routine code with static MPI_Win_create on<br>
buffer instead of create_dynamic+attach works fine. As far as I<br>
understand buffer is in this case "simply contiguous" in a sense of<br>
the MPI Standard. Any help would be appreciated!<br>
<br>
Best Regards,<br>
Maciej<br>
_______________________________________________<br>
discuss mailing list <a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a><br>
To manage subscription options or unsubscribe:<br>
<a href="https://lists.mpich.org/mailman/listinfo/discuss" rel="noreferrer" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br>
<br>
</blockquote>
<br>
_______________________________________________<br>
discuss mailing list <a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a><br>
To manage subscription options or unsubscribe:<br>
<a href="https://lists.mpich.org/mailman/listinfo/discuss" rel="noreferrer" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br>
<br>
_______________________________________________<br>
</blockquote>
discuss mailing list <a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a><br>
To manage subscription options or unsubscribe:<br>
<a href="https://lists.mpich.org/mailman/listinfo/discuss" rel="noreferrer" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br>
<br>
</blockquote>
<br>
<br>
<br>
<br>
<br>
_______________________________________________<br>
discuss mailing list <a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a><br>
To manage subscription options or unsubscribe:<br>
<a href="https://lists.mpich.org/mailman/listinfo/discuss" rel="noreferrer" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br>
<br>
</blockquote>
_______________________________________________<br>
discuss mailing list <a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a><br>
To manage subscription options or unsubscribe:<br>
<a href="https://lists.mpich.org/mailman/listinfo/discuss" rel="noreferrer" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br>
</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature">Jeff Hammond<br><a href="mailto:jeff.science@gmail.com" target="_blank">jeff.science@gmail.com</a><br><a href="http://jeffhammond.github.io/" target="_blank">http://jeffhammond.github.io/</a></div>
</div></div>