[mpich-discuss] MPI RMA

Jeff Hammond jhammond at alcf.anl.gov
Fri Feb 8 13:02:01 CST 2013


I completely agree that the following situation sucks for error
checking and that it cries for out fence_start and fence_finish (and
possible fence_sync i.e. start+finish).  That is what the assertions
purport to do, but they don't do it sufficiently.

There is an implicit assumption that RMA operations just enqueue until
the next synchronization call, particularly for the fence case but
that's terrible for performance if one has a decent network (as Cray
does).

I can't remember what assertions/info there are for windows, but one
practical solution would be to accept that error checking is hard to
impossible in the general case but much easier if the user indicates
that a window will only be used in conjunction with one
synchronization mechanism.

Jeff

On Fri, Feb 8, 2013 at 12:45 PM, Nick Radcliffe <nradclif at cray.com> wrote:
> Thanks for the quick response. Just to clarify, what if I did something like this:
>
> MPI_Win_fence(0, win)
> if (rank == src) MPI_Put(dest)
> MPI_Win_fence(0, win)
>
> if (rank == src) {
>     MPI_Win_lock(dest, win)
>     MPI_Put
>     MPI_Win_unlock(dest ,win)
> }
>
> if (rank == src) MPI_Put(dest)
> MPI_Win_fence(0, win)
>
> The second call to MPI_Win_fence closes-and-reopens an exposure epoch for the dest rank, because the second call to MPI_Win_fence is followed by another call to MPI_Win_fence, and the dest rank is the target of an RMA operation.
>
> The problem is that there is no way for the call to MPI_Win_lock to know if the previous call to MPI_Win_fence simply ended an exposure epoch, or if it ended-and-reopened an exposure epoch. What I'm trying to understand is how the call to MPI_Win_lock could do error checking to verify that it is not locking a rank in an exposure epoch, since whether MPI_Win_fence closes or closes-and-reopens an exposure epoch seems to depend on whether there are any future calls to MPI_Put/MPI_Win_fence.
>
> -Nick
> ________________________________________
> From: discuss-bounces at mpich.org [discuss-bounces at mpich.org] on behalf of Jeff Hammond [jhammond at alcf.anl.gov]
> Sent: Friday, February 08, 2013 11:58 AM
> To: discuss at mpich.org
> Subject: Re: [mpich-discuss] MPI RMA
>
> From MPI-3 11.5.1:
>
> - The MPI call MPI_WIN_FENCE(assert, win) synchronizes RMA calls on win.
> - The call completes an RMA access epoch if it was preceded by another
> fence call and the local process issued RMA communication calls on win
> between these two calls.
> - The call completes an RMA exposure epoch if it was preceded by
> another fence call and the local window was the target of RMA accesses
> between these two calls.
> - The call starts an RMA access epoch if it is followed by another
> fence call and by RMA communication calls issued between these two
> fence calls.
> - The call starts an exposure epoch if it is followed by another fence
> call and the local window is the target of RMA accesses between these
> two fence calls.
>
> MPI_Win_fence may both start or complete an RMA epoch but it may also
> complete-one-and-start-another RMA epoch depending on the context and
> the assertions.  All of the tests below are valid in my opinion.
>
> What the assertions below in the first example are saying is that the
> first call to MPI_Win_fence need not complete any RMA calls, which is
> true because it is the first sync call.  The last MPI_Win_fence
> asserts the same thing in reverse.  The middle one asserts nothing
> because it is both completing and starting an epoch.
>
> I think that MPI_Win_fence is poorly designed but that ship has sailed
> and I believe that the usage is well-defined in the standard despite
> the confusing properties of this function.
>
> Jeff
>
> On Fri, Feb 8, 2013 at 11:36 AM, Nick Radcliffe <nradclif at cray.com> wrote:
>> Hi,
>>
>> I have a question about MPI RMA, and the ANL regression tests for RMA in particular. The tests mixedsync.c and epochtest.c seem to have contradictory views of fence synchronization.
>>
>> epochtest.c seems to suggest that access/exposure epochs opened by a call to MPI_Win_fence are not closed until a call to MPI_Win_fence with assert==MPI_MODE_NOSUCCEED. The test looks roughly like this:
>>
>> MPI_Win_fence(MPI_MODE_NOPRECEDE, win)
>> if (rank == src) MPI_Put
>> MPI_Win_fence(0, win)
>> if (rank == dest) MPI_Put
>> etc...
>> MPI_Win_fence(MPI_MODE_NOSUCCEED, win)
>>
>> Since there is a call to MPI_Put after the second call to MPI_Win_fence, it would seem that the second call could not have ended the access epoch for dest, or the exposure epoch for src (which is the target of the second Put).
>>
>> On the other hand, the test mixedsync.c looks roughly like this:
>>
>> if (rank == src) {
>>     MPI_Win_lock(...,win)
>>     MPI_Put
>>     MPI_Win_unlock(...,win)
>> }
>>
>> MPI_Win_fence(0, win)
>> if (rank == src) MPI_Put
>> MPI_Win_fence(0, win)
>>
>> if (rank == src) {
>>     MPI_Win_lock(...,win)
>>     MPI_Put
>>     MPI_Win_unlock(...,win)
>> }
>>
>> The problem is that it is erroneous to call MPI_Win_lock on a window while that window is exposed due to a call to MPI_Win_fence. If mixedsync.c is not erroneous, then the second call to MPI_Win_fence must end the exposure epoch on win, contradicting what's implied about fence synchronization by epochtest.c.
>>
>> Sorry for the long post, but if anyone can shed some light on this for me, I would greatly appreciate it.
>>
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>
>
>
> --
> Jeff Hammond
> Argonne Leadership Computing Facility
> University of Chicago Computation Institute
> jhammond at alcf.anl.gov / (630) 252-5381
> http://www.linkedin.com/in/jeffhammond
> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss



-- 
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond



More information about the discuss mailing list