[mpich-discuss] One-sided semantics

Palmer, Bruce J Bruce.Palmer at pnnl.gov
Tue Mar 8 11:00:27 CST 2016


I reran the tests using 2 processors on only 1 SMP node on our infinband cluster and everything works fine. I also built MPICH on my Dell quad-core workstation and ran the tests there as well. They also worked. The failures seem to only happen when you have to go off-node and actually use the infiniband network.

Bruce

-----Original Message-----
From: Balaji, Pavan [mailto:balaji at anl.gov] 
Sent: Friday, March 04, 2016 8:02 PM
To: discuss at mpich.org
Subject: Re: [mpich-discuss] One-sided semantics


And, of course, I assume you'll eventually fix all the performance shortcomings with respect to doing extra flush/flush_locals or datatype commits inside the for loop.

We'll be happy to review the Comex port for you, once it's ready, if you like.

  -- Pavan

> On Mar 4, 2016, at 9:59 PM, Balaji, Pavan <balaji at anl.gov> wrote:
> 
> Bruce,
> 
> You are missing an MPI_Barrier on line 81 (after the initialization).  Without this, a remote process might update your buffer while you are still initializing.  The program works with the barrier.
> 
>  -- Pavan
> 
>> On Mar 4, 2016, at 6:22 PM, Palmer, Bruce J <Bruce.Palmer at pnnl.gov> wrote:
>> 
>> Hi,
>> 
>> I’ve been working on a thin implementation of the COMEX runtime over MPI-3. The COMEX interface has been used by most of the MPI-based runtimes in GA. One of the COMEX tests has processors writing to and then immediately reading from neighboring processes multiple times. The GA semantics are that for multiple consecutive operations between the same pair of processes, the operations are ordered on the remote process in the same order as on the originating process. The test for this frequently fails for the MPI-3 based implementation. I’ve tried testing this independently of GA but the results are confusing.
>> 
>> The implementation I’ve been working on uses three different strategies to implement one-sided communication calls that follow, or are at least close to, the GA communication semantics. The first uses MPI_Put/MPI_Get/MPI_Accumulate and surrounds these calls by and MPI_Lock and MPI_Unlock pair immediately before and after the one-sided communication call. My understanding is that this forces completion both locally and remotely.  The second approach calls MPI_Win_lock_all on the MPI window immediately after creation and MPI_Win_unlock_all when the window is destroy so that the window is always in a passive synchronization epoch. The put/get/accumulate calls are implemented with the request-based calls MPI_Rput/MPI_Rget/MPI_Raccumulate and followed immediately by a call to MPI_Wait on the request handle. Again, from my understanding, this should force local completion of the operation but not necessarily remote completion. Finally, the last implementation is to again use the MPI_Win_lock_all to guarantee that a window is in a permanent passive synchronization epoch, use MPI_Put/MPI_Get/MPI_Accumulate to implement put/get/accumulate and use MPI_Win_flush_local to force completion locally. The first implementation should require only a barrier to force synchronization between all processors, the second two include a call to MPI_Win_flush_all in conjunction with a barrier to synchronize the data on all processors.
>> 
>> I’ve written a small test code that implements all three schemes and attached it to this email. It creates a 200x200 array of doubles, fills each array with unique numbers, writes a portion of the array to the next higher rank using put and then reads it back using get (cyclic boundary conditions are used for the first and last ranks). This is repeated 2000 times, with each test using a slightly different set of numbers from the previous test. I’ve done this for all three implementations using both a synchronization between the put and the get and without synchronization. The code has been run on an Infinband cluster using 2 processors on 2 separate SMP nodes. The results I get are that the request-based implementation and the flush_local_all implementation without synchronization work pretty consistently while the tests with synchronization all fail. The lock/unlock implementation also fails both with and without synchronization. Most tests that fail get through at least a few put/get cycles before failing but they don’t do all 2000 iterations.
>> 
>> I’ve also tried this using OpenMPI. In the OpenMPI case, there doesn’t appear to be much of an effect from using synchronization. In addition, the lock/unlock algorithm does not consistently fail, although it fails more frequently than the other two.
>> 
>> Does anyone have a suggestion as to what I’m doing wrong here? From my understanding of the MPI-3 standard, all three implementations should work with synchronization. I’m not completely sure if they should work without synchronization.
>> 
>> Bruce Palmer
>> 
>> <testmpi.c>_______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
> 

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list