<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<meta name="Generator" content="Microsoft Word 14 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri","sans-serif";}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:purple;
text-decoration:underline;}
span.EmailStyle17
{mso-style-type:personal-compose;
font-family:"Calibri","sans-serif";
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-family:"Calibri","sans-serif";}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-US" link="blue" vlink="purple">
<div class="WordSection1">
<p class="MsoNormal">Hi,<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">I’ve been working on a thin implementation of the COMEX runtime over MPI-3. The COMEX interface has been used by most of the MPI-based runtimes in GA. One of the COMEX tests has processors writing to and then immediately reading from neighboring
processes multiple times. The GA semantics are that for multiple consecutive operations between the same pair of processes, the operations are ordered on the remote process in the same order as on the originating process. The test for this frequently fails
for the MPI-3 based implementation. I’ve tried testing this independently of GA but the results are confusing.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">The implementation I’ve been working on uses three different strategies to implement one-sided communication calls that follow, or are at least close to, the GA communication semantics. The first uses MPI_Put/MPI_Get/MPI_Accumulate and
surrounds these calls by and MPI_Lock and MPI_Unlock pair immediately before and after the one-sided communication call. My understanding is that this forces completion both locally and remotely. The second approach calls MPI_Win_lock_all on the MPI window
immediately after creation and MPI_Win_unlock_all when the window is destroy so that the window is always in a passive synchronization epoch. The put/get/accumulate calls are implemented with the request-based calls MPI_Rput/MPI_Rget/MPI_Raccumulate and followed
immediately by a call to MPI_Wait on the request handle. Again, from my understanding, this should force local completion of the operation but not necessarily remote completion. Finally, the last implementation is to again use the MPI_Win_lock_all to guarantee
that a window is in a permanent passive synchronization epoch, use MPI_Put/MPI_Get/MPI_Accumulate to implement put/get/accumulate and use MPI_Win_flush_local to force completion locally. The first implementation should require only a barrier to force synchronization
between all processors, the second two include a call to MPI_Win_flush_all in conjunction with a barrier to synchronize the data on all processors.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">I’ve written a small test code that implements all three schemes and attached it to this email. It creates a 200x200 array of doubles, fills each array with unique numbers, writes a portion of the array to the next higher rank using put
and then reads it back using get (cyclic boundary conditions are used for the first and last ranks). This is repeated 2000 times, with each test using a slightly different set of numbers from the previous test. I’ve done this for all three implementations
using both a synchronization between the put and the get and without synchronization. The code has been run on an Infinband cluster using 2 processors on 2 separate SMP nodes. The results I get are that the request-based implementation and the flush_local_all
implementation without synchronization work pretty consistently while the tests with synchronization all fail. The lock/unlock implementation also fails both with and without synchronization. Most tests that fail get through at least a few put/get cycles before
failing but they don’t do all 2000 iterations.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">I’ve also tried this using OpenMPI. In the OpenMPI case, there doesn’t appear to be much of an effect from using synchronization. In addition, the lock/unlock algorithm does not consistently fail, although it fails more frequently than
the other two.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Does anyone have a suggestion as to what I’m doing wrong here? From my understanding of the MPI-3 standard, all three implementations should work with synchronization. I’m not completely sure if they should work without synchronization.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Bruce Palmer<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
</body>
</html>