<meta http-equiv="Content-Type" content="text/html; charset=utf-8"><div dir="ltr">Pavan, I have followed your advice using MPI_MODE_NOCHECK and added some flushes, but I still get race conditions sometimes. I suspect that I have not followed your suggestion correctly or that something else is wrong at my end.</div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Mar 13, 2017 at 7:43 PM, Ask Jakobsen <span dir="ltr"><<a href="mailto:afj@qeye-labs.com" target="_blank">afj@qeye-labs.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Thanks Pavan and Halim. You are right it progress in the Fetch_and_op version without the async progress environment variable. I will try to implement the <span style="font-size:12.8px">MPI_MODE_NOCHECK as you suggested.</span><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">To make matters more complicated:</span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">I have discovered that the code from the book in mcs-lock.c deviates from </span><span style="font-size:12.8px">"High-Performance Distributed RMA Locks" pseudo code </span><span style="font-size:12.8px">(see Listing 3 in paper) </span><span style="font-size:12.8px">and the original MCS paper "Algorithms for scalable Sync on shared memory multiprocessors". If I add to the original mcs-lock.c code</span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">lmem[nextRank]=-1;</span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">before entering the MPI_win_lock_all in acquire the code *almost appears* to work! Sort of... when having a large number of processes there are still a rare race condition where a few processes don't get to the</span> MPI_Win_free(&win) in main().</div><div><br></div>


</div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Mar 13, 2017 at 6:15 PM, Halim Amer <span dir="ltr"><<a href="mailto:aamer@anl.gov" target="_blank">aamer@anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">To be precise, asynchronous progress is not required for this second implementation because the busy waiting loop is doing a Fetch_and_op. It is required, however, for the first implementation, from the tutorial book, because it busy waits with Win_sync.<br>


<br>


Halim<br>


<a href="http://www.mcs.anl.gov/~aamer" rel="noreferrer" target="_blank">www.mcs.anl.gov/~aamer</a><div class="m_-9117170903988053540HOEnZb"><div class="m_-9117170903988053540h5"><br>


<br>


On 3/13/17 8:46 AM, Balaji, Pavan wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<br>


I should also point out that I don't think your implementation is assuming asynchronous progress.  You shouldn't have to do any of the asynchronous progress tweaks for it to work correctly.<br>


<br>


  -- Pavan<br>


<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


On Mar 13, 2017, at 8:43 AM, Balaji, Pavan <<a href="mailto:balaji@anl.gov" target="_blank">balaji@anl.gov</a>> wrote:<br>


<br>


<br>


OK, I spent a little more time going through the code.  The algorithm looks correct, except for some minor issues:<br>


<br>


1. mcs-lock-fop.c:72 -- you need a flush or flush_local.  You were lucky that this was working correctly since it's local, but the MPI standard doesn't guarantee it.<br>


<br>


2. You might be able to simplify mcs-lock-fop.c lines 72-90 as follows:<br>


<br>


   do {<br>


     MPI_Fetch_and_op(&dummy, &fetch_nextrank, MPI_INT,<br>


                  myrank, nextRank, MPI_NO_OP, win);<br>


     MPI_Win_flush(myrank, win);<br>


   } while (fetch_nextrank==-1);<br>


<br>


3. Polling on the nextrank value is better than polling on a remote location.  However, you could further simplify this by using send/recv to notify the waiting process rather than RMA.  This allows the MPI implementation the opportunity to block waiting for progress, rather than poll (though in practice, current implementations poll anyway).<br>


<br>


4. Since you are always using the lock in shared mode, you should specify the hint MPI_MODE_NOCHECK in your lock_all epochs.<br>


<br>


Now, coming to your bug, this does seem to be a bug in the MPI implementation.  We can dig into it further.  In the meanwhile, if you use the optimization #4 above, this will allow the MPI implementation to bypass the entire locking checks, which will get you past the bug for now.<br>


<br>


Thanks for reporting the issue.<br>


<br>


 -- Pavan<br>


<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


On Mar 13, 2017, at 3:06 AM, Ask Jakobsen <<a href="mailto:afj@qeye-labs.com" target="_blank">afj@qeye-labs.com</a>> wrote:<br>


<br>


I don't think so. Rank 0 also holds the tail which is the process which most recently requested the mutex.<br>


<br>


On Mon, Mar 13, 2017 at 2:55 AM, Balaji, Pavan <<a href="mailto:balaji@anl.gov" target="_blank">balaji@anl.gov</a>> wrote:<br>


<br>


Shouldn't winsize be 3 integers in your code?  (sorry, I spent only 30 seconds looking at the code, so I might have missed something).<br>


<br>


 -- Pavan<br>


<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


On Mar 12, 2017, at 2:44 PM, Ask Jakobsen <<a href="mailto:afj@qeye-labs.com" target="_blank">afj@qeye-labs.com</a>> wrote:<br>


<br>


Interestingly, according to the paper you suggested it appears to include a similar test in pseudo code <a href="https://htor.inf.ethz.ch/publications/img/hpclocks.pdf" rel="noreferrer" target="_blank">https://htor.inf.ethz.ch/publi<wbr>cations/img/hpclocks.pdf</a> (see Listing 3 in paper).<br>


<br>


Unfortunately, removing the test in the release protocol did not solve the problem. The race condition is much more difficult to provoke, but I managed when setting the size of the communicator to 3 (only tested even sizes so far).<br>


<br>