[mpich-discuss] MCS lock and MPI RMA problem

Ask Jakobsen afj at qeye-labs.com
Mon Mar 13 14:48:10 CDT 2017


Pavan, I have followed your advice using MPI_MODE_NOCHECK and added some
flushes, but I still get race conditions sometimes. I suspect that I have
not followed your suggestion correctly or that something else is wrong at
my end.

On Mon, Mar 13, 2017 at 7:43 PM, Ask Jakobsen <afj at qeye-labs.com> wrote:

> Thanks Pavan and Halim. You are right it progress in the Fetch_and_op
> version without the async progress environment variable. I will try to
> implement the MPI_MODE_NOCHECK as you suggested.
>
> To make matters more complicated:
>
> I have discovered that the code from the book in mcs-lock.c deviates from "High-Performance
> Distributed RMA Locks" pseudo code (see Listing 3 in paper) and the
> original MCS paper "Algorithms for scalable Sync on shared memory
> multiprocessors". If I add to the original mcs-lock.c code
>
> lmem[nextRank]=-1;
>
> before entering the MPI_win_lock_all in acquire the code *almost appears*
> to work! Sort of... when having a large number of processes there are still
> a rare race condition where a few processes don't get to the MPI_Win_free(&win)
> in main().
>
>
> On Mon, Mar 13, 2017 at 6:15 PM, Halim Amer <aamer at anl.gov> wrote:
>
>> To be precise, asynchronous progress is not required for this second
>> implementation because the busy waiting loop is doing a Fetch_and_op. It is
>> required, however, for the first implementation, from the tutorial book,
>> because it busy waits with Win_sync.
>>
>> Halim
>> www.mcs.anl.gov/~aamer
>>
>>
>> On 3/13/17 8:46 AM, Balaji, Pavan wrote:
>>
>>>
>>> I should also point out that I don't think your implementation is
>>> assuming asynchronous progress.  You shouldn't have to do any of the
>>> asynchronous progress tweaks for it to work correctly.
>>>
>>>   -- Pavan
>>>
>>> On Mar 13, 2017, at 8:43 AM, Balaji, Pavan <balaji at anl.gov> wrote:
>>>>
>>>>
>>>> OK, I spent a little more time going through the code.  The algorithm
>>>> looks correct, except for some minor issues:
>>>>
>>>> 1. mcs-lock-fop.c:72 -- you need a flush or flush_local.  You were
>>>> lucky that this was working correctly since it's local, but the MPI
>>>> standard doesn't guarantee it.
>>>>
>>>> 2. You might be able to simplify mcs-lock-fop.c lines 72-90 as follows:
>>>>
>>>>    do {
>>>>      MPI_Fetch_and_op(&dummy, &fetch_nextrank, MPI_INT,
>>>>                   myrank, nextRank, MPI_NO_OP, win);
>>>>      MPI_Win_flush(myrank, win);
>>>>    } while (fetch_nextrank==-1);
>>>>
>>>> 3. Polling on the nextrank value is better than polling on a remote
>>>> location.  However, you could further simplify this by using send/recv to
>>>> notify the waiting process rather than RMA.  This allows the MPI
>>>> implementation the opportunity to block waiting for progress, rather than
>>>> poll (though in practice, current implementations poll anyway).
>>>>
>>>> 4. Since you are always using the lock in shared mode, you should
>>>> specify the hint MPI_MODE_NOCHECK in your lock_all epochs.
>>>>
>>>> Now, coming to your bug, this does seem to be a bug in the MPI
>>>> implementation.  We can dig into it further.  In the meanwhile, if you use
>>>> the optimization #4 above, this will allow the MPI implementation to bypass
>>>> the entire locking checks, which will get you past the bug for now.
>>>>
>>>> Thanks for reporting the issue.
>>>>
>>>>  -- Pavan
>>>>
>>>> On Mar 13, 2017, at 3:06 AM, Ask Jakobsen <afj at qeye-labs.com> wrote:
>>>>>
>>>>> I don't think so. Rank 0 also holds the tail which is the process
>>>>> which most recently requested the mutex.
>>>>>
>>>>> On Mon, Mar 13, 2017 at 2:55 AM, Balaji, Pavan <balaji at anl.gov> wrote:
>>>>>
>>>>> Shouldn't winsize be 3 integers in your code?  (sorry, I spent only 30
>>>>> seconds looking at the code, so I might have missed something).
>>>>>
>>>>>  -- Pavan
>>>>>
>>>>> On Mar 12, 2017, at 2:44 PM, Ask Jakobsen <afj at qeye-labs.com> wrote:
>>>>>>
>>>>>> Interestingly, according to the paper you suggested it appears to
>>>>>> include a similar test in pseudo code https://htor.inf.ethz.ch/publi
>>>>>> cations/img/hpclocks.pdf (see Listing 3 in paper).
>>>>>>
>>>>>> Unfortunately, removing the test in the release protocol did not
>>>>>> solve the problem. The race condition is much more difficult to provoke,
>>>>>> but I managed when setting the size of the communicator to 3 (only tested
>>>>>> even sizes so far).
>>>>>>
>>>>>> From Jeff's suggestion I have attempted to rewrite the code removing
>>>>>> local loads and stores in the MPI_Win_lock_all epochs using
>>>>>> MPI_Fetch_and_op (see attached files).
>>>>>>
>>>>>> This version behaves very similar to the original code and also fails
>>>>>> from time to time. Putting a sleep into the acquire busy loop (usleep(100))
>>>>>> will make the code "much more robust" (I hack, I know, but indicating some
>>>>>> underlying race condition?!). Let me know if you see any problems in the
>>>>>> way I am using MPI_Fetch_and_op in a busy loop. Flushing or syncing is not
>>>>>> necessary in this case, right?
>>>>>>
>>>>>> All work is done with export MPIR_CVAR_ASYNC_PROGRESS=1 on mpich-3.2
>>>>>> and mpich-3.3a2
>>>>>>
>>>>>> On Wed, Mar 8, 2017 at 4:21 PM, Halim Amer <aamer at anl.gov> wrote:
>>>>>> I cannot claim that I thoroughly verified the correctness of that
>>>>>> code, so take it with a grain of salt. Please keep in mind that it is a
>>>>>> test code from a tutorial book; those codes are meant for learning purposes
>>>>>> not for deployment.
>>>>>>
>>>>>> If your goal is to have a high performance RMA lock, I suggest you to
>>>>>> look into the recent HPDC'16 paper: "High-Performance Distributed RMA
>>>>>> Locks".
>>>>>>
>>>>>> Halim
>>>>>> www.mcs.anl.gov/~aamer
>>>>>>
>>>>>> On 3/8/17 3:06 AM, Ask Jakobsen wrote:
>>>>>> You are absolutely correct, Halim. Removing the test lmem[nextRank]
>>>>>> == -1
>>>>>> in release fixes the problem. Great work. Now I will try to
>>>>>> understand why
>>>>>> you are right. I hope the authors of the book will credit you for
>>>>>> discovering the bug.
>>>>>>
>>>>>> So in conclusion you need to remove the above mentioned test AND
>>>>>> enable
>>>>>> asynchronous progression using the environment variable
>>>>>> MPIR_CVAR_ASYNC_PROGRESS=1 in MPICH (BTW I still can't get the code
>>>>>> to work
>>>>>> in openmpi).
>>>>>>
>>>>>> On Tue, Mar 7, 2017 at 5:37 PM, Halim Amer <aamer at anl.gov> wrote:
>>>>>>
>>>>>> detect that another process is being or already enqueued in the MCS
>>>>>> queue.
>>>>>>
>>>>>> Actually the problem occurs only when the waiting process already
>>>>>> enqueued
>>>>>> itself, i.e., the accumulate operation on the nextRank field
>>>>>> succeeded.
>>>>>>
>>>>>> Halim
>>>>>> www.mcs.anl.gov/~aamer <http://www.mcs.anl.gov/%7Eaamer>
>>>>>>
>>>>>>
>>>>>> On 3/7/17 10:29 AM, Halim Amer wrote:
>>>>>>
>>>>>> In the Release protocol, try removing this test:
>>>>>>
>>>>>> if (lmem[nextRank] == -1) {
>>>>>>   If-Block;
>>>>>> }
>>>>>>
>>>>>> but keep the If-Block.
>>>>>>
>>>>>> The hang occurs because the process releasing the MCS lock fails to
>>>>>> detect that another process is being or already enqueued in the MCS
>>>>>> queue.
>>>>>>
>>>>>> Halim
>>>>>> www.mcs.anl.gov/~aamer <http://www.mcs.anl.gov/%7Eaamer>
>>>>>>
>>>>>>
>>>>>> On 3/7/17 6:43 AM, Ask Jakobsen wrote:
>>>>>>
>>>>>> Thanks, Halim. I have now enabled asynchronous progress in MPICH
>>>>>> (can't
>>>>>> find something similar in openmpi) and now all ranks acquire the lock
>>>>>> and
>>>>>> the program finish as expected. However if I put a while(1) loop
>>>>>> around the
>>>>>> acquire-release code in main.c it will fail again at random and go
>>>>>> into an
>>>>>> infinite loop. The simple unfair lock does not have this problem.
>>>>>>
>>>>>> On Tue, Mar 7, 2017 at 12:44 AM, Halim Amer <aamer at anl.gov> wrote:
>>>>>>
>>>>>> My understanding is that this code assumes asynchronous progress.
>>>>>> An example of when the processes hang is as follows:
>>>>>>
>>>>>> 1) P0 Finishes MCSLockAcquire()
>>>>>> 2) P1 is busy waiting in MCSLockAcquire() at
>>>>>> do {
>>>>>>      MPI_Win_sync(win);
>>>>>>   } while (lmem[blocked] == 1);
>>>>>> 3) P0 executes MCSLockRelease()
>>>>>> 4) P0 waits on MPI_Win_lock_all() inside MCSLockRlease()
>>>>>>
>>>>>> Hang!
>>>>>>
>>>>>> For P1 to get out of the loop, P0 has to get out of
>>>>>> MPI_Win_lock_all() and
>>>>>> executes its Compare_and_swap().
>>>>>>
>>>>>> For P0 to get out MPI_Win_lock_all(), it needs an ACK from P1 that it
>>>>>> got
>>>>>> the lock.
>>>>>>
>>>>>> P1 does not make communication progress because MPI_Win_sync is not
>>>>>> required to do so. It only synchronizes private and public copies.
>>>>>>
>>>>>> For this hang to disappear, one can either trigger progress manually
>>>>>> by
>>>>>> using heavy-duty synchronization calls instead of Win_sync (e.g.,
>>>>>> Win_unlock_all + Win_lock_all), or enable asynchronous progress.
>>>>>>
>>>>>> To enable asynchronous progress in MPICH, set the
>>>>>> MPIR_CVAR_ASYNC_PROGRESS
>>>>>> env var to 1.
>>>>>>
>>>>>> Halim
>>>>>> www.mcs.anl.gov/~aamer <http://www.mcs.anl.gov/%7Eaamer> <
>>>>>> http://www.mcs.anl.gov/%7Eaamer>
>>>>>>
>>>>>>
>>>>>> On 3/6/17 1:11 PM, Ask Jakobsen wrote:
>>>>>>
>>>>>> I am testing on x86_64 platform.
>>>>>>
>>>>>> I have tried to built both the mpich and the mcs lock code with -O0 to
>>>>>> avoid agressive optimization. After your suggestion I have also
>>>>>> tried to
>>>>>> make volatile int *pblocked pointing to lmem[blocked] in the
>>>>>> MCSLockAcquire
>>>>>> function and volatile int *pnextrank pointing to lmem[nextRank] in
>>>>>> MCSLockRelease, but it does not appear to make a difference.
>>>>>>
>>>>>> On suggestion from Richard Warren I have also tried building the code
>>>>>> using
>>>>>> openmpi-2.0.2 without any luck (however it appears to acquire the
>>>>>> lock a
>>>>>> couple of extra times before failing) which I find troubling.
>>>>>>
>>>>>> I think I will give up using local load/stores and will see if I can
>>>>>> figure
>>>>>> out if rewrite using MPI calls like MPI_Fetch_and_op  as you suggest.
>>>>>> Thanks for your help.
>>>>>>
>>>>>> On Mon, Mar 6, 2017 at 7:20 PM, Jeff Hammond <jeff.science at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> What processor architecture are you testing?
>>>>>>
>>>>>>
>>>>>> Maybe set lmem to volatile or read it with MPI_Fetch_and_op rather
>>>>>> than a
>>>>>> load.  MPI_Win_sync cannot prevent the compiler from caching *lmem
>>>>>> in a
>>>>>> register.
>>>>>>
>>>>>> Jeff
>>>>>>
>>>>>> On Sat, Mar 4, 2017 at 12:30 AM, Ask Jakobsen <afj at qeye-labs.com>
>>>>>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> I have downloaded the source code for the MCS lock from the excellent
>>>>>> book "Using Advanced MPI" from http://www.mcs.anl.gov/researc
>>>>>> h/projects/mpi/usingmpi/examples-advmpi/rma2/mcs-lock.c
>>>>>>
>>>>>> I have made a very simple piece of test code for testing the MCS lock
>>>>>> but
>>>>>> it works at random and often never escapes the busy loops in the
>>>>>> acquire
>>>>>> and release functions (see attached source code). The code appears
>>>>>> semantically correct to my eyes.
>>>>>>
>>>>>> #include <stdio.h>
>>>>>> #include <mpi.h>
>>>>>> #include "mcs-lock.h"
>>>>>>
>>>>>> int main(int argc, char *argv[])
>>>>>> {
>>>>>>  MPI_Win win;
>>>>>>  MPI_Init( &argc, &argv );
>>>>>>
>>>>>>  MCSLockInit(MPI_COMM_WORLD, &win);
>>>>>>
>>>>>>  int rank, size;
>>>>>>  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>>>>>  MPI_Comm_size(MPI_COMM_WORLD, &size);
>>>>>>
>>>>>>  printf("rank: %d, size: %d\n", rank, size);
>>>>>>
>>>>>>
>>>>>>  MCSLockAcquire(win);
>>>>>>  printf("rank %d aquired lock\n", rank);   fflush(stdout);
>>>>>>  MCSLockRelease(win);
>>>>>>
>>>>>>
>>>>>>  MPI_Win_free(&win);
>>>>>>  MPI_Finalize();
>>>>>>  return 0;
>>>>>> }
>>>>>>
>>>>>>
>>>>>> I have tested on several hardware platforms and mpich-3.2 and
>>>>>> mpich-3.3a2
>>>>>> but with no luck.
>>>>>>
>>>>>> It appears that the MPI_Win_Sync are not "refreshing" the local
>>>>>> data or
>>>>>> I
>>>>>> have a bug I can't spot.
>>>>>>
>>>>>> A simple unfair lock like http://www.mcs.anl.gov/researc
>>>>>> h/projects/mpi/usingmpi/examples-advmpi/rma2/ga_mutex1.c works
>>>>>> perfectly.
>>>>>>
>>>>>> Best regards, Ask Jakobsen
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> discuss mailing list     discuss at mpich.org
>>>>>> To manage subscription options or unsubscribe:
>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jeff Hammond
>>>>>> jeff.science at gmail.com
>>>>>> http://jeffhammond.github.io/
>>>>>>
>>>>>> _______________________________________________
>>>>>> discuss mailing list     discuss at mpich.org
>>>>>> To manage subscription options or unsubscribe:
>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> discuss mailing list     discuss at mpich.org
>>>>>> To manage subscription options or unsubscribe:
>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>>
>>>>>> _______________________________________________
>>>>>>
>>>>>> discuss mailing list     discuss at mpich.org
>>>>>> To manage subscription options or unsubscribe:
>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> discuss mailing list     discuss at mpich.org
>>>>>> To manage subscription options or unsubscribe:
>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>>
>>>>>> _______________________________________________
>>>>>> discuss mailing list     discuss at mpich.org
>>>>>> To manage subscription options or unsubscribe:
>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>>
>>>>>> _______________________________________________
>>>>>> discuss mailing list     discuss at mpich.org
>>>>>> To manage subscription options or unsubscribe:
>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> discuss mailing list     discuss at mpich.org
>>>>>> To manage subscription options or unsubscribe:
>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>>
>>>>>> _______________________________________________
>>>>>> discuss mailing list     discuss at mpich.org
>>>>>> To manage subscription options or unsubscribe:
>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>> <main.c><mcs-lock-fop.c><mcs-lock.h>________________________
>>>>>> _______________________
>>>>>> discuss mailing list     discuss at mpich.org
>>>>>> To manage subscription options or unsubscribe:
>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> discuss mailing list     discuss at mpich.org
>>>>> To manage subscription options or unsubscribe:
>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ask Jakobsen
>>>>> R&D
>>>>>
>>>>> Qeye Labs
>>>>> Lersø Parkallé 107
>>>>> 2100 Copenhagen Ø
>>>>> Denmark
>>>>>
>>>>> mobile: +45 2834 6936
>>>>> email: afj at Qeye-Labs.com
>>>>> _______________________________________________
>>>>> discuss mailing list     discuss at mpich.org
>>>>> To manage subscription options or unsubscribe:
>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>
>>>>
>>>> _______________________________________________
>>>> discuss mailing list     discuss at mpich.org
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>
>>>
>>> _______________________________________________
>>> discuss mailing list     discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>
>>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>
>
>
> --
> *Ask Jakobsen*
> R&D
>
> Qeye Labs
> Lersø Parkallé 107
> 2100 Copenhagen Ø
> Denmark
>
> mobile: +45 2834 6936 <+45%2028%2034%2069%2036>
> email: afj at Qeye-Labs.com
>



-- 
*Ask Jakobsen*
R&D

Qeye Labs
Lersø Parkallé 107
2100 Copenhagen Ø
Denmark

mobile: +45 2834 6936
email: afj at Qeye-Labs.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20170313/0e85ef9f/attachment.html>


More information about the discuss mailing list