[mpich-discuss] MCS lock and MPI RMA problem

Balaji, Pavan balaji at anl.gov
Mon Mar 13 08:43:00 CDT 2017


OK, I spent a little more time going through the code.  The algorithm looks correct, except for some minor issues:

1. mcs-lock-fop.c:72 -- you need a flush or flush_local.  You were lucky that this was working correctly since it's local, but the MPI standard doesn't guarantee it.

2. You might be able to simplify mcs-lock-fop.c lines 72-90 as follows:

    do {
      MPI_Fetch_and_op(&dummy, &fetch_nextrank, MPI_INT,
                   myrank, nextRank, MPI_NO_OP, win);
      MPI_Win_flush(myrank, win);
    } while (fetch_nextrank==-1);

3. Polling on the nextrank value is better than polling on a remote location.  However, you could further simplify this by using send/recv to notify the waiting process rather than RMA.  This allows the MPI implementation the opportunity to block waiting for progress, rather than poll (though in practice, current implementations poll anyway).

4. Since you are always using the lock in shared mode, you should specify the hint MPI_MODE_NOCHECK in your lock_all epochs.

Now, coming to your bug, this does seem to be a bug in the MPI implementation.  We can dig into it further.  In the meanwhile, if you use the optimization #4 above, this will allow the MPI implementation to bypass the entire locking checks, which will get you past the bug for now.

Thanks for reporting the issue.

  -- Pavan

> On Mar 13, 2017, at 3:06 AM, Ask Jakobsen <afj at qeye-labs.com> wrote:
> 
> I don't think so. Rank 0 also holds the tail which is the process which most recently requested the mutex.
> 
> On Mon, Mar 13, 2017 at 2:55 AM, Balaji, Pavan <balaji at anl.gov> wrote:
> 
> Shouldn't winsize be 3 integers in your code?  (sorry, I spent only 30 seconds looking at the code, so I might have missed something).
> 
>   -- Pavan
> 
> > On Mar 12, 2017, at 2:44 PM, Ask Jakobsen <afj at qeye-labs.com> wrote:
> >
> > Interestingly, according to the paper you suggested it appears to include a similar test in pseudo code https://htor.inf.ethz.ch/publications/img/hpclocks.pdf (see Listing 3 in paper).
> >
> > Unfortunately, removing the test in the release protocol did not solve the problem. The race condition is much more difficult to provoke, but I managed when setting the size of the communicator to 3 (only tested even sizes so far).
> >
> > From Jeff's suggestion I have attempted to rewrite the code removing local loads and stores in the MPI_Win_lock_all epochs using MPI_Fetch_and_op (see attached files).
> >
> > This version behaves very similar to the original code and also fails from time to time. Putting a sleep into the acquire busy loop (usleep(100)) will make the code "much more robust" (I hack, I know, but indicating some underlying race condition?!). Let me know if you see any problems in the way I am using MPI_Fetch_and_op in a busy loop. Flushing or syncing is not necessary in this case, right?
> >
> > All work is done with export MPIR_CVAR_ASYNC_PROGRESS=1 on mpich-3.2 and mpich-3.3a2
> >
> > On Wed, Mar 8, 2017 at 4:21 PM, Halim Amer <aamer at anl.gov> wrote:
> > I cannot claim that I thoroughly verified the correctness of that code, so take it with a grain of salt. Please keep in mind that it is a test code from a tutorial book; those codes are meant for learning purposes not for deployment.
> >
> > If your goal is to have a high performance RMA lock, I suggest you to look into the recent HPDC'16 paper: "High-Performance Distributed RMA Locks".
> >
> > Halim
> > www.mcs.anl.gov/~aamer
> >
> > On 3/8/17 3:06 AM, Ask Jakobsen wrote:
> > You are absolutely correct, Halim. Removing the test lmem[nextRank] == -1
> > in release fixes the problem. Great work. Now I will try to understand why
> > you are right. I hope the authors of the book will credit you for
> > discovering the bug.
> >
> > So in conclusion you need to remove the above mentioned test AND enable
> > asynchronous progression using the environment variable
> > MPIR_CVAR_ASYNC_PROGRESS=1 in MPICH (BTW I still can't get the code to work
> > in openmpi).
> >
> > On Tue, Mar 7, 2017 at 5:37 PM, Halim Amer <aamer at anl.gov> wrote:
> >
> > detect that another process is being or already enqueued in the MCS
> > queue.
> >
> > Actually the problem occurs only when the waiting process already enqueued
> > itself, i.e., the accumulate operation on the nextRank field succeeded.
> >
> > Halim
> > www.mcs.anl.gov/~aamer <http://www.mcs.anl.gov/%7Eaamer>
> >
> >
> > On 3/7/17 10:29 AM, Halim Amer wrote:
> >
> > In the Release protocol, try removing this test:
> >
> > if (lmem[nextRank] == -1) {
> >    If-Block;
> > }
> >
> > but keep the If-Block.
> >
> > The hang occurs because the process releasing the MCS lock fails to
> > detect that another process is being or already enqueued in the MCS queue.
> >
> > Halim
> > www.mcs.anl.gov/~aamer <http://www.mcs.anl.gov/%7Eaamer>
> >
> >
> > On 3/7/17 6:43 AM, Ask Jakobsen wrote:
> >
> > Thanks, Halim. I have now enabled asynchronous progress in MPICH (can't
> > find something similar in openmpi) and now all ranks acquire the lock and
> > the program finish as expected. However if I put a while(1) loop
> > around the
> > acquire-release code in main.c it will fail again at random and go
> > into an
> > infinite loop. The simple unfair lock does not have this problem.
> >
> > On Tue, Mar 7, 2017 at 12:44 AM, Halim Amer <aamer at anl.gov> wrote:
> >
> > My understanding is that this code assumes asynchronous progress.
> > An example of when the processes hang is as follows:
> >
> > 1) P0 Finishes MCSLockAcquire()
> > 2) P1 is busy waiting in MCSLockAcquire() at
> > do {
> >       MPI_Win_sync(win);
> >    } while (lmem[blocked] == 1);
> > 3) P0 executes MCSLockRelease()
> > 4) P0 waits on MPI_Win_lock_all() inside MCSLockRlease()
> >
> > Hang!
> >
> > For P1 to get out of the loop, P0 has to get out of
> > MPI_Win_lock_all() and
> > executes its Compare_and_swap().
> >
> > For P0 to get out MPI_Win_lock_all(), it needs an ACK from P1 that it
> > got
> > the lock.
> >
> > P1 does not make communication progress because MPI_Win_sync is not
> > required to do so. It only synchronizes private and public copies.
> >
> > For this hang to disappear, one can either trigger progress manually by
> > using heavy-duty synchronization calls instead of Win_sync (e.g.,
> > Win_unlock_all + Win_lock_all), or enable asynchronous progress.
> >
> > To enable asynchronous progress in MPICH, set the
> > MPIR_CVAR_ASYNC_PROGRESS
> > env var to 1.
> >
> > Halim
> > www.mcs.anl.gov/~aamer <http://www.mcs.anl.gov/%7Eaamer> <
> > http://www.mcs.anl.gov/%7Eaamer>
> >
> >
> > On 3/6/17 1:11 PM, Ask Jakobsen wrote:
> >
> >  I am testing on x86_64 platform.
> >
> > I have tried to built both the mpich and the mcs lock code with -O0 to
> > avoid agressive optimization. After your suggestion I have also
> > tried to
> > make volatile int *pblocked pointing to lmem[blocked] in the
> > MCSLockAcquire
> > function and volatile int *pnextrank pointing to lmem[nextRank] in
> > MCSLockRelease, but it does not appear to make a difference.
> >
> > On suggestion from Richard Warren I have also tried building the code
> > using
> > openmpi-2.0.2 without any luck (however it appears to acquire the
> > lock a
> > couple of extra times before failing) which I find troubling.
> >
> > I think I will give up using local load/stores and will see if I can
> > figure
> > out if rewrite using MPI calls like MPI_Fetch_and_op  as you suggest.
> > Thanks for your help.
> >
> > On Mon, Mar 6, 2017 at 7:20 PM, Jeff Hammond <jeff.science at gmail.com>
> > wrote:
> >
> > What processor architecture are you testing?
> >
> >
> > Maybe set lmem to volatile or read it with MPI_Fetch_and_op rather
> > than a
> > load.  MPI_Win_sync cannot prevent the compiler from caching *lmem
> > in a
> > register.
> >
> > Jeff
> >
> > On Sat, Mar 4, 2017 at 12:30 AM, Ask Jakobsen <afj at qeye-labs.com>
> > wrote:
> >
> > Hi,
> >
> >
> > I have downloaded the source code for the MCS lock from the excellent
> > book "Using Advanced MPI" from http://www.mcs.anl.gov/researc
> > h/projects/mpi/usingmpi/examples-advmpi/rma2/mcs-lock.c
> >
> > I have made a very simple piece of test code for testing the MCS lock
> > but
> > it works at random and often never escapes the busy loops in the
> > acquire
> > and release functions (see attached source code). The code appears
> > semantically correct to my eyes.
> >
> > #include <stdio.h>
> > #include <mpi.h>
> > #include "mcs-lock.h"
> >
> > int main(int argc, char *argv[])
> > {
> >   MPI_Win win;
> >   MPI_Init( &argc, &argv );
> >
> >   MCSLockInit(MPI_COMM_WORLD, &win);
> >
> >   int rank, size;
> >   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> >   MPI_Comm_size(MPI_COMM_WORLD, &size);
> >
> >   printf("rank: %d, size: %d\n", rank, size);
> >
> >
> >   MCSLockAcquire(win);
> >   printf("rank %d aquired lock\n", rank);   fflush(stdout);
> >   MCSLockRelease(win);
> >
> >
> >   MPI_Win_free(&win);
> >   MPI_Finalize();
> >   return 0;
> > }
> >
> >
> > I have tested on several hardware platforms and mpich-3.2 and
> > mpich-3.3a2
> > but with no luck.
> >
> > It appears that the MPI_Win_Sync are not "refreshing" the local
> > data or
> > I
> > have a bug I can't spot.
> >
> > A simple unfair lock like http://www.mcs.anl.gov/researc
> > h/projects/mpi/usingmpi/examples-advmpi/rma2/ga_mutex1.c works
> > perfectly.
> >
> > Best regards, Ask Jakobsen
> >
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
> >
> >
> >
> > --
> > Jeff Hammond
> > jeff.science at gmail.com
> > http://jeffhammond.github.io/
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
> >
> >
> >
> >
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
> > _______________________________________________
> >
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
> >
> >
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
> >
> >
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> > <main.c><mcs-lock-fop.c><mcs-lock.h>_______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
> 
> 
> 
> -- 
> Ask Jakobsen
> R&D
> 
> Qeye Labs
> Lersø Parkallé 107
> 2100 Copenhagen Ø 
> Denmark
> 
> mobile: +45 2834 6936
> email: afj at Qeye-Labs.com
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list