[mpich-discuss] MCS lock and MPI RMA problem

Tue Mar 14 04:58:50 CDT 2017

The MPI_Fetch_and_op version trying to implement Pavan's ideas as I
understand it (no async progress necessary, but also tested with it
enabled). I have also attached the original code mcs-lock.c with a comment
about the adding lmem[nextRank] = -1 in the acquire function (which
displays a different race condition behavior, advancing further)

I see same behavior in mpich-3.2 and mpich-3.3a2. The older versions
mpich-3.1 and mpich-3.0.4 freezes immediately no matter which of the
attached codes.

On Tue, Mar 14, 2017 at 5:50 AM, Balaji, Pavan <balaji at anl.gov> wrote:

>
> Can you send us the new code?
>
>   -- Pavan
>
> > On Mar 13, 2017, at 2:48 PM, Ask Jakobsen <afj at qeye-labs.com> wrote:
> >
> > Pavan, I have followed your advice using MPI_MODE_NOCHECK and added some
> flushes, but I still get race conditions sometimes. I suspect that I have
> not followed your suggestion correctly or that something else is wrong at
> my end.
> >
> > On Mon, Mar 13, 2017 at 7:43 PM, Ask Jakobsen <afj at qeye-labs.com> wrote:
> > Thanks Pavan and Halim. You are right it progress in the Fetch_and_op
> version without the async progress environment variable. I will try to
> implement the MPI_MODE_NOCHECK as you suggested.
> >
> > To make matters more complicated:
> >
> > I have discovered that the code from the book in mcs-lock.c deviates
> from "High-Performance Distributed RMA Locks" pseudo code (see Listing 3 in
> paper) and the original MCS paper "Algorithms for scalable Sync on shared
> memory multiprocessors". If I add to the original mcs-lock.c code
> >
> > lmem[nextRank]=-1;
> >
> > before entering the MPI_win_lock_all in acquire the code *almost
> appears* to work! Sort of... when having a large number of processes there
> are still a rare race condition where a few processes don't get to the
> MPI_Win_free(&win) in main().
> >
> >
> > On Mon, Mar 13, 2017 at 6:15 PM, Halim Amer <aamer at anl.gov> wrote:
> > To be precise, asynchronous progress is not required for this second
> implementation because the busy waiting loop is doing a Fetch_and_op. It is
> required, however, for the first implementation, from the tutorial book,
> because it busy waits with Win_sync.
> >
> > Halim
> > www.mcs.anl.gov/~aamer <http://www.mcs.anl.gov/%7Eaamer>
> >
> >
> > On 3/13/17 8:46 AM, Balaji, Pavan wrote:
> >
> > I should also point out that I don't think your implementation is
> assuming asynchronous progress.  You shouldn't have to do any of the
> asynchronous progress tweaks for it to work correctly.
> >
> >   -- Pavan
> >
> > On Mar 13, 2017, at 8:43 AM, Balaji, Pavan <balaji at anl.gov> wrote:
> >
> >
> > OK, I spent a little more time going through the code.  The algorithm
> looks correct, except for some minor issues:
> >
> > 1. mcs-lock-fop.c:72 -- you need a flush or flush_local.  You were lucky
> that this was working correctly since it's local, but the MPI standard
> doesn't guarantee it.
> >
> > 2. You might be able to simplify mcs-lock-fop.c lines 72-90 as follows:
> >
> >    do {
> >      MPI_Fetch_and_op(&dummy, &fetch_nextrank, MPI_INT,
> >                   myrank, nextRank, MPI_NO_OP, win);
> >      MPI_Win_flush(myrank, win);
> >    } while (fetch_nextrank==-1);
> >
> > 3. Polling on the nextrank value is better than polling on a remote
> location.  However, you could further simplify this by using send/recv to
> notify the waiting process rather than RMA.  This allows the MPI
> implementation the opportunity to block waiting for progress, rather than
> poll (though in practice, current implementations poll anyway).
> >
> > 4. Since you are always using the lock in shared mode, you should
> specify the hint MPI_MODE_NOCHECK in your lock_all epochs.
> >
> > Now, coming to your bug, this does seem to be a bug in the MPI
> implementation.  We can dig into it further.  In the meanwhile, if you use
> the optimization #4 above, this will allow the MPI implementation to bypass
> the entire locking checks, which will get you past the bug for now.
> >
> > Thanks for reporting the issue.
> >
> >  -- Pavan
> >
> > On Mar 13, 2017, at 3:06 AM, Ask Jakobsen <afj at qeye-labs.com> wrote:
> >
> > I don't think so. Rank 0 also holds the tail which is the process which
> most recently requested the mutex.
> >
> > On Mon, Mar 13, 2017 at 2:55 AM, Balaji, Pavan <balaji at anl.gov> wrote:
> >
> > Shouldn't winsize be 3 integers in your code?  (sorry, I spent only 30
> seconds looking at the code, so I might have missed something).
> >
> >  -- Pavan
> >
> > On Mar 12, 2017, at 2:44 PM, Ask Jakobsen <afj at qeye-labs.com> wrote:
> >
> > Interestingly, according to the paper you suggested it appears to
> include a similar test in pseudo code https://htor.inf.ethz.ch/
> publications/img/hpclocks.pdf (see Listing 3 in paper).
> >
> > Unfortunately, removing the test in the release protocol did not solve
> the problem. The race condition is much more difficult to provoke, but I
> managed when setting the size of the communicator to 3 (only tested even
> sizes so far).
> >
> > From Jeff's suggestion I have attempted to rewrite the code removing
> local loads and stores in the MPI_Win_lock_all epochs using
> MPI_Fetch_and_op (see attached files).
> >
> > This version behaves very similar to the original code and also fails
> from time to time. Putting a sleep into the acquire busy loop (usleep(100))
> will make the code "much more robust" (I hack, I know, but indicating some
> underlying race condition?!). Let me know if you see any problems in the
> way I am using MPI_Fetch_and_op in a busy loop. Flushing or syncing is not
> necessary in this case, right?
> >
> > All work is done with export MPIR_CVAR_ASYNC_PROGRESS=1 on mpich-3.2 and
> mpich-3.3a2
> >
> > On Wed, Mar 8, 2017 at 4:21 PM, Halim Amer <aamer at anl.gov> wrote:
> > I cannot claim that I thoroughly verified the correctness of that code,
> so take it with a grain of salt. Please keep in mind that it is a test code
> from a tutorial book; those codes are meant for learning purposes not for
> deployment.
> >
> > If your goal is to have a high performance RMA lock, I suggest you to
> look into the recent HPDC'16 paper: "High-Performance Distributed RMA
> Locks".
> >
> > Halim
> > www.mcs.anl.gov/~aamer <http://www.mcs.anl.gov/%7Eaamer>
> >
> > On 3/8/17 3:06 AM, Ask Jakobsen wrote:
> > You are absolutely correct, Halim. Removing the test lmem[nextRank] == -1
> > in release fixes the problem. Great work. Now I will try to understand
> why
> > you are right. I hope the authors of the book will credit you for
> > discovering the bug.
> >
> > So in conclusion you need to remove the above mentioned test AND enable
> > asynchronous progression using the environment variable
> > MPIR_CVAR_ASYNC_PROGRESS=1 in MPICH (BTW I still can't get the code to
> work
> > in openmpi).
> >
> > On Tue, Mar 7, 2017 at 5:37 PM, Halim Amer <aamer at anl.gov> wrote:
> >
> > detect that another process is being or already enqueued in the MCS
> > queue.
> >
> > Actually the problem occurs only when the waiting process already
> enqueued
> > itself, i.e., the accumulate operation on the nextRank field succeeded.
> >
> > Halim
> > www.mcs.anl.gov/~aamer <http://www.mcs.anl.gov/%7Eaamer> <
> http://www.mcs.anl.gov/%7Eaamer>
> >
> >
> > On 3/7/17 10:29 AM, Halim Amer wrote:
> >
> > In the Release protocol, try removing this test:
> >
> > if (lmem[nextRank] == -1) {
> >   If-Block;
> > }
> >
> > but keep the If-Block.
> >
> > The hang occurs because the process releasing the MCS lock fails to
> > detect that another process is being or already enqueued in the MCS
> queue.
> >
> > Halim
> > www.mcs.anl.gov/~aamer <http://www.mcs.anl.gov/%7Eaamer> <
> http://www.mcs.anl.gov/%7Eaamer>
> >
> >
> > On 3/7/17 6:43 AM, Ask Jakobsen wrote:
> >
> > Thanks, Halim. I have now enabled asynchronous progress in MPICH (can't
> > find something similar in openmpi) and now all ranks acquire the lock and
> > the program finish as expected. However if I put a while(1) loop
> > around the
> > acquire-release code in main.c it will fail again at random and go
> > into an
> > infinite loop. The simple unfair lock does not have this problem.
> >
> > On Tue, Mar 7, 2017 at 12:44 AM, Halim Amer <aamer at anl.gov> wrote:
> >
> > My understanding is that this code assumes asynchronous progress.
> > An example of when the processes hang is as follows:
> >
> > 1) P0 Finishes MCSLockAcquire()
> > 2) P1 is busy waiting in MCSLockAcquire() at
> > do {
> >      MPI_Win_sync(win);
> >   } while (lmem[blocked] == 1);
> > 3) P0 executes MCSLockRelease()
> > 4) P0 waits on MPI_Win_lock_all() inside MCSLockRlease()
> >
> > Hang!
> >
> > For P1 to get out of the loop, P0 has to get out of
> > MPI_Win_lock_all() and
> > executes its Compare_and_swap().
> >
> > For P0 to get out MPI_Win_lock_all(), it needs an ACK from P1 that it
> > got
> > the lock.
> >
> > P1 does not make communication progress because MPI_Win_sync is not
> > required to do so. It only synchronizes private and public copies.
> >
> > For this hang to disappear, one can either trigger progress manually by
> > using heavy-duty synchronization calls instead of Win_sync (e.g.,
> > Win_unlock_all + Win_lock_all), or enable asynchronous progress.
> >
> > To enable asynchronous progress in MPICH, set the
> > MPIR_CVAR_ASYNC_PROGRESS
> > env var to 1.
> >
> > Halim
> > www.mcs.anl.gov/~aamer <http://www.mcs.anl.gov/%7Eaamer> <
> http://www.mcs.anl.gov/%7Eaamer> <
> > http://www.mcs.anl.gov/%7Eaamer>
> >
> >
> > On 3/6/17 1:11 PM, Ask Jakobsen wrote:
> >
> > I am testing on x86_64 platform.
> >
> > I have tried to built both the mpich and the mcs lock code with -O0 to
> > avoid agressive optimization. After your suggestion I have also
> > tried to
> > make volatile int *pblocked pointing to lmem[blocked] in the
> > MCSLockAcquire
> > function and volatile int *pnextrank pointing to lmem[nextRank] in
> > MCSLockRelease, but it does not appear to make a difference.
> >
> > On suggestion from Richard Warren I have also tried building the code
> > using
> > openmpi-2.0.2 without any luck (however it appears to acquire the
> > lock a
> > couple of extra times before failing) which I find troubling.
> >
> > I think I will give up using local load/stores and will see if I can
> > figure
> > out if rewrite using MPI calls like MPI_Fetch_and_op  as you suggest.
> > Thanks for your help.
> >
> > On Mon, Mar 6, 2017 at 7:20 PM, Jeff Hammond <jeff.science at gmail.com>
> > wrote:
> >
> > What processor architecture are you testing?
> >
> >
> > Maybe set lmem to volatile or read it with MPI_Fetch_and_op rather
> > than a
> > load.  MPI_Win_sync cannot prevent the compiler from caching *lmem
> > in a
> > register.
> >
> > Jeff
> >
> > On Sat, Mar 4, 2017 at 12:30 AM, Ask Jakobsen <afj at qeye-labs.com>
> > wrote:
> >
> > Hi,
> >
> >
> > I have downloaded the source code for the MCS lock from the excellent
> > book "Using Advanced MPI" from http://www.mcs.anl.gov/researc
> > h/projects/mpi/usingmpi/examples-advmpi/rma2/mcs-lock.c
> >
> > I have made a very simple piece of test code for testing the MCS lock
> > but
> > it works at random and often never escapes the busy loops in the
> > acquire
> > and release functions (see attached source code). The code appears
> > semantically correct to my eyes.
> >
> > #include <stdio.h>
> > #include <mpi.h>
> > #include "mcs-lock.h"
> >
> > int main(int argc, char *argv[])
> > {
> >  MPI_Win win;
> >  MPI_Init( &argc, &argv );
> >
> >  MCSLockInit(MPI_COMM_WORLD, &win);
> >
> >  int rank, size;
> >  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> >  MPI_Comm_size(MPI_COMM_WORLD, &size);
> >
> >  printf("rank: %d, size: %d\n", rank, size);
> >
> >
> >  MCSLockAcquire(win);
> >  printf("rank %d aquired lock\n", rank);   fflush(stdout);
> >  MCSLockRelease(win);
> >
> >
> >  MPI_Win_free(&win);
> >  MPI_Finalize();
> >  return 0;
> > }
> >
> >
> > I have tested on several hardware platforms and mpich-3.2 and
> > mpich-3.3a2
> > but with no luck.
> >
> > It appears that the MPI_Win_Sync are not "refreshing" the local
> > data or
> > I
> > have a bug I can't spot.
> >
> > A simple unfair lock like http://www.mcs.anl.gov/researc
> > h/projects/mpi/usingmpi/examples-advmpi/rma2/ga_mutex1.c works
> > perfectly.
> >
> > Best regards, Ask Jakobsen
> >
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
> >
> >
> >
> > --
> > Jeff Hammond
> > jeff.science at gmail.com
> > http://jeffhammond.github.io/
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
> >
> >
> >
> >
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
> > _______________________________________________
> >
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
> >
> >
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
> >
> >
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> > <main.c><mcs-lock-fop.c><mcs-lock.h>_______________________
> ________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
> >
> >
> > --
> > Ask Jakobsen
> > R&D
> >
> > Qeye Labs
> > Lersø Parkallé 107
> > 2100 Copenhagen Ø
> > Denmark
> >
> > mobile: +45 2834 6936
> > email: afj at Qeye-Labs.com
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
> >
> >
> > --
> > Ask Jakobsen
> > R&D
> >
> > Qeye Labs
> > Lersø Parkallé 107
> > 2100 Copenhagen Ø
> > Denmark
> >
> > mobile: +45 2834 6936
> > email: afj at Qeye-Labs.com
> >
> >
> >
> > --
> > Ask Jakobsen
> > R&D
> >
> > Qeye Labs
> > Lersø Parkallé 107
> > 2100 Copenhagen Ø
> > Denmark
> >
> > mobile: +45 2834 6936
> > email: afj at Qeye-Labs.com
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20170314/2a6f173d/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mcs-lock.c
Type: text/x-csrc
Size: 2714 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20170314/2a6f173d/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mcs-lock-fop.c
Type: text/x-csrc
Size: 3102 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20170314/2a6f173d/attachment-0001.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: main.c
Type: text/x-csrc
Size: 576 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20170314/2a6f173d/attachment-0002.bin>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss