[mpich-discuss] Assertion Failure using MPICH3 RMA
wbland at anl.gov
Mon Nov 3 15:35:38 CST 2014
Have you tried updating to use the latest version of MPICH? You said you haven’t seen the issue when you run with 3.1. We’re now on 3.1.3 and I believe a number of RMA bug fixes have gone in recently.
> On Nov 3, 2014, at 3:14 PM, Corey A. Henderson <cahenderson at wisc.edu> wrote:
> About 1 in 100 runs on my local desktop, while developing an MPI code that uses MPI 3-0 RMA shared-lock features, I see the following assertion failure. The assertion does not fail at the same point in a run, or following any pattern that I can see. It has not happened on a cluster I use that is running MPICH3.1, but I haven't run my code there very often yet.
> Failure text:
> Assertion failed in file <snip>/mpich-3.0.4/src/mpid/ch3/src/ch3u_rma_sync.c at line 2803: win_ptr->targets[target_rank].remote_lock_state == MPIDI_CH3_WIN_LOCK_REQUESTED || win_ptr->targets[target_rank].remote_lock_state == MPIDI_CH3_WIN_LOCK_GRANTED
> internal ABORT - process 0
> I have not attempted to recreate the issue with a smaller code snippet because I am not sure where to even begin to do so. Can anyone suggest to me where I might start to look for the cause of this?
> Some notes on what the code does:
> - One window per node of fixed size opened (MPI_Win_allocate) at program start.
> - All windows locked (shared) after creation with MPI_Win_lock_all
> - Code uses MPI_Fetch_and_op and MPI_Compare_and_swap on a few concurrently-accessed MPI_INT locations in a node's window
> - Code uses GET/PUT for data access to the rest of the window on any node (those portions that are not accessed concurrently)
> MPICH is v3.0.4 on 64-bit Ubuntu 12.04 LTS in a single-machine configuration.
> The error occurs regardless of how many MPI processes I may be testing with at any given time. I have not nailed down where in the code to trace to see why the error occurs because I don't know what could cause this (which is why I'm asking). The MPI messaging portion of the code hasn't changed in a couple of months, but I'm starting to run my code more often and for longer periods now as it nears completion.
> Any help on where to begin tracing to fix this would be great.
> Corey A. Henderson
> PhD Candidate and NSF Graduate Fellow
> Dept. of Engineering Physics
> Univ. of Wisconsin - Madison
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
More information about the discuss