[mpich-discuss] Assertion Failure using MPICH3 RMA

Corey A. Henderson cahenderson at wisc.edu
Mon Nov 3 15:14:44 CST 2014

About 1 in 100 runs on my local desktop, while developing an MPI code that
uses MPI 3-0 RMA shared-lock features, I see the following assertion
failure. The assertion does not fail at the same point in a run, or
following any pattern that I can see. It has not happened on a cluster I
use that is running MPICH3.1, but I haven't run my code there very often

Failure text:

Assertion failed in file
<snip>/mpich-3.0.4/src/mpid/ch3/src/ch3u_rma_sync.c at line 2803:
win_ptr->targets[target_rank].remote_lock_state ==
win_ptr->targets[target_rank].remote_lock_state ==
internal ABORT - process 0

I have not attempted to recreate the issue with a smaller code snippet
because I am not sure where to even begin to do so. Can anyone suggest to
me where I might start to look for the cause of this?

Some notes on what the code does:

- One window per node of fixed size opened (MPI_Win_allocate) at program
- All windows locked (shared) after creation with MPI_Win_lock_all
- Code uses MPI_Fetch_and_op and MPI_Compare_and_swap on a few
concurrently-accessed MPI_INT locations in a node's window
- Code uses GET/PUT for data access to the rest of the window on any node
(those portions that are not accessed concurrently)

MPICH is v3.0.4 on 64-bit Ubuntu 12.04 LTS in a single-machine

The error occurs regardless of how many MPI processes I may be testing with
at any given time. I have not nailed down where in the code to trace to see
why the error occurs because I don't know what could cause this (which is
why I'm asking). The MPI messaging portion of the code hasn't changed in a
couple of months, but I'm starting to run my code more often and for longer
periods now as it nears completion.

Any help on where to begin tracing to fix this would be great.

Corey A. Henderson
PhD Candidate and NSF Graduate Fellow
Dept. of Engineering Physics
Univ. of Wisconsin - Madison
