[mpich-discuss] Assertion failed in file src/mpid/ch3/src/ch3u_handle_send_req.c at line 61 (RMA && Derived datatypes)

Victor Vysotskiy victor.vysotskiy at teokem.lu.se
Wed Nov 12 04:28:51 CST 2014


Dear Pavan, 
Dear Xin,

I just downloaded the latest nightly MPICH3 tarball ('mpich-master-v3.1.3-174-gb0f5772f') and compiled it on our IB cluster using the Intel's compilers v15.0:

%mpichversion 
MPICH Version:          3.1.3
MPICH Release date:     Wed Nov 12 00:00:34 CST 2014
MPICH Device:           ch3:nemesis
MPICH configure:        --prefix=/nobackup/global/x_vicvy/mpich3-dev.ib --with-device=ch3:nemesis:ib CC=icc CXX=icpc FC=ifort F77=ifort
MPICH CC:       icc    -O2
MPICH CXX:      icpc   -O2
MPICH F77:      ifort   -O2
MPICH FC:       ifort   -O2

Unfortunately, the test-bed code still crashes with the same error message:

%mpirun  -np 8 ./mpi_tvec2_rma 64 400000 
Allocating memory: win_buf=195 (Mb), loc_buf=24 (Mb)
Assertion failed in file src/mpid/ch3/src/ch3u_handle_send_req.c at line 71: win_ptr->at_completion_counter >= 0
internal ABORT - process 0
===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 65516 RUNNING AT n3
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

Here I should mention an observed change: the code always crashes with 8 processes, but sometimes it works out with 4 processes (!). However, even with 4 processes you can reproduce the problem by running test-bed several times in a row; i.e.;

for i in {1..5}; do mpirun  -np 4 ./mpi_tvec2_rma 64 400000; done

In order to be sure that the problem is generic, I also fetched the latest commits from mpich/master repo and recompiled MPICH3 on my laptop by using GCC v4.9:

%git log --pretty=oneline --abbrev-commit -5
b0f5772 Revert RMA ADI change for req-based RMA operations.
eedd51e Delete unused variable.
9107068 Delete no longer needed file.
c235c75 Delete no longer used epoch states.
f695c96 Bug-fixing: set window state to MPIDI_RMA_NONE when UNLOCK finishes

%mpichversion 
MPICH Version:          3.1.3
MPICH Release date:     unreleased development copy
MPICH Device:           ch3:nemesis
MPICH configure:        --prefix=/opt/mpi/mpich3-dev/ --no-create --no-recursion
MPICH CC:       gcc    -O2
MPICH CXX:      g++   -O2
MPICH F77:      gfortran   -O2
MPICH FC:       gfortran   -O2


Even on my laptop, the problem still remains:

mpirun  -np 4 ./mpi_tvec2_rma 64 400000 
Allocating memory: win_buf=195 (Mb), loc_buf=48 (Mb)
Allocating memory: win_buf=195 (Mb), loc_buf=48 (Mb)
Allocating memory: win_buf=195 (Mb), loc_buf=48 (Mb)
Allocating memory: win_buf=195 (Mb), loc_buf=48 (Mb)
Assertion failed in file src/mpid/ch3/src/ch3u_handle_send_req.c at line 71: win_ptr->at_completion_counter >= 0
internal ABORT - process 1
...

It would be great if you can check and fix the issue, if needed. 

With best regards,
Victor.

P.s. Just in case, please see the first message for test-bed:
http://lists.mpich.org/pipermail/discuss/attachments/20141110/b5b34433/attachment.obj

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list