[mpich-discuss] Help with debugging

Tue Jun 19 19:52:52 CDT 2018

We've been dealing with a particularly nasty bug in our code and are 
having trouble debugging it.

We are using RHEL7 with default GNU compiler colleciton (4.8.5) and yum 
package for mpich (mpich-3.2-x86_64)

The rest of this email describes briefly:

* Our communication pattern
* what we observe
* what we've tested (and think we know)

Description of communication pattern:
This is point-to-point communication using MPI_Isend, MPI_Irecv, and 
MPI_Waitall. Its effectively a structured cartesian grid and we send 
multiple messages per face that are probably large (message sizes are a 
few KB).

What we observe:
The problem we run into is that when we get to the MPI_Waitall we 
receive something like:

Fatal error in PMPI_Waitall: Other MPI error, error stack:
PMPI_Waitall(323)...............: MPI_Waitall(count=1152, 
req_array=0x13493b90, status_array=0x1) failed
MPIR_Waitall_impl(166)..........:
MPIDI_CH3I_Progress(422)........:
MPID_nem_handle_pkt(642)........:
pkt_CTS_handler(321)............:
MPID_nem_lmt_shm_start_send(270):
MPID_nem_delete_shm_region(923).:
MPIU_SHMW_Seg_detach(707).......: unable to remove shared memory - 
unlink No such file or directory

Our test case is 9 processors (a 3x3 grid)

We have tested several things that allow us to avoid the problem (but 
not reliably) if:
* we send multiple messages (about 50x) that are smaller, instead of one 
large message.
* Adding debug print statements near the MPI_Waitall
* Removing compiler optimizations
* Running under valgrind (e.g. mpirun -np <n> valgrind <exe>)

* We have also observed this error on multiple machines (Cray XK7, 
workstations with Intel Xeon's, SGI ICE-X with MVAPICH)

The output from valgrind is:

==210233== Invalid read of size 8
==210233==    at 0xB3B7060: memcpy@@GLIBC_2.14 (vg_replace_strmem.c:1022)
==210233==    by 0xE7F8FBB: MPID_nem_mpich_sendv_header 
(mpid_nem_inline.h:363)
==210233==    by 0xE7F8FBB: MPIDI_CH3I_Shm_send_progress 
(ch3_progress.c:207)
==210233==    by 0xE7FBD93: MPIDI_CH3I_Progress (ch3_progress.c:583)
==210233==    by 0xE7276B4: MPIR_Waitall_impl (waitall.c:164)
==210233==    by 0xE727BB5: PMPI_Waitall (waitall.c:378)
==210233==    by 0xDFE4DFD: mpi_waitall (waitallf.c:275)
==210233==    by 0x49FF27F: __updatebc_MOD_finish_updatebc_base 
(UpdateBC.f90:361)
==210233==  Address 0x142ec268 is 152 bytes inside a block of size 216 
free'd
==210233==    at 0xB3B3CBD: free (vg_replace_malloc.c:530)
==210233==    by 0xE83FB41: MPL_trfree (in 
/home/bkochuna/sw/lib/libmpi.so.12.1.0)
==210233==    by 0xE7837D8: MPIU_trfree (trmem.c:37)
==210233==    by 0xE80DC08: MPIU_SHMW_Hnd_free (mpiu_shm_wrappers.h:247)
==210233==    by 0xE80DC08: MPIU_SHMW_Hnd_finalize (mpiu_shm_wrappers.h:443)
==210233==    by 0xE80DC08: MPID_nem_delete_shm_region 
(mpid_nem_lmt_shm.c:963)
==210233==    by 0xE80DC08: MPID_nem_lmt_shm_start_send 
(mpid_nem_lmt_shm.c:270)
==210233==    by 0xE8097E0: pkt_CTS_handler (mpid_nem_lmt.c:352)
==210233==    by 0xE7FB3C5: MPID_nem_handle_pkt (ch3_progress.c:760)
==210233==    by 0xE7FBF8D: MPIDI_CH3I_Progress (ch3_progress.c:570)
==210233==    by 0xE7276B4: MPIR_Waitall_impl (waitall.c:164)
==210233==    by 0xE727BB5: PMPI_Waitall (waitall.c:378)
==210233==    by 0xDFE4DFD: mpi_waitall (waitallf.c:275)
==210233==    by 0x49FF27F: __updatebc_MOD_finish_updatebc_base 
(UpdateBC.f90:361)
==210233==  Block was alloc'd at
==210233==    at 0xB3B2BC3: malloc (vg_replace_malloc.c:299)
==210233==    by 0xE83F38F: MPL_trmalloc (in 
/home/bkochuna/sw/lib/libmpi.so.12.1.0)
==210233==    by 0xE783526: MPIU_trmalloc (trmem.c:29)
==210233==    by 0xE80CF74: MPIU_SHMW_Ghnd_alloc (mpiu_shm_wrappers.h:188)
==210233==    by 0xE80CF74: MPIU_SHMW_Seg_create_attach_templ 
(mpiu_shm_wrappers.h:622)
==210233==    by 0xE80CF74: MPIU_SHMW_Seg_create_and_attach 
(mpiu_shm_wrappers.h:894)
==210233==    by 0xE80CF74: MPID_nem_allocate_shm_region 
(mpid_nem_lmt_shm.c:885)
==210233==    by 0xE80CF74: MPID_nem_lmt_shm_start_recv 
(mpid_nem_lmt_shm.c:180)
==210233==    by 0xE8094AF: do_cts (mpid_nem_lmt.c:560)
==210233==    by 0xE809EBE: pkt_RTS_handler (mpid_nem_lmt.c:276)
==210233==    by 0xE7FB3C5: MPID_nem_handle_pkt (ch3_progress.c:760)
==210233==    by 0xE7FBF8D: MPIDI_CH3I_Progress (ch3_progress.c:570)
==210233==    by 0xE7276B4: MPIR_Waitall_impl (waitall.c:164)
==210233==    by 0xE727BB5: PMPI_Waitall (waitall.c:378)
==210233==    by 0xDFE4DFD: mpi_waitall (waitallf.c:275)
==210233==    by 0x49FF27F: __updatebc_MOD_finish_updatebc_base 
(UpdateBC.f90:361)

So my main questions to the list are:
* are there any known issues with mpich-3.2 and the shared memory 
communication within nemesis that might affect non-blocking communication?
* Does this look more like an issue in our code or within MPICH?
* If it looks like an issue in our code what is the best way to debug 
this... seems like our current efforts of basic print statements change 
the behavior and prevent us from identifying it. I suspect there may be 
some memory overwrite beyond some array bounds, but compiling in debug 
with -fbounds-check does not expose the problem (and due to the nature 
of the MPI interfaces, I would not expect this to help). We've stared at 
the code for the calls to MPI_Isend and MPI_Irecv and everything looks 
correct (e.g. we don't touch the buffers, the sizes match up, etc.)
* Are there limits to the number of simultaneous messages? (we are below 
the maximum number of MPI requests) or message sizes (I don't believe 
there is a limit here beyond integer overflow)
* We've tried changing from MPI_Waitall to individual MPI_Waits and 
MPI_Tests, but the problem still happens with the individual waits.

Any insights are welcome.

Thanks,
-Brendan

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss