[mpich-discuss] Help with debugging
Kochunas, Brendan
bkochuna at umich.edu
Tue Jun 19 19:52:52 CDT 2018
We've been dealing with a particularly nasty bug in our code and are
having trouble debugging it.
We are using RHEL7 with default GNU compiler colleciton (4.8.5) and yum
package for mpich (mpich-3.2-x86_64)
The rest of this email describes briefly:
* Our communication pattern
* what we observe
* what we've tested (and think we know)
Description of communication pattern:
This is point-to-point communication using MPI_Isend, MPI_Irecv, and
MPI_Waitall. Its effectively a structured cartesian grid and we send
multiple messages per face that are probably large (message sizes are a
few KB).
What we observe:
The problem we run into is that when we get to the MPI_Waitall we
receive something like:
Fatal error in PMPI_Waitall: Other MPI error, error stack:
PMPI_Waitall(323)...............: MPI_Waitall(count=1152,
req_array=0x13493b90, status_array=0x1) failed
MPIR_Waitall_impl(166)..........:
MPIDI_CH3I_Progress(422)........:
MPID_nem_handle_pkt(642)........:
pkt_CTS_handler(321)............:
MPID_nem_lmt_shm_start_send(270):
MPID_nem_delete_shm_region(923).:
MPIU_SHMW_Seg_detach(707).......: unable to remove shared memory -
unlink No such file or directory
Our test case is 9 processors (a 3x3 grid)
We have tested several things that allow us to avoid the problem (but
not reliably) if:
* we send multiple messages (about 50x) that are smaller, instead of one
large message.
* Adding debug print statements near the MPI_Waitall
* Removing compiler optimizations
* Running under valgrind (e.g. mpirun -np <n> valgrind <exe>)
* We have also observed this error on multiple machines (Cray XK7,
workstations with Intel Xeon's, SGI ICE-X with MVAPICH)
The output from valgrind is:
==210233== Invalid read of size 8
==210233== at 0xB3B7060: memcpy@@GLIBC_2.14 (vg_replace_strmem.c:1022)
==210233== by 0xE7F8FBB: MPID_nem_mpich_sendv_header
(mpid_nem_inline.h:363)
==210233== by 0xE7F8FBB: MPIDI_CH3I_Shm_send_progress
(ch3_progress.c:207)
==210233== by 0xE7FBD93: MPIDI_CH3I_Progress (ch3_progress.c:583)
==210233== by 0xE7276B4: MPIR_Waitall_impl (waitall.c:164)
==210233== by 0xE727BB5: PMPI_Waitall (waitall.c:378)
==210233== by 0xDFE4DFD: mpi_waitall (waitallf.c:275)
==210233== by 0x49FF27F: __updatebc_MOD_finish_updatebc_base
(UpdateBC.f90:361)
==210233== Address 0x142ec268 is 152 bytes inside a block of size 216
free'd
==210233== at 0xB3B3CBD: free (vg_replace_malloc.c:530)
==210233== by 0xE83FB41: MPL_trfree (in
/home/bkochuna/sw/lib/libmpi.so.12.1.0)
==210233== by 0xE7837D8: MPIU_trfree (trmem.c:37)
==210233== by 0xE80DC08: MPIU_SHMW_Hnd_free (mpiu_shm_wrappers.h:247)
==210233== by 0xE80DC08: MPIU_SHMW_Hnd_finalize (mpiu_shm_wrappers.h:443)
==210233== by 0xE80DC08: MPID_nem_delete_shm_region
(mpid_nem_lmt_shm.c:963)
==210233== by 0xE80DC08: MPID_nem_lmt_shm_start_send
(mpid_nem_lmt_shm.c:270)
==210233== by 0xE8097E0: pkt_CTS_handler (mpid_nem_lmt.c:352)
==210233== by 0xE7FB3C5: MPID_nem_handle_pkt (ch3_progress.c:760)
==210233== by 0xE7FBF8D: MPIDI_CH3I_Progress (ch3_progress.c:570)
==210233== by 0xE7276B4: MPIR_Waitall_impl (waitall.c:164)
==210233== by 0xE727BB5: PMPI_Waitall (waitall.c:378)
==210233== by 0xDFE4DFD: mpi_waitall (waitallf.c:275)
==210233== by 0x49FF27F: __updatebc_MOD_finish_updatebc_base
(UpdateBC.f90:361)
==210233== Block was alloc'd at
==210233== at 0xB3B2BC3: malloc (vg_replace_malloc.c:299)
==210233== by 0xE83F38F: MPL_trmalloc (in
/home/bkochuna/sw/lib/libmpi.so.12.1.0)
==210233== by 0xE783526: MPIU_trmalloc (trmem.c:29)
==210233== by 0xE80CF74: MPIU_SHMW_Ghnd_alloc (mpiu_shm_wrappers.h:188)
==210233== by 0xE80CF74: MPIU_SHMW_Seg_create_attach_templ
(mpiu_shm_wrappers.h:622)
==210233== by 0xE80CF74: MPIU_SHMW_Seg_create_and_attach
(mpiu_shm_wrappers.h:894)
==210233== by 0xE80CF74: MPID_nem_allocate_shm_region
(mpid_nem_lmt_shm.c:885)
==210233== by 0xE80CF74: MPID_nem_lmt_shm_start_recv
(mpid_nem_lmt_shm.c:180)
==210233== by 0xE8094AF: do_cts (mpid_nem_lmt.c:560)
==210233== by 0xE809EBE: pkt_RTS_handler (mpid_nem_lmt.c:276)
==210233== by 0xE7FB3C5: MPID_nem_handle_pkt (ch3_progress.c:760)
==210233== by 0xE7FBF8D: MPIDI_CH3I_Progress (ch3_progress.c:570)
==210233== by 0xE7276B4: MPIR_Waitall_impl (waitall.c:164)
==210233== by 0xE727BB5: PMPI_Waitall (waitall.c:378)
==210233== by 0xDFE4DFD: mpi_waitall (waitallf.c:275)
==210233== by 0x49FF27F: __updatebc_MOD_finish_updatebc_base
(UpdateBC.f90:361)
So my main questions to the list are:
* are there any known issues with mpich-3.2 and the shared memory
communication within nemesis that might affect non-blocking communication?
* Does this look more like an issue in our code or within MPICH?
* If it looks like an issue in our code what is the best way to debug
this... seems like our current efforts of basic print statements change
the behavior and prevent us from identifying it. I suspect there may be
some memory overwrite beyond some array bounds, but compiling in debug
with -fbounds-check does not expose the problem (and due to the nature
of the MPI interfaces, I would not expect this to help). We've stared at
the code for the calls to MPI_Isend and MPI_Irecv and everything looks
correct (e.g. we don't touch the buffers, the sizes match up, etc.)
* Are there limits to the number of simultaneous messages? (we are below
the maximum number of MPI requests) or message sizes (I don't believe
there is a limit here beyond integer overflow)
* We've tried changing from MPI_Waitall to individual MPI_Waits and
MPI_Tests, but the problem still happens with the individual waits.
Any insights are welcome.
Thanks,
-Brendan
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list