[mpich-discuss] Help with debugging

Kochunas, Brendan bkochuna at umich.edu
Thu Jun 21 16:34:33 CDT 2018


Hi All,

I wanted to provide an update as I've been debugging this fairly 
intensely the past 2 days.

First some more information about the communication pattern.

The test case uses 9 processors on a 3x3 cartesian grid. Each 
communications with its neighbors (so there are 12 communicating pairs). 
There are 768 simultaneous (64 messages per communicating pair) for 
which Isend and Irecv have been called but no wait operations. The total 
data over all messages is about 128 MB, this works out to be about 10 MB 
per communicating pair. Individual messages range from 270 KB down to 
13.5 KB. The data being passed are just vanilla Fortran arrays that are 
REAL(8)'s e.g. doubles. No crazy derived types.

I double checked the tags and message sizes for all 768 messages and 
everything checks out. No duplicate tags, information on both processors 
in a communicating pair is correct. The machine I'm testing on is RHEL7 
workstation with the packages for gcc-4.8.5 and mpich-3.2. I built and 
tested mpich-3.2, mpich-3.2.1, and mpich-3.3b2 with gcc 4.8.5 also. It 
has two 48-core Intel Skylake processors. So all messages are 
transmitted via shared memory.

At the suggestion of Jeff Hammond I tried compiling MPI with different 
different devices in place of nemesis. For mpich-3.2 I tried sockets, 
and for mpich-3.3b2 I also tried libfabric-1.6.1 for ch4:ofi. The 
problem also does not occur for OpenMPI or MVAPICH.

If I did not use nemesis, the problem did not occur for my test case. If 
I did use nemesis the problem was encountered for all versions of MPICH 
I tested. As undeterministic as these types of bugs might be, I do seem 
able to reproduce it reliably. However, if I add print statements to my 
code in certain, seemingly random places.

So my question to the list is: does this sound like an issue with 
nemesis? Or is it possible there is still some application error?

Since I was able to build MPICH with debug info, when I ran this case 
under valgrind I received a slightly more meaningful stack trace:

==99610== Invalid read of size 2
==99610==    at 0xB3B70BF: memcpy@@GLIBC_2.14 (vg_replace_strmem.c:1022)
==99610==    by 0xE7F8FBB: MPID_nem_mpich_sendv_header 
(mpid_nem_inline.h:363)
==99610==    by 0xE7F8FBB: MPIDI_CH3I_Shm_send_progress (ch3_progress.c:207)
==99610==    by 0xE7FBD93: MPIDI_CH3I_Progress (ch3_progress.c:583)
==99610==    by 0xE72754E: MPIR_Waitall_impl (waitall.c:217)
==99610==    by 0xE727BB5: PMPI_Waitall (waitall.c:378)
==99610==    by 0xDFE4DFD: mpi_waitall (waitallf.c:275)

==99610==  Address 0x149ea1e4 is 180 bytes inside a block of size 216 free'd
==99610==    at 0xB3B3CBD: free (vg_replace_malloc.c:530)
==99610==    by 0xE83FB41: MPL_trfree (in 
/home/bkochuna/sw/lib/libmpi.so.12.1.0)
==99610==    by 0xE7837D8: MPIU_trfree (trmem.c:37)
==99610==    by 0xE80DC08: MPIU_SHMW_Hnd_free (mpiu_shm_wrappers.h:247)
==99610==    by 0xE80DC08: MPIU_SHMW_Hnd_finalize (mpiu_shm_wrappers.h:443)
==99610==    by 0xE80DC08: MPID_nem_delete_shm_region 
(mpid_nem_lmt_shm.c:963)
==99610==    by 0xE80DC08: MPID_nem_lmt_shm_start_send 
(mpid_nem_lmt_shm.c:270)
==99610==    by 0xE8097E0: pkt_CTS_handler (mpid_nem_lmt.c:352)
==99610==    by 0xE7FBD0C: MPIDI_CH3I_Progress (ch3_progress.c:541)
==99610==    by 0xE72754E: MPIR_Waitall_impl (waitall.c:217)
==99610==    by 0xE727BB5: PMPI_Waitall (waitall.c:378)
==99610==    by 0xDFE4DFD: mpi_waitall (waitallf.c:275)

==99610==  Block was alloc'd at
==99610==    at 0xB3B2BC3: malloc (vg_replace_malloc.c:299)
==99610==    by 0xE83F38F: MPL_trmalloc (in 
/home/bkochuna/sw/lib/libmpi.so.12.1.0)
==99610==    by 0xE783526: MPIU_trmalloc (trmem.c:29)
==99610==    by 0xE80CF74: MPIU_SHMW_Ghnd_alloc (mpiu_shm_wrappers.h:188)
==99610==    by 0xE80CF74: MPIU_SHMW_Seg_create_attach_templ 
(mpiu_shm_wrappers.h:622)
==99610==    by 0xE80CF74: MPIU_SHMW_Seg_create_and_attach 
(mpiu_shm_wrappers.h:894)
==99610==    by 0xE80CF74: MPID_nem_allocate_shm_region 
(mpid_nem_lmt_shm.c:885)
==99610==    by 0xE80CF74: MPID_nem_lmt_shm_start_recv 
(mpid_nem_lmt_shm.c:180)
==99610==    by 0xE8094AF: do_cts (mpid_nem_lmt.c:560)
==99610==    by 0xE809EBE: pkt_RTS_handler (mpid_nem_lmt.c:276)
==99610==    by 0xE7FB3C5: MPID_nem_handle_pkt (ch3_progress.c:760)
==99610==    by 0xE7FBF8D: MPIDI_CH3I_Progress (ch3_progress.c:570)
==99610==    by 0xE72754E: MPIR_Waitall_impl (waitall.c:217)
==99610==    by 0xE727BB5: PMPI_Waitall (waitall.c:378)
==99610==    by 0xDFE4DFD: mpi_waitall (waitallf.c:275)
==99610==

There are basically 8 sets of messages like where the numbers of bytes 
changes slightly. (e.g. invalid read of 8 bytes instead of 2). There are 
no other valgrind errors or run time check errors.

If I'm able to free up enough time over the next couple days, I hope to 
write a simple program that demonstrates the problem. Until then, 
thoughts and opinions are welcome!

Thanks,
-Brendan
On 6/19/2018 10:21 PM, discuss-request at mpich.org wrote:
> Date: Tue, 19 Jun 2018 20:52:52 -0400
> From: "Kochunas, Brendan" <bkochuna at umich.edu>
> To: discuss at mpich.org
> Subject: [mpich-discuss] Help with debugging
> Message-ID: <1d6a3e4e-0145-400c-77f8-d98b8121877d at umich.edu>
> Content-Type: text/plain; charset=utf-8; format=flowed
>
> We've been dealing with a particularly nasty bug in our code and are
> having trouble debugging it.
>
> We are using RHEL7 with default GNU compiler colleciton (4.8.5) and yum
> package for mpich (mpich-3.2-x86_64)
>
> The rest of this email describes briefly:
>
> * Our communication pattern
> * what we observe
> * what we've tested (and think we know)
>
> Description of communication pattern:
> This is point-to-point communication using MPI_Isend, MPI_Irecv, and
> MPI_Waitall. Its effectively a structured cartesian grid and we send
> multiple messages per face that are probably large (message sizes are a
> few KB).
>
> What we observe:
> The problem we run into is that when we get to the MPI_Waitall we
> receive something like:
>
> Fatal error in PMPI_Waitall: Other MPI error, error stack:
> PMPI_Waitall(323)...............: MPI_Waitall(count=1152,
> req_array=0x13493b90, status_array=0x1) failed
> MPIR_Waitall_impl(166)..........:
> MPIDI_CH3I_Progress(422)........:
> MPID_nem_handle_pkt(642)........:
> pkt_CTS_handler(321)............:
> MPID_nem_lmt_shm_start_send(270):
> MPID_nem_delete_shm_region(923).:
> MPIU_SHMW_Seg_detach(707).......: unable to remove shared memory -
> unlink No such file or directory
>
> Our test case is 9 processors (a 3x3 grid)
>
> We have tested several things that allow us to avoid the problem (but
> not reliably) if:
> * we send multiple messages (about 50x) that are smaller, instead of one
> large message.
> * Adding debug print statements near the MPI_Waitall
> * Removing compiler optimizations
> * Running under valgrind (e.g. mpirun -np <n> valgrind <exe>)
>
> * We have also observed this error on multiple machines (Cray XK7,
> workstations with Intel Xeon's, SGI ICE-X with MVAPICH)
>
> The output from valgrind is:
>
> ==210233== Invalid read of size 8
> ==210233==??? at 0xB3B7060: memcpy@@GLIBC_2.14 (vg_replace_strmem.c:1022)
> ==210233==??? by 0xE7F8FBB: MPID_nem_mpich_sendv_header
> (mpid_nem_inline.h:363)
> ==210233==??? by 0xE7F8FBB: MPIDI_CH3I_Shm_send_progress
> (ch3_progress.c:207)
> ==210233==??? by 0xE7FBD93: MPIDI_CH3I_Progress (ch3_progress.c:583)
> ==210233==??? by 0xE7276B4: MPIR_Waitall_impl (waitall.c:164)
> ==210233==??? by 0xE727BB5: PMPI_Waitall (waitall.c:378)
> ==210233==??? by 0xDFE4DFD: mpi_waitall (waitallf.c:275)
> ==210233==??? by 0x49FF27F: __updatebc_MOD_finish_updatebc_base
> (UpdateBC.f90:361)
> ==210233==? Address 0x142ec268 is 152 bytes inside a block of size 216
> free'd
> ==210233==??? at 0xB3B3CBD: free (vg_replace_malloc.c:530)
> ==210233==??? by 0xE83FB41: MPL_trfree (in
> /home/bkochuna/sw/lib/libmpi.so.12.1.0)
> ==210233==??? by 0xE7837D8: MPIU_trfree (trmem.c:37)
> ==210233==??? by 0xE80DC08: MPIU_SHMW_Hnd_free (mpiu_shm_wrappers.h:247)
> ==210233==??? by 0xE80DC08: MPIU_SHMW_Hnd_finalize (mpiu_shm_wrappers.h:443)
> ==210233==??? by 0xE80DC08: MPID_nem_delete_shm_region
> (mpid_nem_lmt_shm.c:963)
> ==210233==??? by 0xE80DC08: MPID_nem_lmt_shm_start_send
> (mpid_nem_lmt_shm.c:270)
> ==210233==??? by 0xE8097E0: pkt_CTS_handler (mpid_nem_lmt.c:352)
> ==210233==??? by 0xE7FB3C5: MPID_nem_handle_pkt (ch3_progress.c:760)
> ==210233==??? by 0xE7FBF8D: MPIDI_CH3I_Progress (ch3_progress.c:570)
> ==210233==??? by 0xE7276B4: MPIR_Waitall_impl (waitall.c:164)
> ==210233==??? by 0xE727BB5: PMPI_Waitall (waitall.c:378)
> ==210233==??? by 0xDFE4DFD: mpi_waitall (waitallf.c:275)
> ==210233==??? by 0x49FF27F: __updatebc_MOD_finish_updatebc_base
> (UpdateBC.f90:361)
> ==210233==? Block was alloc'd at
> ==210233==??? at 0xB3B2BC3: malloc (vg_replace_malloc.c:299)
> ==210233==??? by 0xE83F38F: MPL_trmalloc (in
> /home/bkochuna/sw/lib/libmpi.so.12.1.0)
> ==210233==??? by 0xE783526: MPIU_trmalloc (trmem.c:29)
> ==210233==??? by 0xE80CF74: MPIU_SHMW_Ghnd_alloc (mpiu_shm_wrappers.h:188)
> ==210233==??? by 0xE80CF74: MPIU_SHMW_Seg_create_attach_templ
> (mpiu_shm_wrappers.h:622)
> ==210233==??? by 0xE80CF74: MPIU_SHMW_Seg_create_and_attach
> (mpiu_shm_wrappers.h:894)
> ==210233==??? by 0xE80CF74: MPID_nem_allocate_shm_region
> (mpid_nem_lmt_shm.c:885)
> ==210233==??? by 0xE80CF74: MPID_nem_lmt_shm_start_recv
> (mpid_nem_lmt_shm.c:180)
> ==210233==??? by 0xE8094AF: do_cts (mpid_nem_lmt.c:560)
> ==210233==??? by 0xE809EBE: pkt_RTS_handler (mpid_nem_lmt.c:276)
> ==210233==??? by 0xE7FB3C5: MPID_nem_handle_pkt (ch3_progress.c:760)
> ==210233==??? by 0xE7FBF8D: MPIDI_CH3I_Progress (ch3_progress.c:570)
> ==210233==??? by 0xE7276B4: MPIR_Waitall_impl (waitall.c:164)
> ==210233==??? by 0xE727BB5: PMPI_Waitall (waitall.c:378)
> ==210233==??? by 0xDFE4DFD: mpi_waitall (waitallf.c:275)
> ==210233==??? by 0x49FF27F: __updatebc_MOD_finish_updatebc_base
> (UpdateBC.f90:361)
>
> So my main questions to the list are:
> * are there any known issues with mpich-3.2 and the shared memory
> communication within nemesis that might affect non-blocking communication?
> * Does this look more like an issue in our code or within MPICH?
> * If it looks like an issue in our code what is the best way to debug
> this... seems like our current efforts of basic print statements change
> the behavior and prevent us from identifying it. I suspect there may be
> some memory overwrite beyond some array bounds, but compiling in debug
> with -fbounds-check does not expose the problem (and due to the nature
> of the MPI interfaces, I would not expect this to help). We've stared at
> the code for the calls to MPI_Isend and MPI_Irecv and everything looks
> correct (e.g. we don't touch the buffers, the sizes match up, etc.)
> * Are there limits to the number of simultaneous messages? (we are below
> the maximum number of MPI requests) or message sizes (I don't believe
> there is a limit here beyond integer overflow)
> * We've tried changing from MPI_Waitall to individual MPI_Waits and
> MPI_Tests, but the problem still happens with the individual waits.
>
> Any insights are welcome.
>
> Thanks,
> -Brendan
>
>
>
> ------------------------------
>
> Message: 4
> Date: Tue, 19 Jun 2018 19:20:35 -0700
> From: Jeff Hammond <jeff.science at gmail.com>
> To: discuss at mpich.org
> Subject: Re: [mpich-discuss] Help with debugging
> Message-ID: <2226E1B6-75F7-43E2-A6C3-101860313B18 at gmail.com>
> Content-Type: text/plain;	charset=utf-8
>
> You can rule out MPICH shm bugs if the code runs fine with Open-MPI or by disabling shm in MPICH, eg by using ch3:sock.
>
> The fact that small messages work suggest rendezvous deadlock is the issue, which is often but not always an application error.
>
> It?s impossible to reason about the precise nature of the bug without source code.
>
> Jeff
>
> Sent from my iPhone
>
>> On Jun 19, 2018, at 5:52 PM, Kochunas, Brendan <bkochuna at umich.edu> wrote:
>>
>> We've been dealing with a particularly nasty bug in our code and are having trouble debugging it.
>>
>> We are using RHEL7 with default GNU compiler colleciton (4.8.5) and yum package for mpich (mpich-3.2-x86_64)
>>
>> The rest of this email describes briefly:
>>
>> * Our communication pattern
>> * what we observe
>> * what we've tested (and think we know)
>>
>> Description of communication pattern:
>> This is point-to-point communication using MPI_Isend, MPI_Irecv, and MPI_Waitall. Its effectively a structured cartesian grid and we send multiple messages per face that are probably large (message sizes are a few KB).
>>
>> What we observe:
>> The problem we run into is that when we get to the MPI_Waitall we receive something like:
>>
>> Fatal error in PMPI_Waitall: Other MPI error, error stack:
>> PMPI_Waitall(323)...............: MPI_Waitall(count=1152, req_array=0x13493b90, status_array=0x1) failed
>> MPIR_Waitall_impl(166)..........:
>> MPIDI_CH3I_Progress(422)........:
>> MPID_nem_handle_pkt(642)........:
>> pkt_CTS_handler(321)............:
>> MPID_nem_lmt_shm_start_send(270):
>> MPID_nem_delete_shm_region(923).:
>> MPIU_SHMW_Seg_detach(707).......: unable to remove shared memory - unlink No such file or directory
>>
>> Our test case is 9 processors (a 3x3 grid)
>>
>> We have tested several things that allow us to avoid the problem (but not reliably) if:
>> * we send multiple messages (about 50x) that are smaller, instead of one large message.
>> * Adding debug print statements near the MPI_Waitall
>> * Removing compiler optimizations
>> * Running under valgrind (e.g. mpirun -np <n> valgrind <exe>)
>>
>> * We have also observed this error on multiple machines (Cray XK7, workstations with Intel Xeon's, SGI ICE-X with MVAPICH)
>>
>> The output from valgrind is:
>>
>> ==210233== Invalid read of size 8
>> ==210233==    at 0xB3B7060: memcpy@@GLIBC_2.14 (vg_replace_strmem.c:1022)
>> ==210233==    by 0xE7F8FBB: MPID_nem_mpich_sendv_header (mpid_nem_inline.h:363)
>> ==210233==    by 0xE7F8FBB: MPIDI_CH3I_Shm_send_progress (ch3_progress.c:207)
>> ==210233==    by 0xE7FBD93: MPIDI_CH3I_Progress (ch3_progress.c:583)
>> ==210233==    by 0xE7276B4: MPIR_Waitall_impl (waitall.c:164)
>> ==210233==    by 0xE727BB5: PMPI_Waitall (waitall.c:378)
>> ==210233==    by 0xDFE4DFD: mpi_waitall (waitallf.c:275)
>> ==210233==    by 0x49FF27F: __updatebc_MOD_finish_updatebc_base (UpdateBC.f90:361)
>> ==210233==  Address 0x142ec268 is 152 bytes inside a block of size 216 free'd
>> ==210233==    at 0xB3B3CBD: free (vg_replace_malloc.c:530)
>> ==210233==    by 0xE83FB41: MPL_trfree (in /home/bkochuna/sw/lib/libmpi.so.12.1.0)
>> ==210233==    by 0xE7837D8: MPIU_trfree (trmem.c:37)
>> ==210233==    by 0xE80DC08: MPIU_SHMW_Hnd_free (mpiu_shm_wrappers.h:247)
>> ==210233==    by 0xE80DC08: MPIU_SHMW_Hnd_finalize (mpiu_shm_wrappers.h:443)
>> ==210233==    by 0xE80DC08: MPID_nem_delete_shm_region (mpid_nem_lmt_shm.c:963)
>> ==210233==    by 0xE80DC08: MPID_nem_lmt_shm_start_send (mpid_nem_lmt_shm.c:270)
>> ==210233==    by 0xE8097E0: pkt_CTS_handler (mpid_nem_lmt.c:352)
>> ==210233==    by 0xE7FB3C5: MPID_nem_handle_pkt (ch3_progress.c:760)
>> ==210233==    by 0xE7FBF8D: MPIDI_CH3I_Progress (ch3_progress.c:570)
>> ==210233==    by 0xE7276B4: MPIR_Waitall_impl (waitall.c:164)
>> ==210233==    by 0xE727BB5: PMPI_Waitall (waitall.c:378)
>> ==210233==    by 0xDFE4DFD: mpi_waitall (waitallf.c:275)
>> ==210233==    by 0x49FF27F: __updatebc_MOD_finish_updatebc_base (UpdateBC.f90:361)
>> ==210233==  Block was alloc'd at
>> ==210233==    at 0xB3B2BC3: malloc (vg_replace_malloc.c:299)
>> ==210233==    by 0xE83F38F: MPL_trmalloc (in /home/bkochuna/sw/lib/libmpi.so.12.1.0)
>> ==210233==    by 0xE783526: MPIU_trmalloc (trmem.c:29)
>> ==210233==    by 0xE80CF74: MPIU_SHMW_Ghnd_alloc (mpiu_shm_wrappers.h:188)
>> ==210233==    by 0xE80CF74: MPIU_SHMW_Seg_create_attach_templ (mpiu_shm_wrappers.h:622)
>> ==210233==    by 0xE80CF74: MPIU_SHMW_Seg_create_and_attach (mpiu_shm_wrappers.h:894)
>> ==210233==    by 0xE80CF74: MPID_nem_allocate_shm_region (mpid_nem_lmt_shm.c:885)
>> ==210233==    by 0xE80CF74: MPID_nem_lmt_shm_start_recv (mpid_nem_lmt_shm.c:180)
>> ==210233==    by 0xE8094AF: do_cts (mpid_nem_lmt.c:560)
>> ==210233==    by 0xE809EBE: pkt_RTS_handler (mpid_nem_lmt.c:276)
>> ==210233==    by 0xE7FB3C5: MPID_nem_handle_pkt (ch3_progress.c:760)
>> ==210233==    by 0xE7FBF8D: MPIDI_CH3I_Progress (ch3_progress.c:570)
>> ==210233==    by 0xE7276B4: MPIR_Waitall_impl (waitall.c:164)
>> ==210233==    by 0xE727BB5: PMPI_Waitall (waitall.c:378)
>> ==210233==    by 0xDFE4DFD: mpi_waitall (waitallf.c:275)
>> ==210233==    by 0x49FF27F: __updatebc_MOD_finish_updatebc_base (UpdateBC.f90:361)
>>
>> So my main questions to the list are:
>> * are there any known issues with mpich-3.2 and the shared memory communication within nemesis that might affect non-blocking communication?
>> * Does this look more like an issue in our code or within MPICH?
>> * If it looks like an issue in our code what is the best way to debug this... seems like our current efforts of basic print statements change the behavior and prevent us from identifying it. I suspect there may be some memory overwrite beyond some array bounds, but compiling in debug with -fbounds-check does not expose the problem (and due to the nature of the MPI interfaces, I would not expect this to help). We've stared at the code for the calls to MPI_Isend and MPI_Irecv and everything looks correct (e.g. we don't touch the buffers, the sizes match up, etc.)
>> * Are there limits to the number of simultaneous messages? (we are below the maximum number of MPI requests) or message sizes (I don't believe there is a limit here beyond integer overflow)
>> * We've tried changing from MPI_Waitall to individual MPI_Waits and MPI_Tests, but the problem still happens with the individual waits.
>>
>> Any insights are welcome.
>>
>> Thanks,
>> -Brendan
>>
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> discuss mailing list
> discuss at mpich.org
> https://lists.mpich.org/mailman/listinfo/discuss
>
>
> ------------------------------
>
> End of discuss Digest, Vol 68, Issue 12
> ***************************************

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list