[mpich-discuss] Reading buffers during MPI call in multithreaded application

Jeff Hammond jeff.science at gmail.com
Tue Aug 16 09:50:57 CDT 2016


On Tue, Aug 16, 2016 at 7:18 AM, Mark Davis <markdavisinboston at gmail.com>
wrote:

> Hello, I'm hitting a data race and a deconstruction error when using
> MPI_Bcast in an MPI application that also uses multiple threads per
> process. I'll go into detail below, but the fundamental question I
> have is: what can I assume, if anything, about the state of an MPI
> buffer during an MPI call? (I do realize that the buffers are not
> declared const in MPI_Bcast, whereas in other situations, such as
> MPI_Send, they are.)
>

Nothing, unless the buffer pointer has the const attribute on it.

At the non-root processes, MPI_Bcast behaves like a MPI_Recv call, and thus
you cannot touch the buffer until the function returned.  Multiple threads
cannot use the same buffer to multiple MPI_Bcast calls.

Jeff


>
> Background: running on Linux 3.13.0-52 / Ubuntu x86_64. gcc 5.3. MPICH
> 3.2 release version with thread support enabled. I am MPI_Initing with
> thread multiple support. I'm also using thread sanitizer on my
> application (although not on MPICH itself). C++11 standard.
>
> Problem description: I'm running into a SIGABRT at the end of the
> program when MPICH seems to be running its deconstructors. My program
> is running NPROC MPI processes, each with T C++11 threads (so, NPROC *
> T threads total). This simplified application (which I've created to
> exhibit this problem) simply has one root looping through H times,
> doing a MPI_Bcast to other processes each time. Only one thread per
> process participates in the MPI_Bcast. Thread sanitizer is also
> detecting a data race. In fact, thread sanitizer is detecting that
> MPI_Bcast is *writing* to the root's buffer, even though in a
> broadcast, at least semantically, the root should only ever be sending
> data, not receiving data. Taking a quick look at MPI_Bcast (which
> ultimately calls to MPIDI_CH3U_Receive_data_found which invokes the
> memcpy), it does seem that depending on the message size (in my case,
> it's just 10 MPI_INTs per message), scatter and allgather can be used
> or a binomial tree algorithm. I haven't dug in to see which one is
> being used in my case, but this indicates that there's at least a
> possibility that the root's buffer can be received into during a
> MPI_Bcast. My program is producing the "right answer" but could just
> be luck.
>
>
> Here's the basic structure of the program:
>
>     if root process
>        if root thread
>           A: signal using condition variable to (T-1) non-root threads
> that it can read the buffer (via shared memory)
>           B: MPI_Bcast to other procs
>        else if not root thread
>           wait on condition variable until it's ok to read buffer
>           read buffer
>     else if not root process
>         // AFAIK there are no problems with non-root processes
>         MPI_Bcast as a non-root
>
>
>
> Thread sanitizer is detecting that on the root process, MPI_Bcast is
> writing to the buffer that is being broadcast. Simultaneously,
> non-root threads in the root process are reading the buffer being
> broadcast *while the MPI_Bcast is happening*. When I change the order
> of statements A and B above, the data race goes away. So, it seems to
> me that my assumption (that on the root node, the buffer being
> broadcast is being written, or at least it's a possibility that the
> MPI call will write the buffer) is incorrect. (FYI, the "writing"
> that's happening in MPI_Bcast is coming from
> MPIDI_CH3U_Receive_data_found
> src/mpid/ch3/src/ch3u_handle_recv_pkt.c:152.)
>
> Also, at the end of the program, I'm hitting a SIGABRT during the
> deconstruction of something in libmpi (__do_global_dtors_aux). Full
> backtrace below. This issue also goes away when I reverse the order of
> statements A and B. I imagine I'm corrupting some state in MPICH, but
> I'm not sure.
>
> I should point out that I'd prefer to have the ordering of A and B as
> above, so some threads can make progress while the MPI_Bcast is
> happening.
>
> So, my questions are:
>
> 1. what assumptions can be made about buffers in general in MPI during
> MPI operations? Does the standard specify anything about this? No
> reading non-const buffers during MPI operations, or can I do this in
> some situations?
>
> 2. Is there any way for me to achieve what I'm trying to do (above,
> where the reading of the buffer is happening simultaneously with an
> MPI operation that shouldn't need to write the user buffer but does)?
> This would be very helpful to know.
>
> Thank you.
>
>
>
> Program received signal SIGABRT, Aborted.
> [Switching to Thread 0x7ff937bf8700 (LWP 32001)]
> 0x00007ff93fd17c37 in __GI_raise (sig=sig at entry=6) at
> ../nptl/sysdeps/unix/sysv/linux/raise.c:56
>
> (gdb) bt full
> #0  0x00007ff93fd17c37 in __GI_raise (sig=sig at entry=6) at
> ../nptl/sysdeps/unix/sysv/linux/raise.c:56
>         resultvar = 0
>         pid = 31987
>         selftid = 32001
> #1  0x00007ff93fd1b028 in __GI_abort () at abort.c:89
>         save_stage = 2
>         act = {__sigaction_handler = {sa_handler = 0xbea6495,
> sa_sigaction = 0xbea6495}, sa_mask = {__val = {140708540140376,
>               140708494113584, 134272, 0, 140708494597155, 134240,
> 140708497979320, 131072, 140708494598025, 140708358448000,
>               140708521085690, 140708539470768, 0, 140708540266640,
> 140708551504800, 1}}, sa_flags = 934869088,
>           sa_restorer = 0x1}
>         sigs = {__val = {32, 0 <repeats 15 times>}}
> #2  0x00007ff93fd62dfa in malloc_printerr (ptr=<optimized out>,
> str=<optimized out>, action=<optimized out>) at malloc.c:5000
> No locals.
> #3  free_check (mem=<optimized out>, caller=<optimized out>) at hooks.c:298
>         p = <optimized out>
> #4  0x00007ff93fd1d53a in __cxa_finalize (d=0x7ff941207708) at
> cxa_finalize.c:56
>         check = 97
>         cxafn = <optimized out>
>         cxaarg = <optimized out>
>         f = 0x7ff9400a1230 <initial+944>
>         funcs = 0x7ff9400a0e80 <initial>
> #5  0x00007ff940ff7123 in __do_global_dtors_aux () from
> mpich/debug/lib/libmpicxx.so.12
> No symbol table info available.
> #6  0x00007ff937b8e410 in ?? ()
> No symbol table info available.
> #7  0x00007ff9426e270a in _dl_fini () at dl-fini.c:252
>         array = 0x7ff941205238
>         i = 0
>         nmaps = 32761
>         nloaded = <optimized out>
>         i = 4
>         l = 0x7ff9428d7a00
>         ns = 140708494300474
>         maps = 0x7ff937b8e330
>         maps_size = 140708497985152
>         do_audit = 1116568064
>         __PRETTY_FUNCTION__ = "_dl_fini"
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>



-- 
Jeff Hammond
jeff.science at gmail.com
http://jeffhammond.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20160816/ff425967/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list