[mpich-discuss] Reading buffers during MPI call in multithreaded application

Mark Davis markdavisinboston at gmail.com
Tue Aug 16 15:12:18 CDT 2016


> At the non-root processes, MPI_Bcast behaves like a MPI_Recv call, and thus
> you cannot touch the buffer until the function returned.

That part makes sense. I'm not allowing the buffer to be read or
otherwise used on non-root threads. It makes sense to me that this
acts as a MPI_Recv call.

The thing that I'm confused by is on the root process, as it seems
that the root process' buffer is also written to during the course of
the MPI_Bcast; it should act like an MPI_Send. It seems like this is
just an implementation detail, and as you pointed out, since the
MPI_Bcast buffer is not marked const, anything could happen.






>> Background: running on Linux 3.13.0-52 / Ubuntu x86_64. gcc 5.3. MPICH
>> 3.2 release version with thread support enabled. I am MPI_Initing with
>> thread multiple support. I'm also using thread sanitizer on my
>> application (although not on MPICH itself). C++11 standard.
>>
>> Problem description: I'm running into a SIGABRT at the end of the
>> program when MPICH seems to be running its deconstructors. My program
>> is running NPROC MPI processes, each with T C++11 threads (so, NPROC *
>> T threads total). This simplified application (which I've created to
>> exhibit this problem) simply has one root looping through H times,
>> doing a MPI_Bcast to other processes each time. Only one thread per
>> process participates in the MPI_Bcast. Thread sanitizer is also
>> detecting a data race. In fact, thread sanitizer is detecting that
>> MPI_Bcast is *writing* to the root's buffer, even though in a
>> broadcast, at least semantically, the root should only ever be sending
>> data, not receiving data. Taking a quick look at MPI_Bcast (which
>> ultimately calls to MPIDI_CH3U_Receive_data_found which invokes the
>> memcpy), it does seem that depending on the message size (in my case,
>> it's just 10 MPI_INTs per message), scatter and allgather can be used
>> or a binomial tree algorithm. I haven't dug in to see which one is
>> being used in my case, but this indicates that there's at least a
>> possibility that the root's buffer can be received into during a
>> MPI_Bcast. My program is producing the "right answer" but could just
>> be luck.
>>
>>
>> Here's the basic structure of the program:
>>
>>     if root process
>>        if root thread
>>           A: signal using condition variable to (T-1) non-root threads
>> that it can read the buffer (via shared memory)
>>           B: MPI_Bcast to other procs
>>        else if not root thread
>>           wait on condition variable until it's ok to read buffer
>>           read buffer
>>     else if not root process
>>         // AFAIK there are no problems with non-root processes
>>         MPI_Bcast as a non-root
>>
>>
>>
>> Thread sanitizer is detecting that on the root process, MPI_Bcast is
>> writing to the buffer that is being broadcast. Simultaneously,
>> non-root threads in the root process are reading the buffer being
>> broadcast *while the MPI_Bcast is happening*. When I change the order
>> of statements A and B above, the data race goes away. So, it seems to
>> me that my assumption (that on the root node, the buffer being
>> broadcast is being written, or at least it's a possibility that the
>> MPI call will write the buffer) is incorrect. (FYI, the "writing"
>> that's happening in MPI_Bcast is coming from
>> MPIDI_CH3U_Receive_data_found
>> src/mpid/ch3/src/ch3u_handle_recv_pkt.c:152.)
>>
>> Also, at the end of the program, I'm hitting a SIGABRT during the
>> deconstruction of something in libmpi (__do_global_dtors_aux). Full
>> backtrace below. This issue also goes away when I reverse the order of
>> statements A and B. I imagine I'm corrupting some state in MPICH, but
>> I'm not sure.
>>
>> I should point out that I'd prefer to have the ordering of A and B as
>> above, so some threads can make progress while the MPI_Bcast is
>> happening.
>>
>> So, my questions are:
>>
>> 1. what assumptions can be made about buffers in general in MPI during
>> MPI operations? Does the standard specify anything about this? No
>> reading non-const buffers during MPI operations, or can I do this in
>> some situations?
>>
>> 2. Is there any way for me to achieve what I'm trying to do (above,
>> where the reading of the buffer is happening simultaneously with an
>> MPI operation that shouldn't need to write the user buffer but does)?
>> This would be very helpful to know.
>>
>> Thank you.
>>
>>
>>
>> Program received signal SIGABRT, Aborted.
>> [Switching to Thread 0x7ff937bf8700 (LWP 32001)]
>> 0x00007ff93fd17c37 in __GI_raise (sig=sig at entry=6) at
>> ../nptl/sysdeps/unix/sysv/linux/raise.c:56
>>
>> (gdb) bt full
>> #0  0x00007ff93fd17c37 in __GI_raise (sig=sig at entry=6) at
>> ../nptl/sysdeps/unix/sysv/linux/raise.c:56
>>         resultvar = 0
>>         pid = 31987
>>         selftid = 32001
>> #1  0x00007ff93fd1b028 in __GI_abort () at abort.c:89
>>         save_stage = 2
>>         act = {__sigaction_handler = {sa_handler = 0xbea6495,
>> sa_sigaction = 0xbea6495}, sa_mask = {__val = {140708540140376,
>>               140708494113584, 134272, 0, 140708494597155, 134240,
>> 140708497979320, 131072, 140708494598025, 140708358448000,
>>               140708521085690, 140708539470768, 0, 140708540266640,
>> 140708551504800, 1}}, sa_flags = 934869088,
>>           sa_restorer = 0x1}
>>         sigs = {__val = {32, 0 <repeats 15 times>}}
>> #2  0x00007ff93fd62dfa in malloc_printerr (ptr=<optimized out>,
>> str=<optimized out>, action=<optimized out>) at malloc.c:5000
>> No locals.
>> #3  free_check (mem=<optimized out>, caller=<optimized out>) at
>> hooks.c:298
>>         p = <optimized out>
>> #4  0x00007ff93fd1d53a in __cxa_finalize (d=0x7ff941207708) at
>> cxa_finalize.c:56
>>         check = 97
>>         cxafn = <optimized out>
>>         cxaarg = <optimized out>
>>         f = 0x7ff9400a1230 <initial+944>
>>         funcs = 0x7ff9400a0e80 <initial>
>> #5  0x00007ff940ff7123 in __do_global_dtors_aux () from
>> mpich/debug/lib/libmpicxx.so.12
>> No symbol table info available.
>> #6  0x00007ff937b8e410 in ?? ()
>> No symbol table info available.
>> #7  0x00007ff9426e270a in _dl_fini () at dl-fini.c:252
>>         array = 0x7ff941205238
>>         i = 0
>>         nmaps = 32761
>>         nloaded = <optimized out>
>>         i = 4
>>         l = 0x7ff9428d7a00
>>         ns = 140708494300474
>>         maps = 0x7ff937b8e330
>>         maps_size = 140708497985152
>>         do_audit = 1116568064
>>         __PRETTY_FUNCTION__ = "_dl_fini"
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>
>
>
>
> --
> Jeff Hammond
> jeff.science at gmail.com
> http://jeffhammond.github.io/
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list