[mpich-discuss] Reading buffers during MPI call in multithreaded application

Mark Davis markdavisinboston at gmail.com
Tue Aug 16 09:18:34 CDT 2016


Hello, I'm hitting a data race and a deconstruction error when using
MPI_Bcast in an MPI application that also uses multiple threads per
process. I'll go into detail below, but the fundamental question I
have is: what can I assume, if anything, about the state of an MPI
buffer during an MPI call? (I do realize that the buffers are not
declared const in MPI_Bcast, whereas in other situations, such as
MPI_Send, they are.)

Background: running on Linux 3.13.0-52 / Ubuntu x86_64. gcc 5.3. MPICH
3.2 release version with thread support enabled. I am MPI_Initing with
thread multiple support. I'm also using thread sanitizer on my
application (although not on MPICH itself). C++11 standard.

Problem description: I'm running into a SIGABRT at the end of the
program when MPICH seems to be running its deconstructors. My program
is running NPROC MPI processes, each with T C++11 threads (so, NPROC *
T threads total). This simplified application (which I've created to
exhibit this problem) simply has one root looping through H times,
doing a MPI_Bcast to other processes each time. Only one thread per
process participates in the MPI_Bcast. Thread sanitizer is also
detecting a data race. In fact, thread sanitizer is detecting that
MPI_Bcast is *writing* to the root's buffer, even though in a
broadcast, at least semantically, the root should only ever be sending
data, not receiving data. Taking a quick look at MPI_Bcast (which
ultimately calls to MPIDI_CH3U_Receive_data_found which invokes the
memcpy), it does seem that depending on the message size (in my case,
it's just 10 MPI_INTs per message), scatter and allgather can be used
or a binomial tree algorithm. I haven't dug in to see which one is
being used in my case, but this indicates that there's at least a
possibility that the root's buffer can be received into during a
MPI_Bcast. My program is producing the "right answer" but could just
be luck.


Here's the basic structure of the program:

    if root process
       if root thread
          A: signal using condition variable to (T-1) non-root threads
that it can read the buffer (via shared memory)
          B: MPI_Bcast to other procs
       else if not root thread
          wait on condition variable until it's ok to read buffer
          read buffer
    else if not root process
        // AFAIK there are no problems with non-root processes
        MPI_Bcast as a non-root



Thread sanitizer is detecting that on the root process, MPI_Bcast is
writing to the buffer that is being broadcast. Simultaneously,
non-root threads in the root process are reading the buffer being
broadcast *while the MPI_Bcast is happening*. When I change the order
of statements A and B above, the data race goes away. So, it seems to
me that my assumption (that on the root node, the buffer being
broadcast is being written, or at least it's a possibility that the
MPI call will write the buffer) is incorrect. (FYI, the "writing"
that's happening in MPI_Bcast is coming from
MPIDI_CH3U_Receive_data_found
src/mpid/ch3/src/ch3u_handle_recv_pkt.c:152.)

Also, at the end of the program, I'm hitting a SIGABRT during the
deconstruction of something in libmpi (__do_global_dtors_aux). Full
backtrace below. This issue also goes away when I reverse the order of
statements A and B. I imagine I'm corrupting some state in MPICH, but
I'm not sure.

I should point out that I'd prefer to have the ordering of A and B as
above, so some threads can make progress while the MPI_Bcast is
happening.

So, my questions are:

1. what assumptions can be made about buffers in general in MPI during
MPI operations? Does the standard specify anything about this? No
reading non-const buffers during MPI operations, or can I do this in
some situations?

2. Is there any way for me to achieve what I'm trying to do (above,
where the reading of the buffer is happening simultaneously with an
MPI operation that shouldn't need to write the user buffer but does)?
This would be very helpful to know.

Thank you.



Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7ff937bf8700 (LWP 32001)]
0x00007ff93fd17c37 in __GI_raise (sig=sig at entry=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:56

(gdb) bt full
#0  0x00007ff93fd17c37 in __GI_raise (sig=sig at entry=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:56
        resultvar = 0
        pid = 31987
        selftid = 32001
#1  0x00007ff93fd1b028 in __GI_abort () at abort.c:89
        save_stage = 2
        act = {__sigaction_handler = {sa_handler = 0xbea6495,
sa_sigaction = 0xbea6495}, sa_mask = {__val = {140708540140376,
              140708494113584, 134272, 0, 140708494597155, 134240,
140708497979320, 131072, 140708494598025, 140708358448000,
              140708521085690, 140708539470768, 0, 140708540266640,
140708551504800, 1}}, sa_flags = 934869088,
          sa_restorer = 0x1}
        sigs = {__val = {32, 0 <repeats 15 times>}}
#2  0x00007ff93fd62dfa in malloc_printerr (ptr=<optimized out>,
str=<optimized out>, action=<optimized out>) at malloc.c:5000
No locals.
#3  free_check (mem=<optimized out>, caller=<optimized out>) at hooks.c:298
        p = <optimized out>
#4  0x00007ff93fd1d53a in __cxa_finalize (d=0x7ff941207708) at cxa_finalize.c:56
        check = 97
        cxafn = <optimized out>
        cxaarg = <optimized out>
        f = 0x7ff9400a1230 <initial+944>
        funcs = 0x7ff9400a0e80 <initial>
#5  0x00007ff940ff7123 in __do_global_dtors_aux () from
mpich/debug/lib/libmpicxx.so.12
No symbol table info available.
#6  0x00007ff937b8e410 in ?? ()
No symbol table info available.
#7  0x00007ff9426e270a in _dl_fini () at dl-fini.c:252
        array = 0x7ff941205238
        i = 0
        nmaps = 32761
        nloaded = <optimized out>
        i = 4
        l = 0x7ff9428d7a00
        ns = 140708494300474
        maps = 0x7ff937b8e330
        maps_size = 140708497985152
        do_audit = 1116568064
        __PRETTY_FUNCTION__ = "_dl_fini"
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list