[mpich-discuss] MPICH-3.2: SIGSEGV in MPID_Request_create () at src/mpid/ch3/src/ch3u_request.c:101

Mark Davis markdavisinboston at gmail.com
Thu Aug 11 08:48:49 CDT 2016


Hello, I'm running into a segfault when I run some relatively simple
MPI programs. In this particular case, I'm running a small program in
a loop that does MPI_Bcast, once per loop, within MPI_COMM_WORLD. The
buffer consists of just 7 doubles. I'm running with 6 procs on a
machine with 8 cores on OSX (Darwin - 15.6.0 Darwin Kernel Version
15.6.0: Thu Jun 23 18:25:34 PDT 2016;
root:xnu-3248.60.10~1/RELEASE_X86_64 x86_64). When I run the same
program with a smaller number of procs, the error usually doesn't show
up. My compiler (both for compiling the MPICH source as well as my
application) is clang 3.8.1.

When I run the same program on linux, also with MPICH-3.2 (I believe
the same exact source), compiled with gcc 5.3, I do not get this
error. This seems to be something I get only with

gdb shows the following stack trace. I have a feeling that this has
something to do with my toolchain and/or libraries on my system given
that I never get this error on my other system (linux). However, it's
possible that there's an application bug as well.

I'm running the MPICH-3.2 stable release; I haven't tried anything
from the repository yet.

Does anyone have any ideas about what's going on here? I'm happy to
provide more details.

Thank you,
Mark


Program received signal SIGSEGV, Segmentation fault.
MPID_Request_create () at src/mpid/ch3/src/ch3u_request.c:101
101             req->dev.ext_hdr_ptr       = NULL;
(gdb) bt full
#0  MPID_Request_create () at src/mpid/ch3/src/ch3u_request.c:101
No locals.
#1  0x00000001003ac4c9 in MPIDI_CH3U_Recvq_FDP_or_AEU
(match=<optimized out>, foundp=0x7fff5fbfe2bc) at
src/mpid/ch3/src/ch3u_recvq.c:830
        proc_failure_bit_masked = <error reading variable
proc_failure_bit_masked (Cannot access memory at address 0x1)>
        error_bit_masked = <error reading variable error_bit_masked
(Cannot access memory at address 0x1)>
        prev_rreq = <optimized out>
        channel_matched = <optimized out>
        rreq = <optimized out>
#2  0x00000001003d1ffe in MPIDI_CH3_PktHandler_EagerSend
(vc=<optimized out>, pkt=0x1004b3fd8 <MPIU_DBG_MaxLevel>,
buflen=0x7fff5fbfe440, rreqp=0x7fff5fbfe438) at
src/mpid/ch3/src/ch3u_eager.c:629
        mpi_errno = <error reading variable mpi_errno (Cannot access
memory at address 0x0)>
        found = <error reading variable found (Cannot access memory at
address 0xefefefefefefefef)>
        rreq = <optimized out>
        data_len = <optimized out>
        complete = <optimized out>
#3  0x00000001003f6045 in MPID_nem_handle_pkt (vc=<optimized out>,
buf=0x102ad07e0 "", buflen=<optimized out>) at
src/mpid/ch3/channels/nemesis/src/ch3_progress.c:760
        len = 140734799800192
        mpi_errno = <optimized out>
        complete = <error reading variable complete (Cannot access
memory at address 0x1)>
        rreq = <optimized out>
#4  0x00000001003f4e41 in MPIDI_CH3I_Progress
(progress_state=0x7fff5fbfe750, is_blocking=1) at
src/mpid/ch3/channels/nemesis/src/ch3_progress.c:570
        payload_len = 4299898840
        cell_buf = <optimized out>
        rreq = <optimized out>
        vc = 0x102ad07e8
        made_progress = <error reading variable made_progress (Cannot
access memory at address 0x0)>
        mpi_errno = <optimized out>
#5  0x000000010035386d in MPIC_Wait (request_ptr=<optimized out>,
errflag=<optimized out>) at src/mpi/coll/helper_fns.c:225
        progress_state = {ch = {completion_count = -1409286143}}
        mpi_errno = <error reading variable mpi_errno (Cannot access
memory at address 0x0)>
#6  0x0000000100353b10 in MPIC_Send (buf=0x100917c30,
count=4299945096, datatype=-1581855963, dest=<optimized out>,
tag=4975608, comm_ptr=0x1004b3fd8 <MPIU_DBG_MaxLevel>,
errflag=<optimized out>) at src/mpi/coll/helper_fns.c:302
        mpi_errno = <optimized out>
        request_ptr = 0x1004bf7e0 <MPID_Request_direct+1760>
#7  0x0000000100246031 in MPIR_Bcast_binomial (buffer=<optimized out>,
count=<optimized out>, datatype=<optimized out>, root=<optimized out>,
comm_ptr=<optimized out>, errflag=<optimized out>) at
src/mpi/coll/bcast.c:280
        nbytes = <optimized out>
        mpi_errno_ret = <optimized out>
        mpi_errno = 0
        comm_size = <optimized out>
        rank = 2
        type_size = <optimized out>
        tmp_buf = 0x0
        position = <optimized out>
        relative_rank = <optimized out>
        mask = <optimized out>
        src = <optimized out>
        status = <optimized out>
        recvd_size = <optimized out>
        dst = <optimized out>
#8  0x00000001002455a3 in MPIR_SMP_Bcast (buffer=<optimized out>,
count=<optimized out>, datatype=<optimized out>, root=<optimized out>,
comm_ptr=<optimized out>, errflag=<optimized out>) at
src/mpi/coll/bcast.c:1087
        mpi_errno_ = <error reading variable mpi_errno_ (Cannot access
memory at address 0x0)>
        mpi_errno = <optimized out>
        mpi_errno_ret = <optimized out>
        nbytes = <optimized out>
        type_size = <optimized out>
        status = <optimized out>
        recvd_size = <optimized out>
#9  MPIR_Bcast_intra (buffer=0x100917c30, count=<optimized out>,
datatype=<optimized out>, root=1, comm_ptr=<optimized out>,
errflag=<optimized out>) at src/mpi/coll/bcast.c:1245
        nbytes = <optimized out>
        mpi_errno_ret = <error reading variable mpi_errno_ret (Cannot
access memory at address 0x0)>
        mpi_errno = <optimized out>
        type_size = <optimized out>
        comm_size = <optimized out>
#10 0x000000010024751e in MPIR_Bcast (buffer=<optimized out>,
count=<optimized out>, datatype=<optimized out>, root=<optimized out>,
comm_ptr=0x0, errflag=<optimized out>) at src/mpi/coll/bcast.c:1475
        mpi_errno = <optimized out>
#11 MPIR_Bcast_impl (buffer=0x1004bf7e0 <MPID_Request_direct+1760>,
count=-269488145, datatype=-16, root=0, comm_ptr=0x0,
errflag=0x1004bf100 <MPID_Request_direct>) at
src/mpi/coll/bcast.c:1451
        mpi_errno = <optimized out>
#12 0x00000001000f3c24 in MPI_Bcast (buffer=<optimized out>, count=7,
datatype=1275069445, root=1, comm=<optimized out>) at
src/mpi/coll/bcast.c:1585
        errflag = 2885681152
        mpi_errno = <optimized out>
        comm_ptr = <optimized out>
#13 0x0000000100001df7 in run_test<int> (my_rank=2,
num_ranks=<optimized out>, count=<optimized out>, root_rank=1,
datatype=@0x7fff5fbfeaec: 1275069445, iterations=<optimized out>) at
bcast_test.cpp:83
No locals.
#14 0x00000001000019cd in main (argc=<optimized out>, argv=<optimized
out>) at bcast_test.cpp:137
        root_rank = <optimized out>
        count = <optimized out>
        iterations = <optimized out>
        my_rank = 4978656
        num_errors = <optimized out>
        runtime_ns = <optimized out>
        stats = {<std::__1::__basic_string_common<true>> = {<No data
fields>}, __r_ =
{<std::__1::__libcpp_compressed_pair_imp<std::__1::basic_string<char,
std::__1::char_traits<char>, std::__1::allocator<char> >::__rep,
std::__1::allocator<char>, 2>> = {<std::__1::allocator<char>> = {<No
data fields>}, __first_ = {{__l = {__cap_ = 17289301308300324847,
__size_ = 17289301308300324847, __data_ = 0xefefefefefefefef <error:
Cannot access memory at address 0xefefefefefefefef>}



More information about the discuss mailing list