[mpich-discuss] MPICH-3.2: SIGSEGV in MPID_Request_create () at src/mpid/ch3/src/ch3u_request.c:101

Kenneth Raffenetti raffenet at mcs.anl.gov
Thu Aug 11 16:32:14 CDT 2016


Or a snapshot tarball: 
http://www.mpich.org/static/downloads/nightly/master/mpich/

On 08/11/2016 04:21 PM, Halim Amer wrote:
> This should be related to the alignment problem reported before
> (http://lists.mpich.org/pipermail/discuss/2016-May/004764.html).
>
> We plan to include a fix in the 3.2.x bug fix release series. Meanwhile,
> please try the repo version (git.mpich.org/mpich.git), which should not
> suffer from this problem.
>
> --Halim
> www.mcs.anl.gov/~aamer
>
> On 8/11/16 8:48 AM, Mark Davis wrote:
>> Hello, I'm running into a segfault when I run some relatively simple
>> MPI programs. In this particular case, I'm running a small program in
>> a loop that does MPI_Bcast, once per loop, within MPI_COMM_WORLD. The
>> buffer consists of just 7 doubles. I'm running with 6 procs on a
>> machine with 8 cores on OSX (Darwin - 15.6.0 Darwin Kernel Version
>> 15.6.0: Thu Jun 23 18:25:34 PDT 2016;
>> root:xnu-3248.60.10~1/RELEASE_X86_64 x86_64). When I run the same
>> program with a smaller number of procs, the error usually doesn't show
>> up. My compiler (both for compiling the MPICH source as well as my
>> application) is clang 3.8.1.
>>
>> When I run the same program on linux, also with MPICH-3.2 (I believe
>> the same exact source), compiled with gcc 5.3, I do not get this
>> error. This seems to be something I get only with
>>
>> gdb shows the following stack trace. I have a feeling that this has
>> something to do with my toolchain and/or libraries on my system given
>> that I never get this error on my other system (linux). However, it's
>> possible that there's an application bug as well.
>>
>> I'm running the MPICH-3.2 stable release; I haven't tried anything
>> from the repository yet.
>>
>> Does anyone have any ideas about what's going on here? I'm happy to
>> provide more details.
>>
>> Thank you,
>> Mark
>>
>>
>> Program received signal SIGSEGV, Segmentation fault.
>> MPID_Request_create () at src/mpid/ch3/src/ch3u_request.c:101
>> 101             req->dev.ext_hdr_ptr       = NULL;
>> (gdb) bt full
>> #0  MPID_Request_create () at src/mpid/ch3/src/ch3u_request.c:101
>> No locals.
>> #1  0x00000001003ac4c9 in MPIDI_CH3U_Recvq_FDP_or_AEU
>> (match=<optimized out>, foundp=0x7fff5fbfe2bc) at
>> src/mpid/ch3/src/ch3u_recvq.c:830
>>         proc_failure_bit_masked = <error reading variable
>> proc_failure_bit_masked (Cannot access memory at address 0x1)>
>>         error_bit_masked = <error reading variable error_bit_masked
>> (Cannot access memory at address 0x1)>
>>         prev_rreq = <optimized out>
>>         channel_matched = <optimized out>
>>         rreq = <optimized out>
>> #2  0x00000001003d1ffe in MPIDI_CH3_PktHandler_EagerSend
>> (vc=<optimized out>, pkt=0x1004b3fd8 <MPIU_DBG_MaxLevel>,
>> buflen=0x7fff5fbfe440, rreqp=0x7fff5fbfe438) at
>> src/mpid/ch3/src/ch3u_eager.c:629
>>         mpi_errno = <error reading variable mpi_errno (Cannot access
>> memory at address 0x0)>
>>         found = <error reading variable found (Cannot access memory at
>> address 0xefefefefefefefef)>
>>         rreq = <optimized out>
>>         data_len = <optimized out>
>>         complete = <optimized out>
>> #3  0x00000001003f6045 in MPID_nem_handle_pkt (vc=<optimized out>,
>> buf=0x102ad07e0 "", buflen=<optimized out>) at
>> src/mpid/ch3/channels/nemesis/src/ch3_progress.c:760
>>         len = 140734799800192
>>         mpi_errno = <optimized out>
>>         complete = <error reading variable complete (Cannot access
>> memory at address 0x1)>
>>         rreq = <optimized out>
>> #4  0x00000001003f4e41 in MPIDI_CH3I_Progress
>> (progress_state=0x7fff5fbfe750, is_blocking=1) at
>> src/mpid/ch3/channels/nemesis/src/ch3_progress.c:570
>>         payload_len = 4299898840
>>         cell_buf = <optimized out>
>>         rreq = <optimized out>
>>         vc = 0x102ad07e8
>>         made_progress = <error reading variable made_progress (Cannot
>> access memory at address 0x0)>
>>         mpi_errno = <optimized out>
>> #5  0x000000010035386d in MPIC_Wait (request_ptr=<optimized out>,
>> errflag=<optimized out>) at src/mpi/coll/helper_fns.c:225
>>         progress_state = {ch = {completion_count = -1409286143}}
>>         mpi_errno = <error reading variable mpi_errno (Cannot access
>> memory at address 0x0)>
>> #6  0x0000000100353b10 in MPIC_Send (buf=0x100917c30,
>> count=4299945096, datatype=-1581855963, dest=<optimized out>,
>> tag=4975608, comm_ptr=0x1004b3fd8 <MPIU_DBG_MaxLevel>,
>> errflag=<optimized out>) at src/mpi/coll/helper_fns.c:302
>>         mpi_errno = <optimized out>
>>         request_ptr = 0x1004bf7e0 <MPID_Request_direct+1760>
>> #7  0x0000000100246031 in MPIR_Bcast_binomial (buffer=<optimized out>,
>> count=<optimized out>, datatype=<optimized out>, root=<optimized out>,
>> comm_ptr=<optimized out>, errflag=<optimized out>) at
>> src/mpi/coll/bcast.c:280
>>         nbytes = <optimized out>
>>         mpi_errno_ret = <optimized out>
>>         mpi_errno = 0
>>         comm_size = <optimized out>
>>         rank = 2
>>         type_size = <optimized out>
>>         tmp_buf = 0x0
>>         position = <optimized out>
>>         relative_rank = <optimized out>
>>         mask = <optimized out>
>>         src = <optimized out>
>>         status = <optimized out>
>>         recvd_size = <optimized out>
>>         dst = <optimized out>
>> #8  0x00000001002455a3 in MPIR_SMP_Bcast (buffer=<optimized out>,
>> count=<optimized out>, datatype=<optimized out>, root=<optimized out>,
>> comm_ptr=<optimized out>, errflag=<optimized out>) at
>> src/mpi/coll/bcast.c:1087
>>         mpi_errno_ = <error reading variable mpi_errno_ (Cannot access
>> memory at address 0x0)>
>>         mpi_errno = <optimized out>
>>         mpi_errno_ret = <optimized out>
>>         nbytes = <optimized out>
>>         type_size = <optimized out>
>>         status = <optimized out>
>>         recvd_size = <optimized out>
>> #9  MPIR_Bcast_intra (buffer=0x100917c30, count=<optimized out>,
>> datatype=<optimized out>, root=1, comm_ptr=<optimized out>,
>> errflag=<optimized out>) at src/mpi/coll/bcast.c:1245
>>         nbytes = <optimized out>
>>         mpi_errno_ret = <error reading variable mpi_errno_ret (Cannot
>> access memory at address 0x0)>
>>         mpi_errno = <optimized out>
>>         type_size = <optimized out>
>>         comm_size = <optimized out>
>> #10 0x000000010024751e in MPIR_Bcast (buffer=<optimized out>,
>> count=<optimized out>, datatype=<optimized out>, root=<optimized out>,
>> comm_ptr=0x0, errflag=<optimized out>) at src/mpi/coll/bcast.c:1475
>>         mpi_errno = <optimized out>
>> #11 MPIR_Bcast_impl (buffer=0x1004bf7e0 <MPID_Request_direct+1760>,
>> count=-269488145, datatype=-16, root=0, comm_ptr=0x0,
>> errflag=0x1004bf100 <MPID_Request_direct>) at
>> src/mpi/coll/bcast.c:1451
>>         mpi_errno = <optimized out>
>> #12 0x00000001000f3c24 in MPI_Bcast (buffer=<optimized out>, count=7,
>> datatype=1275069445, root=1, comm=<optimized out>) at
>> src/mpi/coll/bcast.c:1585
>>         errflag = 2885681152
>>         mpi_errno = <optimized out>
>>         comm_ptr = <optimized out>
>> #13 0x0000000100001df7 in run_test<int> (my_rank=2,
>> num_ranks=<optimized out>, count=<optimized out>, root_rank=1,
>> datatype=@0x7fff5fbfeaec: 1275069445, iterations=<optimized out>) at
>> bcast_test.cpp:83
>> No locals.
>> #14 0x00000001000019cd in main (argc=<optimized out>, argv=<optimized
>> out>) at bcast_test.cpp:137
>>         root_rank = <optimized out>
>>         count = <optimized out>
>>         iterations = <optimized out>
>>         my_rank = 4978656
>>         num_errors = <optimized out>
>>         runtime_ns = <optimized out>
>>         stats = {<std::__1::__basic_string_common<true>> = {<No data
>> fields>}, __r_ =
>> {<std::__1::__libcpp_compressed_pair_imp<std::__1::basic_string<char,
>> std::__1::char_traits<char>, std::__1::allocator<char> >::__rep,
>> std::__1::allocator<char>, 2>> = {<std::__1::allocator<char>> = {<No
>> data fields>}, __first_ = {{__l = {__cap_ = 17289301308300324847,
>> __size_ = 17289301308300324847, __data_ = 0xefefefefefefefef <error:
>> Cannot access memory at address 0xefefefefefefefef>}
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss



More information about the discuss mailing list