[mpich-discuss] MPICH-3.2: SIGSEGV in MPID_Request_create () at src/mpid/ch3/src/ch3u_request.c:101
Halim Amer
aamer at anl.gov
Thu Aug 11 16:21:26 CDT 2016
This should be related to the alignment problem reported before
(http://lists.mpich.org/pipermail/discuss/2016-May/004764.html).
We plan to include a fix in the 3.2.x bug fix release series. Meanwhile,
please try the repo version (git.mpich.org/mpich.git), which should not
suffer from this problem.
--Halim
www.mcs.anl.gov/~aamer
On 8/11/16 8:48 AM, Mark Davis wrote:
> Hello, I'm running into a segfault when I run some relatively simple
> MPI programs. In this particular case, I'm running a small program in
> a loop that does MPI_Bcast, once per loop, within MPI_COMM_WORLD. The
> buffer consists of just 7 doubles. I'm running with 6 procs on a
> machine with 8 cores on OSX (Darwin - 15.6.0 Darwin Kernel Version
> 15.6.0: Thu Jun 23 18:25:34 PDT 2016;
> root:xnu-3248.60.10~1/RELEASE_X86_64 x86_64). When I run the same
> program with a smaller number of procs, the error usually doesn't show
> up. My compiler (both for compiling the MPICH source as well as my
> application) is clang 3.8.1.
>
> When I run the same program on linux, also with MPICH-3.2 (I believe
> the same exact source), compiled with gcc 5.3, I do not get this
> error. This seems to be something I get only with
>
> gdb shows the following stack trace. I have a feeling that this has
> something to do with my toolchain and/or libraries on my system given
> that I never get this error on my other system (linux). However, it's
> possible that there's an application bug as well.
>
> I'm running the MPICH-3.2 stable release; I haven't tried anything
> from the repository yet.
>
> Does anyone have any ideas about what's going on here? I'm happy to
> provide more details.
>
> Thank you,
> Mark
>
>
> Program received signal SIGSEGV, Segmentation fault.
> MPID_Request_create () at src/mpid/ch3/src/ch3u_request.c:101
> 101 req->dev.ext_hdr_ptr = NULL;
> (gdb) bt full
> #0 MPID_Request_create () at src/mpid/ch3/src/ch3u_request.c:101
> No locals.
> #1 0x00000001003ac4c9 in MPIDI_CH3U_Recvq_FDP_or_AEU
> (match=<optimized out>, foundp=0x7fff5fbfe2bc) at
> src/mpid/ch3/src/ch3u_recvq.c:830
> proc_failure_bit_masked = <error reading variable
> proc_failure_bit_masked (Cannot access memory at address 0x1)>
> error_bit_masked = <error reading variable error_bit_masked
> (Cannot access memory at address 0x1)>
> prev_rreq = <optimized out>
> channel_matched = <optimized out>
> rreq = <optimized out>
> #2 0x00000001003d1ffe in MPIDI_CH3_PktHandler_EagerSend
> (vc=<optimized out>, pkt=0x1004b3fd8 <MPIU_DBG_MaxLevel>,
> buflen=0x7fff5fbfe440, rreqp=0x7fff5fbfe438) at
> src/mpid/ch3/src/ch3u_eager.c:629
> mpi_errno = <error reading variable mpi_errno (Cannot access
> memory at address 0x0)>
> found = <error reading variable found (Cannot access memory at
> address 0xefefefefefefefef)>
> rreq = <optimized out>
> data_len = <optimized out>
> complete = <optimized out>
> #3 0x00000001003f6045 in MPID_nem_handle_pkt (vc=<optimized out>,
> buf=0x102ad07e0 "", buflen=<optimized out>) at
> src/mpid/ch3/channels/nemesis/src/ch3_progress.c:760
> len = 140734799800192
> mpi_errno = <optimized out>
> complete = <error reading variable complete (Cannot access
> memory at address 0x1)>
> rreq = <optimized out>
> #4 0x00000001003f4e41 in MPIDI_CH3I_Progress
> (progress_state=0x7fff5fbfe750, is_blocking=1) at
> src/mpid/ch3/channels/nemesis/src/ch3_progress.c:570
> payload_len = 4299898840
> cell_buf = <optimized out>
> rreq = <optimized out>
> vc = 0x102ad07e8
> made_progress = <error reading variable made_progress (Cannot
> access memory at address 0x0)>
> mpi_errno = <optimized out>
> #5 0x000000010035386d in MPIC_Wait (request_ptr=<optimized out>,
> errflag=<optimized out>) at src/mpi/coll/helper_fns.c:225
> progress_state = {ch = {completion_count = -1409286143}}
> mpi_errno = <error reading variable mpi_errno (Cannot access
> memory at address 0x0)>
> #6 0x0000000100353b10 in MPIC_Send (buf=0x100917c30,
> count=4299945096, datatype=-1581855963, dest=<optimized out>,
> tag=4975608, comm_ptr=0x1004b3fd8 <MPIU_DBG_MaxLevel>,
> errflag=<optimized out>) at src/mpi/coll/helper_fns.c:302
> mpi_errno = <optimized out>
> request_ptr = 0x1004bf7e0 <MPID_Request_direct+1760>
> #7 0x0000000100246031 in MPIR_Bcast_binomial (buffer=<optimized out>,
> count=<optimized out>, datatype=<optimized out>, root=<optimized out>,
> comm_ptr=<optimized out>, errflag=<optimized out>) at
> src/mpi/coll/bcast.c:280
> nbytes = <optimized out>
> mpi_errno_ret = <optimized out>
> mpi_errno = 0
> comm_size = <optimized out>
> rank = 2
> type_size = <optimized out>
> tmp_buf = 0x0
> position = <optimized out>
> relative_rank = <optimized out>
> mask = <optimized out>
> src = <optimized out>
> status = <optimized out>
> recvd_size = <optimized out>
> dst = <optimized out>
> #8 0x00000001002455a3 in MPIR_SMP_Bcast (buffer=<optimized out>,
> count=<optimized out>, datatype=<optimized out>, root=<optimized out>,
> comm_ptr=<optimized out>, errflag=<optimized out>) at
> src/mpi/coll/bcast.c:1087
> mpi_errno_ = <error reading variable mpi_errno_ (Cannot access
> memory at address 0x0)>
> mpi_errno = <optimized out>
> mpi_errno_ret = <optimized out>
> nbytes = <optimized out>
> type_size = <optimized out>
> status = <optimized out>
> recvd_size = <optimized out>
> #9 MPIR_Bcast_intra (buffer=0x100917c30, count=<optimized out>,
> datatype=<optimized out>, root=1, comm_ptr=<optimized out>,
> errflag=<optimized out>) at src/mpi/coll/bcast.c:1245
> nbytes = <optimized out>
> mpi_errno_ret = <error reading variable mpi_errno_ret (Cannot
> access memory at address 0x0)>
> mpi_errno = <optimized out>
> type_size = <optimized out>
> comm_size = <optimized out>
> #10 0x000000010024751e in MPIR_Bcast (buffer=<optimized out>,
> count=<optimized out>, datatype=<optimized out>, root=<optimized out>,
> comm_ptr=0x0, errflag=<optimized out>) at src/mpi/coll/bcast.c:1475
> mpi_errno = <optimized out>
> #11 MPIR_Bcast_impl (buffer=0x1004bf7e0 <MPID_Request_direct+1760>,
> count=-269488145, datatype=-16, root=0, comm_ptr=0x0,
> errflag=0x1004bf100 <MPID_Request_direct>) at
> src/mpi/coll/bcast.c:1451
> mpi_errno = <optimized out>
> #12 0x00000001000f3c24 in MPI_Bcast (buffer=<optimized out>, count=7,
> datatype=1275069445, root=1, comm=<optimized out>) at
> src/mpi/coll/bcast.c:1585
> errflag = 2885681152
> mpi_errno = <optimized out>
> comm_ptr = <optimized out>
> #13 0x0000000100001df7 in run_test<int> (my_rank=2,
> num_ranks=<optimized out>, count=<optimized out>, root_rank=1,
> datatype=@0x7fff5fbfeaec: 1275069445, iterations=<optimized out>) at
> bcast_test.cpp:83
> No locals.
> #14 0x00000001000019cd in main (argc=<optimized out>, argv=<optimized
> out>) at bcast_test.cpp:137
> root_rank = <optimized out>
> count = <optimized out>
> iterations = <optimized out>
> my_rank = 4978656
> num_errors = <optimized out>
> runtime_ns = <optimized out>
> stats = {<std::__1::__basic_string_common<true>> = {<No data
> fields>}, __r_ =
> {<std::__1::__libcpp_compressed_pair_imp<std::__1::basic_string<char,
> std::__1::char_traits<char>, std::__1::allocator<char> >::__rep,
> std::__1::allocator<char>, 2>> = {<std::__1::allocator<char>> = {<No
> data fields>}, __first_ = {{__l = {__cap_ = 17289301308300324847,
> __size_ = 17289301308300324847, __data_ = 0xefefefefefefefef <error:
> Cannot access memory at address 0xefefefefefefefef>}
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list