[mpich-discuss] MPICH-3.2: SIGSEGV in MPID_Request_create () at src/mpid/ch3/src/ch3u_request.c:101

Eric A. Borisch eborisch at gmail.com
Thu Aug 11 20:06:01 CDT 2016


Or install from macports; is has the patch.

On Thursday, August 11, 2016, Kenneth Raffenetti <raffenet at mcs.anl.gov>
wrote:

> Or a snapshot tarball: http://www.mpich.org/static/do
> wnloads/nightly/master/mpich/
>
> On 08/11/2016 04:21 PM, Halim Amer wrote:
>
>> This should be related to the alignment problem reported before
>> (http://lists.mpich.org/pipermail/discuss/2016-May/004764.html).
>>
>> We plan to include a fix in the 3.2.x bug fix release series. Meanwhile,
>> please try the repo version (git.mpich.org/mpich.git), which should not
>> suffer from this problem.
>>
>> --Halim
>> www.mcs.anl.gov/~aamer
>>
>> On 8/11/16 8:48 AM, Mark Davis wrote:
>>
>>> Hello, I'm running into a segfault when I run some relatively simple
>>> MPI programs. In this particular case, I'm running a small program in
>>> a loop that does MPI_Bcast, once per loop, within MPI_COMM_WORLD. The
>>> buffer consists of just 7 doubles. I'm running with 6 procs on a
>>> machine with 8 cores on OSX (Darwin - 15.6.0 Darwin Kernel Version
>>> 15.6.0: Thu Jun 23 18:25:34 PDT 2016;
>>> root:xnu-3248.60.10~1/RELEASE_X86_64 x86_64). When I run the same
>>> program with a smaller number of procs, the error usually doesn't show
>>> up. My compiler (both for compiling the MPICH source as well as my
>>> application) is clang 3.8.1.
>>>
>>> When I run the same program on linux, also with MPICH-3.2 (I believe
>>> the same exact source), compiled with gcc 5.3, I do not get this
>>> error. This seems to be something I get only with
>>>
>>> gdb shows the following stack trace. I have a feeling that this has
>>> something to do with my toolchain and/or libraries on my system given
>>> that I never get this error on my other system (linux). However, it's
>>> possible that there's an application bug as well.
>>>
>>> I'm running the MPICH-3.2 stable release; I haven't tried anything
>>> from the repository yet.
>>>
>>> Does anyone have any ideas about what's going on here? I'm happy to
>>> provide more details.
>>>
>>> Thank you,
>>> Mark
>>>
>>>
>>> Program received signal SIGSEGV, Segmentation fault.
>>> MPID_Request_create () at src/mpid/ch3/src/ch3u_request.c:101
>>> 101             req->dev.ext_hdr_ptr       = NULL;
>>> (gdb) bt full
>>> #0  MPID_Request_create () at src/mpid/ch3/src/ch3u_request.c:101
>>> No locals.
>>> #1  0x00000001003ac4c9 in MPIDI_CH3U_Recvq_FDP_or_AEU
>>> (match=<optimized out>, foundp=0x7fff5fbfe2bc) at
>>> src/mpid/ch3/src/ch3u_recvq.c:830
>>>         proc_failure_bit_masked = <error reading variable
>>> proc_failure_bit_masked (Cannot access memory at address 0x1)>
>>>         error_bit_masked = <error reading variable error_bit_masked
>>> (Cannot access memory at address 0x1)>
>>>         prev_rreq = <optimized out>
>>>         channel_matched = <optimized out>
>>>         rreq = <optimized out>
>>> #2  0x00000001003d1ffe in MPIDI_CH3_PktHandler_EagerSend
>>> (vc=<optimized out>, pkt=0x1004b3fd8 <MPIU_DBG_MaxLevel>,
>>> buflen=0x7fff5fbfe440, rreqp=0x7fff5fbfe438) at
>>> src/mpid/ch3/src/ch3u_eager.c:629
>>>         mpi_errno = <error reading variable mpi_errno (Cannot access
>>> memory at address 0x0)>
>>>         found = <error reading variable found (Cannot access memory at
>>> address 0xefefefefefefefef)>
>>>         rreq = <optimized out>
>>>         data_len = <optimized out>
>>>         complete = <optimized out>
>>> #3  0x00000001003f6045 in MPID_nem_handle_pkt (vc=<optimized out>,
>>> buf=0x102ad07e0 "", buflen=<optimized out>) at
>>> src/mpid/ch3/channels/nemesis/src/ch3_progress.c:760
>>>         len = 140734799800192
>>>         mpi_errno = <optimized out>
>>>         complete = <error reading variable complete (Cannot access
>>> memory at address 0x1)>
>>>         rreq = <optimized out>
>>> #4  0x00000001003f4e41 in MPIDI_CH3I_Progress
>>> (progress_state=0x7fff5fbfe750, is_blocking=1) at
>>> src/mpid/ch3/channels/nemesis/src/ch3_progress.c:570
>>>         payload_len = 4299898840
>>>         cell_buf = <optimized out>
>>>         rreq = <optimized out>
>>>         vc = 0x102ad07e8
>>>         made_progress = <error reading variable made_progress (Cannot
>>> access memory at address 0x0)>
>>>         mpi_errno = <optimized out>
>>> #5  0x000000010035386d in MPIC_Wait (request_ptr=<optimized out>,
>>> errflag=<optimized out>) at src/mpi/coll/helper_fns.c:225
>>>         progress_state = {ch = {completion_count = -1409286143}}
>>>         mpi_errno = <error reading variable mpi_errno (Cannot access
>>> memory at address 0x0)>
>>> #6  0x0000000100353b10 in MPIC_Send (buf=0x100917c30,
>>> count=4299945096, datatype=-1581855963, dest=<optimized out>,
>>> tag=4975608, comm_ptr=0x1004b3fd8 <MPIU_DBG_MaxLevel>,
>>> errflag=<optimized out>) at src/mpi/coll/helper_fns.c:302
>>>         mpi_errno = <optimized out>
>>>         request_ptr = 0x1004bf7e0 <MPID_Request_direct+1760>
>>> #7  0x0000000100246031 in MPIR_Bcast_binomial (buffer=<optimized out>,
>>> count=<optimized out>, datatype=<optimized out>, root=<optimized out>,
>>> comm_ptr=<optimized out>, errflag=<optimized out>) at
>>> src/mpi/coll/bcast.c:280
>>>         nbytes = <optimized out>
>>>         mpi_errno_ret = <optimized out>
>>>         mpi_errno = 0
>>>         comm_size = <optimized out>
>>>         rank = 2
>>>         type_size = <optimized out>
>>>         tmp_buf = 0x0
>>>         position = <optimized out>
>>>         relative_rank = <optimized out>
>>>         mask = <optimized out>
>>>         src = <optimized out>
>>>         status = <optimized out>
>>>         recvd_size = <optimized out>
>>>         dst = <optimized out>
>>> #8  0x00000001002455a3 in MPIR_SMP_Bcast (buffer=<optimized out>,
>>> count=<optimized out>, datatype=<optimized out>, root=<optimized out>,
>>> comm_ptr=<optimized out>, errflag=<optimized out>) at
>>> src/mpi/coll/bcast.c:1087
>>>         mpi_errno_ = <error reading variable mpi_errno_ (Cannot access
>>> memory at address 0x0)>
>>>         mpi_errno = <optimized out>
>>>         mpi_errno_ret = <optimized out>
>>>         nbytes = <optimized out>
>>>         type_size = <optimized out>
>>>         status = <optimized out>
>>>         recvd_size = <optimized out>
>>> #9  MPIR_Bcast_intra (buffer=0x100917c30, count=<optimized out>,
>>> datatype=<optimized out>, root=1, comm_ptr=<optimized out>,
>>> errflag=<optimized out>) at src/mpi/coll/bcast.c:1245
>>>         nbytes = <optimized out>
>>>         mpi_errno_ret = <error reading variable mpi_errno_ret (Cannot
>>> access memory at address 0x0)>
>>>         mpi_errno = <optimized out>
>>>         type_size = <optimized out>
>>>         comm_size = <optimized out>
>>> #10 0x000000010024751e in MPIR_Bcast (buffer=<optimized out>,
>>> count=<optimized out>, datatype=<optimized out>, root=<optimized out>,
>>> comm_ptr=0x0, errflag=<optimized out>) at src/mpi/coll/bcast.c:1475
>>>         mpi_errno = <optimized out>
>>> #11 MPIR_Bcast_impl (buffer=0x1004bf7e0 <MPID_Request_direct+1760>,
>>> count=-269488145, datatype=-16, root=0, comm_ptr=0x0,
>>> errflag=0x1004bf100 <MPID_Request_direct>) at
>>> src/mpi/coll/bcast.c:1451
>>>         mpi_errno = <optimized out>
>>> #12 0x00000001000f3c24 in MPI_Bcast (buffer=<optimized out>, count=7,
>>> datatype=1275069445, root=1, comm=<optimized out>) at
>>> src/mpi/coll/bcast.c:1585
>>>         errflag = 2885681152
>>>         mpi_errno = <optimized out>
>>>         comm_ptr = <optimized out>
>>> #13 0x0000000100001df7 in run_test<int> (my_rank=2,
>>> num_ranks=<optimized out>, count=<optimized out>, root_rank=1,
>>> datatype=@0x7fff5fbfeaec: 1275069445, iterations=<optimized out>) at
>>> bcast_test.cpp:83
>>> No locals.
>>> #14 0x00000001000019cd in main (argc=<optimized out>, argv=<optimized
>>> out>) at bcast_test.cpp:137
>>>         root_rank = <optimized out>
>>>         count = <optimized out>
>>>         iterations = <optimized out>
>>>         my_rank = 4978656
>>>         num_errors = <optimized out>
>>>         runtime_ns = <optimized out>
>>>         stats = {<std::__1::__basic_string_common<true>> = {<No data
>>> fields>}, __r_ =
>>> {<std::__1::__libcpp_compressed_pair_imp<std::__1::basic_string<char,
>>> std::__1::char_traits<char>, std::__1::allocator<char> >::__rep,
>>> std::__1::allocator<char>, 2>> = {<std::__1::allocator<char>> = {<No
>>> data fields>}, __first_ = {{__l = {__cap_ = 17289301308300324847,
>>> __size_ = 17289301308300324847, __data_ = 0xefefefefefefefef <error:
>>> Cannot access memory at address 0xefefefefefefefef>}
>>> _______________________________________________
>>> discuss mailing list     discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>
>>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20160811/bd35a653/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list