[mpich-discuss] Help with Cray MPICH

Marcin Zalewski zalewski at indiana.edu
Fri Feb 21 13:02:38 CST 2014


Nick,

I have been trying my code to run using an older version of MPI
(cray-mpich2/5.6.1), and I am getting a similar stack trace:

#0  0x00002aaaada445d5 in memcpy () from /lib64/libc.so.6
#1  0x00002aaaac5493e4 in MPID_Segment_blkidx_m2m () from
/opt/cray/lib64/libmpich_gnu_47.so.1
#2  0x00002aaaac547d10 in DLOOP_Leaf_contig_count_block () from
/opt/cray/lib64/libmpich_gnu_47.so.1
#3  0x00002aaaac5501f1 in MPID_Datatype_free_contents () from
/opt/cray/lib64/libmpich_gnu_47.so.1
#4  0x00002aaaac538c18 in MPID_nem_gni_progress_send_handle_cookie ()
from /opt/cray/lib64/libmpich_gnu_47.so.1
#5  0x00002aaaac52cacf in MPID_nem_gni_poll () from
/opt/cray/lib64/libmpich_gnu_47.so.1
#6  0x00002aaaac52d65a in nem_gni_display_pkt.isra.1 () from
/opt/cray/lib64/libmpich_gnu_47.so.1
#7  0x00002aaaac51285f in MPIDI_CH3I_Progress_init () from
/opt/cray/lib64/libmpich_gnu_47.so.1
#8  0x00002aaaac5ecc19 in PMPI_Wait () from /opt/cray/lib64/libmpich_gnu_47.so.1

Do you know if the bug you have in mind is limited to MPI versions
6.x.x, or has it occurred in earlier versions too?

Thank you,
Marcin

On Mon, Feb 3, 2014 at 2:46 PM, Nick Radcliffe <nradclif at cray.com> wrote:
> Hi Marcin,
>
> I can't be certain, but this looks like a bug we fixed recently in Cray MPICH. The fix should be available in MPT 6.2.2 by Feb. 20.
>
> -Nick Radcliffe,
> Cray MPT Team
>
> ________________________________________
> From: discuss-bounces at mpich.org [discuss-bounces at mpich.org] on behalf of Marcin Zalewski [marcin.zalewski at gmail.com]
> Sent: Monday, February 03, 2014 1:27 PM
> To: discuss at mpich.org
> Subject: [mpich-discuss] Help with Cray MPICH
>
> I have an application I am trying to run on a Cray machine composed of
> XE6 nodes. I have run this application previously using Open MPI and
> MVAPICH on a few different machines, so I think it should be more or
> less free of major bugs. However, when I run it on the Cray machine, I
> get a segmentation fault with a stack trace that ends in this:
>
> #0  0x00002aaaafe265e6 in memcpy () from /lib64/libc.so.6
> #1  0x00002aaaae8fe023 in MPID_Segment_index_m2m () from
> /opt/cray/lib64/libmpich_gnu_48.so.2
> #2  0x00002aaaae8fca18 in MPID_Segment_manipulate () from
> /opt/cray/lib64/libmpich_gnu_48.so.2
> #3  0x00002aaaae9049d1 in MPID_Segment_unpack () from
> /opt/cray/lib64/libmpich_gnu_48.so.2
> #4  0x00002aaaae8edd38 in MPID_nem_gni_complete_rdma_get () from
> /opt/cray/lib64/libmpich_gnu_48.so.2
> #5  0x00002aaaae8e11c8 in MPID_nem_gni_check_localCQ () from
> /opt/cray/lib64/libmpich_gnu_48.so.2
> #6  0x00002aaaae8e29fa in MPID_nem_gni_poll () from
> /opt/cray/lib64/libmpich_gnu_48.so.2
> #7  0x00002aaaae8c3515 in MPIDI_CH3I_Progress () from
> /opt/cray/lib64/libmpich_gnu_48.so.2
> #8  0x00002aaaae9a3d4d in PMPI_Testsome () from
> /opt/cray/lib64/libmpich_gnu_48.so.2
>
> I understand that this is very little information to go on, but I
> really cannot deal with this problem, and I thought I would try this
> list while I am trying to solicit help from our support team.
> Currently, I do not have access to a debug version of the library (or
> the source code). Has anyone here seen a similar error? What would be
> a reason to get an error on memcpy in MPICH while calling MPI_Testsome
> in general? I do have asserts in my code to make sure that the output
> arrays are large enough, and I got it to work on other implementations
> of MPI, so I am really clueless to what the problem might be.
>
> Thank you,
> Marcin
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss



More information about the discuss mailing list