[mpich-discuss] segment fault when using knem

M Xie xmxmxie at gmail.com
Thu Apr 25 20:03:51 CDT 2013


Hello,

I am doing some tests with mpich in a dual-processor SMP server.
Now I use mpich-3.0.3, I also use knem for accelerating intra-node
communication.

The channel I used is nemesis:tcp, and nemesis-local-lmt=knem.
But when I use osu_benchmarks to do the bandwidth test, osu_bw will segment
fault at the 2MB, which is the default value of MPICH_NEM_LMT_DMA_THRESHOLD.
It seems when DMA channel in knem is used, the segment fault occurs.
When I set MPICH_NEM_LMT_DMA_THRESHOLD to a smaller value, such as
131072, osu_bw will segment fault at 131072.

I also test NAS Parallell Benchmark. I noticed when DMA channel in knem is
used,
sometimes the NPB tests will be freezed after running for a while.

In the attachment, I list some config and core dump files.

Did anyone met or solved the similar problems.

Thanks for your help.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20130426/5deaad85/attachment.html>
-------------- next part --------------
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-50.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /root/xm/osu_benchmarks/osu_bw...done.
[New Thread 20945]
Missing separate debuginfo for 
Try: yum --disablerepo='*' --enablerepo='*-debuginfo' install /usr/lib/debug/.build-id/08/f634a1d22deff00461d50a7699dacdc97657bf
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_files.so.2
Core was generated by `./osu_bw'.
Program terminated with signal 11, Segmentation fault.
#0  pkt_DONE_handler (vc=0x22660680, pkt=0x7f72bc1450d8, buflen=0x7fff77872128, rreqp=0x7fff77872130)
    at ../mpich-3.0.3/src/mpid/ch3/channels/nemesis/src/mpid_nem_lmt.c:375
375         switch (MPIDI_Request_get_type(req))
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.47.el6.x86_64
(gdb) bt
#0  pkt_DONE_handler (vc=0x22660680, pkt=0x7f72bc1450d8, buflen=0x7fff77872128, rreqp=0x7fff77872130)
    at ../mpich-3.0.3/src/mpid/ch3/channels/nemesis/src/mpid_nem_lmt.c:375
#1  0x000000000043c3f4 in MPID_nem_handle_pkt (vc=0x22660680, buf=0x7f72bc1450d8 "#", buflen=40)
    at ../mpich-3.0.3/src/mpid/ch3/channels/nemesis/src/ch3_progress.c:638
#2  0x0000000000440658 in MPIDI_CH3I_Progress (progress_state=0x7fff778727c0, is_blocking=<value optimized out>)
    at ../mpich-3.0.3/src/mpid/ch3/channels/nemesis/src/ch3_progress.c:425
#3  0x00000000004100aa in MPIR_Waitall_impl (count=64, array_of_requests=0x107d0200, array_of_statuses=0x207e11a0)
    at ../mpich-3.0.3/src/mpi/pt2pt/waitall.c:159
#4  0x0000000000410750 in PMPI_Waitall (count=64, array_of_requests=0x107d0200, array_of_statuses=0x207e11a0)
    at ../mpich-3.0.3/src/mpi/pt2pt/waitall.c:297
#5  0x0000000000409368 in main ()
-------------- next part --------------
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-50.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /root/xm/osu_benchmarks/osu_bw...done.
[New Thread 20946]
Missing separate debuginfo for 
Try: yum --disablerepo='*' --enablerepo='*-debuginfo' install /usr/lib/debug/.build-id/08/f634a1d22deff00461d50a7699dacdc97657bf
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_files.so.2
Core was generated by `./osu_bw'.
Program terminated with signal 11, Segmentation fault.
#0  OPA_load_ptr (vc=0x209544b8, hdr=0x7fff58234110, hdr_sz=40, sreq_ptr=0x7fff58234160)
    at /root/xm/mpich-3.0.3/src/openpa/src/primitives/opa_gcc_intel_32_64_ops.h:46
46          return ptr->v;
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.47.el6.x86_64
(gdb) bt
#0  OPA_load_ptr (vc=0x209544b8, hdr=0x7fff58234110, hdr_sz=40, sreq_ptr=0x7fff58234160)
    at /root/xm/mpich-3.0.3/src/openpa/src/primitives/opa_gcc_intel_32_64_ops.h:46
#1  MPID_nem_queue_dequeue (vc=0x209544b8, hdr=0x7fff58234110, hdr_sz=40, sreq_ptr=0x7fff58234160)
    at ../mpich-3.0.3/src/mpid/ch3/channels/nemesis/include/mpid_nem_queue.h:192
#2  MPID_nem_mpich_send_header (vc=0x209544b8, hdr=0x7fff58234110, hdr_sz=40, sreq_ptr=0x7fff58234160)
    at ../mpich-3.0.3/src/mpid/ch3/channels/nemesis/include/mpid_nem_inline.h:111
#3  MPIDI_CH3_iStartMsg (vc=0x209544b8, hdr=0x7fff58234110, hdr_sz=40, sreq_ptr=0x7fff58234160)
    at ../mpich-3.0.3/src/mpid/ch3/channels/nemesis/src/ch3_istartmsg.c:59
#4  0x000000000044ad04 in MPID_nem_lmt_dma_progress () at ../mpich-3.0.3/src/mpid/ch3/channels/nemesis/src/mpid_nem_lmt_dma.c:562
#5  0x000000000044077b in MPIDI_CH3I_Progress (progress_state=0x7fff582347f0, is_blocking=<value optimized out>)
    at ../mpich-3.0.3/src/mpid/ch3/channels/nemesis/src/ch3_progress.c:460
#6  0x00000000004100aa in MPIR_Waitall_impl (count=64, array_of_requests=0x107d0200, array_of_statuses=0x207e11a0)
    at ../mpich-3.0.3/src/mpi/pt2pt/waitall.c:159
#7  0x0000000000410750 in PMPI_Waitall (count=64, array_of_requests=0x107d0200, array_of_statuses=0x207e11a0)
    at ../mpich-3.0.3/src/mpi/pt2pt/waitall.c:297
#8  0x0000000000409470 in main ()
-------------- next part --------------
Linux version 2.6.32-220.el6.x86_64 (mockbuild at x86-004.build.bos.redhat.com) (gcc version 4.4.5 20110214 (Red Hat 4.4.5-6) (GCC) ) #1 SMP Wed Nov 9 08:03:13 EST 2011
-------------- next part --------------
A non-text attachment was scrubbed...
Name: knem_checks.h
Type: text/x-chdr
Size: 747 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20130426/5deaad85/attachment.bin>
-------------- next part --------------
knem 1.0.0
 Driver ABI=0xd
 Flags: forcing 0x0, ignoring 0x0
 DMAEngine: KernelSupported Enabled ChansAvail ChunkMin=1024B
 Debug: NotBuilt
 Requests submitted                           : 79698
 Requests processed (total)                   : 79698
          processed (using DMA)               : 8178
          processed (offloaded to thread)     : 0
          processed (with pinned local pages) : 8178
 Requests rejected (invalid flags)            : 0
          rejected (not enough memory)        : 0
          rejected (invalid ioctl argument)   : 0
          rejected (unexisting region cookie) : 0
          rejected (failed to pin local pages): 0
 Requests failed during memcpy from/to user   : 0
          failed during DMA copy              : 0
 DMA copy cleanup timeout                     : 0
-------------- next part --------------
MPICH Version:          3.0.3
MPICH Release date:     Thu Mar 28 16:01:21 CDT 2013
MPICH Device:           ch3:nemesis
MPICH configure:        --prefix=/usr/local/mpi-knem --with-device=ch3:nemesis:tcp --enable-g=most --enable-romio=no --without-mpe --with-nemesis-local-lmt=knem --with-knem-include=/opt/knem/include --disable-f77 --disable-fc CFLAGS=-fPIC CXXFLAGS=-fPIC
MPICH CC:       gcc -fPIC   -g -O2
MPICH CXX:      c++ -fPIC  -g -O2
MPICH F77:      no   -g
MPICH FC:       no   -g
-------------- next part --------------
A non-text attachment was scrubbed...
Name: knem_config.log
Type: application/octet-stream
Size: 15511 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20130426/5deaad85/attachment.obj>


More information about the discuss mailing list