[mpich-discuss] Segmentation fault with MXM

admin at genome.arizona.edu admin at genome.arizona.edu
Fri Jan 26 16:47:59 CST 2018


I tried using OSU benchmark with OSU MVAPICH2 and that worked fine, was 
able to verify infiniband is being used.  However, there is a 
segmentation fault when using MPICH, see below.  When re-compiled MPICH 
without MXM, the OSU benchmark works as expected as well as reports 
slower speed of 1Gb ethernet network...

  It seems to be related to MXM, so perhaps I need to contact Mellanox 
regarding this?

Thanks



$ which mpirun
/opt/mpich-install/bin/mpirun

$ mpirun -np 2 -hostfile /opt/machinelist ./osu_bibw
[1517005194.381407] [n001:18086:0]         sys.c:744  MXM  WARN 
Conflicting CPU frequencies detected, using: 2101.00
[1517005194.513782] [n002:32344:0]         sys.c:744  MXM  WARN 
Conflicting CPU frequencies detected, using: 2101.00
[1517005194.599025] [n002:32344:0]    proto_ep.c:179  MXM  WARN  tl dc 
is requested but not supported
[1517005194.655493] [n001:18086:0]    proto_ep.c:179  MXM  WARN  tl dc 
is requested but not supported
# OSU MPI Bi-Directional Bandwidth Test v5.4.0
# Size      Bandwidth (MB/s)
1                       1.64
2                       4.88
4                       9.84
8                      19.83
16                     39.48
32                     76.25
64                    150.78
[n002:32344:0] Caught signal 11 (Segmentation fault)
==== backtrace ====
  2 0x000000000005767c mxm_handle_error() 
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.0.0-gcc-inbox-redhat6.9-x86_64/mxm-v3.6/src/mxm/util/debug/debug.c:641
  3 0x00000000000577ec mxm_error_signal_handler() 
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.0.0-gcc-inbox-redhat6.9-x86_64/mxm-v3.6/src/mxm/util/debug/debug.c:616
  4 0x0000003550432510 killpg()  ??:0
  5 0x0000000000056258 mxm_mpool_put() 
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.0.0-gcc-inbox-redhat6.9-x86_64/mxm-v3.6/src/mxm/util/datatype/mpool.c:210
  6 0x00000000000689ce mxm_cib_ep_poll_tx() 
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.0.0-gcc-inbox-redhat6.9-x86_64/mxm-v3.6/src/mxm/tl/cib/cib_progress.c:527
  7 0x000000000006913d mxm_cib_ep_progress() 
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.0.0-gcc-inbox-redhat6.9-x86_64/mxm-v3.6/src/mxm/tl/cib/cib_progress.c:552
  8 0x000000000004268a mxm_notifier_chain_call() 
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.0.0-gcc-inbox-redhat6.9-x86_64/mxm-v3.6/src/./mxm/util/datatype/callback.h:74
  9 0x000000000004268a mxm_progress_internal() 
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.0.0-gcc-inbox-redhat6.9-x86_64/mxm-v3.6/src/mxm/core/mxm.c:64
10 0x000000000004268a mxm_progress() 
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.0.0-gcc-inbox-redhat6.9-x86_64/mxm-v3.6/src/mxm/core/mxm.c:346
11 0x0000000000177a49 MPID_nem_mxm_poll()  ??:0
12 0x0000000000169be8 MPIDI_CH3I_Progress()  ??:0
13 0x00000000000d0ba7 MPIR_Waitall_impl()  ??:0
14 0x00000000000d1308 PMPI_Waitall()  ??:0
15 0x0000000000401904 main() 
/opt/downloads/osu-micro-benchmarks-5.4/mpi/pt2pt/osu_bibw.c:146
16 0x000000355041ed1d __libc_start_main()  ??:0
17 0x0000000000401269 _start()  ??:0
===================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 32344 RUNNING AT n002
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:0 at n001.genome.arizona.edu] HYD_pmcd_pmip_control_cmd_cb 
(pm/pmiserv/pmip_cb.c:887): assert (!closed) failed
[proxy:0:0 at n001.genome.arizona.edu] HYDT_dmxu_poll_wait_for_event 
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:0 at n001.genome.arizona.edu] main (pm/pmiserv/pmip.c:202): demux 
engine error waiting for event
[mpiexec at pac.genome.arizona.edu] HYDT_bscu_wait_for_completion 
(tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated 
badly; aborting
[mpiexec at pac.genome.arizona.edu] HYDT_bsci_wait_for_completion 
(tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting 
for completion
[mpiexec at pac.genome.arizona.edu] HYD_pmci_wait_for_completion 
(pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for 
completion
[mpiexec at pac.genome.arizona.edu] main (ui/mpich/mpiexec.c:340): process 
manager error waiting for completion


_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list