[mpich-discuss] Segmentation fault with MXM

Min Si msi at anl.gov
Fri Jan 26 17:04:29 CST 2018


It looks like the segfault was reported from the MXM library. But as you 
mentioned, MVAPICH works fine. Thus, we should verify whether this is an 
MPICH issue or a MXM issue.

I will try to reproduce the issue on our test platform and keep you 
updated. Could you please confirm if this is your configure ?
MPICH version: 3.2.1
./configure --prefix=/opt/mpich-install --with-device=ch3:nemesis:mxm 
--with-mxm=/opt/hpcx-v2.0.0-gcc-inbox-redhat6.9-x86_64/mxm

Meanwhile, could you please also try mpich-3.3a3 (see 
http://www.mpich.org/downloads/) ? It includes a few bug fixes.

Thanks,
Min
On 2018/01/26 16:47, admin at genome.arizona.edu wrote:
> I tried using OSU benchmark with OSU MVAPICH2 and that worked fine, 
> was able to verify infiniband is being used.  However, there is a 
> segmentation fault when using MPICH, see below.  When re-compiled 
> MPICH without MXM, the OSU benchmark works as expected as well as 
> reports slower speed of 1Gb ethernet network...
>
>  It seems to be related to MXM, so perhaps I need to contact Mellanox 
> regarding this?
>
> Thanks
>
>
>
> $ which mpirun
> /opt/mpich-install/bin/mpirun
>
> $ mpirun -np 2 -hostfile /opt/machinelist ./osu_bibw
> [1517005194.381407] [n001:18086:0]         sys.c:744  MXM  WARN 
> Conflicting CPU frequencies detected, using: 2101.00
> [1517005194.513782] [n002:32344:0]         sys.c:744  MXM  WARN 
> Conflicting CPU frequencies detected, using: 2101.00
> [1517005194.599025] [n002:32344:0]    proto_ep.c:179  MXM  WARN tl dc 
> is requested but not supported
> [1517005194.655493] [n001:18086:0]    proto_ep.c:179  MXM  WARN tl dc 
> is requested but not supported
> # OSU MPI Bi-Directional Bandwidth Test v5.4.0
> # Size      Bandwidth (MB/s)
> 1                       1.64
> 2                       4.88
> 4                       9.84
> 8                      19.83
> 16                     39.48
> 32                     76.25
> 64                    150.78
> [n002:32344:0] Caught signal 11 (Segmentation fault)
> ==== backtrace ====
>  2 0x000000000005767c mxm_handle_error() 
> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.0.0-gcc-inbox-redhat6.9-x86_64/mxm-v3.6/src/mxm/util/debug/debug.c:641
>  3 0x00000000000577ec mxm_error_signal_handler() 
> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.0.0-gcc-inbox-redhat6.9-x86_64/mxm-v3.6/src/mxm/util/debug/debug.c:616
>  4 0x0000003550432510 killpg()  ??:0
>  5 0x0000000000056258 mxm_mpool_put() 
> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.0.0-gcc-inbox-redhat6.9-x86_64/mxm-v3.6/src/mxm/util/datatype/mpool.c:210
>  6 0x00000000000689ce mxm_cib_ep_poll_tx() 
> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.0.0-gcc-inbox-redhat6.9-x86_64/mxm-v3.6/src/mxm/tl/cib/cib_progress.c:527
>  7 0x000000000006913d mxm_cib_ep_progress() 
> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.0.0-gcc-inbox-redhat6.9-x86_64/mxm-v3.6/src/mxm/tl/cib/cib_progress.c:552
>  8 0x000000000004268a mxm_notifier_chain_call() 
> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.0.0-gcc-inbox-redhat6.9-x86_64/mxm-v3.6/src/./mxm/util/datatype/callback.h:74
>  9 0x000000000004268a mxm_progress_internal() 
> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.0.0-gcc-inbox-redhat6.9-x86_64/mxm-v3.6/src/mxm/core/mxm.c:64
> 10 0x000000000004268a mxm_progress() 
> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.0.0-gcc-inbox-redhat6.9-x86_64/mxm-v3.6/src/mxm/core/mxm.c:346
> 11 0x0000000000177a49 MPID_nem_mxm_poll()  ??:0
> 12 0x0000000000169be8 MPIDI_CH3I_Progress()  ??:0
> 13 0x00000000000d0ba7 MPIR_Waitall_impl()  ??:0
> 14 0x00000000000d1308 PMPI_Waitall()  ??:0
> 15 0x0000000000401904 main() 
> /opt/downloads/osu-micro-benchmarks-5.4/mpi/pt2pt/osu_bibw.c:146
> 16 0x000000355041ed1d __libc_start_main()  ??:0
> 17 0x0000000000401269 _start()  ??:0
> ===================
>
> =================================================================================== 
>
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   PID 32344 RUNNING AT n002
> =   EXIT CODE: 139
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> =================================================================================== 
>
> [proxy:0:0 at n001.genome.arizona.edu] HYD_pmcd_pmip_control_cmd_cb 
> (pm/pmiserv/pmip_cb.c:887): assert (!closed) failed
> [proxy:0:0 at n001.genome.arizona.edu] HYDT_dmxu_poll_wait_for_event 
> (tools/demux/demux_poll.c:76): callback returned error status
> [proxy:0:0 at n001.genome.arizona.edu] main (pm/pmiserv/pmip.c:202): 
> demux engine error waiting for event
> [mpiexec at pac.genome.arizona.edu] HYDT_bscu_wait_for_completion 
> (tools/bootstrap/utils/bscu_wait.c:76): one of the processes 
> terminated badly; aborting
> [mpiexec at pac.genome.arizona.edu] HYDT_bsci_wait_for_completion 
> (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting 
> for completion
> [mpiexec at pac.genome.arizona.edu] HYD_pmci_wait_for_completion 
> (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for 
> completion
> [mpiexec at pac.genome.arizona.edu] main (ui/mpich/mpiexec.c:340): 
> process manager error waiting for completion
>
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list