[mpich-discuss] Segmentation fault with MXM

Min Si msi at anl.gov
Mon Jan 29 16:50:10 CST 2018


Hi,

I confirmed this segfault issue with 
hpcx-v2.0.0-gcc-inbox-redhat6.9-x86_64/mxm. However, mpich-3.3a3 works 
fine with the MLNX_OFED version. You can download it at:
http://www.mellanox.com/downloads/hpc/hpc-x/v2.0/hpcx-v2.0.0-gcc-MLNX_OFED_LINUX-4.2-1.2.0.0-redhat6.9-x86_64.tbz

I am not sure what is the difference of the mxm libraries (close-source) 
in the *MLNX_OFED* and *inbox* versions. Perhaps you could get more 
information from the mxm tech support.

Min

On 2018/01/26 17:27, admin at genome.arizona.edu wrote:
> Min Si wrote on 01/26/2018 04:04 PM:
>> Could you please confirm if this is your configure ?
>> MPICH version: 3.2.1
>> ./configure --prefix=/opt/mpich-install --with-device=ch3:nemesis:mxm 
>> --with-mxm=/opt/hpcx-v2.0.0-gcc-inbox-redhat6.9-x86_64/mxm
>
> Yes this is correct.
>
>> Meanwhile, could you please also try mpich-3.3a3 (see 
>> http://www.mpich.org/downloads/) ? It includes a few bug fixes.
>
> I tried with mpich-3.3a3 and there was still a segmentation fault:
>
> $ which mpirun
> /opt/mpich-3.3a3-install/bin/mpirun
>
> $ mpirun -np 2 -hostfile /tmp/machinelist ./osu_bibw
> [1517009019.605233] [n001:18235:0]         sys.c:744  MXM  WARN 
> Conflicting CPU frequencies detected, using: 2101.00
> [1517009019.650442] [n001:18235:0]    proto_ep.c:179  MXM  WARN tl dc 
> is requested but not supported
> [1517009019.651182] [n002:32467:0]    proto_ep.c:179  MXM  WARN tl dc 
> is requested but not supported
> # OSU MPI Bi-Directional Bandwidth Test v5.4.0
> # Size      Bandwidth (MB/s)
> 1                       1.57
> 2                       4.91
> 4                       9.80
> 8                      20.10
> 16                     40.09
> 32                     77.33
> 64                    149.54
> [n001:18235:0] Caught signal 11 (Segmentation fault)
> ==== backtrace ====
>  2 0x000000000005767c mxm_handle_error() 
> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.0.0-gcc-inbox-redhat6.9-x86_64/mxm-v3.6/src/mxm/util/debug/debug.c:641
>  3 0x00000000000577ec mxm_error_signal_handler() 
> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.0.0-gcc-inbox-redhat6.9-x86_64/mxm-v3.6/src/mxm/util/debug/debug.c:616
>  4 0x0000003c80832510 killpg()  ??:0
>  5 0x0000000000056258 mxm_mpool_put() 
> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.0.0-gcc-inbox-redhat6.9-x86_64/mxm-v3.6/src/mxm/util/datatype/mpool.c:210
>  6 0x00000000000689ce mxm_cib_ep_poll_tx() 
> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.0.0-gcc-inbox-redhat6.9-x86_64/mxm-v3.6/src/mxm/tl/cib/cib_progress.c:527
>  7 0x000000000006913d mxm_cib_ep_progress() 
> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.0.0-gcc-inbox-redhat6.9-x86_64/mxm-v3.6/src/mxm/tl/cib/cib_progress.c:552
>  8 0x000000000004268a mxm_notifier_chain_call() 
> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.0.0-gcc-inbox-redhat6.9-x86_64/mxm-v3.6/src/./mxm/util/datatype/callback.h:74
>  9 0x000000000004268a mxm_progress_internal() 
> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.0.0-gcc-inbox-redhat6.9-x86_64/mxm-v3.6/src/mxm/core/mxm.c:64
> 10 0x000000000004268a mxm_progress() 
> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.0.0-gcc-inbox-redhat6.9-x86_64/mxm-v3.6/src/mxm/core/mxm.c:346
> 11 0x0000000000177a49 MPID_nem_mxm_poll()  ??:0
> 12 0x0000000000169be8 MPIDI_CH3I_Progress()  ??:0
> 13 0x00000000000d0ba7 MPIR_Waitall_impl()  ??:0
> 14 0x00000000000d1308 PMPI_Waitall()  ??:0
> 15 0x00000000004016f5 main() 
> /opt/downloads/osu-micro-benchmarks-5.4/mpi/pt2pt/osu_bibw.c:124
> 16 0x0000003c8081ed1d __libc_start_main()  ??:0
> 17 0x0000000000401269 _start()  ??:0
> ===================
>
> =================================================================================== 
>
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   PID 18235 RUNNING AT n001
> =   EXIT CODE: 139
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> =================================================================================== 
>
> [proxy:0:1 at n002.genome.arizona.edu] HYD_pmcd_pmip_control_cmd_cb 
> (pm/pmiserv/pmip_cb.c:892): assert (!closed) failed
> [proxy:0:1 at n002.genome.arizona.edu] HYDT_dmxu_poll_wait_for_event 
> (tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:1 at n002.genome.arizona.edu] main (pm/pmiserv/pmip.c:202): 
> demux engine error waiting for event
> [mpiexec at pac.genome.arizona.edu] HYDT_bscu_wait_for_completion 
> (tools/bootstrap/utils/bscu_wait.c:76): one of the processes 
> terminated badly; aborting
> [mpiexec at pac.genome.arizona.edu] HYDT_bsci_wait_for_completion 
> (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting 
> for completion
> [mpiexec at pac.genome.arizona.edu] HYD_pmci_wait_for_completion 
> (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for 
> completion
> [mpiexec at pac.genome.arizona.edu] main (ui/mpich/mpiexec.c:340): 
> process manager error waiting for completion
>
>
> Thanks Min
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list