[mpich-discuss] Segmentation fault with MXM
admin at genome.arizona.edu
admin at genome.arizona.edu
Fri Jan 26 17:27:14 CST 2018
Min Si wrote on 01/26/2018 04:04 PM:
> Could you please confirm if this is your configure ?
> MPICH version: 3.2.1
> ./configure --prefix=/opt/mpich-install --with-device=ch3:nemesis:mxm
> --with-mxm=/opt/hpcx-v2.0.0-gcc-inbox-redhat6.9-x86_64/mxm
Yes this is correct.
> Meanwhile, could you please also try mpich-3.3a3 (see
> http://www.mpich.org/downloads/) ? It includes a few bug fixes.
I tried with mpich-3.3a3 and there was still a segmentation fault:
$ which mpirun
/opt/mpich-3.3a3-install/bin/mpirun
$ mpirun -np 2 -hostfile /tmp/machinelist ./osu_bibw
[1517009019.605233] [n001:18235:0] sys.c:744 MXM WARN
Conflicting CPU frequencies detected, using: 2101.00
[1517009019.650442] [n001:18235:0] proto_ep.c:179 MXM WARN tl dc
is requested but not supported
[1517009019.651182] [n002:32467:0] proto_ep.c:179 MXM WARN tl dc
is requested but not supported
# OSU MPI Bi-Directional Bandwidth Test v5.4.0
# Size Bandwidth (MB/s)
1 1.57
2 4.91
4 9.80
8 20.10
16 40.09
32 77.33
64 149.54
[n001:18235:0] Caught signal 11 (Segmentation fault)
==== backtrace ====
2 0x000000000005767c mxm_handle_error()
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.0.0-gcc-inbox-redhat6.9-x86_64/mxm-v3.6/src/mxm/util/debug/debug.c:641
3 0x00000000000577ec mxm_error_signal_handler()
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.0.0-gcc-inbox-redhat6.9-x86_64/mxm-v3.6/src/mxm/util/debug/debug.c:616
4 0x0000003c80832510 killpg() ??:0
5 0x0000000000056258 mxm_mpool_put()
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.0.0-gcc-inbox-redhat6.9-x86_64/mxm-v3.6/src/mxm/util/datatype/mpool.c:210
6 0x00000000000689ce mxm_cib_ep_poll_tx()
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.0.0-gcc-inbox-redhat6.9-x86_64/mxm-v3.6/src/mxm/tl/cib/cib_progress.c:527
7 0x000000000006913d mxm_cib_ep_progress()
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.0.0-gcc-inbox-redhat6.9-x86_64/mxm-v3.6/src/mxm/tl/cib/cib_progress.c:552
8 0x000000000004268a mxm_notifier_chain_call()
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.0.0-gcc-inbox-redhat6.9-x86_64/mxm-v3.6/src/./mxm/util/datatype/callback.h:74
9 0x000000000004268a mxm_progress_internal()
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.0.0-gcc-inbox-redhat6.9-x86_64/mxm-v3.6/src/mxm/core/mxm.c:64
10 0x000000000004268a mxm_progress()
/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u9-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.0.0-gcc-inbox-redhat6.9-x86_64/mxm-v3.6/src/mxm/core/mxm.c:346
11 0x0000000000177a49 MPID_nem_mxm_poll() ??:0
12 0x0000000000169be8 MPIDI_CH3I_Progress() ??:0
13 0x00000000000d0ba7 MPIR_Waitall_impl() ??:0
14 0x00000000000d1308 PMPI_Waitall() ??:0
15 0x00000000004016f5 main()
/opt/downloads/osu-micro-benchmarks-5.4/mpi/pt2pt/osu_bibw.c:124
16 0x0000003c8081ed1d __libc_start_main() ??:0
17 0x0000000000401269 _start() ??:0
===================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 18235 RUNNING AT n001
= EXIT CODE: 139
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:1 at n002.genome.arizona.edu] HYD_pmcd_pmip_control_cmd_cb
(pm/pmiserv/pmip_cb.c:892): assert (!closed) failed
[proxy:0:1 at n002.genome.arizona.edu] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:1 at n002.genome.arizona.edu] main (pm/pmiserv/pmip.c:202): demux
engine error waiting for event
[mpiexec at pac.genome.arizona.edu] HYDT_bscu_wait_for_completion
(tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated
badly; aborting
[mpiexec at pac.genome.arizona.edu] HYDT_bsci_wait_for_completion
(tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting
for completion
[mpiexec at pac.genome.arizona.edu] HYD_pmci_wait_for_completion
(pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for
completion
[mpiexec at pac.genome.arizona.edu] main (ui/mpich/mpiexec.c:340): process
manager error waiting for completion
Thanks Min
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list