[mpich-discuss] Segmentation fault with MXM
admin at genome.arizona.edu
admin at genome.arizona.edu
Mon Jan 29 18:56:56 CST 2018
Min Si wrote on 01/29/2018 03:50 PM:
> I confirmed this segfault issue with
> hpcx-v2.0.0-gcc-inbox-redhat6.9-x86_64/mxm. However, mpich-3.3a3 works
> fine with the MLNX_OFED version.
Thanks Min. The "inbox" refers to the software provided by the OS,
while the OFED is the Mellanox version. We had tried using OFED on our
cluster but it turns out that RDMA support for NFS was removed for
unknown reason (no response from Mellanox). The Redhat 'inbox' software
supports NFS/RDMA just fine as it comes directly from NFS team, and we
get much better performance with our disks.
Since the MXM software is separate, it may work. I did try compiling
MPICH-3.3a3 with the OFED MXM as you suggested, and the bandwidth test
completes without error. However, i noticed we could not use standard
hostfile any longer which uses hostnames like this:
n001:1
n002:1
With the osu_bibw (bi-directional bandwidth) test, the speed was about
234 MB/s and was using 1Gb ethernet. Even though "n001" does not define
infiniband IP address in /etc/hosts, the MVAPICH was able to figure it
out. So I had to use modified hostfile like this:
10.10.11.1:1
10.10.11.2:1
Still with this definition, the results from osu_bibw were slower than
with MVAPICH. As well, there are errors when using MPICH-3.2.1 with the
OFED MXM, perhaps due to mixture of inbox/OFED, see below for test
results. It seems we should rely on MPICH-3.3a3 or MVAPICH2 if we want
to use infiniband.
Thanks
$ which mpirun
/opt/mpich-3.3a3-install/bin/mpirun
$ mpirun -n 2 -hostfile /tmp/machinelist ./osu_bibw
# OSU MPI Bi-Directional Bandwidth Test v5.4.0
# Size Bandwidth (MB/s)
1 0.26
2 0.50
4 0.57
8 2.02
16 4.14
32 7.68
64 15.60
128 26.25
256 47.55
512 72.63
1024 105.92
2048 194.08
4096 361.07
8192 696.26
16384 1088.92
32768 1496.74
65536 1709.40
131072 1733.74
262144 1918.49
524288 1945.67
1048576 1963.35
2097152 1639.91
4194304 1752.34
$
<change to MVAPICH2 and re-login>
$ which mpirun
/opt/mvapich2-install/bin/mpirun
$ mpirun -n 2 -hostfile /tmp/machinelist ./osu_bibw
# OSU MPI Bi-Directional Bandwidth Test v5.4.0
# Size Bandwidth (MB/s)
1 0.63
2 1.41
4 2.80
8 5.58
16 11.19
32 22.29
64 43.83
128 185.59
256 408.23
512 770.38
1024 1399.30
2048 2441.95
4096 4001.28
8192 4545.17
16384 5979.06
32768 8097.11
65536 8781.06
131072 8830.59
262144 7966.66
524288 7761.30
1048576 8099.95
2097152 8851.12
4194304 9107.43
$
<change to MPICH-3.2.1 and re-login>
$ which mpirun
/opt/mpich-install/bin/mpirun
$ mpirun -n 2 -hostfile /tmp/machinelist ./osu_bibw
[1517272925.686782] [n002:21714:0] sys.c:744 MXM WARN
Conflicting CPU frequencies detected, using: 2101.00
[1517272925.750534] [n001:6646 :0] sys.c:744 MXM WARN
Conflicting CPU frequencies detected, using: 2101.00
[1517272925.759946] [n002:21714:0] ib_dev.c:533 MXM WARN failed
call to ibv_exp_use_priv_env(): Function not implemented
[1517272925.759991] [n002:21714:0] ib_dev.c:544 MXM ERROR
ibv_query_device() returned 38: Function not implemented
[1517272925.755605] [n001:6646 :0] ib_dev.c:533 MXM WARN failed
call to ibv_exp_use_priv_env(): Function not implemented
[1517272925.755624] [n001:6646 :0] ib_dev.c:544 MXM ERROR
ibv_query_device() returned 38: Function not implemented
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(474).........:
MPID_Init(190)................: channel initialization failed
MPIDI_CH3_Init(89)............:
MPID_nem_init(320)............:
MPID_nem_mxm_init(163)........:
MPID_nem_mxm_get_ordering(469): mxm_init failed (Input/output error)
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(474).........:
MPID_Init(190)................: channel initialization failed
MPIDI_CH3_Init(89)............:
MPID_nem_init(320)............:
MPID_nem_mxm_init(163)........:
MPID_nem_mxm_get_ordering(469): mxm_init failed (Input/output error)
$
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list