[mpich-discuss] Segmentation fault with MXM

admin at genome.arizona.edu admin at genome.arizona.edu
Mon Jan 29 18:56:56 CST 2018


Min Si wrote on 01/29/2018 03:50 PM:
> I confirmed this segfault issue with 
> hpcx-v2.0.0-gcc-inbox-redhat6.9-x86_64/mxm. However, mpich-3.3a3 works 
> fine with the MLNX_OFED version. 

Thanks Min.  The "inbox" refers to the software provided by the OS, 
while the OFED is the Mellanox version.  We had tried using OFED on our 
cluster but it turns out that RDMA support for NFS was removed for 
unknown reason (no response from Mellanox).  The Redhat 'inbox' software 
supports NFS/RDMA just fine as it comes directly from NFS team, and we 
get much better performance with our disks.

Since the MXM software is separate, it may work.  I did try compiling 
MPICH-3.3a3 with the OFED MXM as you suggested, and the bandwidth test 
completes without error.  However, i noticed we could not use standard 
hostfile any longer which uses hostnames like this:

n001:1
n002:1

With the osu_bibw (bi-directional bandwidth) test, the speed was about 
234 MB/s and was using 1Gb ethernet.  Even though "n001" does not define 
infiniband IP address in /etc/hosts, the MVAPICH was able to figure it 
out.  So I had to use modified hostfile like this:

10.10.11.1:1
10.10.11.2:1

Still with this definition, the results from osu_bibw were slower than 
with MVAPICH.  As well, there are errors when using MPICH-3.2.1 with the 
OFED MXM, perhaps due to mixture of inbox/OFED, see below for test 
results.  It seems we should rely on MPICH-3.3a3 or MVAPICH2 if we want 
to use infiniband.
Thanks


$ which mpirun
/opt/mpich-3.3a3-install/bin/mpirun

$ mpirun -n 2 -hostfile /tmp/machinelist ./osu_bibw
# OSU MPI Bi-Directional Bandwidth Test v5.4.0
# Size      Bandwidth (MB/s)
1                       0.26
2                       0.50
4                       0.57
8                       2.02
16                      4.14
32                      7.68
64                     15.60
128                    26.25
256                    47.55
512                    72.63
1024                  105.92
2048                  194.08
4096                  361.07
8192                  696.26
16384                1088.92
32768                1496.74
65536                1709.40
131072               1733.74
262144               1918.49
524288               1945.67
1048576              1963.35
2097152              1639.91
4194304              1752.34

$

<change to MVAPICH2 and re-login>

$ which mpirun
/opt/mvapich2-install/bin/mpirun

$ mpirun -n 2 -hostfile /tmp/machinelist ./osu_bibw
# OSU MPI Bi-Directional Bandwidth Test v5.4.0
# Size      Bandwidth (MB/s)
1                       0.63
2                       1.41
4                       2.80
8                       5.58
16                     11.19
32                     22.29
64                     43.83
128                   185.59
256                   408.23
512                   770.38
1024                 1399.30
2048                 2441.95
4096                 4001.28
8192                 4545.17
16384                5979.06
32768                8097.11
65536                8781.06
131072               8830.59
262144               7966.66
524288               7761.30
1048576              8099.95
2097152              8851.12
4194304              9107.43
$

<change to MPICH-3.2.1 and re-login>

$ which mpirun
/opt/mpich-install/bin/mpirun

$ mpirun -n 2 -hostfile /tmp/machinelist ./osu_bibw
[1517272925.686782] [n002:21714:0]         sys.c:744  MXM  WARN 
Conflicting CPU frequencies detected, using: 2101.00
[1517272925.750534] [n001:6646 :0]         sys.c:744  MXM  WARN 
Conflicting CPU frequencies detected, using: 2101.00
[1517272925.759946] [n002:21714:0]      ib_dev.c:533  MXM  WARN  failed 
call to ibv_exp_use_priv_env(): Function not implemented
[1517272925.759991] [n002:21714:0]      ib_dev.c:544  MXM  ERROR 
ibv_query_device() returned 38: Function not implemented
[1517272925.755605] [n001:6646 :0]      ib_dev.c:533  MXM  WARN  failed 
call to ibv_exp_use_priv_env(): Function not implemented
[1517272925.755624] [n001:6646 :0]      ib_dev.c:544  MXM  ERROR 
ibv_query_device() returned 38: Function not implemented
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(474).........:
MPID_Init(190)................: channel initialization failed
MPIDI_CH3_Init(89)............:
MPID_nem_init(320)............:
MPID_nem_mxm_init(163)........:
MPID_nem_mxm_get_ordering(469): mxm_init failed (Input/output error)
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(474).........:
MPID_Init(190)................: channel initialization failed
MPIDI_CH3_Init(89)............:
MPID_nem_init(320)............:
MPID_nem_mxm_init(163)........:
MPID_nem_mxm_get_ordering(469): mxm_init failed (Input/output error)
$
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list