[mpich-discuss] Segmentation fault with MXM

Min Si msi at anl.gov
Tue Jan 30 18:39:26 CST 2018


MVAPICH does not rely on the mxm layer. It uses the InfiniBand verbs 
directly. Thus, you do not need Mellanox enhancements (e.g., mxm) for 
running MVAPICH.

I would suggest that you contact the Mellanox tech support for the 
Mellanox OFED installation issue. It looks like the hostname issue is 
also because of that. We usually expect that the Mellanox OFED can 
provide the best performance on Mellanox InfiniBand. All ongoing 
development for InfiniBand in MPICH are relying on the Mellanox features 
(e.g., hcoll, ucx).

Min


On 2018/01/29 18:56, admin at genome.arizona.edu wrote:
> Min Si wrote on 01/29/2018 03:50 PM:
>> I confirmed this segfault issue with 
>> hpcx-v2.0.0-gcc-inbox-redhat6.9-x86_64/mxm. However, mpich-3.3a3 
>> works fine with the MLNX_OFED version. 
>
> Thanks Min.  The "inbox" refers to the software provided by the OS, 
> while the OFED is the Mellanox version.  We had tried using OFED on 
> our cluster but it turns out that RDMA support for NFS was removed for 
> unknown reason (no response from Mellanox).  The Redhat 'inbox' 
> software supports NFS/RDMA just fine as it comes directly from NFS 
> team, and we get much better performance with our disks.
>
> Since the MXM software is separate, it may work.  I did try compiling 
> MPICH-3.3a3 with the OFED MXM as you suggested, and the bandwidth test 
> completes without error.  However, i noticed we could not use standard 
> hostfile any longer which uses hostnames like this:
>
> n001:1
> n002:1
>
> With the osu_bibw (bi-directional bandwidth) test, the speed was about 
> 234 MB/s and was using 1Gb ethernet.  Even though "n001" does not 
> define infiniband IP address in /etc/hosts, the MVAPICH was able to 
> figure it out.  So I had to use modified hostfile like this:
>
> 10.10.11.1:1
> 10.10.11.2:1
>
> Still with this definition, the results from osu_bibw were slower than 
> with MVAPICH.  As well, there are errors when using MPICH-3.2.1 with 
> the OFED MXM, perhaps due to mixture of inbox/OFED, see below for test 
> results.  It seems we should rely on MPICH-3.3a3 or MVAPICH2 if we 
> want to use infiniband.
> Thanks
>
>
> $ which mpirun
> /opt/mpich-3.3a3-install/bin/mpirun
>
> $ mpirun -n 2 -hostfile /tmp/machinelist ./osu_bibw
> # OSU MPI Bi-Directional Bandwidth Test v5.4.0
> # Size      Bandwidth (MB/s)
> 1                       0.26
> 2                       0.50
> 4                       0.57
> 8                       2.02
> 16                      4.14
> 32                      7.68
> 64                     15.60
> 128                    26.25
> 256                    47.55
> 512                    72.63
> 1024                  105.92
> 2048                  194.08
> 4096                  361.07
> 8192                  696.26
> 16384                1088.92
> 32768                1496.74
> 65536                1709.40
> 131072               1733.74
> 262144               1918.49
> 524288               1945.67
> 1048576              1963.35
> 2097152              1639.91
> 4194304              1752.34
>
> $
>
> <change to MVAPICH2 and re-login>
>
> $ which mpirun
> /opt/mvapich2-install/bin/mpirun
>
> $ mpirun -n 2 -hostfile /tmp/machinelist ./osu_bibw
> # OSU MPI Bi-Directional Bandwidth Test v5.4.0
> # Size      Bandwidth (MB/s)
> 1                       0.63
> 2                       1.41
> 4                       2.80
> 8                       5.58
> 16                     11.19
> 32                     22.29
> 64                     43.83
> 128                   185.59
> 256                   408.23
> 512                   770.38
> 1024                 1399.30
> 2048                 2441.95
> 4096                 4001.28
> 8192                 4545.17
> 16384                5979.06
> 32768                8097.11
> 65536                8781.06
> 131072               8830.59
> 262144               7966.66
> 524288               7761.30
> 1048576              8099.95
> 2097152              8851.12
> 4194304              9107.43
> $
>
> <change to MPICH-3.2.1 and re-login>
>
> $ which mpirun
> /opt/mpich-install/bin/mpirun
>
> $ mpirun -n 2 -hostfile /tmp/machinelist ./osu_bibw
> [1517272925.686782] [n002:21714:0]         sys.c:744  MXM  WARN 
> Conflicting CPU frequencies detected, using: 2101.00
> [1517272925.750534] [n001:6646 :0]         sys.c:744  MXM  WARN 
> Conflicting CPU frequencies detected, using: 2101.00
> [1517272925.759946] [n002:21714:0]      ib_dev.c:533  MXM  WARN failed 
> call to ibv_exp_use_priv_env(): Function not implemented
> [1517272925.759991] [n002:21714:0]      ib_dev.c:544  MXM  ERROR 
> ibv_query_device() returned 38: Function not implemented
> [1517272925.755605] [n001:6646 :0]      ib_dev.c:533  MXM  WARN failed 
> call to ibv_exp_use_priv_env(): Function not implemented
> [1517272925.755624] [n001:6646 :0]      ib_dev.c:544  MXM  ERROR 
> ibv_query_device() returned 38: Function not implemented
> Fatal error in MPI_Init: Other MPI error, error stack:
> MPIR_Init_thread(474).........:
> MPID_Init(190)................: channel initialization failed
> MPIDI_CH3_Init(89)............:
> MPID_nem_init(320)............:
> MPID_nem_mxm_init(163)........:
> MPID_nem_mxm_get_ordering(469): mxm_init failed (Input/output error)
> Fatal error in MPI_Init: Other MPI error, error stack:
> MPIR_Init_thread(474).........:
> MPID_Init(190)................: channel initialization failed
> MPIDI_CH3_Init(89)............:
> MPID_nem_init(320)............:
> MPID_nem_mxm_init(163)........:
> MPID_nem_mxm_get_ordering(469): mxm_init failed (Input/output error)
> $
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list