[mpich-discuss] How to get srun/mpich to use the right interface?
John Cary
cary at colorado.edu
Thu Jun 11 11:33:34 CDT 2026
Thanks, Hui!
I am trying to have one build that works for our cluster and also for
AWS, which uses ofi. Is that possible?
Thx....John
On 6/11/26 10:06 AM, Zhou, Hui wrote:
> [External email - use caution]
>
> HI John,
>
> Try using the UCX instead of libfabric. You can configure MPICH with
> |./configure --with-device=ch4:ucx| to use UCX. With libfabric, could
> you try set environment variable |FI_PROVIDER=verbs ?|
>
> If you still have issue, try run a dummy MPI program setting
> `MPIR_CVAR_DEBUG_SUMMARY=1`. That will provide more logging details on
> which libfabric provider is being selected.
>
> --
> Hui Zhou
> ------------------------------------------------------------------------
> *From:* John Cary via discuss <discuss at mpich.org>
> *Sent:* Thursday, June 11, 2026 8:50 AM
> *To:* discuss at mpich.org <discuss at mpich.org>
> *Cc:* John Cary <cary at colorado.edu>
> *Subject:* [mpich-discuss] How to get srun/mpich to use the right
> interface?
> How to get mpich to use the right interface? mpich configured and
> built with libfabric as shown below. It is run using slurm (srun).
> The result is [1781142435. 559314543] ne07: rank64. vorpal: Failed to
> modify UD QP to INIT on mlx5_0: Operation
> ZjQcmQRYFpfptBannerStart
> This Message Is From an External Sender
> This message came from outside your organization.
> ZjQcmQRYFpfptBannerEnd
> How to get mpich to use the right interface?
>
> mpich configured and built with libfabric as shown below. It is run
> using slurm (srun). The result is
>
> [1781142435.559314543] ne07:rank64.vorpal: Failed to modify UD QP to
> INIT on mlx5_0: Operation not permitted
> [1781142435.563376291] ne07:rank66.vorpal: Failed to modify UD QP to
> INIT on mlx5_0: Operation not permitted
> Abort(203572367): Fatal error in internal_Init: Other MPI error, error stack
> ...
>
> which I think means that mpich is txrying to run over the mlx5_0
> interface, which does not exist. The interfaces are
>
> eno1: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
> eno2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
> ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044
> lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
>
> and we want to use ib0, infiniband. We tried
>
> export HYDRA_IFACE=ib0
> srun ...
>
> and got the same error.
>
> How can srun/mpich be instructed to use the ib0 interface by default?
>
> Also, how can one see which interface mpich is choosing?
>
> Thx...
>
>
>
> '/user/builds-linux-centos8-zen2/gvxsimall/mpich-5.0.1/configure' \
> --prefix=/home/research/user/installs/linux-centos8-zen2/contrib-gcc1140/mpich-5.0.1-shared
> \
> --enable-shared \
> --disable-static \
> CC='/home/common/gcc-11.4.0/bin/gcc' \
> CXX='/home/common/gcc-11.4.0/bin/g++' \
> FC='/home/common/gcc-11.4.0/bin/gfortran' \
> F77='/home/common/gcc-11.4.0/bin/gfortran' \
> CFLAGS='-pthread -pipe -fPIC' \
> CXXFLAGS='-pthread -Wno-deprecated-declarations -pipe -fPIC' \
> FFLAGS='-fallow-argument-mismatch -pipe -fPIC' \
> FCFLAGS='-fallow-argument-mismatch -pipe -fPIC' \
> LDFLAGS='-L/home/common/gcc-11.4.0/lib64
> -Wl,-rpath,/home/common/gcc-11.4.0/lib64' \
> LDSHARED='-L/home/common/gcc-11.4.0/lib64
> -Wl,-rpath,/home/common/gcc-11.4.0/lib64' \
> --with-libfabric=install \
> --with-device=ch4:ofi \
> --without-cuda \
> --disable-gl
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20260611/26150cdb/attachment.html>
More information about the discuss
mailing list