[mpich-discuss] How to get srun/mpich to use the right interface?

John Cary cary at colorado.edu
Thu Jun 11 11:33:34 CDT 2026


Thanks, Hui!

I am trying to have one build that works for our cluster and also for 
AWS, which uses ofi.  Is that possible?

Thx....John



On 6/11/26 10:06 AM, Zhou, Hui wrote:
> [External email - use caution]
>
> HI John,
>
> Try using the UCX instead of libfabric. You can configure MPICH with 
> |./configure --with-device=ch4:ucx| to use UCX. With libfabric, could 
> you try set environment variable |FI_PROVIDER=verbs ?|
>
> If you still have issue, try run a dummy MPI program setting 
> `MPIR_CVAR_DEBUG_SUMMARY=1`. That will provide more logging details on 
> which libfabric provider is being selected.
>
> -- 
> Hui Zhou
> ------------------------------------------------------------------------
> *From:* John Cary via discuss <discuss at mpich.org>
> *Sent:* Thursday, June 11, 2026 8:50 AM
> *To:* discuss at mpich.org <discuss at mpich.org>
> *Cc:* John Cary <cary at colorado.edu>
> *Subject:* [mpich-discuss] How to get srun/mpich to use the right 
> interface?
> How to get mpich to use the right interface? mpich configured and 
> built with libfabric as shown below.   It is run using slurm (srun).   
> The result is [1781142435. 559314543] ne07: rank64. vorpal: Failed to 
> modify UD QP to INIT on mlx5_0: Operation
> ZjQcmQRYFpfptBannerStart
> This Message Is From an External Sender
> This message came from outside your organization.
> ZjQcmQRYFpfptBannerEnd
> How to get mpich to use the right interface?
>
> mpich configured and built with libfabric as shown below.  It is run
> using slurm (srun).  The result is
>
> [1781142435.559314543] ne07:rank64.vorpal: Failed to modify UD QP to
> INIT on mlx5_0: Operation not permitted
> [1781142435.563376291] ne07:rank66.vorpal: Failed to modify UD QP to
> INIT on mlx5_0: Operation not permitted
> Abort(203572367): Fatal error in internal_Init: Other MPI error, error stack
> ...
>
> which I think means that mpich is txrying to run over the mlx5_0
> interface, which does not exist.  The interfaces are
>
> eno1: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
> eno2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
> ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 2044
> lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
>
> and we want to use ib0, infiniband.  We tried
>
> export HYDRA_IFACE=ib0
> srun ...
>
> and got the same error.
>
> How can srun/mpich be instructed to use the ib0 interface by default?
>
> Also, how can one see which interface mpich is choosing?
>
> Thx...
>
>
>
> '/user/builds-linux-centos8-zen2/gvxsimall/mpich-5.0.1/configure' \
> --prefix=/home/research/user/installs/linux-centos8-zen2/contrib-gcc1140/mpich-5.0.1-shared
> \
>     --enable-shared \
>     --disable-static \
>     CC='/home/common/gcc-11.4.0/bin/gcc' \
>     CXX='/home/common/gcc-11.4.0/bin/g++' \
>     FC='/home/common/gcc-11.4.0/bin/gfortran' \
>     F77='/home/common/gcc-11.4.0/bin/gfortran' \
>     CFLAGS='-pthread -pipe -fPIC' \
>     CXXFLAGS='-pthread -Wno-deprecated-declarations -pipe -fPIC' \
>     FFLAGS='-fallow-argument-mismatch -pipe -fPIC' \
>     FCFLAGS='-fallow-argument-mismatch -pipe -fPIC' \
>     LDFLAGS='-L/home/common/gcc-11.4.0/lib64
> -Wl,-rpath,/home/common/gcc-11.4.0/lib64' \
>     LDSHARED='-L/home/common/gcc-11.4.0/lib64
> -Wl,-rpath,/home/common/gcc-11.4.0/lib64' \
>     --with-libfabric=install \
>     --with-device=ch4:ofi \
>     --without-cuda \
>     --disable-gl
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20260611/26150cdb/attachment.html>


More information about the discuss mailing list