[mpich-discuss] How to get srun/mpich to use the right interface?

Zhou, Hui zhouh at anl.gov
Thu Jun 11 12:02:33 CDT 2026


Sure. Try set `FI_PROVIDER=verbs` to see if it works. And also try set `MPIR_CVAR_DEBUG_SUMMARY=1` and send me the console log.

Hui
________________________________
From: John Cary <cary at colorado.edu>
Sent: Thursday, June 11, 2026 11:33 AM
To: discuss at mpich.org <discuss at mpich.org>; Zhou, Hui <zhouh at anl.gov>
Subject: Re: [mpich-discuss] How to get srun/mpich to use the right interface?

Thanks, Hui! I am trying to have one build that works for our cluster and also for AWS, which uses ofi. Is that possible? Thx. . . . John On 6/11/26 10: 06 AM, Zhou, Hui wrote: [External email - use caution] HI John, Try using the UCX instead of
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.

ZjQcmQRYFpfptBannerEnd
Thanks, Hui!

I am trying to have one build that works for our cluster and also for AWS, which uses ofi.  Is that possible?

Thx....John



On 6/11/26 10:06 AM, Zhou, Hui wrote:
[External email - use caution]

HI John,

Try using the UCX instead of libfabric. You can configure MPICH with ./configure --with-device=ch4:ucx to use UCX. With libfabric, could you try set environment variable FI_PROVIDER=verbs ?

If you still have issue, try run a dummy MPI program setting `MPIR_CVAR_DEBUG_SUMMARY=1`. That will provide more logging details on which libfabric provider is being selected.

--
Hui Zhou
________________________________
From: John Cary via discuss <discuss at mpich.org><mailto:discuss at mpich.org>
Sent: Thursday, June 11, 2026 8:50 AM
To: discuss at mpich.org<mailto:discuss at mpich.org> <discuss at mpich.org><mailto:discuss at mpich.org>
Cc: John Cary <cary at colorado.edu><mailto:cary at colorado.edu>
Subject: [mpich-discuss] How to get srun/mpich to use the right interface?

How to get mpich to use the right interface? mpich configured and built with libfabric as shown below.   It is run using slurm (srun).   The result is [1781142435. 559314543] ne07: rank64. vorpal: Failed to modify UD QP to INIT on mlx5_0: Operation
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.

ZjQcmQRYFpfptBannerEnd

How to get mpich to use the right interface?

mpich configured and built with libfabric as shown below.  It is run
using slurm (srun).  The result is

[1781142435.559314543] ne07:rank64.vorpal: Failed to modify UD QP to
INIT on mlx5_0: Operation not permitted
[1781142435.563376291] ne07:rank66.vorpal: Failed to modify UD QP to
INIT on mlx5_0: Operation not permitted
Abort(203572367): Fatal error in internal_Init: Other MPI error, error stack
...

which I think means that mpich is txrying to run over the mlx5_0
interface, which does not exist.  The interfaces are

eno1: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
eno2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 2044
lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536

and we want to use ib0, infiniband.  We tried

export HYDRA_IFACE=ib0
srun ...

and got the same error.

How can srun/mpich be instructed to use the ib0 interface by default?

Also, how can one see which interface mpich is choosing?

Thx...



'/user/builds-linux-centos8-zen2/gvxsimall/mpich-5.0.1/configure' \
--prefix=/home/research/user/installs/linux-centos8-zen2/contrib-gcc1140/mpich-5.0.1-shared
\
   --enable-shared \
   --disable-static \
   CC='/home/common/gcc-11.4.0/bin/gcc' \
   CXX='/home/common/gcc-11.4.0/bin/g++' \
   FC='/home/common/gcc-11.4.0/bin/gfortran' \
   F77='/home/common/gcc-11.4.0/bin/gfortran' \
   CFLAGS='-pthread -pipe -fPIC' \
   CXXFLAGS='-pthread -Wno-deprecated-declarations -pipe -fPIC' \
   FFLAGS='-fallow-argument-mismatch -pipe -fPIC' \
   FCFLAGS='-fallow-argument-mismatch -pipe -fPIC' \
   LDFLAGS='-L/home/common/gcc-11.4.0/lib64
-Wl,-rpath,/home/common/gcc-11.4.0/lib64' \
   LDSHARED='-L/home/common/gcc-11.4.0/lib64
-Wl,-rpath,/home/common/gcc-11.4.0/lib64' \
   --with-libfabric=install \
   --with-device=ch4:ofi \
   --without-cuda \
   --disable-gl


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20260611/70438800/attachment.html>


More information about the discuss mailing list