[mpich-discuss] Building for Heterogeneous Clusters

Zhou, Hui zhouh at anl.gov
Fri Jan 7 12:46:07 CST 2022

MPICH doesn't support building with both libfabric and ucx at this point. We understand this use case and we may support it in the future.

Hui Zhou
From: Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC] via discuss <discuss at mpich.org>
Sent: Friday, January 7, 2022 11:16 AM
To: discuss at mpich.org <discuss at mpich.org>
Cc: Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC] <matthew.thompson at nasa.gov>
Subject: [mpich-discuss] Building for Heterogeneous Clusters

Dear MPICH Gurus,

I recently built MPICH 4.0b1 on a cluster I work on that happens to have a couple of different interconnects you can run on. One part is Omnipath and the other is Infiniband.

Now, when I built MPICH, I did so on the Infiniband cluster and a user recently tried my module and:

$ mpifort -o helloWorld.mpi3.MPICH.PSM2.exe helloWorld.mpi3.F90

/usr/bin/ld: /discover/swdev/gmao_SIteam/MPI/mpich/4.0b1/gcc-11.2.0/lib/libmpi.so: undefined reference to `rdma_establish at RDMACM_1.0'

collect2: error: ld returned 1 exit status

Turns out, you try to use that MPICH on the Omnipath cluster, and boom. My "solution" for him was "Build and run on the Infiniband cluster" and that's fine for now as we don't use MPICH in production.

But it got me thinking. Is there any way of building MPICH so that it would "nicely" support both? My first build was a "no extra arguments" type of build, but the configure output did say to maybe try the ch4:ucx device (I think it chose ch4:ofi first). So, I built with "--with-device=ch4:ucx" (still on Infiniband) and this then does build on the Omnipath system...but throws a message on running:

$ mpirun -np 4 ./helloWorld.mpi3.MPICH-UCX.PSM2.exe

libibcm: couldn't read ABI version

libibcm: couldn't read ABI version

libibcm: couldn't read ABI version

libibcm: couldn't read ABI version

I mean, it still *ran* fine, it just prints out one of those for each process.

So I thought I'd ask the experts: is there any way to build an MPICH that is "happy" on both PSM2 and Infiniband? Or should I just tell my users that want to try out MPICH "Stick with Infiniband"

Note: There is no way for us to run a job on both clusters simultaneously, so it's not like I need something that will work on both at the same time. Just something that doesn't through warnings/messages if possible.




Matt Thompson, SSAI, Ld Scientific Programmer/Analyst

NASA GSFC,    Global Modeling and Assimilation Office

Code 610.1,  8800 Greenbelt Rd,  Greenbelt,  MD 20771

Phone: 301-614-6712                 Fax: 301-614-6246

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20220107/0de43f2b/attachment.html>

More information about the discuss mailing list