[mpich-discuss] Building for Heterogeneous Clusters

Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC] matthew.thompson at nasa.gov
Fri Jan 7 11:16:52 CST 2022


Dear MPICH Gurus,

I recently built MPICH 4.0b1 on a cluster I work on that happens to have a couple of different interconnects you can run on. One part is Omnipath and the other is Infiniband.

Now, when I built MPICH, I did so on the Infiniband cluster and a user recently tried my module and:

$ mpifort -o helloWorld.mpi3.MPICH.PSM2.exe helloWorld.mpi3.F90
/usr/bin/ld: /discover/swdev/gmao_SIteam/MPI/mpich/4.0b1/gcc-11.2.0/lib/libmpi.so: undefined reference to `rdma_establish at RDMACM_1.0'
collect2: error: ld returned 1 exit status

Turns out, you try to use that MPICH on the Omnipath cluster, and boom. My "solution" for him was "Build and run on the Infiniband cluster" and that's fine for now as we don't use MPICH in production.

But it got me thinking. Is there any way of building MPICH so that it would "nicely" support both? My first build was a "no extra arguments" type of build, but the configure output did say to maybe try the ch4:ucx device (I think it chose ch4:ofi first). So, I built with "--with-device=ch4:ucx" (still on Infiniband) and this then does build on the Omnipath system...but throws a message on running:

$ mpirun -np 4 ./helloWorld.mpi3.MPICH-UCX.PSM2.exe
libibcm: couldn't read ABI version
libibcm: couldn't read ABI version
libibcm: couldn't read ABI version
libibcm: couldn't read ABI version

I mean, it still *ran* fine, it just prints out one of those for each process.

So I thought I'd ask the experts: is there any way to build an MPICH that is "happy" on both PSM2 and Infiniband? Or should I just tell my users that want to try out MPICH "Stick with Infiniband"

Note: There is no way for us to run a job on both clusters simultaneously, so it's not like I need something that will work on both at the same time. Just something that doesn't through warnings/messages if possible.

Thanks,
Matt
--
Matt Thompson, SSAI, Ld Scientific Programmer/Analyst
NASA GSFC,    Global Modeling and Assimilation Office
Code 610.1,  8800 Greenbelt Rd,  Greenbelt,  MD 20771
Phone: 301-614-6712                 Fax: 301-614-6246
http://science.gsfc.nasa.gov/sed/bio/matthew.thompson
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20220107/1d3d7daa/attachment-0001.html>


More information about the discuss mailing list