[mpich-discuss] Help with MPICH 4.2.0 and win_allocate_shared (or maybe infiniband?)

Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC] matthew.thompson at nasa.gov
Tue Mar 26 09:19:44 CDT 2024


All,

I've been trying to get a code of mine working with MPICH 4.2.0. I can build MPICH just fine and then build our base libraries and then model and all compiles fine. Hello world runs fine on multiple nodes as well.

But when I finally try and run our complex model:

Assertion failed in file ../src/mpid/ch4/src/ch4_impl.h at line 838: map_entry != NULL
/discover/swdev/gmao_SIteam/MPI/mpich/4.2.0-SLES15/ifort-2021.11.0/lib/libmpi.so.12(+0x37d211) [0x14bf4f62c211]
/discover/swdev/gmao_SIteam/MPI/mpich/4.2.0-SLES15/ifort-2021.11.0/lib/libmpi.so.12(PMPI_Win_allocate_shared+0x3ba) [0x14bf4f3e452a]
/discover/swdev/gmao_SIteam/Baselibs/ESMA-Baselibs-7.20.0/x86_64-pc-linux-gnu/ifort_2021.11.0-mpich_4.2.0-SLES15/Linux/lib/libesmf.so(_ZN5ESMCI3VMK14ssishmAllocateERSt6vectorImSaImEEPNS0_9memhandleEb+0x18b) [0x14bf6e91481b]
/discover/swdev/gmao_SIteam/Baselibs/ESMA-Baselibs-7.20.0/x86_64-pc-linux-gnu/ifort_2021.11.0-mpich_4.2.0-SLES15/Linux/lib/libesmf.so(_ZN5ESMCI5Array6createEPNS_9ArraySpecEPNS_8DistGridEPNS_10InterArrayIiEES7_S7_S7_S7_S7_S7_P14ESMC_IndexFlagP13ESMC_Pin_FlagS7_S7_S7_PiPNS_2VME+0x2707) [0x14bf6e44a267]

What I'm mainly wondering is if anyone has any experience with an error like this? My guess (at the moment) is that I built things wrong for an Infiniband cluster maybe?

I'm using Intel Fortran Classic 2021.11.0 with GCC 11.4.0 as a backing C compiler and I built as:

  mkdir build-ifort-2021.11.0 && cd build-ifort-2021.11.0
  ../configure CC=icx CXX=icpx FC=ifort \
     --with-ucx=embedded --with-hwloc=embedded --with-libfabric=embedded --with-yaksa=embedded \
     --prefix=/discover/swdev/gmao_SIteam/MPI/mpich/4.2.0-SLES15/ifort-2021.11.0 |& tee configure.ifort-2021.11.0.log

All those "embedded" flags are mainly because with Open MPI on this system, I have to do something similar with its configure step:

  --with-hwloc=internal --with-libevent=internal --with-pmix=internal

so I figured I should do the same with MPICH>

Now, at the end of the configure step I did see:

*****************************************************
***
*** device      : ch4:ofi (embedded libfabric)
*** shm feature : auto
*** gpu support : disabled
***
  MPICH is configured with device ch4:ofi, which should work
  for TCP networks and any high-bandwidth interconnect
  supported by libfabric. MPICH can also be configured with
  "--with-device=ch4:ucx", which should work for TCP networks
  and any high-bandwidth interconnect supported by the UCX
  library. In addition, the legacy device ch3 (--with-device=ch3)
  is also available.
*****************************************************

And I did try the `--with-device=ch4:ucx` but that didn't seem to help. And the system I am on is an Infiniband network, so I imagine ofi should work.

Note that this code works fine with Intel MPI and Open MPI (which are our "main" MPI stacks), but some of our external users are asking about MPICH support.



Matt

--
Matt Thompson, SSAI, Ld Scientific Prog/Analyst/Super
NASA GSFC,    Global Modeling and Assimilation Office
Code 610.1,  8800 Greenbelt Rd,  Greenbelt,  MD 20771
Phone: 301-614-6712                 Fax: 301-614-6246
https://urldefense.us/v3/__http://science.gsfc.nasa.gov/sed/bio/matthew.thompson__;!!G_uCfscf7eWS!Y6jfxrSalGUYiT8VqK_4OwkY3bftJ-gwM5C6AHyrxvP2BZZvQlHGBeYZnUWmkPQJN7-mWjRBpQg60pHQJKXDetYFzss$ 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20240326/3e402226/attachment-0001.html>


More information about the discuss mailing list