[mpich-discuss] Help with MPICH 4.2.0 and win_allocate_shared (or maybe infiniband?)

Raffenetti, Ken raffenet at anl.gov
Tue Mar 26 09:55:19 CDT 2024


It looks like the crash is happening after shared memory window creation fails. The failure path is getting tripped up removing the window id from the global hash, since it was never added. We will address this in the code so users get a better error message after the failure.

Can you confirm that the input communicator to the window creation function is one created with MPI_Comm_split_type(…,MPI_COMM_TYPE_SHARED,…)?

Thanks,
Ken

From: "Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC] via discuss" <discuss at mpich.org>
Reply-To: "discuss at mpich.org" <discuss at mpich.org>
Date: Tuesday, March 26, 2024 at 9:20 AM
To: "discuss at mpich.org" <discuss at mpich.org>
Cc: "Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC]" <matthew.thompson at nasa.gov>
Subject: [mpich-discuss] Help with MPICH 4.2.0 and win_allocate_shared (or maybe infiniband?)

All, I've been trying to get a code of mine working with MPICH 4. 2. 0. I can build MPICH just fine and then build our base libraries and then model and all compiles fine. Hello world runs fine on multiple nodes as well. But when I finally try
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd
All,

I've been trying to get a code of mine working with MPICH 4.2.0. I can build MPICH just fine and then build our base libraries and then model and all compiles fine. Hello world runs fine on multiple nodes as well.

But when I finally try and run our complex model:

Assertion failed in file ../src/mpid/ch4/src/ch4_impl.h at line 838: map_entry != NULL
/discover/swdev/gmao_SIteam/MPI/mpich/4.2.0-SLES15/ifort-2021.11.0/lib/libmpi.so.12(+0x37d211) [0x14bf4f62c211]
/discover/swdev/gmao_SIteam/MPI/mpich/4.2.0-SLES15/ifort-2021.11.0/lib/libmpi.so.12(PMPI_Win_allocate_shared+0x3ba) [0x14bf4f3e452a]
/discover/swdev/gmao_SIteam/Baselibs/ESMA-Baselibs-7.20.0/x86_64-pc-linux-gnu/ifort_2021.11.0-mpich_4.2.0-SLES15/Linux/lib/libesmf.so(_ZN5ESMCI3VMK14ssishmAllocateERSt6vectorImSaImEEPNS0_9memhandleEb+0x18b) [0x14bf6e91481b]
/discover/swdev/gmao_SIteam/Baselibs/ESMA-Baselibs-7.20.0/x86_64-pc-linux-gnu/ifort_2021.11.0-mpich_4.2.0-SLES15/Linux/lib/libesmf.so(_ZN5ESMCI5Array6createEPNS_9ArraySpecEPNS_8DistGridEPNS_10InterArrayIiEES7_S7_S7_S7_S7_S7_P14ESMC_IndexFlagP13ESMC_Pin_FlagS7_S7_S7_PiPNS_2VME+0x2707) [0x14bf6e44a267]

What I'm mainly wondering is if anyone has any experience with an error like this? My guess (at the moment) is that I built things wrong for an Infiniband cluster maybe?

I'm using Intel Fortran Classic 2021.11.0 with GCC 11.4.0 as a backing C compiler and I built as:

  mkdir build-ifort-2021.11.0 && cd build-ifort-2021.11.0
  ../configure CC=icx CXX=icpx FC=ifort \
     --with-ucx=embedded --with-hwloc=embedded --with-libfabric=embedded --with-yaksa=embedded \
     --prefix=/discover/swdev/gmao_SIteam/MPI/mpich/4.2.0-SLES15/ifort-2021.11.0 |& tee configure.ifort-2021.11.0.log

All those "embedded" flags are mainly because with Open MPI on this system, I have to do something similar with its configure step:

  --with-hwloc=internal --with-libevent=internal --with-pmix=internal

so I figured I should do the same with MPICH>

Now, at the end of the configure step I did see:

*****************************************************
***
*** device      : ch4:ofi (embedded libfabric)
*** shm feature : auto
*** gpu support : disabled
***
  MPICH is configured with device ch4:ofi, which should work
  for TCP networks and any high-bandwidth interconnect
  supported by libfabric. MPICH can also be configured with
  "--with-device=ch4:ucx", which should work for TCP networks
  and any high-bandwidth interconnect supported by the UCX
  library. In addition, the legacy device ch3 (--with-device=ch3)
  is also available.
*****************************************************

And I did try the `--with-device=ch4:ucx` but that didn't seem to help. And the system I am on is an Infiniband network, so I imagine ofi should work.

Note that this code works fine with Intel MPI and Open MPI (which are our "main" MPI stacks), but some of our external users are asking about MPICH support.



Matt

--
Matt Thompson, SSAI, Ld Scientific Prog/Analyst/Super
NASA GSFC,    Global Modeling and Assimilation Office
Code 610.1,  8800 Greenbelt Rd,  Greenbelt,  MD 20771
Phone: 301-614-6712                 Fax: 301-614-6246
https://urldefense.us/v3/__http://science.gsfc.nasa.gov/sed/bio/matthew.thompson__;!!G_uCfscf7eWS!dd4--yHKbpxP9m5tN3f1ckKrhr8XZHaihcqhHnzKMGw9cQsP3Qh7jiuThrL4jmqqIcoIZylFyoJ1bpYj$ <https://urldefense.us/v3/__http:/science.gsfc.nasa.gov/sed/bio/matthew.thompson__;!!G_uCfscf7eWS!Y6jfxrSalGUYiT8VqK_4OwkY3bftJ-gwM5C6AHyrxvP2BZZvQlHGBeYZnUWmkPQJN7-mWjRBpQg60pHQJKXDetYFzss$>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20240326/39bd4be9/attachment-0001.html>


More information about the discuss mailing list