[mpich-discuss] [EXTERNAL] Re: Help with MPICH 4.2.0 and win_allocate_shared (or maybe infiniband?)
Raffenetti, Ken
raffenet at anl.gov
Tue Apr 2 20:26:18 CDT 2024
Can you provide details on how to run the application to reproduce the error? Preferably with as few processes as possible. I think we’ll need to do some more digging to get to the cause on our side. It would also be good to transfer these details over to Github so we can better track the issue.
Thanks,
Ken
From: "Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC]" <matthew.thompson at nasa.gov>
Date: Tuesday, March 26, 2024 at 11:11 AM
To: "Raffenetti, Ken" <raffenet at anl.gov>, "discuss at mpich.org" <discuss at mpich.org>
Subject: Re: [EXTERNAL] Re: [mpich-discuss] Help with MPICH 4.2.0 and win_allocate_shared (or maybe infiniband?)
Ken, I think so. Given the traceback, my guess is it's dealing with this code from ESMF v8. 6. 0 (https: //github. com/esmf-org/esmf/blob/ec5f18667091090df7e7b716d588955ce9aa4bd5/src/Infrastructure/VM/src/ESMCI_VMKernel. C#L466-L475) #if (MPI_VERSION
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd
Ken,
I think so. Given the traceback, my guess is it's dealing with this code from ESMF v8.6.0 (https://urldefense.us/v3/__https://github.com/esmf-org/esmf/blob/ec5f18667091090df7e7b716d588955ce9aa4bd5/src/Infrastructure/VM/src/ESMCI_VMKernel.C*L466-L475__;Iw!!G_uCfscf7eWS!al8yL0DcNlWlzCu0KfKNfysZVRqS5RLaad2mDoRXDEgVumSzmTJresUUqgDtdOCBoQ0mIMcjYEYxR_y3$ <https://urldefense.us/v3/__https:/github.com/esmf-org/esmf/blob/ec5f18667091090df7e7b716d588955ce9aa4bd5/src/Infrastructure/VM/src/ESMCI_VMKernel.C*L466-L475__;Iw!!G_uCfscf7eWS!afLx5kWjU-Q0DepBQPBvQKQ3qCWj5W9vEMDCwhaW37NdsEXg6oFKMOmrmW3rg9nURQgShu7S8wjY7-7mKMT5lx82H_U8$>)
#if (MPI_VERSION >= 3)
// set up communicator across single-system-images SSIs
MPI_Comm_split_type(mpi_c, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL,
&mpi_c_ssi);
// set up communicator across root pets of each SSI
int color;
MPI_Comm_rank(mpi_c_ssi, &color);
if (color>0) color = MPI_UNDEFINED; // only root PETs on each SSI
MPI_Comm_split(mpi_c, color, 0, &mpi_c_ssi_roots);
#endif
Indeed, if it's coming from ESMF, that is the *only* MPI_Comm_split_type in the whole code!
We also call MPI_Comm_split_type in the overall code (MAPL) and it also only uses that:
https://urldefense.us/v3/__https://github.com/search?q=repo*3AGEOS-ESM*2FMAPL*20mpi_comm_split_type&type=code__;JSUl!!G_uCfscf7eWS!al8yL0DcNlWlzCu0KfKNfysZVRqS5RLaad2mDoRXDEgVumSzmTJresUUqgDtdOCBoQ0mIMcjYKT3oN4L$ <https://urldefense.us/v3/__https:/github.com/search?q=repo*3AGEOS-ESM*2FMAPL*20mpi_comm_split_type&type=code__;JSUl!!G_uCfscf7eWS!afLx5kWjU-Q0DepBQPBvQKQ3qCWj5W9vEMDCwhaW37NdsEXg6oFKMOmrmW3rg9nURQgShu7S8wjY7-7mKMT5lz5W2yRj$>
--
Matt Thompson, SSAI, Ld Scientific Prog/Analyst/Super
NASA GSFC, Global Modeling and Assimilation Office
Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD 20771
Phone: 301-614-6712 Fax: 301-614-6246
https://urldefense.us/v3/__http://science.gsfc.nasa.gov/sed/bio/matthew.thompson__;!!G_uCfscf7eWS!al8yL0DcNlWlzCu0KfKNfysZVRqS5RLaad2mDoRXDEgVumSzmTJresUUqgDtdOCBoQ0mIMcjYNoMYW3t$ <https://urldefense.us/v3/__http:/science.gsfc.nasa.gov/sed/bio/matthew.thompson__;!!G_uCfscf7eWS!afLx5kWjU-Q0DepBQPBvQKQ3qCWj5W9vEMDCwhaW37NdsEXg6oFKMOmrmW3rg9nURQgShu7S8wjY7-7mKMT5lwtzstq4$>
From: Raffenetti, Ken <raffenet at anl.gov>
Date: Tuesday, March 26, 2024 at 10:55 AM
To: discuss at mpich.org <discuss at mpich.org>
Cc: Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC] <matthew.thompson at nasa.gov>
Subject: [EXTERNAL] Re: [mpich-discuss] Help with MPICH 4.2.0 and win_allocate_shared (or maybe infiniband?)
It looks like the crash is happening after shared memory window creation fails. The failure path is getting tripped up removing the window id from the global hash, since it was never added. We will address this in the code so users get a better error message after the failure.
Can you confirm that the input communicator to the window creation function is one created with MPI_Comm_split_type(…,MPI_COMM_TYPE_SHARED,…)?
Thanks,
Ken
From: "Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC] via discuss" <discuss at mpich.org>
Reply-To: "discuss at mpich.org" <discuss at mpich.org>
Date: Tuesday, March 26, 2024 at 9:20 AM
To: "discuss at mpich.org" <discuss at mpich.org>
Cc: "Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC]" <matthew.thompson at nasa.gov>
Subject: [mpich-discuss] Help with MPICH 4.2.0 and win_allocate_shared (or maybe infiniband?)
All, I've been trying to get a code of mine working with MPICH 4. 2. 0. I can build MPICH just fine and then build our base libraries and then model and all compiles fine. Hello world runs fine on multiple nodes as well. But when I finally try
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd
All,
I've been trying to get a code of mine working with MPICH 4.2.0. I can build MPICH just fine and then build our base libraries and then model and all compiles fine. Hello world runs fine on multiple nodes as well.
But when I finally try and run our complex model:
Assertion failed in file ../src/mpid/ch4/src/ch4_impl.h at line 838: map_entry != NULL
/discover/swdev/gmao_SIteam/MPI/mpich/4.2.0-SLES15/ifort-2021.11.0/lib/libmpi.so.12(+0x37d211) [0x14bf4f62c211]
/discover/swdev/gmao_SIteam/MPI/mpich/4.2.0-SLES15/ifort-2021.11.0/lib/libmpi.so.12(PMPI_Win_allocate_shared+0x3ba) [0x14bf4f3e452a]
/discover/swdev/gmao_SIteam/Baselibs/ESMA-Baselibs-7.20.0/x86_64-pc-linux-gnu/ifort_2021.11.0-mpich_4.2.0-SLES15/Linux/lib/libesmf.so(_ZN5ESMCI3VMK14ssishmAllocateERSt6vectorImSaImEEPNS0_9memhandleEb+0x18b) [0x14bf6e91481b]
/discover/swdev/gmao_SIteam/Baselibs/ESMA-Baselibs-7.20.0/x86_64-pc-linux-gnu/ifort_2021.11.0-mpich_4.2.0-SLES15/Linux/lib/libesmf.so(_ZN5ESMCI5Array6createEPNS_9ArraySpecEPNS_8DistGridEPNS_10InterArrayIiEES7_S7_S7_S7_S7_S7_P14ESMC_IndexFlagP13ESMC_Pin_FlagS7_S7_S7_PiPNS_2VME+0x2707) [0x14bf6e44a267]
What I'm mainly wondering is if anyone has any experience with an error like this? My guess (at the moment) is that I built things wrong for an Infiniband cluster maybe?
I'm using Intel Fortran Classic 2021.11.0 with GCC 11.4.0 as a backing C compiler and I built as:
mkdir build-ifort-2021.11.0 && cd build-ifort-2021.11.0
../configure CC=icx CXX=icpx FC=ifort \
--with-ucx=embedded --with-hwloc=embedded --with-libfabric=embedded --with-yaksa=embedded \
--prefix=/discover/swdev/gmao_SIteam/MPI/mpich/4.2.0-SLES15/ifort-2021.11.0 |& tee configure.ifort-2021.11.0.log
All those "embedded" flags are mainly because with Open MPI on this system, I have to do something similar with its configure step:
--with-hwloc=internal --with-libevent=internal --with-pmix=internal
so I figured I should do the same with MPICH>
Now, at the end of the configure step I did see:
*****************************************************
***
*** device : ch4:ofi (embedded libfabric)
*** shm feature : auto
*** gpu support : disabled
***
MPICH is configured with device ch4:ofi, which should work
for TCP networks and any high-bandwidth interconnect
supported by libfabric. MPICH can also be configured with
"--with-device=ch4:ucx", which should work for TCP networks
and any high-bandwidth interconnect supported by the UCX
library. In addition, the legacy device ch3 (--with-device=ch3)
is also available.
*****************************************************
And I did try the `--with-device=ch4:ucx` but that didn't seem to help. And the system I am on is an Infiniband network, so I imagine ofi should work.
Note that this code works fine with Intel MPI and Open MPI (which are our "main" MPI stacks), but some of our external users are asking about MPICH support.
Matt
--
Matt Thompson, SSAI, Ld Scientific Prog/Analyst/Super
NASA GSFC, Global Modeling and Assimilation Office
Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD 20771
Phone: 301-614-6712 Fax: 301-614-6246
https://urldefense.us/v3/__http://science.gsfc.nasa.gov/sed/bio/matthew.thompson__;!!G_uCfscf7eWS!al8yL0DcNlWlzCu0KfKNfysZVRqS5RLaad2mDoRXDEgVumSzmTJresUUqgDtdOCBoQ0mIMcjYNoMYW3t$ <https://urldefense.us/v3/__http:/science.gsfc.nasa.gov/sed/bio/matthew.thompson__;!!G_uCfscf7eWS!Y6jfxrSalGUYiT8VqK_4OwkY3bftJ-gwM5C6AHyrxvP2BZZvQlHGBeYZnUWmkPQJN7-mWjRBpQg60pHQJKXDetYFzss$>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20240403/1e0cdceb/attachment-0001.html>
More information about the discuss
mailing list