<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:Verdana;
panose-1:2 11 6 4 3 5 4 4 2 4;}
@font-face
{font-family:Aptos;
panose-1:2 11 0 4 2 2 2 2 2 4;}
@font-face
{font-family:Consolas;
panose-1:2 11 6 9 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
font-size:11.0pt;
font-family:"Aptos",sans-serif;
mso-ligatures:standardcontextual;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
span.EmailStyle19
{mso-style-type:personal-reply;
font-family:"Aptos",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;
mso-ligatures:none;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
--></style>
</head>
<body lang="EN-US" link="blue" vlink="purple" style="word-wrap:break-word">
<div class="WordSection1">
<p class="MsoNormal">It looks like the crash is happening after shared memory window creation fails. The failure path is getting tripped up removing the window id from the global hash, since it was never added. We will address this in the code so users get
a better error message after the failure.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Can you confirm that the input communicator to the window creation function is one created with MPI_Comm_split_type(…,<span style="font-size:9.0pt;font-family:"Verdana",sans-serif;color:black;background:white">MPI_COMM_TYPE_SHARED,…)?<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:9.0pt;font-family:"Verdana",sans-serif;color:black;background:white"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:9.0pt;font-family:"Verdana",sans-serif;color:black;background:white">Thanks,<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:9.0pt;font-family:"Verdana",sans-serif;color:black;background:white">Ken</span><o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal" style="margin-left:.5in"><b><span style="font-family:"Calibri",sans-serif;color:black">From:
</span></b><span style="font-family:"Calibri",sans-serif;color:black">"Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC] via discuss" <discuss@mpich.org><br>
<b>Reply-To: </b>"discuss@mpich.org" <discuss@mpich.org><br>
<b>Date: </b>Tuesday, March 26, 2024 at 9:20 AM<br>
<b>To: </b>"discuss@mpich.org" <discuss@mpich.org><br>
<b>Cc: </b>"Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC]" <matthew.thompson@nasa.gov><br>
<b>Subject: </b>[mpich-discuss] Help with MPICH 4.2.0 and win_allocate_shared (or maybe infiniband?)</span><span style="font-size:12.0pt;font-family:"Calibri",sans-serif;color:black;mso-ligatures:none"><o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal" style="margin-left:.5in"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal" style="margin-left:.5in;mso-line-height-alt:.75pt"><span style="font-size:1.0pt;color:white">All, I've been trying to get a code of mine working with MPICH 4.</span><span style="font-size:1.0pt;font-family:"Arial",sans-serif;color:white"> </span><span style="font-size:1.0pt;color:white">2.</span><span style="font-size:1.0pt;font-family:"Arial",sans-serif;color:white"> </span><span style="font-size:1.0pt;color:white">0.
I can build MPICH just fine and then build our base libraries and then model and all compiles fine. Hello world runs fine on multiple nodes as well. But when I finally try
<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal" style="margin-left:.5in;mso-line-height-alt:.75pt"><span style="font-size:1.0pt;color:white">ZjQcmQRYFpfptBannerStart<o:p></o:p></span></p>
</div>
<table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0" width="100%" style="width:100.0%;margin-left:.5in;border-radius:4px">
<tbody>
<tr>
<td style="padding:12.0pt 0in 12.0pt 0in">
<table class="MsoNormalTable" border="1" cellspacing="0" cellpadding="0" width="100%" style="width:100.0%;background:#D0D8DC;border:none;border-top:solid #90A4AE 3.0pt">
<tbody>
<tr>
<td valign="top" style="border:none;padding:0in 7.5pt 3.75pt 4.5pt">
<table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0" align="left">
<tbody>
<tr>
<td style="padding:3.0pt 6.0pt 3.0pt 6.0pt">
<p class="MsoNormal"><b><span style="font-size:10.5pt;font-family:"Arial",sans-serif;color:black">This Message Is From an External Sender
<o:p></o:p></span></b></p>
</td>
</tr>
<tr>
<td style="padding:3.0pt 6.0pt 3.0pt 6.0pt">
<p class="MsoNormal"><span style="font-size:9.0pt;font-family:"Arial",sans-serif;color:black">This message came from outside your organization.
<o:p></o:p></span></p>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<div>
<p class="MsoNormal" style="margin-left:.5in;mso-line-height-alt:.75pt"><span style="font-size:1.0pt;color:white">ZjQcmQRYFpfptBannerEnd<o:p></o:p></span></p>
</div>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas">All,</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas"> </span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas">I've been trying to get a code of mine working with MPICH 4.2.0. I can build MPICH just fine and then build our base libraries and then model and all compiles
fine. Hello world runs fine on multiple nodes as well.</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas"> </span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas">But when I finally try and run our complex model:</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas"> </span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas">Assertion failed in file ../src/mpid/ch4/src/ch4_impl.h at line 838: map_entry != NULL</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas">/discover/swdev/gmao_SIteam/MPI/mpich/4.2.0-SLES15/ifort-2021.11.0/lib/libmpi.so.12(+0x37d211) [0x14bf4f62c211]</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas">/discover/swdev/gmao_SIteam/MPI/mpich/4.2.0-SLES15/ifort-2021.11.0/lib/libmpi.so.12(PMPI_Win_allocate_shared+0x3ba) [0x14bf4f3e452a]</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas">/discover/swdev/gmao_SIteam/Baselibs/ESMA-Baselibs-7.20.0/x86_64-pc-linux-gnu/ifort_2021.11.0-mpich_4.2.0-SLES15/Linux/lib/libesmf.so(_ZN5ESMCI3VMK14ssishmAllocateERSt6vectorImSaImEEPNS0_9memhandleEb+0x18b)
[0x14bf6e91481b]</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas">/discover/swdev/gmao_SIteam/Baselibs/ESMA-Baselibs-7.20.0/x86_64-pc-linux-gnu/ifort_2021.11.0-mpich_4.2.0-SLES15/Linux/lib/libesmf.so(_ZN5ESMCI5Array6createEPNS_9ArraySpecEPNS_8DistGridEPNS_10InterArrayIiEES7_S7_S7_S7_S7_S7_P14ESMC_IndexFlagP13ESMC_Pin_FlagS7_S7_S7_PiPNS_2VME+0x2707)
[0x14bf6e44a267]</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas"> </span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas">What I'm mainly wondering is if anyone has any experience with an error like this? My guess (at the moment) is that I built things wrong for an Infiniband cluster
maybe?</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas"> </span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas">I'm using Intel Fortran Classic 2021.11.0 with GCC 11.4.0 as a backing C compiler and I built as:</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas"> </span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas"> mkdir build-ifort-2021.11.0 && cd build-ifort-2021.11.0</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas"> ../configure CC=icx CXX=icpx FC=ifort \</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas"> --with-ucx=embedded --with-hwloc=embedded --with-libfabric=embedded --with-yaksa=embedded \</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas"> --prefix=/discover/swdev/gmao_SIteam/MPI/mpich/4.2.0-SLES15/ifort-2021.11.0 |& tee configure.ifort-2021.11.0.log</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas"> </span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas">All those "embedded" flags are mainly because with Open MPI on this system, I have to do something similar with its configure step:</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas"> </span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas"> --with-hwloc=internal --with-libevent=internal --with-pmix=internal</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas"> </span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas">so I figured I should do the same with MPICH></span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas"> </span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas">Now, at the end of the configure step I did see:</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas"> </span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas">*****************************************************</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas">***</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas">*** device : ch4:ofi (embedded libfabric)</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas">*** shm feature : auto</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas">*** gpu support : disabled</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas">***</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas"> MPICH is configured with device ch4:ofi, which should work</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas"> for TCP networks and any high-bandwidth interconnect</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas"> supported by libfabric. MPICH can also be configured with</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas"> "--with-device=ch4:ucx", which should work for TCP networks</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas"> and any high-bandwidth interconnect supported by the UCX</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas"> library. In addition, the legacy device ch3 (--with-device=ch3)</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas"> is also available.</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas">*****************************************************</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas"> </span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas">And I did try the `--with-device=ch4:ucx` but that didn't seem to help. And the system I am on is an Infiniband network, so I imagine ofi should work.</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas"> </span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas">Note that this code works fine with Intel MPI and Open MPI (which are our "main" MPI stacks), but some of our external users are asking about MPICH support.</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas"> </span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas"> </span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas"> </span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas">Matt</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas"> </span><o:p></o:p></p>
<div>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas;mso-ligatures:none">--
</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas;mso-ligatures:none">Matt Thompson, SSAI, Ld Scientific Prog/Analyst/Super</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas;mso-ligatures:none">NASA GSFC, Global Modeling and Assimilation Office</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas;mso-ligatures:none">Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD 20771</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas;mso-ligatures:none">Phone: 301-614-6712 Fax: 301-614-6246</span><o:p></o:p></p>
</div>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:12.0pt;font-family:Consolas;mso-ligatures:none"><a href="https://urldefense.us/v3/__http:/science.gsfc.nasa.gov/sed/bio/matthew.thompson__;!!G_uCfscf7eWS!Y6jfxrSalGUYiT8VqK_4OwkY3bftJ-gwM5C6AHyrxvP2BZZvQlHGBeYZnUWmkPQJN7-mWjRBpQg60pHQJKXDetYFzss$">http://science.gsfc.nasa.gov/sed/bio/matthew.thompson</a></span><o:p></o:p></p>
</div>
</body>
</html>