<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<div class="elementToProof" style="margin: 0px; font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
>../configure --prefix=/people/d3g293/mpich/mpich-4.2.1/build_decptn/install \</div>
<div style="margin: 0px; font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
--with-device=ch4:ofi:sockets --with-libfabric=embedded \</div>
<div style="margin: 0px; font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
--without-ucx CC=gcc CXX=g++<br>
<br>
You are statically using "sockets" provider. Try <code>--with-device=ch4:ofi</code></div>
<div style="margin: 0px; font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="margin: 0px; font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Hui</div>
<div style="margin: 0px; font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="margin: 0px; font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div id="appendonsend"></div>
<hr style="display:inline-block;width:98%" tabindex="-1">
<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> Palmer, Bruce J <Bruce.Palmer@pnnl.gov><br>
<b>Sent:</b> Friday, June 21, 2024 12:27 PM<br>
<b>To:</b> Zhou, Hui <zhouh@anl.gov>; discuss@mpich.org <discuss@mpich.org><br>
<b>Subject:</b> RE: Fail on MPI_Wait</font>
<div> </div>
</div>
<div>
<div style="display:none!important; display:none; visibility:hidden; font-size:1px; color:#ffffff; line-height:1px; height:0px; max-height:0px; opacity:0; overflow:hidden">
Hui, When I set FI_PROVIDER=tcp, the code crashes in MPI_Init. Specifically, this code will fail on one process: #include "mpi. h" int main(int argc, char **argv) { MPI_Init(&argc, &argv); MPI_Finalize(); } I’m running on a
</div>
<div style="display:none!important; display:none; visibility:hidden; font-size:1px; color:#ffffff; line-height:1px; height:0px; max-height:0px; opacity:0; overflow:hidden">
ZjQcmQRYFpfptBannerStart</div>
<div dir="ltr" id="x_pfptBanneraxqgm0d" style="display:block!important; text-align:left!important; margin:16px 0px 16px 0px!important; padding:8px 16px 8px 16px!important; border-radius:4px!important; min-width:200px!important; background-color:#D0D8DC!important; background-color:#D0D8DC; border-top:4px solid #90a4ae!important; border-top:4px solid #90a4ae">
<div id="x_pfptBanneraxqgm0d" style="float:left!important; display:block!important; margin:0px 0px 1px 0px!important; max-width:600px!important">
<div id="x_pfptBanneraxqgm0d" style="display:block!important; visibility:visible!important; background-color:#D0D8DC!important; color:#000000!important; color:#000000; font-family:'Arial',sans-serif!important; font-family:'Arial',sans-serif; font-weight:bold!important; font-weight:bold; font-size:14px!important; line-height:18px!important; line-height:18px">
This Message Is From an External Sender </div>
<div id="x_pfptBanneraxqgm0d" style="display:block!important; visibility:visible!important; background-color:#D0D8DC!important; color:#000000!important; color:#000000; font-weight:normal; font-family:'Arial',sans-serif!important; font-family:'Arial',sans-serif; font-size:12px!important; line-height:18px!important; line-height:18px; margin-top:2px!important">
This message came from outside your organization. </div>
</div>
<div style="clear:both!important; display:block!important; visibility:hidden!important; line-height:0!important; font-size:0.01px!important; height:0px">
</div>
</div>
<div style="display:none!important; display:none; visibility:hidden; font-size:1px; color:#ffffff; line-height:1px; height:0px; max-height:0px; opacity:0; overflow:hidden">
ZjQcmQRYFpfptBannerEnd</div>
<style>
<!--
#x_pfptBanneraxqgm0d
{display:block!important;
visibility:visible!important;
opacity:1!important;
background-color:#D0D8DC!important;
max-width:none!important;
max-height:none!important}
-->
</style>
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style>
<!--
@font-face
{font-family:"Cambria Math"}
@font-face
{font-family:Calibri}
@font-face
{font-family:Aptos}
@font-face
{font-family:Menlo}
p.x_MsoNormal, li.x_MsoNormal, div.x_MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:12.0pt;
font-family:"Times New Roman",serif}
a:link, span.x_MsoHyperlink
{color:#0563C1;
text-decoration:underline}
a:visited, span.x_MsoHyperlinkFollowed
{color:#954F72;
text-decoration:underline}
p
{margin:0in;
margin-bottom:.0001pt;
font-size:12.0pt;
font-family:"Times New Roman",serif}
code
{font-family:"Courier New"}
p.x_msonormal0, li.x_msonormal0, div.x_msonormal0
{margin:0in;
margin-bottom:.0001pt;
font-size:12.0pt;
font-family:"Times New Roman",serif}
p.x_xxxmsonormal, li.x_xxxmsonormal, div.x_xxxmsonormal
{margin:0in;
margin-bottom:.0001pt;
font-size:12.0pt;
font-family:Aptos}
p.x_xxxxmsonormal, li.x_xxxxmsonormal, div.x_xxxxmsonormal
{margin:0in;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:Aptos}
p.x_xxxxp1, li.x_xxxxp1, div.x_xxxxp1
{margin:0in;
margin-bottom:.0001pt;
font-size:8.5pt;
font-family:Menlo;
color:black}
p.x_xxxmsochpdefault, li.x_xxxmsochpdefault, div.x_xxxmsochpdefault
{margin:0in;
margin-bottom:.0001pt;
font-size:10.0pt;
font-family:"Times New Roman",serif}
span.x_xxxxs2
{background:#878A04}
span.x_xxxemailstyle26
{font-family:Aptos;
color:windowtext}
span.x_xxxxs1
{}
span.x_xxxxapple-converted-space
{}
span.x_EmailStyle28
{font-family:"Calibri",sans-serif;
color:#1F497D}
.x_MsoChpDefault
{font-size:10.0pt}
@page WordSection1
{margin:1.0in 1.0in 1.0in 1.0in}
div.x_WordSection1
{}
-->
</style>
<div class="x_WordSection1">
<p class="x_MsoNormal"><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:#1F497D">Hui,</span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:#1F497D"> </span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:#1F497D">When I set FI_PROVIDER=tcp, the code crashes in MPI_Init. Specifically, this code will fail on one process:</span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:#1F497D"> </span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:#1F497D">#include "mpi.h"</span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:#1F497D"> </span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:#1F497D">int main(int argc, char **argv) {</span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:#1F497D"> MPI_Init(&argc, &argv);</span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:#1F497D"> MPI_Finalize();</span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:#1F497D">}</span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:#1F497D"> </span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:#1F497D">I’m running on a system with the following modules</span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:#1F497D"> </span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:#1F497D">[d3g293@deception02 testing]$ module list</span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:#1F497D">Currently Loaded Modulefiles:</span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:#1F497D"> 1) gcc/11.2.0 3) python/3.7.0 5) mkl/2019u4</span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:#1F497D"> 2) cmake/3.21.4 4) git/2.42.0(default) 6) cuda/11.8</span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:#1F497D"> </span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:#1F497D">and a home-built version of mpich-4.2.1 configured with</span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:#1F497D"> </span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:#1F497D">../configure --prefix=/people/d3g293/mpich/mpich-4.2.1/build_decptn/install \</span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:#1F497D"> --with-device=ch4:ofi:sockets --with-libfabric=embedded \</span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:#1F497D"> --without-ucx CC=gcc CXX=g++</span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:#1F497D"> </span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:#1F497D">I thought it might have something to do with using a build configuration in my application build that is set up to include Cuda, but it also fails in MPI_Init
with a non-Cuda configuraton if I set the FI_PROVIDER variable.</span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:#1F497D"><br>
Bruce</span></p>
<p class="x_MsoNormal"><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:#1F497D"> </span></p>
<div>
<div style="border:none; border-top:solid #E1E1E1 1.0pt; padding:3.0pt 0in 0in 0in">
<p class="x_MsoNormal"><b><span style="font-size:11.0pt; font-family:"Calibri",sans-serif">From:</span></b><span style="font-size:11.0pt; font-family:"Calibri",sans-serif"> Zhou, Hui <zhouh@anl.gov>
<br>
<b>Sent:</b> Friday, June 14, 2024 9:27 AM<br>
<b>To:</b> Palmer, Bruce J <Bruce.Palmer@pnnl.gov>; discuss@mpich.org<br>
<b>Subject:</b> Re: Fail on MPI_Wait</span></p>
</div>
</div>
<p class="x_MsoNormal"> </p>
<div>
<p class="x_MsoNormal"><span style="font-family:Aptos; color:black">Never mind. It is v4.2.1.</span></p>
</div>
<div class="x_MsoNormal" align="center" style="text-align:center">
<hr size="2" width="98%" align="center">
</div>
<div id="x_divRplyFwdMsg">
<p class="x_MsoNormal"><b><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:black">From:</span></b><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:black"> Zhou, Hui <</span><a href="mailto:zhouh@anl.gov"><span style="font-size:11.0pt; font-family:"Calibri",sans-serif">zhouh@anl.gov</span></a><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:black">><br>
<b>Sent:</b> Friday, June 14, 2024 11:26 AM<br>
<b>To:</b> Palmer, Bruce J <</span><a href="mailto:Bruce.Palmer@pnnl.gov"><span style="font-size:11.0pt; font-family:"Calibri",sans-serif">Bruce.Palmer@pnnl.gov</span></a><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:black">>;
</span><a href="mailto:discuss@mpich.org"><span style="font-size:11.0pt; font-family:"Calibri",sans-serif">discuss@mpich.org</span></a><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:black"> <</span><a href="mailto:discuss@mpich.org"><span style="font-size:11.0pt; font-family:"Calibri",sans-serif">discuss@mpich.org</span></a><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:black">><br>
<b>Subject:</b> Re: Fail on MPI_Wait</span> </p>
<div>
<p class="x_MsoNormal"> </p>
</div>
</div>
<div>
<div>
<p class="x_MsoNormal"><span style="font-family:Aptos; color:black">Bruce,</span></p>
</div>
<div>
<p class="x_MsoNormal"><span style="font-family:Aptos; color:black"> </span></p>
</div>
<div>
<p class="x_MsoNormal"><span style="font-family:Aptos; color:black">What is the mpich version, BTW?</span></p>
</div>
<div>
<p class="x_MsoNormal"><span style="font-family:Aptos; color:black"> </span></p>
</div>
<div>
<p class="x_MsoNormal"><span style="font-family:Aptos; color:black">--</span></p>
</div>
<div>
<p class="x_MsoNormal"><span style="font-family:Aptos; color:black">Hui</span></p>
</div>
<div class="x_MsoNormal" align="center" style="text-align:center">
<hr size="2" width="98%" align="center">
</div>
<div id="x_x_divRplyFwdMsg">
<p class="x_MsoNormal"><b><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:black">From:</span></b><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:black"> Zhou, Hui <</span><a href="mailto:zhouh@anl.gov"><span style="font-size:11.0pt; font-family:"Calibri",sans-serif">zhouh@anl.gov</span></a><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:black">><br>
<b>Sent:</b> Friday, June 14, 2024 10:55 AM<br>
<b>To:</b> Palmer, Bruce J <</span><a href="mailto:Bruce.Palmer@pnnl.gov"><span style="font-size:11.0pt; font-family:"Calibri",sans-serif">Bruce.Palmer@pnnl.gov</span></a><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:black">>;
</span><a href="mailto:discuss@mpich.org"><span style="font-size:11.0pt; font-family:"Calibri",sans-serif">discuss@mpich.org</span></a><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:black"> <</span><a href="mailto:discuss@mpich.org"><span style="font-size:11.0pt; font-family:"Calibri",sans-serif">discuss@mpich.org</span></a><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:black">><br>
<b>Subject:</b> Re: Fail on MPI_Wait</span> </p>
<div>
<p class="x_MsoNormal"> </p>
</div>
</div>
<div>
<div>
<p class="x_MsoNormal"><span style="font-family:Aptos; color:black">Bruce,</span></p>
</div>
<div>
<p class="x_MsoNormal"><span style="font-family:Aptos; color:black"> </span></p>
</div>
<div>
<p class="x_MsoNormal"><span style="font-family:Aptos; color:black">You are using the sockets provider. Could try set
</span><code><span style="font-size:10.0pt; color:black">FI_PROVIDER=tcp</span></code><span style="font-family:Aptos; color:black"> to see if it makes a difference?<br>
<br>
Meanwhile, if you can get a small reproducer – with the sockets provider or any provider, I'll try to debug it. It is difficult to guess the true source of the issue without a reproducer.</span></p>
</div>
<div>
<p class="x_MsoNormal"><span style="font-family:Aptos; color:black"> </span></p>
</div>
<div>
<p class="x_MsoNormal"><span style="font-family:Aptos; color:black">--</span></p>
</div>
<div>
<p class="x_MsoNormal"><span style="font-family:Aptos; color:black">Hui</span></p>
</div>
<div class="x_MsoNormal" align="center" style="text-align:center">
<hr size="2" width="98%" align="center">
</div>
<div id="x_x_x_divRplyFwdMsg">
<p class="x_MsoNormal"><b><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:black">From:</span></b><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:black"> Palmer, Bruce J <</span><a href="mailto:Bruce.Palmer@pnnl.gov"><span style="font-size:11.0pt; font-family:"Calibri",sans-serif">Bruce.Palmer@pnnl.gov</span></a><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:black">><br>
<b>Sent:</b> Friday, June 14, 2024 10:47 AM<br>
<b>To:</b> Zhou, Hui <</span><a href="mailto:zhouh@anl.gov"><span style="font-size:11.0pt; font-family:"Calibri",sans-serif">zhouh@anl.gov</span></a><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:black">>;
</span><a href="mailto:discuss@mpich.org"><span style="font-size:11.0pt; font-family:"Calibri",sans-serif">discuss@mpich.org</span></a><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:black"> <</span><a href="mailto:discuss@mpich.org"><span style="font-size:11.0pt; font-family:"Calibri",sans-serif">discuss@mpich.org</span></a><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:black">><br>
<b>Subject:</b> Re: Fail on MPI_Wait</span> </p>
<div>
<p class="x_MsoNormal"> </p>
</div>
</div>
<div>
<div>
<p class="x_MsoNormal" style=""><span style="font-size:1.0pt; color:white">The output to standard out from running on 2 nodes and one process per node is attached. From: Zhou, Hui <zhouh@ anl. gov> Date: Tuesday, June 11, 2024 at 5: 49 PM To: discuss@ mpich. org
<discuss@ mpich. org> Cc: Palmer, Bruce J <Bruce. Palmer@ pnnl. gov> </span></p>
</div>
<div>
<p class="x_MsoNormal" style=""><span style="font-size:1.0pt; color:white">ZjQcmQRYFpfptBannerStart</span></p>
</div>
<div id="x_x_x_x_pfptBannerbvkjra5" style="border:none; border-top:solid #90A4AE 3.0pt; padding:0in 0in 0in 0in; display:block!important; text-align:left!important; margin:0px!important; padding:16px!important; border-radius:4px!important; min-width:200px!important; background-color:#D0D8DC!important; border-top:#90a4ae!important">
<div id="x_x_x_x_pfptBannerbvkjra5">
<div id="x_x_x_x_pfptBannerbvkjra5">
<p class="x_MsoNormal" style="line-height:13.5pt; background:#D0D8DC"><b><span style="font-family:"Arial",sans-serif; color:black">This Message Is From an External Sender
</span></b></p>
</div>
<div id="x_x_x_x_pfptBannerbvkjra5">
<p class="x_MsoNormal" style="line-height:13.5pt; background:#D0D8DC"><span style="font-family:"Arial",sans-serif; color:black">This message came from outside your organization.
</span></p>
</div>
</div>
<div>
<p class="x_MsoNormal" style="background:#D0D8DC"> </p>
</div>
</div>
<div>
<p class="x_MsoNormal" style=""><span style="font-size:1.0pt; color:white">ZjQcmQRYFpfptBannerEnd</span></p>
</div>
<div>
<p class="x_xxxmsonormal"><span style="font-size:11.0pt">The output to standard out from running on 2 nodes and one process per node is attached.</span></p>
<p class="x_xxxmsonormal"><span style="font-size:11.0pt"> </span></p>
<div id="x_x_x_x_mail-editor-reference-message-container">
<div>
<div style="border:none; border-top:solid #B5C4DF 1.0pt; padding:3.0pt 0in 0in 0in">
<p class="x_xxxmsonormal" style="margin-bottom:12.0pt"><b><span style="color:black">From:
</span></b><span style="color:black">Zhou, Hui <</span><a href="mailto:zhouh@anl.gov">zhouh@anl.gov</a><span style="color:black">><br>
<b>Date: </b>Tuesday, June 11, 2024 at 5:49</span><span style="font-family:"Arial",sans-serif; color:black"> </span><span style="color:black">PM<br>
<b>To: </b></span><a href="mailto:discuss@mpich.org">discuss@mpich.org</a><span style="color:black"> <</span><a href="mailto:discuss@mpich.org">discuss@mpich.org</a><span style="color:black">><br>
<b>Cc: </b>Palmer, Bruce J <</span><a href="mailto:Bruce.Palmer@pnnl.gov">Bruce.Palmer@pnnl.gov</a><span style="color:black">><br>
<b>Subject: </b>Re: Fail on MPI_Wait</span></p>
</div>
<div>
<p class="x_xxxmsonormal"><span style="color:black">>MPIDI_OFI_handle_cq_error(793): OFI poll failed (ofi_events.c:793:MPIDI_OFI_handle_cq_error:Input/output error)<br>
<br>
This is an error coming from the libfabric provider. First we need find out which provider are you using. Try set environment variable
</span><code><span style="font-size:10.0pt; color:black">MPIR_CVAR_DEBUG_SUMMARY=1</span></code><span style="font-family:"Arial",sans-serif; color:black"></span><span style="color:black"> and run a simple
</span><code><span style="font-size:10.0pt; color:black">MPI_INIT+MPI_Finalize</span></code><span style="font-family:"Arial",sans-serif; color:black"></span><span style="color:black"> test code. Could post its console output?</span></p>
</div>
<div>
<p class="x_xxxmsonormal"><span style="color:black"> </span></p>
</div>
<div>
<p class="x_xxxmsonormal"><span style="color:black">--</span></p>
</div>
<div>
<p class="x_xxxmsonormal"><span style="color:black">Hui</span></p>
</div>
<div class="x_MsoNormal" align="center" style="text-align:center"><span style="font-family:Aptos">
<hr size="2" width="98%" align="center">
</span></div>
<div id="x_x_x_x_divRplyFwdMsg">
<p class="x_xxxmsonormal"><b><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:black">From:</span></b><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:black"> Palmer, Bruce J via discuss <</span><a href="mailto:discuss@mpich.org"><span style="font-size:11.0pt; font-family:"Calibri",sans-serif">discuss@mpich.org</span></a><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:black">><br>
<b>Sent:</b> Tuesday, June 11, 2024 3:17 PM<br>
<b>To:</b> </span><a href="mailto:discuss@mpich.org"><span style="font-size:11.0pt; font-family:"Calibri",sans-serif">discuss@mpich.org</span></a><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:black"> <</span><a href="mailto:discuss@mpich.org"><span style="font-size:11.0pt; font-family:"Calibri",sans-serif">discuss@mpich.org</span></a><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:black">><br>
<b>Cc:</b> Palmer, Bruce J <</span><a href="mailto:Bruce.Palmer@pnnl.gov"><span style="font-size:11.0pt; font-family:"Calibri",sans-serif">Bruce.Palmer@pnnl.gov</span></a><span style="font-size:11.0pt; font-family:"Calibri",sans-serif; color:black">><br>
<b>Subject:</b> [mpich-discuss] Fail on MPI_Wait</span> </p>
<div>
<p class="x_xxxmsonormal"> </p>
</div>
</div>
<div>
<div>
<p class="x_xxxmsonormal"><span style="font-size:1.0pt; color:white">Hi, I’m trying to debug a GPU-aware runtime for the Global Arrays library. We had a version of this working a while ago, but it has mysteriously started failing and we are trying to track
down why. Currently, we are getting failures in MPI_Wait </span></p>
</div>
<div>
<p class="x_xxxmsonormal"><span style="font-size:1.0pt; color:white">ZjQcmQRYFpfptBannerStart</span></p>
</div>
<div id="x_x_x_x_x_pfptBannerbiv18om" style="border:none; border-top:solid #90A4AE 3.0pt; padding:0in 0in 0in 0in; display:block!important; text-align:left!important; margin:0px!important; padding:16px!important; border-radius:4px!important; min-width:200px!important; background-color:#D0D8DC!important; border-top:#90a4ae!important">
<div id="x_x_x_x_x_pfptBannerbiv18om">
<div id="x_x_x_x_x_pfptBannerbiv18om">
<p class="x_xxxmsonormal" style="line-height:13.5pt; background:#D0D8DC"><b><span style="font-family:"Arial",sans-serif; color:black">This Message Is From an External Sender
</span></b></p>
</div>
<div id="x_x_x_x_x_pfptBannerbiv18om">
<p class="x_xxxmsonormal" style="line-height:13.5pt; background:#D0D8DC"><span style="font-family:"Arial",sans-serif; color:black">This message came from outside your organization.
</span></p>
</div>
</div>
<div>
<p class="x_xxxmsonormal" style="background:#D0D8DC"><span style="color:black"> </span></p>
</div>
</div>
<div>
<p class="x_xxxmsonormal"><span style="font-size:1.0pt; color:white">ZjQcmQRYFpfptBannerEnd</span></p>
</div>
<div>
<p class="x_xxxxmsonormal">Hi,</p>
<p class="x_xxxxmsonormal"> </p>
<p class="x_xxxxmsonormal">I’m trying to debug a GPU-aware runtime for the Global Arrays library. We had a version of this working a while ago, but it has mysteriously started failing and we are trying to track down why. Currently, we are getting failures in
MPI_Wait and were wondering if anyone could provide some information on what exactly seems to be failing inside the wait call. The error we are getting is</p>
<p class="x_xxxxmsonormal"> </p>
<p class="x_xxxxp1"><span class="x_xxxxs1">Abort(206752655) on node 0: Fatal error in internal_Wait: Other MPI error, error stack:</span></p>
<p class="x_xxxxp1"><span class="x_xxxxs1">internal_Wait(68205)..........: </span>
<span class="x_xxxxs2">MPI_Wai</span><span class="x_xxxxs1">t(request=0x500847a0, status=0x7ffff9331800) failed</span></p>
<p class="x_xxxxp1"><span class="x_xxxxs1">MPIR_Wait(780)................:</span></p>
<p class="x_xxxxp1"><span class="x_xxxxs1">MPIR_Wait_state(737)..........:</span></p>
<p class="x_xxxxp1"><span class="x_xxxxs1">MPIDI_progress_test(134)......:</span></p>
<p class="x_xxxxp1"><span class="x_xxxxs1">MPIDI_OFI_handle_cq_error(793): OFI poll failed (ofi_events.c:793:MPIDI_OFI_handle_cq_error:Input/output error)</span></p>
<p class="x_xxxxmsonormal"> </p>
<p class="x_xxxxmsonormal">I’ve verified that the handle corresponding to <span class="x_xxxxs1">
0x500847a0 is getting set earlier in the code in an MPI_Isend call and that no MPI_Wait or MPI_Test is called on the handle before it crashes with the above error message. I’m using MPICH 4.2.1 using gcc/8.3.0. The MPICH library was configured with</span></p>
<p class="x_xxxxmsonormal"><span class="x_xxxxs1"> </span></p>
<p class="x_xxxxp1"><span class="x_xxxxs1">../configure --prefix=/people/d3g293/mpich/mpich-4.2.1/build_newell/install \</span></p>
<p class="x_xxxxp1"><span class="x_xxxxapple-converted-space"> </span>
<span class="x_xxxxs1">--with-device=ch4:ofi:sockets --with-libfabric=embedded \</span></p>
<p class="x_xxxxp1"><span class="x_xxxxapple-converted-space"> </span>
<span class="x_xxxxs1">--without-ucx --enable-threads=multiple --with-slurm \</span></p>
<p class="x_xxxxp1"><span class="x_xxxxapple-converted-space"> </span>
<span class="x_xxxxs1">CC=gcc CXX=g+</span></p>
<p class="x_xxxxmsonormal"> </p>
<p class="x_xxxxmsonormal">I’ve tried building with UCX and gotten the same results.</p>
<p class="x_xxxxmsonormal"> </p>
<p class="x_xxxxmsonormal">Are these errors indicative of corruption of the request handle or problems with some internal MPI data structures or something else? Any information you can provide would be appreciated.</p>
<p class="x_xxxxmsonormal"><br>
Thanks,</p>
<p class="x_xxxxmsonormal">Bruce</p>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</body>
</html>