<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:Wingdings;
panose-1:5 0 0 0 0 0 0 0 0 0;}
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:DengXian;
panose-1:2 1 6 0 3 1 1 1 1 1;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:"\@DengXian";
panose-1:2 1 6 0 3 1 1 1 1 1;}
@font-face
{font-family:Menlo;
panose-1:2 11 6 9 3 8 4 2 2 4;}
@font-face
{font-family:"Lucida Grande";
panose-1:2 11 6 0 4 5 2 2 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
font-size:10.0pt;
font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:#0563C1;
text-decoration:underline;}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
{mso-style-priority:34;
margin-top:0in;
margin-right:0in;
margin-bottom:0in;
margin-left:.5in;
font-size:10.0pt;
font-family:"Calibri",sans-serif;}
span.EmailStyle21
{mso-style-type:personal-reply;
font-family:"Calibri",sans-serif;
color:windowtext;}
p.p1, li.p1, div.p1
{mso-style-name:p1;
margin:0in;
font-size:8.5pt;
font-family:Menlo;
color:black;}
span.s1
{mso-style-name:s1;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
/* List Definitions */
@list l0
{mso-list-id:1511137873;
mso-list-type:hybrid;
mso-list-template-ids:-1715033584 1632531590 67698691 67698693 67698689 67698691 67698693 67698689 67698691 67698693;}
@list l0:level1
{mso-level-start-at:0;
mso-level-number-format:bullet;
mso-level-text:\F0D8 ;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;
font-family:Wingdings;
mso-fareast-font-family:Calibri;
mso-bidi-font-family:"Times New Roman";}
@list l0:level2
{mso-level-number-format:bullet;
mso-level-text:o;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;
font-family:"Courier New";}
@list l0:level3
{mso-level-number-format:bullet;
mso-level-text:\F0A7 ;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;
font-family:Wingdings;}
@list l0:level4
{mso-level-number-format:bullet;
mso-level-text:\F0B7 ;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;
font-family:Symbol;}
@list l0:level5
{mso-level-number-format:bullet;
mso-level-text:o;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;
font-family:"Courier New";}
@list l0:level6
{mso-level-number-format:bullet;
mso-level-text:\F0A7 ;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;
font-family:Wingdings;}
@list l0:level7
{mso-level-number-format:bullet;
mso-level-text:\F0B7 ;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;
font-family:Symbol;}
@list l0:level8
{mso-level-number-format:bullet;
mso-level-text:o;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;
font-family:"Courier New";}
@list l0:level9
{mso-level-number-format:bullet;
mso-level-text:\F0A7 ;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;
font-family:Wingdings;}
ol
{margin-bottom:0in;}
ul
{margin-bottom:0in;}
--></style>
</head>
<body lang="EN-US" link="#0563C1" vlink="#954F72" style="word-wrap:break-word">
<div class="WordSection1">
<p class="MsoNormal"><span style="font-size:11.0pt">I kind of dropped this for a while but I�d like to pick it back up. I did some more testing using different versions of mpich and got the following results for the RMA runtime<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">MPICH-3.1.4 configured with<o:p></o:p></span></p>
<p class="p1"><span class="s1">./configure --prefix=/people/d3g293/mpich/mpich-3.1.4/install --with-libfabric=embedded --enable-threads=multiple --with-slurm CC=gcc CXX=g++</span><o:p></o:p></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">2/80 tests fail in GA test suite<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">MPICH-4.0.2 configured with<o:p></o:p></span></p>
<p class="p1"><span class="s1">unset F90</span><o:p></o:p></p>
<p class="p1"><span class="s1">./configure --prefix=/people/d3g293/mpich/mpich-4.0.2/install --with-device=ch4:ofi:sockets --with-libfabric=embedded --enable-threads=multiple --with-slurm CC=gcc CXX=g++</span><o:p></o:p></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">25/80 tests fail in GA test suite<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Running with MPICH-3.3.2 seems to lead to around 8 failures, but my notes on this aren�t that good.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">If I run with OpenMPI 4.1.4, everything passes. Any reason for why I�m seeing this? I haven�t really done much to this runtime in the last few years.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Bruce<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal" style="margin-bottom:12.0pt"><b><span style="font-size:12.0pt;color:black">From:
</span></b><span style="font-size:12.0pt;color:black">Palmer, Bruce J via discuss <discuss@mpich.org><br>
<b>Date: </b>Wednesday, September 28, 2022 at 12:30 PM<br>
<b>To: </b>'Thakur, Rajeev' <thakur@anl.gov>, discuss@mpich.org <discuss@mpich.org>, Zhou, Hui <zhouh@anl.gov><br>
<b>Cc: </b>Palmer, Bruce J <Bruce.Palmer@pnnl.gov><br>
<b>Subject: </b>Re: [mpich-discuss] Crash on MPI_Rput<o:p></o:p></span></p>
</div>
<div style="border:none;border-left:solid #D77600 6.0pt;padding:0in 0in 0in 0in;font-size:1.15rem">
<p class="MsoNormal" align="center" style="text-align:center;background:#F7E3CC">
<span style="font-size:11.0pt;font-family:"Arial",sans-serif;color:black">Check twice before you click! This email originated from outside PNNL.</span><span style="font-size:11.0pt;font-family:"Arial",sans-serif"><o:p></o:p></span></p>
</div>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<div>
<p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D">I think the MPI-RMA runtime was mostly (maybe completely) working with 3.2-3.4. It may have even been working earlier with 4.0. I think there is a pretty good chance that the problem is a system
configuration problem at our end and I was hoping that you might have some insight into what it might be based on the errors I�m seeing. I can try running with a few earlier versions of mpich and see if any of them work better.</span><o:p></o:p></p>
<p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D"> </span><o:p></o:p></p>
<p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D">Bruce</span><o:p></o:p></p>
<p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D"> </span><o:p></o:p></p>
<div>
<div style="border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"><b><span style="font-size:11.0pt">From:</span></b><span style="font-size:11.0pt"> Thakur, Rajeev <thakur@anl.gov>
<br>
<b>Sent:</b> Wednesday, September 28, 2022 12:24 PM<br>
<b>To:</b> discuss@mpich.org; Zhou, Hui <zhouh@anl.gov><br>
<b>Cc:</b> Palmer, Bruce J <Bruce.Palmer@pnnl.gov><br>
<b>Subject:</b> Re: [mpich-discuss] Crash on MPI_Rput</span><o:p></o:p></p>
</div>
</div>
<p class="MsoNormal"> <o:p></o:p></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Lucida Grande",sans-serif">Was it working with an earlier version of MPICH? If so, which one?</span><o:p></o:p></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Lucida Grande",sans-serif"> </span><o:p></o:p></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Lucida Grande",sans-serif">Rajeev</span><o:p></o:p></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Lucida Grande",sans-serif"> </span><o:p></o:p></p>
<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"><b><span style="font-size:12.0pt;color:black">From: </span></b><span style="font-size:12.0pt;color:black">"Palmer, Bruce J via discuss" <<a href="mailto:discuss@mpich.org">discuss@mpich.org</a>><br>
<b>Reply-To: </b>"<a href="mailto:discuss@mpich.org">discuss@mpich.org</a>" <<a href="mailto:discuss@mpich.org">discuss@mpich.org</a>><br>
<b>Date: </b>Wednesday, September 28, 2022 at 2:20 PM<br>
<b>To: </b>"Zhou, Hui" <<a href="mailto:zhouh@anl.gov">zhouh@anl.gov</a>>, "<a href="mailto:discuss@mpich.org">discuss@mpich.org</a>" <<a href="mailto:discuss@mpich.org">discuss@mpich.org</a>><br>
<b>Cc: </b>"Palmer, Bruce J" <<a href="mailto:Bruce.Palmer@pnnl.gov">Bruce.Palmer@pnnl.gov</a>><br>
<b>Subject: </b>Re: [mpich-discuss] Crash on MPI_Rput</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:11.0pt"> </span><o:p></o:p></p>
</div>
<p class="MsoNormal"><span style="font-size:11.0pt">I upgraded to mpich-4.0.2 (latest stable release) and get pretty much the same result. This failure is reproducible, I get the same error on multiple runs so it doesn�t look like an unexpected process failure.</span><o:p></o:p></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> </span><o:p></o:p></p>
<p class="MsoNormal"><span style="font-size:11.0pt">One other feature that I forgot to mention earlier is that I�m running this test on 4 processors distributed over 2 nodes. If I run 4 processes on 1 node, the code runs without error.</span><o:p></o:p></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> </span><o:p></o:p></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Bruce</span><o:p></o:p></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> </span><o:p></o:p></p>
<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal" style="margin-bottom:12.0pt"><b><span style="font-size:12.0pt;color:black">From:
</span></b><span style="font-size:12.0pt;color:black">Zhou, Hui <<a href="mailto:zhouh@anl.gov">zhouh@anl.gov</a>><br>
<b>Date: </b>Tuesday, September 27, 2022 at 2:55 PM<br>
<b>To: </b><a href="mailto:discuss@mpich.org">discuss@mpich.org</a> <<a href="mailto:discuss@mpich.org">discuss@mpich.org</a>><br>
<b>Cc: </b>Palmer, Bruce J <<a href="mailto:Bruce.Palmer@pnnl.gov">Bruce.Palmer@pnnl.gov</a>><br>
<b>Subject: </b>Re: Crash on MPI_Rput</span><o:p></o:p></p>
</div>
<p class="MsoNormal"><span style="font-size:11.0pt">Hi Bruce,</span><o:p></o:p></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> </span><o:p></o:p></p>
<ul style="margin-top:0in" type="disc">
<li class="MsoNormal" style="mso-list:l0 level1 lfo2"><span style="font-size:11.0pt">srun: error: node003: task 1: Exited with exit code 7</span><o:p></o:p></li></ul>
<p class="MsoListParagraph"><span style="font-size:11.0pt"> </span><o:p></o:p></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Looks like one of the process crashed unexpectedly.</span><o:p></o:p></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> </span><o:p></o:p></p>
<div>
<div>
<div>
<p class="MsoNormal"><span style="font-size:11.0pt">-- <br>
Hui Zhou</span><o:p></o:p></p>
</div>
</div>
</div>
<p class="MsoNormal"><span style="font-size:11.0pt"> </span><o:p></o:p></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> </span><o:p></o:p></p>
<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal" style="mso-margin-top-alt:0in;margin-right:0in;margin-bottom:12.0pt;margin-left:.5in">
<b><span style="font-size:12.0pt;color:black">From: </span></b><span style="font-size:12.0pt;color:black">Palmer, Bruce J via discuss <<a href="mailto:discuss@mpich.org">discuss@mpich.org</a>><br>
<b>Date: </b>Tuesday, September 27, 2022 at 3:32 PM<br>
<b>To: </b><a href="mailto:discuss@mpich.org">discuss@mpich.org</a> <<a href="mailto:discuss@mpich.org">discuss@mpich.org</a>><br>
<b>Cc: </b>Palmer, Bruce J <<a href="mailto:Bruce.Palmer@pnnl.gov">Bruce.Palmer@pnnl.gov</a>><br>
<b>Subject: </b>[mpich-discuss] Crash on MPI_Rput</span><o:p></o:p></p>
</div>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:11.0pt">Hi,</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:11.0pt"> </span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:11.0pt">I�m testing the MPI-RMA runtime in Global Arrays and I�m getting a lot more crashes than I�ve seen in the past. The MPI-RMA runtime code is fairly stable and hasn�t been modified much
recently and all the tests used to pass using one of the more recent MPICH releases. However, I�m getting significant crashes at this point. One of them occurs in a program designed to test non-blocking communication. It creates an MPI window, using MPI_Alloc_mem
followed by MPI_Win_create and then calls MPI_Win_lock_all on the window. The code currently crashes when it gets to an MPI_Rput call. I�m trying to see if there is something different in the environment that might be causing this.</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:11.0pt"> </span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:11.0pt">I�m currently up to MPICH-4.0b1 configured with</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:11.0pt"> </span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:11.0pt">./configure --prefix=/people/d3g293/mpich/mpich-4.0b1/install --with-device=ch4:ofi:sockets --with-libfabric=embedded --enable-threads=multiple --with-slurm CC=gcc CXX=g++</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:11.0pt">#./configure --prefix=/people/d3g293/mpich/mpich-3.4.1/install-newell-nocuda --with-device=ch4:ofi:sockets --with-libfabric=embedded --enable-threads=multiple --with-slurm CC=gcc CXX=g++</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:11.0pt"> </span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:11.0pt">I�ve tried other recent vintages of MPICH, but I get similar results. The error I�m seeing when the program crashes is</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:11.0pt"> </span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:11.0pt">[proxy:0:1@node003.local] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:899): assert (!closed) failed</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:11.0pt">[proxy:0:1@node003.local] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:11.0pt">srun: error: node003: task 1: Exited with exit code 7</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:11.0pt">[proxy:0:1@node003.local] main (pm/pmiserv/pmip.c:169): demux engine error waiting for event</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:11.0pt">[mpiexec@node002.local] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:74): one of the processes terminated badly; aborting</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:11.0pt">[mpiexec@node002.local] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:21): launcher returned error waiting for completion</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:11.0pt">[mpiexec@node002.local] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:179): launcher returned error waiting for completion</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:11.0pt">[mpiexec@node002.local] main (ui/mpich/mpiexec.c:325): process manager error waiting for completion</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:11.0pt"> </span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:11.0pt">Any suggestions about what might be going wrong here? It could be a problem with the machine configuration, since this code seemed to be running fine a while ago and has not been modified
since then. I�ll try building the latest stable release and see if that fixes anything, but as I mentioned none of the recent releases seems to work.</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:11.0pt"> </span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:11.0pt">Bruce Palmer</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:11.0pt">Computer Scientist</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:11.0pt">Pacific Northwest National Laboratory</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:11.0pt">(509) 375-3899</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><span style="font-size:11.0pt"> </span><o:p></o:p></p>
</div>
</div>
</body>
</html>