<div dir="ltr">Kurt,<div><br></div><div>Your flags are correct. With these flags OMPI uses a totally different communication engine, which suggests that if the error persists it might indeed be in the application.</div><div><br></div><div>Sorry,</div><div> George.</div><div><br></div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Sep 15, 2023 at 3:01 PM Mccall, Kurt E. (MSFC-EV41) <<a href="mailto:kurt.e.mccall@nasa.gov">kurt.e.mccall@nasa.gov</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="msg4537108238962042312">
<div lang="EN-US" style="overflow-wrap: break-word;">
<div class="m_4537108238962042312WordSection1">
<p class="MsoNormal">George, thanks for the idea. With those flags, OMPI mpirun said that “sm” was no longer available, and suggested “vader”. So my flags were<u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p class="MsoNormal">-mca pml ob1 --mca btl <span style="color:red">vader</span>,self,tcp
<u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p class="MsoNormal">Is that still a valid test of OMPI? The errors I have been seeing continued to occur with the flags.<u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p class="MsoNormal">Kurt<u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<div>
<div style="border-right:none;border-bottom:none;border-left:none;border-top:1pt solid rgb(225,225,225);padding:3pt 0in 0in">
<p class="MsoNormal"><b>From:</b> George Bosilca via discuss <<a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a>> <br>
<b>Sent:</b> Friday, September 15, 2023 1:34 PM<br>
<b>To:</b> <a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a><br>
<b>Cc:</b> George Bosilca <<a href="mailto:bosilca@icl.utk.edu" target="_blank">bosilca@icl.utk.edu</a>>; Raffenetti, Ken <<a href="mailto:raffenet@anl.gov" target="_blank">raffenet@anl.gov</a>><br>
<b>Subject:</b> [EXTERNAL] [BULK] Re: [mpich-discuss] Buffer corruption due to an excessive number of messages<u></u><u></u></p>
</div>
</div>
<p class="MsoNormal"><u></u> <u></u></p>
<table border="1" cellspacing="0" cellpadding="0" align="left" style="border:1.5pt solid black">
<tbody>
<tr>
<td width="100%" style="width:100%;border:none;background:rgb(255,235,156);padding:3.75pt">
<p class="MsoNormal">
<b><span style="font-size:10pt;color:black">CAUTION:</span></b><span style="color:black">
</span><span style="font-size:10pt;color:black">This email originated from outside of NASA. Please take care when clicking links or opening attachments. Use the "Report Message" button to report suspicious messages to the NASA SOC.</span><span style="color:black">
</span><u></u><u></u></p>
</td>
</tr>
</tbody>
</table>
<p class="MsoNormal" style="margin-bottom:12pt"><br>
<br>
<u></u><u></u></p>
<div>
<div>
<p class="MsoNormal">Kurt, <u></u><u></u></p>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">There is another common component between current MPICH and Open MPI: UCX, that is handling the low level communications. I suggest to try to change the communication substrate to see if your issue continues to exist. For OMPI add `--mca
pml ob1 --mca btl self,sm,tcp' to your mpirun command. <u></u><u></u></p>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">George.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
</div>
</div>
<p class="MsoNormal"><u></u> <u></u></p>
<div>
<div>
<p class="MsoNormal">On Fri, Sep 15, 2023 at 11:20 AM Joachim Jenke via discuss <<a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a>> wrote:<u></u><u></u></p>
</div>
<blockquote style="border-top:none;border-right:none;border-bottom:none;border-left:1pt solid rgb(204,204,204);padding:0in 0in 0in 6pt;margin-left:4.8pt;margin-right:0in">
<p class="MsoNormal">Am 15.09.23 um 17:09 schrieb Tony Curtis via discuss:<br>
> <br>
> <br>
>> On Sep 15, 2023, at 11:07 AM, Raffenetti, Ken via discuss <br>
>> <<a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a>> wrote:<br>
>><br>
>> 1. Is there a way to detect this kind of overload with an MPI call?<br>
>><br>
>> If MPI detects an error at runtime, the default behavior is to abort <br>
>> the application. If you application does not abort (and you haven't <br>
>> changed the default error handler), then no error was detected by MPI.<br>
>><br>
> <br>
> There’s a tool called MUST that might help<br>
> <br>
> MUST - RWTH AACHEN UNIVERSITY Lehrstuhl für Informatik 12 - Deutsch <br>
> <<a href="https://www.i12.rwth-aachen.de/go/id/nrbe" target="_blank">https://www.i12.rwth-aachen.de/go/id/nrbe</a>><br>
> <a href="http://i12.rwth-aachen.de/" target="_blank">i12.rwth-aachen.de</a> <<a href="https://www.i12.rwth-aachen.de/go/id/nrbe" target="_blank">https://www.i12.rwth-aachen.de/go/id/nrbe</a>><br>
> apple-touch-icon-180x180.png <<a href="https://www.i12.rwth-aachen.de/go/id/nrbe" target="_blank">https://www.i12.rwth-aachen.de/go/id/nrbe</a>><br>
> <br>
> <<a href="https://www.i12.rwth-aachen.de/go/id/nrbe" target="_blank">https://www.i12.rwth-aachen.de/go/id/nrbe</a>><br>
> <br>
<br>
The current release version can only detect conflicts in buffer usage at <br>
MPI API level. That means, it will only detect buffer conflicts for <br>
in-flight messages as in:<br>
<br>
MPI_Irecv(buf, MPI_INT, 10, ..., req1);<br>
MPI_Irecv(&buf[9], MPI_INT, 10, ..., req2);<br>
MPI_Wait(req1,...);<br>
MPI_Wait(req2,...);<br>
<br>
The upcoming release I was referencing in my other mail would detect <br>
conflicting accesses to in-flight buffers as in:<br>
<br>
MPI_Irecv(buf, MPI_INT, 10, ..., req);<br>
buf[5]=5;<br>
MPI_Wait(req,...);<br>
<br>
> <br>
> (Not affiliated, just happen to have been looking at it)<br>
<br>
Happy to see that people look at the tool :D<br>
<br>
- Joachim<br>
<br>
> <br>
> Tony<br>
> <br>
> <br>
> <br>
> _______________________________________________<br>
> discuss mailing list <a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a><br>
> To manage subscription options or unsubscribe:<br>
> <a href="https://lists.mpich.org/mailman/listinfo/discuss" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br>
<br>
-- <br>
Dr. rer. nat. Joachim Jenke<br>
<br>
IT Center<br>
Group: High Performance Computing<br>
Division: Computational Science and Engineering<br>
RWTH Aachen University<br>
Seffenter Weg 23<br>
D 52074 Aachen (Germany)<br>
Tel: +49 241 80- 24765<br>
Fax: +49 241 80-624765<br>
<a href="mailto:jenke@itc.rwth-aachen.de" target="_blank">jenke@itc.rwth-aachen.de</a><br>
<a href="http://www.itc.rwth-aachen.de/" target="_blank">www.itc.rwth-aachen.de</a><br>
<br>
_______________________________________________<br>
discuss mailing list <a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a><br>
To manage subscription options or unsubscribe:<br>
<a href="https://lists.mpich.org/mailman/listinfo/discuss" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><u></u><u></u></p>
</blockquote>
</div>
</div>
</div>
</div>
</div></blockquote></div>