[mpich-discuss] Buffer corruption due to an excessive number of messages

Mccall, Kurt E. (MSFC-EV41) kurt.e.mccall at nasa.gov
Fri Sep 15 14:01:26 CDT 2023


George, thanks for the idea.   With those flags, OMPI mpirun  said that “sm” was no longer available, and suggested “vader”.  So my flags were

-mca pml ob1 --mca btl vader,self,tcp

Is that still a valid test of OMPI?   The errors I have been seeing continued to occur with the flags.

Kurt


From: George Bosilca via discuss <discuss at mpich.org>
Sent: Friday, September 15, 2023 1:34 PM
To: discuss at mpich.org
Cc: George Bosilca <bosilca at icl.utk.edu>; Raffenetti, Ken <raffenet at anl.gov>
Subject: [EXTERNAL] [BULK] Re: [mpich-discuss] Buffer corruption due to an excessive number of messages

CAUTION: This email originated from outside of NASA.  Please take care when clicking links or opening attachments.  Use the "Report Message" button to report suspicious messages to the NASA SOC.


Kurt,

There is another common component between current MPICH and Open MPI: UCX, that is handling the low level communications. I suggest to try to change the communication substrate to see if your issue continues to exist. For OMPI add `--mca pml ob1 --mca btl self,sm,tcp' to your mpirun command.

George.


On Fri, Sep 15, 2023 at 11:20 AM Joachim Jenke via discuss <discuss at mpich.org<mailto:discuss at mpich.org>> wrote:
Am 15.09.23 um 17:09 schrieb Tony Curtis via discuss:
>
>
>> On Sep 15, 2023, at 11:07 AM, Raffenetti, Ken via discuss
>> <discuss at mpich.org<mailto:discuss at mpich.org>> wrote:
>>
>> 1. Is there a way to detect this kind of overload with an MPI call?
>>
>> If MPI detects an error at runtime, the default behavior is to abort
>> the application. If you application does not abort (and you haven't
>> changed the default error handler), then no error was detected by MPI.
>>
>
> There’s a tool called MUST that might help
>
> MUST - RWTH AACHEN UNIVERSITY Lehrstuhl für Informatik 12 - Deutsch
> <https://www.i12.rwth-aachen.de/go/id/nrbe>
> i12.rwth-aachen.de<http://i12.rwth-aachen.de/> <https://www.i12.rwth-aachen.de/go/id/nrbe>
>       apple-touch-icon-180x180.png <https://www.i12.rwth-aachen.de/go/id/nrbe>
>
> <https://www.i12.rwth-aachen.de/go/id/nrbe>
>

The current release version can only detect conflicts in buffer usage at
MPI API level. That means, it will only detect buffer conflicts for
in-flight messages as in:

MPI_Irecv(buf, MPI_INT, 10, ..., req1);
MPI_Irecv(&buf[9], MPI_INT, 10, ..., req2);
MPI_Wait(req1,...);
MPI_Wait(req2,...);

The upcoming release I was referencing in my other mail would detect
conflicting accesses to in-flight buffers as in:

MPI_Irecv(buf, MPI_INT, 10, ..., req);
buf[5]=5;
MPI_Wait(req,...);

>
> (Not affiliated, just happen to have been looking at it)

Happy to see that people look at the tool :D

- Joachim

>
> Tony
>
>
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org<mailto:discuss at mpich.org>
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

--
Dr. rer. nat. Joachim Jenke

IT Center
Group: High Performance Computing
Division: Computational Science and Engineering
RWTH Aachen University
Seffenter Weg 23
D 52074  Aachen (Germany)
Tel: +49 241 80- 24765
Fax: +49 241 80-624765
jenke at itc.rwth-aachen.de<mailto:jenke at itc.rwth-aachen.de>
www.itc.rwth-aachen.de<http://www.itc.rwth-aachen.de/>

_______________________________________________
discuss mailing list     discuss at mpich.org<mailto:discuss at mpich.org>
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20230915/a898d260/attachment-0001.html>


More information about the discuss mailing list