[mpich-discuss] Buffer corruption due to an excessive number of messages

George Bosilca bosilca at icl.utk.edu
Fri Sep 15 14:05:22 CDT 2023


Kurt,

Your flags are correct. With these flags OMPI uses a totally different
communication engine, which suggests that if the error persists it might
indeed be in the application.

Sorry,
  George.



On Fri, Sep 15, 2023 at 3:01 PM Mccall, Kurt E. (MSFC-EV41) <
kurt.e.mccall at nasa.gov> wrote:

> George, thanks for the idea.   With those flags, OMPI mpirun  said that
> “sm” was no longer available, and suggested “vader”.  So my flags were
>
>
>
> -mca pml ob1 --mca btl vader,self,tcp
>
>
>
> Is that still a valid test of OMPI?   The errors I have been seeing
> continued to occur with the flags.
>
>
>
> Kurt
>
>
>
>
>
> *From:* George Bosilca via discuss <discuss at mpich.org>
> *Sent:* Friday, September 15, 2023 1:34 PM
> *To:* discuss at mpich.org
> *Cc:* George Bosilca <bosilca at icl.utk.edu>; Raffenetti, Ken <
> raffenet at anl.gov>
> *Subject:* [EXTERNAL] [BULK] Re: [mpich-discuss] Buffer corruption due to
> an excessive number of messages
>
>
>
> *CAUTION:* This email originated from outside of NASA.  Please take care
> when clicking links or opening attachments.  Use the "Report Message"
> button to report suspicious messages to the NASA SOC.
>
>
>
> Kurt,
>
>
>
> There is another common component between current MPICH and Open MPI: UCX,
> that is handling the low level communications. I suggest to try to change
> the communication substrate to see if your issue continues to exist. For
> OMPI add `--mca pml ob1 --mca btl self,sm,tcp' to your mpirun command.
>
>
>
> George.
>
>
>
>
>
> On Fri, Sep 15, 2023 at 11:20 AM Joachim Jenke via discuss <
> discuss at mpich.org> wrote:
>
> Am 15.09.23 um 17:09 schrieb Tony Curtis via discuss:
> >
> >
> >> On Sep 15, 2023, at 11:07 AM, Raffenetti, Ken via discuss
> >> <discuss at mpich.org> wrote:
> >>
> >> 1. Is there a way to detect this kind of overload with an MPI call?
> >>
> >> If MPI detects an error at runtime, the default behavior is to abort
> >> the application. If you application does not abort (and you haven't
> >> changed the default error handler), then no error was detected by MPI.
> >>
> >
> > There’s a tool called MUST that might help
> >
> > MUST - RWTH AACHEN UNIVERSITY Lehrstuhl für Informatik 12 - Deutsch
> > <https://www.i12.rwth-aachen.de/go/id/nrbe>
> > i12.rwth-aachen.de <https://www.i12.rwth-aachen.de/go/id/nrbe>
> >       apple-touch-icon-180x180.png <
> https://www.i12.rwth-aachen.de/go/id/nrbe>
> >
> > <https://www.i12.rwth-aachen.de/go/id/nrbe>
> >
>
> The current release version can only detect conflicts in buffer usage at
> MPI API level. That means, it will only detect buffer conflicts for
> in-flight messages as in:
>
> MPI_Irecv(buf, MPI_INT, 10, ..., req1);
> MPI_Irecv(&buf[9], MPI_INT, 10, ..., req2);
> MPI_Wait(req1,...);
> MPI_Wait(req2,...);
>
> The upcoming release I was referencing in my other mail would detect
> conflicting accesses to in-flight buffers as in:
>
> MPI_Irecv(buf, MPI_INT, 10, ..., req);
> buf[5]=5;
> MPI_Wait(req,...);
>
> >
> > (Not affiliated, just happen to have been looking at it)
>
> Happy to see that people look at the tool :D
>
> - Joachim
>
> >
> > Tony
> >
> >
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
>
> --
> Dr. rer. nat. Joachim Jenke
>
> IT Center
> Group: High Performance Computing
> Division: Computational Science and Engineering
> RWTH Aachen University
> Seffenter Weg 23
> D 52074  Aachen (Germany)
> Tel: +49 241 80- 24765
> Fax: +49 241 80-624765
> jenke at itc.rwth-aachen.de
> www.itc.rwth-aachen.de
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20230915/926d75e9/attachment.html>


More information about the discuss mailing list