[mpich-discuss] Buffer corruption due to an excessive number of messages

George Bosilca bosilca at icl.utk.edu
Fri Sep 15 13:33:30 CDT 2023


Kurt,

There is another common component between current MPICH and Open MPI: UCX,
that is handling the low level communications. I suggest to try to change
the communication substrate to see if your issue continues to exist. For
OMPI add `--mca pml ob1 --mca btl self,sm,tcp' to your mpirun command.

George.


On Fri, Sep 15, 2023 at 11:20 AM Joachim Jenke via discuss <
discuss at mpich.org> wrote:

> Am 15.09.23 um 17:09 schrieb Tony Curtis via discuss:
> >
> >
> >> On Sep 15, 2023, at 11:07 AM, Raffenetti, Ken via discuss
> >> <discuss at mpich.org> wrote:
> >>
> >> 1. Is there a way to detect this kind of overload with an MPI call?
> >>
> >> If MPI detects an error at runtime, the default behavior is to abort
> >> the application. If you application does not abort (and you haven't
> >> changed the default error handler), then no error was detected by MPI.
> >>
> >
> > There’s a tool called MUST that might help
> >
> > MUST - RWTH AACHEN UNIVERSITY Lehrstuhl für Informatik 12 - Deutsch
> > <https://www.i12.rwth-aachen.de/go/id/nrbe>
> > i12.rwth-aachen.de <https://www.i12.rwth-aachen.de/go/id/nrbe>
> >       apple-touch-icon-180x180.png <
> https://www.i12.rwth-aachen.de/go/id/nrbe>
> >
> > <https://www.i12.rwth-aachen.de/go/id/nrbe>
> >
>
> The current release version can only detect conflicts in buffer usage at
> MPI API level. That means, it will only detect buffer conflicts for
> in-flight messages as in:
>
> MPI_Irecv(buf, MPI_INT, 10, ..., req1);
> MPI_Irecv(&buf[9], MPI_INT, 10, ..., req2);
> MPI_Wait(req1,...);
> MPI_Wait(req2,...);
>
> The upcoming release I was referencing in my other mail would detect
> conflicting accesses to in-flight buffers as in:
>
> MPI_Irecv(buf, MPI_INT, 10, ..., req);
> buf[5]=5;
> MPI_Wait(req,...);
>
> >
> > (Not affiliated, just happen to have been looking at it)
>
> Happy to see that people look at the tool :D
>
> - Joachim
>
> >
> > Tony
> >
> >
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
>
> --
> Dr. rer. nat. Joachim Jenke
>
> IT Center
> Group: High Performance Computing
> Division: Computational Science and Engineering
> RWTH Aachen University
> Seffenter Weg 23
> D 52074  Aachen (Germany)
> Tel: +49 241 80- 24765
> Fax: +49 241 80-624765
> jenke at itc.rwth-aachen.de
> www.itc.rwth-aachen.de
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20230915/63904571/attachment.html>


More information about the discuss mailing list