[mpich-discuss] Buffer corruption due to an excessive number of messages

Raffenetti, Ken raffenet at anl.gov
Fri Sep 15 10:07:25 CDT 2023


1. Is there a way to detect this kind of overload with an MPI call?

If MPI detects an error at runtime, the default behavior is to abort the application. If you application does not abort (and you haven't changed the default error handler), then no error was detected by MPI.

2. Is there an upper bound on the number of messages that can be "in flight"?

There is, but it depends on configuration and things like the number of communicators in use. It is difficult to give an exact answer. However, if you go over the limit, MPI should crash and let you know something went wrong.

3. Is there a upper bound on message length?

Not really. We test very large message lengths in our regression tests. Counts of INT_MAX are well tested. Larger counts are supported using the new _c interfaces added in MPI-4. (MPI_Send_c, MPI_Recv_c, etc).

Or is there some other possible cause that I haven't thought of?

Is it possible the same buffer is posted to different recv operations simultaneously? MPI will not check for this. Concurrent writes to the same buffer could result in corrupted data. Another possibility is message underflow (amount received is less than the size of the posted buffer), meaning the contents of the buffer beyond the received message are not modified by MPI and may be garbage.

Ken

-----Original Message-----
From: "Mccall, Kurt E. (MSFC-EV41) via discuss" <discuss at mpich.org <mailto:discuss at mpich.org>>
Reply-To: "discuss at mpich.org <mailto:discuss at mpich.org>" <discuss at mpich.org <mailto:discuss at mpich.org>>
Date: Friday, September 15, 2023 at 10:01 AM
To: "Thakur, Rajeev" <thakur at anl.gov <mailto:thakur at anl.gov>>, "discuss at mpich.org <mailto:discuss at mpich.org>" <discuss at mpich.org <mailto:discuss at mpich.org>>
Cc: "Mccall, Kurt E. (MSFC-EV41)" <kurt.e.mccall at nasa.gov <mailto:kurt.e.mccall at nasa.gov>>
Subject: Re: [mpich-discuss] Buffer corruption due to an excessive number of messages


Yes, I tried it with OpenMPI and the same problem occurred.


Kurt


-----Original Message-----
From: Thakur, Rajeev <thakur at anl.gov <mailto:thakur at anl.gov>>
Sent: Friday, September 15, 2023 9:59 AM
To: discuss at mpich.org <mailto:discuss at mpich.org>
Cc: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov <mailto:kurt.e.mccall at nasa.gov>>
Subject: [EXTERNAL] [BULK] Re: [mpich-discuss] Buffer corruption due to an excessive number of messages


Does it happen with other MPI implementations?


Rajeev


-----Original Message-----
From: "Mccall, Kurt E. (MSFC-EV41) via discuss" <discuss at mpich.org <mailto:discuss at mpich.org> <mailto:discuss at mpich.org <mailto:discuss at mpich.org>>>
Reply-To: "discuss at mpich.org <mailto:discuss at mpich.org> <mailto:discuss at mpich.org <mailto:discuss at mpich.org>>" <discuss at mpich.org <mailto:discuss at mpich.org> <mailto:discuss at mpich.org <mailto:discuss at mpich.org>>>
Date: Friday, September 15, 2023 at 9:43 AM
To: "discuss at mpich.org <mailto:discuss at mpich.org> <mailto:discuss at mpich.org <mailto:discuss at mpich.org>>" <discuss at mpich.org <mailto:discuss at mpich.org> <mailto:discuss at mpich.org <mailto:discuss at mpich.org>>>
Cc: "Mccall, Kurt E. (MSFC-EV41)" <kurt.e.mccall at nasa.gov <mailto:kurt.e.mccall at nasa.gov> <mailto:kurt.e.mccall at nasa.gov <mailto:kurt.e.mccall at nasa.gov>>>
Subject: Re: [mpich-discuss] Buffer corruption due to an excessive number of messages




Joachim,




Unfortunately, using MPI_Improbe/MPI_Mrecv didn't solve the problem -- I still am receiving buffers with invalid data objects near the ends of the buffers. The problem goes away when I reduce the size of the job (number of nodes), making me think it is the large number of messages that is causing the problem.




1. Is there a way to detect this kind of overload with an MPI call?
2. Is there an upper bound on the number of messages that can be "in flight"?
3. Is there a upper bound on message length?




Or is there some other possible cause that I haven't thought of?




Thanks,
Kurt




-----Original Message-----
From: Joachim Jenke via discuss <discuss at mpich.org <mailto:discuss at mpich.org> <mailto:discuss at mpich.org <mailto:discuss at mpich.org>>>
Sent: Thursday, September 14, 2023 3:10 PM
To: discuss at mpich.org <mailto:discuss at mpich.org> <mailto:discuss at mpich.org <mailto:discuss at mpich.org>>
Cc: Joachim Jenke <jenke at itc.rwth-aachen.de <mailto:jenke at itc.rwth-aachen.de> <mailto:jenke at itc.rwth-aachen.de <mailto:jenke at itc.rwth-aachen.de>>>
Subject: [EXTERNAL] [BULK] Re: [mpich-discuss] Buffer corruption due to an excessive number of messages




CAUTION: This email originated from outside of NASA. Please take care when clicking links or opening attachments. Use the "Report Message" button to report suspicious messages to the NASA SOC.
















Hi Kurt,




just a thought: do you execute single-threaded or multi-threaded?




In case of multi-threaded execution, you should look into MPI_Improbe/MPI_Mrecv just to make sure that you really receive the message you probed for.
Even in single-threaded execution you might try whether using these functions instead fixes your issue.




Best
Joachim




Am 14.09.23 um 22:02 schrieb Mccall, Kurt E. (MSFC-EV41) via discuss:
> It seems that when I send a process too many non-blocking messages
> (with
> MPI_Isend) , MPI_Iprobe/MPI_Recv sometimes returns a buffer
>
> with corrupted data for some of the messages. Usually the corrupted
> data objects are at the end of the array that was sent. I checked the
>
> buffers passed to MPI_Isend, and they are uncorrupted.
>
> 1. Is there a way to detect this kind of overload with an MPI call?
> 2. Is there an upper bound on the number of messages that can be "in
> flight"?
> 3. Is there a upper bound on message length?
>
> Thanks,
>
> Kurt
>
>
> _______________________________________________
> discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org> <mailto:discuss at mpich.org <mailto:discuss at mpich.org>> To
> manage subscription options or unsubscribe:
> https://list/
> s.mpich.org%2Fmailman%2Flistinfo%2Fdiscuss&data=05%7C01%7Ckurt.e.mccal
> l%40nasa.gov%7C9e8bfd71610243562ce808dbb55eaace%7C7005d45845be48ae8140
> d43da96dd17b%7C0%7C0%7C638303190427564481%7CUnknown%7CTWFpbGZsb3d8eyJW
> IjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%
> 7C%7C%7C&sdata=yplpEqPO1a8OpC2Nt3z55G2UtfA0lBqV4Zn5OrMRRx0%3D&reserved
> =0




--
Dr. rer. nat. Joachim Jenke




IT Center
Group: High Performance Computing
Division: Computational Science and Engineering RWTH Aachen University Seffenter Weg 23 D 52074 Aachen (Germany)
Tel: +49 241 80- 24765
Fax: +49 241 80-624765
jenke at itc.rwth-aachen.de <mailto:jenke at itc.rwth-aachen.de> <mailto:jenke at itc.rwth-aachen.de <mailto:jenke at itc.rwth-aachen.de>>
http://www.itc.rwth-aachen.de/ <http://www.itc.rwth-aachen.de/> <http://www.itc.rwth-aachen.de/> <http://www.itc.rwth-aachen.de/>>




_______________________________________________
discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org> <mailto:discuss at mpich.org <mailto:discuss at mpich.org>> To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss <https://lists.mpich.org/mailman/listinfo/discuss> <https://lists.mpich.org/mailman/listinfo/discuss> <https://lists.mpich.org/mailman/listinfo/discuss>>






_______________________________________________
discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss <https://lists.mpich.org/mailman/listinfo/discuss>





More information about the discuss mailing list