[mpich-discuss] Buffer corruption due to an excessive number of messages

Joachim Jenke jenke at itc.rwth-aachen.de
Fri Sep 15 10:08:51 CDT 2023


If you run into the same issue with different MPI implementations, this 
actually sounds like an issue in the application. Either the sender 
overwrites the buffer between the isend and the wait, or the receiver 
side overwrites the receive buffer between the recv and your buffer 
verification.

We are wrapping up a MUST release that tackles the detection of this 
kind of issues. You would need to compile your code with either LLVM or 
GNU compiler. If you are interested in the details, just let me know.

- Joachim


Am 15.09.23 um 17:00 schrieb Mccall, Kurt E. (MSFC-EV41) via discuss:
> Yes, I tried it with OpenMPI and the same problem occurred.
> 
> Kurt
> 
> -----Original Message-----
> From: Thakur, Rajeev <thakur at anl.gov>
> Sent: Friday, September 15, 2023 9:59 AM
> To: discuss at mpich.org
> Cc: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov>
> Subject: [EXTERNAL] [BULK] Re: [mpich-discuss] Buffer corruption due to an excessive number of messages
> 
> Does it happen with other MPI implementations?
> 
> Rajeev
> 
> -----Original Message-----
> From: "Mccall, Kurt E. (MSFC-EV41) via discuss" <discuss at mpich.org <mailto:discuss at mpich.org>>
> Reply-To: "discuss at mpich.org <mailto:discuss at mpich.org>" <discuss at mpich.org <mailto:discuss at mpich.org>>
> Date: Friday, September 15, 2023 at 9:43 AM
> To: "discuss at mpich.org <mailto:discuss at mpich.org>" <discuss at mpich.org <mailto:discuss at mpich.org>>
> Cc: "Mccall, Kurt E. (MSFC-EV41)" <kurt.e.mccall at nasa.gov <mailto:kurt.e.mccall at nasa.gov>>
> Subject: Re: [mpich-discuss] Buffer corruption due to an excessive number of messages
> 
> 
> Joachim,
> 
> 
> Unfortunately, using MPI_Improbe/MPI_Mrecv didn't solve the problem -- I still am receiving buffers with invalid data objects near the ends of the buffers. The problem goes away when I reduce the size of the job (number of nodes), making me think it is the large number of messages that is causing the problem.
> 
> 
> 1. Is there a way to detect this kind of overload with an MPI call?
> 2. Is there an upper bound on the number of messages that can be "in flight"?
> 3. Is there a upper bound on message length?
> 
> 
> Or is there some other possible cause that I haven't thought of?
> 
> 
> Thanks,
> Kurt
> 
> 
> -----Original Message-----
> From: Joachim Jenke via discuss <discuss at mpich.org <mailto:discuss at mpich.org>>
> Sent: Thursday, September 14, 2023 3:10 PM
> To: discuss at mpich.org <mailto:discuss at mpich.org>
> Cc: Joachim Jenke <jenke at itc.rwth-aachen.de <mailto:jenke at itc.rwth-aachen.de>>
> Subject: [EXTERNAL] [BULK] Re: [mpich-discuss] Buffer corruption due to an excessive number of messages
> 
> 
> CAUTION: This email originated from outside of NASA. Please take care when clicking links or opening attachments. Use the "Report Message" button to report suspicious messages to the NASA SOC.
> 
> 
> 
> 
> 
> 
> 
> 
> Hi Kurt,
> 
> 
> just a thought: do you execute single-threaded or multi-threaded?
> 
> 
> In case of multi-threaded execution, you should look into MPI_Improbe/MPI_Mrecv just to make sure that you really receive the message you probed for.
> Even in single-threaded execution you might try whether using these functions instead fixes your issue.
> 
> 
> Best
> Joachim
> 
> 
> Am 14.09.23 um 22:02 schrieb Mccall, Kurt E. (MSFC-EV41) via discuss:
>> It seems that when I send a process too many non-blocking messages
>> (with
>> MPI_Isend) , MPI_Iprobe/MPI_Recv sometimes returns a buffer
>>
>> with corrupted data for some of the messages. Usually the corrupted
>> data objects are at the end of the array that was sent. I checked the
>>
>> buffers passed to MPI_Isend, and they are uncorrupted.
>>
>> 1. Is there a way to detect this kind of overload with an MPI call?
>> 2. Is there an upper bound on the number of messages that can be "in
>> flight"?
>> 3. Is there a upper bound on message length?
>>
>> Thanks,
>>
>> Kurt
>>
>>
>> _______________________________________________
>> discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org> To
>> manage subscription options or unsubscribe:
>> https://list/
>> s.mpich.org%2Fmailman%2Flistinfo%2Fdiscuss&data=05%7C01%7Ckurt.e.mccal
>> l%40nasa.gov%7C9e8bfd71610243562ce808dbb55eaace%7C7005d45845be48ae8140
>> d43da96dd17b%7C0%7C0%7C638303190427564481%7CUnknown%7CTWFpbGZsb3d8eyJW
>> IjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%
>> 7C%7C%7C&sdata=yplpEqPO1a8OpC2Nt3z55G2UtfA0lBqV4Zn5OrMRRx0%3D&reserved
>> =0
> 
> 
> --
> Dr. rer. nat. Joachim Jenke
> 
> 
> IT Center
> Group: High Performance Computing
> Division: Computational Science and Engineering RWTH Aachen University Seffenter Weg 23 D 52074 Aachen (Germany)
> Tel: +49 241 80- 24765
> Fax: +49 241 80-624765
> jenke at itc.rwth-aachen.de <mailto:jenke at itc.rwth-aachen.de>
> http://www.itc.rwth-aachen.de/ <http://www.itc.rwth-aachen.de/>
> 
> 
> _______________________________________________
> discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss <https://lists.mpich.org/mailman/listinfo/discuss>
> 
> 
> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

-- 
Dr. rer. nat. Joachim Jenke

IT Center
Group: High Performance Computing
Division: Computational Science and Engineering
RWTH Aachen University
Seffenter Weg 23
D 52074  Aachen (Germany)
Tel: +49 241 80- 24765
Fax: +49 241 80-624765
jenke at itc.rwth-aachen.de
www.itc.rwth-aachen.de

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5903 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20230915/92d505e3/attachment.p7s>


More information about the discuss mailing list