[mpich-discuss] Buffer corruption due to an excessive number of messages

Thakur, Rajeev thakur at anl.gov
Fri Sep 15 10:03:02 CDT 2023


Then it is less likely to be a bug in the implementation.

Rajeev

-----Original Message-----
From: "Mccall, Kurt E. (MSFC-EV41)" <kurt.e.mccall at nasa.gov <mailto:kurt.e.mccall at nasa.gov>>
Date: Friday, September 15, 2023 at 10:01 AM
To: "Thakur, Rajeev" <thakur at anl.gov <mailto:thakur at anl.gov>>, "discuss at mpich.org <mailto:discuss at mpich.org>" <discuss at mpich.org <mailto:discuss at mpich.org>>
Subject: Re: [mpich-discuss] Buffer corruption due to an excessive number of messages


Yes, I tried it with OpenMPI and the same problem occurred.


Kurt


-----Original Message-----
From: Thakur, Rajeev <thakur at anl.gov <mailto:thakur at anl.gov>>
Sent: Friday, September 15, 2023 9:59 AM
To: discuss at mpich.org <mailto:discuss at mpich.org>
Cc: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov <mailto:kurt.e.mccall at nasa.gov>>
Subject: [EXTERNAL] [BULK] Re: [mpich-discuss] Buffer corruption due to an excessive number of messages


Does it happen with other MPI implementations?


Rajeev


-----Original Message-----
From: "Mccall, Kurt E. (MSFC-EV41) via discuss" <discuss at mpich.org <mailto:discuss at mpich.org> <mailto:discuss at mpich.org <mailto:discuss at mpich.org>>>
Reply-To: "discuss at mpich.org <mailto:discuss at mpich.org> <mailto:discuss at mpich.org <mailto:discuss at mpich.org>>" <discuss at mpich.org <mailto:discuss at mpich.org> <mailto:discuss at mpich.org <mailto:discuss at mpich.org>>>
Date: Friday, September 15, 2023 at 9:43 AM
To: "discuss at mpich.org <mailto:discuss at mpich.org> <mailto:discuss at mpich.org <mailto:discuss at mpich.org>>" <discuss at mpich.org <mailto:discuss at mpich.org> <mailto:discuss at mpich.org <mailto:discuss at mpich.org>>>
Cc: "Mccall, Kurt E. (MSFC-EV41)" <kurt.e.mccall at nasa.gov <mailto:kurt.e.mccall at nasa.gov> <mailto:kurt.e.mccall at nasa.gov <mailto:kurt.e.mccall at nasa.gov>>>
Subject: Re: [mpich-discuss] Buffer corruption due to an excessive number of messages




Joachim,




Unfortunately, using MPI_Improbe/MPI_Mrecv didn't solve the problem -- I still am receiving buffers with invalid data objects near the ends of the buffers. The problem goes away when I reduce the size of the job (number of nodes), making me think it is the large number of messages that is causing the problem.




1. Is there a way to detect this kind of overload with an MPI call?
2. Is there an upper bound on the number of messages that can be "in flight"?
3. Is there a upper bound on message length?




Or is there some other possible cause that I haven't thought of?




Thanks,
Kurt




-----Original Message-----
From: Joachim Jenke via discuss <discuss at mpich.org <mailto:discuss at mpich.org> <mailto:discuss at mpich.org <mailto:discuss at mpich.org>>>
Sent: Thursday, September 14, 2023 3:10 PM
To: discuss at mpich.org <mailto:discuss at mpich.org> <mailto:discuss at mpich.org <mailto:discuss at mpich.org>>
Cc: Joachim Jenke <jenke at itc.rwth-aachen.de <mailto:jenke at itc.rwth-aachen.de> <mailto:jenke at itc.rwth-aachen.de <mailto:jenke at itc.rwth-aachen.de>>>
Subject: [EXTERNAL] [BULK] Re: [mpich-discuss] Buffer corruption due to an excessive number of messages




CAUTION: This email originated from outside of NASA. Please take care when clicking links or opening attachments. Use the "Report Message" button to report suspicious messages to the NASA SOC.
















Hi Kurt,




just a thought: do you execute single-threaded or multi-threaded?




In case of multi-threaded execution, you should look into MPI_Improbe/MPI_Mrecv just to make sure that you really receive the message you probed for.
Even in single-threaded execution you might try whether using these functions instead fixes your issue.




Best
Joachim




Am 14.09.23 um 22:02 schrieb Mccall, Kurt E. (MSFC-EV41) via discuss:
> It seems that when I send a process too many non-blocking messages
> (with
> MPI_Isend) , MPI_Iprobe/MPI_Recv sometimes returns a buffer
>
> with corrupted data for some of the messages. Usually the corrupted
> data objects are at the end of the array that was sent. I checked the
>
> buffers passed to MPI_Isend, and they are uncorrupted.
>
> 1. Is there a way to detect this kind of overload with an MPI call?
> 2. Is there an upper bound on the number of messages that can be "in
> flight"?
> 3. Is there a upper bound on message length?
>
> Thanks,
>
> Kurt
>
>
> _______________________________________________
> discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org> <mailto:discuss at mpich.org <mailto:discuss at mpich.org>> To
> manage subscription options or unsubscribe:
> https://list/
> s.mpich.org%2Fmailman%2Flistinfo%2Fdiscuss&data=05%7C01%7Ckurt.e.mccal
> l%40nasa.gov%7C9e8bfd71610243562ce808dbb55eaace%7C7005d45845be48ae8140
> d43da96dd17b%7C0%7C0%7C638303190427564481%7CUnknown%7CTWFpbGZsb3d8eyJW
> IjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%
> 7C%7C%7C&sdata=yplpEqPO1a8OpC2Nt3z55G2UtfA0lBqV4Zn5OrMRRx0%3D&reserved
> =0




--
Dr. rer. nat. Joachim Jenke




IT Center
Group: High Performance Computing
Division: Computational Science and Engineering RWTH Aachen University Seffenter Weg 23 D 52074 Aachen (Germany)
Tel: +49 241 80- 24765
Fax: +49 241 80-624765
jenke at itc.rwth-aachen.de <mailto:jenke at itc.rwth-aachen.de> <mailto:jenke at itc.rwth-aachen.de <mailto:jenke at itc.rwth-aachen.de>>
http://www.itc.rwth-aachen.de/ <http://www.itc.rwth-aachen.de/> <http://www.itc.rwth-aachen.de/> <http://www.itc.rwth-aachen.de/>>




_______________________________________________
discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org> <mailto:discuss at mpich.org <mailto:discuss at mpich.org>> To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss <https://lists.mpich.org/mailman/listinfo/discuss> <https://lists.mpich.org/mailman/listinfo/discuss> <https://lists.mpich.org/mailman/listinfo/discuss>>











More information about the discuss mailing list