[mpich-discuss] Buffer corruption due to an excessive number of messages

Mccall, Kurt E. (MSFC-EV41) kurt.e.mccall at nasa.gov
Fri Sep 15 09:43:03 CDT 2023


Joachim,

Unfortunately, using MPI_Improbe/MPI_Mrecv didn't solve the problem -- I still am receiving buffers with invalid data objects near the ends of the buffers.   The problem goes away when I reduce the size of the job (number of nodes), making me think it is the large number of messages  that is causing the problem.

1. Is there a way to detect this kind of overload with an MPI call?
2. Is there an upper bound on the number of messages that can be "in flight"?
3. Is there a upper bound on message length?

Or is there some other possible cause that I haven't thought of?

Thanks,
Kurt

-----Original Message-----
From: Joachim Jenke via discuss <discuss at mpich.org>
Sent: Thursday, September 14, 2023 3:10 PM
To: discuss at mpich.org
Cc: Joachim Jenke <jenke at itc.rwth-aachen.de>
Subject: [EXTERNAL] [BULK] Re: [mpich-discuss] Buffer corruption due to an excessive number of messages

CAUTION: This email originated from outside of NASA.  Please take care when clicking links or opening attachments.  Use the "Report Message" button to report suspicious messages to the NASA SOC.




Hi Kurt,

just a thought: do you execute single-threaded or multi-threaded?

In case of multi-threaded execution, you should look into MPI_Improbe/MPI_Mrecv just to make sure that you really receive the message you probed for.
Even in single-threaded execution you might try whether using these functions instead fixes your issue.

Best
Joachim

Am 14.09.23 um 22:02 schrieb Mccall, Kurt E. (MSFC-EV41) via discuss:
> It seems that when I send a process too many non-blocking messages
> (with
> MPI_Isend) , MPI_Iprobe/MPI_Recv sometimes returns a buffer
>
> with corrupted data for some of the messages.   Usually the corrupted
> data objects are at the end of the array that was sent.  I checked the
>
> buffers passed to MPI_Isend, and they are uncorrupted.
>
>  1. Is there a way to detect this kind of overload with an MPI call?
>  2. Is there an upper bound on the number of messages that can be "in
>     flight"?
>  3. Is there a upper bound on message length?
>
> Thanks,
>
> Kurt
>
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://list/
> s.mpich.org%2Fmailman%2Flistinfo%2Fdiscuss&data=05%7C01%7Ckurt.e.mccal
> l%40nasa.gov%7C9e8bfd71610243562ce808dbb55eaace%7C7005d45845be48ae8140
> d43da96dd17b%7C0%7C0%7C638303190427564481%7CUnknown%7CTWFpbGZsb3d8eyJW
> IjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%
> 7C%7C%7C&sdata=yplpEqPO1a8OpC2Nt3z55G2UtfA0lBqV4Zn5OrMRRx0%3D&reserved
> =0

--
Dr. rer. nat. Joachim Jenke

IT Center
Group: High Performance Computing
Division: Computational Science and Engineering RWTH Aachen University Seffenter Weg 23 D 52074  Aachen (Germany)
Tel: +49 241 80- 24765
Fax: +49 241 80-624765
jenke at itc.rwth-aachen.de
http://www.itc.rwth-aachen.de/



More information about the discuss mailing list