[mpich-discuss] Unexpected behaviour of MPI_Probe + MPI_Get_count

Zhou, Hui zhouh at anl.gov
Mon Mar 25 17:05:01 CDT 2024


Hi Iker,

Could you try setting FI_PROVIDER=sockets​ or FI_PROVIDER=tcp​ to see if the issue persists?

--
Hui
________________________________
From: Iker Martín Álvarez <martini at uji.es>
Sent: Monday, March 25, 2024 12:38 PM
To: Zhou, Hui <zhouh at anl.gov>
Cc: discuss at mpich.org <discuss at mpich.org>
Subject: Re: [mpich-discuss] Unexpected behaviour of MPI_Probe + MPI_Get_count

Hello Zhou, Thanks for the quick reply. In the attached file you can see the result of running the code with the environment variable you gave. Kind regards, Iker Missatge de Zhou, Hui <zhouh@ anl. gov> del dia dl. , 25 de març 2024 a les
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.

ZjQcmQRYFpfptBannerEnd
Hello Zhou,

Thanks for the quick reply.
In the attached file you can see the result of running the code with the environment variable you gave.

Kind regards,
Iker

Missatge de Zhou, Hui <zhouh at anl.gov<mailto:zhouh at anl.gov>> del dia dl., 25 de març 2024 a les 17:03:
Hi Iker,

Could you try reproduce the issue by setting MPIR_CVAR_DEBUG_SUMMARY=​1, and report the console output? The issue may be in specific provider. The log should show that.

--
Hui
________________________________
From: Iker Martín Álvarez via discuss <discuss at mpich.org<mailto:discuss at mpich.org>>
Sent: Monday, March 25, 2024 6:23 AM
To: discuss at mpich.org<mailto:discuss at mpich.org> <discuss at mpich.org<mailto:discuss at mpich.org>>
Cc: Iker Martín Álvarez <martini at uji.es<mailto:martini at uji.es>>
Subject: [mpich-discuss] Unexpected behaviour of MPI_Probe + MPI_Get_count

Hello, I recently encountered an unexpected behaviour of the MPI_Probe + MPI_Get_count functions under specific conditions. I was hoping that this forum could advise me on a solution.   Specifically, the application performs an MPI_Send communication
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.

ZjQcmQRYFpfptBannerEnd
Hello,

I recently encountered an unexpected behaviour of the MPI_Probe + MPI_Get_count functions under specific conditions. I was hoping that this forum could advise me on a solution.

Specifically, the application performs an MPI_Send communication from the root process to process B. Process B doesn't know the size of the message, so I use MPI_Probe + MPI_Get_count to discover it. However, as an example, if the size of the message is 1000 bytes, process B expects with MPI_Get_count function a total of 20 bytes.

The problem only occurs with a specific installation of MPICH and when the following conditions are met in my code:
- The problem only occurs in internode communications.
- The problem only appears if derived types are used in the communication. Specifically a derived type to communicate a vector of integers and a vector of reals, both with the same number of elements.
- None of the MPI functions give an error code. They all return MPI_Sucess.
- If instead of allocating the amount of bytes returned by MPI_Get_count(=20), I allocate the expected value (1000), the message is received correctly.
- The size returned by MPI_Get_count seems to be variable depending on the total number of addresses with which the derived type is created.

I have attached the file to reproduce the problem. It can also be accessed via the GitLab link below:
https://urldefense.us/v3/__https://lorca.act.uji.es/gitlab/martini/mpich_ofi_mpi_probe_bug__;!!G_uCfscf7eWS!Y1cC3DiavaPguaUCJDGUi7AYZtJQo4kI92o0cmu8G3oVdAgfmRdixjb7Y2E-CoAcLJ7jqFLXjZGg$ <https://urldefense.us/v3/__https://lorca.act.uji.es/gitlab/martini/mpich_ofi_mpi_probe_bug__;!!G_uCfscf7eWS!aVgghOB04ZFIQ9sus7BHy-d5is_qeaeC4HHkojD2AKAz4SjExQRNGSl8AAyhk85tIb_jsqY189JmMw$>
It is designed to be run with 3 processes, two of them hosted on one node and the third on a different one.

As previously mentioned, this problem occurs when using MPICH with ch4:ofi without using the embedded option. Specifically, I have tested the following installations in which the error appears:
- MPICH 4.2.0 with config options: '--with-device=ch4:ofi' '--with-libfabric=/home/martini/Instalaciones/libfabric-1.16.1'
- MPICH 4.0.3 with config options: '--with-device=ch4:ofi' '--with-libfabric=/home/martini/Instalaciones/libfabric-1.16.1'
- MPICH 4.0.3 with config options: '--with-device=ch4:ofi' '--with-libfabric=/home/martini/Instalaciones/libfabric-1.16.1' '--disable-psm3'
- MPICH 3.4.1 with config options: '--with-device=ch4:ofi' '--with-libfabric=/home/martini/Instalaciones/libfabric-1.16.1'

However, it does work as expected for the following MPICH installations:
- MPICH 4.0.3 with config options: '--with-device=ch4:ofi' '--with-libfabric=embedded'
- MPICH 4.0.3 with config options: '--with-device=ch4:ucx' '--with-ucx=/soft/gnu/ucx-1.11'
- MPICH 3.4.1 with config options: '--with-device=ch4:ucx' '--with-ucx=/soft/gnu/ucx-1.11'

Although for these installations the code does work, we would like to use a different libfabric installation than the embedded one because we get better networking performance. In the case of UCX, it is because the application in question uses the MPI_Comm_spawn call and MPICH does not currently support it with UCX.

Thank you for your help.
Best regards,
Iker
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20240325/21dbc316/attachment-0001.html>


More information about the discuss mailing list