[mpich-discuss] Unexpected behaviour of MPI_Probe + MPI_Get_count

Zhou, Hui zhouh at anl.gov
Mon Mar 25 11:03:13 CDT 2024


Hi Iker,

Could you try reproduce the issue by setting MPIR_CVAR_DEBUG_SUMMARY=​1, and report the console output? The issue may be in specific provider. The log should show that.

--
Hui
________________________________
From: Iker Martín Álvarez via discuss <discuss at mpich.org>
Sent: Monday, March 25, 2024 6:23 AM
To: discuss at mpich.org <discuss at mpich.org>
Cc: Iker Martín Álvarez <martini at uji.es>
Subject: [mpich-discuss] Unexpected behaviour of MPI_Probe + MPI_Get_count

Hello, I recently encountered an unexpected behaviour of the MPI_Probe + MPI_Get_count functions under specific conditions. I was hoping that this forum could advise me on a solution.   Specifically, the application performs an MPI_Send communication
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.

ZjQcmQRYFpfptBannerEnd
Hello,

I recently encountered an unexpected behaviour of the MPI_Probe + MPI_Get_count functions under specific conditions. I was hoping that this forum could advise me on a solution.

Specifically, the application performs an MPI_Send communication from the root process to process B. Process B doesn't know the size of the message, so I use MPI_Probe + MPI_Get_count to discover it. However, as an example, if the size of the message is 1000 bytes, process B expects with MPI_Get_count function a total of 20 bytes.

The problem only occurs with a specific installation of MPICH and when the following conditions are met in my code:
- The problem only occurs in internode communications.
- The problem only appears if derived types are used in the communication. Specifically a derived type to communicate a vector of integers and a vector of reals, both with the same number of elements.
- None of the MPI functions give an error code. They all return MPI_Sucess.
- If instead of allocating the amount of bytes returned by MPI_Get_count(=20), I allocate the expected value (1000), the message is received correctly.
- The size returned by MPI_Get_count seems to be variable depending on the total number of addresses with which the derived type is created.

I have attached the file to reproduce the problem. It can also be accessed via the GitLab link below:
https://urldefense.us/v3/__https://lorca.act.uji.es/gitlab/martini/mpich_ofi_mpi_probe_bug__;!!G_uCfscf7eWS!e2-Vb0wQgSStVX0bzd-xlHCw1qawOpNLtczJf19a0sTqkUKg9_KqfIn-t5pKySJc9rssVlQlqccV$ <https://urldefense.us/v3/__https://lorca.act.uji.es/gitlab/martini/mpich_ofi_mpi_probe_bug__;!!G_uCfscf7eWS!aVgghOB04ZFIQ9sus7BHy-d5is_qeaeC4HHkojD2AKAz4SjExQRNGSl8AAyhk85tIb_jsqY189JmMw$>
It is designed to be run with 3 processes, two of them hosted on one node and the third on a different one.

As previously mentioned, this problem occurs when using MPICH with ch4:ofi without using the embedded option. Specifically, I have tested the following installations in which the error appears:
- MPICH 4.2.0 with config options: '--with-device=ch4:ofi' '--with-libfabric=/home/martini/Instalaciones/libfabric-1.16.1'
- MPICH 4.0.3 with config options: '--with-device=ch4:ofi' '--with-libfabric=/home/martini/Instalaciones/libfabric-1.16.1'
- MPICH 4.0.3 with config options: '--with-device=ch4:ofi' '--with-libfabric=/home/martini/Instalaciones/libfabric-1.16.1' '--disable-psm3'
- MPICH 3.4.1 with config options: '--with-device=ch4:ofi' '--with-libfabric=/home/martini/Instalaciones/libfabric-1.16.1'

However, it does work as expected for the following MPICH installations:
- MPICH 4.0.3 with config options: '--with-device=ch4:ofi' '--with-libfabric=embedded'
- MPICH 4.0.3 with config options: '--with-device=ch4:ucx' '--with-ucx=/soft/gnu/ucx-1.11'
- MPICH 3.4.1 with config options: '--with-device=ch4:ucx' '--with-ucx=/soft/gnu/ucx-1.11'

Although for these installations the code does work, we would like to use a different libfabric installation than the embedded one because we get better networking performance. In the case of UCX, it is because the application in question uses the MPI_Comm_spawn call and MPICH does not currently support it with UCX.

Thank you for your help.
Best regards,
Iker
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20240325/fa46cdb5/attachment-0001.html>


More information about the discuss mailing list