[mpich-discuss] Unexpected behaviour of MPI_Probe + MPI_Get_count

Zhou, Hui zhouh at anl.gov
Wed Mar 27 12:43:48 CDT 2024


It means the libfabric psm3 provider in 1.16.1 has a bug. :) Since the emebedded libfabric worked, it means they have fixed the bug in the recent versions. Different providers may have different performance.

--
Hui
________________________________
From: Iker Martín Álvarez <martini at uji.es>
Sent: Wednesday, March 27, 2024 6:55 AM
To: Zhou, Hui <zhouh at anl.gov>
Cc: discuss at mpich.org <discuss at mpich.org>
Subject: Re: [mpich-discuss] Unexpected behaviour of MPI_Probe + MPI_Get_count

Hello Hui, After testing this provider, the issue has been resolved. I am not sure what the difference is between the two providers, but we are grateful that it works now. Thank you very much for your time. Kind regards, Iker Missatge de Zhou,
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.

ZjQcmQRYFpfptBannerEnd
Hello Hui,

After testing this provider, the issue has been resolved. I am not sure what the difference is between the two providers, but we are grateful that it works now.

Thank you very much for your time.
Kind regards,
Iker

Missatge de Zhou, Hui <zhouh at anl.gov<mailto:zhouh at anl.gov>> del dia dt., 26 de març 2024 a les 15:37:
Hi Iker,

Could you try FI_PROVIDER=verbs?

--
Hui
________________________________
From: Iker Martín Álvarez <martini at uji.es<mailto:martini at uji.es>>
Sent: Tuesday, March 26, 2024 2:35 AM
To: Zhou, Hui <zhouh at anl.gov<mailto:zhouh at anl.gov>>
Cc: discuss at mpich.org<mailto:discuss at mpich.org> <discuss at mpich.org<mailto:discuss at mpich.org>>
Subject: Re: [mpich-discuss] Unexpected behaviour of MPI_Probe + MPI_Get_count

Hello Zhou, I just tried using both providers and in both cases the execution hangs at MPI_Init for those processes that are on a different node than the main one. This happens even for an MPI "Hello World" code for both providers. So
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.

ZjQcmQRYFpfptBannerEnd
Hello Zhou,

I just tried using both providers and in both cases the execution hangs at MPI_Init for those processes that are on a different node than the main one. This happens even for an MPI "Hello World" code for both providers.

So I can't even check if it has the same problem.

Thank you very much for your time.
Best regards,
Iker

Missatge de Zhou, Hui <zhouh at anl.gov<mailto:zhouh at anl.gov>> del dia dl., 25 de març 2024 a les 23:05:
Hi Iker,

Could you try setting FI_PROVIDER=sockets or FI_PROVIDER=tcp to see if the issue persists?

--
Hui
________________________________
From: Iker Martín Álvarez <martini at uji.es<mailto:martini at uji.es>>
Sent: Monday, March 25, 2024 12:38 PM
To: Zhou, Hui <zhouh at anl.gov<mailto:zhouh at anl.gov>>
Cc: discuss at mpich.org<mailto:discuss at mpich.org> <discuss at mpich.org<mailto:discuss at mpich.org>>
Subject: Re: [mpich-discuss] Unexpected behaviour of MPI_Probe + MPI_Get_count

Hello Zhou, Thanks for the quick reply. In the attached file you can see the result of running the code with the environment variable you gave. Kind regards, Iker Missatge de Zhou, Hui <zhouh@ anl. gov> del dia dl. , 25 de març 2024 a les
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.

ZjQcmQRYFpfptBannerEnd
Hello Zhou,

Thanks for the quick reply.
In the attached file you can see the result of running the code with the environment variable you gave.

Kind regards,
Iker

Missatge de Zhou, Hui <zhouh at anl.gov<mailto:zhouh at anl.gov>> del dia dl., 25 de març 2024 a les 17:03:
Hi Iker,

Could you try reproduce the issue by setting MPIR_CVAR_DEBUG_SUMMARY=1, and report the console output? The issue may be in specific provider. The log should show that.

--
Hui
________________________________
From: Iker Martín Álvarez via discuss <discuss at mpich.org<mailto:discuss at mpich.org>>
Sent: Monday, March 25, 2024 6:23 AM
To: discuss at mpich.org<mailto:discuss at mpich.org> <discuss at mpich.org<mailto:discuss at mpich.org>>
Cc: Iker Martín Álvarez <martini at uji.es<mailto:martini at uji.es>>
Subject: [mpich-discuss] Unexpected behaviour of MPI_Probe + MPI_Get_count

Hello, I recently encountered an unexpected behaviour of the MPI_Probe + MPI_Get_count functions under specific conditions. I was hoping that this forum could advise me on a solution.   Specifically, the application performs an MPI_Send communication
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.

ZjQcmQRYFpfptBannerEnd
Hello,

I recently encountered an unexpected behaviour of the MPI_Probe + MPI_Get_count functions under specific conditions. I was hoping that this forum could advise me on a solution.

Specifically, the application performs an MPI_Send communication from the root process to process B. Process B doesn't know the size of the message, so I use MPI_Probe + MPI_Get_count to discover it. However, as an example, if the size of the message is 1000 bytes, process B expects with MPI_Get_count function a total of 20 bytes.

The problem only occurs with a specific installation of MPICH and when the following conditions are met in my code:
- The problem only occurs in internode communications.
- The problem only appears if derived types are used in the communication. Specifically a derived type to communicate a vector of integers and a vector of reals, both with the same number of elements.
- None of the MPI functions give an error code. They all return MPI_Sucess.
- If instead of allocating the amount of bytes returned by MPI_Get_count(=20), I allocate the expected value (1000), the message is received correctly.
- The size returned by MPI_Get_count seems to be variable depending on the total number of addresses with which the derived type is created.

I have attached the file to reproduce the problem. It can also be accessed via the GitLab link below:
https://urldefense.us/v3/__https://lorca.act.uji.es/gitlab/martini/mpich_ofi_mpi_probe_bug__;!!G_uCfscf7eWS!dnooiboB18uGBi4Aypt2WInuC-HYgQMk5O0DA6vWluYw8fvTIpliyUwCAAOzeeHgRselx6wC3CBA$ <https://urldefense.us/v3/__https://lorca.act.uji.es/gitlab/martini/mpich_ofi_mpi_probe_bug__;!!G_uCfscf7eWS!aVgghOB04ZFIQ9sus7BHy-d5is_qeaeC4HHkojD2AKAz4SjExQRNGSl8AAyhk85tIb_jsqY189JmMw$>
It is designed to be run with 3 processes, two of them hosted on one node and the third on a different one.

As previously mentioned, this problem occurs when using MPICH with ch4:ofi without using the embedded option. Specifically, I have tested the following installations in which the error appears:
- MPICH 4.2.0 with config options: '--with-device=ch4:ofi' '--with-libfabric=/home/martini/Instalaciones/libfabric-1.16.1'
- MPICH 4.0.3 with config options: '--with-device=ch4:ofi' '--with-libfabric=/home/martini/Instalaciones/libfabric-1.16.1'
- MPICH 4.0.3 with config options: '--with-device=ch4:ofi' '--with-libfabric=/home/martini/Instalaciones/libfabric-1.16.1' '--disable-psm3'
- MPICH 3.4.1 with config options: '--with-device=ch4:ofi' '--with-libfabric=/home/martini/Instalaciones/libfabric-1.16.1'

However, it does work as expected for the following MPICH installations:
- MPICH 4.0.3 with config options: '--with-device=ch4:ofi' '--with-libfabric=embedded'
- MPICH 4.0.3 with config options: '--with-device=ch4:ucx' '--with-ucx=/soft/gnu/ucx-1.11'
- MPICH 3.4.1 with config options: '--with-device=ch4:ucx' '--with-ucx=/soft/gnu/ucx-1.11'

Although for these installations the code does work, we would like to use a different libfabric installation than the embedded one because we get better networking performance. In the case of UCX, it is because the application in question uses the MPI_Comm_spawn call and MPICH does not currently support it with UCX.

Thank you for your help.
Best regards,
Iker
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20240327/23098a23/attachment-0001.html>


More information about the discuss mailing list