[mpich-discuss] Unexpected behaviour of MPI_Probe + MPI_Get_count

Iker Martín Álvarez martini at uji.es
Tue Mar 26 02:35:47 CDT 2024


Hello Zhou,

I just tried using both providers and in both cases the execution hangs at
MPI_Init for those processes that are on a different node than the main
one. This happens even for an MPI "Hello World" code for both providers.

So I can't even check if it has the same problem.

Thank you very much for your time.
Best regards,
Iker

Missatge de Zhou, Hui <zhouh at anl.gov> del dia dl., 25 de març 2024 a les
23:05:

> Hi Iker,
>
> Could you try setting FI_PROVIDER=sockets​ or FI_PROVIDER=tcp​ to see if
> the issue persists?
>
> --
> Hui
> ------------------------------
> *From:* Iker Martín Álvarez <martini at uji.es>
> *Sent:* Monday, March 25, 2024 12:38 PM
> *To:* Zhou, Hui <zhouh at anl.gov>
> *Cc:* discuss at mpich.org <discuss at mpich.org>
> *Subject:* Re: [mpich-discuss] Unexpected behaviour of MPI_Probe +
> MPI_Get_count
>
> Hello Zhou, Thanks for the quick reply. In the attached file you can see
> the result of running the code with the environment variable you gave. Kind
> regards, Iker Missatge de Zhou, Hui <zhouh@ anl. gov> del dia dl. , 25 de
> març 2024 a les
> ZjQcmQRYFpfptBannerStart
> This Message Is From an External Sender
> This message came from outside your organization.
>
> ZjQcmQRYFpfptBannerEnd
> Hello Zhou,
>
> Thanks for the quick reply.
> In the attached file you can see the result of running the code with the
> environment variable you gave.
>
> Kind regards,
> Iker
>
> Missatge de Zhou, Hui <zhouh at anl.gov> del dia dl., 25 de març 2024 a les
> 17:03:
>
> Hi Iker,
>
> Could you try reproduce the issue by setting MPIR_CVAR_DEBUG_SUMMARY=​1,
> and report the console output? The issue may be in specific provider. The
> log should show that.
>
> --
> Hui
> ------------------------------
> *From:* Iker Martín Álvarez via discuss <discuss at mpich.org>
> *Sent:* Monday, March 25, 2024 6:23 AM
> *To:* discuss at mpich.org <discuss at mpich.org>
> *Cc:* Iker Martín Álvarez <martini at uji.es>
> *Subject:* [mpich-discuss] Unexpected behaviour of MPI_Probe +
> MPI_Get_count
>
> Hello, I recently encountered an unexpected behaviour of the MPI_Probe +
> MPI_Get_count functions under specific conditions. I was hoping that this
> forum could advise me on a solution.   Specifically, the application
> performs an MPI_Send communication
> ZjQcmQRYFpfptBannerStart
> This Message Is From an External Sender
> This message came from outside your organization.
>
> ZjQcmQRYFpfptBannerEnd
> Hello,
>
> I recently encountered an unexpected behaviour of the MPI_Probe +
> MPI_Get_count functions under specific conditions. I was hoping that this
> forum could advise me on a solution.
>
> Specifically, the application performs an MPI_Send communication from the
> root process to process B. Process B doesn't know the size of the message,
> so I use MPI_Probe + MPI_Get_count to discover it. However, as an example,
> if the size of the message is 1000 bytes, process B expects with
> MPI_Get_count function a total of 20 bytes.
>
> The problem only occurs with a specific installation of MPICH and when the
> following conditions are met in my code:
> - The problem only occurs in internode communications.
> - The problem only appears if derived types are used in the communication.
> Specifically a derived type to communicate a vector of integers and a
> vector of reals, both with the same number of elements.
> - None of the MPI functions give an error code. They all return MPI_Sucess.
> - If instead of allocating the amount of bytes returned by
> MPI_Get_count(=20), I allocate the expected value (1000), the message is
> received correctly.
> - The size returned by MPI_Get_count seems to be variable depending on the
> total number of addresses with which the derived type is created.
>
> I have attached the file to reproduce the problem. It can also be accessed
> via the GitLab link below:
> https://urldefense.us/v3/__https://lorca.act.uji.es/gitlab/martini/mpich_ofi_mpi_probe_bug__;!!G_uCfscf7eWS!d12d0XjTMVw0ekIONAXDB1G34ACMsBFDYlGqmqiBGVkHdRFnQHyviKl0TEPgkGQxv75eToJw8-vvJw$ 
> <https://urldefense.us/v3/__https://lorca.act.uji.es/gitlab/martini/mpich_ofi_mpi_probe_bug__;!!G_uCfscf7eWS!aVgghOB04ZFIQ9sus7BHy-d5is_qeaeC4HHkojD2AKAz4SjExQRNGSl8AAyhk85tIb_jsqY189JmMw$>
> It is designed to be run with 3 processes, two of them hosted on one node
> and the third on a different one.
>
> As previously mentioned, this problem occurs when using MPICH with ch4:ofi
> without using the embedded option. Specifically, I have tested the
> following installations in which the error appears:
> - MPICH 4.2.0 with config options: '--with-device=ch4:ofi'
> '--with-libfabric=/home/martini/Instalaciones/libfabric-1.16.1'
> - MPICH 4.0.3 with config options: '--with-device=ch4:ofi'
> '--with-libfabric=/home/martini/Instalaciones/libfabric-1.16.1'
> - MPICH 4.0.3 with config options: '--with-device=ch4:ofi'
> '--with-libfabric=/home/martini/Instalaciones/libfabric-1.16.1' '--disable-psm3'
> - MPICH 3.4.1 with config options: '--with-device=ch4:ofi'
> '--with-libfabric=/home/martini/Instalaciones/libfabric-1.16.1'
>
> However, it does work as expected for the following MPICH installations:
> - MPICH 4.0.3 with config options: '--with-device=ch4:ofi'
> '--with-libfabric=embedded'
> - MPICH 4.0.3 with config options: '--with-device=ch4:ucx'
> '--with-ucx=/soft/gnu/ucx-1.11'
> - MPICH 3.4.1 with config options: '--with-device=ch4:ucx'
> '--with-ucx=/soft/gnu/ucx-1.11'
>
> Although for these installations the code does work, we would like to use
> a different libfabric installation than the embedded one because we get
> better networking performance. In the case of UCX, it is because the
> application in question uses the MPI_Comm_spawn call and MPICH does not
> currently support it with UCX.
>
> Thank you for your help.
> Best regards,
> Iker
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20240326/af3668cc/attachment.html>


More information about the discuss mailing list