[mpich-discuss] Unexpected behaviour of MPI_Probe + MPI_Get_count

Iker Martín Álvarez martini at uji.es
Wed Mar 27 06:55:06 CDT 2024


Hello Hui,

After testing this provider, the issue has been resolved. I am not sure
what the difference is between the two providers, but we are grateful that
it works now.

Thank you very much for your time.
Kind regards,
Iker

Missatge de Zhou, Hui <zhouh at anl.gov> del dia dt., 26 de març 2024 a les
15:37:

> Hi Iker,
>
> Could you try FI_PROVIDER=verbs?
>
> --
> Hui
> ------------------------------
> *From:* Iker Martín Álvarez <martini at uji.es>
> *Sent:* Tuesday, March 26, 2024 2:35 AM
> *To:* Zhou, Hui <zhouh at anl.gov>
> *Cc:* discuss at mpich.org <discuss at mpich.org>
> *Subject:* Re: [mpich-discuss] Unexpected behaviour of MPI_Probe +
> MPI_Get_count
>
> Hello Zhou, I just tried using both providers and in both cases the
> execution hangs at MPI_Init for those processes that are on a different
> node than the main one. This happens even for an MPI "Hello World" code for
> both providers. So
> ZjQcmQRYFpfptBannerStart
> This Message Is From an External Sender
> This message came from outside your organization.
>
> ZjQcmQRYFpfptBannerEnd
> Hello Zhou,
>
> I just tried using both providers and in both cases the execution hangs at
> MPI_Init for those processes that are on a different node than the main
> one. This happens even for an MPI "Hello World" code for both providers.
>
> So I can't even check if it has the same problem.
>
> Thank you very much for your time.
> Best regards,
> Iker
>
> Missatge de Zhou, Hui <zhouh at anl.gov> del dia dl., 25 de març 2024 a les
> 23:05:
>
> Hi Iker,
>
> Could you try setting FI_PROVIDER=sockets or FI_PROVIDER=tcp to see if
> the issue persists?
>
> --
> Hui
> ------------------------------
> *From:* Iker Martín Álvarez <martini at uji.es>
> *Sent:* Monday, March 25, 2024 12:38 PM
> *To:* Zhou, Hui <zhouh at anl.gov>
> *Cc:* discuss at mpich.org <discuss at mpich.org>
> *Subject:* Re: [mpich-discuss] Unexpected behaviour of MPI_Probe +
> MPI_Get_count
>
> Hello Zhou, Thanks for the quick reply. In the attached file you can see
> the result of running the code with the environment variable you gave. Kind
> regards, Iker Missatge de Zhou, Hui <zhouh@ anl. gov> del dia dl. , 25 de
> març 2024 a les
> ZjQcmQRYFpfptBannerStart
> This Message Is From an External Sender
> This message came from outside your organization.
>
> ZjQcmQRYFpfptBannerEnd
> Hello Zhou,
>
> Thanks for the quick reply.
> In the attached file you can see the result of running the code with the
> environment variable you gave.
>
> Kind regards,
> Iker
>
> Missatge de Zhou, Hui <zhouh at anl.gov> del dia dl., 25 de març 2024 a les
> 17:03:
>
> Hi Iker,
>
> Could you try reproduce the issue by setting MPIR_CVAR_DEBUG_SUMMARY=1,
> and report the console output? The issue may be in specific provider. The
> log should show that.
>
> --
> Hui
> ------------------------------
> *From:* Iker Martín Álvarez via discuss <discuss at mpich.org>
> *Sent:* Monday, March 25, 2024 6:23 AM
> *To:* discuss at mpich.org <discuss at mpich.org>
> *Cc:* Iker Martín Álvarez <martini at uji.es>
> *Subject:* [mpich-discuss] Unexpected behaviour of MPI_Probe +
> MPI_Get_count
>
> Hello, I recently encountered an unexpected behaviour of the MPI_Probe +
> MPI_Get_count functions under specific conditions. I was hoping that this
> forum could advise me on a solution.   Specifically, the application
> performs an MPI_Send communication
> ZjQcmQRYFpfptBannerStart
> This Message Is From an External Sender
> This message came from outside your organization.
>
> ZjQcmQRYFpfptBannerEnd
> Hello,
>
> I recently encountered an unexpected behaviour of the MPI_Probe +
> MPI_Get_count functions under specific conditions. I was hoping that this
> forum could advise me on a solution.
>
> Specifically, the application performs an MPI_Send communication from the
> root process to process B. Process B doesn't know the size of the message,
> so I use MPI_Probe + MPI_Get_count to discover it. However, as an example,
> if the size of the message is 1000 bytes, process B expects with
> MPI_Get_count function a total of 20 bytes.
>
> The problem only occurs with a specific installation of MPICH and when the
> following conditions are met in my code:
> - The problem only occurs in internode communications.
> - The problem only appears if derived types are used in the communication.
> Specifically a derived type to communicate a vector of integers and a
> vector of reals, both with the same number of elements.
> - None of the MPI functions give an error code. They all return MPI_Sucess.
> - If instead of allocating the amount of bytes returned by
> MPI_Get_count(=20), I allocate the expected value (1000), the message is
> received correctly.
> - The size returned by MPI_Get_count seems to be variable depending on the
> total number of addresses with which the derived type is created.
>
> I have attached the file to reproduce the problem. It can also be accessed
> via the GitLab link below:
> https://urldefense.us/v3/__https://lorca.act.uji.es/gitlab/martini/mpich_ofi_mpi_probe_bug__;!!G_uCfscf7eWS!bTk3snYDB_1b2TP0ZOZ2GC0cYZiIH3pxCuJ9b08DfTHaEetxDAKTzAPwDLXVTCASD1swvdOp17_pZg$ 
> <https://urldefense.us/v3/__https://lorca.act.uji.es/gitlab/martini/mpich_ofi_mpi_probe_bug__;!!G_uCfscf7eWS!aVgghOB04ZFIQ9sus7BHy-d5is_qeaeC4HHkojD2AKAz4SjExQRNGSl8AAyhk85tIb_jsqY189JmMw$>
> It is designed to be run with 3 processes, two of them hosted on one node
> and the third on a different one.
>
> As previously mentioned, this problem occurs when using MPICH with ch4:ofi
> without using the embedded option. Specifically, I have tested the
> following installations in which the error appears:
> - MPICH 4.2.0 with config options: '--with-device=ch4:ofi'
> '--with-libfabric=/home/martini/Instalaciones/libfabric-1.16.1'
> - MPICH 4.0.3 with config options: '--with-device=ch4:ofi'
> '--with-libfabric=/home/martini/Instalaciones/libfabric-1.16.1'
> - MPICH 4.0.3 with config options: '--with-device=ch4:ofi'
> '--with-libfabric=/home/martini/Instalaciones/libfabric-1.16.1' '--disable-psm3'
> - MPICH 3.4.1 with config options: '--with-device=ch4:ofi'
> '--with-libfabric=/home/martini/Instalaciones/libfabric-1.16.1'
>
> However, it does work as expected for the following MPICH installations:
> - MPICH 4.0.3 with config options: '--with-device=ch4:ofi'
> '--with-libfabric=embedded'
> - MPICH 4.0.3 with config options: '--with-device=ch4:ucx'
> '--with-ucx=/soft/gnu/ucx-1.11'
> - MPICH 3.4.1 with config options: '--with-device=ch4:ucx'
> '--with-ucx=/soft/gnu/ucx-1.11'
>
> Although for these installations the code does work, we would like to use
> a different libfabric installation than the embedded one because we get
> better networking performance. In the case of UCX, it is because the
> application in question uses the MPI_Comm_spawn call and MPICH does not
> currently support it with UCX.
>
> Thank you for your help.
> Best regards,
> Iker
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20240327/6977a545/attachment.html>


More information about the discuss mailing list