<head><!-- BaNnErBlUrFlE-HeAdEr-start -->
<style>
  #pfptBannerob42aud { all: revert !important; display: block !important; 
    visibility: visible !important; opacity: 1 !important; 
    background-color: #D0D8DC !important; 
    max-width: none !important; max-height: none !important }
  .pfptPrimaryButtonob42aud:hover, .pfptPrimaryButtonob42aud:focus {
    background-color: #b4c1c7 !important; }
  .pfptPrimaryButtonob42aud:active {
    background-color: #90a4ae !important; }
</style>

<!-- BaNnErBlUrFlE-HeAdEr-end -->
</head><!-- BaNnErBlUrFlE-BoDy-start -->
<!-- Preheader Text : BEGIN -->
<div style="display:none !important;display:none;visibility:hidden;mso-hide:all;font-size:1px;color:#ffffff;line-height:1px;height:0px;max-height:0px;opacity:0;overflow:hidden;">
Hi Hui, Yes, it works for that provider, or at least I haven't had any issues with it. Kind regards, Iker Missatge de Zhou, Hui <zhouh@ anl. gov> del dia dl. , 28 d’oct. 2024 a les 20: 01: Hi Iker, Does it work with `FI_PROVIDER="verbs;ofi_rxm"?</div>
<!-- Preheader Text : END -->

<!-- Email Banner : BEGIN -->
<div style="display:none !important;display:none;visibility:hidden;mso-hide:all;font-size:1px;color:#ffffff;line-height:1px;height:0px;max-height:0px;opacity:0;overflow:hidden;">ZjQcmQRYFpfptBannerStart</div>

<!--[if ((ie)|(mso))]>
  <table border="0" cellspacing="0" cellpadding="0" width="100%" style="padding: 16px 0px 16px 0px; direction: ltr" ><tr><td>
    <table border="0" cellspacing="0" cellpadding="0" style="padding: 0px 10px 5px 6px; width: 100%; border-radius:4px; border-top:4px solid #90a4ae;background-color:#D0D8DC;"><tr><td valign="top">
      <table align="left" border="0" cellspacing="0" cellpadding="0" style="padding: 4px 8px 4px 8px">
        <tr><td style="color:#000000; font-family: 'Arial', sans-serif; font-weight:bold; font-size:14px; direction: ltr">
          This Message Is From an External Sender
        </td></tr>
        <tr><td style="color:#000000; font-weight:normal; font-family: 'Arial', sans-serif; font-size:12px; direction: ltr">
          This message came from outside your organization.
        </td></tr>

      </table>

    </td></tr></table>
  </td></tr></table>
<![endif]-->

<![if !((ie)|(mso))]>
  <div dir="ltr"  id="pfptBannerob42aud" style="all: revert !important; display:block !important; text-align: left !important; margin:16px 0px 16px 0px !important; padding:8px 16px 8px 16px !important; border-radius: 4px !important; min-width: 200px !important; background-color: #D0D8DC !important; background-color: #D0D8DC; border-top: 4px solid #90a4ae !important; border-top: 4px solid #90a4ae;">
    <div id="pfptBannerob42aud" style="all: unset !important; float:left !important; display:block !important; margin: 0px 0px 1px 0px !important; max-width: 600px !important;">
      <div id="pfptBannerob42aud" style="all: unset !important; display:block !important; visibility: visible !important; background-color: #D0D8DC !important; color:#000000 !important; color:#000000; font-family: 'Arial', sans-serif !important; font-family: 'Arial', sans-serif; font-weight:bold !important; font-weight:bold; font-size:14px !important; line-height:18px !important; line-height:18px">
        This Message Is From an External Sender
      </div>
      <div id="pfptBannerob42aud" style="all: unset !important; display:block !important; visibility: visible !important; background-color: #D0D8DC !important; color:#000000 !important; color:#000000; font-weight:normal; font-family: 'Arial', sans-serif !important; font-family: 'Arial', sans-serif; font-size:12px !important; line-height:18px !important; line-height:18px; margin-top:2px !important;">
This message came from outside your organization.
      </div>

    </div>

    <div style="clear: both !important; display: block !important; visibility: hidden !important; line-height: 0 !important; font-size: 0.01px !important; height: 0px"> </div>
  </div>
<![endif]>

<div style="display:none !important;display:none;visibility:hidden;mso-hide:all;font-size:1px;color:#ffffff;line-height:1px;height:0px;max-height:0px;opacity:0;overflow:hidden;">ZjQcmQRYFpfptBannerEnd</div>
<!-- Email Banner : END -->

<!-- BaNnErBlUrFlE-BoDy-end -->
<div dir="ltr"><div>Hi Hui,</div><div><br></div><div>Yes, it works for that provider, or at least I haven't had any issues with it.</div><div><br></div><div>Kind regards,</div><div>Iker<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Missatge de Zhou, Hui <<a href="mailto:zhouh@anl.gov">zhouh@anl.gov</a>> del dia dl., 28 d’oct. 2024 a les 20:01:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="msg-3975233306732778935">




<div dir="ltr">
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Hi Iker,</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Does it work with `FI_PROVIDER="verbs;ofi_rxm"?</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Hui</div>
<div id="m_633548469818499321appendonsend"></div>
<hr style="display:inline-block;width:98%">
<div id="m_633548469818499321divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> Iker Martín Álvarez via discuss <<a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a>><br>
<b>Sent:</b> Monday, October 28, 2024 12:34 PM<br>
<b>To:</b> <a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a> <<a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a>><br>
<b>Cc:</b> Iker Martín Álvarez <<a href="mailto:martini@uji.es" target="_blank">martini@uji.es</a>><br>
<b>Subject:</b> [mpich-discuss] Occasional hang with MPI_Intercomm_merge and OFI+provider verbs</font>
<div> </div>
</div>

<div>
<div style="display:none;font-size:1px;color:rgb(255,255,255);line-height:1px;height:0px;max-height:0px;opacity:0;overflow:hidden">
Hi, Lately I have been dealing with an unexpected problem when using MPI_Comm_spawn + MPI_Intercomm_merge, where on some occasions my application hangs when two conditions are met. Specifically, the hang occurs when using the resulting Intracommunicator</div>
<div style="display:none;font-size:1px;color:rgb(255,255,255);line-height:1px;height:0px;max-height:0px;opacity:0;overflow:hidden">
ZjQcmQRYFpfptBannerStart</div>
<div dir="ltr" id="m_633548469818499321x_pfptBanner805rdra" style="display:block;text-align:left;margin:16px 0px;padding:8px 16px;border-radius:4px;min-width:200px;background-color:rgb(208,216,220);border-top:4px solid rgb(144,164,174)">
<div id="m_633548469818499321x_pfptBanner805rdra" style="float:left;display:block;margin:0px 0px 1px;max-width:600px">
<div id="m_633548469818499321x_pfptBanner805rdra" style="display:block;background-color:rgb(208,216,220);color:rgb(0,0,0);font-family:"Arial",sans-serif;font-weight:bold;font-size:14px;line-height:18px">
This Message Is From an External Sender </div>
<div id="m_633548469818499321x_pfptBanner805rdra" style="display:block;background-color:rgb(208,216,220);color:rgb(0,0,0);font-weight:normal;font-family:"Arial",sans-serif;font-size:12px;line-height:18px;margin-top:2px">
This message came from outside your organization. </div>
</div>
<div style="clear:both;display:block;line-height:0;font-size:0.01px;height:0px">
 </div>
</div>
<div style="display:none;font-size:1px;color:rgb(255,255,255);line-height:1px;height:0px;max-height:0px;opacity:0;overflow:hidden">
ZjQcmQRYFpfptBannerEnd</div>
<div dir="ltr">
<div>Hi,</div>
<div><br>
</div>
<div>Lately I have been dealing with an unexpected problem when using MPI_Comm_spawn + MPI_Intercomm_merge, where on some occasions my application hangs when two conditions are met.</div>
<div><br>
</div>
Specifically, the hang occurs when using the resulting Intracommunicator of MPI_Intercomm_merge in collective operations as MPI_Bcast. The conditions are<br>
- There is an oversubscription state. The number of processes is greater than the available number of physical cores.<br>
- Using CH4:ofi with FI_PROVIDER="verbs:ofi_rxd".
<div><br>
</div>
<div>I tested a minimal code with MPICH 4.2.0 and MPICH 4.2.3 configured as:</div>
<div>./configure --prefix=... --with-device=ch4:ofi --disable-psm3</div>
<div><br>
</div>
<div>The <a href="https://urldefense.us/v3/__https://lorca.act.uji.es/gitlab/martini/mpich_ofi_rxd_intracomm_hang/-/blob/main/BaseCode.c__;!!G_uCfscf7eWS!ZoB5c9APxvNk40SirehC83dWaIUE_w3yOQ2EEoJ4MaJlUaREOTMYR85Dd4SCEE0Exrr4U6ZETBg1sg$" target="_blank">
minimal code</a> to reproduce the problem is the following:</div>
<div>==========================</div>
<div>
<pre lang="c"><span lang="c" id="m_633548469818499321x_gmail-LC1"><span>#include <stdio.h></span></span>
<span lang="c" id="m_633548469818499321x_gmail-LC2"><span>#include <stdlib.h></span></span>
<span lang="c" id="m_633548469818499321x_gmail-LC3"><span>#include <mpi.h></span></span>
<span lang="c" id="m_633548469818499321x_gmail-LC4"></span>
<span lang="c" id="m_633548469818499321x_gmail-LC5"><span>int</span> <span>main</span><span>(</span><span>int</span> <span>argc</span><span>,</span> <span>char</span><span>*</span> <span>argv</span><span>[])</span> <span>{</span></span>
<span lang="c" id="m_633548469818499321x_gmail-LC6">  <span>int</span> <span>rank</span><span>,</span> <span>numP</span><span>,</span> <span>numO</span><span>;</span></span>
<span lang="c" id="m_633548469818499321x_gmail-LC7">  <span>int</span> <span>rootBcast</span><span>,</span> <span>order</span><span>;</span></span>
<span lang="c" id="m_633548469818499321x_gmail-LC8">  <span>double</span> <span>test</span> <span>=</span> <span>0</span><span>;</span></span>
<span lang="c" id="m_633548469818499321x_gmail-LC9">  <span>int</span> <span>solution</span> <span>=</span> 0<span>;</span></span>
<span lang="c" id="m_633548469818499321x_gmail-LC10"></span>
<span lang="c" id="m_633548469818499321x_gmail-LC11">  <span>MPI_Init</span><span>(</span><span>&</span><span>argc</span><span>,</span> <span>&</span><span>argv</span><span>);</span></span>
<span lang="c" id="m_633548469818499321x_gmail-LC12">  <span>MPI_Comm_rank</span><span>(</span><span>MPI_COMM_WORLD</span><span>,</span> <span>&</span><span>rank</span><span>);</span></span>
<span lang="c" id="m_633548469818499321x_gmail-LC13">  <span>MPI_Comm_size</span><span>(</span><span>MPI_COMM_WORLD</span><span>,</span> <span>&</span><span>numP</span><span>);</span></span>
<span lang="c" id="m_633548469818499321x_gmail-LC14"></span>
<span lang="c" id="m_633548469818499321x_gmail-LC15">  <span>MPI_Comm</span> <span>intercomm</span><span>,</span> <span>intracomm</span><span>;</span></span>
<span lang="c" id="m_633548469818499321x_gmail-LC16">  <span>MPI_Comm_get_parent</span><span>(</span><span>&</span><span>intercomm</span><span>);</span></span>
<span lang="c" id="m_633548469818499321x_gmail-LC17">  <span>if</span><span>(</span><span>intercomm</span> <span>==</span> <span>MPI_COMM_NULL</span><span>)</span> <span>{</span></span>
<span lang="c" id="m_633548469818499321x_gmail-LC18">    <span>numO</span> <span>=</span> <span>atoi</span><span>(</span><span>argv</span><span>[</span><span>1</span><span>]);</span></span>
<span lang="c" id="m_633548469818499321x_gmail-LC19">    <span>MPI_Comm_spawn</span><span>(</span><span>argv</span><span>[</span><span>0</span><span>],</span> <span>MPI_ARGV_NULL</span><span>,</span> <span>numO</span><span>,</span> <span>MPI_INFO_NULL</span><span>,</span> <span>0</span><span>,</span> <span>MPI_COMM_WORLD</span><span>,</span> <span>&</span><span>intercomm</span><span>,</span> <span>MPI_ERRCODES_IGNORE</span><span>);</span></span>
<span lang="c" id="m_633548469818499321x_gmail-LC20">    <span>order</span> <span>=</span> <span>0</span><span>;</span></span>
<span lang="c" id="m_633548469818499321x_gmail-LC21">  <span>}</span> <span>else</span> <span>{</span> <span>order</span> <span>=</span> <span>1</span><span>;</span> <span>}</span></span>
<span lang="c" id="m_633548469818499321x_gmail-LC22"></span>
<span lang="c" id="m_633548469818499321x_gmail-LC23">  <span>MPI_Intercomm_merge</span><span>(</span><span>intercomm</span><span>,</span> <span>order</span><span>,</span> <span>&</span><span>intracomm</span><span>);</span></span>
<span lang="c" id="m_633548469818499321x_gmail-LC24">  <span>printf</span><span>(</span><span>"TEST 1 P%02d/%d</span><span>\n</span><span>"</span><span>,</span> <span>rank</span><span>,</span> <span>numP</span><span>);</span></span>
<span lang="c" id="m_633548469818499321x_gmail-LC25">  <span>MPI_Bcast</span><span>(</span><span>&</span><span>test</span><span>,</span> <span>1</span><span>,</span> <span>MPI_DOUBLE</span><span>,</span> <span>0</span><span>,</span> <span>intracomm</span><span>);</span></span> // Hangs here
<span lang="c" id="m_633548469818499321x_gmail-LC26">  <span>if</span><span>(</span><span>solution</span><span>)</span> <span>{</span> <span>MPI_Barrier</span><span>(</span><span>intercomm</span><span>);</span> <span>}</span></span>
<span lang="c" id="m_633548469818499321x_gmail-LC27">  <span>printf</span><span>(</span><span>"TEST 2 P%02d/%d</span><span>\n</span><span>"</span><span>,</span> <span>rank</span><span>,</span> <span>numP</span><span>);</span></span>
<span lang="c" id="m_633548469818499321x_gmail-LC28"></span>
<span lang="c" id="m_633548469818499321x_gmail-LC29">  <span>MPI_Finalize</span><span>();</span></span>
<span lang="c" id="m_633548469818499321x_gmail-LC30">  <span>return</span> <span>0</span><span>;</span></span>
<span lang="c" id="m_633548469818499321x_gmail-LC31"><span>}</span></span>
</pre>
</div>
<div>==========================</div>
The code only hangs at the MPI_Bcast operation for some of the spawned processes. All my executions have been with a single node of 20 cores, with 10 initial processes and spawning 20 at the MPI_Comm_spawn function. If I change the variable "solution" to 1,
 I rarely get a hang of the application, but it still happens on some occasions.<br>
<br>
>From my perspective, the code seems to follow the standard. Is this the case? I have been able to run the code with other providers for OFI, but I am confused as to why it does not work in this case.
<div><br>
</div>
<div>Thank you for your time.</div>
<div>
<div>Best regards,<br>
</div>
<div>Iker</div>
</div>
</div>
</div>
</div>

</div></blockquote></div>