<head><!-- BaNnErBlUrFlE-HeAdEr-start -->
<style>
#pfptBanner805rdra { all: revert !important; display: block !important;
visibility: visible !important; opacity: 1 !important;
background-color: #D0D8DC !important;
max-width: none !important; max-height: none !important }
.pfptPrimaryButton805rdra:hover, .pfptPrimaryButton805rdra:focus {
background-color: #b4c1c7 !important; }
.pfptPrimaryButton805rdra:active {
background-color: #90a4ae !important; }
</style>
<!-- BaNnErBlUrFlE-HeAdEr-end -->
</head><!-- BaNnErBlUrFlE-BoDy-start -->
<!-- Preheader Text : BEGIN -->
<div style="display:none !important;display:none;visibility:hidden;mso-hide:all;font-size:1px;color:#ffffff;line-height:1px;height:0px;max-height:0px;opacity:0;overflow:hidden;">
Hi, Lately I have been dealing with an unexpected problem when using MPI_Comm_spawn + MPI_Intercomm_merge, where on some occasions my application hangs when two conditions are met. Specifically, the hang occurs when using the resulting Intracommunicator</div>
<!-- Preheader Text : END -->
<!-- Email Banner : BEGIN -->
<div style="display:none !important;display:none;visibility:hidden;mso-hide:all;font-size:1px;color:#ffffff;line-height:1px;height:0px;max-height:0px;opacity:0;overflow:hidden;">ZjQcmQRYFpfptBannerStart</div>
<!--[if ((ie)|(mso))]>
<table border="0" cellspacing="0" cellpadding="0" width="100%" style="padding: 16px 0px 16px 0px; direction: ltr" ><tr><td>
<table border="0" cellspacing="0" cellpadding="0" style="padding: 0px 10px 5px 6px; width: 100%; border-radius:4px; border-top:4px solid #90a4ae;background-color:#D0D8DC;"><tr><td valign="top">
<table align="left" border="0" cellspacing="0" cellpadding="0" style="padding: 4px 8px 4px 8px">
<tr><td style="color:#000000; font-family: 'Arial', sans-serif; font-weight:bold; font-size:14px; direction: ltr">
This Message Is From an External Sender
</td></tr>
<tr><td style="color:#000000; font-weight:normal; font-family: 'Arial', sans-serif; font-size:12px; direction: ltr">
This message came from outside your organization.
</td></tr>
</table>
</td></tr></table>
</td></tr></table>
<![endif]-->
<![if !((ie)|(mso))]>
<div dir="ltr" id="pfptBanner805rdra" style="all: revert !important; display:block !important; text-align: left !important; margin:16px 0px 16px 0px !important; padding:8px 16px 8px 16px !important; border-radius: 4px !important; min-width: 200px !important; background-color: #D0D8DC !important; background-color: #D0D8DC; border-top: 4px solid #90a4ae !important; border-top: 4px solid #90a4ae;">
<div id="pfptBanner805rdra" style="all: unset !important; float:left !important; display:block !important; margin: 0px 0px 1px 0px !important; max-width: 600px !important;">
<div id="pfptBanner805rdra" style="all: unset !important; display:block !important; visibility: visible !important; background-color: #D0D8DC !important; color:#000000 !important; color:#000000; font-family: 'Arial', sans-serif !important; font-family: 'Arial', sans-serif; font-weight:bold !important; font-weight:bold; font-size:14px !important; line-height:18px !important; line-height:18px">
This Message Is From an External Sender
</div>
<div id="pfptBanner805rdra" style="all: unset !important; display:block !important; visibility: visible !important; background-color: #D0D8DC !important; color:#000000 !important; color:#000000; font-weight:normal; font-family: 'Arial', sans-serif !important; font-family: 'Arial', sans-serif; font-size:12px !important; line-height:18px !important; line-height:18px; margin-top:2px !important;">
This message came from outside your organization.
</div>
</div>
<div style="clear: both !important; display: block !important; visibility: hidden !important; line-height: 0 !important; font-size: 0.01px !important; height: 0px"> </div>
</div>
<![endif]>
<div style="display:none !important;display:none;visibility:hidden;mso-hide:all;font-size:1px;color:#ffffff;line-height:1px;height:0px;max-height:0px;opacity:0;overflow:hidden;">ZjQcmQRYFpfptBannerEnd</div>
<!-- Email Banner : END -->
<!-- BaNnErBlUrFlE-BoDy-end -->
<div dir="ltr"><div>Hi,</div><div><br></div><div>Lately I have been dealing with an unexpected problem when using MPI_Comm_spawn + MPI_Intercomm_merge, where on some occasions my application hangs when two conditions are met.</div><div><br></div>Specifically, the hang occurs when using the resulting Intracommunicator of MPI_Intercomm_merge in collective operations as MPI_Bcast. The conditions are<br>- There is an oversubscription state. The number of processes is greater than the available number of physical cores.<br>- Using CH4:ofi with FI_PROVIDER="verbs:ofi_rxd".<div><br></div><div>I tested a minimal code with MPICH 4.2.0 and MPICH 4.2.3 configured as:</div><div>./configure --prefix=... --with-device=ch4:ofi --disable-psm3</div><div><br></div><div>The <a href="https://urldefense.us/v3/__https://lorca.act.uji.es/gitlab/martini/mpich_ofi_rxd_intracomm_hang/-/blob/main/BaseCode.c__;!!G_uCfscf7eWS!ZoB5c9APxvNk40SirehC83dWaIUE_w3yOQ2EEoJ4MaJlUaREOTMYR85Dd4SCEE0Exrr4U6ZETBg1sg$">minimal code</a> to reproduce the problem is the following:</div><div>==========================</div><div><pre class="gmail-code gmail-highlight" lang="c"><span lang="c" class="gmail-line" id="gmail-LC1"><span class="gmail-cp">#include <stdio.h></span></span>
<span lang="c" class="gmail-line" id="gmail-LC2"><span class="gmail-cp">#include <stdlib.h></span></span>
<span lang="c" class="gmail-line" id="gmail-LC3"><span class="gmail-cp">#include <mpi.h></span></span>
<span lang="c" class="gmail-line" id="gmail-LC4"></span>
<span lang="c" class="gmail-line" id="gmail-LC5"><span class="gmail-kt">int</span> <span class="gmail-nf">main</span><span class="gmail-p">(</span><span class="gmail-kt">int</span> <span class="gmail-n">argc</span><span class="gmail-p">,</span> <span class="gmail-kt">char</span><span class="gmail-o">*</span> <span class="gmail-n">argv</span><span class="gmail-p">[])</span> <span class="gmail-p">{</span></span>
<span lang="c" class="gmail-line" id="gmail-LC6"> <span class="gmail-kt">int</span> <span class="gmail-n">rank</span><span class="gmail-p">,</span> <span class="gmail-n">numP</span><span class="gmail-p">,</span> <span class="gmail-n">numO</span><span class="gmail-p">;</span></span>
<span lang="c" class="gmail-line" id="gmail-LC7"> <span class="gmail-kt">int</span> <span class="gmail-n">rootBcast</span><span class="gmail-p">,</span> <span class="gmail-n">order</span><span class="gmail-p">;</span></span>
<span lang="c" class="gmail-line" id="gmail-LC8"> <span class="gmail-kt">double</span> <span class="gmail-n">test</span> <span class="gmail-o">=</span> <span class="gmail-mi">0</span><span class="gmail-p">;</span></span>
<span lang="c" class="gmail-line" id="gmail-LC9"> <span class="gmail-kt">int</span> <span class="gmail-n">solution</span> <span class="gmail-o">=</span> 0<span class="gmail-p">;</span></span>
<span lang="c" class="gmail-line" id="gmail-LC10"></span>
<span lang="c" class="gmail-line" id="gmail-LC11"> <span class="gmail-n">MPI_Init</span><span class="gmail-p">(</span><span class="gmail-o">&</span><span class="gmail-n">argc</span><span class="gmail-p">,</span> <span class="gmail-o">&</span><span class="gmail-n">argv</span><span class="gmail-p">);</span></span>
<span lang="c" class="gmail-line" id="gmail-LC12"> <span class="gmail-n">MPI_Comm_rank</span><span class="gmail-p">(</span><span class="gmail-n">MPI_COMM_WORLD</span><span class="gmail-p">,</span> <span class="gmail-o">&</span><span class="gmail-n">rank</span><span class="gmail-p">);</span></span>
<span lang="c" class="gmail-line" id="gmail-LC13"> <span class="gmail-n">MPI_Comm_size</span><span class="gmail-p">(</span><span class="gmail-n">MPI_COMM_WORLD</span><span class="gmail-p">,</span> <span class="gmail-o">&</span><span class="gmail-n">numP</span><span class="gmail-p">);</span></span>
<span lang="c" class="gmail-line" id="gmail-LC14"></span>
<span lang="c" class="gmail-line" id="gmail-LC15"> <span class="gmail-n">MPI_Comm</span> <span class="gmail-n">intercomm</span><span class="gmail-p">,</span> <span class="gmail-n">intracomm</span><span class="gmail-p">;</span></span>
<span lang="c" class="gmail-line" id="gmail-LC16"> <span class="gmail-n">MPI_Comm_get_parent</span><span class="gmail-p">(</span><span class="gmail-o">&</span><span class="gmail-n">intercomm</span><span class="gmail-p">);</span></span>
<span lang="c" class="gmail-line" id="gmail-LC17"> <span class="gmail-k">if</span><span class="gmail-p">(</span><span class="gmail-n">intercomm</span> <span class="gmail-o">==</span> <span class="gmail-n">MPI_COMM_NULL</span><span class="gmail-p">)</span> <span class="gmail-p">{</span></span>
<span lang="c" class="gmail-line" id="gmail-LC18"> <span class="gmail-n">numO</span> <span class="gmail-o">=</span> <span class="gmail-n">atoi</span><span class="gmail-p">(</span><span class="gmail-n">argv</span><span class="gmail-p">[</span><span class="gmail-mi">1</span><span class="gmail-p">]);</span></span>
<span lang="c" class="gmail-line" id="gmail-LC19"> <span class="gmail-n">MPI_Comm_spawn</span><span class="gmail-p">(</span><span class="gmail-n">argv</span><span class="gmail-p">[</span><span class="gmail-mi">0</span><span class="gmail-p">],</span> <span class="gmail-n">MPI_ARGV_NULL</span><span class="gmail-p">,</span> <span class="gmail-n">numO</span><span class="gmail-p">,</span> <span class="gmail-n">MPI_INFO_NULL</span><span class="gmail-p">,</span> <span class="gmail-mi">0</span><span class="gmail-p">,</span> <span class="gmail-n">MPI_COMM_WORLD</span><span class="gmail-p">,</span> <span class="gmail-o">&</span><span class="gmail-n">intercomm</span><span class="gmail-p">,</span> <span class="gmail-n">MPI_ERRCODES_IGNORE</span><span class="gmail-p">);</span></span>
<span lang="c" class="gmail-line" id="gmail-LC20"> <span class="gmail-n">order</span> <span class="gmail-o">=</span> <span class="gmail-mi">0</span><span class="gmail-p">;</span></span>
<span lang="c" class="gmail-line" id="gmail-LC21"> <span class="gmail-p">}</span> <span class="gmail-k">else</span> <span class="gmail-p">{</span> <span class="gmail-n">order</span> <span class="gmail-o">=</span> <span class="gmail-mi">1</span><span class="gmail-p">;</span> <span class="gmail-p">}</span></span>
<span lang="c" class="gmail-line" id="gmail-LC22"></span>
<span lang="c" class="gmail-line" id="gmail-LC23"> <span class="gmail-n">MPI_Intercomm_merge</span><span class="gmail-p">(</span><span class="gmail-n">intercomm</span><span class="gmail-p">,</span> <span class="gmail-n">order</span><span class="gmail-p">,</span> <span class="gmail-o">&</span><span class="gmail-n">intracomm</span><span class="gmail-p">);</span></span>
<span lang="c" class="gmail-line" id="gmail-LC24"> <span class="gmail-n">printf</span><span class="gmail-p">(</span><span class="gmail-s">"TEST 1 P%02d/%d</span><span class="gmail-se">\n</span><span class="gmail-s">"</span><span class="gmail-p">,</span> <span class="gmail-n">rank</span><span class="gmail-p">,</span> <span class="gmail-n">numP</span><span class="gmail-p">);</span></span>
<span lang="c" class="gmail-line" id="gmail-LC25"> <span class="gmail-n">MPI_Bcast</span><span class="gmail-p">(</span><span class="gmail-o">&</span><span class="gmail-n">test</span><span class="gmail-p">,</span> <span class="gmail-mi">1</span><span class="gmail-p">,</span> <span class="gmail-n">MPI_DOUBLE</span><span class="gmail-p">,</span> <span class="gmail-mi">0</span><span class="gmail-p">,</span> <span class="gmail-n">intracomm</span><span class="gmail-p">);</span></span> // Hangs here
<span lang="c" class="gmail-line" id="gmail-LC26"> <span class="gmail-k">if</span><span class="gmail-p">(</span><span class="gmail-n">solution</span><span class="gmail-p">)</span> <span class="gmail-p">{</span> <span class="gmail-n">MPI_Barrier</span><span class="gmail-p">(</span><span class="gmail-n">intercomm</span><span class="gmail-p">);</span> <span class="gmail-p">}</span></span>
<span lang="c" class="gmail-line" id="gmail-LC27"> <span class="gmail-n">printf</span><span class="gmail-p">(</span><span class="gmail-s">"TEST 2 P%02d/%d</span><span class="gmail-se">\n</span><span class="gmail-s">"</span><span class="gmail-p">,</span> <span class="gmail-n">rank</span><span class="gmail-p">,</span> <span class="gmail-n">numP</span><span class="gmail-p">);</span></span>
<span lang="c" class="gmail-line" id="gmail-LC28"></span>
<span lang="c" class="gmail-line" id="gmail-LC29"> <span class="gmail-n">MPI_Finalize</span><span class="gmail-p">();</span></span>
<span lang="c" class="gmail-line" id="gmail-LC30"> <span class="gmail-k">return</span> <span class="gmail-mi">0</span><span class="gmail-p">;</span></span>
<span lang="c" class="gmail-line" id="gmail-LC31"><span class="gmail-p">}</span></span>
</pre></div><div>==========================</div>The code only hangs at the MPI_Bcast operation for some of the spawned processes. All my executions have been with a single node of 20 cores, with 10 initial processes and spawning 20 at the MPI_Comm_spawn function. If I change the variable "solution" to 1, I rarely get a hang of the application, but it still happens on some occasions.<br><br>From my perspective, the code seems to follow the standard. Is this the case? I have been able to run the code with other providers for OFI, but I am confused as to why it does not work in this case.<div><br></div><div>Thank you for your time.</div><div><div>Best regards,<br></div><div>Iker</div></div></div>