<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Hi Iker,</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Does it work with `FI_PROVIDER="verbs;ofi_rxm"?</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Hui</div>
<div id="appendonsend"></div>
<hr style="display:inline-block;width:98%" tabindex="-1">
<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> Iker Martín Álvarez via discuss <discuss@mpich.org><br>
<b>Sent:</b> Monday, October 28, 2024 12:34 PM<br>
<b>To:</b> discuss@mpich.org <discuss@mpich.org><br>
<b>Cc:</b> Iker Martín Álvarez <martini@uji.es><br>
<b>Subject:</b> [mpich-discuss] Occasional hang with MPI_Intercomm_merge and OFI+provider verbs</font>
<div> </div>
</div>
<style>
<!--
#x_pfptBanner805rdra
{display:block!important;
visibility:visible!important;
opacity:1!important;
background-color:#D0D8DC!important;
max-width:none!important;
max-height:none!important}
-->
</style>
<div>
<div style="display:none!important; display:none; visibility:hidden; font-size:1px; color:#ffffff; line-height:1px; height:0px; max-height:0px; opacity:0; overflow:hidden">
Hi, Lately I have been dealing with an unexpected problem when using MPI_Comm_spawn + MPI_Intercomm_merge, where on some occasions my application hangs when two conditions are met. Specifically, the hang occurs when using the resulting Intracommunicator</div>
<div style="display:none!important; display:none; visibility:hidden; font-size:1px; color:#ffffff; line-height:1px; height:0px; max-height:0px; opacity:0; overflow:hidden">
ZjQcmQRYFpfptBannerStart</div>
<div dir="ltr" id="x_pfptBanner805rdra" style="display:block!important; text-align:left!important; margin:16px 0px 16px 0px!important; padding:8px 16px 8px 16px!important; border-radius:4px!important; min-width:200px!important; background-color:#D0D8DC!important; background-color:#D0D8DC; border-top:4px solid #90a4ae!important; border-top:4px solid #90a4ae">
<div id="x_pfptBanner805rdra" style="float:left!important; display:block!important; margin:0px 0px 1px 0px!important; max-width:600px!important">
<div id="x_pfptBanner805rdra" style="display:block!important; visibility:visible!important; background-color:#D0D8DC!important; color:#000000!important; color:#000000; font-family:'Arial',sans-serif!important; font-family:'Arial',sans-serif; font-weight:bold!important; font-weight:bold; font-size:14px!important; line-height:18px!important; line-height:18px">
This Message Is From an External Sender </div>
<div id="x_pfptBanner805rdra" style="display:block!important; visibility:visible!important; background-color:#D0D8DC!important; color:#000000!important; color:#000000; font-weight:normal; font-family:'Arial',sans-serif!important; font-family:'Arial',sans-serif; font-size:12px!important; line-height:18px!important; line-height:18px; margin-top:2px!important">
This message came from outside your organization. </div>
</div>
<div style="clear:both!important; display:block!important; visibility:hidden!important; line-height:0!important; font-size:0.01px!important; height:0px">
</div>
</div>
<div style="display:none!important; display:none; visibility:hidden; font-size:1px; color:#ffffff; line-height:1px; height:0px; max-height:0px; opacity:0; overflow:hidden">
ZjQcmQRYFpfptBannerEnd</div>
<div dir="ltr">
<div>Hi,</div>
<div><br>
</div>
<div>Lately I have been dealing with an unexpected problem when using MPI_Comm_spawn + MPI_Intercomm_merge, where on some occasions my application hangs when two conditions are met.</div>
<div><br>
</div>
Specifically, the hang occurs when using the resulting Intracommunicator of MPI_Intercomm_merge in collective operations as MPI_Bcast. The conditions are<br>
- There is an oversubscription state. The number of processes is greater than the available number of physical cores.<br>
- Using CH4:ofi with FI_PROVIDER="verbs:ofi_rxd".
<div><br>
</div>
<div>I tested a minimal code with MPICH 4.2.0 and MPICH 4.2.3 configured as:</div>
<div>./configure --prefix=... --with-device=ch4:ofi --disable-psm3</div>
<div><br>
</div>
<div>The <a href="https://urldefense.us/v3/__https://lorca.act.uji.es/gitlab/martini/mpich_ofi_rxd_intracomm_hang/-/blob/main/BaseCode.c__;!!G_uCfscf7eWS!ZoB5c9APxvNk40SirehC83dWaIUE_w3yOQ2EEoJ4MaJlUaREOTMYR85Dd4SCEE0Exrr4U6ZETBg1sg$">
minimal code</a> to reproduce the problem is the following:</div>
<div>==========================</div>
<div>
<pre class="x_gmail-code x_gmail-highlight" lang="c"><span lang="c" class="x_gmail-line" id="x_gmail-LC1"><span class="x_gmail-cp">#include <stdio.h></span></span>
<span lang="c" class="x_gmail-line" id="x_gmail-LC2"><span class="x_gmail-cp">#include <stdlib.h></span></span>
<span lang="c" class="x_gmail-line" id="x_gmail-LC3"><span class="x_gmail-cp">#include <mpi.h></span></span>
<span lang="c" class="x_gmail-line" id="x_gmail-LC4"></span>
<span lang="c" class="x_gmail-line" id="x_gmail-LC5"><span class="x_gmail-kt">int</span> <span class="x_gmail-nf">main</span><span class="x_gmail-p">(</span><span class="x_gmail-kt">int</span> <span class="x_gmail-n">argc</span><span class="x_gmail-p">,</span> <span class="x_gmail-kt">char</span><span class="x_gmail-o">*</span> <span class="x_gmail-n">argv</span><span class="x_gmail-p">[])</span> <span class="x_gmail-p">{</span></span>
<span lang="c" class="x_gmail-line" id="x_gmail-LC6"> <span class="x_gmail-kt">int</span> <span class="x_gmail-n">rank</span><span class="x_gmail-p">,</span> <span class="x_gmail-n">numP</span><span class="x_gmail-p">,</span> <span class="x_gmail-n">numO</span><span class="x_gmail-p">;</span></span>
<span lang="c" class="x_gmail-line" id="x_gmail-LC7"> <span class="x_gmail-kt">int</span> <span class="x_gmail-n">rootBcast</span><span class="x_gmail-p">,</span> <span class="x_gmail-n">order</span><span class="x_gmail-p">;</span></span>
<span lang="c" class="x_gmail-line" id="x_gmail-LC8"> <span class="x_gmail-kt">double</span> <span class="x_gmail-n">test</span> <span class="x_gmail-o">=</span> <span class="x_gmail-mi">0</span><span class="x_gmail-p">;</span></span>
<span lang="c" class="x_gmail-line" id="x_gmail-LC9"> <span class="x_gmail-kt">int</span> <span class="x_gmail-n">solution</span> <span class="x_gmail-o">=</span> 0<span class="x_gmail-p">;</span></span>
<span lang="c" class="x_gmail-line" id="x_gmail-LC10"></span>
<span lang="c" class="x_gmail-line" id="x_gmail-LC11"> <span class="x_gmail-n">MPI_Init</span><span class="x_gmail-p">(</span><span class="x_gmail-o">&</span><span class="x_gmail-n">argc</span><span class="x_gmail-p">,</span> <span class="x_gmail-o">&</span><span class="x_gmail-n">argv</span><span class="x_gmail-p">);</span></span>
<span lang="c" class="x_gmail-line" id="x_gmail-LC12"> <span class="x_gmail-n">MPI_Comm_rank</span><span class="x_gmail-p">(</span><span class="x_gmail-n">MPI_COMM_WORLD</span><span class="x_gmail-p">,</span> <span class="x_gmail-o">&</span><span class="x_gmail-n">rank</span><span class="x_gmail-p">);</span></span>
<span lang="c" class="x_gmail-line" id="x_gmail-LC13"> <span class="x_gmail-n">MPI_Comm_size</span><span class="x_gmail-p">(</span><span class="x_gmail-n">MPI_COMM_WORLD</span><span class="x_gmail-p">,</span> <span class="x_gmail-o">&</span><span class="x_gmail-n">numP</span><span class="x_gmail-p">);</span></span>
<span lang="c" class="x_gmail-line" id="x_gmail-LC14"></span>
<span lang="c" class="x_gmail-line" id="x_gmail-LC15"> <span class="x_gmail-n">MPI_Comm</span> <span class="x_gmail-n">intercomm</span><span class="x_gmail-p">,</span> <span class="x_gmail-n">intracomm</span><span class="x_gmail-p">;</span></span>
<span lang="c" class="x_gmail-line" id="x_gmail-LC16"> <span class="x_gmail-n">MPI_Comm_get_parent</span><span class="x_gmail-p">(</span><span class="x_gmail-o">&</span><span class="x_gmail-n">intercomm</span><span class="x_gmail-p">);</span></span>
<span lang="c" class="x_gmail-line" id="x_gmail-LC17"> <span class="x_gmail-k">if</span><span class="x_gmail-p">(</span><span class="x_gmail-n">intercomm</span> <span class="x_gmail-o">==</span> <span class="x_gmail-n">MPI_COMM_NULL</span><span class="x_gmail-p">)</span> <span class="x_gmail-p">{</span></span>
<span lang="c" class="x_gmail-line" id="x_gmail-LC18"> <span class="x_gmail-n">numO</span> <span class="x_gmail-o">=</span> <span class="x_gmail-n">atoi</span><span class="x_gmail-p">(</span><span class="x_gmail-n">argv</span><span class="x_gmail-p">[</span><span class="x_gmail-mi">1</span><span class="x_gmail-p">]);</span></span>
<span lang="c" class="x_gmail-line" id="x_gmail-LC19"> <span class="x_gmail-n">MPI_Comm_spawn</span><span class="x_gmail-p">(</span><span class="x_gmail-n">argv</span><span class="x_gmail-p">[</span><span class="x_gmail-mi">0</span><span class="x_gmail-p">],</span> <span class="x_gmail-n">MPI_ARGV_NULL</span><span class="x_gmail-p">,</span> <span class="x_gmail-n">numO</span><span class="x_gmail-p">,</span> <span class="x_gmail-n">MPI_INFO_NULL</span><span class="x_gmail-p">,</span> <span class="x_gmail-mi">0</span><span class="x_gmail-p">,</span> <span class="x_gmail-n">MPI_COMM_WORLD</span><span class="x_gmail-p">,</span> <span class="x_gmail-o">&</span><span class="x_gmail-n">intercomm</span><span class="x_gmail-p">,</span> <span class="x_gmail-n">MPI_ERRCODES_IGNORE</span><span class="x_gmail-p">);</span></span>
<span lang="c" class="x_gmail-line" id="x_gmail-LC20"> <span class="x_gmail-n">order</span> <span class="x_gmail-o">=</span> <span class="x_gmail-mi">0</span><span class="x_gmail-p">;</span></span>
<span lang="c" class="x_gmail-line" id="x_gmail-LC21"> <span class="x_gmail-p">}</span> <span class="x_gmail-k">else</span> <span class="x_gmail-p">{</span> <span class="x_gmail-n">order</span> <span class="x_gmail-o">=</span> <span class="x_gmail-mi">1</span><span class="x_gmail-p">;</span> <span class="x_gmail-p">}</span></span>
<span lang="c" class="x_gmail-line" id="x_gmail-LC22"></span>
<span lang="c" class="x_gmail-line" id="x_gmail-LC23"> <span class="x_gmail-n">MPI_Intercomm_merge</span><span class="x_gmail-p">(</span><span class="x_gmail-n">intercomm</span><span class="x_gmail-p">,</span> <span class="x_gmail-n">order</span><span class="x_gmail-p">,</span> <span class="x_gmail-o">&</span><span class="x_gmail-n">intracomm</span><span class="x_gmail-p">);</span></span>
<span lang="c" class="x_gmail-line" id="x_gmail-LC24"> <span class="x_gmail-n">printf</span><span class="x_gmail-p">(</span><span class="x_gmail-s">"TEST 1 P%02d/%d</span><span class="x_gmail-se">\n</span><span class="x_gmail-s">"</span><span class="x_gmail-p">,</span> <span class="x_gmail-n">rank</span><span class="x_gmail-p">,</span> <span class="x_gmail-n">numP</span><span class="x_gmail-p">);</span></span>
<span lang="c" class="x_gmail-line" id="x_gmail-LC25"> <span class="x_gmail-n">MPI_Bcast</span><span class="x_gmail-p">(</span><span class="x_gmail-o">&</span><span class="x_gmail-n">test</span><span class="x_gmail-p">,</span> <span class="x_gmail-mi">1</span><span class="x_gmail-p">,</span> <span class="x_gmail-n">MPI_DOUBLE</span><span class="x_gmail-p">,</span> <span class="x_gmail-mi">0</span><span class="x_gmail-p">,</span> <span class="x_gmail-n">intracomm</span><span class="x_gmail-p">);</span></span> // Hangs here
<span lang="c" class="x_gmail-line" id="x_gmail-LC26"> <span class="x_gmail-k">if</span><span class="x_gmail-p">(</span><span class="x_gmail-n">solution</span><span class="x_gmail-p">)</span> <span class="x_gmail-p">{</span> <span class="x_gmail-n">MPI_Barrier</span><span class="x_gmail-p">(</span><span class="x_gmail-n">intercomm</span><span class="x_gmail-p">);</span> <span class="x_gmail-p">}</span></span>
<span lang="c" class="x_gmail-line" id="x_gmail-LC27"> <span class="x_gmail-n">printf</span><span class="x_gmail-p">(</span><span class="x_gmail-s">"TEST 2 P%02d/%d</span><span class="x_gmail-se">\n</span><span class="x_gmail-s">"</span><span class="x_gmail-p">,</span> <span class="x_gmail-n">rank</span><span class="x_gmail-p">,</span> <span class="x_gmail-n">numP</span><span class="x_gmail-p">);</span></span>
<span lang="c" class="x_gmail-line" id="x_gmail-LC28"></span>
<span lang="c" class="x_gmail-line" id="x_gmail-LC29"> <span class="x_gmail-n">MPI_Finalize</span><span class="x_gmail-p">();</span></span>
<span lang="c" class="x_gmail-line" id="x_gmail-LC30"> <span class="x_gmail-k">return</span> <span class="x_gmail-mi">0</span><span class="x_gmail-p">;</span></span>
<span lang="c" class="x_gmail-line" id="x_gmail-LC31"><span class="x_gmail-p">}</span></span>
</pre>
</div>
<div>==========================</div>
The code only hangs at the MPI_Bcast operation for some of the spawned processes. All my executions have been with a single node of 20 cores, with 10 initial processes and spawning 20 at the MPI_Comm_spawn function. If I change the variable "solution" to 1,
I rarely get a hang of the application, but it still happens on some occasions.<br>
<br>
>From my perspective, the code seems to follow the standard. Is this the case? I have been able to run the code with other providers for OFI, but I am confused as to why it does not work in this case.
<div><br>
</div>
<div>Thank you for your time.</div>
<div>
<div>Best regards,<br>
</div>
<div>Iker</div>
</div>
</div>
</div>
</body>
</html>