<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Hi Kurt,</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
We have fixed the bug when launching using PMI1 and mpiexec -- <a href="https://github.com/pmodels/mpich/issues/5835" id="LPlnkOWALinkPreview">
https://github.com/pmodels/mpich/issues/5835</a>. Could you checkout the latest <code>
main</code> branch on github and test? Let us know if you need instructions on building from git checkouts. We are working on the remaining scenarios.</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
-- <br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Hui Zhou<br>
</div>
<div class="_Entity _EType_OWALinkPreview _EId_OWALinkPreview _EReadonly_1">
<div id="LPBorder_GTaHR0cHM6Ly9naXRodWIuY29tL3Btb2RlbHMvbXBpY2gvaXNzdWVzLzU4MzU." class="LPBorder412684" style="width: 100%; margin-top: 16px; margin-bottom: 16px; position: relative; max-width: 800px; min-width: 424px;">
<table id="LPContainer412684" role="presentation" style="padding: 12px 36px 12px 12px; width: 100%; border-width: 1px; border-style: solid; border-color: rgb(200, 200, 200); border-radius: 2px;">
<tbody>
<tr style="border-spacing: 0px;" valign="top">
<td>
<div id="LPImageContainer412684" style="position: relative; margin-right: 12px; height: 120px; overflow: hidden; width: 240px;">
<a target="_blank" id="LPImageAnchor412684" href="https://github.com/pmodels/mpich/issues/5835"><img id="LPThumbnailImageId412684" alt="" style="display: block;" width="240" height="120" src="https://opengraph.githubassets.com/625f481fa04d88c26174a96205e1a5fc12fb0eb785bcb64a3ac06060ed8e554f/pmodels/mpich/issues/5835"></a></div>
</td>
<td style="width: 100%;">
<div id="LPTitle412684" style="font-size: 21px; font-weight: 300; margin-right: 8px; font-family: "wf_segoe-ui_light", "Segoe UI Light", "Segoe WP Light", "Segoe UI", "Segoe WP", Tahoma, Arial, sans-serif; margin-bottom: 12px;">
<a target="_blank" id="LPUrlAnchor412684" href="https://github.com/pmodels/mpich/issues/5835" style="text-decoration: none; color: var(--themePrimary);">MPI_Comm_spawn in Slurm environment · Issue #5835 · pmodels/mpich</a></div>
<div id="LPDescription412684" style="font-size: 14px; max-height: 100px; color: rgb(102, 102, 102); font-family: "wf_segoe-ui_normal", "Segoe UI", "Segoe WP", Tahoma, Arial, sans-serif; margin-bottom: 12px; margin-right: 8px; overflow: hidden;">
Originated from user email https://lists.mpich.org/pipermail/discuss/2022-January/006360.html. 1. MPICH + Hydra + PMI1 (crashes) Fixed in #5838 2. MPICH + Hydra + PMI2 (works but ignores "host...</div>
<div id="LPMetadata412684" style="font-size: 14px; font-weight: 400; color: rgb(166, 166, 166); font-family: "wf_segoe-ui_normal", "Segoe UI", "Segoe WP", Tahoma, Arial, sans-serif;">
github.com</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<br>
<div id="appendonsend"></div>
<hr style="display:inline-block;width:98%" tabindex="-1">
<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> Mccall, Kurt E. (MSFC-EV41) via discuss <discuss@mpich.org><br>
<b>Sent:</b> Thursday, February 17, 2022 2:38 PM<br>
<b>To:</b> discuss@mpich.org <discuss@mpich.org><br>
<b>Cc:</b> Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall@nasa.gov><br>
<b>Subject:</b> Re: [mpich-discuss] MPI_Init hangs under Slurm</font>
<div> </div>
</div>
<style>
<!--
@font-face
{font-family:"Cambria Math"}
@font-face
{font-family:Calibri}
p.x_MsoNormal, li.x_MsoNormal, div.x_MsoNormal
{margin:0in;
font-size:11.0pt;
font-family:"Calibri",sans-serif}
span.x_EmailStyle19
{font-family:"Calibri",sans-serif;
color:windowtext}
.x_MsoChpDefault
{font-size:10.0pt}
@page WordSection1
{margin:1.0in 1.0in 1.0in 1.0in}
div.x_WordSection1
{}
-->
</style>
<div lang="EN-US" link="#0563C1" vlink="#954F72" style="word-wrap:break-word">
<div class="x_WordSection1">
<p class="x_MsoNormal">Sorry, my attachment with an .out extension was blocked. Here is the file with a .txt extension.</p>
<p class="x_MsoNormal"> </p>
<div>
<div style="border:none; border-top:solid #E1E1E1 1.0pt; padding:3.0pt 0in 0in 0in">
<p class="x_MsoNormal"><b>From:</b> Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall@nasa.gov>
<br>
<b>Sent:</b> Thursday, February 17, 2022 2:36 PM<br>
<b>To:</b> discuss@mpich.org<br>
<b>Cc:</b> Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall@nasa.gov><br>
<b>Subject:</b> MPI_Init hangs under Slurm</p>
</div>
</div>
<p class="x_MsoNormal"> </p>
<p class="x_MsoNormal">Things were working fine when I was launching 1 node jobs under Slurm 20.11.8, but when I launched a 20 node job, MPICH hangs in MPI_Init. The output of “mpiexec -verbose” is attached, and the stack trace at the point where it hangs
is below.</p>
<p class="x_MsoNormal"> </p>
<p class="x_MsoNormal">In the “mpiexec -verbose” output, I wonder why variables such as PATH_modshare point to our Intel MPI implementation, which I am no using. I am using MPICH 4.0 with a patch that Ken Raffenetti provided (which makes MPICH recognize
the “host” info key). My $PATH and $LD_LIBRARY_PATH variables definitely point to the correct MPICH installation.</p>
<p class="x_MsoNormal"> </p>
<p class="x_MsoNormal">I appreciate any help you can give.</p>
<p class="x_MsoNormal"> </p>
<p class="x_MsoNormal"> </p>
<p class="x_MsoNormal">Here is the Slurm sbatch command:</p>
<p class="x_MsoNormal"> </p>
<p class="x_MsoNormal">sbatch --nodes=20 --ntasks=20 --job-name $job_name --exclusive –verbose
</p>
<p class="x_MsoNormal"> </p>
<p class="x_MsoNormal"> </p>
<p class="x_MsoNormal">Here is the mpiexec command:</p>
<p class="x_MsoNormal"> </p>
<p class="x_MsoNormal">mpiexec -verbose -launcher ssh -print-all-exitcodes -np 20 -wdir ${work_dir} -env DISPLAY localhost:10.0 --ppn 1 <many more args…></p>
<p class="x_MsoNormal"> </p>
<p class="x_MsoNormal"> </p>
<p class="x_MsoNormal">Stack trace at MPI_Init:</p>
<p class="x_MsoNormal"> </p>
<p class="x_MsoNormal">#0 0x00007f6d85f499b2 in read () from /lib64/libpthread.so.0</p>
<p class="x_MsoNormal">#1 0x00007f6d87a5753a in PMIU_readline (fd=5, buf=buf@entry=0x7ffd6fb596e0 "", maxlen=maxlen@entry=1024)</p>
<p class="x_MsoNormal"> at ../mpich-slurm-patch-4.0/src/pmi/simple/simple_pmiutil.c:134</p>
<p class="x_MsoNormal">#2 0x00007f6d87a57a56 in GetResponse (request=0x7f6d87b48351 "cmd=barrier_in\n",</p>
<p class="x_MsoNormal"> expectedCmd=0x7f6d87b48345 "barrier_out", checkRc=0) at ../mpich-slurm-patch-4.0/src/pmi/simple/simple_pmi.c:818</p>
<p class="x_MsoNormal">#3 0x00007f6d87a29915 in MPIDI_PG_SetConnInfo (rank=rank@entry=0,</p>
<p class="x_MsoNormal"> connString=connString@entry=0x1bbf5a0 "description#n001$port#33403$ifname#172.16.56.1$")</p>
<p class="x_MsoNormal"> at ../mpich-slurm-patch-4.0/src/mpid/ch3/src/mpidi_pg.c:559</p>
<p class="x_MsoNormal">#4 0x00007f6d87a38611 in MPID_nem_init (pg_rank=pg_rank@entry=0, pg_p=pg_p@entry=0x1bbf850, has_parent=<optimized out>)</p>
<p class="x_MsoNormal"> at ../mpich-slurm-patch-4.0/src/mpid/ch3/channels/nemesis/src/mpid_nem_init.c:393</p>
<p class="x_MsoNormal">#5 0x00007f6d87a2ad93 in MPIDI_CH3_Init (has_parent=<optimized out>, pg_p=0x1bbf850, pg_rank=0)</p>
<p class="x_MsoNormal"> at ../mpich-slurm-patch-4.0/src/mpid/ch3/channels/nemesis/src/ch3_init.c:83</p>
<p class="x_MsoNormal">#6 0x00007f6d87a1b3b7 in init_world () at ../mpich-slurm-patch-4.0/src/mpid/ch3/src/mpid_init.c:190</p>
<p class="x_MsoNormal">#7 MPID_Init (requested=<optimized out>, provided=provided@entry=0x7f6d87e03540 <MPIR_ThreadInfo>)</p>
<p class="x_MsoNormal"> at ../mpich-slurm-patch-4.0/src/mpid/ch3/src/mpid_init.c:76</p>
<p class="x_MsoNormal">#8 0x00007f6d879828eb in MPII_Init_thread (argc=argc@entry=0x7ffd6fb5a5cc, argv=argv@entry=0x7ffd6fb5a5c0,</p>
<p class="x_MsoNormal"> user_required=0, provided=provided@entry=0x7ffd6fb5a574, p_session_ptr=p_session_ptr@entry=0x0)</p>
<p class="x_MsoNormal"> at ../mpich-slurm-patch-4.0/src/mpi/init/mpir_init.c:208</p>
<p class="x_MsoNormal">#9 0x00007f6d879832a5 in MPIR_Init_impl (argc=0x7ffd6fb5a5cc, argv=0x7ffd6fb5a5c0)</p>
<p class="x_MsoNormal"> at ../mpich-slurm-patch-4.0/src/mpi/init/mpir_init.c:93</p>
<p class="x_MsoNormal">#10 0x00007f6d8786388e in PMPI_Init (argc=0x7ffd6fb5a5cc, argv=0x7ffd6fb5a5c0)</p>
<p class="x_MsoNormal"> at ../mpich-slurm-patch-4.0/src/binding/c/init/init.c:46</p>
<p class="x_MsoNormal">#11 0x000000000040640d in main (argc=23, argv=0x7ffd6fb5ad68) at src/NeedlesMpiManagerMain.cpp:53</p>
<p class="x_MsoNormal"> </p>
</div>
</div>
</body>
</html>