<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Kurt,</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
The stack trace is showing <code>mpiexec</code> in the polling loop. Since there are no other PMI messages being logged, that means it is still waiting for the processes to call
<code>PMI_Init</code>. Can you make your program print something before and after
<code>MPI_Init</code>? That will tell us whether the program is stuck in <code>
MPI_Init</code> or before <code>MPI_Init</code>.</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Hui<br>
</div>
<div id="appendonsend"></div>
<hr style="display:inline-block;width:98%" tabindex="-1">
<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall@nasa.gov><br>
<b>Sent:</b> Tuesday, June 14, 2022 2:21 AM<br>
<b>To:</b> discuss@mpich.org <discuss@mpich.org>; Zhou, Hui <zhouh@anl.gov><br>
<b>Cc:</b> Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall@nasa.gov><br>
<b>Subject:</b> Re: mpiexec fails to launch any processes</font>
<div> </div>
</div>
<style>
<!--
@font-face
{font-family:"Cambria Math"}
@font-face
{font-family:Calibri}
p.x_MsoNormal, li.x_MsoNormal, div.x_MsoNormal
{margin:0in;
font-size:11.0pt;
font-family:"Calibri",sans-serif}
a:link, span.x_MsoHyperlink
{color:#0563C1;
text-decoration:underline}
code
{font-family:"Courier New"}
p.x_xmsonormal, li.x_xmsonormal, div.x_xmsonormal
{margin:0in;
font-size:11.0pt;
font-family:"Calibri",sans-serif}
p.x_xxmsonormal, li.x_xxmsonormal, div.x_xxmsonormal
{margin:0in;
font-size:11.0pt;
font-family:"Calibri",sans-serif}
span.x_EmailStyle23
{font-family:"Calibri",sans-serif;
color:windowtext}
.x_MsoChpDefault
{font-size:10.0pt}
@page WordSection1
{margin:1.0in 1.0in 1.0in 1.0in}
div.x_WordSection1
{}
-->
</style>
<div lang="EN-US" link="#0563C1" vlink="purple" style="word-wrap:break-word">
<div class="x_WordSection1">
<p class="x_MsoNormal">Hui,</p>
<p class="x_MsoNormal"> </p>
<p class="x_MsoNormal">Slurm doesn’t seem to be killing the job, as it still shows up when I run squeue. A gdb stack trace shows where mpiexec is stuck – does this tell you anything?</p>
<p class="x_MsoNormal"> </p>
<p class="x_MsoNormal">#0 0x00007f9c895ddaa8 in poll () from /lib64/libc.so.6</p>
<p class="x_MsoNormal">#1 0x000000000045352c in HYDT_dmxu_poll_wait_for_event (wtime=-1)</p>
<p class="x_MsoNormal"> at ../../../../mpich-4.0.1/src/pm/hydra/tools/demux/demux_poll.c:39</p>
<p class="x_MsoNormal">#2 0x0000000000452e9a in HYDT_dmx_wait_for_event (wtime=-1)</p>
<p class="x_MsoNormal"> at ../../../../mpich-4.0.1/src/pm/hydra/tools/demux/demux.c:168</p>
<p class="x_MsoNormal">#3 0x000000000040cda4 in HYD_pmci_wait_for_completion (timeout=-1)</p>
<p class="x_MsoNormal"> at ../../../../mpich-4.0.1/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:157</p>
<p class="x_MsoNormal">#4 0x0000000000404177 in main (argc=33, argv=0x7fff054e6888)</p>
<p class="x_MsoNormal"> at ../../../../mpich-4.0.1/src/pm/hydra/ui/mpich/mpiexec.c:324</p>
<p class="x_MsoNormal"> </p>
<p class="x_MsoNormal">Thanks,</p>
<p class="x_MsoNormal">Kurt</p>
<p class="x_MsoNormal"><span style="font-size:14.0pt"> </span></p>
<div>
<div style="border:none; border-top:solid #E1E1E1 1.0pt; padding:3.0pt 0in 0in 0in">
<p class="x_MsoNormal"><b>From:</b> Mccall, Kurt E. (MSFC-EV41) via discuss <discuss@mpich.org>
<br>
<b>Sent:</b> Monday, June 13, 2022 4:16 PM<br>
<b>To:</b> Zhou, Hui <zhouh@anl.gov>; discuss@mpich.org<br>
<b>Cc:</b> Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall@nasa.gov><br>
<b>Subject:</b> Re: [mpich-discuss] [EXTERNAL] Re: mpiexec fails to launch any processes</p>
</div>
</div>
<p class="x_MsoNormal"> </p>
<p class="x_MsoNormal"><span style="font-size:12.0pt">Hui,</span></p>
<p class="x_MsoNormal"><span style="font-size:12.0pt"> </span></p>
<p class="x_MsoNormal"><span style="font-size:12.0pt">That worked too. I guess I’ll have to find a way to pass a “verbose” argument to sbatch and see why Slurm is killing my application.</span></p>
<p class="x_MsoNormal"><span style="font-size:12.0pt"> </span></p>
<p class="x_MsoNormal"><span style="font-size:12.0pt">Thanks,</span></p>
<p class="x_MsoNormal"><span style="font-size:12.0pt">Kurt</span></p>
<p class="x_MsoNormal"><span style="font-size:14.0pt"> </span></p>
<div>
<div style="border:none; border-top:solid #E1E1E1 1.0pt; padding:3.0pt 0in 0in 0in">
<p class="x_MsoNormal"><b>From:</b> Zhou, Hui <<a href="mailto:zhouh@anl.gov">zhouh@anl.gov</a>>
<br>
<b>Sent:</b> Monday, June 13, 2022 4:11 PM<br>
<b>To:</b> Mccall, Kurt E. (MSFC-EV41) <<a href="mailto:kurt.e.mccall@nasa.gov">kurt.e.mccall@nasa.gov</a>>;
<a href="mailto:discuss@mpich.org">discuss@mpich.org</a><br>
<b>Subject:</b> Re: [EXTERNAL] Re: mpiexec fails to launch any processes</p>
</div>
</div>
<p class="x_MsoNormal"> </p>
<div>
<p class="x_MsoNormal"><span style="font-size:12.0pt; color:black">Kurt,</span></p>
</div>
<div>
<p class="x_MsoNormal"><span style="font-size:12.0pt; color:black"> </span></p>
</div>
<div>
<p class="x_MsoNormal"><code><span style="font-size:10.0pt; color:black">Could you try launch hostname</span></code><code><span style="font-size:10.0pt; font-family:"Cambria Math",serif; color:black"></span></code><code><span style="font-size:10.0pt; color:black">
with the same command?</span></code><span style="font-size:12.0pt; color:black"></span></p>
</div>
<div>
<p class="x_MsoNormal"><span style="font-size:12.0pt; color:black"> </span></p>
</div>
<div>
<p class="x_MsoNormal"><code><span style="font-size:10.0pt; color:black"> mpiexec -launcher ssh -verbose -print-all-exitcodes -wdir <directory> -np 20 -ppn 1 hostname</span></code><span style="font-size:12.0pt; color:black"></span></p>
</div>
<div>
<p class="x_MsoNormal"><span style="font-size:12.0pt; color:black"> </span></p>
</div>
<div>
<p class="x_MsoNormal"><span style="font-size:12.0pt; color:black">If that went okay, it then seems to point to your application. Something in your code made Slurm kill the job.</span></p>
</div>
<div>
<p class="x_MsoNormal"><span style="font-size:12.0pt; color:black"> </span></p>
</div>
<div>
<p class="x_MsoNormal"><span style="font-size:12.0pt; color:black">-- </span></p>
</div>
<div>
<p class="x_MsoNormal"><span style="font-size:12.0pt; color:black">Hui</span></p>
</div>
<div class="x_MsoNormal" align="center" style="text-align:center">
<hr size="2" width="98%" align="center">
</div>
<div id="x_divRplyFwdMsg">
<p class="x_MsoNormal"><b><span style="color:black">From:</span></b><span style="color:black"> Mccall, Kurt E. (MSFC-EV41) <<a href="mailto:kurt.e.mccall@nasa.gov">kurt.e.mccall@nasa.gov</a>><br>
<b>Sent:</b> Monday, June 13, 2022 4:02 PM<br>
<b>To:</b> Zhou, Hui <<a href="mailto:zhouh@anl.gov">zhouh@anl.gov</a>>; <a href="mailto:discuss@mpich.org">
discuss@mpich.org</a> <<a href="mailto:discuss@mpich.org">discuss@mpich.org</a>><br>
<b>Subject:</b> RE: [EXTERNAL] Re: mpiexec fails to launch any processes</span> </p>
<div>
<p class="x_MsoNormal"> </p>
</div>
</div>
<div>
<div>
<p class="x_xmsonormal">Hui,</p>
<p class="x_xmsonormal"> </p>
<p class="x_xmsonormal">$ mpiexec -N 10 -hostfile MySlurmNodeFile2 hostname</p>
<p class="x_xmsonormal"> </p>
<p class="x_xmsonormal">works properly, reporting from each of 10 nodes.</p>
<p class="x_xmsonormal"> </p>
<p class="x_xmsonormal">Kurt</p>
<p class="x_xmsonormal"><span style="font-size:14.0pt"> </span></p>
<div>
<div style="border:none; border-top:solid #E1E1E1 1.0pt; padding:3.0pt 0in 0in 0in">
<p class="x_xmsonormal"><b>From:</b> Zhou, Hui <<a href="mailto:zhouh@anl.gov">zhouh@anl.gov</a>>
<br>
<b>Sent:</b> Monday, June 13, 2022 2:44 PM<br>
<b>To:</b> <a href="mailto:discuss@mpich.org">discuss@mpich.org</a><br>
<b>Cc:</b> Mccall, Kurt E. (MSFC-EV41) <<a href="mailto:kurt.e.mccall@nasa.gov">kurt.e.mccall@nasa.gov</a>><br>
<b>Subject:</b> [EXTERNAL] Re: mpiexec fails to launch any processes</p>
</div>
</div>
<p class="x_xmsonormal"> </p>
<div>
<p class="x_xmsonormal"><span style="font-size:12.0pt; color:black">Hi Kurt,</span></p>
</div>
<div>
<p class="x_xmsonormal"><span style="font-size:12.0pt; color:black"> </span></p>
</div>
<div>
<p class="x_xmsonormal"><span style="font-size:12.0pt; color:black">I don't have much clue. Are you able to launch some trivial applications, for example,
</span><code><span style="font-size:10.0pt; color:black">"hostname</span></code><span style="font-size:12.0pt; color:black">"?</span></p>
</div>
<div>
<p class="x_xmsonormal"><span style="font-size:12.0pt; color:black"> </span></p>
</div>
<div>
<p class="x_xmsonormal"><span style="font-size:12.0pt; color:black">-- </span></p>
</div>
<div>
<p class="x_xmsonormal"><span style="font-size:12.0pt; color:black">Hui</span></p>
</div>
<div class="x_MsoNormal" align="center" style="text-align:center">
<hr size="2" width="98%" align="center">
</div>
<div id="x_x_divRplyFwdMsg">
<p class="x_xmsonormal"><b><span style="color:black">From:</span></b><span style="color:black"> Mccall, Kurt E. (MSFC-EV41) via discuss <<a href="mailto:discuss@mpich.org">discuss@mpich.org</a>><br>
<b>Sent:</b> Monday, June 13, 2022 12:29 PM<br>
<b>To:</b> <a href="mailto:discuss@mpich.org">discuss@mpich.org</a> <<a href="mailto:discuss@mpich.org">discuss@mpich.org</a>><br>
<b>Cc:</b> Mccall, Kurt E. (MSFC-EV41) <<a href="mailto:kurt.e.mccall@nasa.gov">kurt.e.mccall@nasa.gov</a>><br>
<b>Subject:</b> Re: [mpich-discuss] mpiexec fails to launch any processes</span> </p>
<div>
<p class="x_xmsonormal"> </p>
</div>
</div>
<div>
<div>
<div>
<div style="border:none; border-top:solid #E1E1E1 1.0pt; padding:3.0pt 0in 0in 0in">
<p class="x_xxmsonormal">Outlook blocked the output file slurm.out that I had attached. Trying to send it again as slurm.txt.</p>
<p class="x_xxmsonormal"> </p>
<p class="x_xxmsonormal">Kurt</p>
<p class="x_xxmsonormal"> </p>
</div>
</div>
<p class="x_xxmsonormal"> </p>
<p class="x_xxmsonormal">Hi, </p>
<p class="x_xxmsonormal"> </p>
<p class="x_xxmsonormal">My mpiexec command fails to launch any processes. I ran it with the -verbose option but didn’t see any obvious errors in the output (attached).</p>
<p class="x_xxmsonormal"> </p>
<p class="x_xxmsonormal">The command is:</p>
<p class="x_xxmsonormal"> </p>
<p class="x_xxmsonormal">mpiexec -launcher ssh -verbose -print-all-exitcodes -wdir <directory> -np 20 -ppn 1 <more args…></p>
<p class="x_xxmsonormal"> </p>
<p class="x_xxmsonormal">I am running MPICH 4.0.1 under Slurm 20.11.8. Thanks for any help.</p>
<p class="x_xxmsonormal"> </p>
<p class="x_xxmsonormal">Kurt</p>
</div>
</div>
</div>
</div>
</div>
</div>
</body>
</html>