<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:#0563C1;
text-decoration:underline;}
span.EmailStyle19
{mso-style-type:personal-reply;
font-family:"Calibri",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;
mso-ligatures:none;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-US" link="#0563C1" vlink="#954F72" style="word-wrap:break-word">
<div class="WordSection1">
<p class="MsoNormal">Hi Kurt,<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">A little searching suggests that the 64kb limit you see in the log is due to the Slurm daemons starting before the increased limit is set.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><a href="https://www.mail-archive.com/slurm-dev@lists.llnl.gov/msg00960.html">https://www.mail-archive.com/slurm-dev@lists.llnl.gov/msg00960.html</a><o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">It seems you can remedy this by adding lines to the Slurm startup script to increase the locked memory limit before starting the daemons. Can you give it a try?<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Ken<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal" style="margin-left:.5in"><b><span style="font-size:12.0pt;color:black">From:
</span></b><span style="font-size:12.0pt;color:black">"Mccall, Kurt E. (MSFC-EV41) via discuss" <discuss@mpich.org><br>
<b>Reply-To: </b>"discuss@mpich.org" <discuss@mpich.org><br>
<b>Date: </b>Saturday, October 21, 2023 at 1:59 PM<br>
<b>To: </b>"discuss@mpich.org" <discuss@mpich.org><br>
<b>Cc: </b>"Mccall, Kurt E. (MSFC-EV41)" <kurt.e.mccall@nasa.gov><br>
<b>Subject: </b>[mpich-discuss] Can't allocate comm memory for ch4, UCX<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal" style="margin-left:.5in"><o:p> </o:p></p>
</div>
<p class="MsoNormal" style="margin-left:.5in">Any clue what is causing this problem? My job is failing on all compute nodes with this message:<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><o:p> </o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><b>ib_iface.c:1060 UCX ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 64 kbytes)<o:p></o:p></b></p>
<p class="MsoNormal" style="margin-left:.5in"><o:p> </o:p></p>
<p class="MsoNormal" style="margin-left:.5in">It doesn’t make sense because on all nodes
<i><span style="color:red">except for the head node</span></i>, the memory locking limit is “unlimited”:<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><o:p> </o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><b>$ ssh n004<o:p></o:p></b></p>
<p class="MsoNormal" style="margin-left:.5in"><b>$ ulimit -Hl -Sl<o:p></o:p></b></p>
<p class="MsoNormal" style="margin-left:.5in"><b>max locked memory (kbytes, -l) unlimited<o:p></o:p></b></p>
<p class="MsoNormal" style="margin-left:.5in"><b>max locked memory (kbytes, -l) unlimited<o:p></o:p></b></p>
<p class="MsoNormal" style="margin-left:.5in"><o:p> </o:p></p>
<p class="MsoNormal" style="margin-left:.5in">On the head node, the limit is still set at 64 kb, but I thought that it wouldn’t matter since the processes run on the compute nodes.
<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><o:p> </o:p></p>
<p class="MsoNormal" style="margin-left:.5in">I configured MPICH with ch4 and UCX 1.15.0:<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><o:p> </o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><b>$ ../mpich-4.1.2/configure --prefix=/opt/mpich --with-device=ch4:ucx --with-ucx=/home/kmccall/ucx-1.15.0/install --with-slurm -enable-debuginfo --enable-g=debug 2>&1 | tee c.txt<o:p></o:p></b></p>
<p class="MsoNormal" style="margin-left:.5in"><o:p> </o:p></p>
<p class="MsoNormal" style="margin-left:.5in">and my launch command is unremarkable:<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><o:p> </o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><b>mpiexec \ <o:p></o:p></b></p>
<p class="MsoNormal" style="margin-left:.5in"><b> -print-all-exitcodes \<o:p></o:p></b></p>
<p class="MsoNormal" style="margin-left:.5in"><b> -wdir ${work_dir} \<o:p></o:p></b></p>
<p class="MsoNormal" style="margin-left:.5in"><b> -np ${num_proc} \<o:p></o:p></b></p>
<p class="MsoNormal" style="margin-left:.5in"><b> -ppn 1 \<o:p></o:p></b></p>
<p class="MsoNormal" style="margin-left:.5in"><b> application…<o:p></o:p></b></p>
<p class="MsoNormal" style="margin-left:.5in"><o:p> </o:p></p>
<p class="MsoNormal" style="margin-left:.5in">The cluster is running Red Hat Enterprise Linux release 8.6, slurm 20.11.8, and the compiler version is g++ 8.5.0.
<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><o:p> </o:p></p>
<p class="MsoNormal" style="margin-left:.5in">Just for reference, here is the compete error message sequenct from slurm.out<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><o:p> </o:p></p>
<p class="MsoNormal" style="margin-left:.5in">SLURM_JOB_NODELIST = n[004-008]<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">[1697913222.014184] [n004:39165:0] ib_iface.c:1060 UCX ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 64 kbytes)<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">[1697913222.014288] [n006:38767:0] ib_iface.c:1060 UCX ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 64 kbytes)<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">[1697913222.014374] [n005:39166:0] ib_iface.c:1060 UCX ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 64 kbytes)<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">[1697913222.014851] [n008:38524:0] ib_iface.c:1060 UCX ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 64 kbytes)<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">[1697913222.014807] [n007:38753:0] ib_iface.c:1060 UCX ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 64 kbytes)<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">Abort(607769871) on node 0 (rank 0 in comm 0): Fatal error in internal_Init: Other MPI error, error stack:<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">internal_Init(66).............: MPI_Init(argc=0x7ffe9e2b475c, argv=0x7ffe9e2b4750) failed<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">MPII_Init_thread(257).........:<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">MPIR_init_comm_world(34)......:<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">MPIR_Comm_commit(770).........:<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">MPIR_Comm_commit_internal(558):<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">MPID_Comm_commit_pre_hook(158):<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">MPIDI_UCX_init_world(252).....:<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">init_worker(38)...............: ucx function returned with failed status(ucx_init.c 38 init_worker Input/output error)<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">Abort(339334415) on node 2 (rank 2 in comm 0): Fatal error in internal_Init: Other MPI error, error stack:<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">internal_Init(66).............: MPI_Init(argc=0x7ffd5fd79e5c, argv=0x7ffd5fd79e50) failed<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">MPII_Init_thread(257).........:<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">MPIR_init_comm_world(34)......:<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">MPIR_Comm_commit(770).........:<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">MPIR_Comm_commit_internal(558):<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">MPID_Comm_commit_pre_hook(158):<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">MPIDI_UCX_init_world(252).....:<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">init_worker(38)...............: ucx function returned with failed status(ucx_init.c 38 init_worker Input/output error)<o:p></o:p></p>
</div>
</body>
</html>