<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
span.EmailStyle17
{mso-style-type:personal-compose;
font-family:"Calibri",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-family:"Calibri",sans-serif;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-US" link="#0563C1" vlink="#954F72" style="word-wrap:break-word">
<div class="WordSection1">
<p class="MsoNormal">Any clue what is causing this problem? My job is failing on all compute nodes with this message:<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><b>ib_iface.c:1060 UCX ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 64 kbytes)<o:p></o:p></b></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">It doesn’t make sense because on all nodes <i><span style="color:red">except for the head node</span></i>, the memory locking limit is “unlimited”:<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><b>$ ssh n004<o:p></o:p></b></p>
<p class="MsoNormal"><b>$ ulimit -Hl -Sl<o:p></o:p></b></p>
<p class="MsoNormal"><b>max locked memory (kbytes, -l) unlimited<o:p></o:p></b></p>
<p class="MsoNormal"><b>max locked memory (kbytes, -l) unlimited<o:p></o:p></b></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">On the head node, the limit is still set at 64 kb, but I thought that it wouldn’t matter since the processes run on the compute nodes.
<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">I configured MPICH with ch4 and UCX 1.15.0:<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><b>$ ../mpich-4.1.2/configure --prefix=/opt/mpich --with-device=ch4:ucx --with-ucx=/home/kmccall/ucx-1.15.0/install --with-slurm -enable-debuginfo --enable-g=debug 2>&1 | tee c.txt<o:p></o:p></b></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">and my launch command is unremarkable:<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><b>mpiexec \ <o:p></o:p></b></p>
<p class="MsoNormal"><b> -print-all-exitcodes \<o:p></o:p></b></p>
<p class="MsoNormal"><b> -wdir ${work_dir} \<o:p></o:p></b></p>
<p class="MsoNormal"><b> -np ${num_proc} \<o:p></o:p></b></p>
<p class="MsoNormal"><b> -ppn 1 \<o:p></o:p></b></p>
<p class="MsoNormal"><b> application…<o:p></o:p></b></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">The cluster is running Red Hat Enterprise Linux release 8.6, slurm 20.11.8, and the compiler version is g++ 8.5.0.
<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Just for reference, here is the compete error message sequenct from slurm.out<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">SLURM_JOB_NODELIST = n[004-008]<o:p></o:p></p>
<p class="MsoNormal">[1697913222.014184] [n004:39165:0] ib_iface.c:1060 UCX ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 64 kbytes)<o:p></o:p></p>
<p class="MsoNormal">[1697913222.014288] [n006:38767:0] ib_iface.c:1060 UCX ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 64 kbytes)<o:p></o:p></p>
<p class="MsoNormal">[1697913222.014374] [n005:39166:0] ib_iface.c:1060 UCX ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 64 kbytes)<o:p></o:p></p>
<p class="MsoNormal">[1697913222.014851] [n008:38524:0] ib_iface.c:1060 UCX ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 64 kbytes)<o:p></o:p></p>
<p class="MsoNormal">[1697913222.014807] [n007:38753:0] ib_iface.c:1060 UCX ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 64 kbytes)<o:p></o:p></p>
<p class="MsoNormal">Abort(607769871) on node 0 (rank 0 in comm 0): Fatal error in internal_Init: Other MPI error, error stack:<o:p></o:p></p>
<p class="MsoNormal">internal_Init(66).............: MPI_Init(argc=0x7ffe9e2b475c, argv=0x7ffe9e2b4750) failed<o:p></o:p></p>
<p class="MsoNormal">MPII_Init_thread(257).........:<o:p></o:p></p>
<p class="MsoNormal">MPIR_init_comm_world(34)......:<o:p></o:p></p>
<p class="MsoNormal">MPIR_Comm_commit(770).........:<o:p></o:p></p>
<p class="MsoNormal">MPIR_Comm_commit_internal(558):<o:p></o:p></p>
<p class="MsoNormal">MPID_Comm_commit_pre_hook(158):<o:p></o:p></p>
<p class="MsoNormal">MPIDI_UCX_init_world(252).....:<o:p></o:p></p>
<p class="MsoNormal">init_worker(38)...............: ucx function returned with failed status(ucx_init.c 38 init_worker Input/output error)<o:p></o:p></p>
<p class="MsoNormal">Abort(339334415) on node 2 (rank 2 in comm 0): Fatal error in internal_Init: Other MPI error, error stack:<o:p></o:p></p>
<p class="MsoNormal">internal_Init(66).............: MPI_Init(argc=0x7ffd5fd79e5c, argv=0x7ffd5fd79e50) failed<o:p></o:p></p>
<p class="MsoNormal">MPII_Init_thread(257).........:<o:p></o:p></p>
<p class="MsoNormal">MPIR_init_comm_world(34)......:<o:p></o:p></p>
<p class="MsoNormal">MPIR_Comm_commit(770).........:<o:p></o:p></p>
<p class="MsoNormal">MPIR_Comm_commit_internal(558):<o:p></o:p></p>
<p class="MsoNormal">MPID_Comm_commit_pre_hook(158):<o:p></o:p></p>
<p class="MsoNormal">MPIDI_UCX_init_world(252).....:<o:p></o:p></p>
<p class="MsoNormal">init_worker(38)...............: ucx function returned with failed status(ucx_init.c 38 init_worker Input/output error)<o:p></o:p></p>
</div>
</body>
</html>