[mpich-discuss] Can't allocate comm memory for ch4, UCX

Raffenetti, Ken raffenet at anl.gov
Tue Oct 24 09:37:46 CDT 2023

Hi Kurt,

A little searching suggests that the 64kb limit you see in the log is due to the Slurm daemons starting before the increased limit is set.


It seems you can remedy this by adding lines to the Slurm startup script to increase the locked memory limit before starting the daemons. Can you give it a try?


From: "Mccall, Kurt E. (MSFC-EV41) via discuss" <discuss at mpich.org>
Reply-To: "discuss at mpich.org" <discuss at mpich.org>
Date: Saturday, October 21, 2023 at 1:59 PM
To: "discuss at mpich.org" <discuss at mpich.org>
Cc: "Mccall, Kurt E. (MSFC-EV41)" <kurt.e.mccall at nasa.gov>
Subject: [mpich-discuss] Can't allocate comm memory for ch4, UCX

Any clue what is causing this problem?  My job is failing on all compute nodes with this message:

ib_iface.c:1060 UCX  ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 64 kbytes)

It doesn’t make sense because on all nodes except for the head node, the memory locking limit is “unlimited”:

$ ssh n004
$ ulimit -Hl -Sl
max locked memory       (kbytes, -l) unlimited
max locked memory       (kbytes, -l) unlimited

On the head node, the limit is still set at 64 kb, but I thought that it wouldn’t matter since the processes run on the compute nodes.

I configured MPICH with ch4 and UCX 1.15.0:

$ ../mpich-4.1.2/configure --prefix=/opt/mpich --with-device=ch4:ucx --with-ucx=/home/kmccall/ucx-1.15.0/install --with-slurm -enable-debuginfo --enable-g=debug  2>&1 | tee c.txt

and my launch command is unremarkable:

mpiexec \
        -print-all-exitcodes \
        -wdir ${work_dir} \
        -np ${num_proc} \
        -ppn 1  \

The cluster is running Red Hat Enterprise Linux release 8.6, slurm 20.11.8,  and the compiler version is g++  8.5.0.

Just for reference, here is the compete error message sequenct from slurm.out

SLURM_JOB_NODELIST =  n[004-008]
[1697913222.014184] [n004:39165:0]        ib_iface.c:1060 UCX  ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 64 kbytes)
[1697913222.014288] [n006:38767:0]        ib_iface.c:1060 UCX  ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 64 kbytes)
[1697913222.014374] [n005:39166:0]        ib_iface.c:1060 UCX  ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 64 kbytes)
[1697913222.014851] [n008:38524:0]        ib_iface.c:1060 UCX  ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 64 kbytes)
[1697913222.014807] [n007:38753:0]        ib_iface.c:1060 UCX  ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 64 kbytes)
Abort(607769871) on node 0 (rank 0 in comm 0): Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(66).............: MPI_Init(argc=0x7ffe9e2b475c, argv=0x7ffe9e2b4750) failed
init_worker(38)...............:  ucx function returned with failed status(ucx_init.c 38 init_worker Input/output error)
Abort(339334415) on node 2 (rank 2 in comm 0): Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(66).............: MPI_Init(argc=0x7ffd5fd79e5c, argv=0x7ffd5fd79e50) failed
init_worker(38)...............:  ucx function returned with failed status(ucx_init.c 38 init_worker Input/output error)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20231024/6b520960/attachment-0001.html>

More information about the discuss mailing list