[mpich-discuss] Can't allocate comm memory for ch4, UCX

Raffenetti, Ken raffenet at anl.gov
Tue Oct 24 09:37:46 CDT 2023


Hi Kurt,

A little searching suggests that the 64kb limit you see in the log is due to the Slurm daemons starting before the increased limit is set.

https://www.mail-archive.com/slurm-dev@lists.llnl.gov/msg00960.html

It seems you can remedy this by adding lines to the Slurm startup script to increase the locked memory limit before starting the daemons. Can you give it a try?

Ken

From: "Mccall, Kurt E. (MSFC-EV41) via discuss" <discuss at mpich.org>
Reply-To: "discuss at mpich.org" <discuss at mpich.org>
Date: Saturday, October 21, 2023 at 1:59 PM
To: "discuss at mpich.org" <discuss at mpich.org>
Cc: "Mccall, Kurt E. (MSFC-EV41)" <kurt.e.mccall at nasa.gov>
Subject: [mpich-discuss] Can't allocate comm memory for ch4, UCX

Any clue what is causing this problem?  My job is failing on all compute nodes with this message:

ib_iface.c:1060 UCX  ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 64 kbytes)

It doesn’t make sense because on all nodes except for the head node, the memory locking limit is “unlimited”:

$ ssh n004
$ ulimit -Hl -Sl
max locked memory       (kbytes, -l) unlimited
max locked memory       (kbytes, -l) unlimited

On the head node, the limit is still set at 64 kb, but I thought that it wouldn’t matter since the processes run on the compute nodes.

I configured MPICH with ch4 and UCX 1.15.0:

$ ../mpich-4.1.2/configure --prefix=/opt/mpich --with-device=ch4:ucx --with-ucx=/home/kmccall/ucx-1.15.0/install --with-slurm -enable-debuginfo --enable-g=debug  2>&1 | tee c.txt

and my launch command is unremarkable:

mpiexec \
        -print-all-exitcodes \
        -wdir ${work_dir} \
        -np ${num_proc} \
        -ppn 1  \
        application…

The cluster is running Red Hat Enterprise Linux release 8.6, slurm 20.11.8,  and the compiler version is g++  8.5.0.

Just for reference, here is the compete error message sequenct from slurm.out

SLURM_JOB_NODELIST =  n[004-008]
[1697913222.014184] [n004:39165:0]        ib_iface.c:1060 UCX  ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 64 kbytes)
[1697913222.014288] [n006:38767:0]        ib_iface.c:1060 UCX  ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 64 kbytes)
[1697913222.014374] [n005:39166:0]        ib_iface.c:1060 UCX  ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 64 kbytes)
[1697913222.014851] [n008:38524:0]        ib_iface.c:1060 UCX  ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 64 kbytes)
[1697913222.014807] [n007:38753:0]        ib_iface.c:1060 UCX  ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 64 kbytes)
Abort(607769871) on node 0 (rank 0 in comm 0): Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(66).............: MPI_Init(argc=0x7ffe9e2b475c, argv=0x7ffe9e2b4750) failed
MPII_Init_thread(257).........:
MPIR_init_comm_world(34)......:
MPIR_Comm_commit(770).........:
MPIR_Comm_commit_internal(558):
MPID_Comm_commit_pre_hook(158):
MPIDI_UCX_init_world(252).....:
init_worker(38)...............:  ucx function returned with failed status(ucx_init.c 38 init_worker Input/output error)
Abort(339334415) on node 2 (rank 2 in comm 0): Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(66).............: MPI_Init(argc=0x7ffd5fd79e5c, argv=0x7ffd5fd79e50) failed
MPII_Init_thread(257).........:
MPIR_init_comm_world(34)......:
MPIR_Comm_commit(770).........:
MPIR_Comm_commit_internal(558):
MPID_Comm_commit_pre_hook(158):
MPIDI_UCX_init_world(252).....:
init_worker(38)...............:  ucx function returned with failed status(ucx_init.c 38 init_worker Input/output error)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20231024/6b520960/attachment-0001.html>


More information about the discuss mailing list