[mpich-discuss] Can't allocate comm memory for ch4, UCX
Raffenetti, Ken
raffenet at anl.gov
Tue Oct 24 09:37:46 CDT 2023
Hi Kurt,
A little searching suggests that the 64kb limit you see in the log is due to the Slurm daemons starting before the increased limit is set.
https://www.mail-archive.com/slurm-dev@lists.llnl.gov/msg00960.html
It seems you can remedy this by adding lines to the Slurm startup script to increase the locked memory limit before starting the daemons. Can you give it a try?
Ken
From: "Mccall, Kurt E. (MSFC-EV41) via discuss" <discuss at mpich.org>
Reply-To: "discuss at mpich.org" <discuss at mpich.org>
Date: Saturday, October 21, 2023 at 1:59 PM
To: "discuss at mpich.org" <discuss at mpich.org>
Cc: "Mccall, Kurt E. (MSFC-EV41)" <kurt.e.mccall at nasa.gov>
Subject: [mpich-discuss] Can't allocate comm memory for ch4, UCX
Any clue what is causing this problem? My job is failing on all compute nodes with this message:
ib_iface.c:1060 UCX ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 64 kbytes)
It doesn’t make sense because on all nodes except for the head node, the memory locking limit is “unlimited”:
$ ssh n004
$ ulimit -Hl -Sl
max locked memory (kbytes, -l) unlimited
max locked memory (kbytes, -l) unlimited
On the head node, the limit is still set at 64 kb, but I thought that it wouldn’t matter since the processes run on the compute nodes.
I configured MPICH with ch4 and UCX 1.15.0:
$ ../mpich-4.1.2/configure --prefix=/opt/mpich --with-device=ch4:ucx --with-ucx=/home/kmccall/ucx-1.15.0/install --with-slurm -enable-debuginfo --enable-g=debug 2>&1 | tee c.txt
and my launch command is unremarkable:
mpiexec \
-print-all-exitcodes \
-wdir ${work_dir} \
-np ${num_proc} \
-ppn 1 \
application…
The cluster is running Red Hat Enterprise Linux release 8.6, slurm 20.11.8, and the compiler version is g++ 8.5.0.
Just for reference, here is the compete error message sequenct from slurm.out
SLURM_JOB_NODELIST = n[004-008]
[1697913222.014184] [n004:39165:0] ib_iface.c:1060 UCX ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 64 kbytes)
[1697913222.014288] [n006:38767:0] ib_iface.c:1060 UCX ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 64 kbytes)
[1697913222.014374] [n005:39166:0] ib_iface.c:1060 UCX ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 64 kbytes)
[1697913222.014851] [n008:38524:0] ib_iface.c:1060 UCX ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 64 kbytes)
[1697913222.014807] [n007:38753:0] ib_iface.c:1060 UCX ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 64 kbytes)
Abort(607769871) on node 0 (rank 0 in comm 0): Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(66).............: MPI_Init(argc=0x7ffe9e2b475c, argv=0x7ffe9e2b4750) failed
MPII_Init_thread(257).........:
MPIR_init_comm_world(34)......:
MPIR_Comm_commit(770).........:
MPIR_Comm_commit_internal(558):
MPID_Comm_commit_pre_hook(158):
MPIDI_UCX_init_world(252).....:
init_worker(38)...............: ucx function returned with failed status(ucx_init.c 38 init_worker Input/output error)
Abort(339334415) on node 2 (rank 2 in comm 0): Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(66).............: MPI_Init(argc=0x7ffd5fd79e5c, argv=0x7ffd5fd79e50) failed
MPII_Init_thread(257).........:
MPIR_init_comm_world(34)......:
MPIR_Comm_commit(770).........:
MPIR_Comm_commit_internal(558):
MPID_Comm_commit_pre_hook(158):
MPIDI_UCX_init_world(252).....:
init_worker(38)...............: ucx function returned with failed status(ucx_init.c 38 init_worker Input/output error)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20231024/6b520960/attachment-0001.html>
More information about the discuss
mailing list