[mpich-discuss] Can't allocate comm memory for ch4, UCX

Mccall, Kurt E. (MSFC-EV41) kurt.e.mccall at nasa.gov
Sat Oct 21 13:58:59 CDT 2023


Any clue what is causing this problem?  My job is failing on all compute nodes with this message:

ib_iface.c:1060 UCX  ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 64 kbytes)

It doesn't make sense because on all nodes except for the head node, the memory locking limit is "unlimited":

$ ssh n004
$ ulimit -Hl -Sl
max locked memory       (kbytes, -l) unlimited
max locked memory       (kbytes, -l) unlimited

On the head node, the limit is still set at 64 kb, but I thought that it wouldn't matter since the processes run on the compute nodes.

I configured MPICH with ch4 and UCX 1.15.0:

$ ../mpich-4.1.2/configure --prefix=/opt/mpich --with-device=ch4:ucx --with-ucx=/home/kmccall/ucx-1.15.0/install --with-slurm -enable-debuginfo --enable-g=debug  2>&1 | tee c.txt

and my launch command is unremarkable:

mpiexec \
        -print-all-exitcodes \
        -wdir ${work_dir} \
        -np ${num_proc} \
        -ppn 1  \
        application...

The cluster is running Red Hat Enterprise Linux release 8.6, slurm 20.11.8,  and the compiler version is g++  8.5.0.

Just for reference, here is the compete error message sequenct from slurm.out

SLURM_JOB_NODELIST =  n[004-008]
[1697913222.014184] [n004:39165:0]        ib_iface.c:1060 UCX  ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 64 kbytes)
[1697913222.014288] [n006:38767:0]        ib_iface.c:1060 UCX  ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 64 kbytes)
[1697913222.014374] [n005:39166:0]        ib_iface.c:1060 UCX  ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 64 kbytes)
[1697913222.014851] [n008:38524:0]        ib_iface.c:1060 UCX  ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 64 kbytes)
[1697913222.014807] [n007:38753:0]        ib_iface.c:1060 UCX  ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory : Please set max locked memory (ulimit -l) to 'unlimited' (current: 64 kbytes)
Abort(607769871) on node 0 (rank 0 in comm 0): Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(66).............: MPI_Init(argc=0x7ffe9e2b475c, argv=0x7ffe9e2b4750) failed
MPII_Init_thread(257).........:
MPIR_init_comm_world(34)......:
MPIR_Comm_commit(770).........:
MPIR_Comm_commit_internal(558):
MPID_Comm_commit_pre_hook(158):
MPIDI_UCX_init_world(252).....:
init_worker(38)...............:  ucx function returned with failed status(ucx_init.c 38 init_worker Input/output error)
Abort(339334415) on node 2 (rank 2 in comm 0): Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(66).............: MPI_Init(argc=0x7ffd5fd79e5c, argv=0x7ffd5fd79e50) failed
MPII_Init_thread(257).........:
MPIR_init_comm_world(34)......:
MPIR_Comm_commit(770).........:
MPIR_Comm_commit_internal(558):
MPID_Comm_commit_pre_hook(158):
MPIDI_UCX_init_world(252).....:
init_worker(38)...............:  ucx function returned with failed status(ucx_init.c 38 init_worker Input/output error)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20231021/e1fac6e0/attachment.html>


More information about the discuss mailing list