Hi Kurt,

This looks to be an issue allocating resources on infiniband device on the node. MPI_Init should not require any special system settings. Are you able to run infiniband diagnostics without any MPI library? Ibstatus should tell you if the IB card is online and what state it is in. From there, you could try running an ib_send_bw test across 2 nodes and verify that traffic is flowing.


I have configured MPICH 4.1.2 with both –with-device=ch4:ofi   and –with-device=ch4:ucx.    My application fails in both cases when it can’t allocate enough memory.  For –with-device=ch4:ofi :

Unable to create send CQ of size 5080 on mlx5_0: Cannot allocate memory
n001.cluster.pssclabs.com:rank0.NeedlesMpiMM: Unable to initialize verbs NIC /sys/class/infiniband/mlx5_0 (unit 0:0)
n001.cluster.pssclabs.com:rank0: PSM3 can't open nic unit: 0 (err=23)
Abort(606197135): Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(66)........: MPI_Init(argc=0x7ffc1cbd334c, argv=0x7ffc1cbd3340) failed
create_vni_context(982)..: OFI endpoint open failed (ofi_init.c:982:create_vni_context:Cannot allocate memory)

Configuring using  –with-device=ch4:ucx, there was a very similar error involving /sys/class/infiniband/mlx5_0  that explicitly stating that the locked memory limit (ulimit -l) needs to be set to “unlimited”.   Are there any other ch4 device configuration options that don’t require unlimited locked memory?

