[mpich-discuss] Mpich failure with external libfabric on macOS: OFI resource bind failed

Zhou, Hui zhouh at anl.gov
Fri May 7 17:31:06 CDT 2021


Hi Michka,

Thanks for the information. That is helpful. This seems to be an issue with the latest development in libfabric. It is selecting the `udp;ofi_rxd` provider, which has never been tested with MPICH. You can try force the `sockets` provider selection by setting `MPIR_CVAR_OFI_USE_PROVIDER=sockets` or setting `FI_PROVIDER=sockets`. Alternatively, you may want force the sockets provider during configure with `--with-device=ch4:ofi:sockets`. Or you may consider disable the `ofi_rxd` provider in libfabric. I admit either seem non-optimum. The better solution probably should be either fix or tune the provider selection logic in mpich.

FYI, the current embedded libfabric is on version 1.10.1 and we are still working on upgrading that to version 1.11.2.

--
Hui Zhou


From: Michka Popoff <michkapopoff at gmail.com>
Date: Friday, May 7, 2021 at 4:37 PM
To: Zhou, Hui <zhouh at anl.gov>
Cc: discuss at mpich.org <discuss at mpich.org>
Subject: Re: [mpich-discuss] Mpich failure with external libfabric on macOS: OFI resource bind failed
The libfabric version is 1.12.1

Here is the log as asked:

==== Capability set configuration ====
libfabric provider: udp;ofi_rxd
MPIDI_OFI_ENABLE_AV_TABLE: 1
MPIDI_OFI_ENABLE_SCALABLE_ENDPOINTS: 0
MPIDI_OFI_ENABLE_SHARED_CONTEXTS: 0
MPIDI_OFI_ENABLE_MR_SCALABLE: 0
MPIDI_OFI_ENABLE_MR_VIRT_ADDRESS: 0
MPIDI_OFI_ENABLE_MR_ALLOCATED: 0
MPIDI_OFI_ENABLE_MR_PROV_KEY: 0
MPIDI_OFI_ENABLE_TAGGED: 1
MPIDI_OFI_ENABLE_AM: 1
MPIDI_OFI_ENABLE_RMA: 1
MPIDI_OFI_ENABLE_ATOMICS: 1
MPIDI_OFI_FETCH_ATOMIC_IOVECS: 1
MPIDI_OFI_ENABLE_DATA_AUTO_PROGRESS: 0
MPIDI_OFI_ENABLE_CONTROL_AUTO_PROGRESS: 0
MPIDI_OFI_ENABLE_PT2PT_NOPACK: 1
MPIDI_OFI_ENABLE_HMEM: 0
MPIDI_OFI_NUM_AM_BUFFERS: 8
MPIDI_OFI_CONTEXT_BITS: 20
MPIDI_OFI_SOURCE_BITS: 0
MPIDI_OFI_TAG_BITS: 31
======================================
MAXIMUM SUPPORTED RANKS: 4294967296
MAXIMUM TAG: 2147483648
======================================
Abort(1615247) on node 0 (rank 0 in comm 0): Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(152).......:
MPID_Init(597)..............:
MPIDI_OFI_mpi_init_hook(674):
create_vni_context(964).....: OFI resource bind failed (ofi_init.c:964:create_vni_context:No message available on STREAM)
…

Michka


On 4 May 2021, at 00:53, Zhou, Hui <zhouh at anl.gov<mailto:zhouh at anl.gov>> wrote:

Hi Michka,

Which libfabric version are you using? Could you try setting `MPIR_CVAR_CH4_OFI_CAPABILITY_SETS_DEBUG=` to see if there is more debug messages?

--
Hui Zhou
________________________________
From: Michka Popoff via discuss <discuss at mpich.org<mailto:discuss at mpich.org>>
Sent: Saturday, May 1, 2021 4:33 PM
To: discuss at mpich.org<mailto:discuss at mpich.org> <discuss at mpich.org<mailto:discuss at mpich.org>>
Cc: Michka Popoff <michkapopoff at gmail.com<mailto:michkapopoff at gmail.com>>
Subject: [mpich-discuss] Mpich failure with external libfabric on macOS: OFI resource bind failed

Hi

Homebrew maintainer here (https://github.com/Homebrew).
Homebrew ships mpich as a package on both MacOS and Linux.

The issue below was found with version 3.4.1 but might have been there for longer.

We noticed that mpich was building it’s own internal libfabric dependency.
After reading https://lists.mpich.org/pipermail/discuss/2021-January/006092.html,
we added `--with-device=ch4:ofi` and set the libfabric path with the --with-libfabric= flag,
to use our own version.
The build is fine. We have a small test to check if mpich is still working fine;

#include <mpi.h>
#include <stdio.h>
int main()
{
  int size, rank, nameLen;
  char name[MPI_MAX_PROCESSOR_NAME];
  MPI_Init(NULL, NULL);
  MPI_Comm_size(MPI_COMM_WORLD, &size);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Get_processor_name(name, &nameLen);
  printf("[%d/%d] Hello, world! My name is %s.\\n", rank, size, name);
  MPI_Finalize();
  return 0;
}

Executing the test fails with a weird error:

/usr/local/Cellar/mpich/3.4.1_2/bin/mpicc hello.c -o hello
./hello
Abort(1615247) on node 0 (rank 0 in comm 0): Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(152).......:
MPID_Init(597)..............:
MPIDI_OFI_mpi_init_hook(674):
create_vni_context(964).....: OFI resource bind failed (ofi_init.c:964:create_vni_context:No message available on STREAM)
[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=1615247
:
system msg for write_line failure : Bad file descriptor
Abort(1615247) on node 0 (rank 0 in comm 0): Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(152).......:
MPID_Init(597)..............:
MPIDI_OFI_mpi_init_hook(674):
create_vni_context(964).....: OFI resource bind failed (ofi_init.c:964:create_vni_context:No message available on STREAM)
[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=1615247
:
system msg for write_line failure : Bad file descriptor

This test passes fine on Linux, and fails only on MacOS. Using the internal libfabric is fine on both platforms.

Here is the related discussion:
https://github.com/Homebrew/homebrew-core/pull/73062

Maybe you could help us debug this issue?

Regards

Michka

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20210507/ee3a01eb/attachment-0001.html>


More information about the discuss mailing list