[mpich-discuss] Mpich failure with external libfabric on macOS: OFI resource bind failed

Zhou, Hui zhouh at anl.gov
Mon May 3 17:53:35 CDT 2021


Hi Michka,

Which libfabric version are you using? Could you try setting `MPIR_CVAR_CH4_OFI_CAPABILITY_SETS_DEBUG=` to see if there is more debug messages?

--
Hui Zhou
________________________________
From: Michka Popoff via discuss <discuss at mpich.org>
Sent: Saturday, May 1, 2021 4:33 PM
To: discuss at mpich.org <discuss at mpich.org>
Cc: Michka Popoff <michkapopoff at gmail.com>
Subject: [mpich-discuss] Mpich failure with external libfabric on macOS: OFI resource bind failed

Hi

Homebrew maintainer here (https://github.com/Homebrew).
Homebrew ships mpich as a package on both MacOS and Linux.

The issue below was found with version 3.4.1 but might have been there for longer.

We noticed that mpich was building it’s own internal libfabric dependency.
After reading https://lists.mpich.org/pipermail/discuss/2021-January/006092.html,
we added `--with-device=ch4:ofi` and set the libfabric path with the --with-libfabric= flag,
to use our own version.

The build is fine. We have a small test to check if mpich is still working fine;

#include <mpi.h>
#include <stdio.h>
int main()
{
  int size, rank, nameLen;
  char name[MPI_MAX_PROCESSOR_NAME];
  MPI_Init(NULL, NULL);
  MPI_Comm_size(MPI_COMM_WORLD, &size);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Get_processor_name(name, &nameLen);
  printf("[%d/%d] Hello, world! My name is %s.\\n", rank, size, name);
  MPI_Finalize();
  return 0;
}

Executing the test fails with a weird error:

/usr/local/Cellar/mpich/3.4.1_2/bin/mpicc hello.c -o hello
./hello
Abort(1615247) on node 0 (rank 0 in comm 0): Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(152).......:
MPID_Init(597)..............:
MPIDI_OFI_mpi_init_hook(674):
create_vni_context(964).....: OFI resource bind failed (ofi_init.c:964:create_vni_context:No message available on STREAM)
[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=1615247
:
system msg for write_line failure : Bad file descriptor
Abort(1615247) on node 0 (rank 0 in comm 0): Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(152).......:
MPID_Init(597)..............:
MPIDI_OFI_mpi_init_hook(674):
create_vni_context(964).....: OFI resource bind failed (ofi_init.c:964:create_vni_context:No message available on STREAM)
[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=1615247
:
system msg for write_line failure : Bad file descriptor

This test passes fine on Linux, and fails only on MacOS. Using the internal libfabric is fine on both platforms.

Here is the related discussion:
https://github.com/Homebrew/homebrew-core/pull/73062

Maybe you could help us debug this issue?

Regards

Michka

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20210503/7478cfad/attachment.html>


More information about the discuss mailing list