[mpich-discuss] Mpich failure with external libfabric on macOS: OFI resource bind failed
Michka Popoff
michkapopoff at gmail.com
Fri May 7 16:36:57 CDT 2021
The libfabric version is 1.12.1
Here is the log as asked:
==== Capability set configuration ====
libfabric provider: udp;ofi_rxd
MPIDI_OFI_ENABLE_AV_TABLE: 1
MPIDI_OFI_ENABLE_SCALABLE_ENDPOINTS: 0
MPIDI_OFI_ENABLE_SHARED_CONTEXTS: 0
MPIDI_OFI_ENABLE_MR_SCALABLE: 0
MPIDI_OFI_ENABLE_MR_VIRT_ADDRESS: 0
MPIDI_OFI_ENABLE_MR_ALLOCATED: 0
MPIDI_OFI_ENABLE_MR_PROV_KEY: 0
MPIDI_OFI_ENABLE_TAGGED: 1
MPIDI_OFI_ENABLE_AM: 1
MPIDI_OFI_ENABLE_RMA: 1
MPIDI_OFI_ENABLE_ATOMICS: 1
MPIDI_OFI_FETCH_ATOMIC_IOVECS: 1
MPIDI_OFI_ENABLE_DATA_AUTO_PROGRESS: 0
MPIDI_OFI_ENABLE_CONTROL_AUTO_PROGRESS: 0
MPIDI_OFI_ENABLE_PT2PT_NOPACK: 1
MPIDI_OFI_ENABLE_HMEM: 0
MPIDI_OFI_NUM_AM_BUFFERS: 8
MPIDI_OFI_CONTEXT_BITS: 20
MPIDI_OFI_SOURCE_BITS: 0
MPIDI_OFI_TAG_BITS: 31
======================================
MAXIMUM SUPPORTED RANKS: 4294967296
MAXIMUM TAG: 2147483648
======================================
Abort(1615247) on node 0 (rank 0 in comm 0): Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(152).......:
MPID_Init(597)..............:
MPIDI_OFI_mpi_init_hook(674):
create_vni_context(964).....: OFI resource bind failed (ofi_init.c:964:create_vni_context:No message available on STREAM)
…
Michka
> On 4 May 2021, at 00:53, Zhou, Hui <zhouh at anl.gov> wrote:
>
> Hi Michka,
>
> Which libfabric version are you using? Could you try setting `MPIR_CVAR_CH4_OFI_CAPABILITY_SETS_DEBUG=` to see if there is more debug messages?
>
> --
> Hui Zhou
> From: Michka Popoff via discuss <discuss at mpich.org>
> Sent: Saturday, May 1, 2021 4:33 PM
> To: discuss at mpich.org <discuss at mpich.org>
> Cc: Michka Popoff <michkapopoff at gmail.com>
> Subject: [mpich-discuss] Mpich failure with external libfabric on macOS: OFI resource bind failed
>
> Hi
>
> Homebrew maintainer here (https://github.com/Homebrew <https://github.com/Homebrew>).
> Homebrew ships mpich as a package on both MacOS and Linux.
>
> The issue below was found with version 3.4.1 but might have been there for longer.
>
> We noticed that mpich was building it’s own internal libfabric dependency.
> After reading https://lists.mpich.org/pipermail/discuss/2021-January/006092.html <https://lists.mpich.org/pipermail/discuss/2021-January/006092.html>,
> we added `--with-device=ch4:ofi` and set the libfabric path with the --with-libfabric= flag,
> to use our own version.
>
> The build is fine. We have a small test to check if mpich is still working fine;
>
> #include <mpi.h>
> #include <stdio.h>
> int main()
> {
> int size, rank, nameLen;
> char name[MPI_MAX_PROCESSOR_NAME];
> MPI_Init(NULL, NULL);
> MPI_Comm_size(MPI_COMM_WORLD, &size);
> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> MPI_Get_processor_name(name, &nameLen);
> printf("[%d/%d] Hello, world! My name is %s.\\n", rank, size, name);
> MPI_Finalize();
> return 0;
> }
>
> Executing the test fails with a weird error:
>
> /usr/local/Cellar/mpich/3.4.1_2/bin/mpicc hello.c -o hello
> ./hello
> Abort(1615247) on node 0 (rank 0 in comm 0): Fatal error in MPI_Init: Other MPI error, error stack:
> MPIR_Init_thread(152).......:
> MPID_Init(597)..............:
> MPIDI_OFI_mpi_init_hook(674):
> create_vni_context(964).....: OFI resource bind failed (ofi_init.c:964:create_vni_context:No message available on STREAM)
> [unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=1615247
> :
> system msg for write_line failure : Bad file descriptor
> Abort(1615247) on node 0 (rank 0 in comm 0): Fatal error in MPI_Init: Other MPI error, error stack:
> MPIR_Init_thread(152).......:
> MPID_Init(597)..............:
> MPIDI_OFI_mpi_init_hook(674):
> create_vni_context(964).....: OFI resource bind failed (ofi_init.c:964:create_vni_context:No message available on STREAM)
> [unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=1615247
> :
> system msg for write_line failure : Bad file descriptor
>
> This test passes fine on Linux, and fails only on MacOS. Using the internal libfabric is fine on both platforms.
>
> Here is the related discussion:
> https://github.com/Homebrew/homebrew-core/pull/73062 <https://github.com/Homebrew/homebrew-core/pull/73062>
>
> Maybe you could help us debug this issue?
>
> Regards
>
> Michka
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20210507/634aa827/attachment.html>
More information about the discuss
mailing list