[mpich-discuss] Mpich failure with external libfabric on macOS: OFI resource bind failed

Michka Popoff michkapopoff at gmail.com
Fri May 7 16:36:57 CDT 2021


The libfabric version is 1.12.1

Here is the log as asked:

==== Capability set configuration ====
libfabric provider: udp;ofi_rxd
MPIDI_OFI_ENABLE_AV_TABLE: 1
MPIDI_OFI_ENABLE_SCALABLE_ENDPOINTS: 0
MPIDI_OFI_ENABLE_SHARED_CONTEXTS: 0
MPIDI_OFI_ENABLE_MR_SCALABLE: 0
MPIDI_OFI_ENABLE_MR_VIRT_ADDRESS: 0
MPIDI_OFI_ENABLE_MR_ALLOCATED: 0
MPIDI_OFI_ENABLE_MR_PROV_KEY: 0
MPIDI_OFI_ENABLE_TAGGED: 1
MPIDI_OFI_ENABLE_AM: 1
MPIDI_OFI_ENABLE_RMA: 1
MPIDI_OFI_ENABLE_ATOMICS: 1
MPIDI_OFI_FETCH_ATOMIC_IOVECS: 1
MPIDI_OFI_ENABLE_DATA_AUTO_PROGRESS: 0
MPIDI_OFI_ENABLE_CONTROL_AUTO_PROGRESS: 0
MPIDI_OFI_ENABLE_PT2PT_NOPACK: 1
MPIDI_OFI_ENABLE_HMEM: 0
MPIDI_OFI_NUM_AM_BUFFERS: 8
MPIDI_OFI_CONTEXT_BITS: 20
MPIDI_OFI_SOURCE_BITS: 0
MPIDI_OFI_TAG_BITS: 31
======================================
MAXIMUM SUPPORTED RANKS: 4294967296
MAXIMUM TAG: 2147483648
======================================
Abort(1615247) on node 0 (rank 0 in comm 0): Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(152).......:
MPID_Init(597)..............:
MPIDI_OFI_mpi_init_hook(674):
create_vni_context(964).....: OFI resource bind failed (ofi_init.c:964:create_vni_context:No message available on STREAM)
…

Michka

> On 4 May 2021, at 00:53, Zhou, Hui <zhouh at anl.gov> wrote:
> 
> Hi Michka,
> 
> Which libfabric version are you using? Could you try setting `MPIR_CVAR_CH4_OFI_CAPABILITY_SETS_DEBUG=` to see if there is more debug messages?
> 
> -- 
> Hui Zhou
> From: Michka Popoff via discuss <discuss at mpich.org>
> Sent: Saturday, May 1, 2021 4:33 PM
> To: discuss at mpich.org <discuss at mpich.org>
> Cc: Michka Popoff <michkapopoff at gmail.com>
> Subject: [mpich-discuss] Mpich failure with external libfabric on macOS: OFI resource bind failed
>  
> Hi
> 
> Homebrew maintainer here (https://github.com/Homebrew <https://github.com/Homebrew>).
> Homebrew ships mpich as a package on both MacOS and Linux.
> 
> The issue below was found with version 3.4.1 but might have been there for longer.
> 
> We noticed that mpich was building it’s own internal libfabric dependency.
> After reading https://lists.mpich.org/pipermail/discuss/2021-January/006092.html <https://lists.mpich.org/pipermail/discuss/2021-January/006092.html>,
> we added `--with-device=ch4:ofi` and set the libfabric path with the --with-libfabric= flag,
> to use our own version.
> 
> The build is fine. We have a small test to check if mpich is still working fine;
> 
> #include <mpi.h>
> #include <stdio.h>
> int main()
> {
>   int size, rank, nameLen;
>   char name[MPI_MAX_PROCESSOR_NAME];
>   MPI_Init(NULL, NULL);
>   MPI_Comm_size(MPI_COMM_WORLD, &size);
>   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>   MPI_Get_processor_name(name, &nameLen);
>   printf("[%d/%d] Hello, world! My name is %s.\\n", rank, size, name);
>   MPI_Finalize();
>   return 0;
> }
> 
> Executing the test fails with a weird error:
> 
> /usr/local/Cellar/mpich/3.4.1_2/bin/mpicc hello.c -o hello
> ./hello
> Abort(1615247) on node 0 (rank 0 in comm 0): Fatal error in MPI_Init: Other MPI error, error stack:
> MPIR_Init_thread(152).......: 
> MPID_Init(597)..............: 
> MPIDI_OFI_mpi_init_hook(674): 
> create_vni_context(964).....: OFI resource bind failed (ofi_init.c:964:create_vni_context:No message available on STREAM)
> [unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=1615247
> :
> system msg for write_line failure : Bad file descriptor
> Abort(1615247) on node 0 (rank 0 in comm 0): Fatal error in MPI_Init: Other MPI error, error stack:
> MPIR_Init_thread(152).......: 
> MPID_Init(597)..............: 
> MPIDI_OFI_mpi_init_hook(674): 
> create_vni_context(964).....: OFI resource bind failed (ofi_init.c:964:create_vni_context:No message available on STREAM)
> [unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=1615247
> :
> system msg for write_line failure : Bad file descriptor
> 
> This test passes fine on Linux, and fails only on MacOS. Using the internal libfabric is fine on both platforms.
> 
> Here is the related discussion:
> https://github.com/Homebrew/homebrew-core/pull/73062 <https://github.com/Homebrew/homebrew-core/pull/73062>
> 
> Maybe you could help us debug this issue?
> 
> Regards
> 
> Michka

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20210507/634aa827/attachment.html>


More information about the discuss mailing list