[mpich-discuss] OFI poll failed error if using more than one cluster node

Stephen Wong stephen.photond at gmail.com
Mon Jul 29 05:15:39 CDT 2024


For the build configured with --with-device=ch4:ofi, I got

==== Various sizes and limits ====
sizeof(MPIDI_per_vci_t): 192
Required minimum FI_VERSION: 0, current version: 10014
provider: tcp, score = 4, pref = 0, FI_SOCKADDR_IN [16] 192.168.1.5
provider: tcp, score = 4, pref = 0, FI_SOCKADDR_IN6 [28]
fe80::1ff:fe23:4567:890a
provider: tcp, score = 4, pref = 0, FI_SOCKADDR_IN [16] 127.0.0.1
provider: tcp, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] ::1
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 192.168.1.5
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28]
fe80::1ff:fe23:4567:890a
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 127.0.0.1
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] ::1
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 192.168.1.5
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28]
fe80::1ff:fe23:4567:890a
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 127.0.0.1
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] ::1
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 192.168.1.5
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28]
fe80::1ff:fe23:4567:890a
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 127.0.0.1
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] ::1
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 192.168.1.5
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28]
fe80::1ff:fe23:4567:890a
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 127.0.0.1
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] ::1
provider: udp;ofi_rxd, score = 5, pref = -2, FI_SOCKADDR_IN [16] 192.168.1.5
provider: udp;ofi_rxd, score = 5, pref = -2, FI_SOCKADDR_IN6 [28]
fe80::1ff:fe23:4567:890a
provider: udp;ofi_rxd, score = 5, pref = -2, FI_SOCKADDR_IN [16] 127.0.0.1
provider: udp;ofi_rxd, score = 5, pref = -2, FI_SOCKADDR_IN6 [28] ::1
provider: shm, score = 4, pref = -2, FI_ADDR_STR [14] - fi_shm://4595
provider: shm, score = 4, pref = -2, FI_ADDR_STR [14] - fi_shm://4595
provider: udp, score = 0, pref = 0, FI_SOCKADDR_IN [16] 192.168.1.5
provider: udp, score = 0, pref = 0, FI_SOCKADDR_IN6 [28]
fe80::1ff:fe23:4567:890a
provider: udp, score = 0, pref = 0, FI_SOCKADDR_IN [16] 127.0.0.1
provider: udp, score = 0, pref = 0, FI_SOCKADDR_IN6 [28] ::1
provider: tcp, score = 0, pref = 0, FI_SOCKADDR_IN [16] 192.168.1.5
provider: tcp, score = 0, pref = 0, FI_SOCKADDR_IN6 [28]
fe80::1ff:fe23:4567:890a
provider: tcp, score = 0, pref = 0, FI_SOCKADDR_IN [16] 127.0.0.1
provider: tcp, score = 0, pref = 0, FI_SOCKADDR_IN6 [28] ::1
provider: tcp, score = 4, pref = 0, FI_SOCKADDR_IN [16] 192.168.1.5
provider: tcp, score = 4, pref = 0, FI_SOCKADDR_IN6 [28]
fe80::1ff:fe23:4567:890a
provider: tcp, score = 4, pref = 0, FI_SOCKADDR_IN [16] 127.0.0.1
provider: tcp, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] ::1
provider: sockets, score = 3, pref = 0, FI_SOCKADDR_IN [16] 192.168.1.5
provider: sockets, score = 3, pref = 0, FI_SOCKADDR_IN6 [28]
fe80::1ff:fe23:4567:890a
provider: sockets, score = 3, pref = 0, FI_SOCKADDR_IN [16] 127.0.0.1
provider: sockets, score = 3, pref = 0, FI_SOCKADDR_IN6 [28] ::1
provider: sockets, score = 5, pref = 0, FI_SOCKADDR_IN [16] 192.168.1.5
provider: sockets, score = 5, pref = 0, FI_SOCKADDR_IN6 [28]
fe80::1ff:fe23:4567:890a
provider: sockets, score = 5, pref = 0, FI_SOCKADDR_IN [16] 127.0.0.1
provider: sockets, score = 5, pref = 0, FI_SOCKADDR_IN6 [28] ::1
provider: sockets, score = 5, pref = 0, FI_SOCKADDR_IN [16] 192.168.1.5
provider: sockets, score = 5, pref = 0, FI_SOCKADDR_IN6 [28]
fe80::1ff:fe23:4567:890a
provider: sockets, score = 5, pref = 0, FI_SOCKADDR_IN [16] 127.0.0.1
provider: sockets, score = 5, pref = 0, FI_SOCKADDR_IN6 [28] ::1
provider: sm2, score = 3, pref = 0, FI_ADDR_STR [13] - fi_sm2://4595
provider: sm2, score = 3, pref = 0, FI_ADDR_STR [13] - fi_sm2://4595
Required minimum FI_VERSION: 10005, current version: 10014
==== Capability set configuration ====
libfabric provider: sockets - 192.168.1.0/24
MPIDI_OFI_ENABLE_DATA: 1
MPIDI_OFI_ENABLE_AV_TABLE: 1
MPIDI_OFI_ENABLE_SCALABLE_ENDPOINTS: 1
MPIDI_OFI_ENABLE_SHARED_CONTEXTS: 0
MPIDI_OFI_ENABLE_MR_VIRT_ADDRESS: 0
MPIDI_OFI_ENABLE_MR_ALLOCATED: 0
MPIDI_OFI_ENABLE_MR_REGISTER_NULL: 1
MPIDI_OFI_ENABLE_MR_PROV_KEY: 0
MPIDI_OFI_ENABLE_TAGGED: 1
MPIDI_OFI_ENABLE_AM: 1
MPIDI_OFI_ENABLE_RMA: 1
MPIDI_OFI_ENABLE_ATOMICS: 1
MPIDI_OFI_FETCH_ATOMIC_IOVECS: 1
MPIDI_OFI_ENABLE_DATA_AUTO_PROGRESS: 0
MPIDI_OFI_ENABLE_CONTROL_AUTO_PROGRESS: 0
MPIDI_OFI_ENABLE_PT2PT_NOPACK: 1
MPIDI_OFI_ENABLE_TRIGGERED: 0
MPIDI_OFI_ENABLE_HMEM: 0
MPIDI_OFI_NUM_AM_BUFFERS: 8
MPIDI_OFI_NUM_OPTIMIZED_MEMORY_REGIONS: 0
MPIDI_OFI_CONTEXT_BITS: 20
MPIDI_OFI_SOURCE_BITS: 0
MPIDI_OFI_TAG_BITS: 31
MPIDI_OFI_VNI_USE_DOMAIN: 1
MAXIMUM SUPPORTED RANKS: 4294967296
MAXIMUM TAG: 2147483648
==== Provider global thresholds ====
max_buffered_send: 255
max_buffered_write: 255
max_msg_size: 9223372036854775807
max_order_raw: -1
max_order_war: -1
max_order_waw: -1
tx_iov_limit: 8
rx_iov_limit: 8
rma_iov_limit: 8
max_mr_key_size: 8
==== Various sizes and limits ====
MPIDI_OFI_AM_MSG_HEADER_SIZE: 24
MPIDI_OFI_MAX_AM_HDR_SIZE: 255
sizeof(MPIDI_OFI_am_request_header_t): 416
sizeof(MPIDI_OFI_per_vci_t): 52480
MPIDI_OFI_AM_HDR_POOL_CELL_SIZE: 1024
MPIDI_OFI_DEFAULT_SHORT_SEND_SIZE: 16384
==== OFI dynamic settings ====
num_vcis: 1
num_nics: 1
======================================

*mpiexec -host host1,host2 -n 2 cpi*

Abort(883025295) on node 1: Fatal error in internal_Init: Other MPI error,
error stack:
internal_Init(48306).............: MPI_Init(argc=0x7ffcb8c7aedc,
argv=0x7ffcb8c7aed0) failed
MPII_Init_thread(265)............:
MPIR_init_comm_world(34).........:
MPIR_Comm_commit(823)............:
MPID_Comm_commit_post_hook(222)..:
MPIDI_world_post_init(660).......:
MPIDI_OFI_init_vcis(842).........:
check_num_nics(891)..............:
MPIR_Allreduce_allcomm_auto(4726):
MPIC_Sendrecv(306)...............:
MPIC_Wait(91)....................:
MPIR_Wait(780)...................:
MPIR_Wait_state(737).............:
MPIDI_progress_test(134).........:
MPIDI_OFI_handle_cq_error(791)...: OFI poll failed
(ofi_events.c:793:MPIDI_OFI_handle_cq_error:Input/output error)
Abort(883025295) on node 0: Fatal error in internal_Init: Other MPI error,
error stack:
internal_Init(48306).............: MPI_Init(argc=0x7fff18beee1c,
argv=0x7fff18beee10) failed
MPII_Init_thread(265)............:
MPIR_init_comm_world(34).........:
MPIR_Comm_commit(823)............:
MPID_Comm_commit_post_hook(222)..:
MPIDI_world_post_init(660).......:
MPIDI_OFI_init_vcis(842).........:
check_num_nics(891)..............:
MPIR_Allreduce_allcomm_auto(4726):
MPIC_Sendrecv(306)...............:
MPIC_Wait(91)....................:
MPIR_Wait(780)...................:
MPIR_Wait_state(737).............:
MPIDI_progress_test(134).........:
MPIDI_OFI_handle_cq_error(791)...: OFI poll failed
(ofi_events.c:793:MPIDI_OFI_handle_cq_error:Input/output error)


On Fri, 26 Jul 2024 at 19:04, Zhou, Hui <zhouh at anl.gov> wrote:

> Could you run cpi with MPIR_CVAR_DEBUG_SUMMARY=1​ and post the output?
>
> Hui Zhou
> ------------------------------
> *From:* Stephen Wong via discuss <discuss at mpich.org>
> *Sent:* Friday, July 26, 2024 4:46 AM
> *To:* discuss at mpich.org <discuss at mpich.org>
> *Cc:* Stephen Wong <stephen.photond at gmail.com>
> *Subject:* [mpich-discuss] OFI poll failed error if using more than one
> cluster node
>
> Hi, (I sent this previously without a subject line. ) I am using MPICH 4.
> 2. 2 on Ubuntu 24. 04 testing with the small program cpi that calculates
> the value of pi using MPI. I can start on host1 to run cpi on either host1
> or host2 alone and I can
> ZjQcmQRYFpfptBannerStart
> This Message Is From an External Sender
> This message came from outside your organization.
>
> ZjQcmQRYFpfptBannerEnd
> Hi,
>
> (I sent this previously without a subject line.)
>
> I am using MPICH 4.2.2 on Ubuntu 24.04 testing with the small program cpi
> that calculates the value of pi using MPI. I can start on host1 to run cpi
> on either host1 or host2 alone and I can start on host2 to run cpi on
> either host2 or host1 alone. The problem occurs only if I try to use both
> host1 and host2 together.
>
> This is done using, for example, the command
> *mpiexec -host host1,host2 -n 2 cpi*
> then it ends with the error
>
> Abort(77718927) on node 1: Fatal error in internal_Init: Other MPI error,
> error stack:
> internal_Init(48306).............: MPI_Init(argc=0x7ffdb68e7fec,
> argv=0x7ffdb68e7fe0) failed
> MPII_Init_thread(265)............:
> MPIR_init_comm_world(34).........:
> MPIR_Comm_commit(823)............:
> MPID_Comm_commit_post_hook(222)..:
> MPIDI_world_post_init(660).......:
> MPIDI_OFI_init_vcis(842).........:
> check_num_nics(891)..............:
> MPIR_Allreduce_allcomm_auto(4726):
> MPIC_Sendrecv(306)...............:
> MPIC_Wait(91)....................:
> MPIR_Wait(780)...................:
> MPIR_Wait_state(737).............:
> MPIDI_progress_test(134).........:
> MPIDI_OFI_handle_cq_error(791)...: OFI poll failed
> (ofi_events.c:793:MPIDI_OFI_handle_cq_error:Input/output error)
> Abort(77718927) on node 0: Fatal error in internal_Init: Other MPI error,
> error stack:
> internal_Init(48306).............: MPI_Init(argc=0x7ffcb5b28adc,
> argv=0x7ffcb5b28ad0) failed
> MPII_Init_thread(265)............:
> MPIR_init_comm_world(34).........:
> MPIR_Comm_commit(823)............:
> MPID_Comm_commit_post_hook(222)..:
> MPIDI_world_post_init(660).......:
> MPIDI_OFI_init_vcis(842).........:
> check_num_nics(891)..............:
> MPIR_Allreduce_allcomm_auto(4726):
> MPIC_Sendrecv(306)...............:
> MPIC_Wait(91)....................:
> MPIR_Wait(780)...................:
> MPIR_Wait_state(737).............:
> MPIDI_progress_test(134).........:
> MPIDI_OFI_handle_cq_error(791)...: OFI poll failed
> (ofi_events.c:793:MPIDI_OFI_handle_cq_error:Input/output error)
> I searched through the archive of this mailing list and there is only one
> thread that has this OFI poll failed error.
> In the thread, it mentioned this has something to do with the device
> configuration of ch4:ofi ?
> I checked my configure log and it has
> device : ch4:ofi (embedded libfabric)
> in the configuration when I built the MPI. So I am wondering if I should
> switch this option to something else? If this will fix it. I am not too
> sure what other option I could substitute for ch4:ofi.
>
> *****************************************************
>
> Next I tried running configure for a build with the --enable-device =
> ch3:nemesis option.
> Now again I can run the cpi on any  of host1 or host2 alone. If I run it
> on host1 and host2 together, it just crashed with a core dump.
>
> Using the --enable-device = ch3:sock configure option resulted in more or
> less the same problem but now it just quits silently when running on host1
> and host2 together.
>
> Any ideas?
> Thanks!
> Stephen.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20240729/be01fd36/attachment-0001.html>


More information about the discuss mailing list