[mpich-discuss] OFI poll failed error if using more than one cluster node

Zhou, Hui zhouh at anl.gov
Mon Aug 5 09:37:55 CDT 2024


Hi Stephen,

No, the PMI choice should not affect the init error you are seeing.

Sorry for the lack of responses. I didn't find any clues in the debug log.

Do you have the issue running on the local node or only when launching on remote nodes? What kind of systems do you have?

--
Hui
________________________________
From: Stephen Wong <stephen.photond at gmail.com>
Sent: Monday, August 5, 2024 4:41 AM
To: Zhou, Hui <zhouh at anl.gov>
Cc: discuss at mpich.org <discuss at mpich.org>
Subject: Re: [mpich-discuss] OFI poll failed error if using more than one cluster node

Since there is no update to this after posting the Debug messages, I have since got a new suggestion from someone. I was told that MPICH was only built with PMIx support whereas mpiexec. hydra only provides a PMI server. Could that be the source
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.

ZjQcmQRYFpfptBannerEnd
Since there is no update to this after posting the Debug messages, I have since got a new suggestion from someone.
I was told that MPICH was only built with PMIx support whereas mpiexec.hydra only provides a PMI server.
Could that be the source of the problem?

On Mon, 29 Jul 2024 at 11:15, Stephen Wong <stephen.photond at gmail.com<mailto:stephen.photond at gmail.com>> wrote:
For the build configured with --with-device=ch4:ofi, I got

==== Various sizes and limits ====
sizeof(MPIDI_per_vci_t): 192
Required minimum FI_VERSION: 0, current version: 10014
provider: tcp, score = 4, pref = 0, FI_SOCKADDR_IN [16] 192.168.1.5
provider: tcp, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::1ff:fe23:4567:890a
provider: tcp, score = 4, pref = 0, FI_SOCKADDR_IN [16] 127.0.0.1
provider: tcp, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] ::1
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 192.168.1.5
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::1ff:fe23:4567:890a
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 127.0.0.1
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] ::1
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 192.168.1.5
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::1ff:fe23:4567:890a
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 127.0.0.1
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] ::1
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 192.168.1.5
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::1ff:fe23:4567:890a
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 127.0.0.1
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] ::1
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 192.168.1.5
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::1ff:fe23:4567:890a
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 127.0.0.1
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] ::1
provider: udp;ofi_rxd, score = 5, pref = -2, FI_SOCKADDR_IN [16] 192.168.1.5
provider: udp;ofi_rxd, score = 5, pref = -2, FI_SOCKADDR_IN6 [28] fe80::1ff:fe23:4567:890a
provider: udp;ofi_rxd, score = 5, pref = -2, FI_SOCKADDR_IN [16] 127.0.0.1
provider: udp;ofi_rxd, score = 5, pref = -2, FI_SOCKADDR_IN6 [28] ::1
provider: shm, score = 4, pref = -2, FI_ADDR_STR [14] - fi_shm://4595
provider: shm, score = 4, pref = -2, FI_ADDR_STR [14] - fi_shm://4595
provider: udp, score = 0, pref = 0, FI_SOCKADDR_IN [16] 192.168.1.5
provider: udp, score = 0, pref = 0, FI_SOCKADDR_IN6 [28] fe80::1ff:fe23:4567:890a
provider: udp, score = 0, pref = 0, FI_SOCKADDR_IN [16] 127.0.0.1
provider: udp, score = 0, pref = 0, FI_SOCKADDR_IN6 [28] ::1
provider: tcp, score = 0, pref = 0, FI_SOCKADDR_IN [16] 192.168.1.5
provider: tcp, score = 0, pref = 0, FI_SOCKADDR_IN6 [28] fe80::1ff:fe23:4567:890a
provider: tcp, score = 0, pref = 0, FI_SOCKADDR_IN [16] 127.0.0.1
provider: tcp, score = 0, pref = 0, FI_SOCKADDR_IN6 [28] ::1
provider: tcp, score = 4, pref = 0, FI_SOCKADDR_IN [16] 192.168.1.5
provider: tcp, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::1ff:fe23:4567:890a
provider: tcp, score = 4, pref = 0, FI_SOCKADDR_IN [16] 127.0.0.1
provider: tcp, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] ::1
provider: sockets, score = 3, pref = 0, FI_SOCKADDR_IN [16] 192.168.1.5
provider: sockets, score = 3, pref = 0, FI_SOCKADDR_IN6 [28] fe80::1ff:fe23:4567:890a
provider: sockets, score = 3, pref = 0, FI_SOCKADDR_IN [16] 127.0.0.1
provider: sockets, score = 3, pref = 0, FI_SOCKADDR_IN6 [28] ::1
provider: sockets, score = 5, pref = 0, FI_SOCKADDR_IN [16] 192.168.1.5
provider: sockets, score = 5, pref = 0, FI_SOCKADDR_IN6 [28] fe80::1ff:fe23:4567:890a
provider: sockets, score = 5, pref = 0, FI_SOCKADDR_IN [16] 127.0.0.1
provider: sockets, score = 5, pref = 0, FI_SOCKADDR_IN6 [28] ::1
provider: sockets, score = 5, pref = 0, FI_SOCKADDR_IN [16] 192.168.1.5
provider: sockets, score = 5, pref = 0, FI_SOCKADDR_IN6 [28] fe80::1ff:fe23:4567:890a
provider: sockets, score = 5, pref = 0, FI_SOCKADDR_IN [16] 127.0.0.1
provider: sockets, score = 5, pref = 0, FI_SOCKADDR_IN6 [28] ::1
provider: sm2, score = 3, pref = 0, FI_ADDR_STR [13] - fi_sm2://4595
provider: sm2, score = 3, pref = 0, FI_ADDR_STR [13] - fi_sm2://4595
Required minimum FI_VERSION: 10005, current version: 10014
==== Capability set configuration ====
libfabric provider: sockets - 192.168.1.0/24<https://urldefense.us/v3/__http://192.168.1.0/24__;!!G_uCfscf7eWS!ek__1tWG5sjtigJ_cs7qb4luMlaQqVsLTWjfP3QFHImEcYSCvO70TQfwlHkWJN7pRdD_Up_0zxCLckHU57xgLAog$>
MPIDI_OFI_ENABLE_DATA: 1
MPIDI_OFI_ENABLE_AV_TABLE: 1
MPIDI_OFI_ENABLE_SCALABLE_ENDPOINTS: 1
MPIDI_OFI_ENABLE_SHARED_CONTEXTS: 0
MPIDI_OFI_ENABLE_MR_VIRT_ADDRESS: 0
MPIDI_OFI_ENABLE_MR_ALLOCATED: 0
MPIDI_OFI_ENABLE_MR_REGISTER_NULL: 1
MPIDI_OFI_ENABLE_MR_PROV_KEY: 0
MPIDI_OFI_ENABLE_TAGGED: 1
MPIDI_OFI_ENABLE_AM: 1
MPIDI_OFI_ENABLE_RMA: 1
MPIDI_OFI_ENABLE_ATOMICS: 1
MPIDI_OFI_FETCH_ATOMIC_IOVECS: 1
MPIDI_OFI_ENABLE_DATA_AUTO_PROGRESS: 0
MPIDI_OFI_ENABLE_CONTROL_AUTO_PROGRESS: 0
MPIDI_OFI_ENABLE_PT2PT_NOPACK: 1
MPIDI_OFI_ENABLE_TRIGGERED: 0
MPIDI_OFI_ENABLE_HMEM: 0
MPIDI_OFI_NUM_AM_BUFFERS: 8
MPIDI_OFI_NUM_OPTIMIZED_MEMORY_REGIONS: 0
MPIDI_OFI_CONTEXT_BITS: 20
MPIDI_OFI_SOURCE_BITS: 0
MPIDI_OFI_TAG_BITS: 31
MPIDI_OFI_VNI_USE_DOMAIN: 1
MAXIMUM SUPPORTED RANKS: 4294967296
MAXIMUM TAG: 2147483648
==== Provider global thresholds ====
max_buffered_send: 255
max_buffered_write: 255
max_msg_size: 9223372036854775807
max_order_raw: -1
max_order_war: -1
max_order_waw: -1
tx_iov_limit: 8
rx_iov_limit: 8
rma_iov_limit: 8
max_mr_key_size: 8
==== Various sizes and limits ====
MPIDI_OFI_AM_MSG_HEADER_SIZE: 24
MPIDI_OFI_MAX_AM_HDR_SIZE: 255
sizeof(MPIDI_OFI_am_request_header_t): 416
sizeof(MPIDI_OFI_per_vci_t): 52480
MPIDI_OFI_AM_HDR_POOL_CELL_SIZE: 1024
MPIDI_OFI_DEFAULT_SHORT_SEND_SIZE: 16384
==== OFI dynamic settings ====
num_vcis: 1
num_nics: 1
======================================

mpiexec -host host1,host2 -n 2 cpi

Abort(883025295) on node 1: Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(48306).............: MPI_Init(argc=0x7ffcb8c7aedc, argv=0x7ffcb8c7aed0) failed
MPII_Init_thread(265)............:
MPIR_init_comm_world(34).........:
MPIR_Comm_commit(823)............:
MPID_Comm_commit_post_hook(222)..:
MPIDI_world_post_init(660).......:
MPIDI_OFI_init_vcis(842).........:
check_num_nics(891)..............:
MPIR_Allreduce_allcomm_auto(4726):
MPIC_Sendrecv(306)...............:
MPIC_Wait(91)....................:
MPIR_Wait(780)...................:
MPIR_Wait_state(737).............:
MPIDI_progress_test(134).........:
MPIDI_OFI_handle_cq_error(791)...: OFI poll failed (ofi_events.c:793:MPIDI_OFI_handle_cq_error:Input/output error)
Abort(883025295) on node 0: Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(48306).............: MPI_Init(argc=0x7fff18beee1c, argv=0x7fff18beee10) failed
MPII_Init_thread(265)............:
MPIR_init_comm_world(34).........:
MPIR_Comm_commit(823)............:
MPID_Comm_commit_post_hook(222)..:
MPIDI_world_post_init(660).......:
MPIDI_OFI_init_vcis(842).........:
check_num_nics(891)..............:
MPIR_Allreduce_allcomm_auto(4726):
MPIC_Sendrecv(306)...............:
MPIC_Wait(91)....................:
MPIR_Wait(780)...................:
MPIR_Wait_state(737).............:
MPIDI_progress_test(134).........:
MPIDI_OFI_handle_cq_error(791)...: OFI poll failed (ofi_events.c:793:MPIDI_OFI_handle_cq_error:Input/output error)


On Fri, 26 Jul 2024 at 19:04, Zhou, Hui <zhouh at anl.gov<mailto:zhouh at anl.gov>> wrote:
Could you run cpi with MPIR_CVAR_DEBUG_SUMMARY=1​ and post the output?

Hui Zhou
________________________________
From: Stephen Wong via discuss <discuss at mpich.org<mailto:discuss at mpich.org>>
Sent: Friday, July 26, 2024 4:46 AM
To: discuss at mpich.org<mailto:discuss at mpich.org> <discuss at mpich.org<mailto:discuss at mpich.org>>
Cc: Stephen Wong <stephen.photond at gmail.com<mailto:stephen.photond at gmail.com>>
Subject: [mpich-discuss] OFI poll failed error if using more than one cluster node

Hi, (I sent this previously without a subject line. ) I am using MPICH 4. 2. 2 on Ubuntu 24. 04 testing with the small program cpi that calculates the value of pi using MPI. I can start on host1 to run cpi on either host1 or host2 alone and I can
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.

ZjQcmQRYFpfptBannerEnd
Hi,

(I sent this previously without a subject line.)

I am using MPICH 4.2.2 on Ubuntu 24.04 testing with the small program cpi that calculates the value of pi using MPI. I can start on host1 to run cpi on either host1 or host2 alone and I can start on host2 to run cpi on either host2 or host1 alone. The problem occurs only if I try to use both host1 and host2 together.


This is done using, for example, the command
mpiexec -host host1,host2 -n 2 cpi
then it ends with the error

Abort(77718927) on node 1: Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(48306).............: MPI_Init(argc=0x7ffdb68e7fec, argv=0x7ffdb68e7fe0) failed
MPII_Init_thread(265)............:
MPIR_init_comm_world(34).........:
MPIR_Comm_commit(823)............:
MPID_Comm_commit_post_hook(222)..:
MPIDI_world_post_init(660).......:
MPIDI_OFI_init_vcis(842).........:
check_num_nics(891)..............:
MPIR_Allreduce_allcomm_auto(4726):
MPIC_Sendrecv(306)...............:
MPIC_Wait(91)....................:
MPIR_Wait(780)...................:
MPIR_Wait_state(737).............:
MPIDI_progress_test(134).........:
MPIDI_OFI_handle_cq_error(791)...: OFI poll failed (ofi_events.c:793:MPIDI_OFI_handle_cq_error:Input/output error) Abort(77718927) on node 0: Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(48306).............: MPI_Init(argc=0x7ffcb5b28adc, argv=0x7ffcb5b28ad0) failed
MPII_Init_thread(265)............:
MPIR_init_comm_world(34).........:
MPIR_Comm_commit(823)............:
MPID_Comm_commit_post_hook(222)..:
MPIDI_world_post_init(660).......:
MPIDI_OFI_init_vcis(842).........:
check_num_nics(891)..............:
MPIR_Allreduce_allcomm_auto(4726):
MPIC_Sendrecv(306)...............:
MPIC_Wait(91)....................:
MPIR_Wait(780)...................:
MPIR_Wait_state(737).............:
MPIDI_progress_test(134).........:
MPIDI_OFI_handle_cq_error(791)...: OFI poll failed (ofi_events.c:793:MPIDI_OFI_handle_cq_error:Input/output error)

I searched through the archive of this mailing list and there is only one thread that has this OFI poll failed error.
In the thread, it mentioned this has something to do with the device configuration of ch4:ofi ?
I checked my configure log and it has
device : ch4:ofi (embedded libfabric)
in the configuration when I built the MPI. So I am wondering if I should switch this option to something else? If this will fix it. I am not too sure what other option I could substitute for ch4:ofi.

*****************************************************

Next I tried running configure for a build with the --enable-device = ch3:nemesis option.
Now again I can run the cpi on any  of host1 or host2 alone. If I run it on host1 and host2 together, it just crashed with a core dump.

Using the --enable-device = ch3:sock configure option resulted in more or less the same problem but now it just quits silently when running on host1 and host2 together.

Any ideas?
Thanks!
Stephen.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20240805/fecb3033/attachment-0001.html>


More information about the discuss mailing list