[mpich-discuss] OFI poll failed error if using more than one cluster node

Zhou, Hui zhouh at anl.gov
Fri Jul 26 13:04:11 CDT 2024


Could you run cpi with MPIR_CVAR_DEBUG_SUMMARY=1​ and post the output?

Hui Zhou
________________________________
From: Stephen Wong via discuss <discuss at mpich.org>
Sent: Friday, July 26, 2024 4:46 AM
To: discuss at mpich.org <discuss at mpich.org>
Cc: Stephen Wong <stephen.photond at gmail.com>
Subject: [mpich-discuss] OFI poll failed error if using more than one cluster node

Hi, (I sent this previously without a subject line. ) I am using MPICH 4. 2. 2 on Ubuntu 24. 04 testing with the small program cpi that calculates the value of pi using MPI. I can start on host1 to run cpi on either host1 or host2 alone and I can
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.

ZjQcmQRYFpfptBannerEnd
Hi,

(I sent this previously without a subject line.)

I am using MPICH 4.2.2 on Ubuntu 24.04 testing with the small program cpi that calculates the value of pi using MPI. I can start on host1 to run cpi on either host1 or host2 alone and I can start on host2 to run cpi on either host2 or host1 alone. The problem occurs only if I try to use both host1 and host2 together.


This is done using, for example, the command
mpiexec -host host1,host2 -n 2 cpi
then it ends with the error

Abort(77718927) on node 1: Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(48306).............: MPI_Init(argc=0x7ffdb68e7fec, argv=0x7ffdb68e7fe0) failed
MPII_Init_thread(265)............:
MPIR_init_comm_world(34).........:
MPIR_Comm_commit(823)............:
MPID_Comm_commit_post_hook(222)..:
MPIDI_world_post_init(660).......:
MPIDI_OFI_init_vcis(842).........:
check_num_nics(891)..............:
MPIR_Allreduce_allcomm_auto(4726):
MPIC_Sendrecv(306)...............:
MPIC_Wait(91)....................:
MPIR_Wait(780)...................:
MPIR_Wait_state(737).............:
MPIDI_progress_test(134).........:
MPIDI_OFI_handle_cq_error(791)...: OFI poll failed (ofi_events.c:793:MPIDI_OFI_handle_cq_error:Input/output error) Abort(77718927) on node 0: Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(48306).............: MPI_Init(argc=0x7ffcb5b28adc, argv=0x7ffcb5b28ad0) failed
MPII_Init_thread(265)............:
MPIR_init_comm_world(34).........:
MPIR_Comm_commit(823)............:
MPID_Comm_commit_post_hook(222)..:
MPIDI_world_post_init(660).......:
MPIDI_OFI_init_vcis(842).........:
check_num_nics(891)..............:
MPIR_Allreduce_allcomm_auto(4726):
MPIC_Sendrecv(306)...............:
MPIC_Wait(91)....................:
MPIR_Wait(780)...................:
MPIR_Wait_state(737).............:
MPIDI_progress_test(134).........:
MPIDI_OFI_handle_cq_error(791)...: OFI poll failed (ofi_events.c:793:MPIDI_OFI_handle_cq_error:Input/output error)

I searched through the archive of this mailing list and there is only one thread that has this OFI poll failed error.
In the thread, it mentioned this has something to do with the device configuration of ch4:ofi ?
I checked my configure log and it has
device : ch4:ofi (embedded libfabric)
in the configuration when I built the MPI. So I am wondering if I should switch this option to something else? If this will fix it. I am not too sure what other option I could substitute for ch4:ofi.

*****************************************************

Next I tried running configure for a build with the --enable-device = ch3:nemesis option.
Now again I can run the cpi on any  of host1 or host2 alone. If I run it on host1 and host2 together, it just crashed with a core dump.

Using the --enable-device = ch3:sock configure option resulted in more or less the same problem but now it just quits silently when running on host1 and host2 together.

Any ideas?
Thanks!
Stephen.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20240726/ab1be4ef/attachment.html>


More information about the discuss mailing list