[mpich-discuss] OFI poll failed error if using more than one cluster node

Stephen Wong stephen.photond at gmail.com
Fri Jul 26 04:46:16 CDT 2024


Hi,

(I sent this previously without a subject line.)

I am using MPICH 4.2.2 on Ubuntu 24.04 testing with the small program cpi
that calculates the value of pi using MPI. I can start on host1 to run cpi
on either host1 or host2 alone and I can start on host2 to run cpi on
either host2 or host1 alone. The problem occurs only if I try to use both
host1 and host2 together.

This is done using, for example, the command
*mpiexec -host host1,host2 -n 2 cpi*
then it ends with the error

Abort(77718927) on node 1: Fatal error in internal_Init: Other MPI error,
error stack:
internal_Init(48306).............: MPI_Init(argc=0x7ffdb68e7fec,
argv=0x7ffdb68e7fe0) failed
MPII_Init_thread(265)............:
MPIR_init_comm_world(34).........:
MPIR_Comm_commit(823)............:
MPID_Comm_commit_post_hook(222)..:
MPIDI_world_post_init(660).......:
MPIDI_OFI_init_vcis(842).........:
check_num_nics(891)..............:
MPIR_Allreduce_allcomm_auto(4726):
MPIC_Sendrecv(306)...............:
MPIC_Wait(91)....................:
MPIR_Wait(780)...................:
MPIR_Wait_state(737).............:
MPIDI_progress_test(134).........:
MPIDI_OFI_handle_cq_error(791)...: OFI poll failed
(ofi_events.c:793:MPIDI_OFI_handle_cq_error:Input/output error)
Abort(77718927) on node 0: Fatal error in internal_Init: Other MPI error,
error stack:
internal_Init(48306).............: MPI_Init(argc=0x7ffcb5b28adc,
argv=0x7ffcb5b28ad0) failed
MPII_Init_thread(265)............:
MPIR_init_comm_world(34).........:
MPIR_Comm_commit(823)............:
MPID_Comm_commit_post_hook(222)..:
MPIDI_world_post_init(660).......:
MPIDI_OFI_init_vcis(842).........:
check_num_nics(891)..............:
MPIR_Allreduce_allcomm_auto(4726):
MPIC_Sendrecv(306)...............:
MPIC_Wait(91)....................:
MPIR_Wait(780)...................:
MPIR_Wait_state(737).............:
MPIDI_progress_test(134).........:
MPIDI_OFI_handle_cq_error(791)...: OFI poll failed
(ofi_events.c:793:MPIDI_OFI_handle_cq_error:Input/output error)
I searched through the archive of this mailing list and there is only one
thread that has this OFI poll failed error.
In the thread, it mentioned this has something to do with the device
configuration of ch4:ofi ?
I checked my configure log and it has
device : ch4:ofi (embedded libfabric)
in the configuration when I built the MPI. So I am wondering if I should
switch this option to something else? If this will fix it. I am not too
sure what other option I could substitute for ch4:ofi.

*****************************************************

Next I tried running configure for a build with the --enable-device =
ch3:nemesis option.
Now again I can run the cpi on any  of host1 or host2 alone. If I run it on
host1 and host2 together, it just crashed with a core dump.

Using the --enable-device = ch3:sock configure option resulted in more or
less the same problem but now it just quits silently when running on host1
and host2 together.

Any ideas?
Thanks!
Stephen.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20240726/3e921a19/attachment-0001.html>


More information about the discuss mailing list