<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Could you run cpi with <code>MPIR_CVAR_DEBUG_SUMMARY=1</code>​ and post the output?</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Hui Zhou</div>
<div id="appendonsend"></div>
<hr style="display:inline-block;width:98%" tabindex="-1">
<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> Stephen Wong via discuss <discuss@mpich.org><br>
<b>Sent:</b> Friday, July 26, 2024 4:46 AM<br>
<b>To:</b> discuss@mpich.org <discuss@mpich.org><br>
<b>Cc:</b> Stephen Wong <stephen.photond@gmail.com><br>
<b>Subject:</b> [mpich-discuss] OFI poll failed error if using more than one cluster node</font>
<div> </div>
</div>
<style>
<!--
#x_pfptBannerb3ol143
        {display:block!important;
        visibility:visible!important;
        opacity:1!important;
        background-color:#D0D8DC!important;
        max-width:none!important;
        max-height:none!important}
-->
</style>
<div>
<div style="display:none!important; display:none; visibility:hidden; font-size:1px; color:#ffffff; line-height:1px; height:0px; max-height:0px; opacity:0; overflow:hidden">
Hi, (I sent this previously without a subject line. ) I am using MPICH 4. 2. 2 on Ubuntu 24. 04 testing with the small program cpi that calculates the value of pi using MPI. I can start on host1 to run cpi on either host1 or host2 alone and I can
</div>
<div style="display:none!important; display:none; visibility:hidden; font-size:1px; color:#ffffff; line-height:1px; height:0px; max-height:0px; opacity:0; overflow:hidden">
ZjQcmQRYFpfptBannerStart</div>
<div dir="ltr" id="x_pfptBannerb3ol143" style="display:block!important; text-align:left!important; margin:16px 0px 16px 0px!important; padding:8px 16px 8px 16px!important; border-radius:4px!important; min-width:200px!important; background-color:#D0D8DC!important; background-color:#D0D8DC; border-top:4px solid #90a4ae!important; border-top:4px solid #90a4ae">
<div id="x_pfptBannerb3ol143" style="float:left!important; display:block!important; margin:0px 0px 1px 0px!important; max-width:600px!important">
<div id="x_pfptBannerb3ol143" style="display:block!important; visibility:visible!important; background-color:#D0D8DC!important; color:#000000!important; color:#000000; font-family:'Arial',sans-serif!important; font-family:'Arial',sans-serif; font-weight:bold!important; font-weight:bold; font-size:14px!important; line-height:18px!important; line-height:18px">
This Message Is From an External Sender </div>
<div id="x_pfptBannerb3ol143" style="display:block!important; visibility:visible!important; background-color:#D0D8DC!important; color:#000000!important; color:#000000; font-weight:normal; font-family:'Arial',sans-serif!important; font-family:'Arial',sans-serif; font-size:12px!important; line-height:18px!important; line-height:18px; margin-top:2px!important">
This message came from outside your organization. </div>
</div>
<div style="clear:both!important; display:block!important; visibility:hidden!important; line-height:0!important; font-size:0.01px!important; height:0px">
 </div>
</div>
<div style="display:none!important; display:none; visibility:hidden; font-size:1px; color:#ffffff; line-height:1px; height:0px; max-height:0px; opacity:0; overflow:hidden">
ZjQcmQRYFpfptBannerEnd</div>
<div dir="ltr">
<div>Hi,</div>
<div><br>
</div>
<div>(I sent this previously without a subject line.)</div>
<div><br>
</div>
<div><span style="color:rgb(12,13,14); font-family:-apple-system,BlinkMacSystemFont,"Segoe UI Adjusted","Segoe UI","Liberation Sans",sans-serif; font-size:15px">I am using MPICH 4.2.2 on Ubuntu 24.04 testing with the small program cpi that calculates the value
 of pi using MPI. I can start on host1 to run cpi on either host1 or host2 alone and I can start on host2 to run cpi on either host2 or host1 alone. The problem occurs only if I try to use both host1 and host2 together.</span><br>
</div>
<div><span style="color:rgb(12,13,14); font-family:-apple-system,BlinkMacSystemFont,"Segoe UI Adjusted","Segoe UI","Liberation Sans",sans-serif; font-size:15px"><br>
</span></div>
<div>
<p style="margin:0px 0px 1.1em; padding:0px; border:0px; font-variant-numeric:inherit; font-variant-east-asian:inherit; font-variant-alternates:inherit; font-stretch:inherit; line-height:inherit; font-family:-apple-system,BlinkMacSystemFont,"Segoe UI Adjusted","Segoe UI","Liberation Sans",sans-serif; font-kerning:inherit; font-feature-settings:inherit; font-size:15px; vertical-align:baseline; box-sizing:inherit; clear:both; color:rgb(12,13,14)">
This is done using, for example, the command<br style="box-sizing:inherit">
<em style="margin:0px; padding:0px; border:0px; font-variant:inherit; font-weight:inherit; font-stretch:inherit; line-height:inherit; font-family:inherit; font-kerning:inherit; font-feature-settings:inherit; vertical-align:baseline; box-sizing:inherit">mpiexec
 -host host1,host2 -n 2 cpi</em><br style="box-sizing:inherit">
then it ends with the error</p>
<p style="margin:0px 0px 1.1em; padding:0px; border:0px; font-variant-numeric:inherit; font-variant-east-asian:inherit; font-variant-alternates:inherit; font-stretch:inherit; line-height:inherit; font-family:-apple-system,BlinkMacSystemFont,"Segoe UI Adjusted","Segoe UI","Liberation Sans",sans-serif; font-kerning:inherit; font-feature-settings:inherit; font-size:15px; vertical-align:baseline; box-sizing:inherit; clear:both; color:rgb(12,13,14)">
Abort(77718927) on node 1: Fatal error in internal_Init: Other MPI error, error stack:<br style="box-sizing:inherit">
internal_Init(48306).............: MPI_Init(argc=0x7ffdb68e7fec, argv=0x7ffdb68e7fe0) failed<br style="box-sizing:inherit">
MPII_Init_thread(265)............:<br style="box-sizing:inherit">
MPIR_init_comm_world(34).........:<br style="box-sizing:inherit">
MPIR_Comm_commit(823)............:<br style="box-sizing:inherit">
MPID_Comm_commit_post_hook(222)..:<br style="box-sizing:inherit">
MPIDI_world_post_init(660).......:<br style="box-sizing:inherit">
MPIDI_OFI_init_vcis(842).........:<br style="box-sizing:inherit">
check_num_nics(891)..............:<br style="box-sizing:inherit">
MPIR_Allreduce_allcomm_auto(4726):<br style="box-sizing:inherit">
MPIC_Sendrecv(306)...............:<br style="box-sizing:inherit">
MPIC_Wait(91)....................:<br style="box-sizing:inherit">
MPIR_Wait(780)...................:<br style="box-sizing:inherit">
MPIR_Wait_state(737).............:<br style="box-sizing:inherit">
MPIDI_progress_test(134).........:<br style="box-sizing:inherit">
MPIDI_OFI_handle_cq_error(791)...: OFI poll failed (ofi_events.c:793:MPIDI_OFI_handle_cq_error:Input/output error) Abort(77718927) on node 0: Fatal error in internal_Init: Other MPI error, error stack:<br style="box-sizing:inherit">
internal_Init(48306).............: MPI_Init(argc=0x7ffcb5b28adc, argv=0x7ffcb5b28ad0) failed<br style="box-sizing:inherit">
MPII_Init_thread(265)............:<br style="box-sizing:inherit">
MPIR_init_comm_world(34).........:<br style="box-sizing:inherit">
MPIR_Comm_commit(823)............:<br style="box-sizing:inherit">
MPID_Comm_commit_post_hook(222)..:<br style="box-sizing:inherit">
MPIDI_world_post_init(660).......:<br style="box-sizing:inherit">
MPIDI_OFI_init_vcis(842).........:<br style="box-sizing:inherit">
check_num_nics(891)..............:<br style="box-sizing:inherit">
MPIR_Allreduce_allcomm_auto(4726):<br style="box-sizing:inherit">
MPIC_Sendrecv(306)...............:<br style="box-sizing:inherit">
MPIC_Wait(91)....................:<br style="box-sizing:inherit">
MPIR_Wait(780)...................:<br style="box-sizing:inherit">
MPIR_Wait_state(737).............:<br style="box-sizing:inherit">
MPIDI_progress_test(134).........:<br style="box-sizing:inherit">
MPIDI_OFI_handle_cq_error(791)...: OFI poll failed (ofi_events.c:793:MPIDI_OFI_handle_cq_error:Input/output error)</p>
I searched through the archive of this mailing list and there is only one thread that has this OFI poll failed error. </div>
<div>In the thread, it mentioned this has something to do with the device configuration of ch4:ofi ?</div>
<div>I checked my configure log and it has </div>
<div>device : ch4:ofi (embedded libfabric) </div>
<div>in the configuration when I built the MPI. So I am wondering if I should switch this option to something else? If this will fix it. I am not too sure what other option I could substitute for ch4:ofi.</div>
<div><br>
</div>
<div>*****************************************************</div>
<div><br>
</div>
<div>Next I tried running configure for a build with the --enable-device = ch3:nemesis option. </div>
<div>Now again I can run the cpi on any  of host1 or host2 alone. If I run it on host1 and host2 together, it just crashed with a core dump.</div>
<div><br>
</div>
<div>Using the --enable-device = ch3:sock configure option resulted in more or less the same problem but now it just quits silently when running on host1 and host2 together.</div>
<div><br>
</div>
<div>Any ideas?</div>
<div>Thanks!</div>
<font color="#888888">
<div>Stephen.</div>
<div><br>
</div>
</font></div>
</div>
</body>
</html>