<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
Could you try the upstream MPICH? - <a href="https://urldefense.us/v3/__https://www.mpich.org/downloads/__;!!G_uCfscf7eWS!eaZwgEeG7j4_wvBW7TJLfW7DwSTH3fFMgA78Zc-mx4Hi7gxIj8DBnm5vVNQraVIFszvRBY83cglV$">https://www.mpich.org/downloads/</a></div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
<br>
</div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
The behavior between Cray MPICH and upstream MPICH may have diverged. The upstream MPICH should be fine with different number of NICs on different nodes. By default, the process picks a nic that is closest to process's CPU affinity; or if you have multiple
processes on the same node, each process will try pick a different nic in a round-robin fashion. Usually a disabled nic won't be selected during init, but if you see it behave otherwise (weird), I encourage you to create an issue at
<a href="https://urldefense.us/v3/__https://github.com/pmodels/mpich/issues__;!!G_uCfscf7eWS!eaZwgEeG7j4_wvBW7TJLfW7DwSTH3fFMgA78Zc-mx4Hi7gxIj8DBnm5vVNQraVIFszvRBUPDkrZ_$">https://github.com/pmodels/mpich/issues</a> .</div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
<br>
</div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
<br>
</div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
Cheers,</div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
Hui</div>
<div id="appendonsend"></div>
<hr style="display:inline-block;width:98%" tabindex="-1">
<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> Kevin Buckley via discuss <discuss@mpich.org><br>
<b>Sent:</b> Monday, March 9, 2026 12:22 AM<br>
<b>To:</b> discuss@mpich.org <discuss@mpich.org><br>
<b>Cc:</b> Kevin Buckley <kevin.buckley.pawsey.org.au@gmail.com><br>
<b>Subject:</b> [mpich-discuss] How does MPICH dtermine available NICs?</font>
<div> </div>
</div>
<div>
<div style="display:none!important; display:none; visibility:hidden; font-size:1px; color:#ffffff; line-height:1px; height:0px; max-height:0px; opacity:0; overflow:hidden">
I do hope this is the right list on which to ask this, but it all seems a bit weird to me, so I thought I'd "turn pro". TL;DR: it gets really weird at the bottom. I am trying to work out how MPICH determines the number of NICs that it "thinks"</div>
<div style="display:none!important; display:none; visibility:hidden; font-size:1px; color:#ffffff; line-height:1px; max-height:0px; opacity:0; overflow:hidden">
ZjQcmQRYFpfptBannerStart</div>
<div dir="ltr" id="x_pfptBannerl9z4ykr" style="display:block!important; text-align:left!important; margin:16px 0px 16px 0px!important; padding:8px 16px 8px 16px!important; border-radius:4px!important; min-width:200px!important; background-color:#D0D8DC!important; background-color:#D0D8DC; border-top:4px solid #90a4ae!important; border-top:4px solid #90a4ae">
<div id="x_pfptBannerl9z4ykr" style="float:left!important; display:block!important; margin:0px 0px 1px 0px!important; max-width:600px!important">
<div id="x_pfptBannerl9z4ykr" style="display:block!important; visibility:visible!important; background-color:#D0D8DC!important; color:#000000!important; color:#000000; font-family:'Arial',sans-serif!important; font-family:'Arial',sans-serif; font-weight:bold!important; font-weight:bold; font-size:14px!important; line-height:18px!important; line-height:18px">
This Message Is From an External Sender </div>
<div id="x_pfptBannerl9z4ykr" style="display:block!important; visibility:visible!important; background-color:#D0D8DC!important; color:#000000!important; color:#000000; font-weight:normal; font-family:'Arial',sans-serif!important; font-family:'Arial',sans-serif; font-size:12px!important; line-height:18px!important; line-height:18px; margin-top:2px!important">
This message came from outside your organization. </div>
</div>
<div style="clear:both!important; display:block!important; visibility:hidden!important; line-height:0!important; font-size:0.01px!important; height:0px">
</div>
</div>
<div style="display:none!important; display:none; visibility:hidden; font-size:1px; color:#ffffff; line-height:1px; max-height:0px; opacity:0; overflow:hidden">
ZjQcmQRYFpfptBannerEnd</div>
<style>
<!--
#x_pfptBannerl9z4ykr
{display:block!important;
visibility:visible!important;
opacity:1!important;
background-color:#D0D8DC!important;
max-width:none!important;
max-height:none!important}
-->
</style>
<pre style="font-family:sans-serif; font-size:100%; white-space:pre-wrap; word-wrap:break-word">I do hope this is the right list on which to ask this, but
it all seems a bit weird to me, so I thought I'd "turn pro".
TL;DR: it gets really weird at the bottom.
I am trying to work out how MPICH determines the number of
NICs that it "thinks" a node has.
Here's what I am seeing, as a result of my nodes having an
"Inconsistent number of NICs across the job"
PE 0: == Node nid001764 has 1 NIC(s) available
PE 0: == Node nid002792 has 2 NIC(s) available
PE 0:
(where the job is a noddy test job that runs two processes
per node, across two nodes)
however, as far as I am concerned, the second NIC on that
nid002792 has been disabled at Boot-time, by replacing
STARTMODE='auto'
with
STARTMODE='off'
in the
/etc/sysconfig/network/ifcfg-hsn1
file.
As far as wicked (which is really using an ifcfg-compat mode,
and not any new wicked-goodness) is concerned, the second
interface is "not up", and hasn't been configured, hence:
nid002792:~ # wicked ifstatus hsn0
hsn0 up
link: #5, state up, mtu 9000
type: ethernet, hwaddr 02:00:00:00:60:73
config: compat:suse:/etc/sysconfig/network/ifcfg-hsn0
leases: ipv4 static granted
addr: ipv4 10.253.133.14/17 [static]
route: ipv4 172.18.0.0/16 via 10.253.255.254 proto boot
nid002792:~ # wicked ifstatus hsn1
hsn1 device-unconfigured
link: #6, state up, mtu 1500
type: ethernet, hwaddr 02:00:00:00:60:33
nid002792:~ #
and what's more, there is no route using that second interface,
which there would have been, had I not ferkled the ifcfg script:
nid002792:~ # ip route
default via 10.253.128.3 dev hsn0
10.168.28.0/22 dev bond0 proto kernel scope link src 10.168.28.23
10.253.128.0/17 dev hsn0 proto kernel scope link src 10.253.133.14
172.18.0.0/16 via 10.253.255.254 dev hsn0
172.23.0.0/16 via 10.168.31.254 dev bond0
nid002792:~ #
THE REALLY WEIRD BIT
If I don't disable the second NIC at boot time, but then, once the
node has booted, explicitly "ifdown" it, manually, with wicked's
wrapped version of an ifdown:
wicked --systemd ifdown hsn1
then MPICH jobs lauching on the node DON'T SEE THE SECOND NIC?
For reference. here are the two "old-school" network config files
nid002792:~ # cat /etc/sysconfig/network/ifcfg-hsn0
STARTMODE='auto'
BOOTPROTO='static'
IPADDR='10.253.133.14'
NETMASK='255.255.128.0'
MTU='9000'
LINK_REQUIRED='yes'
POST_UP_SCRIPT="systemd:cm-slingshot-ama@.service"
nid002792:~ # cat /etc/sysconfig/network/ifcfg-hsn1
STARTMODE='off'
BOOTPROTO='static'
IPADDR='10.253.133.11'
NETMASK='255.255.128.0'
MTU='9000'
LINK_REQUIRED='yes'
POST_UP_SCRIPT="systemd:cm-slingshot-ama@.service"
nid002792:~ #
Now, "I would have thought" (tm) that MPICH, on "seeing"
an interface that hadn't been START-ed, would not have
considered it as a NIC, for NIC_SYMMETRY puposes?
I am (more than) aware that I can prevent the messages about
NIC_SYMMETRY inconsistencies, but that's not the issue here;
the issue here is that MPICH seems to think a NIC that hasn't
been START-ed is worthy of consideration.
FWIW, it's
cray-mpich/8.1.32
so MPICH 3.4a2, under the hood, where the hood belongs to an
HPE/Cray EX, running SLES 15 SP6.
The info from the cray-mpich/8.1.32 module says
- Cray MPICH offers support for multiple NICs per node. Starting with
version 8.0.8, by default Cray MPICH will use all available NICs on
a node.
but maybe their definiiton of "available" differs from the one
that I have become accustomed to over the years, to wit: if it's
not START-ed; it's not available?
Interested to hear any thoughts on this?
Kevin M. Buckley
--
Supercomputing Systems Administrator
Pawsey Supercomputing Centre
PERTH
Australia
</pre>
</div>
</body>
</html>