<!DOCTYPE html>
<!-- BaNnErBlUrFlE-BoDy-start -->
<!-- Preheader Text : BEGIN -->
<div style="display:none !important;display:none;visibility:hidden;mso-hide:all;font-size:1px;color:#ffffff;line-height:1px;height:0px;max-height:0px;opacity:0;overflow:hidden;">
I do hope this is the right list on which to ask this, but it all seems a bit weird to me, so I thought I'd "turn pro". TL;DR: it gets really weird at the bottom. I am trying to work out how MPICH determines the number of NICs that it "thinks"</div>
<!-- Preheader Text : END -->
<!-- Email Banner : BEGIN -->
<div style="display:none !important;display:none;visibility:hidden;mso-hide:all;font-size:1px;color:#ffffff;line-height:1px;max-height:0px;opacity:0;overflow:hidden;">ZjQcmQRYFpfptBannerStart</div>
<!--[if ((ie)|(mso))]>
<table border="0" cellspacing="0" cellpadding="0" width="100%" style="padding: 16px 0px 16px 0px; direction: ltr" ><tr><td>
<table border="0" cellspacing="0" cellpadding="0" style="padding: 0px 10px 5px 6px; width: 100%; border-radius:4px; border-top:4px solid #90a4ae;background-color:#D0D8DC;"><tr><td valign="top">
<table align="left" border="0" cellspacing="0" cellpadding="0" style="padding: 4px 8px 4px 8px">
<tr><td style="color:#000000; font-family: 'Arial', sans-serif; font-weight:bold; font-size:14px; direction: ltr">
This Message Is From an External Sender
</td></tr>
<tr><td style="color:#000000; font-weight:normal; font-family: 'Arial', sans-serif; font-size:12px; direction: ltr">
This message came from outside your organization.
</td></tr>
</table>
</td></tr></table>
</td></tr></table>
<![endif]-->
<![if !((ie)|(mso))]>
<div dir="ltr" id="pfptBannerl9z4ykr" style="all: revert !important; display:block !important; text-align: left !important; margin:16px 0px 16px 0px !important; padding:8px 16px 8px 16px !important; border-radius: 4px !important; min-width: 200px !important; background-color: #D0D8DC !important; background-color: #D0D8DC; border-top: 4px solid #90a4ae !important; border-top: 4px solid #90a4ae;">
<div id="pfptBannerl9z4ykr" style="all: unset !important; float:left !important; display:block !important; margin: 0px 0px 1px 0px !important; max-width: 600px !important;">
<div id="pfptBannerl9z4ykr" style="all: unset !important; display:block !important; visibility: visible !important; background-color: #D0D8DC !important; color:#000000 !important; color:#000000; font-family: 'Arial', sans-serif !important; font-family: 'Arial', sans-serif; font-weight:bold !important; font-weight:bold; font-size:14px !important; line-height:18px !important; line-height:18px">
This Message Is From an External Sender
</div>
<div id="pfptBannerl9z4ykr" style="all: unset !important; display:block !important; visibility: visible !important; background-color: #D0D8DC !important; color:#000000 !important; color:#000000; font-weight:normal; font-family: 'Arial', sans-serif !important; font-family: 'Arial', sans-serif; font-size:12px !important; line-height:18px !important; line-height:18px; margin-top:2px !important;">
This message came from outside your organization.
</div>
</div>
<div style="clear: both !important; display: block !important; visibility: hidden !important; line-height: 0 !important; font-size: 0.01px !important; height: 0px"> </div>
</div>
<![endif]>
<div style="display:none !important;display:none;visibility:hidden;mso-hide:all;font-size:1px;color:#ffffff;line-height:1px;max-height:0px;opacity:0;overflow:hidden;">ZjQcmQRYFpfptBannerEnd</div>
<!-- Email Banner : END -->
<!-- BaNnErBlUrFlE-BoDy-end -->
<html>
<head><!-- BaNnErBlUrFlE-HeAdEr-start -->
<style>
#pfptBannerl9z4ykr { all: revert !important; display: block !important;
visibility: visible !important; opacity: 1 !important;
background-color: #D0D8DC !important;
max-width: none !important; max-height: none !important }
.pfptPrimaryButtonl9z4ykr:hover, .pfptPrimaryButtonl9z4ykr:focus {
background-color: #b4c1c7 !important; }
.pfptPrimaryButtonl9z4ykr:active {
background-color: #90a4ae !important; }
</style>
<!-- BaNnErBlUrFlE-HeAdEr-end -->
<meta charset="UTF-8"></head><body><pre style="font-family: sans-serif; font-size: 100%; white-space: pre-wrap; word-wrap: break-word">I do hope this is the right list on which to ask this, but
it all seems a bit weird to me, so I thought I'd "turn pro".
TL;DR: it gets really weird at the bottom.
I am trying to work out how MPICH determines the number of
NICs that it "thinks" a node has.
Here's what I am seeing, as a result of my nodes having an
"Inconsistent number of NICs across the job"
PE 0: == Node nid001764 has 1 NIC(s) available
PE 0: == Node nid002792 has 2 NIC(s) available
PE 0:
(where the job is a noddy test job that runs two processes
per node, across two nodes)
however, as far as I am concerned, the second NIC on that
nid002792 has been disabled at Boot-time, by replacing
STARTMODE='auto'
with
STARTMODE='off'
in the
/etc/sysconfig/network/ifcfg-hsn1
file.
As far as wicked (which is really using an ifcfg-compat mode,
and not any new wicked-goodness) is concerned, the second
interface is "not up", and hasn't been configured, hence:
nid002792:~ # wicked ifstatus hsn0
hsn0 up
link: #5, state up, mtu 9000
type: ethernet, hwaddr 02:00:00:00:60:73
config: compat:suse:/etc/sysconfig/network/ifcfg-hsn0
leases: ipv4 static granted
addr: ipv4 10.253.133.14/17 [static]
route: ipv4 172.18.0.0/16 via 10.253.255.254 proto boot
nid002792:~ # wicked ifstatus hsn1
hsn1 device-unconfigured
link: #6, state up, mtu 1500
type: ethernet, hwaddr 02:00:00:00:60:33
nid002792:~ #
and what's more, there is no route using that second interface,
which there would have been, had I not ferkled the ifcfg script:
nid002792:~ # ip route
default via 10.253.128.3 dev hsn0
10.168.28.0/22 dev bond0 proto kernel scope link src 10.168.28.23
10.253.128.0/17 dev hsn0 proto kernel scope link src 10.253.133.14
172.18.0.0/16 via 10.253.255.254 dev hsn0
172.23.0.0/16 via 10.168.31.254 dev bond0
nid002792:~ #
THE REALLY WEIRD BIT
If I don't disable the second NIC at boot time, but then, once the
node has booted, explicitly "ifdown" it, manually, with wicked's
wrapped version of an ifdown:
wicked --systemd ifdown hsn1
then MPICH jobs lauching on the node DON'T SEE THE SECOND NIC?
For reference. here are the two "old-school" network config files
nid002792:~ # cat /etc/sysconfig/network/ifcfg-hsn0
STARTMODE='auto'
BOOTPROTO='static'
IPADDR='10.253.133.14'
NETMASK='255.255.128.0'
MTU='9000'
LINK_REQUIRED='yes'
POST_UP_SCRIPT="systemd:cm-slingshot-ama@.service"
nid002792:~ # cat /etc/sysconfig/network/ifcfg-hsn1
STARTMODE='off'
BOOTPROTO='static'
IPADDR='10.253.133.11'
NETMASK='255.255.128.0'
MTU='9000'
LINK_REQUIRED='yes'
POST_UP_SCRIPT="systemd:cm-slingshot-ama@.service"
nid002792:~ #
Now, "I would have thought" (tm) that MPICH, on "seeing"
an interface that hadn't been START-ed, would not have
considered it as a NIC, for NIC_SYMMETRY puposes?
I am (more than) aware that I can prevent the messages about
NIC_SYMMETRY inconsistencies, but that's not the issue here;
the issue here is that MPICH seems to think a NIC that hasn't
been START-ed is worthy of consideration.
FWIW, it's
cray-mpich/8.1.32
so MPICH 3.4a2, under the hood, where the hood belongs to an
HPE/Cray EX, running SLES 15 SP6.
The info from the cray-mpich/8.1.32 module says
- Cray MPICH offers support for multiple NICs per node. Starting with
version 8.0.8, by default Cray MPICH will use all available NICs on
a node.
but maybe their definiiton of "available" differs from the one
that I have become accustomed to over the years, to wit: if it's
not START-ed; it's not available?
Interested to hear any thoughts on this?
Kevin M. Buckley
--
Supercomputing Systems Administrator
Pawsey Supercomputing Centre
PERTH
Australia
</pre></body></html>