[mpich-discuss] affinity problems with mpiexec.hydra 3.0.4
Ken Raffenetti
raffenet at mcs.anl.gov
Fri Sep 20 10:11:57 CDT 2013
Bill,
I have a correction on the interpretation of -bind-to.
> For two socket machines:
>
> If I use --bind-to socket I expect one rank on one socket, and the
> other on the other socket.
This is correct.
>
> If I use --bind-to socket:1 I expect both ranks on the same socket.
This is incorrect. "--bind-to socket:1" is equivalent to "--bind-to socket"
>
> If I use --bind-to socket:2 I expect one rank on each socket (ie the
> same as --bind-to socket for a two socket machine)
Also incorrect. This would bind processes to the group of 2 sockets together.
I hope that is cleared up. Now there does seem to be a bug in the "--bind-to socket" case when ppn=2 on some hardware. I am able to replicate this behavior and will file a report. ppn>2 and things work as expected on the same hardware.
Ken
----- Original Message -----
> From: "Bill Ryder" <bryder at wetafx.co.nz>
> To: discuss at mpich.org
> Sent: Wednesday, September 18, 2013 4:10:49 PM
> Subject: Re: [mpich-discuss] affinity problems with mpiexec.hydra 3.0.4
>
> Hi Ken,
>
> Same result using hwloc-bind (which is a relief because if that
> didn't agree with /proc/pid/status I would have been unpleasantly
> surprised!)
>
> The most curious thing is that if I set socket:3 I get the exact
> binding I want!
>
> Perhaps my interpretation of --ppn 2 bind-to socket is incorrect.
>
> For two socket machines:
>
> If I use --bind-to socket I expect one rank on one socket, and the
> other on the other socket.
>
> If I use --bind-to socket:1 I expect both ranks on the same socket.
>
> If I use --bind-to socket:2 I expect one rank on each socket (ie the
> same as --bind-to socket for a two socket machine)
>
> Please let me know if that's incorrect.
>
>
>
> Here's trying various socket counts - using hwloc and looking at
> /proc/$$/status
>
>
> --bind-to socket - I expect each rank to get it's own socket - but
> that doesn't happen
>
> hydra-3.0.4/mpiexec.hydra --prepend-pattern "%h.%r " --ppn=2
> --launcher=ssh --host=jericho101 --bind-to socket utils/get_affinity
> 2> /dev/null
> jericho101.0 Affinity from /proc/$$/status : 00ff00ff 0-7,16-23
> jericho101.1 Affinity from /proc/$$/status : 00ff00ff 0-7,16-23
> jericho101.1 Affinity from hwloc-bind --get : 0x00ff00ff
> jericho101.0 Affinity from hwloc-bind --get : 0x00ff00ff
>
> socket:1 - both processes bound to the same socket. This is what I
> expect socket:1 to do.
>
> hydra-3.0.4/mpiexec.hydra --prepend-pattern "%h.%r " --ppn=2
> --launcher=ssh --host=jericho101 --bind-to socket:1
> utils/get_affinity
> jericho101.0 Affinity from /proc/$$/status : 00ff00ff 0-7,16-23
> jericho101.1 Affinity from /proc/$$/status : 00ff00ff 0-7,16-23
> jericho101.0 Affinity from hwloc-bind --get : 0x00ff00ff
> jericho101.1 Affinity from hwloc-bind --get : 0x00ff00ff
>
> socket:2 - I end up with the same binding as socket and socket:1
>
> hydra-3.0.4/mpiexec.hydra --prepend-pattern "%h.%r " --ppn=2
> --launcher=ssh --host=jericho101 --bind-to socket:2
> utils/get_affinity
> jericho101.0 Affinity from /proc/$$/status : 00ff00ff 0-7,16-23
> jericho101.1 Affinity from /proc/$$/status : 00ff00ff 0-7,16-23
> jericho101.0 Affinity from hwloc-bind --get : 0x00ff00ff
> jericho101.1 Affinity from hwloc-bind --get : 0x00ff00ff
>
> socket:3 - this gets strange - this is the affinity I want!
>
> hydra-3.0.4/mpiexec.hydra --prepend-pattern "%h.%r " --ppn=2
> --launcher=ssh --host=jericho101 --bind-to socket:3
> utils/get_affinity
> jericho101.0 Affinity from /proc/$$/status : 00ff00ff 0-7,16-23
> jericho101.1 Affinity from /proc/$$/status : ff00ff00 8-15,24-31
> jericho101.0 Affinity from hwloc-bind --get : 0x00ff00ff
> jericho101.1 Affinity from hwloc-bind --get : 0xff00ff00
>
> I was originally was asked to look at this because of bad runtimes on
> faster hardware - fixing the affinity fixed
> the performance problem. For the test case I was using binding each
> process to a socket gave me 8% better performance so it's
> definitely worth working on.
>
> Let me know what I can do to help - I don't mind gdbing into mpiexec
> or something or shoving prints into the code.
>
> Also a probably silly question - Does hydra_pmi_proxy figure out
> and set the affinity? That seems the logical place.
>
> I should also note that the other machine where --bind-to socket,
> socket:1 does the right thing seems to do the wrong thing at
> socket:2
>
> Of course this all assumes I'm interpreting that socket option
> correctly.
>
>
> hydra-3.0.4/mpiexec.hydra --prepend-pattern "%h.%r " --ppn=2
> --launcher=ssh --host=abrams211a --bind-to socket utils/get_affinity
> abrams211a.0 Affinity from /proc/$$/status : 00555555
> 0,2,4,6,8,10,12,14,16,18,20,22
> abrams211a.1 Affinity from /proc/$$/status : 00aaaaaa
> 1,3,5,7,9,11,13,15,17,19,21,23
> abrams211a.0 Affinity from hwloc-bind --get : 0x00555555
> abrams211a.1 Affinity from hwloc-bind --get : 0x00aaaaaa
>
> mpiexec.hydra --prepend-pattern "%h.%r " --ppn=2 --launcher=ssh
> --host=abrams211a --bind-to socket:1 utils/get_affinity
> abrams211a.0 Affinity from /proc/$$/status : 00555555
> 0,2,4,6,8,10,12,14,16,18,20,22
> abrams211a.1 Affinity from /proc/$$/status : 00aaaaaa
> 1,3,5,7,9,11,13,15,17,19,21,23
> abrams211a.0 Affinity from hwloc-bind --get : 0x00555555
> abrams211a.1 Affinity from hwloc-bind --get : 0x00aaaaaa
>
>
> hydra-3.0.4/mpiexec.hydra --prepend-pattern "%h.%r " --ppn=2
> --launcher=ssh --host=abrams211a --bind-to socket:2
> utils/get_affinity
> abrams211a.0 Affinity from /proc/$$/status : 00555555
> 0,2,4,6,8,10,12,14,16,18,20,22
> abrams211a.1 Affinity from /proc/$$/status : 00555555
> 0,2,4,6,8,10,12,14,16,18,20,22
> abrams211a.0 Affinity from hwloc-bind --get : 0x00555555
> abrams211a.1 Affinity from hwloc-bind --get : 0x00555555
>
>
>
> On 09/18/2013 01:50 AM, Ken Raffenetti wrote:
> > Hi Bill,
> >
> > I'm not seeing any obvious problems. However, in my own testing, I
> > found a situation where the grep method you used was unreliable.
> > Can you try using "hwloc-bind --get" instead and see if the
> > results differ?
> >
> > Ken
> >
> > ----- Original Message -----
> >> From: "Bill Ryder" <bryder at wetafx.co.nz>
> >> To: discuss at mpich.org
> >> Sent: Monday, September 16, 2013 6:04:45 PM
> >> Subject: [mpich-discuss] affinity problems with mpiexec.hydra
> >> 3.0.4
> >>
> >> Greetings all,
> >>
> >> I'm trying set affinity for mybrid MPI/OpenMP tasks.
> >>
> >> I want to run two processes on a host, and give one socket to one
> >> rank, and the other socket to the other rank.
> >>
> >> I have two types of hardware - one works perfectly - the other
> >> doesn't.
> >>
> >> I first saw the problem using mpiexec.hydra with slurm but I've
> >> moved
> >> to using ssh to remove some possible variables
> >>
> >>
> >> I have a trivial script which just greps out /proc/$$/status for
> >> the
> >> Cpus_allowed mask and Cpus_allowed_list
> >>
> >> It's just: echo "`hostname` $PMI_RANK `grep Cpus_allowed
> >> /proc/$$/status`"
> >>
> >>
> >> On the machine that is doing what I want I get:
> >>
> >> mpiexec.hydra -ppn 2 -hosts abrams201a --bind-to socket -launcher
> >> ssh
> >> ./get_mapping
> >> abrams201a 0 Cpus_allowed: 00555555
> >> Cpus_allowed_list: 0,2,4,6,8,10,12,14,16,18,20,22
> >> abrams201a 1 Cpus_allowed: 00aaaaaa
> >> Cpus_allowed_list: 1,3,5,7,9,11,13,15,17,19,21,23
> >>
> >> rank 0 gets one socket, rank 1 gets the other socket. This is what
> >> I
> >> want
> >>
> >> But on my other machine with a different topology I get this:
> >>
> >> mpiexec.hydra -ppn 2 -hosts jericho101 --bind-to socket -launcher
> >> ssh ./get_mapping
> >>
> >> jericho101 0 Cpus_allowed: 00000000,00ff00ff
> >> Cpus_allowed_list: 0-7,16-23
> >> jericho101 1 Cpus_allowed: 00000000,00ff00ff
> >> Cpus_allowed_list: 0-7,16-23
> >>
> >> So each rank is trying to use the same socket.
> >>
> >> Similarly if I try to bind to a numanode
> >>
> >> mpiexec.hydra -ppn 2 -hosts jericho101 --bind-to numa -launcher
> >> ssh
> >> ./get_mapping
> >> jericho101 1 Cpus_allowed: 00000000,00ff00ff
> >> Cpus_allowed_list: 0-7,16-23
> >> jericho101 0 Cpus_allowed: 00000000,00ff00ff
> >> Cpus_allowed_list: 0-7,16-23
> >>
> >>
> >> Or even if I send numa:2
> >>
> >> mpiexec.hydra -ppn 2 -hosts jericho101 --bind-to numa:2 -launcher
> >> ssh ./get_mapping
> >>
> >> jericho101 0 Cpus_allowed: 00000000,00ff00ff
> >> Cpus_allowed_list: 0-7,16-23
> >> jericho101 1 Cpus_allowed: 00000000,00ff00ff
> >> Cpus_allowed_list: 0-7,16-23
> >>
> >> So once again instead of handing a numa node to each process -
> >> it's
> >> handing the same node to both.
> >>
> >>
> >> How would I start debugging this?
> >>
> >> Or am I missing something really obvious
> >>
> >>
> >>
> >> Thanks!
> >> ---------
> >>
> >> Bill Ryder
> >> Weta Digital
> >>
> >>
> >>
> >>
> >>
> >> A bit more data:
> >>
> >> mpiexec.hydra --info
> >> HYDRA build details:
> >> Version: 3.0.4
> >> Release Date: Wed Apr 24 10:08:10
> >> CDT
> >> 2013
> >> CC: cc
> >> CXX:
> >> F77:
> >> F90:
> >> Configure options: '--prefix=/tech/apps/mpich/hydra'
> >> Process Manager: pmi
> >> Launchers available: ssh rsh fork slurm
> >> ll
> >> lsf sge manual persist
> >> Topology libraries available: hwloc
> >> Resource management kernels available: user slurm ll lsf
> >> sge
> >> pbs cobalt
> >> Checkpointing libraries available:
> >> Demux engines available: poll select
> >>
> >>
> >>
> >> I have hwloc 1.3.1 installed locally on each machine
> >>
> >> abrams201a looks like:
> >>
> >> Machine (48GB)
> >> NUMANode L#0 (P#1 24GB) + Socket L#0 + L3 L#0 (12MB)
> >> L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0
> >> PU L#0 (P#0)
> >> PU L#1 (P#12)
> >> L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1
> >> PU L#2 (P#2)
> >> PU L#3 (P#14)
> >> L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2
> >> PU L#4 (P#4)
> >> PU L#5 (P#16)
> >> L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3
> >> PU L#6 (P#6)
> >> PU L#7 (P#18)
> >> L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4
> >> PU L#8 (P#8)
> >> PU L#9 (P#20)
> >> L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5
> >> PU L#10 (P#10)
> >> PU L#11 (P#22)
> >> NUMANode L#1 (P#0 24GB) + Socket L#1 + L3 L#1 (12MB)
> >> L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6
> >> PU L#12 (P#1)
> >> PU L#13 (P#13)
> >> L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7
> >> PU L#14 (P#3)
> >> PU L#15 (P#15)
> >> L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8
> >> PU L#16 (P#5)
> >> PU L#17 (P#17)
> >> L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9
> >> PU L#18 (P#7)
> >> PU L#19 (P#19)
> >> L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10
> >> PU L#20 (P#9)
> >> PU L#21 (P#21)
> >> L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11
> >> PU L#22 (P#11)
> >> PU L#23 (P#23)
> >> HostBridge L#0
> >> PCIBridge
> >> PCI 8086:10e7
> >> Net L#0 "eth0"
> >> PCI 8086:10e7
> >> Net L#1 "eth1"
> >> PCIBridge
> >> PCI 15b3:6746
> >> Net L#2 "eth2"
> >> OpenFabrics L#3 "mlx4_0"
> >> PCI 15b3:6746
> >> PCI 15b3:6746
> >> PCI 15b3:6746
> >> PCIBridge
> >> PCI 102b:0533
> >> PCI 8086:3a20
> >> Block L#4 "sda"
> >>
> >>
> >> And jericho101 looks like:
> >>
> >> achine (96GB)
> >> NUMANode L#0 (P#0 48GB)
> >> Socket L#0 + L3 L#0 (20MB)
> >> L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0
> >> PU L#0 (P#0)
> >> PU L#1 (P#16)
> >> L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1
> >> PU L#2 (P#1)
> >> PU L#3 (P#17)
> >> L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2
> >> PU L#4 (P#2)
> >> PU L#5 (P#18)
> >> L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3
> >> PU L#6 (P#3)
> >> PU L#7 (P#19)
> >> L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4
> >> PU L#8 (P#4)
> >> PU L#9 (P#20)
> >> L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5
> >> PU L#10 (P#5)
> >> PU L#11 (P#21)
> >> L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6
> >> PU L#12 (P#6)
> >> PU L#13 (P#22)
> >> L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7
> >> PU L#14 (P#7)
> >> PU L#15 (P#23)
> >> HostBridge L#0
> >> PCIBridge
> >> PCI 14e4:168e
> >> Net L#0 "eth0"
> >> PCI 14e4:168e
> >> Net L#1 "eth1"
> >> PCI 14e4:168e
> >> Net L#2 "eth2"
> >> PCI 14e4:168e
> >> Net L#3 "eth3"
> >> PCI 14e4:168e
> >> Net L#4 "eth4"
> >> PCI 14e4:168e
> >> Net L#5 "eth5"
> >> PCI 14e4:168e
> >> Net L#6 "eth6"
> >> PCI 14e4:168e
> >> Net L#7 "eth7"
> >> PCIBridge
> >> PCI 103c:323b
> >> Block L#8 "sda"
> >> PCIBridge
> >> PCI 102b:0533
> >> NUMANode L#1 (P#1 48GB) + Socket L#1 + L3 L#1 (20MB)
> >> L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8
> >> PU L#16 (P#8)
> >> PU L#17 (P#24)
> >> L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9
> >> PU L#18 (P#9)
> >> PU L#19 (P#25)
> >> L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10
> >> PU L#20 (P#10)
> >> PU L#21 (P#26)
> >> L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11
> >> PU L#22 (P#11)
> >> PU L#23 (P#27)
> >> L2 L#12 (256KB) + L1 L#12 (32KB) + Core L#12
> >> PU L#24 (P#12)
> >> PU L#25 (P#28)
> >> L2 L#13 (256KB) + L1 L#13 (32KB) + Core L#13
> >> PU L#26 (P#13)
> >> PU L#27 (P#29)
> >> L2 L#14 (256KB) + L1 L#14 (32KB) + Core L#14
> >> PU L#28 (P#14)
> >> PU L#29 (P#30)
> >> L2 L#15 (256KB) + L1 L#15 (32KB) + Core L#15
> >> PU L#30 (P#15)
> >> PU L#31 (P#31)
> >>
> >> _______________________________________________
> >> discuss mailing list discuss at mpich.org
> >> To manage subscription options or unsubscribe:
> >> https://lists.mpich.org/mailman/listinfo/discuss
> >>
> > _______________________________________________
> > discuss mailing list discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
More information about the discuss
mailing list