[mpich-discuss] affinity problems with mpiexec.hydra 3.0.4
Bill Ryder
bryder at wetafx.co.nz
Wed Sep 18 16:10:49 CDT 2013
Hi Ken,
Same result using hwloc-bind (which is a relief because if that didn't agree with /proc/pid/status I would have been unpleasantly
surprised!)
The most curious thing is that if I set socket:3 I get the exact binding I want!
Perhaps my interpretation of --ppn 2 bind-to socket is incorrect.
For two socket machines:
If I use --bind-to socket I expect one rank on one socket, and the other on the other socket.
If I use --bind-to socket:1 I expect both ranks on the same socket.
If I use --bind-to socket:2 I expect one rank on each socket (ie the same as --bind-to socket for a two socket machine)
Please let me know if that's incorrect.
Here's trying various socket counts - using hwloc and looking at /proc/$$/status
--bind-to socket - I expect each rank to get it's own socket - but
that doesn't happen
hydra-3.0.4/mpiexec.hydra --prepend-pattern "%h.%r " --ppn=2 --launcher=ssh --host=jericho101 --bind-to socket utils/get_affinity
2> /dev/null
jericho101.0 Affinity from /proc/$$/status : 00ff00ff 0-7,16-23
jericho101.1 Affinity from /proc/$$/status : 00ff00ff 0-7,16-23
jericho101.1 Affinity from hwloc-bind --get : 0x00ff00ff
jericho101.0 Affinity from hwloc-bind --get : 0x00ff00ff
socket:1 - both processes bound to the same socket. This is what I expect socket:1 to do.
hydra-3.0.4/mpiexec.hydra --prepend-pattern "%h.%r " --ppn=2 --launcher=ssh --host=jericho101 --bind-to socket:1 utils/get_affinity
jericho101.0 Affinity from /proc/$$/status : 00ff00ff 0-7,16-23
jericho101.1 Affinity from /proc/$$/status : 00ff00ff 0-7,16-23
jericho101.0 Affinity from hwloc-bind --get : 0x00ff00ff
jericho101.1 Affinity from hwloc-bind --get : 0x00ff00ff
socket:2 - I end up with the same binding as socket and socket:1
hydra-3.0.4/mpiexec.hydra --prepend-pattern "%h.%r " --ppn=2 --launcher=ssh --host=jericho101 --bind-to socket:2 utils/get_affinity
jericho101.0 Affinity from /proc/$$/status : 00ff00ff 0-7,16-23
jericho101.1 Affinity from /proc/$$/status : 00ff00ff 0-7,16-23
jericho101.0 Affinity from hwloc-bind --get : 0x00ff00ff
jericho101.1 Affinity from hwloc-bind --get : 0x00ff00ff
socket:3 - this gets strange - this is the affinity I want!
hydra-3.0.4/mpiexec.hydra --prepend-pattern "%h.%r " --ppn=2 --launcher=ssh --host=jericho101 --bind-to socket:3 utils/get_affinity
jericho101.0 Affinity from /proc/$$/status : 00ff00ff 0-7,16-23
jericho101.1 Affinity from /proc/$$/status : ff00ff00 8-15,24-31
jericho101.0 Affinity from hwloc-bind --get : 0x00ff00ff
jericho101.1 Affinity from hwloc-bind --get : 0xff00ff00
I was originally was asked to look at this because of bad runtimes on faster hardware - fixing the affinity fixed
the performance problem. For the test case I was using binding each process to a socket gave me 8% better performance so it's
definitely worth working on.
Let me know what I can do to help - I don't mind gdbing into mpiexec or something or shoving prints into the code.
Also a probably silly question - Does hydra_pmi_proxy figure out and set the affinity? That seems the logical place.
I should also note that the other machine where --bind-to socket, socket:1 does the right thing seems to do the wrong thing at
socket:2
Of course this all assumes I'm interpreting that socket option correctly.
hydra-3.0.4/mpiexec.hydra --prepend-pattern "%h.%r " --ppn=2 --launcher=ssh --host=abrams211a --bind-to socket utils/get_affinity
abrams211a.0 Affinity from /proc/$$/status : 00555555 0,2,4,6,8,10,12,14,16,18,20,22
abrams211a.1 Affinity from /proc/$$/status : 00aaaaaa 1,3,5,7,9,11,13,15,17,19,21,23
abrams211a.0 Affinity from hwloc-bind --get : 0x00555555
abrams211a.1 Affinity from hwloc-bind --get : 0x00aaaaaa
mpiexec.hydra --prepend-pattern "%h.%r " --ppn=2 --launcher=ssh --host=abrams211a --bind-to socket:1 utils/get_affinity
abrams211a.0 Affinity from /proc/$$/status : 00555555 0,2,4,6,8,10,12,14,16,18,20,22
abrams211a.1 Affinity from /proc/$$/status : 00aaaaaa 1,3,5,7,9,11,13,15,17,19,21,23
abrams211a.0 Affinity from hwloc-bind --get : 0x00555555
abrams211a.1 Affinity from hwloc-bind --get : 0x00aaaaaa
hydra-3.0.4/mpiexec.hydra --prepend-pattern "%h.%r " --ppn=2 --launcher=ssh --host=abrams211a --bind-to socket:2 utils/get_affinity
abrams211a.0 Affinity from /proc/$$/status : 00555555 0,2,4,6,8,10,12,14,16,18,20,22
abrams211a.1 Affinity from /proc/$$/status : 00555555 0,2,4,6,8,10,12,14,16,18,20,22
abrams211a.0 Affinity from hwloc-bind --get : 0x00555555
abrams211a.1 Affinity from hwloc-bind --get : 0x00555555
On 09/18/2013 01:50 AM, Ken Raffenetti wrote:
> Hi Bill,
>
> I'm not seeing any obvious problems. However, in my own testing, I found a situation where the grep method you used was unreliable. Can you try using "hwloc-bind --get" instead and see if the results differ?
>
> Ken
>
> ----- Original Message -----
>> From: "Bill Ryder" <bryder at wetafx.co.nz>
>> To: discuss at mpich.org
>> Sent: Monday, September 16, 2013 6:04:45 PM
>> Subject: [mpich-discuss] affinity problems with mpiexec.hydra 3.0.4
>>
>> Greetings all,
>>
>> I'm trying set affinity for mybrid MPI/OpenMP tasks.
>>
>> I want to run two processes on a host, and give one socket to one
>> rank, and the other socket to the other rank.
>>
>> I have two types of hardware - one works perfectly - the other
>> doesn't.
>>
>> I first saw the problem using mpiexec.hydra with slurm but I've moved
>> to using ssh to remove some possible variables
>>
>>
>> I have a trivial script which just greps out /proc/$$/status for the
>> Cpus_allowed mask and Cpus_allowed_list
>>
>> It's just: echo "`hostname` $PMI_RANK `grep Cpus_allowed
>> /proc/$$/status`"
>>
>>
>> On the machine that is doing what I want I get:
>>
>> mpiexec.hydra -ppn 2 -hosts abrams201a --bind-to socket -launcher ssh
>> ./get_mapping
>> abrams201a 0 Cpus_allowed: 00555555
>> Cpus_allowed_list: 0,2,4,6,8,10,12,14,16,18,20,22
>> abrams201a 1 Cpus_allowed: 00aaaaaa
>> Cpus_allowed_list: 1,3,5,7,9,11,13,15,17,19,21,23
>>
>> rank 0 gets one socket, rank 1 gets the other socket. This is what I
>> want
>>
>> But on my other machine with a different topology I get this:
>>
>> mpiexec.hydra -ppn 2 -hosts jericho101 --bind-to socket -launcher
>> ssh ./get_mapping
>>
>> jericho101 0 Cpus_allowed: 00000000,00ff00ff
>> Cpus_allowed_list: 0-7,16-23
>> jericho101 1 Cpus_allowed: 00000000,00ff00ff
>> Cpus_allowed_list: 0-7,16-23
>>
>> So each rank is trying to use the same socket.
>>
>> Similarly if I try to bind to a numanode
>>
>> mpiexec.hydra -ppn 2 -hosts jericho101 --bind-to numa -launcher ssh
>> ./get_mapping
>> jericho101 1 Cpus_allowed: 00000000,00ff00ff
>> Cpus_allowed_list: 0-7,16-23
>> jericho101 0 Cpus_allowed: 00000000,00ff00ff
>> Cpus_allowed_list: 0-7,16-23
>>
>>
>> Or even if I send numa:2
>>
>> mpiexec.hydra -ppn 2 -hosts jericho101 --bind-to numa:2 -launcher
>> ssh ./get_mapping
>>
>> jericho101 0 Cpus_allowed: 00000000,00ff00ff
>> Cpus_allowed_list: 0-7,16-23
>> jericho101 1 Cpus_allowed: 00000000,00ff00ff
>> Cpus_allowed_list: 0-7,16-23
>>
>> So once again instead of handing a numa node to each process - it's
>> handing the same node to both.
>>
>>
>> How would I start debugging this?
>>
>> Or am I missing something really obvious
>>
>>
>>
>> Thanks!
>> ---------
>>
>> Bill Ryder
>> Weta Digital
>>
>>
>>
>>
>>
>> A bit more data:
>>
>> mpiexec.hydra --info
>> HYDRA build details:
>> Version: 3.0.4
>> Release Date: Wed Apr 24 10:08:10 CDT
>> 2013
>> CC: cc
>> CXX:
>> F77:
>> F90:
>> Configure options: '--prefix=/tech/apps/mpich/hydra'
>> Process Manager: pmi
>> Launchers available: ssh rsh fork slurm ll
>> lsf sge manual persist
>> Topology libraries available: hwloc
>> Resource management kernels available: user slurm ll lsf sge
>> pbs cobalt
>> Checkpointing libraries available:
>> Demux engines available: poll select
>>
>>
>>
>> I have hwloc 1.3.1 installed locally on each machine
>>
>> abrams201a looks like:
>>
>> Machine (48GB)
>> NUMANode L#0 (P#1 24GB) + Socket L#0 + L3 L#0 (12MB)
>> L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0
>> PU L#0 (P#0)
>> PU L#1 (P#12)
>> L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1
>> PU L#2 (P#2)
>> PU L#3 (P#14)
>> L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2
>> PU L#4 (P#4)
>> PU L#5 (P#16)
>> L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3
>> PU L#6 (P#6)
>> PU L#7 (P#18)
>> L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4
>> PU L#8 (P#8)
>> PU L#9 (P#20)
>> L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5
>> PU L#10 (P#10)
>> PU L#11 (P#22)
>> NUMANode L#1 (P#0 24GB) + Socket L#1 + L3 L#1 (12MB)
>> L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6
>> PU L#12 (P#1)
>> PU L#13 (P#13)
>> L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7
>> PU L#14 (P#3)
>> PU L#15 (P#15)
>> L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8
>> PU L#16 (P#5)
>> PU L#17 (P#17)
>> L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9
>> PU L#18 (P#7)
>> PU L#19 (P#19)
>> L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10
>> PU L#20 (P#9)
>> PU L#21 (P#21)
>> L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11
>> PU L#22 (P#11)
>> PU L#23 (P#23)
>> HostBridge L#0
>> PCIBridge
>> PCI 8086:10e7
>> Net L#0 "eth0"
>> PCI 8086:10e7
>> Net L#1 "eth1"
>> PCIBridge
>> PCI 15b3:6746
>> Net L#2 "eth2"
>> OpenFabrics L#3 "mlx4_0"
>> PCI 15b3:6746
>> PCI 15b3:6746
>> PCI 15b3:6746
>> PCIBridge
>> PCI 102b:0533
>> PCI 8086:3a20
>> Block L#4 "sda"
>>
>>
>> And jericho101 looks like:
>>
>> achine (96GB)
>> NUMANode L#0 (P#0 48GB)
>> Socket L#0 + L3 L#0 (20MB)
>> L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0
>> PU L#0 (P#0)
>> PU L#1 (P#16)
>> L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1
>> PU L#2 (P#1)
>> PU L#3 (P#17)
>> L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2
>> PU L#4 (P#2)
>> PU L#5 (P#18)
>> L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3
>> PU L#6 (P#3)
>> PU L#7 (P#19)
>> L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4
>> PU L#8 (P#4)
>> PU L#9 (P#20)
>> L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5
>> PU L#10 (P#5)
>> PU L#11 (P#21)
>> L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6
>> PU L#12 (P#6)
>> PU L#13 (P#22)
>> L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7
>> PU L#14 (P#7)
>> PU L#15 (P#23)
>> HostBridge L#0
>> PCIBridge
>> PCI 14e4:168e
>> Net L#0 "eth0"
>> PCI 14e4:168e
>> Net L#1 "eth1"
>> PCI 14e4:168e
>> Net L#2 "eth2"
>> PCI 14e4:168e
>> Net L#3 "eth3"
>> PCI 14e4:168e
>> Net L#4 "eth4"
>> PCI 14e4:168e
>> Net L#5 "eth5"
>> PCI 14e4:168e
>> Net L#6 "eth6"
>> PCI 14e4:168e
>> Net L#7 "eth7"
>> PCIBridge
>> PCI 103c:323b
>> Block L#8 "sda"
>> PCIBridge
>> PCI 102b:0533
>> NUMANode L#1 (P#1 48GB) + Socket L#1 + L3 L#1 (20MB)
>> L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8
>> PU L#16 (P#8)
>> PU L#17 (P#24)
>> L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9
>> PU L#18 (P#9)
>> PU L#19 (P#25)
>> L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10
>> PU L#20 (P#10)
>> PU L#21 (P#26)
>> L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11
>> PU L#22 (P#11)
>> PU L#23 (P#27)
>> L2 L#12 (256KB) + L1 L#12 (32KB) + Core L#12
>> PU L#24 (P#12)
>> PU L#25 (P#28)
>> L2 L#13 (256KB) + L1 L#13 (32KB) + Core L#13
>> PU L#26 (P#13)
>> PU L#27 (P#29)
>> L2 L#14 (256KB) + L1 L#14 (32KB) + Core L#14
>> PU L#28 (P#14)
>> PU L#29 (P#30)
>> L2 L#15 (256KB) + L1 L#15 (32KB) + Core L#15
>> PU L#30 (P#15)
>> PU L#31 (P#31)
>>
>> _______________________________________________
>> discuss mailing list discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list