[mpich-discuss] affinity problems with mpiexec.hydra 3.0.4

Bill Ryder bryder at wetafx.co.nz
Sat Sep 21 16:08:33 CDT 2013


That is cleared up now,

Thanks Ken.

---
Bill

On 21/09/2013 3:11 a.m., Ken Raffenetti wrote:
> Bill,
>
> I have a correction on the interpretation of -bind-to.
>
>> For two socket machines:
>>
>> If I use --bind-to socket I expect one rank on one socket, and the
>> other on the other socket.
> This is correct.
>
>> If I use --bind-to socket:1 I expect both ranks on the same socket.
> This is incorrect. "--bind-to socket:1" is equivalent to "--bind-to socket"
>
>> If I use --bind-to socket:2 I expect one rank on each socket (ie the
>> same as --bind-to socket for a two socket machine)
> Also incorrect. This would bind processes to the group of 2 sockets together.
>
> I hope that is cleared up. Now there does seem to be a bug in the "--bind-to socket" case when ppn=2 on some hardware. I am able to replicate this behavior and will file a report. ppn>2 and things work as expected on the same hardware.
>
> Ken
>
> ----- Original Message -----
>> From: "Bill Ryder" <bryder at wetafx.co.nz>
>> To: discuss at mpich.org
>> Sent: Wednesday, September 18, 2013 4:10:49 PM
>> Subject: Re: [mpich-discuss] affinity problems with mpiexec.hydra 3.0.4
>>
>> Hi Ken,
>>
>> Same result using hwloc-bind (which is a relief because if that
>> didn't agree with /proc/pid/status I would have been unpleasantly
>> surprised!)
>>
>> The most curious thing is that if I set socket:3 I get the exact
>> binding I want!
>>
>> Perhaps my interpretation of  --ppn 2 bind-to socket is incorrect.
>>
>> For two socket machines:
>>
>> If I use --bind-to socket I expect one rank on one socket, and the
>> other on the other socket.
>>
>> If I use --bind-to socket:1 I expect both ranks on the same socket.
>>
>> If I use --bind-to socket:2 I expect one rank on each socket (ie the
>> same as --bind-to socket for a two socket machine)
>>
>> Please let me know if that's incorrect.
>>
>>
>>
>> Here's trying various socket counts - using hwloc and looking at
>> /proc/$$/status
>>
>>
>> --bind-to socket - I expect each rank to get it's own socket - but
>> that doesn't happen
>>
>> hydra-3.0.4/mpiexec.hydra --prepend-pattern "%h.%r " --ppn=2
>> --launcher=ssh --host=jericho101 --bind-to socket utils/get_affinity
>> 2> /dev/null
>> jericho101.0 Affinity from /proc/$$/status :  00ff00ff 0-7,16-23
>> jericho101.1 Affinity from /proc/$$/status :  00ff00ff 0-7,16-23
>> jericho101.1 Affinity from hwloc-bind --get :  0x00ff00ff
>> jericho101.0 Affinity from hwloc-bind --get :  0x00ff00ff
>>
>> socket:1 - both processes bound to the same socket. This is what I
>> expect socket:1 to do.
>>
>> hydra-3.0.4/mpiexec.hydra --prepend-pattern "%h.%r " --ppn=2
>> --launcher=ssh --host=jericho101 --bind-to socket:1
>> utils/get_affinity
>> jericho101.0 Affinity from /proc/$$/status :  00ff00ff 0-7,16-23
>> jericho101.1 Affinity from /proc/$$/status :  00ff00ff 0-7,16-23
>> jericho101.0 Affinity from hwloc-bind --get :  0x00ff00ff
>> jericho101.1 Affinity from hwloc-bind --get :  0x00ff00ff
>>
>> socket:2 - I end up with the same binding as socket and socket:1
>>
>> hydra-3.0.4/mpiexec.hydra --prepend-pattern "%h.%r " --ppn=2
>> --launcher=ssh --host=jericho101 --bind-to socket:2
>> utils/get_affinity
>> jericho101.0 Affinity from /proc/$$/status :  00ff00ff 0-7,16-23
>> jericho101.1 Affinity from /proc/$$/status :  00ff00ff 0-7,16-23
>> jericho101.0 Affinity from hwloc-bind --get :  0x00ff00ff
>> jericho101.1 Affinity from hwloc-bind --get :  0x00ff00ff
>>
>> socket:3 - this gets strange - this is the affinity I want!
>>
>> hydra-3.0.4/mpiexec.hydra --prepend-pattern "%h.%r " --ppn=2
>> --launcher=ssh --host=jericho101 --bind-to socket:3
>> utils/get_affinity
>> jericho101.0 Affinity from /proc/$$/status :  00ff00ff 0-7,16-23
>> jericho101.1 Affinity from /proc/$$/status :  ff00ff00 8-15,24-31
>> jericho101.0 Affinity from hwloc-bind --get :  0x00ff00ff
>> jericho101.1 Affinity from hwloc-bind --get :  0xff00ff00
>>
>> I was originally was asked to look at this because of bad runtimes on
>> faster hardware - fixing the affinity fixed
>> the performance problem. For the test case I was using binding each
>> process to a socket gave me 8% better performance so it's
>> definitely worth working on.
>>
>> Let me know what I can do to help - I don't mind gdbing into mpiexec
>> or something or shoving prints into the code.
>>
>> Also a probably silly question - Does  hydra_pmi_proxy  figure out
>> and set the affinity? That seems the logical place.
>>
>> I should also note that the other machine where --bind-to socket,
>> socket:1  does the right thing seems to do the wrong thing at
>> socket:2
>>
>> Of course this all assumes I'm interpreting that socket option
>> correctly.
>>
>>
>> hydra-3.0.4/mpiexec.hydra --prepend-pattern "%h.%r " --ppn=2
>> --launcher=ssh --host=abrams211a --bind-to socket utils/get_affinity
>> abrams211a.0 Affinity from /proc/$$/status :  00555555
>> 0,2,4,6,8,10,12,14,16,18,20,22
>> abrams211a.1 Affinity from /proc/$$/status :  00aaaaaa
>> 1,3,5,7,9,11,13,15,17,19,21,23
>> abrams211a.0 Affinity from hwloc-bind --get :  0x00555555
>> abrams211a.1 Affinity from hwloc-bind --get :  0x00aaaaaa
>>
>> mpiexec.hydra --prepend-pattern "%h.%r " --ppn=2 --launcher=ssh
>> --host=abrams211a --bind-to socket:1 utils/get_affinity
>> abrams211a.0 Affinity from /proc/$$/status :  00555555
>> 0,2,4,6,8,10,12,14,16,18,20,22
>> abrams211a.1 Affinity from /proc/$$/status :  00aaaaaa
>> 1,3,5,7,9,11,13,15,17,19,21,23
>> abrams211a.0 Affinity from hwloc-bind --get :  0x00555555
>> abrams211a.1 Affinity from hwloc-bind --get :  0x00aaaaaa
>>
>>
>> hydra-3.0.4/mpiexec.hydra --prepend-pattern "%h.%r " --ppn=2
>> --launcher=ssh --host=abrams211a --bind-to socket:2
>> utils/get_affinity
>> abrams211a.0 Affinity from /proc/$$/status :  00555555
>> 0,2,4,6,8,10,12,14,16,18,20,22
>> abrams211a.1 Affinity from /proc/$$/status :  00555555
>> 0,2,4,6,8,10,12,14,16,18,20,22
>> abrams211a.0 Affinity from hwloc-bind --get :  0x00555555
>> abrams211a.1 Affinity from hwloc-bind --get :  0x00555555
>>
>>
>>
>> On 09/18/2013 01:50 AM, Ken Raffenetti wrote:
>>> Hi Bill,
>>>
>>> I'm not seeing any obvious problems. However, in my own testing, I
>>> found a situation where the grep method you used was unreliable.
>>> Can you try using "hwloc-bind --get" instead and see if the
>>> results differ?
>>>
>>> Ken
>>>
>>> ----- Original Message -----
>>>> From: "Bill Ryder" <bryder at wetafx.co.nz>
>>>> To: discuss at mpich.org
>>>> Sent: Monday, September 16, 2013 6:04:45 PM
>>>> Subject: [mpich-discuss] affinity problems with mpiexec.hydra
>>>> 3.0.4
>>>>
>>>> Greetings all,
>>>>
>>>> I'm trying set affinity for mybrid MPI/OpenMP tasks.
>>>>
>>>> I want to run two processes on a host, and give one socket to one
>>>> rank, and the other socket to the other rank.
>>>>
>>>> I have two types of hardware - one works perfectly - the other
>>>> doesn't.
>>>>
>>>> I first saw the problem using mpiexec.hydra with slurm but I've
>>>> moved
>>>> to using ssh to remove some possible variables
>>>>
>>>>
>>>> I have a trivial script which just greps out /proc/$$/status for
>>>> the
>>>> Cpus_allowed mask and Cpus_allowed_list
>>>>
>>>> It's just: echo "`hostname` $PMI_RANK `grep Cpus_allowed
>>>> /proc/$$/status`"
>>>>
>>>>
>>>> On the machine that is doing what I want I get:
>>>>
>>>> mpiexec.hydra -ppn 2 -hosts abrams201a --bind-to socket -launcher
>>>> ssh
>>>>    ./get_mapping
>>>> abrams201a 0 Cpus_allowed:    00555555
>>>> Cpus_allowed_list:    0,2,4,6,8,10,12,14,16,18,20,22
>>>> abrams201a 1 Cpus_allowed:    00aaaaaa
>>>> Cpus_allowed_list:    1,3,5,7,9,11,13,15,17,19,21,23
>>>>
>>>> rank 0 gets one socket, rank 1 gets the other socket. This is what
>>>> I
>>>> want
>>>>
>>>> But on my other machine with a different topology I get this:
>>>>
>>>> mpiexec.hydra -ppn 2 -hosts jericho101  --bind-to socket -launcher
>>>> ssh  ./get_mapping
>>>>
>>>> jericho101 0 Cpus_allowed:    00000000,00ff00ff
>>>> Cpus_allowed_list:    0-7,16-23
>>>> jericho101 1 Cpus_allowed:    00000000,00ff00ff
>>>> Cpus_allowed_list:    0-7,16-23
>>>>
>>>> So each rank is trying to use the same socket.
>>>>
>>>> Similarly if I try to bind to a numanode
>>>>
>>>> mpiexec.hydra -ppn 2 -hosts jericho101  --bind-to numa -launcher
>>>> ssh
>>>>    ./get_mapping
>>>> jericho101 1 Cpus_allowed:    00000000,00ff00ff
>>>> Cpus_allowed_list:    0-7,16-23
>>>> jericho101 0 Cpus_allowed:    00000000,00ff00ff
>>>> Cpus_allowed_list:    0-7,16-23
>>>>
>>>>
>>>> Or even if I send numa:2
>>>>
>>>> mpiexec.hydra -ppn 2 -hosts jericho101  --bind-to numa:2 -launcher
>>>> ssh  ./get_mapping
>>>>
>>>> jericho101 0 Cpus_allowed:    00000000,00ff00ff
>>>> Cpus_allowed_list:    0-7,16-23
>>>> jericho101 1 Cpus_allowed:    00000000,00ff00ff
>>>> Cpus_allowed_list:    0-7,16-23
>>>>
>>>> So once again instead of handing a numa node to each process -
>>>> it's
>>>> handing the same node to both.
>>>>
>>>>
>>>> How would I start debugging this?
>>>>
>>>> Or am I missing something really obvious
>>>>
>>>>
>>>>
>>>> Thanks!
>>>> ---------
>>>>
>>>> Bill Ryder
>>>> Weta Digital
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> A bit more data:
>>>>
>>>> mpiexec.hydra  --info
>>>> HYDRA build details:
>>>>        Version:                                 3.0.4
>>>>        Release Date:                            Wed Apr 24 10:08:10
>>>>        CDT
>>>>        2013
>>>>        CC:                              cc
>>>>        CXX:
>>>>        F77:
>>>>        F90:
>>>>        Configure options: '--prefix=/tech/apps/mpich/hydra'
>>>>        Process Manager:                         pmi
>>>>        Launchers available:                     ssh rsh fork slurm
>>>>        ll
>>>>        lsf sge manual persist
>>>>        Topology libraries available:            hwloc
>>>>        Resource management kernels available:   user slurm ll lsf
>>>>        sge
>>>>        pbs cobalt
>>>>        Checkpointing libraries available:
>>>>        Demux engines available:                 poll select
>>>>
>>>>
>>>>
>>>> I have hwloc 1.3.1 installed locally on each machine
>>>>
>>>> abrams201a looks like:
>>>>
>>>> Machine (48GB)
>>>>      NUMANode L#0 (P#1 24GB) + Socket L#0 + L3 L#0 (12MB)
>>>>        L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0
>>>>          PU L#0 (P#0)
>>>>          PU L#1 (P#12)
>>>>        L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1
>>>>          PU L#2 (P#2)
>>>>          PU L#3 (P#14)
>>>>        L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2
>>>>          PU L#4 (P#4)
>>>>          PU L#5 (P#16)
>>>>        L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3
>>>>          PU L#6 (P#6)
>>>>          PU L#7 (P#18)
>>>>        L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4
>>>>          PU L#8 (P#8)
>>>>          PU L#9 (P#20)
>>>>        L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5
>>>>          PU L#10 (P#10)
>>>>          PU L#11 (P#22)
>>>>      NUMANode L#1 (P#0 24GB) + Socket L#1 + L3 L#1 (12MB)
>>>>        L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6
>>>>          PU L#12 (P#1)
>>>>          PU L#13 (P#13)
>>>>        L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7
>>>>          PU L#14 (P#3)
>>>>          PU L#15 (P#15)
>>>>        L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8
>>>>          PU L#16 (P#5)
>>>>          PU L#17 (P#17)
>>>>        L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9
>>>>          PU L#18 (P#7)
>>>>          PU L#19 (P#19)
>>>>        L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10
>>>>          PU L#20 (P#9)
>>>>          PU L#21 (P#21)
>>>>        L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11
>>>>          PU L#22 (P#11)
>>>>          PU L#23 (P#23)
>>>>      HostBridge L#0
>>>>        PCIBridge
>>>>          PCI 8086:10e7
>>>>            Net L#0 "eth0"
>>>>          PCI 8086:10e7
>>>>            Net L#1 "eth1"
>>>>        PCIBridge
>>>>          PCI 15b3:6746
>>>>            Net L#2 "eth2"
>>>>            OpenFabrics L#3 "mlx4_0"
>>>>          PCI 15b3:6746
>>>>          PCI 15b3:6746
>>>>          PCI 15b3:6746
>>>>        PCIBridge
>>>>          PCI 102b:0533
>>>>        PCI 8086:3a20
>>>>          Block L#4 "sda"
>>>>
>>>>
>>>> And jericho101 looks like:
>>>>
>>>> achine (96GB)
>>>>      NUMANode L#0 (P#0 48GB)
>>>>        Socket L#0 + L3 L#0 (20MB)
>>>>          L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0
>>>>            PU L#0 (P#0)
>>>>            PU L#1 (P#16)
>>>>          L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1
>>>>            PU L#2 (P#1)
>>>>            PU L#3 (P#17)
>>>>          L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2
>>>>            PU L#4 (P#2)
>>>>            PU L#5 (P#18)
>>>>          L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3
>>>>            PU L#6 (P#3)
>>>>            PU L#7 (P#19)
>>>>          L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4
>>>>            PU L#8 (P#4)
>>>>            PU L#9 (P#20)
>>>>          L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5
>>>>            PU L#10 (P#5)
>>>>            PU L#11 (P#21)
>>>>          L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6
>>>>            PU L#12 (P#6)
>>>>            PU L#13 (P#22)
>>>>          L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7
>>>>            PU L#14 (P#7)
>>>>            PU L#15 (P#23)
>>>>        HostBridge L#0
>>>>          PCIBridge
>>>>            PCI 14e4:168e
>>>>              Net L#0 "eth0"
>>>>            PCI 14e4:168e
>>>>              Net L#1 "eth1"
>>>>            PCI 14e4:168e
>>>>              Net L#2 "eth2"
>>>>            PCI 14e4:168e
>>>>              Net L#3 "eth3"
>>>>            PCI 14e4:168e
>>>>              Net L#4 "eth4"
>>>>            PCI 14e4:168e
>>>>              Net L#5 "eth5"
>>>>            PCI 14e4:168e
>>>>              Net L#6 "eth6"
>>>>            PCI 14e4:168e
>>>>              Net L#7 "eth7"
>>>>          PCIBridge
>>>>            PCI 103c:323b
>>>>              Block L#8 "sda"
>>>>          PCIBridge
>>>>            PCI 102b:0533
>>>>      NUMANode L#1 (P#1 48GB) + Socket L#1 + L3 L#1 (20MB)
>>>>        L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8
>>>>          PU L#16 (P#8)
>>>>          PU L#17 (P#24)
>>>>        L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9
>>>>          PU L#18 (P#9)
>>>>          PU L#19 (P#25)
>>>>        L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10
>>>>          PU L#20 (P#10)
>>>>          PU L#21 (P#26)
>>>>        L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11
>>>>          PU L#22 (P#11)
>>>>          PU L#23 (P#27)
>>>>        L2 L#12 (256KB) + L1 L#12 (32KB) + Core L#12
>>>>          PU L#24 (P#12)
>>>>          PU L#25 (P#28)
>>>>        L2 L#13 (256KB) + L1 L#13 (32KB) + Core L#13
>>>>          PU L#26 (P#13)
>>>>          PU L#27 (P#29)
>>>>        L2 L#14 (256KB) + L1 L#14 (32KB) + Core L#14
>>>>          PU L#28 (P#14)
>>>>          PU L#29 (P#30)
>>>>        L2 L#15 (256KB) + L1 L#15 (32KB) + Core L#15
>>>>          PU L#30 (P#15)
>>>>          PU L#31 (P#31)
>>>>
>>>> _______________________________________________
>>>> discuss mailing list     discuss at mpich.org
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>
>>> _______________________________________________
>>> discuss mailing list     discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss




More information about the discuss mailing list