[mpich-discuss] affinity problems with mpiexec.hydra 3.0.4
Ken Raffenetti
raffenet at mcs.anl.gov
Tue Sep 17 08:50:32 CDT 2013
Hi Bill,
I'm not seeing any obvious problems. However, in my own testing, I found a situation where the grep method you used was unreliable. Can you try using "hwloc-bind --get" instead and see if the results differ?
Ken
----- Original Message -----
> From: "Bill Ryder" <bryder at wetafx.co.nz>
> To: discuss at mpich.org
> Sent: Monday, September 16, 2013 6:04:45 PM
> Subject: [mpich-discuss] affinity problems with mpiexec.hydra 3.0.4
>
> Greetings all,
>
> I'm trying set affinity for mybrid MPI/OpenMP tasks.
>
> I want to run two processes on a host, and give one socket to one
> rank, and the other socket to the other rank.
>
> I have two types of hardware - one works perfectly - the other
> doesn't.
>
> I first saw the problem using mpiexec.hydra with slurm but I've moved
> to using ssh to remove some possible variables
>
>
> I have a trivial script which just greps out /proc/$$/status for the
> Cpus_allowed mask and Cpus_allowed_list
>
> It's just: echo "`hostname` $PMI_RANK `grep Cpus_allowed
> /proc/$$/status`"
>
>
> On the machine that is doing what I want I get:
>
> mpiexec.hydra -ppn 2 -hosts abrams201a --bind-to socket -launcher ssh
> ./get_mapping
> abrams201a 0 Cpus_allowed: 00555555
> Cpus_allowed_list: 0,2,4,6,8,10,12,14,16,18,20,22
> abrams201a 1 Cpus_allowed: 00aaaaaa
> Cpus_allowed_list: 1,3,5,7,9,11,13,15,17,19,21,23
>
> rank 0 gets one socket, rank 1 gets the other socket. This is what I
> want
>
> But on my other machine with a different topology I get this:
>
> mpiexec.hydra -ppn 2 -hosts jericho101 --bind-to socket -launcher
> ssh ./get_mapping
>
> jericho101 0 Cpus_allowed: 00000000,00ff00ff
> Cpus_allowed_list: 0-7,16-23
> jericho101 1 Cpus_allowed: 00000000,00ff00ff
> Cpus_allowed_list: 0-7,16-23
>
> So each rank is trying to use the same socket.
>
> Similarly if I try to bind to a numanode
>
> mpiexec.hydra -ppn 2 -hosts jericho101 --bind-to numa -launcher ssh
> ./get_mapping
> jericho101 1 Cpus_allowed: 00000000,00ff00ff
> Cpus_allowed_list: 0-7,16-23
> jericho101 0 Cpus_allowed: 00000000,00ff00ff
> Cpus_allowed_list: 0-7,16-23
>
>
> Or even if I send numa:2
>
> mpiexec.hydra -ppn 2 -hosts jericho101 --bind-to numa:2 -launcher
> ssh ./get_mapping
>
> jericho101 0 Cpus_allowed: 00000000,00ff00ff
> Cpus_allowed_list: 0-7,16-23
> jericho101 1 Cpus_allowed: 00000000,00ff00ff
> Cpus_allowed_list: 0-7,16-23
>
> So once again instead of handing a numa node to each process - it's
> handing the same node to both.
>
>
> How would I start debugging this?
>
> Or am I missing something really obvious
>
>
>
> Thanks!
> ---------
>
> Bill Ryder
> Weta Digital
>
>
>
>
>
> A bit more data:
>
> mpiexec.hydra --info
> HYDRA build details:
> Version: 3.0.4
> Release Date: Wed Apr 24 10:08:10 CDT
> 2013
> CC: cc
> CXX:
> F77:
> F90:
> Configure options: '--prefix=/tech/apps/mpich/hydra'
> Process Manager: pmi
> Launchers available: ssh rsh fork slurm ll
> lsf sge manual persist
> Topology libraries available: hwloc
> Resource management kernels available: user slurm ll lsf sge
> pbs cobalt
> Checkpointing libraries available:
> Demux engines available: poll select
>
>
>
> I have hwloc 1.3.1 installed locally on each machine
>
> abrams201a looks like:
>
> Machine (48GB)
> NUMANode L#0 (P#1 24GB) + Socket L#0 + L3 L#0 (12MB)
> L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0
> PU L#0 (P#0)
> PU L#1 (P#12)
> L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1
> PU L#2 (P#2)
> PU L#3 (P#14)
> L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2
> PU L#4 (P#4)
> PU L#5 (P#16)
> L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3
> PU L#6 (P#6)
> PU L#7 (P#18)
> L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4
> PU L#8 (P#8)
> PU L#9 (P#20)
> L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5
> PU L#10 (P#10)
> PU L#11 (P#22)
> NUMANode L#1 (P#0 24GB) + Socket L#1 + L3 L#1 (12MB)
> L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6
> PU L#12 (P#1)
> PU L#13 (P#13)
> L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7
> PU L#14 (P#3)
> PU L#15 (P#15)
> L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8
> PU L#16 (P#5)
> PU L#17 (P#17)
> L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9
> PU L#18 (P#7)
> PU L#19 (P#19)
> L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10
> PU L#20 (P#9)
> PU L#21 (P#21)
> L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11
> PU L#22 (P#11)
> PU L#23 (P#23)
> HostBridge L#0
> PCIBridge
> PCI 8086:10e7
> Net L#0 "eth0"
> PCI 8086:10e7
> Net L#1 "eth1"
> PCIBridge
> PCI 15b3:6746
> Net L#2 "eth2"
> OpenFabrics L#3 "mlx4_0"
> PCI 15b3:6746
> PCI 15b3:6746
> PCI 15b3:6746
> PCIBridge
> PCI 102b:0533
> PCI 8086:3a20
> Block L#4 "sda"
>
>
> And jericho101 looks like:
>
> achine (96GB)
> NUMANode L#0 (P#0 48GB)
> Socket L#0 + L3 L#0 (20MB)
> L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0
> PU L#0 (P#0)
> PU L#1 (P#16)
> L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1
> PU L#2 (P#1)
> PU L#3 (P#17)
> L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2
> PU L#4 (P#2)
> PU L#5 (P#18)
> L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3
> PU L#6 (P#3)
> PU L#7 (P#19)
> L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4
> PU L#8 (P#4)
> PU L#9 (P#20)
> L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5
> PU L#10 (P#5)
> PU L#11 (P#21)
> L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6
> PU L#12 (P#6)
> PU L#13 (P#22)
> L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7
> PU L#14 (P#7)
> PU L#15 (P#23)
> HostBridge L#0
> PCIBridge
> PCI 14e4:168e
> Net L#0 "eth0"
> PCI 14e4:168e
> Net L#1 "eth1"
> PCI 14e4:168e
> Net L#2 "eth2"
> PCI 14e4:168e
> Net L#3 "eth3"
> PCI 14e4:168e
> Net L#4 "eth4"
> PCI 14e4:168e
> Net L#5 "eth5"
> PCI 14e4:168e
> Net L#6 "eth6"
> PCI 14e4:168e
> Net L#7 "eth7"
> PCIBridge
> PCI 103c:323b
> Block L#8 "sda"
> PCIBridge
> PCI 102b:0533
> NUMANode L#1 (P#1 48GB) + Socket L#1 + L3 L#1 (20MB)
> L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8
> PU L#16 (P#8)
> PU L#17 (P#24)
> L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9
> PU L#18 (P#9)
> PU L#19 (P#25)
> L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10
> PU L#20 (P#10)
> PU L#21 (P#26)
> L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11
> PU L#22 (P#11)
> PU L#23 (P#27)
> L2 L#12 (256KB) + L1 L#12 (32KB) + Core L#12
> PU L#24 (P#12)
> PU L#25 (P#28)
> L2 L#13 (256KB) + L1 L#13 (32KB) + Core L#13
> PU L#26 (P#13)
> PU L#27 (P#29)
> L2 L#14 (256KB) + L1 L#14 (32KB) + Core L#14
> PU L#28 (P#14)
> PU L#29 (P#30)
> L2 L#15 (256KB) + L1 L#15 (32KB) + Core L#15
> PU L#30 (P#15)
> PU L#31 (P#31)
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
More information about the discuss
mailing list