[mpich-discuss] affinity problems with mpiexec.hydra 3.0.4

Tue Sep 17 08:50:32 CDT 2013

Hi Bill,

I'm not seeing any obvious problems. However, in my own testing, I found a situation where the grep method you used was unreliable. Can you try using "hwloc-bind --get" instead and see if the results differ?

Ken

----- Original Message -----
> From: "Bill Ryder" <bryder at wetafx.co.nz>
> To: discuss at mpich.org
> Sent: Monday, September 16, 2013 6:04:45 PM
> Subject: [mpich-discuss] affinity problems with mpiexec.hydra 3.0.4
> 
> Greetings all,
> 
> I'm trying set affinity for mybrid MPI/OpenMP tasks.
> 
> I want to run two processes on a host, and give one socket to one
> rank, and the other socket to the other rank.
> 
> I have two types of hardware - one works perfectly - the other
> doesn't.
> 
> I first saw the problem using mpiexec.hydra with slurm but I've moved
> to using ssh to remove some possible variables
> 
> 
> I have a trivial script which just greps out /proc/$$/status for the
> Cpus_allowed mask and Cpus_allowed_list
> 
> It's just: echo "`hostname` $PMI_RANK `grep Cpus_allowed
> /proc/$$/status`"
> 
> 
> On the machine that is doing what I want I get:
> 
> mpiexec.hydra -ppn 2 -hosts abrams201a --bind-to socket -launcher ssh
>  ./get_mapping
> abrams201a 0 Cpus_allowed:    00555555
> Cpus_allowed_list:    0,2,4,6,8,10,12,14,16,18,20,22
> abrams201a 1 Cpus_allowed:    00aaaaaa
> Cpus_allowed_list:    1,3,5,7,9,11,13,15,17,19,21,23
> 
> rank 0 gets one socket, rank 1 gets the other socket. This is what I
> want
> 
> But on my other machine with a different topology I get this:
> 
> mpiexec.hydra -ppn 2 -hosts jericho101  --bind-to socket -launcher
> ssh  ./get_mapping
> 
> jericho101 0 Cpus_allowed:    00000000,00ff00ff
> Cpus_allowed_list:    0-7,16-23
> jericho101 1 Cpus_allowed:    00000000,00ff00ff
> Cpus_allowed_list:    0-7,16-23
> 
> So each rank is trying to use the same socket.
> 
> Similarly if I try to bind to a numanode
> 
> mpiexec.hydra -ppn 2 -hosts jericho101  --bind-to numa -launcher ssh
>  ./get_mapping
> jericho101 1 Cpus_allowed:    00000000,00ff00ff
> Cpus_allowed_list:    0-7,16-23
> jericho101 0 Cpus_allowed:    00000000,00ff00ff
> Cpus_allowed_list:    0-7,16-23
> 
> 
> Or even if I send numa:2
> 
> mpiexec.hydra -ppn 2 -hosts jericho101  --bind-to numa:2 -launcher
> ssh  ./get_mapping
> 
> jericho101 0 Cpus_allowed:    00000000,00ff00ff
> Cpus_allowed_list:    0-7,16-23
> jericho101 1 Cpus_allowed:    00000000,00ff00ff
> Cpus_allowed_list:    0-7,16-23
> 
> So once again instead of handing a numa node to each process - it's
> handing the same node to both.
> 
> 
> How would I start debugging this?
> 
> Or am I missing something really obvious
> 
> 
> 
> Thanks!
> ---------
> 
> Bill Ryder
> Weta Digital
> 
> 
> 
> 
> 
> A bit more data:
> 
> mpiexec.hydra  --info
> HYDRA build details:
>      Version:                                 3.0.4
>      Release Date:                            Wed Apr 24 10:08:10 CDT
>      2013
>      CC:                              cc
>      CXX:
>      F77:
>      F90:
>      Configure options: '--prefix=/tech/apps/mpich/hydra'
>      Process Manager:                         pmi
>      Launchers available:                     ssh rsh fork slurm ll
>      lsf sge manual persist
>      Topology libraries available:            hwloc
>      Resource management kernels available:   user slurm ll lsf sge
>      pbs cobalt
>      Checkpointing libraries available:
>      Demux engines available:                 poll select
> 
> 
> 
> I have hwloc 1.3.1 installed locally on each machine
> 
> abrams201a looks like:
> 
> Machine (48GB)
>    NUMANode L#0 (P#1 24GB) + Socket L#0 + L3 L#0 (12MB)
>      L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0
>        PU L#0 (P#0)
>        PU L#1 (P#12)
>      L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1
>        PU L#2 (P#2)
>        PU L#3 (P#14)
>      L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2
>        PU L#4 (P#4)
>        PU L#5 (P#16)
>      L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3
>        PU L#6 (P#6)
>        PU L#7 (P#18)
>      L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4
>        PU L#8 (P#8)
>        PU L#9 (P#20)
>      L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5
>        PU L#10 (P#10)
>        PU L#11 (P#22)
>    NUMANode L#1 (P#0 24GB) + Socket L#1 + L3 L#1 (12MB)
>      L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6
>        PU L#12 (P#1)
>        PU L#13 (P#13)
>      L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7
>        PU L#14 (P#3)
>        PU L#15 (P#15)
>      L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8
>        PU L#16 (P#5)
>        PU L#17 (P#17)
>      L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9
>        PU L#18 (P#7)
>        PU L#19 (P#19)
>      L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10
>        PU L#20 (P#9)
>        PU L#21 (P#21)
>      L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11
>        PU L#22 (P#11)
>        PU L#23 (P#23)
>    HostBridge L#0
>      PCIBridge
>        PCI 8086:10e7
>          Net L#0 "eth0"
>        PCI 8086:10e7
>          Net L#1 "eth1"
>      PCIBridge
>        PCI 15b3:6746
>          Net L#2 "eth2"
>          OpenFabrics L#3 "mlx4_0"
>        PCI 15b3:6746
>        PCI 15b3:6746
>        PCI 15b3:6746
>      PCIBridge
>        PCI 102b:0533
>      PCI 8086:3a20
>        Block L#4 "sda"
> 
> 
> And jericho101 looks like:
> 
> achine (96GB)
>    NUMANode L#0 (P#0 48GB)
>      Socket L#0 + L3 L#0 (20MB)
>        L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0
>          PU L#0 (P#0)
>          PU L#1 (P#16)
>        L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1
>          PU L#2 (P#1)
>          PU L#3 (P#17)
>        L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2
>          PU L#4 (P#2)
>          PU L#5 (P#18)
>        L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3
>          PU L#6 (P#3)
>          PU L#7 (P#19)
>        L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4
>          PU L#8 (P#4)
>          PU L#9 (P#20)
>        L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5
>          PU L#10 (P#5)
>          PU L#11 (P#21)
>        L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6
>          PU L#12 (P#6)
>          PU L#13 (P#22)
>        L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7
>          PU L#14 (P#7)
>          PU L#15 (P#23)
>      HostBridge L#0
>        PCIBridge
>          PCI 14e4:168e
>            Net L#0 "eth0"
>          PCI 14e4:168e
>            Net L#1 "eth1"
>          PCI 14e4:168e
>            Net L#2 "eth2"
>          PCI 14e4:168e
>            Net L#3 "eth3"
>          PCI 14e4:168e
>            Net L#4 "eth4"
>          PCI 14e4:168e
>            Net L#5 "eth5"
>          PCI 14e4:168e
>            Net L#6 "eth6"
>          PCI 14e4:168e
>            Net L#7 "eth7"
>        PCIBridge
>          PCI 103c:323b
>            Block L#8 "sda"
>        PCIBridge
>          PCI 102b:0533
>    NUMANode L#1 (P#1 48GB) + Socket L#1 + L3 L#1 (20MB)
>      L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8
>        PU L#16 (P#8)
>        PU L#17 (P#24)
>      L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9
>        PU L#18 (P#9)
>        PU L#19 (P#25)
>      L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10
>        PU L#20 (P#10)
>        PU L#21 (P#26)
>      L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11
>        PU L#22 (P#11)
>        PU L#23 (P#27)
>      L2 L#12 (256KB) + L1 L#12 (32KB) + Core L#12
>        PU L#24 (P#12)
>        PU L#25 (P#28)
>      L2 L#13 (256KB) + L1 L#13 (32KB) + Core L#13
>        PU L#26 (P#13)
>        PU L#27 (P#29)
>      L2 L#14 (256KB) + L1 L#14 (32KB) + Core L#14
>        PU L#28 (P#14)
>        PU L#29 (P#30)
>      L2 L#15 (256KB) + L1 L#15 (32KB) + Core L#15
>        PU L#30 (P#15)
>        PU L#31 (P#31)
> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>