[mpich-discuss] affinity problems with mpiexec.hydra 3.0.4

Bill Ryder bryder at wetafx.co.nz
Mon Sep 16 18:04:45 CDT 2013


Greetings all,

I'm trying set affinity for mybrid MPI/OpenMP tasks.

I want to run two processes on a host, and give one socket to one rank, and the other socket to the other rank.

I have two types of hardware - one works perfectly - the other doesn't.

I first saw the problem using mpiexec.hydra with slurm but I've moved to using ssh to remove some possible variables


I have a trivial script which just greps out /proc/$$/status for the Cpus_allowed mask and Cpus_allowed_list

It's just: echo "`hostname` $PMI_RANK `grep Cpus_allowed /proc/$$/status`"


On the machine that is doing what I want I get:

mpiexec.hydra -ppn 2 -hosts abrams201a --bind-to socket -launcher ssh  ./get_mapping
abrams201a 0 Cpus_allowed:    00555555
Cpus_allowed_list:    0,2,4,6,8,10,12,14,16,18,20,22
abrams201a 1 Cpus_allowed:    00aaaaaa
Cpus_allowed_list:    1,3,5,7,9,11,13,15,17,19,21,23

rank 0 gets one socket, rank 1 gets the other socket. This is what I want

But on my other machine with a different topology I get this:

mpiexec.hydra -ppn 2 -hosts jericho101  --bind-to socket -launcher ssh  ./get_mapping

jericho101 0 Cpus_allowed:    00000000,00ff00ff
Cpus_allowed_list:    0-7,16-23
jericho101 1 Cpus_allowed:    00000000,00ff00ff
Cpus_allowed_list:    0-7,16-23

So each rank is trying to use the same socket.

Similarly if I try to bind to a numanode

mpiexec.hydra -ppn 2 -hosts jericho101  --bind-to numa -launcher ssh  ./get_mapping
jericho101 1 Cpus_allowed:    00000000,00ff00ff
Cpus_allowed_list:    0-7,16-23
jericho101 0 Cpus_allowed:    00000000,00ff00ff
Cpus_allowed_list:    0-7,16-23


Or even if I send numa:2

mpiexec.hydra -ppn 2 -hosts jericho101  --bind-to numa:2 -launcher ssh  ./get_mapping

jericho101 0 Cpus_allowed:    00000000,00ff00ff
Cpus_allowed_list:    0-7,16-23
jericho101 1 Cpus_allowed:    00000000,00ff00ff
Cpus_allowed_list:    0-7,16-23

So once again instead of handing a numa node to each process - it's handing the same node to both.


How would I start debugging this?

Or am I missing something really obvious



Thanks!
---------

Bill Ryder
Weta Digital





A bit more data:

mpiexec.hydra  --info
HYDRA build details:
     Version:                                 3.0.4
     Release Date:                            Wed Apr 24 10:08:10 CDT 2013
     CC:                              cc
     CXX:
     F77:
     F90:
     Configure options: '--prefix=/tech/apps/mpich/hydra'
     Process Manager:                         pmi
     Launchers available:                     ssh rsh fork slurm ll lsf sge manual persist
     Topology libraries available:            hwloc
     Resource management kernels available:   user slurm ll lsf sge pbs cobalt
     Checkpointing libraries available:
     Demux engines available:                 poll select



I have hwloc 1.3.1 installed locally on each machine

abrams201a looks like:

Machine (48GB)
   NUMANode L#0 (P#1 24GB) + Socket L#0 + L3 L#0 (12MB)
     L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0
       PU L#0 (P#0)
       PU L#1 (P#12)
     L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1
       PU L#2 (P#2)
       PU L#3 (P#14)
     L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2
       PU L#4 (P#4)
       PU L#5 (P#16)
     L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3
       PU L#6 (P#6)
       PU L#7 (P#18)
     L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4
       PU L#8 (P#8)
       PU L#9 (P#20)
     L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5
       PU L#10 (P#10)
       PU L#11 (P#22)
   NUMANode L#1 (P#0 24GB) + Socket L#1 + L3 L#1 (12MB)
     L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6
       PU L#12 (P#1)
       PU L#13 (P#13)
     L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7
       PU L#14 (P#3)
       PU L#15 (P#15)
     L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8
       PU L#16 (P#5)
       PU L#17 (P#17)
     L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9
       PU L#18 (P#7)
       PU L#19 (P#19)
     L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10
       PU L#20 (P#9)
       PU L#21 (P#21)
     L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11
       PU L#22 (P#11)
       PU L#23 (P#23)
   HostBridge L#0
     PCIBridge
       PCI 8086:10e7
         Net L#0 "eth0"
       PCI 8086:10e7
         Net L#1 "eth1"
     PCIBridge
       PCI 15b3:6746
         Net L#2 "eth2"
         OpenFabrics L#3 "mlx4_0"
       PCI 15b3:6746
       PCI 15b3:6746
       PCI 15b3:6746
     PCIBridge
       PCI 102b:0533
     PCI 8086:3a20
       Block L#4 "sda"


And jericho101 looks like:

achine (96GB)
   NUMANode L#0 (P#0 48GB)
     Socket L#0 + L3 L#0 (20MB)
       L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0
         PU L#0 (P#0)
         PU L#1 (P#16)
       L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1
         PU L#2 (P#1)
         PU L#3 (P#17)
       L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2
         PU L#4 (P#2)
         PU L#5 (P#18)
       L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3
         PU L#6 (P#3)
         PU L#7 (P#19)
       L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4
         PU L#8 (P#4)
         PU L#9 (P#20)
       L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5
         PU L#10 (P#5)
         PU L#11 (P#21)
       L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6
         PU L#12 (P#6)
         PU L#13 (P#22)
       L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7
         PU L#14 (P#7)
         PU L#15 (P#23)
     HostBridge L#0
       PCIBridge
         PCI 14e4:168e
           Net L#0 "eth0"
         PCI 14e4:168e
           Net L#1 "eth1"
         PCI 14e4:168e
           Net L#2 "eth2"
         PCI 14e4:168e
           Net L#3 "eth3"
         PCI 14e4:168e
           Net L#4 "eth4"
         PCI 14e4:168e
           Net L#5 "eth5"
         PCI 14e4:168e
           Net L#6 "eth6"
         PCI 14e4:168e
           Net L#7 "eth7"
       PCIBridge
         PCI 103c:323b
           Block L#8 "sda"
       PCIBridge
         PCI 102b:0533
   NUMANode L#1 (P#1 48GB) + Socket L#1 + L3 L#1 (20MB)
     L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8
       PU L#16 (P#8)
       PU L#17 (P#24)
     L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9
       PU L#18 (P#9)
       PU L#19 (P#25)
     L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10
       PU L#20 (P#10)
       PU L#21 (P#26)
     L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11
       PU L#22 (P#11)
       PU L#23 (P#27)
     L2 L#12 (256KB) + L1 L#12 (32KB) + Core L#12
       PU L#24 (P#12)
       PU L#25 (P#28)
     L2 L#13 (256KB) + L1 L#13 (32KB) + Core L#13
       PU L#26 (P#13)
       PU L#27 (P#29)
     L2 L#14 (256KB) + L1 L#14 (32KB) + Core L#14
       PU L#28 (P#14)
       PU L#29 (P#30)
     L2 L#15 (256KB) + L1 L#15 (32KB) + Core L#15
       PU L#30 (P#15)
       PU L#31 (P#31)




More information about the discuss mailing list