[mpich-discuss] scheduling to real hw cores, not using hyperthreading (mpich-3.2)

Heinz-Ado Arnolds arnolds at MPA-Garching.MPG.DE
Thu Apr 13 02:40:33 CDT 2017


Dear Ken,

thanks a lot for your fast reply!

See my comments/questions inline too:

On 12.04.2017 21:38, Kenneth Raffenetti wrote:
> Hi,
> 
> Some answers inline:
> 
> On 04/12/2017 10:02 AM, Heinz-Ado Arnolds wrote:
>> Dear MPIch users and developers,
>>
>> first of all many thanks for all the great work you have done for MPIch!
>>
>> I'd like to have 4 MPI jobs scheduled by SGE starting 1 OpenMP job each with 10 threads, running on 2 nodes, each having 2 sockets, with 10 cores & 10 hwthreads. Only 10 cores (no hwthreads) should be used on each socket.
>>
>> 4 MPI: 1 OpenMP with 10 thread (i.e. 4x10 threads)
>> 2 nodes, 2 sockets each, 10 cores & 10 hwthreads each
>>
>> lscpu -a -e
>>
>> CPU NODE SOCKET CORE L1d:L1i:L2:L3
>> 0   0    0      0    0:0:0:0
>> 1   1    1      1    1:1:1:1
>> 2   0    0      2    2:2:2:0
>> 3   1    1      3    3:3:3:1
>> 4   0    0      4    4:4:4:0
>> 5   1    1      5    5:5:5:1
>> 6   0    0      6    6:6:6:0
>> 7   1    1      7    7:7:7:1
>> 8   0    0      8    8:8:8:0
>> 9   1    1      9    9:9:9:1
>> 10  0    0      10   10:10:10:0
>> 11  1    1      11   11:11:11:1
>> 12  0    0      12   12:12:12:0
>> 13  1    1      13   13:13:13:1
>> 14  0    0      14   14:14:14:0
>> 15  1    1      15   15:15:15:1
>> 16  0    0      16   16:16:16:0
>> 17  1    1      17   17:17:17:1
>> 18  0    0      18   18:18:18:0
>> 19  1    1      19   19:19:19:1
>> 20  0    0      0    0:0:0:0
>> 21  1    1      1    1:1:1:1
>> 22  0    0      2    2:2:2:0
>> 23  1    1      3    3:3:3:1
>> 24  0    0      4    4:4:4:0
>> 25  1    1      5    5:5:5:1
>> 26  0    0      6    6:6:6:0
>> 27  1    1      7    7:7:7:1
>> 28  0    0      8    8:8:8:0
>> 29  1    1      9    9:9:9:1
>> 30  0    0      10   10:10:10:0
>> 31  1    1      11   11:11:11:1
>> 32  0    0      12   12:12:12:0
>> 33  1    1      13   13:13:13:1
>> 34  0    0      14   14:14:14:0
>> 35  1    1      15   15:15:15:1
>> 36  0    0      16   16:16:16:0
>> 37  1    1      17   17:17:17:1
>> 38  0    0      18   18:18:18:0
>> 39  1    1      19   19:19:19:1
>>
>> When I try to submit the job by using
>>
>> HYDRA_TOPO_DEBUG=1 mpirun -np 4 -bind-to socket:1 -map-by socket ./myid
>>
>> the distribution is done to 2 sockets on 2 nodes correctly, but all 10 cores + 10 hwthreads are used on each node:
>>
>>   process 0 binding: 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
>>   process 1 binding: 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
>>   process 2 binding: 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
>>   process 3 binding: 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
> 
> The process binding code treats the machine as a hierarchy. If a core contains 2 hwthreads, then binding to that core effectively binds to both hwthreads. If you bind to one socket on your machine, you should see that as being bound to 20 hwthreads like above. You can view the system hierarchy if you have hwloc installed on your system using the 'lstopo' command. Hopefully that is answering your question.

Yes, I know the architecture of the machine and the hwloc & lstopo tools too and I see that the binding is done to both hwthreads of a core. But that's not what I like to get. Can you give me a hint how I could achieve a binding like this

  process 0 binding: 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  process 1 binding: 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

i.e. binding only to one thread of a core, not both. That would be of great help to me. In OpenMPI that's done with "-use-hwthread-cpus".

>>
>> Additionally it seems, that the CPU masks & Cpus_allowed_list are not set for the called processes:
>>
>>   MPI Instance 0001 of 0004 is on pascal-1-04, 0x000000ff,0xffffffff, Cpus_allowed_list:    0-39
>>   MPI Instance 0002 of 0004 is on pascal-1-04, 0x000000ff,0xffffffff, Cpus_allowed_list:    0-39
>>   MPI Instance 0003 of 0004 is on pascal-3-06, 0x000000ff,0xffffffff, Cpus_allowed_list:    0-39
>>   MPI Instance 0004 of 0004 is on pascal-3-06, 0x000000ff,0xffffffff, Cpus_allowed_list:    0-39
> 
> This is true only if HYDRA_TOPO_DEBUG=1 is set. We changed that behavior based on user feedback, and future versions will actually set the binding when debug is enabled. https://github.com/pmodels/mpich/commit/a145fd73a1bb7dd4f06c8c6b7dff5f6c3695f6f6

Thanks for that helpful hint! I didn't know that I have to choose between a working *OR* a verbose arrangement.

>>
>>
>> Another way to achive my task would be to use "-bind-to user"
>>
>>   HYDRA_TOPO_DEBUG=1 mpirun -np $nmpi -ppn $nmpipn -bind-to user:0+2+4+6+8+10+12+14+16+18,1+3+5+7+9+11+13+15+17+19 ./myid
>>
>> This works great up to specifying cores 9 cores on each socket ("0+2+4+6+8+10+12+14+16,1+3+5+7+9+11+13+15+17"). As soon as I add ...+18,...+19, the job crashes with these messages:
>>
> 
> Probably overrunning a buffer. Thanks for the bug report :).
> 
> Ken
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

I'd appreciate an advise how to bind to cores w/o using all their hwthreads very much!

Have nice Easter days

Ado


-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4992 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20170413/80a82885/attachment.p7s>


More information about the discuss mailing list