[mpich-discuss] scheduling to real hw cores, not using hyperthreading (mpich-3.2)

Kenneth Raffenetti raffenet at mcs.anl.gov
Wed Apr 12 14:38:24 CDT 2017


Hi,

Some answers inline:

On 04/12/2017 10:02 AM, Heinz-Ado Arnolds wrote:
> Dear MPIch users and developers,
>
> first of all many thanks for all the great work you have done for MPIch!
>
> I'd like to have 4 MPI jobs scheduled by SGE starting 1 OpenMP job each with 10 threads, running on 2 nodes, each having 2 sockets, with 10 cores & 10 hwthreads. Only 10 cores (no hwthreads) should be used on each socket.
>
> 4 MPI: 1 OpenMP with 10 thread (i.e. 4x10 threads)
> 2 nodes, 2 sockets each, 10 cores & 10 hwthreads each
>
> lscpu -a -e
>
> CPU NODE SOCKET CORE L1d:L1i:L2:L3
> 0   0    0      0    0:0:0:0
> 1   1    1      1    1:1:1:1
> 2   0    0      2    2:2:2:0
> 3   1    1      3    3:3:3:1
> 4   0    0      4    4:4:4:0
> 5   1    1      5    5:5:5:1
> 6   0    0      6    6:6:6:0
> 7   1    1      7    7:7:7:1
> 8   0    0      8    8:8:8:0
> 9   1    1      9    9:9:9:1
> 10  0    0      10   10:10:10:0
> 11  1    1      11   11:11:11:1
> 12  0    0      12   12:12:12:0
> 13  1    1      13   13:13:13:1
> 14  0    0      14   14:14:14:0
> 15  1    1      15   15:15:15:1
> 16  0    0      16   16:16:16:0
> 17  1    1      17   17:17:17:1
> 18  0    0      18   18:18:18:0
> 19  1    1      19   19:19:19:1
> 20  0    0      0    0:0:0:0
> 21  1    1      1    1:1:1:1
> 22  0    0      2    2:2:2:0
> 23  1    1      3    3:3:3:1
> 24  0    0      4    4:4:4:0
> 25  1    1      5    5:5:5:1
> 26  0    0      6    6:6:6:0
> 27  1    1      7    7:7:7:1
> 28  0    0      8    8:8:8:0
> 29  1    1      9    9:9:9:1
> 30  0    0      10   10:10:10:0
> 31  1    1      11   11:11:11:1
> 32  0    0      12   12:12:12:0
> 33  1    1      13   13:13:13:1
> 34  0    0      14   14:14:14:0
> 35  1    1      15   15:15:15:1
> 36  0    0      16   16:16:16:0
> 37  1    1      17   17:17:17:1
> 38  0    0      18   18:18:18:0
> 39  1    1      19   19:19:19:1
>
> When I try to submit the job by using
>
> HYDRA_TOPO_DEBUG=1 mpirun -np 4 -bind-to socket:1 -map-by socket ./myid
>
> the distribution is done to 2 sockets on 2 nodes correctly, but all 10 cores + 10 hwthreads are used on each node:
>
>   process 0 binding: 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
>   process 1 binding: 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
>   process 2 binding: 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
>   process 3 binding: 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

The process binding code treats the machine as a hierarchy. If a core 
contains 2 hwthreads, then binding to that core effectively binds to 
both hwthreads. If you bind to one socket on your machine, you should 
see that as being bound to 20 hwthreads like above. You can view the 
system hierarchy if you have hwloc installed on your system using the 
'lstopo' command. Hopefully that is answering your question.

>
> Additionally it seems, that the CPU masks & Cpus_allowed_list are not set for the called processes:
>
>   MPI Instance 0001 of 0004 is on pascal-1-04, 0x000000ff,0xffffffff, Cpus_allowed_list:	0-39
>   MPI Instance 0002 of 0004 is on pascal-1-04, 0x000000ff,0xffffffff, Cpus_allowed_list:	0-39
>   MPI Instance 0003 of 0004 is on pascal-3-06, 0x000000ff,0xffffffff, Cpus_allowed_list:	0-39
>   MPI Instance 0004 of 0004 is on pascal-3-06, 0x000000ff,0xffffffff, Cpus_allowed_list:	0-39

This is true only if HYDRA_TOPO_DEBUG=1 is set. We changed that behavior 
based on user feedback, and future versions will actually set the 
binding when debug is enabled. 
https://github.com/pmodels/mpich/commit/a145fd73a1bb7dd4f06c8c6b7dff5f6c3695f6f6

>
>
> Another way to achive my task would be to use "-bind-to user"
>
>   HYDRA_TOPO_DEBUG=1 mpirun -np $nmpi -ppn $nmpipn -bind-to user:0+2+4+6+8+10+12+14+16+18,1+3+5+7+9+11+13+15+17+19 ./myid
>
> This works great up to specifying cores 9 cores on each socket ("0+2+4+6+8+10+12+14+16,1+3+5+7+9+11+13+15+17"). As soon as I add ...+18,...+19, the job crashes with these messages:
>

Probably overrunning a buffer. Thanks for the bug report :).

Ken
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list