[mpich-discuss] Understanding process bindings in MPICH

Fri May 15 12:52:35 CDT 2020

Hello,

I am working on a parallel CFD solver with MPI and I am using an account on
a cluster to run my executable. The hardware structure of my account is as
follows;

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    8
CPU socket(s):         2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Stepping:              4
CPU MHz:               2600.079
BogoMIPS:              5199.25
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31

Initially, I was running my executable with any binding options and in that
case, whenever I was switching from 2 to 4 processors my computation time
was also increasing along with communication time inside some iterative
loop.

Today, somewhere I read about binding options in MPI through which I can
manage the allocation of processors. Initially, I used the "-bind-to core"
option and the results were different and I got time reduction up to 16
processors and after that with 24 and 32 processors, it has started
increasing. Results of timing are as follows;
2 procs- 160 seconds, 4 procs- 84 seconds, 8 procs- 45 seconds, 16 procs-
28 seconds, 24 procs- 38 seconds, 32 procs- 34 seconds.

After that, I used some other combinations of binding option but did not
get better timing results compared to -bind-to core option. So, I back
edited the bind to option to core but now I am getting different timing
results with the same executable which are as follows,
2 procs- 164 seconds, 4 procs- 85 seconds, 8 procs- 45 seconds, 16 procs-
48 seconds, 24 procs- 52 seconds, 32 seconds- 98 seconds.

I have following two questions for which I am seeking your help,

1. Can anyone please suggest me is it possible an optimum binding and
mapping options  based on my cluster account hardware topology?? If yes
then please tell me.
2. Why I am getting such an irregular pattern of jump in timing without
binding option and why with a binding option, my timings are varying for
each run? Is it my cluster network problem or my MPI code problem?

If you need further details about my iterative loop then please tell me. As
this message got too long I can share it later if you think above data is
not sufficient.

Thank you.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20200515/9316fd02/attachment.html>