<div dir="ltr">Hello,<div><br></div><div>I am working on a parallel CFD solver with MPI and I am using an account on a cluster to run my executable. The hardware structure of my account is as follows;</div><div><br></div><div>Architecture: x86_64<br>CPU op-mode(s): 32-bit, 64-bit<br>Byte Order: Little Endian<br>CPU(s): 32<br>On-line CPU(s) list: 0-31<br>Thread(s) per core: 2<br>Core(s) per socket: 8<br>CPU socket(s): 2<br>NUMA node(s): 2<br>Vendor ID: GenuineIntel<br>CPU family: 6<br>Model: 62<br>Stepping: 4<br>CPU MHz: 2600.079<br>BogoMIPS: 5199.25<br>Virtualization: VT-x<br>L1d cache: 32K<br>L1i cache: 32K<br>L2 cache: 256K<br>L3 cache: 20480K<br>NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30<br>NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31<br></div><div><br></div><div>Initially, I was running my executable with any binding options and in that case, whenever I was switching from 2 to 4 processors my computation time was also increasing along with communication time inside some iterative loop. </div><div><br></div><div>Today, somewhere I read about binding options in MPI through which I can manage the allocation of processors. Initially, I used the "-bind-to core" option and the results were different and I got time reduction up to 16 processors and after that with 24 and 32 processors, it has started increasing. Results of timing are as follows;</div><div>2 procs- 160 seconds, 4 procs- 84 seconds, 8 procs- 45 seconds, 16 procs- 28 seconds, 24 procs- 38 seconds, 32 procs- 34 seconds.</div><div><br></div><div>After that, I used some other combinations of binding option but did not get better timing results compared to -bind-to core option. So, I back edited the bind to option to core but now I am getting different timing results with the same executable which are as follows,</div><div>2 procs- 164 seconds, 4 procs- 85 seconds, 8 procs- 45 seconds, 16 procs- 48 seconds, 24 procs- 52 seconds, 32 seconds- 98 seconds.</div><div><br></div><div>I have following two questions for which I am seeking your help,</div><div><br></div><div>1. Can anyone please suggest me is it possible
an optimum binding and mapping options
based on my cluster account hardware topology?? If yes then please tell me.</div><div>2. Why I am getting such an irregular pattern of jump in timing without binding option and why
with a binding option, my timings are varying for each run? Is it my cluster network problem or my MPI code problem?</div><div><br></div><div>If you need further details about my iterative loop then please tell me. As this message got too long I can share it later if you think above data is not sufficient.</div><div><br></div><div>Thank you.</div></div>