<!DOCTYPE html><html><head><title></title><style type="text/css">p.MsoNormal,p.MsoNoSpacing{margin:0}</style></head><body><div style="font-family:Arial;"><br></div><div style="font-family:Arial;"><br></div><div>On Fri, May 15, 2020, at 8:52 PM, hritikesh semwal via discuss wrote:<br></div><blockquote type="cite" id="qt" style=""><div dir="ltr"><div>Hello,<br></div><div><br></div><div>I am working on a parallel CFD solver with MPI and I am using an account on a cluster to run my executable. The hardware structure of my account is as follows;<br></div><div><br></div><div><div>Architecture:          x86_64<br></div><div>CPU op-mode(s):        32-bit, 64-bit<br></div><div>Byte Order:            Little Endian<br></div><div>CPU(s):                32<br></div><div>On-line CPU(s) list:   0-31<br></div><div>Thread(s) per core:    2<br></div><div>Core(s) per socket:    8<br></div><div>CPU socket(s):         2<br></div><div>NUMA node(s):          2<br></div><div>Vendor ID:             GenuineIntel<br></div><div>CPU family:            6<br></div><div>Model:                 62<br></div><div>Stepping:              4<br></div><div>CPU MHz:               2600.079<br></div><div>BogoMIPS:              5199.25<br></div><div>Virtualization:        VT-x<br></div><div>L1d cache:             32K<br></div><div>L1i cache:             32K<br></div><div>L2 cache:              256K<br></div><div>L3 cache:              20480K<br></div><div>NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30<br></div><div>NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31<br></div></div><div><br></div><div>Initially, I was running my executable with any binding options and in that case, whenever I was switching from 2 to 4 processors my computation time was also increasing along with communication time inside some iterative loop. <br></div><div><br></div><div>Today, somewhere I read about binding options in MPI through which I can manage the allocation of processors. Initially, I used the "-bind-to core" option and the results were different and  I got time reduction up to 16 processors and after that with 24 and 32 processors, it has started increasing. Results of timing are as follows;<br></div><div>2 procs- 160 seconds, 4 procs- 84 seconds, 8 procs- 45 seconds, 16 procs- 28 seconds, 24 procs- 38 seconds, 32 procs- 34 seconds.<br></div></div></blockquote><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">This seems reasonable. Are you able to turn of hyperthreading? For most numerical codes this is not useful as they are typically bandwidth limited. Thus for more than 16 processors will not see much speed up.</div><div style="font-family:Arial;"><br></div></body></html>