[mpich-discuss] Optimizing runtime for 4 mpiexec.hydra jobs

Thu Nov 14 03:33:40 CST 2019

Hi Doug,

This is somewhat expected behavior. For many simulation codes you will 
see that they hit the memory bandwidth of a socket using just 30-70% of 
the cores. Adding more processes/threads doesn't improve the runtime in 
such case.

In your first experiment, when you don't bind the processes to the same 
numa domain (socket), the 64 processes will spread across both sockets 
and use the memory bandwidth of both sockets.

By binding one job to one socket, this job can use the same memory 
bandwidth as two jobs spread on both sockets. Based on the numbers you 
provided, my guess is that ~44 processes of your application are 
sufficient to fully utilize the memory bandwidth. Adding more processes 
to the socket will only slightly improve the runtime.

So the question for optimization might be: is there a number of 
processes in 44..64, which allows best decomposition of your problem?

- Joachim

On 11/13/19 11:03 PM, Douglas Dommermuth via discuss wrote:
> I tried the following:
> 
> mpirun.mpich  -bind-to user:0,1,2,3,4,5,6,7, .... ,57,58,59,60,61,62,63 -n 64 myprog.
> 
> The timing was 76.03s.   It was interesting to follow it on the system monitor.
> ________________________________________
> From: Douglas Dommermuth via discuss <discuss at mpich.org>
> Sent: Wednesday, November 13, 2019 1:37 PM
> To: discuss at mpich.org
> Cc: Douglas Dommermuth
> Subject: Re: [mpich-discuss] Optimizing runtime for 4 mpiexec.hydra jobs
> 
> Hi Giuseppe,
> 
> Thanks for your help.   For NUMA Node 0, the hardware threads are 0-63 and 128-191.  Thread 0 is paired with 128 on the same core, 1 is paired with 129, etc.  For NUMA Node 1, the hardware threads are 64-127 and 192-255.  How do I run the two jobs on completely separated hardware threads and numas?
> 
> Thank you, Doug.
> ________________________________________
> From: Congiu, Giuseppe <gcongiu at anl.gov>
> Sent: Wednesday, November 13, 2019 12:57 PM
> To: discuss at mpich.org
> Cc: Douglas Dommermuth
> Subject: Re: [mpich-discuss] Optimizing runtime for 4 mpiexec.hydra jobs
> 
> You can try with only two jobs running on completely separated hwthreads and numas and see if you get runtime comparable to the case in which you have only one job. Then you can add the other two jobs and see how much you increase the runtime. If it’s twice as long you are running jobs serially (as Joachim noted), otherwise there is some room for sharing among hwthreads. In any case, consider that by default MPICH uses shared memory for intranode communication, thus hwthreads/processes will be involved in copying data and leave little time to other hwthreads that want to run. On top of this memory bandwidth might also be an issue as you have all the cores possibly accessing memory at the same time.
> 
>> On Nov 13, 2019, at 2:06 PM, Douglas Dommermuth via discuss <discuss at mpich.org> wrote:
>>
>> The timings for 4 runs are 163.1s, 159.9s, 160.9s, and 158.3s.
>>
>> The subdomain size is currently 32^3.   I could see how 64^3 and 128^3 scale to boost the work relative to the communication.   The solver is incompressible with multigrid, which makes it a bit tricky.   Also, AMD recommends specific BIOS settings.   However, the machine is currently Top 5 on Geekbench 4 for multicore results.
>> ________________________________________
>> From: Congiu, Giuseppe <gcongiu at anl.gov>
>> Sent: Wednesday, November 13, 2019 11:42 AM
>> To: discuss at mpich.org
>> Cc: Douglas Dommermuth
>> Subject: Re: [mpich-discuss] Optimizing runtime for 4 mpiexec.hydra jobs
>>
>> Is this an average runtime over multiple runs?
>>
>>> On Nov 13, 2019, at 1:40 PM, Douglas Dommermuth via discuss <discuss at mpich.org> wrote:
>>>
>>> Hi Giuseppe,
>>>
>>> It took 163.1s for this case:
>>>
>>> mpirun.mpich -bind-to user:0+1 -n 64 myprog &
>>> mpirun.mpich -bind-to user:64+1 -n 64 myprog &
>>> mpirun.mpich -bind-to user:128+1 -n 64 myprog &
>>> mpirun.mpich -bind-to user:192+1 -n 64 myprog &
>>>
>>> Thanks, Doug.
>>> ________________________________________
>>> From: Congiu, Giuseppe <gcongiu at anl.gov>
>>> Sent: Wednesday, November 13, 2019 11:10 AM
>>> To: discuss at mpich.org
>>> Cc: Douglas Dommermuth
>>> Subject: Re: [mpich-discuss] Optimizing runtime for 4 mpiexec.hydra jobs
>>>
>>> Try binding all the ranks of a job to the same numa. See if something like this works better:
>>>
>>> mpirun.mpich -bind-to user:0+1 -n 64 myprog &
>>> mpirun.mpich -bind-to user:64+1 -n 64 myprog &
>>> mpirun.mpich -bind-to user:128+1 -n 64 myprog &
>>> mpirun.mpich -bind-to user:192+1 -n 64 myprog &
>>>
>>> However this might not solve completely the problem as MPI processes can still move around across different cores in the numa.
>>>
>>> —Giuseppe
>>>
>>>> On Nov 13, 2019, at 1:00 PM, Douglas Dommermuth via discuss <discuss at mpich.org> wrote:
>>>>
>>>> Hi Giuseppe and Joachim,
>>>>
>>>> I will look into turning off hyperthreading and running two jobs with a corresponding change in the sizes of the jobs.  Meanwhile, I ran the following case, which took 159.6s:
>>>>
>>>> mpirun.mpich -bind-to user:0+4 -n 64 myprog &
>>>> mpirun.mpich -bind-to user:1+4 -n 64 myprog &
>>>> mpirun.mpich -bind-to user:2+4 -n 64 myprog &
>>>> mpirun.mpich -bind-to user:3+4 -n 64 myprog &
>>>>
>>>> Thank you, Doug.
>>>> ________________________________________
>>>> From: Joachim Protze <protze at itc.rwth-aachen.de>
>>>> Sent: Wednesday, November 13, 2019 9:56 AM
>>>> To: discuss at mpich.org
>>>> Cc: Douglas Dommermuth
>>>> Subject: Re: [mpich-discuss] Optimizing runtime for 4 mpiexec.hydra jobs
>>>>
>>>> Hi Doug,
>>>>
>>>> in general, using hyperthreads only improves execution time, if you do
>>>> not utilize the core with a single process/thread. I.e., if you see
>>>> ~100% cpu utilization per process in "top" for the single job execution,
>>>> doubling the execution time from 2 to 4 mpi jobs sounds reasonable.
>>>> If your application is mostly calculating (as an MPI application
>>>> hopefully does), the two processes/threads running on the same core
>>>> share the execution time of the core and will finally end up with double
>>>> execution time.
>>>>
>>>> Depending on your application, additional processes/threads also might
>>>> increase the pressure on the memory bus and therefore slow down the
>>>> other application by making it wait for memory accesses. This might also
>>>> explain the execution time increase from one to two mpi jobs.
>>>> All this depends on the cpu/memory configuration of this machine.
>>>>
>>>> Best
>>>> Joachim
>>>>
>>>> On 11/13/19 5:39 PM, Douglas Dommermuth via discuss wrote:
>>>>> I am running Ubuntu 18.04.3 with MPICH 3.3~a2-4 and GFortran
>>>>> 4:7.4.0-1ubuntu2.3 and GCC 4:7.4.0-1ubuntu2.3CC on dual AMD EPYC 7742
>>>>> processors with hyper threading enabled.  My codes are written in MPI
>>>>> and Fortran.  The dual AMD processors have 128 cores and 256 threads.
>>>>> I want to optimize the runtime for 4 mpi jobs running concurrently
>>>>> with 64 threads each.  Some timings are provided here:
>>>>>
>>>>> 1. One mpi job with mpiexec.hydra -n 64 myprog => 57.32s
>>>>> 2. One mpi job with mpiexec.hydra -bind-to numa -n 64 => 50.52s
>>>>> 3. Two mpi jobs with mpiexec.hydra -n 64 myprog => 99.77s
>>>>> 4. Two mpi jobs with mpiexec.hydra -bind-to numa -n 64 => 72.23s
>>>>> 5. Four mpi jobs with mpiexec.hydra -bind-to numa -n 64 => 159.2s
>>>>>
>>>>> The option "-bind-to numa" helps, but even so,  running four mpi
>>>>> jobs concurrently with 64 threads each is considerably slower than
>>>>> running one mpi job with 64 threads.  I can almost run four mpi jobs
>>>>> sequentially and match the time for running four mpi jobs concurrently.
>>>>> How can I improve on the result for running 4 mpi jobs concurrently?
>>>>> Thanks, Doug.
>>>>>
>>>>> _______________________________________________
>>>>> discuss mailing list     discuss at mpich.org
>>>>> To manage subscription options or unsubscribe:
>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>
>>>>
>>>>
>>>> --
>>>> Dipl.-Inf. Joachim Protze
>>>>
>>>> IT Center
>>>> Group: High Performance Computing
>>>> Division: Computational Science and Engineering
>>>> RWTH Aachen University
>>>> Seffenter Weg 23
>>>> D 52074  Aachen (Germany)
>>>> Tel: +49 241 80- 24765
>>>> Fax: +49 241 80-624765
>>>> protze at itc.rwth-aachen.de
>>>> www.itc.rwth-aachen.de
>>>>
>>>> _______________________________________________
>>>> discuss mailing list     discuss at mpich.org
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>
>>> _______________________________________________
>>> discuss mailing list     discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
> 

-- 
Dipl.-Inf. Joachim Protze

IT Center
Group: High Performance Computing
Division: Computational Science and Engineering
RWTH Aachen University
Seffenter Weg 23
D 52074  Aachen (Germany)
Tel: +49 241 80- 24765
Fax: +49 241 80-624765
protze at itc.rwth-aachen.de
www.itc.rwth-aachen.de

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5327 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20191114/78a97e4d/attachment.p7s>