[mpich-discuss] Optimizing runtime for 4 mpiexec.hydra jobs
Douglas Dommermuth
dgd at mit.edu
Thu Nov 14 14:25:19 CST 2019
Hi Joachim,
Thanks for your help and the help of others who contributed to my understanding.
I did some tests with just one medium-sized mpi job:
mpirun.mpich -n 128 mprog => 371.5s
mpirun.mpich -bind-to numa -n 128 myprog => 366.4s
mpirun.mpich -bind-to user:0,1,2,3,...,127 -n 128 myprog => 358.4s
mpirun.mpich -bind-to rr -n 128 myprog => 355.1s
mpirun.mpich -n 128 myprog with hyperthreading disabled => 365.6s
mpirun.mpich -bind-to rr -n 128 myprog with hyperthreading disabled => 356.0s
I also did one test with two medium-sized mpi jobs running concurrently:
mpirun.mpich -bind-to user:0,1,2,3,4,...,127 -n 128 myprog and
mpirun.mpich -bind-to user:128,129,130,131,...,255 -n 128 myprog =>690.8s
rr binding cannot be used for multiple jobs because it pins it to the same threads. Manually entering the threads yields the best performance for multiple jobs, but it is awkward to enter 128 threads. It would be nice to have an option such as 0-127+1. Running two jobs concurrently is only slightly faster than running two jobs sequentially. I think that manually entering the threads is only useful for very long multiple jobs if all the available threads are used because I don't think it is desirable to bind to single threads over extended periods of time. Unlike manual binding, numa binding spreads the burden over all the threads evenly.
Thanks again, Doug.
________________________________________
From: Joachim Protze <protze at itc.rwth-aachen.de>
Sent: Thursday, November 14, 2019 1:33 AM
To: discuss at mpich.org
Cc: Douglas Dommermuth
Subject: Re: [mpich-discuss] Optimizing runtime for 4 mpiexec.hydra jobs
Hi Doug,
This is somewhat expected behavior. For many simulation codes you will
see that they hit the memory bandwidth of a socket using just 30-70% of
the cores. Adding more processes/threads doesn't improve the runtime in
such case.
In your first experiment, when you don't bind the processes to the same
numa domain (socket), the 64 processes will spread across both sockets
and use the memory bandwidth of both sockets.
By binding one job to one socket, this job can use the same memory
bandwidth as two jobs spread on both sockets. Based on the numbers you
provided, my guess is that ~44 processes of your application are
sufficient to fully utilize the memory bandwidth. Adding more processes
to the socket will only slightly improve the runtime.
So the question for optimization might be: is there a number of
processes in 44..64, which allows best decomposition of your problem?
- Joachim
On 11/13/19 11:03 PM, Douglas Dommermuth via discuss wrote:
> I tried the following:
>
> mpirun.mpich -bind-to user:0,1,2,3,4,5,6,7, .... ,57,58,59,60,61,62,63 -n 64 myprog.
>
> The timing was 76.03s. It was interesting to follow it on the system monitor.
> ________________________________________
> From: Douglas Dommermuth via discuss <discuss at mpich.org>
> Sent: Wednesday, November 13, 2019 1:37 PM
> To: discuss at mpich.org
> Cc: Douglas Dommermuth
> Subject: Re: [mpich-discuss] Optimizing runtime for 4 mpiexec.hydra jobs
>
> Hi Giuseppe,
>
> Thanks for your help. For NUMA Node 0, the hardware threads are 0-63 and 128-191. Thread 0 is paired with 128 on the same core, 1 is paired with 129, etc. For NUMA Node 1, the hardware threads are 64-127 and 192-255. How do I run the two jobs on completely separated hardware threads and numas?
>
> Thank you, Doug.
> ________________________________________
> From: Congiu, Giuseppe <gcongiu at anl.gov>
> Sent: Wednesday, November 13, 2019 12:57 PM
> To: discuss at mpich.org
> Cc: Douglas Dommermuth
> Subject: Re: [mpich-discuss] Optimizing runtime for 4 mpiexec.hydra jobs
>
> You can try with only two jobs running on completely separated hwthreads and numas and see if you get runtime comparable to the case in which you have only one job. Then you can add the other two jobs and see how much you increase the runtime. If it’s twice as long you are running jobs serially (as Joachim noted), otherwise there is some room for sharing among hwthreads. In any case, consider that by default MPICH uses shared memory for intranode communication, thus hwthreads/processes will be involved in copying data and leave little time to other hwthreads that want to run. On top of this memory bandwidth might also be an issue as you have all the cores possibly accessing memory at the same time.
>
>> On Nov 13, 2019, at 2:06 PM, Douglas Dommermuth via discuss <discuss at mpich.org> wrote:
>>
>> The timings for 4 runs are 163.1s, 159.9s, 160.9s, and 158.3s.
>>
>> The subdomain size is currently 32^3. I could see how 64^3 and 128^3 scale to boost the work relative to the communication. The solver is incompressible with multigrid, which makes it a bit tricky. Also, AMD recommends specific BIOS settings. However, the machine is currently Top 5 on Geekbench 4 for multicore results.
>> ________________________________________
>> From: Congiu, Giuseppe <gcongiu at anl.gov>
>> Sent: Wednesday, November 13, 2019 11:42 AM
>> To: discuss at mpich.org
>> Cc: Douglas Dommermuth
>> Subject: Re: [mpich-discuss] Optimizing runtime for 4 mpiexec.hydra jobs
>>
>> Is this an average runtime over multiple runs?
>>
>>> On Nov 13, 2019, at 1:40 PM, Douglas Dommermuth via discuss <discuss at mpich.org> wrote:
>>>
>>> Hi Giuseppe,
>>>
>>> It took 163.1s for this case:
>>>
>>> mpirun.mpich -bind-to user:0+1 -n 64 myprog &
>>> mpirun.mpich -bind-to user:64+1 -n 64 myprog &
>>> mpirun.mpich -bind-to user:128+1 -n 64 myprog &
>>> mpirun.mpich -bind-to user:192+1 -n 64 myprog &
>>>
>>> Thanks, Doug.
>>> ________________________________________
>>> From: Congiu, Giuseppe <gcongiu at anl.gov>
>>> Sent: Wednesday, November 13, 2019 11:10 AM
>>> To: discuss at mpich.org
>>> Cc: Douglas Dommermuth
>>> Subject: Re: [mpich-discuss] Optimizing runtime for 4 mpiexec.hydra jobs
>>>
>>> Try binding all the ranks of a job to the same numa. See if something like this works better:
>>>
>>> mpirun.mpich -bind-to user:0+1 -n 64 myprog &
>>> mpirun.mpich -bind-to user:64+1 -n 64 myprog &
>>> mpirun.mpich -bind-to user:128+1 -n 64 myprog &
>>> mpirun.mpich -bind-to user:192+1 -n 64 myprog &
>>>
>>> However this might not solve completely the problem as MPI processes can still move around across different cores in the numa.
>>>
>>> —Giuseppe
>>>
>>>> On Nov 13, 2019, at 1:00 PM, Douglas Dommermuth via discuss <discuss at mpich.org> wrote:
>>>>
>>>> Hi Giuseppe and Joachim,
>>>>
>>>> I will look into turning off hyperthreading and running two jobs with a corresponding change in the sizes of the jobs. Meanwhile, I ran the following case, which took 159.6s:
>>>>
>>>> mpirun.mpich -bind-to user:0+4 -n 64 myprog &
>>>> mpirun.mpich -bind-to user:1+4 -n 64 myprog &
>>>> mpirun.mpich -bind-to user:2+4 -n 64 myprog &
>>>> mpirun.mpich -bind-to user:3+4 -n 64 myprog &
>>>>
>>>> Thank you, Doug.
>>>> ________________________________________
>>>> From: Joachim Protze <protze at itc.rwth-aachen.de>
>>>> Sent: Wednesday, November 13, 2019 9:56 AM
>>>> To: discuss at mpich.org
>>>> Cc: Douglas Dommermuth
>>>> Subject: Re: [mpich-discuss] Optimizing runtime for 4 mpiexec.hydra jobs
>>>>
>>>> Hi Doug,
>>>>
>>>> in general, using hyperthreads only improves execution time, if you do
>>>> not utilize the core with a single process/thread. I.e., if you see
>>>> ~100% cpu utilization per process in "top" for the single job execution,
>>>> doubling the execution time from 2 to 4 mpi jobs sounds reasonable.
>>>> If your application is mostly calculating (as an MPI application
>>>> hopefully does), the two processes/threads running on the same core
>>>> share the execution time of the core and will finally end up with double
>>>> execution time.
>>>>
>>>> Depending on your application, additional processes/threads also might
>>>> increase the pressure on the memory bus and therefore slow down the
>>>> other application by making it wait for memory accesses. This might also
>>>> explain the execution time increase from one to two mpi jobs.
>>>> All this depends on the cpu/memory configuration of this machine.
>>>>
>>>> Best
>>>> Joachim
>>>>
>>>> On 11/13/19 5:39 PM, Douglas Dommermuth via discuss wrote:
>>>>> I am running Ubuntu 18.04.3 with MPICH 3.3~a2-4 and GFortran
>>>>> 4:7.4.0-1ubuntu2.3 and GCC 4:7.4.0-1ubuntu2.3CC on dual AMD EPYC 7742
>>>>> processors with hyper threading enabled. My codes are written in MPI
>>>>> and Fortran. The dual AMD processors have 128 cores and 256 threads.
>>>>> I want to optimize the runtime for 4 mpi jobs running concurrently
>>>>> with 64 threads each. Some timings are provided here:
>>>>>
>>>>> 1. One mpi job with mpiexec.hydra -n 64 myprog => 57.32s
>>>>> 2. One mpi job with mpiexec.hydra -bind-to numa -n 64 => 50.52s
>>>>> 3. Two mpi jobs with mpiexec.hydra -n 64 myprog => 99.77s
>>>>> 4. Two mpi jobs with mpiexec.hydra -bind-to numa -n 64 => 72.23s
>>>>> 5. Four mpi jobs with mpiexec.hydra -bind-to numa -n 64 => 159.2s
>>>>>
>>>>> The option "-bind-to numa" helps, but even so, running four mpi
>>>>> jobs concurrently with 64 threads each is considerably slower than
>>>>> running one mpi job with 64 threads. I can almost run four mpi jobs
>>>>> sequentially and match the time for running four mpi jobs concurrently.
>>>>> How can I improve on the result for running 4 mpi jobs concurrently?
>>>>> Thanks, Doug.
>>>>>
>>>>> _______________________________________________
>>>>> discuss mailing list discuss at mpich.org
>>>>> To manage subscription options or unsubscribe:
>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>
>>>>
>>>>
>>>> --
>>>> Dipl.-Inf. Joachim Protze
>>>>
>>>> IT Center
>>>> Group: High Performance Computing
>>>> Division: Computational Science and Engineering
>>>> RWTH Aachen University
>>>> Seffenter Weg 23
>>>> D 52074 Aachen (Germany)
>>>> Tel: +49 241 80- 24765
>>>> Fax: +49 241 80-624765
>>>> protze at itc.rwth-aachen.de
>>>> www.itc.rwth-aachen.de
>>>>
>>>> _______________________________________________
>>>> discuss mailing list discuss at mpich.org
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>
>>> _______________________________________________
>>> discuss mailing list discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>> _______________________________________________
>> discuss mailing list discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
--
Dipl.-Inf. Joachim Protze
IT Center
Group: High Performance Computing
Division: Computational Science and Engineering
RWTH Aachen University
Seffenter Weg 23
D 52074 Aachen (Germany)
Tel: +49 241 80- 24765
Fax: +49 241 80-624765
protze at itc.rwth-aachen.de
www.itc.rwth-aachen.de
More information about the discuss
mailing list