[mpich-discuss] Optimizing runtime for 4 mpiexec.hydra jobs
Congiu, Giuseppe
gcongiu at anl.gov
Wed Nov 13 16:22:29 CST 2019
Sorry I meant separated cores and numas. You can run lstopo (LSTOPO) to discover how hwthreads are numbered by hwloc (hydra uses hwloc to do the binding). Once you know how hwthreads are numbered you can use "-bind-to user:” to make sure you are running each MPI process on separate cores and numas.
Best,
Giuseppe
> On Nov 13, 2019, at 3:37 PM, Douglas Dommermuth via discuss <discuss at mpich.org> wrote:
>
> Hi Giuseppe,
>
> Thanks for your help. For NUMA Node 0, the hardware threads are 0-63 and 128-191. Thread 0 is paired with 128 on the same core, 1 is paired with 129, etc. For NUMA Node 1, the hardware threads are 64-127 and 192-255. How do I run the two jobs on completely separated hardware threads and numas?
>
> Thank you, Doug.
> ________________________________________
> From: Congiu, Giuseppe <gcongiu at anl.gov>
> Sent: Wednesday, November 13, 2019 12:57 PM
> To: discuss at mpich.org
> Cc: Douglas Dommermuth
> Subject: Re: [mpich-discuss] Optimizing runtime for 4 mpiexec.hydra jobs
>
> You can try with only two jobs running on completely separated hwthreads and numas and see if you get runtime comparable to the case in which you have only one job. Then you can add the other two jobs and see how much you increase the runtime. If it’s twice as long you are running jobs serially (as Joachim noted), otherwise there is some room for sharing among hwthreads. In any case, consider that by default MPICH uses shared memory for intranode communication, thus hwthreads/processes will be involved in copying data and leave little time to other hwthreads that want to run. On top of this memory bandwidth might also be an issue as you have all the cores possibly accessing memory at the same time.
>
>> On Nov 13, 2019, at 2:06 PM, Douglas Dommermuth via discuss <discuss at mpich.org> wrote:
>>
>> The timings for 4 runs are 163.1s, 159.9s, 160.9s, and 158.3s.
>>
>> The subdomain size is currently 32^3. I could see how 64^3 and 128^3 scale to boost the work relative to the communication. The solver is incompressible with multigrid, which makes it a bit tricky. Also, AMD recommends specific BIOS settings. However, the machine is currently Top 5 on Geekbench 4 for multicore results.
>> ________________________________________
>> From: Congiu, Giuseppe <gcongiu at anl.gov>
>> Sent: Wednesday, November 13, 2019 11:42 AM
>> To: discuss at mpich.org
>> Cc: Douglas Dommermuth
>> Subject: Re: [mpich-discuss] Optimizing runtime for 4 mpiexec.hydra jobs
>>
>> Is this an average runtime over multiple runs?
>>
>>> On Nov 13, 2019, at 1:40 PM, Douglas Dommermuth via discuss <discuss at mpich.org> wrote:
>>>
>>> Hi Giuseppe,
>>>
>>> It took 163.1s for this case:
>>>
>>> mpirun.mpich -bind-to user:0+1 -n 64 myprog &
>>> mpirun.mpich -bind-to user:64+1 -n 64 myprog &
>>> mpirun.mpich -bind-to user:128+1 -n 64 myprog &
>>> mpirun.mpich -bind-to user:192+1 -n 64 myprog &
>>>
>>> Thanks, Doug.
>>> ________________________________________
>>> From: Congiu, Giuseppe <gcongiu at anl.gov>
>>> Sent: Wednesday, November 13, 2019 11:10 AM
>>> To: discuss at mpich.org
>>> Cc: Douglas Dommermuth
>>> Subject: Re: [mpich-discuss] Optimizing runtime for 4 mpiexec.hydra jobs
>>>
>>> Try binding all the ranks of a job to the same numa. See if something like this works better:
>>>
>>> mpirun.mpich -bind-to user:0+1 -n 64 myprog &
>>> mpirun.mpich -bind-to user:64+1 -n 64 myprog &
>>> mpirun.mpich -bind-to user:128+1 -n 64 myprog &
>>> mpirun.mpich -bind-to user:192+1 -n 64 myprog &
>>>
>>> However this might not solve completely the problem as MPI processes can still move around across different cores in the numa.
>>>
>>> —Giuseppe
>>>
>>>> On Nov 13, 2019, at 1:00 PM, Douglas Dommermuth via discuss <discuss at mpich.org> wrote:
>>>>
>>>> Hi Giuseppe and Joachim,
>>>>
>>>> I will look into turning off hyperthreading and running two jobs with a corresponding change in the sizes of the jobs. Meanwhile, I ran the following case, which took 159.6s:
>>>>
>>>> mpirun.mpich -bind-to user:0+4 -n 64 myprog &
>>>> mpirun.mpich -bind-to user:1+4 -n 64 myprog &
>>>> mpirun.mpich -bind-to user:2+4 -n 64 myprog &
>>>> mpirun.mpich -bind-to user:3+4 -n 64 myprog &
>>>>
>>>> Thank you, Doug.
>>>> ________________________________________
>>>> From: Joachim Protze <protze at itc.rwth-aachen.de>
>>>> Sent: Wednesday, November 13, 2019 9:56 AM
>>>> To: discuss at mpich.org
>>>> Cc: Douglas Dommermuth
>>>> Subject: Re: [mpich-discuss] Optimizing runtime for 4 mpiexec.hydra jobs
>>>>
>>>> Hi Doug,
>>>>
>>>> in general, using hyperthreads only improves execution time, if you do
>>>> not utilize the core with a single process/thread. I.e., if you see
>>>> ~100% cpu utilization per process in "top" for the single job execution,
>>>> doubling the execution time from 2 to 4 mpi jobs sounds reasonable.
>>>> If your application is mostly calculating (as an MPI application
>>>> hopefully does), the two processes/threads running on the same core
>>>> share the execution time of the core and will finally end up with double
>>>> execution time.
>>>>
>>>> Depending on your application, additional processes/threads also might
>>>> increase the pressure on the memory bus and therefore slow down the
>>>> other application by making it wait for memory accesses. This might also
>>>> explain the execution time increase from one to two mpi jobs.
>>>> All this depends on the cpu/memory configuration of this machine.
>>>>
>>>> Best
>>>> Joachim
>>>>
>>>> On 11/13/19 5:39 PM, Douglas Dommermuth via discuss wrote:
>>>>> I am running Ubuntu 18.04.3 with MPICH 3.3~a2-4 and GFortran
>>>>> 4:7.4.0-1ubuntu2.3 and GCC 4:7.4.0-1ubuntu2.3CC on dual AMD EPYC 7742
>>>>> processors with hyper threading enabled. My codes are written in MPI
>>>>> and Fortran. The dual AMD processors have 128 cores and 256 threads.
>>>>> I want to optimize the runtime for 4 mpi jobs running concurrently
>>>>> with 64 threads each. Some timings are provided here:
>>>>>
>>>>> 1. One mpi job with mpiexec.hydra -n 64 myprog => 57.32s
>>>>> 2. One mpi job with mpiexec.hydra -bind-to numa -n 64 => 50.52s
>>>>> 3. Two mpi jobs with mpiexec.hydra -n 64 myprog => 99.77s
>>>>> 4. Two mpi jobs with mpiexec.hydra -bind-to numa -n 64 => 72.23s
>>>>> 5. Four mpi jobs with mpiexec.hydra -bind-to numa -n 64 => 159.2s
>>>>>
>>>>> The option "-bind-to numa" helps, but even so, running four mpi
>>>>> jobs concurrently with 64 threads each is considerably slower than
>>>>> running one mpi job with 64 threads. I can almost run four mpi jobs
>>>>> sequentially and match the time for running four mpi jobs concurrently.
>>>>> How can I improve on the result for running 4 mpi jobs concurrently?
>>>>> Thanks, Doug.
>>>>>
>>>>> _______________________________________________
>>>>> discuss mailing list discuss at mpich.org
>>>>> To manage subscription options or unsubscribe:
>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>
>>>>
>>>>
>>>> --
>>>> Dipl.-Inf. Joachim Protze
>>>>
>>>> IT Center
>>>> Group: High Performance Computing
>>>> Division: Computational Science and Engineering
>>>> RWTH Aachen University
>>>> Seffenter Weg 23
>>>> D 52074 Aachen (Germany)
>>>> Tel: +49 241 80- 24765
>>>> Fax: +49 241 80-624765
>>>> protze at itc.rwth-aachen.de
>>>> www.itc.rwth-aachen.de
>>>>
>>>> _______________________________________________
>>>> discuss mailing list discuss at mpich.org
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>
>>> _______________________________________________
>>> discuss mailing list discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>> _______________________________________________
>> discuss mailing list discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list