[mpich-devel] MPI_Comm_Split/Dup scalability on BGQ and K supercomputers
Sam Williams
swwilliams at lbl.gov
Sat May 24 09:17:04 CDT 2014
to be clear, I ran these experiments in late February (<27th).
On May 24, 2014, at 7:07 AM, Sam Williams <SWWilliams at lbl.gov> wrote:
> I saw the problem on Mira and K but not Edison. I don't know if that is due to scale or implementation.
>
> On Mira, I was running jobs with a 10min wallclock limit. I scaled to 46656 processes with 64 threads per process (c1, OMP_NUM_THREADS=64) and all jobs completed successfully. However, I was only looking at MGSolve times and not MGBuild times. I then decided to explore 8 threads per process (c8, OMP_NUM_THREADS=8) and started at the high concurrency. The jobs timed out while still in MGBuild after 10mins with 373248 processes as well as with a 20min timeout. At that point I added the USE_SUBCOMM option to enable/disable the use of comm_split. I haven't tried scaling with the sub communicator on Mira since then.
>
>
> On May 24, 2014, at 6:39 AM, Junchao Zhang <jczhang at mcs.anl.gov> wrote:
>
>> Hi, Sam,
>> Could you give me the exact number of MPI ranks for your results on Mira?
>> I run hpgmg on Edison with export OMP_NUM_THREADS=1, aprun -n 64000 -ss -cc numa_node ./hpgmg-fv 6 1. The total time in MGBuild is about 0.005 seconds. I was wondering how many cores I need to apply to reproduce the problem.
>> Thanks.
>>
>>
>> --Junchao Zhang
>>
>>
>> On Sat, May 17, 2014 at 9:06 AM, Sam Williams <swwilliams at lbl.gov> wrote:
>> I've been conducting scaling experiments on the Mira (Blue Gene/Q) and K (Sparc) supercomputers. I've noticed that the time required for MPI_Comm_split and MPI_Comm_dup can grow quickly with scale (~P^2). As such, its performance eventually becomes a bottleneck. That is, although the benefit of using a subcommunicator is huge (multigrid solves are weak-scalable), the penalty of creating one (multigrid build time) is also huge.
>>
>> For example, when scaling from 1 to 46K nodes (= cubes of integers) on Mira, the time (in seconds) required to build a MG solver (including a subcommunicator) scales as
>> 222335.output: Total time in MGBuild 0.056704
>> 222336.output: Total time in MGBuild 0.060834
>> 222348.output: Total time in MGBuild 0.064782
>> 222349.output: Total time in MGBuild 0.090229
>> 222350.output: Total time in MGBuild 0.075280
>> 222351.output: Total time in MGBuild 0.091852
>> 222352.output: Total time in MGBuild 0.137299
>> 222411.output: Total time in MGBuild 0.301552
>> 222413.output: Total time in MGBuild 0.606444
>> 222415.output: Total time in MGBuild 0.745272
>> 222417.output: Total time in MGBuild 0.779757
>> 222418.output: Total time in MGBuild 4.671838
>> 222419.output: Total time in MGBuild 15.123162
>> 222420.output: Total time in MGBuild 33.875626
>> 222421.output: Total time in MGBuild 49.494547
>> 222422.output: Total time in MGBuild 151.329026
>>
>> If I disable the call to MPI_Comm_Split, my time scales as
>> 224982.output: Total time in MGBuild 0.050143
>> 224983.output: Total time in MGBuild 0.052607
>> 224988.output: Total time in MGBuild 0.050697
>> 224989.output: Total time in MGBuild 0.078343
>> 224990.output: Total time in MGBuild 0.054634
>> 224991.output: Total time in MGBuild 0.052158
>> 224992.output: Total time in MGBuild 0.060286
>> 225008.output: Total time in MGBuild 0.062925
>> 225009.output: Total time in MGBuild 0.097357
>> 225010.output: Total time in MGBuild 0.061807
>> 225011.output: Total time in MGBuild 0.076617
>> 225012.output: Total time in MGBuild 0.099683
>> 225013.output: Total time in MGBuild 0.125580
>> 225014.output: Total time in MGBuild 0.190711
>> 225016.output: Total time in MGBuild 0.218329
>> 225017.output: Total time in MGBuild 0.282081
>>
>> Although I didn't directly measure it, this suggests the time for MPI_Comm_Split is growing roughly quadratically with process concurrency.
>>
>>
>>
>>
>> I see the same effect on the K machine (8...64K nodes) where the code uses comm_split/dup in conjunction:
>> run00008_7_1.sh.o2412931: Total time in MGBuild 0.026458 seconds
>> run00064_7_1.sh.o2415876: Total time in MGBuild 0.039121 seconds
>> run00512_7_1.sh.o2415877: Total time in MGBuild 0.086800 seconds
>> run01000_7_1.sh.o2414496: Total time in MGBuild 0.129764 seconds
>> run01728_7_1.sh.o2415878: Total time in MGBuild 0.224576 seconds
>> run04096_7_1.sh.o2415880: Total time in MGBuild 0.738979 seconds
>> run08000_7_1.sh.o2414504: Total time in MGBuild 2.123800 seconds
>> run13824_7_1.sh.o2415881: Total time in MGBuild 6.276573 seconds
>> run21952_7_1.sh.o2415882: Total time in MGBuild 13.634200 seconds
>> run32768_7_1.sh.o2415884: Total time in MGBuild 36.508670 seconds
>> run46656_7_1.sh.o2415874: Total time in MGBuild 58.668228 seconds
>> run64000_7_1.sh.o2415875: Total time in MGBuild 117.322217 seconds
>>
>>
>> A glance at the implementation on Mira (I don't know if the implementation on K is stock) suggests it should be using qsort to sort based on keys. Unfortunately, qsort is not performance robust like heap/merge sort. If one were to be productive and call comm_split like...
>> MPI_Comm_split(...,mycolor,myrank,...)
>> then one runs the risk that the keys are presorted. This hits the worst case computational complexity for qsort... O(P^2). Demanding programmers avoid sending sorted keys seems unreasonable.
>>
>>
>> I should note, I see a similar lack of scaling with MPI_Comm_dup on the K machine. Unfortunately, my BGQ data used an earlier version of the code that did not use comm_dup. As such, I can’t definitively say that it is a problem on that machine as well.
>>
>> Thus, I'm asking for scalable implementations of comm_split/dup using merge/heap sort whose worst case complexity is still PlogP to be prioritized in the next update.
>>
>>
>> thanks
>> _______________________________________________
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/devel
>>
>> _______________________________________________
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/devel
>
More information about the devel
mailing list