[mpich-devel] MPI_Comm_Split/Dup scalability on BGQ and K supercomputers

Mon May 19 21:25:58 CDT 2014

Sam,
        Thanks for the bug report. We will fix it.

Rajeev

On May 17, 2014, at 9:06 AM, Sam Williams <swwilliams at lbl.gov> wrote:

> I've been conducting scaling experiments on the Mira (Blue Gene/Q) and K (Sparc) supercomputers.  I've noticed that the time required for MPI_Comm_split and MPI_Comm_dup can grow quickly with scale (~P^2).  As such, its performance eventually becomes a bottleneck.  That is, although the benefit of using a subcommunicator is huge (multigrid solves are weak-scalable), the penalty of creating one (multigrid build time) is also huge.
> 
> For example, when scaling from 1 to 46K nodes (= cubes of integers) on Mira, the time (in seconds) required to build a MG solver (including a subcommunicator) scales as 
> 222335.output:   Total time in MGBuild      0.056704
> 222336.output:   Total time in MGBuild      0.060834
> 222348.output:   Total time in MGBuild      0.064782
> 222349.output:   Total time in MGBuild      0.090229
> 222350.output:   Total time in MGBuild      0.075280
> 222351.output:   Total time in MGBuild      0.091852
> 222352.output:   Total time in MGBuild      0.137299
> 222411.output:   Total time in MGBuild      0.301552
> 222413.output:   Total time in MGBuild      0.606444
> 222415.output:   Total time in MGBuild      0.745272
> 222417.output:   Total time in MGBuild      0.779757
> 222418.output:   Total time in MGBuild      4.671838
> 222419.output:   Total time in MGBuild     15.123162
> 222420.output:   Total time in MGBuild     33.875626
> 222421.output:   Total time in MGBuild     49.494547
> 222422.output:   Total time in MGBuild    151.329026
> 
> If I disable the call to MPI_Comm_Split, my time scales as
> 224982.output:   Total time in MGBuild      0.050143
> 224983.output:   Total time in MGBuild      0.052607
> 224988.output:   Total time in MGBuild      0.050697
> 224989.output:   Total time in MGBuild      0.078343
> 224990.output:   Total time in MGBuild      0.054634
> 224991.output:   Total time in MGBuild      0.052158
> 224992.output:   Total time in MGBuild      0.060286
> 225008.output:   Total time in MGBuild      0.062925
> 225009.output:   Total time in MGBuild      0.097357
> 225010.output:   Total time in MGBuild      0.061807
> 225011.output:   Total time in MGBuild      0.076617
> 225012.output:   Total time in MGBuild      0.099683
> 225013.output:   Total time in MGBuild      0.125580
> 225014.output:   Total time in MGBuild      0.190711
> 225016.output:   Total time in MGBuild      0.218329
> 225017.output:   Total time in MGBuild      0.282081
> 
> Although I didn't directly measure it, this suggests the time for MPI_Comm_Split is growing roughly quadratically with process concurrency.
> 
> 
> 
> 
> I see the same effect on the K machine (8...64K nodes) where the code uses comm_split/dup in conjunction:
> run00008_7_1.sh.o2412931:   Total time in MGBuild      0.026458 seconds
> run00064_7_1.sh.o2415876:   Total time in MGBuild      0.039121 seconds
> run00512_7_1.sh.o2415877:   Total time in MGBuild      0.086800 seconds
> run01000_7_1.sh.o2414496:   Total time in MGBuild      0.129764 seconds
> run01728_7_1.sh.o2415878:   Total time in MGBuild      0.224576 seconds
> run04096_7_1.sh.o2415880:   Total time in MGBuild      0.738979 seconds
> run08000_7_1.sh.o2414504:   Total time in MGBuild      2.123800 seconds
> run13824_7_1.sh.o2415881:   Total time in MGBuild      6.276573 seconds
> run21952_7_1.sh.o2415882:   Total time in MGBuild     13.634200 seconds
> run32768_7_1.sh.o2415884:   Total time in MGBuild     36.508670 seconds
> run46656_7_1.sh.o2415874:   Total time in MGBuild     58.668228 seconds
> run64000_7_1.sh.o2415875:   Total time in MGBuild    117.322217 seconds
> 
> 
> A glance at the implementation on Mira (I don't know if the implementation on K is stock) suggests it should be using qsort to sort based on keys.  Unfortunately, qsort is not performance robust like heap/merge sort.  If one were to be productive and call comm_split like...
> MPI_Comm_split(...,mycolor,myrank,...)
> then one runs the risk that the keys are presorted.  This hits the worst case computational complexity for qsort... O(P^2).  Demanding programmers avoid sending sorted keys seems unreasonable.
> 
> 
> I should note, I see a similar lack of scaling with MPI_Comm_dup on the K machine.  Unfortunately, my BGQ data used an earlier version of the code that did not use comm_dup.  As such, I can’t definitively say that it is a problem on that machine as well.
> 
> Thus, I'm asking for scalable implementations of comm_split/dup using merge/heap sort whose worst case complexity is still PlogP to be prioritized in the next update.
> 
> 
> thanks
> _______________________________________________
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/devel