[mpich-devel] MPI_Comm_Split/Dup scalability on BGQ and K supercomputers

Thu Jul 3 16:45:01 CDT 2014

PAMI is open-source via https://repo.anl-external.org/repos/bgq-driver/.

I believe ALCF has already reported this bug but you can contact
support at alcf.anl.gov for an update.

Best,

Jeff

On Thu, Jul 3, 2014 at 2:41 PM, Junchao Zhang <jczhang at mcs.anl.gov> wrote:
> Hi, Sam,
>   I wrote micro-benchmarks for MPI_Comm_split/dup. My profiling results
> suggested the problem lies in a IBM PAMI library call,
> PAMI_Geometry_create_taskrange().  Unfortunately, I don't have access to the
> PAMI source code and don't know why. I reported it to IBM and hope IBM will
> fix it.
>   Alternatively, you can set an environment variable PAMID_COLLECTIVES=0 to
> disables pami collectives. My tests showed it at least fixed the scalability
> problem of Comm_split and Comm_dup.
>   Also through profiling, I found the qsort() called in MPICH code is
> actually using the merge sort algorithm in Mira's libc library.
>
>
>
> --Junchao Zhang
>
>
> On Sat, May 17, 2014 at 9:06 AM, Sam Williams <swwilliams at lbl.gov> wrote:
>>
>> I've been conducting scaling experiments on the Mira (Blue Gene/Q) and K
>> (Sparc) supercomputers.  I've noticed that the time required for
>> MPI_Comm_split and MPI_Comm_dup can grow quickly with scale (~P^2).  As
>> such, its performance eventually becomes a bottleneck.  That is, although
>> the benefit of using a subcommunicator is huge (multigrid solves are
>> weak-scalable), the penalty of creating one (multigrid build time) is also
>> huge.
>>
>> For example, when scaling from 1 to 46K nodes (= cubes of integers) on
>> Mira, the time (in seconds) required to build a MG solver (including a
>> subcommunicator) scales as
>> 222335.output:   Total time in MGBuild      0.056704
>> 222336.output:   Total time in MGBuild      0.060834
>> 222348.output:   Total time in MGBuild      0.064782
>> 222349.output:   Total time in MGBuild      0.090229
>> 222350.output:   Total time in MGBuild      0.075280
>> 222351.output:   Total time in MGBuild      0.091852
>> 222352.output:   Total time in MGBuild      0.137299
>> 222411.output:   Total time in MGBuild      0.301552
>> 222413.output:   Total time in MGBuild      0.606444
>> 222415.output:   Total time in MGBuild      0.745272
>> 222417.output:   Total time in MGBuild      0.779757
>> 222418.output:   Total time in MGBuild      4.671838
>> 222419.output:   Total time in MGBuild     15.123162
>> 222420.output:   Total time in MGBuild     33.875626
>> 222421.output:   Total time in MGBuild     49.494547
>> 222422.output:   Total time in MGBuild    151.329026
>>
>> If I disable the call to MPI_Comm_Split, my time scales as
>> 224982.output:   Total time in MGBuild      0.050143
>> 224983.output:   Total time in MGBuild      0.052607
>> 224988.output:   Total time in MGBuild      0.050697
>> 224989.output:   Total time in MGBuild      0.078343
>> 224990.output:   Total time in MGBuild      0.054634
>> 224991.output:   Total time in MGBuild      0.052158
>> 224992.output:   Total time in MGBuild      0.060286
>> 225008.output:   Total time in MGBuild      0.062925
>> 225009.output:   Total time in MGBuild      0.097357
>> 225010.output:   Total time in MGBuild      0.061807
>> 225011.output:   Total time in MGBuild      0.076617
>> 225012.output:   Total time in MGBuild      0.099683
>> 225013.output:   Total time in MGBuild      0.125580
>> 225014.output:   Total time in MGBuild      0.190711
>> 225016.output:   Total time in MGBuild      0.218329
>> 225017.output:   Total time in MGBuild      0.282081
>>
>> Although I didn't directly measure it, this suggests the time for
>> MPI_Comm_Split is growing roughly quadratically with process concurrency.
>>
>>
>>
>>
>> I see the same effect on the K machine (8...64K nodes) where the code uses
>> comm_split/dup in conjunction:
>> run00008_7_1.sh.o2412931:   Total time in MGBuild      0.026458 seconds
>> run00064_7_1.sh.o2415876:   Total time in MGBuild      0.039121 seconds
>> run00512_7_1.sh.o2415877:   Total time in MGBuild      0.086800 seconds
>> run01000_7_1.sh.o2414496:   Total time in MGBuild      0.129764 seconds
>> run01728_7_1.sh.o2415878:   Total time in MGBuild      0.224576 seconds
>> run04096_7_1.sh.o2415880:   Total time in MGBuild      0.738979 seconds
>> run08000_7_1.sh.o2414504:   Total time in MGBuild      2.123800 seconds
>> run13824_7_1.sh.o2415881:   Total time in MGBuild      6.276573 seconds
>> run21952_7_1.sh.o2415882:   Total time in MGBuild     13.634200 seconds
>> run32768_7_1.sh.o2415884:   Total time in MGBuild     36.508670 seconds
>> run46656_7_1.sh.o2415874:   Total time in MGBuild     58.668228 seconds
>> run64000_7_1.sh.o2415875:   Total time in MGBuild    117.322217 seconds
>>
>>
>> A glance at the implementation on Mira (I don't know if the implementation
>> on K is stock) suggests it should be using qsort to sort based on keys.
>> Unfortunately, qsort is not performance robust like heap/merge sort.  If one
>> were to be productive and call comm_split like...
>> MPI_Comm_split(...,mycolor,myrank,...)
>> then one runs the risk that the keys are presorted.  This hits the worst
>> case computational complexity for qsort... O(P^2).  Demanding programmers
>> avoid sending sorted keys seems unreasonable.
>>
>>
>> I should note, I see a similar lack of scaling with MPI_Comm_dup on the K
>> machine.  Unfortunately, my BGQ data used an earlier version of the code
>> that did not use comm_dup.  As such, I can’t definitively say that it is a
>> problem on that machine as well.
>>
>> Thus, I'm asking for scalable implementations of comm_split/dup using
>> merge/heap sort whose worst case complexity is still PlogP to be prioritized
>> in the next update.
>>
>>
>> thanks
>> _______________________________________________
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/devel
>
>
>
> _______________________________________________
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/devel

-- 
Jeff Hammond
jeff.science at gmail.com
http://jeffhammond.github.io/