[mpich-devel] MPI_Comm_Split/Dup scalability on BGQ and K supercomputers

Mon Jul 7 12:19:18 CDT 2014

Finding the task range impl requires thinking in C++ but it's not hard.
Just grep your way down a few layers.

Jeff

On Monday, July 7, 2014, Rob Latham <robl at mcs.anl.gov> wrote:

>
>
> On 07/07/2014 10:34 AM, Junchao Zhang wrote:
>
>> Rob,
>>    Is it possible for me to install a debug version PAMI on Mira? I read
>> the InstallReadme*_BGQ.txt. It is quite complex and and looks I need
>> root privilege.
>>    If it is possible, I can profile the code further.
>>
>
> I know man, that install process is crazy.. seems like one should be able
> to get a pami library out of comm/sys/pami by setting enough environment
> variables -- there's no configure process for pami?
>
> ==rob
>
>
>
>> --Junchao Zhang
>>
>>
>> On Mon, Jul 7, 2014 at 10:19 AM, Rob Latham <robl at mcs.anl.gov
>> <mailto:robl at mcs.anl.gov>> wrote:
>>
>>
>>
>>     On 07/03/2014 04:45 PM, Jeff Hammond wrote:
>>
>>         PAMI is open-source via
>>         https://repo.anl-external.org/__repos/bgq-driver/
>>         <https://repo.anl-external.org/repos/bgq-driver/>.
>>
>>         I believe ALCF has already reported this bug but you can contact
>>         support at alcf.anl.gov <mailto:support at alcf.anl.gov> for an update.
>>
>>
>>     in a nice bit of circular logic, ALCF keeps trying to close that
>>     ticket saying "this is being discussed on the MPICH list".
>>
>>     Specifically to Jeff's point, the PAMI things are in
>>     bgq-VERSION-gpl.tar.gz
>>
>>     Junchao: you can find the implementation of
>>     PAMI_Geometry_create_taskrange in comm/sys/pami/api/c/pami.cc, but
>>     all it does is immediately call the objects' create_taskrange'
>>     member function, so now you have to find where *that* is...
>>
>>     ==rob
>>
>>
>>
>>         Best,
>>
>>         Jeff
>>
>>         On Thu, Jul 3, 2014 at 2:41 PM, Junchao Zhang
>>         <jczhang at mcs.anl.gov <mailto:jczhang at mcs.anl.gov>> wrote:
>>
>>             Hi, Sam,
>>                 I wrote micro-benchmarks for MPI_Comm_split/dup. My
>>             profiling results
>>             suggested the problem lies in a IBM PAMI library call,
>>             PAMI_Geometry_create___taskrange().  Unfortunately, I don't
>>             have access to the
>>             PAMI source code and don't know why. I reported it to IBM
>>             and hope IBM will
>>             fix it.
>>                 Alternatively, you can set an environment variable
>>             PAMID_COLLECTIVES=0 to
>>             disables pami collectives. My tests showed it at least fixed
>>             the scalability
>>             problem of Comm_split and Comm_dup.
>>                 Also through profiling, I found the qsort() called in
>>             MPICH code is
>>             actually using the merge sort algorithm in Mira's libc
>> library.
>>
>>
>>
>>             --Junchao Zhang
>>
>>
>>             On Sat, May 17, 2014 at 9:06 AM, Sam Williams
>>             <swwilliams at lbl.gov <mailto:swwilliams at lbl.gov>> wrote:
>>
>>
>>                 I've been conducting scaling experiments on the Mira
>>                 (Blue Gene/Q) and K
>>                 (Sparc) supercomputers.  I've noticed that the time
>>                 required for
>>                 MPI_Comm_split and MPI_Comm_dup can grow quickly with
>>                 scale (~P^2).  As
>>                 such, its performance eventually becomes a bottleneck.
>>                   That is, although
>>                 the benefit of using a subcommunicator is huge
>>                 (multigrid solves are
>>                 weak-scalable), the penalty of creating one (multigrid
>>                 build time) is also
>>                 huge.
>>
>>                 For example, when scaling from 1 to 46K nodes (= cubes
>>                 of integers) on
>>                 Mira, the time (in seconds) required to build a MG
>>                 solver (including a
>>                 subcommunicator) scales as
>>                 222335.output:   Total time in MGBuild      0.056704
>>                 222336.output:   Total time in MGBuild      0.060834
>>                 222348.output:   Total time in MGBuild      0.064782
>>                 222349.output:   Total time in MGBuild      0.090229
>>                 222350.output:   Total time in MGBuild      0.075280
>>                 222351.output:   Total time in MGBuild      0.091852
>>                 222352.output:   Total time in MGBuild      0.137299
>>                 222411.output:   Total time in MGBuild      0.301552
>>                 222413.output:   Total time in MGBuild      0.606444
>>                 222415.output:   Total time in MGBuild      0.745272
>>                 222417.output:   Total time in MGBuild      0.779757
>>                 222418.output:   Total time in MGBuild      4.671838
>>                 222419.output:   Total time in MGBuild     15.123162
>>                 222420.output:   Total time in MGBuild     33.875626
>>                 222421.output:   Total time in MGBuild     49.494547
>>                 222422.output:   Total time in MGBuild    151.329026
>>
>>                 If I disable the call to MPI_Comm_Split, my time scales as
>>                 224982.output:   Total time in MGBuild      0.050143
>>                 224983.output:   Total time in MGBuild      0.052607
>>                 224988.output:   Total time in MGBuild      0.050697
>>                 224989.output:   Total time in MGBuild      0.078343
>>                 224990.output:   Total time in MGBuild      0.054634
>>                 224991.output:   Total time in MGBuild      0.052158
>>                 224992.output:   Total time in MGBuild      0.060286
>>                 225008.output:   Total time in MGBuild      0.062925
>>                 225009.output:   Total time in MGBuild      0.097357
>>                 225010.output:   Total time in MGBuild      0.061807
>>                 225011.output:   Total time in MGBuild      0.076617
>>                 225012.output:   Total time in MGBuild      0.099683
>>                 225013.output:   Total time in MGBuild      0.125580
>>                 225014.output:   Total time in MGBuild      0.190711
>>                 225016.output:   Total time in MGBuild      0.218329
>>                 225017.output:   Total time in MGBuild      0.282081
>>
>>                 Although I didn't directly measure it, this suggests the
>>                 time for
>>                 MPI_Comm_Split is growing roughly quadratically with
>>                 process concurrency.
>>
>>
>>
>>
>>                 I see the same effect on the K machine (8...64K nodes)
>>                 where the code uses
>>                 comm_split/dup in conjunction:
>>                 run00008_7_1.sh.o2412931:   Total time in MGBuild
>>                   0.026458 seconds
>>                 run00064_7_1.sh.o2415876:   Total time in MGBuild
>>                   0.039121 seconds
>>                 run00512_7_1.sh.o2415877:   Total time in MGBuild
>>                   0.086800 seconds
>>                 run01000_7_1.sh.o2414496:   Total time in MGBuild
>>                   0.129764 seconds
>>                 run01728_7_1.sh.o2415878:   Total time in MGBuild
>>                   0.224576 seconds
>>                 run04096_7_1.sh.o2415880:   Total time in MGBuild
>>                   0.738979 seconds
>>                 run08000_7_1.sh.o2414504:   Total time in MGBuild
>>                   2.123800 seconds
>>                 run13824_7_1.sh.o2415881:   Total time in MGBuild
>>                   6.276573 seconds
>>                 run21952_7_1.sh.o2415882:   Total time in MGBuild
>>                 13.634200 seconds
>>                 run32768_7_1.sh.o2415884:   Total time in MGBuild
>>                 36.508670 seconds
>>                 run46656_7_1.sh.o2415874:   Total time in MGBuild
>>                 58.668228 seconds
>>                 run64000_7_1.sh.o2415875:   Total time in MGBuild
>>                   117.322217 seconds
>>
>>
>>                 A glance at the implementation on Mira (I don't know if
>>                 the implementation
>>                 on K is stock) suggests it should be using qsort to sort
>>                 based on keys.
>>                 Unfortunately, qsort is not performance robust like
>>                 heap/merge sort.  If one
>>                 were to be productive and call comm_split like...
>>                 MPI_Comm_split(...,mycolor,__myrank,...)
>>                 then one runs the risk that the keys are presorted.
>>                   This hits the worst
>>                 case computational complexity for qsort... O(P^2).
>>                   Demanding programmers
>>                 avoid sending sorted keys seems unreasonable.
>>
>>
>>                 I should note, I see a similar lack of scaling with
>>                 MPI_Comm_dup on the K
>>                 machine.  Unfortunately, my BGQ data used an earlier
>>                 version of the code
>>                 that did not use comm_dup.  As such, I can’t
>>                 definitively say that it is a
>>                 problem on that machine as well.
>>
>>                 Thus, I'm asking for scalable implementations of
>>                 comm_split/dup using
>>                 merge/heap sort whose worst case complexity is still
>>                 PlogP to be prioritized
>>                 in the next update.
>>
>>
>>                 thanks
>>                 _________________________________________________
>>                 To manage subscription options or unsubscribe:
>>                 https://lists.mpich.org/__mailman/listinfo/devel
>>                 <https://lists.mpich.org/mailman/listinfo/devel>
>>
>>
>>
>>
>>             _________________________________________________
>>             To manage subscription options or unsubscribe:
>>             https://lists.mpich.org/__mailman/listinfo/devel
>>             <https://lists.mpich.org/mailman/listinfo/devel>
>>
>>
>>
>>
>>
>>     --
>>     Rob Latham
>>     Mathematics and Computer Science Division
>>     Argonne National Lab, IL USA
>>     _________________________________________________
>>     To manage subscription options or unsubscribe:
>>     https://lists.mpich.org/__mailman/listinfo/devel
>>     <https://lists.mpich.org/mailman/listinfo/devel>
>>
>>
>>
>>
>> _______________________________________________
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/devel
>>
>>
> --
> Rob Latham
> Mathematics and Computer Science Division
> Argonne National Lab, IL USA
> _______________________________________________
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/devel
>

-- 
Jeff Hammond
jeff.science at gmail.com
http://jeffhammond.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/devel/attachments/20140707/7b2a06f8/attachment.html>