[mpich-devel] MPI_Comm_Split/Dup scalability on BGQ and K supercomputers
Jeff Hammond
jeff.science at gmail.com
Mon Jul 7 12:19:18 CDT 2014
Finding the task range impl requires thinking in C++ but it's not hard.
Just grep your way down a few layers.
Jeff
On Monday, July 7, 2014, Rob Latham <robl at mcs.anl.gov> wrote:
>
>
> On 07/07/2014 10:34 AM, Junchao Zhang wrote:
>
>> Rob,
>> Is it possible for me to install a debug version PAMI on Mira? I read
>> the InstallReadme*_BGQ.txt. It is quite complex and and looks I need
>> root privilege.
>> If it is possible, I can profile the code further.
>>
>
> I know man, that install process is crazy.. seems like one should be able
> to get a pami library out of comm/sys/pami by setting enough environment
> variables -- there's no configure process for pami?
>
> ==rob
>
>
>
>> --Junchao Zhang
>>
>>
>> On Mon, Jul 7, 2014 at 10:19 AM, Rob Latham <robl at mcs.anl.gov
>> <mailto:robl at mcs.anl.gov>> wrote:
>>
>>
>>
>> On 07/03/2014 04:45 PM, Jeff Hammond wrote:
>>
>> PAMI is open-source via
>> https://repo.anl-external.org/__repos/bgq-driver/
>> <https://repo.anl-external.org/repos/bgq-driver/>.
>>
>> I believe ALCF has already reported this bug but you can contact
>> support at alcf.anl.gov <mailto:support at alcf.anl.gov> for an update.
>>
>>
>> in a nice bit of circular logic, ALCF keeps trying to close that
>> ticket saying "this is being discussed on the MPICH list".
>>
>> Specifically to Jeff's point, the PAMI things are in
>> bgq-VERSION-gpl.tar.gz
>>
>> Junchao: you can find the implementation of
>> PAMI_Geometry_create_taskrange in comm/sys/pami/api/c/pami.cc, but
>> all it does is immediately call the objects' create_taskrange'
>> member function, so now you have to find where *that* is...
>>
>> ==rob
>>
>>
>>
>> Best,
>>
>> Jeff
>>
>> On Thu, Jul 3, 2014 at 2:41 PM, Junchao Zhang
>> <jczhang at mcs.anl.gov <mailto:jczhang at mcs.anl.gov>> wrote:
>>
>> Hi, Sam,
>> I wrote micro-benchmarks for MPI_Comm_split/dup. My
>> profiling results
>> suggested the problem lies in a IBM PAMI library call,
>> PAMI_Geometry_create___taskrange(). Unfortunately, I don't
>> have access to the
>> PAMI source code and don't know why. I reported it to IBM
>> and hope IBM will
>> fix it.
>> Alternatively, you can set an environment variable
>> PAMID_COLLECTIVES=0 to
>> disables pami collectives. My tests showed it at least fixed
>> the scalability
>> problem of Comm_split and Comm_dup.
>> Also through profiling, I found the qsort() called in
>> MPICH code is
>> actually using the merge sort algorithm in Mira's libc
>> library.
>>
>>
>>
>> --Junchao Zhang
>>
>>
>> On Sat, May 17, 2014 at 9:06 AM, Sam Williams
>> <swwilliams at lbl.gov <mailto:swwilliams at lbl.gov>> wrote:
>>
>>
>> I've been conducting scaling experiments on the Mira
>> (Blue Gene/Q) and K
>> (Sparc) supercomputers. I've noticed that the time
>> required for
>> MPI_Comm_split and MPI_Comm_dup can grow quickly with
>> scale (~P^2). As
>> such, its performance eventually becomes a bottleneck.
>> That is, although
>> the benefit of using a subcommunicator is huge
>> (multigrid solves are
>> weak-scalable), the penalty of creating one (multigrid
>> build time) is also
>> huge.
>>
>> For example, when scaling from 1 to 46K nodes (= cubes
>> of integers) on
>> Mira, the time (in seconds) required to build a MG
>> solver (including a
>> subcommunicator) scales as
>> 222335.output: Total time in MGBuild 0.056704
>> 222336.output: Total time in MGBuild 0.060834
>> 222348.output: Total time in MGBuild 0.064782
>> 222349.output: Total time in MGBuild 0.090229
>> 222350.output: Total time in MGBuild 0.075280
>> 222351.output: Total time in MGBuild 0.091852
>> 222352.output: Total time in MGBuild 0.137299
>> 222411.output: Total time in MGBuild 0.301552
>> 222413.output: Total time in MGBuild 0.606444
>> 222415.output: Total time in MGBuild 0.745272
>> 222417.output: Total time in MGBuild 0.779757
>> 222418.output: Total time in MGBuild 4.671838
>> 222419.output: Total time in MGBuild 15.123162
>> 222420.output: Total time in MGBuild 33.875626
>> 222421.output: Total time in MGBuild 49.494547
>> 222422.output: Total time in MGBuild 151.329026
>>
>> If I disable the call to MPI_Comm_Split, my time scales as
>> 224982.output: Total time in MGBuild 0.050143
>> 224983.output: Total time in MGBuild 0.052607
>> 224988.output: Total time in MGBuild 0.050697
>> 224989.output: Total time in MGBuild 0.078343
>> 224990.output: Total time in MGBuild 0.054634
>> 224991.output: Total time in MGBuild 0.052158
>> 224992.output: Total time in MGBuild 0.060286
>> 225008.output: Total time in MGBuild 0.062925
>> 225009.output: Total time in MGBuild 0.097357
>> 225010.output: Total time in MGBuild 0.061807
>> 225011.output: Total time in MGBuild 0.076617
>> 225012.output: Total time in MGBuild 0.099683
>> 225013.output: Total time in MGBuild 0.125580
>> 225014.output: Total time in MGBuild 0.190711
>> 225016.output: Total time in MGBuild 0.218329
>> 225017.output: Total time in MGBuild 0.282081
>>
>> Although I didn't directly measure it, this suggests the
>> time for
>> MPI_Comm_Split is growing roughly quadratically with
>> process concurrency.
>>
>>
>>
>>
>> I see the same effect on the K machine (8...64K nodes)
>> where the code uses
>> comm_split/dup in conjunction:
>> run00008_7_1.sh.o2412931: Total time in MGBuild
>> 0.026458 seconds
>> run00064_7_1.sh.o2415876: Total time in MGBuild
>> 0.039121 seconds
>> run00512_7_1.sh.o2415877: Total time in MGBuild
>> 0.086800 seconds
>> run01000_7_1.sh.o2414496: Total time in MGBuild
>> 0.129764 seconds
>> run01728_7_1.sh.o2415878: Total time in MGBuild
>> 0.224576 seconds
>> run04096_7_1.sh.o2415880: Total time in MGBuild
>> 0.738979 seconds
>> run08000_7_1.sh.o2414504: Total time in MGBuild
>> 2.123800 seconds
>> run13824_7_1.sh.o2415881: Total time in MGBuild
>> 6.276573 seconds
>> run21952_7_1.sh.o2415882: Total time in MGBuild
>> 13.634200 seconds
>> run32768_7_1.sh.o2415884: Total time in MGBuild
>> 36.508670 seconds
>> run46656_7_1.sh.o2415874: Total time in MGBuild
>> 58.668228 seconds
>> run64000_7_1.sh.o2415875: Total time in MGBuild
>> 117.322217 seconds
>>
>>
>> A glance at the implementation on Mira (I don't know if
>> the implementation
>> on K is stock) suggests it should be using qsort to sort
>> based on keys.
>> Unfortunately, qsort is not performance robust like
>> heap/merge sort. If one
>> were to be productive and call comm_split like...
>> MPI_Comm_split(...,mycolor,__myrank,...)
>> then one runs the risk that the keys are presorted.
>> This hits the worst
>> case computational complexity for qsort... O(P^2).
>> Demanding programmers
>> avoid sending sorted keys seems unreasonable.
>>
>>
>> I should note, I see a similar lack of scaling with
>> MPI_Comm_dup on the K
>> machine. Unfortunately, my BGQ data used an earlier
>> version of the code
>> that did not use comm_dup. As such, I can’t
>> definitively say that it is a
>> problem on that machine as well.
>>
>> Thus, I'm asking for scalable implementations of
>> comm_split/dup using
>> merge/heap sort whose worst case complexity is still
>> PlogP to be prioritized
>> in the next update.
>>
>>
>> thanks
>> _________________________________________________
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/__mailman/listinfo/devel
>> <https://lists.mpich.org/mailman/listinfo/devel>
>>
>>
>>
>>
>> _________________________________________________
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/__mailman/listinfo/devel
>> <https://lists.mpich.org/mailman/listinfo/devel>
>>
>>
>>
>>
>>
>> --
>> Rob Latham
>> Mathematics and Computer Science Division
>> Argonne National Lab, IL USA
>> _________________________________________________
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/__mailman/listinfo/devel
>> <https://lists.mpich.org/mailman/listinfo/devel>
>>
>>
>>
>>
>> _______________________________________________
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/devel
>>
>>
> --
> Rob Latham
> Mathematics and Computer Science Division
> Argonne National Lab, IL USA
> _______________________________________________
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/devel
>
--
Jeff Hammond
jeff.science at gmail.com
http://jeffhammond.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/devel/attachments/20140707/7b2a06f8/attachment.html>
More information about the devel
mailing list