[mpich-devel] MPI_Comm_Split/Dup scalability on BGQ and K supercomputers
Rob Latham
robl at mcs.anl.gov
Mon Jul 7 11:04:21 CDT 2014
On 07/07/2014 10:34 AM, Junchao Zhang wrote:
> Rob,
> Is it possible for me to install a debug version PAMI on Mira? I read
> the InstallReadme*_BGQ.txt. It is quite complex and and looks I need
> root privilege.
> If it is possible, I can profile the code further.
I know man, that install process is crazy.. seems like one should be
able to get a pami library out of comm/sys/pami by setting enough
environment variables -- there's no configure process for pami?
==rob
>
> --Junchao Zhang
>
>
> On Mon, Jul 7, 2014 at 10:19 AM, Rob Latham <robl at mcs.anl.gov
> <mailto:robl at mcs.anl.gov>> wrote:
>
>
>
> On 07/03/2014 04:45 PM, Jeff Hammond wrote:
>
> PAMI is open-source via
> https://repo.anl-external.org/__repos/bgq-driver/
> <https://repo.anl-external.org/repos/bgq-driver/>.
>
> I believe ALCF has already reported this bug but you can contact
> support at alcf.anl.gov <mailto:support at alcf.anl.gov> for an update.
>
>
> in a nice bit of circular logic, ALCF keeps trying to close that
> ticket saying "this is being discussed on the MPICH list".
>
> Specifically to Jeff's point, the PAMI things are in
> bgq-VERSION-gpl.tar.gz
>
> Junchao: you can find the implementation of
> PAMI_Geometry_create_taskrange in comm/sys/pami/api/c/pami.cc, but
> all it does is immediately call the objects' create_taskrange'
> member function, so now you have to find where *that* is...
>
> ==rob
>
>
>
> Best,
>
> Jeff
>
> On Thu, Jul 3, 2014 at 2:41 PM, Junchao Zhang
> <jczhang at mcs.anl.gov <mailto:jczhang at mcs.anl.gov>> wrote:
>
> Hi, Sam,
> I wrote micro-benchmarks for MPI_Comm_split/dup. My
> profiling results
> suggested the problem lies in a IBM PAMI library call,
> PAMI_Geometry_create___taskrange(). Unfortunately, I don't
> have access to the
> PAMI source code and don't know why. I reported it to IBM
> and hope IBM will
> fix it.
> Alternatively, you can set an environment variable
> PAMID_COLLECTIVES=0 to
> disables pami collectives. My tests showed it at least fixed
> the scalability
> problem of Comm_split and Comm_dup.
> Also through profiling, I found the qsort() called in
> MPICH code is
> actually using the merge sort algorithm in Mira's libc library.
>
>
>
> --Junchao Zhang
>
>
> On Sat, May 17, 2014 at 9:06 AM, Sam Williams
> <swwilliams at lbl.gov <mailto:swwilliams at lbl.gov>> wrote:
>
>
> I've been conducting scaling experiments on the Mira
> (Blue Gene/Q) and K
> (Sparc) supercomputers. I've noticed that the time
> required for
> MPI_Comm_split and MPI_Comm_dup can grow quickly with
> scale (~P^2). As
> such, its performance eventually becomes a bottleneck.
> That is, although
> the benefit of using a subcommunicator is huge
> (multigrid solves are
> weak-scalable), the penalty of creating one (multigrid
> build time) is also
> huge.
>
> For example, when scaling from 1 to 46K nodes (= cubes
> of integers) on
> Mira, the time (in seconds) required to build a MG
> solver (including a
> subcommunicator) scales as
> 222335.output: Total time in MGBuild 0.056704
> 222336.output: Total time in MGBuild 0.060834
> 222348.output: Total time in MGBuild 0.064782
> 222349.output: Total time in MGBuild 0.090229
> 222350.output: Total time in MGBuild 0.075280
> 222351.output: Total time in MGBuild 0.091852
> 222352.output: Total time in MGBuild 0.137299
> 222411.output: Total time in MGBuild 0.301552
> 222413.output: Total time in MGBuild 0.606444
> 222415.output: Total time in MGBuild 0.745272
> 222417.output: Total time in MGBuild 0.779757
> 222418.output: Total time in MGBuild 4.671838
> 222419.output: Total time in MGBuild 15.123162
> 222420.output: Total time in MGBuild 33.875626
> 222421.output: Total time in MGBuild 49.494547
> 222422.output: Total time in MGBuild 151.329026
>
> If I disable the call to MPI_Comm_Split, my time scales as
> 224982.output: Total time in MGBuild 0.050143
> 224983.output: Total time in MGBuild 0.052607
> 224988.output: Total time in MGBuild 0.050697
> 224989.output: Total time in MGBuild 0.078343
> 224990.output: Total time in MGBuild 0.054634
> 224991.output: Total time in MGBuild 0.052158
> 224992.output: Total time in MGBuild 0.060286
> 225008.output: Total time in MGBuild 0.062925
> 225009.output: Total time in MGBuild 0.097357
> 225010.output: Total time in MGBuild 0.061807
> 225011.output: Total time in MGBuild 0.076617
> 225012.output: Total time in MGBuild 0.099683
> 225013.output: Total time in MGBuild 0.125580
> 225014.output: Total time in MGBuild 0.190711
> 225016.output: Total time in MGBuild 0.218329
> 225017.output: Total time in MGBuild 0.282081
>
> Although I didn't directly measure it, this suggests the
> time for
> MPI_Comm_Split is growing roughly quadratically with
> process concurrency.
>
>
>
>
> I see the same effect on the K machine (8...64K nodes)
> where the code uses
> comm_split/dup in conjunction:
> run00008_7_1.sh.o2412931: Total time in MGBuild
> 0.026458 seconds
> run00064_7_1.sh.o2415876: Total time in MGBuild
> 0.039121 seconds
> run00512_7_1.sh.o2415877: Total time in MGBuild
> 0.086800 seconds
> run01000_7_1.sh.o2414496: Total time in MGBuild
> 0.129764 seconds
> run01728_7_1.sh.o2415878: Total time in MGBuild
> 0.224576 seconds
> run04096_7_1.sh.o2415880: Total time in MGBuild
> 0.738979 seconds
> run08000_7_1.sh.o2414504: Total time in MGBuild
> 2.123800 seconds
> run13824_7_1.sh.o2415881: Total time in MGBuild
> 6.276573 seconds
> run21952_7_1.sh.o2415882: Total time in MGBuild
> 13.634200 seconds
> run32768_7_1.sh.o2415884: Total time in MGBuild
> 36.508670 seconds
> run46656_7_1.sh.o2415874: Total time in MGBuild
> 58.668228 seconds
> run64000_7_1.sh.o2415875: Total time in MGBuild
> 117.322217 seconds
>
>
> A glance at the implementation on Mira (I don't know if
> the implementation
> on K is stock) suggests it should be using qsort to sort
> based on keys.
> Unfortunately, qsort is not performance robust like
> heap/merge sort. If one
> were to be productive and call comm_split like...
> MPI_Comm_split(...,mycolor,__myrank,...)
> then one runs the risk that the keys are presorted.
> This hits the worst
> case computational complexity for qsort... O(P^2).
> Demanding programmers
> avoid sending sorted keys seems unreasonable.
>
>
> I should note, I see a similar lack of scaling with
> MPI_Comm_dup on the K
> machine. Unfortunately, my BGQ data used an earlier
> version of the code
> that did not use comm_dup. As such, I can’t
> definitively say that it is a
> problem on that machine as well.
>
> Thus, I'm asking for scalable implementations of
> comm_split/dup using
> merge/heap sort whose worst case complexity is still
> PlogP to be prioritized
> in the next update.
>
>
> thanks
> _________________________________________________
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/__mailman/listinfo/devel
> <https://lists.mpich.org/mailman/listinfo/devel>
>
>
>
>
> _________________________________________________
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/__mailman/listinfo/devel
> <https://lists.mpich.org/mailman/listinfo/devel>
>
>
>
>
>
> --
> Rob Latham
> Mathematics and Computer Science Division
> Argonne National Lab, IL USA
> _________________________________________________
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/__mailman/listinfo/devel
> <https://lists.mpich.org/mailman/listinfo/devel>
>
>
>
>
> _______________________________________________
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/devel
>
--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA
More information about the devel
mailing list