[mpich-devel] MPI_Comm_Split/Dup scalability on BGQ and K supercomputers

Mon Jul 7 10:34:50 CDT 2014

Rob,
  Is it possible for me to install a debug version PAMI on Mira? I read the
InstallReadme*_BGQ.txt. It is quite complex and and looks I need root
privilege.
  If it is possible, I can profile the code further.

--Junchao Zhang

On Mon, Jul 7, 2014 at 10:19 AM, Rob Latham <robl at mcs.anl.gov> wrote:

>
>
> On 07/03/2014 04:45 PM, Jeff Hammond wrote:
>
>> PAMI is open-source via https://repo.anl-external.org/repos/bgq-driver/.
>>
>> I believe ALCF has already reported this bug but you can contact
>> support at alcf.anl.gov for an update.
>>
>
> in a nice bit of circular logic, ALCF keeps trying to close that ticket
> saying "this is being discussed on the MPICH list".
>
> Specifically to Jeff's point, the PAMI things are in bgq-VERSION-gpl.tar.gz
>
> Junchao: you can find the implementation of PAMI_Geometry_create_taskrange
> in comm/sys/pami/api/c/pami.cc, but all it does is immediately call the
> objects' create_taskrange' member function, so now you have to find where
> *that* is...
>
> ==rob
>
>
>
>> Best,
>>
>> Jeff
>>
>> On Thu, Jul 3, 2014 at 2:41 PM, Junchao Zhang <jczhang at mcs.anl.gov>
>> wrote:
>>
>>> Hi, Sam,
>>>    I wrote micro-benchmarks for MPI_Comm_split/dup. My profiling results
>>> suggested the problem lies in a IBM PAMI library call,
>>> PAMI_Geometry_create_taskrange().  Unfortunately, I don't have access
>>> to the
>>> PAMI source code and don't know why. I reported it to IBM and hope IBM
>>> will
>>> fix it.
>>>    Alternatively, you can set an environment variable
>>> PAMID_COLLECTIVES=0 to
>>> disables pami collectives. My tests showed it at least fixed the
>>> scalability
>>> problem of Comm_split and Comm_dup.
>>>    Also through profiling, I found the qsort() called in MPICH code is
>>> actually using the merge sort algorithm in Mira's libc library.
>>>
>>>
>>>
>>> --Junchao Zhang
>>>
>>>
>>> On Sat, May 17, 2014 at 9:06 AM, Sam Williams <swwilliams at lbl.gov>
>>> wrote:
>>>
>>>>
>>>> I've been conducting scaling experiments on the Mira (Blue Gene/Q) and K
>>>> (Sparc) supercomputers.  I've noticed that the time required for
>>>> MPI_Comm_split and MPI_Comm_dup can grow quickly with scale (~P^2).  As
>>>> such, its performance eventually becomes a bottleneck.  That is,
>>>> although
>>>> the benefit of using a subcommunicator is huge (multigrid solves are
>>>> weak-scalable), the penalty of creating one (multigrid build time) is
>>>> also
>>>> huge.
>>>>
>>>> For example, when scaling from 1 to 46K nodes (= cubes of integers) on
>>>> Mira, the time (in seconds) required to build a MG solver (including a
>>>> subcommunicator) scales as
>>>> 222335.output:   Total time in MGBuild      0.056704
>>>> 222336.output:   Total time in MGBuild      0.060834
>>>> 222348.output:   Total time in MGBuild      0.064782
>>>> 222349.output:   Total time in MGBuild      0.090229
>>>> 222350.output:   Total time in MGBuild      0.075280
>>>> 222351.output:   Total time in MGBuild      0.091852
>>>> 222352.output:   Total time in MGBuild      0.137299
>>>> 222411.output:   Total time in MGBuild      0.301552
>>>> 222413.output:   Total time in MGBuild      0.606444
>>>> 222415.output:   Total time in MGBuild      0.745272
>>>> 222417.output:   Total time in MGBuild      0.779757
>>>> 222418.output:   Total time in MGBuild      4.671838
>>>> 222419.output:   Total time in MGBuild     15.123162
>>>> 222420.output:   Total time in MGBuild     33.875626
>>>> 222421.output:   Total time in MGBuild     49.494547
>>>> 222422.output:   Total time in MGBuild    151.329026
>>>>
>>>> If I disable the call to MPI_Comm_Split, my time scales as
>>>> 224982.output:   Total time in MGBuild      0.050143
>>>> 224983.output:   Total time in MGBuild      0.052607
>>>> 224988.output:   Total time in MGBuild      0.050697
>>>> 224989.output:   Total time in MGBuild      0.078343
>>>> 224990.output:   Total time in MGBuild      0.054634
>>>> 224991.output:   Total time in MGBuild      0.052158
>>>> 224992.output:   Total time in MGBuild      0.060286
>>>> 225008.output:   Total time in MGBuild      0.062925
>>>> 225009.output:   Total time in MGBuild      0.097357
>>>> 225010.output:   Total time in MGBuild      0.061807
>>>> 225011.output:   Total time in MGBuild      0.076617
>>>> 225012.output:   Total time in MGBuild      0.099683
>>>> 225013.output:   Total time in MGBuild      0.125580
>>>> 225014.output:   Total time in MGBuild      0.190711
>>>> 225016.output:   Total time in MGBuild      0.218329
>>>> 225017.output:   Total time in MGBuild      0.282081
>>>>
>>>> Although I didn't directly measure it, this suggests the time for
>>>> MPI_Comm_Split is growing roughly quadratically with process
>>>> concurrency.
>>>>
>>>>
>>>>
>>>>
>>>> I see the same effect on the K machine (8...64K nodes) where the code
>>>> uses
>>>> comm_split/dup in conjunction:
>>>> run00008_7_1.sh.o2412931:   Total time in MGBuild      0.026458 seconds
>>>> run00064_7_1.sh.o2415876:   Total time in MGBuild      0.039121 seconds
>>>> run00512_7_1.sh.o2415877:   Total time in MGBuild      0.086800 seconds
>>>> run01000_7_1.sh.o2414496:   Total time in MGBuild      0.129764 seconds
>>>> run01728_7_1.sh.o2415878:   Total time in MGBuild      0.224576 seconds
>>>> run04096_7_1.sh.o2415880:   Total time in MGBuild      0.738979 seconds
>>>> run08000_7_1.sh.o2414504:   Total time in MGBuild      2.123800 seconds
>>>> run13824_7_1.sh.o2415881:   Total time in MGBuild      6.276573 seconds
>>>> run21952_7_1.sh.o2415882:   Total time in MGBuild     13.634200 seconds
>>>> run32768_7_1.sh.o2415884:   Total time in MGBuild     36.508670 seconds
>>>> run46656_7_1.sh.o2415874:   Total time in MGBuild     58.668228 seconds
>>>> run64000_7_1.sh.o2415875:   Total time in MGBuild    117.322217 seconds
>>>>
>>>>
>>>> A glance at the implementation on Mira (I don't know if the
>>>> implementation
>>>> on K is stock) suggests it should be using qsort to sort based on keys.
>>>> Unfortunately, qsort is not performance robust like heap/merge sort.
>>>>  If one
>>>> were to be productive and call comm_split like...
>>>> MPI_Comm_split(...,mycolor,myrank,...)
>>>> then one runs the risk that the keys are presorted.  This hits the worst
>>>> case computational complexity for qsort... O(P^2).  Demanding
>>>> programmers
>>>> avoid sending sorted keys seems unreasonable.
>>>>
>>>>
>>>> I should note, I see a similar lack of scaling with MPI_Comm_dup on the
>>>> K
>>>> machine.  Unfortunately, my BGQ data used an earlier version of the code
>>>> that did not use comm_dup.  As such, I can’t definitively say that it
>>>> is a
>>>> problem on that machine as well.
>>>>
>>>> Thus, I'm asking for scalable implementations of comm_split/dup using
>>>> merge/heap sort whose worst case complexity is still PlogP to be
>>>> prioritized
>>>> in the next update.
>>>>
>>>>
>>>> thanks
>>>> _______________________________________________
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/devel
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/devel
>>>
>>
>>
>>
>>
> --
> Rob Latham
> Mathematics and Computer Science Division
> Argonne National Lab, IL USA
> _______________________________________________
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/devel/attachments/20140707/e5ba263d/attachment.html>