[mpich-devel] MPI_Comm_Split/Dup scalability on BGQ and K supercomputers

Mon Jul 7 10:19:34 CDT 2014

On 07/03/2014 04:45 PM, Jeff Hammond wrote:
> PAMI is open-source via https://repo.anl-external.org/repos/bgq-driver/.
>
> I believe ALCF has already reported this bug but you can contact
> support at alcf.anl.gov for an update.

in a nice bit of circular logic, ALCF keeps trying to close that ticket 
saying "this is being discussed on the MPICH list".

Specifically to Jeff's point, the PAMI things are in bgq-VERSION-gpl.tar.gz

Junchao: you can find the implementation of 
PAMI_Geometry_create_taskrange in comm/sys/pami/api/c/pami.cc, but all 
it does is immediately call the objects' create_taskrange' member 
function, so now you have to find where *that* is...

==rob

>
> Best,
>
> Jeff
>
> On Thu, Jul 3, 2014 at 2:41 PM, Junchao Zhang <jczhang at mcs.anl.gov> wrote:
>> Hi, Sam,
>>    I wrote micro-benchmarks for MPI_Comm_split/dup. My profiling results
>> suggested the problem lies in a IBM PAMI library call,
>> PAMI_Geometry_create_taskrange().  Unfortunately, I don't have access to the
>> PAMI source code and don't know why. I reported it to IBM and hope IBM will
>> fix it.
>>    Alternatively, you can set an environment variable PAMID_COLLECTIVES=0 to
>> disables pami collectives. My tests showed it at least fixed the scalability
>> problem of Comm_split and Comm_dup.
>>    Also through profiling, I found the qsort() called in MPICH code is
>> actually using the merge sort algorithm in Mira's libc library.
>>
>>
>>
>> --Junchao Zhang
>>
>>
>> On Sat, May 17, 2014 at 9:06 AM, Sam Williams <swwilliams at lbl.gov> wrote:
>>>
>>> I've been conducting scaling experiments on the Mira (Blue Gene/Q) and K
>>> (Sparc) supercomputers.  I've noticed that the time required for
>>> MPI_Comm_split and MPI_Comm_dup can grow quickly with scale (~P^2).  As
>>> such, its performance eventually becomes a bottleneck.  That is, although
>>> the benefit of using a subcommunicator is huge (multigrid solves are
>>> weak-scalable), the penalty of creating one (multigrid build time) is also
>>> huge.
>>>
>>> For example, when scaling from 1 to 46K nodes (= cubes of integers) on
>>> Mira, the time (in seconds) required to build a MG solver (including a
>>> subcommunicator) scales as
>>> 222335.output:   Total time in MGBuild      0.056704
>>> 222336.output:   Total time in MGBuild      0.060834
>>> 222348.output:   Total time in MGBuild      0.064782
>>> 222349.output:   Total time in MGBuild      0.090229
>>> 222350.output:   Total time in MGBuild      0.075280
>>> 222351.output:   Total time in MGBuild      0.091852
>>> 222352.output:   Total time in MGBuild      0.137299
>>> 222411.output:   Total time in MGBuild      0.301552
>>> 222413.output:   Total time in MGBuild      0.606444
>>> 222415.output:   Total time in MGBuild      0.745272
>>> 222417.output:   Total time in MGBuild      0.779757
>>> 222418.output:   Total time in MGBuild      4.671838
>>> 222419.output:   Total time in MGBuild     15.123162
>>> 222420.output:   Total time in MGBuild     33.875626
>>> 222421.output:   Total time in MGBuild     49.494547
>>> 222422.output:   Total time in MGBuild    151.329026
>>>
>>> If I disable the call to MPI_Comm_Split, my time scales as
>>> 224982.output:   Total time in MGBuild      0.050143
>>> 224983.output:   Total time in MGBuild      0.052607
>>> 224988.output:   Total time in MGBuild      0.050697
>>> 224989.output:   Total time in MGBuild      0.078343
>>> 224990.output:   Total time in MGBuild      0.054634
>>> 224991.output:   Total time in MGBuild      0.052158
>>> 224992.output:   Total time in MGBuild      0.060286
>>> 225008.output:   Total time in MGBuild      0.062925
>>> 225009.output:   Total time in MGBuild      0.097357
>>> 225010.output:   Total time in MGBuild      0.061807
>>> 225011.output:   Total time in MGBuild      0.076617
>>> 225012.output:   Total time in MGBuild      0.099683
>>> 225013.output:   Total time in MGBuild      0.125580
>>> 225014.output:   Total time in MGBuild      0.190711
>>> 225016.output:   Total time in MGBuild      0.218329
>>> 225017.output:   Total time in MGBuild      0.282081
>>>
>>> Although I didn't directly measure it, this suggests the time for
>>> MPI_Comm_Split is growing roughly quadratically with process concurrency.
>>>
>>>
>>>
>>>
>>> I see the same effect on the K machine (8...64K nodes) where the code uses
>>> comm_split/dup in conjunction:
>>> run00008_7_1.sh.o2412931:   Total time in MGBuild      0.026458 seconds
>>> run00064_7_1.sh.o2415876:   Total time in MGBuild      0.039121 seconds
>>> run00512_7_1.sh.o2415877:   Total time in MGBuild      0.086800 seconds
>>> run01000_7_1.sh.o2414496:   Total time in MGBuild      0.129764 seconds
>>> run01728_7_1.sh.o2415878:   Total time in MGBuild      0.224576 seconds
>>> run04096_7_1.sh.o2415880:   Total time in MGBuild      0.738979 seconds
>>> run08000_7_1.sh.o2414504:   Total time in MGBuild      2.123800 seconds
>>> run13824_7_1.sh.o2415881:   Total time in MGBuild      6.276573 seconds
>>> run21952_7_1.sh.o2415882:   Total time in MGBuild     13.634200 seconds
>>> run32768_7_1.sh.o2415884:   Total time in MGBuild     36.508670 seconds
>>> run46656_7_1.sh.o2415874:   Total time in MGBuild     58.668228 seconds
>>> run64000_7_1.sh.o2415875:   Total time in MGBuild    117.322217 seconds
>>>
>>>
>>> A glance at the implementation on Mira (I don't know if the implementation
>>> on K is stock) suggests it should be using qsort to sort based on keys.
>>> Unfortunately, qsort is not performance robust like heap/merge sort.  If one
>>> were to be productive and call comm_split like...
>>> MPI_Comm_split(...,mycolor,myrank,...)
>>> then one runs the risk that the keys are presorted.  This hits the worst
>>> case computational complexity for qsort... O(P^2).  Demanding programmers
>>> avoid sending sorted keys seems unreasonable.
>>>
>>>
>>> I should note, I see a similar lack of scaling with MPI_Comm_dup on the K
>>> machine.  Unfortunately, my BGQ data used an earlier version of the code
>>> that did not use comm_dup.  As such, I can’t definitively say that it is a
>>> problem on that machine as well.
>>>
>>> Thus, I'm asking for scalable implementations of comm_split/dup using
>>> merge/heap sort whose worst case complexity is still PlogP to be prioritized
>>> in the next update.
>>>
>>>
>>> thanks
>>> _______________________________________________
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/devel
>>
>>
>>
>> _______________________________________________
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/devel
>
>
>

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA