[mpich-devel] MPI_Comm_Split/Dup scalability on BGQ and K supercomputers

Mon Sep 22 17:24:25 CDT 2014

Forward new messages from IBM:
  * If you set PAMID_COLLECTIVES_MEMORY_OPTIMIZED=1 then the pami optimized
collectives are disabled for that comm and the geometry / collective
network is not created at comm creation time if the block is irregular, so
the comm dup/split performance is good.  So for now the user can go ahead
and use it, and if there are problems with it we'll need to decide then if
we will support it or look at what code changes would be necessary to
address the underlying issue.
  * This whole 'power-of-2' thing was basically too loose a definition for
an irregular block, for Blue Gene /Q at the moment there is no
crystal-clear formal definition, but a tighter one would be a block that is
fully populated with every node in all 5 dimensions, with the same number
of MPI ranks on each node --- there are still some exceptions and
caveats beyond this that can only be answered by looking at the code, but
this is the tightest and simplest definition I can come up with at the
moment.

On Mon, Sep 22, 2014 at 10:51 AM, Sam Williams <swwilliams at lbl.gov> wrote:

> All of Mira (48k) is technically not a power of two.
>
> I tried the pami option at one point.  I don't recall it giving better
> performance
>
> - Sam
>
> On Sep 22, 2014, at 8:39 AM, Junchao Zhang <jczhang at mcs.anl.gov> wrote:
>
> Sam,
>    I had some updates from IBM last week. They reproduced the problem and
> found it only happens when the number of MPI ranks is non-power-of-2.
> Their advice is that since the IBM BG/Q optimized collectives themselves
> are mostly designed only to be helpful for blocks with power-of-2
> geometries,  you can try in your program to see if subsequent collective
> calls with PAMID_COLLECTIVES=1 are actually faster than PAMID_COLLECTIVES=0
> on comms with a non-power-of-2 geometry. If the answer is no, then you
> can just run with PAMID_COLLECTIVES=0 and avoid the dup/split performance
> issue. Otherwise, IBM may prioritize this ticket.
>
>   Thanks.
> --Junchao Zhang
>
> On Thu, Jul 3, 2014 at 4:41 PM, Junchao Zhang <jczhang at mcs.anl.gov> wrote:
>
>> Hi, Sam,
>>   I wrote micro-benchmarks for MPI_Comm_split/dup. My profiling results
>> suggested the problem lies in a IBM PAMI library call,
>> PAMI_Geometry_create_taskrange().  Unfortunately, I don't have access to
>> the PAMI source code and don't know why. I reported it to IBM and hope IBM
>> will fix it.
>>   Alternatively, you can set an environment variable PAMID_COLLECTIVES=0
>> to disables pami collectives. My tests showed it at least fixed the
>> scalability problem of Comm_split and Comm_dup.
>>   Also through profiling, I found the qsort() called in MPICH code is
>> actually using the merge sort algorithm in Mira's libc library.
>>
>>
>>
>> --Junchao Zhang
>>
>>
>> On Sat, May 17, 2014 at 9:06 AM, Sam Williams <swwilliams at lbl.gov> wrote:
>>
>>> I've been conducting scaling experiments on the Mira (Blue Gene/Q) and K
>>> (Sparc) supercomputers.  I've noticed that the time required for
>>> MPI_Comm_split and MPI_Comm_dup can grow quickly with scale (~P^2).  As
>>> such, its performance eventually becomes a bottleneck.  That is, although
>>> the benefit of using a subcommunicator is huge (multigrid solves are
>>> weak-scalable), the penalty of creating one (multigrid build time) is also
>>> huge.
>>>
>>> For example, when scaling from 1 to 46K nodes (= cubes of integers) on
>>> Mira, the time (in seconds) required to build a MG solver (including a
>>> subcommunicator) scales as
>>> 222335.output:   Total time in MGBuild      0.056704
>>> 222336.output:   Total time in MGBuild      0.060834
>>> 222348.output:   Total time in MGBuild      0.064782
>>> 222349.output:   Total time in MGBuild      0.090229
>>> 222350.output:   Total time in MGBuild      0.075280
>>> 222351.output:   Total time in MGBuild      0.091852
>>> 222352.output:   Total time in MGBuild      0.137299
>>> 222411.output:   Total time in MGBuild      0.301552
>>> 222413.output:   Total time in MGBuild      0.606444
>>> 222415.output:   Total time in MGBuild      0.745272
>>> 222417.output:   Total time in MGBuild      0.779757
>>> 222418.output:   Total time in MGBuild      4.671838
>>> 222419.output:   Total time in MGBuild     15.123162
>>> 222420.output:   Total time in MGBuild     33.875626
>>> 222421.output:   Total time in MGBuild     49.494547
>>> 222422.output:   Total time in MGBuild    151.329026
>>>
>>> If I disable the call to MPI_Comm_Split, my time scales as
>>> 224982.output:   Total time in MGBuild      0.050143
>>> 224983.output:   Total time in MGBuild      0.052607
>>> 224988.output:   Total time in MGBuild      0.050697
>>> 224989.output:   Total time in MGBuild      0.078343
>>> 224990.output:   Total time in MGBuild      0.054634
>>> 224991.output:   Total time in MGBuild      0.052158
>>> 224992.output:   Total time in MGBuild      0.060286
>>> 225008.output:   Total time in MGBuild      0.062925
>>> 225009.output:   Total time in MGBuild      0.097357
>>> 225010.output:   Total time in MGBuild      0.061807
>>> 225011.output:   Total time in MGBuild      0.076617
>>> 225012.output:   Total time in MGBuild      0.099683
>>> 225013.output:   Total time in MGBuild      0.125580
>>> 225014.output:   Total time in MGBuild      0.190711
>>> 225016.output:   Total time in MGBuild      0.218329
>>> 225017.output:   Total time in MGBuild      0.282081
>>>
>>> Although I didn't directly measure it, this suggests the time for
>>> MPI_Comm_Split is growing roughly quadratically with process concurrency.
>>>
>>>
>>>
>>>
>>> I see the same effect on the K machine (8...64K nodes) where the code
>>> uses comm_split/dup in conjunction:
>>> run00008_7_1.sh.o2412931:   Total time in MGBuild      0.026458 seconds
>>> run00064_7_1.sh.o2415876:   Total time in MGBuild      0.039121 seconds
>>> run00512_7_1.sh.o2415877:   Total time in MGBuild      0.086800 seconds
>>> run01000_7_1.sh.o2414496:   Total time in MGBuild      0.129764 seconds
>>> run01728_7_1.sh.o2415878:   Total time in MGBuild      0.224576 seconds
>>> run04096_7_1.sh.o2415880:   Total time in MGBuild      0.738979 seconds
>>> run08000_7_1.sh.o2414504:   Total time in MGBuild      2.123800 seconds
>>> run13824_7_1.sh.o2415881:   Total time in MGBuild      6.276573 seconds
>>> run21952_7_1.sh.o2415882:   Total time in MGBuild     13.634200 seconds
>>> run32768_7_1.sh.o2415884:   Total time in MGBuild     36.508670 seconds
>>> run46656_7_1.sh.o2415874:   Total time in MGBuild     58.668228 seconds
>>> run64000_7_1.sh.o2415875:   Total time in MGBuild    117.322217 seconds
>>>
>>>
>>> A glance at the implementation on Mira (I don't know if the
>>> implementation on K is stock) suggests it should be using qsort to sort
>>> based on keys.  Unfortunately, qsort is not performance robust like
>>> heap/merge sort.  If one were to be productive and call comm_split like...
>>> MPI_Comm_split(...,mycolor,myrank,...)
>>> then one runs the risk that the keys are presorted.  This hits the worst
>>> case computational complexity for qsort... O(P^2).  Demanding programmers
>>> avoid sending sorted keys seems unreasonable.
>>>
>>>
>>> I should note, I see a similar lack of scaling with MPI_Comm_dup on the
>>> K machine.  Unfortunately, my BGQ data used an earlier version of the code
>>> that did not use comm_dup.  As such, I can’t definitively say that it is a
>>> problem on that machine as well.
>>>
>>> Thus, I'm asking for scalable implementations of comm_split/dup using
>>> merge/heap sort whose worst case complexity is still PlogP to be
>>> prioritized in the next update.
>>>
>>>
>>> thanks
>>> _______________________________________________
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/devel
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/devel/attachments/20140922/176f2719/attachment.html>