[mpich-devel] MPI_Comm_Split/Dup scalability on BGQ and K supercomputers
Jeff Hammond
jeff.science at gmail.com
Mon Sep 22 12:16:07 CDT 2014
PAMID_COLLECTIVES=0 is really bad for performance. IBM should figure
out a way to disable it on a per-communicator basis when
MPI_COMM_SPLIT is going to have issues. I recall they allow one to
"unoptimize" a communicator but I thought that was only possible after
it was created.
Jeff
On Mon, Sep 22, 2014 at 8:39 AM, Junchao Zhang <jczhang at mcs.anl.gov> wrote:
> Sam,
> I had some updates from IBM last week. They reproduced the problem and
> found it only happens when the number of MPI ranks is non-power-of-2. Their
> advice is that since the IBM BG/Q optimized collectives themselves are
> mostly designed only to be helpful for blocks with power-of-2 geometries,
> you can try in your program to see if subsequent collective calls with
> PAMID_COLLECTIVES=1 are actually faster than PAMID_COLLECTIVES=0 on comms
> with a non-power-of-2 geometry. If the answer is no, then you can just run
> with PAMID_COLLECTIVES=0 and avoid the dup/split performance issue.
> Otherwise, IBM may prioritize this ticket.
>
> Thanks.
> --Junchao Zhang
>
> On Thu, Jul 3, 2014 at 4:41 PM, Junchao Zhang <jczhang at mcs.anl.gov> wrote:
>>
>> Hi, Sam,
>> I wrote micro-benchmarks for MPI_Comm_split/dup. My profiling results
>> suggested the problem lies in a IBM PAMI library call,
>> PAMI_Geometry_create_taskrange(). Unfortunately, I don't have access to the
>> PAMI source code and don't know why. I reported it to IBM and hope IBM will
>> fix it.
>> Alternatively, you can set an environment variable PAMID_COLLECTIVES=0
>> to disables pami collectives. My tests showed it at least fixed the
>> scalability problem of Comm_split and Comm_dup.
>> Also through profiling, I found the qsort() called in MPICH code is
>> actually using the merge sort algorithm in Mira's libc library.
>>
>>
>>
>> --Junchao Zhang
>>
>>
>> On Sat, May 17, 2014 at 9:06 AM, Sam Williams <swwilliams at lbl.gov> wrote:
>>>
>>> I've been conducting scaling experiments on the Mira (Blue Gene/Q) and K
>>> (Sparc) supercomputers. I've noticed that the time required for
>>> MPI_Comm_split and MPI_Comm_dup can grow quickly with scale (~P^2). As
>>> such, its performance eventually becomes a bottleneck. That is, although
>>> the benefit of using a subcommunicator is huge (multigrid solves are
>>> weak-scalable), the penalty of creating one (multigrid build time) is also
>>> huge.
>>>
>>> For example, when scaling from 1 to 46K nodes (= cubes of integers) on
>>> Mira, the time (in seconds) required to build a MG solver (including a
>>> subcommunicator) scales as
>>> 222335.output: Total time in MGBuild 0.056704
>>> 222336.output: Total time in MGBuild 0.060834
>>> 222348.output: Total time in MGBuild 0.064782
>>> 222349.output: Total time in MGBuild 0.090229
>>> 222350.output: Total time in MGBuild 0.075280
>>> 222351.output: Total time in MGBuild 0.091852
>>> 222352.output: Total time in MGBuild 0.137299
>>> 222411.output: Total time in MGBuild 0.301552
>>> 222413.output: Total time in MGBuild 0.606444
>>> 222415.output: Total time in MGBuild 0.745272
>>> 222417.output: Total time in MGBuild 0.779757
>>> 222418.output: Total time in MGBuild 4.671838
>>> 222419.output: Total time in MGBuild 15.123162
>>> 222420.output: Total time in MGBuild 33.875626
>>> 222421.output: Total time in MGBuild 49.494547
>>> 222422.output: Total time in MGBuild 151.329026
>>>
>>> If I disable the call to MPI_Comm_Split, my time scales as
>>> 224982.output: Total time in MGBuild 0.050143
>>> 224983.output: Total time in MGBuild 0.052607
>>> 224988.output: Total time in MGBuild 0.050697
>>> 224989.output: Total time in MGBuild 0.078343
>>> 224990.output: Total time in MGBuild 0.054634
>>> 224991.output: Total time in MGBuild 0.052158
>>> 224992.output: Total time in MGBuild 0.060286
>>> 225008.output: Total time in MGBuild 0.062925
>>> 225009.output: Total time in MGBuild 0.097357
>>> 225010.output: Total time in MGBuild 0.061807
>>> 225011.output: Total time in MGBuild 0.076617
>>> 225012.output: Total time in MGBuild 0.099683
>>> 225013.output: Total time in MGBuild 0.125580
>>> 225014.output: Total time in MGBuild 0.190711
>>> 225016.output: Total time in MGBuild 0.218329
>>> 225017.output: Total time in MGBuild 0.282081
>>>
>>> Although I didn't directly measure it, this suggests the time for
>>> MPI_Comm_Split is growing roughly quadratically with process concurrency.
>>>
>>>
>>>
>>>
>>> I see the same effect on the K machine (8...64K nodes) where the code
>>> uses comm_split/dup in conjunction:
>>> run00008_7_1.sh.o2412931: Total time in MGBuild 0.026458 seconds
>>> run00064_7_1.sh.o2415876: Total time in MGBuild 0.039121 seconds
>>> run00512_7_1.sh.o2415877: Total time in MGBuild 0.086800 seconds
>>> run01000_7_1.sh.o2414496: Total time in MGBuild 0.129764 seconds
>>> run01728_7_1.sh.o2415878: Total time in MGBuild 0.224576 seconds
>>> run04096_7_1.sh.o2415880: Total time in MGBuild 0.738979 seconds
>>> run08000_7_1.sh.o2414504: Total time in MGBuild 2.123800 seconds
>>> run13824_7_1.sh.o2415881: Total time in MGBuild 6.276573 seconds
>>> run21952_7_1.sh.o2415882: Total time in MGBuild 13.634200 seconds
>>> run32768_7_1.sh.o2415884: Total time in MGBuild 36.508670 seconds
>>> run46656_7_1.sh.o2415874: Total time in MGBuild 58.668228 seconds
>>> run64000_7_1.sh.o2415875: Total time in MGBuild 117.322217 seconds
>>>
>>>
>>> A glance at the implementation on Mira (I don't know if the
>>> implementation on K is stock) suggests it should be using qsort to sort
>>> based on keys. Unfortunately, qsort is not performance robust like
>>> heap/merge sort. If one were to be productive and call comm_split like...
>>> MPI_Comm_split(...,mycolor,myrank,...)
>>> then one runs the risk that the keys are presorted. This hits the worst
>>> case computational complexity for qsort... O(P^2). Demanding programmers
>>> avoid sending sorted keys seems unreasonable.
>>>
>>>
>>> I should note, I see a similar lack of scaling with MPI_Comm_dup on the K
>>> machine. Unfortunately, my BGQ data used an earlier version of the code
>>> that did not use comm_dup. As such, I can’t definitively say that it is a
>>> problem on that machine as well.
>>>
>>> Thus, I'm asking for scalable implementations of comm_split/dup using
>>> merge/heap sort whose worst case complexity is still PlogP to be prioritized
>>> in the next update.
>>>
>>>
>>> thanks
>>> _______________________________________________
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/devel
>>
>>
>
>
> _______________________________________________
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/devel
--
Jeff Hammond
jeff.science at gmail.com
http://jeffhammond.github.io/
More information about the devel
mailing list