[mpich-devel] MPI_Comm_Split/Dup scalability on BGQ and K supercomputers

Mon Sep 22 15:21:05 CDT 2014

Thanks Bob!

Here's the code snippet in question:

        $ git blame src/mpid/pamid/src/comm/mpid_comm.c
        ...
        77999d6f (Bob Cernohous     2013-03-18 14:51:05 -0500 291) 
if((MPIDI_Process.optimized.memory  & MPID_OPT_LVL_IRREG) && 
(comm->local_size & (comm->local_size-1)))
        53f6e934 (Haizhu Liu        2012-11-07 20:56:27 -0500 292)       {
        858da8da (Bob Cernohous     2013-02-14 13:37:36 -0600 293)  /* 
Don't create irregular geometries.  Fallback to MPICH only collectives */
        858da8da (Bob Cernohous     2013-02-14 13:37:36 -0600 294)  
geom_init = 0;
        224dfb1b (Bob Cernohous     2013-03-06 13:14:47 -0600 295)  
comm->mpid.geometry = PAMI_GEOMETRY_NULL;
        63577b28 (Bob Cernohous     2013-02-07 10:21:02 -0600 296)       }
        ...

The environment variable "PAMID_COLLECTIVES_MEMORY_OPTIMIZED=1" will 
enable this code. Here is the documentation:

        $ git blame src/mpid/pamid/src/mpidi_env.c
        ...
        63577b28 (Bob Cernohous     2013-02-07 10:21:02 -0600  111)  * - 
PAMID_COLLECTIVES_MEMORY_OPTIMIZED - Controls whether collectives are 
        63577b28 (Bob Cernohous     2013-02-07 10:21:02 -0600  112)  * 
optimized to reduce memory usage. This may disable some PAMI collectives.
        63577b28 (Bob Cernohous     2013-02-07 10:21:02 -0600  113)  * 
Possible values:
        63577b28 (Bob Cernohous     2013-02-07 10:21:02 -0600  114)  *   - 
0 - Collectives are not memory optimized.
        77999d6f (Bob Cernohous     2013-03-18 14:51:05 -0500  115)  *   - 
n - Collectives are memory optimized. Levels are bitwise values :
        77999d6f (Bob Cernohous     2013-03-18 14:51:05 -0500  116)  *  
MPID_OPT_LVL_IRREG     = 1,   Do not optimize irregular communicators 
        77999d6f (Bob Cernohous     2013-03-18 14:51:05 -0500  117)  *  
MPID_OPT_LVL_NONCONTIG = 2,   Disable some non-contig collectives
        ... 

Michael Blocksome
Parallel Environment MPI Middleware
POWER, x86, and Blue Gene HPC Messaging
blocksom at us.ibm.com

From:   Bob Cernohous <bcernohous at cray.com>
To:     "devel at mpich.org" <devel at mpich.org>
Date:   09/22/2014 01:18 PM
Subject:        Re: [mpich-devel] MPI_Comm_Split/Dup scalability on BGQ 
and     K       supercomputers
Sent by:        devel-bounces at mpich.org

I thought there was a "memory" optimization that disabled PAMI on 
irregular communicators.  However I don't know the current state of that 
code.

       if(MPIDI_Process.optimized.memory && (comm->local_size & 
(comm->local_size-1)))
       {
         /* Don't create irregular geometries.  Fallback to MPICH only 
collectives */
         geom_init = 0;
         comm->mpid.geometry = NULL;
       }

> -----Original Message-----
> From: devel-bounces at mpich.org [mailto:devel-bounces at mpich.org] On
> Behalf Of Jeff Hammond
> Sent: Monday, September 22, 2014 12:16 PM
> To: devel at mpich.org
> Subject: Re: [mpich-devel] MPI_Comm_Split/Dup scalability on BGQ and K
> supercomputers
> 
> PAMID_COLLECTIVES=0 is really bad for performance.  IBM should figure 
out a
> way to disable it on a per-communicator basis when MPI_COMM_SPLIT is
> going to have issues.  I recall they allow one to "unoptimize" a 
communicator
> but I thought that was only possible after it was created.
> 
> Jeff
> 
> On Mon, Sep 22, 2014 at 8:39 AM, Junchao Zhang <jczhang at mcs.anl.gov>
> wrote:
> > Sam,
> >    I had some updates from IBM last week. They reproduced the problem
> > and found it only happens when the number of MPI ranks is
> > non-power-of-2.  Their advice is that since the IBM BG/Q optimized
> > collectives themselves are mostly designed only to be helpful for
> > blocks with power-of-2 geometries, you can try in your program to see
> > if subsequent collective calls with
> > PAMID_COLLECTIVES=1 are actually faster than PAMID_COLLECTIVES=0 on
> > comms with a non-power-of-2 geometry. If the answer is no, then you
> > can just run with PAMID_COLLECTIVES=0 and avoid the dup/split
> performance issue.
> > Otherwise, IBM may prioritize this ticket.
> >
> >   Thanks.
> > --Junchao Zhang
> >
> > On Thu, Jul 3, 2014 at 4:41 PM, Junchao Zhang <jczhang at mcs.anl.gov>
> wrote:
> >>
> >> Hi, Sam,
> >>   I wrote micro-benchmarks for MPI_Comm_split/dup. My profiling
> >> results suggested the problem lies in a IBM PAMI library call,
> >> PAMI_Geometry_create_taskrange().  Unfortunately, I don't have access
> >> to the PAMI source code and don't know why. I reported it to IBM and
> >> hope IBM will fix it.
> >>   Alternatively, you can set an environment variable
> >> PAMID_COLLECTIVES=0 to disables pami collectives. My tests showed it
> >> at least fixed the scalability problem of Comm_split and Comm_dup.
> >>   Also through profiling, I found the qsort() called in MPICH code is
> >> actually using the merge sort algorithm in Mira's libc library.
> >>
> >>
> >>
> >> --Junchao Zhang
> >>
> >>
> >> On Sat, May 17, 2014 at 9:06 AM, Sam Williams <swwilliams at lbl.gov>
> wrote:
> >>>
> >>> I've been conducting scaling experiments on the Mira (Blue Gene/Q)
> >>> and K
> >>> (Sparc) supercomputers.  I've noticed that the time required for
> >>> MPI_Comm_split and MPI_Comm_dup can grow quickly with scale
> (~P^2).
> >>> As such, its performance eventually becomes a bottleneck.  That is,
> >>> although the benefit of using a subcommunicator is huge (multigrid
> >>> solves are weak-scalable), the penalty of creating one (multigrid
> >>> build time) is also huge.
> >>>
> >>> For example, when scaling from 1 to 46K nodes (= cubes of integers)
> >>> on Mira, the time (in seconds) required to build a MG solver
> >>> (including a
> >>> subcommunicator) scales as
> >>> 222335.output:   Total time in MGBuild      0.056704
> >>> 222336.output:   Total time in MGBuild      0.060834
> >>> 222348.output:   Total time in MGBuild      0.064782
> >>> 222349.output:   Total time in MGBuild      0.090229
> >>> 222350.output:   Total time in MGBuild      0.075280
> >>> 222351.output:   Total time in MGBuild      0.091852
> >>> 222352.output:   Total time in MGBuild      0.137299
> >>> 222411.output:   Total time in MGBuild      0.301552
> >>> 222413.output:   Total time in MGBuild      0.606444
> >>> 222415.output:   Total time in MGBuild      0.745272
> >>> 222417.output:   Total time in MGBuild      0.779757
> >>> 222418.output:   Total time in MGBuild      4.671838
> >>> 222419.output:   Total time in MGBuild     15.123162
> >>> 222420.output:   Total time in MGBuild     33.875626
> >>> 222421.output:   Total time in MGBuild     49.494547
> >>> 222422.output:   Total time in MGBuild    151.329026
> >>>
> >>> If I disable the call to MPI_Comm_Split, my time scales as
> >>> 224982.output:   Total time in MGBuild      0.050143
> >>> 224983.output:   Total time in MGBuild      0.052607
> >>> 224988.output:   Total time in MGBuild      0.050697
> >>> 224989.output:   Total time in MGBuild      0.078343
> >>> 224990.output:   Total time in MGBuild      0.054634
> >>> 224991.output:   Total time in MGBuild      0.052158
> >>> 224992.output:   Total time in MGBuild      0.060286
> >>> 225008.output:   Total time in MGBuild      0.062925
> >>> 225009.output:   Total time in MGBuild      0.097357
> >>> 225010.output:   Total time in MGBuild      0.061807
> >>> 225011.output:   Total time in MGBuild      0.076617
> >>> 225012.output:   Total time in MGBuild      0.099683
> >>> 225013.output:   Total time in MGBuild      0.125580
> >>> 225014.output:   Total time in MGBuild      0.190711
> >>> 225016.output:   Total time in MGBuild      0.218329
> >>> 225017.output:   Total time in MGBuild      0.282081
> >>>
> >>> Although I didn't directly measure it, this suggests the time for
> >>> MPI_Comm_Split is growing roughly quadratically with process
> concurrency.
> >>>
> >>>
> >>>
> >>>
> >>> I see the same effect on the K machine (8...64K nodes) where the
> >>> code uses comm_split/dup in conjunction:
> >>> run00008_7_1.sh.o2412931:   Total time in MGBuild      0.026458
> seconds
> >>> run00064_7_1.sh.o2415876:   Total time in MGBuild      0.039121
> seconds
> >>> run00512_7_1.sh.o2415877:   Total time in MGBuild      0.086800
> seconds
> >>> run01000_7_1.sh.o2414496:   Total time in MGBuild      0.129764
> seconds
> >>> run01728_7_1.sh.o2415878:   Total time in MGBuild      0.224576
> seconds
> >>> run04096_7_1.sh.o2415880:   Total time in MGBuild      0.738979
> seconds
> >>> run08000_7_1.sh.o2414504:   Total time in MGBuild      2.123800
> seconds
> >>> run13824_7_1.sh.o2415881:   Total time in MGBuild      6.276573
> seconds
> >>> run21952_7_1.sh.o2415882:   Total time in MGBuild     13.634200
> seconds
> >>> run32768_7_1.sh.o2415884:   Total time in MGBuild     36.508670
> seconds
> >>> run46656_7_1.sh.o2415874:   Total time in MGBuild     58.668228
> seconds
> >>> run64000_7_1.sh.o2415875:   Total time in MGBuild    117.322217
> seconds
> >>>
> >>>
> >>> A glance at the implementation on Mira (I don't know if the
> >>> implementation on K is stock) suggests it should be using qsort to
> >>> sort based on keys.  Unfortunately, qsort is not performance robust
> >>> like heap/merge sort.  If one were to be productive and call 
comm_split
> like...
> >>> MPI_Comm_split(...,mycolor,myrank,...)
> >>> then one runs the risk that the keys are presorted.  This hits the
> >>> worst case computational complexity for qsort... O(P^2).  Demanding
> >>> programmers avoid sending sorted keys seems unreasonable.
> >>>
> >>>
> >>> I should note, I see a similar lack of scaling with MPI_Comm_dup on
> >>> the K machine.  Unfortunately, my BGQ data used an earlier version
> >>> of the code that did not use comm_dup.  As such, I can’t
> >>> definitively say that it is a problem on that machine as well.
> >>>
> >>> Thus, I'm asking for scalable implementations of comm_split/dup
> >>> using merge/heap sort whose worst case complexity is still PlogP to
> >>> be prioritized in the next update.
> >>>
> >>>
> >>> thanks
> >>> _______________________________________________
> >>> To manage subscription options or unsubscribe:
> >>> https://lists.mpich.org/mailman/listinfo/devel
> >>
> >>
> >
> >
> > _______________________________________________
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/devel
> 
> 
> 
> --
> Jeff Hammond
> jeff.science at gmail.com
> http://jeffhammond.github.io/
> _______________________________________________
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/devel
_______________________________________________
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/devel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/devel/attachments/20140922/9c85be05/attachment-0001.html>