<tt><font size=2>Thanks Bob!</font></tt>


<br>


<br><tt><font size=2>Here's the code snippet in question:</font></tt>


<br>


<br><tt><font size=2>        $ git blame src/mpid/pamid/src/comm/mpid_comm.c</font></tt>


<br><tt><font size=2>        ...</font></tt>


<br><tt><font size=2>        77999d6f (Bob


Cernohous     2013-03-18 14:51:05 -0500 291)      


if((MPIDI_Process.optimized.memory  & MPID_OPT_LVL_IRREG) &&


(comm->local_size & (comm->local_size-1)))</font></tt>


<br><tt><font size=2>        53f6e934 (Haizhu


Liu        2012-11-07 20:56:27 -0500 292)    


  {</font></tt>


<br><tt><font size=2>        858da8da (Bob


Cernohous     2013-02-14 13:37:36 -0600 293)      


   /* Don't create irregular geometries.  Fallback to MPICH


only collectives */</font></tt>


<br><tt><font size=2>        858da8da (Bob


Cernohous     2013-02-14 13:37:36 -0600 294)      


   geom_init = 0;</font></tt>


<br><tt><font size=2>        224dfb1b (Bob


Cernohous     2013-03-06 13:14:47 -0600 295)      


   comm->mpid.geometry = PAMI_GEOMETRY_NULL;</font></tt>


<br><tt><font size=2>        63577b28 (Bob


Cernohous     2013-02-07 10:21:02 -0600 296)      


}</font></tt>


<br><tt><font size=2>        ...</font></tt>


<br>


<br><tt><font size=2>The environment variable "PAMID_COLLECTIVES_MEMORY_OPTIMIZED=1"


will enable this code. Here is the documentation:</font></tt>


<br>


<br><tt><font size=2>        $ git blame src/mpid/pamid/src/mpidi_env.c</font></tt>


<br><tt><font size=2>        ...</font></tt>


<br><tt><font size=2>        63577b28 (Bob


Cernohous     2013-02-07 10:21:02 -0600  111)  * -


PAMID_COLLECTIVES_MEMORY_OPTIMIZED - Controls whether collectives are </font></tt>


<br><tt><font size=2>        63577b28 (Bob


Cernohous     2013-02-07 10:21:02 -0600  112)  *  


optimized to reduce memory usage. This may disable some PAMI collectives.</font></tt>


<br><tt><font size=2>        63577b28 (Bob


Cernohous     2013-02-07 10:21:02 -0600  113)  *  


Possible values:</font></tt>


<br><tt><font size=2>        63577b28 (Bob


Cernohous     2013-02-07 10:21:02 -0600  114)  *  


- 0 - Collectives are not memory optimized.</font></tt>


<br><tt><font size=2>        77999d6f (Bob


Cernohous     2013-03-18 14:51:05 -0500  115)  *  


- n - Collectives are memory optimized. Levels are bitwise values :</font></tt>


<br><tt><font size=2>        77999d6f (Bob


Cernohous     2013-03-18 14:51:05 -0500  116)  *  


     MPID_OPT_LVL_IRREG     = 1,   Do not


optimize irregular communicators </font></tt>


<br><tt><font size=2>        77999d6f (Bob


Cernohous     2013-03-18 14:51:05 -0500  117)  *  


     MPID_OPT_LVL_NONCONTIG = 2,   Disable some non-contig


collectives</font></tt>


<br><tt><font size=2>        ... </font></tt>


<br>


<br>


<br><font size=2 face="sans-serif"><br>


Michael Blocksome<br>


Parallel Environment MPI Middleware<br>


POWER, x86, and Blue Gene HPC Messaging<br>


blocksom@us.ibm.com<br>


</font>


<br>


<br>


<br>


<br><font size=1 color=#5f5f5f face="sans-serif">From:      


 </font><font size=1 face="sans-serif">Bob Cernohous <bcernohous@cray.com></font>


<br><font size=1 color=#5f5f5f face="sans-serif">To:      


 </font><font size=1 face="sans-serif">"devel@mpich.org"


<devel@mpich.org></font>


<br><font size=1 color=#5f5f5f face="sans-serif">Date:      


 </font><font size=1 face="sans-serif">09/22/2014 01:18 PM</font>


<br><font size=1 color=#5f5f5f face="sans-serif">Subject:    


   </font><font size=1 face="sans-serif">Re: [mpich-devel]


MPI_Comm_Split/Dup scalability on BGQ and        K


       supercomputers</font>


<br><font size=1 color=#5f5f5f face="sans-serif">Sent by:    


   </font><font size=1 face="sans-serif">devel-bounces@mpich.org</font>


<br>


<hr noshade>


<br>


<br>


<br><tt><font size=2>I thought there was a "memory" optimization


that disabled PAMI on irregular communicators.  However I don't know


the current state of that code.<br>


<br>


       if(MPIDI_Process.optimized.memory && (comm->local_size


& (comm->local_size-1)))<br>


       {<br>


         /* Don't create irregular geometries.  Fallback


to MPICH only collectives */<br>


         geom_init = 0;<br>


         comm->mpid.geometry = NULL;<br>


       }<br>


<br>


<br>


> -----Original Message-----<br>


> From: devel-bounces@mpich.org [</font></tt><a href="mailto:devel-bounces@mpich.org"><tt><font size=2>mailto:devel-bounces@mpich.org</font></tt></a><tt><font size=2>]


On<br>


> Behalf Of Jeff Hammond<br>


> Sent: Monday, September 22, 2014 12:16 PM<br>


> To: devel@mpich.org<br>


> Subject: Re: [mpich-devel] MPI_Comm_Split/Dup scalability on BGQ and


K<br>


> supercomputers<br>


> <br>


> PAMID_COLLECTIVES=0 is really bad for performance.  IBM should


figure out a<br>


> way to disable it on a per-communicator basis when MPI_COMM_SPLIT


is<br>


> going to have issues.  I recall they allow one to "unoptimize"


a communicator<br>


> but I thought that was only possible after it was created.<br>


> <br>


> Jeff<br>


> <br>


> On Mon, Sep 22, 2014 at 8:39 AM, Junchao Zhang <jczhang@mcs.anl.gov><br>


> wrote:<br>


> > Sam,<br>


> >    I had some updates from IBM last week. They reproduced


the problem<br>


> > and found it only happens when the number of MPI ranks is<br>


> > non-power-of-2.  Their advice is that since the IBM BG/Q


optimized<br>


> > collectives themselves are mostly designed only to be helpful


for<br>


> > blocks with power-of-2 geometries, you can try in your program


to see<br>


> > if subsequent collective calls with<br>


> > PAMID_COLLECTIVES=1 are actually faster than PAMID_COLLECTIVES=0


on<br>


> > comms with a non-power-of-2 geometry. If the answer is no, then


you<br>


> > can just run with PAMID_COLLECTIVES=0 and avoid the dup/split<br>


> performance issue.<br>


> > Otherwise, IBM may prioritize this ticket.<br>


> ><br>


> >   Thanks.<br>


> > --Junchao Zhang<br>


> ><br>


> > On Thu, Jul 3, 2014 at 4:41 PM, Junchao Zhang <jczhang@mcs.anl.gov><br>


> wrote:<br>


> >><br>


> >> Hi, Sam,<br>


> >>   I wrote micro-benchmarks for MPI_Comm_split/dup. My


profiling<br>


> >> results suggested the problem lies in a IBM PAMI library


call,<br>


> >> PAMI_Geometry_create_taskrange().  Unfortunately, I


don't have access<br>


> >> to the PAMI source code and don't know why. I reported it


to IBM and<br>


> >> hope IBM will fix it.<br>


> >>   Alternatively, you can set an environment variable<br>


> >> PAMID_COLLECTIVES=0 to disables pami collectives. My tests


showed it<br>


> >> at least fixed the scalability problem of Comm_split and


Comm_dup.<br>


> >>   Also through profiling, I found the qsort() called


in MPICH code is<br>


> >> actually using the merge sort algorithm in Mira's libc library.<br>


> >><br>


> >><br>


> >><br>


> >> --Junchao Zhang<br>


> >><br>


> >><br>


> >> On Sat, May 17, 2014 at 9:06 AM, Sam Williams <swwilliams@lbl.gov><br>


> wrote:<br>


> >>><br>


> >>> I've been conducting scaling experiments on the Mira


(Blue Gene/Q)<br>


> >>> and K<br>


> >>> (Sparc) supercomputers.  I've noticed that the time


required for<br>


> >>> MPI_Comm_split and MPI_Comm_dup can grow quickly with


scale<br>


> (~P^2).<br>


> >>> As such, its performance eventually becomes a bottleneck.


 That is,<br>


> >>> although the benefit of using a subcommunicator is huge


(multigrid<br>


> >>> solves are weak-scalable), the penalty of creating one


(multigrid<br>


> >>> build time) is also huge.<br>


> >>><br>


> >>> For example, when scaling from 1 to 46K nodes (= cubes


of integers)<br>


> >>> on Mira, the time (in seconds) required to build a MG


solver<br>


> >>> (including a<br>


> >>> subcommunicator) scales as<br>


> >>> 222335.output:   Total time in MGBuild    


 0.056704<br>


> >>> 222336.output:   Total time in MGBuild    


 0.060834<br>


> >>> 222348.output:   Total time in MGBuild    


 0.064782<br>


> >>> 222349.output:   Total time in MGBuild    


 0.090229<br>


> >>> 222350.output:   Total time in MGBuild    


 0.075280<br>


> >>> 222351.output:   Total time in MGBuild    


 0.091852<br>


> >>> 222352.output:   Total time in MGBuild    


 0.137299<br>


> >>> 222411.output:   Total time in MGBuild    


 0.301552<br>


> >>> 222413.output:   Total time in MGBuild    


 0.606444<br>


> >>> 222415.output:   Total time in MGBuild    


 0.745272<br>


> >>> 222417.output:   Total time in MGBuild    


 0.779757<br>


> >>> 222418.output:   Total time in MGBuild    


 4.671838<br>


> >>> 222419.output:   Total time in MGBuild    


15.123162<br>


> >>> 222420.output:   Total time in MGBuild    


33.875626<br>


> >>> 222421.output:   Total time in MGBuild    


49.494547<br>


> >>> 222422.output:   Total time in MGBuild    151.329026<br>


> >>><br>


> >>> If I disable the call to MPI_Comm_Split, my time scales


as<br>


> >>> 224982.output:   Total time in MGBuild    


 0.050143<br>


> >>> 224983.output:   Total time in MGBuild    


 0.052607<br>


> >>> 224988.output:   Total time in MGBuild    


 0.050697<br>


> >>> 224989.output:   Total time in MGBuild    


 0.078343<br>


> >>> 224990.output:   Total time in MGBuild    


 0.054634<br>


> >>> 224991.output:   Total time in MGBuild    


 0.052158<br>


> >>> 224992.output:   Total time in MGBuild    


 0.060286<br>


> >>> 225008.output:   Total time in MGBuild    


 0.062925<br>


> >>> 225009.output:   Total time in MGBuild    


 0.097357<br>


> >>> 225010.output:   Total time in MGBuild    


 0.061807<br>


> >>> 225011.output:   Total time in MGBuild    


 0.076617<br>


> >>> 225012.output:   Total time in MGBuild    


 0.099683<br>


> >>> 225013.output:   Total time in MGBuild    


 0.125580<br>


> >>> 225014.output:   Total time in MGBuild    


 0.190711<br>


> >>> 225016.output:   Total time in MGBuild    


 0.218329<br>


> >>> 225017.output:   Total time in MGBuild    


 0.282081<br>


> >>><br>


> >>> Although I didn't directly measure it, this suggests


the time for<br>


> >>> MPI_Comm_Split is growing roughly quadratically with


process<br>


> concurrency.<br>


> >>><br>


> >>><br>


> >>><br>


> >>><br>


> >>> I see the same effect on the K machine (8...64K nodes)


where the<br>


> >>> code uses comm_split/dup in conjunction:<br>


> >>> run00008_7_1.sh.o2412931:   Total time in MGBuild


     0.026458<br>


> seconds<br>


> >>> run00064_7_1.sh.o2415876:   Total time in MGBuild


     0.039121<br>


> seconds<br>


> >>> run00512_7_1.sh.o2415877:   Total time in MGBuild


     0.086800<br>


> seconds<br>


> >>> run01000_7_1.sh.o2414496:   Total time in MGBuild


     0.129764<br>


> seconds<br>


> >>> run01728_7_1.sh.o2415878:   Total time in MGBuild


     0.224576<br>


> seconds<br>


> >>> run04096_7_1.sh.o2415880:   Total time in MGBuild


     0.738979<br>


> seconds<br>


> >>> run08000_7_1.sh.o2414504:   Total time in MGBuild


     2.123800<br>


> seconds<br>


> >>> run13824_7_1.sh.o2415881:   Total time in MGBuild


     6.276573<br>


> seconds<br>


> >>> run21952_7_1.sh.o2415882:   Total time in MGBuild


    13.634200<br>


> seconds<br>


> >>> run32768_7_1.sh.o2415884:   Total time in MGBuild


    36.508670<br>


> seconds<br>


> >>> run46656_7_1.sh.o2415874:   Total time in MGBuild


    58.668228<br>


> seconds<br>


> >>> run64000_7_1.sh.o2415875:   Total time in MGBuild


   117.322217<br>


> seconds<br>


> >>><br>


> >>><br>


> >>> A glance at the implementation on Mira (I don't know


if the<br>


> >>> implementation on K is stock) suggests it should be using


qsort to<br>


> >>> sort based on keys.  Unfortunately, qsort is not


performance robust<br>


> >>> like heap/merge sort.  If one were to be productive


and call comm_split<br>


> like...<br>


> >>> MPI_Comm_split(...,mycolor,myrank,...)<br>


> >>> then one runs the risk that the keys are presorted.  This


hits the<br>


> >>> worst case computational complexity for qsort... O(P^2).


 Demanding<br>


> >>> programmers avoid sending sorted keys seems unreasonable.<br>


> >>><br>


> >>><br>


> >>> I should note, I see a similar lack of scaling with MPI_Comm_dup


on<br>


> >>> the K machine.  Unfortunately, my BGQ data used


an earlier version<br>


> >>> of the code that did not use comm_dup.  As such,


I can’t<br>


> >>> definitively say that it is a problem on that machine


as well.<br>


> >>><br>


> >>> Thus, I'm asking for scalable implementations of comm_split/dup<br>


> >>> using merge/heap sort whose worst case complexity is


still PlogP to<br>


> >>> be prioritized in the next update.<br>


> >>><br>


> >>><br>


> >>> thanks<br>


> >>> _______________________________________________<br>


> >>> To manage subscription options or unsubscribe:<br>


> >>> </font></tt><a href=https://lists.mpich.org/mailman/listinfo/devel><tt><font size=2>https://lists.mpich.org/mailman/listinfo/devel</font></tt></a><tt><font size=2><br>


> >><br>


> >><br>


> ><br>


> ><br>


> > _______________________________________________<br>


> > To manage subscription options or unsubscribe:<br>


> > </font></tt><a href=https://lists.mpich.org/mailman/listinfo/devel><tt><font size=2>https://lists.mpich.org/mailman/listinfo/devel</font></tt></a><tt><font size=2><br>


> <br>


> <br>


> <br>


> --<br>


> Jeff Hammond<br>


> jeff.science@gmail.com<br>


> </font></tt><a href=http://jeffhammond.github.io/><tt><font size=2>http://jeffhammond.github.io/</font></tt></a><tt><font size=2><br>


> _______________________________________________<br>


> To manage subscription options or unsubscribe:<br>


> </font></tt><a href=https://lists.mpich.org/mailman/listinfo/devel><tt><font size=2>https://lists.mpich.org/mailman/listinfo/devel</font></tt></a><tt><font size=2><br>


_______________________________________________<br>


To manage subscription options or unsubscribe:<br>


</font></tt><a href=https://lists.mpich.org/mailman/listinfo/devel><tt><font size=2>https://lists.mpich.org/mailman/listinfo/devel</font></tt></a>


<br>