[mpich-devel] MPI_Comm_Split/Dup scalability on BGQ and K supercomputers

Sun May 25 08:14:53 CDT 2014

I tested hpgmg on Mira and reproduced the scaling problem. I used
mpiwrapper-gcc.
With 64000 MPI tasks, the output is:

Duplicating MPI_COMM_WORLD...done (2.059304 seconds)
Building MPI subcommunicators..., level 1...done (2.291601 seconds)
Total time in MGBuild     28.085873 seconds

With 125000 MPI Tasks, the output is:

Duplicating MPI_COMM_WORLD...done (7.807612 seconds)
Building MPI subcommunicators...,  level 1...done (8.158366 seconds)
Total time in MGBuild     88.644299 seconds

Let me replace qsort with mergesort in Comm_split to see what will happen.

--Junchao Zhang

On Sat, May 24, 2014 at 9:07 AM, Sam Williams <swwilliams at lbl.gov> wrote:

> I saw the problem on Mira and K but not Edison.  I don't know if that is
> due to scale or implementation.
>
> On Mira, I was running jobs with a 10min wallclock limit.  I scaled to
> 46656 processes with 64 threads per process (c1, OMP_NUM_THREADS=64) and
> all jobs completed successfully.  However, I was only looking at MGSolve
> times and not MGBuild times.  I then decided to explore 8 threads per
> process (c8, OMP_NUM_THREADS=8) and started at the high concurrency.  The
> jobs timed out while still in MGBuild after 10mins with 373248 processes as
> well as with a 20min timeout.  At that point I added the USE_SUBCOMM option
> to enable/disable the use of comm_split.  I haven't tried scaling with the
> sub communicator on Mira since then.
>
>
> On May 24, 2014, at 6:39 AM, Junchao Zhang <jczhang at mcs.anl.gov> wrote:
>
> > Hi, Sam,
> >   Could you give me the exact number of MPI ranks for your results on
> Mira?
> >   I run hpgmg on Edison with export OMP_NUM_THREADS=1, aprun -n 64000
> -ss -cc numa_node ./hpgmg-fv 6 1. The total time in MGBuild is about 0.005
> seconds. I was wondering how many cores I need to apply to reproduce the
> problem.
> >   Thanks.
> >
> >
> > --Junchao Zhang
> >
> >
> > On Sat, May 17, 2014 at 9:06 AM, Sam Williams <swwilliams at lbl.gov>
> wrote:
> > I've been conducting scaling experiments on the Mira (Blue Gene/Q) and K
> (Sparc) supercomputers.  I've noticed that the time required for
> MPI_Comm_split and MPI_Comm_dup can grow quickly with scale (~P^2).  As
> such, its performance eventually becomes a bottleneck.  That is, although
> the benefit of using a subcommunicator is huge (multigrid solves are
> weak-scalable), the penalty of creating one (multigrid build time) is also
> huge.
> >
> > For example, when scaling from 1 to 46K nodes (= cubes of integers) on
> Mira, the time (in seconds) required to build a MG solver (including a
> subcommunicator) scales as
> > 222335.output:   Total time in MGBuild      0.056704
> > 222336.output:   Total time in MGBuild      0.060834
> > 222348.output:   Total time in MGBuild      0.064782
> > 222349.output:   Total time in MGBuild      0.090229
> > 222350.output:   Total time in MGBuild      0.075280
> > 222351.output:   Total time in MGBuild      0.091852
> > 222352.output:   Total time in MGBuild      0.137299
> > 222411.output:   Total time in MGBuild      0.301552
> > 222413.output:   Total time in MGBuild      0.606444
> > 222415.output:   Total time in MGBuild      0.745272
> > 222417.output:   Total time in MGBuild      0.779757
> > 222418.output:   Total time in MGBuild      4.671838
> > 222419.output:   Total time in MGBuild     15.123162
> > 222420.output:   Total time in MGBuild     33.875626
> > 222421.output:   Total time in MGBuild     49.494547
> > 222422.output:   Total time in MGBuild    151.329026
> >
> > If I disable the call to MPI_Comm_Split, my time scales as
> > 224982.output:   Total time in MGBuild      0.050143
> > 224983.output:   Total time in MGBuild      0.052607
> > 224988.output:   Total time in MGBuild      0.050697
> > 224989.output:   Total time in MGBuild      0.078343
> > 224990.output:   Total time in MGBuild      0.054634
> > 224991.output:   Total time in MGBuild      0.052158
> > 224992.output:   Total time in MGBuild      0.060286
> > 225008.output:   Total time in MGBuild      0.062925
> > 225009.output:   Total time in MGBuild      0.097357
> > 225010.output:   Total time in MGBuild      0.061807
> > 225011.output:   Total time in MGBuild      0.076617
> > 225012.output:   Total time in MGBuild      0.099683
> > 225013.output:   Total time in MGBuild      0.125580
> > 225014.output:   Total time in MGBuild      0.190711
> > 225016.output:   Total time in MGBuild      0.218329
> > 225017.output:   Total time in MGBuild      0.282081
> >
> > Although I didn't directly measure it, this suggests the time for
> MPI_Comm_Split is growing roughly quadratically with process concurrency.
> >
> >
> >
> >
> > I see the same effect on the K machine (8...64K nodes) where the code
> uses comm_split/dup in conjunction:
> > run00008_7_1.sh.o2412931:   Total time in MGBuild      0.026458 seconds
> > run00064_7_1.sh.o2415876:   Total time in MGBuild      0.039121 seconds
> > run00512_7_1.sh.o2415877:   Total time in MGBuild      0.086800 seconds
> > run01000_7_1.sh.o2414496:   Total time in MGBuild      0.129764 seconds
> > run01728_7_1.sh.o2415878:   Total time in MGBuild      0.224576 seconds
> > run04096_7_1.sh.o2415880:   Total time in MGBuild      0.738979 seconds
> > run08000_7_1.sh.o2414504:   Total time in MGBuild      2.123800 seconds
> > run13824_7_1.sh.o2415881:   Total time in MGBuild      6.276573 seconds
> > run21952_7_1.sh.o2415882:   Total time in MGBuild     13.634200 seconds
> > run32768_7_1.sh.o2415884:   Total time in MGBuild     36.508670 seconds
> > run46656_7_1.sh.o2415874:   Total time in MGBuild     58.668228 seconds
> > run64000_7_1.sh.o2415875:   Total time in MGBuild    117.322217 seconds
> >
> >
> > A glance at the implementation on Mira (I don't know if the
> implementation on K is stock) suggests it should be using qsort to sort
> based on keys.  Unfortunately, qsort is not performance robust like
> heap/merge sort.  If one were to be productive and call comm_split like...
> > MPI_Comm_split(...,mycolor,myrank,...)
> > then one runs the risk that the keys are presorted.  This hits the worst
> case computational complexity for qsort... O(P^2).  Demanding programmers
> avoid sending sorted keys seems unreasonable.
> >
> >
> > I should note, I see a similar lack of scaling with MPI_Comm_dup on the
> K machine.  Unfortunately, my BGQ data used an earlier version of the code
> that did not use comm_dup.  As such, I can’t definitively say that it is a
> problem on that machine as well.
> >
> > Thus, I'm asking for scalable implementations of comm_split/dup using
> merge/heap sort whose worst case complexity is still PlogP to be
> prioritized in the next update.
> >
> >
> > thanks
> > _______________________________________________
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/devel
> >
> > _______________________________________________
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/devel
>
> _______________________________________________
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/devel/attachments/20140525/a3c01c17/attachment-0001.html>