[mpich-devel] Fwd: MPI_Comm_Split/Dup scalability on BGQ and K supercomputers

Thu May 29 14:52:25 CDT 2014

On 05/29/2014 02:16 PM, Junchao Zhang wrote:
> Hi, Rob,
>    Rajeev mentioned you knows a bug in Comm_split on Mira. What is that?

The bug I saw was when mpich/master was built against the older V1R1M1 
driver.   With the Mira update a few months ago to V1R2M1, the memory 
corruption I saw at large scale no longer occured.

Egdar Solmonik had a bug not in the comm_split itself, but in something 
trying to use the split-up communicators.

Setting the environment variable PAMID_COLLECTIVES=0 was a workaround, 
but one with significant performance implications.

Please send email to support at alcf.anl.gov with your test case / findings 
and they will update their PMR (PMR == the way ALCF pays IBM to fix 
things, essentially) with this information.

Jeff Hammond was driving these fixes, but he's left.  Adding your voice 
to the need for working/efficient MPI Communicator operations will help 
many apps.

==rob

>    Thanks.
>
> --Junchao Zhang
>
>
> ---------- Forwarded message ----------
> From: *Thakur, Rajeev* <thakur at mcs.anl.gov <mailto:thakur at mcs.anl.gov>>
> Date: Thu, May 29, 2014 at 2:11 PM
> Subject: Re: [mpich-devel] MPI_Comm_Split/Dup scalability on BGQ and K
> supercomputers
> To: "Zhang, Junchao" <jczhang at mcs.anl.gov <mailto:jczhang at mcs.anl.gov>>
> Cc: "Balaji, Pavan" <balaji at anl.gov <mailto:balaji at anl.gov>>
>
>
> Can you look at the source code of IBM MPI on Mira to see what they are
> doing. I don't know if the source code is in our repo or somewhere else,
> but Rob Latham should know.
>
> Also, there is a bug in Comm_split in the latest version of MPI
> installed on Mira. There was a lot of discussion on the early-users
> mailing list. Rob Latham knows about it, and he can tell you what the
> bug is about.
>
> Rajeev
>
>
> On May 29, 2014, at 1:49 PM, Junchao Zhang <jczhang at mcs.anl.gov
> <mailto:jczhang at mcs.anl.gov>>
>   wrote:
>
>  > I did more experiments on Mira. But I feel I am lost :)   In short, I
> think the original MPICH comm_split implementation is good if not the
> best. However, for unknown reasons, the MPI library provided by Mira is bad.
>  > I wrote three micro-benchmarks, 1) qsort.c, which is a sequential C
> doe, to test the libc qsort performance when sorting a presorted array.
> 2) split.c, which measures MPI_Comm_split(MPI_COMM_WORLD, 0, myrank,
> ...). Note that all processes have the same color and use their old
> rank. The maximal time on processes is reported. 3) dup.c, which simply
> does MPI_Comm_dup(MPI_COMM_WORLD, &newcomm).
>  >
>  > I compiled the code using either mpixlc, which is provided by
> +mpiwrapper-gcc on Mira, or mpicc, which is built from mpich/master by
> myself, or mpicc-mergesort, which is also built by myself with a
> mergesort comm_split.c
>  >
>  > QSORT : compiled with mpixlc, run with 1 core on 1 node (also run
> with 16 cores simultaneously on 1 node, which gives even better
> results). The results indicate the libc-qsort used by mpixlc is good.
>  > # integers  time (s)
>  > 64000       0.110789
>  > 125000     0.200747
>  > 512000     0.445075
>  >
>  > COMM_SPLIT : compiled with mpixlc.(Note the results match with Sam's)
>  > # ranks     time (s)
>  >  64000       2.297776
>  >  125000     8.157885
>  >  512000     237.079862
>  >
>  > COMM_DUP : compiled with mpixlc (Note the results match with Sam's)
>  >  64000       2.057966
>  >  125000     7.808858
>  >  512000     235.948144
>  >
>  > COMM_SPLIT : compiled with mpicc, run two times. The variation is big
> at 512,000.
>  >  64000       0.549024  0.545003
>  >  125000     0.959198  0.965059
>  >  512000     6.625385  3.716696
>  >
>  > COMM_DUP : compiled with mpicc
>  >  64000       0.028518
>  >  125000     0.056386
>  >  512000     0.202677
>  >
>  > COMM_SPLIT : compiled with mpicc-mergesort
>  >  64000       0.362572
>  >  125000     0.712167
>  >  512000     2.929134
>  >
>  > COMM_DUP : compiled with mpicc-mergesort
>  >  64000       0.028835
>  >  125000     0.055987
>  >  512000     0.202660
>  >
>  > (Another difference should be mentioned is that mpixlc version is
> "MPICH2 version 1.5".  I looked at the log history of comm_split.c.
> After MPICH2 1.5, there is no big changes)
>  > We can see Comm_dup performance is independent of Comm_split and
> mpicc-mergesort is slightly better than mpicc.  It may be still worth
> pushing mergesort to mpich.
>  >
>  > I have no idea why mpixlc is bad.  Should I reported it to ALCF?
>  >
>  > --Junchao Zhang
>  >
>  >
>  > On Wed, May 28, 2014 at 7:34 AM, Junchao Zhang <jczhang at mcs.anl.gov
> <mailto:jczhang at mcs.anl.gov>> wrote:
>  > Fixed them.  Thanks.
>  >
>  > --Junchao Zhang
>  >
>  >
>  > On Tue, May 27, 2014 at 11:02 PM, Thakur, Rajeev <thakur at mcs.anl.gov
> <mailto:thakur at mcs.anl.gov>> wrote:
>  > Junchao,
>  >               Your code looks fine, although I didn't check the merge
> sort in detail for potential bugs. You may want add a comment explaining
> that the 2x in the memory allocation for keytable is because the second
> half of it will be used as scratch space for the merge sort. Otherwise,
> someone looking at the code 3 years from now will miss it :-). Also, a
> typo in one comment: sortting -> sorting.
>  >
>  > Rajeev
>  >
>  >
>  > On May 27, 2014, at 3:42 PM, Junchao Zhang <jczhang at mcs.anl.gov
> <mailto:jczhang at mcs.anl.gov>> wrote:
>  >
>  > > Here is the code.
>  > > We do not just sort keys. We also want to keep track of old ranks
> for these keys. So we actually sort an array of {key, rank}.
>  > >
>  > > --Junchao Zhang
>  > >
>  > >
>  > > On Tue, May 27, 2014 at 12:58 PM, Junchao Zhang
> <jczhang at mcs.anl.gov <mailto:jczhang at mcs.anl.gov>> wrote:
>  > > Downloaded mpich2-1.3.1 and checked that .  It uses bubble sort,
> which is a stable sort algorithm and no extra buffer is needed.
>  > > merge sort is stable, but needs O(N) extra memory. I can allocate
> the array and the scratch together in one malloc instead of two mallocs
> separately. Let me do that.
>  > >
>  > > --Junchao Zhang
>  > >
>  > >
>  > > On Tue, May 27, 2014 at 12:46 PM, Thakur, Rajeev
> <thakur at mcs.anl.gov <mailto:thakur at mcs.anl.gov>> wrote:
>  > > The version I suggested (1.4) was not old enough. It has the new
> code with quick sort, not the old code with bubble sort. Instead, see
> 1.3.1 (http://www.mpich.org/static/downloads/1.3.1/). I confirmed that
> it has the old bubble sort. I believe it used O(p) less memory as well
> but I didn't recheck just now.
>  > >
>  > > Rajeev
>  > >
>  > >
>  > >
>  > > On May 27, 2014, at 12:38 PM, Junchao Zhang <jczhang at mcs.anl.gov
> <mailto:jczhang at mcs.anl.gov>>
>  > >  wrote:
>  > >
>  > > > I read the old code. It is not efficient. The code gathers {key,
> color (in effect it is later used to store old ranks) } and puts them in
> an array. Actually, that is enough info for a stable sort (i.e., using
> old ranks for ties).  But the old code did not realize that and created
> a new type, splittype { key, color, orig_idx}, It assigns orig_idx the
> index of the element in array and uses that for ties.  This is the O(p)
> memory you mentioned. We can say it adds 50% memory use.
>  > > > For merge sort, the buffer has to be standalone (i.e., not mixed
> with old data).  So I deleted the orig_idx field from splittype, and
> allocated a buffer as big as the original array, in effect adding 100%
> memory use.
>  > > > I think we can have a fast qsort with 0% extra memory for our
> case. I will do that experiment later.
>  > > >
>  > > > --Junchao Zhang
>  > > >
>  > > >
>  > > > On Tue, May 27, 2014 at 11:17 AM, Thakur, Rajeev
> <thakur at mcs.anl.gov <mailto:thakur at mcs.anl.gov>> wrote:
>  > > > Junchao,
>  > > >               As far as I remember, when we switched from bubble
> sort to quicksort, we had to allocate additional O(nprocs) memory for
> the trick to make the qsort stable. You can check the old code in a
> release in 2011 (say 1.4 from
> http://www.mpich.org/static/downloads/1.4/). It would be good if we can
> avoid that additional memory allocation in the new sort. And if in the
> merge sort, you have added one more allocate, that should definitely be
> avoided.
>  > > >
>  > > > Rajeev
>  > > >
>  > > >
>  > > > On May 27, 2014, at 9:40 AM, Junchao Zhang <jczhang at mcs.anl.gov
> <mailto:jczhang at mcs.anl.gov>> wrote:
>  > > >
>  > > > > Rejeev,
>  > > > >    The code is at mpich-review/commsplit.  I pushed it there
> several days ago and MPICH test results look good.  Could you review it?
>  > > > >    Yes, I will look why the improvement of Comm_dup is better.
>  > > > >    Another thing I want to investigate is: In Comm_split, we
> sort {key, old rank}, i.e., an integer pair. So basically, we have
> mechanisms to avoid the unstable problem of qsort. We can copy an
> optimized qsort instead of using the one in libc, and test its
> performance.  The benefit of qsort is we can avoid the extra buffer
> needed in mergesort. Using a customized qsort, we can also avoid
> function pointer dereferencing.  In libc qsort, one provides a generic
> comp function pointer. But our case is simple. We just want to compare
> integers. We can avoid the function call cost for each comparison.
>  > > > >
>  > > > >
>  > > > >
>  > > > > --Junchao Zhang
>  > > > >
>  > > > >
>  > > > > On Tue, May 27, 2014 at 9:24 AM, Thakur, Rajeev
> <thakur at mcs.anl.gov <mailto:thakur at mcs.anl.gov>> wrote:
>  > > > > Thanks. Have you run your new version through the MPICH test
> suite to make sure there are no bugs in the new sort.
>  > > > >
>  > > > > Can you look at Comm_dup to see why it improved so much.
>  > > > >
>  > > > > BG/Q has a relatively slower processor, so the sorting time is
> noticeable, unlike on the NERSC system.
>  > > > >
>  > > > > I think we should include it in 3.1.1
>  > > > >
>  > > > > Rajeev
>  > > > >
>  > > > > On May 27, 2014, at 9:16 AM, Junchao Zhang <jczhang at mcs.anl.gov
> <mailto:jczhang at mcs.anl.gov>>
>  > > > >  wrote:
>  > > > >
>  > > > > > I tested Sam's HPGMG on Mira. I tried mpiwrapper-gcc and
> mpiwrapper-xl provided by the softenv of Mira, and also tried a
> self-built mpich with my own comm_split + mergesort. Here are test results.
>  > > > > >
>  > > > > > mpiwrapper-gcc and mpiwrapper-xl have similar results like
> the following (all time is in seconds)
>  > > > > > # of ranks     comm_dup     comm_split    MGBuild
>  > > > > > 64000           2.059304        2.291601       28.085873
>  > > > > > 125000         7.807612        8.158366       88.644299
>  > > > > > 512000      235.923766       (no results due to  job timed out)
>  > > > > >
>  > > > > > mpich-mergesort
>  > > > > > # of ranks     comm_dup     comm_split    MGBuild
>  > > > > > 64000           0.046616        0.716383       6.999050
>  > > > > > 125000         0.065818        1.293382       10.696351
>  > > > > > 512000         0.229605        4.998839        52.467708
>  > > > > >
>  > > > > > I don't know why the MPICH libraries on Mira are bad, since
> optimizations in qsort to avoid the worst case are not new. The libc
> library on Mira should have an optimized qsort.
>  > > > > > Another thing I have not looked into is why comm_dup is
> affected by comm_split.  It seems comm_dup calls comm_split internally.
>  > > > > >
>  > > > > > In short, the current mergesort solution largely fixes the
> scaling problem.  We may further investigate the issue and come up with
> better solutions. But if you think this is a key fix, probably we can
> push it to 3.1.1.
>  > > > > >
>  > > > > >
>  > > > > > --Junchao Zhang
>  > > > > >
>  > > > > >
>  > > > > > On Sun, May 25, 2014 at 12:25 PM, Junchao Zhang
> <jczhang at mcs.anl.gov <mailto:jczhang at mcs.anl.gov>> wrote:
>  > > > > > Yes, I am going to test that.
>  > > > > >
>  > > > > > --Junchao Zhang
>  > > > > >
>  > > > > >
>  > > > > > On Sun, May 25, 2014 at 11:11 AM, Thakur, Rajeev
> <thakur at mcs.anl.gov <mailto:thakur at mcs.anl.gov>> wrote:
>  > > > > > Can you also see if the xlc compiler makes any difference?
>  > > > > >
>  > > > > > Rajeev
>  > > > > >
>  > > > > > On May 25, 2014, at 8:14 AM, Junchao Zhang
> <jczhang at mcs.anl.gov <mailto:jczhang at mcs.anl.gov>>
>  > > > > >  wrote:
>  > > > > >
>  > > > > > > I tested hpgmg on Mira and reproduced the scaling problem.
> I used mpiwrapper-gcc.
>  > > > > > > With 64000 MPI tasks, the output is:
>  > > > > > > Duplicating MPI_COMM_WORLD...done (2.059304 seconds)
>  > > > > > > Building MPI subcommunicators..., level 1...done (2.291601
> seconds)
>  > > > > > > Total time in MGBuild     28.085873 seconds
>  > > > > > > With 125000 MPI Tasks, the output is:
>  > > > > > > Duplicating MPI_COMM_WORLD...done (7.807612 seconds)
>  > > > > > > Building MPI subcommunicators...,  level 1...done (8.158366
> seconds)
>  > > > > > > Total time in MGBuild     88.644299 seconds
>  > > > > > >
>  > > > > > > Let me replace qsort with mergesort in Comm_split to see
> what will happen.
>  > > > > > >
>  > > > > > > --Junchao Zhang
>  > > > > > >
>  > > > > > >
>  > > > > > > On Sat, May 24, 2014 at 9:07 AM, Sam Williams
> <swwilliams at lbl.gov <mailto:swwilliams at lbl.gov>> wrote:
>  > > > > > > I saw the problem on Mira and K but not Edison.  I don't
> know if that is due to scale or implementation.
>  > > > > > >
>  > > > > > > On Mira, I was running jobs with a 10min wallclock limit.
>   I scaled to 46656 processes with 64 threads per process (c1,
> OMP_NUM_THREADS=64) and all jobs completed successfully.  However, I was
> only looking at MGSolve times and not MGBuild times.  I then decided to
> explore 8 threads per process (c8, OMP_NUM_THREADS=8) and started at the
> high concurrency.  The jobs timed out while still in MGBuild after
> 10mins with 373248 processes as well as with a 20min timeout.  At that
> point I added the USE_SUBCOMM option to enable/disable the use of
> comm_split.  I haven't tried scaling with the sub communicator on Mira
> since then.
>  > > > > > >
>  > > > > > >
>  > > > > > > On May 24, 2014, at 6:39 AM, Junchao Zhang
> <jczhang at mcs.anl.gov <mailto:jczhang at mcs.anl.gov>> wrote:
>  > > > > > >
>  > > > > > > > Hi, Sam,
>  > > > > > > >   Could you give me the exact number of MPI ranks for
> your results on Mira?
>  > > > > > > >   I run hpgmg on Edison with export OMP_NUM_THREADS=1,
> aprun -n 64000 -ss -cc numa_node ./hpgmg-fv 6 1. The total time in
> MGBuild is about 0.005 seconds. I was wondering how many cores I need to
> apply to reproduce the problem.
>  > > > > > > >   Thanks.
>  > > > > > > >
>  > > > > > > >
>  > > > > > > > --Junchao Zhang
>  > > > > > > >
>  > > > > > > >
>  > > > > > > > On Sat, May 17, 2014 at 9:06 AM, Sam Williams
> <swwilliams at lbl.gov <mailto:swwilliams at lbl.gov>> wrote:
>  > > > > > > > I've been conducting scaling experiments on the Mira
> (Blue Gene/Q) and K (Sparc) supercomputers.  I've noticed that the time
> required for MPI_Comm_split and MPI_Comm_dup can grow quickly with scale
> (~P^2).  As such, its performance eventually becomes a bottleneck.  That
> is, although the benefit of using a subcommunicator is huge (multigrid
> solves are weak-scalable), the penalty of creating one (multigrid build
> time) is also huge.
>  > > > > > > >
>  > > > > > > > For example, when scaling from 1 to 46K nodes (= cubes of
> integers) on Mira, the time (in seconds) required to build a MG solver
> (including a subcommunicator) scales as
>  > > > > > > > 222335.output:   Total time in MGBuild      0.056704
>  > > > > > > > 222336.output:   Total time in MGBuild      0.060834
>  > > > > > > > 222348.output:   Total time in MGBuild      0.064782
>  > > > > > > > 222349.output:   Total time in MGBuild      0.090229
>  > > > > > > > 222350.output:   Total time in MGBuild      0.075280
>  > > > > > > > 222351.output:   Total time in MGBuild      0.091852
>  > > > > > > > 222352.output:   Total time in MGBuild      0.137299
>  > > > > > > > 222411.output:   Total time in MGBuild      0.301552
>  > > > > > > > 222413.output:   Total time in MGBuild      0.606444
>  > > > > > > > 222415.output:   Total time in MGBuild      0.745272
>  > > > > > > > 222417.output:   Total time in MGBuild      0.779757
>  > > > > > > > 222418.output:   Total time in MGBuild      4.671838
>  > > > > > > > 222419.output:   Total time in MGBuild     15.123162
>  > > > > > > > 222420.output:   Total time in MGBuild     33.875626
>  > > > > > > > 222421.output:   Total time in MGBuild     49.494547
>  > > > > > > > 222422.output:   Total time in MGBuild    151.329026
>  > > > > > > >
>  > > > > > > > If I disable the call to MPI_Comm_Split, my time scales as
>  > > > > > > > 224982.output:   Total time in MGBuild      0.050143
>  > > > > > > > 224983.output:   Total time in MGBuild      0.052607
>  > > > > > > > 224988.output:   Total time in MGBuild      0.050697
>  > > > > > > > 224989.output:   Total time in MGBuild      0.078343
>  > > > > > > > 224990.output:   Total time in MGBuild      0.054634
>  > > > > > > > 224991.output:   Total time in MGBuild      0.052158
>  > > > > > > > 224992.output:   Total time in MGBuild      0.060286
>  > > > > > > > 225008.output:   Total time in MGBuild      0.062925
>  > > > > > > > 225009.output:   Total time in MGBuild      0.097357
>  > > > > > > > 225010.output:   Total time in MGBuild      0.061807
>  > > > > > > > 225011.output:   Total time in MGBuild      0.076617
>  > > > > > > > 225012.output:   Total time in MGBuild      0.099683
>  > > > > > > > 225013.output:   Total time in MGBuild      0.125580
>  > > > > > > > 225014.output:   Total time in MGBuild      0.190711
>  > > > > > > > 225016.output:   Total time in MGBuild      0.218329
>  > > > > > > > 225017.output:   Total time in MGBuild      0.282081
>  > > > > > > >
>  > > > > > > > Although I didn't directly measure it, this suggests the
> time for MPI_Comm_Split is growing roughly quadratically with process
> concurrency.
>  > > > > > > >
>  > > > > > > >
>  > > > > > > >
>  > > > > > > >
>  > > > > > > > I see the same effect on the K machine (8...64K nodes)
> where the code uses comm_split/dup in conjunction:
>  > > > > > > > run00008_7_1.sh.o2412931:   Total time in MGBuild
>   0.026458 seconds
>  > > > > > > > run00064_7_1.sh.o2415876:   Total time in MGBuild
>   0.039121 seconds
>  > > > > > > > run00512_7_1.sh.o2415877:   Total time in MGBuild
>   0.086800 seconds
>  > > > > > > > run01000_7_1.sh.o2414496:   Total time in MGBuild
>   0.129764 seconds
>  > > > > > > > run01728_7_1.sh.o2415878:   Total time in MGBuild
>   0.224576 seconds
>  > > > > > > > run04096_7_1.sh.o2415880:   Total time in MGBuild
>   0.738979 seconds
>  > > > > > > > run08000_7_1.sh.o2414504:   Total time in MGBuild
>   2.123800 seconds
>  > > > > > > > run13824_7_1.sh.o2415881:   Total time in MGBuild
>   6.276573 seconds
>  > > > > > > > run21952_7_1.sh.o2415882:   Total time in MGBuild
> 13.634200 seconds
>  > > > > > > > run32768_7_1.sh.o2415884:   Total time in MGBuild
> 36.508670 seconds
>  > > > > > > > run46656_7_1.sh.o2415874:   Total time in MGBuild
> 58.668228 seconds
>  > > > > > > > run64000_7_1.sh.o2415875:   Total time in MGBuild
>   117.322217 seconds
>  > > > > > > >
>  > > > > > > >
>  > > > > > > > A glance at the implementation on Mira (I don't know if
> the implementation on K is stock) suggests it should be using qsort to
> sort based on keys.  Unfortunately, qsort is not performance robust like
> heap/merge sort.  If one were to be productive and call comm_split like...
>  > > > > > > > MPI_Comm_split(...,mycolor,myrank,...)
>  > > > > > > > then one runs the risk that the keys are presorted.  This
> hits the worst case computational complexity for qsort... O(P^2).
>   Demanding programmers avoid sending sorted keys seems unreasonable.
>  > > > > > > >
>  > > > > > > >
>  > > > > > > > I should note, I see a similar lack of scaling with
> MPI_Comm_dup on the K machine.  Unfortunately, my BGQ data used an
> earlier version of the code that did not use comm_dup.  As such, I can’t
> definitively say that it is a problem on that machine as well.
>  > > > > > > >
>  > > > > > > > Thus, I'm asking for scalable implementations of
> comm_split/dup using merge/heap sort whose worst case complexity is
> still PlogP to be prioritized in the next update.
>  > > > > > > >
>  > > > > > > >
>  > > > > > > > thanks
>  > > > > > > > _______________________________________________
>  > > > > > > > To manage subscription options or unsubscribe:
>  > > > > > > > https://lists.mpich.org/mailman/listinfo/devel
>  > > > > > > >
>  > > > > > > > _______________________________________________
>  > > > > > > > To manage subscription options or unsubscribe:
>  > > > > > > > https://lists.mpich.org/mailman/listinfo/devel
>  > > > > > >
>  > > > > > > _______________________________________________
>  > > > > > > To manage subscription options or unsubscribe:
>  > > > > > > https://lists.mpich.org/mailman/listinfo/devel
>  > > > > > >
>  > > > > > > _______________________________________________
>  > > > > > > To manage subscription options or unsubscribe:
>  > > > > > > https://lists.mpich.org/mailman/listinfo/devel
>  > > > > >
>  > > > > >
>  > > > > >
>  > > > >
>  > > > >
>  > > >
>  > > >
>  > >
>  > >
>  > >
>  > > <comm_split.c>
>  >
>  >
>  >
>
>

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA