[mpich-devel] Fwd: MPI_Comm_Split/Dup scalability on BGQ and K supercomputers
Rob Latham
robl at mcs.anl.gov
Thu May 29 14:41:44 CDT 2014
On 05/29/2014 02:21 PM, Kenneth Raffenetti wrote:
> I do not [know where the source code to IBM's MPI
> implementation on Blue Gene is]. RobL should know. I imagine you have to add it via softenv?
Don't be shy, man! these messages should all be on the mpich-devel list,
which I'm looping back in now.
- you can add the mpich-ibm repository and then clone it with git. A
bit of a mess, since V1R1M1 and V1R2M1 have a different directory
structure.
- you can browse the source on-line Mira is now (finally!) running
V1R2M1, so follow the BGQ/IBM_V1R2M1/efix15 tag:
http://git.mpich.org/mpich-ibm.git/tree/81ec186864f8cfe69ed9d99dfbd2d5ffd423e6cd:/mpich2
==rob
>
> On 05/29/2014 02:15 PM, Junchao Zhang wrote:
>> Ken,
>> Do you know where is IBM MPI on mira?
>>
>> --Junchao Zhang
>>
>>
>> ---------- Forwarded message ----------
>> From: *Thakur, Rajeev* <thakur at mcs.anl.gov <mailto:thakur at mcs.anl.gov>>
>> Date: Thu, May 29, 2014 at 2:11 PM
>> Subject: Re: [mpich-devel] MPI_Comm_Split/Dup scalability on BGQ and K
>> supercomputers
>> To: "Zhang, Junchao" <jczhang at mcs.anl.gov <mailto:jczhang at mcs.anl.gov>>
>> Cc: "Balaji, Pavan" <balaji at anl.gov <mailto:balaji at anl.gov>>
>>
>>
>> Can you look at the source code of IBM MPI on Mira to see what they are
>> doing. I don't know if the source code is in our repo or somewhere else,
>> but Rob Latham should know.
>>
>> Also, there is a bug in Comm_split in the latest version of MPI
>> installed on Mira. There was a lot of discussion on the early-users
>> mailing list. Rob Latham knows about it, and he can tell you what the
>> bug is about.
>>
>> Rajeev
>>
>>
>> On May 29, 2014, at 1:49 PM, Junchao Zhang <jczhang at mcs.anl.gov
>> <mailto:jczhang at mcs.anl.gov>>
>> wrote:
>>
>> > I did more experiments on Mira. But I feel I am lost :) In short, I
>> think the original MPICH comm_split implementation is good if not the
>> best. However, for unknown reasons, the MPI library provided by Mira
>> is bad.
>> > I wrote three micro-benchmarks, 1) qsort.c, which is a sequential C
>> doe, to test the libc qsort performance when sorting a presorted array.
>> 2) split.c, which measures MPI_Comm_split(MPI_COMM_WORLD, 0, myrank,
>> ...). Note that all processes have the same color and use their old
>> rank. The maximal time on processes is reported. 3) dup.c, which simply
>> does MPI_Comm_dup(MPI_COMM_WORLD, &newcomm).
>> >
>> > I compiled the code using either mpixlc, which is provided by
>> +mpiwrapper-gcc on Mira, or mpicc, which is built from mpich/master by
>> myself, or mpicc-mergesort, which is also built by myself with a
>> mergesort comm_split.c
>> >
>> > QSORT : compiled with mpixlc, run with 1 core on 1 node (also run
>> with 16 cores simultaneously on 1 node, which gives even better
>> results). The results indicate the libc-qsort used by mpixlc is good.
>> > # integers time (s)
>> > 64000 0.110789
>> > 125000 0.200747
>> > 512000 0.445075
>> >
>> > COMM_SPLIT : compiled with mpixlc.(Note the results match with Sam's)
>> > # ranks time (s)
>> > 64000 2.297776
>> > 125000 8.157885
>> > 512000 237.079862
>> >
>> > COMM_DUP : compiled with mpixlc (Note the results match with Sam's)
>> > 64000 2.057966
>> > 125000 7.808858
>> > 512000 235.948144
>> >
>> > COMM_SPLIT : compiled with mpicc, run two times. The variation is big
>> at 512,000.
>> > 64000 0.549024 0.545003
>> > 125000 0.959198 0.965059
>> > 512000 6.625385 3.716696
>> >
>> > COMM_DUP : compiled with mpicc
>> > 64000 0.028518
>> > 125000 0.056386
>> > 512000 0.202677
>> >
>> > COMM_SPLIT : compiled with mpicc-mergesort
>> > 64000 0.362572
>> > 125000 0.712167
>> > 512000 2.929134
>> >
>> > COMM_DUP : compiled with mpicc-mergesort
>> > 64000 0.028835
>> > 125000 0.055987
>> > 512000 0.202660
>> >
>> > (Another difference should be mentioned is that mpixlc version is
>> "MPICH2 version 1.5". I looked at the log history of comm_split.c.
>> After MPICH2 1.5, there is no big changes)
>> > We can see Comm_dup performance is independent of Comm_split and
>> mpicc-mergesort is slightly better than mpicc. It may be still worth
>> pushing mergesort to mpich.
>> >
>> > I have no idea why mpixlc is bad. Should I reported it to ALCF?
>> >
>> > --Junchao Zhang
>> >
>> >
>> > On Wed, May 28, 2014 at 7:34 AM, Junchao Zhang <jczhang at mcs.anl.gov
>> <mailto:jczhang at mcs.anl.gov>> wrote:
>> > Fixed them. Thanks.
>> >
>> > --Junchao Zhang
>> >
>> >
>> > On Tue, May 27, 2014 at 11:02 PM, Thakur, Rajeev <thakur at mcs.anl.gov
>> <mailto:thakur at mcs.anl.gov>> wrote:
>> > Junchao,
>> > Your code looks fine, although I didn't check the merge
>> sort in detail for potential bugs. You may want add a comment explaining
>> that the 2x in the memory allocation for keytable is because the second
>> half of it will be used as scratch space for the merge sort. Otherwise,
>> someone looking at the code 3 years from now will miss it :-). Also, a
>> typo in one comment: sortting -> sorting.
>> >
>> > Rajeev
>> >
>> >
>> > On May 27, 2014, at 3:42 PM, Junchao Zhang <jczhang at mcs.anl.gov
>> <mailto:jczhang at mcs.anl.gov>> wrote:
>> >
>> > > Here is the code.
>> > > We do not just sort keys. We also want to keep track of old ranks
>> for these keys. So we actually sort an array of {key, rank}.
>> > >
>> > > --Junchao Zhang
>> > >
>> > >
>> > > On Tue, May 27, 2014 at 12:58 PM, Junchao Zhang
>> <jczhang at mcs.anl.gov <mailto:jczhang at mcs.anl.gov>> wrote:
>> > > Downloaded mpich2-1.3.1 and checked that . It uses bubble sort,
>> which is a stable sort algorithm and no extra buffer is needed.
>> > > merge sort is stable, but needs O(N) extra memory. I can allocate
>> the array and the scratch together in one malloc instead of two mallocs
>> separately. Let me do that.
>> > >
>> > > --Junchao Zhang
>> > >
>> > >
>> > > On Tue, May 27, 2014 at 12:46 PM, Thakur, Rajeev
>> <thakur at mcs.anl.gov <mailto:thakur at mcs.anl.gov>> wrote:
>> > > The version I suggested (1.4) was not old enough. It has the new
>> code with quick sort, not the old code with bubble sort. Instead, see
>> 1.3.1 (http://www.mpich.org/static/downloads/1.3.1/). I confirmed that
>> it has the old bubble sort. I believe it used O(p) less memory as well
>> but I didn't recheck just now.
>> > >
>> > > Rajeev
>> > >
>> > >
>> > >
>> > > On May 27, 2014, at 12:38 PM, Junchao Zhang <jczhang at mcs.anl.gov
>> <mailto:jczhang at mcs.anl.gov>>
>> > > wrote:
>> > >
>> > > > I read the old code. It is not efficient. The code gathers {key,
>> color (in effect it is later used to store old ranks) } and puts them in
>> an array. Actually, that is enough info for a stable sort (i.e., using
>> old ranks for ties). But the old code did not realize that and created
>> a new type, splittype { key, color, orig_idx}, It assigns orig_idx the
>> index of the element in array and uses that for ties. This is the O(p)
>> memory you mentioned. We can say it adds 50% memory use.
>> > > > For merge sort, the buffer has to be standalone (i.e., not mixed
>> with old data). So I deleted the orig_idx field from splittype, and
>> allocated a buffer as big as the original array, in effect adding 100%
>> memory use.
>> > > > I think we can have a fast qsort with 0% extra memory for our
>> case. I will do that experiment later.
>> > > >
>> > > > --Junchao Zhang
>> > > >
>> > > >
>> > > > On Tue, May 27, 2014 at 11:17 AM, Thakur, Rajeev
>> <thakur at mcs.anl.gov <mailto:thakur at mcs.anl.gov>> wrote:
>> > > > Junchao,
>> > > > As far as I remember, when we switched from bubble
>> sort to quicksort, we had to allocate additional O(nprocs) memory for
>> the trick to make the qsort stable. You can check the old code in a
>> release in 2011 (say 1.4 from
>> http://www.mpich.org/static/downloads/1.4/). It would be good if we can
>> avoid that additional memory allocation in the new sort. And if in the
>> merge sort, you have added one more allocate, that should definitely be
>> avoided.
>> > > >
>> > > > Rajeev
>> > > >
>> > > >
>> > > > On May 27, 2014, at 9:40 AM, Junchao Zhang <jczhang at mcs.anl.gov
>> <mailto:jczhang at mcs.anl.gov>> wrote:
>> > > >
>> > > > > Rejeev,
>> > > > > The code is at mpich-review/commsplit. I pushed it there
>> several days ago and MPICH test results look good. Could you review it?
>> > > > > Yes, I will look why the improvement of Comm_dup is better.
>> > > > > Another thing I want to investigate is: In Comm_split, we
>> sort {key, old rank}, i.e., an integer pair. So basically, we have
>> mechanisms to avoid the unstable problem of qsort. We can copy an
>> optimized qsort instead of using the one in libc, and test its
>> performance. The benefit of qsort is we can avoid the extra buffer
>> needed in mergesort. Using a customized qsort, we can also avoid
>> function pointer dereferencing. In libc qsort, one provides a generic
>> comp function pointer. But our case is simple. We just want to compare
>> integers. We can avoid the function call cost for each comparison.
>> > > > >
>> > > > >
>> > > > >
>> > > > > --Junchao Zhang
>> > > > >
>> > > > >
>> > > > > On Tue, May 27, 2014 at 9:24 AM, Thakur, Rajeev
>> <thakur at mcs.anl.gov <mailto:thakur at mcs.anl.gov>> wrote:
>> > > > > Thanks. Have you run your new version through the MPICH test
>> suite to make sure there are no bugs in the new sort.
>> > > > >
>> > > > > Can you look at Comm_dup to see why it improved so much.
>> > > > >
>> > > > > BG/Q has a relatively slower processor, so the sorting time is
>> noticeable, unlike on the NERSC system.
>> > > > >
>> > > > > I think we should include it in 3.1.1
>> > > > >
>> > > > > Rajeev
>> > > > >
>> > > > > On May 27, 2014, at 9:16 AM, Junchao Zhang <jczhang at mcs.anl.gov
>> <mailto:jczhang at mcs.anl.gov>>
>> > > > > wrote:
>> > > > >
>> > > > > > I tested Sam's HPGMG on Mira. I tried mpiwrapper-gcc and
>> mpiwrapper-xl provided by the softenv of Mira, and also tried a
>> self-built mpich with my own comm_split + mergesort. Here are test
>> results.
>> > > > > >
>> > > > > > mpiwrapper-gcc and mpiwrapper-xl have similar results like
>> the following (all time is in seconds)
>> > > > > > # of ranks comm_dup comm_split MGBuild
>> > > > > > 64000 2.059304 2.291601 28.085873
>> > > > > > 125000 7.807612 8.158366 88.644299
>> > > > > > 512000 235.923766 (no results due to job timed
>> out)
>> > > > > >
>> > > > > > mpich-mergesort
>> > > > > > # of ranks comm_dup comm_split MGBuild
>> > > > > > 64000 0.046616 0.716383 6.999050
>> > > > > > 125000 0.065818 1.293382 10.696351
>> > > > > > 512000 0.229605 4.998839 52.467708
>> > > > > >
>> > > > > > I don't know why the MPICH libraries on Mira are bad, since
>> optimizations in qsort to avoid the worst case are not new. The libc
>> library on Mira should have an optimized qsort.
>> > > > > > Another thing I have not looked into is why comm_dup is
>> affected by comm_split. It seems comm_dup calls comm_split internally.
>> > > > > >
>> > > > > > In short, the current mergesort solution largely fixes the
>> scaling problem. We may further investigate the issue and come up with
>> better solutions. But if you think this is a key fix, probably we can
>> push it to 3.1.1.
>> > > > > >
>> > > > > >
>> > > > > > --Junchao Zhang
>> > > > > >
>> > > > > >
>> > > > > > On Sun, May 25, 2014 at 12:25 PM, Junchao Zhang
>> <jczhang at mcs.anl.gov <mailto:jczhang at mcs.anl.gov>> wrote:
>> > > > > > Yes, I am going to test that.
>> > > > > >
>> > > > > > --Junchao Zhang
>> > > > > >
>> > > > > >
>> > > > > > On Sun, May 25, 2014 at 11:11 AM, Thakur, Rajeev
>> <thakur at mcs.anl.gov <mailto:thakur at mcs.anl.gov>> wrote:
>> > > > > > Can you also see if the xlc compiler makes any difference?
>> > > > > >
>> > > > > > Rajeev
>> > > > > >
>> > > > > > On May 25, 2014, at 8:14 AM, Junchao Zhang
>> <jczhang at mcs.anl.gov <mailto:jczhang at mcs.anl.gov>>
>> > > > > > wrote:
>> > > > > >
>> > > > > > > I tested hpgmg on Mira and reproduced the scaling problem.
>> I used mpiwrapper-gcc.
>> > > > > > > With 64000 MPI tasks, the output is:
>> > > > > > > Duplicating MPI_COMM_WORLD...done (2.059304 seconds)
>> > > > > > > Building MPI subcommunicators..., level 1...done (2.291601
>> seconds)
>> > > > > > > Total time in MGBuild 28.085873 seconds
>> > > > > > > With 125000 MPI Tasks, the output is:
>> > > > > > > Duplicating MPI_COMM_WORLD...done (7.807612 seconds)
>> > > > > > > Building MPI subcommunicators..., level 1...done (8.158366
>> seconds)
>> > > > > > > Total time in MGBuild 88.644299 seconds
>> > > > > > >
>> > > > > > > Let me replace qsort with mergesort in Comm_split to see
>> what will happen.
>> > > > > > >
>> > > > > > > --Junchao Zhang
>> > > > > > >
>> > > > > > >
>> > > > > > > On Sat, May 24, 2014 at 9:07 AM, Sam Williams
>> <swwilliams at lbl.gov <mailto:swwilliams at lbl.gov>> wrote:
>> > > > > > > I saw the problem on Mira and K but not Edison. I don't
>> know if that is due to scale or implementation.
>> > > > > > >
>> > > > > > > On Mira, I was running jobs with a 10min wallclock limit.
>> I scaled to 46656 processes with 64 threads per process (c1,
>> OMP_NUM_THREADS=64) and all jobs completed successfully. However, I was
>> only looking at MGSolve times and not MGBuild times. I then decided to
>> explore 8 threads per process (c8, OMP_NUM_THREADS=8) and started at the
>> high concurrency. The jobs timed out while still in MGBuild after
>> 10mins with 373248 processes as well as with a 20min timeout. At that
>> point I added the USE_SUBCOMM option to enable/disable the use of
>> comm_split. I haven't tried scaling with the sub communicator on Mira
>> since then.
>> > > > > > >
>> > > > > > >
>> > > > > > > On May 24, 2014, at 6:39 AM, Junchao Zhang
>> <jczhang at mcs.anl.gov <mailto:jczhang at mcs.anl.gov>> wrote:
>> > > > > > >
>> > > > > > > > Hi, Sam,
>> > > > > > > > Could you give me the exact number of MPI ranks for
>> your results on Mira?
>> > > > > > > > I run hpgmg on Edison with export OMP_NUM_THREADS=1,
>> aprun -n 64000 -ss -cc numa_node ./hpgmg-fv 6 1. The total time in
>> MGBuild is about 0.005 seconds. I was wondering how many cores I need to
>> apply to reproduce the problem.
>> > > > > > > > Thanks.
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > --Junchao Zhang
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > On Sat, May 17, 2014 at 9:06 AM, Sam Williams
>> <swwilliams at lbl.gov <mailto:swwilliams at lbl.gov>> wrote:
>> > > > > > > > I've been conducting scaling experiments on the Mira
>> (Blue Gene/Q) and K (Sparc) supercomputers. I've noticed that the time
>> required for MPI_Comm_split and MPI_Comm_dup can grow quickly with scale
>> (~P^2). As such, its performance eventually becomes a bottleneck. That
>> is, although the benefit of using a subcommunicator is huge (multigrid
>> solves are weak-scalable), the penalty of creating one (multigrid build
>> time) is also huge.
>> > > > > > > >
>> > > > > > > > For example, when scaling from 1 to 46K nodes (= cubes of
>> integers) on Mira, the time (in seconds) required to build a MG solver
>> (including a subcommunicator) scales as
>> > > > > > > > 222335.output: Total time in MGBuild 0.056704
>> > > > > > > > 222336.output: Total time in MGBuild 0.060834
>> > > > > > > > 222348.output: Total time in MGBuild 0.064782
>> > > > > > > > 222349.output: Total time in MGBuild 0.090229
>> > > > > > > > 222350.output: Total time in MGBuild 0.075280
>> > > > > > > > 222351.output: Total time in MGBuild 0.091852
>> > > > > > > > 222352.output: Total time in MGBuild 0.137299
>> > > > > > > > 222411.output: Total time in MGBuild 0.301552
>> > > > > > > > 222413.output: Total time in MGBuild 0.606444
>> > > > > > > > 222415.output: Total time in MGBuild 0.745272
>> > > > > > > > 222417.output: Total time in MGBuild 0.779757
>> > > > > > > > 222418.output: Total time in MGBuild 4.671838
>> > > > > > > > 222419.output: Total time in MGBuild 15.123162
>> > > > > > > > 222420.output: Total time in MGBuild 33.875626
>> > > > > > > > 222421.output: Total time in MGBuild 49.494547
>> > > > > > > > 222422.output: Total time in MGBuild 151.329026
>> > > > > > > >
>> > > > > > > > If I disable the call to MPI_Comm_Split, my time scales as
>> > > > > > > > 224982.output: Total time in MGBuild 0.050143
>> > > > > > > > 224983.output: Total time in MGBuild 0.052607
>> > > > > > > > 224988.output: Total time in MGBuild 0.050697
>> > > > > > > > 224989.output: Total time in MGBuild 0.078343
>> > > > > > > > 224990.output: Total time in MGBuild 0.054634
>> > > > > > > > 224991.output: Total time in MGBuild 0.052158
>> > > > > > > > 224992.output: Total time in MGBuild 0.060286
>> > > > > > > > 225008.output: Total time in MGBuild 0.062925
>> > > > > > > > 225009.output: Total time in MGBuild 0.097357
>> > > > > > > > 225010.output: Total time in MGBuild 0.061807
>> > > > > > > > 225011.output: Total time in MGBuild 0.076617
>> > > > > > > > 225012.output: Total time in MGBuild 0.099683
>> > > > > > > > 225013.output: Total time in MGBuild 0.125580
>> > > > > > > > 225014.output: Total time in MGBuild 0.190711
>> > > > > > > > 225016.output: Total time in MGBuild 0.218329
>> > > > > > > > 225017.output: Total time in MGBuild 0.282081
>> > > > > > > >
>> > > > > > > > Although I didn't directly measure it, this suggests the
>> time for MPI_Comm_Split is growing roughly quadratically with process
>> concurrency.
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > I see the same effect on the K machine (8...64K nodes)
>> where the code uses comm_split/dup in conjunction:
>> > > > > > > > run00008_7_1.sh.o2412931: Total time in MGBuild
>> 0.026458 seconds
>> > > > > > > > run00064_7_1.sh.o2415876: Total time in MGBuild
>> 0.039121 seconds
>> > > > > > > > run00512_7_1.sh.o2415877: Total time in MGBuild
>> 0.086800 seconds
>> > > > > > > > run01000_7_1.sh.o2414496: Total time in MGBuild
>> 0.129764 seconds
>> > > > > > > > run01728_7_1.sh.o2415878: Total time in MGBuild
>> 0.224576 seconds
>> > > > > > > > run04096_7_1.sh.o2415880: Total time in MGBuild
>> 0.738979 seconds
>> > > > > > > > run08000_7_1.sh.o2414504: Total time in MGBuild
>> 2.123800 seconds
>> > > > > > > > run13824_7_1.sh.o2415881: Total time in MGBuild
>> 6.276573 seconds
>> > > > > > > > run21952_7_1.sh.o2415882: Total time in MGBuild
>> 13.634200 seconds
>> > > > > > > > run32768_7_1.sh.o2415884: Total time in MGBuild
>> 36.508670 seconds
>> > > > > > > > run46656_7_1.sh.o2415874: Total time in MGBuild
>> 58.668228 seconds
>> > > > > > > > run64000_7_1.sh.o2415875: Total time in MGBuild
>> 117.322217 seconds
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > A glance at the implementation on Mira (I don't know if
>> the implementation on K is stock) suggests it should be using qsort to
>> sort based on keys. Unfortunately, qsort is not performance robust like
>> heap/merge sort. If one were to be productive and call comm_split
>> like...
>> > > > > > > > MPI_Comm_split(...,mycolor,myrank,...)
>> > > > > > > > then one runs the risk that the keys are presorted. This
>> hits the worst case computational complexity for qsort... O(P^2).
>> Demanding programmers avoid sending sorted keys seems unreasonable.
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > I should note, I see a similar lack of scaling with
>> MPI_Comm_dup on the K machine. Unfortunately, my BGQ data used an
>> earlier version of the code that did not use comm_dup. As such, I can’t
>> definitively say that it is a problem on that machine as well.
>> > > > > > > >
>> > > > > > > > Thus, I'm asking for scalable implementations of
>> comm_split/dup using merge/heap sort whose worst case complexity is
>> still PlogP to be prioritized in the next update.
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > thanks
>> > > > > > > > _______________________________________________
>> > > > > > > > To manage subscription options or unsubscribe:
>> > > > > > > > https://lists.mpich.org/mailman/listinfo/devel
>> > > > > > > >
>> > > > > > > > _______________________________________________
>> > > > > > > > To manage subscription options or unsubscribe:
>> > > > > > > > https://lists.mpich.org/mailman/listinfo/devel
>> > > > > > >
>> > > > > > > _______________________________________________
>> > > > > > > To manage subscription options or unsubscribe:
>> > > > > > > https://lists.mpich.org/mailman/listinfo/devel
>> > > > > > >
>> > > > > > > _______________________________________________
>> > > > > > > To manage subscription options or unsubscribe:
>> > > > > > > https://lists.mpich.org/mailman/listinfo/devel
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > >
>> > > > >
>> > > >
>> > > >
>> > >
>> > >
>> > >
>> > > <comm_split.c>
>> >
>> >
>> >
>>
>>
--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA
More information about the devel
mailing list