[mpich-devel] Fwd: MPI_Comm_Split/Dup scalability on BGQ and K supercomputers

Thu May 29 14:41:44 CDT 2014

On 05/29/2014 02:21 PM, Kenneth Raffenetti wrote:
> I do not [know where the source code to IBM's MPI
> implementation on Blue Gene is]. RobL should know. I imagine you have to add it via softenv?

Don't be shy, man! these messages should all be on the mpich-devel list, 
which I'm looping back in now.

- you can add the mpich-ibm repository and then clone it with git.  A 
bit of a mess, since V1R1M1 and V1R2M1 have a different directory 
structure.

- you can browse the source on-line  Mira is now (finally!) running 
V1R2M1, so follow the BGQ/IBM_V1R2M1/efix15 tag:

http://git.mpich.org/mpich-ibm.git/tree/81ec186864f8cfe69ed9d99dfbd2d5ffd423e6cd:/mpich2

==rob

>
> On 05/29/2014 02:15 PM, Junchao Zhang wrote:
>> Ken,
>>     Do you know where is IBM MPI on mira?
>>
>> --Junchao Zhang
>>
>>
>> ---------- Forwarded message ----------
>> From: *Thakur, Rajeev* <thakur at mcs.anl.gov <mailto:thakur at mcs.anl.gov>>
>> Date: Thu, May 29, 2014 at 2:11 PM
>> Subject: Re: [mpich-devel] MPI_Comm_Split/Dup scalability on BGQ and K
>> supercomputers
>> To: "Zhang, Junchao" <jczhang at mcs.anl.gov <mailto:jczhang at mcs.anl.gov>>
>> Cc: "Balaji, Pavan" <balaji at anl.gov <mailto:balaji at anl.gov>>
>>
>>
>> Can you look at the source code of IBM MPI on Mira to see what they are
>> doing. I don't know if the source code is in our repo or somewhere else,
>> but Rob Latham should know.
>>
>> Also, there is a bug in Comm_split in the latest version of MPI
>> installed on Mira. There was a lot of discussion on the early-users
>> mailing list. Rob Latham knows about it, and he can tell you what the
>> bug is about.
>>
>> Rajeev
>>
>>
>> On May 29, 2014, at 1:49 PM, Junchao Zhang <jczhang at mcs.anl.gov
>> <mailto:jczhang at mcs.anl.gov>>
>>   wrote:
>>
>>  > I did more experiments on Mira. But I feel I am lost :)   In short, I
>> think the original MPICH comm_split implementation is good if not the
>> best. However, for unknown reasons, the MPI library provided by Mira
>> is bad.
>>  > I wrote three micro-benchmarks, 1) qsort.c, which is a sequential C
>> doe, to test the libc qsort performance when sorting a presorted array.
>> 2) split.c, which measures MPI_Comm_split(MPI_COMM_WORLD, 0, myrank,
>> ...). Note that all processes have the same color and use their old
>> rank. The maximal time on processes is reported. 3) dup.c, which simply
>> does MPI_Comm_dup(MPI_COMM_WORLD, &newcomm).
>>  >
>>  > I compiled the code using either mpixlc, which is provided by
>> +mpiwrapper-gcc on Mira, or mpicc, which is built from mpich/master by
>> myself, or mpicc-mergesort, which is also built by myself with a
>> mergesort comm_split.c
>>  >
>>  > QSORT : compiled with mpixlc, run with 1 core on 1 node (also run
>> with 16 cores simultaneously on 1 node, which gives even better
>> results). The results indicate the libc-qsort used by mpixlc is good.
>>  > # integers  time (s)
>>  > 64000       0.110789
>>  > 125000     0.200747
>>  > 512000     0.445075
>>  >
>>  > COMM_SPLIT : compiled with mpixlc.(Note the results match with Sam's)
>>  > # ranks     time (s)
>>  >  64000       2.297776
>>  >  125000     8.157885
>>  >  512000     237.079862
>>  >
>>  > COMM_DUP : compiled with mpixlc (Note the results match with Sam's)
>>  >  64000       2.057966
>>  >  125000     7.808858
>>  >  512000     235.948144
>>  >
>>  > COMM_SPLIT : compiled with mpicc, run two times. The variation is big
>> at 512,000.
>>  >  64000       0.549024  0.545003
>>  >  125000     0.959198  0.965059
>>  >  512000     6.625385  3.716696
>>  >
>>  > COMM_DUP : compiled with mpicc
>>  >  64000       0.028518
>>  >  125000     0.056386
>>  >  512000     0.202677
>>  >
>>  > COMM_SPLIT : compiled with mpicc-mergesort
>>  >  64000       0.362572
>>  >  125000     0.712167
>>  >  512000     2.929134
>>  >
>>  > COMM_DUP : compiled with mpicc-mergesort
>>  >  64000       0.028835
>>  >  125000     0.055987
>>  >  512000     0.202660
>>  >
>>  > (Another difference should be mentioned is that mpixlc version is
>> "MPICH2 version 1.5".  I looked at the log history of comm_split.c.
>> After MPICH2 1.5, there is no big changes)
>>  > We can see Comm_dup performance is independent of Comm_split and
>> mpicc-mergesort is slightly better than mpicc.  It may be still worth
>> pushing mergesort to mpich.
>>  >
>>  > I have no idea why mpixlc is bad.  Should I reported it to ALCF?
>>  >
>>  > --Junchao Zhang
>>  >
>>  >
>>  > On Wed, May 28, 2014 at 7:34 AM, Junchao Zhang <jczhang at mcs.anl.gov
>> <mailto:jczhang at mcs.anl.gov>> wrote:
>>  > Fixed them.  Thanks.
>>  >
>>  > --Junchao Zhang
>>  >
>>  >
>>  > On Tue, May 27, 2014 at 11:02 PM, Thakur, Rajeev <thakur at mcs.anl.gov
>> <mailto:thakur at mcs.anl.gov>> wrote:
>>  > Junchao,
>>  >               Your code looks fine, although I didn't check the merge
>> sort in detail for potential bugs. You may want add a comment explaining
>> that the 2x in the memory allocation for keytable is because the second
>> half of it will be used as scratch space for the merge sort. Otherwise,
>> someone looking at the code 3 years from now will miss it :-). Also, a
>> typo in one comment: sortting -> sorting.
>>  >
>>  > Rajeev
>>  >
>>  >
>>  > On May 27, 2014, at 3:42 PM, Junchao Zhang <jczhang at mcs.anl.gov
>> <mailto:jczhang at mcs.anl.gov>> wrote:
>>  >
>>  > > Here is the code.
>>  > > We do not just sort keys. We also want to keep track of old ranks
>> for these keys. So we actually sort an array of {key, rank}.
>>  > >
>>  > > --Junchao Zhang
>>  > >
>>  > >
>>  > > On Tue, May 27, 2014 at 12:58 PM, Junchao Zhang
>> <jczhang at mcs.anl.gov <mailto:jczhang at mcs.anl.gov>> wrote:
>>  > > Downloaded mpich2-1.3.1 and checked that .  It uses bubble sort,
>> which is a stable sort algorithm and no extra buffer is needed.
>>  > > merge sort is stable, but needs O(N) extra memory. I can allocate
>> the array and the scratch together in one malloc instead of two mallocs
>> separately. Let me do that.
>>  > >
>>  > > --Junchao Zhang
>>  > >
>>  > >
>>  > > On Tue, May 27, 2014 at 12:46 PM, Thakur, Rajeev
>> <thakur at mcs.anl.gov <mailto:thakur at mcs.anl.gov>> wrote:
>>  > > The version I suggested (1.4) was not old enough. It has the new
>> code with quick sort, not the old code with bubble sort. Instead, see
>> 1.3.1 (http://www.mpich.org/static/downloads/1.3.1/). I confirmed that
>> it has the old bubble sort. I believe it used O(p) less memory as well
>> but I didn't recheck just now.
>>  > >
>>  > > Rajeev
>>  > >
>>  > >
>>  > >
>>  > > On May 27, 2014, at 12:38 PM, Junchao Zhang <jczhang at mcs.anl.gov
>> <mailto:jczhang at mcs.anl.gov>>
>>  > >  wrote:
>>  > >
>>  > > > I read the old code. It is not efficient. The code gathers {key,
>> color (in effect it is later used to store old ranks) } and puts them in
>> an array. Actually, that is enough info for a stable sort (i.e., using
>> old ranks for ties).  But the old code did not realize that and created
>> a new type, splittype { key, color, orig_idx}, It assigns orig_idx the
>> index of the element in array and uses that for ties.  This is the O(p)
>> memory you mentioned. We can say it adds 50% memory use.
>>  > > > For merge sort, the buffer has to be standalone (i.e., not mixed
>> with old data).  So I deleted the orig_idx field from splittype, and
>> allocated a buffer as big as the original array, in effect adding 100%
>> memory use.
>>  > > > I think we can have a fast qsort with 0% extra memory for our
>> case. I will do that experiment later.
>>  > > >
>>  > > > --Junchao Zhang
>>  > > >
>>  > > >
>>  > > > On Tue, May 27, 2014 at 11:17 AM, Thakur, Rajeev
>> <thakur at mcs.anl.gov <mailto:thakur at mcs.anl.gov>> wrote:
>>  > > > Junchao,
>>  > > >               As far as I remember, when we switched from bubble
>> sort to quicksort, we had to allocate additional O(nprocs) memory for
>> the trick to make the qsort stable. You can check the old code in a
>> release in 2011 (say 1.4 from
>> http://www.mpich.org/static/downloads/1.4/). It would be good if we can
>> avoid that additional memory allocation in the new sort. And if in the
>> merge sort, you have added one more allocate, that should definitely be
>> avoided.
>>  > > >
>>  > > > Rajeev
>>  > > >
>>  > > >
>>  > > > On May 27, 2014, at 9:40 AM, Junchao Zhang <jczhang at mcs.anl.gov
>> <mailto:jczhang at mcs.anl.gov>> wrote:
>>  > > >
>>  > > > > Rejeev,
>>  > > > >    The code is at mpich-review/commsplit.  I pushed it there
>> several days ago and MPICH test results look good.  Could you review it?
>>  > > > >    Yes, I will look why the improvement of Comm_dup is better.
>>  > > > >    Another thing I want to investigate is: In Comm_split, we
>> sort {key, old rank}, i.e., an integer pair. So basically, we have
>> mechanisms to avoid the unstable problem of qsort. We can copy an
>> optimized qsort instead of using the one in libc, and test its
>> performance.  The benefit of qsort is we can avoid the extra buffer
>> needed in mergesort. Using a customized qsort, we can also avoid
>> function pointer dereferencing.  In libc qsort, one provides a generic
>> comp function pointer. But our case is simple. We just want to compare
>> integers. We can avoid the function call cost for each comparison.
>>  > > > >
>>  > > > >
>>  > > > >
>>  > > > > --Junchao Zhang
>>  > > > >
>>  > > > >
>>  > > > > On Tue, May 27, 2014 at 9:24 AM, Thakur, Rajeev
>> <thakur at mcs.anl.gov <mailto:thakur at mcs.anl.gov>> wrote:
>>  > > > > Thanks. Have you run your new version through the MPICH test
>> suite to make sure there are no bugs in the new sort.
>>  > > > >
>>  > > > > Can you look at Comm_dup to see why it improved so much.
>>  > > > >
>>  > > > > BG/Q has a relatively slower processor, so the sorting time is
>> noticeable, unlike on the NERSC system.
>>  > > > >
>>  > > > > I think we should include it in 3.1.1
>>  > > > >
>>  > > > > Rajeev
>>  > > > >
>>  > > > > On May 27, 2014, at 9:16 AM, Junchao Zhang <jczhang at mcs.anl.gov
>> <mailto:jczhang at mcs.anl.gov>>
>>  > > > >  wrote:
>>  > > > >
>>  > > > > > I tested Sam's HPGMG on Mira. I tried mpiwrapper-gcc and
>> mpiwrapper-xl provided by the softenv of Mira, and also tried a
>> self-built mpich with my own comm_split + mergesort. Here are test
>> results.
>>  > > > > >
>>  > > > > > mpiwrapper-gcc and mpiwrapper-xl have similar results like
>> the following (all time is in seconds)
>>  > > > > > # of ranks     comm_dup     comm_split    MGBuild
>>  > > > > > 64000           2.059304        2.291601       28.085873
>>  > > > > > 125000         7.807612        8.158366       88.644299
>>  > > > > > 512000      235.923766       (no results due to  job timed
>> out)
>>  > > > > >
>>  > > > > > mpich-mergesort
>>  > > > > > # of ranks     comm_dup     comm_split    MGBuild
>>  > > > > > 64000           0.046616        0.716383       6.999050
>>  > > > > > 125000         0.065818        1.293382       10.696351
>>  > > > > > 512000         0.229605        4.998839        52.467708
>>  > > > > >
>>  > > > > > I don't know why the MPICH libraries on Mira are bad, since
>> optimizations in qsort to avoid the worst case are not new. The libc
>> library on Mira should have an optimized qsort.
>>  > > > > > Another thing I have not looked into is why comm_dup is
>> affected by comm_split.  It seems comm_dup calls comm_split internally.
>>  > > > > >
>>  > > > > > In short, the current mergesort solution largely fixes the
>> scaling problem.  We may further investigate the issue and come up with
>> better solutions. But if you think this is a key fix, probably we can
>> push it to 3.1.1.
>>  > > > > >
>>  > > > > >
>>  > > > > > --Junchao Zhang
>>  > > > > >
>>  > > > > >
>>  > > > > > On Sun, May 25, 2014 at 12:25 PM, Junchao Zhang
>> <jczhang at mcs.anl.gov <mailto:jczhang at mcs.anl.gov>> wrote:
>>  > > > > > Yes, I am going to test that.
>>  > > > > >
>>  > > > > > --Junchao Zhang
>>  > > > > >
>>  > > > > >
>>  > > > > > On Sun, May 25, 2014 at 11:11 AM, Thakur, Rajeev
>> <thakur at mcs.anl.gov <mailto:thakur at mcs.anl.gov>> wrote:
>>  > > > > > Can you also see if the xlc compiler makes any difference?
>>  > > > > >
>>  > > > > > Rajeev
>>  > > > > >
>>  > > > > > On May 25, 2014, at 8:14 AM, Junchao Zhang
>> <jczhang at mcs.anl.gov <mailto:jczhang at mcs.anl.gov>>
>>  > > > > >  wrote:
>>  > > > > >
>>  > > > > > > I tested hpgmg on Mira and reproduced the scaling problem.
>> I used mpiwrapper-gcc.
>>  > > > > > > With 64000 MPI tasks, the output is:
>>  > > > > > > Duplicating MPI_COMM_WORLD...done (2.059304 seconds)
>>  > > > > > > Building MPI subcommunicators..., level 1...done (2.291601
>> seconds)
>>  > > > > > > Total time in MGBuild     28.085873 seconds
>>  > > > > > > With 125000 MPI Tasks, the output is:
>>  > > > > > > Duplicating MPI_COMM_WORLD...done (7.807612 seconds)
>>  > > > > > > Building MPI subcommunicators...,  level 1...done (8.158366
>> seconds)
>>  > > > > > > Total time in MGBuild     88.644299 seconds
>>  > > > > > >
>>  > > > > > > Let me replace qsort with mergesort in Comm_split to see
>> what will happen.
>>  > > > > > >
>>  > > > > > > --Junchao Zhang
>>  > > > > > >
>>  > > > > > >
>>  > > > > > > On Sat, May 24, 2014 at 9:07 AM, Sam Williams
>> <swwilliams at lbl.gov <mailto:swwilliams at lbl.gov>> wrote:
>>  > > > > > > I saw the problem on Mira and K but not Edison.  I don't
>> know if that is due to scale or implementation.
>>  > > > > > >
>>  > > > > > > On Mira, I was running jobs with a 10min wallclock limit.
>>   I scaled to 46656 processes with 64 threads per process (c1,
>> OMP_NUM_THREADS=64) and all jobs completed successfully.  However, I was
>> only looking at MGSolve times and not MGBuild times.  I then decided to
>> explore 8 threads per process (c8, OMP_NUM_THREADS=8) and started at the
>> high concurrency.  The jobs timed out while still in MGBuild after
>> 10mins with 373248 processes as well as with a 20min timeout.  At that
>> point I added the USE_SUBCOMM option to enable/disable the use of
>> comm_split.  I haven't tried scaling with the sub communicator on Mira
>> since then.
>>  > > > > > >
>>  > > > > > >
>>  > > > > > > On May 24, 2014, at 6:39 AM, Junchao Zhang
>> <jczhang at mcs.anl.gov <mailto:jczhang at mcs.anl.gov>> wrote:
>>  > > > > > >
>>  > > > > > > > Hi, Sam,
>>  > > > > > > >   Could you give me the exact number of MPI ranks for
>> your results on Mira?
>>  > > > > > > >   I run hpgmg on Edison with export OMP_NUM_THREADS=1,
>> aprun -n 64000 -ss -cc numa_node ./hpgmg-fv 6 1. The total time in
>> MGBuild is about 0.005 seconds. I was wondering how many cores I need to
>> apply to reproduce the problem.
>>  > > > > > > >   Thanks.
>>  > > > > > > >
>>  > > > > > > >
>>  > > > > > > > --Junchao Zhang
>>  > > > > > > >
>>  > > > > > > >
>>  > > > > > > > On Sat, May 17, 2014 at 9:06 AM, Sam Williams
>> <swwilliams at lbl.gov <mailto:swwilliams at lbl.gov>> wrote:
>>  > > > > > > > I've been conducting scaling experiments on the Mira
>> (Blue Gene/Q) and K (Sparc) supercomputers.  I've noticed that the time
>> required for MPI_Comm_split and MPI_Comm_dup can grow quickly with scale
>> (~P^2).  As such, its performance eventually becomes a bottleneck.  That
>> is, although the benefit of using a subcommunicator is huge (multigrid
>> solves are weak-scalable), the penalty of creating one (multigrid build
>> time) is also huge.
>>  > > > > > > >
>>  > > > > > > > For example, when scaling from 1 to 46K nodes (= cubes of
>> integers) on Mira, the time (in seconds) required to build a MG solver
>> (including a subcommunicator) scales as
>>  > > > > > > > 222335.output:   Total time in MGBuild      0.056704
>>  > > > > > > > 222336.output:   Total time in MGBuild      0.060834
>>  > > > > > > > 222348.output:   Total time in MGBuild      0.064782
>>  > > > > > > > 222349.output:   Total time in MGBuild      0.090229
>>  > > > > > > > 222350.output:   Total time in MGBuild      0.075280
>>  > > > > > > > 222351.output:   Total time in MGBuild      0.091852
>>  > > > > > > > 222352.output:   Total time in MGBuild      0.137299
>>  > > > > > > > 222411.output:   Total time in MGBuild      0.301552
>>  > > > > > > > 222413.output:   Total time in MGBuild      0.606444
>>  > > > > > > > 222415.output:   Total time in MGBuild      0.745272
>>  > > > > > > > 222417.output:   Total time in MGBuild      0.779757
>>  > > > > > > > 222418.output:   Total time in MGBuild      4.671838
>>  > > > > > > > 222419.output:   Total time in MGBuild     15.123162
>>  > > > > > > > 222420.output:   Total time in MGBuild     33.875626
>>  > > > > > > > 222421.output:   Total time in MGBuild     49.494547
>>  > > > > > > > 222422.output:   Total time in MGBuild    151.329026
>>  > > > > > > >
>>  > > > > > > > If I disable the call to MPI_Comm_Split, my time scales as
>>  > > > > > > > 224982.output:   Total time in MGBuild      0.050143
>>  > > > > > > > 224983.output:   Total time in MGBuild      0.052607
>>  > > > > > > > 224988.output:   Total time in MGBuild      0.050697
>>  > > > > > > > 224989.output:   Total time in MGBuild      0.078343
>>  > > > > > > > 224990.output:   Total time in MGBuild      0.054634
>>  > > > > > > > 224991.output:   Total time in MGBuild      0.052158
>>  > > > > > > > 224992.output:   Total time in MGBuild      0.060286
>>  > > > > > > > 225008.output:   Total time in MGBuild      0.062925
>>  > > > > > > > 225009.output:   Total time in MGBuild      0.097357
>>  > > > > > > > 225010.output:   Total time in MGBuild      0.061807
>>  > > > > > > > 225011.output:   Total time in MGBuild      0.076617
>>  > > > > > > > 225012.output:   Total time in MGBuild      0.099683
>>  > > > > > > > 225013.output:   Total time in MGBuild      0.125580
>>  > > > > > > > 225014.output:   Total time in MGBuild      0.190711
>>  > > > > > > > 225016.output:   Total time in MGBuild      0.218329
>>  > > > > > > > 225017.output:   Total time in MGBuild      0.282081
>>  > > > > > > >
>>  > > > > > > > Although I didn't directly measure it, this suggests the
>> time for MPI_Comm_Split is growing roughly quadratically with process
>> concurrency.
>>  > > > > > > >
>>  > > > > > > >
>>  > > > > > > >
>>  > > > > > > >
>>  > > > > > > > I see the same effect on the K machine (8...64K nodes)
>> where the code uses comm_split/dup in conjunction:
>>  > > > > > > > run00008_7_1.sh.o2412931:   Total time in MGBuild
>>   0.026458 seconds
>>  > > > > > > > run00064_7_1.sh.o2415876:   Total time in MGBuild
>>   0.039121 seconds
>>  > > > > > > > run00512_7_1.sh.o2415877:   Total time in MGBuild
>>   0.086800 seconds
>>  > > > > > > > run01000_7_1.sh.o2414496:   Total time in MGBuild
>>   0.129764 seconds
>>  > > > > > > > run01728_7_1.sh.o2415878:   Total time in MGBuild
>>   0.224576 seconds
>>  > > > > > > > run04096_7_1.sh.o2415880:   Total time in MGBuild
>>   0.738979 seconds
>>  > > > > > > > run08000_7_1.sh.o2414504:   Total time in MGBuild
>>   2.123800 seconds
>>  > > > > > > > run13824_7_1.sh.o2415881:   Total time in MGBuild
>>   6.276573 seconds
>>  > > > > > > > run21952_7_1.sh.o2415882:   Total time in MGBuild
>> 13.634200 seconds
>>  > > > > > > > run32768_7_1.sh.o2415884:   Total time in MGBuild
>> 36.508670 seconds
>>  > > > > > > > run46656_7_1.sh.o2415874:   Total time in MGBuild
>> 58.668228 seconds
>>  > > > > > > > run64000_7_1.sh.o2415875:   Total time in MGBuild
>>   117.322217 seconds
>>  > > > > > > >
>>  > > > > > > >
>>  > > > > > > > A glance at the implementation on Mira (I don't know if
>> the implementation on K is stock) suggests it should be using qsort to
>> sort based on keys.  Unfortunately, qsort is not performance robust like
>> heap/merge sort.  If one were to be productive and call comm_split
>> like...
>>  > > > > > > > MPI_Comm_split(...,mycolor,myrank,...)
>>  > > > > > > > then one runs the risk that the keys are presorted.  This
>> hits the worst case computational complexity for qsort... O(P^2).
>>   Demanding programmers avoid sending sorted keys seems unreasonable.
>>  > > > > > > >
>>  > > > > > > >
>>  > > > > > > > I should note, I see a similar lack of scaling with
>> MPI_Comm_dup on the K machine.  Unfortunately, my BGQ data used an
>> earlier version of the code that did not use comm_dup.  As such, I can’t
>> definitively say that it is a problem on that machine as well.
>>  > > > > > > >
>>  > > > > > > > Thus, I'm asking for scalable implementations of
>> comm_split/dup using merge/heap sort whose worst case complexity is
>> still PlogP to be prioritized in the next update.
>>  > > > > > > >
>>  > > > > > > >
>>  > > > > > > > thanks
>>  > > > > > > > _______________________________________________
>>  > > > > > > > To manage subscription options or unsubscribe:
>>  > > > > > > > https://lists.mpich.org/mailman/listinfo/devel
>>  > > > > > > >
>>  > > > > > > > _______________________________________________
>>  > > > > > > > To manage subscription options or unsubscribe:
>>  > > > > > > > https://lists.mpich.org/mailman/listinfo/devel
>>  > > > > > >
>>  > > > > > > _______________________________________________
>>  > > > > > > To manage subscription options or unsubscribe:
>>  > > > > > > https://lists.mpich.org/mailman/listinfo/devel
>>  > > > > > >
>>  > > > > > > _______________________________________________
>>  > > > > > > To manage subscription options or unsubscribe:
>>  > > > > > > https://lists.mpich.org/mailman/listinfo/devel
>>  > > > > >
>>  > > > > >
>>  > > > > >
>>  > > > >
>>  > > > >
>>  > > >
>>  > > >
>>  > >
>>  > >
>>  > >
>>  > > <comm_split.c>
>>  >
>>  >
>>  >
>>
>>

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA