[mpich-discuss] Is there any optimization of collective calls (MPI_Allreduce) for 2^n ranks?

Aiman Fang aimanf at cs.uchicago.edu
Thu Feb 26 10:10:21 CST 2015


Thank you all for your help.

We were doing experiments and somehow needed to compare the performance of power-of-two and nonpower-of-two processes in collective communication. After adding MPI_Barrier before MPI_Allreduce, we did observe improvement in performance of nonpower-of-two processes. 

Best,
Aiman
On Feb 25, 2015, at 8:58 PM, Jeff Hammond <jeff.science at gmail.com> wrote:

> See papers by Sameer Kumar and coworkers on Google Scholar for
> details.  You need to look at the pamid device code in MPICH to see
> the collectives that are actually used on BGQ, but that won't actually
> help since the all the BGQ collective magic is usually inside of PAMI.
> PAMI is open-source but not exactly easy to understand.
> 
> Unlike previous generation of Blue Gene, BGQ collective optimizations
> are much more general and often appear for communicator sizes other
> than 2^n.
> 
> From my experiments when I was at ALCF, you may find that calling
> MPI_Barrier before MPI_Allreduce can improve performance
> significantly.
> 
> This is hardly the full story.  It would be useful to know more about
> what you are trying to accomplish.
> 
> Best,
> 
> Jeff
> 
> On Wed, Feb 25, 2015 at 3:07 PM, Junchao Zhang <jczhang at mcs.anl.gov> wrote:
>> Yes. Many collectives have optimizations for power-of-two processes. In
>> MPICH's source code allreduce.c, you can find the following comments.
>> 
>> /* This is the default implementation of allreduce. The algorithm is:
>> 
>>   Algorithm: MPI_Allreduce
>> 
>>   For the heterogeneous case, we call MPI_Reduce followed by MPI_Bcast
>>   in order to meet the requirement that all processes must have the
>>   same result. For the homogeneous case, we use the following algorithms.
>> 
>> 
>>   For long messages and for builtin ops and if count >= pof2 (where
>>   pof2 is the nearest power-of-two less than or equal to the number
>>   of processes), we use Rabenseifner's algorithm (see
>>   http://www.hlrs.de/mpi/myreduce.html).
>>   This algorithm implements the allreduce in two steps: first a
>>   reduce-scatter, followed by an allgather. A recursive-halving
>>   algorithm (beginning with processes that are distance 1 apart) is
>>   used for the reduce-scatter, and a recursive doubling
>>   algorithm is used for the allgather. The non-power-of-two case is
>>   handled by dropping to the nearest lower power-of-two: the first
>>   few even-numbered processes send their data to their right neighbors
>>   (rank+1), and the reduce-scatter and allgather happen among the remaining
>>   power-of-two processes. At the end, the first few even-numbered
>>   processes get the result from their right neighbors.
>> 
>>   For the power-of-two case, the cost for the reduce-scatter is
>>   lgp.alpha + n.((p-1)/p).beta + n.((p-1)/p).gamma. The cost for the
>>   allgather lgp.alpha + n.((p-1)/p).beta. Therefore, the
>>   total cost is:
>>   Cost = 2.lgp.alpha + 2.n.((p-1)/p).beta + n.((p-1)/p).gamma
>> 
>>   For the non-power-of-two case,
>>   Cost = (2.floor(lgp)+2).alpha + (2.((p-1)/p) + 2).n.beta +
>> n.(1+(p-1)/p).gamma
>> 
>> 
>>   For short messages, for user-defined ops, and for count < pof2
>>   we use a recursive doubling algorithm (similar to the one in
>>   MPI_Allgather). We use this algorithm in the case of user-defined ops
>>   because in this case derived datatypes are allowed, and the user
>>   could pass basic datatypes on one process and derived on another as
>>   long as the type maps are the same. Breaking up derived datatypes
>>   to do the reduce-scatter is tricky.
>> 
>>   Cost = lgp.alpha + n.lgp.beta + n.lgp.gamma
>> 
>>   Possible improvements:
>> 
>>   End Algorithm: MPI_Allreduce
>> */
>> 
>> --Junchao Zhang
>> 
>> On Wed, Feb 25, 2015 at 2:59 PM, Aiman Fang <aimanf at cs.uchicago.edu> wrote:
>>> 
>>> Hi,
>>> 
>>> I came across a problem in experiments that makes me wondering if there is
>>> any optimization of collective calls, such as MPI_Allreduce, for 2^n number
>>> of ranks?
>>> 
>>> We did some experiments on Argonne Vesta system to measure the time of
>>> MPI_Allreduce calls using 511, 512 and 513 processes. (one process per
>>> node). In each run, the synthetic benchmark first did some computation and
>>> then called MPI_Allreduce 30 times, for total 100 loops. We measured the
>>> total time spent on communication.
>>> 
>>> We found that 512-process run gives the best performance. The time for
>>> 511, 512 and 513 processes are 0.1492, 0.1449 and 0.1547 seconds
>>> respectively. 512-proc outperforms 511-proc by 3.7%, and 513-proc by 6.7%.
>>> 
>>> The mpich version we used is as follows.
>>> 
>>> $ mpichversion
>>> MPICH Version:          3.1.2
>>> MPICH Release date:     Mon Jul 21 16:00:21 CDT 2014
>>> MPICH Device:           pamid
>>> MPICH configure:        --prefix=/home/fujita/soft/mpich-3.1.2
>>> --host=powerpc64-bgq-linux --with-device=pamid --with-file-system=gpfs:BGQ
>>> --disable-wrapper-rpath
>>> MPICH CC:       powerpc64-bgq-linux-gcc    -O2
>>> MPICH CXX:      powerpc64-bgq-linux-g++   -O2
>>> MPICH F77:      powerpc64-bgq-linux-gfortran   -O2
>>> MPICH FC:       powerpc64-bgq-linux-gfortran   -O2
>>> 
>>> Thanks!
>>> 
>>> Best,
>>> Aiman
>>> _______________________________________________
>>> discuss mailing list     discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>> 
>> 
>> 
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
> 
> 
> 
> -- 
> Jeff Hammond
> jeff.science at gmail.com
> http://jeffhammond.github.io/
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list