[mpich-discuss] Is there any optimization of collective calls (MPI_Allreduce) for 2^n ranks?

Junchao Zhang jczhang at mcs.anl.gov
Wed Feb 25 17:07:32 CST 2015


Yes. Many collectives have optimizations for power-of-two processes. In
MPICH's source code allreduce.c, you can find the following comments.

/* This is the default implementation of allreduce. The algorithm is:

   Algorithm: MPI_Allreduce

   For the heterogeneous case, we call MPI_Reduce followed by MPI_Bcast
   in order to meet the requirement that all processes must have the
   same result. For the homogeneous case, we use the following algorithms.


   For long messages and for builtin ops and if count >= pof2 (where
   pof2 is the nearest power-of-two less than or equal to the number
   of processes), we use Rabenseifner's algorithm (see
   http://www.hlrs.de/mpi/myreduce.html).
   This algorithm implements the allreduce in two steps: first a
   reduce-scatter, followed by an allgather. A recursive-halving
   algorithm (beginning with processes that are distance 1 apart) is
   used for the reduce-scatter, and a recursive doubling
   algorithm is used for the allgather. The non-power-of-two case is
   handled by dropping to the nearest lower power-of-two: the first
   few even-numbered processes send their data to their right neighbors
   (rank+1), and the reduce-scatter and allgather happen among the remaining
   power-of-two processes. At the end, the first few even-numbered
   processes get the result from their right neighbors.

   For the power-of-two case, the cost for the reduce-scatter is
   lgp.alpha + n.((p-1)/p).beta + n.((p-1)/p).gamma. The cost for the
   allgather lgp.alpha + n.((p-1)/p).beta. Therefore, the
   total cost is:
   Cost = 2.lgp.alpha + 2.n.((p-1)/p).beta + n.((p-1)/p).gamma

   For the non-power-of-two case,
   Cost = (2.floor(lgp)+2).alpha + (2.((p-1)/p) + 2).n.beta +
n.(1+(p-1)/p).gamma


   For short messages, for user-defined ops, and for count < pof2
   we use a recursive doubling algorithm (similar to the one in
   MPI_Allgather). We use this algorithm in the case of user-defined ops
   because in this case derived datatypes are allowed, and the user
   could pass basic datatypes on one process and derived on another as
   long as the type maps are the same. Breaking up derived datatypes
   to do the reduce-scatter is tricky.

   Cost = lgp.alpha + n.lgp.beta + n.lgp.gamma

   Possible improvements:

   End Algorithm: MPI_Allreduce
*/

--Junchao Zhang

On Wed, Feb 25, 2015 at 2:59 PM, Aiman Fang <aimanf at cs.uchicago.edu> wrote:

> Hi,
>
> I came across a problem in experiments that makes me wondering if there is
> any optimization of collective calls, such as MPI_Allreduce, for 2^n number
> of ranks?
>
> We did some experiments on Argonne Vesta system to measure the time of
> MPI_Allreduce calls using 511, 512 and 513 processes. (one process per
> node). In each run, the synthetic benchmark first did some computation and
> then called MPI_Allreduce 30 times, for total 100 loops. We measured the
> total time spent on communication.
>
> We found that 512-process run gives the best performance. The time for
> 511, 512 and 513 processes are 0.1492, 0.1449 and 0.1547 seconds
> respectively. 512-proc outperforms 511-proc by 3.7%, and 513-proc by 6.7%.
>
> The mpich version we used is as follows.
>
> $ mpichversion
> MPICH Version:          3.1.2
> MPICH Release date:     Mon Jul 21 16:00:21 CDT 2014
> MPICH Device:           pamid
> MPICH configure:        --prefix=/home/fujita/soft/mpich-3.1.2
> --host=powerpc64-bgq-linux --with-device=pamid --with-file-system=gpfs:BGQ
> --disable-wrapper-rpath
> MPICH CC:       powerpc64-bgq-linux-gcc    -O2
> MPICH CXX:      powerpc64-bgq-linux-g++   -O2
> MPICH F77:      powerpc64-bgq-linux-gfortran   -O2
> MPICH FC:       powerpc64-bgq-linux-gfortran   -O2
>
> Thanks!
>
> Best,
> Aiman
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20150225/b197dfae/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list