[mpich-discuss] Is there any optimization of collective calls (MPI_Allreduce) for 2^n ranks?
Junchao Zhang
jczhang at mcs.anl.gov
Wed Feb 25 17:07:32 CST 2015
Yes. Many collectives have optimizations for power-of-two processes. In
MPICH's source code allreduce.c, you can find the following comments.
/* This is the default implementation of allreduce. The algorithm is:
Algorithm: MPI_Allreduce
For the heterogeneous case, we call MPI_Reduce followed by MPI_Bcast
in order to meet the requirement that all processes must have the
same result. For the homogeneous case, we use the following algorithms.
For long messages and for builtin ops and if count >= pof2 (where
pof2 is the nearest power-of-two less than or equal to the number
of processes), we use Rabenseifner's algorithm (see
http://www.hlrs.de/mpi/myreduce.html).
This algorithm implements the allreduce in two steps: first a
reduce-scatter, followed by an allgather. A recursive-halving
algorithm (beginning with processes that are distance 1 apart) is
used for the reduce-scatter, and a recursive doubling
algorithm is used for the allgather. The non-power-of-two case is
handled by dropping to the nearest lower power-of-two: the first
few even-numbered processes send their data to their right neighbors
(rank+1), and the reduce-scatter and allgather happen among the remaining
power-of-two processes. At the end, the first few even-numbered
processes get the result from their right neighbors.
For the power-of-two case, the cost for the reduce-scatter is
lgp.alpha + n.((p-1)/p).beta + n.((p-1)/p).gamma. The cost for the
allgather lgp.alpha + n.((p-1)/p).beta. Therefore, the
total cost is:
Cost = 2.lgp.alpha + 2.n.((p-1)/p).beta + n.((p-1)/p).gamma
For the non-power-of-two case,
Cost = (2.floor(lgp)+2).alpha + (2.((p-1)/p) + 2).n.beta +
n.(1+(p-1)/p).gamma
For short messages, for user-defined ops, and for count < pof2
we use a recursive doubling algorithm (similar to the one in
MPI_Allgather). We use this algorithm in the case of user-defined ops
because in this case derived datatypes are allowed, and the user
could pass basic datatypes on one process and derived on another as
long as the type maps are the same. Breaking up derived datatypes
to do the reduce-scatter is tricky.
Cost = lgp.alpha + n.lgp.beta + n.lgp.gamma
Possible improvements:
End Algorithm: MPI_Allreduce
*/
--Junchao Zhang
On Wed, Feb 25, 2015 at 2:59 PM, Aiman Fang <aimanf at cs.uchicago.edu> wrote:
> Hi,
>
> I came across a problem in experiments that makes me wondering if there is
> any optimization of collective calls, such as MPI_Allreduce, for 2^n number
> of ranks?
>
> We did some experiments on Argonne Vesta system to measure the time of
> MPI_Allreduce calls using 511, 512 and 513 processes. (one process per
> node). In each run, the synthetic benchmark first did some computation and
> then called MPI_Allreduce 30 times, for total 100 loops. We measured the
> total time spent on communication.
>
> We found that 512-process run gives the best performance. The time for
> 511, 512 and 513 processes are 0.1492, 0.1449 and 0.1547 seconds
> respectively. 512-proc outperforms 511-proc by 3.7%, and 513-proc by 6.7%.
>
> The mpich version we used is as follows.
>
> $ mpichversion
> MPICH Version: 3.1.2
> MPICH Release date: Mon Jul 21 16:00:21 CDT 2014
> MPICH Device: pamid
> MPICH configure: --prefix=/home/fujita/soft/mpich-3.1.2
> --host=powerpc64-bgq-linux --with-device=pamid --with-file-system=gpfs:BGQ
> --disable-wrapper-rpath
> MPICH CC: powerpc64-bgq-linux-gcc -O2
> MPICH CXX: powerpc64-bgq-linux-g++ -O2
> MPICH F77: powerpc64-bgq-linux-gfortran -O2
> MPICH FC: powerpc64-bgq-linux-gfortran -O2
>
> Thanks!
>
> Best,
> Aiman
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20150225/b197dfae/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list