Tue Jun 18 13:52:11 CDT 2019
MPI_Barrier before MPI_Allreduce can improve performance
This is hardly the full story. It would be useful to know more about
what you are trying to accomplish.
On Wed, Feb 25, 2015 at 3:07 PM, Junchao Zhang <jczhang at mcs.anl.gov> wrote:
> Yes. Many collectives have optimizations for power-of-two processes. In
> MPICH's source code allreduce.c, you can find the following comments.
> /* This is the default implementation of allreduce. The algorithm is:
> Algorithm: MPI_Allreduce
> For the heterogeneous case, we call MPI_Reduce followed by MPI_Bcast
> in order to meet the requirement that all processes must have the
> same result. For the homogeneous case, we use the following algorithms.
> For long messages and for builtin ops and if count >= pof2 (where
> pof2 is the nearest power-of-two less than or equal to the number
> of processes), we use Rabenseifner's algorithm (see
> This algorithm implements the allreduce in two steps: first a
> reduce-scatter, followed by an allgather. A recursive-halving
> algorithm (beginning with processes that are distance 1 apart) is
> used for the reduce-scatter, and a recursive doubling
> algorithm is used for the allgather. The non-power-of-two case is
> handled by dropping to the nearest lower power-of-two: the first
> few even-numbered processes send their data to their right neighbors
> (rank+1), and the reduce-scatter and allgather happen among the remaining
> power-of-two processes. At the end, the first few even-numbered
> processes get the result from their right neighbors.
> For the power-of-two case, the cost for the reduce-scatter is
> lgp.alpha + n.((p-1)/p).beta + n.((p-1)/p).gamma. The cost for the
> allgather lgp.alpha + n.((p-1)/p).beta. Therefore, the
> total cost is:
> Cost = 2.lgp.alpha + 2.n.((p-1)/p).beta + n.((p-1)/p).gamma
> For the non-power-of-two case,
> Cost = (2.floor(lgp)+2).alpha + (2.((p-1)/p) + 2).n.beta +
> For short messages, for user-defined ops, and for count < pof2
> we use a recursive doubling algorithm (similar to the one in
> MPI_Allgather). We use this algorithm in the case of user-defined ops
> because in this case derived datatypes are allowed, and the user
> could pass basic datatypes on one process and derived on another as
> long as the type maps are the same. Breaking up derived datatypes
> to do the reduce-scatter is tricky.
> Cost = lgp.alpha + n.lgp.beta + n.lgp.gamma
> Possible improvements:
> End Algorithm: MPI_Allreduce
> --Junchao Zhang
> On Wed, Feb 25, 2015 at 2:59 PM, Aiman Fang <aimanf at cs.uchicago.edu> wrote:
>> I came across a problem in experiments that makes me wondering if there is
>> any optimization of collective calls, such as MPI_Allreduce, for 2^n number
>> of ranks?
>> We did some experiments on Argonne Vesta system to measure the time of
>> MPI_Allreduce calls using 511, 512 and 513 processes. (one process per
>> node). In each run, the synthetic benchmark first did some computation and
>> then called MPI_Allreduce 30 times, for total 100 loops. We measured the
>> total time spent on communication.
>> We found that 512-process run gives the best performance. The time for
>> 511, 512 and 513 processes are 0.1492, 0.1449 and 0.1547 seconds
>> respectively. 512-proc outperforms 511-proc by 3.7%, and 513-proc by 6.7%.
>> The mpich version we used is as follows.
>> $ mpichversion
>> MPICH Version: 3.1.2
>> MPICH Release date: Mon Jul 21 16:00:21 CDT 2014
>> MPICH Device: pamid
>> MPICH configure: --prefix=/home/fujita/soft/mpich-3.1.2
>> --host=powerpc64-bgq-linux --with-device=pamid --with-file-system=gpfs:BGQ
>> MPICH CC: powerpc64-bgq-linux-gcc -O2
>> MPICH CXX: powerpc64-bgq-linux-g++ -O2
>> MPICH F77: powerpc64-bgq-linux-gfortran -O2
>> MPICH FC: powerpc64-bgq-linux-gfortran -O2
>> discuss mailing list discuss at mpich.org
>> To manage subscription options or unsubscribe:
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
jeff.science at gmail.com
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
More information about the discuss