[mpich-discuss] Is there any optimization of collective calls (MPI_Allreduce) for 2^n ranks?
Aiman Fang
aimanf at cs.uchicago.edu
Thu Feb 26 10:10:21 CST 2015
Thank you all for your help.
We were doing experiments and somehow needed to compare the performance of power-of-two and nonpower-of-two processes in collective communication. After adding MPI_Barrier before MPI_Allreduce, we did observe improvement in performance of nonpower-of-two processes.
Best,
Aiman
On Feb 25, 2015, at 8:58 PM, Jeff Hammond <jeff.science at gmail.com> wrote:
> See papers by Sameer Kumar and coworkers on Google Scholar for
> details. You need to look at the pamid device code in MPICH to see
> the collectives that are actually used on BGQ, but that won't actually
> help since the all the BGQ collective magic is usually inside of PAMI.
> PAMI is open-source but not exactly easy to understand.
>
> Unlike previous generation of Blue Gene, BGQ collective optimizations
> are much more general and often appear for communicator sizes other
> than 2^n.
>
> From my experiments when I was at ALCF, you may find that calling
> MPI_Barrier before MPI_Allreduce can improve performance
> significantly.
>
> This is hardly the full story. It would be useful to know more about
> what you are trying to accomplish.
>
> Best,
>
> Jeff
>
> On Wed, Feb 25, 2015 at 3:07 PM, Junchao Zhang <jczhang at mcs.anl.gov> wrote:
>> Yes. Many collectives have optimizations for power-of-two processes. In
>> MPICH's source code allreduce.c, you can find the following comments.
>>
>> /* This is the default implementation of allreduce. The algorithm is:
>>
>> Algorithm: MPI_Allreduce
>>
>> For the heterogeneous case, we call MPI_Reduce followed by MPI_Bcast
>> in order to meet the requirement that all processes must have the
>> same result. For the homogeneous case, we use the following algorithms.
>>
>>
>> For long messages and for builtin ops and if count >= pof2 (where
>> pof2 is the nearest power-of-two less than or equal to the number
>> of processes), we use Rabenseifner's algorithm (see
>> http://www.hlrs.de/mpi/myreduce.html).
>> This algorithm implements the allreduce in two steps: first a
>> reduce-scatter, followed by an allgather. A recursive-halving
>> algorithm (beginning with processes that are distance 1 apart) is
>> used for the reduce-scatter, and a recursive doubling
>> algorithm is used for the allgather. The non-power-of-two case is
>> handled by dropping to the nearest lower power-of-two: the first
>> few even-numbered processes send their data to their right neighbors
>> (rank+1), and the reduce-scatter and allgather happen among the remaining
>> power-of-two processes. At the end, the first few even-numbered
>> processes get the result from their right neighbors.
>>
>> For the power-of-two case, the cost for the reduce-scatter is
>> lgp.alpha + n.((p-1)/p).beta + n.((p-1)/p).gamma. The cost for the
>> allgather lgp.alpha + n.((p-1)/p).beta. Therefore, the
>> total cost is:
>> Cost = 2.lgp.alpha + 2.n.((p-1)/p).beta + n.((p-1)/p).gamma
>>
>> For the non-power-of-two case,
>> Cost = (2.floor(lgp)+2).alpha + (2.((p-1)/p) + 2).n.beta +
>> n.(1+(p-1)/p).gamma
>>
>>
>> For short messages, for user-defined ops, and for count < pof2
>> we use a recursive doubling algorithm (similar to the one in
>> MPI_Allgather). We use this algorithm in the case of user-defined ops
>> because in this case derived datatypes are allowed, and the user
>> could pass basic datatypes on one process and derived on another as
>> long as the type maps are the same. Breaking up derived datatypes
>> to do the reduce-scatter is tricky.
>>
>> Cost = lgp.alpha + n.lgp.beta + n.lgp.gamma
>>
>> Possible improvements:
>>
>> End Algorithm: MPI_Allreduce
>> */
>>
>> --Junchao Zhang
>>
>> On Wed, Feb 25, 2015 at 2:59 PM, Aiman Fang <aimanf at cs.uchicago.edu> wrote:
>>>
>>> Hi,
>>>
>>> I came across a problem in experiments that makes me wondering if there is
>>> any optimization of collective calls, such as MPI_Allreduce, for 2^n number
>>> of ranks?
>>>
>>> We did some experiments on Argonne Vesta system to measure the time of
>>> MPI_Allreduce calls using 511, 512 and 513 processes. (one process per
>>> node). In each run, the synthetic benchmark first did some computation and
>>> then called MPI_Allreduce 30 times, for total 100 loops. We measured the
>>> total time spent on communication.
>>>
>>> We found that 512-process run gives the best performance. The time for
>>> 511, 512 and 513 processes are 0.1492, 0.1449 and 0.1547 seconds
>>> respectively. 512-proc outperforms 511-proc by 3.7%, and 513-proc by 6.7%.
>>>
>>> The mpich version we used is as follows.
>>>
>>> $ mpichversion
>>> MPICH Version: 3.1.2
>>> MPICH Release date: Mon Jul 21 16:00:21 CDT 2014
>>> MPICH Device: pamid
>>> MPICH configure: --prefix=/home/fujita/soft/mpich-3.1.2
>>> --host=powerpc64-bgq-linux --with-device=pamid --with-file-system=gpfs:BGQ
>>> --disable-wrapper-rpath
>>> MPICH CC: powerpc64-bgq-linux-gcc -O2
>>> MPICH CXX: powerpc64-bgq-linux-g++ -O2
>>> MPICH F77: powerpc64-bgq-linux-gfortran -O2
>>> MPICH FC: powerpc64-bgq-linux-gfortran -O2
>>>
>>> Thanks!
>>>
>>> Best,
>>> Aiman
>>> _______________________________________________
>>> discuss mailing list discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>>
>>
>> _______________________________________________
>> discuss mailing list discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>
>
>
> --
> Jeff Hammond
> jeff.science at gmail.com
> http://jeffhammond.github.io/
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list