[mpich-discuss] optimization of MPI_Alltoall(..)

William Gropp wgropp at illinois.edu
Tue May 20 07:52:31 CDT 2014

Yes, there are.  There are more things that you can do to take into account the interconnect topology, and there are things that can be done to take better advantage of the SMP nodes.  The algorithm in MPICH is rather generic and tries not to overstress the network, since contention and queue search times can have a significant impact on performance of alltoall.

With the appropriate algorithm, you should be able to sustain near the peak interconnect bandwidth (assuming also that the memory system on the node is fast enough to keep up with the network).


William Gropp
Director, Parallel Computing Institute
Thomas M. Siebel Chair in Computer Science
University of Illinois Urbana-Champaign

On May 19, 2014, at 9:19 PM, Jan T. Balewski wrote:

> Hi,
> I want to use MPI to transpose a big NxM matrix.
> Currently the size is: N=20k, M=40k,  type=ushort, total size =16GB  (it will be
> larger in the future)
> To evaluate the code I report the  processing speed in MB/sec, it is dominated
> by the cost of MPI_Alltoall(...)  between 32 MPI processes.
> By changing the order in which MPI jobs are assigned to the cores on the blades
> I was able to increase  the overall speed by ~15% , from 349 MB/sec to 398
> MB/sec.
> My question is are there more tricks I can play to accelerate MPI_Alltoall(...)
> further  ?
> (For now using the existing hardware)
> Thanks for any suggestions
> Jan
> P.S. Below are gory details:
> The transposition  is done in 2 stages on 32 MPI jobs running on 4 8-core Dell
> 1950 blades.
> The 4 blades are connected via eth0 to the same card in the Cisco 1 GBit switch.
> This is the order of operations:
> - big matrix is divided on 32x32 non-square blocks
> - each of 32 blocks is transposed individually on CPU in each MPI  job
> -  the 32x32 blocks are exchanged between MPI jobs and transposed using
> MPI_Alltoall(...) command
> - the time is measured using MPI_Wtime()
> FYI, the whole code is visible at:
> https://bitbucket.org/balewski/kruk-mpi/src/8d3c4b7deb566f2768f132135b12e58f28f252d9/kruk2.c?at=master
> ***** mode 1:  uses 8 consecutive cores per blade, next fill 2nd blade, etc
> $ mpiexec -f machinefile -n 32 ./kruk2
> where machinefile is:
> oswrk139.lns.mit.edu:8
> oswrk140.lns.mit.edu:8
> oswrk145.lns.mit.edu:8
> oswrk150.lns.mit.edu:8
> Summary: Ncpu=32  avrT/sec=40.83, totMB=16257.0 avrSpeed=398.2(MB/sec)
> ***** mode 2:  uses 1st core from all blades, next use 2nd core from all, etc:
> $ mpiexec -f machinefile -n 32 ./kruk2
> where machinefile is:
> oswrk139.lns.mit.edu:1
> oswrk140.lns.mit.edu:1
> oswrk145.lns.mit.edu:1
> oswrk150.lns.mit.edu:1
> Summary: Ncpu=32  avrT/sec=46.61, totMB=16257.0 avrSpeed=348.8(MB/sec)
> (3)
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140520/ac90bb95/attachment.html>

More information about the discuss mailing list