[mpich-discuss] optimization of MPI_Alltoall(..)

Jan T. Balewski balewski at MIT.EDU
Mon May 19 21:19:35 CDT 2014

I want to use MPI to transpose a big NxM matrix.
Currently the size is: N=20k, M=40k,  type=ushort, total size =16GB  (it will be
larger in the future)

To evaluate the code I report the  processing speed in MB/sec, it is dominated
by the cost of MPI_Alltoall(...)  between 32 MPI processes.
By changing the order in which MPI jobs are assigned to the cores on the blades
I was able to increase  the overall speed by ~15% , from 349 MB/sec to 398

My question is are there more tricks I can play to accelerate MPI_Alltoall(...)
further  ?
(For now using the existing hardware)
Thanks for any suggestions

P.S. Below are gory details:

The transposition  is done in 2 stages on 32 MPI jobs running on 4 8-core Dell
1950 blades.
The 4 blades are connected via eth0 to the same card in the Cisco 1 GBit switch.
This is the order of operations:
- big matrix is divided on 32x32 non-square blocks
- each of 32 blocks is transposed individually on CPU in each MPI  job
-  the 32x32 blocks are exchanged between MPI jobs and transposed using
MPI_Alltoall(...) command
- the time is measured using MPI_Wtime()
FYI, the whole code is visible at:

***** mode 1:  uses 8 consecutive cores per blade, next fill 2nd blade, etc
$ mpiexec -f machinefile -n 32 ./kruk2
where machinefile is:

Summary: Ncpu=32  avrT/sec=40.83, totMB=16257.0 avrSpeed=398.2(MB/sec)

***** mode 2:  uses 1st core from all blades, next use 2nd core from all, etc:
$ mpiexec -f machinefile -n 32 ./kruk2
where machinefile is:

Summary: Ncpu=32  avrT/sec=46.61, totMB=16257.0 avrSpeed=348.8(MB/sec)

More information about the discuss mailing list