[mpich-discuss] optimization of MPI_Alltoall(..)

Mon May 19 21:19:35 CDT 2014

Hi,
I want to use MPI to transpose a big NxM matrix.
Currently the size is: N=20k, M=40k,  type=ushort, total size =16GB  (it will be
larger in the future)

To evaluate the code I report the  processing speed in MB/sec, it is dominated
by the cost of MPI_Alltoall(...)  between 32 MPI processes.
By changing the order in which MPI jobs are assigned to the cores on the blades
I was able to increase  the overall speed by ~15% , from 349 MB/sec to 398
MB/sec.

My question is are there more tricks I can play to accelerate MPI_Alltoall(...)
further  ?
(For now using the existing hardware)
Thanks for any suggestions
Jan

P.S. Below are gory details:

The transposition  is done in 2 stages on 32 MPI jobs running on 4 8-core Dell
1950 blades.
The 4 blades are connected via eth0 to the same card in the Cisco 1 GBit switch.
This is the order of operations:
- big matrix is divided on 32x32 non-square blocks
- each of 32 blocks is transposed individually on CPU in each MPI  job
-  the 32x32 blocks are exchanged between MPI jobs and transposed using
MPI_Alltoall(...) command
- the time is measured using MPI_Wtime()
FYI, the whole code is visible at:
https://bitbucket.org/balewski/kruk-mpi/src/8d3c4b7deb566f2768f132135b12e58f28f252d9/kruk2.c?at=master

***** mode 1:  uses 8 consecutive cores per blade, next fill 2nd blade, etc
$ mpiexec -f machinefile -n 32 ./kruk2
where machinefile is:
oswrk139.lns.mit.edu:8
oswrk140.lns.mit.edu:8
oswrk145.lns.mit.edu:8
oswrk150.lns.mit.edu:8

Summary: Ncpu=32  avrT/sec=40.83, totMB=16257.0 avrSpeed=398.2(MB/sec)

***** mode 2:  uses 1st core from all blades, next use 2nd core from all, etc:
$ mpiexec -f machinefile -n 32 ./kruk2
where machinefile is:
oswrk139.lns.mit.edu:1
oswrk140.lns.mit.edu:1
oswrk145.lns.mit.edu:1
oswrk150.lns.mit.edu:1

Summary: Ncpu=32  avrT/sec=46.61, totMB=16257.0 avrSpeed=348.8(MB/sec)
(3)