[mpich-discuss] optimization of MPI_Alltoall(..)
Jan T. Balewski
balewski at MIT.EDU
Mon May 19 21:19:35 CDT 2014
Hi,
I want to use MPI to transpose a big NxM matrix.
Currently the size is: N=20k, M=40k, type=ushort, total size =16GB (it will be
larger in the future)
To evaluate the code I report the processing speed in MB/sec, it is dominated
by the cost of MPI_Alltoall(...) between 32 MPI processes.
By changing the order in which MPI jobs are assigned to the cores on the blades
I was able to increase the overall speed by ~15% , from 349 MB/sec to 398
MB/sec.
My question is are there more tricks I can play to accelerate MPI_Alltoall(...)
further ?
(For now using the existing hardware)
Thanks for any suggestions
Jan
P.S. Below are gory details:
The transposition is done in 2 stages on 32 MPI jobs running on 4 8-core Dell
1950 blades.
The 4 blades are connected via eth0 to the same card in the Cisco 1 GBit switch.
This is the order of operations:
- big matrix is divided on 32x32 non-square blocks
- each of 32 blocks is transposed individually on CPU in each MPI job
- the 32x32 blocks are exchanged between MPI jobs and transposed using
MPI_Alltoall(...) command
- the time is measured using MPI_Wtime()
FYI, the whole code is visible at:
https://bitbucket.org/balewski/kruk-mpi/src/8d3c4b7deb566f2768f132135b12e58f28f252d9/kruk2.c?at=master
***** mode 1: uses 8 consecutive cores per blade, next fill 2nd blade, etc
$ mpiexec -f machinefile -n 32 ./kruk2
where machinefile is:
oswrk139.lns.mit.edu:8
oswrk140.lns.mit.edu:8
oswrk145.lns.mit.edu:8
oswrk150.lns.mit.edu:8
Summary: Ncpu=32 avrT/sec=40.83, totMB=16257.0 avrSpeed=398.2(MB/sec)
***** mode 2: uses 1st core from all blades, next use 2nd core from all, etc:
$ mpiexec -f machinefile -n 32 ./kruk2
where machinefile is:
oswrk139.lns.mit.edu:1
oswrk140.lns.mit.edu:1
oswrk145.lns.mit.edu:1
oswrk150.lns.mit.edu:1
Summary: Ncpu=32 avrT/sec=46.61, totMB=16257.0 avrSpeed=348.8(MB/sec)
(3)
More information about the discuss
mailing list