[mpich-discuss] optimization of MPI_Alltoall(..)
Kokron, Daniel S. (GSFC-606.2)[Computer Sciences Corporation]
daniel.s.kokron at nasa.gov
Tue May 20 09:40:59 CDT 2014
Jan,
Here is some output from your code on a cluster that uses FDR Infiniband instead of 1Gbit Ethernet. I also used the vendor MPI which is SMP aware.
2 nodes each with 2 sandybridge sockets (no process placement)
./a.out START: NchTtot=201600 NchTXtot=40320; oneCPU: NchTX=254016000 tot_size/B=508032000, kB=496125, MB=484, nCyc=1
Ncpu=32 , block: NchT_b=6300 NchX_b=1260 size/B=15876000, kB=15503, MB=15
#### Summary: Ncpu=32 avrT/sec=8.05 , maxT/sec=8.05, sumT/sec=257.68 totMB=16257.0 avrSpeed=2018.9(MB/sec)
4 nodes each with 2 sandybridge sockets (no process placement)
mpiexec -np 32 ./a.out
./a.out START: NchTtot=201600 NchTXtot=40320; oneCPU: NchTX=254016000 tot_size/B=508032000, kB=496125, MB=484, nCyc=1
Ncpu=32 , block: NchT_b=6300 NchX_b=1260 size/B=15876000, kB=15503, MB=15
#### Summary: Ncpu=32 avrT/sec=6.94 , maxT/sec=6.94, sumT/sec=221.97 totMB=16257.0 avrSpeed=2343.7(MB/sec)
Same 4 nodes, but with process placement (4 ranks/socket)
mpiexec -np 32 ./a.out
./a.out START: NchTtot=201600 NchTXtot=40320; oneCPU: NchTX=254016000 tot_size/B=508032000, kB=496125, MB=484, nCyc=1
Ncpu=32 , block: NchT_b=6300 NchX_b=1260 size/B=15876000, kB=15503, MB=15
#### Summary: Ncpu=32 avrT/sec=6.86 , maxT/sec=6.86, sumT/sec=219.50 totMB=16257.0 avrSpeed=2370.0(MB/sec)
________________________________________
From: discuss-bounces at mpich.org [discuss-bounces at mpich.org] on behalf of Jan T. Balewski [balewski at MIT.EDU]
Sent: Monday, May 19, 2014 10:19 PM
To: discuss at mpich.org
Subject: [mpich-discuss] optimization of MPI_Alltoall(..)
Hi,
I want to use MPI to transpose a big NxM matrix.
Currently the size is: N=20k, M=40k, type=ushort, total size =16GB (it will be
larger in the future)
To evaluate the code I report the processing speed in MB/sec, it is dominated
by the cost of MPI_Alltoall(...) between 32 MPI processes.
By changing the order in which MPI jobs are assigned to the cores on the blades
I was able to increase the overall speed by ~15% , from 349 MB/sec to 398
MB/sec.
My question is are there more tricks I can play to accelerate MPI_Alltoall(...)
further ?
(For now using the existing hardware)
Thanks for any suggestions
Jan
P.S. Below are gory details:
The transposition is done in 2 stages on 32 MPI jobs running on 4 8-core Dell
1950 blades.
The 4 blades are connected via eth0 to the same card in the Cisco 1 GBit switch.
This is the order of operations:
- big matrix is divided on 32x32 non-square blocks
- each of 32 blocks is transposed individually on CPU in each MPI job
- the 32x32 blocks are exchanged between MPI jobs and transposed using
MPI_Alltoall(...) command
- the time is measured using MPI_Wtime()
FYI, the whole code is visible at:
https://bitbucket.org/balewski/kruk-mpi/src/8d3c4b7deb566f2768f132135b12e58f28f252d9/kruk2.c?at=master
***** mode 1: uses 8 consecutive cores per blade, next fill 2nd blade, etc
$ mpiexec -f machinefile -n 32 ./kruk2
where machinefile is:
oswrk139.lns.mit.edu:8
oswrk140.lns.mit.edu:8
oswrk145.lns.mit.edu:8
oswrk150.lns.mit.edu:8
Summary: Ncpu=32 avrT/sec=40.83, totMB=16257.0 avrSpeed=398.2(MB/sec)
***** mode 2: uses 1st core from all blades, next use 2nd core from all, etc:
$ mpiexec -f machinefile -n 32 ./kruk2
where machinefile is:
oswrk139.lns.mit.edu:1
oswrk140.lns.mit.edu:1
oswrk145.lns.mit.edu:1
oswrk150.lns.mit.edu:1
Summary: Ncpu=32 avrT/sec=46.61, totMB=16257.0 avrSpeed=348.8(MB/sec)
(3)
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list