[mpich-discuss] optimization of MPI_Alltoall(..)

Kokron, Daniel S. (GSFC-606.2)[Computer Sciences Corporation] daniel.s.kokron at nasa.gov
Tue May 20 09:40:59 CDT 2014


Jan,

Here is some output from your code on a cluster that uses FDR Infiniband instead of 1Gbit Ethernet.  I also used the vendor MPI which is SMP aware.

2 nodes each with 2 sandybridge sockets (no process placement)
./a.out START: NchTtot=201600  NchTXtot=40320; oneCPU:  NchTX=254016000 tot_size/B=508032000, kB=496125, MB=484, nCyc=1
Ncpu=32 , block: NchT_b=6300  NchX_b=1260  size/B=15876000, kB=15503, MB=15
#### Summary: Ncpu=32  avrT/sec=8.05 , maxT/sec=8.05, sumT/sec=257.68 totMB=16257.0 avrSpeed=2018.9(MB/sec)

4 nodes each with 2 sandybridge sockets (no process placement)
mpiexec -np 32 ./a.out 
./a.out START: NchTtot=201600  NchTXtot=40320; oneCPU:  NchTX=254016000 tot_size/B=508032000, kB=496125, MB=484, nCyc=1
Ncpu=32 , block: NchT_b=6300  NchX_b=1260  size/B=15876000, kB=15503, MB=15
#### Summary: Ncpu=32  avrT/sec=6.94 , maxT/sec=6.94, sumT/sec=221.97 totMB=16257.0 avrSpeed=2343.7(MB/sec)

Same 4 nodes, but with process placement (4 ranks/socket)
mpiexec -np 32 ./a.out
./a.out START: NchTtot=201600  NchTXtot=40320; oneCPU:  NchTX=254016000 tot_size/B=508032000, kB=496125, MB=484, nCyc=1
Ncpu=32 , block: NchT_b=6300  NchX_b=1260  size/B=15876000, kB=15503, MB=15
#### Summary: Ncpu=32  avrT/sec=6.86 , maxT/sec=6.86, sumT/sec=219.50 totMB=16257.0 avrSpeed=2370.0(MB/sec)

________________________________________
From: discuss-bounces at mpich.org [discuss-bounces at mpich.org] on behalf of Jan T. Balewski [balewski at MIT.EDU]
Sent: Monday, May 19, 2014 10:19 PM
To: discuss at mpich.org
Subject: [mpich-discuss] optimization of MPI_Alltoall(..)

Hi,
I want to use MPI to transpose a big NxM matrix.
Currently the size is: N=20k, M=40k,  type=ushort, total size =16GB  (it will be
larger in the future)

To evaluate the code I report the  processing speed in MB/sec, it is dominated
by the cost of MPI_Alltoall(...)  between 32 MPI processes.
By changing the order in which MPI jobs are assigned to the cores on the blades
I was able to increase  the overall speed by ~15% , from 349 MB/sec to 398
MB/sec.

My question is are there more tricks I can play to accelerate MPI_Alltoall(...)
further  ?
(For now using the existing hardware)
Thanks for any suggestions
Jan

P.S. Below are gory details:

The transposition  is done in 2 stages on 32 MPI jobs running on 4 8-core Dell
1950 blades.
The 4 blades are connected via eth0 to the same card in the Cisco 1 GBit switch.
This is the order of operations:
- big matrix is divided on 32x32 non-square blocks
- each of 32 blocks is transposed individually on CPU in each MPI  job
-  the 32x32 blocks are exchanged between MPI jobs and transposed using
MPI_Alltoall(...) command
- the time is measured using MPI_Wtime()
FYI, the whole code is visible at:
https://bitbucket.org/balewski/kruk-mpi/src/8d3c4b7deb566f2768f132135b12e58f28f252d9/kruk2.c?at=master


***** mode 1:  uses 8 consecutive cores per blade, next fill 2nd blade, etc
$ mpiexec -f machinefile -n 32 ./kruk2
where machinefile is:
oswrk139.lns.mit.edu:8
oswrk140.lns.mit.edu:8
oswrk145.lns.mit.edu:8
oswrk150.lns.mit.edu:8

Summary: Ncpu=32  avrT/sec=40.83, totMB=16257.0 avrSpeed=398.2(MB/sec)


***** mode 2:  uses 1st core from all blades, next use 2nd core from all, etc:
$ mpiexec -f machinefile -n 32 ./kruk2
where machinefile is:
oswrk139.lns.mit.edu:1
oswrk140.lns.mit.edu:1
oswrk145.lns.mit.edu:1
oswrk150.lns.mit.edu:1

Summary: Ncpu=32  avrT/sec=46.61, totMB=16257.0 avrSpeed=348.8(MB/sec)
(3)
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss



More information about the discuss mailing list