[mpich-discuss] Issue with MPI_IALLTOALL on Cray XE6
Rattermann . Dale
drattermann at drc.com
Thu Jun 12 13:47:01 CDT 2014
Hi,
I'm not sure if anyone can help me with this question but I'll ask anyway. I'm working on optimizing a code that utilizes MPI on a Cray XE6. In this code there is a subroutine that calls MPI_ALLTOALL 16 different times. I changed this code so that non-blocking MPI_IALLTOALLs were used instead so that computation could be overlapped with communication, but found that it actually slowed the code down quite considerably. I wrote a little test program in fortran (that I have attached below) in order to time the two different scenarios. I've tried using both the cray-mpich/6.0.0 and cray-mpich/6.3.0 modules. I've noticed two things by running this code:
1) MPI_IALLTOALL and MPI_WAIT run slower than MPI_ALLTOALL on the Cray… even when computation and communication are overlapped using the former
2) MPI_ALLTOALL runs very slow the first time it is called (this can be seen by commenting out the MPI_ALLTOALL call on lines 26 and 27 in the attached code and compiling and running)
I'm using an interactive batch session on Garnet with 2 compute nodes (64 cores total). I've compiled and run the code using the following:
ftn mpi_test.f
aprun -n 64 ./a.out 6400000
Here is an example of results I get while running with lines 26 and 27 uncommented:
nratter at batch04-wlm:~/temp/mpi_test> make clean; make; aprun -n 64 ./a.out 6400000
rm a.out
ftn mpi_test.f
0 : dt for MPI_ALLTOALL section = 0.9902300834655762
0 : dt for MPI_IALLTOALL section = 1.116391181945801
And here is an example of results I get while running with lines 26 and 27 commented out:
nratter at batch04-wlm:~/temp/mpi_test> make clean; make; aprun -n 64 ./a.out 6400000
rm a.out
ftn mpi_test.f
0 : dt for MPI_ALLTOALL section = 2.331636905670166
0 : dt for MPI_IALLTOALL section = 1.070204973220825
Does anyone know of a way to improve both the efficiency of MPI_IALLTOALL/MPI_WAIT and preventing the first call to MPI_ALLTOALL from taking so long? In terms of priority, I am much more interested in improving the non-blocking communications. I've tried many different combinations of environment variables with no success. Thanks for your time… I'm hoping someone can help here.
Thanks,
Nick Rattermann
-------------------------
program hello_world
implicit none
include 'mpif.h'
integer ierr, num_procs, my_id,rqst(2),cnt,i,N,maxnum
real*8,dimension(:),allocatable :: a,atemp,b,btemp,
> c,ctemp,d,dtemp,e,f
character(len=100) :: arg
double precision :: t1,t2
CALL GETARG(1,arg) !Grab the 2nd command line argument
! and store it in the temporary variable
! 'arg'
read(arg,*) maxnum !Now convert string to integer
allocate(a(maxnum),atemp(maxnum),b(maxnum),btemp(maxnum),
> c(maxnum),ctemp(maxnum),d(maxnum),dtemp(maxnum),
> e(maxnum),f(maxnum))
call MPI_INIT ( ierr )
! find out MY process ID, and how many processes were started.
call MPI_COMM_RANK (MPI_COMM_WORLD, my_id, ierr)
call MPI_COMM_SIZE (MPI_COMM_WORLD, num_procs, ierr)
call MPI_ALLTOALL(c,maxnum/num_procs,MPI_REAL8,ctemp,
> MAXNUM/num_procs,MPI_REAL8,MPI_COMM_WORLD,ierr)
!- MPI_ALLTOALL SECTION
a = 1.
atemp = 0.
b = 2.
btemp = 0.
call MPI_BARRIER(MPI_COMM_WORLD,ierr)
t1 = MPI_WTIME()
call MPI_ALLTOALL(a,maxnum/num_procs,MPI_REAL8,atemp,
> MAXNUM/num_procs,MPI_REAL8,MPI_COMM_WORLD,ierr)
call MPI_ALLTOALL(b,maxnum/num_procs,MPI_REAL8,btemp,
> MAXNUM/num_procs,MPI_REAL8,MPI_COMM_WORLD,ierr)
do i=1,maxnum
e(i) = real(i)
end do
cnt = 0
do i=1,maxnum
if (atemp(i) .ne. 1.) cnt = cnt + 1
end do
if (cnt .ne. 0) print *,my_id,": ATEMP NOT EQUAL TO 1!!!"
cnt = 0
do i=1,maxnum
if (btemp(i) .ne. 2.) cnt = cnt + 1
end do
if (cnt .ne. 0) print *,my_id,": BTEMP NOT EQUAL TO 2!!!"
t2 = MPI_WTIME()
if (my_id == 0) print*,my_id,": dt for MPI_ALLTOALL section = ",
> t2-t1
!- MPI_IALLTOALL SECTION
c = 1.
ctemp = 0.
d = 2.
dtemp = 0.
call MPI_BARRIER(MPI_COMM_WORLD,ierr)
t1 = MPI_WTIME()
call MPI_IALLTOALL(c,maxnum/num_procs,MPI_REAL8,ctemp,
> MAXNUM/num_procs,MPI_REAL8,MPI_COMM_WORLD,rqst(1),ierr)
call MPI_IALLTOALL(d,maxnum/num_procs,MPI_REAL8,dtemp,
> MAXNUM/num_procs,MPI_REAL8,MPI_COMM_WORLD,rqst(2),ierr)
do i=1,maxnum
f(i) = real(i)
end do
call MPI_WAIT(rqst(1),MPI_STATUS_IGNORE,ierr)
cnt = 0
do i=1,maxnum
if (ctemp(i) .ne. 1.) cnt = cnt + 1
end do
if (cnt .ne. 0) print *,my_id,": ATEMP NOT EQUAL TO 1!!!"
call MPI_WAIT(rqst(2),MPI_STATUS_IGNORE,ierr)
cnt = 0
do i=1,maxnum
if (dtemp(i) .ne. 2.) cnt = cnt + 1
end do
if (cnt .ne. 0) print *,my_id,": BTEMP NOT EQUAL TO 2!!!"
t2 = MPI_WTIME()
if (my_id == 0) print*,my_id,": dt for MPI_IALLTOALL section = "
> ,t2-t1
call MPI_FINALIZE ( ierr )
deallocate(a,atemp,b,btemp,c,ctemp,d,dtemp,e,f)
end
________________________________
This electronic message transmission and any attachments that accompany it contain information from DRC® (Dynamics Research Corporation) or its subsidiaries, or the intended recipient, which is privileged, proprietary, business confidential, or otherwise protected from disclosure and is the exclusive property of DRC and/or the intended recipient. The information in this email is solely intended for the use of the individual or entity that is the intended recipient. If you are not the intended recipient, any use, dissemination, distribution, retention, or copying of this communication, attachments, or substance is prohibited. If you have received this electronic transmission in error, please immediately reply to the author via email that you received the message by mistake and also promptly and permanently delete this message and all copies of this email and any attachments. We thank you for your assistance and apologize for any inconvenience.
More information about the discuss
mailing list