[mpich-devel] Suboptimal MPI_Allreduce() for intercommunicators
Rajeev Thakur
thakur at mcs.anl.gov
Wed Apr 30 10:56:38 CDT 2014
It was probably done that way to avoid the need for memory allocation, but can you add a ticket in Trac and we will look at it.
(For example, on one side of the intercommunicator sendbuf=1 byte, recvbuf=1 gigabyte; on the other side, sendbuf=1 gigabyte, recvbuf=1 byte)
Rajeev
On Apr 30, 2014, at 4:44 AM, Lisandro Dalcin <dalcinl at gmail.com> wrote:
> The implementation of Allreduce for intercommunicator
> (MPIR_Allreduce_inter in src/mpi/coll/allreduce.c) uses more or less
> the following algorithm (this is Python code I'm using to test this
> issue)
>
> def allreduce_inter_mpich(obj, op, comm, tag, localcomm, low_group):
> zero = 0
> if comm.rank == 0:
> root = MPI.ROOT
> else:
> root = MPI.PROC_NULL
> if low_group:
> ignore = reduce_inter(obj, op, zero, comm, tag, localcomm)
> result = reduce_inter(obj, op, root, comm, tag, localcomm)
> else:
> result = reduce_inter(obj, op, root, comm, tag, localcomm)
> ignore = reduce_inter(obj, op, zero, comm, tag, localcomm)
> return localcomm.bcast(result, 0)
>
>
> However, while the broadcasts at each group overlap, the calls to
> reduce_inter() introduce serialization. A much better implementation
> would be:
>
> def allreduce_inter_dalcinl(obj, op, comm, tag, localcomm):
> result = reduce_binomial(obj, op, 0, localcomm, tag)
> if comm.rank == 0:
> result = comm.sendrecv(result, 0, tag, None, 0, tag)
> return localcomm.bcast(result, 0)
>
> i.e, perform (overlaped) reductions in the local groups, exchange
> results between local and remote rank 0, and (overlaped) broadcast in
> the local groups.
>
> I'm ataching a test Python script (I do not expect you to run it :-),
> but perhaps you want to see the code). I'm defining a reduce operation
> that artificially sleeps 1 second. Running this code in 8 cores in my
> desktop clearly shows the issue with the MPICH implementation:
>
> $ mpiexec -n 8 python test-reduce.py
> [mpich] time: min=4.003491e+00 max=4.003569e+00
> [dalcinl] time: min=2.002367e+00 max=2.002456e+00
>
> What do you think? Am I right? Or perhaps I'm missing something obvious?
>
>
> --
> Lisandro Dalcin
> ---------------
> CIMEC (UNL/CONICET)
> Predio CONICET-Santa Fe
> Colectora RN 168 Km 472, Paraje El Pozo
> 3000 Santa Fe, Argentina
> Tel: +54-342-4511594 (ext 1016)
> Tel/Fax: +54-342-4511169
> <test-allreduce.py>_______________________________________________
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/devel
More information about the devel
mailing list