[mpich-devel] Suboptimal MPI_Allreduce() for intercommunicators

Rajeev Thakur thakur at mcs.anl.gov
Wed Apr 30 10:56:38 CDT 2014


It was probably done that way to avoid the need for memory allocation, but can you add a ticket in Trac and we will look at it.

(For example, on one side of the intercommunicator sendbuf=1 byte, recvbuf=1 gigabyte; on the other side, sendbuf=1 gigabyte, recvbuf=1 byte)

Rajeev


On Apr 30, 2014, at 4:44 AM, Lisandro Dalcin <dalcinl at gmail.com> wrote:

> The implementation of Allreduce for intercommunicator
> (MPIR_Allreduce_inter in src/mpi/coll/allreduce.c) uses more or less
> the following algorithm (this is Python code I'm using to test this
> issue)
> 
> def allreduce_inter_mpich(obj, op, comm, tag, localcomm, low_group):
>    zero = 0
>    if comm.rank == 0:
>        root = MPI.ROOT
>    else:
>        root = MPI.PROC_NULL
>    if low_group:
>        ignore = reduce_inter(obj, op, zero, comm, tag, localcomm)
>        result = reduce_inter(obj, op, root, comm, tag, localcomm)
>    else:
>        result = reduce_inter(obj, op, root, comm, tag, localcomm)
>        ignore = reduce_inter(obj, op, zero, comm, tag, localcomm)
>    return localcomm.bcast(result, 0)
> 
> 
> However, while the broadcasts at each group overlap, the calls to
> reduce_inter() introduce serialization. A much better implementation
> would be:
> 
> def allreduce_inter_dalcinl(obj, op, comm, tag, localcomm):
>    result = reduce_binomial(obj, op, 0, localcomm, tag)
>    if comm.rank == 0:
>        result = comm.sendrecv(result, 0, tag, None, 0, tag)
>    return localcomm.bcast(result, 0)
> 
> i.e, perform (overlaped) reductions in the local groups, exchange
> results between local and remote rank 0, and (overlaped) broadcast in
> the local groups.
> 
> I'm ataching a test Python script (I do not expect you to run it :-),
> but perhaps you want to see the code). I'm defining a reduce operation
> that artificially sleeps 1 second. Running this code in 8 cores in my
> desktop clearly shows the issue with the MPICH implementation:
> 
> $ mpiexec -n 8 python test-reduce.py
> [mpich]   time: min=4.003491e+00 max=4.003569e+00
> [dalcinl] time: min=2.002367e+00 max=2.002456e+00
> 
> What do you think? Am I right? Or perhaps I'm missing something obvious?
> 
> 
> -- 
> Lisandro Dalcin
> ---------------
> CIMEC (UNL/CONICET)
> Predio CONICET-Santa Fe
> Colectora RN 168 Km 472, Paraje El Pozo
> 3000 Santa Fe, Argentina
> Tel: +54-342-4511594 (ext 1016)
> Tel/Fax: +54-342-4511169
> <test-allreduce.py>_______________________________________________
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/devel



More information about the devel mailing list