[mpich-discuss] Poor performance of Waitany / Waitsome

Jeff Hammond jeff.science at gmail.com
Wed Jan 15 13:52:47 CST 2014


If Waitall wasn't faster than Waitsome or Waitany, then it wouldn't
exist, since obviously one can implement the former in terms of the
latter.  The point of Waitall is to be an optimization for N calls to
Wait.  Waitsome and Waitany are just weaker optimizations that could
be cast as Testall followed by output array inspection and it
shouldn't be surprising that they aren't as efficient.

The appropriate question to ask is whether Waitany is implemented
optimally or not.  Comparing Waitall to its emulation using Waitany
_does not_ answer this question.  If you find that emulating Waitany
using Testall following by a loop, then that's useful information.

Jeff

On Wed, Jan 15, 2014 at 12:17 PM, John Grime <jgrime at uchicago.edu> wrote:
> Hi all,
>
> I noticed unexpectedly poor performance of the MPI_Waitany() routine (Mac
> OSX 10.9.1, MPICH v3.0.4 via Macports).
>
> I noticed that “wbland” had added relevant information to the “trac” system:
>
> http://trac.mpich.org/projects/mpich/ticket/1988
>
> … so I downloaded his example code, modified it a little and wrote a wrapper
> script to examine the different routines (attached to this email, apologies
> in advance for any dumb contents).
>
> Results:
>
> ./time_routines.sh 4 50
>
> nprocs = 4, ntokens = 16, ncycles = 50
> Method          : Time         Relative
>     MPI_Waitall : 1.358000e-03    1.000x
>     MPI_Waitany : 1.491000e-03    1.098x
>    MPI_Waitsome : 3.243000e-03    2.388x
>    PMPI_Waitall : 9.860000e-04    0.726x
>    PMPI_Waitany : 1.421000e-03    1.046x
>   PMPI_Waitsome : 4.432000e-03    3.264x
>
>
> nprocs = 4, ntokens = 64, ncycles = 50
> Method          : Time         Relative
>     MPI_Waitall : 2.075000e-03    1.000x
>     MPI_Waitany : 5.746000e-03    2.769x
>    MPI_Waitsome : 1.314400e-02    6.334x
>    PMPI_Waitall : 3.142000e-03    1.514x
>    PMPI_Waitany : 5.450000e-03    2.627x
>   PMPI_Waitsome : 1.891500e-02    9.116x
>
>
> nprocs = 4, ntokens = 128, ncycles = 50
> Method          : Time         Relative
>     MPI_Waitall : 5.159000e-03    1.000x
>     MPI_Waitany : 1.615100e-02    3.131x
>    MPI_Waitsome : 5.004100e-02    9.700x
>    PMPI_Waitall : 3.480000e-03    0.675x
>    PMPI_Waitany : 2.564000e-02    4.970x
>   PMPI_Waitsome : 3.799700e-02    7.365x
>
>
> nprocs = 4, ntokens = 512, ncycles = 50
> Method          : Time         Relative
>     MPI_Waitall : 1.949800e-02    1.000x
>     MPI_Waitany : 2.431020e-01   12.468x
>    MPI_Waitsome : 3.643640e-01   18.687x
>    PMPI_Waitall : 1.869800e-02    0.959x
>    PMPI_Waitany : 2.491870e-01   12.780x
>   PMPI_Waitsome : 3.500600e-01   17.954x
>
>
> nprocs = 4, ntokens = 1024, ncycles = 50
> Method          : Time         Relative
>     MPI_Waitall : 2.749100e-02    1.000x
>     MPI_Waitany : 1.223122e+00   44.492x
>    MPI_Waitsome : 1.554282e+00   56.538x
>    PMPI_Waitall : 3.329800e-02    1.211x
>    PMPI_Waitany : 1.232125e+00   44.819x
>   PMPI_Waitsome : 1.531198e+00   55.698x
>
> … and so it seems the performance delta between the different approaches (
> Waitall / Waitany / Waitsome ) increases as a function of the buffer size.
>
> This is a bit of a problem for me, as I make heavy use of Waitany() to
> overlap communication with calculations. Is there any way to avoid this
> behavior?
>
> Cheers,
>
>
> J.
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss



-- 
Jeff Hammond
jeff.science at gmail.com



More information about the discuss mailing list