[mpich-discuss] Poor performance of Waitany / Waitsome

John Grime jgrime at uchicago.edu
Wed Jan 15 15:10:42 CST 2014

Cheers for the help, Jeff!

I just tried to mimic Waitall() using a variety of the “MPI_Test…” routines (code attached), and the results are not what I would expect:

Although Waitsome() seems to give consistently the worst performance ( Waitall < Waitany < Waitsome ) , Testsome() *appears* to always be faster than Testany(), and for larger numbers of requests the performance order seems to actually reverse.

Now, I may have done something spectacularly dumb here (it would be the 5th such example from today alone), but on the assumption I have not: is this result expected given the underlying implementation?


./time_routines.sh 4 50

nprocs = 4, ntokens = 16, ncycles = 50
Method          : Time         Relative
    MPI_Waitall : 1.526000e-03    1.000x
    MPI_Waitany : 1.435000e-03    0.940x
   MPI_Waitsome : 3.381000e-03    2.216x
    MPI_Testall : 3.101000e-03    2.032x
    MPI_Testany : 8.080000e-03    5.295x
   MPI_Testsome : 3.037000e-03    1.990x
   PMPI_Waitall : 1.603000e-03    1.050x
   PMPI_Waitany : 1.404000e-03    0.920x
  PMPI_Waitsome : 4.666000e-03    3.058x

nprocs = 4, ntokens = 64, ncycles = 50
Method          : Time         Relative
    MPI_Waitall : 3.173000e-03    1.000x
    MPI_Waitany : 5.362000e-03    1.690x
   MPI_Waitsome : 1.809100e-02    5.702x
    MPI_Testall : 1.364200e-02    4.299x
    MPI_Testany : 2.309300e-02    7.278x
   MPI_Testsome : 1.469800e-02    4.632x
   PMPI_Waitall : 2.063000e-03    0.650x
   PMPI_Waitany : 9.420000e-03    2.969x
  PMPI_Waitsome : 1.890300e-02    5.957x

nprocs = 4, ntokens = 128, ncycles = 50
Method          : Time         Relative
    MPI_Waitall : 4.730000e-03    1.000x
    MPI_Waitany : 2.691000e-02    5.689x
   MPI_Waitsome : 4.519000e-02    9.554x
    MPI_Testall : 4.696900e-02    9.930x
    MPI_Testany : 7.285200e-02   15.402x
   MPI_Testsome : 3.773400e-02    7.978x
   PMPI_Waitall : 5.158000e-03    1.090x
   PMPI_Waitany : 2.223200e-02    4.700x
  PMPI_Waitsome : 4.205000e-02    8.890x

nprocs = 4, ntokens = 512, ncycles = 50
Method          : Time         Relative
    MPI_Waitall : 1.365900e-02    1.000x
    MPI_Waitany : 3.261610e-01   23.879x
   MPI_Waitsome : 3.944020e-01   28.875x
    MPI_Testall : 5.408010e-01   39.593x
    MPI_Testany : 4.865990e-01   35.625x
   MPI_Testsome : 3.067470e-01   22.458x
   PMPI_Waitall : 1.976100e-02    1.447x
   PMPI_Waitany : 3.011500e-01   22.048x
  PMPI_Waitsome : 3.791930e-01   27.761x

nprocs = 4, ntokens = 1024, ncycles = 50
Method          : Time         Relative
    MPI_Waitall : 4.087800e-02    1.000x
    MPI_Waitany : 1.245209e+00   30.462x
   MPI_Waitsome : 1.704020e+00   41.686x
    MPI_Testall : 1.940940e+00   47.481x
    MPI_Testany : 1.618215e+00   39.586x
   MPI_Testsome : 1.133568e+00   27.731x
   PMPI_Waitall : 3.970200e-02    0.971x
   PMPI_Waitany : 1.344188e+00   32.883x
  PMPI_Waitsome : 1.685816e+00   41.240x

nprocs = 4, ntokens = 2048, ncycles = 50
Method          : Time         Relative
    MPI_Waitall : 1.173840e-01    1.000x
    MPI_Waitany : 4.600552e+00   39.192x
   MPI_Waitsome : 6.840568e+00   58.275x
    MPI_Testall : 6.762144e+00   57.607x
    MPI_Testany : 5.170525e+00   44.048x
   MPI_Testsome : 4.260335e+00   36.294x
   PMPI_Waitall : 1.291590e-01    1.100x
   PMPI_Waitany : 5.161881e+00   43.974x
  PMPI_Waitsome : 7.388439e+00   62.942x

On Jan 15, 2014, at 2:53 PM, Jeff Hammond <jeff.science at gmail.com> wrote:

> On Wed, Jan 15, 2014 at 2:23 PM, John Grime <jgrime at uchicago.edu> wrote:
>> Hi Jeff,
>>> If Waitall wasn't faster than Waitsome or Waitany, then it wouldn't
>>> exist since obviously one can implement the former in terms of the
>>> latter
>> I see no reason it wouldn’t exist in such a case, given that it’s an elegant/convenient way to wait for all requests to complete vs. Waitsome / Waitany. It makes sense to me that it would be in the API in any case, much as I appreciate the value of the RISC-y approach you imply.
>>> it shouldn't be surprising that they aren't as efficient.
>> I would’t expect them to have identical performance - but nor would I have expected a performance difference of ~50x for the same number of outstanding requests, even given that a naive loop over the request array will be O(N). That loop should be pretty cheap after all, even given that you can’t use cache well due to the potential for background state changes in the request object data or whatever (I’m not sure how it’s actually implemented, which is why I’m asking about this issue on the mailing list).
>>> The appropriate question to ask is whether Waitany is implemented
>>> optimally or not.
>> Well, yes. I kinda hoped that question was heavily implied by my original email!
>>> If you find that emulating Waitany
>>> using Testall following by a loop, then that's useful information.
>> I accidentally the whole thing, Jeff! ;)
>> But that’s a good idea, thanks - I’ll give it a try and report back!
> Testall is the wrong semantic here.  I thought it would test them all
> individually but it doesn't.  I implemented it anyways and it is the
> worst of all.  I attached your test with my modifications.  Because I
> am an evil bastard, I made a ton of whitespace changes in addition to
> the nontrivial ones.
> Jeff
> --
> Jeff Hammond
> jeff.science at gmail.com
> <nb_ring.c>_______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140115/aad621e9/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: nb_ring.c
Type: application/octet-stream
Size: 7723 bytes
Desc: nb_ring.c
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140115/aad621e9/attachment-0001.obj>

More information about the discuss mailing list