[mpich-discuss] Poor performance of Waitany / Waitsome

John Grime jgrime at uchicago.edu
Wed Jan 15 14:23:40 CST 2014


Hi Jeff,

> If Waitall wasn't faster than Waitsome or Waitany, then it wouldn't
> exist since obviously one can implement the former in terms of the
> latter


I see no reason it wouldn’t exist in such a case, given that it’s an elegant/convenient way to wait for all requests to complete vs. Waitsome / Waitany. It makes sense to me that it would be in the API in any case, much as I appreciate the value of the RISC-y approach you imply.

> it shouldn't be surprising that they aren't as efficient.

I would’t expect them to have identical performance - but nor would I have expected a performance difference of ~50x for the same number of outstanding requests, even given that a naive loop over the request array will be O(N). That loop should be pretty cheap after all, even given that you can’t use cache well due to the potential for background state changes in the request object data or whatever (I’m not sure how it’s actually implemented, which is why I’m asking about this issue on the mailing list).

> The appropriate question to ask is whether Waitany is implemented
> optimally or not.


Well, yes. I kinda hoped that question was heavily implied by my original email!


> If you find that emulating Waitany
> using Testall following by a loop, then that's useful information.

I accidentally the whole thing, Jeff! ;)

But that’s a good idea, thanks - I’ll give it a try and report back!

J.

On Jan 15, 2014, at 1:52 PM, Jeff Hammond <jeff.science at gmail.com> wrote:

> If Waitall wasn't faster than Wait some or Waitany, then it wouldn't
> exist, since obviously one can implement the former in terms of the
> latter.  The point of Waitall is to be an optimization for N calls to
> Wait.  Waitsome and Waitany are just weaker optimizations that could
> be cast as Testall followed by output array inspection and it
> shouldn't be surprising that they aren't as efficient.
> 
> The appropriate question to ask is whether Waitany is implemented
> optimally or not.  Comparing Waitall to its emulation using Waitany
> _does not_ answer this question.  If you find that emulating Waitany
> using Testall following by a loop, then that's useful information.
> 
> Jeff
> 
> On Wed, Jan 15, 2014 at 12:17 PM, John Grime <jgrime at uchicago.edu> wrote:
>> Hi all,
>> 
>> I noticed unexpectedly poor performance of the MPI_Waitany() routine (Mac
>> OSX 10.9.1, MPICH v3.0.4 via Macports).
>> 
>> I noticed that “wbland” had added relevant information to the “trac” system:
>> 
>> http://trac.mpich.org/projects/mpich/ticket/1988
>> 
>> … so I downloaded his example code, modified it a little and wrote a wrapper
>> script to examine the different routines (attached to this email, apologies
>> in advance for any dumb contents).
>> 
>> Results:
>> 
>> ./time_routines.sh 4 50
>> 
>> nprocs = 4, ntokens = 16, ncycles = 50
>> Method          : Time         Relative
>>    MPI_Waitall : 1.358000e-03    1.000x
>>    MPI_Waitany : 1.491000e-03    1.098x
>>   MPI_Waitsome : 3.243000e-03    2.388x
>>   PMPI_Waitall : 9.860000e-04    0.726x
>>   PMPI_Waitany : 1.421000e-03    1.046x
>>  PMPI_Waitsome : 4.432000e-03    3.264x
>> 
>> 
>> nprocs = 4, ntokens = 64, ncycles = 50
>> Method          : Time         Relative
>>    MPI_Waitall : 2.075000e-03    1.000x
>>    MPI_Waitany : 5.746000e-03    2.769x
>>   MPI_Waitsome : 1.314400e-02    6.334x
>>   PMPI_Waitall : 3.142000e-03    1.514x
>>   PMPI_Waitany : 5.450000e-03    2.627x
>>  PMPI_Waitsome : 1.891500e-02    9.116x
>> 
>> 
>> nprocs = 4, ntokens = 128, ncycles = 50
>> Method          : Time         Relative
>>    MPI_Waitall : 5.159000e-03    1.000x
>>    MPI_Waitany : 1.615100e-02    3.131x
>>   MPI_Waitsome : 5.004100e-02    9.700x
>>   PMPI_Waitall : 3.480000e-03    0.675x
>>   PMPI_Waitany : 2.564000e-02    4.970x
>>  PMPI_Waitsome : 3.799700e-02    7.365x
>> 
>> 
>> nprocs = 4, ntokens = 512, ncycles = 50
>> Method          : Time         Relative
>>    MPI_Waitall : 1.949800e-02    1.000x
>>    MPI_Waitany : 2.431020e-01   12.468x
>>   MPI_Waitsome : 3.643640e-01   18.687x
>>   PMPI_Waitall : 1.869800e-02    0.959x
>>   PMPI_Waitany : 2.491870e-01   12.780x
>>  PMPI_Waitsome : 3.500600e-01   17.954x
>> 
>> 
>> nprocs = 4, ntokens = 1024, ncycles = 50
>> Method          : Time         Relative
>>    MPI_Waitall : 2.749100e-02    1.000x
>>    MPI_Waitany : 1.223122e+00   44.492x
>>   MPI_Waitsome : 1.554282e+00   56.538x
>>   PMPI_Waitall : 3.329800e-02    1.211x
>>   PMPI_Waitany : 1.232125e+00   44.819x
>>  PMPI_Waitsome : 1.531198e+00   55.698x
>> 
>> … and so it seems the performance delta between the different approaches (
>> Waitall / Waitany / Waitsome ) increases as a function of the buffer size.
>> 
>> This is a bit of a problem for me, as I make heavy use of Waitany() to
>> overlap communication with calculations. Is there any way to avoid this
>> behavior?
>> 
>> Cheers,
>> 
>> 
>> J.
>> 
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
> 
> 
> 
> -- 
> Jeff Hammond
> jeff.science at gmail.com
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss




More information about the discuss mailing list