<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">
</head>
<body>
<div class="BodyFragment"><font size="2"><span style="font-size:10pt;">
<div class="PlainText">Cheers for the help, Jeff!<br>
<br>
I just tried to mimic Waitall() using a variety of the “MPI_Test…” routines (code attached), and the results are not what I would expect:<br>
<br>
Although Waitsome() seems to give consistently the worst performance ( Waitall < Waitany < Waitsome ) , Testsome() *appears* to always be faster than Testany(), and for larger numbers of requests the performance order seems to actually reverse.</div>
</span></font></div>
<div class="BodyFragment"><font size="2"><span style="font-size:10pt;">
<div class="PlainText"><br>
<br>
Now, I may have done something spectacularly dumb here (it would be the 5th such example from today alone), but on the assumption I have not: is this result expected given the underlying implementation?<br>
<br>
J.<br>
<br>
./time_routines.sh 4 50<br>
<br>
nprocs = 4, ntokens = 16, ncycles = 50<br>
Method : Time Relative<br>
MPI_Waitall : 1.526000e-03 1.000x<br>
MPI_Waitany : 1.435000e-03 0.940x<br>
MPI_Waitsome : 3.381000e-03 2.216x<br>
MPI_Testall : 3.101000e-03 2.032x<br>
MPI_Testany : 8.080000e-03 5.295x<br>
MPI_Testsome : 3.037000e-03 1.990x<br>
PMPI_Waitall : 1.603000e-03 1.050x<br>
PMPI_Waitany : 1.404000e-03 0.920x<br>
PMPI_Waitsome : 4.666000e-03 3.058x<br>
<br>
<br>
nprocs = 4, ntokens = 64, ncycles = 50<br>
Method : Time Relative<br>
MPI_Waitall : 3.173000e-03 1.000x<br>
MPI_Waitany : 5.362000e-03 1.690x<br>
MPI_Waitsome : 1.809100e-02 5.702x<br>
MPI_Testall : 1.364200e-02 4.299x<br>
MPI_Testany : 2.309300e-02 7.278x<br>
MPI_Testsome : 1.469800e-02 4.632x<br>
PMPI_Waitall : 2.063000e-03 0.650x<br>
PMPI_Waitany : 9.420000e-03 2.969x<br>
PMPI_Waitsome : 1.890300e-02 5.957x<br>
<br>
<br>
nprocs = 4, ntokens = 128, ncycles = 50<br>
Method : Time Relative<br>
MPI_Waitall : 4.730000e-03 1.000x<br>
MPI_Waitany : 2.691000e-02 5.689x<br>
MPI_Waitsome : 4.519000e-02 9.554x<br>
MPI_Testall : 4.696900e-02 9.930x<br>
MPI_Testany : 7.285200e-02 15.402x<br>
MPI_Testsome : 3.773400e-02 7.978x<br>
PMPI_Waitall : 5.158000e-03 1.090x<br>
PMPI_Waitany : 2.223200e-02 4.700x<br>
PMPI_Waitsome : 4.205000e-02 8.890x<br>
<br>
<br>
nprocs = 4, ntokens = 512, ncycles = 50<br>
Method : Time Relative<br>
MPI_Waitall : 1.365900e-02 1.000x<br>
MPI_Waitany : 3.261610e-01 23.879x<br>
MPI_Waitsome : 3.944020e-01 28.875x<br>
MPI_Testall : 5.408010e-01 39.593x<br>
MPI_Testany : 4.865990e-01 35.625x<br>
MPI_Testsome : 3.067470e-01 22.458x<br>
PMPI_Waitall : 1.976100e-02 1.447x<br>
PMPI_Waitany : 3.011500e-01 22.048x<br>
PMPI_Waitsome : 3.791930e-01 27.761x<br>
<br>
<br>
nprocs = 4, ntokens = 1024, ncycles = 50<br>
Method : Time Relative<br>
MPI_Waitall : 4.087800e-02 1.000x<br>
MPI_Waitany : 1.245209e+00 30.462x<br>
MPI_Waitsome : 1.704020e+00 41.686x<br>
MPI_Testall : 1.940940e+00 47.481x<br>
MPI_Testany : 1.618215e+00 39.586x<br>
MPI_Testsome : 1.133568e+00 27.731x<br>
PMPI_Waitall : 3.970200e-02 0.971x<br>
PMPI_Waitany : 1.344188e+00 32.883x<br>
PMPI_Waitsome : 1.685816e+00 41.240x<br>
<br>
<br>
nprocs = 4, ntokens = 2048, ncycles = 50<br>
Method : Time Relative<br>
MPI_Waitall : 1.173840e-01 1.000x<br>
MPI_Waitany : 4.600552e+00 39.192x<br>
MPI_Waitsome : 6.840568e+00 58.275x<br>
MPI_Testall : 6.762144e+00 57.607x<br>
MPI_Testany : 5.170525e+00 44.048x<br>
MPI_Testsome : 4.260335e+00 36.294x<br>
PMPI_Waitall : 1.291590e-01 1.100x<br>
PMPI_Waitany : 5.161881e+00 43.974x<br>
PMPI_Waitsome : 7.388439e+00 62.942x<br>
<br>
<br>
On Jan 15, 2014, at 2:53 PM, Jeff Hammond <jeff.science@gmail.com> wrote:<br>
<br>
> On Wed, Jan 15, 2014 at 2:23 PM, John Grime <jgrime@uchicago.edu> wrote:<br>
>> Hi Jeff,<br>
>> <br>
>>> If Waitall wasn't faster than Waitsome or Waitany, then it wouldn't<br>
>>> exist since obviously one can implement the former in terms of the<br>
>>> latter<br>
>> <br>
>> <br>
>> I see no reason it wouldn’t exist in such a case, given that it’s an elegant/convenient way to wait for all requests to complete vs. Waitsome / Waitany. It makes sense to me that it would be in the API in any case, much as I appreciate the value of the RISC-y
approach you imply.<br>
>> <br>
>>> it shouldn't be surprising that they aren't as efficient.<br>
>> <br>
>> I would’t expect them to have identical performance - but nor would I have expected a performance difference of ~50x for the same number of outstanding requests, even given that a naive loop over the request array will be O(N). That loop should be pretty
cheap after all, even given that you can’t use cache well due to the potential for background state changes in the request object data or whatever (I’m not sure how it’s actually implemented, which is why I’m asking about this issue on the mailing list).<br>
>> <br>
>>> The appropriate question to ask is whether Waitany is implemented<br>
>>> optimally or not.<br>
>> <br>
>> <br>
>> Well, yes. I kinda hoped that question was heavily implied by my original email!<br>
>> <br>
>> <br>
>>> If you find that emulating Waitany<br>
>>> using Testall following by a loop, then that's useful information.<br>
>> <br>
>> I accidentally the whole thing, Jeff! ;)<br>
>> <br>
>> But that’s a good idea, thanks - I’ll give it a try and report back!<br>
> <br>
> Testall is the wrong semantic here. I thought it would test them all<br>
> individually but it doesn't. I implemented it anyways and it is the<br>
> worst of all. I attached your test with my modifications. Because I<br>
> am an evil bastard, I made a ton of whitespace changes in addition to<br>
> the nontrivial ones.<br>
> <br>
> Jeff<br>
> <br>
> -- <br>
> Jeff Hammond<br>
> jeff.science@gmail.com<br>
> <nb_ring.c>_______________________________________________<br>
> discuss mailing list discuss@mpich.org<br>
> To manage subscription options or unsubscribe:<br>
> <a href="https://lists.mpich.org/mailman/listinfo/discuss">https://lists.mpich.org/mailman/listinfo/discuss</a><br>
<br>
</div>
</span></font></div>
</body>
</html>