[mpich-discuss] Poor performance of Waitany / Waitsome
John Grime
jgrime at uchicago.edu
Wed Jan 15 15:39:33 CST 2014
Hi Jeff,
> Given that you're using shared memory on a bloated OS (anything
> driving a GUI Window Manager), software overhead is going to be
> significant.
Very true - I would not expect these times to be indicative of what MPICH can actually achieve, but nonetheless the general trends seem to be reproducible.
I’m hoping if I can get a good handle on what’s happening, then I can write better MPI code in the general case. The major head scratcher for me is how the MPI_TestX routines seem to be slower than their conjugate MPI_WaitX in most situations, when I would imagine they’d be doing something fairly similar behind the scenes.
The wildcard appears to be MPI_Testsome().
I must be doing something dumb here, so I’ll also caffein-ate and consider!
J.
On Jan 15, 2014, at 3:26 PM, Jeff Hammond <jeff.science at gmail.com> wrote:
> Given that you're using shared memory on a bloated OS (anything
> driving a GUI Window Manager), software overhead is going to be
> significant. You can only do so much about this. You might want to
> compile MPICH yourself using all the optimization flags.
>
> For example, I decided that "--enable-static
> --enable-fast=O3,nochkmsg,notiming,ndebug,nompit
> --disable-weak-symbols --enable-threads=single" were configure options
> that someone in search of speed might use. I have not done any
> systematic testing yet so some MPICH developer might tell me I'm a
> clueless buffoon for bothering to (de)activate some of these options.
>
> If you were to assume that I was going to rerun your test with
> different builds of MPICH on my Mac laptop as soon as I get some
> coffee, you would be correct. Hence, apathy on your part has no
> impact on the experiments regarding MPICH build variants and speed :-)
>
> Jeff
>
> On Wed, Jan 15, 2014 at 3:10 PM, John Grime <jgrime at uchicago.edu> wrote:
>> Cheers for the help, Jeff!
>>
>> I just tried to mimic Waitall() using a variety of the “MPI_Test…” routines
>> (code attached), and the results are not what I would expect:
>>
>> Although Waitsome() seems to give consistently the worst performance (
>> Waitall < Waitany < Waitsome ) , Testsome() *appears* to always be faster
>> than Testany(), and for larger numbers of requests the performance order
>> seems to actually reverse.
>>
>>
>> Now, I may have done something spectacularly dumb here (it would be the 5th
>> such example from today alone), but on the assumption I have not: is this
>> result expected given the underlying implementation?
>>
>> J.
>>
>>
>> ./time_routines.sh 4 50
>>
>> nprocs = 4, ntokens = 16, ncycles = 50
>> Method : Time Relative
>> MPI_Waitall : 1.526000e-03 1.000x
>> MPI_Waitany : 1.435000e-03 0.940x
>> MPI_Waitsome : 3.381000e-03 2.216x
>> MPI_Testall : 3.101000e-03 2.032x
>> MPI_Testany : 8.080000e-03 5.295x
>> MPI_Testsome : 3.037000e-03 1.990x
>> PMPI_Waitall : 1.603000e-03 1.050x
>> PMPI_Waitany : 1.404000e-03 0.920x
>> PMPI_Waitsome : 4.666000e-03 3.058x
>>
>>
>>
>> nprocs = 4, ntokens = 64, ncycles = 50
>> Method : Time Relative
>> MPI_Waitall : 3.173000e-03 1.000x
>> MPI_Waitany : 5.362000e-03 1.690x
>> MPI_Waitsome : 1.809100e-02 5.702x
>> MPI_Testall : 1.364200e-02 4.299x
>> MPI_Testany : 2.309300e-02 7.278x
>> MPI_Testsome : 1.469800e-02 4.632x
>> PMPI_Waitall : 2.063000e-03 0.650x
>> PMPI_Waitany : 9.420000e-03 2.969x
>> PMPI_Waitsome : 1.890300e-02 5.957x
>>
>>
>>
>> nprocs = 4, ntokens = 128, ncycles = 50
>> Method : Time Relative
>> MPI_Waitall : 4.730000e-03 1.000x
>> MPI_Waitany : 2.691000e-02 5.689x
>> MPI_Waitsome : 4.519000e-02 9.554x
>> MPI_Testall : 4.696900e-02 9.930x
>> MPI_Testany : 7.285200e-02 15.402x
>> MPI_Testsome : 3.773400e-02 7.978x
>> PMPI_Waitall : 5.158000e-03 1.090x
>> PMPI_Waitany : 2.223200e-02 4.700x
>> PMPI_Waitsome : 4.205000e-02 8.890x
>>
>>
>>
>> nprocs = 4, ntokens = 512, ncycles = 50
>> Method : Time Relative
>> MPI_Waitall : 1.365900e-02 1.000x
>> MPI_Waitany : 3.261610e-01 23.879x
>> MPI_Waitsome : 3.944020e-01 28.875x
>> MPI_Testall : 5.408010e-01 39.593x
>> MPI_Testany : 4.865990e-01 35.625x
>> MPI_Testsome : 3.067470e-01 22.458x
>> PMPI_Waitall : 1.976100e-02 1.447x
>> PMPI_Waitany : 3.011500e-01 22.048x
>> PMPI_Waitsome : 3.791930e-01 27.761x
>>
>>
>>
>> nprocs = 4, ntokens = 1024, ncycles = 50
>> Method : Time Relative
>> MPI_Waitall : 4.087800e-02 1.000x
>> MPI_Waitany : 1.245209e+00 30.462x
>> MPI_Waitsome : 1.704020e+00 41.686x
>> MPI_Testall : 1.940940e+00 47.481x
>> MPI_Testany : 1.618215e+00 39.586x
>> MPI_Testsome : 1.133568e+00 27.731x
>> PMPI_Waitall : 3.970200e-02 0.971x
>> PMPI_Waitany : 1.344188e+00 32.883x
>> PMPI_Waitsome : 1.685816e+00 41.240x
>>
>>
>> nprocs = 4, ntokens = 2048, ncycles = 50
>> Method : Time Relative
>> MPI_Waitall : 1.173840e-01 1.000x
>> MPI_Waitany : 4.600552e+00 39.192x
>> MPI_Waitsome : 6.840568e+00 58.275x
>> MPI_Testall : 6.762144e+00 57.607x
>> MPI_Testany : 5.170525e+00 44.048x
>> MPI_Testsome : 4.260335e+00 36.294x
>> PMPI_Waitall : 1.291590e-01 1.100x
>> PMPI_Waitany : 5.161881e+00 43.974x
>> PMPI_Waitsome : 7.388439e+00 62.942x
>>
>>
>>
>> On Jan 15, 2014, at 2:53 PM, Jeff Hammond <jeff.science at gmail.com> wrote:
>>
>>> On Wed, Jan 15, 2014 at 2:23 PM, John Grime <jgrime at uchicago.edu> wrote:
>>>> Hi Jeff,
>>>>
>>>>> If Waitall wasn't faster than Waitsome or Waitany, then it wouldn't
>>>>> exist since obviously one can implement the former in terms of the
>>>>> latter
>>>>
>>>>
>>>> I see no reason it wouldn’t exist in such a case, given that it’s an
>>>> elegant/convenient way to wait for all requests to complete vs. Waitsome /
>>>> Waitany. It makes sense to me that it would be in the API in any case, much
>>>> as I appreciate the value of the RISC-y approach you imply.
>>>>
>>>>> it shouldn't be surprising that they aren't as efficient.
>>>>
>>>> I would’t expect them to have identical performance - but nor would I
>>>> have expected a performance difference of ~50x for the same number of
>>>> outstanding requests, even given that a naive loop over the request array
>>>> will be O(N). That loop should be pretty cheap after all, even given that
>>>> you can’t use cache well due to the potential for background state changes
>>>> in the request object data or whatever (I’m not sure how it’s actually
>>>> implemented, which is why I’m asking about this issue on the mailing list).
>>>>
>>>>> The appropriate question to ask is whether Waitany is implemented
>>>>> optimally or not.
>>>>
>>>>
>>>> Well, yes. I kinda hoped that question was heavily implied by my original
>>>> email!
>>>>
>>>>
>>>>> If you find that emulating Waitany
>>>>> using Testall following by a loop, then that's useful information.
>>>>
>>>> I accidentally the whole thing, Jeff! ;)
>>>>
>>>> But that’s a good idea, thanks - I’ll give it a try and report back!
>>>
>>> Testall is the wrong semantic here. I thought it would test them all
>>> individually but it doesn't. I implemented it anyways and it is the
>>> worst of all. I attached your test with my modifications. Because I
>>> am an evil bastard, I made a ton of whitespace changes in addition to
>>> the nontrivial ones.
>>>
>>> Jeff
>>>
>>> --
>>> Jeff Hammond
>>> jeff.science at gmail.com
>>> <nb_ring.c>_______________________________________________
>>
>>> discuss mailing list discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>>
>> _______________________________________________
>> discuss mailing list discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>
>
>
> --
> Jeff Hammond
> jeff.science at gmail.com
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list