[mpich-discuss] Poor performance of Waitany / Waitsome

Jeff Hammond jeff.science at gmail.com
Wed Jan 15 15:26:44 CST 2014


Given that you're using shared memory on a bloated OS (anything
driving a GUI Window Manager), software overhead is going to be
significant.  You can only do so much about this.  You might want to
compile MPICH yourself using all the optimization flags.

For example, I decided that "--enable-static
--enable-fast=O3,nochkmsg,notiming,ndebug,nompit
--disable-weak-symbols --enable-threads=single" were configure options
that someone in search of speed might use.  I have not done any
systematic testing yet so some MPICH developer might tell me I'm a
clueless buffoon for bothering to (de)activate some of these options.

If you were to assume that I was going to rerun your test with
different builds of MPICH on my Mac laptop as soon as I get some
coffee, you would be correct.  Hence, apathy on your part has no
impact on the experiments regarding MPICH build variants and speed :-)

Jeff

On Wed, Jan 15, 2014 at 3:10 PM, John Grime <jgrime at uchicago.edu> wrote:
> Cheers for the help, Jeff!
>
> I just tried to mimic Waitall() using a variety of the “MPI_Test…” routines
> (code attached), and the results are not what I would expect:
>
> Although Waitsome() seems to give consistently the worst performance (
> Waitall < Waitany < Waitsome ) , Testsome() *appears* to always be faster
> than Testany(), and for larger numbers of requests the performance order
> seems to actually reverse.
>
>
> Now, I may have done something spectacularly dumb here (it would be the 5th
> such example from today alone), but on the assumption I have not: is this
> result expected given the underlying implementation?
>
> J.
>
>
> ./time_routines.sh 4 50
>
> nprocs = 4, ntokens = 16, ncycles = 50
> Method          : Time         Relative
>     MPI_Waitall : 1.526000e-03    1.000x
>     MPI_Waitany : 1.435000e-03    0.940x
>    MPI_Waitsome : 3.381000e-03    2.216x
>     MPI_Testall : 3.101000e-03    2.032x
>     MPI_Testany : 8.080000e-03    5.295x
>    MPI_Testsome : 3.037000e-03    1.990x
>    PMPI_Waitall : 1.603000e-03    1.050x
>    PMPI_Waitany : 1.404000e-03    0.920x
>   PMPI_Waitsome : 4.666000e-03    3.058x
>
>
>
> nprocs = 4, ntokens = 64, ncycles = 50
> Method          : Time         Relative
>     MPI_Waitall : 3.173000e-03    1.000x
>     MPI_Waitany : 5.362000e-03    1.690x
>    MPI_Waitsome : 1.809100e-02    5.702x
>     MPI_Testall : 1.364200e-02    4.299x
>     MPI_Testany : 2.309300e-02    7.278x
>    MPI_Testsome : 1.469800e-02    4.632x
>    PMPI_Waitall : 2.063000e-03    0.650x
>    PMPI_Waitany : 9.420000e-03    2.969x
>   PMPI_Waitsome : 1.890300e-02    5.957x
>
>
>
> nprocs = 4, ntokens = 128, ncycles = 50
> Method          : Time         Relative
>     MPI_Waitall : 4.730000e-03    1.000x
>     MPI_Waitany : 2.691000e-02    5.689x
>    MPI_Waitsome : 4.519000e-02    9.554x
>     MPI_Testall : 4.696900e-02    9.930x
>     MPI_Testany : 7.285200e-02   15.402x
>    MPI_Testsome : 3.773400e-02    7.978x
>    PMPI_Waitall : 5.158000e-03    1.090x
>    PMPI_Waitany : 2.223200e-02    4.700x
>   PMPI_Waitsome : 4.205000e-02    8.890x
>
>
>
> nprocs = 4, ntokens = 512, ncycles = 50
> Method          : Time         Relative
>     MPI_Waitall : 1.365900e-02    1.000x
>     MPI_Waitany : 3.261610e-01   23.879x
>    MPI_Waitsome : 3.944020e-01   28.875x
>     MPI_Testall : 5.408010e-01   39.593x
>     MPI_Testany : 4.865990e-01   35.625x
>    MPI_Testsome : 3.067470e-01   22.458x
>    PMPI_Waitall : 1.976100e-02    1.447x
>    PMPI_Waitany : 3.011500e-01   22.048x
>   PMPI_Waitsome : 3.791930e-01   27.761x
>
>
>
> nprocs = 4, ntokens = 1024, ncycles = 50
> Method          : Time         Relative
>     MPI_Waitall : 4.087800e-02    1.000x
>     MPI_Waitany : 1.245209e+00   30.462x
>    MPI_Waitsome : 1.704020e+00   41.686x
>     MPI_Testall : 1.940940e+00   47.481x
>     MPI_Testany : 1.618215e+00   39.586x
>    MPI_Testsome : 1.133568e+00   27.731x
>    PMPI_Waitall : 3.970200e-02    0.971x
>    PMPI_Waitany : 1.344188e+00   32.883x
>   PMPI_Waitsome : 1.685816e+00   41.240x
>
>
> nprocs = 4, ntokens = 2048, ncycles = 50
> Method          : Time         Relative
>     MPI_Waitall : 1.173840e-01    1.000x
>     MPI_Waitany : 4.600552e+00   39.192x
>    MPI_Waitsome : 6.840568e+00   58.275x
>     MPI_Testall : 6.762144e+00   57.607x
>     MPI_Testany : 5.170525e+00   44.048x
>    MPI_Testsome : 4.260335e+00   36.294x
>    PMPI_Waitall : 1.291590e-01    1.100x
>    PMPI_Waitany : 5.161881e+00   43.974x
>   PMPI_Waitsome : 7.388439e+00   62.942x
>
>
>
> On Jan 15, 2014, at 2:53 PM, Jeff Hammond <jeff.science at gmail.com> wrote:
>
>> On Wed, Jan 15, 2014 at 2:23 PM, John Grime <jgrime at uchicago.edu> wrote:
>>> Hi Jeff,
>>>
>>>> If Waitall wasn't faster than Waitsome or Waitany, then it wouldn't
>>>> exist since obviously one can implement the former in terms of the
>>>> latter
>>>
>>>
>>> I see no reason it wouldn’t exist in such a case, given that it’s an
>>> elegant/convenient way to wait for all requests to complete vs. Waitsome /
>>> Waitany. It makes sense to me that it would be in the API in any case, much
>>> as I appreciate the value of the RISC-y approach you imply.
>>>
>>>> it shouldn't be surprising that they aren't as efficient.
>>>
>>> I would’t expect them to have identical performance - but nor would I
>>> have expected a performance difference of ~50x for the same number of
>>> outstanding requests, even given that a naive loop over the request array
>>> will be O(N). That loop should be pretty cheap after all, even given that
>>> you can’t use cache well due to the potential for background state changes
>>> in the request object data or whatever (I’m not sure how it’s actually
>>> implemented, which is why I’m asking about this issue on the mailing list).
>>>
>>>> The appropriate question to ask is whether Waitany is implemented
>>>> optimally or not.
>>>
>>>
>>> Well, yes. I kinda hoped that question was heavily implied by my original
>>> email!
>>>
>>>
>>>> If you find that emulating Waitany
>>>> using Testall following by a loop, then that's useful information.
>>>
>>> I accidentally the whole thing, Jeff! ;)
>>>
>>> But that’s a good idea, thanks - I’ll give it a try and report back!
>>
>> Testall is the wrong semantic here.  I thought it would test them all
>> individually but it doesn't.  I implemented it anyways and it is the
>> worst of all.  I attached your test with my modifications.  Because I
>> am an evil bastard, I made a ton of whitespace changes in addition to
>> the nontrivial ones.
>>
>> Jeff
>>
>> --
>> Jeff Hammond
>> jeff.science at gmail.com
>> <nb_ring.c>_______________________________________________
>
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss



-- 
Jeff Hammond
jeff.science at gmail.com



More information about the discuss mailing list