[mpich-devel] O(N^p) data

Jeff Hammond jeff.science at gmail.com
Wed Aug 24 14:42:44 CDT 2016


It's been a few years.  I don't remember.  Probably 82B/rank was the bug
that caused me to measure the prefactor in the first place, and the
bug-free value was ~8B/rank.

Anyways, we could run MPI with 64 ppn on 48Ki nodes with 16 GB per node,
which means MPI state was a lot less than 200 MB.  I recall MPI footprint
was ~20 MB, not including the 32 MB of POSIX shared memory that MPI
required.

Jeff


On Wed, Aug 24, 2016 at 12:16 PM, Dan Ibanez <dan.a.ibanez at gmail.com> wrote:

> Just a slight discrepancy:
> > the prefactor was ~82 bytes, which is pretty lean (~25 MB per rank at 3M
> ranks)
> 82 * 3M = 246MB
> Did you mean 8.2 bytes per rank ?
>
> On Wed, Aug 24, 2016 at 2:05 PM, Dan Ibanez <dan.a.ibanez at gmail.com>
> wrote:
>
>> Thanks Jeff !
>>
>> Yea, I've been able to write scalable MPI-based code
>> that doesn't use MPI_All* functions, and the
>> MPI_Neighbor_all* variants are just perfect; they have
>> replaced lots of low-level send/recv systems.
>>
>> I was interested in the theoretical scalability of the
>> implementation, and your answer is pretty comprehensive
>> so I'll go read those papers.
>>
>> On Wed, Aug 24, 2016 at 1:55 PM, Jeff Hammond <jeff.science at gmail.com>
>> wrote:
>>
>>> It depends on where you look in MPICH.  I analyzed memory consumption of
>>> MPI on Blue Gene/Q, which was based on MPICH (and is OSS, so you can read
>>> all of it).  There was O(nproc) memory usage at every node, but I recall it
>>> the prefactor was ~82 bytes, which is pretty lean (~25 MB per rank at 3M
>>> ranks).  I don't know if the O(nproc) data was in MPICH itself or the
>>> underlying layer (PAMI), or both, but it doesn't really matter from a user
>>> perspective.
>>>
>>> Some _networks_ might make it hard not to have O(nproc) eager buffers on
>>> every rank, and there are other "features" of network HW/SW that may
>>> require O(nproc) data.  Obviously, since this sort of thing is not
>>> scalable, networks that historically had such properties have evolved to
>>> support more scalable designs.  Some of the low-level issues are addressed
>>> in https://www.open-mpi.org/papers/ipdps-2006/ipdps-2006-ope
>>> nmpi-ib-scalability.pdf.
>>>
>>> User buffers are a separate issue.  MPI_Alltoall and MPI_Allgather acts
>>> on O(nproc) user storage.  MPI_Allgatherv, MPI_Alltoallv and MPI_Alltoallw
>>> have O(nproc) input vectors.  MPI experts often refer to the vector
>>> collectives as unscalable, but of course this may not matter in practice
>>> for many users.  And in some of the cases where MPI_Alltoallv is used, one
>>> can replace with a carefully written loop over Send-Recv calls that does
>>> not require the user to allocate O(nproc) vectors specifically for MPI.
>>>
>>> There's a paper by Argonne+IBM that addresses this topic in more detail:
>>> http://www.mcs.anl.gov/~thakur/papers/mpi-million.pdf
>>>
>>> Jeff
>>>
>>>
>>> On Wed, Aug 24, 2016 at 10:28 AM, Dan Ibanez <dan.a.ibanez at gmail.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> This may be a silly question, but the reason
>>>> I'm asking is to obtain a fairly definitive answer.
>>>> Basically, does MPICH have data structures
>>>> which are of size:
>>>> 1) O(N)
>>>> 2) O(N^2)
>>>> Where N is the size of MPI_COMM_WORLD ?
>>>> My initial guess would be no, because there
>>>> exist machines (Mira) for which it is not
>>>> possible to store N^2 bytes, and even N bytes
>>>> becomes an issue.
>>>> I understand there are MPI functions (MPI_alltoall) one can
>>>> call that by definition will require at least O(N) memory,
>>>> but supposing one does not use these, would the internal
>>>> MPICH systems still have this memory complexity ?
>>>>
>>>> Thank you for looking at this anyway
>>>>
>>>> _______________________________________________
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/devel
>>>>
>>>
>>>
>>>
>>> --
>>> Jeff Hammond
>>> jeff.science at gmail.com
>>> http://jeffhammond.github.io/
>>>
>>> _______________________________________________
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/devel
>>>
>>
>>
>
> _______________________________________________
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/devel
>



-- 
Jeff Hammond
jeff.science at gmail.com
http://jeffhammond.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/devel/attachments/20160824/47ccb9ad/attachment.html>


More information about the devel mailing list