<div dir="ltr">It's been a few years.  I don't remember.  Probably 82B/rank was the bug that caused me to measure the prefactor in the first place, and the bug-free value was ~8B/rank.<div><br></div><div>Anyways, we could run MPI with 64 ppn on 48Ki nodes with 16 GB per node, which means MPI state was a lot less than 200 MB.  I recall MPI footprint was ~20 MB, not including the 32 MB of POSIX shared memory that MPI required.</div><div><br></div><div>Jeff<br><div><br></div></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Aug 24, 2016 at 12:16 PM, Dan Ibanez <span dir="ltr"><<a href="mailto:dan.a.ibanez@gmail.com" target="_blank">dan.a.ibanez@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Just a slight discrepancy:<span class=""><div>> the prefactor was ~82 bytes, which is pretty lean (~25 MB per rank at 3M ranks)</div></span><div>82 * 3M = 246MB</div><div>Did you mean 8.2 bytes per rank ?</div></div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Aug 24, 2016 at 2:05 PM, Dan Ibanez <span dir="ltr"><<a href="mailto:dan.a.ibanez@gmail.com" target="_blank">dan.a.ibanez@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Thanks Jeff !<div><br></div><div>Yea, I've been able to write scalable MPI-based code</div><div>that doesn't use MPI_All* functions, and the</div><div>MPI_Neighbor_all* variants are just perfect; they have</div><div>replaced lots of low-level send/recv systems.</div><div><br></div><div>I was interested in the theoretical scalability of the</div><div>implementation, and your answer is pretty comprehensive</div><div>so I'll go read those papers.</div></div><div><div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Aug 24, 2016 at 1:55 PM, Jeff Hammond <span dir="ltr"><<a href="mailto:jeff.science@gmail.com" target="_blank">jeff.science@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">It depends on where you look in MPICH.  I analyzed memory consumption of MPI on Blue Gene/Q, which was based on MPICH (and is OSS, so you can read all of it).  There was O(nproc) memory usage at every node, but I recall it the prefactor was ~82 bytes, which is pretty lean (~25 MB per rank at 3M ranks).  I don't know if the O(nproc) data was in MPICH itself or the underlying layer (PAMI), or both, but it doesn't really matter from a user perspective.<div><br></div><div>Some _networks_ might make it hard not to have O(nproc) eager buffers on every rank, and there are other "features" of network HW/SW that may require O(nproc) data.  Obviously, since this sort of thing is not scalable, networks that historically had such properties have evolved to support more scalable designs.  Some of the low-level issues are addressed in <a href="https://www.open-mpi.org/papers/ipdps-2006/ipdps-2006-openmpi-ib-scalability.pdf" target="_blank">https://www.open-mpi.org/pa<wbr>pers/ipdps-2006/ipdps-2006-ope<wbr>nmpi-ib-scalability.pdf</a>.</div><div><br></div><div>User buffers are a separate issue.  MPI_Alltoall and MPI_Allgather acts on O(nproc) user storage.  MPI_Allgatherv, MPI_Alltoallv and MPI_Alltoallw have O(nproc) input vectors.  MPI experts often refer to the vector collectives as unscalable, but of course this may not matter in practice for many users.  And in some of the cases where MPI_Alltoallv is used, one can replace with a carefully written loop over Send-Recv calls that does not require the user to allocate O(nproc) vectors specifically for MPI.</div><div><br></div><div>There's a paper by Argonne+IBM that addresses this topic in more detail: <a href="http://www.mcs.anl.gov/~thakur/papers/mpi-million.pdf" target="_blank">http://www.mcs.anl.gov<wbr>/~thakur/papers/mpi-million.pd<wbr>f</a></div><div><br></div><div>Jeff</div><div><br><div class="gmail_extra"><br><div class="gmail_quote"><div><div>On Wed, Aug 24, 2016 at 10:28 AM, Dan Ibanez <span dir="ltr"><<a href="mailto:dan.a.ibanez@gmail.com" target="_blank">dan.a.ibanez@gmail.com</a>></span> wrote:<br></div></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div><div dir="ltr">Hello,<div><br></div><div>This may be a silly question, but the reason</div><div>I'm asking is to obtain a fairly definitive answer.</div><div>Basically, does MPICH have data structures</div><div>which are of size:</div><div>1) O(N)</div><div>2) O(N^2)</div><div>Where N is the size of MPI_COMM_WORLD ?</div><div>My initial guess would be no, because there</div><div>exist machines (Mira) for which it is not</div><div>possible to store N^2 bytes, and even N bytes</div><div>becomes an issue.</div><div>I understand there are MPI functions (MPI_alltoall) one can</div><div>call that by definition will require at least O(N) memory,</div><div>but supposing one does not use these, would the internal</div><div>MPICH systems still have this memory complexity ?</div><div><br></div><div>Thank you for looking at this anyway</div></div>

<br></div></div>______________________________<wbr>_________________<br>

To manage subscription options or unsubscribe:<br>

<a href="https://lists.mpich.org/mailman/listinfo/devel" rel="noreferrer" target="_blank">https://lists.mpich.org/mailma<wbr>n/listinfo/devel</a><span><font color="#888888"><br></font></span></blockquote></div><span><font color="#888888"><br><br clear="all"><div><br></div>-- <br><div data-smartmail="gmail_signature">Jeff Hammond<br><a href="mailto:jeff.science@gmail.com" target="_blank">jeff.science@gmail.com</a><br><a href="http://jeffhammond.github.io/" target="_blank">http://jeffhammond.github.io/</a></div>

</font></span></div></div></div>

<br>______________________________<wbr>_________________<br>

To manage subscription options or unsubscribe:<br>

<a href="https://lists.mpich.org/mailman/listinfo/devel" rel="noreferrer" target="_blank">https://lists.mpich.org/mailma<wbr>n/listinfo/devel</a><br></blockquote></div><br></div>

</div></div></blockquote></div><br></div>

</div></div><br>______________________________<wbr>_________________<br>

To manage subscription options or unsubscribe:<br>

<a href="https://lists.mpich.org/mailman/listinfo/devel" rel="noreferrer" target="_blank">https://lists.mpich.org/<wbr>mailman/listinfo/devel</a><br></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature" data-smartmail="gmail_signature">Jeff Hammond<br><a href="mailto:jeff.science@gmail.com" target="_blank">jeff.science@gmail.com</a><br><a href="http://jeffhammond.github.io/" target="_blank">http://jeffhammond.github.io/</a></div>

</div>