<div dir="ltr"><div class="gmail_extra"><br><div class="gmail_quote">On Sat, Feb 23, 2013 at 8:21 PM, Jeff Hammond <span dir="ltr"><<a href="mailto:jhammond@alcf.anl.gov" target="_blank">jhammond@alcf.anl.gov</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":2pz">I personally think that MKL, meaning the BLAS, is an amazing<br>

programming model.</div></blockquote><div><br></div><div style>Are you trolling me?</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":2pz">  However, my point here was merely to say that<br>


Intel has an integrated software stack that manages thread resources<br>

intelligently relative to N independent implementations of MPI,<br>

BLAS/LAPACK, and OpenMP/TBB, for example.  I was under the impression<br>

that Intel presents stupid oversubscription, which a good thing for<br>

users but essentially impossible to implement portably.<br></div></blockquote><div><br></div><div style>It's relatively easy to be integrated when you're not interoperable.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div id=":2pz"><div class="im">

</div>I think the only way for async progress to work well is to have<br>

fine-grain locking since of MPI, as is done in PAMID.  Any<br>

implementation that resorts to fat-locking is probably better off<br>

without async progress unless the application is doing something<br>

really silly (like never calling MPI on a rank that is the target of<br>

MPI RMA).</div></blockquote></div><br>There's more than 100k cycles between times that I enter the MPI stack. There are only two threads ever that contend for locks (my funneled thread and MPICH's async-progress thread). I'm not convinced that you need super fine-grained locks to make progress during that time period. FWIW, my experience has been that standard Nemesis does a _much_ better job of making asynchronous progress than Cray's implementation. If we could look at the code for Cray's implementation, we might be able to get a better idea of why.</div>

</div>