<div dir="ltr"><div class="gmail_extra"><br><div class="gmail_quote">On Sat, Feb 23, 2013 at 7:48 PM, Jeff Hammond <span dir="ltr"><<a href="mailto:jhammond@alcf.anl.gov" target="_blank">jhammond@alcf.anl.gov</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":23z">There's absolutely no reason why it should.  I am not trying to<br>

maximize progress; I am trying to maximize computational performance.<br>

Pinning the 7 comm threads to one core is going to be terrible for<br>

them, but I am assuming that I don't need that much progress, whereas<br>

I do need to the computation to run at top speed.  DGEMM on 7 cores<br>

and 1 core of MPI should be much better than DGEMM on 8 cores where<br>

each core is time-shared with a comm thread.<br></div></blockquote><div><br></div><div style>Sure.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":23z">

<div class="im">

</div>You should run Intel MPI, MKL, OpenMP and TBB and only compile your<br>

code with Intel compilers.  That's your best chance to have everything<br>

work together. </div></blockquote><div><br></div><div style>Those are terrible programming models, oversynchronizing and inflexible, and they don't have a concept of memory locality anyway. ;-)</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div id=":23z"> I really don't see how Hydra is supposed to know what<br>

GOMP is doing and try to deal with it. </div></blockquote><div><br></div><div style>AFAIK, GOMP doesn't set affinity at all. It doesn't really make sense for TBB because the programming model doesn't have explicit memory locality. But in this case, _I'm_ setting affinity for my pthreads or my OpenMP threads because I know how I use them collectively.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":23z"><div class="im">

<br>

</div>What testing I have done of MPICH_NEMESIS_ASYNC_PROGRESS=1 on Cray<br>

XC30 indicates that NWChem is better off without comm threads since it<br>

communicates and computes at fine enough granularity such that the<br>

lack of progress is less than the overhead of internal locking<br>

(because ASYNC_PROGRESS implies MPI_THREAD_MULTIPLE) and competition<br>

for execution resources (even though XC30 has Intel SNB with HT<br>

enabled).</div></blockquote></div><br>MPICH_NEMESIS_ASYNC_PROGRESS=1 is always slower when I've tried it on Hopper, even when I have nonblocking communication running for a long time while the application is computing. I'm hoping that works better in the future, but for now, I still wait just as long when I get around to calling MPI_Wait, except that everything in between ran slower due to MPI_THREAD_MULTIPLE.</div>

</div>