<div dir="ltr"><div class="gmail_extra"><br><div class="gmail_quote">On Sat, Feb 23, 2013 at 7:48 PM, Jeff Hammond <span dir="ltr"><<a href="mailto:jhammond@alcf.anl.gov" target="_blank">jhammond@alcf.anl.gov</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":23z">There's absolutely no reason why it should. I am not trying to<br>
maximize progress; I am trying to maximize computational performance.<br>
Pinning the 7 comm threads to one core is going to be terrible for<br>
them, but I am assuming that I don't need that much progress, whereas<br>
I do need to the computation to run at top speed. DGEMM on 7 cores<br>
and 1 core of MPI should be much better than DGEMM on 8 cores where<br>
each core is time-shared with a comm thread.<br></div></blockquote><div><br></div><div style>Sure.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":23z">
<div class="im">
</div>You should run Intel MPI, MKL, OpenMP and TBB and only compile your<br>
code with Intel compilers. That's your best chance to have everything<br>
work together. </div></blockquote><div><br></div><div style>Those are terrible programming models, oversynchronizing and inflexible, and they don't have a concept of memory locality anyway. ;-)</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div id=":23z"> I really don't see how Hydra is supposed to know what<br>
GOMP is doing and try to deal with it. </div></blockquote><div><br></div><div style>AFAIK, GOMP doesn't set affinity at all. It doesn't really make sense for TBB because the programming model doesn't have explicit memory locality. But in this case, _I'm_ setting affinity for my pthreads or my OpenMP threads because I know how I use them collectively.</div>
<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":23z"><div class="im">
<br>
</div>What testing I have done of MPICH_NEMESIS_ASYNC_PROGRESS=1 on Cray<br>
XC30 indicates that NWChem is better off without comm threads since it<br>
communicates and computes at fine enough granularity such that the<br>
lack of progress is less than the overhead of internal locking<br>
(because ASYNC_PROGRESS implies MPI_THREAD_MULTIPLE) and competition<br>
for execution resources (even though XC30 has Intel SNB with HT<br>
enabled).</div></blockquote></div><br>MPICH_NEMESIS_ASYNC_PROGRESS=1 is always slower when I've tried it on Hopper, even when I have nonblocking communication running for a long time while the application is computing. I'm hoping that works better in the future, but for now, I still wait just as long when I get around to calling MPI_Wait, except that everything in between ran slower due to MPI_THREAD_MULTIPLE.</div>
</div>