[mpich-discuss] Affinity with MPICH_ASYNC_PROGRESS

Sat Feb 23 20:21:28 CST 2013

>> You should run Intel MPI, MKL, OpenMP and TBB and only compile your
>> code with Intel compilers.  That's your best chance to have everything
>> work together.
>
> Those are terrible programming models, oversynchronizing and inflexible, and
> they don't have a concept of memory locality anyway. ;-)

I personally think that MKL, meaning the BLAS, is an amazing
programming model.  However, my point here was merely to say that
Intel has an integrated software stack that manages thread resources
intelligently relative to N independent implementations of MPI,
BLAS/LAPACK, and OpenMP/TBB, for example.  I was under the impression
that Intel presents stupid oversubscription, which a good thing for
users but essentially impossible to implement portably.

>>  I really don't see how Hydra is supposed to know what
>> GOMP is doing and try to deal with it.
>
> AFAIK, GOMP doesn't set affinity at all. It doesn't really make sense for
> TBB because the programming model doesn't have explicit memory locality. But
> in this case, _I'm_ setting affinity for my pthreads or my OpenMP threads
> because I know how I use them collectively.

I guess I am thinking about simplistic affinity, like breadth-first
placement and not letting the OS moves threads around arbitrarily.
However, no affinity is better than bad affinity.  For example, any
application that uses Pthreads with MVAPICH has to set
MV2_ENABLE_AFFINITY=0 or pay performance consequences because MVAPICH
implements bad affinity by default.

>> What testing I have done of MPICH_NEMESIS_ASYNC_PROGRESS=1 on Cray
>> XC30 indicates that NWChem is better off without comm threads since it
>> communicates and computes at fine enough granularity such that the
>> lack of progress is less than the overhead of internal locking
>> (because ASYNC_PROGRESS implies MPI_THREAD_MULTIPLE) and competition
>> for execution resources (even though XC30 has Intel SNB with HT
>> enabled).
>
> MPICH_NEMESIS_ASYNC_PROGRESS=1 is always slower when I've tried it on
> Hopper, even when I have nonblocking communication running for a long time
> while the application is computing. I'm hoping that works better in the
> future, but for now, I still wait just as long when I get around to calling
> MPI_Wait, except that everything in between ran slower due to
> MPI_THREAD_MULTIPLE.

I think the only way for async progress to work well is to have
fine-grain locking since of MPI, as is done in PAMID.  Any
implementation that resorts to fat-locking is probably better off
without async progress unless the application is doing something
really silly (like never calling MPI on a rank that is the target of
MPI RMA).

Jeff

-- 
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond