[mpich-discuss] low performance in an asynchronous, mixed MPI/pthreads app

Sun Jan 6 20:57:58 CST 2013

Hello,

With help from Jeff Hammond and Jed Brown, I am trying to optimize and
scale a solver for the board game pentago on the Cray XE6 machine
Hopper and NERSC.  The code mixes MPI and pthreads, and has an
asynchronous message passing structure.  We're seeing much lower
performance than expected, and are trying to understand why.  Here is
a visualization of a 96 core test run using 16 ranks with 6 threads
per core (4 Hopper nodes):

    http://naml.us/random/pentago/history-random17.png

Each rank has 1 dedicated communication thread doing almost all of the
MPI calls, and 5 worker threads doing computation.  White (of which
there is too much) indicates an idle thread.  The color says what's
happening on each thread, and the lines indicate information flow:
ranks request input data from other ranks, respond to input requests
with data, compute once data arrives, send output to other ranks, and
process output for final storage.  The compute is yellow and the
processing is interleaved red/blue.  The weirdness is as follows:

1. All messages are sent using MPI_Isend, but Isend sometimes takes a
very long time to return even for tiny (8 byte) messages.  For
example, the light blue "wakeup" messages are sent from a worker
thread to the master communication thread to indicate that computation
is complete, breaking the communication thread out of an MPI_Waitsome
call (using MPI_THREAD_MULTIPLE).  Some of the input request Isends
also take a long time, and these are also 8 bytes.  What would cause
MPI_Isend to take a long time to return, both for small messages
(wakeups and input requests) and for large messages (input responses
and output sends)?

2. There's a common pattern in the visualization where a bunch of
output processing happens immediately *before* computation.  Whenever
this happens the output and compute are unrelated in terms of
dataflow.  This is probably the same problem as (1).  What seems to be
happening is that an MPI_Isend on the communication thread is taking a
long time to return, prevent the communication thread from noticing
that other messages have arrived (the communication will call
MPI_Waitsome once the Isend returns.  If both output messages from
processing and input messages for compute stack up, the code processes
the output first in order to deallocate certain memory as soon
possible.

3. Stalled wakeup Isends seem to complete at the same time as input
response Isends on the same rank.  Is the MPI_THREAD_MULTIPLE support
problematic?

4. The message latency is quite high.  For example, one of the input
responses takes over .3 s in the above image.  Each such response
message is about 92K.  It's possible that line is actually more than
one input response from the same rank (I do not merge them currently),
but in that case I'd expect them to complete at different times.
Unfortunately, my current visualization doesn't do a good job of
visualizing instantaneous bandwidth: I have the data but don't know
quite what to measure.  Bandwidth averaged over 10s intervals is shown
here:

    http://naml.us/random/pentago/bandwidth-10s-async-random17-nonblock.svg

I've also tried asynchronous progress mode (6 threads per rank, 1
progress thread for MPI, 1 communication thread for me, and 4 worker
threads).  This gave a 12% speedup but didn't change the picture
qualitatively (apologies for the mismatched colors and different
horizontal scale):

    http://naml.us/random/pentago/history-async-random17.png

Some things that might contribute to the problem:

1. Since the 92K input responses are sent in response to 8 byte
request messages, I can prepost all matching Irecvs.  However, output
messages arrive unexpectedly, so the best I can do is post a few
wildcard Irecvs (MPI_ANY_SOURCE, MPI_ANY_TAG on the output message
communicator).  Jeff thought wildcard Irecvs might not be good enough
to hit the fast path.  Moreover, output messages are not compressed,
so most are 262K.  I didn't compress them historically because
compression isn't all that fast (I'm using Google's Snappy plus domain
specific preconditioning), but that may be a good experiment to run.

2. I don't know how to visualize instantaneous bandwidth, even though
I have start and end times for all large messages (start and end of
Isend, Irecv request completion time).  It's possible the problem is
bad network load balancing.  I'm happy to whip up a plot if anyone has
suggestions as to what to draw.

3. Waking a communication thread out of an MPI_Waitsome is somewhat of
a hack.  It's the only time the worker threads touch MPI; otherwise I
could use MPI_THREAD_FUNNELED.  However, the only alternative I know
is for the communication thread to poll on MPI_Testsome and check for
thread completion in between.  Might that be better (it's an easy
experiment to run)?

Finally, a few details about the computation structure follow.  I've
tried to make the above independent, so feel free to stop here.

Thanks for any thoughts!
Geoffrey

---------------------------------------

Values of pentago positions with n stones depend only values of
positions with n+1 stones, so the computation is embarrassingly
parallel within each such "level".  However, it is also memory
constrained: holding level 24 and 25 in memory at the same time (one
input level and one output level) requires about 80 TB.  The state
space of each level is divided into "blocks", and the space of
computation required to go from level n to n+1 is divided into
"lines".  Each line has a set of input blocks at level n and a set of
output blocks at level n+1.  Most blocks have size 64 * 8**4 = 262144
bytes uncompressed.

We partition all blocks and lines amongst the ranks, then operate as follows:
1. When a rank has enough memory to allocate a new line, it posts
Irecvs and Isends 8 byte request message to all ranks which own an
input block.
2. When a rank receives a request, it replies using an Isend from a
flat buffer (doing O(1) work otherwise).  These messages are
compressed, and take up around 92K.
3. Once all input blocks arrive, we compute the line's output blocks.
Each block at level n+1 is computed from several (4) lines, so if
necessary we send the output blocks to the rank which owns them.
These messages are uncompressed, and are usually 262K.

Code is here for reference:

    https://github.com/girving/pentago