[mpich-discuss] low performance in an asynchronous, mixed MPI/pthreads app

Darius Buntinas buntinas at mcs.anl.gov
Mon Jan 7 13:39:21 CST 2013


Hi Geoff,

Is it possible for you to try this with fewer worker threads per node (leaving some idle cores on each node)?

What's the scale of the x-axis? (about how much time is one tick?)

Do the worker threads receive messages, or is it only the comm thread receive all messages?

How many isends are performed by the worker in the "wakeup" section?  

Aside from the isends in "wakeup," are there any other MPI calls made by the worker threads?

How many isends are performed by the comm thread in "response_send," and how many in "output_send?"

Does "wait" do any MPI other than MPI_Waitsome?

Does the worker do any MPI calls other than isends in "response_send" and "output_send," and waitsome in "wait?"

Thanks,
-d


On Jan 6, 2013, at 8:57 PM, Geoffrey Irving wrote:

> Hello,
> 
> With help from Jeff Hammond and Jed Brown, I am trying to optimize and
> scale a solver for the board game pentago on the Cray XE6 machine
> Hopper and NERSC.  The code mixes MPI and pthreads, and has an
> asynchronous message passing structure.  We're seeing much lower
> performance than expected, and are trying to understand why.  Here is
> a visualization of a 96 core test run using 16 ranks with 6 threads
> per core (4 Hopper nodes):
> 
>    http://naml.us/random/pentago/history-random17.png
> 
> Each rank has 1 dedicated communication thread doing almost all of the
> MPI calls, and 5 worker threads doing computation.  White (of which
> there is too much) indicates an idle thread.  The color says what's
> happening on each thread, and the lines indicate information flow:
> ranks request input data from other ranks, respond to input requests
> with data, compute once data arrives, send output to other ranks, and
> process output for final storage.  The compute is yellow and the
> processing is interleaved red/blue.  The weirdness is as follows:
> 
> 1. All messages are sent using MPI_Isend, but Isend sometimes takes a
> very long time to return even for tiny (8 byte) messages.  For
> example, the light blue "wakeup" messages are sent from a worker
> thread to the master communication thread to indicate that computation
> is complete, breaking the communication thread out of an MPI_Waitsome
> call (using MPI_THREAD_MULTIPLE).  Some of the input request Isends
> also take a long time, and these are also 8 bytes.  What would cause
> MPI_Isend to take a long time to return, both for small messages
> (wakeups and input requests) and for large messages (input responses
> and output sends)?
> 
> 2. There's a common pattern in the visualization where a bunch of
> output processing happens immediately *before* computation.  Whenever
> this happens the output and compute are unrelated in terms of
> dataflow.  This is probably the same problem as (1).  What seems to be
> happening is that an MPI_Isend on the communication thread is taking a
> long time to return, prevent the communication thread from noticing
> that other messages have arrived (the communication will call
> MPI_Waitsome once the Isend returns.  If both output messages from
> processing and input messages for compute stack up, the code processes
> the output first in order to deallocate certain memory as soon
> possible.
> 
> 3. Stalled wakeup Isends seem to complete at the same time as input
> response Isends on the same rank.  Is the MPI_THREAD_MULTIPLE support
> problematic?
> 
> 4. The message latency is quite high.  For example, one of the input
> responses takes over .3 s in the above image.  Each such response
> message is about 92K.  It's possible that line is actually more than
> one input response from the same rank (I do not merge them currently),
> but in that case I'd expect them to complete at different times.
> Unfortunately, my current visualization doesn't do a good job of
> visualizing instantaneous bandwidth: I have the data but don't know
> quite what to measure.  Bandwidth averaged over 10s intervals is shown
> here:
> 
>    http://naml.us/random/pentago/bandwidth-10s-async-random17-nonblock.svg
> 
> I've also tried asynchronous progress mode (6 threads per rank, 1
> progress thread for MPI, 1 communication thread for me, and 4 worker
> threads).  This gave a 12% speedup but didn't change the picture
> qualitatively (apologies for the mismatched colors and different
> horizontal scale):
> 
>    http://naml.us/random/pentago/history-async-random17.png
> 
> Some things that might contribute to the problem:
> 
> 1. Since the 92K input responses are sent in response to 8 byte
> request messages, I can prepost all matching Irecvs.  However, output
> messages arrive unexpectedly, so the best I can do is post a few
> wildcard Irecvs (MPI_ANY_SOURCE, MPI_ANY_TAG on the output message
> communicator).  Jeff thought wildcard Irecvs might not be good enough
> to hit the fast path.  Moreover, output messages are not compressed,
> so most are 262K.  I didn't compress them historically because
> compression isn't all that fast (I'm using Google's Snappy plus domain
> specific preconditioning), but that may be a good experiment to run.
> 
> 2. I don't know how to visualize instantaneous bandwidth, even though
> I have start and end times for all large messages (start and end of
> Isend, Irecv request completion time).  It's possible the problem is
> bad network load balancing.  I'm happy to whip up a plot if anyone has
> suggestions as to what to draw.
> 
> 3. Waking a communication thread out of an MPI_Waitsome is somewhat of
> a hack.  It's the only time the worker threads touch MPI; otherwise I
> could use MPI_THREAD_FUNNELED.  However, the only alternative I know
> is for the communication thread to poll on MPI_Testsome and check for
> thread completion in between.  Might that be better (it's an easy
> experiment to run)?
> 
> Finally, a few details about the computation structure follow.  I've
> tried to make the above independent, so feel free to stop here.
> 
> Thanks for any thoughts!
> Geoffrey
> 
> ---------------------------------------
> 
> Values of pentago positions with n stones depend only values of
> positions with n+1 stones, so the computation is embarrassingly
> parallel within each such "level".  However, it is also memory
> constrained: holding level 24 and 25 in memory at the same time (one
> input level and one output level) requires about 80 TB.  The state
> space of each level is divided into "blocks", and the space of
> computation required to go from level n to n+1 is divided into
> "lines".  Each line has a set of input blocks at level n and a set of
> output blocks at level n+1.  Most blocks have size 64 * 8**4 = 262144
> bytes uncompressed.
> 
> We partition all blocks and lines amongst the ranks, then operate as follows:
> 1. When a rank has enough memory to allocate a new line, it posts
> Irecvs and Isends 8 byte request message to all ranks which own an
> input block.
> 2. When a rank receives a request, it replies using an Isend from a
> flat buffer (doing O(1) work otherwise).  These messages are
> compressed, and take up around 92K.
> 3. Once all input blocks arrive, we compute the line's output blocks.
> Each block at level n+1 is computed from several (4) lines, so if
> necessary we send the output blocks to the rank which owns them.
> These messages are uncompressed, and are usually 262K.
> 
> Code is here for reference:
> 
>    https://github.com/girving/pentago
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss




More information about the discuss mailing list