[mpich-discuss] low performance in an asynchronous, mixed MPI/pthreads app

Dave Goodell goodell at mcs.anl.gov
Mon Jan 7 12:53:20 CST 2013


On Jan 6, 2013, at 8:57 PM CST, Geoffrey Irving wrote:

> We're seeing much lower
> performance than expected, and are trying to understand why.  Here is
> a visualization of a 96 core test run using 16 ranks with 6 threads
> per core (4 Hopper nodes):
> 
>    http://naml.us/random/pentago/history-random17.png
> 
> 1. All messages are sent using MPI_Isend, but Isend sometimes takes a
> very long time to return even for tiny (8 byte) messages.  For
> example, the light blue "wakeup" messages are sent from a worker
> thread to the master communication thread to indicate that computation
> is complete, breaking the communication thread out of an MPI_Waitsome
> call (using MPI_THREAD_MULTIPLE).  Some of the input request Isends
> also take a long time, and these are also 8 bytes.  What would cause
> MPI_Isend to take a long time to return, both for small messages
> (wakeups and input requests) and for large messages (input responses
> and output sends)?

One possibility is because MPICH-nemesis (on which Cray's MPI for XE6 is based) does not have any fine-grained parallelism in the library.  So in order for an Isend-ing thread to properly enter the MPI library, it must queue on a mutex until the thread which is currently occupying the progress engine decides to yield.  In nemesis by default this is 10 iterations of {polling shared memory then the network module}, although Cray may have decided to tweak this number (I have not seen their source).  So overly long netmod polling times could contribute to this delay.

This design is simplistic, and we aren't especially happy with it, but so far we've found it to be very difficult to retrofit finer-grained parallelism into the existing nemesis codebase.  My personal opinion is that a real fix will require a substantial rewrite of a large part of the lower level code.

If I'm understanding your questions/explanations correctly, then this is likely related to issues 2 & 3 as well, possibly 4.

[…]
> Some things that might contribute to the problem:
> 
> 1. Since the 92K input responses are sent in response to 8 byte
> request messages, I can prepost all matching Irecvs.  However, output
> messages arrive unexpectedly, so the best I can do is post a few
> wildcard Irecvs (MPI_ANY_SOURCE, MPI_ANY_TAG on the output message
> communicator).  Jeff thought wildcard Irecvs might not be good enough
> to hit the fast path.  Moreover, output messages are not compressed,
> so most are 262K.  I didn't compress them historically because
> compression isn't all that fast (I'm using Google's Snappy plus domain
> specific preconditioning), but that may be a good experiment to run.

Preposted any_source+any_tag irecvs should be on the fast-enough path.  The size is large enough that it might be tipping you over into a rendezvous instead of eager protocol, but only Cray can answer that for sure.  I probably wouldn't pursue compression as a first next step.

> 2. I don't know how to visualize instantaneous bandwidth, even though
> I have start and end times for all large messages (start and end of
> Isend, Irecv request completion time).  It's possible the problem is
> bad network load balancing.  I'm happy to whip up a plot if anyone has
> suggestions as to what to draw.

I'm not sure what the goal is here.  262 KiB messages don't strike me as large enough to worry much about bandwidth on a modern Cray machine.  Latencies seem like a more useful metric still, esp. if you aren't seeing much variation as a function of message size.

> 3. Waking a communication thread out of an MPI_Waitsome is somewhat of
> a hack.  It's the only time the worker threads touch MPI; otherwise I
> could use MPI_THREAD_FUNNELED.  However, the only alternative I know
> is for the communication thread to poll on MPI_Testsome and check for
> thread completion in between.  Might that be better (it's an easy
> experiment to run)?

Frankly, it might, because of the aforementioned design.  What non-MPI mechanism for inter-thread signaling are you planning on using?

-Dave




More information about the discuss mailing list