[mpich-discuss] low performance in an asynchronous, mixed MPI/pthreads app

Tue Jan 8 23:32:55 CST 2013

On Mon, Jan 7, 2013 at 10:44 PM, Geoffrey Irving <irving at naml.us> wrote:
> On Mon, Jan 7, 2013 at 10:53 AM, Dave Goodell <goodell at mcs.anl.gov> wrote:
>> On Jan 6, 2013, at 8:57 PM CST, Geoffrey Irving wrote:
>>
>>> We're seeing much lower
>>> performance than expected, and are trying to understand why.  Here is
>>> a visualization of a 96 core test run using 16 ranks with 6 threads
>>> per core (4 Hopper nodes):
>>>
>>>    http://naml.us/random/pentago/history-random17.png
>>>
>>> 1. All messages are sent using MPI_Isend, but Isend sometimes takes a
>>> very long time to return even for tiny (8 byte) messages.  For
>>> example, the light blue "wakeup" messages are sent from a worker
>>> thread to the master communication thread to indicate that computation
>>> is complete, breaking the communication thread out of an MPI_Waitsome
>>> call (using MPI_THREAD_MULTIPLE).  Some of the input request Isends
>>> also take a long time, and these are also 8 bytes.  What would cause
>>> MPI_Isend to take a long time to return, both for small messages
>>> (wakeups and input requests) and for large messages (input responses
>>> and output sends)?
>>
>> One possibility is because MPICH-nemesis (on which Cray's MPI for XE6 is based) does not have any fine-grained parallelism in the library.  So in order for an Isend-ing thread to properly enter the MPI library, it must queue on a mutex until the thread which is currently occupying the progress engine decides to yield.  In nemesis by default this is 10 iterations of {polling shared memory then the network module}, although Cray may have decided to tweak this number (I have not seen their source).  So overly long netmod polling times could contribute to this delay.
>>
>> This design is simplistic, and we aren't especially happy with it, but so far we've found it to be very difficult to retrofit finer-grained parallelism into the existing nemesis codebase.  My personal opinion is that a real fix will require a substantial rewrite of a large part of the lower level code.
>>
>> If I'm understanding your questions/explanations correctly, then this is likely related to issues 2 & 3 as well, possibly 4.
>>
>> […]
>>> Some things that might contribute to the problem:
>>>
>>> 1. Since the 92K input responses are sent in response to 8 byte
>>> request messages, I can prepost all matching Irecvs.  However, output
>>> messages arrive unexpectedly, so the best I can do is post a few
>>> wildcard Irecvs (MPI_ANY_SOURCE, MPI_ANY_TAG on the output message
>>> communicator).  Jeff thought wildcard Irecvs might not be good enough
>>> to hit the fast path.  Moreover, output messages are not compressed,
>>> so most are 262K.  I didn't compress them historically because
>>> compression isn't all that fast (I'm using Google's Snappy plus domain
>>> specific preconditioning), but that may be a good experiment to run.
>>
>> Preposted any_source+any_tag irecvs should be on the fast-enough path.  The size is large enough that it might be tipping you over into a rendezvous instead of eager protocol, but only Cray can answer that for sure.  I probably wouldn't pursue compression as a first next step.
>
> And you're correct: at least in the current setup output compression
> slows things down, probably in part because it results in more
> intrarank messages (I had already started implementing this experiment
> before I got your email).
>
>>> 2. I don't know how to visualize instantaneous bandwidth, even though
>>> I have start and end times for all large messages (start and end of
>>> Isend, Irecv request completion time).  It's possible the problem is
>>> bad network load balancing.  I'm happy to whip up a plot if anyone has
>>> suggestions as to what to draw.
>>
>> I'm not sure what the goal is here.  262 KiB messages don't strike me as large enough to worry much about bandwidth on a modern Cray machine.  Latencies seem like a more useful metric still, esp. if you aren't seeing much variation as a function of message size.
>>
>>> 3. Waking a communication thread out of an MPI_Waitsome is somewhat of
>>> a hack.  It's the only time the worker threads touch MPI; otherwise I
>>> could use MPI_THREAD_FUNNELED.  However, the only alternative I know
>>> is for the communication thread to poll on MPI_Testsome and check for
>>> thread completion in between.  Might that be better (it's an easy
>>> experiment to run)?
>>
>> Frankly, it might, because of the aforementioned design.  What non-MPI mechanism for inter-thread signaling are you planning on using?
>
> All the thread signalling is via pthread spinlocks and the occasional
> lockless spin when I can get away with it.  In the MPI_Testsome case
> the communication thread would check a flag for inter-thread messages
> without grabbing a lock, then grab a spinlock and recheck.
>
> I'll implement this tomorrow and see how it fares.

Polling on MPI_Testsome and spinlocks and using MPI_THREAD_FUNNELED
does significantly better than bare MPI_THREAD_MULTIPLE, and a little
better than asynchronous progress mode:

    funneled, 1 comm thread and 5 workers: 157.0734 s
    multiple, 1 comm thread and 5 workers: 208.9402 s
    asynchronous progress, 1 progress thread, 1 comm thread, 4
workers: 169.9694 s

Here are a few history trace screenshots (zooming in).  The worker
thread stalls is wakeup are gone of course, but otherwise it is
similar.

    http://naml.us/random/pentago/history-funnel-random17.png
    http://naml.us/random/pentago/history-funnel-random17-1.png
    http://naml.us/random/pentago/history-funnel-random17-2.png

I'm going to try on BlueGene next to see if the performance is similar.

Geoffrey