[mpich-discuss] low performance in an asynchronous, mixed MPI/pthreads app

Tue Jan 8 00:44:36 CST 2013

On Mon, Jan 7, 2013 at 10:53 AM, Dave Goodell <goodell at mcs.anl.gov> wrote:
> On Jan 6, 2013, at 8:57 PM CST, Geoffrey Irving wrote:
>
>> We're seeing much lower
>> performance than expected, and are trying to understand why.  Here is
>> a visualization of a 96 core test run using 16 ranks with 6 threads
>> per core (4 Hopper nodes):
>>
>>    http://naml.us/random/pentago/history-random17.png
>>
>> 1. All messages are sent using MPI_Isend, but Isend sometimes takes a
>> very long time to return even for tiny (8 byte) messages.  For
>> example, the light blue "wakeup" messages are sent from a worker
>> thread to the master communication thread to indicate that computation
>> is complete, breaking the communication thread out of an MPI_Waitsome
>> call (using MPI_THREAD_MULTIPLE).  Some of the input request Isends
>> also take a long time, and these are also 8 bytes.  What would cause
>> MPI_Isend to take a long time to return, both for small messages
>> (wakeups and input requests) and for large messages (input responses
>> and output sends)?
>
> One possibility is because MPICH-nemesis (on which Cray's MPI for XE6 is based) does not have any fine-grained parallelism in the library.  So in order for an Isend-ing thread to properly enter the MPI library, it must queue on a mutex until the thread which is currently occupying the progress engine decides to yield.  In nemesis by default this is 10 iterations of {polling shared memory then the network module}, although Cray may have decided to tweak this number (I have not seen their source).  So overly long netmod polling times could contribute to this delay.
>
> This design is simplistic, and we aren't especially happy with it, but so far we've found it to be very difficult to retrofit finer-grained parallelism into the existing nemesis codebase.  My personal opinion is that a real fix will require a substantial rewrite of a large part of the lower level code.
>
> If I'm understanding your questions/explanations correctly, then this is likely related to issues 2 & 3 as well, possibly 4.
>
> […]
>> Some things that might contribute to the problem:
>>
>> 1. Since the 92K input responses are sent in response to 8 byte
>> request messages, I can prepost all matching Irecvs.  However, output
>> messages arrive unexpectedly, so the best I can do is post a few
>> wildcard Irecvs (MPI_ANY_SOURCE, MPI_ANY_TAG on the output message
>> communicator).  Jeff thought wildcard Irecvs might not be good enough
>> to hit the fast path.  Moreover, output messages are not compressed,
>> so most are 262K.  I didn't compress them historically because
>> compression isn't all that fast (I'm using Google's Snappy plus domain
>> specific preconditioning), but that may be a good experiment to run.
>
> Preposted any_source+any_tag irecvs should be on the fast-enough path.  The size is large enough that it might be tipping you over into a rendezvous instead of eager protocol, but only Cray can answer that for sure.  I probably wouldn't pursue compression as a first next step.

And you're correct: at least in the current setup output compression
slows things down, probably in part because it results in more
intrarank messages (I had already started implementing this experiment
before I got your email).

>> 2. I don't know how to visualize instantaneous bandwidth, even though
>> I have start and end times for all large messages (start and end of
>> Isend, Irecv request completion time).  It's possible the problem is
>> bad network load balancing.  I'm happy to whip up a plot if anyone has
>> suggestions as to what to draw.
>
> I'm not sure what the goal is here.  262 KiB messages don't strike me as large enough to worry much about bandwidth on a modern Cray machine.  Latencies seem like a more useful metric still, esp. if you aren't seeing much variation as a function of message size.
>
>> 3. Waking a communication thread out of an MPI_Waitsome is somewhat of
>> a hack.  It's the only time the worker threads touch MPI; otherwise I
>> could use MPI_THREAD_FUNNELED.  However, the only alternative I know
>> is for the communication thread to poll on MPI_Testsome and check for
>> thread completion in between.  Might that be better (it's an easy
>> experiment to run)?
>
> Frankly, it might, because of the aforementioned design.  What non-MPI mechanism for inter-thread signaling are you planning on using?

All the thread signalling is via pthread spinlocks and the occasional
lockless spin when I can get away with it.  In the MPI_Testsome case
the communication thread would check a flag for inter-thread messages
without grabbing a lock, then grab a spinlock and recheck.

I'll implement this tomorrow and see how it fares.

Thanks,
Geoffrey