[mpich-discuss] low performance in an asynchronous, mixed MPI/pthreads app
Geoffrey Irving
irving at naml.us
Tue Jan 8 23:32:55 CST 2013
On Mon, Jan 7, 2013 at 10:44 PM, Geoffrey Irving <irving at naml.us> wrote:
> On Mon, Jan 7, 2013 at 10:53 AM, Dave Goodell <goodell at mcs.anl.gov> wrote:
>> On Jan 6, 2013, at 8:57 PM CST, Geoffrey Irving wrote:
>>
>>> We're seeing much lower
>>> performance than expected, and are trying to understand why. Here is
>>> a visualization of a 96 core test run using 16 ranks with 6 threads
>>> per core (4 Hopper nodes):
>>>
>>> http://naml.us/random/pentago/history-random17.png
>>>
>>> 1. All messages are sent using MPI_Isend, but Isend sometimes takes a
>>> very long time to return even for tiny (8 byte) messages. For
>>> example, the light blue "wakeup" messages are sent from a worker
>>> thread to the master communication thread to indicate that computation
>>> is complete, breaking the communication thread out of an MPI_Waitsome
>>> call (using MPI_THREAD_MULTIPLE). Some of the input request Isends
>>> also take a long time, and these are also 8 bytes. What would cause
>>> MPI_Isend to take a long time to return, both for small messages
>>> (wakeups and input requests) and for large messages (input responses
>>> and output sends)?
>>
>> One possibility is because MPICH-nemesis (on which Cray's MPI for XE6 is based) does not have any fine-grained parallelism in the library. So in order for an Isend-ing thread to properly enter the MPI library, it must queue on a mutex until the thread which is currently occupying the progress engine decides to yield. In nemesis by default this is 10 iterations of {polling shared memory then the network module}, although Cray may have decided to tweak this number (I have not seen their source). So overly long netmod polling times could contribute to this delay.
>>
>> This design is simplistic, and we aren't especially happy with it, but so far we've found it to be very difficult to retrofit finer-grained parallelism into the existing nemesis codebase. My personal opinion is that a real fix will require a substantial rewrite of a large part of the lower level code.
>>
>> If I'm understanding your questions/explanations correctly, then this is likely related to issues 2 & 3 as well, possibly 4.
>>
>> […]
>>> Some things that might contribute to the problem:
>>>
>>> 1. Since the 92K input responses are sent in response to 8 byte
>>> request messages, I can prepost all matching Irecvs. However, output
>>> messages arrive unexpectedly, so the best I can do is post a few
>>> wildcard Irecvs (MPI_ANY_SOURCE, MPI_ANY_TAG on the output message
>>> communicator). Jeff thought wildcard Irecvs might not be good enough
>>> to hit the fast path. Moreover, output messages are not compressed,
>>> so most are 262K. I didn't compress them historically because
>>> compression isn't all that fast (I'm using Google's Snappy plus domain
>>> specific preconditioning), but that may be a good experiment to run.
>>
>> Preposted any_source+any_tag irecvs should be on the fast-enough path. The size is large enough that it might be tipping you over into a rendezvous instead of eager protocol, but only Cray can answer that for sure. I probably wouldn't pursue compression as a first next step.
>
> And you're correct: at least in the current setup output compression
> slows things down, probably in part because it results in more
> intrarank messages (I had already started implementing this experiment
> before I got your email).
>
>>> 2. I don't know how to visualize instantaneous bandwidth, even though
>>> I have start and end times for all large messages (start and end of
>>> Isend, Irecv request completion time). It's possible the problem is
>>> bad network load balancing. I'm happy to whip up a plot if anyone has
>>> suggestions as to what to draw.
>>
>> I'm not sure what the goal is here. 262 KiB messages don't strike me as large enough to worry much about bandwidth on a modern Cray machine. Latencies seem like a more useful metric still, esp. if you aren't seeing much variation as a function of message size.
>>
>>> 3. Waking a communication thread out of an MPI_Waitsome is somewhat of
>>> a hack. It's the only time the worker threads touch MPI; otherwise I
>>> could use MPI_THREAD_FUNNELED. However, the only alternative I know
>>> is for the communication thread to poll on MPI_Testsome and check for
>>> thread completion in between. Might that be better (it's an easy
>>> experiment to run)?
>>
>> Frankly, it might, because of the aforementioned design. What non-MPI mechanism for inter-thread signaling are you planning on using?
>
> All the thread signalling is via pthread spinlocks and the occasional
> lockless spin when I can get away with it. In the MPI_Testsome case
> the communication thread would check a flag for inter-thread messages
> without grabbing a lock, then grab a spinlock and recheck.
>
> I'll implement this tomorrow and see how it fares.
Polling on MPI_Testsome and spinlocks and using MPI_THREAD_FUNNELED
does significantly better than bare MPI_THREAD_MULTIPLE, and a little
better than asynchronous progress mode:
funneled, 1 comm thread and 5 workers: 157.0734 s
multiple, 1 comm thread and 5 workers: 208.9402 s
asynchronous progress, 1 progress thread, 1 comm thread, 4
workers: 169.9694 s
Here are a few history trace screenshots (zooming in). The worker
thread stalls is wakeup are gone of course, but otherwise it is
similar.
http://naml.us/random/pentago/history-funnel-random17.png
http://naml.us/random/pentago/history-funnel-random17-1.png
http://naml.us/random/pentago/history-funnel-random17-2.png
I'm going to try on BlueGene next to see if the performance is similar.
Geoffrey
More information about the discuss
mailing list