[mpich-discuss] low performance in an asynchronous, mixed MPI/pthreads app
Geoffrey Irving
irving at naml.us
Tue Jan 8 00:44:36 CST 2013
On Mon, Jan 7, 2013 at 10:53 AM, Dave Goodell <goodell at mcs.anl.gov> wrote:
> On Jan 6, 2013, at 8:57 PM CST, Geoffrey Irving wrote:
>
>> We're seeing much lower
>> performance than expected, and are trying to understand why. Here is
>> a visualization of a 96 core test run using 16 ranks with 6 threads
>> per core (4 Hopper nodes):
>>
>> http://naml.us/random/pentago/history-random17.png
>>
>> 1. All messages are sent using MPI_Isend, but Isend sometimes takes a
>> very long time to return even for tiny (8 byte) messages. For
>> example, the light blue "wakeup" messages are sent from a worker
>> thread to the master communication thread to indicate that computation
>> is complete, breaking the communication thread out of an MPI_Waitsome
>> call (using MPI_THREAD_MULTIPLE). Some of the input request Isends
>> also take a long time, and these are also 8 bytes. What would cause
>> MPI_Isend to take a long time to return, both for small messages
>> (wakeups and input requests) and for large messages (input responses
>> and output sends)?
>
> One possibility is because MPICH-nemesis (on which Cray's MPI for XE6 is based) does not have any fine-grained parallelism in the library. So in order for an Isend-ing thread to properly enter the MPI library, it must queue on a mutex until the thread which is currently occupying the progress engine decides to yield. In nemesis by default this is 10 iterations of {polling shared memory then the network module}, although Cray may have decided to tweak this number (I have not seen their source). So overly long netmod polling times could contribute to this delay.
>
> This design is simplistic, and we aren't especially happy with it, but so far we've found it to be very difficult to retrofit finer-grained parallelism into the existing nemesis codebase. My personal opinion is that a real fix will require a substantial rewrite of a large part of the lower level code.
>
> If I'm understanding your questions/explanations correctly, then this is likely related to issues 2 & 3 as well, possibly 4.
>
> […]
>> Some things that might contribute to the problem:
>>
>> 1. Since the 92K input responses are sent in response to 8 byte
>> request messages, I can prepost all matching Irecvs. However, output
>> messages arrive unexpectedly, so the best I can do is post a few
>> wildcard Irecvs (MPI_ANY_SOURCE, MPI_ANY_TAG on the output message
>> communicator). Jeff thought wildcard Irecvs might not be good enough
>> to hit the fast path. Moreover, output messages are not compressed,
>> so most are 262K. I didn't compress them historically because
>> compression isn't all that fast (I'm using Google's Snappy plus domain
>> specific preconditioning), but that may be a good experiment to run.
>
> Preposted any_source+any_tag irecvs should be on the fast-enough path. The size is large enough that it might be tipping you over into a rendezvous instead of eager protocol, but only Cray can answer that for sure. I probably wouldn't pursue compression as a first next step.
And you're correct: at least in the current setup output compression
slows things down, probably in part because it results in more
intrarank messages (I had already started implementing this experiment
before I got your email).
>> 2. I don't know how to visualize instantaneous bandwidth, even though
>> I have start and end times for all large messages (start and end of
>> Isend, Irecv request completion time). It's possible the problem is
>> bad network load balancing. I'm happy to whip up a plot if anyone has
>> suggestions as to what to draw.
>
> I'm not sure what the goal is here. 262 KiB messages don't strike me as large enough to worry much about bandwidth on a modern Cray machine. Latencies seem like a more useful metric still, esp. if you aren't seeing much variation as a function of message size.
>
>> 3. Waking a communication thread out of an MPI_Waitsome is somewhat of
>> a hack. It's the only time the worker threads touch MPI; otherwise I
>> could use MPI_THREAD_FUNNELED. However, the only alternative I know
>> is for the communication thread to poll on MPI_Testsome and check for
>> thread completion in between. Might that be better (it's an easy
>> experiment to run)?
>
> Frankly, it might, because of the aforementioned design. What non-MPI mechanism for inter-thread signaling are you planning on using?
All the thread signalling is via pthread spinlocks and the occasional
lockless spin when I can get away with it. In the MPI_Testsome case
the communication thread would check a flag for inter-thread messages
without grabbing a lock, then grab a spinlock and recheck.
I'll implement this tomorrow and see how it fares.
Thanks,
Geoffrey
More information about the discuss
mailing list