[mpich-discuss] Support for MIC in MPICH 3.0.4

Maciej.Golebiewski at csiro.au Maciej.Golebiewski at csiro.au
Tue Jul 9 20:09:05 CDT 2013


Hi Jeff,

> Does your code actually scale to 240 MPI ranks per card?  What is
> the performance relative to peak or a fully-populated MPI-only run
> on e.g SNB?

I don't expect it to scale up to that many ranks, mentioning 240 ranks/card in my previous email was just to demonstrate that I was able to start that many ranks with Intel MPI.

Performance peaks out at 60 ranks on MIC and is still about 4.5 times slower than 12 ranks on a dual slot system with 6 core X5670 Xeons (on which it scales up to 36 ranks over 3 nodes, the grid is only about 15K points which is probably what prevents further scaling).

While MIC cores are slower that Xeon, 4.5 times slower with 5 times as many ranks was quite dissapointing. My first hunch was that perhaps Intel MPI does not handle well many immediate send/recv requests at a time (according to vampir traces node leader is always ready for workers to talk to them, but the workers take long to complete either receive of input or send of results), so I wanted to compare the perfrmance using another MPI implementation.

After reading your comment about ring interconenct now I consider the possibility that instead I'm hitting the limit of the interconnect.

> You're going to have to convert to MPI+Threads eventually so you

That's one of these things that are obvious from a technical point of view but are not going to happen any time soon due to various non-technical impediments.

> If you don't like OpenMP, you could consider Pthreads or TBB

I've done my fair share of Pthreads programming in the past, and it's an evil thing to recommend it over OpenMP for a scientific application actively developed by people who are not expert programmers. Very, very evil. ;)

> OpenMP, it's because you have an F77  code with common blocks that
> cannot be made thread-safe for all the beer in Australia.

Close enough, but not that bad. F90 with data in multiple modules, the main loop time loop could be nicely task-parallelized, the main effort would be to refactor the code so that no subroutine is accessing data through modules, only through the arguments. And then to forever keep policing the scientists to stick to this rule. :)

Again, the technical side is not the problem, it's the external circumstances.

Anyway, thanks for your comments. Replying to them helped me tidy my thoughts and get a better idea how to present my results and conclusions to the user.

Cheers,

Maciej



More information about the discuss mailing list