[mpich-discuss] Hydra fills in the MPIR_proctable[] incorrectly with multiple processes per node

Mon Mar 18 11:56:22 CDT 2013

Pavan Balaji wrote:
> Hi John,
> 
> On 03/12/2013 11:32 AM US Central Time, John DelSignore wrote:
>> If possible, it would be good is MPICH could be changed to share the
>> host and executable name character strings across multiple process
>> descriptor entries.
> 
> This is a maintenance hassle with respect to chasing dangling pointers
> in the future.

I've never looked at the MPICH implementation, so I don't understand the issue with dangling pointers.

However, I do know that other MPI implementations support common strings for the MPIR proctable and as a result we have seen significant performance benefits when extracting the MPIR proctable.

>  Also, I'm not sure how much space this saves.

It's not purely a matter of saving space (which can be significant), it's also a matter of saving time when the tool attempts to extract the strings from the starter process. On systems like IBM Blue Gene/Q we are beginning to deal with MPI jobs approaching one million processes or more.

Let's do a rough calculation on a BG/Q job that has 1,048,576 processes spread across 65,536 compute nodes (16 processes per node), with an IO-node to compute-node ratio of 128:1. Also assume that the absolute path to the executable is "only" 50 characters long, it's an SPMD code (one executable for all processes), and the that host name strings are IP addresses that average 15 characters long.

Without the "common string" optimization, the executable strings would occupy at least 50MB of space, and the host name strings would occupy 15MB of space. But the thing that really hurts is that the tool would have to read 2 million separate strings using an interface like ptrace() or /proc, which can be slow.

With the "common string" optimization, the executable string would occupy about 50 bytes of space, and the host name strings would occupy 7,680 bytes ((1,048,576 / 16 / 128) * 15). But the bigger win is that tool would have to read only 513 separate strings using ptrace() or /proc, which represents a big time savings.

> We are working with the MPI Forum on a more scalable interface where
> this information is distributed across various processes.  That might be
> a better model here.

I'm on the MPI Tool Working group, and we are many many miles away from a new MPIR interface, which we've been calling "MPIR v2". There is no MPIR v2 proposal on the table, and there seems to be virtually no interest in creating one. And from my perspective, IBM has proven that MPIR v1 can scale to millions of processes, so it seems unlikely to me that there will be an MPIR v2 anytime soon. Besides, I think the real motivator for an MPIR v2 will be to support MPI dynamic process creation, not scalability, but there doesn't seem to be much demand for that either.

Cheers, John D.

>  -- Pavan
>