[mpich-discuss] Implementation of MPICH collectives

Thu Sep 12 13:00:24 CDT 2013

Jiri,

The communication with the process manager (mpiexec) is only to lookup information about remote processes.  The process manager is not involved in most MPI operations, e.g., send/recv or collectives.

The process manager (Hydra, as well as other process managers in mpich and all its derivatives), essentially provide a key-value database.  Each process can put some key/vals in there and lookup key/vals put by other processes.  When hydra_pmi_proxy starts the MPI processes, each process puts a "business card" in the key-value space.  After that, if process 0 wants to communicate with process 1, it'll look up the business card for process 1 and establish a communication channel with it (shared memory in this case).  After the communication channel is setup, all communication happens directly between the two processes without involving the process manager.

That's of course a very rough layout of how things happen.  In practice, there are several optimizations to make this more efficient, with the distributed proxies exchanging and caching information for efficiency.  Also, the exchange of business card information with the process manager is eager in some cases (e.g., for shared memory) and lazy in other cases (e.g., for network connections).

 -- Pavan

On Sep 12, 2013, at 12:36 PM, Jiri Simsa wrote:

> Hello,
> 
> I have been trying to understand how MPICH implements collective operations. To do so, I have been reading the MPICH source code and stepping through mpiexec executions. 
> 
> For the sake of this discussion, let's assume that all MPI processes are executed on the same computer using: mpiexec -n <n> <mpi_program>
> 
> This is my current abstract understanding of MPICH:
> 
> - mpiexec spawns a hydra_pmi_proxy process, which in turn spawns <n> instances of <mpi_program>
> - hydra_pmi_proxy process uses socket pairs to communicate with the instances of <mpi_program>
> 
> I am not quite sure though what happens under the hoods when a collective operation, such as MPI_Allreduce, is executed. I have noticed that instances of <mpi_program> create and listen on a socket in the course of executing MPI_Allreduce but I am not sure who connects to these sockets. Any chance someone could describe the data flow inside of MPICH when a collective operation, such as MPI_Allreduce, is executed? Thanks!
> 
> Best,
> 
> --Jiri Simsa
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

--
Pavan Balaji
http://www.mcs.anl.gov/~balaji