[mpich-discuss] Implementation of MPICH collectives

Jiri Simsa jsimsa at cs.cmu.edu
Sat Sep 14 12:33:34 CDT 2013


Using the configure option does the right thing. Thank you.

--Jiri


On Fri, Sep 13, 2013 at 10:23 PM, Rajeev Thakur <thakur at mcs.anl.gov> wrote:

> Or try configuring with the option --enable-nemesis-dbg-nolocal. It should
> disable shared memory communication in the build.
>
> On Sep 13, 2013, at 8:17 PM, Jiri Simsa wrote:
>
> > Yes. To verify the behavior I wrote a simple test program:
> >
> > #include "mpi.h"
> > #include <stdlib.h>
> > #include <string.h>
> >
> > int main(int argc, char **argv) {
> >   char message[256];
> >   int rank;
> >   if (getenv("MPIR_PARAM_CH3_NO_LOCAL") != NULL) {
> >     printf("MPIR_PARAM_CH3_NO_LOCAL = %s\n",
> getenv("MPIR_PARAM_CH3_NO_LOCAL"));
> >   }
> >   MPI_Init(&argc, &argv);
> >   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> >   if (rank == 0) { strncpy(message, "Hello!", strlen("Hello!")); }
> >   MPI_Bcast(message, strlen("Hello!"), MPI_CHAR, 0, MPI_COMM_WORLD);
> >   MPI_Finalize();
> >   printf("%d: %s\n", rank, message);
> >   return 0;
> > }
> >
> > When I run it with "mpiexec -n 2 ./simple" I get the following output:
> >
> > MPIR_PARAM_CH3_NO_LOCAL = 1
> > MPIR_PARAM_CH3_NO_LOCAL = 1
> > 0: Hello!
> > 1: Hello!
> >
> > I have compiled mpich-3.0.4 with --enable-g=dbg,log and set the
> MPICH_DBG environment variable to FILE and the MPICH_DBG_LEVEL environment
> variable to VERBOSE. I am attaching the log file for the process 0, which
> shows (to the best of my understanding) that the broadcast uses fbox and
> memcpy to transfer the data.
> >
> >
> > On Fri, Sep 13, 2013 at 8:56 PM, Pavan Balaji <balaji at mcs.anl.gov>
> wrote:
> >
> > Not really.  It shouldn't be using the nemesis fast box.  Are you
> setting the environment correctly?
> >
> >  -- Pavan
> >
> > On Sep 13, 2013, at 7:08 PM, Jiri Simsa wrote:
> >
> > > To be more precise, I don't see any such call before MPI_Bcast()
> returns in the root. Is MPICH buffering the data to be broadcasted to some
> later point?
> > >
> > > --Jiri
> > >
> > >
> > > On Fri, Sep 13, 2013 at 7:55 PM, Jiri Simsa <jsimsa at cs.cmu.edu> wrote:
> > > Well, it seems like it is copying data from "nemesis fastbox". More
> importantly, I don't see any call to socket(), connect(), and send(),
> sendto(), or sendmsg() that I would expect to be part of the data transfer.
> > >
> > > --Jiri
> > >
> > >
> > > On Fri, Sep 13, 2013 at 5:44 PM, Pavan Balaji <balaji at mcs.anl.gov>
> wrote:
> > >
> > > Depends on what the memcpy is doing.  It might be some internal data
> manipulation.
> > >
> > >  -- Pavan
> > >
> > > On Sep 13, 2013, at 4:34 PM, Jiri Simsa wrote:
> > >
> > > > Hm, I have set that variable and then I have stepped through a
> program that calls MPI_Bcast (using mpiexec -n 2 <program> on a single
> node). The MPI_Bcast still seems to use memcpy() while I would expect it to
> use the sockets interface. Is the memcpy() to be expected?
> > > >
> > > > --Jiri
> > > >
> > > >
> > > > On Fri, Sep 13, 2013 at 10:25 AM, Pavan Balaji <balaji at mcs.anl.gov>
> wrote:
> > > >
> > > > Yes, you can set the environment variable MPIR_PARAM_CH3_NOLOCAL=1.
> > > >
> > > >  -- Pavan
> > > >
> > > > On Sep 13, 2013, at 7:53 AM, Jiri Simsa wrote:
> > > >
> > > > > Pavan,
> > > > >
> > > > > Thank you for your answer. That's precisely what I was looking
> for. Any chance there is a way to force the intranode communication to use
> tcp?
> > > > >
> > > > > --Jiri
> > > > >
> > > > > Within the node, it uses shared memory.  Outside the node, it
> depends on the netmod you configured with.  tcp is the default netmod.
> > > > >  -- Pavan
> > > > > On Sep 12, 2013, at 2:24 PM, Jiri Simsa wrote:
> > > > > > The high-order bit of my question is: What OS interface(s) does
> MPICH use to transfer data from one MPI process to another?
> > > > > >
> > > > > >
> > > > > > On Thu, Sep 12, 2013 at 1:36 PM, Jiri Simsa <jsimsa at cs.cmu.edu>
> wrote:
> > > > > > Hello,
> > > > > >
> > > > > > I have been trying to understand how MPICH implements collective
> operations. To do so, I have been reading the MPICH source code and
> stepping through mpiexec executions.
> > > > > >
> > > > > > For the sake of this discussion, let's assume that all MPI
> processes are executed on the same computer using: mpiexec -n <n>
> <mpi_program>
> > > > > >
> > > > > > This is my current abstract understanding of MPICH:
> > > > > >
> > > > > > - mpiexec spawns a hydra_pmi_proxy process, which in turn spawns
> <n> instances of <mpi_program>
> > > > > > - hydra_pmi_proxy process uses socket pairs to communicate with
> the instances of <mpi_program>
> > > > > >
> > > > > > I am not quite sure though what happens under the hoods when a
> collective operation, such as MPI_Allreduce, is executed. I have noticed
> that instances of <mpi_program> create and listen on a socket in the course
> of executing MPI_Allreduce but I am not sure who connects to these sockets.
> Any chance someone could describe the data flow inside of MPICH when a
> collective operation, such as MPI_Allreduce, is executed? Thanks!
> > > > > >
> > > > > > Best,
> > > > > >
> > > > > > --Jiri Simsa
> > > > > >
> > > > > > _______________________________________________
> > > > > > discuss mailing list     discuss at mpich.org
> > > > > > To manage subscription options or unsubscribe:
> > > > > > https://lists.mpich.org/mailman/listinfo/discuss
> > > > > --
> > > > > Pavan Balaji
> > > > > http://www.mcs.anl.gov/~balaji
> > > >
> > > > --
> > > > Pavan Balaji
> > > > http://www.mcs.anl.gov/~balaji
> > > >
> > > >
> > >
> > > --
> > > Pavan Balaji
> > > http://www.mcs.anl.gov/~balaji
> > >
> > >
> > >
> >
> > --
> > Pavan Balaji
> > http://www.mcs.anl.gov/~balaji
> >
> >
> > <dbg0-7f997b82eb40.log>_______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20130914/7f6b40ae/attachment.html>


More information about the discuss mailing list