[mpich-discuss] Implementation of MPICH collectives

Fri Sep 13 20:17:43 CDT 2013

Yes. To verify the behavior I wrote a simple test program:

#include "mpi.h"
#include <stdlib.h>
#include <string.h>

int main(int argc, char **argv) {
  char message[256];
  int rank;
  if (getenv("MPIR_PARAM_CH3_NO_LOCAL") != NULL) {
    printf("MPIR_PARAM_CH3_NO_LOCAL = %s\n",
getenv("MPIR_PARAM_CH3_NO_LOCAL"));
  }
  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  if (rank == 0) { strncpy(message, "Hello!", strlen("Hello!")); }
  MPI_Bcast(message, strlen("Hello!"), MPI_CHAR, 0, MPI_COMM_WORLD);
  MPI_Finalize();
  printf("%d: %s\n", rank, message);
  return 0;
}

When I run it with "mpiexec -n 2 ./simple" I get the following output:

MPIR_PARAM_CH3_NO_LOCAL = 1
MPIR_PARAM_CH3_NO_LOCAL = 1
0: Hello!
1: Hello!

I have compiled mpich-3.0.4 with --enable-g=dbg,log and set the MPICH_DBG
environment variable to FILE and the MPICH_DBG_LEVEL environment variable
to VERBOSE. I am attaching the log file for the process 0, which shows (to
the best of my understanding) that the broadcast uses fbox and memcpy to
transfer the data.

On Fri, Sep 13, 2013 at 8:56 PM, Pavan Balaji <balaji at mcs.anl.gov> wrote:

>
> Not really.  It shouldn't be using the nemesis fast box.  Are you setting
> the environment correctly?
>
>  -- Pavan
>
> On Sep 13, 2013, at 7:08 PM, Jiri Simsa wrote:
>
> > To be more precise, I don't see any such call before MPI_Bcast() returns
> in the root. Is MPICH buffering the data to be broadcasted to some later
> point?
> >
> > --Jiri
> >
> >
> > On Fri, Sep 13, 2013 at 7:55 PM, Jiri Simsa <jsimsa at cs.cmu.edu> wrote:
> > Well, it seems like it is copying data from "nemesis fastbox". More
> importantly, I don't see any call to socket(), connect(), and send(),
> sendto(), or sendmsg() that I would expect to be part of the data transfer.
> >
> > --Jiri
> >
> >
> > On Fri, Sep 13, 2013 at 5:44 PM, Pavan Balaji <balaji at mcs.anl.gov>
> wrote:
> >
> > Depends on what the memcpy is doing.  It might be some internal data
> manipulation.
> >
> >  -- Pavan
> >
> > On Sep 13, 2013, at 4:34 PM, Jiri Simsa wrote:
> >
> > > Hm, I have set that variable and then I have stepped through a program
> that calls MPI_Bcast (using mpiexec -n 2 <program> on a single node). The
> MPI_Bcast still seems to use memcpy() while I would expect it to use the
> sockets interface. Is the memcpy() to be expected?
> > >
> > > --Jiri
> > >
> > >
> > > On Fri, Sep 13, 2013 at 10:25 AM, Pavan Balaji <balaji at mcs.anl.gov>
> wrote:
> > >
> > > Yes, you can set the environment variable MPIR_PARAM_CH3_NOLOCAL=1.
> > >
> > >  -- Pavan
> > >
> > > On Sep 13, 2013, at 7:53 AM, Jiri Simsa wrote:
> > >
> > > > Pavan,
> > > >
> > > > Thank you for your answer. That's precisely what I was looking for.
> Any chance there is a way to force the intranode communication to use tcp?
> > > >
> > > > --Jiri
> > > >
> > > > Within the node, it uses shared memory.  Outside the node, it
> depends on the netmod you configured with.  tcp is the default netmod.
> > > >  -- Pavan
> > > > On Sep 12, 2013, at 2:24 PM, Jiri Simsa wrote:
> > > > > The high-order bit of my question is: What OS interface(s) does
> MPICH use to transfer data from one MPI process to another?
> > > > >
> > > > >
> > > > > On Thu, Sep 12, 2013 at 1:36 PM, Jiri Simsa <jsimsa at cs.cmu.edu>
> wrote:
> > > > > Hello,
> > > > >
> > > > > I have been trying to understand how MPICH implements collective
> operations. To do so, I have been reading the MPICH source code and
> stepping through mpiexec executions.
> > > > >
> > > > > For the sake of this discussion, let's assume that all MPI
> processes are executed on the same computer using: mpiexec -n <n>
> <mpi_program>
> > > > >
> > > > > This is my current abstract understanding of MPICH:
> > > > >
> > > > > - mpiexec spawns a hydra_pmi_proxy process, which in turn spawns
> <n> instances of <mpi_program>
> > > > > - hydra_pmi_proxy process uses socket pairs to communicate with
> the instances of <mpi_program>
> > > > >
> > > > > I am not quite sure though what happens under the hoods when a
> collective operation, such as MPI_Allreduce, is executed. I have noticed
> that instances of <mpi_program> create and listen on a socket in the course
> of executing MPI_Allreduce but I am not sure who connects to these sockets.
> Any chance someone could describe the data flow inside of MPICH when a
> collective operation, such as MPI_Allreduce, is executed? Thanks!
> > > > >
> > > > > Best,
> > > > >
> > > > > --Jiri Simsa
> > > > >
> > > > > _______________________________________________
> > > > > discuss mailing list     discuss at mpich.org
> > > > > To manage subscription options or unsubscribe:
> > > > > https://lists.mpich.org/mailman/listinfo/discuss
> > > > --
> > > > Pavan Balaji
> > > > http://www.mcs.anl.gov/~balaji
> > >
> > > --
> > > Pavan Balaji
> > > http://www.mcs.anl.gov/~balaji
> > >
> > >
> >
> > --
> > Pavan Balaji
> > http://www.mcs.anl.gov/~balaji
> >
> >
> >
>
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20130913/0a5d7960/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dbg0-7f997b82eb40.log
Type: application/octet-stream
Size: 91603 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20130913/0a5d7960/attachment.obj>