[mpich-discuss] Process Group Collision for multiple clients from different host machines having same pid with MPI_Comm_accept

Balaji, Pavan balaji at anl.gov
Wed May 3 11:02:11 CDT 2017


Yup.  +1.

Note that this needs to be fixed in Hydra as well, which essentially does the same thing.

  -- Pavan

> On May 1, 2017, at 11:52 PM, Roy, Hirak <Hirak_Roy at mentor.com> wrote:
> 
> Hi Min,
>  
> I found the following piece of code in MPICH source : src/pm/util/pmiserv.c :
>  
>     /* We include the pid of the PMI server as a way to allow multiple
>        PMI servers to coexist.  This is needed to support connect/accept
>        operations when multiple mpiexec's are used, and the KVS space
>        is served directly by mpiexec (it should really have the
>        hostname as well, just to avoid getting the same pid on two
>        different hosts, but this is probably good enough for most
>        uses) */
>    
>     MPIU_Snprintf( (char *)(kvs->kvsname), MAXNAMELEN, "kvs_%d_%d",
>                  (int)getpid(), kvsnum++ );
>     kvs->pairs     = 0;
>     kvs->lastByIdx = 0;
>     kvs->lastIdx   = -1;
>  
>  
> and in src/pm/hydra/pm/pmiserv/common.c
>  
>  
> HYD_status HYD_pmcd_pmi_allocate_kvs(struct HYD_pmcd_pmi_kvs ** kvs, int pgid)
> {
>     HYD_status status = HYD_SUCCESS;
>  
>     HYDU_FUNC_ENTER();
>     HYDU_MALLOC(*kvs, struct HYD_pmcd_pmi_kvs *, sizeof(struct HYD_pmcd_pmi_kvs), status);
>     HYDU_snprintf((*kvs)->kvsname, PMI_MAXKVSLEN, "kvs_%d_%d", (int) getpid(), pgid);
>     (*kvs)->key_pair = NULL;
>  
>   fn_exit:
>     HYDU_FUNC_EXIT();
>     return status;
>  
>   fn_fail:
>     goto fn_exit;
> }
>  
>  
> I think hostname should be added to kvsname to make it unique when accept/connect are being done from multiple processes which are not invoked by the same mpiexec.
>  
>  
> Thanks,
> Hirak
>  
>  
>  
>  
>  
> Hi Hirak,
>  
> Before look into PMI, it would be good to first make sure if this is a
> problem in your server-client code, or in the dynamic process part of
> MPICH code. Could you please reproduce this issue with a simple program
> and give it to us ?
>  
> One thing I noticed is that the server program is multithreaded. Are you
> using multiple threads to accept client connection ? Anyway, a
> reproducer program will be great.
>  
> Please also try to use the latest MPICH release and see if it happens.
>  
> In summary, it would be great if you can send us the following files.
> - A reproducer program
> - MPICH's config.log (you can find it in the directory where you build
> MPICH)
>  
> Thanks,
> Min
>  
> On 4/14/17 1:15 AM, Roy, Hirak wrote:
> > 
> > 
> > Dear MPICH team,
> > 
> > We use MPICH for a server-client application. We use MPICH-3.0.4 with
> > sock channel.
> > In this application there is one server and 100 clients.
> > Each client is launched independently in different host-machines using
> > individual-wrapper scripts. (we explicitly use : mpiexec -n 1 )
> > 
> > The server is multithreaded and it uses MPI_Comm_accept (on
> > MPI_COMM_SELF) and clients use MPI_Comm_connect to connect.
> > We have observed the following issue after all the clients connect to
> > server :
> >  if we send message to a client (lets say 'm'), it reaches
> > unexpectedly to some other client (lets say 'n'). { server sends the
> > message using the communicator returned by accept call }. This happens
> > randomly in one out of 5-6 runs.
> > 
> > On further looking into MPICH code, we found that
> > 1) There is a collsion of pg (process-group) of two processes (m and
> > n) after mpi-comm-accept
> > 2) As a result of (1), comm->vc are same (for m and n, although comm
> > are different). It seems that the unique string (something like
> > kva_<int>_int) is not unique for such two processes. 'm' and 'n'
> > processes are running in different host-machine and they have the same
> > pid. The kva string looked like kva_pid_rank.
> > 
> > 
> > We have the following questions :
> > 1) Have we built MPICH with some kind of incorrect
> > configuration (hydra configuration at the end of the email) ?
> > 2) Are we using incorrect process-manager or configuration and that is
> > why there is a possible collision of process-groups?
> > 3) What is the purpose of process group sharing/uniquifying? If there
> > is no real reason for this, could it be disabled or will something
> > else rely on the id string being unique?
> > 4) If there are no other work-around, what could be done to make the
> > id string unique? Add the host-name? Will everything else be ok with this?
> > 
> > 
> > It would be good if you can let us know if there is any workaround for
> > this issue or not.
> > 
> > 
> > Thanks,
> > Hirak Roy
> > 
> > HYDRA build details:
> >     CXX:                             no  -O3 -fPIC
> >     F77:                             no
> >     F90:                             no
> >     Configure options: '--disable-option-checking'
> > '--prefix=/home/hroy/local/mpich-3.0.4/linux_x86_64' '--disable-f77'
> > '--disable-fc' '--disable-f90modules' '--disable-cxx'
> > '--enable-fast=nochkmsg' '--enable-fast=notiming'
> > '--enable-fast=ndebug' '--enable-fast=O3' '--with-device=ch3:sock'
> > 'CFLAGS=-O3 -fPIC -O3' 'CXXFLAGS=-O3 -fPIC '
> > 'CC=/u/prod/gnu/gcc/20121129/gcc-4.5.0-linux_x86_64/bin/gcc' 'LDFLAGS=
> > ' '--cache-file=/dev/null' '--srcdir=.' 'LIBS=-lrt -lpthread '
> > 'CPPFLAGS= -I/home/hroy/tools/mpich/mpich-3.0.4/src/mpl/include
> > -I/home/hroy/tools/mpich/mpich-3.0.4/src/mpl/include
> > -I/home/hroy/tools/mpich/mpich-3.0.4/src/openpa/src
> > -I/home/hroy/tools/mpich/mpich-3.0.4/src/openpa/src
> > -I/home/hroy/tools/mpich/mpich-3.0.4/src/mpi/romio/include'
> >     Process Manager:                         pmi
> >     Launchers available:                     ssh rsh fork slurm ll lsf
> > sge manual persist
> >     Topology libraries available:            hwloc
> >     Resource management kernels available:   user slurm ll lsf sge pbs
> > cobalt
> >     Checkpointing libraries available:
> >     Demux engines available:                 poll select
> > 
> > 
> > 
>  
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list