[mpich-discuss] Process Group Collision for multiple clients from different host machines having same pid with MPI_Comm_accept
Balaji, Pavan
balaji at anl.gov
Wed May 3 11:02:11 CDT 2017
Yup. +1.
Note that this needs to be fixed in Hydra as well, which essentially does the same thing.
-- Pavan
> On May 1, 2017, at 11:52 PM, Roy, Hirak <Hirak_Roy at mentor.com> wrote:
>
> Hi Min,
>
> I found the following piece of code in MPICH source : src/pm/util/pmiserv.c :
>
> /* We include the pid of the PMI server as a way to allow multiple
> PMI servers to coexist. This is needed to support connect/accept
> operations when multiple mpiexec's are used, and the KVS space
> is served directly by mpiexec (it should really have the
> hostname as well, just to avoid getting the same pid on two
> different hosts, but this is probably good enough for most
> uses) */
>
> MPIU_Snprintf( (char *)(kvs->kvsname), MAXNAMELEN, "kvs_%d_%d",
> (int)getpid(), kvsnum++ );
> kvs->pairs = 0;
> kvs->lastByIdx = 0;
> kvs->lastIdx = -1;
>
>
> and in src/pm/hydra/pm/pmiserv/common.c
>
>
> HYD_status HYD_pmcd_pmi_allocate_kvs(struct HYD_pmcd_pmi_kvs ** kvs, int pgid)
> {
> HYD_status status = HYD_SUCCESS;
>
> HYDU_FUNC_ENTER();
> HYDU_MALLOC(*kvs, struct HYD_pmcd_pmi_kvs *, sizeof(struct HYD_pmcd_pmi_kvs), status);
> HYDU_snprintf((*kvs)->kvsname, PMI_MAXKVSLEN, "kvs_%d_%d", (int) getpid(), pgid);
> (*kvs)->key_pair = NULL;
>
> fn_exit:
> HYDU_FUNC_EXIT();
> return status;
>
> fn_fail:
> goto fn_exit;
> }
>
>
> I think hostname should be added to kvsname to make it unique when accept/connect are being done from multiple processes which are not invoked by the same mpiexec.
>
>
> Thanks,
> Hirak
>
>
>
>
>
> Hi Hirak,
>
> Before look into PMI, it would be good to first make sure if this is a
> problem in your server-client code, or in the dynamic process part of
> MPICH code. Could you please reproduce this issue with a simple program
> and give it to us ?
>
> One thing I noticed is that the server program is multithreaded. Are you
> using multiple threads to accept client connection ? Anyway, a
> reproducer program will be great.
>
> Please also try to use the latest MPICH release and see if it happens.
>
> In summary, it would be great if you can send us the following files.
> - A reproducer program
> - MPICH's config.log (you can find it in the directory where you build
> MPICH)
>
> Thanks,
> Min
>
> On 4/14/17 1:15 AM, Roy, Hirak wrote:
> >
> >
> > Dear MPICH team,
> >
> > We use MPICH for a server-client application. We use MPICH-3.0.4 with
> > sock channel.
> > In this application there is one server and 100 clients.
> > Each client is launched independently in different host-machines using
> > individual-wrapper scripts. (we explicitly use : mpiexec -n 1 )
> >
> > The server is multithreaded and it uses MPI_Comm_accept (on
> > MPI_COMM_SELF) and clients use MPI_Comm_connect to connect.
> > We have observed the following issue after all the clients connect to
> > server :
> > if we send message to a client (lets say 'm'), it reaches
> > unexpectedly to some other client (lets say 'n'). { server sends the
> > message using the communicator returned by accept call }. This happens
> > randomly in one out of 5-6 runs.
> >
> > On further looking into MPICH code, we found that
> > 1) There is a collsion of pg (process-group) of two processes (m and
> > n) after mpi-comm-accept
> > 2) As a result of (1), comm->vc are same (for m and n, although comm
> > are different). It seems that the unique string (something like
> > kva_<int>_int) is not unique for such two processes. 'm' and 'n'
> > processes are running in different host-machine and they have the same
> > pid. The kva string looked like kva_pid_rank.
> >
> >
> > We have the following questions :
> > 1) Have we built MPICH with some kind of incorrect
> > configuration (hydra configuration at the end of the email) ?
> > 2) Are we using incorrect process-manager or configuration and that is
> > why there is a possible collision of process-groups?
> > 3) What is the purpose of process group sharing/uniquifying? If there
> > is no real reason for this, could it be disabled or will something
> > else rely on the id string being unique?
> > 4) If there are no other work-around, what could be done to make the
> > id string unique? Add the host-name? Will everything else be ok with this?
> >
> >
> > It would be good if you can let us know if there is any workaround for
> > this issue or not.
> >
> >
> > Thanks,
> > Hirak Roy
> >
> > HYDRA build details:
> > CXX: no -O3 -fPIC
> > F77: no
> > F90: no
> > Configure options: '--disable-option-checking'
> > '--prefix=/home/hroy/local/mpich-3.0.4/linux_x86_64' '--disable-f77'
> > '--disable-fc' '--disable-f90modules' '--disable-cxx'
> > '--enable-fast=nochkmsg' '--enable-fast=notiming'
> > '--enable-fast=ndebug' '--enable-fast=O3' '--with-device=ch3:sock'
> > 'CFLAGS=-O3 -fPIC -O3' 'CXXFLAGS=-O3 -fPIC '
> > 'CC=/u/prod/gnu/gcc/20121129/gcc-4.5.0-linux_x86_64/bin/gcc' 'LDFLAGS=
> > ' '--cache-file=/dev/null' '--srcdir=.' 'LIBS=-lrt -lpthread '
> > 'CPPFLAGS= -I/home/hroy/tools/mpich/mpich-3.0.4/src/mpl/include
> > -I/home/hroy/tools/mpich/mpich-3.0.4/src/mpl/include
> > -I/home/hroy/tools/mpich/mpich-3.0.4/src/openpa/src
> > -I/home/hroy/tools/mpich/mpich-3.0.4/src/openpa/src
> > -I/home/hroy/tools/mpich/mpich-3.0.4/src/mpi/romio/include'
> > Process Manager: pmi
> > Launchers available: ssh rsh fork slurm ll lsf
> > sge manual persist
> > Topology libraries available: hwloc
> > Resource management kernels available: user slurm ll lsf sge pbs
> > cobalt
> > Checkpointing libraries available:
> > Demux engines available: poll select
> >
> >
> >
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list