[mpich-discuss] Process Group Collision for multiple clients from different host machines having same pid with MPI_Comm_accept

Roy, Hirak Hirak_Roy at mentor.com
Fri Apr 14 01:15:10 CDT 2017


Dear MPICH team,

We use MPICH for a server-client application. We use MPICH-3.0.4 with sock channel.
In this application there is one server and 100 clients.
Each client is launched independently in different host-machines using individual-wrapper scripts. (we explicitly use : mpiexec -n 1 )

The server is multithreaded and it uses MPI_Comm_accept (on MPI_COMM_SELF) and clients use MPI_Comm_connect to connect.
We have observed the following issue after all the clients connect to server :
 if we send message to a client (lets say 'm'), it reaches unexpectedly to some other client (lets say 'n'). { server sends the message using the communicator returned by accept call }. This happens randomly in one out of 5-6 runs.

On further looking into MPICH code, we found that
1) There is a collsion of pg (process-group) of two processes (m and n) after mpi-comm-accept
2) As a result of (1), comm->vc are same (for m and n, although comm are different). It seems that the unique string (something like kva_<int>_int) is not unique for such two processes. 'm' and 'n' processes are running in different host-machine and they have the same pid. The kva string looked like kva_pid_rank.


We have the following questions :
1) Have we built MPICH with some kind of incorrect configuration (hydra configuration at the end of the email) ?
2) Are we using incorrect process-manager or configuration and that is why there is a possible collision of process-groups?
3) What is the purpose of process group sharing/uniquifying? If there is no real reason for this, could it be disabled or will something else rely on the id string being unique?
4) If there are no other work-around, what could be done to make the id string unique? Add the host-name? Will everything else be ok with this?


It would be good if you can let us know if there is any workaround for this issue or not.


Thanks,
Hirak Roy

HYDRA build details:

    CXX:                             no  -O3 -fPIC
    F77:                             no
    F90:                             no
    Configure options:                       '--disable-option-checking' '--prefix=/home/hroy/local/mpich-3.0.4/linux_x86_64' '--disable-f77' '--disable-fc' '--disable-f90modules' '--disable-cxx' '--enable-fast=nochkmsg' '--enable-fast=notiming' '--enable-fast=ndebug' '--enable-fast=O3' '--with-device=ch3:sock' 'CFLAGS=-O3 -fPIC -O3' 'CXXFLAGS=-O3 -fPIC ' 'CC=/u/prod/gnu/gcc/20121129/gcc-4.5.0-linux_x86_64/bin/gcc' 'LDFLAGS= ' '--cache-file=/dev/null' '--srcdir=.' 'LIBS=-lrt -lpthread ' 'CPPFLAGS= -I/home/hroy/tools/mpich/mpich-3.0.4/src/mpl/include -I/home/hroy/tools/mpich/mpich-3.0.4/src/mpl/include -I/home/hroy/tools/mpich/mpich-3.0.4/src/openpa/src -I/home/hroy/tools/mpich/mpich-3.0.4/src/openpa/src -I/home/hroy/tools/mpich/mpich-3.0.4/src/mpi/romio/include'
    Process Manager:                         pmi
    Launchers available:                     ssh rsh fork slurm ll lsf sge manual persist
    Topology libraries available:            hwloc
    Resource management kernels available:   user slurm ll lsf sge pbs cobalt
    Checkpointing libraries available:
    Demux engines available:                 poll select

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20170414/05c60f1a/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list