[mpich-discuss] Process Group Collision for multiple clients from different host machines having same pid with MPI_Comm_accept

Min Si msi at anl.gov
Sat Apr 15 13:22:02 CDT 2017


Hi Hirak,

Before look into PMI, it would be good to first make sure if this is a 
problem in your server-client code, or in the dynamic process part of 
MPICH code. Could you please reproduce this issue with a simple program 
and give it to us ?

One thing I noticed is that the server program is multithreaded. Are you 
using multiple threads to accept client connection ? Anyway, a 
reproducer program will be great.

Please also try to use the latest MPICH release and see if it happens.

In summary, it would be great if you can send us the following files.
- A reproducer program
- MPICH's config.log (you can find it in the directory where you build 
MPICH)

Thanks,
Min

On 4/14/17 1:15 AM, Roy, Hirak wrote:
>
>
> Dear MPICH team,
>
> We use MPICH for a server-client application. We use MPICH-3.0.4 with 
> sock channel.
> In this application there is one server and 100 clients.
> Each client is launched independently in different host-machines using 
> individual-wrapper scripts. (we explicitly use : mpiexec -n 1 )
>
> The server is multithreaded and it uses MPI_Comm_accept (on 
> MPI_COMM_SELF) and clients use MPI_Comm_connect to connect.
> We have observed the following issue after all the clients connect to 
> server :
>  if we send message to a client (lets say 'm'), it reaches 
> unexpectedly to some other client (lets say 'n'). { server sends the 
> message using the communicator returned by accept call }. This happens 
> randomly in one out of 5-6 runs.
>
> On further looking into MPICH code, we found that
> 1) There is a collsion of pg (process-group) of two processes (m and 
> n) after mpi-comm-accept
> 2) As a result of (1), comm->vc are same (for m and n, although comm 
> are different). It seems that the unique string (something like 
> kva_<int>_int) is not unique for such two processes. 'm' and 'n' 
> processes are running in different host-machine and they have the same 
> pid. The kva string looked like kva_pid_rank.
>
>
> We have the following questions :
> 1) Have we built MPICH with some kind of incorrect 
> configuration (hydra configuration at the end of the email) ?
> 2) Are we using incorrect process-manager or configuration and that is 
> why there is a possible collision of process-groups?
> 3) What is the purpose of process group sharing/uniquifying? If there 
> is no real reason for this, could it be disabled or will something 
> else rely on the id string being unique?
> 4) If there are no other work-around, what could be done to make the 
> id string unique? Add the host-name? Will everything else be ok with this?
>
>
> It would be good if you can let us know if there is any workaround for 
> this issue or not.
>
>
> Thanks,
> Hirak Roy
>
> HYDRA build details:
>     CXX:                             no  -O3 -fPIC
>     F77:                             no
>     F90:                             no
>     Configure options: '--disable-option-checking' 
> '--prefix=/home/hroy/local/mpich-3.0.4/linux_x86_64' '--disable-f77' 
> '--disable-fc' '--disable-f90modules' '--disable-cxx' 
> '--enable-fast=nochkmsg' '--enable-fast=notiming' 
> '--enable-fast=ndebug' '--enable-fast=O3' '--with-device=ch3:sock' 
> 'CFLAGS=-O3 -fPIC -O3' 'CXXFLAGS=-O3 -fPIC ' 
> 'CC=/u/prod/gnu/gcc/20121129/gcc-4.5.0-linux_x86_64/bin/gcc' 'LDFLAGS= 
> ' '--cache-file=/dev/null' '--srcdir=.' 'LIBS=-lrt -lpthread ' 
> 'CPPFLAGS= -I/home/hroy/tools/mpich/mpich-3.0.4/src/mpl/include 
> -I/home/hroy/tools/mpich/mpich-3.0.4/src/mpl/include 
> -I/home/hroy/tools/mpich/mpich-3.0.4/src/openpa/src 
> -I/home/hroy/tools/mpich/mpich-3.0.4/src/openpa/src 
> -I/home/hroy/tools/mpich/mpich-3.0.4/src/mpi/romio/include'
>     Process Manager:                         pmi
>     Launchers available:                     ssh rsh fork slurm ll lsf 
> sge manual persist
>     Topology libraries available:            hwloc
>     Resource management kernels available:   user slurm ll lsf sge pbs 
> cobalt
>     Checkpointing libraries available:
>     Demux engines available:                 poll select
>
>
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20170415/355f5516/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list