<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">
</head>
<body bgcolor="#FFFFFF" text="#000000">
Hi Hirak,<br>
<br>
Before look into PMI, it would be good to first make sure if this is
a problem in your server-client code, or in the dynamic process part
of MPICH code. Could you please reproduce this issue with a simple
program and give it to us ? <br>
<br>
One thing I noticed is that the server program is multithreaded. Are
you using multiple threads to accept client connection ? Anyway, a
reproducer program will be great.<br>
<br>
Please also try to use the latest MPICH release and see if it
happens. <br>
<br>
In summary, it would be great if you can send us the following
files.<br>
- A reproducer program<br>
- MPICH's config.log (you can find it in the directory where you
build MPICH)<br>
<br>
Thanks,<br>
Min<br>
<br>
<div class="moz-cite-prefix">On 4/14/17 1:15 AM, Roy, Hirak wrote:<br>
</div>
<blockquote cite="mid:1492150510118.38115@mentor.com" type="cite">
<style type="text/css" style="display:none"><!-- p { margin-top: 0px; margin-bottom: 0px; }--></style>
<p><br>
</p>
<div>Dear MPICH team,</div>
<div><br>
</div>
<div>We use MPICH for a server-client application. We use
MPICH-3.0.4 with sock channel.</div>
<div>In this application there is one server and 100 clients.</div>
<div>Each client is launched independently in different
host-machines using individual-wrapper scripts. (we explicitly
use : mpiexec -n 1 )</div>
<div><br>
</div>
<div>The server is multithreaded and it uses MPI_Comm_accept (on
MPI_COMM_SELF) and clients use MPI_Comm_connect to connect.</div>
<div>We have observed the following issue after all the clients
connect to server :</div>
<div> if we send message to a client (lets say 'm'), it reaches
unexpectedly to some other client (lets say 'n'). { server sends
the message using the communicator returned by accept call }.
This happens randomly in one out of 5-6 runs.</div>
<div><br>
</div>
<div>On further looking into MPICH code, we found that </div>
<div>1) There is a collsion of pg (process-group) of two processes
(m and n) after mpi-comm-accept</div>
<div>2) As a result of (1), comm->vc are same (for m and n,
although comm are different). It seems that the unique string
(something like kva_<int>_int) is not unique for such two
processes. 'm' and 'n' processes are running in different
host-machine and they have the same pid. The kva string looked
like kva_pid_rank.</div>
<div><br>
</div>
<div><br>
</div>
<div>We have the following questions : </div>
<div>1) Have we built MPICH with some kind of incorrect
configuration (hydra configuration at the end of the email) ?</div>
<div>2) Are we using incorrect process-manager or configuration
and that is why there is a possible collision of process-groups?</div>
<div>3) What is the purpose of process group sharing/uniquifying?
If there is no real reason for this, could it be disabled or
will something else rely on the id string being unique?</div>
<div>4) If there are no other work-around, what could be done to
make the id string unique? Add the host-name? Will everything
else be ok with this?</div>
<div><br>
</div>
<div><br>
</div>
<div>It would be good if you can let us know if there is any
workaround for this issue or not.</div>
<div><br>
</div>
<div><br>
</div>
<div>Thanks,</div>
<div>Hirak Roy</div>
<div><br>
</div>
<div>HYDRA build details:</div>
<div> </div>
<div> CXX: no -O3 -fPIC </div>
<div> F77: no </div>
<div> F90: no </div>
<div> Configure options:
'--disable-option-checking'
'--prefix=/home/hroy/local/mpich-3.0.4/linux_x86_64'
'--disable-f77' '--disable-fc' '--disable-f90modules'
'--disable-cxx' '--enable-fast=nochkmsg'
'--enable-fast=notiming' '--enable-fast=ndebug'
'--enable-fast=O3' '--with-device=ch3:sock' 'CFLAGS=-O3 -fPIC
-O3' 'CXXFLAGS=-O3 -fPIC '
'CC=/u/prod/gnu/gcc/20121129/gcc-4.5.0-linux_x86_64/bin/gcc'
'LDFLAGS= ' '--cache-file=/dev/null' '--srcdir=.' 'LIBS=-lrt
-lpthread ' 'CPPFLAGS=
-I/home/hroy/tools/mpich/mpich-3.0.4/src/mpl/include
-I/home/hroy/tools/mpich/mpich-3.0.4/src/mpl/include
-I/home/hroy/tools/mpich/mpich-3.0.4/src/openpa/src
-I/home/hroy/tools/mpich/mpich-3.0.4/src/openpa/src
-I/home/hroy/tools/mpich/mpich-3.0.4/src/mpi/romio/include'</div>
<div> Process Manager: pmi</div>
<div> Launchers available: ssh rsh fork
slurm ll lsf sge manual persist</div>
<div> Topology libraries available: hwloc</div>
<div> Resource management kernels available: user slurm ll
lsf sge pbs cobalt</div>
<div> Checkpointing libraries available: </div>
<div> Demux engines available: poll select</div>
<div><br>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
discuss mailing list <a class="moz-txt-link-abbreviated" href="mailto:discuss@mpich.org">discuss@mpich.org</a>
To manage subscription options or unsubscribe:
<a class="moz-txt-link-freetext" href="https://lists.mpich.org/mailman/listinfo/discuss">https://lists.mpich.org/mailman/listinfo/discuss</a></pre>
</blockquote>
<br>
</body>
</html>