[mpich-discuss] A possible bug in HYD_pmcd_pmi_allocate_kvs

Xiaopeng Duan xpduan12 at gmail.com
Thu Jun 6 00:52:25 CDT 2019


Thank you, Ken.

We were having another problem with 3.3, and will try it once we fixed our
issue.

Just my couriosity, why a random number was chosen for the fix instead of
hostname or address? Looks to me the random number still has some
possibility to repeat (although very rare), but hostnames and addresses
should be unique in a system.

Regards,
Xiaopeng

On Wed, Jun 5, 2019, 8:29 AM Raffenetti, Kenneth J. <raffenet at mcs.anl.gov>
wrote:

> We added a similar fix in https://github.com/pmodels/mpich/pull/2788.
> This was included in the MPICH 3.3 release.
>
> Ken
>
> On 6/4/19 11:27 PM, Xiaopeng Duan via discuss wrote:
> > Hi, MPICH experts,
> >
> > We are working on a dynamic master-worker flow using
> > mpi_comm_connect/mpi_com_accept. In some cases when the total number of
> > worker process is large, they may get the same kvs_name and confuse the
> > internal group identifiers. This was traced to the naming convention in
> > HYD_pmcd_pmi_allocate_kvs() that considers only process id, while two
> > processes on different machines may have the same pid. I tried to add
> > host name (from unistd.h>gethostname) to the name, i.e.
> > 'kvs_HOSTNAME_PID_pgid', then everything is working fine in our testing.
> >
> > So I'm wondering if this change is safe (we may need it for our release)
> > and if it would go into the official MPICH release some time.
> >
> > Thank you very much.
> >
> > Xiaopeng
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20190606/c634e0ea/attachment.html>


More information about the discuss mailing list