[mpich-discuss] A possible bug in HYD_pmcd_pmi_allocate_kvs
Raffenetti, Kenneth J.
raffenet at mcs.anl.gov
Thu Jun 6 08:25:21 CDT 2019
On 6/6/19 12:52 AM, Xiaopeng Duan wrote:
> Thank you, Ken.
>
> We were having another problem with 3.3, and will try it once we fixed
> our issue.
>
> Just my couriosity, why a random number was chosen for the fix instead
> of hostname or address? Looks to me the random number still has some
> possibility to repeat (although very rare), but hostnames and addresses
> should be unique in a system.
I had the same thought when looking back at this patch. Maybe Giuseppe
can share why that was added. I'm fairly sure it can be safely removed.
Ken
>
> Regards,
> Xiaopeng
>
> On Wed, Jun 5, 2019, 8:29 AM Raffenetti, Kenneth J.
> <raffenet at mcs.anl.gov <mailto:raffenet at mcs.anl.gov>> wrote:
>
> We added a similar fix in https://github.com/pmodels/mpich/pull/2788.
> This was included in the MPICH 3.3 release.
>
> Ken
>
> On 6/4/19 11:27 PM, Xiaopeng Duan via discuss wrote:
> > Hi, MPICH experts,
> >
> > We are working on a dynamic master-worker flow using
> > mpi_comm_connect/mpi_com_accept. In some cases when the total
> number of
> > worker process is large, they may get the same kvs_name and
> confuse the
> > internal group identifiers. This was traced to the naming
> convention in
> > HYD_pmcd_pmi_allocate_kvs() that considers only process id, while
> two
> > processes on different machines may have the same pid. I tried to
> add
> > host name (from unistd.h>gethostname) to the name, i.e.
> > 'kvs_HOSTNAME_PID_pgid', then everything is working fine in our
> testing.
> >
> > So I'm wondering if this change is safe (we may need it for our
> release)
> > and if it would go into the official MPICH release some time.
> >
> > Thank you very much.
> >
> > Xiaopeng
> >
> > _______________________________________________
> > discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
>
More information about the discuss
mailing list