[mpich-discuss] A possible bug in HYD_pmcd_pmi_allocate_kvs
xpduan12 at gmail.com
Thu Jun 6 15:34:13 CDT 2019
Thank you, Giuseppe and Ken.
On Thu, Jun 6, 2019, 2:56 PM Congiu, Giuseppe <gcongiu at anl.gov> wrote:
> Actually, the fix uses a combination of hostname and random number, which
> seed is a time stamp.
> I don’t remember why exactly we didn’t go for the hostname only but I
> suspect it is because this might not be
> unique. Adding the random number with a timestamp seed should be robust
> enough against collisions.
> > On Jun 6, 2019, at 8:25 AM, Raffenetti, Kenneth J. via discuss <
> discuss at mpich.org> wrote:
> > On 6/6/19 12:52 AM, Xiaopeng Duan wrote:
> >> Thank you, Ken.
> >> We were having another problem with 3.3, and will try it once we fixed
> >> our issue.
> >> Just my couriosity, why a random number was chosen for the fix instead
> >> of hostname or address? Looks to me the random number still has some
> >> possibility to repeat (although very rare), but hostnames and addresses
> >> should be unique in a system.
> > I had the same thought when looking back at this patch. Maybe Giuseppe
> > can share why that was added. I'm fairly sure it can be safely removed.
> > Ken
> >> Regards,
> >> Xiaopeng
> >> On Wed, Jun 5, 2019, 8:29 AM Raffenetti, Kenneth J.
> >> <raffenet at mcs.anl.gov <mailto:raffenet at mcs.anl.gov>> wrote:
> >> We added a similar fix in https://github.com/pmodels/mpich/pull/2788
> >> This was included in the MPICH 3.3 release.
> >> Ken
> >> On 6/4/19 11:27 PM, Xiaopeng Duan via discuss wrote:
> >>> Hi, MPICH experts,
> >>> We are working on a dynamic master-worker flow using
> >>> mpi_comm_connect/mpi_com_accept. In some cases when the total
> >> number of
> >>> worker process is large, they may get the same kvs_name and
> >> confuse the
> >>> internal group identifiers. This was traced to the naming
> >> convention in
> >>> HYD_pmcd_pmi_allocate_kvs() that considers only process id, while
> >> two
> >>> processes on different machines may have the same pid. I tried to
> >> add
> >>> host name (from unistd.h>gethostname) to the name, i.e.
> >>> 'kvs_HOSTNAME_PID_pgid', then everything is working fine in our
> >> testing.
> >>> So I'm wondering if this change is safe (we may need it for our
> >> release)
> >>> and if it would go into the official MPICH release some time.
> >>> Thank you very much.
> >>> Xiaopeng
> >>> _______________________________________________
> >>> discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
> >>> To manage subscription options or unsubscribe:
> >>> https://lists.mpich.org/mailman/listinfo/discuss
> > _______________________________________________
> > discuss mailing list discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the discuss