[mpich-discuss] A possible bug in HYD_pmcd_pmi_allocate_kvs

Congiu, Giuseppe gcongiu at anl.gov
Thu Jun 6 14:56:07 CDT 2019


Actually, the fix uses a combination of hostname and random number, which seed is a time stamp. 
I don’t remember why exactly we didn’t go for the hostname only but I suspect it is because this might not be
unique. Adding the random number with a timestamp seed should be robust enough against collisions.

Giuseppe

> On Jun 6, 2019, at 8:25 AM, Raffenetti, Kenneth J. via discuss <discuss at mpich.org> wrote:
> 
> On 6/6/19 12:52 AM, Xiaopeng Duan wrote:
>> Thank you, Ken.
>> 
>> We were having another problem with 3.3, and will try it once we fixed 
>> our issue.
>> 
>> Just my couriosity, why a random number was chosen for the fix instead 
>> of hostname or address? Looks to me the random number still has some 
>> possibility to repeat (although very rare), but hostnames and addresses 
>> should be unique in a system.
> 
> I had the same thought when looking back at this patch. Maybe Giuseppe 
> can share why that was added. I'm fairly sure it can be safely removed.
> 
> Ken
> 
>> 
>> Regards,
>> Xiaopeng
>> 
>> On Wed, Jun 5, 2019, 8:29 AM Raffenetti, Kenneth J. 
>> <raffenet at mcs.anl.gov <mailto:raffenet at mcs.anl.gov>> wrote:
>> 
>>    We added a similar fix in https://github.com/pmodels/mpich/pull/2788.
>>    This was included in the MPICH 3.3 release.
>> 
>>    Ken
>> 
>>    On 6/4/19 11:27 PM, Xiaopeng Duan via discuss wrote:
>>> Hi, MPICH experts,
>>> 
>>> We are working on a dynamic master-worker flow using
>>> mpi_comm_connect/mpi_com_accept. In some cases when the total
>>    number of
>>> worker process is large, they may get the same kvs_name and
>>    confuse the
>>> internal group identifiers. This was traced to the naming
>>    convention in
>>> HYD_pmcd_pmi_allocate_kvs() that considers only process id, while
>>    two
>>> processes on different machines may have the same pid. I tried to
>>    add
>>> host name (from unistd.h>gethostname) to the name, i.e.
>>> 'kvs_HOSTNAME_PID_pgid', then everything is working fine in our
>>    testing.
>>> 
>>> So I'm wondering if this change is safe (we may need it for our
>>    release)
>>> and if it would go into the official MPICH release some time.
>>> 
>>> Thank you very much.
>>> 
>>> Xiaopeng
>>> 
>>> _______________________________________________
>>> discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>> 
>> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss



More information about the discuss mailing list