[mpich-discuss] A possible bug in HYD_pmcd_pmi_allocate_kvs
Congiu, Giuseppe
gcongiu at anl.gov
Thu Jun 6 14:56:07 CDT 2019
Actually, the fix uses a combination of hostname and random number, which seed is a time stamp.
I don’t remember why exactly we didn’t go for the hostname only but I suspect it is because this might not be
unique. Adding the random number with a timestamp seed should be robust enough against collisions.
Giuseppe
> On Jun 6, 2019, at 8:25 AM, Raffenetti, Kenneth J. via discuss <discuss at mpich.org> wrote:
>
> On 6/6/19 12:52 AM, Xiaopeng Duan wrote:
>> Thank you, Ken.
>>
>> We were having another problem with 3.3, and will try it once we fixed
>> our issue.
>>
>> Just my couriosity, why a random number was chosen for the fix instead
>> of hostname or address? Looks to me the random number still has some
>> possibility to repeat (although very rare), but hostnames and addresses
>> should be unique in a system.
>
> I had the same thought when looking back at this patch. Maybe Giuseppe
> can share why that was added. I'm fairly sure it can be safely removed.
>
> Ken
>
>>
>> Regards,
>> Xiaopeng
>>
>> On Wed, Jun 5, 2019, 8:29 AM Raffenetti, Kenneth J.
>> <raffenet at mcs.anl.gov <mailto:raffenet at mcs.anl.gov>> wrote:
>>
>> We added a similar fix in https://github.com/pmodels/mpich/pull/2788.
>> This was included in the MPICH 3.3 release.
>>
>> Ken
>>
>> On 6/4/19 11:27 PM, Xiaopeng Duan via discuss wrote:
>>> Hi, MPICH experts,
>>>
>>> We are working on a dynamic master-worker flow using
>>> mpi_comm_connect/mpi_com_accept. In some cases when the total
>> number of
>>> worker process is large, they may get the same kvs_name and
>> confuse the
>>> internal group identifiers. This was traced to the naming
>> convention in
>>> HYD_pmcd_pmi_allocate_kvs() that considers only process id, while
>> two
>>> processes on different machines may have the same pid. I tried to
>> add
>>> host name (from unistd.h>gethostname) to the name, i.e.
>>> 'kvs_HOSTNAME_PID_pgid', then everything is working fine in our
>> testing.
>>>
>>> So I'm wondering if this change is safe (we may need it for our
>> release)
>>> and if it would go into the official MPICH release some time.
>>>
>>> Thank you very much.
>>>
>>> Xiaopeng
>>>
>>> _______________________________________________
>>> discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>
>>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list