[mpich-discuss] hydra, stdin close(), and SLURM

Aaron Knister aaron.s.knister at nasa.gov
Wed Jul 29 16:14:53 CDT 2015


Thanks Pavan!

-Aaron

On 7/28/15 3:23 PM, Balaji, Pavan wrote:
> Hi Aaron,
>
> I've committed it to mpich/master:
>
> 	http://git.mpich.org/mpich.git/commitdiff/6b41775b2056ff18b3c28aab71764e35904c00fa
>
> Thanks for the contribution.
>
> This should be in tonight's nightlies:
>
> 	http://www.mpich.org/static/downloads/nightly/master/mpich/
>
> ... and in the upcoming mpich-3.2rc1 release.
>
>    -- Pavan
>
>
>
>
> On 7/27/15, 1:40 PM, "Balaji, Pavan" <balaji at anl.gov> wrote:
>
>> Hi Aaron,
>>
>>
>>
>> Please send the patch to me directly.
>>
>> General guidelines as to the kind of patches we ask for:
>>
>> 	https://wiki.mpich.org/mpich/index.php/Version_Control_Systems_101
>>
>> You can ignore the git workflow related text, which is for our internal testing.  I'll take care of that for you.
>>
>> Thanks,
>>
>>   -- Pavan
>>
>> On 7/27/15, 1:36 PM, "Aaron Knister" <aaron.s.knister at nasa.gov> wrote:
>>
>>> Hi Pavan,
>>>
>>> I see your reply in the archives but it didn't make it to my inbox so
>>> I'm replying to my post. I don't disagree without you about the error
>>> being in the SLURM code, but I'm not sure how one would prevent this
>>> reliably. SLURM has no expectation that an external library will open
>>> something at file descriptor 0 before it reaches the point in the code
>>> where it's ready to poll for stdin. Do you have any suggestions?
>>>
>>> It's been a long while since I've done a git e-mail patch so it might
>>> take me a bit to figure out. Should I send the patch to the list or to
>>> you directly?
>>>
>>> Thanks!
>>>
>>> -Aaron
>>>
>>> On 7/25/15 10:26 PM, Aaron Knister wrote:
>>>> I sent this off to the mvapich list yesterday and it was suggested I
>>>> raise it here since this is the upstream project:
>>>>
>>>> This is a bit of a cross post from a thread I started on the slurm dev
>>>> list: http://article.gmane.org/gmane.comp.distributed.slurm.devel/8176
>>>>
>>>> I'd like to get feedback on the idea that "--input none" be passed to
>>>> srun when using the SLURM hydra bootstrap mechanism. I figured it
>>>> would be inserted somewhere around here
>>>> http://trac.mpich.org/projects/mpich/browser/src/pm/hydra/tools/bootstrap/external/slurm_launch.c#L98.
>>>>
>>>>
>>>> Without this argument I'm getting spurious job aborts and confusing
>>>> errors. The gist of it is mpiexec.hydra closes stdin before it exec's
>>>> srun. srun then (possibly via the munge libraries) calls some function
>>>> that does a look up via nss. We use sssd for AAA so libnss_sssd will
>>>> handle this request. Part of the caching mechanism sssd uses will
>>>> cause the library to open() the cache file. The lowest fd available is
>>>> 0 so the cache file is opened on fd 0. srun then believes it's got
>>>> stdin attached and it causes the issues outlined in the slurm dev
>>>> post. I think passing "--input none" is the right thing to do here
>>>> since hydra has in fact closed stdin to srun. I tested this via the
>>>> HYDRA_LAUNCHER_EXTRA_ARGS environment variable and it does resolve the
>>>> errors I described.
>>>>
>>>> Thanks!
>>>> -Aaron
>>>>
>>> -- 
>>> Aaron Knister
>>> NASA Center for Climate Simulation (Code 606.2)
>>> Goddard Space Flight Center
>>> (301) 286-2776
>>>
>>>
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

-- 
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 859 bytes
Desc: OpenPGP digital signature
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20150729/0280399d/attachment.sig>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list