[mpich-discuss] hydra, stdin close(), and SLURM

Balaji, Pavan balaji at anl.gov
Tue Jul 28 14:23:54 CDT 2015


Hi Aaron,

I've committed it to mpich/master:

	http://git.mpich.org/mpich.git/commitdiff/6b41775b2056ff18b3c28aab71764e35904c00fa

Thanks for the contribution.

This should be in tonight's nightlies:

	http://www.mpich.org/static/downloads/nightly/master/mpich/

... and in the upcoming mpich-3.2rc1 release.

  -- Pavan




On 7/27/15, 1:40 PM, "Balaji, Pavan" <balaji at anl.gov> wrote:

>Hi Aaron,
>
>
>
>Please send the patch to me directly.
>
>General guidelines as to the kind of patches we ask for:
>
>	https://wiki.mpich.org/mpich/index.php/Version_Control_Systems_101
>
>You can ignore the git workflow related text, which is for our internal testing.  I'll take care of that for you.
>
>Thanks,
>
>  -- Pavan
>
>On 7/27/15, 1:36 PM, "Aaron Knister" <aaron.s.knister at nasa.gov> wrote:
>
>>Hi Pavan,
>>
>>I see your reply in the archives but it didn't make it to my inbox so 
>>I'm replying to my post. I don't disagree without you about the error 
>>being in the SLURM code, but I'm not sure how one would prevent this 
>>reliably. SLURM has no expectation that an external library will open 
>>something at file descriptor 0 before it reaches the point in the code 
>>where it's ready to poll for stdin. Do you have any suggestions?
>>
>>It's been a long while since I've done a git e-mail patch so it might 
>>take me a bit to figure out. Should I send the patch to the list or to 
>>you directly?
>>
>>Thanks!
>>
>>-Aaron
>>
>>On 7/25/15 10:26 PM, Aaron Knister wrote:
>>> I sent this off to the mvapich list yesterday and it was suggested I 
>>> raise it here since this is the upstream project:
>>>
>>> This is a bit of a cross post from a thread I started on the slurm dev 
>>> list: http://article.gmane.org/gmane.comp.distributed.slurm.devel/8176
>>>
>>> I'd like to get feedback on the idea that "--input none" be passed to 
>>> srun when using the SLURM hydra bootstrap mechanism. I figured it 
>>> would be inserted somewhere around here 
>>> http://trac.mpich.org/projects/mpich/browser/src/pm/hydra/tools/bootstrap/external/slurm_launch.c#L98. 
>>>
>>>
>>> Without this argument I'm getting spurious job aborts and confusing 
>>> errors. The gist of it is mpiexec.hydra closes stdin before it exec's 
>>> srun. srun then (possibly via the munge libraries) calls some function 
>>> that does a look up via nss. We use sssd for AAA so libnss_sssd will 
>>> handle this request. Part of the caching mechanism sssd uses will 
>>> cause the library to open() the cache file. The lowest fd available is 
>>> 0 so the cache file is opened on fd 0. srun then believes it's got 
>>> stdin attached and it causes the issues outlined in the slurm dev 
>>> post. I think passing "--input none" is the right thing to do here 
>>> since hydra has in fact closed stdin to srun. I tested this via the 
>>> HYDRA_LAUNCHER_EXTRA_ARGS environment variable and it does resolve the 
>>> errors I described.
>>>
>>> Thanks!
>>> -Aaron
>>>
>>
>>-- 
>>Aaron Knister
>>NASA Center for Climate Simulation (Code 606.2)
>>Goddard Space Flight Center
>>(301) 286-2776
>>
>>
>_______________________________________________
>discuss mailing list     discuss at mpich.org
>To manage subscription options or unsubscribe:
>https://lists.mpich.org/mailman/listinfo/discuss
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list