[mpich-discuss] hydra, stdin close(), and SLURM

Aaron Knister aaron.s.knister at nasa.gov
Mon Jul 27 13:36:07 CDT 2015


Hi Pavan,

I see your reply in the archives but it didn't make it to my inbox so 
I'm replying to my post. I don't disagree without you about the error 
being in the SLURM code, but I'm not sure how one would prevent this 
reliably. SLURM has no expectation that an external library will open 
something at file descriptor 0 before it reaches the point in the code 
where it's ready to poll for stdin. Do you have any suggestions?

It's been a long while since I've done a git e-mail patch so it might 
take me a bit to figure out. Should I send the patch to the list or to 
you directly?

Thanks!

-Aaron

On 7/25/15 10:26 PM, Aaron Knister wrote:
> I sent this off to the mvapich list yesterday and it was suggested I 
> raise it here since this is the upstream project:
>
> This is a bit of a cross post from a thread I started on the slurm dev 
> list: http://article.gmane.org/gmane.comp.distributed.slurm.devel/8176
>
> I'd like to get feedback on the idea that "--input none" be passed to 
> srun when using the SLURM hydra bootstrap mechanism. I figured it 
> would be inserted somewhere around here 
> http://trac.mpich.org/projects/mpich/browser/src/pm/hydra/tools/bootstrap/external/slurm_launch.c#L98. 
>
>
> Without this argument I'm getting spurious job aborts and confusing 
> errors. The gist of it is mpiexec.hydra closes stdin before it exec's 
> srun. srun then (possibly via the munge libraries) calls some function 
> that does a look up via nss. We use sssd for AAA so libnss_sssd will 
> handle this request. Part of the caching mechanism sssd uses will 
> cause the library to open() the cache file. The lowest fd available is 
> 0 so the cache file is opened on fd 0. srun then believes it's got 
> stdin attached and it causes the issues outlined in the slurm dev 
> post. I think passing "--input none" is the right thing to do here 
> since hydra has in fact closed stdin to srun. I tested this via the 
> HYDRA_LAUNCHER_EXTRA_ARGS environment variable and it does resolve the 
> errors I described.
>
> Thanks!
> -Aaron
>

-- 
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 859 bytes
Desc: OpenPGP digital signature
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20150727/020b3afb/attachment.sig>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list