[mpich-discuss] hydra, stdin close(), and SLURM
Balaji, Pavan
balaji at anl.gov
Mon Jul 27 13:40:17 CDT 2015
Hi Aaron,
Please send the patch to me directly.
General guidelines as to the kind of patches we ask for:
https://wiki.mpich.org/mpich/index.php/Version_Control_Systems_101
You can ignore the git workflow related text, which is for our internal testing. I'll take care of that for you.
Thanks,
-- Pavan
On 7/27/15, 1:36 PM, "Aaron Knister" <aaron.s.knister at nasa.gov> wrote:
>Hi Pavan,
>
>I see your reply in the archives but it didn't make it to my inbox so
>I'm replying to my post. I don't disagree without you about the error
>being in the SLURM code, but I'm not sure how one would prevent this
>reliably. SLURM has no expectation that an external library will open
>something at file descriptor 0 before it reaches the point in the code
>where it's ready to poll for stdin. Do you have any suggestions?
>
>It's been a long while since I've done a git e-mail patch so it might
>take me a bit to figure out. Should I send the patch to the list or to
>you directly?
>
>Thanks!
>
>-Aaron
>
>On 7/25/15 10:26 PM, Aaron Knister wrote:
>> I sent this off to the mvapich list yesterday and it was suggested I
>> raise it here since this is the upstream project:
>>
>> This is a bit of a cross post from a thread I started on the slurm dev
>> list: http://article.gmane.org/gmane.comp.distributed.slurm.devel/8176
>>
>> I'd like to get feedback on the idea that "--input none" be passed to
>> srun when using the SLURM hydra bootstrap mechanism. I figured it
>> would be inserted somewhere around here
>> http://trac.mpich.org/projects/mpich/browser/src/pm/hydra/tools/bootstrap/external/slurm_launch.c#L98.
>>
>>
>> Without this argument I'm getting spurious job aborts and confusing
>> errors. The gist of it is mpiexec.hydra closes stdin before it exec's
>> srun. srun then (possibly via the munge libraries) calls some function
>> that does a look up via nss. We use sssd for AAA so libnss_sssd will
>> handle this request. Part of the caching mechanism sssd uses will
>> cause the library to open() the cache file. The lowest fd available is
>> 0 so the cache file is opened on fd 0. srun then believes it's got
>> stdin attached and it causes the issues outlined in the slurm dev
>> post. I think passing "--input none" is the right thing to do here
>> since hydra has in fact closed stdin to srun. I tested this via the
>> HYDRA_LAUNCHER_EXTRA_ARGS environment variable and it does resolve the
>> errors I described.
>>
>> Thanks!
>> -Aaron
>>
>
>--
>Aaron Knister
>NASA Center for Climate Simulation (Code 606.2)
>Goddard Space Flight Center
>(301) 286-2776
>
>
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list