[mpich-discuss] MPICH Hydra mpiexec and Slurm job allocation
Stefan
r900 at mail.com
Wed Dec 4 17:02:21 CST 2019
Oh sorry, I thought I mentioned that. I'm using the latest stable release 3.3.2
/Stefan
On December 4, 2019 11:43:04 PM GMT+01:00, "Raffenetti, Kenneth J." <raffenet at mcs.anl.gov> wrote:
>Which version of MPICH are you using?
>
>Ken
>
>On 12/4/19 4:28 PM, Stefan via discuss wrote:
>> Hi,
>>
>> I'm having some issues to make mpirun/mpiexec play nicely with Slurm
>> allocations. I'm using Slurm 19.05.4, and have configured MPICH with:
>> --enable-shared --enable-static --with-slurm=/sw/slurm/19.05.4 \
>> --with-pm=hydra
>>
>> Now I request resources from Slurm with:
>> $ salloc -N 2 --ntasks-per-node 4
>>
>> Then when I try to run a test binary:
>> $ mpiexec.hydra ./mpich_hello
>> Error: node list format not recognized. Try using
>'-hosts=<hostnames>'.
>> Aborted (core dumped)
>>
>> When I do the same with OpenMPI's mpirun/mpiexec it runs on the
>allocated
>> nodes. Am I missing something, or does MPICH simply not support this
>use case?
>>
>> Currently I'm working around this by using a script to translate
>Slurm
>> node allocations into a host list and run like this:
>> $ mpiexec.hydra -hosts $(mpich-host) ./mpich_hello
>>
>> That works fine, but I suppose this workaround should not be
>necessary.
>> Here is ltrace output which shows that mpiexec tries to process some
>Slurm
>> related environment variables but apparently fails to do so:
>> https://paste.ubuntu.com/p/327tGrTzq5/
>>
>> I've also tried with salloc -N 1 -n 1, so that the environment
>variables
>> are simpler, e.g.
>> SLURM_NODELIST=node-b01
>> SLURM_TASKS_PER_NODE=1
>> but that did not change the way mpiexec fails.
>>
>> /Stefan
>> _______________________________________________
>> discuss mailing list discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
More information about the discuss
mailing list