[mpich-discuss] MPICH Hydra mpiexec and Slurm job allocation
r900 at mail.com
r900 at mail.com
Wed Dec 4 16:28:51 CST 2019
Hi,
I'm having some issues to make mpirun/mpiexec play nicely with Slurm
allocations. I'm using Slurm 19.05.4, and have configured MPICH with:
--enable-shared --enable-static --with-slurm=/sw/slurm/19.05.4 \
--with-pm=hydra
Now I request resources from Slurm with:
$ salloc -N 2 --ntasks-per-node 4
Then when I try to run a test binary:
$ mpiexec.hydra ./mpich_hello
Error: node list format not recognized. Try using '-hosts=<hostnames>'.
Aborted (core dumped)
When I do the same with OpenMPI's mpirun/mpiexec it runs on the allocated
nodes. Am I missing something, or does MPICH simply not support this use case?
Currently I'm working around this by using a script to translate Slurm
node allocations into a host list and run like this:
$ mpiexec.hydra -hosts $(mpich-host) ./mpich_hello
That works fine, but I suppose this workaround should not be necessary.
Here is ltrace output which shows that mpiexec tries to process some Slurm
related environment variables but apparently fails to do so:
https://paste.ubuntu.com/p/327tGrTzq5/
I've also tried with salloc -N 1 -n 1, so that the environment variables
are simpler, e.g.
SLURM_NODELIST=node-b01
SLURM_TASKS_PER_NODE=1
but that did not change the way mpiexec fails.
/Stefan
More information about the discuss
mailing list