[mpich-discuss] MPICH Hydra mpiexec and Slurm job allocation

r900 at mail.com r900 at mail.com
Wed Dec 4 16:28:51 CST 2019


Hi,

I'm having some issues to make mpirun/mpiexec play nicely with Slurm
allocations. I'm using Slurm 19.05.4, and have configured MPICH with:
 --enable-shared --enable-static --with-slurm=/sw/slurm/19.05.4 \
 --with-pm=hydra

Now I request resources from Slurm with:
 $ salloc -N 2 --ntasks-per-node 4

Then when I try to run a test binary:
 $ mpiexec.hydra ./mpich_hello
 Error: node list format not recognized. Try using '-hosts=<hostnames>'.
 Aborted (core dumped)

When I do the same with OpenMPI's mpirun/mpiexec it runs on the allocated
nodes. Am I missing something, or does MPICH simply not support this use case?

Currently I'm working around this by using a script to translate Slurm
node allocations into a host list and run like this:
 $ mpiexec.hydra -hosts $(mpich-host) ./mpich_hello

That works fine, but I suppose this workaround should not be necessary.
Here is ltrace output which shows that mpiexec tries to process some Slurm
related environment variables but apparently fails to do so:
 https://paste.ubuntu.com/p/327tGrTzq5/

I've also tried with salloc -N 1 -n 1, so that the environment variables
are simpler, e.g.
 SLURM_NODELIST=node-b01
 SLURM_TASKS_PER_NODE=1
but that did not change the way mpiexec fails.

/Stefan


More information about the discuss mailing list