[mpich-discuss] MPICH Hydra mpiexec and Slurm job allocation

Raffenetti, Kenneth J. raffenet at mcs.anl.gov
Wed Dec 4 16:43:04 CST 2019


Which version of MPICH are you using?

Ken

On 12/4/19 4:28 PM, Stefan via discuss wrote:
> Hi,
> 
> I'm having some issues to make mpirun/mpiexec play nicely with Slurm
> allocations. I'm using Slurm 19.05.4, and have configured MPICH with:
>   --enable-shared --enable-static --with-slurm=/sw/slurm/19.05.4 \
>   --with-pm=hydra
> 
> Now I request resources from Slurm with:
>   $ salloc -N 2 --ntasks-per-node 4
> 
> Then when I try to run a test binary:
>   $ mpiexec.hydra ./mpich_hello
>   Error: node list format not recognized. Try using '-hosts=<hostnames>'.
>   Aborted (core dumped)
> 
> When I do the same with OpenMPI's mpirun/mpiexec it runs on the allocated
> nodes. Am I missing something, or does MPICH simply not support this use case?
> 
> Currently I'm working around this by using a script to translate Slurm
> node allocations into a host list and run like this:
>   $ mpiexec.hydra -hosts $(mpich-host) ./mpich_hello
> 
> That works fine, but I suppose this workaround should not be necessary.
> Here is ltrace output which shows that mpiexec tries to process some Slurm
> related environment variables but apparently fails to do so:
>   https://paste.ubuntu.com/p/327tGrTzq5/
> 
> I've also tried with salloc -N 1 -n 1, so that the environment variables
> are simpler, e.g.
>   SLURM_NODELIST=node-b01
>   SLURM_TASKS_PER_NODE=1
> but that did not change the way mpiexec fails.
> 
> /Stefan
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
> 


More information about the discuss mailing list