[mpich-discuss] MPICH Hydra mpiexec and Slurm job allocation

Stefan r900 at mail.com
Wed Dec 4 17:02:21 CST 2019


Oh sorry, I thought I mentioned that. I'm using the latest stable release 3.3.2

/Stefan

On December 4, 2019 11:43:04 PM GMT+01:00, "Raffenetti, Kenneth J." <raffenet at mcs.anl.gov> wrote:
>Which version of MPICH are you using?
>
>Ken
>
>On 12/4/19 4:28 PM, Stefan via discuss wrote:
>> Hi,
>> 
>> I'm having some issues to make mpirun/mpiexec play nicely with Slurm
>> allocations. I'm using Slurm 19.05.4, and have configured MPICH with:
>>   --enable-shared --enable-static --with-slurm=/sw/slurm/19.05.4 \
>>   --with-pm=hydra
>> 
>> Now I request resources from Slurm with:
>>   $ salloc -N 2 --ntasks-per-node 4
>> 
>> Then when I try to run a test binary:
>>   $ mpiexec.hydra ./mpich_hello
>>   Error: node list format not recognized. Try using
>'-hosts=<hostnames>'.
>>   Aborted (core dumped)
>> 
>> When I do the same with OpenMPI's mpirun/mpiexec it runs on the
>allocated
>> nodes. Am I missing something, or does MPICH simply not support this
>use case?
>> 
>> Currently I'm working around this by using a script to translate
>Slurm
>> node allocations into a host list and run like this:
>>   $ mpiexec.hydra -hosts $(mpich-host) ./mpich_hello
>> 
>> That works fine, but I suppose this workaround should not be
>necessary.
>> Here is ltrace output which shows that mpiexec tries to process some
>Slurm
>> related environment variables but apparently fails to do so:
>>   https://paste.ubuntu.com/p/327tGrTzq5/
>> 
>> I've also tried with salloc -N 1 -n 1, so that the environment
>variables
>> are simpler, e.g.
>>   SLURM_NODELIST=node-b01
>>   SLURM_TASKS_PER_NODE=1
>> but that did not change the way mpiexec fails.
>> 
>> /Stefan
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>> 


More information about the discuss mailing list