[mpich-discuss] Slurm and MPI_Comm_spawn

Mccall, Kurt E. (MSFC-EV41) kurt.e.mccall at nasa.gov
Fri Jan 7 16:24:49 CST 2022


Hui,

"Failing" == spinning in place, deep in the MPI library inside the MPI_Comm_spawn call.   I can provide the stack trace if you need it.   Slurm says this, but not sure why it mentions srun since I am calling that nowhere.

srun: error: Unable to create step for job 34378: Requested node configuration is not available

I tried ntasks = 4, 20 and 48 and the results were always the same.

Kurt

From: Zhou, Hui <zhouh at anl.gov>
Sent: Friday, January 7, 2022 4:13 PM
To: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov>; discuss at mpich.org
Subject: [EXTERNAL] Re: Slurm and MPI_Comm_spawn

Hi Kurt,

Thanks for the details. When you say the job is failing, is it process hanging or abort? Are there any error messages?

My suspicion is that slurm is preventing extra process to be launched since you have assigned all the resources to the first two MPI processes. Could you try increase the ntasks in the batch command?

--
Hui Zhou
________________________________
From: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov<mailto:kurt.e.mccall at nasa.gov>>
Sent: Friday, January 7, 2022 3:33 PM
To: Zhou, Hui <zhouh at anl.gov<mailto:zhouh at anl.gov>>; discuss at mpich.org<mailto:discuss at mpich.org> <discuss at mpich.org<mailto:discuss at mpich.org>>
Cc: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov<mailto:kurt.e.mccall at nasa.gov>>
Subject: Re: Slurm and MPI_Comm_spawn


Thanks for the reply, Hui.



configure --prefix=/home/kmccall/mpich-install-4.0b1 --with-device=ch3:nemesis --disable-fortran  -enable-debuginfo --enable-g=debug





The program is run via sbatch, which is given a bash script as an argument.



sbatch  --nodes=2  --ntasks=2  --cpus-per-task=24   <bash_script>





The bash script calls mpiexec:



mpiexec -print-all-exitcodes -enable-x -np 2  -wdir ${work_dir} -env DISPLAY localhost:10.0 --ppn 1 <cmd>





From: Zhou, Hui <zhouh at anl.gov<mailto:zhouh at anl.gov>>
Sent: Friday, January 7, 2022 2:39 PM
To: discuss at mpich.org<mailto:discuss at mpich.org>
Cc: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov<mailto:kurt.e.mccall at nasa.gov>>
Subject: [EXTERNAL] Re: Slurm and MPI_Comm_spawn



MPICH uses PMI 1 by default.



How is your MPICH configured? And how do you run your program, is it via srun?



--
Hui Zhou





From: Mccall, Kurt E. (MSFC-EV41) via discuss <discuss at mpich.org<mailto:discuss at mpich.org>>
Date: Friday, January 7, 2022 at 2:21 PM
To: discuss at mpich.org<mailto:discuss at mpich.org> <discuss at mpich.org<mailto:discuss at mpich.org>>
Cc: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov<mailto:kurt.e.mccall at nasa.gov>>
Subject: [mpich-discuss] Slurm and MPI_Comm_spawn

My MPICH/Slurm job is failing when the call to MPI_Comm_spawn is made.   The Slurm MPI guide https://slurm.schedmd.com/mpi_guide.html#mpich2<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fmpi_guide.html%23mpich2&data=04%7C01%7Ckurt.e.mccall%40nasa.gov%7Cfceae7104eb74518db6a08d9d22af26d%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C637771904164104729%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=5t1oJdGqmfsWlEqBOGR401n6wGvAev3mgucRCR%2FnSLk%3D&reserved=0> specifically states that MPI_Comm_spawn will work going through Hydra's PMI 1.1 interface.



How do I ensure that it goes through that interface?



Maybe we'll have to rebuild Slurm to support PMI 1.1.    This Slurm command  yields the following and PMI 1.1 is not mentioned, although PMI 2 is.



$ srun -mpi=list

srun: MPI types are...

srun: cray_shasta

srun: pmi2

srun: none




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20220107/893478e9/attachment.html>


More information about the discuss mailing list