[mpich-discuss] MPI_Comm_spawn crosses node boundaries

Mccall, Kurt E. (MSFC-EV41) kurt.e.mccall at nasa.gov
Thu Feb 3 19:00:07 CST 2022


Ken,

I'm trying to build MPICH 4.0 in several ways, one of which will be what you suggested below.   For this particular attempt suggested by the Slurm MPI guide, I built it with

configure --with-slurm=/opt/slurm --with-pmi=pmi2/simple <etc>

and invoked it with

srun --mpi=pmi2 <etc>

The job is crashing with this message.   Any idea what is wrong?

slurmstepd: error: mpi/pmi2: no value for key  in req
slurmstepd: error: mpi/pmi2: no value for key  in req
slurmstepd: error: mpi/pmi2: no value for key <99>è­þ^? in req
slurmstepd: error: mpi/pmi2: no value for key  in req
slurmstepd: error: mpi/pmi2: no value for key  in req
slurmstepd: error: mpi/pmi2: no value for key ´2¾ÿ^? in req
slurmstepd: error: mpi/pmi2: no value for key ; in req
slurmstepd: error: mpi/pmi2: no value for key  in req
slurmstepd: error: *** STEP 52227.0 ON n001 CANCELLED AT 2022-02-03T18:48:02 ***

-----Original Message-----
From: Raffenetti, Ken <raffenet at anl.gov> 
Sent: Friday, January 28, 2022 3:15 PM
To: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov>; discuss at mpich.org
Subject: [EXTERNAL] Re: [mpich-discuss] MPI_Comm_spawn crosses node boundaries

On 1/28/22, 2:22 PM, "Mccall, Kurt E. (MSFC-EV41)" <kurt.e.mccall at nasa.gov> wrote:

    Ken,

    I confirmed that MPI_Comm_spawn fails completely if I build MPICH without the PMI2 option.

Dang, I thought that would work :(.

    Looking at the Slurm documentation https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fmpi_guide.html%23intel_mpiexec_hydra&data=04%7C01%7Ckurt.e.mccall%40nasa.gov%7Cd88808a06c294db7ab2a08d9e2a33ce3%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C637790012994134895%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=ulhkFWJoO%2BOccQNAOggrTOgdtO6PDBHtv1b5lJ1p2ms%3D&reserved=0
    it states  "All MPI_comm_spawn work fine now going through hydra's PMI 1.1 interface."   The full quote is below for reference.

    1) how do I build MPICH to support hydra's PMI 1.1 interface?

That is the default, so no extra configuration should be needed. One thing I notice in your log output is that the Slurm envvars seems to have changed name from what we have in our source. E.g. SLURM_JOB_NODELIST vs. SLURM_NODELIST. Do your initial processes launch on the right nodes?

    2) Can you offer any guesses on how to build Slurm to do the same?  (I realize this isn't a Slurm forum  😊)

Hopefully you don't need to rebuild Slurm to do it. What you could try is configuring the Slurm PMI library when building MPICH. Add "--with-pm=none --with-pmi=slurm --with-slurm=<path/to/install>". Then use srun instead of mpiexec and see how it goes.

Ken



More information about the discuss mailing list