[mpich-discuss] using mpi4py in a Singluarity container run at a large computing center with Slurm installed

Martin Cuma martin.cuma at utah.edu
Fri Aug 31 18:13:38 CDT 2018


Hi Heather,

this is a nicely complex problem that I can't say I know a solution of, 
but, let me say what I know and perhaps it'll shed some light on the 
problem.

To answer your question on how mpirun interacts with srun (or SLURM in 
general), most MPIs (or better to say, PMIs that MPI uses for process 
launch) these days have SLURM support so when built they can leverage 
SLURM. Or the SLURM is set up to facilitate the remote node connection 
(e.g. by hijacking ssh through its own PMI - I don't know this just 
guessing). So, for the MPI distros that I tried (MPICH and derivatives - 
Intel MPI, MVAPICH2; and OpenMPI), mpirun at some point calls srun, no 
matter if it was built with SLURM support explicitly or not. Which would 
explain the srun error you are getting.

Now, what I think is happening in your case is that you are calling the 
mpirun (or its equivalent inside mpi4py) from INSIDE of the container, 
where there's no srun. Notice that most MPI container examples, including 
the very well written ANL page, instruct you to use mpirun (or aprun in 
Cray's case) OUTSIDE of the container (the host), and launch N instances 
of the container through the mpirun.

I reproduced your problem on our system in the following way:
1. Build a Singularity container with local MPI installation, e.g. 
https://github.com/CHPC-UofU/Singularity-ubuntu-mpi
2. shell into the container and build some mpi program (e.g. I have the 
cpi.c example from mpich - mpicc cpi.c -o cpi).
3. This then runs OK in the container on an interactive node (= outside 
SLURM job - mpirun does not use SLURM's PMI = does not use srun).
4. Launch the job, then shell into the container, and try to run mpirun 
-np 2 ./cpi - I get the same error you get, since
$ which srun
$

Now, I can try to set the path to the SLURM binaries
$ export PATH="/uufs/notchpeak.peaks/sys/installdir/slurm/std/bin:$PATH"
$ which srun
/uufs/notchpeak.peaks/sys/installdir/slurm/std/bin/srun
but then get another error:
$ mpirun -np 2 ./cpi.c
srun: error: Invalid user for SlurmUser slurm, ignored
srun: fatal: Unable to process configuration file
so the environment needs some more changes to get the srun to work 
correctly from inside the container. Though I think this would still only 
be hackable for an intra-node MPI launch, as inter-node you'll rely on 
SLURM that would have to get accessed from outside of the container.

So, bottom line, launching mpirun from the host is preferable.

I am not sure how you can launch the mpi4py from the host, since I don't 
use mpi4py, but, in theory it should not be any different than launching 
MPI binaries. Though I figure modifying your launch scripts around this 
may be complicated.

BTW, I have had a reasonable success with mixing ABI compatible MPIs 
(MPICH, MVAPICH2, Intel MPI) in and out of the container. It often works 
but sometimes it does not.

HTH,
MC

  -- 
Martin Cuma
Center for High Performance Computing
Department of Geology and Geophysics
University of Utah

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list