[mpich-discuss] using mpi4py in a Singluarity container run at a large computing center with Slurm installed
Martin Cuma
martin.cuma at utah.edu
Fri Aug 31 18:13:38 CDT 2018
Hi Heather,
this is a nicely complex problem that I can't say I know a solution of,
but, let me say what I know and perhaps it'll shed some light on the
problem.
To answer your question on how mpirun interacts with srun (or SLURM in
general), most MPIs (or better to say, PMIs that MPI uses for process
launch) these days have SLURM support so when built they can leverage
SLURM. Or the SLURM is set up to facilitate the remote node connection
(e.g. by hijacking ssh through its own PMI - I don't know this just
guessing). So, for the MPI distros that I tried (MPICH and derivatives -
Intel MPI, MVAPICH2; and OpenMPI), mpirun at some point calls srun, no
matter if it was built with SLURM support explicitly or not. Which would
explain the srun error you are getting.
Now, what I think is happening in your case is that you are calling the
mpirun (or its equivalent inside mpi4py) from INSIDE of the container,
where there's no srun. Notice that most MPI container examples, including
the very well written ANL page, instruct you to use mpirun (or aprun in
Cray's case) OUTSIDE of the container (the host), and launch N instances
of the container through the mpirun.
I reproduced your problem on our system in the following way:
1. Build a Singularity container with local MPI installation, e.g.
https://github.com/CHPC-UofU/Singularity-ubuntu-mpi
2. shell into the container and build some mpi program (e.g. I have the
cpi.c example from mpich - mpicc cpi.c -o cpi).
3. This then runs OK in the container on an interactive node (= outside
SLURM job - mpirun does not use SLURM's PMI = does not use srun).
4. Launch the job, then shell into the container, and try to run mpirun
-np 2 ./cpi - I get the same error you get, since
$ which srun
$
Now, I can try to set the path to the SLURM binaries
$ export PATH="/uufs/notchpeak.peaks/sys/installdir/slurm/std/bin:$PATH"
$ which srun
/uufs/notchpeak.peaks/sys/installdir/slurm/std/bin/srun
but then get another error:
$ mpirun -np 2 ./cpi.c
srun: error: Invalid user for SlurmUser slurm, ignored
srun: fatal: Unable to process configuration file
so the environment needs some more changes to get the srun to work
correctly from inside the container. Though I think this would still only
be hackable for an intra-node MPI launch, as inter-node you'll rely on
SLURM that would have to get accessed from outside of the container.
So, bottom line, launching mpirun from the host is preferable.
I am not sure how you can launch the mpi4py from the host, since I don't
use mpi4py, but, in theory it should not be any different than launching
MPI binaries. Though I figure modifying your launch scripts around this
may be complicated.
BTW, I have had a reasonable success with mixing ABI compatible MPIs
(MPICH, MVAPICH2, Intel MPI) in and out of the container. It often works
but sometimes it does not.
HTH,
MC
--
Martin Cuma
Center for High Performance Computing
Department of Geology and Geophysics
University of Utah
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list