[mpich-discuss] using mpi4py in a Singluarity container run at a large computing center with Slurm installed

Heather Kelly heather999kelly at gmail.com
Sat Sep 1 06:01:57 CDT 2018


Hi Martin,
Thank you so much for taking the time to look at this so carefully -
especially on a Friday before a holiday weekend!
You reproduced the behavior I am seeing precisely, where things run fine on
an interactive node.  Just hacking something to work intra-node would be
fine as a start. This gives me a much better understanding so I can play
around more.
I'll report back if I manage to make some progress or can come up with an
intelligent question :)
Take care,
Heather

On Fri, Aug 31, 2018 at 7:13 PM Martin Cuma <martin.cuma at utah.edu> wrote:

> Hi Heather,
>
> this is a nicely complex problem that I can't say I know a solution of,
> but, let me say what I know and perhaps it'll shed some light on the
> problem.
>
> To answer your question on how mpirun interacts with srun (or SLURM in
> general), most MPIs (or better to say, PMIs that MPI uses for process
> launch) these days have SLURM support so when built they can leverage
> SLURM. Or the SLURM is set up to facilitate the remote node connection
> (e.g. by hijacking ssh through its own PMI - I don't know this just
> guessing). So, for the MPI distros that I tried (MPICH and derivatives -
> Intel MPI, MVAPICH2; and OpenMPI), mpirun at some point calls srun, no
> matter if it was built with SLURM support explicitly or not. Which would
> explain the srun error you are getting.
>
> Now, what I think is happening in your case is that you are calling the
> mpirun (or its equivalent inside mpi4py) from INSIDE of the container,
> where there's no srun. Notice that most MPI container examples, including
> the very well written ANL page, instruct you to use mpirun (or aprun in
> Cray's case) OUTSIDE of the container (the host), and launch N instances
> of the container through the mpirun.
>
> I reproduced your problem on our system in the following way:
> 1. Build a Singularity container with local MPI installation, e.g.
> https://github.com/CHPC-UofU/Singularity-ubuntu-mpi
> 2. shell into the container and build some mpi program (e.g. I have the
> cpi.c example from mpich - mpicc cpi.c -o cpi).
> 3. This then runs OK in the container on an interactive node (= outside
> SLURM job - mpirun does not use SLURM's PMI = does not use srun).
> 4. Launch the job, then shell into the container, and try to run mpirun
> -np 2 ./cpi - I get the same error you get, since
> $ which srun
> $
>
> Now, I can try to set the path to the SLURM binaries
> $ export PATH="/uufs/notchpeak.peaks/sys/installdir/slurm/std/bin:$PATH"
> $ which srun
> /uufs/notchpeak.peaks/sys/installdir/slurm/std/bin/srun
> but then get another error:
> $ mpirun -np 2 ./cpi.c
> srun: error: Invalid user for SlurmUser slurm, ignored
> srun: fatal: Unable to process configuration file
> so the environment needs some more changes to get the srun to work
> correctly from inside the container. Though I think this would still only
> be hackable for an intra-node MPI launch, as inter-node you'll rely on
> SLURM that would have to get accessed from outside of the container.
>
> So, bottom line, launching mpirun from the host is preferable.
>
> I am not sure how you can launch the mpi4py from the host, since I don't
> use mpi4py, but, in theory it should not be any different than launching
> MPI binaries. Though I figure modifying your launch scripts around this
> may be complicated.
>
> BTW, I have had a reasonable success with mixing ABI compatible MPIs
> (MPICH, MVAPICH2, Intel MPI) in and out of the container. It often works
> but sometimes it does not.
>
> HTH,
> MC
>
>   --
> Martin Cuma
> Center for High Performance Computing
> Department of Geology and Geophysics
> University of Utah
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20180901/5e15e5f6/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list