[mpich-discuss] using mpi4py in a Singluarity container run at a large computing center with Slurm installed

Heather Kelly heather999kelly at gmail.com
Fri Aug 31 16:44:40 CDT 2018


Hi Martin,
Thanks for the response.
Unfortunately, I don't have direct access to the Dockerfile used to create
this image, but I have at times built my own when necessary.  Here is an
example:
https://github.com/LSSTDESC/dockerfiles/blob/master/lsst_sims/Dockerfile-buildAllFromSource
There's a bit going on there with the newinstall.sh script that includes
installing miniconda plus a bunch of python packages including mpich 3.2.1,
then downloading and building some C/C++ code.
In my search for clues, I see you are absolutely right, there is a hope to
reuse the same (or similar) version of mpi in the container as is available
on the native system: https://www.alcf.anl.gov/user-guides/singularity.
That is not what is happening here, and I'm not certain how easily I can
change that. I might be able to modify this one package that includes the
python code I want to run.

I've tried to simplify what I'm up to, so I connect to one of the compute
nodes and there I am able to start up shifter or singularity (depending on
what center I'm working on - the error is the same in both cases). Here I'm
using shifter on a compute node, and I completely skip using srun, but the
result is the same:

shifter --image=<imageName> ./run_shifter_smp.sh $PWD/drp-test
$PWD/filesToIngest.txt |& tee smp.log
root INFO: Loading config overrride file
'/global/cscratch1/sd/desc/DC2/Run1.2iTest-20180830/obs_lsstCam/config/ingest.py'
LsstCamMapper WARN: Unable to find calib root directory
CameraMapper INFO: Loading Posix exposure registry from
/global/cscratch1/sd/desc/DC2/Run1.2i-20180830/drp-test
[mpiexec at nid00011] HYDU_create_process (utils/launch/launch.c:75): execvp
error on file srun (No such file or directory)

that run_shifter_smp.sh script just sets up my env and calls the routine
where I can ask that it run in SMP mode:
ingestDriver.py $1 @$2 --cores 20 --mode link --batch-type smp

I can double check where mpiexec is pointing and see:
(lsst-scipipe-10a4fa6) sh-4.2$ type mpiexec
mpiexec is
/opt/lsst/software/stack/stack/miniconda3-4.5.4-10a4fa6/Linux64/mpich/3.2.1/bin/mpiexec
which is in our software, rather than the system's mpiexec

I also took a look at the slew of environment variables containing "MPI",
it's at the end of this email in case it's of any use.

Here's a look at the modules loaded in this env:
module list
Currently Loaded Modulefiles:
  1) modules/3.2.10.6                               13)
dvs/2.7_2.2.65-6.0.5.2_16.2__gbec2cb0
  2) intel/18.0.1.163                               14)
alps/6.5.28-6.0.5.0_18.6__g13a91b6.ari
  3) craype-network-aries                           15)
rca/2.2.16-6.0.5.0_15.34__g5e09e6d.ari
  4) craype/2.5.14                                  16) atp/2.1.1
  5) cray-libsci/18.03.1                            17) PrgEnv-intel/6.0.4
  6) udreg/2.3.2-6.0.5.0_13.12__ga14955a.ari        18) craype-haswell
  7) ugni/6.0.14-6.0.5.0_16.9__g19583bb.ari         19) cray-mpich-abi/7.7.0
  8) pmi/5.0.13                                     20) nano/2.2.6
  9) dmapp/7.1.1-6.0.5.0_49.8__g1125556.ari         21) altd/2.0
 10) gni-headers/5.0.12-6.0.5.0_2.15__g2ef1ebc.ari  22) darshan/3.1.4
 11) xpmem/2.2.4-6.0.5.1_8.18__g35d5e73.ari         23)
Base-opts/2.4.123-6.0.5.0_11.2__g6460790.ari
 12) job/2.2.2-6.0.5.0_8.47__g3c644b5.ari


Reading the FAQ, I was curious about this statement when MPICH is set up
with Slurm to use srun:
"Once configured with slurm, no internal process manager is built for
MPICH; the user is expected to use SLURM's launch models (such as srun)."
https://wiki.mpich.org/mpich/index.php/FAQ#Q:_How_do_I_use_MPICH_with_slurm.3F
Does that mean the host system's mpi likely only knows about SLURM's launch
models and wouldn't even be able to use SMP?

Take care,
Heather

(lsst-scipipe-10a4fa6) sh-4.2$ set|grep MPI
CRAY_MPICH2_DIR=/opt/cray/pe/mpt/7.7.0/gni/mpich-intel/16.0
CRAY_MPICH2_VER=7.7.0
CRAY_MPICH_BASEDIR=/opt/cray/pe/mpt/7.7.0/gni
CRAY_MPICH_DIR=/opt/cray/pe/mpt/7.7.0/gni/mpich-intel/16.0
CRAY_MPICH_ROOTDIR=/opt/cray/pe/mpt/7.7.0
CRAY_PRE_COMPILE_OPTS=-hnetwork=aries
MPI4PY_DIR=/opt/lsst/software/stack/stack/miniconda3-4.5.4-10a4fa6/Linux64/mpi4py/2.0.0+6
MPICH_ABORT_ON_ERROR=1
MPICH_DIR=/opt/lsst/software/stack/stack/miniconda3-4.5.4-10a4fa6/Linux64/mpich/3.2.1
MPICH_MPIIO_DVS_MAXNODES=32
MPI_DIR=/opt/lsst/software/stack/stack/miniconda3-4.5.4-10a4fa6/Linux64/mpi/0.0.1+3
PE_FFTW2_DEFAULT_REQUIRED_PRODUCTS=PE_MPICH
PE_FFTW_DEFAULT_REQUIRED_PRODUCTS=PE_MPICH
PE_GA_DEFAULT_GENCOMPILERS_GNU='5.1 4.9'
PE_HDF5_DEFAULT_GENCOMPILERS_GNU='5.1 4.9'
PE_HDF5_PARALLEL_DEFAULT_GENCOMPILERS_GNU='5.1 4.9'
PE_HDF5_PARALLEL_DEFAULT_REQUIRED_PRODUCTS=PE_MPICH
PE_LIBSCI_DEFAULT_GENCOMPILERS_CRAY_x86_64=8.6
PE_LIBSCI_DEFAULT_GENCOMPILERS_GNU_x86_64='7.1 6.1 5.1 4.9'
PE_LIBSCI_DEFAULT_GENCOMPILERS_INTEL_x86_64=16.0
PE_LIBSCI_DEFAULT_REQUIRED_PRODUCTS=PE_MPICH
PE_LIBSCI_GENCOMPILERS_CRAY_x86_64=8.6
PE_LIBSCI_GENCOMPILERS_GNU_x86_64='7.1 6.1 5.1 4.9'
PE_LIBSCI_GENCOMPILERS_INTEL_x86_64=16.0
PE_LIBSCI_REQUIRED_PRODUCTS=PE_MPICH
PE_MPICH_ALTERNATE_LIBS_dpm=_dpm
PE_MPICH_ALTERNATE_LIBS_multithreaded=_mt
PE_MPICH_CXX_PKGCONFIG_LIBS=mpichcxx
PE_MPICH_DEFAULT_DIR_CRAY_DEFAULT64=64
PE_MPICH_DEFAULT_FIXED_PRGENV=INTEL
PE_MPICH_DEFAULT_GENCOMPILERS_CRAY=8.6
PE_MPICH_DEFAULT_GENCOMPILERS_GNU='5.1 4.9'
PE_MPICH_DEFAULT_GENCOMPS_CRAY=86
PE_MPICH_DEFAULT_GENCOMPS_GNU='51 49'
PE_MPICH_DEFAULT_VOLATILE_PKGCONFIG_PATH=/opt/cray/pe/mpt/7.7.0/gni/mpich- at PRGENV
@@PE_MPICH_DEFAULT_DIR_DEFAULT64@/@PE_MPICH_DEFAULT_GENCOMPS@
/lib/pkgconfig
PE_MPICH_DEFAULT_VOLATILE_PRGENV='CRAY GNU'
PE_MPICH_DIR_CRAY_DEFAULT64=64
PE_MPICH_FIXED_PRGENV=INTEL
PE_MPICH_FORTRAN_PKGCONFIG_LIBS=mpichf90
PE_MPICH_GENCOMPILERS_CRAY=8.6
PE_MPICH_GENCOMPILERS_GNU='5.1 4.9'
PE_MPICH_GENCOMPS_CRAY=86
PE_MPICH_GENCOMPS_GNU='51 49'
PE_MPICH_MODULE_NAME=cray-mpich
PE_MPICH_NV_LIBS=
PE_MPICH_NV_LIBS_nvidia20=-lcudart
PE_MPICH_NV_LIBS_nvidia35=-lcudart
PE_MPICH_NV_LIBS_nvidia60=-lcudart
PE_MPICH_PKGCONFIG_LIBS=mpich
PE_MPICH_PKGCONFIG_VARIABLES=PE_MPICH_NV_LIBS_ at accelerator
@:PE_MPICH_ALTERNATE_LIBS_ at multithreaded@:PE_MPICH_ALTERNATE_LIBS_ at dpm@
PE_MPICH_TARGET_VAR_nvidia20=-lcudart
PE_MPICH_TARGET_VAR_nvidia35=-lcudart
PE_MPICH_VOLATILE_PKGCONFIG_PATH=/opt/cray/pe/mpt/7.7.0/gni/mpich- at PRGENV
@@PE_MPICH_DIR_DEFAULT64@/@PE_MPICH_GENCOMPS@/lib/pkgconfig
PE_MPICH_VOLATILE_PRGENV='CRAY GNU'
PE_NETCDF_DEFAULT_GENCOMPILERS_GNU='5.1 4.9'
PE_NETCDF_HDF5PARALLEL_DEFAULT_GENCOMPILERS_GNU='5.1 4.9'
PE_NETCDF_HDF5PARALLEL_DEFAULT_REQUIRED_PRODUCTS=PE_HDF5_PARALLEL:PE_MPICH
PE_PARALLEL_NETCDF_DEFAULT_GENCOMPILERS_GNU='5.1 4.9'
PE_PETSC_DEFAULT_GENCOMPILERS_CRAY_haswell=8.6
PE_PETSC_DEFAULT_GENCOMPILERS_CRAY_interlagos=8.6
PE_PETSC_DEFAULT_GENCOMPILERS_CRAY_mic_knl=8.6
PE_PETSC_DEFAULT_GENCOMPILERS_CRAY_sandybridge=8.6
PE_PETSC_DEFAULT_GENCOMPILERS_CRAY_skylake=8.6
PE_PETSC_DEFAULT_GENCOMPILERS_CRAY_x86_64=8.6
PE_PETSC_DEFAULT_GENCOMPILERS_GNU_haswell='5.3 4.9'
PE_PETSC_DEFAULT_GENCOMPILERS_GNU_interlagos='5.3 4.9'
PE_PETSC_DEFAULT_GENCOMPILERS_GNU_mic_knl=5.3
PE_PETSC_DEFAULT_GENCOMPILERS_GNU_sandybridge='5.3 4.9'
PE_PETSC_DEFAULT_GENCOMPILERS_GNU_skylake=6.1
PE_PETSC_DEFAULT_GENCOMPILERS_GNU_x86_64='5.3 4.9'
PE_PETSC_DEFAULT_GENCOMPILERS_INTEL_haswell=16.0
PE_PETSC_DEFAULT_GENCOMPILERS_INTEL_interlagos=16.0
PE_PETSC_DEFAULT_GENCOMPILERS_INTEL_mic_knl=16.0
PE_PETSC_DEFAULT_GENCOMPILERS_INTEL_sandybridge=16.0
PE_PETSC_DEFAULT_GENCOMPILERS_INTEL_skylake=16.0
PE_PETSC_DEFAULT_GENCOMPILERS_INTEL_x86_64=16.0
PE_PETSC_DEFAULT_REQUIRED_PRODUCTS=PE_MPICH:PE_LIBSCI:PE_HDF5_PARALLEL:PE_TPSL
PE_PKGCONFIG_DEFAULT_PRODUCTS=PE_TRILINOS:PE_TPSL_64:PE_TPSL:PE_PETSC:PE_PARALLEL_NETCDF:PE_NETCDF_HDF5PARALLEL:PE_NETCDF:PE_MPICH:PE_LIBSCI:PE_HDF5_PARALLEL:PE_HDF5:PE_GA:PE_FFTW2:PE_FFTW
PE_PKGCONFIG_PRODUCTS=PE_MPICH:PE_LIBSCI
PE_TPSL_64_DEFAULT_GENCOMPILERS_CRAY_haswell=8.6
PE_TPSL_64_DEFAULT_GENCOMPILERS_CRAY_interlagos=8.6
PE_TPSL_64_DEFAULT_GENCOMPILERS_CRAY_mic_knl=8.6
PE_TPSL_64_DEFAULT_GENCOMPILERS_CRAY_sandybridge=8.6
PE_TPSL_64_DEFAULT_GENCOMPILERS_CRAY_x86_64=8.6
PE_TPSL_64_DEFAULT_GENCOMPILERS_CRAY_x86_skylake=8.6
PE_TPSL_64_DEFAULT_GENCOMPILERS_GNU_haswell='5.1 4.9'
PE_TPSL_64_DEFAULT_GENCOMPILERS_GNU_interlagos='5.1 4.9'
PE_TPSL_64_DEFAULT_GENCOMPILERS_GNU_mic_knl=5.1
PE_TPSL_64_DEFAULT_GENCOMPILERS_GNU_sandybridge='5.1 4.9'
PE_TPSL_64_DEFAULT_GENCOMPILERS_GNU_x86_64='5.1 4.9'
PE_TPSL_64_DEFAULT_GENCOMPILERS_GNU_x86_skylake=6.1
PE_TPSL_64_DEFAULT_GENCOMPILERS_INTEL_haswell=16.0
PE_TPSL_64_DEFAULT_GENCOMPILERS_INTEL_interlagos=16.0
PE_TPSL_64_DEFAULT_GENCOMPILERS_INTEL_mic_knl=16.0
PE_TPSL_64_DEFAULT_GENCOMPILERS_INTEL_sandybridge=16.0
PE_TPSL_64_DEFAULT_GENCOMPILERS_INTEL_x86_64=16.0
PE_TPSL_64_DEFAULT_GENCOMPILERS_INTEL_x86_skylake=16.0
PE_TPSL_64_DEFAULT_REQUIRED_PRODUCTS=PE_MPICH:PE_LIBSCI
PE_TPSL_DEFAULT_GENCOMPILERS_CRAY_haswell=8.6
PE_TPSL_DEFAULT_GENCOMPILERS_CRAY_interlagos=8.6
PE_TPSL_DEFAULT_GENCOMPILERS_CRAY_mic_knl=8.6
PE_TPSL_DEFAULT_GENCOMPILERS_CRAY_sandybridge=8.6
PE_TPSL_DEFAULT_GENCOMPILERS_CRAY_x86_64=8.6
PE_TPSL_DEFAULT_GENCOMPILERS_CRAY_x86_skylake=8.6
PE_TPSL_DEFAULT_GENCOMPILERS_GNU_haswell='5.1 4.9'
PE_TPSL_DEFAULT_GENCOMPILERS_GNU_interlagos='5.1 4.9'
PE_TPSL_DEFAULT_GENCOMPILERS_GNU_mic_knl=5.1
PE_TPSL_DEFAULT_GENCOMPILERS_GNU_sandybridge='5.1 4.9'
PE_TPSL_DEFAULT_GENCOMPILERS_GNU_x86_64='5.1 4.9'
PE_TPSL_DEFAULT_GENCOMPILERS_GNU_x86_skylake=6.1
PE_TPSL_DEFAULT_GENCOMPILERS_INTEL_haswell=16.0
PE_TPSL_DEFAULT_GENCOMPILERS_INTEL_interlagos=16.0
PE_TPSL_DEFAULT_GENCOMPILERS_INTEL_mic_knl=16.0
PE_TPSL_DEFAULT_GENCOMPILERS_INTEL_sandybridge=16.0
PE_TPSL_DEFAULT_GENCOMPILERS_INTEL_x86_64=16.0
PE_TPSL_DEFAULT_GENCOMPILERS_INTEL_x86_skylake=16.0
PE_TPSL_DEFAULT_REQUIRED_PRODUCTS=PE_MPICH:PE_LIBSCI
PE_TRILINOS_DEFAULT_GENCOMPILERS_CRAY_x86_64=8.6
PE_TRILINOS_DEFAULT_GENCOMPILERS_GNU_x86_64='5.1 4.9'
PE_TRILINOS_DEFAULT_GENCOMPILERS_INTEL_x86_64=16.0
PE_TRILINOS_DEFAULT_REQUIRED_PRODUCTS=PE_MPICH:PE_HDF5_PARALLEL:PE_NETCDF_HDF5PARALLEL:PE_LIBSCI:PE_TPSL
SETUP_MPI='mpi 0.0.1+3 -f Linux64 -Z
/opt/lsst/software/stack/stack/miniconda3-4.5.4-10a4fa6'
SETUP_MPI4PY='mpi4py 2.0.0+6 -f Linux64 -Z
/opt/lsst/software/stack/stack/miniconda3-4.5.4-10a4fa6'
SETUP_MPICH='mpich 3.2.1 -f Linux64 -Z
/opt/lsst/software/stack/stack/miniconda3-4.5.4-10a4fa6'

On Fri, Aug 31, 2018 at 3:47 PM Kandes, Martin <mkandes at sdsc.edu> wrote:

> Hi Heather,
>
>
> Can you copy and paste your Slurm batch job script here to give us an
> overview of what the job looks like? It'd also be helpful if you could
> provided the definition (or recipe) file for this container and a list of
> the software modules available at the site you are running the container.
>
>
> Marty
>
>
> P.S. In general, you want to have the same MPI implementation and version
> installed within the Singularity container as-is the one available on the
> host systems where the container will run.
> ------------------------------
> *From:* Heather Kelly <heather999kelly at gmail.com>
> *Sent:* Friday, August 31, 2018 12:34:25 PM
> *To:* discuss at mpich.org
> *Subject:* [mpich-discuss] using mpi4py in a Singluarity container run at
> a large computing center with Slurm installed
>
> Hi,
> Complete newbie here.
> I have a Singularity container (created by someone else), that includes a
> python script that uses mpi4py which is installed in the image, as is mpi.
> I'm trying to run this at a large computing center where SLURM is
> installed.  The code in the container, wants to utilize its own
> installation of mpi4py and mpi. The code provides flags that allow a user
> to specify if they want to allow the code to use Slurm, SMP, or nothing at
> all.  The default is SMP.
>
> When I attempt to run this code in the image, on a compute node of this
> computing center, I receive an error:
> HYDU_create_process (utils/launch/launch.c:75): execvp error on file srun
> (No such file or directory
> even though I have specified to the program that I want to use SMP, yet,
> it appears mpiexec is trying to submit a job to Slurm.  Checking the
> environment, it appears mpiexec is pointing to the version installed in the
> container, and not the one available at the computing center.
>
> Is there an env variable or some way to set the process management system
> to avoid using Slurm?
>
> In other contexts this code works just fine, my problem seems specific to
> running this in a container at these large computing centers where Slurm is
> available.  It's as if the local computing center's mpi install is taking
> precedence..and perhaps that's just how it works, but I'd like to find a
> way around that.
>
> Take care,
> Heather
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20180831/a0bae97c/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list