[mpich-discuss] Installing MPICH on clusters

Tue Sep 21 11:33:25 CDT 2021

MPICH does not need to use Spectrum MPI's libraries. Are you sure your application isn't actually linked against Spectrum MPI and not MPICH?

Ken

On 9/20/21, 6:33 PM, "Feimi Yu via discuss" <discuss at mpich.org> wrote:

    Thank you for the hint! I finally got my job running on my own MPI build. Just out of curious: Actually, I searched this libmpiprofilesupport library before and discovered that it is specific to Spectrum MPI so I didn't go further. But MPICH just ran magically after I load Spectrum MPI and have the Spectrum MPI lib path in $LD_LIBRARY_PATH. Why does MPICH have to use Spectrum MPI's libraries?

    Thanks!
    Feimi

    On 9/19/2021 1:57 PM, Zhou, Hui wrote:

    > ./step-17.release: error while loading shared libraries: libmpiprofilesupport.so.3: cannot open shared object file: No such file or directory

    This is complaining about cannot finding a dynamic library (that is linked in your binary) on the compute node. Make sure the path to that library is in the LD_LIBRARY_PATH.

    ________________________________________
    From: Feimi Yu <yuf2 at rpi.edu> <mailto:yuf2 at rpi.edu>
    Sent: Saturday, September 18, 2021 9:13 PM
    To: Zhou, Hui <zhouh at anl.gov> <mailto:zhouh at anl.gov>; discuss at mpich.org <discuss at mpich.org> <mailto:discuss at mpich.org>
    Subject: Re: [mpich-discuss] Installing MPICH on clusters  

    Hi Hui,

    Thank you for the response! Here is the Slurm batch file I used to run a program with MPICH configured with Hydra:
    #!/bin/bash
    #SBATCH --job-name=5e-7
    #SBATCH --partition=el8
    #SBATCH --time 6:00:00
    #SBATCH --ntasks 40
    #SBATCH --nodes 1
    #SBATCH --gres=gpu:4

    date
    export LD_LIBRARY_PATH=/gpfs/u/home/CFSI/CFSIfmyu/barn-shared/dcs-rh8/mpich-build/lib:$MPI_ROOT:$LD_LIBRARY_PATH
    srun --mpi=mpichmx hostname -s | sort -u > /tmp/hosts.$SLURM_JOB_ID
    awk "{ print \$0 \":40\"; }" /tmp/hosts.$SLURM_JOB_ID >/tmp/tmp.$SLURM_JOB_ID
    mv /tmp/tmp.$SLURM_JOB_ID ./hosts.$SLURM_JOB_ID

    /gpfs/u/home/CFSI/CFSIfmyu/barn-shared/dcs-rh8/mpich-build/bin/mpiexec -f ./hosts.$SLURM_JOB_ID -np $SLURM_NPROCS ./step-17.release

    date

    And the error message is:
    ./step-17.release: error while loading shared libraries: libmpiprofilesupport.so.3: cannot open shared object file: No such file or directory

    I was not sure if this is related to a network problem because the clusters use Infiniband. Running "/sbin/ifconfig" gives ib0, ib1, ib2 and ib3. I tried the option "-iface ib0" and the error message became:
    [mpiexec at dcs176] HYDU_sock_get_iface_ip (utils/sock/sock.c:451): unable to find interface ib0 
    [mpiexec at dcs176] HYDU_sock_create_and_listen_portstr (utils/sock/sock.c:496): unable to get network interface IP
    [mpiexec at dcs176] HYD_pmci_launch_procs (pm/pmiserv/pmiserv_pmci.c:79): unable to create PMI port
    [mpiexec at dcs176] main (ui/mpich/mpiexec.c:322): process manager returned error launching processes

    Specifying ib1-ib3 gives similar results.

    Thanks!
    Feimi

    On 9/17/21 7:57 PM, Zhou, Hui wrote:

    Hi Feimi,

    Hydra should be able to work with slurm. How are you launching the job and what is the failure message?

    -- 

    Hui Zhou

    ________________________________________
    From: Feimi Yu via discuss <discuss at mpich.org> <mailto:discuss at mpich.org>
    Sent: Friday, September 17, 2021 10:55 AM
    To: discuss at mpich.org <discuss at mpich.org> <mailto:discuss at mpich.org>
    Cc: Feimi Yu <yuf2 at rpi.edu> <mailto:yuf2 at rpi.edu>
    Subject: [mpich-discuss] Installing MPICH on clusters  

    Hi,
    I'm working on a supercomputer which only provides Spectrum MPI implementation in modules. Since our code does not perform well with Spectrum MPI I decided to install an MPICH build on our own partition (I'm not an administrator.) The supercomputer has a rhel8 system on ppc64le architecture with Slurm as the process manager. I tried several building options according to the user guide but could not run a job so I have a few questions. Here are things I tried:
    1. Build with Hydra PM. I could not launch a job with Hydra at all.
    2. Then I decided to use ``--with-pm=none`` option to build and use srun + ``mpiexec -f hostfile`` to launch my job. But what confuses me is the PMI setting:
    srun --mpi=list gives following:
    srun: mpi/mpichgm
    srun: mpi/mpichmx
    srun: mpi/none
    srun: mpi/mvapich
    srun: mpi/openmpi
    srun: mpi/pmi2
    srun: mpi/lam
    srun: mpi/mpich1_p4
    srun: mpi/mpich1_shmem
    At first I tried use pmix since I found pmix libraries. But it didn't do the trick. It segfaults on PMPI_Init_thread(). The error message is:
    [dcs135:2312190] PMIX ERROR: NOT-FOUND in file client/pmix_client.c at line 562
    Abort(1090831) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
    MPIR_Init_thread(159): 
    MPID_Init(509).......: 
    MPIR_pmi_init(92)....: PMIX_Init returned -46 
    [dcs135:2312190:0:2312190] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))

    Then I switched to pmi2 but make keeps telling me undefined reference to PMI2 library. (actually I couldn't find the pmi2 libraries either.)
    Then I used ``--with-pmi=slurm``, and it turned out that I couldn't locate the Slurm header files. I guess I don't have the permission to access them.
    I was wondering if it is still possible for me to build a usable MPICH as a user? If yes, how can I do to have the PMI work?

    Thanks!
    Feimi