[mpich-discuss] Installing MPICH on clusters

Zhou, Hui zhouh at anl.gov
Wed Sep 22 15:50:23 CDT 2021


Feimi Yu,

The profiling library is a separate library that sits between MPI and user application so the system can gather statistics over applications MPI usage. The use of this profiling library is optional and typically transparent to users. It is probably is set by the system linker scripts or preloader scripts by your system. It is not surprising that this profiling library has a hard coded libmpi dependency set during its build, in your case, it is linked to the system MPI, which is the spectrum MPI. There usually is an option to not load this profiling library. Please consult your system admin for how to skip the profiling library.

-- 
Hui Zhou

-----Original Message-----
From: Raffenetti, Kenneth J. <raffenet at mcs.anl.gov> 
Sent: Tuesday, September 21, 2021 11:33 AM
To: discuss at mpich.org; Zhou, Hui <zhouh at anl.gov>
Cc: Feimi Yu <yuf2 at rpi.edu>
Subject: Re: [mpich-discuss] Installing MPICH on clusters

MPICH does not need to use Spectrum MPI's libraries. Are you sure your application isn't actually linked against Spectrum MPI and not MPICH?

Ken

On 9/20/21, 6:33 PM, "Feimi Yu via discuss" <discuss at mpich.org> wrote:

    Thank you for the hint! I finally got my job running on my own MPI build. Just out of curious: Actually, I searched this libmpiprofilesupport library before and discovered that it is specific to Spectrum MPI so I didn't go further. But MPICH just ran magically after I load Spectrum MPI and have the Spectrum MPI lib path in $LD_LIBRARY_PATH. Why does MPICH have to use Spectrum MPI's libraries?
    
    Thanks!
    Feimi
    

    On 9/19/2021 1:57 PM, Zhou, Hui wrote:
    
    
    > ./step-17.release: error while loading shared libraries: libmpiprofilesupport.so.3: cannot open shared object file: No such file or directory
    
    
    This is complaining about cannot finding a dynamic library (that is linked in your binary) on the compute node. Make sure the path to that library is in the LD_LIBRARY_PATH.
    
    
    ________________________________________
    From: Feimi Yu <yuf2 at rpi.edu> <mailto:yuf2 at rpi.edu>
    Sent: Saturday, September 18, 2021 9:13 PM
    To: Zhou, Hui <zhouh at anl.gov> <mailto:zhouh at anl.gov>; discuss at mpich.org <discuss at mpich.org> <mailto:discuss at mpich.org>
    Subject: Re: [mpich-discuss] Installing MPICH on clusters  
    
    Hi Hui,
    
    Thank you for the response! Here is the Slurm batch file I used to run a program with MPICH configured with Hydra:
    #!/bin/bash
    #SBATCH --job-name=5e-7
    #SBATCH --partition=el8
    #SBATCH --time 6:00:00
    #SBATCH --ntasks 40
    #SBATCH --nodes 1
    #SBATCH --gres=gpu:4
    
    date
    export LD_LIBRARY_PATH=/gpfs/u/home/CFSI/CFSIfmyu/barn-shared/dcs-rh8/mpich-build/lib:$MPI_ROOT:$LD_LIBRARY_PATH
    srun --mpi=mpichmx hostname -s | sort -u > /tmp/hosts.$SLURM_JOB_ID
    awk "{ print \$0 \":40\"; }" /tmp/hosts.$SLURM_JOB_ID >/tmp/tmp.$SLURM_JOB_ID
    mv /tmp/tmp.$SLURM_JOB_ID ./hosts.$SLURM_JOB_ID
    
    /gpfs/u/home/CFSI/CFSIfmyu/barn-shared/dcs-rh8/mpich-build/bin/mpiexec -f ./hosts.$SLURM_JOB_ID -np $SLURM_NPROCS ./step-17.release
    
    date
    
    
    And the error message is:
    ./step-17.release: error while loading shared libraries: libmpiprofilesupport.so.3: cannot open shared object file: No such file or directory
    
    I was not sure if this is related to a network problem because the clusters use Infiniband. Running "/sbin/ifconfig" gives ib0, ib1, ib2 and ib3. I tried the option "-iface ib0" and the error message became:
    [mpiexec at dcs176] HYDU_sock_get_iface_ip (utils/sock/sock.c:451): unable to find interface ib0 
    [mpiexec at dcs176] HYDU_sock_create_and_listen_portstr (utils/sock/sock.c:496): unable to get network interface IP
    [mpiexec at dcs176] HYD_pmci_launch_procs (pm/pmiserv/pmiserv_pmci.c:79): unable to create PMI port
    [mpiexec at dcs176] main (ui/mpich/mpiexec.c:322): process manager returned error launching processes
    
    Specifying ib1-ib3 gives similar results.
    
    
    Thanks!
    Feimi
    
    
    On 9/17/21 7:57 PM, Zhou, Hui wrote:
    
    
    Hi Feimi,
    
    Hydra should be able to work with slurm. How are you launching the job and what is the failure message?
    
    -- 
    
    Hui Zhou
    
    ________________________________________
    From: Feimi Yu via discuss <discuss at mpich.org> <mailto:discuss at mpich.org>
    Sent: Friday, September 17, 2021 10:55 AM
    To: discuss at mpich.org <discuss at mpich.org> <mailto:discuss at mpich.org>
    Cc: Feimi Yu <yuf2 at rpi.edu> <mailto:yuf2 at rpi.edu>
    Subject: [mpich-discuss] Installing MPICH on clusters  
    
    Hi,
    I'm working on a supercomputer which only provides Spectrum MPI implementation in modules. Since our code does not perform well with Spectrum MPI I decided to install an MPICH build on our own partition (I'm not an administrator.) The supercomputer has a rhel8 system on ppc64le architecture with Slurm as the process manager. I tried several building options according to the user guide but could not run a job so I have a few questions. Here are things I tried:
    1. Build with Hydra PM. I could not launch a job with Hydra at all.
    2. Then I decided to use ``--with-pm=none`` option to build and use srun + ``mpiexec -f hostfile`` to launch my job. But what confuses me is the PMI setting:
    srun --mpi=list gives following:
    srun: mpi/mpichgm
    srun: mpi/mpichmx
    srun: mpi/none
    srun: mpi/mvapich
    srun: mpi/openmpi
    srun: mpi/pmi2
    srun: mpi/lam
    srun: mpi/mpich1_p4
    srun: mpi/mpich1_shmem
    At first I tried use pmix since I found pmix libraries. But it didn't do the trick. It segfaults on PMPI_Init_thread(). The error message is:
    [dcs135:2312190] PMIX ERROR: NOT-FOUND in file client/pmix_client.c at line 562
    Abort(1090831) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
    MPIR_Init_thread(159): 
    MPID_Init(509).......: 
    MPIR_pmi_init(92)....: PMIX_Init returned -46 
    [dcs135:2312190:0:2312190] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
    
    Then I switched to pmi2 but make keeps telling me undefined reference to PMI2 library. (actually I couldn't find the pmi2 libraries either.)
    Then I used ``--with-pmi=slurm``, and it turned out that I couldn't locate the Slurm header files. I guess I don't have the permission to access them.
    I was wondering if it is still possible for me to build a usable MPICH as a user? If yes, how can I do to have the PMI work?
    
    Thanks!
    Feimi 



More information about the discuss mailing list