[mpich-discuss] Installing MPICH on clusters
Feimi Yu
yuf2 at rpi.edu
Mon Sep 20 18:33:22 CDT 2021
Thank you for the hint! I finally got my job running on my own MPI
build. Just out of curious: Actually, I searched this
libmpiprofilesupport library before and discovered that it is specific
to Spectrum MPI so I didn't go further. But MPICH just ran magically
after I load Spectrum MPI and have the Spectrum MPI lib path in
$LD_LIBRARY_PATH. Why does MPICH have to use Spectrum MPI's libraries?
Thanks!
Feimi
On 9/19/2021 1:57 PM, Zhou, Hui wrote:
> > ./step-17.release: error while loading shared libraries:
> libmpiprofilesupport.so.3: cannot open shared object file: No such
> file or directory
>
> This is complaining about cannot finding a dynamic library (that is
> linked in your binary) on the compute node. Make sure the path to that
> library is in the LD_LIBRARY_PATH.
>
> ------------------------------------------------------------------------
> *From:* Feimi Yu <yuf2 at rpi.edu>
> *Sent:* Saturday, September 18, 2021 9:13 PM
> *To:* Zhou, Hui <zhouh at anl.gov>; discuss at mpich.org <discuss at mpich.org>
> *Subject:* Re: [mpich-discuss] Installing MPICH on clusters
>
> Hi Hui,
>
>
> Thank you for the response! Here is the Slurm batch file I used to run
> a program with MPICH configured with Hydra:
>
> /#!/bin/bash//
> //#SBATCH --job-name=5e-7//
> //#SBATCH --partition=el8//
> //#SBATCH --time 6:00:00//
> //#SBATCH --ntasks 40//
> //#SBATCH --nodes 1//
> //#SBATCH --gres=gpu:4//
> //
> //date//
> //export
> LD_LIBRARY_PATH=/gpfs/u/home/CFSI/CFSIfmyu/barn-shared/dcs-rh8/mpich-build/lib:$MPI_ROOT:$LD_LIBRARY_PATH//
> //srun --mpi=mpichmx hostname -s | sort -u > /tmp/hosts.$SLURM_JOB_ID//
> //awk "{ print \$0 \":40\"; }" /tmp/hosts.$SLURM_JOB_ID
> >/tmp/tmp.$SLURM_JOB_ID//
> //mv /tmp/tmp.$SLURM_JOB_ID ./hosts.$SLURM_JOB_ID//
> //
> ///gpfs/u/home/CFSI/CFSIfmyu/barn-shared/dcs-rh8/mpich-build/bin/mpiexec
> -f ./hosts.$SLURM_JOB_ID -np $SLURM_NPROCS ./step-17.release//
> //
> //date/
>
>
> And the error message is:
>
> ./step-17.release: error while loading shared libraries:
> libmpiprofilesupport.so.3: cannot open shared object file: No such
> file or directory
>
>
> I was not sure if this is related to a network problem because the
> clusters use Infiniband. Running "/sbin/ifconfig" gives ib0, ib1, ib2
> and ib3. I tried the option "-iface ib0" and the error message became:
>
> [mpiexec at dcs176] HYDU_sock_get_iface_ip (utils/sock/sock.c:451):
> unable to find interface ib0
> [mpiexec at dcs176] HYDU_sock_create_and_listen_portstr
> (utils/sock/sock.c:496): unable to get network interface IP
> [mpiexec at dcs176] HYD_pmci_launch_procs (pm/pmiserv/pmiserv_pmci.c:79):
> unable to create PMI port
> [mpiexec at dcs176] main (ui/mpich/mpiexec.c:322): process manager
> returned error launching processes
>
>
> Specifying ib1-ib3 gives similar results.
>
>
> Thanks!
>
> Feimi
>
>
> On 9/17/21 7:57 PM, Zhou, Hui wrote:
>> Hi Feimi,
>>
>> Hydra should be able to work with slurm. How are you launching the
>> job and what is the failure message?
>>
>> --
>> Hui Zhou
>> ------------------------------------------------------------------------
>> *From:* Feimi Yu via discuss <discuss at mpich.org>
>> <mailto:discuss at mpich.org>
>> *Sent:* Friday, September 17, 2021 10:55 AM
>> *To:* discuss at mpich.org <mailto:discuss at mpich.org>
>> <discuss at mpich.org> <mailto:discuss at mpich.org>
>> *Cc:* Feimi Yu <yuf2 at rpi.edu> <mailto:yuf2 at rpi.edu>
>> *Subject:* [mpich-discuss] Installing MPICH on clusters
>>
>> Hi,
>>
>> I'm working on a supercomputer which only provides Spectrum MPI
>> implementation in modules. Since our code does not perform well with
>> Spectrum MPI I decided to install an MPICH build on our own partition
>> (I'm not an administrator.) The supercomputer has a rhel8 system on
>> ppc64le architecture with Slurm as the process manager. I tried
>> several building options according to the user guide but could not
>> run a job so I have a few questions. Here are things I tried:
>>
>> 1. Build with Hydra PM. I could not launch a job with Hydra at all.
>>
>> 2. Then I decided to use ``--with-pm=none`` option to build and use
>> srun + ``mpiexec -f hostfile`` to launch my job. But what confuses me
>> is the PMI setting:
>>
>> srun --mpi=list gives following:
>>
>> srun: mpi/mpichgm
>> srun: mpi/mpichmx
>> srun: mpi/none
>> srun: mpi/mvapich
>> srun: mpi/openmpi
>> srun: mpi/pmi2
>> srun: mpi/lam
>> srun: mpi/mpich1_p4
>> srun: mpi/mpich1_shmem
>>
>> At first I tried use pmix since I found pmix libraries. But it didn't
>> do the trick. It segfaults on PMPI_Init_thread(). The error message is:
>>
>> /[dcs135:2312190] PMIX ERROR: NOT-FOUND in file client/pmix_client.c
>> at line 562/
>>
>> /Abort(1090831) on node 0 (rank 0 in comm 0): Fatal error in
>> PMPI_Init_thread: Other MPI error, error stack://
>> //MPIR_Init_thread(159): //
>> //MPID_Init(509).......: //
>> //MPIR_pmi_init(92)....: PMIX_Init returned -46 //
>> //[dcs135:2312190:0:2312190] Caught signal 11 (Segmentation fault:
>> address not mapped to object at address (nil))//
>> /
>>
>> Then I switched to pmi2 but make keeps telling me undefined reference
>> to PMI2 library. (actually I couldn't find the pmi2 libraries either.)
>>
>> Then I used ``--with-pmi=slurm``, and it turned out that I couldn't
>> locate the Slurm header files. I guess I don't have the permission to
>> access them.
>>
>> I was wondering if it is still possible for me to build a usable
>> MPICH as a user? If yes, how can I do to have the PMI work?
>>
>> Thanks!
>>
>> Feimi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20210920/c946aaf4/attachment.html>
More information about the discuss
mailing list