[mpich-discuss] Installing MPICH on clusters

Feimi Yu yuf2 at rpi.edu
Sat Sep 18 21:13:20 CDT 2021

Hi Hui,

Thank you for the response! Here is the Slurm batch file I used to run a 
program with MPICH configured with Hydra:

//#SBATCH --job-name=5e-7//
//#SBATCH --partition=el8//
//#SBATCH --time 6:00:00//
//#SBATCH --ntasks 40//
//#SBATCH --nodes 1//
//#SBATCH --gres=gpu:4//
//srun --mpi=mpichmx hostname -s | sort -u > /tmp/hosts.$SLURM_JOB_ID//
//awk "{ print \$0 \":40\"; }" /tmp/hosts.$SLURM_JOB_ID 
//mv /tmp/tmp.$SLURM_JOB_ID ./hosts.$SLURM_JOB_ID//
-f ./hosts.$SLURM_JOB_ID -np $SLURM_NPROCS ./step-17.release//

And the error message is:

./step-17.release: error while loading shared libraries: 
libmpiprofilesupport.so.3: cannot open shared object file: No such file 
or directory

I was not sure if this is related to a network problem because the 
clusters use Infiniband. Running "/sbin/ifconfig" gives ib0, ib1, ib2 
and ib3. I tried the option "-iface ib0" and the error message became:

[mpiexec at dcs176] HYDU_sock_get_iface_ip (utils/sock/sock.c:451): unable 
to find interface ib0
[mpiexec at dcs176] HYDU_sock_create_and_listen_portstr 
(utils/sock/sock.c:496): unable to get network interface IP
[mpiexec at dcs176] HYD_pmci_launch_procs (pm/pmiserv/pmiserv_pmci.c:79): 
unable to create PMI port
[mpiexec at dcs176] main (ui/mpich/mpiexec.c:322): process manager returned 
error launching processes

Specifying ib1-ib3 gives similar results.



On 9/17/21 7:57 PM, Zhou, Hui wrote:
> Hi Feimi,
> Hydra should be able to work with slurm. How are you launching the job 
> and what is the failure message?
> -- 
> Hui Zhou
> ------------------------------------------------------------------------
> *From:* Feimi Yu via discuss <discuss at mpich.org>
> *Sent:* Friday, September 17, 2021 10:55 AM
> *To:* discuss at mpich.org <discuss at mpich.org>
> *Cc:* Feimi Yu <yuf2 at rpi.edu>
> *Subject:* [mpich-discuss] Installing MPICH on clusters
> Hi,
> I'm working on a supercomputer which only provides Spectrum MPI 
> implementation in modules. Since our code does not perform well with 
> Spectrum MPI I decided to install an MPICH build on our own partition 
> (I'm not an administrator.) The supercomputer has a rhel8 system on 
> ppc64le architecture with Slurm as the process manager. I tried 
> several building options according to the user guide but could not run 
> a job so I have a few questions. Here are things I tried:
> 1. Build with Hydra PM. I could not launch a job with Hydra at all.
> 2. Then I decided to use ``--with-pm=none`` option to build and use 
> srun + ``mpiexec -f hostfile`` to launch my job. But what confuses me 
> is the PMI setting:
> srun --mpi=list gives following:
> srun: mpi/mpichgm
> srun: mpi/mpichmx
> srun: mpi/none
> srun: mpi/mvapich
> srun: mpi/openmpi
> srun: mpi/pmi2
> srun: mpi/lam
> srun: mpi/mpich1_p4
> srun: mpi/mpich1_shmem
> At first I tried use pmix since I found pmix libraries. But it didn't 
> do the trick. It segfaults on PMPI_Init_thread(). The error message is:
> /[dcs135:2312190] PMIX ERROR: NOT-FOUND in file client/pmix_client.c 
> at line 562/
> /Abort(1090831) on node 0 (rank 0 in comm 0): Fatal error in 
> PMPI_Init_thread: Other MPI error, error stack://
> //MPIR_Init_thread(159): //
> //MPID_Init(509).......: //
> //MPIR_pmi_init(92)....: PMIX_Init returned -46 //
> //[dcs135:2312190:0:2312190] Caught signal 11 (Segmentation fault: 
> address not mapped to object at address (nil))//
> /
> Then I switched to pmi2 but make keeps telling me undefined reference 
> to PMI2 library. (actually I couldn't find the pmi2 libraries either.)
> Then I used ``--with-pmi=slurm``, and it turned out that I couldn't 
> locate the Slurm header files. I guess I don't have the permission to 
> access them.
> I was wondering if it is still possible for me to build a usable MPICH 
> as a user? If yes, how can I do to have the PMI work?
> Thanks!
> Feimi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20210918/b5397dfe/attachment.html>

More information about the discuss mailing list