<html>
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1252">
</head>
<body>
<p>Hi Hui,</p>
<p><br>
</p>
<p>Thank you for the response! Here is the Slurm batch file I used
to run a program with MPICH configured with Hydra:</p>
<p><i>#!/bin/bash</i><i><br>
</i><i>#SBATCH --job-name=5e-7</i><i><br>
</i><i>#SBATCH --partition=el8</i><i><br>
</i><i>#SBATCH --time 6:00:00</i><i><br>
</i><i>#SBATCH --ntasks 40</i><i><br>
</i><i>#SBATCH --nodes 1</i><i><br>
</i><i>#SBATCH --gres=gpu:4</i><i><br>
</i><i><br>
</i><i>date</i><i><br>
</i><i>export
LD_LIBRARY_PATH=/gpfs/u/home/CFSI/CFSIfmyu/barn-shared/dcs-rh8/mpich-build/lib:$MPI_ROOT:$LD_LIBRARY_PATH</i><i><br>
</i><i>srun --mpi=mpichmx hostname -s | sort -u >
/tmp/hosts.$SLURM_JOB_ID</i><i><br>
</i><i>awk "{ print \$0 \":40\"; }" /tmp/hosts.$SLURM_JOB_ID
>/tmp/tmp.$SLURM_JOB_ID</i><i><br>
</i><i>mv /tmp/tmp.$SLURM_JOB_ID ./hosts.$SLURM_JOB_ID</i><i><br>
</i><i><br>
</i><i>/gpfs/u/home/CFSI/CFSIfmyu/barn-shared/dcs-rh8/mpich-build/bin/mpiexec
-f ./hosts.$SLURM_JOB_ID -np $SLURM_NPROCS ./step-17.release</i><i><br>
</i><i><br>
</i><i>date</i><br>
</p>
<p><br>
</p>
<p>And the error message is:</p>
<p>./step-17.release: error while loading shared libraries:
libmpiprofilesupport.so.3: cannot open shared object file: No such
file or directory</p>
<p><br>
</p>
<p>I was not sure if this is related to a network problem because
the clusters use Infiniband. Running "/sbin/ifconfig" gives ib0,
ib1, ib2 and ib3. I tried the option "-iface ib0" and the error
message became:</p>
<p>[mpiexec@dcs176] HYDU_sock_get_iface_ip (utils/sock/sock.c:451):
unable to find interface ib0 <br>
[mpiexec@dcs176] HYDU_sock_create_and_listen_portstr
(utils/sock/sock.c:496): unable to get network interface IP<br>
[mpiexec@dcs176] HYD_pmci_launch_procs
(pm/pmiserv/pmiserv_pmci.c:79): unable to create PMI port<br>
[mpiexec@dcs176] main (ui/mpich/mpiexec.c:322): process manager
returned error launching processes</p>
<p><br>
</p>
<p>Specifying ib1-ib3 gives similar results.<br>
</p>
<p><br>
</p>
<p>Thanks!</p>
<p>Feimi<br>
</p>
<p><br>
</p>
<div class="moz-cite-prefix">On 9/17/21 7:57 PM, Zhou, Hui wrote:<br>
</div>
<blockquote type="cite"
cite="mid:SA0PR09MB74175DD1555FE3B2B709D979A9DD9@SA0PR09MB7417.namprd09.prod.outlook.com">
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1252">
<style type="text/css" style="display:none;">P {margin-top:0;margin-bottom:0;}</style>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif;
font-size: 12pt; color: rgb(0, 0, 0);">
Hi Feimi,</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif;
font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif;
font-size: 12pt; color: rgb(0, 0, 0);">
Hydra should be able to work with slurm. How are you launching
the job and what is the failure message?</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif;
font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif;
font-size: 12pt; color: rgb(0, 0, 0);">
-- <br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif;
font-size: 12pt; color: rgb(0, 0, 0);">
Hui Zhou<br>
</div>
<hr style="display:inline-block;width:98%" tabindex="-1">
<div id="divRplyFwdMsg" dir="ltr"><font style="font-size:11pt"
face="Calibri, sans-serif" color="#000000"><b>From:</b> Feimi
Yu via discuss <a class="moz-txt-link-rfc2396E" href="mailto:discuss@mpich.org"><discuss@mpich.org></a><br>
<b>Sent:</b> Friday, September 17, 2021 10:55 AM<br>
<b>To:</b> <a class="moz-txt-link-abbreviated" href="mailto:discuss@mpich.org">discuss@mpich.org</a> <a class="moz-txt-link-rfc2396E" href="mailto:discuss@mpich.org"><discuss@mpich.org></a><br>
<b>Cc:</b> Feimi Yu <a class="moz-txt-link-rfc2396E" href="mailto:yuf2@rpi.edu"><yuf2@rpi.edu></a><br>
<b>Subject:</b> [mpich-discuss] Installing MPICH on clusters</font>
<div> </div>
</div>
<div>
<p>Hi,</p>
<p>I'm working on a supercomputer which only provides Spectrum
MPI implementation in modules. Since our code does not perform
well with Spectrum MPI I decided to install an MPICH build on
our own partition (I'm not an administrator.) The
supercomputer has a rhel8 system on ppc64le architecture with
Slurm as the process manager. I tried several building options
according to the user guide but could not run a job so I have
a few questions. Here are things I tried:</p>
<p>1. Build with Hydra PM. I could not launch a job with Hydra
at all.</p>
<p>2. Then I decided to use ``--with-pm=none`` option to build
and use srun + ``mpiexec -f hostfile`` to launch my job. But
what confuses me is the PMI setting:</p>
<p>srun --mpi=list gives following:</p>
<p>srun: mpi/mpichgm<br>
srun: mpi/mpichmx<br>
srun: mpi/none<br>
srun: mpi/mvapich<br>
srun: mpi/openmpi<br>
srun: mpi/pmi2<br>
srun: mpi/lam<br>
srun: mpi/mpich1_p4<br>
srun: mpi/mpich1_shmem</p>
<p>At first I tried use pmix since I found pmix libraries. But
it didn't do the trick. It segfaults on PMPI_Init_thread().
The error message is:</p>
<p><i>[dcs135:2312190] PMIX ERROR: NOT-FOUND in file
client/pmix_client.c at line 562</i></p>
<p><i>Abort(1090831) on node 0 (rank 0 in comm 0): Fatal error
in PMPI_Init_thread: Other MPI error, error stack:</i><i><br>
</i><i>MPIR_Init_thread(159): </i><i><br>
</i><i>MPID_Init(509).......: </i><i><br>
</i><i>MPIR_pmi_init(92)....: PMIX_Init returned -46 </i><i><br>
</i><i>[dcs135:2312190:0:2312190] Caught signal 11
(Segmentation fault: address not mapped to object at address
(nil))</i><i><br>
</i></p>
<p>Then I switched to pmi2 but make keeps telling me undefined
reference to PMI2 library. (actually I couldn't find the pmi2
libraries either.)</p>
<p>Then I used ``--with-pmi=slurm``, and it turned out that I
couldn't locate the Slurm header files. I guess I don't have
the permission to access them.</p>
<p>I was wondering if it is still possible for me to build a
usable MPICH as a user? If yes, how can I do to have the PMI
work?<br>
</p>
<p>Thanks!</p>
Feimi </div>
</blockquote>
</body>
</html>