[mpich-discuss] MPICH configure

Palmer, Bruce J Bruce.Palmer at pnnl.gov
Thu May 7 11:16:34 CDT 2020


Hi Hui,

Sorry for the late reply, I keep getting pulled off on other projects. I’m actually running a test suite most of the time, so the job scripts look like


#!/bin/csh

#SBATCH -t 02:30:00

#SBATCH -A XGA

#SBATCH -p short,slurm,gpu

#SBATCH -N 2

#SBATCH -n 6

#SBATCH -o ./test.out

#SBATCH -e ./test.err



source /etc/profile.d/modules.csh



source ~/set_mpich

env | grep PATH

module list



#make check-ga MPIEXEC="mpirun -n 6 "

make check-ga MPIEXEC="srun -n 6 "



I’ve tried using mpirun, srun and mpiexec in the MPIEXEC variable. If I run a test standalone, then the job submission script is


#!/bin/csh
#SBATCH -t 02:30:00
#SBATCH -A XGA
#SBATCH -p short,slurm,gpu
#SBATCH -N 2
#SBATCH -n 6
#SBATCH -o ./test.out
#SBATCH -e ./test.err

source /etc/profile.d/modules.csh

source ~/set_mpich
env | grep PATH
module list

srun -n 6 test.x > test.out



Again, I’ve tried running with mpirun, srun, and mpiexec. The environment in the set_mpich file is



module purge

module load gcc/5.2.0

module load python/2.7.8

module load cmake/3.8.2

module load git

module load mkl

setenv CC gcc

setenv CFLAGS "-pthread"

setenv CXX g++

setenv CXXFLAGS "-pthread"

setenv FC gfortran

setenv FCFLAGS "-pthread"

unsetenv F90

unsetenv F90FLAGS



setenv PATH /people/d3g293/mpich/mpich-3.3.2/install/bin:${PATH}

setenv MANPATH /people/d3g293/mpich/mpich-3.3.2/install/share/man:${MANPATH}

setenv LD_LIBRARY_PATH /people/d3g293/mpich/mpich-3.3.2/install/lib:${LD_LIBRARY_PATH}



Bruce

From: "Zhou, Hui" <zhouh at anl.gov>
Date: Thursday, April 30, 2020 at 11:25 AM
To: "Palmer, Bruce J" <Bruce.Palmer at pnnl.gov>, "discuss at mpich.org" <discuss at mpich.org>
Subject: Re: [mpich-discuss] MPICH configure

Hi Bruce,

Could you share your job scripts with us? It’ll be helpful to understand how you exactly launches jobs.

--
Hui Zhou


From: "Palmer, Bruce J" <Bruce.Palmer at pnnl.gov>
Date: Thursday, April 30, 2020 at 9:25 AM
To: "Zhou, Hui" <zhouh at anl.gov>, "discuss at mpich.org" <discuss at mpich.org>
Subject: Re: [mpich-discuss] MPICH configure

Hi Hui,

I checked with the system admins and they say there are no firewall restrictions between nodes. According to them they can talk to any port on any machine.

I’ve also followed up on their suggestion that I link with PMI and verified with ldd that the PMI libraries are showing up before MPI in the executables. I still only get 1 process when running with srun. This is a summary of what I am seeing when I run on 2 nodes (after configuring and building with slurm).

MPICH-3.3.1
Launch with mpiexec: runs okay
Launch with mpirun: runs okay
Link with PMI and Launch with srun: only get 1 process (from MPI_Comm_size on MPI_COMM_WORLD)

MPICH-3.3.2
Launch with mpiexec: hangs
Launch with mpirun: hangs
Link with PMI and Launch with srun: only get 1 process (from MPI_Comm_size on MPI_COMM_WORLD)

For what it’s worth, it looks like the error message

[proxy:0:1 at node168.local] HYDU_sock_connect (utils/sock/sock.c:145): unable to connect from "node168.local" to "node100.local" (Connection refused)
[proxy:0:1 at node168.local] main (pm/pmiserv/pmip.c:183): unable to connect to server node100.local at port 54762 (check for firewalls!)

doesn’t show up immediately. It looks like it appears (if it appears) after the system has been hung up for a while.

Bruce


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20200507/f6eb886a/attachment.html>


More information about the discuss mailing list