[mpich-discuss] Running issue on CENTOS Cluster with SLURM

Yaser Afshar ya.afshar at gmail.com
Mon Apr 20 09:03:02 CDT 2020


- AMD cluster (AMD Opteron(tm) Processor 6344) with InfiniPath_QLE7240
- CentOS Linux release 7.7.1908 (Core)

- slurm

MPICH version:

MPICH configuration:
./configure --prefix=/home/yaser/bin/mpich/3.3.2/intel CC=icc CXX=icpc
FC=ifort --with-hwloc=/opt/HWLOC/2.2.0 --with-ucx=/opt/UCX/1.8.0
--with-knem=/opt/KNEM/1.1.3 --with-device=ch4:ucx --enable-mpi-cxx
--enable-mpi1-compatibility --enable-threads=multiple --with-pmi --with-pm

I am testing using the OSU microbenchmark (

running the job on `srun` would succeed with no problem.
srun --mpi=pmi2 ./osu_bw

When I use the `mpiexec`, or `mpirun` it fails with:
[proxy:0:1 at pd-compute-3-40.local] HYDU_sock_connect
(utils/sock/sock.c:145): unable to connect from "pd-compute-3-40.local" to
"pd-compute-1-6.local" (Connection refused)
[proxy:0:1 at pd-compute-3-40.local] main (pm/pmiserv/pmip.c:183): unable to
connect to server pd-compute-1-6.local at port 42216 (check for firewalls!)
srun: error: pd-compute-3-40: task 1: Exited with exit code 5

The firewall is off, so it is not the reason.

> systemctl status firewalld
● firewalld.service
   Loaded: masked (/dev/null; bad)
   Active: inactive (dead)

I could not find any hint on MPICH FAQ nor anything useful anywhere else.
Would you help me to resolve this issue?

Many thanks.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20200420/2642b1dc/attachment.html>

More information about the discuss mailing list