[mpich-discuss] Running issue on CENTOS Cluster with SLURM

Yaser Afshar ya.afshar at gmail.com
Mon Apr 20 09:03:02 CDT 2020


Hi,

System:
- AMD cluster (AMD Opteron(tm) Processor 6344) with InfiniPath_QLE7240
- CentOS Linux release 7.7.1908 (Core)

Scheduler:
- slurm

MPICH version:
3.3.2

MPICH configuration:
./configure --prefix=/home/yaser/bin/mpich/3.3.2/intel CC=icc CXX=icpc
FC=ifort --with-hwloc=/opt/HWLOC/2.2.0 --with-ucx=/opt/UCX/1.8.0
--with-knem=/opt/KNEM/1.1.3 --with-device=ch4:ucx --enable-mpi-cxx
--enable-mpi1-compatibility --enable-threads=multiple --with-pmi --with-pm

I am testing using the OSU microbenchmark (
http://mvapich.cse.ohio-state.edu/benchmarks)

running the job on `srun` would succeed with no problem.
```
srun --mpi=pmi2 ./osu_bw
```

When I use the `mpiexec`, or `mpirun` it fails with:
```
[proxy:0:1 at pd-compute-3-40.local] HYDU_sock_connect
(utils/sock/sock.c:145): unable to connect from "pd-compute-3-40.local" to
"pd-compute-1-6.local" (Connection refused)
[proxy:0:1 at pd-compute-3-40.local] main (pm/pmiserv/pmip.c:183): unable to
connect to server pd-compute-1-6.local at port 42216 (check for firewalls!)
srun: error: pd-compute-3-40: task 1: Exited with exit code 5
```

The firewall is off, so it is not the reason.

```
> systemctl status firewalld
● firewalld.service
   Loaded: masked (/dev/null; bad)
   Active: inactive (dead)
```

I could not find any hint on MPICH FAQ nor anything useful anywhere else.
Would you help me to resolve this issue?

Many thanks.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20200420/2642b1dc/attachment.html>


More information about the discuss mailing list