[mpich-discuss] Communication issue with simple test?
Appel, Thibaut
t.appel17 at imperial.ac.uk
Tue Nov 13 07:30:10 CST 2018
Dear MPICH users,
I'm having an issue with, apparently, communication between nodes of the local cluster we're using. To fix ideas, the cluster is made of 8 different nodes and it was set-up with now outdated versions of the Intel compilers and MPI libraries. I did an mpbboot/mpdringtest and it seems to work fine.
Now, I would like to use my application code with MPICH 3.3 installed from PETSc and with gcc-8 installed from linuxbrew on the different nodes.
I tested a simple Fortran program:
program test
use mpi
use ISO_fortran_env, only: output_unit
implicit none
integer :: irank, nproc, ierr
character(len=80) :: hostname
call MPI_INIT(ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD,nproc,ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD,irank,ierr)
if (irank == 0) THEN
WRITE(output_unit,'(1X,A,I3,A)') 'Started test program with', nproc, ' MPI processes'
end if
call MPI_BARRIER(MPI_COMM_WORLD,ierr)
call HOSTNM(hostname,ierr)
WRITE(output_unit,'(1X,A,I3,A)') 'I am processor #', irank, ' running on '//hostname
call MPI_FINALIZE(ierr)
end program test
It works fine on the local host. But when I try to launch it on all the nodes with "mpiexec -f path_to_hostfile -n 8 path_to_my_program" I get:
Started test program with 8 MPI processes
Fatal error in PMPI_Barrier: Unknown error class, error stack:
PMPI_Barrier(287)...........................: MPI_Barrier(MPI_COMM_WORLD) failed
PMPI_Barrier(273)...........................:
MPIR_Barrier_impl(173)......................:
MPIR_Barrier_intra_auto(108)................:
MPIR_Barrier_intra_recursive_doubling(47)...:
MPIC_Sendrecv(347)..........................:
MPIC_Wait(73)...............................:
MPIDI_CH3i_Progress_wait(242)...............: an error occurred while handling an event returned by MPIDI_CH3I_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(698)..:
MPIDI_CH3_Sockconn_handle_connect_event(597): [ch3:sock] failed to connnect to remote process
MPIDI_CH3I_Sock_post_connect_ifaddr(1774)...: unexpected operating system error (set=0,sock=5,errno=101:Network is unreachable)
Note that when I comment the call to MPI_BARRIER, it works fine as well. Therefore, communication between nodes seems to be an issue. All the nodes see and have access to the same mpiexec/mpifort executables: I checked $PATH and 'which mpiexec'/'which mpifort'.
Could you give me ways for an efficient diagnostic of what's possibly wrong?
mpiexec --version gives
HYDRA build details:
Version: 3.3b1
Release Date: Mon Feb 5 10:16:15 CST 2018
CC: gcc-8 -fPIC -fstack-protector -O3 -march=native
CXX: g++-8 -fstack-protector -O3 -march=native -fPIC
F77: gfortran-8 -fPIC -ffree-line-length-0 -O3 -march=native
F90: gfortran-8 -fPIC -ffree-line-length-0 -O3 -march=native
Configure options: '--disable-option-checking' '--prefix=/home/petsc/icm_cplx' 'MAKE=/usr/bin/make' '--libdir=/home/petsc/icm_cplx/lib' 'CC=gcc-8' 'CFLAGS=-fPIC -fstack-protector -O3 -march=native -O2' 'AR=/usr/bin/ar' 'ARFLAGS=cr' 'CXX=g++-8' 'CXXFLAGS=-fstack-protector -O3 -march=native -fPIC -O2' 'F77=gfortran-8' 'FFLAGS=-fPIC -ffree-line-length-0 -O3 -march=native -O2' 'FC=gfortran-8' 'FCFLAGS=-fPIC -ffree-line-length-0 -O3 -march=native -O2' '--enable-shared' '--with-device=ch3:sock' '--with-pm=hydra' '--enable-g=meminit' '--cache-file=/dev/null' '--srcdir=.' 'LDFLAGS=' 'LIBS=' 'CPPFLAGS= -I/home/petsc/icm_cplx/externalpackages/mpich-3.3b1/src/mpl/include -I/home/petsc/icm_cplx/externalpackages/mpich-3.3b1/src/mpl/include -I/home/petsc/icm_cplx/externalpackages/mpich-3.3b1/src/openpa/src -I/home/petsc/icm_cplx/externalpackages/mpich-3.3b1/src/openpa/src -D_REENTRANT -I/home/petsc/icm_cplx/externalpackages/mpich-3.3b1/src/mpi/romio/include' 'MPLLIBNAME=mpl'
Process Manager: pmi
Launchers available: ssh rsh fork slurm ll lsf sge manual persist
Topology libraries available: hwloc
Resource management kernels available: user slurm ll lsf sge pbs cobalt
Checkpointing libraries available:
Demux engines available: poll select
Thank you,
Thibaut
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20181113/81ea4641/attachment.html>
More information about the discuss
mailing list