[mpich-discuss] Communication issue with simple test?

Si, Min msi at anl.gov
Tue Nov 13 10:23:03 CST 2018


Hi Thibaut,

In order to help us isolate this problem, could you please try the following options:

- Use MPICH 3.3rc instead of MPICH 3.3b1
- Use ch3:tcp instead of ch3:sock (tcp is used by default, so please just delete `--with-device=ch3:sock` when you configure)
- Try `mpiexec -f path_to_hostfile -n 8 hostname`

Best regards,
Min
On 2018/11/13 7:30, Appel, Thibaut via discuss wrote:
Dear MPICH users,

I'm having an issue with, apparently, communication between nodes of the local cluster we're using. To fix ideas, the cluster is made of 8 different nodes and it was set-up with now outdated versions of the Intel compilers and MPI libraries. I did an mpbboot/mpdringtest and it seems to work fine.

Now, I would like to use my application code with MPICH 3.3 installed from PETSc and with gcc-8 installed from linuxbrew on the different nodes.

I tested a simple Fortran program:

program test

  use mpi
  use ISO_fortran_env, only: output_unit

  implicit none

  integer :: irank, nproc, ierr
  character(len=80) :: hostname

  call MPI_INIT(ierr)
  call MPI_COMM_SIZE(MPI_COMM_WORLD,nproc,ierr)
  call MPI_COMM_RANK(MPI_COMM_WORLD,irank,ierr)

  if (irank == 0) THEN
    WRITE(output_unit,'(1X,A,I3,A)') 'Started test program with', nproc, ' MPI processes'
  end if

  call MPI_BARRIER(MPI_COMM_WORLD,ierr)

  call HOSTNM(hostname,ierr)
  WRITE(output_unit,'(1X,A,I3,A)') 'I am processor #', irank, ' running on '//hostname

  call MPI_FINALIZE(ierr)

end program test

It works fine on the local host. But when I try to launch it on all the nodes with "mpiexec -f path_to_hostfile -n 8 path_to_my_program" I get:

 Started test program with  8 MPI processes
Fatal error in PMPI_Barrier: Unknown error class, error stack:
PMPI_Barrier(287)...........................: MPI_Barrier(MPI_COMM_WORLD) failed
PMPI_Barrier(273)...........................:
MPIR_Barrier_impl(173)......................:
MPIR_Barrier_intra_auto(108)................:
MPIR_Barrier_intra_recursive_doubling(47)...:
MPIC_Sendrecv(347)..........................:
MPIC_Wait(73)...............................:
MPIDI_CH3i_Progress_wait(242)...............: an error occurred while handling an event returned by MPIDI_CH3I_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(698)..:
MPIDI_CH3_Sockconn_handle_connect_event(597): [ch3:sock] failed to connnect to remote process
MPIDI_CH3I_Sock_post_connect_ifaddr(1774)...: unexpected operating system error (set=0,sock=5,errno=101:Network is unreachable)

Note that when I comment the call to MPI_BARRIER, it works fine as well. Therefore, communication between nodes seems to be an issue. All the nodes see and have access to the same mpiexec/mpifort executables: I checked $PATH and 'which mpiexec'/'which mpifort'.

Could you give me ways for an efficient diagnostic of what's possibly wrong?

mpiexec --version gives

HYDRA build details:
    Version:                                 3.3b1
    Release Date:                            Mon Feb  5 10:16:15 CST 2018
    CC:                              gcc-8  -fPIC -fstack-protector -O3 -march=native
    CXX:                             g++-8  -fstack-protector -O3 -march=native -fPIC
    F77:                             gfortran-8 -fPIC -ffree-line-length-0 -O3 -march=native
    F90:                             gfortran-8 -fPIC -ffree-line-length-0 -O3 -march=native
    Configure options:                       '--disable-option-checking' '--prefix=/home/petsc/icm_cplx' 'MAKE=/usr/bin/make' '--libdir=/home/petsc/icm_cplx/lib' 'CC=gcc-8' 'CFLAGS=-fPIC -fstack-protector -O3 -march=native -O2' 'AR=/usr/bin/ar' 'ARFLAGS=cr' 'CXX=g++-8' 'CXXFLAGS=-fstack-protector -O3 -march=native -fPIC -O2' 'F77=gfortran-8' 'FFLAGS=-fPIC -ffree-line-length-0 -O3 -march=native -O2' 'FC=gfortran-8' 'FCFLAGS=-fPIC -ffree-line-length-0 -O3 -march=native -O2' '--enable-shared' '--with-device=ch3:sock' '--with-pm=hydra' '--enable-g=meminit' '--cache-file=/dev/null' '--srcdir=.' 'LDFLAGS=' 'LIBS=' 'CPPFLAGS= -I/home/petsc/icm_cplx/externalpackages/mpich-3.3b1/src/mpl/include -I/home/petsc/icm_cplx/externalpackages/mpich-3.3b1/src/mpl/include -I/home/petsc/icm_cplx/externalpackages/mpich-3.3b1/src/openpa/src -I/home/petsc/icm_cplx/externalpackages/mpich-3.3b1/src/openpa/src -D_REENTRANT -I/home/petsc/icm_cplx/externalpackages/mpich-3.3b1/src/mpi/romio/include' 'MPLLIBNAME=mpl'
    Process Manager:                         pmi
    Launchers available:                     ssh rsh fork slurm ll lsf sge manual persist
    Topology libraries available:            hwloc
    Resource management kernels available:   user slurm ll lsf sge pbs cobalt
    Checkpointing libraries available:
    Demux engines available:                 poll select


Thank you,

Thibaut



_______________________________________________
discuss mailing list     discuss at mpich.org<mailto:discuss at mpich.org>
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20181113/30f6bf6c/attachment.html>


More information about the discuss mailing list