[mpich-discuss] MPICH-3.2.1 crashes on MPI_BARRIER call

Bill Petrick wpetrick at industrialimaging.com
Wed Jun 13 09:24:35 CDT 2018


Hi all,

For the past 9 years I've been running MPICH-1.2.7 with the g95 fortran 
compiler.  I just upgraded to MPICH-3.2.1 with a gfortran compiler.  My 
first tests running on 2 nodes were fine. When I ran the program on 3 or 
more nodes it crashes at the first MPI_BARRIER call.  As a test I ran 
the following "helloworld" program.  It runs on all nodes just fine 
until I add the MPI_BARRIER statement.  With the MPI_BARRIER statement 
it also runs fine on 2 nodes but crashes when run on 3 or more nodes. I 
execute with the following statement.

mpiexec -f machines -launcher rsh –n 2 ./helloworld

This program runs fine on all nodes without the MPI_BARRIER call but 
crashes on more than two nodes with the BARRIER call.
____________________________________________________
program main
   use mpi
   integer ( kind = 4 ) error
   integer ( kind = 4 ) id
   integer ( kind = 4 ) p
   real ( kind = 8 ) wtime
!  Initialize MPI.
   call MPI_Init ( error )
!  Get the number of processes.
   call MPI_Comm_size ( MPI_COMM_WORLD, p, error )
!  Get the individual process ID.
   call MPI_Comm_rank ( MPI_COMM_WORLD, id, error )
   call MPI_BARRIER(MPI_COMM_WORLD,error)
!
!  Print a message.
!
.................etc__________________________________________

However when I run it on 3 or more nodes it crashes at MPI_BARRIER with 
the following error message
__________________________________________________________________________
iic at node0:/clusterfiles$ mpiexec -f machines -launcher rsh -n 3 
./helloworld
Fatal error in PMPI_Barrier: Unknown error class, error stack:
PMPI_Barrier(425)......: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(332).: Failure during collective
MPIR_Barrier_impl(327).:
MPIR_Barrier(292)......:
MPIR_Barrier_intra(180): Failure during collective
Fatal error in PMPI_Barrier: Unknown error class, error stack:
PMPI_Barrier(425)...............: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(332)..........: Failure during collective
MPIR_Barrier_impl(327)..........:
MPIR_Barrier(292)...............:
MPIR_Barrier_intra(169).........:
MPID_nem_tcp_connpoll(1845).....: Communication error with rank 2: 
Connection refused
MPIR_Barrier_intra(169).........:
MPIDI_CH3U_Recvq_FDU_or_AEP(629): Communication error with rank 2
iic at node0:/clusterfiles$ 
______________________________________________________


/etc/hosts is consistent across all nodes.
/usr/bin/rsh exists on all nodes
the "machines" file is the same on all nodes.

Anyone have any ideas about what could be wrong? Thanks, Bill

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list