[mpich-discuss] MPICH-3.2.1 crashes on MPI_BARRIER call - no idea from anyone so far
Bill Petrick
wpetrick at industrialimaging.com
Tue Jun 19 16:28:35 CDT 2018
Hi all,
For the past 9 years I've been running MPICH-1.2.7 with the g95 fortran
compiler. I just upgraded to MPICH-3.2.1 with a gfortran compiler. My
first tests running on 2 nodes were fine. When I ran the program on 3 or
more nodes it crashes at the first MPI_BARRIER call. As a test I ran
the following "helloworld" program. It runs on all nodes just fine
until I add the MPI_BARRIER statement. With the MPI_BARRIER statement
it also runs fine on 2 nodes but crashes when run on 3 or more nodes. I
execute with the following statement.
mpiexec -f machines -launcher rsh –n 2 ./helloworld
This program runs fine on all nodes without the MPI_BARRIER call but
crashes on more than two nodes with the BARRIER call.
____________________________________________________
program main
use mpi
integer ( kind = 4 ) error
integer ( kind = 4 ) id
integer ( kind = 4 ) p
real ( kind = 8 ) wtime
! Initialize MPI.
call MPI_Init ( error )
! Get the number of processes.
call MPI_Comm_size ( MPI_COMM_WORLD, p, error )
! Get the individual process ID.
call MPI_Comm_rank ( MPI_COMM_WORLD, id, error )
call MPI_BARRIER(MPI_COMM_WORLD,error)
!
! Print a message.
!
.................etc__________________________________________
However when I run it on 3 or more nodes it crashes at MPI_BARRIER with
the following error message
__________________________________________________________________________
iic at node0:/clusterfiles$ mpiexec -f machines -launcher rsh -n 3
./helloworld
Fatal error in PMPI_Barrier: Unknown error class, error stack:
PMPI_Barrier(425)......: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(332).: Failure during collective
MPIR_Barrier_impl(327).:
MPIR_Barrier(292)......:
MPIR_Barrier_intra(180): Failure during collective
Fatal error in PMPI_Barrier: Unknown error class, error stack:
PMPI_Barrier(425)...............: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(332)..........: Failure during collective
MPIR_Barrier_impl(327)..........:
MPIR_Barrier(292)...............:
MPIR_Barrier_intra(169).........:
MPID_nem_tcp_connpoll(1845).....: Communication error with rank 2:
Connection refused
MPIR_Barrier_intra(169).........:
MPIDI_CH3U_Recvq_FDU_or_AEP(629): Communication error with rank 2
iic at node0:/clusterfiles$
______________________________________________________
/etc/hosts is consistent across all nodes.
/usr/bin/rsh exists on all nodes
the "machines" file is the same on all nodes.
Anyone have any ideas about what could be wrong? Thanks, Bill
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list