[mpich-discuss] MPICH-3.2.1 crashes on MPI_BARRIER call

Congiu, Giuseppe gcongiu at anl.gov
Tue Oct 9 16:27:19 CDT 2018


Hello Bill,

This seems a network connection problem. Please make sure that every machine can connect to every other machine and there is no firewall blocking connections.

Best,
Giuseppe

> On Jun 13, 2018, at 9:24 AM, Bill Petrick <wpetrick at industrialimaging.com> wrote:
> 
> Hi all,
> 
> For the past 9 years I've been running MPICH-1.2.7 with the g95 fortran compiler.  I just upgraded to MPICH-3.2.1 with a gfortran compiler.  My first tests running on 2 nodes were fine. When I ran the program on 3 or more nodes it crashes at the first MPI_BARRIER call.  As a test I ran the following "helloworld" program.  It runs on all nodes just fine until I add the MPI_BARRIER statement.  With the MPI_BARRIER statement it also runs fine on 2 nodes but crashes when run on 3 or more nodes. I execute with the following statement.
> 
> mpiexec -f machines -launcher rsh –n 2 ./helloworld
> 
> This program runs fine on all nodes without the MPI_BARRIER call but crashes on more than two nodes with the BARRIER call.
> ____________________________________________________
> program main
>   use mpi
>   integer ( kind = 4 ) error
>   integer ( kind = 4 ) id
>   integer ( kind = 4 ) p
>   real ( kind = 8 ) wtime
> !  Initialize MPI.
>   call MPI_Init ( error )
> !  Get the number of processes.
>   call MPI_Comm_size ( MPI_COMM_WORLD, p, error )
> !  Get the individual process ID.
>   call MPI_Comm_rank ( MPI_COMM_WORLD, id, error )
>   call MPI_BARRIER(MPI_COMM_WORLD,error)
> !
> !  Print a message.
> !
> .................etc__________________________________________
> 
> However when I run it on 3 or more nodes it crashes at MPI_BARRIER with the following error message
> __________________________________________________________________________
> iic at node0:/clusterfiles$ mpiexec -f machines -launcher rsh -n 3 ./helloworld
> Fatal error in PMPI_Barrier: Unknown error class, error stack:
> PMPI_Barrier(425)......: MPI_Barrier(MPI_COMM_WORLD) failed
> MPIR_Barrier_impl(332).: Failure during collective
> MPIR_Barrier_impl(327).:
> MPIR_Barrier(292)......:
> MPIR_Barrier_intra(180): Failure during collective
> Fatal error in PMPI_Barrier: Unknown error class, error stack:
> PMPI_Barrier(425)...............: MPI_Barrier(MPI_COMM_WORLD) failed
> MPIR_Barrier_impl(332)..........: Failure during collective
> MPIR_Barrier_impl(327)..........:
> MPIR_Barrier(292)...............:
> MPIR_Barrier_intra(169).........:
> MPID_nem_tcp_connpoll(1845).....: Communication error with rank 2: Connection refused
> MPIR_Barrier_intra(169).........:
> MPIDI_CH3U_Recvq_FDU_or_AEP(629): Communication error with rank 2
> iic at node0:/clusterfiles$ ______________________________________________________
> 
> 
> /etc/hosts is consistent across all nodes.
> /usr/bin/rsh exists on all nodes
> the "machines" file is the same on all nodes.
> 
> Anyone have any ideas about what could be wrong? Thanks, Bill
> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list