[mpich-discuss] MPICH-3.2.1 crashes on MPI_BARRIER call - no idea from anyone so far

Min Si msi at anl.gov
Tue Jun 19 16:50:57 CDT 2018


Hi Bill,

I just tried the program with MPICH-3.2.1/gfortran/ssh, it works fine. I 
guess it is an issue related to the low level network connection. We 
might need more information from you to figure out the problem.

Could you please first check:
- Make sure `mpiexec` is the one installed with MPICH-3.2.1, but not the 
old one installed with MPICH-1.2.7
- Send us the output of `mpichversion`

And try:
mpiexec -f machines -launcher rsh –n 3 hostname

Min

On 2018/06/19 16:28, Bill Petrick wrote:
> Hi all,
>
> For the past 9 years I've been running MPICH-1.2.7 with the g95 
> fortran compiler.  I just upgraded to MPICH-3.2.1 with a gfortran 
> compiler.  My first tests running on 2 nodes were fine. When I ran the 
> program on 3 or more nodes it crashes at the first MPI_BARRIER call.  
> As a test I ran the following "helloworld" program.  It runs on all 
> nodes just fine until I add the MPI_BARRIER statement.  With the 
> MPI_BARRIER statement it also runs fine on 2 nodes but crashes when 
> run on 3 or more nodes. I execute with the following statement.
>
> mpiexec -f machines -launcher rsh –n 2 ./helloworld
>
> This program runs fine on all nodes without the MPI_BARRIER call but 
> crashes on more than two nodes with the BARRIER call.
> ____________________________________________________
> program main
>   use mpi
>   integer ( kind = 4 ) error
>   integer ( kind = 4 ) id
>   integer ( kind = 4 ) p
>   real ( kind = 8 ) wtime
> !  Initialize MPI.
>   call MPI_Init ( error )
> !  Get the number of processes.
>   call MPI_Comm_size ( MPI_COMM_WORLD, p, error )
> !  Get the individual process ID.
>   call MPI_Comm_rank ( MPI_COMM_WORLD, id, error )
>   call MPI_BARRIER(MPI_COMM_WORLD,error)
> !
> !  Print a message.
> !
> .................etc__________________________________________
>
> However when I run it on 3 or more nodes it crashes at MPI_BARRIER 
> with the following error message
> __________________________________________________________________________ 
>
> iic at node0:/clusterfiles$ mpiexec -f machines -launcher rsh -n 3 
> ./helloworld
> Fatal error in PMPI_Barrier: Unknown error class, error stack:
> PMPI_Barrier(425)......: MPI_Barrier(MPI_COMM_WORLD) failed
> MPIR_Barrier_impl(332).: Failure during collective
> MPIR_Barrier_impl(327).:
> MPIR_Barrier(292)......:
> MPIR_Barrier_intra(180): Failure during collective
> Fatal error in PMPI_Barrier: Unknown error class, error stack:
> PMPI_Barrier(425)...............: MPI_Barrier(MPI_COMM_WORLD) failed
> MPIR_Barrier_impl(332)..........: Failure during collective
> MPIR_Barrier_impl(327)..........:
> MPIR_Barrier(292)...............:
> MPIR_Barrier_intra(169).........:
> MPID_nem_tcp_connpoll(1845).....: Communication error with rank 2: 
> Connection refused
> MPIR_Barrier_intra(169).........:
> MPIDI_CH3U_Recvq_FDU_or_AEP(629): Communication error with rank 2
> iic at node0:/clusterfiles$ 
> ______________________________________________________
>
>
> /etc/hosts is consistent across all nodes.
> /usr/bin/rsh exists on all nodes
> the "machines" file is the same on all nodes.
>
> Anyone have any ideas about what could be wrong? Thanks, Bill
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list