[mpich-discuss] MPICH-3.2.1 crashes on MPI_BARRIER call - no idea from anyone so far
Min Si
msi at anl.gov
Tue Jun 19 16:50:57 CDT 2018
Hi Bill,
I just tried the program with MPICH-3.2.1/gfortran/ssh, it works fine. I
guess it is an issue related to the low level network connection. We
might need more information from you to figure out the problem.
Could you please first check:
- Make sure `mpiexec` is the one installed with MPICH-3.2.1, but not the
old one installed with MPICH-1.2.7
- Send us the output of `mpichversion`
And try:
mpiexec -f machines -launcher rsh –n 3 hostname
Min
On 2018/06/19 16:28, Bill Petrick wrote:
> Hi all,
>
> For the past 9 years I've been running MPICH-1.2.7 with the g95
> fortran compiler. I just upgraded to MPICH-3.2.1 with a gfortran
> compiler. My first tests running on 2 nodes were fine. When I ran the
> program on 3 or more nodes it crashes at the first MPI_BARRIER call.
> As a test I ran the following "helloworld" program. It runs on all
> nodes just fine until I add the MPI_BARRIER statement. With the
> MPI_BARRIER statement it also runs fine on 2 nodes but crashes when
> run on 3 or more nodes. I execute with the following statement.
>
> mpiexec -f machines -launcher rsh –n 2 ./helloworld
>
> This program runs fine on all nodes without the MPI_BARRIER call but
> crashes on more than two nodes with the BARRIER call.
> ____________________________________________________
> program main
> use mpi
> integer ( kind = 4 ) error
> integer ( kind = 4 ) id
> integer ( kind = 4 ) p
> real ( kind = 8 ) wtime
> ! Initialize MPI.
> call MPI_Init ( error )
> ! Get the number of processes.
> call MPI_Comm_size ( MPI_COMM_WORLD, p, error )
> ! Get the individual process ID.
> call MPI_Comm_rank ( MPI_COMM_WORLD, id, error )
> call MPI_BARRIER(MPI_COMM_WORLD,error)
> !
> ! Print a message.
> !
> .................etc__________________________________________
>
> However when I run it on 3 or more nodes it crashes at MPI_BARRIER
> with the following error message
> __________________________________________________________________________
>
> iic at node0:/clusterfiles$ mpiexec -f machines -launcher rsh -n 3
> ./helloworld
> Fatal error in PMPI_Barrier: Unknown error class, error stack:
> PMPI_Barrier(425)......: MPI_Barrier(MPI_COMM_WORLD) failed
> MPIR_Barrier_impl(332).: Failure during collective
> MPIR_Barrier_impl(327).:
> MPIR_Barrier(292)......:
> MPIR_Barrier_intra(180): Failure during collective
> Fatal error in PMPI_Barrier: Unknown error class, error stack:
> PMPI_Barrier(425)...............: MPI_Barrier(MPI_COMM_WORLD) failed
> MPIR_Barrier_impl(332)..........: Failure during collective
> MPIR_Barrier_impl(327)..........:
> MPIR_Barrier(292)...............:
> MPIR_Barrier_intra(169).........:
> MPID_nem_tcp_connpoll(1845).....: Communication error with rank 2:
> Connection refused
> MPIR_Barrier_intra(169).........:
> MPIDI_CH3U_Recvq_FDU_or_AEP(629): Communication error with rank 2
> iic at node0:/clusterfiles$
> ______________________________________________________
>
>
> /etc/hosts is consistent across all nodes.
> /usr/bin/rsh exists on all nodes
> the "machines" file is the same on all nodes.
>
> Anyone have any ideas about what could be wrong? Thanks, Bill
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list