[mpich-discuss] FW: Potential MPICH problem

Richard Warren Richard.Warren at hdfgroup.org
Wed Jan 11 10:27:42 CST 2017


Hi Rob,
I did put together an initial simplified test case to attempt to reproduce the issue, but that does code does NOT fail!   I’ve since focused my efforts not so much on the file closing operations, but other collective operations that take place PRIOR to the close.  I suspect that some operation(s) that precede the file close are responsible for potentially getting things out-of-sync and hence we observe the failure “down-stream”.   At this time, I’m spending my time attempting to trace these other collective operations in order to gain a better understanding what’s happening.  The code that we’re testing has not been released yet, so I’m not convinced that you could reproduce it directly from the currently available HDF5 and CGNS downloads.   I could send you the actual test, since it links to the static libraries (.a) for both HDF5 and CGNS.   Would that be of interest to you?
Thanks,
Richard


On 1/11/17, 11:02 AM, "Rob Latham" <robl at mcs.anl.gov> wrote:

    
    
    On 12/29/2016 08:28 AM, Richard Warren wrote:
    >
    >
    > Hi All,
    >
    > I’m writing to get some advice and possibly report a bug.   The
    > circumstances are that we are currently working on updating HDF5
    > functionality and have run into an issue running a parallel test of a
    > CFD code (benchmark.hdf) from the CGNS code base
    > <https://github.com/CGNS/CGNS.git>.   I’ve debugged enough to see that
    > our failure occurs during a call to MPI_File_set_view, with the failure
    > signature as follows:
    
    i'm having a heck of a time building CGNS. When I make cgns, 
    cgnsconvert  is failing to find the hdf5 symbols despite me telling 
    cmake where to find libhdf5.
    
    did you get that test case you mentioned?
    
    
    ==rob
    
    >
    >
    >
    > [brtnfld at jelly] ~/scratch/CGNS/CGNS/src/ptests % mpirun -n 2 benchmark_hdf5
    >
    > Fatal error in PMPI_Barrier: Message truncated, error stack:
    >
    > PMPI_Barrier(430)...................: MPI_Barrier(comm=0x84000006) failed
    >
    > MPIR_Barrier_impl(337)..............: Failure during collective
    >
    > MPIR_Barrier_impl(330)..............:
    >
    > MPIR_Barrier(294)...................:
    >
    > MPIR_Barrier_intra(151).............:
    >
    > barrier_smp_intra(111)..............:
    >
    > MPIR_Bcast_impl(1462)...............:
    >
    > MPIR_Bcast(1486)....................:
    >
    > MPIR_Bcast_intra(1295)..............:
    >
    > MPIR_Bcast_binomial(241)............:
    >
    > MPIC_Recv(352)......................:
    >
    > MPIDI_CH3U_Request_unpack_uebuf(608): Message truncated; 4 bytes
    > received but buffer size is 1
    >
    > [cli_1]: aborting job:
    >
    > Fatal error in PMPI_Barrier: Message truncated, error stack:
    >
    > PMPI_Barrier(430)...................: MPI_Barrier(comm=0x84000006) failed
    >
    > MPIR_Barrier_impl(337)..............: Failure during collective
    >
    > MPIR_Barrier_impl(330)..............:
    >
    > MPIR_Barrier(294)...................:
    >
    > MPIR_Barrier_intra(151).............:
    >
    > barrier_smp_intra(111)..............:
    >
    > MPIR_Bcast_impl(1462)...............:
    >
    > MPIR_Bcast(1486)....................:
    >
    > MPIR_Bcast_intra(1295)..............:
    >
    > MPIR_Bcast_binomial(241)............:
    >
    > MPIC_Recv(352)......................:
    >
    > MPIDI_CH3U_Request_unpack_uebuf(608): Message truncated; 4 bytes
    > received but buffer size is 1
    >
    > benchmark_hdf5: /mnt/hdf/brtnfld/hdf5/trunk/hdf5/src/H5Fint.c:1465:
    > H5F_close: Assertion `f->file_id > 0' failed.
    >
    > Fatal error in PMPI_Allgather: Unknown error class, error stack:
    >
    > PMPI_Allgather(1002)......................:
    > MPI_Allgather(sbuf=0x7ffdfdaf9b10, scount=1, MPI_LONG_LONG_INT,
    > rbuf=0x1d53ed8, rcount=1, MPI_LONG_LONG_INT, comm=0xc4000002) failed
    >
    > MPIR_Allgather_impl(842)..................:
    >
    > MPIR_Allgather(801).......................:
    >
    > MPIR_Allgather_intra(216).................:
    >
    > MPIC_Sendrecv(475)........................:
    >
    > MPIC_Wait(243)............................:
    >
    > MPIDI_CH3i_Progress_wait(239).............: an error occurred while
    > handling an event returned by MPIDU_Sock_Wait()
    >
    > MPIDI_CH3I_Progress_handle_sock_event(451):
    >
    > MPIDU_Socki_handle_read(649)..............: connection failure
    > (set=0,sock=1,errno=104:Connection reset by peer)
    >
    > [cli_0]: aborting job:
    >
    > Fatal error in PMPI_Allgather: Unknown error class, error stack:
    >
    > PMPI_Allgather(1002)......................:
    > MPI_Allgather(sbuf=0x7ffdfdaf9b10, scount=1, MPI_LONG_LONG_INT,
    > rbuf=0x1d53ed8, rcount=1, MPI_LONG_LONG_INT, comm=0xc4000002) failed
    >
    > MPIR_Allgather_impl(842)..................:
    >
    > MPIR_Allgather(801).......................:
    >
    > MPIR_Allgather_intra(216).................:
    >
    > MPIC_Sendrecv(475)........................:
    >
    > MPIC_Wait(243)............................:
    >
    > MPIDI_CH3i_Progress_wait(239).............: an error occurred while
    > handling an event returned by MPIDU_Sock_Wait()
    >
    > MPIDI_CH3I_Progress_handle_sock_event(451):
    >
    > MPIDU_Socki_handle_read(649)..............: connection failure
    > (set=0,sock=1,errno=104:Connection reset by peer)
    >
    > benchmark_hdf5: /mnt/hdf/brtnfld/hdf5/trunk/hdf5/src/H5Fint.c:1465:
    > H5F_close: Assertion `f->file_id > 0’ failed.
    >
    >
    >
    > Please note that the above trace was the original stacktrace which
    > appears to utilize sockets, though I’ve reproduced the same problem by
    > running on an SMP with shared memory.
    >
    >
    >
    > While it’s not definitive that the issue has anything to do with the
    > above stack trace, the very same benchmark runs perfectly well utilizing
    > PHDF5 built with OpenMPI.  My own testing is with MPICH version 3.2
    > available from your download site and with OpenMPI 2.0.1 (also their
    > latest download). Both MPI releases were built from source on my Fedora
    > 25 Linux distribution using GCC 6.2.1 20160916 (Red Hat 6.2.1-2).
    >
    >
    >
    > Given that the synchronous calls into MPI_File_set_view appear to be
    > coded correctly AND that there isn’t much in the way of input parameters
    > that would cause problems (other than incorrect coding), we tend to
    > believe that the internal message queues between processes may somehow
    > be corrupted.  This impression is strengthened by the fact that our
    > recent codebase changes (which are unrelated to the actual calls to
    > MPI_File_set_view) may have introduced this issue.   Note too, that the
    > code paths to MPI_File_set_view have been taken many times previously
    > and those function calls have all succeeded.
    >
    >
    >
    > Are there any suggestions out there as to how to further debug this
    > potential corruption issue?
    >
    > Many thanks,
    >
    > Richard A. Warren
    >
    >
    >
    >
    >
    > _______________________________________________
    > discuss mailing list     discuss at mpich.org
    > To manage subscription options or unsubscribe:
    > https://lists.mpich.org/mailman/listinfo/discuss
    >
    _______________________________________________
    discuss mailing list     discuss at mpich.org
    To manage subscription options or unsubscribe:
    https://lists.mpich.org/mailman/listinfo/discuss
    

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list