[mpich-discuss] FW: Potential MPICH problem

Richard Warren Richard.Warren at hdfgroup.org
Fri Dec 30 07:48:26 CST 2016


Hi Rob,
Many thanks for your note and heads-up message about your vacation.   The vacation situation applies to most of our group, so it’s a good time for me to investigate the problem a bit more.   Along with your suggestions, I’m thinking that I might work on writing an MPI only reproducer, e.g. I can probably produce a trace of the MPI function calls that happen during the file close operation and simply package that into a test that I can share with you…

Thanks again,
Richard

On 12/29/16, 10:24 PM, "Rob Latham" <robl at mcs.anl.gov> wrote:

    
    
    On 12/29/2016 08:28 AM, Richard Warren wrote:
    >
    >
    > Hi All,
    >
    > I’m writing to get some advice and possibly report a bug.   The
    > circumstances are that we are currently working on updating HDF5
    > functionality and have run into an issue running a parallel test of a
    > CFD code (benchmark.hdf) from the CGNS code base
    > <https://github.com/CGNS/CGNS.git>.   I’ve debugged enough to see that
    > our failure occurs during a call to MPI_File_set_view, with the failure
    > signature as follows:
    
    Interesting.  I can take a look at it but I'll warn you I'm on vacation 
    until 9 January.
    
    Stuff I would do:
    -  hook up valgrind or address sanitizer to see if anything unexpected 
    does crop up (your "message queue corrupted" theory).
    - get a gdb backtrace of the failing MPI_File_set_view including 
    caller's stack.
    - maybe try out with older/newer versions, though lots of tests invoke 
    MPI_FILE_SET_VIEW without problems, so that's a bit of a long shot.
    
    ==rob
    
    >
    >
    >
    > [brtnfld at jelly] ~/scratch/CGNS/CGNS/src/ptests % mpirun -n 2 benchmark_hdf5
    >
    > Fatal error in PMPI_Barrier: Message truncated, error stack:
    >
    > PMPI_Barrier(430)...................: MPI_Barrier(comm=0x84000006) failed
    >
    > MPIR_Barrier_impl(337)..............: Failure during collective
    >
    > MPIR_Barrier_impl(330)..............:
    >
    > MPIR_Barrier(294)...................:
    >
    > MPIR_Barrier_intra(151).............:
    >
    > barrier_smp_intra(111)..............:
    >
    > MPIR_Bcast_impl(1462)...............:
    >
    > MPIR_Bcast(1486)....................:
    >
    > MPIR_Bcast_intra(1295)..............:
    >
    > MPIR_Bcast_binomial(241)............:
    >
    > MPIC_Recv(352)......................:
    >
    > MPIDI_CH3U_Request_unpack_uebuf(608): Message truncated; 4 bytes
    > received but buffer size is 1
    >
    > [cli_1]: aborting job:
    >
    > Fatal error in PMPI_Barrier: Message truncated, error stack:
    >
    > PMPI_Barrier(430)...................: MPI_Barrier(comm=0x84000006) failed
    >
    > MPIR_Barrier_impl(337)..............: Failure during collective
    >
    > MPIR_Barrier_impl(330)..............:
    >
    > MPIR_Barrier(294)...................:
    >
    > MPIR_Barrier_intra(151).............:
    >
    > barrier_smp_intra(111)..............:
    >
    > MPIR_Bcast_impl(1462)...............:
    >
    > MPIR_Bcast(1486)....................:
    >
    > MPIR_Bcast_intra(1295)..............:
    >
    > MPIR_Bcast_binomial(241)............:
    >
    > MPIC_Recv(352)......................:
    >
    > MPIDI_CH3U_Request_unpack_uebuf(608): Message truncated; 4 bytes
    > received but buffer size is 1
    >
    > benchmark_hdf5: /mnt/hdf/brtnfld/hdf5/trunk/hdf5/src/H5Fint.c:1465:
    > H5F_close: Assertion `f->file_id > 0' failed.
    >
    > Fatal error in PMPI_Allgather: Unknown error class, error stack:
    >
    > PMPI_Allgather(1002)......................:
    > MPI_Allgather(sbuf=0x7ffdfdaf9b10, scount=1, MPI_LONG_LONG_INT,
    > rbuf=0x1d53ed8, rcount=1, MPI_LONG_LONG_INT, comm=0xc4000002) failed
    >
    > MPIR_Allgather_impl(842)..................:
    >
    > MPIR_Allgather(801).......................:
    >
    > MPIR_Allgather_intra(216).................:
    >
    > MPIC_Sendrecv(475)........................:
    >
    > MPIC_Wait(243)............................:
    >
    > MPIDI_CH3i_Progress_wait(239).............: an error occurred while
    > handling an event returned by MPIDU_Sock_Wait()
    >
    > MPIDI_CH3I_Progress_handle_sock_event(451):
    >
    > MPIDU_Socki_handle_read(649)..............: connection failure
    > (set=0,sock=1,errno=104:Connection reset by peer)
    >
    > [cli_0]: aborting job:
    >
    > Fatal error in PMPI_Allgather: Unknown error class, error stack:
    >
    > PMPI_Allgather(1002)......................:
    > MPI_Allgather(sbuf=0x7ffdfdaf9b10, scount=1, MPI_LONG_LONG_INT,
    > rbuf=0x1d53ed8, rcount=1, MPI_LONG_LONG_INT, comm=0xc4000002) failed
    >
    > MPIR_Allgather_impl(842)..................:
    >
    > MPIR_Allgather(801).......................:
    >
    > MPIR_Allgather_intra(216).................:
    >
    > MPIC_Sendrecv(475)........................:
    >
    > MPIC_Wait(243)............................:
    >
    > MPIDI_CH3i_Progress_wait(239).............: an error occurred while
    > handling an event returned by MPIDU_Sock_Wait()
    >
    > MPIDI_CH3I_Progress_handle_sock_event(451):
    >
    > MPIDU_Socki_handle_read(649)..............: connection failure
    > (set=0,sock=1,errno=104:Connection reset by peer)
    >
    > benchmark_hdf5: /mnt/hdf/brtnfld/hdf5/trunk/hdf5/src/H5Fint.c:1465:
    > H5F_close: Assertion `f->file_id > 0’ failed.
    >
    >
    >
    > Please note that the above trace was the original stacktrace which
    > appears to utilize sockets, though I’ve reproduced the same problem by
    > running on an SMP with shared memory.
    >
    >
    >
    > While it’s not definitive that the issue has anything to do with the
    > above stack trace, the very same benchmark runs perfectly well utilizing
    > PHDF5 built with OpenMPI.  My own testing is with MPICH version 3.2
    > available from your download site and with OpenMPI 2.0.1 (also their
    > latest download). Both MPI releases were built from source on my Fedora
    > 25 Linux distribution using GCC 6.2.1 20160916 (Red Hat 6.2.1-2).
    >
    >
    >
    > Given that the synchronous calls into MPI_File_set_view appear to be
    > coded correctly AND that there isn’t much in the way of input parameters
    > that would cause problems (other than incorrect coding), we tend to
    > believe that the internal message queues between processes may somehow
    > be corrupted.  This impression is strengthened by the fact that our
    > recent codebase changes (which are unrelated to the actual calls to
    > MPI_File_set_view) may have introduced this issue.   Note too, that the
    > code paths to MPI_File_set_view have been taken many times previously
    > and those function calls have all succeeded.
    >
    >
    >
    > Are there any suggestions out there as to how to further debug this
    > potential corruption issue?
    >
    > Many thanks,
    >
    > Richard A. Warren
    >
    >
    >
    >
    >
    > _______________________________________________
    > discuss mailing list     discuss at mpich.org
    > To manage subscription options or unsubscribe:
    > https://lists.mpich.org/mailman/listinfo/discuss
    >
    _______________________________________________
    discuss mailing list     discuss at mpich.org
    To manage subscription options or unsubscribe:
    https://lists.mpich.org/mailman/listinfo/discuss
    

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list