[mpich-discuss] FW: Potential MPICH problem

Rob Latham robl at mcs.anl.gov
Sat Dec 31 07:47:54 CST 2016



On 12/30/2016 07:48 AM, Richard Warren wrote:
> Hi Rob,
> Many thanks for your note and heads-up message about your vacation.   The vacation situation applies to most of our group, so it’s a good time for me to investigate the problem a bit more.   Along with your suggestions, I’m thinking that I might work on writing an MPI only reproducer, e.g. I can probably produce a trace of the MPI function calls that happen during the file close operation and simply package that into a test that I can share with you…
>

I'd be surprised if it's a problem with MPI_FILE_SET_VIEW itself.  like, 
if you put an MPI_BARRIER in there before the call, I bet you'd find a 
problem in MPI_BARRIER.

==rob

> Thanks again,
> Richard
>
> On 12/29/16, 10:24 PM, "Rob Latham" <robl at mcs.anl.gov> wrote:
>
>
>
>     On 12/29/2016 08:28 AM, Richard Warren wrote:
>     >
>     >
>     > Hi All,
>     >
>     > I’m writing to get some advice and possibly report a bug.   The
>     > circumstances are that we are currently working on updating HDF5
>     > functionality and have run into an issue running a parallel test of a
>     > CFD code (benchmark.hdf) from the CGNS code base
>     > <https://github.com/CGNS/CGNS.git>.   I’ve debugged enough to see that
>     > our failure occurs during a call to MPI_File_set_view, with the failure
>     > signature as follows:
>
>     Interesting.  I can take a look at it but I'll warn you I'm on vacation
>     until 9 January.
>
>     Stuff I would do:
>     -  hook up valgrind or address sanitizer to see if anything unexpected
>     does crop up (your "message queue corrupted" theory).
>     - get a gdb backtrace of the failing MPI_File_set_view including
>     caller's stack.
>     - maybe try out with older/newer versions, though lots of tests invoke
>     MPI_FILE_SET_VIEW without problems, so that's a bit of a long shot.
>
>     ==rob
>
>     >
>     >
>     >
>     > [brtnfld at jelly] ~/scratch/CGNS/CGNS/src/ptests % mpirun -n 2 benchmark_hdf5
>     >
>     > Fatal error in PMPI_Barrier: Message truncated, error stack:
>     >
>     > PMPI_Barrier(430)...................: MPI_Barrier(comm=0x84000006) failed
>     >
>     > MPIR_Barrier_impl(337)..............: Failure during collective
>     >
>     > MPIR_Barrier_impl(330)..............:
>     >
>     > MPIR_Barrier(294)...................:
>     >
>     > MPIR_Barrier_intra(151).............:
>     >
>     > barrier_smp_intra(111)..............:
>     >
>     > MPIR_Bcast_impl(1462)...............:
>     >
>     > MPIR_Bcast(1486)....................:
>     >
>     > MPIR_Bcast_intra(1295)..............:
>     >
>     > MPIR_Bcast_binomial(241)............:
>     >
>     > MPIC_Recv(352)......................:
>     >
>     > MPIDI_CH3U_Request_unpack_uebuf(608): Message truncated; 4 bytes
>     > received but buffer size is 1
>     >
>     > [cli_1]: aborting job:
>     >
>     > Fatal error in PMPI_Barrier: Message truncated, error stack:
>     >
>     > PMPI_Barrier(430)...................: MPI_Barrier(comm=0x84000006) failed
>     >
>     > MPIR_Barrier_impl(337)..............: Failure during collective
>     >
>     > MPIR_Barrier_impl(330)..............:
>     >
>     > MPIR_Barrier(294)...................:
>     >
>     > MPIR_Barrier_intra(151).............:
>     >
>     > barrier_smp_intra(111)..............:
>     >
>     > MPIR_Bcast_impl(1462)...............:
>     >
>     > MPIR_Bcast(1486)....................:
>     >
>     > MPIR_Bcast_intra(1295)..............:
>     >
>     > MPIR_Bcast_binomial(241)............:
>     >
>     > MPIC_Recv(352)......................:
>     >
>     > MPIDI_CH3U_Request_unpack_uebuf(608): Message truncated; 4 bytes
>     > received but buffer size is 1
>     >
>     > benchmark_hdf5: /mnt/hdf/brtnfld/hdf5/trunk/hdf5/src/H5Fint.c:1465:
>     > H5F_close: Assertion `f->file_id > 0' failed.
>     >
>     > Fatal error in PMPI_Allgather: Unknown error class, error stack:
>     >
>     > PMPI_Allgather(1002)......................:
>     > MPI_Allgather(sbuf=0x7ffdfdaf9b10, scount=1, MPI_LONG_LONG_INT,
>     > rbuf=0x1d53ed8, rcount=1, MPI_LONG_LONG_INT, comm=0xc4000002) failed
>     >
>     > MPIR_Allgather_impl(842)..................:
>     >
>     > MPIR_Allgather(801).......................:
>     >
>     > MPIR_Allgather_intra(216).................:
>     >
>     > MPIC_Sendrecv(475)........................:
>     >
>     > MPIC_Wait(243)............................:
>     >
>     > MPIDI_CH3i_Progress_wait(239).............: an error occurred while
>     > handling an event returned by MPIDU_Sock_Wait()
>     >
>     > MPIDI_CH3I_Progress_handle_sock_event(451):
>     >
>     > MPIDU_Socki_handle_read(649)..............: connection failure
>     > (set=0,sock=1,errno=104:Connection reset by peer)
>     >
>     > [cli_0]: aborting job:
>     >
>     > Fatal error in PMPI_Allgather: Unknown error class, error stack:
>     >
>     > PMPI_Allgather(1002)......................:
>     > MPI_Allgather(sbuf=0x7ffdfdaf9b10, scount=1, MPI_LONG_LONG_INT,
>     > rbuf=0x1d53ed8, rcount=1, MPI_LONG_LONG_INT, comm=0xc4000002) failed
>     >
>     > MPIR_Allgather_impl(842)..................:
>     >
>     > MPIR_Allgather(801).......................:
>     >
>     > MPIR_Allgather_intra(216).................:
>     >
>     > MPIC_Sendrecv(475)........................:
>     >
>     > MPIC_Wait(243)............................:
>     >
>     > MPIDI_CH3i_Progress_wait(239).............: an error occurred while
>     > handling an event returned by MPIDU_Sock_Wait()
>     >
>     > MPIDI_CH3I_Progress_handle_sock_event(451):
>     >
>     > MPIDU_Socki_handle_read(649)..............: connection failure
>     > (set=0,sock=1,errno=104:Connection reset by peer)
>     >
>     > benchmark_hdf5: /mnt/hdf/brtnfld/hdf5/trunk/hdf5/src/H5Fint.c:1465:
>     > H5F_close: Assertion `f->file_id > 0’ failed.
>     >
>     >
>     >
>     > Please note that the above trace was the original stacktrace which
>     > appears to utilize sockets, though I’ve reproduced the same problem by
>     > running on an SMP with shared memory.
>     >
>     >
>     >
>     > While it’s not definitive that the issue has anything to do with the
>     > above stack trace, the very same benchmark runs perfectly well utilizing
>     > PHDF5 built with OpenMPI.  My own testing is with MPICH version 3.2
>     > available from your download site and with OpenMPI 2.0.1 (also their
>     > latest download). Both MPI releases were built from source on my Fedora
>     > 25 Linux distribution using GCC 6.2.1 20160916 (Red Hat 6.2.1-2).
>     >
>     >
>     >
>     > Given that the synchronous calls into MPI_File_set_view appear to be
>     > coded correctly AND that there isn’t much in the way of input parameters
>     > that would cause problems (other than incorrect coding), we tend to
>     > believe that the internal message queues between processes may somehow
>     > be corrupted.  This impression is strengthened by the fact that our
>     > recent codebase changes (which are unrelated to the actual calls to
>     > MPI_File_set_view) may have introduced this issue.   Note too, that the
>     > code paths to MPI_File_set_view have been taken many times previously
>     > and those function calls have all succeeded.
>     >
>     >
>     >
>     > Are there any suggestions out there as to how to further debug this
>     > potential corruption issue?
>     >
>     > Many thanks,
>     >
>     > Richard A. Warren
>     >
>     >
>     >
>     >
>     >
>     > _______________________________________________
>     > discuss mailing list     discuss at mpich.org
>     > To manage subscription options or unsubscribe:
>     > https://lists.mpich.org/mailman/listinfo/discuss
>     >
>     _______________________________________________
>     discuss mailing list     discuss at mpich.org
>     To manage subscription options or unsubscribe:
>     https://lists.mpich.org/mailman/listinfo/discuss
>
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list