[mpich-discuss] FW: Potential MPICH problem
Rob Latham
robl at mcs.anl.gov
Sat Dec 31 07:47:54 CST 2016
On 12/30/2016 07:48 AM, Richard Warren wrote:
> Hi Rob,
> Many thanks for your note and heads-up message about your vacation. The vacation situation applies to most of our group, so it’s a good time for me to investigate the problem a bit more. Along with your suggestions, I’m thinking that I might work on writing an MPI only reproducer, e.g. I can probably produce a trace of the MPI function calls that happen during the file close operation and simply package that into a test that I can share with you…
>
I'd be surprised if it's a problem with MPI_FILE_SET_VIEW itself. like,
if you put an MPI_BARRIER in there before the call, I bet you'd find a
problem in MPI_BARRIER.
==rob
> Thanks again,
> Richard
>
> On 12/29/16, 10:24 PM, "Rob Latham" <robl at mcs.anl.gov> wrote:
>
>
>
> On 12/29/2016 08:28 AM, Richard Warren wrote:
> >
> >
> > Hi All,
> >
> > I’m writing to get some advice and possibly report a bug. The
> > circumstances are that we are currently working on updating HDF5
> > functionality and have run into an issue running a parallel test of a
> > CFD code (benchmark.hdf) from the CGNS code base
> > <https://github.com/CGNS/CGNS.git>. I’ve debugged enough to see that
> > our failure occurs during a call to MPI_File_set_view, with the failure
> > signature as follows:
>
> Interesting. I can take a look at it but I'll warn you I'm on vacation
> until 9 January.
>
> Stuff I would do:
> - hook up valgrind or address sanitizer to see if anything unexpected
> does crop up (your "message queue corrupted" theory).
> - get a gdb backtrace of the failing MPI_File_set_view including
> caller's stack.
> - maybe try out with older/newer versions, though lots of tests invoke
> MPI_FILE_SET_VIEW without problems, so that's a bit of a long shot.
>
> ==rob
>
> >
> >
> >
> > [brtnfld at jelly] ~/scratch/CGNS/CGNS/src/ptests % mpirun -n 2 benchmark_hdf5
> >
> > Fatal error in PMPI_Barrier: Message truncated, error stack:
> >
> > PMPI_Barrier(430)...................: MPI_Barrier(comm=0x84000006) failed
> >
> > MPIR_Barrier_impl(337)..............: Failure during collective
> >
> > MPIR_Barrier_impl(330)..............:
> >
> > MPIR_Barrier(294)...................:
> >
> > MPIR_Barrier_intra(151).............:
> >
> > barrier_smp_intra(111)..............:
> >
> > MPIR_Bcast_impl(1462)...............:
> >
> > MPIR_Bcast(1486)....................:
> >
> > MPIR_Bcast_intra(1295)..............:
> >
> > MPIR_Bcast_binomial(241)............:
> >
> > MPIC_Recv(352)......................:
> >
> > MPIDI_CH3U_Request_unpack_uebuf(608): Message truncated; 4 bytes
> > received but buffer size is 1
> >
> > [cli_1]: aborting job:
> >
> > Fatal error in PMPI_Barrier: Message truncated, error stack:
> >
> > PMPI_Barrier(430)...................: MPI_Barrier(comm=0x84000006) failed
> >
> > MPIR_Barrier_impl(337)..............: Failure during collective
> >
> > MPIR_Barrier_impl(330)..............:
> >
> > MPIR_Barrier(294)...................:
> >
> > MPIR_Barrier_intra(151).............:
> >
> > barrier_smp_intra(111)..............:
> >
> > MPIR_Bcast_impl(1462)...............:
> >
> > MPIR_Bcast(1486)....................:
> >
> > MPIR_Bcast_intra(1295)..............:
> >
> > MPIR_Bcast_binomial(241)............:
> >
> > MPIC_Recv(352)......................:
> >
> > MPIDI_CH3U_Request_unpack_uebuf(608): Message truncated; 4 bytes
> > received but buffer size is 1
> >
> > benchmark_hdf5: /mnt/hdf/brtnfld/hdf5/trunk/hdf5/src/H5Fint.c:1465:
> > H5F_close: Assertion `f->file_id > 0' failed.
> >
> > Fatal error in PMPI_Allgather: Unknown error class, error stack:
> >
> > PMPI_Allgather(1002)......................:
> > MPI_Allgather(sbuf=0x7ffdfdaf9b10, scount=1, MPI_LONG_LONG_INT,
> > rbuf=0x1d53ed8, rcount=1, MPI_LONG_LONG_INT, comm=0xc4000002) failed
> >
> > MPIR_Allgather_impl(842)..................:
> >
> > MPIR_Allgather(801).......................:
> >
> > MPIR_Allgather_intra(216).................:
> >
> > MPIC_Sendrecv(475)........................:
> >
> > MPIC_Wait(243)............................:
> >
> > MPIDI_CH3i_Progress_wait(239).............: an error occurred while
> > handling an event returned by MPIDU_Sock_Wait()
> >
> > MPIDI_CH3I_Progress_handle_sock_event(451):
> >
> > MPIDU_Socki_handle_read(649)..............: connection failure
> > (set=0,sock=1,errno=104:Connection reset by peer)
> >
> > [cli_0]: aborting job:
> >
> > Fatal error in PMPI_Allgather: Unknown error class, error stack:
> >
> > PMPI_Allgather(1002)......................:
> > MPI_Allgather(sbuf=0x7ffdfdaf9b10, scount=1, MPI_LONG_LONG_INT,
> > rbuf=0x1d53ed8, rcount=1, MPI_LONG_LONG_INT, comm=0xc4000002) failed
> >
> > MPIR_Allgather_impl(842)..................:
> >
> > MPIR_Allgather(801).......................:
> >
> > MPIR_Allgather_intra(216).................:
> >
> > MPIC_Sendrecv(475)........................:
> >
> > MPIC_Wait(243)............................:
> >
> > MPIDI_CH3i_Progress_wait(239).............: an error occurred while
> > handling an event returned by MPIDU_Sock_Wait()
> >
> > MPIDI_CH3I_Progress_handle_sock_event(451):
> >
> > MPIDU_Socki_handle_read(649)..............: connection failure
> > (set=0,sock=1,errno=104:Connection reset by peer)
> >
> > benchmark_hdf5: /mnt/hdf/brtnfld/hdf5/trunk/hdf5/src/H5Fint.c:1465:
> > H5F_close: Assertion `f->file_id > 0’ failed.
> >
> >
> >
> > Please note that the above trace was the original stacktrace which
> > appears to utilize sockets, though I’ve reproduced the same problem by
> > running on an SMP with shared memory.
> >
> >
> >
> > While it’s not definitive that the issue has anything to do with the
> > above stack trace, the very same benchmark runs perfectly well utilizing
> > PHDF5 built with OpenMPI. My own testing is with MPICH version 3.2
> > available from your download site and with OpenMPI 2.0.1 (also their
> > latest download). Both MPI releases were built from source on my Fedora
> > 25 Linux distribution using GCC 6.2.1 20160916 (Red Hat 6.2.1-2).
> >
> >
> >
> > Given that the synchronous calls into MPI_File_set_view appear to be
> > coded correctly AND that there isn’t much in the way of input parameters
> > that would cause problems (other than incorrect coding), we tend to
> > believe that the internal message queues between processes may somehow
> > be corrupted. This impression is strengthened by the fact that our
> > recent codebase changes (which are unrelated to the actual calls to
> > MPI_File_set_view) may have introduced this issue. Note too, that the
> > code paths to MPI_File_set_view have been taken many times previously
> > and those function calls have all succeeded.
> >
> >
> >
> > Are there any suggestions out there as to how to further debug this
> > potential corruption issue?
> >
> > Many thanks,
> >
> > Richard A. Warren
> >
> >
> >
> >
> >
> > _______________________________________________
> > discuss mailing list discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list