[mpich-discuss] FW: Potential MPICH problem
Richard Warren
Richard.Warren at hdfgroup.org
Fri Dec 30 07:48:26 CST 2016
Hi Rob,
Many thanks for your note and heads-up message about your vacation. The vacation situation applies to most of our group, so it’s a good time for me to investigate the problem a bit more. Along with your suggestions, I’m thinking that I might work on writing an MPI only reproducer, e.g. I can probably produce a trace of the MPI function calls that happen during the file close operation and simply package that into a test that I can share with you…
Thanks again,
Richard
On 12/29/16, 10:24 PM, "Rob Latham" <robl at mcs.anl.gov> wrote:
On 12/29/2016 08:28 AM, Richard Warren wrote:
>
>
> Hi All,
>
> I’m writing to get some advice and possibly report a bug. The
> circumstances are that we are currently working on updating HDF5
> functionality and have run into an issue running a parallel test of a
> CFD code (benchmark.hdf) from the CGNS code base
> <https://github.com/CGNS/CGNS.git>. I’ve debugged enough to see that
> our failure occurs during a call to MPI_File_set_view, with the failure
> signature as follows:
Interesting. I can take a look at it but I'll warn you I'm on vacation
until 9 January.
Stuff I would do:
- hook up valgrind or address sanitizer to see if anything unexpected
does crop up (your "message queue corrupted" theory).
- get a gdb backtrace of the failing MPI_File_set_view including
caller's stack.
- maybe try out with older/newer versions, though lots of tests invoke
MPI_FILE_SET_VIEW without problems, so that's a bit of a long shot.
==rob
>
>
>
> [brtnfld at jelly] ~/scratch/CGNS/CGNS/src/ptests % mpirun -n 2 benchmark_hdf5
>
> Fatal error in PMPI_Barrier: Message truncated, error stack:
>
> PMPI_Barrier(430)...................: MPI_Barrier(comm=0x84000006) failed
>
> MPIR_Barrier_impl(337)..............: Failure during collective
>
> MPIR_Barrier_impl(330)..............:
>
> MPIR_Barrier(294)...................:
>
> MPIR_Barrier_intra(151).............:
>
> barrier_smp_intra(111)..............:
>
> MPIR_Bcast_impl(1462)...............:
>
> MPIR_Bcast(1486)....................:
>
> MPIR_Bcast_intra(1295)..............:
>
> MPIR_Bcast_binomial(241)............:
>
> MPIC_Recv(352)......................:
>
> MPIDI_CH3U_Request_unpack_uebuf(608): Message truncated; 4 bytes
> received but buffer size is 1
>
> [cli_1]: aborting job:
>
> Fatal error in PMPI_Barrier: Message truncated, error stack:
>
> PMPI_Barrier(430)...................: MPI_Barrier(comm=0x84000006) failed
>
> MPIR_Barrier_impl(337)..............: Failure during collective
>
> MPIR_Barrier_impl(330)..............:
>
> MPIR_Barrier(294)...................:
>
> MPIR_Barrier_intra(151).............:
>
> barrier_smp_intra(111)..............:
>
> MPIR_Bcast_impl(1462)...............:
>
> MPIR_Bcast(1486)....................:
>
> MPIR_Bcast_intra(1295)..............:
>
> MPIR_Bcast_binomial(241)............:
>
> MPIC_Recv(352)......................:
>
> MPIDI_CH3U_Request_unpack_uebuf(608): Message truncated; 4 bytes
> received but buffer size is 1
>
> benchmark_hdf5: /mnt/hdf/brtnfld/hdf5/trunk/hdf5/src/H5Fint.c:1465:
> H5F_close: Assertion `f->file_id > 0' failed.
>
> Fatal error in PMPI_Allgather: Unknown error class, error stack:
>
> PMPI_Allgather(1002)......................:
> MPI_Allgather(sbuf=0x7ffdfdaf9b10, scount=1, MPI_LONG_LONG_INT,
> rbuf=0x1d53ed8, rcount=1, MPI_LONG_LONG_INT, comm=0xc4000002) failed
>
> MPIR_Allgather_impl(842)..................:
>
> MPIR_Allgather(801).......................:
>
> MPIR_Allgather_intra(216).................:
>
> MPIC_Sendrecv(475)........................:
>
> MPIC_Wait(243)............................:
>
> MPIDI_CH3i_Progress_wait(239).............: an error occurred while
> handling an event returned by MPIDU_Sock_Wait()
>
> MPIDI_CH3I_Progress_handle_sock_event(451):
>
> MPIDU_Socki_handle_read(649)..............: connection failure
> (set=0,sock=1,errno=104:Connection reset by peer)
>
> [cli_0]: aborting job:
>
> Fatal error in PMPI_Allgather: Unknown error class, error stack:
>
> PMPI_Allgather(1002)......................:
> MPI_Allgather(sbuf=0x7ffdfdaf9b10, scount=1, MPI_LONG_LONG_INT,
> rbuf=0x1d53ed8, rcount=1, MPI_LONG_LONG_INT, comm=0xc4000002) failed
>
> MPIR_Allgather_impl(842)..................:
>
> MPIR_Allgather(801).......................:
>
> MPIR_Allgather_intra(216).................:
>
> MPIC_Sendrecv(475)........................:
>
> MPIC_Wait(243)............................:
>
> MPIDI_CH3i_Progress_wait(239).............: an error occurred while
> handling an event returned by MPIDU_Sock_Wait()
>
> MPIDI_CH3I_Progress_handle_sock_event(451):
>
> MPIDU_Socki_handle_read(649)..............: connection failure
> (set=0,sock=1,errno=104:Connection reset by peer)
>
> benchmark_hdf5: /mnt/hdf/brtnfld/hdf5/trunk/hdf5/src/H5Fint.c:1465:
> H5F_close: Assertion `f->file_id > 0’ failed.
>
>
>
> Please note that the above trace was the original stacktrace which
> appears to utilize sockets, though I’ve reproduced the same problem by
> running on an SMP with shared memory.
>
>
>
> While it’s not definitive that the issue has anything to do with the
> above stack trace, the very same benchmark runs perfectly well utilizing
> PHDF5 built with OpenMPI. My own testing is with MPICH version 3.2
> available from your download site and with OpenMPI 2.0.1 (also their
> latest download). Both MPI releases were built from source on my Fedora
> 25 Linux distribution using GCC 6.2.1 20160916 (Red Hat 6.2.1-2).
>
>
>
> Given that the synchronous calls into MPI_File_set_view appear to be
> coded correctly AND that there isn’t much in the way of input parameters
> that would cause problems (other than incorrect coding), we tend to
> believe that the internal message queues between processes may somehow
> be corrupted. This impression is strengthened by the fact that our
> recent codebase changes (which are unrelated to the actual calls to
> MPI_File_set_view) may have introduced this issue. Note too, that the
> code paths to MPI_File_set_view have been taken many times previously
> and those function calls have all succeeded.
>
>
>
> Are there any suggestions out there as to how to further debug this
> potential corruption issue?
>
> Many thanks,
>
> Richard A. Warren
>
>
>
>
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list