[mpich-discuss] FW: Potential MPICH problem

Richard Warren Richard.Warren at hdfgroup.org
Thu Dec 29 08:28:35 CST 2016


Hi All,
I’m writing to get some advice and possibly report a bug.   The circumstances are that we are currently working on updating HDF5 functionality and have run into an issue running a parallel test of a CFD code (benchmark.hdf) from the CGNS code base<https://github.com/CGNS/CGNS.git>.   I’ve debugged enough to see that our failure occurs during a call to MPI_File_set_view, with the failure signature as follows:

[brtnfld at jelly] ~/scratch/CGNS/CGNS/src/ptests % mpirun -n 2 benchmark_hdf5
Fatal error in PMPI_Barrier: Message truncated, error stack:
PMPI_Barrier(430)...................: MPI_Barrier(comm=0x84000006) failed
MPIR_Barrier_impl(337)..............: Failure during collective
MPIR_Barrier_impl(330)..............:
MPIR_Barrier(294)...................:
MPIR_Barrier_intra(151).............:
barrier_smp_intra(111)..............:
MPIR_Bcast_impl(1462)...............:
MPIR_Bcast(1486)....................:
MPIR_Bcast_intra(1295)..............:
MPIR_Bcast_binomial(241)............:
MPIC_Recv(352)......................:
MPIDI_CH3U_Request_unpack_uebuf(608): Message truncated; 4 bytes received but buffer size is 1
[cli_1]: aborting job:
Fatal error in PMPI_Barrier: Message truncated, error stack:
PMPI_Barrier(430)...................: MPI_Barrier(comm=0x84000006) failed
MPIR_Barrier_impl(337)..............: Failure during collective
MPIR_Barrier_impl(330)..............:
MPIR_Barrier(294)...................:
MPIR_Barrier_intra(151).............:
barrier_smp_intra(111)..............:
MPIR_Bcast_impl(1462)...............:
MPIR_Bcast(1486)....................:
MPIR_Bcast_intra(1295)..............:
MPIR_Bcast_binomial(241)............:
MPIC_Recv(352)......................:
MPIDI_CH3U_Request_unpack_uebuf(608): Message truncated; 4 bytes received but buffer size is 1
benchmark_hdf5: /mnt/hdf/brtnfld/hdf5/trunk/hdf5/src/H5Fint.c:1465: H5F_close: Assertion `f->file_id > 0' failed.
Fatal error in PMPI_Allgather: Unknown error class, error stack:
PMPI_Allgather(1002)......................: MPI_Allgather(sbuf=0x7ffdfdaf9b10, scount=1, MPI_LONG_LONG_INT, rbuf=0x1d53ed8, rcount=1, MPI_LONG_LONG_INT, comm=0xc4000002) failed
MPIR_Allgather_impl(842)..................:
MPIR_Allgather(801).......................:
MPIR_Allgather_intra(216).................:
MPIC_Sendrecv(475)........................:
MPIC_Wait(243)............................:
MPIDI_CH3i_Progress_wait(239).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(451):
MPIDU_Socki_handle_read(649)..............: connection failure (set=0,sock=1,errno=104:Connection reset by peer)
[cli_0]: aborting job:
Fatal error in PMPI_Allgather: Unknown error class, error stack:
PMPI_Allgather(1002)......................: MPI_Allgather(sbuf=0x7ffdfdaf9b10, scount=1, MPI_LONG_LONG_INT, rbuf=0x1d53ed8, rcount=1, MPI_LONG_LONG_INT, comm=0xc4000002) failed
MPIR_Allgather_impl(842)..................:
MPIR_Allgather(801).......................:
MPIR_Allgather_intra(216).................:
MPIC_Sendrecv(475)........................:
MPIC_Wait(243)............................:
MPIDI_CH3i_Progress_wait(239).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(451):
MPIDU_Socki_handle_read(649)..............: connection failure (set=0,sock=1,errno=104:Connection reset by peer)
benchmark_hdf5: /mnt/hdf/brtnfld/hdf5/trunk/hdf5/src/H5Fint.c:1465: H5F_close: Assertion `f->file_id > 0’ failed.

Please note that the above trace was the original stacktrace which appears to utilize sockets, though I’ve reproduced the same problem by running on an SMP with shared memory.

While it’s not definitive that the issue has anything to do with the above stack trace, the very same benchmark runs perfectly well utilizing PHDF5 built with OpenMPI.  My own testing is with MPICH version 3.2 available from your download site and with OpenMPI 2.0.1 (also their latest download). Both MPI releases were built from source on my Fedora 25 Linux distribution using GCC 6.2.1 20160916 (Red Hat 6.2.1-2).

Given that the synchronous calls into MPI_File_set_view appear to be coded correctly AND that there isn’t much in the way of input parameters that would cause problems (other than incorrect coding), we tend to believe that the internal message queues between processes may somehow be corrupted.  This impression is strengthened by the fact that our recent codebase changes (which are unrelated to the actual calls to MPI_File_set_view) may have introduced this issue.   Note too, that the code paths to MPI_File_set_view have been taken many times previously and those function calls have all succeeded.

Are there any suggestions out there as to how to further debug this potential corruption issue?
Many thanks,
Richard A. Warren

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20161229/c3b481b3/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list