[mpich-discuss] FW: Potential MPICH problem
Richard Warren
Richard.Warren at hdfgroup.org
Fri Jan 20 20:13:13 CST 2017
Hi Rob,
In my previous email, I mentioned that the failure in MPI_File_set_view that we've observed and reported might be the effect of some prior data-corruption, e.g. a buffer overrun... but I haven't found evidence of that yet. Even more curious, is that the application crash occurs from within an internal Barrier operation in MPI_File_set_view given that an MPI_Barrier call that I added immediately prior to the set_view proceeds without error (though I've used MPI_COMM_WORLD for that "test").
Here's a more detailed trace which shows me stepping thru the MPI_File_set_view/Barrier code:
112 my_rank = MPID_nem_mem_region.rank;
(gdb)
119 if (MPID_nem_fbox_is_full((MPID_nem_fbox_common_ptr_t)pbox))
(gdb)
OPA_load_acquire_int (ptr=0x7ffff5a91040) at /home/riwarren/Sandbox/mpich-3.2/src/openpa/src/primitives/opa_gcc_intel_32_64_ops.h:65
65 /home/riwarren/Sandbox/mpich-3.2/src/openpa/src/primitives/opa_gcc_intel_32_64_ops.h: No such file or directory.
(gdb)
MPID_nem_mpich_send_header (size=48, again=<synthetic pointer>, vc=0xb1b838, buf=0x7fffffffcf10) at ./src/mpid/ch3/channels/nemesis/include/mpid_nem_inline.h:119
119 if (MPID_nem_fbox_is_full((MPID_nem_fbox_common_ptr_t)pbox))
(gdb)
122 pbox->cell.pkt.mpich.source = MPID_nem_mem_region.local_rank;
(gdb)
128 MPIU_Memcpy((void *)pbox->cell.pkt.mpich.p.payload, buf, size);
(gdb)
122 pbox->cell.pkt.mpich.source = MPID_nem_mem_region.local_rank;
(gdb)
123 pbox->cell.pkt.mpich.datalen = size;
(gdb)
124 pbox->cell.pkt.mpich.seqno = vc_ch->send_seqno++;
(gdb)
128 MPIU_Memcpy((void *)pbox->cell.pkt.mpich.p.payload, buf, size);
(gdb) n
130 OPA_store_release_int(&pbox->flag.value, 1);
(gdb) where
#0 MPID_nem_mpich_send_header (size=48, again=<synthetic pointer>, vc=0xb1b838, buf=0x7fffffffcf10) at ./src/mpid/ch3/channels/nemesis/include/mpid_nem_inline.h:130
#1 MPIDI_CH3_iSend (vc=vc at entry=0xb1b838, sreq=0x7ffff769fe88 <MPID_Request_direct+904>, hdr=hdr at entry=0x7fffffffcf10, hdr_sz=48, hdr_sz at entry=32) at src/mpid/ch3/channels/nemesis/src/ch3_isend.c:56
#2 0x00007ffff736e6eb in MPID_Isend (buf=buf at entry=0x0, count=count at entry=0, datatype=datatype at entry=1275068685, rank=rank at entry=1, tag=tag at entry=1, comm=comm at entry=0x7ffff7e27520, context_offset=1,
request=0x7fffffffcf98) at src/mpid/ch3/src/mpid_isend.c:115
#3 0x00007ffff73127a4 in MPIC_Sendrecv (sendbuf=sendbuf at entry=0x0, sendcount=sendcount at entry=0, sendtype=sendtype at entry=1275068685, dest=dest at entry=1, sendtag=sendtag at entry=1, recvbuf=recvbuf at entry=0x0,
recvcount=0, recvtype=1275068685, source=1, recvtag=1, comm_ptr=0x7ffff7e27520, status=0x7fffffffcfa0, errflag=0x7fffffffd14c) at src/mpi/coll/helper_fns.c:481
#4 0x00007ffff726f18c in MPIR_Barrier_intra (comm_ptr=0x7ffff7e27520, errflag=0x7fffffffd14c) at src/mpi/coll/barrier.c:162
#5 0x00007ffff726f7b2 in MPIR_Barrier (comm_ptr=<optimized out>, errflag=<optimized out>) at src/mpi/coll/barrier.c:291
#6 0x00007ffff726f095 in MPIR_Barrier_impl (comm_ptr=<optimized out>, errflag=errflag at entry=0x7fffffffd14c) at src/mpi/coll/barrier.c:326
#7 0x00007ffff726f26a in barrier_smp_intra (errflag=0x7fffffffd14c, comm_ptr=0x7ffff7e27370) at src/mpi/coll/barrier.c:81
#8 MPIR_Barrier_intra (comm_ptr=0x7ffff7e27370, errflag=0x7fffffffd14c) at src/mpi/coll/barrier.c:146
#9 0x00007ffff726f7b2 in MPIR_Barrier (comm_ptr=<optimized out>, errflag=<optimized out>) at src/mpi/coll/barrier.c:291
#10 0x00007ffff726f095 in MPIR_Barrier_impl (comm_ptr=comm_ptr at entry=0x7ffff7e27370, errflag=errflag at entry=0x7fffffffd14c) at src/mpi/coll/barrier.c:326
#11 0x00007ffff726f9e2 in PMPI_Barrier (comm=-1006632958) at src/mpi/coll/barrier.c:410
#12 0x00007ffff73b6bf1 in PMPI_File_set_view (fh=0xb99e78, disp=0, etype=etype at entry=1275068685, filetype=-1946157049, datarep=<optimized out>, datarep at entry=0xae9b08 <H5FD_mpi_native_g> "native",
info=-1677721596) at mpi-io/set_view.c:188
#13 0x000000000075cce1 in H5FD_mpio_write (_file=_file at entry=0xbb21d0, type=type at entry=H5FD_MEM_DEFAULT, dxpl_id=<optimized out>, addr=addr at entry=0, size=size at entry=1, buf=buf at entry=0xbcb830)
at H5FDmpio.c:1781
#14 0x000000000055e285 in H5FD_write (file=0xbb21d0, dxpl=0xb84210, type=type at entry=H5FD_MEM_DEFAULT, addr=addr at entry=0, size=size at entry=1, buf=buf at entry=0xbcb830) at H5FDint.c:294
#15 0x000000000054aa12 in H5F__accum_write (fio_info=fio_info at entry=0x7fffffffd350, map_type=map_type at entry=H5FD_MEM_DEFAULT, addr=addr at entry=0, size=size at entry=1, buf=buf at entry=0xbcb830) at H5Faccum.c:821
#16 0x000000000054c5fc in H5F_block_write (f=f at entry=0xbb2290, type=type at entry=H5FD_MEM_DEFAULT, addr=addr at entry=0, size=size at entry=1, dxpl_id=dxpl_id at entry=720575940379279375, buf=buf at entry=0xbcb830)
at H5Fio.c:195
#17 0x0000000000752a11 in H5C__collective_write (f=f at entry=0xbb2290, dxpl_id=dxpl_id at entry=720575940379279375) at H5Cmpio.c:1454
#18 0x0000000000754267 in H5C_apply_candidate_list (f=f at entry=0xbb2290, dxpl_id=dxpl_id at entry=720575940379279375, cache_ptr=cache_ptr at entry=0x7ffff455a040, num_candidates=1,
candidates_list_ptr=<optimized out>, mpi_rank=<optimized out>, mpi_size=2) at H5Cmpio.c:760
#19 0x0000000000750676 in H5AC__rsp__dist_md_write__flush (f=f at entry=0xbb2290, dxpl_id=dxpl_id at entry=720575940379279375) at H5ACmpio.c:1707
#20 0x0000000000751f7f in H5AC__run_sync_point (f=f at entry=0xbb2290, dxpl_id=dxpl_id at entry=720575940379279375, sync_point_op=sync_point_op at entry=1) at H5ACmpio.c:2158
#21 0x000000000075205e in H5AC__flush_entries (f=f at entry=0xbb2290, dxpl_id=dxpl_id at entry=720575940379279375) at H5ACmpio.c:2301
#22 0x00000000004bfe16 in H5AC_dest (f=f at entry=0xbb2290, dxpl_id=dxpl_id at entry=720575940379279375) at H5AC.c:582
#23 0x0000000000543ef1 in H5F_dest (f=f at entry=0xbb2290, dxpl_id=720575940379279375, flush=flush at entry=true) at H5Fint.c:964
#24 0x0000000000544ae2 in H5F_try_close (f=f at entry=0xbb2290, was_closed=was_closed at entry=0x0) at H5Fint.c:1800
#25 0x0000000000544ee2 in H5F_close (f=0xbb2290) at H5Fint.c:1626
#26 0x00000000005b65cd in H5I_dec_ref (id=id at entry=72057594037927936) at H5I.c:1308
#27 0x00000000005b669e in H5I_dec_app_ref (id=id at entry=72057594037927936) at H5I.c:1353
#28 0x000000000053d058 in H5Fclose (file_id=72057594037927936) at H5F.c:769
#29 0x0000000000487353 in ADFH_Database_Close (root=4.7783097267364807e-299, status=0x7fffffffdb24) at adfh/ADFH.c:2447
#30 0x0000000000481633 in cgio_close_file (cgio_num=1) at cgns_io.c:817
#31 0x00000000004060d7 in cg_close (file_number=1) at cgnslib.c:636
#32 0x0000000000437417 in cgp_close (fn=1) at pcgnslib.c:288
#33 0x0000000000403524 in main (argc=1, argv=0x7fffffffdd28) at benchmark_hdf5.c:186
The code fails when I attempt to step over the OPA_store_release_int function. Might you have some explanation about what things *COULD* go wrong with this? Not knowing the current details of the MPICH collectives, why would a Barrier operation which immediate precedes that call to MPI_File_set_view work correctly and then have that same operation fail from within?
Thanks for any insight you have on this issue!
Best regards,
Richard
________________________________
From: Richard Warren <Richard.Warren at hdfgroup.org>
Sent: Wednesday, January 11, 2017 11:27:42 AM
To: discuss at mpich.org
Subject: Re: [mpich-discuss] FW: Potential MPICH problem
[This sender failed our fraud detection checks and may not be who they appear to be. Learn about spoofing at http://aka.ms/LearnAboutSpoofing]
How Office helps protect you from phishing schemes - Office Support<http://aka.ms/LearnAboutSpoofing>
aka.ms
This article explains what phishing is and includes tips on how to identify phishing schemes and follow best practices to avoid becoming a victim of online fraud.
Hi Rob,
I did put together an initial simplified test case to attempt to reproduce the issue, but that does code does NOT fail! I’ve since focused my efforts not so much on the file closing operations, but other collective operations that take place PRIOR to the close. I suspect that some operation(s) that precede the file close are responsible for potentially getting things out-of-sync and hence we observe the failure “down-stream”. At this time, I’m spending my time attempting to trace these other collective operations in order to gain a better understanding what’s happening. The code that we’re testing has not been released yet, so I’m not convinced that you could reproduce it directly from the currently available HDF5 and CGNS downloads. I could send you the actual test, since it links to the static libraries (.a) for both HDF5 and CGNS. Would that be of interest to you?
Thanks,
Richard
On 1/11/17, 11:02 AM, "Rob Latham" <robl at mcs.anl.gov> wrote:
On 12/29/2016 08:28 AM, Richard Warren wrote:
>
>
> Hi All,
>
> I’m writing to get some advice and possibly report a bug. The
> circumstances are that we are currently working on updating HDF5
> functionality and have run into an issue running a parallel test of a
> CFD code (benchmark.hdf) from the CGNS code base
> <https://github.com/CGNS/CGNS.git>. I’ve debugged enough to see that
CGNS/CGNS<https://github.com/CGNS/CGNS.git>
github.com
The CFD General Notation System (CGNS) provides a standard for recording and recovering computer data associated with the numerical solution of fluid dynamics equations. All developement work and b...
> our failure occurs during a call to MPI_File_set_view, with the failure
> signature as follows:
i'm having a heck of a time building CGNS. When I make cgns,
cgnsconvert is failing to find the hdf5 symbols despite me telling
cmake where to find libhdf5.
did you get that test case you mentioned?
==rob
>
>
>
> [brtnfld at jelly] ~/scratch/CGNS/CGNS/src/ptests % mpirun -n 2 benchmark_hdf5
>
> Fatal error in PMPI_Barrier: Message truncated, error stack:
>
> PMPI_Barrier(430)...................: MPI_Barrier(comm=0x84000006) failed
>
> MPIR_Barrier_impl(337)..............: Failure during collective
>
> MPIR_Barrier_impl(330)..............:
>
> MPIR_Barrier(294)...................:
>
> MPIR_Barrier_intra(151).............:
>
> barrier_smp_intra(111)..............:
>
> MPIR_Bcast_impl(1462)...............:
>
> MPIR_Bcast(1486)....................:
>
> MPIR_Bcast_intra(1295)..............:
>
> MPIR_Bcast_binomial(241)............:
>
> MPIC_Recv(352)......................:
>
> MPIDI_CH3U_Request_unpack_uebuf(608): Message truncated; 4 bytes
> received but buffer size is 1
>
> [cli_1]: aborting job:
>
> Fatal error in PMPI_Barrier: Message truncated, error stack:
>
> PMPI_Barrier(430)...................: MPI_Barrier(comm=0x84000006) failed
>
> MPIR_Barrier_impl(337)..............: Failure during collective
>
> MPIR_Barrier_impl(330)..............:
>
> MPIR_Barrier(294)...................:
>
> MPIR_Barrier_intra(151).............:
>
> barrier_smp_intra(111)..............:
>
> MPIR_Bcast_impl(1462)...............:
>
> MPIR_Bcast(1486)....................:
>
> MPIR_Bcast_intra(1295)..............:
>
> MPIR_Bcast_binomial(241)............:
>
> MPIC_Recv(352)......................:
>
> MPIDI_CH3U_Request_unpack_uebuf(608): Message truncated; 4 bytes
> received but buffer size is 1
>
> benchmark_hdf5: /mnt/hdf/brtnfld/hdf5/trunk/hdf5/src/H5Fint.c:1465:
> H5F_close: Assertion `f->file_id > 0' failed.
>
> Fatal error in PMPI_Allgather: Unknown error class, error stack:
>
> PMPI_Allgather(1002)......................:
> MPI_Allgather(sbuf=0x7ffdfdaf9b10, scount=1, MPI_LONG_LONG_INT,
> rbuf=0x1d53ed8, rcount=1, MPI_LONG_LONG_INT, comm=0xc4000002) failed
>
> MPIR_Allgather_impl(842)..................:
>
> MPIR_Allgather(801).......................:
>
> MPIR_Allgather_intra(216).................:
>
> MPIC_Sendrecv(475)........................:
>
> MPIC_Wait(243)............................:
>
> MPIDI_CH3i_Progress_wait(239).............: an error occurred while
> handling an event returned by MPIDU_Sock_Wait()
>
> MPIDI_CH3I_Progress_handle_sock_event(451):
>
> MPIDU_Socki_handle_read(649)..............: connection failure
> (set=0,sock=1,errno=104:Connection reset by peer)
>
> [cli_0]: aborting job:
>
> Fatal error in PMPI_Allgather: Unknown error class, error stack:
>
> PMPI_Allgather(1002)......................:
> MPI_Allgather(sbuf=0x7ffdfdaf9b10, scount=1, MPI_LONG_LONG_INT,
> rbuf=0x1d53ed8, rcount=1, MPI_LONG_LONG_INT, comm=0xc4000002) failed
>
> MPIR_Allgather_impl(842)..................:
>
> MPIR_Allgather(801).......................:
>
> MPIR_Allgather_intra(216).................:
>
> MPIC_Sendrecv(475)........................:
>
> MPIC_Wait(243)............................:
>
> MPIDI_CH3i_Progress_wait(239).............: an error occurred while
> handling an event returned by MPIDU_Sock_Wait()
>
> MPIDI_CH3I_Progress_handle_sock_event(451):
>
> MPIDU_Socki_handle_read(649)..............: connection failure
> (set=0,sock=1,errno=104:Connection reset by peer)
>
> benchmark_hdf5: /mnt/hdf/brtnfld/hdf5/trunk/hdf5/src/H5Fint.c:1465:
> H5F_close: Assertion `f->file_id > 0’ failed.
>
>
>
> Please note that the above trace was the original stacktrace which
> appears to utilize sockets, though I’ve reproduced the same problem by
> running on an SMP with shared memory.
>
>
>
> While it’s not definitive that the issue has anything to do with the
> above stack trace, the very same benchmark runs perfectly well utilizing
> PHDF5 built with OpenMPI. My own testing is with MPICH version 3.2
> available from your download site and with OpenMPI 2.0.1 (also their
> latest download). Both MPI releases were built from source on my Fedora
> 25 Linux distribution using GCC 6.2.1 20160916 (Red Hat 6.2.1-2).
>
>
>
> Given that the synchronous calls into MPI_File_set_view appear to be
> coded correctly AND that there isn’t much in the way of input parameters
> that would cause problems (other than incorrect coding), we tend to
> believe that the internal message queues between processes may somehow
> be corrupted. This impression is strengthened by the fact that our
> recent codebase changes (which are unrelated to the actual calls to
> MPI_File_set_view) may have introduced this issue. Note too, that the
> code paths to MPI_File_set_view have been taken many times previously
> and those function calls have all succeeded.
>
>
>
> Are there any suggestions out there as to how to further debug this
> potential corruption issue?
>
> Many thanks,
>
> Richard A. Warren
>
>
>
>
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
discuss Info Page - MPICH<https://lists.mpich.org/mailman/listinfo/discuss>
lists.mpich.org
To see the collection of prior postings to the list, visit the discuss Archives. Using discuss: To post a message to all the list members, send email ...
>
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
discuss Info Page - MPICH<https://lists.mpich.org/mailman/listinfo/discuss>
lists.mpich.org
To see the collection of prior postings to the list, visit the discuss Archives. Using discuss: To post a message to all the list members, send email ...
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
discuss Info Page - MPICH<https://lists.mpich.org/mailman/listinfo/discuss>
lists.mpich.org
To see the collection of prior postings to the list, visit the discuss Archives. Using discuss: To post a message to all the list members, send email ...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20170121/540c1d06/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list