[mpich-discuss] [Fwd: MPI_File_read_at_all external32 bug in MV2-2.1]
Rob Latham
robl at mcs.anl.gov
Mon Nov 9 15:12:28 CST 2015
On 11/06/2015 07:13 PM, Balaji, Pavan wrote:
> Adam,
>
> Not yet. We were working heads-down on the mpich-3.2 release.
>
> I'm adding the mpich discuss list to cc, and poking @robl as well. Do you have a test program that reproduces the error? (sorry if you already sent it, I'm still catching up on the pending issues).
>
adam, thanks for reporting. Since you're the first person I've ever met
who cared about external32 support, let me give you a bit of background.
Our External32 support came from Intel, and we know about one bug : if
the externa32 representation differs from the memory size (MPI_LONG and
MPI_AINT are perhaps the two most prominent examples)
- http://trac.mpich.org/projects/mpich/ticket/1754 (but look for
'goodell' and 'robl'. 'jhammond' comments are less relevant for this
discussion)
- http://trac.mpich.org/projects/mpich/ticket/1755
We do have an external32 test case, but it's pretty simple, and
independent i/o only.
OpenMPI found a bug in our external32 code, which sounds related to your
problem:
http://git.mpich.org/mpich.git/commitdiff/53f11fd73b2f9aa7b994e78ed05e9ae74264c63f
So you can see why Pavan was hoping you had a test case: it's taken a
few tries to get it right.
==rob
> -- Pavan
>
>> On Nov 6, 2015, at 2:10 PM, Adam T. Moody <moody20 at llnl.gov> wrote:
>>
>> Hi Pavan,
>> Did you have a chance to look into this in MPICH?
>>
>> I'm pretty certain there is a lurking ROMIO bug here.
>> -Adam
>>
>>
>> Adam T. Moody wrote:
>>
>>> Hi Pavan and Howard,
>>> FYI, it looks this same bug is in MPICH-3.2rc1 and Open MPI-1.10.0.
>>>
>>> I guess no one uses collective I/O with external32 :-)
>>> -Adam
>>>
>>> ------------------------------------------------------------------------
>>>
>>> Subject:
>>> MPI_File_read_at_all external32 bug in MV2-2.1
>>> From:
>>> "Adam T. Moody" <moody20 at llnl.gov>
>>> Date:
>>> Fri, 30 Oct 2015 11:48:35 -0700
>>> To:
>>> "mvapich-discuss at cse.ohio-state.edu" <mvapich-discuss at cse.ohio-state.edu>
>>>
>>> To:
>>> "mvapich-discuss at cse.ohio-state.edu" <mvapich-discuss at cse.ohio-state.edu>
>>>
>>>
>>> Hello MVAPICH team,
>>> I've hit a bug in MPI_File_read_at_all in MVAPICH2-2.1. I have an application that reads and writes files in external32 format. It writes the file just fine, but it throws the following error when reading the file back:
>>>
>>> internal ABORT - process 0
>>> srun: error: rzmerl2: task 0: Exited with exit code 1
>>> Assertion failed in file src/mpid/common/datatype/mpid_ext32_segment.c at line 277: FALSE
>>> memcpy argument memory ranges overlap, dst_=0x2aaab5c160a8 src_=0x2aaab5c160a8 len_=176
>>>
>>> Above, MPI is trying to do a memcpy where the source and destination buffer are the same address. Looking through the code for MVAPICH2-2.1, the problem seems to be at line 132 in src/mpi/romio/mpi-io/read_all.c:
>>>
>>> if (e32_buf != NULL) {
>>> error_code = MPIU_read_external32_conversion_fn(xbuf, datatype,
>>> count, e32_buf);
>>> ADIOI_Free(e32_buf);
>>> }
>>>
>>> I think the fix is to change "xbuf" above to "buf" as it is in read.c below:
>>>
>>> if (e32_buf != NULL) {
>>> error_code = MPIU_read_external32_conversion_fn(buf, datatype,
>>> count, e32_buf);
>>> ADIOI_Free(e32_buf);
>>> }
>>>
>>> When in external32 mode, xbuf == e32_buf, which acts as a temporary buffer in which to read the data. The code is then meant to unpack and convert the data from the temporary buffer into the user buffer at buf.
>>>
>>> It's probably worth checking the other external32 code paths to look for similar bugs.
>>> Thanks,
>>> -Adam
>>>
>>
>
--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA
More information about the discuss
mailing list