[mpich-discuss] [Fwd: MPI_File_read_at_all external32 bug in MV2-2.1]

Rob Latham robl at mcs.anl.gov
Mon Nov 9 15:49:02 CST 2015



On 11/09/2015 03:22 PM, Adam T. Moody wrote:
> Hi Rob,
> Ok, thanks for the background.  I'll put together a simple reproducer
> for the particular case that I've got.  I'll send that a little later.
>
> The bug that OpenMPI found is the same kind of bug but in a different
> routine (read instead of read_at_all).  My guess is that this is likely
> a cut-and-paste bug, so it'd be a good idea to check the other MPI IO
> routines for this problem.

I took a crack at reducing the cut and paste when I first merged the 
patch but clearly more should be done... and now that I know there's a 
user out there, I'll clean it up.

My usual advice is to use something like Parallel-Netcdf or HDF5, which 
have portability like external32 but also have a self-describing file 
format.  A library option is not an answer for you?

==rob

> -Adam
>
>
> Rob Latham wrote:
>
>>
>>
>> On 11/06/2015 07:13 PM, Balaji, Pavan wrote:
>>
>>> Adam,
>>>
>>> Not yet.  We were working heads-down on the mpich-3.2 release.
>>>
>>> I'm adding the mpich discuss list to cc, and poking @robl as well. Do
>>> you have a test program that reproduces the error?  (sorry if you
>>> already sent it, I'm still catching up on the pending issues).
>>>
>>
>> adam, thanks for reporting. Since you're the first person I've ever
>> met who cared about external32 support, let me give you a bit of
>> background.
>>
>> Our External32 support came from Intel, and we know about one bug : if
>> the externa32 representation differs from the memory size (MPI_LONG
>> and MPI_AINT are perhaps the two most prominent examples)
>>
>> - http://trac.mpich.org/projects/mpich/ticket/1754 (but look for
>> 'goodell' and 'robl'.  'jhammond' comments are less relevant for this
>> discussion)
>>
>> - http://trac.mpich.org/projects/mpich/ticket/1755
>>
>> We do have an external32 test case, but it's pretty simple, and
>> independent i/o only.
>>
>> OpenMPI found a bug in our external32 code, which sounds related to
>> your problem:
>>
>> http://git.mpich.org/mpich.git/commitdiff/53f11fd73b2f9aa7b994e78ed05e9ae74264c63f
>>
>>
>> So you can see why Pavan was hoping you had a test case: it's taken a
>> few tries to get it right.
>>
>> ==rob
>>
>>>    -- Pavan
>>>
>>>> On Nov 6, 2015, at 2:10 PM, Adam T. Moody <moody20 at llnl.gov> wrote:
>>>>
>>>> Hi Pavan,
>>>> Did you have a chance to look into this in MPICH?
>>>>
>>>> I'm pretty certain there is a lurking ROMIO bug here.
>>>> -Adam
>>>>
>>>>
>>>> Adam T. Moody wrote:
>>>>
>>>>> Hi Pavan and Howard,
>>>>> FYI, it looks this same bug is in MPICH-3.2rc1 and Open MPI-1.10.0.
>>>>>
>>>>> I guess no one uses collective I/O with external32 :-)
>>>>> -Adam
>>>>>
>>>>> ------------------------------------------------------------------------
>>>>>
>>>>>
>>>>> Subject:
>>>>> MPI_File_read_at_all external32 bug in MV2-2.1
>>>>> From:
>>>>> "Adam T. Moody" <moody20 at llnl.gov>
>>>>> Date:
>>>>> Fri, 30 Oct 2015 11:48:35 -0700
>>>>> To:
>>>>> "mvapich-discuss at cse.ohio-state.edu"
>>>>> <mvapich-discuss at cse.ohio-state.edu>
>>>>>
>>>>> To:
>>>>> "mvapich-discuss at cse.ohio-state.edu"
>>>>> <mvapich-discuss at cse.ohio-state.edu>
>>>>>
>>>>>
>>>>> Hello MVAPICH team,
>>>>> I've hit a bug in MPI_File_read_at_all in MVAPICH2-2.1.  I have an
>>>>> application that reads and writes files in external32 format.  It
>>>>> writes the file just fine, but it throws the following error when
>>>>> reading the file back:
>>>>>
>>>>> internal ABORT - process 0
>>>>> srun: error: rzmerl2: task 0: Exited with exit code 1
>>>>> Assertion failed in file
>>>>> src/mpid/common/datatype/mpid_ext32_segment.c at line 277: FALSE
>>>>> memcpy argument memory ranges overlap, dst_=0x2aaab5c160a8
>>>>> src_=0x2aaab5c160a8 len_=176
>>>>>
>>>>> Above, MPI is trying to do a memcpy where the source and
>>>>> destination buffer are the same address.  Looking through the code
>>>>> for MVAPICH2-2.1, the problem seems to be at line 132 in
>>>>> src/mpi/romio/mpi-io/read_all.c:
>>>>>
>>>>>    if (e32_buf != NULL) {
>>>>>        error_code = MPIU_read_external32_conversion_fn(xbuf, datatype,
>>>>>                count, e32_buf);
>>>>>    ADIOI_Free(e32_buf);
>>>>>    }
>>>>>
>>>>> I think the fix is to change "xbuf" above to "buf" as it is in
>>>>> read.c below:
>>>>>
>>>>>    if (e32_buf != NULL) {
>>>>>        error_code = MPIU_read_external32_conversion_fn(buf, datatype,
>>>>>                count, e32_buf);
>>>>>    ADIOI_Free(e32_buf);
>>>>>    }
>>>>>
>>>>> When in external32 mode, xbuf == e32_buf, which acts as a temporary
>>>>> buffer in which to read the data.  The code is then meant to unpack
>>>>> and convert the data from the temporary buffer into the user buffer
>>>>> at buf.
>>>>>
>>>>> It's probably worth checking the other external32 code paths to
>>>>> look for similar bugs.
>>>>> Thanks,
>>>>> -Adam
>>>>>
>>>>
>>>
>>
>

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA



More information about the discuss mailing list