[mpich-discuss] [Fwd: MPI_File_read_at_all external32 bug in MV2-2.1]

Rob Latham robl at mcs.anl.gov
Mon Nov 9 15:12:28 CST 2015



On 11/06/2015 07:13 PM, Balaji, Pavan wrote:
> Adam,
>
> Not yet.  We were working heads-down on the mpich-3.2 release.
>
> I'm adding the mpich discuss list to cc, and poking @robl as well.  Do you have a test program that reproduces the error?  (sorry if you already sent it, I'm still catching up on the pending issues).
>

adam, thanks for reporting. Since you're the first person I've ever met 
who cared about external32 support, let me give you a bit of background.

Our External32 support came from Intel, and we know about one bug : if 
the externa32 representation differs from the memory size (MPI_LONG and 
MPI_AINT are perhaps the two most prominent examples)

- http://trac.mpich.org/projects/mpich/ticket/1754 (but look for 
'goodell' and 'robl'.  'jhammond' comments are less relevant for this 
discussion)

- http://trac.mpich.org/projects/mpich/ticket/1755

We do have an external32 test case, but it's pretty simple, and 
independent i/o only.

OpenMPI found a bug in our external32 code, which sounds related to your 
problem:

http://git.mpich.org/mpich.git/commitdiff/53f11fd73b2f9aa7b994e78ed05e9ae74264c63f

So you can see why Pavan was hoping you had a test case: it's taken a 
few tries to get it right.

==rob

>    -- Pavan
>
>> On Nov 6, 2015, at 2:10 PM, Adam T. Moody <moody20 at llnl.gov> wrote:
>>
>> Hi Pavan,
>> Did you have a chance to look into this in MPICH?
>>
>> I'm pretty certain there is a lurking ROMIO bug here.
>> -Adam
>>
>>
>> Adam T. Moody wrote:
>>
>>> Hi Pavan and Howard,
>>> FYI, it looks this same bug is in MPICH-3.2rc1 and Open MPI-1.10.0.
>>>
>>> I guess no one uses collective I/O with external32 :-)
>>> -Adam
>>>
>>> ------------------------------------------------------------------------
>>>
>>> Subject:
>>> MPI_File_read_at_all external32 bug in MV2-2.1
>>> From:
>>> "Adam T. Moody" <moody20 at llnl.gov>
>>> Date:
>>> Fri, 30 Oct 2015 11:48:35 -0700
>>> To:
>>> "mvapich-discuss at cse.ohio-state.edu" <mvapich-discuss at cse.ohio-state.edu>
>>>
>>> To:
>>> "mvapich-discuss at cse.ohio-state.edu" <mvapich-discuss at cse.ohio-state.edu>
>>>
>>>
>>> Hello MVAPICH team,
>>> I've hit a bug in MPI_File_read_at_all in MVAPICH2-2.1.  I have an application that reads and writes files in external32 format.  It writes the file just fine, but it throws the following error when reading the file back:
>>>
>>> internal ABORT - process 0
>>> srun: error: rzmerl2: task 0: Exited with exit code 1
>>> Assertion failed in file src/mpid/common/datatype/mpid_ext32_segment.c at line 277: FALSE
>>> memcpy argument memory ranges overlap, dst_=0x2aaab5c160a8 src_=0x2aaab5c160a8 len_=176
>>>
>>> Above, MPI is trying to do a memcpy where the source and destination buffer are the same address.  Looking through the code for MVAPICH2-2.1, the problem seems to be at line 132 in src/mpi/romio/mpi-io/read_all.c:
>>>
>>>    if (e32_buf != NULL) {
>>>        error_code = MPIU_read_external32_conversion_fn(xbuf, datatype,
>>>                count, e32_buf);
>>>    ADIOI_Free(e32_buf);
>>>    }
>>>
>>> I think the fix is to change "xbuf" above to "buf" as it is in read.c below:
>>>
>>>    if (e32_buf != NULL) {
>>>        error_code = MPIU_read_external32_conversion_fn(buf, datatype,
>>>                count, e32_buf);
>>>    ADIOI_Free(e32_buf);
>>>    }
>>>
>>> When in external32 mode, xbuf == e32_buf, which acts as a temporary buffer in which to read the data.  The code is then meant to unpack and convert the data from the temporary buffer into the user buffer at buf.
>>>
>>> It's probably worth checking the other external32 code paths to look for similar bugs.
>>> Thanks,
>>> -Adam
>>>
>>
>

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA



More information about the discuss mailing list