[mpich-discuss] Use of MPI derived data types / MPI file IO
jgrime at uchicago.edu
jgrime at uchicago.edu
Mon Nov 19 09:25:00 CST 2012
Hi Wei-keng,
I checked this out with the latest mpich2 installation I have on my desktop, and I
can indeed call MPI_Type_free() on a derived type which has not been committed
with no error code returned.
Maybe this could/should be clarified in the MPI standard at some point, although
I'm probably one of only 5 people who ever cared!
J.
---- Original message ----
>Date: Mon, 19 Nov 2012 09:17:36 -0600
>From: discuss-bounces at mpich.org (on behalf of Wei-keng Liao
<wkliao at ece.northwestern.edu>)
>Subject: Re: [mpich-discuss] Use of MPI derived data types / MPI file IO
>To: discuss at mpich.org
>
>Hi, John,
>
>I did not find from the MPI standard specific about this question. However,
>if you check the man page of MPI_Type_free, one of the errors is
> MPI_ERR_TYPE
> Invalid datatype argument. May be an uncommitted MPI_Datatype (see
MPI_Type_commit).
>
>So, it seems to imply one should only free the data types that have been
committed.
>But, I think the MPICH development team should confirm if this is the case.
>
>
>Wei-keng
>
>
>On Nov 19, 2012, at 8:46 AM, <jgrime at uchicago.edu>
<jgrime at uchicago.edu> wrote:
>
>> Hi Wei-keng,
>>
>> It now works! Thanks for the help!
>>
>> One last question to the list:
>>
>> I've looked at the MPI 2 standards documents, and I'm still a little confused
as to
>> the precise semantics of MPI_Type_free(); as "MPI_Datatype" is an opaque
type,
>> I'm assuming that there is a certain amount of background allocation going
on
>> inside the MPI runtime when I call something like MPI_Type_create_struct() or
>> similar routines.
>>
>> Am I right in assuming that I should call MPI_Type_free() on *all* derived
data
>> types I generate, even if they are not subsequently registered using
>> MPI_Type_commit()? I would imagine that any other behaviour is likely to
lead to
>> memory leaks!
>>
>> Cheers,
>>
>> J.
>>
>> ---- Original message ----
>>> Date: Sun, 18 Nov 2012 18:58:37 -0600
>>> From: discuss-bounces at mpich.org (on behalf of Wei-keng Liao
>> <wkliao at ece.northwestern.edu>)
>>> Subject: Re: [mpich-discuss] Use of MPI derived data types / MPI file IO
>>> To: discuss at mpich.org
>>>
>>> Hi, John,
>>>
>>> You certainly are on the right track to achieve that. Your code is almost
>>> there, only the call to MPI_File_set_view is incorrect. In fact, you don't need
it.
>>>
>>> Try remove the call to MPI_File_set_view and replace the MPI_File_write_all
with:
>>> MPI_File_write_at_all(f, offset, &atoms[0], (int)atoms.size(),
>> mpi_atom_type_resized, &stat);
>>>
>>> On the reader side, you need to set the offset based on the new struct.
Other
>> than
>>> that, it is the same as the writer case. (no need of MPI_File_set_view either).
>>>
>>> As for the portability issue, I would suggest you to use high-level I/O
libraries,
>>> such as PnetCDF.
>>>
>>> Wei-keng
>>>
>>> On Nov 18, 2012, at 12:38 PM, <jgrime at uchicago.edu>
>> <jgrime at uchicago.edu> wrote:
>>>
>>>> Hi Wei-keng,
>>>>
>>>> That's a good point, thanks!
>>>>
>>>> However, I actually only want to save certain parts of the "atom" structure
to
>> file,
>>>> and saving the whole array as a raw dump could waste a lot of disk
space.
>>>>
>>>> For example, the "atom" structure I used in the example code in reality
>> contains
>>>> not only an integer and three contiguous doubles, but also at least
another
>> two
>>>> double[3] entries which I may not want to save to disk. As the full data
set
>> can
>>>> be hundreds of millions (or even billions) of "atom" structures, using a
>> derived
>>>> data type with only a restricted subset of the data in each "atom"
structure
>> will
>>>> produce considerably smaller file sizes!
>>>>
>>>> There's also the problem of making the resultant file "portable" - raw
>> memory
>>>> dumps could make life difficult in trying to use output files on machines
with
>>>> different processor architectures. Once I get the derived data types
working,
>> I
>>>> can then switch from the "native" representation to something else
>> ("external32"
>>>> etc), which should allow me to create portable output files, provided I'm
>> careful
>>>> with using MPIs file offset routines etc if the file is larger than plain old
32
>> bit
>>>> offsets can handle.
>>>>
>>>> Cheers,
>>>>
>>>> J.
>>>>
>>>> ---- Original message ----
>>>>> Date: Sun, 18 Nov 2012 12:27:04 -0600
>>>>> From: discuss-bounces at mpich.org (on behalf of Wei-keng Liao
>>>> <wkliao at ece.northwestern.edu>)
>>>>> Subject: Re: [mpich-discuss] Use of MPI derived data types / MPI file IO
>>>>> To: discuss at mpich.org
>>>>>
>>>>> Hi, John
>>>>>
>>>>> If your I/O is simply appending one process's data after another and the
>> I/O
>>>> buffers in memory
>>>>> are all contiguous, then you can simply do the following without
defining
>> MPI
>>>>> derived data types or setting the file view.
>>>>>
>>>>> MPI_File_write_at_all(f, offset, &atoms[0], (int)atoms.size() * sizeof(struct
>> atom),
>>>> MPI_BYTE, &stat);
>>>>>
>>>>> Using derived data types is usually when you have noncontiguous
buffer in
>>>> memory or
>>>>> want to access non-contiguous data in files.
>>>>>
>>>>>
>>>>> Wei-keng
>>>>>
>>>>> On Nov 18, 2012, at 11:52 AM, <jgrime at uchicago.edu>
>>>> <jgrime at uchicago.edu> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I'm having some problems with using derived data types and MPI
parallel
>> IO,
>>>> and
>>>>>> was wondering if anyone could help. I tried to search the archives in
case
>>>> this
>>>>>> was covered earlier, but that just gave me "ht://Dig error" messages.
>>>>>>
>>>>>> Outline: I have written a C++ program where each MPI rank acts on
data
>>>> stored
>>>>>> in a local array of structures. The arrays are typically of different
lengths
>> on
>>>> each
>>>>>> rank. I wish to write and read the contents of these arrays to disk
using
>> MPI's
>>>>>> parallel IO routines. The file format is simply an initial integer which
>>>> describes
>>>>>> how many "structures" are in the file, followed by the data which
>> represents
>>>> the
>>>>>> "structure information" from all ranks (ie the total data set).
>>>>>>
>>>>>> So far, I've tried two different approaches: the first consists of each
rank
>>>>>> serialising the contents of the local array of structures into a byte
array,
>>>> which is
>>>>>> then saved to file "f" using MPI_File_set_view( f, MPI_COMM_WORLD,
>> offset,
>>>>>> MPI_CHAR, MPI_CHAR, "native", MPI_INFO_NULL ) to skip the initial
integer
>>>>>> "header" and then a call to MPI_File_write_all( f, local_bytearray,
>>>> local_n_bytes,
>>>>>> MPI_CHAR, &status ). Here, "offset" is simply the size of an integer (in
>> bytes)
>>>> +
>>>>>> the summation of the number of bytes each preceeding rank wishes
to
>> write
>>>> to
>>>>>> the file (received via an earlier MPI_Allgather call). This seems to work,
as
>>>> when I
>>>>>> read the file back in on a single MPI rank and deserialise the data into
an
>>>> array of
>>>>>> structures I get the results I expect.
>>>>>>
>>>>>> The second approach is to use MPI's derived data types to create MPI
>>>>>> representations of the structures, and then treat the arrays of
structures
>> as
>>>> MPI
>>>>>> data types. This allows me to avoid copying the local data into an
>>>> intermediate
>>>>>> buffer etc, and seems the more elegant approach. I cannot, however,
>> seem
>>>> to
>>>>>> make this approach work.
>>>>>>
>>>>>> I'm pretty sure the problem lies in my use of the file views, but I'm not
>> sure
>>>>>> where I'm going wrong. The reading of the integer "header" always
works
>>>> fine,
>>>>>> but the proceeding data is garbled. I'm using the "native" data
>> representation
>>>> for
>>>>>> testing, but will likely change that to something more portable when I
get
>>>> this
>>>>>> code working.
>>>>>>
>>>>>> I've included the important excerpts of the test code I'm trying to use
>> below
>>>>>> (with some printf()s and error handling etc removed to make it a little
>> more
>>>>>> concise). I have previously tested that std::vector allocates a
contiguous
>> flat
>>>>>> array of the appropriate data type in memory, so passing a
>> pointer/reference
>>>> to
>>>>>> the first element in such a data structure behaves the same way as
simply
>>>>>> passing a conventional array of the appropriate data type:
>>>>>>
>>>>>> struct atom
>>>>>> {
>>>>>> int global_id;
>>>>>> double xyz[3];
>>>>>> };
>>>>>>
>>>>>> void write( char * fpath, std::vector<struct atom> &atoms, int rank,
int
>>>> nranks )
>>>>>> {
>>>>>> /*
>>>>>> Memory layout information for the structure we wish to
convert
>> into
>>>>>> an
>>>>>> MPI derived data type.
>>>>>> */
>>>>>> std::vector<int> s_blocklengths;
>>>>>> std::vector<MPI_Aint> s_displacements;
>>>>>> std::vector<MPI_Datatype> s_datatypes;
>>>>>> MPI_Aint addr_start, addr;
>>>>>> MPI_Datatype mpi_atom_type, mpi_atom_type_resized;
>>>>>> int type_size;
>>>>>>
>>>>>> struct atom a;
>>>>>>
>>>>>> MPI_File f;
>>>>>> MPI_Status stat;
>>>>>> MPI_Offset offset;
>>>>>> char *datarep = (char *)"native";
>>>>>>
>>>>>> std::vector<int> all_N;
>>>>>> int local_N, global_N;
>>>>>>
>>>>>> /*
>>>>>> Set up the structure data type: single integer, and 3 double
>> precision
>>>>>> floats.
>>>>>> We use the temporary "a" structure to determine the layout
of
>> memory
>>>>>> inside
>>>>>> atom structures.
>>>>>> */
>>>>>> MPI_Get_address( &a, &addr_start );
>>>>>>
>>>>>> s_blocklengths.push_back( 1 );
>>>>>> s_datatypes.push_back( MPI_INT );
>>>>>> MPI_Get_address( &a.global_id, &addr );
>>>>>> s_displacements.push_back( addr - addr_start );
>>>>>>
>>>>>> s_blocklengths.push_back( 3 );
>>>>>> s_datatypes.push_back( MPI_DOUBLE );
>>>>>> MPI_Get_address( &a.xyz[0], &addr );
>>>>>> s_displacements.push_back( addr - addr_start );
>>>>>>
>>>>>> MPI_Type_create_struct( (int)s_blocklengths.size(),
&s_blocklengths[0],
>>>>>> &s_displacements[0], &s_datatypes[0], &mpi_atom_type );
>>>>>> MPI_Type_commit( &mpi_atom_type );
>>>>>>
>>>>>> /*
>>>>>> Take into account any compiler padding in creating an array
of
>>>>>> structures.
>>>>>> */
>>>>>> MPI_Type_create_resized( mpi_atom_type, 0, sizeof(struct atom),
>>>>>> &mpi_atom_type_resized );
>>>>>> MPI_Type_commit( &mpi_atom_type_resized );
>>>>>>
>>>>>> MPI_Type_size( mpi_atom_type_resized, &type_size );
>>>>>>
>>>>>> local_N = (int)atoms.size();
>>>>>> all_N.resize( nranks );
>>>>>>
>>>>>> MPI_Allgather( &local_N, 1, MPI_INT, &all_N[0], 1, MPI_INT,
>>>>>> MPI_COMM_WORLD );
>>>>>>
>>>>>> global_N = 0;
>>>>>> for( size_t i=0; i<all_N.size(); i++ ) global_N += all_N[i];
>>>>>>
>>>>>> offset = 0;
>>>>>> for( int i=0; i<rank; i++ ) offset += all_N[i];
>>>>>>
>>>>>> offset *= type_size; // convert from structure counts -> bytes
into file
>> for
>>>>>> true structure size
>>>>>> offset += sizeof( int ); // skip leading integer (global_N) in file.
>>>>>>
>>>>>> MPI_File_open( MPI_COMM_WORLD, fpath, MPI_MODE_CREATE |
>>>>>> MPI_MODE_WRONLY, MPI_INFO_NULL, &f );
>>>>>> if( rank == 0 )
>>>>>> {
>>>>>> MPI_File_write( f, &global_N, 1, MPI_INT, &stat );
>>>>>> }
>>>>>> MPI_File_set_view( f, offset, mpi_atom_type_resized,
>>>>>> mpi_atom_type_resized, datarep, MPI_INFO_NULL );
>>>>>>
>>>>>> MPI_File_write_all( f, &atoms[0], (int)atoms.size(),
>> mpi_atom_type_resized,
>>>>>> &stat );
>>>>>> MPI_File_close( &f );
>>>>>>
>>>>>> MPI_Type_free( &mpi_atom_type );
>>>>>> MPI_Type_free( &mpi_atom_type_resized );
>>>>>>
>>>>>> return;
>>>>>> }
>>>>>>
>>>>>> void read( char * fpath, std::vector<struct atom> &atoms )
>>>>>> {
>>>>>> std::vector<int> s_blocklengths;
>>>>>> std::vector<MPI_Aint> s_displacements;
>>>>>> std::vector<MPI_Datatype> s_datatypes;
>>>>>> MPI_Datatype mpi_atom_type, mpi_atom_type_resized;
>>>>>>
>>>>>> struct atom a;
>>>>>> MPI_Aint addr_start, addr;
>>>>>>
>>>>>> MPI_File f;
>>>>>> MPI_Status stat;
>>>>>>
>>>>>> int global_N;
>>>>>> char *datarep = (char *)"native";
>>>>>>
>>>>>> int type_size;
>>>>>>
>>>>>> /*
>>>>>> Set up the structure data type
>>>>>> */
>>>>>> MPI_Get_address( &a, &addr_start );
>>>>>>
>>>>>> s_blocklengths.push_back( 1 );
>>>>>> s_datatypes.push_back( MPI_INT );
>>>>>> MPI_Get_address( &a.global_id, &addr );
>>>>>> s_displacements.push_back( addr - addr_start );
>>>>>>
>>>>>> s_blocklengths.push_back( 3 );
>>>>>> s_datatypes.push_back( MPI_DOUBLE );
>>>>>> MPI_Get_address( &a.xyz[0], &addr );
>>>>>> s_displacements.push_back( addr - addr_start );
>>>>>>
>>>>>> MPI_Type_create_struct( (int)s_blocklengths.size(),
&s_blocklengths[0],
>>>>>> &s_displacements[0], &s_datatypes[0], &mpi_atom_type );
>>>>>> MPI_Type_commit( &mpi_atom_type );
>>>>>>
>>>>>> /*
>>>>>> Take into account any compiler padding in creating an array
of
>>>>>> structures.
>>>>>> */
>>>>>> MPI_Type_create_resized( mpi_atom_type, 0, sizeof(struct atom),
>>>>>> &mpi_atom_type_resized );
>>>>>> MPI_Type_commit( &mpi_atom_type_resized );
>>>>>>
>>>>>> MPI_Type_size( mpi_atom_type_resized, &type_size );
>>>>>>
>>>>>> MPI_File_open( MPI_COMM_SELF, fpath, MPI_MODE_RDONLY,
>>>>>> MPI_INFO_NULL, &f );
>>>>>>
>>>>>> MPI_File_read( f, &global_N, 1, MPI_INT, &stat );
>>>>>>
>>>>>> atoms.clear();
>>>>>> atoms.resize( global_N );
>>>>>>
>>>>>> errcode = MPI_File_set_view( f, sizeof(int),
mpi_atom_type_resized,
>>>>>> mpi_atom_type_resized, datarep, MPI_INFO_NULL );
>>>>>> errcode = MPI_File_read( f, &atoms[0], global_N,
>> mpi_atom_type_resized,
>>>>>> &stat );
>>>>>> errcode = MPI_File_close( &f );
>>>>>>
>>>>>> MPI_Type_free( &mpi_atom_type );
>>>>>> MPI_Type_free( &mpi_atom_type_resized );
>>>>>>
>>>>>> return;
>>>>>> }
>>>>>>
>>>>>> Calling MPI_Type_get_extent() and MPI_Type_get_true_extent() for
both
>>>>>> mpi_atom_type and mpi_atom_type_resized returns (0,32) bytes in all
>> cases.
>>>>>> Calling MPI_Type_size() on both derived data types returns 28 bytes.
>>>>>>
>>>>>> If I call MPI_File_get_type_extent() on both derived data types after
>> opening
>>>> the
>>>>>> file, they both resolve to 32 bytes - so I think the problem is in the
>>>> difference
>>>>>> between the data representation in memory and on disk. If I explicitly
use
>> 32
>>>>>> bytes in the offset calculation in the write() routine above, it still
doesn't
>>>> work.
>>>>>>
>>>>>> I'm finding it remarkably difficult to do something very simple using
MPI's
>>>>>> derived data types and the parallel IO, and hence I'm guessing that I
have
>>>>>> fundamentally misunderstood one or more aspects of this. If anyone
can
>>>> help
>>>>>> clarify where I'm going wrong, that would be much appreciated!
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> John.
>>>>>> _______________________________________________
>>>>>> discuss mailing list discuss at mpich.org
>>>>>> To manage subscription options or unsubscribe:
>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>
>>>>> _______________________________________________
>>>>> discuss mailing list discuss at mpich.org
>>>>> To manage subscription options or unsubscribe:
>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>> _______________________________________________
>>>> discuss mailing list discuss at mpich.org
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>
>>> _______________________________________________
>>> discuss mailing list discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>> _______________________________________________
>> discuss mailing list discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>
>_______________________________________________
>discuss mailing list discuss at mpich.org
>To manage subscription options or unsubscribe:
>https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list