[mpich-discuss] Use of MPI derived data types / MPI file IO
jgrime at uchicago.edu
jgrime at uchicago.edu
Mon Nov 19 08:46:44 CST 2012
Hi Wei-keng,
It now works! Thanks for the help!
One last question to the list:
I've looked at the MPI 2 standards documents, and I'm still a little confused as to
the precise semantics of MPI_Type_free(); as "MPI_Datatype" is an opaque type,
I'm assuming that there is a certain amount of background allocation going on
inside the MPI runtime when I call something like MPI_Type_create_struct() or
similar routines.
Am I right in assuming that I should call MPI_Type_free() on *all* derived data
types I generate, even if they are not subsequently registered using
MPI_Type_commit()? I would imagine that any other behaviour is likely to lead to
memory leaks!
Cheers,
J.
---- Original message ----
>Date: Sun, 18 Nov 2012 18:58:37 -0600
>From: discuss-bounces at mpich.org (on behalf of Wei-keng Liao
<wkliao at ece.northwestern.edu>)
>Subject: Re: [mpich-discuss] Use of MPI derived data types / MPI file IO
>To: discuss at mpich.org
>
>Hi, John,
>
>You certainly are on the right track to achieve that. Your code is almost
>there, only the call to MPI_File_set_view is incorrect. In fact, you don't need it.
>
>Try remove the call to MPI_File_set_view and replace the MPI_File_write_all with:
>MPI_File_write_at_all(f, offset, &atoms[0], (int)atoms.size(),
mpi_atom_type_resized, &stat);
>
>On the reader side, you need to set the offset based on the new struct. Other
than
>that, it is the same as the writer case. (no need of MPI_File_set_view either).
>
>As for the portability issue, I would suggest you to use high-level I/O libraries,
>such as PnetCDF.
>
>Wei-keng
>
>On Nov 18, 2012, at 12:38 PM, <jgrime at uchicago.edu>
<jgrime at uchicago.edu> wrote:
>
>> Hi Wei-keng,
>>
>> That's a good point, thanks!
>>
>> However, I actually only want to save certain parts of the "atom" structure to
file,
>> and saving the whole array as a raw dump could waste a lot of disk space.
>>
>> For example, the "atom" structure I used in the example code in reality
contains
>> not only an integer and three contiguous doubles, but also at least another
two
>> double[3] entries which I may not want to save to disk. As the full data set
can
>> be hundreds of millions (or even billions) of "atom" structures, using a
derived
>> data type with only a restricted subset of the data in each "atom" structure
will
>> produce considerably smaller file sizes!
>>
>> There's also the problem of making the resultant file "portable" - raw
memory
>> dumps could make life difficult in trying to use output files on machines with
>> different processor architectures. Once I get the derived data types working,
I
>> can then switch from the "native" representation to something else
("external32"
>> etc), which should allow me to create portable output files, provided I'm
careful
>> with using MPIs file offset routines etc if the file is larger than plain old 32
bit
>> offsets can handle.
>>
>> Cheers,
>>
>> J.
>>
>> ---- Original message ----
>>> Date: Sun, 18 Nov 2012 12:27:04 -0600
>>> From: discuss-bounces at mpich.org (on behalf of Wei-keng Liao
>> <wkliao at ece.northwestern.edu>)
>>> Subject: Re: [mpich-discuss] Use of MPI derived data types / MPI file IO
>>> To: discuss at mpich.org
>>>
>>> Hi, John
>>>
>>> If your I/O is simply appending one process's data after another and the
I/O
>> buffers in memory
>>> are all contiguous, then you can simply do the following without defining
MPI
>>> derived data types or setting the file view.
>>>
>>> MPI_File_write_at_all(f, offset, &atoms[0], (int)atoms.size() * sizeof(struct
atom),
>> MPI_BYTE, &stat);
>>>
>>> Using derived data types is usually when you have noncontiguous buffer in
>> memory or
>>> want to access non-contiguous data in files.
>>>
>>>
>>> Wei-keng
>>>
>>> On Nov 18, 2012, at 11:52 AM, <jgrime at uchicago.edu>
>> <jgrime at uchicago.edu> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I'm having some problems with using derived data types and MPI parallel
IO,
>> and
>>>> was wondering if anyone could help. I tried to search the archives in case
>> this
>>>> was covered earlier, but that just gave me "ht://Dig error" messages.
>>>>
>>>> Outline: I have written a C++ program where each MPI rank acts on data
>> stored
>>>> in a local array of structures. The arrays are typically of different lengths
on
>> each
>>>> rank. I wish to write and read the contents of these arrays to disk using
MPI's
>>>> parallel IO routines. The file format is simply an initial integer which
>> describes
>>>> how many "structures" are in the file, followed by the data which
represents
>> the
>>>> "structure information" from all ranks (ie the total data set).
>>>>
>>>> So far, I've tried two different approaches: the first consists of each rank
>>>> serialising the contents of the local array of structures into a byte array,
>> which is
>>>> then saved to file "f" using MPI_File_set_view( f, MPI_COMM_WORLD,
offset,
>>>> MPI_CHAR, MPI_CHAR, "native", MPI_INFO_NULL ) to skip the initial integer
>>>> "header" and then a call to MPI_File_write_all( f, local_bytearray,
>> local_n_bytes,
>>>> MPI_CHAR, &status ). Here, "offset" is simply the size of an integer (in
bytes)
>> +
>>>> the summation of the number of bytes each preceeding rank wishes to
write
>> to
>>>> the file (received via an earlier MPI_Allgather call). This seems to work, as
>> when I
>>>> read the file back in on a single MPI rank and deserialise the data into an
>> array of
>>>> structures I get the results I expect.
>>>>
>>>> The second approach is to use MPI's derived data types to create MPI
>>>> representations of the structures, and then treat the arrays of structures
as
>> MPI
>>>> data types. This allows me to avoid copying the local data into an
>> intermediate
>>>> buffer etc, and seems the more elegant approach. I cannot, however,
seem
>> to
>>>> make this approach work.
>>>>
>>>> I'm pretty sure the problem lies in my use of the file views, but I'm not
sure
>>>> where I'm going wrong. The reading of the integer "header" always works
>> fine,
>>>> but the proceeding data is garbled. I'm using the "native" data
representation
>> for
>>>> testing, but will likely change that to something more portable when I get
>> this
>>>> code working.
>>>>
>>>> I've included the important excerpts of the test code I'm trying to use
below
>>>> (with some printf()s and error handling etc removed to make it a little
more
>>>> concise). I have previously tested that std::vector allocates a contiguous
flat
>>>> array of the appropriate data type in memory, so passing a
pointer/reference
>> to
>>>> the first element in such a data structure behaves the same way as simply
>>>> passing a conventional array of the appropriate data type:
>>>>
>>>> struct atom
>>>> {
>>>> int global_id;
>>>> double xyz[3];
>>>> };
>>>>
>>>> void write( char * fpath, std::vector<struct atom> &atoms, int rank, int
>> nranks )
>>>> {
>>>> /*
>>>> Memory layout information for the structure we wish to convert
into
>>>> an
>>>> MPI derived data type.
>>>> */
>>>> std::vector<int> s_blocklengths;
>>>> std::vector<MPI_Aint> s_displacements;
>>>> std::vector<MPI_Datatype> s_datatypes;
>>>> MPI_Aint addr_start, addr;
>>>> MPI_Datatype mpi_atom_type, mpi_atom_type_resized;
>>>> int type_size;
>>>>
>>>> struct atom a;
>>>>
>>>> MPI_File f;
>>>> MPI_Status stat;
>>>> MPI_Offset offset;
>>>> char *datarep = (char *)"native";
>>>>
>>>> std::vector<int> all_N;
>>>> int local_N, global_N;
>>>>
>>>> /*
>>>> Set up the structure data type: single integer, and 3 double
precision
>>>> floats.
>>>> We use the temporary "a" structure to determine the layout of
memory
>>>> inside
>>>> atom structures.
>>>> */
>>>> MPI_Get_address( &a, &addr_start );
>>>>
>>>> s_blocklengths.push_back( 1 );
>>>> s_datatypes.push_back( MPI_INT );
>>>> MPI_Get_address( &a.global_id, &addr );
>>>> s_displacements.push_back( addr - addr_start );
>>>>
>>>> s_blocklengths.push_back( 3 );
>>>> s_datatypes.push_back( MPI_DOUBLE );
>>>> MPI_Get_address( &a.xyz[0], &addr );
>>>> s_displacements.push_back( addr - addr_start );
>>>>
>>>> MPI_Type_create_struct( (int)s_blocklengths.size(), &s_blocklengths[0],
>>>> &s_displacements[0], &s_datatypes[0], &mpi_atom_type );
>>>> MPI_Type_commit( &mpi_atom_type );
>>>>
>>>> /*
>>>> Take into account any compiler padding in creating an array of
>>>> structures.
>>>> */
>>>> MPI_Type_create_resized( mpi_atom_type, 0, sizeof(struct atom),
>>>> &mpi_atom_type_resized );
>>>> MPI_Type_commit( &mpi_atom_type_resized );
>>>>
>>>> MPI_Type_size( mpi_atom_type_resized, &type_size );
>>>>
>>>> local_N = (int)atoms.size();
>>>> all_N.resize( nranks );
>>>>
>>>> MPI_Allgather( &local_N, 1, MPI_INT, &all_N[0], 1, MPI_INT,
>>>> MPI_COMM_WORLD );
>>>>
>>>> global_N = 0;
>>>> for( size_t i=0; i<all_N.size(); i++ ) global_N += all_N[i];
>>>>
>>>> offset = 0;
>>>> for( int i=0; i<rank; i++ ) offset += all_N[i];
>>>>
>>>> offset *= type_size; // convert from structure counts -> bytes into file
for
>>>> true structure size
>>>> offset += sizeof( int ); // skip leading integer (global_N) in file.
>>>>
>>>> MPI_File_open( MPI_COMM_WORLD, fpath, MPI_MODE_CREATE |
>>>> MPI_MODE_WRONLY, MPI_INFO_NULL, &f );
>>>> if( rank == 0 )
>>>> {
>>>> MPI_File_write( f, &global_N, 1, MPI_INT, &stat );
>>>> }
>>>> MPI_File_set_view( f, offset, mpi_atom_type_resized,
>>>> mpi_atom_type_resized, datarep, MPI_INFO_NULL );
>>>>
>>>> MPI_File_write_all( f, &atoms[0], (int)atoms.size(),
mpi_atom_type_resized,
>>>> &stat );
>>>> MPI_File_close( &f );
>>>>
>>>> MPI_Type_free( &mpi_atom_type );
>>>> MPI_Type_free( &mpi_atom_type_resized );
>>>>
>>>> return;
>>>> }
>>>>
>>>> void read( char * fpath, std::vector<struct atom> &atoms )
>>>> {
>>>> std::vector<int> s_blocklengths;
>>>> std::vector<MPI_Aint> s_displacements;
>>>> std::vector<MPI_Datatype> s_datatypes;
>>>> MPI_Datatype mpi_atom_type, mpi_atom_type_resized;
>>>>
>>>> struct atom a;
>>>> MPI_Aint addr_start, addr;
>>>>
>>>> MPI_File f;
>>>> MPI_Status stat;
>>>>
>>>> int global_N;
>>>> char *datarep = (char *)"native";
>>>>
>>>> int type_size;
>>>>
>>>> /*
>>>> Set up the structure data type
>>>> */
>>>> MPI_Get_address( &a, &addr_start );
>>>>
>>>> s_blocklengths.push_back( 1 );
>>>> s_datatypes.push_back( MPI_INT );
>>>> MPI_Get_address( &a.global_id, &addr );
>>>> s_displacements.push_back( addr - addr_start );
>>>>
>>>> s_blocklengths.push_back( 3 );
>>>> s_datatypes.push_back( MPI_DOUBLE );
>>>> MPI_Get_address( &a.xyz[0], &addr );
>>>> s_displacements.push_back( addr - addr_start );
>>>>
>>>> MPI_Type_create_struct( (int)s_blocklengths.size(), &s_blocklengths[0],
>>>> &s_displacements[0], &s_datatypes[0], &mpi_atom_type );
>>>> MPI_Type_commit( &mpi_atom_type );
>>>>
>>>> /*
>>>> Take into account any compiler padding in creating an array of
>>>> structures.
>>>> */
>>>> MPI_Type_create_resized( mpi_atom_type, 0, sizeof(struct atom),
>>>> &mpi_atom_type_resized );
>>>> MPI_Type_commit( &mpi_atom_type_resized );
>>>>
>>>> MPI_Type_size( mpi_atom_type_resized, &type_size );
>>>>
>>>> MPI_File_open( MPI_COMM_SELF, fpath, MPI_MODE_RDONLY,
>>>> MPI_INFO_NULL, &f );
>>>>
>>>> MPI_File_read( f, &global_N, 1, MPI_INT, &stat );
>>>>
>>>> atoms.clear();
>>>> atoms.resize( global_N );
>>>>
>>>> errcode = MPI_File_set_view( f, sizeof(int), mpi_atom_type_resized,
>>>> mpi_atom_type_resized, datarep, MPI_INFO_NULL );
>>>> errcode = MPI_File_read( f, &atoms[0], global_N,
mpi_atom_type_resized,
>>>> &stat );
>>>> errcode = MPI_File_close( &f );
>>>>
>>>> MPI_Type_free( &mpi_atom_type );
>>>> MPI_Type_free( &mpi_atom_type_resized );
>>>>
>>>> return;
>>>> }
>>>>
>>>> Calling MPI_Type_get_extent() and MPI_Type_get_true_extent() for both
>>>> mpi_atom_type and mpi_atom_type_resized returns (0,32) bytes in all
cases.
>>>> Calling MPI_Type_size() on both derived data types returns 28 bytes.
>>>>
>>>> If I call MPI_File_get_type_extent() on both derived data types after
opening
>> the
>>>> file, they both resolve to 32 bytes - so I think the problem is in the
>> difference
>>>> between the data representation in memory and on disk. If I explicitly use
32
>>>> bytes in the offset calculation in the write() routine above, it still doesn't
>> work.
>>>>
>>>> I'm finding it remarkably difficult to do something very simple using MPI's
>>>> derived data types and the parallel IO, and hence I'm guessing that I have
>>>> fundamentally misunderstood one or more aspects of this. If anyone can
>> help
>>>> clarify where I'm going wrong, that would be much appreciated!
>>>>
>>>> Cheers,
>>>>
>>>> John.
>>>> _______________________________________________
>>>> discuss mailing list discuss at mpich.org
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>
>>> _______________________________________________
>>> discuss mailing list discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>> _______________________________________________
>> discuss mailing list discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>
>_______________________________________________
>discuss mailing list discuss at mpich.org
>To manage subscription options or unsubscribe:
>https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list