[mpich-discuss] Use of MPI derived data types / MPI file IO
Wei-keng Liao
wkliao at ece.northwestern.edu
Sun Nov 18 18:58:37 CST 2012
Hi, John,
You certainly are on the right track to achieve that. Your code is almost
there, only the call to MPI_File_set_view is incorrect. In fact, you don't need it.
Try remove the call to MPI_File_set_view and replace the MPI_File_write_all with:
MPI_File_write_at_all(f, offset, &atoms[0], (int)atoms.size(), mpi_atom_type_resized, &stat);
On the reader side, you need to set the offset based on the new struct. Other than
that, it is the same as the writer case. (no need of MPI_File_set_view either).
As for the portability issue, I would suggest you to use high-level I/O libraries,
such as PnetCDF.
Wei-keng
On Nov 18, 2012, at 12:38 PM, <jgrime at uchicago.edu> <jgrime at uchicago.edu> wrote:
> Hi Wei-keng,
>
> That's a good point, thanks!
>
> However, I actually only want to save certain parts of the "atom" structure to file,
> and saving the whole array as a raw dump could waste a lot of disk space.
>
> For example, the "atom" structure I used in the example code in reality contains
> not only an integer and three contiguous doubles, but also at least another two
> double[3] entries which I may not want to save to disk. As the full data set can
> be hundreds of millions (or even billions) of "atom" structures, using a derived
> data type with only a restricted subset of the data in each "atom" structure will
> produce considerably smaller file sizes!
>
> There's also the problem of making the resultant file "portable" - raw memory
> dumps could make life difficult in trying to use output files on machines with
> different processor architectures. Once I get the derived data types working, I
> can then switch from the "native" representation to something else ("external32"
> etc), which should allow me to create portable output files, provided I'm careful
> with using MPIs file offset routines etc if the file is larger than plain old 32 bit
> offsets can handle.
>
> Cheers,
>
> J.
>
> ---- Original message ----
>> Date: Sun, 18 Nov 2012 12:27:04 -0600
>> From: discuss-bounces at mpich.org (on behalf of Wei-keng Liao
> <wkliao at ece.northwestern.edu>)
>> Subject: Re: [mpich-discuss] Use of MPI derived data types / MPI file IO
>> To: discuss at mpich.org
>>
>> Hi, John
>>
>> If your I/O is simply appending one process's data after another and the I/O
> buffers in memory
>> are all contiguous, then you can simply do the following without defining MPI
>> derived data types or setting the file view.
>>
>> MPI_File_write_at_all(f, offset, &atoms[0], (int)atoms.size() * sizeof(struct atom),
> MPI_BYTE, &stat);
>>
>> Using derived data types is usually when you have noncontiguous buffer in
> memory or
>> want to access non-contiguous data in files.
>>
>>
>> Wei-keng
>>
>> On Nov 18, 2012, at 11:52 AM, <jgrime at uchicago.edu>
> <jgrime at uchicago.edu> wrote:
>>
>>> Hi all,
>>>
>>> I'm having some problems with using derived data types and MPI parallel IO,
> and
>>> was wondering if anyone could help. I tried to search the archives in case
> this
>>> was covered earlier, but that just gave me "ht://Dig error" messages.
>>>
>>> Outline: I have written a C++ program where each MPI rank acts on data
> stored
>>> in a local array of structures. The arrays are typically of different lengths on
> each
>>> rank. I wish to write and read the contents of these arrays to disk using MPI's
>>> parallel IO routines. The file format is simply an initial integer which
> describes
>>> how many "structures" are in the file, followed by the data which represents
> the
>>> "structure information" from all ranks (ie the total data set).
>>>
>>> So far, I've tried two different approaches: the first consists of each rank
>>> serialising the contents of the local array of structures into a byte array,
> which is
>>> then saved to file "f" using MPI_File_set_view( f, MPI_COMM_WORLD, offset,
>>> MPI_CHAR, MPI_CHAR, "native", MPI_INFO_NULL ) to skip the initial integer
>>> "header" and then a call to MPI_File_write_all( f, local_bytearray,
> local_n_bytes,
>>> MPI_CHAR, &status ). Here, "offset" is simply the size of an integer (in bytes)
> +
>>> the summation of the number of bytes each preceeding rank wishes to write
> to
>>> the file (received via an earlier MPI_Allgather call). This seems to work, as
> when I
>>> read the file back in on a single MPI rank and deserialise the data into an
> array of
>>> structures I get the results I expect.
>>>
>>> The second approach is to use MPI's derived data types to create MPI
>>> representations of the structures, and then treat the arrays of structures as
> MPI
>>> data types. This allows me to avoid copying the local data into an
> intermediate
>>> buffer etc, and seems the more elegant approach. I cannot, however, seem
> to
>>> make this approach work.
>>>
>>> I'm pretty sure the problem lies in my use of the file views, but I'm not sure
>>> where I'm going wrong. The reading of the integer "header" always works
> fine,
>>> but the proceeding data is garbled. I'm using the "native" data representation
> for
>>> testing, but will likely change that to something more portable when I get
> this
>>> code working.
>>>
>>> I've included the important excerpts of the test code I'm trying to use below
>>> (with some printf()s and error handling etc removed to make it a little more
>>> concise). I have previously tested that std::vector allocates a contiguous flat
>>> array of the appropriate data type in memory, so passing a pointer/reference
> to
>>> the first element in such a data structure behaves the same way as simply
>>> passing a conventional array of the appropriate data type:
>>>
>>> struct atom
>>> {
>>> int global_id;
>>> double xyz[3];
>>> };
>>>
>>> void write( char * fpath, std::vector<struct atom> &atoms, int rank, int
> nranks )
>>> {
>>> /*
>>> Memory layout information for the structure we wish to convert into
>>> an
>>> MPI derived data type.
>>> */
>>> std::vector<int> s_blocklengths;
>>> std::vector<MPI_Aint> s_displacements;
>>> std::vector<MPI_Datatype> s_datatypes;
>>> MPI_Aint addr_start, addr;
>>> MPI_Datatype mpi_atom_type, mpi_atom_type_resized;
>>> int type_size;
>>>
>>> struct atom a;
>>>
>>> MPI_File f;
>>> MPI_Status stat;
>>> MPI_Offset offset;
>>> char *datarep = (char *)"native";
>>>
>>> std::vector<int> all_N;
>>> int local_N, global_N;
>>>
>>> /*
>>> Set up the structure data type: single integer, and 3 double precision
>>> floats.
>>> We use the temporary "a" structure to determine the layout of memory
>>> inside
>>> atom structures.
>>> */
>>> MPI_Get_address( &a, &addr_start );
>>>
>>> s_blocklengths.push_back( 1 );
>>> s_datatypes.push_back( MPI_INT );
>>> MPI_Get_address( &a.global_id, &addr );
>>> s_displacements.push_back( addr - addr_start );
>>>
>>> s_blocklengths.push_back( 3 );
>>> s_datatypes.push_back( MPI_DOUBLE );
>>> MPI_Get_address( &a.xyz[0], &addr );
>>> s_displacements.push_back( addr - addr_start );
>>>
>>> MPI_Type_create_struct( (int)s_blocklengths.size(), &s_blocklengths[0],
>>> &s_displacements[0], &s_datatypes[0], &mpi_atom_type );
>>> MPI_Type_commit( &mpi_atom_type );
>>>
>>> /*
>>> Take into account any compiler padding in creating an array of
>>> structures.
>>> */
>>> MPI_Type_create_resized( mpi_atom_type, 0, sizeof(struct atom),
>>> &mpi_atom_type_resized );
>>> MPI_Type_commit( &mpi_atom_type_resized );
>>>
>>> MPI_Type_size( mpi_atom_type_resized, &type_size );
>>>
>>> local_N = (int)atoms.size();
>>> all_N.resize( nranks );
>>>
>>> MPI_Allgather( &local_N, 1, MPI_INT, &all_N[0], 1, MPI_INT,
>>> MPI_COMM_WORLD );
>>>
>>> global_N = 0;
>>> for( size_t i=0; i<all_N.size(); i++ ) global_N += all_N[i];
>>>
>>> offset = 0;
>>> for( int i=0; i<rank; i++ ) offset += all_N[i];
>>>
>>> offset *= type_size; // convert from structure counts -> bytes into file for
>>> true structure size
>>> offset += sizeof( int ); // skip leading integer (global_N) in file.
>>>
>>> MPI_File_open( MPI_COMM_WORLD, fpath, MPI_MODE_CREATE |
>>> MPI_MODE_WRONLY, MPI_INFO_NULL, &f );
>>> if( rank == 0 )
>>> {
>>> MPI_File_write( f, &global_N, 1, MPI_INT, &stat );
>>> }
>>> MPI_File_set_view( f, offset, mpi_atom_type_resized,
>>> mpi_atom_type_resized, datarep, MPI_INFO_NULL );
>>>
>>> MPI_File_write_all( f, &atoms[0], (int)atoms.size(), mpi_atom_type_resized,
>>> &stat );
>>> MPI_File_close( &f );
>>>
>>> MPI_Type_free( &mpi_atom_type );
>>> MPI_Type_free( &mpi_atom_type_resized );
>>>
>>> return;
>>> }
>>>
>>> void read( char * fpath, std::vector<struct atom> &atoms )
>>> {
>>> std::vector<int> s_blocklengths;
>>> std::vector<MPI_Aint> s_displacements;
>>> std::vector<MPI_Datatype> s_datatypes;
>>> MPI_Datatype mpi_atom_type, mpi_atom_type_resized;
>>>
>>> struct atom a;
>>> MPI_Aint addr_start, addr;
>>>
>>> MPI_File f;
>>> MPI_Status stat;
>>>
>>> int global_N;
>>> char *datarep = (char *)"native";
>>>
>>> int type_size;
>>>
>>> /*
>>> Set up the structure data type
>>> */
>>> MPI_Get_address( &a, &addr_start );
>>>
>>> s_blocklengths.push_back( 1 );
>>> s_datatypes.push_back( MPI_INT );
>>> MPI_Get_address( &a.global_id, &addr );
>>> s_displacements.push_back( addr - addr_start );
>>>
>>> s_blocklengths.push_back( 3 );
>>> s_datatypes.push_back( MPI_DOUBLE );
>>> MPI_Get_address( &a.xyz[0], &addr );
>>> s_displacements.push_back( addr - addr_start );
>>>
>>> MPI_Type_create_struct( (int)s_blocklengths.size(), &s_blocklengths[0],
>>> &s_displacements[0], &s_datatypes[0], &mpi_atom_type );
>>> MPI_Type_commit( &mpi_atom_type );
>>>
>>> /*
>>> Take into account any compiler padding in creating an array of
>>> structures.
>>> */
>>> MPI_Type_create_resized( mpi_atom_type, 0, sizeof(struct atom),
>>> &mpi_atom_type_resized );
>>> MPI_Type_commit( &mpi_atom_type_resized );
>>>
>>> MPI_Type_size( mpi_atom_type_resized, &type_size );
>>>
>>> MPI_File_open( MPI_COMM_SELF, fpath, MPI_MODE_RDONLY,
>>> MPI_INFO_NULL, &f );
>>>
>>> MPI_File_read( f, &global_N, 1, MPI_INT, &stat );
>>>
>>> atoms.clear();
>>> atoms.resize( global_N );
>>>
>>> errcode = MPI_File_set_view( f, sizeof(int), mpi_atom_type_resized,
>>> mpi_atom_type_resized, datarep, MPI_INFO_NULL );
>>> errcode = MPI_File_read( f, &atoms[0], global_N, mpi_atom_type_resized,
>>> &stat );
>>> errcode = MPI_File_close( &f );
>>>
>>> MPI_Type_free( &mpi_atom_type );
>>> MPI_Type_free( &mpi_atom_type_resized );
>>>
>>> return;
>>> }
>>>
>>> Calling MPI_Type_get_extent() and MPI_Type_get_true_extent() for both
>>> mpi_atom_type and mpi_atom_type_resized returns (0,32) bytes in all cases.
>>> Calling MPI_Type_size() on both derived data types returns 28 bytes.
>>>
>>> If I call MPI_File_get_type_extent() on both derived data types after opening
> the
>>> file, they both resolve to 32 bytes - so I think the problem is in the
> difference
>>> between the data representation in memory and on disk. If I explicitly use 32
>>> bytes in the offset calculation in the write() routine above, it still doesn't
> work.
>>>
>>> I'm finding it remarkably difficult to do something very simple using MPI's
>>> derived data types and the parallel IO, and hence I'm guessing that I have
>>> fundamentally misunderstood one or more aspects of this. If anyone can
> help
>>> clarify where I'm going wrong, that would be much appreciated!
>>>
>>> Cheers,
>>>
>>> John.
>>> _______________________________________________
>>> discuss mailing list discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>> _______________________________________________
>> discuss mailing list discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list