[mpich-discuss] Use of MPI derived data types / MPI file IO

Sun Nov 18 12:38:30 CST 2012

Hi Wei-keng,

That's a good point, thanks!

However, I actually only want to save certain parts of the "atom" structure to file, 
and saving the whole array as a raw dump could waste a lot of disk space.

For example, the "atom" structure I used in the example code in reality contains 
not only an integer and three contiguous doubles, but also at least another two 
double[3] entries which I may not want to save to disk. As the full data set can 
be hundreds of millions (or even billions) of "atom" structures, using a derived 
data type with only a restricted subset of the data in each "atom" structure will 
produce considerably smaller file sizes!

There's also the problem of making the resultant file "portable" - raw memory 
dumps could make life difficult in trying to use output files on machines with 
different processor architectures. Once I get the derived data types working, I 
can then switch from the "native" representation to something else ("external32" 
etc), which should allow me to create portable output files, provided I'm careful 
with using MPIs file offset routines etc if the file is larger than plain old 32 bit 
offsets can handle.

Cheers,

J.

---- Original message ----
>Date: Sun, 18 Nov 2012 12:27:04 -0600
>From: discuss-bounces at mpich.org (on behalf of Wei-keng Liao 
<wkliao at ece.northwestern.edu>)
>Subject: Re: [mpich-discuss] Use of MPI derived data types / MPI file IO  
>To: discuss at mpich.org
>
>Hi, John
>
>If your I/O is simply appending one process's data after another and the I/O 
buffers in memory
>are all contiguous, then you can simply do the following without defining MPI
>derived data types or setting the file view.
>
>MPI_File_write_at_all(f, offset, &atoms[0], (int)atoms.size() * sizeof(struct atom), 
MPI_BYTE, &stat);
>
>Using derived data types is usually when you have noncontiguous buffer in 
memory or
>want to access non-contiguous data in files.
>
>
>Wei-keng
>
>On Nov 18, 2012, at 11:52 AM, <jgrime at uchicago.edu> 
<jgrime at uchicago.edu> wrote:
>
>> Hi all,
>> 
>> I'm having some problems with using derived data types and MPI parallel IO, 
and 
>> was wondering if anyone could help. I tried to search the archives in case 
this 
>> was covered earlier, but that just gave me "ht://Dig error" messages.
>> 
>> Outline: I have written a C++ program where each MPI rank acts on data 
stored 
>> in a local array of structures. The arrays are typically of different lengths on 
each 
>> rank. I wish to write and read the contents of these arrays to disk using MPI's 
>> parallel IO routines. The file format is simply an initial integer which 
describes 
>> how many "structures" are in the file, followed by the data which represents 
the 
>> "structure information" from all ranks (ie the total data set).
>> 
>> So far, I've tried two different approaches: the first consists of each rank 
>> serialising the contents of the local array of structures into a byte array, 
which is 
>> then saved to file "f" using MPI_File_set_view( f, MPI_COMM_WORLD, offset, 
>> MPI_CHAR, MPI_CHAR, "native", MPI_INFO_NULL ) to skip the initial integer 
>> "header" and then a call to MPI_File_write_all( f, local_bytearray, 
local_n_bytes, 
>> MPI_CHAR, &status ). Here, "offset" is simply the size of an integer (in bytes) 
+ 
>> the summation of the number of bytes each preceeding rank wishes to write 
to 
>> the file (received via an earlier MPI_Allgather call). This seems to work, as 
when I 
>> read the file back in on a single MPI rank and deserialise the data into an 
array of 
>> structures I get the results I expect.
>> 
>> The second approach is to use MPI's derived data types to create MPI 
>> representations of the structures, and then treat the arrays of structures as 
MPI 
>> data types. This allows me to avoid copying the local data into an 
intermediate 
>> buffer etc, and seems the more elegant approach. I cannot, however, seem 
to 
>> make this approach work.
>> 
>> I'm pretty sure the problem lies in my use of the file views, but I'm not sure 
>> where I'm going wrong. The reading of the integer "header" always works 
fine, 
>> but the proceeding data is garbled. I'm using the "native" data representation 
for 
>> testing, but will likely change that to something more portable when I get 
this 
>> code working.
>> 
>> I've included the important excerpts of the test code I'm trying to use below 
>> (with some printf()s and error handling etc removed to make it a little more 
>> concise). I have previously tested that std::vector allocates a contiguous flat 
>> array of the appropriate data type in memory, so passing a pointer/reference 
to 
>> the first element in such a data structure behaves the same way as simply 
>> passing a conventional array of the appropriate data type:
>> 
>> struct atom
>> {
>> 	int global_id;
>> 	double xyz[3];
>> };
>> 
>> void write( char * fpath, std::vector<struct atom> &atoms, int rank, int 
nranks )
>> {
>> 	/*
>> 		Memory layout information for the structure we wish to convert into 
>> an
>> 		MPI derived data type.
>> 	*/
>> 	std::vector<int> s_blocklengths;
>> 	std::vector<MPI_Aint> s_displacements;
>> 	std::vector<MPI_Datatype> s_datatypes;
>> 	MPI_Aint addr_start, addr;
>> 	MPI_Datatype mpi_atom_type, mpi_atom_type_resized;
>> 	int type_size;
>> 	
>> 	struct atom a;
>> 	
>> 	MPI_File f;
>> 	MPI_Status stat;
>> 	MPI_Offset offset;
>> 	char *datarep = (char *)"native";
>> 
>> 	std::vector<int> all_N;
>> 	int local_N, global_N;
>> 
>> 	/*
>> 		Set up the structure data type: single integer, and 3 double precision 
>> floats.
>> 		We use the temporary "a" structure to determine the layout of memory 
>> inside
>> 		atom structures.
>> 	*/
>> 	MPI_Get_address( &a, &addr_start );
>> 	
>> 	s_blocklengths.push_back( 1 );
>> 	s_datatypes.push_back( MPI_INT );
>> 	MPI_Get_address( &a.global_id, &addr );
>> 	s_displacements.push_back( addr - addr_start );
>> 
>> 	s_blocklengths.push_back( 3 );
>> 	s_datatypes.push_back( MPI_DOUBLE );
>> 	MPI_Get_address( &a.xyz[0], &addr );
>> 	s_displacements.push_back( addr - addr_start );
>> 	
>> 	MPI_Type_create_struct( (int)s_blocklengths.size(), &s_blocklengths[0], 
>> &s_displacements[0], &s_datatypes[0], &mpi_atom_type );
>> 	MPI_Type_commit( &mpi_atom_type );
>> 	
>> 	/*
>> 		Take into account any compiler padding in creating an array of 
>> structures.
>> 	*/
>> 	MPI_Type_create_resized( mpi_atom_type, 0, sizeof(struct atom), 
>> &mpi_atom_type_resized );
>> 	MPI_Type_commit( &mpi_atom_type_resized );
>> 		
>> 	MPI_Type_size( mpi_atom_type_resized, &type_size );
>> 
>> 	local_N = (int)atoms.size();
>> 	all_N.resize( nranks );
>> 
>> 	MPI_Allgather( &local_N, 1, MPI_INT, &all_N[0], 1, MPI_INT, 
>> MPI_COMM_WORLD );
>> 
>> 	global_N = 0;
>> 	for( size_t i=0; i<all_N.size(); i++ ) global_N += all_N[i];
>> 
>> 	offset = 0;
>> 	for( int i=0; i<rank; i++ ) offset += all_N[i];
>> 
>> 	offset *= type_size; // convert from structure counts -> bytes into file for 
>> true structure size
>> 	offset += sizeof( int ); // skip leading integer (global_N) in file.
>> 
>> 	MPI_File_open( MPI_COMM_WORLD, fpath, MPI_MODE_CREATE | 
>> MPI_MODE_WRONLY, MPI_INFO_NULL, &f );
>> 	if( rank == 0 )
>> 	{
>> 		MPI_File_write( f, &global_N, 1, MPI_INT, &stat );
>> 	}
>> 	MPI_File_set_view( f, offset, mpi_atom_type_resized, 
>> mpi_atom_type_resized, datarep, MPI_INFO_NULL );
>> 	
>> 	MPI_File_write_all( f, &atoms[0], (int)atoms.size(), mpi_atom_type_resized, 
>> &stat );
>> 	MPI_File_close( &f );
>> 
>> 	MPI_Type_free( &mpi_atom_type );
>> 	MPI_Type_free( &mpi_atom_type_resized );
>> 
>> 	return;
>> }
>> 
>> void read( char * fpath, std::vector<struct atom> &atoms )
>> {
>> 	std::vector<int> s_blocklengths;
>> 	std::vector<MPI_Aint> s_displacements;
>> 	std::vector<MPI_Datatype> s_datatypes;
>> 	MPI_Datatype mpi_atom_type, mpi_atom_type_resized;
>> 	
>> 	struct atom a;
>> 	MPI_Aint addr_start, addr;
>> 	
>> 	MPI_File f;
>> 	MPI_Status stat;
>> 	
>> 	int global_N;
>> 	char *datarep = (char *)"native";
>> 
>> 	int type_size;
>> 
>> 	/*
>> 		Set up the structure data type
>> 	*/
>> 	MPI_Get_address( &a, &addr_start );
>> 	
>> 	s_blocklengths.push_back( 1 );
>> 	s_datatypes.push_back( MPI_INT );
>> 	MPI_Get_address( &a.global_id, &addr );
>> 	s_displacements.push_back( addr - addr_start );
>> 
>> 	s_blocklengths.push_back( 3 );
>> 	s_datatypes.push_back( MPI_DOUBLE );
>> 	MPI_Get_address( &a.xyz[0], &addr );
>> 	s_displacements.push_back( addr - addr_start );
>> 	
>> 	MPI_Type_create_struct( (int)s_blocklengths.size(), &s_blocklengths[0], 
>> &s_displacements[0], &s_datatypes[0], &mpi_atom_type );
>> 	MPI_Type_commit( &mpi_atom_type );
>> 	
>> 	/*
>> 		Take into account any compiler padding in creating an array of 
>> structures.
>> 	*/
>> 	MPI_Type_create_resized( mpi_atom_type, 0, sizeof(struct atom), 
>> &mpi_atom_type_resized );
>> 	MPI_Type_commit( &mpi_atom_type_resized );
>> 
>> 	MPI_Type_size( mpi_atom_type_resized, &type_size );
>> 	
>> 	MPI_File_open( MPI_COMM_SELF, fpath, MPI_MODE_RDONLY, 
>> MPI_INFO_NULL, &f );
>> 
>> 	MPI_File_read( f, &global_N, 1, MPI_INT, &stat );
>> 	
>> 	atoms.clear();
>> 	atoms.resize( global_N );
>> 
>> 	errcode = MPI_File_set_view( f, sizeof(int), mpi_atom_type_resized, 
>> mpi_atom_type_resized, datarep, MPI_INFO_NULL );
>> 	errcode = MPI_File_read( f, &atoms[0], global_N, mpi_atom_type_resized, 
>> &stat );
>> 	errcode = MPI_File_close( &f );
>> 
>> 	MPI_Type_free( &mpi_atom_type );
>> 	MPI_Type_free( &mpi_atom_type_resized );
>> 
>> 	return;
>> }
>> 
>> Calling MPI_Type_get_extent() and MPI_Type_get_true_extent() for both 
>> mpi_atom_type and mpi_atom_type_resized returns (0,32) bytes in all cases. 
>> Calling MPI_Type_size() on both derived data types returns 28 bytes.
>> 
>> If I call MPI_File_get_type_extent() on both derived data types after opening 
the 
>> file, they both resolve to 32 bytes - so I think the problem is in the 
difference 
>> between the data representation in memory and on disk. If I explicitly use 32 
>> bytes in the offset calculation in the write() routine above, it still doesn't 
work.
>> 
>> I'm finding it remarkably difficult to do something very simple using MPI's 
>> derived data types and the parallel IO, and hence I'm guessing that I have 
>> fundamentally misunderstood one or more aspects of this. If anyone can 
help 
>> clarify where I'm going wrong, that would be much appreciated!
>> 
>> Cheers,
>> 
>> John.
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>
>_______________________________________________
>discuss mailing list     discuss at mpich.org
>To manage subscription options or unsubscribe:
>https://lists.mpich.org/mailman/listinfo/discuss