[mpich-discuss] Use of MPI derived data types / MPI file IO

Wei-keng Liao wkliao at ece.northwestern.edu
Mon Nov 19 09:17:36 CST 2012


Hi, John,

I did not find from the MPI standard specific about this question. However,
if you check the man page of MPI_Type_free, one of the errors is
   MPI_ERR_TYPE
   Invalid datatype argument. May be an uncommitted MPI_Datatype (see MPI_Type_commit).

So, it seems to imply one should only free the data types that have been committed.
But, I think the MPICH development team should confirm if this is the case.


Wei-keng


On Nov 19, 2012, at 8:46 AM, <jgrime at uchicago.edu> <jgrime at uchicago.edu> wrote:

> Hi Wei-keng,
> 
> It now works! Thanks for the help!
> 
> One last question to the list:
> 
> I've looked at the MPI 2 standards documents, and I'm still a little confused as to 
> the precise semantics of MPI_Type_free(); as "MPI_Datatype" is an opaque type, 
> I'm assuming that there is a certain amount of background allocation going on 
> inside the MPI runtime when I call something like MPI_Type_create_struct() or 
> similar routines.
> 
> Am I right in assuming that I should call MPI_Type_free() on *all* derived data 
> types I generate, even if they are not subsequently registered using 
> MPI_Type_commit()? I would imagine that any other behaviour is likely to lead to 
> memory leaks!
> 
> Cheers,
> 
> J.
> 
> ---- Original message ----
>> Date: Sun, 18 Nov 2012 18:58:37 -0600
>> From: discuss-bounces at mpich.org (on behalf of Wei-keng Liao 
> <wkliao at ece.northwestern.edu>)
>> Subject: Re: [mpich-discuss] Use of MPI derived data types / MPI file IO  
>> To: discuss at mpich.org
>> 
>> Hi, John,
>> 
>> You certainly are on the right track to achieve that. Your code is almost
>> there, only the call to MPI_File_set_view is incorrect. In fact, you don't need it.
>> 
>> Try remove the call to MPI_File_set_view and replace the MPI_File_write_all with:
>> MPI_File_write_at_all(f, offset, &atoms[0], (int)atoms.size(), 
> mpi_atom_type_resized, &stat);
>> 
>> On the reader side, you need to set the offset based on the new struct. Other 
> than
>> that, it is the same as the writer case. (no need of MPI_File_set_view either).
>> 
>> As for the portability issue, I would suggest you to use high-level I/O libraries,
>> such as PnetCDF.
>> 
>> Wei-keng
>> 
>> On Nov 18, 2012, at 12:38 PM, <jgrime at uchicago.edu> 
> <jgrime at uchicago.edu> wrote:
>> 
>>> Hi Wei-keng,
>>> 
>>> That's a good point, thanks!
>>> 
>>> However, I actually only want to save certain parts of the "atom" structure to 
> file, 
>>> and saving the whole array as a raw dump could waste a lot of disk space.
>>> 
>>> For example, the "atom" structure I used in the example code in reality 
> contains 
>>> not only an integer and three contiguous doubles, but also at least another 
> two 
>>> double[3] entries which I may not want to save to disk. As the full data set 
> can 
>>> be hundreds of millions (or even billions) of "atom" structures, using a 
> derived 
>>> data type with only a restricted subset of the data in each "atom" structure 
> will 
>>> produce considerably smaller file sizes!
>>> 
>>> There's also the problem of making the resultant file "portable" - raw 
> memory 
>>> dumps could make life difficult in trying to use output files on machines with 
>>> different processor architectures. Once I get the derived data types working, 
> I 
>>> can then switch from the "native" representation to something else 
> ("external32" 
>>> etc), which should allow me to create portable output files, provided I'm 
> careful 
>>> with using MPIs file offset routines etc if the file is larger than plain old 32 
> bit 
>>> offsets can handle.
>>> 
>>> Cheers,
>>> 
>>> J.
>>> 
>>> ---- Original message ----
>>>> Date: Sun, 18 Nov 2012 12:27:04 -0600
>>>> From: discuss-bounces at mpich.org (on behalf of Wei-keng Liao 
>>> <wkliao at ece.northwestern.edu>)
>>>> Subject: Re: [mpich-discuss] Use of MPI derived data types / MPI file IO  
>>>> To: discuss at mpich.org
>>>> 
>>>> Hi, John
>>>> 
>>>> If your I/O is simply appending one process's data after another and the 
> I/O 
>>> buffers in memory
>>>> are all contiguous, then you can simply do the following without defining 
> MPI
>>>> derived data types or setting the file view.
>>>> 
>>>> MPI_File_write_at_all(f, offset, &atoms[0], (int)atoms.size() * sizeof(struct 
> atom), 
>>> MPI_BYTE, &stat);
>>>> 
>>>> Using derived data types is usually when you have noncontiguous buffer in 
>>> memory or
>>>> want to access non-contiguous data in files.
>>>> 
>>>> 
>>>> Wei-keng
>>>> 
>>>> On Nov 18, 2012, at 11:52 AM, <jgrime at uchicago.edu> 
>>> <jgrime at uchicago.edu> wrote:
>>>> 
>>>>> Hi all,
>>>>> 
>>>>> I'm having some problems with using derived data types and MPI parallel 
> IO, 
>>> and 
>>>>> was wondering if anyone could help. I tried to search the archives in case 
>>> this 
>>>>> was covered earlier, but that just gave me "ht://Dig error" messages.
>>>>> 
>>>>> Outline: I have written a C++ program where each MPI rank acts on data 
>>> stored 
>>>>> in a local array of structures. The arrays are typically of different lengths 
> on 
>>> each 
>>>>> rank. I wish to write and read the contents of these arrays to disk using 
> MPI's 
>>>>> parallel IO routines. The file format is simply an initial integer which 
>>> describes 
>>>>> how many "structures" are in the file, followed by the data which 
> represents 
>>> the 
>>>>> "structure information" from all ranks (ie the total data set).
>>>>> 
>>>>> So far, I've tried two different approaches: the first consists of each rank 
>>>>> serialising the contents of the local array of structures into a byte array, 
>>> which is 
>>>>> then saved to file "f" using MPI_File_set_view( f, MPI_COMM_WORLD, 
> offset, 
>>>>> MPI_CHAR, MPI_CHAR, "native", MPI_INFO_NULL ) to skip the initial integer 
>>>>> "header" and then a call to MPI_File_write_all( f, local_bytearray, 
>>> local_n_bytes, 
>>>>> MPI_CHAR, &status ). Here, "offset" is simply the size of an integer (in 
> bytes) 
>>> + 
>>>>> the summation of the number of bytes each preceeding rank wishes to 
> write 
>>> to 
>>>>> the file (received via an earlier MPI_Allgather call). This seems to work, as 
>>> when I 
>>>>> read the file back in on a single MPI rank and deserialise the data into an 
>>> array of 
>>>>> structures I get the results I expect.
>>>>> 
>>>>> The second approach is to use MPI's derived data types to create MPI 
>>>>> representations of the structures, and then treat the arrays of structures 
> as 
>>> MPI 
>>>>> data types. This allows me to avoid copying the local data into an 
>>> intermediate 
>>>>> buffer etc, and seems the more elegant approach. I cannot, however, 
> seem 
>>> to 
>>>>> make this approach work.
>>>>> 
>>>>> I'm pretty sure the problem lies in my use of the file views, but I'm not 
> sure 
>>>>> where I'm going wrong. The reading of the integer "header" always works 
>>> fine, 
>>>>> but the proceeding data is garbled. I'm using the "native" data 
> representation 
>>> for 
>>>>> testing, but will likely change that to something more portable when I get 
>>> this 
>>>>> code working.
>>>>> 
>>>>> I've included the important excerpts of the test code I'm trying to use 
> below 
>>>>> (with some printf()s and error handling etc removed to make it a little 
> more 
>>>>> concise). I have previously tested that std::vector allocates a contiguous 
> flat 
>>>>> array of the appropriate data type in memory, so passing a 
> pointer/reference 
>>> to 
>>>>> the first element in such a data structure behaves the same way as simply 
>>>>> passing a conventional array of the appropriate data type:
>>>>> 
>>>>> struct atom
>>>>> {
>>>>> 	int global_id;
>>>>> 	double xyz[3];
>>>>> };
>>>>> 
>>>>> void write( char * fpath, std::vector<struct atom> &atoms, int rank, int 
>>> nranks )
>>>>> {
>>>>> 	/*
>>>>> 		Memory layout information for the structure we wish to convert 
> into 
>>>>> an
>>>>> 		MPI derived data type.
>>>>> 	*/
>>>>> 	std::vector<int> s_blocklengths;
>>>>> 	std::vector<MPI_Aint> s_displacements;
>>>>> 	std::vector<MPI_Datatype> s_datatypes;
>>>>> 	MPI_Aint addr_start, addr;
>>>>> 	MPI_Datatype mpi_atom_type, mpi_atom_type_resized;
>>>>> 	int type_size;
>>>>> 	
>>>>> 	struct atom a;
>>>>> 	
>>>>> 	MPI_File f;
>>>>> 	MPI_Status stat;
>>>>> 	MPI_Offset offset;
>>>>> 	char *datarep = (char *)"native";
>>>>> 
>>>>> 	std::vector<int> all_N;
>>>>> 	int local_N, global_N;
>>>>> 
>>>>> 	/*
>>>>> 		Set up the structure data type: single integer, and 3 double 
> precision 
>>>>> floats.
>>>>> 		We use the temporary "a" structure to determine the layout of 
> memory 
>>>>> inside
>>>>> 		atom structures.
>>>>> 	*/
>>>>> 	MPI_Get_address( &a, &addr_start );
>>>>> 	
>>>>> 	s_blocklengths.push_back( 1 );
>>>>> 	s_datatypes.push_back( MPI_INT );
>>>>> 	MPI_Get_address( &a.global_id, &addr );
>>>>> 	s_displacements.push_back( addr - addr_start );
>>>>> 
>>>>> 	s_blocklengths.push_back( 3 );
>>>>> 	s_datatypes.push_back( MPI_DOUBLE );
>>>>> 	MPI_Get_address( &a.xyz[0], &addr );
>>>>> 	s_displacements.push_back( addr - addr_start );
>>>>> 	
>>>>> 	MPI_Type_create_struct( (int)s_blocklengths.size(), &s_blocklengths[0], 
>>>>> &s_displacements[0], &s_datatypes[0], &mpi_atom_type );
>>>>> 	MPI_Type_commit( &mpi_atom_type );
>>>>> 	
>>>>> 	/*
>>>>> 		Take into account any compiler padding in creating an array of 
>>>>> structures.
>>>>> 	*/
>>>>> 	MPI_Type_create_resized( mpi_atom_type, 0, sizeof(struct atom), 
>>>>> &mpi_atom_type_resized );
>>>>> 	MPI_Type_commit( &mpi_atom_type_resized );
>>>>> 		
>>>>> 	MPI_Type_size( mpi_atom_type_resized, &type_size );
>>>>> 
>>>>> 	local_N = (int)atoms.size();
>>>>> 	all_N.resize( nranks );
>>>>> 
>>>>> 	MPI_Allgather( &local_N, 1, MPI_INT, &all_N[0], 1, MPI_INT, 
>>>>> MPI_COMM_WORLD );
>>>>> 
>>>>> 	global_N = 0;
>>>>> 	for( size_t i=0; i<all_N.size(); i++ ) global_N += all_N[i];
>>>>> 
>>>>> 	offset = 0;
>>>>> 	for( int i=0; i<rank; i++ ) offset += all_N[i];
>>>>> 
>>>>> 	offset *= type_size; // convert from structure counts -> bytes into file 
> for 
>>>>> true structure size
>>>>> 	offset += sizeof( int ); // skip leading integer (global_N) in file.
>>>>> 
>>>>> 	MPI_File_open( MPI_COMM_WORLD, fpath, MPI_MODE_CREATE | 
>>>>> MPI_MODE_WRONLY, MPI_INFO_NULL, &f );
>>>>> 	if( rank == 0 )
>>>>> 	{
>>>>> 		MPI_File_write( f, &global_N, 1, MPI_INT, &stat );
>>>>> 	}
>>>>> 	MPI_File_set_view( f, offset, mpi_atom_type_resized, 
>>>>> mpi_atom_type_resized, datarep, MPI_INFO_NULL );
>>>>> 	
>>>>> 	MPI_File_write_all( f, &atoms[0], (int)atoms.size(), 
> mpi_atom_type_resized, 
>>>>> &stat );
>>>>> 	MPI_File_close( &f );
>>>>> 
>>>>> 	MPI_Type_free( &mpi_atom_type );
>>>>> 	MPI_Type_free( &mpi_atom_type_resized );
>>>>> 
>>>>> 	return;
>>>>> }
>>>>> 
>>>>> void read( char * fpath, std::vector<struct atom> &atoms )
>>>>> {
>>>>> 	std::vector<int> s_blocklengths;
>>>>> 	std::vector<MPI_Aint> s_displacements;
>>>>> 	std::vector<MPI_Datatype> s_datatypes;
>>>>> 	MPI_Datatype mpi_atom_type, mpi_atom_type_resized;
>>>>> 	
>>>>> 	struct atom a;
>>>>> 	MPI_Aint addr_start, addr;
>>>>> 	
>>>>> 	MPI_File f;
>>>>> 	MPI_Status stat;
>>>>> 	
>>>>> 	int global_N;
>>>>> 	char *datarep = (char *)"native";
>>>>> 
>>>>> 	int type_size;
>>>>> 
>>>>> 	/*
>>>>> 		Set up the structure data type
>>>>> 	*/
>>>>> 	MPI_Get_address( &a, &addr_start );
>>>>> 	
>>>>> 	s_blocklengths.push_back( 1 );
>>>>> 	s_datatypes.push_back( MPI_INT );
>>>>> 	MPI_Get_address( &a.global_id, &addr );
>>>>> 	s_displacements.push_back( addr - addr_start );
>>>>> 
>>>>> 	s_blocklengths.push_back( 3 );
>>>>> 	s_datatypes.push_back( MPI_DOUBLE );
>>>>> 	MPI_Get_address( &a.xyz[0], &addr );
>>>>> 	s_displacements.push_back( addr - addr_start );
>>>>> 	
>>>>> 	MPI_Type_create_struct( (int)s_blocklengths.size(), &s_blocklengths[0], 
>>>>> &s_displacements[0], &s_datatypes[0], &mpi_atom_type );
>>>>> 	MPI_Type_commit( &mpi_atom_type );
>>>>> 	
>>>>> 	/*
>>>>> 		Take into account any compiler padding in creating an array of 
>>>>> structures.
>>>>> 	*/
>>>>> 	MPI_Type_create_resized( mpi_atom_type, 0, sizeof(struct atom), 
>>>>> &mpi_atom_type_resized );
>>>>> 	MPI_Type_commit( &mpi_atom_type_resized );
>>>>> 
>>>>> 	MPI_Type_size( mpi_atom_type_resized, &type_size );
>>>>> 	
>>>>> 	MPI_File_open( MPI_COMM_SELF, fpath, MPI_MODE_RDONLY, 
>>>>> MPI_INFO_NULL, &f );
>>>>> 
>>>>> 	MPI_File_read( f, &global_N, 1, MPI_INT, &stat );
>>>>> 	
>>>>> 	atoms.clear();
>>>>> 	atoms.resize( global_N );
>>>>> 
>>>>> 	errcode = MPI_File_set_view( f, sizeof(int), mpi_atom_type_resized, 
>>>>> mpi_atom_type_resized, datarep, MPI_INFO_NULL );
>>>>> 	errcode = MPI_File_read( f, &atoms[0], global_N, 
> mpi_atom_type_resized, 
>>>>> &stat );
>>>>> 	errcode = MPI_File_close( &f );
>>>>> 
>>>>> 	MPI_Type_free( &mpi_atom_type );
>>>>> 	MPI_Type_free( &mpi_atom_type_resized );
>>>>> 
>>>>> 	return;
>>>>> }
>>>>> 
>>>>> Calling MPI_Type_get_extent() and MPI_Type_get_true_extent() for both 
>>>>> mpi_atom_type and mpi_atom_type_resized returns (0,32) bytes in all 
> cases. 
>>>>> Calling MPI_Type_size() on both derived data types returns 28 bytes.
>>>>> 
>>>>> If I call MPI_File_get_type_extent() on both derived data types after 
> opening 
>>> the 
>>>>> file, they both resolve to 32 bytes - so I think the problem is in the 
>>> difference 
>>>>> between the data representation in memory and on disk. If I explicitly use 
> 32 
>>>>> bytes in the offset calculation in the write() routine above, it still doesn't 
>>> work.
>>>>> 
>>>>> I'm finding it remarkably difficult to do something very simple using MPI's 
>>>>> derived data types and the parallel IO, and hence I'm guessing that I have 
>>>>> fundamentally misunderstood one or more aspects of this. If anyone can 
>>> help 
>>>>> clarify where I'm going wrong, that would be much appreciated!
>>>>> 
>>>>> Cheers,
>>>>> 
>>>>> John.
>>>>> _______________________________________________
>>>>> discuss mailing list     discuss at mpich.org
>>>>> To manage subscription options or unsubscribe:
>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>> 
>>>> _______________________________________________
>>>> discuss mailing list     discuss at mpich.org
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>> _______________________________________________
>>> discuss mailing list     discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>> 
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss




More information about the discuss mailing list