[mpich-discuss] Use of MPI derived data types / MPI file IO

Wei-keng Liao wkliao at ece.northwestern.edu
Mon Nov 19 10:14:51 CST 2012


Hi, John,

I think the question you raised is important and can improve the MPI and MPICH
documentation.

Based on what you observed from testing the latest MPICH and the comments
from Dave Goodell (thanks Dave), I suggest at least the MPICH man page
should be revised to clarify.


Wei-keng

On Nov 19, 2012, at 9:25 AM, <jgrime at uchicago.edu> <jgrime at uchicago.edu> wrote:

> Hi Wei-keng,
> 
> I checked this out with the latest mpich2 installation I have on my desktop, and I 
> can indeed call MPI_Type_free() on a derived type which has not been committed 
> with no error code returned.
> 
> Maybe this could/should be clarified in the MPI standard at some point, although 
> I'm probably one of only 5 people who ever cared!
> 
> J.
> 
> ---- Original message ----
>> Date: Mon, 19 Nov 2012 09:17:36 -0600
>> From: discuss-bounces at mpich.org (on behalf of Wei-keng Liao 
> <wkliao at ece.northwestern.edu>)
>> Subject: Re: [mpich-discuss] Use of MPI derived data types / MPI file IO  
>> To: discuss at mpich.org
>> 
>> Hi, John,
>> 
>> I did not find from the MPI standard specific about this question. However,
>> if you check the man page of MPI_Type_free, one of the errors is
>>  MPI_ERR_TYPE
>>  Invalid datatype argument. May be an uncommitted MPI_Datatype (see 
> MPI_Type_commit).
>> 
>> So, it seems to imply one should only free the data types that have been 
> committed.
>> But, I think the MPICH development team should confirm if this is the case.
>> 
>> 
>> Wei-keng
>> 
>> 
>> On Nov 19, 2012, at 8:46 AM, <jgrime at uchicago.edu> 
> <jgrime at uchicago.edu> wrote:
>> 
>>> Hi Wei-keng,
>>> 
>>> It now works! Thanks for the help!
>>> 
>>> One last question to the list:
>>> 
>>> I've looked at the MPI 2 standards documents, and I'm still a little confused 
> as to 
>>> the precise semantics of MPI_Type_free(); as "MPI_Datatype" is an opaque 
> type, 
>>> I'm assuming that there is a certain amount of background allocation going 
> on 
>>> inside the MPI runtime when I call something like MPI_Type_create_struct() or 
>>> similar routines.
>>> 
>>> Am I right in assuming that I should call MPI_Type_free() on *all* derived 
> data 
>>> types I generate, even if they are not subsequently registered using 
>>> MPI_Type_commit()? I would imagine that any other behaviour is likely to 
> lead to 
>>> memory leaks!
>>> 
>>> Cheers,
>>> 
>>> J.
>>> 
>>> ---- Original message ----
>>>> Date: Sun, 18 Nov 2012 18:58:37 -0600
>>>> From: discuss-bounces at mpich.org (on behalf of Wei-keng Liao 
>>> <wkliao at ece.northwestern.edu>)
>>>> Subject: Re: [mpich-discuss] Use of MPI derived data types / MPI file IO  
>>>> To: discuss at mpich.org
>>>> 
>>>> Hi, John,
>>>> 
>>>> You certainly are on the right track to achieve that. Your code is almost
>>>> there, only the call to MPI_File_set_view is incorrect. In fact, you don't need 
> it.
>>>> 
>>>> Try remove the call to MPI_File_set_view and replace the MPI_File_write_all 
> with:
>>>> MPI_File_write_at_all(f, offset, &atoms[0], (int)atoms.size(), 
>>> mpi_atom_type_resized, &stat);
>>>> 
>>>> On the reader side, you need to set the offset based on the new struct. 
> Other 
>>> than
>>>> that, it is the same as the writer case. (no need of MPI_File_set_view either).
>>>> 
>>>> As for the portability issue, I would suggest you to use high-level I/O 
> libraries,
>>>> such as PnetCDF.
>>>> 
>>>> Wei-keng
>>>> 
>>>> On Nov 18, 2012, at 12:38 PM, <jgrime at uchicago.edu> 
>>> <jgrime at uchicago.edu> wrote:
>>>> 
>>>>> Hi Wei-keng,
>>>>> 
>>>>> That's a good point, thanks!
>>>>> 
>>>>> However, I actually only want to save certain parts of the "atom" structure 
> to 
>>> file, 
>>>>> and saving the whole array as a raw dump could waste a lot of disk 
> space.
>>>>> 
>>>>> For example, the "atom" structure I used in the example code in reality 
>>> contains 
>>>>> not only an integer and three contiguous doubles, but also at least 
> another 
>>> two 
>>>>> double[3] entries which I may not want to save to disk. As the full data 
> set 
>>> can 
>>>>> be hundreds of millions (or even billions) of "atom" structures, using a 
>>> derived 
>>>>> data type with only a restricted subset of the data in each "atom" 
> structure 
>>> will 
>>>>> produce considerably smaller file sizes!
>>>>> 
>>>>> There's also the problem of making the resultant file "portable" - raw 
>>> memory 
>>>>> dumps could make life difficult in trying to use output files on machines 
> with 
>>>>> different processor architectures. Once I get the derived data types 
> working, 
>>> I 
>>>>> can then switch from the "native" representation to something else 
>>> ("external32" 
>>>>> etc), which should allow me to create portable output files, provided I'm 
>>> careful 
>>>>> with using MPIs file offset routines etc if the file is larger than plain old 
> 32 
>>> bit 
>>>>> offsets can handle.
>>>>> 
>>>>> Cheers,
>>>>> 
>>>>> J.
>>>>> 
>>>>> ---- Original message ----
>>>>>> Date: Sun, 18 Nov 2012 12:27:04 -0600
>>>>>> From: discuss-bounces at mpich.org (on behalf of Wei-keng Liao 
>>>>> <wkliao at ece.northwestern.edu>)
>>>>>> Subject: Re: [mpich-discuss] Use of MPI derived data types / MPI file IO  
>>>>>> To: discuss at mpich.org
>>>>>> 
>>>>>> Hi, John
>>>>>> 
>>>>>> If your I/O is simply appending one process's data after another and the 
>>> I/O 
>>>>> buffers in memory
>>>>>> are all contiguous, then you can simply do the following without 
> defining 
>>> MPI
>>>>>> derived data types or setting the file view.
>>>>>> 
>>>>>> MPI_File_write_at_all(f, offset, &atoms[0], (int)atoms.size() * sizeof(struct 
>>> atom), 
>>>>> MPI_BYTE, &stat);
>>>>>> 
>>>>>> Using derived data types is usually when you have noncontiguous 
> buffer in 
>>>>> memory or
>>>>>> want to access non-contiguous data in files.
>>>>>> 
>>>>>> 
>>>>>> Wei-keng
>>>>>> 
>>>>>> On Nov 18, 2012, at 11:52 AM, <jgrime at uchicago.edu> 
>>>>> <jgrime at uchicago.edu> wrote:
>>>>>> 
>>>>>>> Hi all,
>>>>>>> 
>>>>>>> I'm having some problems with using derived data types and MPI 
> parallel 
>>> IO, 
>>>>> and 
>>>>>>> was wondering if anyone could help. I tried to search the archives in 
> case 
>>>>> this 
>>>>>>> was covered earlier, but that just gave me "ht://Dig error" messages.
>>>>>>> 
>>>>>>> Outline: I have written a C++ program where each MPI rank acts on 
> data 
>>>>> stored 
>>>>>>> in a local array of structures. The arrays are typically of different 
> lengths 
>>> on 
>>>>> each 
>>>>>>> rank. I wish to write and read the contents of these arrays to disk 
> using 
>>> MPI's 
>>>>>>> parallel IO routines. The file format is simply an initial integer which 
>>>>> describes 
>>>>>>> how many "structures" are in the file, followed by the data which 
>>> represents 
>>>>> the 
>>>>>>> "structure information" from all ranks (ie the total data set).
>>>>>>> 
>>>>>>> So far, I've tried two different approaches: the first consists of each 
> rank 
>>>>>>> serialising the contents of the local array of structures into a byte 
> array, 
>>>>> which is 
>>>>>>> then saved to file "f" using MPI_File_set_view( f, MPI_COMM_WORLD, 
>>> offset, 
>>>>>>> MPI_CHAR, MPI_CHAR, "native", MPI_INFO_NULL ) to skip the initial 
> integer 
>>>>>>> "header" and then a call to MPI_File_write_all( f, local_bytearray, 
>>>>> local_n_bytes, 
>>>>>>> MPI_CHAR, &status ). Here, "offset" is simply the size of an integer (in 
>>> bytes) 
>>>>> + 
>>>>>>> the summation of the number of bytes each preceeding rank wishes 
> to 
>>> write 
>>>>> to 
>>>>>>> the file (received via an earlier MPI_Allgather call). This seems to work, 
> as 
>>>>> when I 
>>>>>>> read the file back in on a single MPI rank and deserialise the data into 
> an 
>>>>> array of 
>>>>>>> structures I get the results I expect.
>>>>>>> 
>>>>>>> The second approach is to use MPI's derived data types to create MPI 
>>>>>>> representations of the structures, and then treat the arrays of 
> structures 
>>> as 
>>>>> MPI 
>>>>>>> data types. This allows me to avoid copying the local data into an 
>>>>> intermediate 
>>>>>>> buffer etc, and seems the more elegant approach. I cannot, however, 
>>> seem 
>>>>> to 
>>>>>>> make this approach work.
>>>>>>> 
>>>>>>> I'm pretty sure the problem lies in my use of the file views, but I'm not 
>>> sure 
>>>>>>> where I'm going wrong. The reading of the integer "header" always 
> works 
>>>>> fine, 
>>>>>>> but the proceeding data is garbled. I'm using the "native" data 
>>> representation 
>>>>> for 
>>>>>>> testing, but will likely change that to something more portable when I 
> get 
>>>>> this 
>>>>>>> code working.
>>>>>>> 
>>>>>>> I've included the important excerpts of the test code I'm trying to use 
>>> below 
>>>>>>> (with some printf()s and error handling etc removed to make it a little 
>>> more 
>>>>>>> concise). I have previously tested that std::vector allocates a 
> contiguous 
>>> flat 
>>>>>>> array of the appropriate data type in memory, so passing a 
>>> pointer/reference 
>>>>> to 
>>>>>>> the first element in such a data structure behaves the same way as 
> simply 
>>>>>>> passing a conventional array of the appropriate data type:
>>>>>>> 
>>>>>>> struct atom
>>>>>>> {
>>>>>>> 	int global_id;
>>>>>>> 	double xyz[3];
>>>>>>> };
>>>>>>> 
>>>>>>> void write( char * fpath, std::vector<struct atom> &atoms, int rank, 
> int 
>>>>> nranks )
>>>>>>> {
>>>>>>> 	/*
>>>>>>> 		Memory layout information for the structure we wish to 
> convert 
>>> into 
>>>>>>> an
>>>>>>> 		MPI derived data type.
>>>>>>> 	*/
>>>>>>> 	std::vector<int> s_blocklengths;
>>>>>>> 	std::vector<MPI_Aint> s_displacements;
>>>>>>> 	std::vector<MPI_Datatype> s_datatypes;
>>>>>>> 	MPI_Aint addr_start, addr;
>>>>>>> 	MPI_Datatype mpi_atom_type, mpi_atom_type_resized;
>>>>>>> 	int type_size;
>>>>>>> 	
>>>>>>> 	struct atom a;
>>>>>>> 	
>>>>>>> 	MPI_File f;
>>>>>>> 	MPI_Status stat;
>>>>>>> 	MPI_Offset offset;
>>>>>>> 	char *datarep = (char *)"native";
>>>>>>> 
>>>>>>> 	std::vector<int> all_N;
>>>>>>> 	int local_N, global_N;
>>>>>>> 
>>>>>>> 	/*
>>>>>>> 		Set up the structure data type: single integer, and 3 double 
>>> precision 
>>>>>>> floats.
>>>>>>> 		We use the temporary "a" structure to determine the layout 
> of 
>>> memory 
>>>>>>> inside
>>>>>>> 		atom structures.
>>>>>>> 	*/
>>>>>>> 	MPI_Get_address( &a, &addr_start );
>>>>>>> 	
>>>>>>> 	s_blocklengths.push_back( 1 );
>>>>>>> 	s_datatypes.push_back( MPI_INT );
>>>>>>> 	MPI_Get_address( &a.global_id, &addr );
>>>>>>> 	s_displacements.push_back( addr - addr_start );
>>>>>>> 
>>>>>>> 	s_blocklengths.push_back( 3 );
>>>>>>> 	s_datatypes.push_back( MPI_DOUBLE );
>>>>>>> 	MPI_Get_address( &a.xyz[0], &addr );
>>>>>>> 	s_displacements.push_back( addr - addr_start );
>>>>>>> 	
>>>>>>> 	MPI_Type_create_struct( (int)s_blocklengths.size(), 
> &s_blocklengths[0], 
>>>>>>> &s_displacements[0], &s_datatypes[0], &mpi_atom_type );
>>>>>>> 	MPI_Type_commit( &mpi_atom_type );
>>>>>>> 	
>>>>>>> 	/*
>>>>>>> 		Take into account any compiler padding in creating an array 
> of 
>>>>>>> structures.
>>>>>>> 	*/
>>>>>>> 	MPI_Type_create_resized( mpi_atom_type, 0, sizeof(struct atom), 
>>>>>>> &mpi_atom_type_resized );
>>>>>>> 	MPI_Type_commit( &mpi_atom_type_resized );
>>>>>>> 		
>>>>>>> 	MPI_Type_size( mpi_atom_type_resized, &type_size );
>>>>>>> 
>>>>>>> 	local_N = (int)atoms.size();
>>>>>>> 	all_N.resize( nranks );
>>>>>>> 
>>>>>>> 	MPI_Allgather( &local_N, 1, MPI_INT, &all_N[0], 1, MPI_INT, 
>>>>>>> MPI_COMM_WORLD );
>>>>>>> 
>>>>>>> 	global_N = 0;
>>>>>>> 	for( size_t i=0; i<all_N.size(); i++ ) global_N += all_N[i];
>>>>>>> 
>>>>>>> 	offset = 0;
>>>>>>> 	for( int i=0; i<rank; i++ ) offset += all_N[i];
>>>>>>> 
>>>>>>> 	offset *= type_size; // convert from structure counts -> bytes 
> into file 
>>> for 
>>>>>>> true structure size
>>>>>>> 	offset += sizeof( int ); // skip leading integer (global_N) in file.
>>>>>>> 
>>>>>>> 	MPI_File_open( MPI_COMM_WORLD, fpath, MPI_MODE_CREATE | 
>>>>>>> MPI_MODE_WRONLY, MPI_INFO_NULL, &f );
>>>>>>> 	if( rank == 0 )
>>>>>>> 	{
>>>>>>> 		MPI_File_write( f, &global_N, 1, MPI_INT, &stat );
>>>>>>> 	}
>>>>>>> 	MPI_File_set_view( f, offset, mpi_atom_type_resized, 
>>>>>>> mpi_atom_type_resized, datarep, MPI_INFO_NULL );
>>>>>>> 	
>>>>>>> 	MPI_File_write_all( f, &atoms[0], (int)atoms.size(), 
>>> mpi_atom_type_resized, 
>>>>>>> &stat );
>>>>>>> 	MPI_File_close( &f );
>>>>>>> 
>>>>>>> 	MPI_Type_free( &mpi_atom_type );
>>>>>>> 	MPI_Type_free( &mpi_atom_type_resized );
>>>>>>> 
>>>>>>> 	return;
>>>>>>> }
>>>>>>> 
>>>>>>> void read( char * fpath, std::vector<struct atom> &atoms )
>>>>>>> {
>>>>>>> 	std::vector<int> s_blocklengths;
>>>>>>> 	std::vector<MPI_Aint> s_displacements;
>>>>>>> 	std::vector<MPI_Datatype> s_datatypes;
>>>>>>> 	MPI_Datatype mpi_atom_type, mpi_atom_type_resized;
>>>>>>> 	
>>>>>>> 	struct atom a;
>>>>>>> 	MPI_Aint addr_start, addr;
>>>>>>> 	
>>>>>>> 	MPI_File f;
>>>>>>> 	MPI_Status stat;
>>>>>>> 	
>>>>>>> 	int global_N;
>>>>>>> 	char *datarep = (char *)"native";
>>>>>>> 
>>>>>>> 	int type_size;
>>>>>>> 
>>>>>>> 	/*
>>>>>>> 		Set up the structure data type
>>>>>>> 	*/
>>>>>>> 	MPI_Get_address( &a, &addr_start );
>>>>>>> 	
>>>>>>> 	s_blocklengths.push_back( 1 );
>>>>>>> 	s_datatypes.push_back( MPI_INT );
>>>>>>> 	MPI_Get_address( &a.global_id, &addr );
>>>>>>> 	s_displacements.push_back( addr - addr_start );
>>>>>>> 
>>>>>>> 	s_blocklengths.push_back( 3 );
>>>>>>> 	s_datatypes.push_back( MPI_DOUBLE );
>>>>>>> 	MPI_Get_address( &a.xyz[0], &addr );
>>>>>>> 	s_displacements.push_back( addr - addr_start );
>>>>>>> 	
>>>>>>> 	MPI_Type_create_struct( (int)s_blocklengths.size(), 
> &s_blocklengths[0], 
>>>>>>> &s_displacements[0], &s_datatypes[0], &mpi_atom_type );
>>>>>>> 	MPI_Type_commit( &mpi_atom_type );
>>>>>>> 	
>>>>>>> 	/*
>>>>>>> 		Take into account any compiler padding in creating an array 
> of 
>>>>>>> structures.
>>>>>>> 	*/
>>>>>>> 	MPI_Type_create_resized( mpi_atom_type, 0, sizeof(struct atom), 
>>>>>>> &mpi_atom_type_resized );
>>>>>>> 	MPI_Type_commit( &mpi_atom_type_resized );
>>>>>>> 
>>>>>>> 	MPI_Type_size( mpi_atom_type_resized, &type_size );
>>>>>>> 	
>>>>>>> 	MPI_File_open( MPI_COMM_SELF, fpath, MPI_MODE_RDONLY, 
>>>>>>> MPI_INFO_NULL, &f );
>>>>>>> 
>>>>>>> 	MPI_File_read( f, &global_N, 1, MPI_INT, &stat );
>>>>>>> 	
>>>>>>> 	atoms.clear();
>>>>>>> 	atoms.resize( global_N );
>>>>>>> 
>>>>>>> 	errcode = MPI_File_set_view( f, sizeof(int), 
> mpi_atom_type_resized, 
>>>>>>> mpi_atom_type_resized, datarep, MPI_INFO_NULL );
>>>>>>> 	errcode = MPI_File_read( f, &atoms[0], global_N, 
>>> mpi_atom_type_resized, 
>>>>>>> &stat );
>>>>>>> 	errcode = MPI_File_close( &f );
>>>>>>> 
>>>>>>> 	MPI_Type_free( &mpi_atom_type );
>>>>>>> 	MPI_Type_free( &mpi_atom_type_resized );
>>>>>>> 
>>>>>>> 	return;
>>>>>>> }
>>>>>>> 
>>>>>>> Calling MPI_Type_get_extent() and MPI_Type_get_true_extent() for 
> both 
>>>>>>> mpi_atom_type and mpi_atom_type_resized returns (0,32) bytes in all 
>>> cases. 
>>>>>>> Calling MPI_Type_size() on both derived data types returns 28 bytes.
>>>>>>> 
>>>>>>> If I call MPI_File_get_type_extent() on both derived data types after 
>>> opening 
>>>>> the 
>>>>>>> file, they both resolve to 32 bytes - so I think the problem is in the 
>>>>> difference 
>>>>>>> between the data representation in memory and on disk. If I explicitly 
> use 
>>> 32 
>>>>>>> bytes in the offset calculation in the write() routine above, it still 
> doesn't 
>>>>> work.
>>>>>>> 
>>>>>>> I'm finding it remarkably difficult to do something very simple using 
> MPI's 
>>>>>>> derived data types and the parallel IO, and hence I'm guessing that I 
> have 
>>>>>>> fundamentally misunderstood one or more aspects of this. If anyone 
> can 
>>>>> help 
>>>>>>> clarify where I'm going wrong, that would be much appreciated!
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> 
>>>>>>> John.
>>>>>>> _______________________________________________
>>>>>>> discuss mailing list     discuss at mpich.org
>>>>>>> To manage subscription options or unsubscribe:
>>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>> 
>>>>>> _______________________________________________
>>>>>> discuss mailing list     discuss at mpich.org
>>>>>> To manage subscription options or unsubscribe:
>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>> _______________________________________________
>>>>> discuss mailing list     discuss at mpich.org
>>>>> To manage subscription options or unsubscribe:
>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>> 
>>>> _______________________________________________
>>>> discuss mailing list     discuss at mpich.org
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>> _______________________________________________
>>> discuss mailing list     discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>> 
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss




More information about the discuss mailing list