[mpich-discuss] bug in MPI_File_write_all?

CANELA-XANDRI Oriol Oriol.CAnela-Xandri at roslin.ed.ac.uk
Fri May 16 11:42:16 CDT 2014


I also tried with the 3.0.4 (for which I have a distribution package) version with same results. If you think the result could change with 3.1, I will try with it.

Oriol


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


-----Original Message-----
From: discuss-bounces at mpich.org [mailto:discuss-bounces at mpich.org] On Behalf Of Rajeev Thakur
Sent: 16 May 2014 17:14
To: discuss at mpich.org
Subject: Re: [mpich-discuss] bug in MPI_File_write_all?

1.4.1 is a very old version of MPICH. Can you try with the latest release, 3.1?

Rajeev

On May 16, 2014, at 8:50 AM, CANELA-XANDRI Oriol <Oriol.CAnela-Xandri at roslin.ed.ac.uk> wrote:

> Hello,
> 
> I am writing a function for storing a block cyclic distributed array using MPI_Type_create_darray, MPI_File_set_view, and MPI_File_write_all functions. After some testing I found that for some particular cases, sometimes, some of the values of the array have been not stored properly on the file. The error seems random, but always affects the same number in the same position (where instead of a number, there is a nan, it seems that the position is not written). I tried with other implementations (e.g. openmpi) and seems everything works well (I am using MPICH2 Version 1.4.1).
> 
> For instance, I saw the error for a matrix with 5 rows and 4 columns. Block size for rows = 1 and for columns = 3. There are 9 MPI processes distributed in a 3x3 array. My function is (where nBlockRows/nBlockCols define the size of the blocks, nGlobRows/nGlobCols define the global size of the matrix, nProcRows/nProcCols define the dimensions of the process grid, and fname is the name of the file.):
> 
> void Matrix::writeMatrixMPI(std::string fname) {  int dims[] = 
> {this->nGlobRows, this->nGlobCols};  int dargs[] = {this->nBlockRows, 
> this->nBlockCols};  int distribs[] = {MPI_DISTRIBUTE_CYCLIC, 
> MPI_DISTRIBUTE_CYCLIC};  int dim[] = {communicator->nProcRows, 
> communicator->nProcCols};  char nat[] = "native";  int rc;  
> MPI_Datatype dcarray;  MPI_File cFile;  MPI_Status status;
> 
>  MPI_Type_create_darray(communicator->mpiNumTasks, 
> communicator->mpiRank, 2, dims, distribs, dargs, dim, 
> MPI_ORDER_FORTRAN, MPI_DOUBLE, &dcarray);  MPI_Type_commit(&dcarray);
> 
>  std::vector<char> fn(fname.begin(), fname.end());  
> fn.push_back('\0');  rc = MPI_File_open(MPI_COMM_WORLD, &fn[0], 
> MPI_MODE_EXCL | MPI_MODE_CREATE | MPI_MODE_WRONLY, MPI_INFO_NULL, 
> &cFile);  if(rc){
>    ss << "Error: Failed to open file: " << rc;
>    misc.error(ss.str(), 0);
>  }
>  else
>  {
>    MPI_File_set_view(cFile, 0, MPI_DOUBLE, dcarray, nat, MPI_INFO_NULL);
>    MPI_File_write_all(cFile, this->m, this->nRows*this->nCols, MPI_DOUBLE, &status);    
>  }
>  MPI_File_close(&cFile);
>  MPI_Type_free(&dcarray);
> }
> 
> If I call the function a repeated number of times without deleting the file between calls, the file seems to be always correct. If I delete the file between calls or create a file with a different name each time, then I see the error in some of the files. The error is a nan always in the same position. For me it is very strange, but as I said, with other implementations, I do not have the problem. 
> 
> --
> The University of Edinburgh is a charitable body, registered in 
> Scotland, with registration number SC005336.
> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss



More information about the discuss mailing list