[mpich-discuss] bug in MPI_File_write_all?

CANELA-XANDRI Oriol Oriol.CAnela-Xandri at roslin.ed.ac.uk
Fri May 16 08:50:15 CDT 2014


I am writing a function for storing a block cyclic distributed array using MPI_Type_create_darray, MPI_File_set_view, and MPI_File_write_all functions. After some testing I found that for some particular cases, sometimes, some of the values of the array have been not stored properly on the file. The error seems random, but always affects the same number in the same position (where instead of a number, there is a nan, it seems that the position is not written). I tried with other implementations (e.g. openmpi) and seems everything works well (I am using MPICH2 Version 1.4.1).

For instance, I saw the error for a matrix with 5 rows and 4 columns. Block size for rows = 1 and for columns = 3. There are 9 MPI processes distributed in a 3x3 array. My function is (where nBlockRows/nBlockCols define the size of the blocks, nGlobRows/nGlobCols define the global size of the matrix, nProcRows/nProcCols define the dimensions of the process grid, and fname is the name of the file.):

void Matrix::writeMatrixMPI(std::string fname)
  int dims[] = {this->nGlobRows, this->nGlobCols};
  int dargs[] = {this->nBlockRows, this->nBlockCols};
  int dim[] = {communicator->nProcRows, communicator->nProcCols};
  char nat[] = "native";
  int rc;
  MPI_Datatype dcarray; 
  MPI_File cFile;
  MPI_Status status;

  MPI_Type_create_darray(communicator->mpiNumTasks, communicator->mpiRank, 2, dims, distribs, dargs, dim, MPI_ORDER_FORTRAN, MPI_DOUBLE, &dcarray); 
  std::vector<char> fn(fname.begin(), fname.end());
    ss << "Error: Failed to open file: " << rc;
    misc.error(ss.str(), 0);
    MPI_File_set_view(cFile, 0, MPI_DOUBLE, dcarray, nat, MPI_INFO_NULL);
    MPI_File_write_all(cFile, this->m, this->nRows*this->nCols, MPI_DOUBLE, &status);    

If I call the function a repeated number of times without deleting the file between calls, the file seems to be always correct. If I delete the file between calls or create a file with a different name each time, then I see the error in some of the files. The error is a nan always in the same position. For me it is very strange, but as I said, with other implementations, I do not have the problem. 

The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

More information about the discuss mailing list