[mpich-discuss] bug in MPI_File_write_all?

CANELA-XANDRI Oriol Oriol.CAnela-Xandri at roslin.ed.ac.uk
Mon May 19 04:49:32 CDT 2014


Yes, a attach below a test program. The error can be reproduced running it with 9 MPI threads.

#include <mpi.h>

#include <iostream>
#include <sstream>
#include <string>
#include <vector>

/**
 * get the number of local rows and columns
 */
int getNumroc(int globalSize, int myProc, int nProcs, int blockSize)
{
  int myDist = myProc % nProcs;
  int nBlocks = globalSize / blockSize;
  int numroc = nBlocks / nProcs;
  numroc *= blockSize;
  int extraBlocks = nBlocks % nProcs;
  if(myDist < extraBlocks)
  {
    numroc += blockSize;
  }
  else if(myDist == extraBlocks)
  {
    numroc += globalSize % blockSize;
  }
  return numroc;
}

int main(int argc, char **argv)
{
  //MPI vars
  bool mpiRoot;
  int mpiRank;
  int mpiNumTasks;
  char hostName[MPI_MAX_PROCESSOR_NAME];
  int lenHostName;
  int myProcRow;
  int myProcCol;
  
  // Initiate MPI
  int tmp;
  tmp = MPI_Init(&argc, &argv);
  if (tmp != MPI_SUCCESS) {
    std::cerr << "Error: MPI can not be started. Terminating." << std::endl;
    MPI_Abort(MPI_COMM_WORLD, 1);
  }
  MPI_Comm_rank(MPI_COMM_WORLD, &mpiRank);
  MPI_Comm_size(MPI_COMM_WORLD, &mpiNumTasks);
  MPI_Get_processor_name(hostName, &lenHostName);

  if(mpiNumTasks != 9)
  {
    std::cerr << "Error: This is a test program designed for running with 9 threads." << std::endl;
    MPI_Abort(MPI_COMM_WORLD, 1);
  }
  
  //Id of the process in a 2d grid
  myProcRow = mpiRank / 3;
  myProcCol = mpiRank % 3;
  
  double *m;                                ///<A pointer to the distributed matrix
  double *mRead;                                ///<A pointer to the distributed matrix
  
  int nGlobRows = 5;                        ///<Number of rows of the global matrix
  int nGlobCols = 4;                        ///<Number of columns of the global matrix
  int nRows;                                ///<Number of rows of the local matrix
  int nCols;                                ///<Number of columns of the local matrix
  int nBlockRows = 1;                       ///<Number of rows of the distributed matrix blocks
  int nBlockCols = 3;                       ///<Number of columns of the distributed matrix blocks
  
  nRows = getNumroc(nGlobRows, myProcRow, 3, nBlockRows);
  nCols = getNumroc(nGlobCols, myProcCol, 3, nBlockCols);

  m = new double[nRows*nCols];
  mRead = new double[nRows*nCols];
  for(int i = 0; i < nRows; i++)
  {
    for(int j = 0; j < nCols; j++)
    {
      m[i*nCols + j] = 1;
    }
  }
  
  for(int repeat = 0; repeat < 10; repeat++)
  {
    int dims[] = {nGlobRows, nGlobCols};
    int dargs[] = {nBlockRows, nBlockCols};
    int distribs[] = {MPI_DISTRIBUTE_CYCLIC, MPI_DISTRIBUTE_CYCLIC};
    int dim[] = {3, 3};
    char nat[] = "native";
    int rc;
    MPI_Datatype dcarray; 
    MPI_File cFile;
    MPI_Status status;
    
    MPI_Type_create_darray(mpiNumTasks, mpiRank, 2, dims, distribs, dargs, dim, MPI_ORDER_FORTRAN, MPI_DOUBLE, &dcarray); 
    MPI_Type_commit(&dcarray);
    
    std::stringstream ss;
    ss << "test_" << repeat << ".bin";
    std::string fname = ss.str(); //"test.bin";
    std::vector<char> fn(fname.begin(), fname.end());
    fn.push_back('\0');
    MPI_File_delete (&fn[0], MPI_INFO_NULL);
    
    //Write file
    rc = MPI_File_open(MPI_COMM_WORLD, &fn[0], MPI_MODE_EXCL | MPI_MODE_CREATE | MPI_MODE_WRONLY, MPI_INFO_NULL, &cFile);
    if(rc){
      std::cerr << "Error: Failed to open file." << std::endl;
      MPI_Abort(MPI_COMM_WORLD, 1);
    }
    else
    {
      MPI_File_set_view(cFile, 0, MPI_DOUBLE, dcarray, nat, MPI_INFO_NULL);
      MPI_File_write_all(cFile, m, nRows*nCols, MPI_DOUBLE, &status);    
    }
    MPI_Barrier(MPI_COMM_WORLD);
    MPI_File_close(&cFile);


    //Initialize matrix before reading to 0;
    for(int i = 0; i < nRows; i++)
    {
      for(int j = 0; j < nCols; j++)
      {
        mRead[i*nCols + j] = 0;
      }
    }
    //Read file
    rc = MPI_File_open(MPI_COMM_WORLD, &fn[0], MPI_MODE_RDONLY, MPI_INFO_NULL, &cFile);
    if(rc){
      std::cerr << "Error: Failed to open file." << std::endl;
      MPI_Abort(MPI_COMM_WORLD, 1);
    }
    else
    {
      MPI_File_set_view(cFile, 0, MPI_DOUBLE, dcarray, nat, MPI_INFO_NULL);
      MPI_File_read_all(cFile, mRead, nRows*nCols, MPI_DOUBLE, &status);    
    }
    MPI_Barrier(MPI_COMM_WORLD);
    MPI_File_close(&cFile);
    MPI_Type_free(&dcarray);
    
    //Check data
    for(int i = 0; i < nRows; i++)
    {
      for(int j = 0; j < nCols; j++)
      {
        if(mRead[i*nCols + j] != 1)
        {
          std::cerr << "Error in data. " << repeat << " iteration." << std::endl;
          MPI_Abort(MPI_COMM_WORLD, 1);
        }
      }
    }
  }
 
  delete [] m;
  delete [] mRead;
  
  return 0;
}

Oriol


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


-----Original Message-----
From: discuss-bounces at mpich.org [mailto:discuss-bounces at mpich.org] On Behalf Of Rajeev Thakur
Sent: 16 May 2014 20:22
To: discuss at mpich.org
Subject: Re: [mpich-discuss] bug in MPI_File_write_all?

Can you send us a small test program that can be compiled and run?

Rajeev


On May 16, 2014, at 11:42 AM, CANELA-XANDRI Oriol <Oriol.CAnela-Xandri at roslin.ed.ac.uk> wrote:

> I also tried with the 3.0.4 (for which I have a distribution package) version with same results. If you think the result could change with 3.1, I will try with it.
> 
> Oriol
> 
> 
> --
> The University of Edinburgh is a charitable body, registered in 
> Scotland, with registration number SC005336.
> 
> 
> -----Original Message-----
> From: discuss-bounces at mpich.org [mailto:discuss-bounces at mpich.org] On 
> Behalf Of Rajeev Thakur
> Sent: 16 May 2014 17:14
> To: discuss at mpich.org
> Subject: Re: [mpich-discuss] bug in MPI_File_write_all?
> 
> 1.4.1 is a very old version of MPICH. Can you try with the latest release, 3.1?
> 
> Rajeev
> 
> On May 16, 2014, at 8:50 AM, CANELA-XANDRI Oriol <Oriol.CAnela-Xandri at roslin.ed.ac.uk> wrote:
> 
>> Hello,
>> 
>> I am writing a function for storing a block cyclic distributed array using MPI_Type_create_darray, MPI_File_set_view, and MPI_File_write_all functions. After some testing I found that for some particular cases, sometimes, some of the values of the array have been not stored properly on the file. The error seems random, but always affects the same number in the same position (where instead of a number, there is a nan, it seems that the position is not written). I tried with other implementations (e.g. openmpi) and seems everything works well (I am using MPICH2 Version 1.4.1).
>> 
>> For instance, I saw the error for a matrix with 5 rows and 4 columns. Block size for rows = 1 and for columns = 3. There are 9 MPI processes distributed in a 3x3 array. My function is (where nBlockRows/nBlockCols define the size of the blocks, nGlobRows/nGlobCols define the global size of the matrix, nProcRows/nProcCols define the dimensions of the process grid, and fname is the name of the file.):
>> 
>> void Matrix::writeMatrixMPI(std::string fname) {  int dims[] = 
>> {this->nGlobRows, this->nGlobCols};  int dargs[] = {this->nBlockRows,
>> this->nBlockCols};  int distribs[] = {MPI_DISTRIBUTE_CYCLIC,
>> MPI_DISTRIBUTE_CYCLIC};  int dim[] = {communicator->nProcRows,
>> communicator->nProcCols};  char nat[] = "native";  int rc;
>> MPI_Datatype dcarray;  MPI_File cFile;  MPI_Status status;
>> 
>> MPI_Type_create_darray(communicator->mpiNumTasks,
>> communicator->mpiRank, 2, dims, distribs, dargs, dim,
>> MPI_ORDER_FORTRAN, MPI_DOUBLE, &dcarray);  MPI_Type_commit(&dcarray);
>> 
>> std::vector<char> fn(fname.begin(), fname.end()); fn.push_back('\0');  
>> rc = MPI_File_open(MPI_COMM_WORLD, &fn[0], MPI_MODE_EXCL | 
>> MPI_MODE_CREATE | MPI_MODE_WRONLY, MPI_INFO_NULL, &cFile);  if(rc){
>>   ss << "Error: Failed to open file: " << rc;
>>   misc.error(ss.str(), 0);
>> }
>> else
>> {
>>   MPI_File_set_view(cFile, 0, MPI_DOUBLE, dcarray, nat, MPI_INFO_NULL);
>>   MPI_File_write_all(cFile, this->m, this->nRows*this->nCols, MPI_DOUBLE, &status);    
>> }
>> MPI_File_close(&cFile);
>> MPI_Type_free(&dcarray);
>> }
>> 
>> If I call the function a repeated number of times without deleting the file between calls, the file seems to be always correct. If I delete the file between calls or create a file with a different name each time, then I see the error in some of the files. The error is a nan always in the same position. For me it is very strange, but as I said, with other implementations, I do not have the problem. 
>> 
>> --
>> The University of Edinburgh is a charitable body, registered in 
>> Scotland, with registration number SC005336.
>> 
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss



More information about the discuss mailing list