[mpich-discuss] bug in MPI_File_write_all?
Rajeev Thakur
thakur at mcs.anl.gov
Mon May 19 11:04:55 CDT 2014
The program works on my Mac laptop with the latest MPICH source. Can you try with the latest nightly snapshot of MPICH from here:
http://www.mpich.org/static/downloads/nightly/master/mpich/
Rajeev
On May 19, 2014, at 4:49 AM, CANELA-XANDRI Oriol <Oriol.CAnela-Xandri at roslin.ed.ac.uk> wrote:
> Yes, a attach below a test program. The error can be reproduced running it with 9 MPI threads.
>
> #include <mpi.h>
>
> #include <iostream>
> #include <sstream>
> #include <string>
> #include <vector>
>
> /**
> * get the number of local rows and columns
> */
> int getNumroc(int globalSize, int myProc, int nProcs, int blockSize)
> {
> int myDist = myProc % nProcs;
> int nBlocks = globalSize / blockSize;
> int numroc = nBlocks / nProcs;
> numroc *= blockSize;
> int extraBlocks = nBlocks % nProcs;
> if(myDist < extraBlocks)
> {
> numroc += blockSize;
> }
> else if(myDist == extraBlocks)
> {
> numroc += globalSize % blockSize;
> }
> return numroc;
> }
>
> int main(int argc, char **argv)
> {
> //MPI vars
> bool mpiRoot;
> int mpiRank;
> int mpiNumTasks;
> char hostName[MPI_MAX_PROCESSOR_NAME];
> int lenHostName;
> int myProcRow;
> int myProcCol;
>
> // Initiate MPI
> int tmp;
> tmp = MPI_Init(&argc, &argv);
> if (tmp != MPI_SUCCESS) {
> std::cerr << "Error: MPI can not be started. Terminating." << std::endl;
> MPI_Abort(MPI_COMM_WORLD, 1);
> }
> MPI_Comm_rank(MPI_COMM_WORLD, &mpiRank);
> MPI_Comm_size(MPI_COMM_WORLD, &mpiNumTasks);
> MPI_Get_processor_name(hostName, &lenHostName);
>
> if(mpiNumTasks != 9)
> {
> std::cerr << "Error: This is a test program designed for running with 9 threads." << std::endl;
> MPI_Abort(MPI_COMM_WORLD, 1);
> }
>
> //Id of the process in a 2d grid
> myProcRow = mpiRank / 3;
> myProcCol = mpiRank % 3;
>
> double *m; ///<A pointer to the distributed matrix
> double *mRead; ///<A pointer to the distributed matrix
>
> int nGlobRows = 5; ///<Number of rows of the global matrix
> int nGlobCols = 4; ///<Number of columns of the global matrix
> int nRows; ///<Number of rows of the local matrix
> int nCols; ///<Number of columns of the local matrix
> int nBlockRows = 1; ///<Number of rows of the distributed matrix blocks
> int nBlockCols = 3; ///<Number of columns of the distributed matrix blocks
>
> nRows = getNumroc(nGlobRows, myProcRow, 3, nBlockRows);
> nCols = getNumroc(nGlobCols, myProcCol, 3, nBlockCols);
>
> m = new double[nRows*nCols];
> mRead = new double[nRows*nCols];
> for(int i = 0; i < nRows; i++)
> {
> for(int j = 0; j < nCols; j++)
> {
> m[i*nCols + j] = 1;
> }
> }
>
> for(int repeat = 0; repeat < 10; repeat++)
> {
> int dims[] = {nGlobRows, nGlobCols};
> int dargs[] = {nBlockRows, nBlockCols};
> int distribs[] = {MPI_DISTRIBUTE_CYCLIC, MPI_DISTRIBUTE_CYCLIC};
> int dim[] = {3, 3};
> char nat[] = "native";
> int rc;
> MPI_Datatype dcarray;
> MPI_File cFile;
> MPI_Status status;
>
> MPI_Type_create_darray(mpiNumTasks, mpiRank, 2, dims, distribs, dargs, dim, MPI_ORDER_FORTRAN, MPI_DOUBLE, &dcarray);
> MPI_Type_commit(&dcarray);
>
> std::stringstream ss;
> ss << "test_" << repeat << ".bin";
> std::string fname = ss.str(); //"test.bin";
> std::vector<char> fn(fname.begin(), fname.end());
> fn.push_back('\0');
> MPI_File_delete (&fn[0], MPI_INFO_NULL);
>
> //Write file
> rc = MPI_File_open(MPI_COMM_WORLD, &fn[0], MPI_MODE_EXCL | MPI_MODE_CREATE | MPI_MODE_WRONLY, MPI_INFO_NULL, &cFile);
> if(rc){
> std::cerr << "Error: Failed to open file." << std::endl;
> MPI_Abort(MPI_COMM_WORLD, 1);
> }
> else
> {
> MPI_File_set_view(cFile, 0, MPI_DOUBLE, dcarray, nat, MPI_INFO_NULL);
> MPI_File_write_all(cFile, m, nRows*nCols, MPI_DOUBLE, &status);
> }
> MPI_Barrier(MPI_COMM_WORLD);
> MPI_File_close(&cFile);
>
>
> //Initialize matrix before reading to 0;
> for(int i = 0; i < nRows; i++)
> {
> for(int j = 0; j < nCols; j++)
> {
> mRead[i*nCols + j] = 0;
> }
> }
> //Read file
> rc = MPI_File_open(MPI_COMM_WORLD, &fn[0], MPI_MODE_RDONLY, MPI_INFO_NULL, &cFile);
> if(rc){
> std::cerr << "Error: Failed to open file." << std::endl;
> MPI_Abort(MPI_COMM_WORLD, 1);
> }
> else
> {
> MPI_File_set_view(cFile, 0, MPI_DOUBLE, dcarray, nat, MPI_INFO_NULL);
> MPI_File_read_all(cFile, mRead, nRows*nCols, MPI_DOUBLE, &status);
> }
> MPI_Barrier(MPI_COMM_WORLD);
> MPI_File_close(&cFile);
> MPI_Type_free(&dcarray);
>
> //Check data
> for(int i = 0; i < nRows; i++)
> {
> for(int j = 0; j < nCols; j++)
> {
> if(mRead[i*nCols + j] != 1)
> {
> std::cerr << "Error in data. " << repeat << " iteration." << std::endl;
> MPI_Abort(MPI_COMM_WORLD, 1);
> }
> }
> }
> }
>
> delete [] m;
> delete [] mRead;
>
> return 0;
> }
>
> Oriol
>
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>
> -----Original Message-----
> From: discuss-bounces at mpich.org [mailto:discuss-bounces at mpich.org] On Behalf Of Rajeev Thakur
> Sent: 16 May 2014 20:22
> To: discuss at mpich.org
> Subject: Re: [mpich-discuss] bug in MPI_File_write_all?
>
> Can you send us a small test program that can be compiled and run?
>
> Rajeev
>
>
> On May 16, 2014, at 11:42 AM, CANELA-XANDRI Oriol <Oriol.CAnela-Xandri at roslin.ed.ac.uk> wrote:
>
>> I also tried with the 3.0.4 (for which I have a distribution package) version with same results. If you think the result could change with 3.1, I will try with it.
>>
>> Oriol
>>
>>
>> --
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>
>>
>> -----Original Message-----
>> From: discuss-bounces at mpich.org [mailto:discuss-bounces at mpich.org] On
>> Behalf Of Rajeev Thakur
>> Sent: 16 May 2014 17:14
>> To: discuss at mpich.org
>> Subject: Re: [mpich-discuss] bug in MPI_File_write_all?
>>
>> 1.4.1 is a very old version of MPICH. Can you try with the latest release, 3.1?
>>
>> Rajeev
>>
>> On May 16, 2014, at 8:50 AM, CANELA-XANDRI Oriol <Oriol.CAnela-Xandri at roslin.ed.ac.uk> wrote:
>>
>>> Hello,
>>>
>>> I am writing a function for storing a block cyclic distributed array using MPI_Type_create_darray, MPI_File_set_view, and MPI_File_write_all functions. After some testing I found that for some particular cases, sometimes, some of the values of the array have been not stored properly on the file. The error seems random, but always affects the same number in the same position (where instead of a number, there is a nan, it seems that the position is not written). I tried with other implementations (e.g. openmpi) and seems everything works well (I am using MPICH2 Version 1.4.1).
>>>
>>> For instance, I saw the error for a matrix with 5 rows and 4 columns. Block size for rows = 1 and for columns = 3. There are 9 MPI processes distributed in a 3x3 array. My function is (where nBlockRows/nBlockCols define the size of the blocks, nGlobRows/nGlobCols define the global size of the matrix, nProcRows/nProcCols define the dimensions of the process grid, and fname is the name of the file.):
>>>
>>> void Matrix::writeMatrixMPI(std::string fname) { int dims[] =
>>> {this->nGlobRows, this->nGlobCols}; int dargs[] = {this->nBlockRows,
>>> this->nBlockCols}; int distribs[] = {MPI_DISTRIBUTE_CYCLIC,
>>> MPI_DISTRIBUTE_CYCLIC}; int dim[] = {communicator->nProcRows,
>>> communicator->nProcCols}; char nat[] = "native"; int rc;
>>> MPI_Datatype dcarray; MPI_File cFile; MPI_Status status;
>>>
>>> MPI_Type_create_darray(communicator->mpiNumTasks,
>>> communicator->mpiRank, 2, dims, distribs, dargs, dim,
>>> MPI_ORDER_FORTRAN, MPI_DOUBLE, &dcarray); MPI_Type_commit(&dcarray);
>>>
>>> std::vector<char> fn(fname.begin(), fname.end()); fn.push_back('\0');
>>> rc = MPI_File_open(MPI_COMM_WORLD, &fn[0], MPI_MODE_EXCL |
>>> MPI_MODE_CREATE | MPI_MODE_WRONLY, MPI_INFO_NULL, &cFile); if(rc){
>>> ss << "Error: Failed to open file: " << rc;
>>> misc.error(ss.str(), 0);
>>> }
>>> else
>>> {
>>> MPI_File_set_view(cFile, 0, MPI_DOUBLE, dcarray, nat, MPI_INFO_NULL);
>>> MPI_File_write_all(cFile, this->m, this->nRows*this->nCols, MPI_DOUBLE, &status);
>>> }
>>> MPI_File_close(&cFile);
>>> MPI_Type_free(&dcarray);
>>> }
>>>
>>> If I call the function a repeated number of times without deleting the file between calls, the file seems to be always correct. If I delete the file between calls or create a file with a different name each time, then I see the error in some of the files. The error is a nan always in the same position. For me it is very strange, but as I said, with other implementations, I do not have the problem.
>>>
>>> --
>>> The University of Edinburgh is a charitable body, registered in
>>> Scotland, with registration number SC005336.
>>>
>>> _______________________________________________
>>> discuss mailing list discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>> _______________________________________________
>> discuss mailing list discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>> _______________________________________________
>> discuss mailing list discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list