[mpich-devel] mpich3 error

Brent Morgan brent.taylormorgan at gmail.com
Sat Jan 16 13:31:01 CST 2021


Hi Hui, Mpich community,

Thanks for the response.  You're right, I'll provide a toy program that
replicates the code structure (and results).  The toy program is
calculating a sum value from each process- the value isn't too important
for this toy program.  The timing, however, is the only thing important in
our demonstration.  It exactly replicates what we are observing for our
actual program.  This directly relates to the MPI functionality- we can't
find out what the issue is.

[image: image.png]
I have attached the code.  Is something wrong with our implementation?  It
starts with the main() function.  Thank you very much for any help,

Best,
Brent
PS My subscription to discuss at mpich.org is pending currently.

On Sat, Jan 16, 2021 at 12:24 PM Brent Morgan <brent.taylormorgan at gmail.com>
wrote:

> Hi Hui, Mpich community,
>
> Thanks for the response.  You're right, I'll provide a toy program that
> replicates the code structure (and results).  The toy program is
> calculating a sum value from each process- the value isn't too important
> for this toy program.  The timing, however, is the only thing important in
> our demonstration.  It exactly replicates what we are observing for our
> actual program.  This directly relates to the MPI functionality- we can't
> find out what the issue is.
>
> [image: image.png]
> I have attached the code.  Is something wrong with our implementation?  It
> starts with the main() function.  Thank you very much for any help,
>
> Best,
> Brent
>
> On Fri, Jan 15, 2021 at 10:43 PM Zhou, Hui <zhouh at anl.gov> wrote:
>
>> Your description only mentions MPI_Gather. If there is indeed problem
>> with MPI_Gather, then you should be able to reproduce the issue with a
>> sample program. Share with us and we can better assist you. If you can’t
>> reproduce the issue with a simple example, then I suspect there are other
>> problems that you are not able to fully describe. We really can’t help much
>> without able to see the code.
>>
>>
>>
>> That said, I am not even sure what is the issue you are describing. 100
>> process MPI_Gather will be slower than 50 process MPI_Gather. And since it
>> is a collective, if one of your process is delayed due to some computations
>> or else, the whole collective will take longer to finish just due to
>> waiting for the late process. You really need tell us what your program is
>> doing in order for us to even offer an intelligent guess.
>>
>>
>>
>> --
>> Hui Zhou
>>
>>
>>
>>
>>
>> *From: *Brent Morgan <brent.taylormorgan at gmail.com>
>> *Date: *Friday, January 15, 2021 at 10:42 PM
>> *To: *Zhou, Hui <zhouh at anl.gov>, discuss at mpich.org <discuss at mpich.org>
>> *Cc: *Robert Katona <robert.katona at hotmail.com>
>> *Subject: *Re: [mpich-devel] mpich3 error
>>
>> Hi MPICH community,
>>
>>
>>
>> My team has downloaded mpich 3.3.2 (using ch3 as default) and implemented
>> MPI, and for small # of processes (<50), everything worked fine for our MPI
>> implementation.  For >=50 processes, there was a ch3 error and crashed the
>> program after a random amount of seconds (sometimes 10seconds, sometimes
>> 100seconds).  So we compiled mpich 3.3.2 with ch4 (instead of default ch3)
>> using '--with-device=ch4:ofi` flag and this got rid of the error- but for
>> >12 processes, the speed would slow down to 2x slower suddenly.
>>
>>
>>
>> Upon Hui's suggestion, we upgraded to mpich 3.4 and compiled with
>> '--with-device=ch4:ofi` flag (where ch4 is default for mpich 3.4).
>> Everything worked fine until we hit 20 processes; after >=20 processes, the
>> 2x slowdown is happening again.
>>
>>
>>
>> We have tried 1 communicator and multiple communicators in an attempt to
>> make the MPI implementation faster, but there's no significant difference
>> in observations.  We are using MPI_Gather collector for merely calculating
>> the sum of the result of N processes, but we can't seem to maintain
>> stability within MPI as we increase N processes.  Is there something we are
>> missing that is ultimately causing this error?  We are at a loss here,
>> thank you.
>>
>>
>>
>> Best,
>>
>> Brent
>>
>>
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/devel/attachments/20210116/2d01c090/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 87383 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/devel/attachments/20210116/2d01c090/attachment-0002.png>
-------------- next part --------------
struct MPI_Data_t{
    int numOfCpu; 
    int columnGrid;
    int rowGrid; //this is calculated
    int rank;
    int subRank;
    int nodeSize, nodeRank;
    int flag;
    int sharedMemSize;
    int winDisp;
    int *sharedMemModel;
    float *pMem, *pMemLocal;
    int color;
    MPI_Win sharedMemoryWin;
    MPI_Aint winSize;
    MPI_Comm allComm, nodeComm;
    MPI_Comm rowComm;
};

float collector_toy(vector<float>& xVec) //this function will be called from the root process
{
    MPI_Request request;
    MPI_Status status;
    float retVal = 0;
    float noData = 0;

    //send xvector to all of the workers and they will start to call the f(x)
    MPI_Ibcast(&xVec[0], PROB_SIZE, MPI_FLOAT, Cluster::rootProcessId, subComm,  &request);
    MPI_Wait (&request, &status); //wait for the end of non-blocking broadcast

    MPI_Barrier(subComm); //wait for syncronization  of the threads
    int mpiState;
    mpiState = MPI_Gather(&noData, 1, MPI_FLOAT, resultArr1, 1, MPI_FLOAT, Cluster::rootProcessId, subComm); //collect data from all node
    mpiState = MPI_Gather(&noData, 1, MPI_FLOAT, resultArr2, 1, MPI_FLOAT, Cluster::rootProcessId, subComm); //collect data from all node

    //now we gathered all float result data we shall compute the sum of them
    float a = 0;
    float b = 0;
    for(int i = 0; i < numOfWorkers; i++)
    {
        a += resultArr1[i+1];
        b += resultArr2[i+1];
    }

    retVal = (a / b);
    retVal = -retVal;
    return retVal;
}

float toyDummy[2];
float* ToyObjFun(float* x, int cpuId,  int numOfWorkers )//the cpuId can be used to define different behavior
{
    int start_process = (541 - 1) / numOfWorkers * (cpuId - 1); //cpuId > 0
    int end_process = start_process + ((541 - 1) / numOfWorkers);
    int idxOffset = (cpuId - 1) * PROB_SIZE/numOfWorkers;
    toyDummy[0] = 0;
    toyDummy[1] = 0;
    for (int j = start_process; j < end_process; j++)
    {
        for(int i=0;i<100000000;i++)
        {
            toyDummy[0] += x[j+idxOffset]*i;
        }
        toyDummy[0] /= PROB_SIZE/numOfWorkers;
        toyDummy[1] = toyDummy[0]*j;
    }
    return toyDummy;
}

void App_toy_function(MPI_Data_t& mMpiData)
{
    std::chrono::_V2::system_clock::time_point searchStartTime;
    std::chrono::duration<double> searchElapsedTime;
    int loopNum = 1;
    double searchTimer;
    float yRes;
    int numOfCpu = mMpiData.numOfCpu;
    int rank = mMpiData.rank;
    subComm = mMpiData.allComm;
    if(rank == 0){
        cout << "Master node starting" << endl;

        MPI_Barrier(mMpiData.allComm); //wait for syncronization  of the threads
        vector<float> x(PROB_SIZE, 0.5);

        while(loopNum)
        {
            searchStartTime = std::chrono::high_resolution_clock::now();
            yRes = collector_toy(x);
            searchElapsedTime = std::chrono::high_resolution_clock::now() - searchStartTime;
            searchTimer = searchElapsedTime.count();
            cout << yRes << ", " << searchTimer << " s" << endl;
            loopNum--;
        }
    }
    else
    {
        float timer;
        float* ptrResult;
        float f32Result[2];
        bool isAppOK = true;
        float x[PROB_SIZE];

        int flag;
        MPI_Request request;
        MPI_Status status;
        MPI_Barrier(mMpiData.allComm);  //wait for syncronization  of the threads
        while(1)
        {
            MPI_Ibcast(x, PROB_SIZE, MPI_FLOAT, Cluster::rootProcessId, mMpiData.allComm,  &request);
            while(1) //run timout until data is received
            {
                MPI_Test(&request, &flag, &status);
                if(timer > MPI_APP_TIMEOUT)
                {
                    isAppOK = false;
                    break; //break out from the inner while
                }
                if(flag)
                {
                    break;
                }
            }
            if(isAppOK)
            {
                ptrResult = ToyObjFun(x, rank,  (numOfCpu - 1));
                f32Result[0] = ptrResult[0];
                f32Result[1] = ptrResult[1];

                MPI_Barrier(mMpiData.allComm); //wait for syncronization  of the processes

                MPI_Gather(&f32Result[0], 1, MPI_FLOAT, resultArr1, 1, MPI_FLOAT,Cluster::rootProcessId, mMpiData.allComm);
                MPI_Gather(&f32Result[1], 1, MPI_FLOAT, resultArr2, 1, MPI_FLOAT,Cluster::rootProcessId, mMpiData.allComm);
            }
            else // master node is not responding -> app hang -> quit
            {
                std::cout << "App finished on CPU: " << rank << std::endl;
                break;
            }
        }
    }
}

int main()
{
    int subRootNodeNum;
    float dummy;
    float* rxPtr;
    MPI_Data_t mMpiData;
    memset(&mMpiData, 0, sizeof(MPI_Data_t));

    /* MPI initialization */
    mMpiData.columnGrid = 10;
    mMpiData.allComm = MPI_COMM_WORLD;

    MPI_Init(NULL, NULL);
    MPI_Comm_rank(mMpiData.allComm, &mMpiData.rank);
    LOG_D(mMpiData.rank, "MPI initialization done");
    MPI_Comm_size(mMpiData.allComm, &mMpiData.numOfCpu); //get how much CPU are present for calculation
    LOG_D(mMpiData.rank, "ranking done");

    LOG_D(mMpiData.rank,"Allocation of ram started");
    numOfWorkers = mMpiData.numOfCpu - 1;
    resultArr1 = (float *) malloc(sizeof(float) * mMpiData.numOfCpu); //allocate memory for storing of the result data
    resultArr2 = (float *) malloc(sizeof(float) * mMpiData.numOfCpu); //allocate memory for storing of the result data
    LOG_D(mMpiData.rank,"Allocation of ram done");

    if(mMpiData.rank == 0) {
        cout << "Num of cpu " << mMpiData.numOfCpu << endl;
        cout << "Num of workers: " << numOfWorkers << endl;
    }
    LOG_D(mMpiData.rank,"Starting app");

    App_toy_function(mMpiData);

    MPI_Barrier(mMpiData.allComm); //Blocks the caller until all processes in the communicator have called it;
    free(resultArr1);
    free(resultArr2);

    if(mMpiData.rank == 0) {
        std::cout << "----------------------------------------------" << std::endl;
    }
    return 0;
}
-------------- next part --------------
A non-text attachment was scrubbed...
Name: speedup_curve.png
Type: image/png
Size: 60825 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/devel/attachments/20210116/2d01c090/attachment-0003.png>


More information about the devel mailing list