[mpich-discuss] MPI_Barrier cannot stop all process randomly

Sufeng Niu sniu at hawk.iit.edu
Tue Jul 9 12:01:56 CDT 2013


Hello,

Sorry to post this long, stupid and simple question.
I found that some time MPI_Barrier cannot stop all the process. I try to
write a simple test program to create data_struct shown below:

#include <stdlib.h>
#include <stdio.h>
#include "mpi.h"

typedef struct
{
    int a;
    char b;
    int c;
    int d;
} foo;

int main(int argc, char *argv[])
{

    int rank, size;
    int i;

    foo    x;

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    char processor_name[MPI_MAX_PROCESSOR_NAME];
    int name_len;
    MPI_Get_processor_name(processor_name, &name_len);
    printf("-- processor %s, rank %d out of %d processors\n",
processor_name, rank, size);

   * MPI_Barrier(MPI_COMM_WORLD);*

    int count=4;

    MPI_Datatype testtype;
    MPI_Datatype types[4] = {MPI_INT, MPI_CHAR, MPI_INT, MPI_DOUBLE};
    int len[4] = {1, 1, 1, 1};
    MPI_Aint disp[4];
    long int base;

    MPI_Address(&x, disp);
    MPI_Address(&(x.a), disp+1);
    MPI_Address(&(x.b), disp+2);
    MPI_Address(&(x.c), disp+3);
    base = disp[0];
    for(i=0; i<4; i++) disp[i] -= base;

    MPI_Type_struct(count, len, disp, types, &testtype);
    MPI_Type_commit(&testtype);

    if(rank == 0){
        x.a = 2;
        x.b = 0;
        x.c = 10;
        x.d = 3;
    }

    printf("rank %d(before): x value is %d, %d, %d, %d\n", rank, x.a, x.b,
x.c, x.d);
    *MPI_Barrier(MPI_COMM_WORLD);*
    MPI_Bcast(&x, 1, testtype, 0, MPI_COMM_WORLD);

    printf("rank %d(after): x value is %d, %d, %d, %d\n", rank, x.a, x.b,
x.c, x.d);

    MPI_Finalize();

    return 0;
}

the output should be looks like:
-- processor iocfccd3.aps.anl.gov, rank 0 out of 4 processors
-- processor iocfccd3.aps.anl.gov, rank 3 out of 4 processors
-- processor iocfccd3.aps.anl.gov, rank 1 out of 4 processors
-- processor iocfccd3.aps.anl.gov, rank 2 out of 4 processors
rank 0(before): x value is 2, 0, 10, 3
rank 1(before): x value is 1197535864, -1, 4994901, 0
rank 2(before): x value is 1591464488, -1, 4994901, 0
rank 3(before): x value is 1851622184, -1, 4994901, 0
rank 0(after): x value is 2, 0, 10, 3
rank 3(after): x value is 2, 0, 10, 3
rank 1(after): x value is 2, 0, 10, 3
rank 2(after): x value is 2, 0, 10, 3

but some time is shows as:
-- processor iocfccd3.aps.anl.gov, rank 0 out of 4 processors
rank 0(before): x value is 2, 0, 10, 3
rank 0(after): x value is 2, 0, 10, 3
-- processor iocfccd3.aps.anl.gov, rank 1 out of 4 processors
rank 1(before): x value is -464731256, -1, 4994901, 0
rank 1(after): x value is 2, 0, 10, 3
-- processor iocfccd3.aps.anl.gov, rank 2 out of 4 processors
rank 2(before): x value is 1863042488, -1, 4994901, 0
rank 2(after): x value is 2, 0, 10, 3
-- processor iocfccd3.aps.anl.gov, rank 3 out of 4 processors
rank 3(before): x value is 1721065144, -1, 4994901, 0
rank 3(after): x value is 2, 0, 10, 3

or
-- processor iocfccd3.aps.anl.gov, rank 0 out of 4 processors
-- processor iocfccd3.aps.anl.gov, rank 1 out of 4 processors
rank 1(before): x value is -1883169624, -1, 4994901, 0
-- processor iocfccd3.aps.anl.gov, rank 2 out of 4 processors
rank 2(before): x value is -451256152, -1, 4994901, 0
-- processor iocfccd3.aps.anl.gov, rank 3 out of 4 processors
rank 3(before): x value is 1715067240, -1, 4994901, 0
rank 0(before): x value is 2, 0, 10, 3
rank 0(after): x value is 2, 0, 10, 3
rank 1(after): x value is 2, 0, 10, 3
rank 2(after): x value is 2, 0, 10, 3
rank 3(after): x value is 2, 0, 10, 3

it is all randomly, I am not sure where is the problem.

The second issue is I use MPI_Datatype to create a MPI struct for
broadcast. However, as the program shown above, if I change the struct:
typedef struct
{
    int a;
    char b;
    int c;
    *int d;*
} foo;
as
typedef struct
{
    int a;
    char b;
    int c;
    *double d*;
} foo;

I found the result is:
-- processor ephesus.ece.iit.edu, rank 1 out of 4 processors
-- processor ephesus.ece.iit.edu, rank 2 out of 4 processors
-- processor ephesus.ece.iit.edu, rank 3 out of 4 processors
-- processor ephesus.ece.iit.edu, rank 0 out of 4 processors
rank 0(before): x value is 2, 0, 10, 3.250000
rank 3(before): x value is 0, 0, 547474368, 0.000000
rank 1(before): x value is 0, 0, 547474368, 0.000000
rank 2(before): x value is 0, 0, 547474368, 0.000000
rank 0(after): x value is 2, 0, 10, 3.250000
*rank 2(after): x value is 2, 0, 10, 0.000000 <- should be 3.25
rank 3(after): x value is 2, 0, 10, 0.000000 <- should be 3.25
rank 1(after): x value is 2, 0, 10, 0.000000** <- should be 3.25
*
Do you have any clues on why this happened? Thanks a lot!

-- 
Best Regards,
Sufeng Niu
ECASP lab, ECE department, Illinois Institute of Technology
Tel: 312-731-7219
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20130709/f8ced30e/attachment.html>


More information about the discuss mailing list