[mpich-discuss] MPI_Barrier cannot stop all process randomly
Pavan Balaji
balaji at mcs.anl.gov
Tue Jul 9 12:05:23 CDT 2013
printf buffers output and doesn't write it out immediately (or in this
case, send it to mpiexec immediately). Try doing "fflush(stdout);"
immediately after printf.
-- Pavan
On 07/09/2013 12:01 PM, Sufeng Niu wrote:
> Hello,
>
> Sorry to post this long, stupid and simple question.
> I found that some time MPI_Barrier cannot stop all the process. I try to
> write a simple test program to create data_struct shown below:
>
> #include <stdlib.h>
> #include <stdio.h>
> #include "mpi.h"
>
> typedef struct
> {
> int a;
> char b;
> int c;
> int d;
> } foo;
>
> int main(int argc, char *argv[])
> {
>
> int rank, size;
> int i;
>
> foo x;
>
> MPI_Init(&argc, &argv);
> MPI_Comm_size(MPI_COMM_WORLD, &size);
> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>
> char processor_name[MPI_MAX_PROCESSOR_NAME];
> int name_len;
> MPI_Get_processor_name(processor_name, &name_len);
> printf("-- processor %s, rank %d out of %d processors\n",
> processor_name, rank, size);
>
> *MPI_Barrier(MPI_COMM_WORLD);*
>
> int count=4;
>
> MPI_Datatype testtype;
> MPI_Datatype types[4] = {MPI_INT, MPI_CHAR, MPI_INT, MPI_DOUBLE};
> int len[4] = {1, 1, 1, 1};
> MPI_Aint disp[4];
> long int base;
>
> MPI_Address(&x, disp);
> MPI_Address(&(x.a), disp+1);
> MPI_Address(&(x.b), disp+2);
> MPI_Address(&(x.c), disp+3);
> base = disp[0];
> for(i=0; i<4; i++) disp[i] -= base;
>
> MPI_Type_struct(count, len, disp, types, &testtype);
> MPI_Type_commit(&testtype);
>
> if(rank == 0){
> x.a = 2;
> x.b = 0;
> x.c = 10;
> x.d = 3;
> }
>
> printf("rank %d(before): x value is %d, %d, %d, %d\n", rank, x.a,
> x.b, x.c, x.d);
> *MPI_Barrier(MPI_COMM_WORLD);*
> MPI_Bcast(&x, 1, testtype, 0, MPI_COMM_WORLD);
>
> printf("rank %d(after): x value is %d, %d, %d, %d\n", rank, x.a,
> x.b, x.c, x.d);
>
> MPI_Finalize();
>
> return 0;
> }
>
> the output should be looks like:
> -- processor iocfccd3.aps.anl.gov <http://iocfccd3.aps.anl.gov>, rank 0
> out of 4 processors
> -- processor iocfccd3.aps.anl.gov <http://iocfccd3.aps.anl.gov>, rank 3
> out of 4 processors
> -- processor iocfccd3.aps.anl.gov <http://iocfccd3.aps.anl.gov>, rank 1
> out of 4 processors
> -- processor iocfccd3.aps.anl.gov <http://iocfccd3.aps.anl.gov>, rank 2
> out of 4 processors
> rank 0(before): x value is 2, 0, 10, 3
> rank 1(before): x value is 1197535864, -1, 4994901, 0
> rank 2(before): x value is 1591464488, -1, 4994901, 0
> rank 3(before): x value is 1851622184, -1, 4994901, 0
> rank 0(after): x value is 2, 0, 10, 3
> rank 3(after): x value is 2, 0, 10, 3
> rank 1(after): x value is 2, 0, 10, 3
> rank 2(after): x value is 2, 0, 10, 3
>
> but some time is shows as:
> -- processor iocfccd3.aps.anl.gov <http://iocfccd3.aps.anl.gov>, rank 0
> out of 4 processors
> rank 0(before): x value is 2, 0, 10, 3
> rank 0(after): x value is 2, 0, 10, 3
> -- processor iocfccd3.aps.anl.gov <http://iocfccd3.aps.anl.gov>, rank 1
> out of 4 processors
> rank 1(before): x value is -464731256, -1, 4994901, 0
> rank 1(after): x value is 2, 0, 10, 3
> -- processor iocfccd3.aps.anl.gov <http://iocfccd3.aps.anl.gov>, rank 2
> out of 4 processors
> rank 2(before): x value is 1863042488, -1, 4994901, 0
> rank 2(after): x value is 2, 0, 10, 3
> -- processor iocfccd3.aps.anl.gov <http://iocfccd3.aps.anl.gov>, rank 3
> out of 4 processors
> rank 3(before): x value is 1721065144, -1, 4994901, 0
> rank 3(after): x value is 2, 0, 10, 3
>
> or
> -- processor iocfccd3.aps.anl.gov <http://iocfccd3.aps.anl.gov>, rank 0
> out of 4 processors
> -- processor iocfccd3.aps.anl.gov <http://iocfccd3.aps.anl.gov>, rank 1
> out of 4 processors
> rank 1(before): x value is -1883169624, -1, 4994901, 0
> -- processor iocfccd3.aps.anl.gov <http://iocfccd3.aps.anl.gov>, rank 2
> out of 4 processors
> rank 2(before): x value is -451256152, -1, 4994901, 0
> -- processor iocfccd3.aps.anl.gov <http://iocfccd3.aps.anl.gov>, rank 3
> out of 4 processors
> rank 3(before): x value is 1715067240, -1, 4994901, 0
> rank 0(before): x value is 2, 0, 10, 3
> rank 0(after): x value is 2, 0, 10, 3
> rank 1(after): x value is 2, 0, 10, 3
> rank 2(after): x value is 2, 0, 10, 3
> rank 3(after): x value is 2, 0, 10, 3
>
> it is all randomly, I am not sure where is the problem.
>
> The second issue is I use MPI_Datatype to create a MPI struct for
> broadcast. However, as the program shown above, if I change the struct:
> typedef struct
> {
> int a;
> char b;
> int c;
> *int d;*
> } foo;
> as
> typedef struct
> {
> int a;
> char b;
> int c;
> *double d*;
> } foo;
>
> I found the result is:
> -- processor ephesus.ece.iit.edu <http://ephesus.ece.iit.edu>, rank 1
> out of 4 processors
> -- processor ephesus.ece.iit.edu <http://ephesus.ece.iit.edu>, rank 2
> out of 4 processors
> -- processor ephesus.ece.iit.edu <http://ephesus.ece.iit.edu>, rank 3
> out of 4 processors
> -- processor ephesus.ece.iit.edu <http://ephesus.ece.iit.edu>, rank 0
> out of 4 processors
> rank 0(before): x value is 2, 0, 10, 3.250000
> rank 3(before): x value is 0, 0, 547474368, 0.000000
> rank 1(before): x value is 0, 0, 547474368, 0.000000
> rank 2(before): x value is 0, 0, 547474368, 0.000000
> rank 0(after): x value is 2, 0, 10, 3.250000
> *rank 2(after): x value is 2, 0, 10, 0.000000 <- should be 3.25
> rank 3(after): x value is 2, 0, 10, 0.000000 <- should be 3.25
> rank 1(after): x value is 2, 0, 10, 0.000000**<- should be 3.25
> *
> Do you have any clues on why this happened? Thanks a lot!
>
> --
> Best Regards,
> Sufeng Niu
> ECASP lab, ECE department, Illinois Institute of Technology
> Tel: 312-731-7219
>
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
--
Pavan Balaji
http://www.mcs.anl.gov/~balaji
More information about the discuss
mailing list