[mpich-discuss] MPI_Barrier cannot stop all process randomly

Jeff Hammond jeff.science at gmail.com
Tue Jul 9 12:24:41 CDT 2013


...and depending on your filesystem (assuming you pipe stdout to a
file), even that isn't sufficient.

printf is a terrible way to evaluate temporal relationships and you
shouldn't try to reason about parallel programs with it unless you've
taken great care to ensure a global total ordering of output streams.

Jeff

On Tue, Jul 9, 2013 at 12:05 PM, Pavan Balaji <balaji at mcs.anl.gov> wrote:
>
> printf buffers output and doesn't write it out immediately (or in this case,
> send it to mpiexec immediately).  Try doing "fflush(stdout);" immediately
> after printf.
>
>  -- Pavan
>
>
> On 07/09/2013 12:01 PM, Sufeng Niu wrote:
>>
>> Hello,
>>
>> Sorry to post this long, stupid and simple question.
>> I found that some time MPI_Barrier cannot stop all the process. I try to
>> write a simple test program to create data_struct shown below:
>>
>> #include <stdlib.h>
>> #include <stdio.h>
>> #include "mpi.h"
>>
>> typedef struct
>> {
>>      int a;
>>      char b;
>>      int c;
>>      int d;
>> } foo;
>>
>> int main(int argc, char *argv[])
>> {
>>
>>      int rank, size;
>>      int i;
>>
>>      foo    x;
>>
>>      MPI_Init(&argc, &argv);
>>      MPI_Comm_size(MPI_COMM_WORLD, &size);
>>      MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>
>>      char processor_name[MPI_MAX_PROCESSOR_NAME];
>>      int name_len;
>>      MPI_Get_processor_name(processor_name, &name_len);
>>      printf("-- processor %s, rank %d out of %d processors\n",
>> processor_name, rank, size);
>>
>> *MPI_Barrier(MPI_COMM_WORLD);*
>>
>>
>>      int count=4;
>>
>>      MPI_Datatype testtype;
>>      MPI_Datatype types[4] = {MPI_INT, MPI_CHAR, MPI_INT, MPI_DOUBLE};
>>      int len[4] = {1, 1, 1, 1};
>>      MPI_Aint disp[4];
>>      long int base;
>>
>>      MPI_Address(&x, disp);
>>      MPI_Address(&(x.a), disp+1);
>>      MPI_Address(&(x.b), disp+2);
>>      MPI_Address(&(x.c), disp+3);
>>      base = disp[0];
>>      for(i=0; i<4; i++) disp[i] -= base;
>>
>>      MPI_Type_struct(count, len, disp, types, &testtype);
>>      MPI_Type_commit(&testtype);
>>
>>      if(rank == 0){
>>          x.a = 2;
>>          x.b = 0;
>>          x.c = 10;
>>          x.d = 3;
>>      }
>>
>>      printf("rank %d(before): x value is %d, %d, %d, %d\n", rank, x.a,
>> x.b, x.c, x.d);
>> *MPI_Barrier(MPI_COMM_WORLD);*
>>
>>      MPI_Bcast(&x, 1, testtype, 0, MPI_COMM_WORLD);
>>
>>      printf("rank %d(after): x value is %d, %d, %d, %d\n", rank, x.a,
>> x.b, x.c, x.d);
>>
>>      MPI_Finalize();
>>
>>      return 0;
>> }
>>
>> the output should be looks like:
>> -- processor iocfccd3.aps.anl.gov <http://iocfccd3.aps.anl.gov>, rank 0
>> out of 4 processors
>> -- processor iocfccd3.aps.anl.gov <http://iocfccd3.aps.anl.gov>, rank 3
>> out of 4 processors
>> -- processor iocfccd3.aps.anl.gov <http://iocfccd3.aps.anl.gov>, rank 1
>> out of 4 processors
>> -- processor iocfccd3.aps.anl.gov <http://iocfccd3.aps.anl.gov>, rank 2
>>
>> out of 4 processors
>> rank 0(before): x value is 2, 0, 10, 3
>> rank 1(before): x value is 1197535864, -1, 4994901, 0
>> rank 2(before): x value is 1591464488, -1, 4994901, 0
>> rank 3(before): x value is 1851622184, -1, 4994901, 0
>> rank 0(after): x value is 2, 0, 10, 3
>> rank 3(after): x value is 2, 0, 10, 3
>> rank 1(after): x value is 2, 0, 10, 3
>> rank 2(after): x value is 2, 0, 10, 3
>>
>> but some time is shows as:
>> -- processor iocfccd3.aps.anl.gov <http://iocfccd3.aps.anl.gov>, rank 0
>>
>> out of 4 processors
>> rank 0(before): x value is 2, 0, 10, 3
>> rank 0(after): x value is 2, 0, 10, 3
>> -- processor iocfccd3.aps.anl.gov <http://iocfccd3.aps.anl.gov>, rank 1
>>
>> out of 4 processors
>> rank 1(before): x value is -464731256, -1, 4994901, 0
>> rank 1(after): x value is 2, 0, 10, 3
>> -- processor iocfccd3.aps.anl.gov <http://iocfccd3.aps.anl.gov>, rank 2
>>
>> out of 4 processors
>> rank 2(before): x value is 1863042488, -1, 4994901, 0
>> rank 2(after): x value is 2, 0, 10, 3
>> -- processor iocfccd3.aps.anl.gov <http://iocfccd3.aps.anl.gov>, rank 3
>>
>> out of 4 processors
>> rank 3(before): x value is 1721065144, -1, 4994901, 0
>> rank 3(after): x value is 2, 0, 10, 3
>>
>> or
>> -- processor iocfccd3.aps.anl.gov <http://iocfccd3.aps.anl.gov>, rank 0
>> out of 4 processors
>> -- processor iocfccd3.aps.anl.gov <http://iocfccd3.aps.anl.gov>, rank 1
>>
>> out of 4 processors
>> rank 1(before): x value is -1883169624, -1, 4994901, 0
>> -- processor iocfccd3.aps.anl.gov <http://iocfccd3.aps.anl.gov>, rank 2
>>
>> out of 4 processors
>> rank 2(before): x value is -451256152, -1, 4994901, 0
>> -- processor iocfccd3.aps.anl.gov <http://iocfccd3.aps.anl.gov>, rank 3
>>
>> out of 4 processors
>> rank 3(before): x value is 1715067240, -1, 4994901, 0
>> rank 0(before): x value is 2, 0, 10, 3
>> rank 0(after): x value is 2, 0, 10, 3
>> rank 1(after): x value is 2, 0, 10, 3
>> rank 2(after): x value is 2, 0, 10, 3
>> rank 3(after): x value is 2, 0, 10, 3
>>
>> it is all randomly, I am not sure where is the problem.
>>
>> The second issue is I use MPI_Datatype to create a MPI struct for
>> broadcast. However, as the program shown above, if I change the struct:
>> typedef struct
>> {
>>      int a;
>>      char b;
>>      int c;
>> *int d;*
>>
>> } foo;
>> as
>> typedef struct
>> {
>>      int a;
>>      char b;
>>      int c;
>> *double d*;
>>
>> } foo;
>>
>> I found the result is:
>> -- processor ephesus.ece.iit.edu <http://ephesus.ece.iit.edu>, rank 1
>> out of 4 processors
>> -- processor ephesus.ece.iit.edu <http://ephesus.ece.iit.edu>, rank 2
>> out of 4 processors
>> -- processor ephesus.ece.iit.edu <http://ephesus.ece.iit.edu>, rank 3
>> out of 4 processors
>> -- processor ephesus.ece.iit.edu <http://ephesus.ece.iit.edu>, rank 0
>>
>> out of 4 processors
>> rank 0(before): x value is 2, 0, 10, 3.250000
>> rank 3(before): x value is 0, 0, 547474368, 0.000000
>> rank 1(before): x value is 0, 0, 547474368, 0.000000
>> rank 2(before): x value is 0, 0, 547474368, 0.000000
>> rank 0(after): x value is 2, 0, 10, 3.250000
>> *rank 2(after): x value is 2, 0, 10, 0.000000 <- should be 3.25
>>
>> rank 3(after): x value is 2, 0, 10, 0.000000 <- should be 3.25
>> rank 1(after): x value is 2, 0, 10, 0.000000**<- should be 3.25
>>
>> *
>> Do you have any clues on why this happened? Thanks a lot!
>>
>> --
>> Best Regards,
>> Sufeng Niu
>> ECASP lab, ECE department, Illinois Institute of Technology
>> Tel: 312-731-7219
>>
>>
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss



-- 
Jeff Hammond
jeff.science at gmail.com



More information about the discuss mailing list