[mpich-discuss] MPI_Barrier cannot stop all process randomly
Sufeng Niu
sniu at hawk.iit.edu
Tue Jul 9 12:50:09 CDT 2013
I see. Thanks a lot! and that make sense. I tried fflush(stdout), it seems
that still not work properly. but I understand that printf is not a proper
way to check the timing.
Thank you!
Sufeng
On Tue, Jul 9, 2013 at 12:24 PM, Jeff Hammond <jeff.science at gmail.com>wrote:
> ...and depending on your filesystem (assuming you pipe stdout to a
> file), even that isn't sufficient.
>
> printf is a terrible way to evaluate temporal relationships and you
> shouldn't try to reason about parallel programs with it unless you've
> taken great care to ensure a global total ordering of output streams.
>
> Jeff
>
> On Tue, Jul 9, 2013 at 12:05 PM, Pavan Balaji <balaji at mcs.anl.gov> wrote:
> >
> > printf buffers output and doesn't write it out immediately (or in this
> case,
> > send it to mpiexec immediately). Try doing "fflush(stdout);" immediately
> > after printf.
> >
> > -- Pavan
> >
> >
> > On 07/09/2013 12:01 PM, Sufeng Niu wrote:
> >>
> >> Hello,
> >>
> >> Sorry to post this long, stupid and simple question.
> >> I found that some time MPI_Barrier cannot stop all the process. I try to
> >> write a simple test program to create data_struct shown below:
> >>
> >> #include <stdlib.h>
> >> #include <stdio.h>
> >> #include "mpi.h"
> >>
> >> typedef struct
> >> {
> >> int a;
> >> char b;
> >> int c;
> >> int d;
> >> } foo;
> >>
> >> int main(int argc, char *argv[])
> >> {
> >>
> >> int rank, size;
> >> int i;
> >>
> >> foo x;
> >>
> >> MPI_Init(&argc, &argv);
> >> MPI_Comm_size(MPI_COMM_WORLD, &size);
> >> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> >>
> >> char processor_name[MPI_MAX_PROCESSOR_NAME];
> >> int name_len;
> >> MPI_Get_processor_name(processor_name, &name_len);
> >> printf("-- processor %s, rank %d out of %d processors\n",
> >> processor_name, rank, size);
> >>
> >> *MPI_Barrier(MPI_COMM_WORLD);*
> >>
> >>
> >> int count=4;
> >>
> >> MPI_Datatype testtype;
> >> MPI_Datatype types[4] = {MPI_INT, MPI_CHAR, MPI_INT, MPI_DOUBLE};
> >> int len[4] = {1, 1, 1, 1};
> >> MPI_Aint disp[4];
> >> long int base;
> >>
> >> MPI_Address(&x, disp);
> >> MPI_Address(&(x.a), disp+1);
> >> MPI_Address(&(x.b), disp+2);
> >> MPI_Address(&(x.c), disp+3);
> >> base = disp[0];
> >> for(i=0; i<4; i++) disp[i] -= base;
> >>
> >> MPI_Type_struct(count, len, disp, types, &testtype);
> >> MPI_Type_commit(&testtype);
> >>
> >> if(rank == 0){
> >> x.a = 2;
> >> x.b = 0;
> >> x.c = 10;
> >> x.d = 3;
> >> }
> >>
> >> printf("rank %d(before): x value is %d, %d, %d, %d\n", rank, x.a,
> >> x.b, x.c, x.d);
> >> *MPI_Barrier(MPI_COMM_WORLD);*
> >>
> >> MPI_Bcast(&x, 1, testtype, 0, MPI_COMM_WORLD);
> >>
> >> printf("rank %d(after): x value is %d, %d, %d, %d\n", rank, x.a,
> >> x.b, x.c, x.d);
> >>
> >> MPI_Finalize();
> >>
> >> return 0;
> >> }
> >>
> >> the output should be looks like:
> >> -- processor iocfccd3.aps.anl.gov <http://iocfccd3.aps.anl.gov>, rank 0
> >> out of 4 processors
> >> -- processor iocfccd3.aps.anl.gov <http://iocfccd3.aps.anl.gov>, rank 3
> >> out of 4 processors
> >> -- processor iocfccd3.aps.anl.gov <http://iocfccd3.aps.anl.gov>, rank 1
> >> out of 4 processors
> >> -- processor iocfccd3.aps.anl.gov <http://iocfccd3.aps.anl.gov>, rank 2
> >>
> >> out of 4 processors
> >> rank 0(before): x value is 2, 0, 10, 3
> >> rank 1(before): x value is 1197535864, -1, 4994901, 0
> >> rank 2(before): x value is 1591464488, -1, 4994901, 0
> >> rank 3(before): x value is 1851622184, -1, 4994901, 0
> >> rank 0(after): x value is 2, 0, 10, 3
> >> rank 3(after): x value is 2, 0, 10, 3
> >> rank 1(after): x value is 2, 0, 10, 3
> >> rank 2(after): x value is 2, 0, 10, 3
> >>
> >> but some time is shows as:
> >> -- processor iocfccd3.aps.anl.gov <http://iocfccd3.aps.anl.gov>, rank 0
> >>
> >> out of 4 processors
> >> rank 0(before): x value is 2, 0, 10, 3
> >> rank 0(after): x value is 2, 0, 10, 3
> >> -- processor iocfccd3.aps.anl.gov <http://iocfccd3.aps.anl.gov>, rank 1
> >>
> >> out of 4 processors
> >> rank 1(before): x value is -464731256, -1, 4994901, 0
> >> rank 1(after): x value is 2, 0, 10, 3
> >> -- processor iocfccd3.aps.anl.gov <http://iocfccd3.aps.anl.gov>, rank 2
> >>
> >> out of 4 processors
> >> rank 2(before): x value is 1863042488, -1, 4994901, 0
> >> rank 2(after): x value is 2, 0, 10, 3
> >> -- processor iocfccd3.aps.anl.gov <http://iocfccd3.aps.anl.gov>, rank 3
> >>
> >> out of 4 processors
> >> rank 3(before): x value is 1721065144, -1, 4994901, 0
> >> rank 3(after): x value is 2, 0, 10, 3
> >>
> >> or
> >> -- processor iocfccd3.aps.anl.gov <http://iocfccd3.aps.anl.gov>, rank 0
> >> out of 4 processors
> >> -- processor iocfccd3.aps.anl.gov <http://iocfccd3.aps.anl.gov>, rank 1
> >>
> >> out of 4 processors
> >> rank 1(before): x value is -1883169624, -1, 4994901, 0
> >> -- processor iocfccd3.aps.anl.gov <http://iocfccd3.aps.anl.gov>, rank 2
> >>
> >> out of 4 processors
> >> rank 2(before): x value is -451256152, -1, 4994901, 0
> >> -- processor iocfccd3.aps.anl.gov <http://iocfccd3.aps.anl.gov>, rank 3
> >>
> >> out of 4 processors
> >> rank 3(before): x value is 1715067240, -1, 4994901, 0
> >> rank 0(before): x value is 2, 0, 10, 3
> >> rank 0(after): x value is 2, 0, 10, 3
> >> rank 1(after): x value is 2, 0, 10, 3
> >> rank 2(after): x value is 2, 0, 10, 3
> >> rank 3(after): x value is 2, 0, 10, 3
> >>
> >> it is all randomly, I am not sure where is the problem.
> >>
> >> The second issue is I use MPI_Datatype to create a MPI struct for
> >> broadcast. However, as the program shown above, if I change the struct:
> >> typedef struct
> >> {
> >> int a;
> >> char b;
> >> int c;
> >> *int d;*
> >>
> >> } foo;
> >> as
> >> typedef struct
> >> {
> >> int a;
> >> char b;
> >> int c;
> >> *double d*;
> >>
> >> } foo;
> >>
> >> I found the result is:
> >> -- processor ephesus.ece.iit.edu <http://ephesus.ece.iit.edu>, rank 1
> >> out of 4 processors
> >> -- processor ephesus.ece.iit.edu <http://ephesus.ece.iit.edu>, rank 2
> >> out of 4 processors
> >> -- processor ephesus.ece.iit.edu <http://ephesus.ece.iit.edu>, rank 3
> >> out of 4 processors
> >> -- processor ephesus.ece.iit.edu <http://ephesus.ece.iit.edu>, rank 0
> >>
> >> out of 4 processors
> >> rank 0(before): x value is 2, 0, 10, 3.250000
> >> rank 3(before): x value is 0, 0, 547474368, 0.000000
> >> rank 1(before): x value is 0, 0, 547474368, 0.000000
> >> rank 2(before): x value is 0, 0, 547474368, 0.000000
> >> rank 0(after): x value is 2, 0, 10, 3.250000
> >> *rank 2(after): x value is 2, 0, 10, 0.000000 <- should be 3.25
> >>
> >> rank 3(after): x value is 2, 0, 10, 0.000000 <- should be 3.25
> >> rank 1(after): x value is 2, 0, 10, 0.000000**<- should be 3.25
> >>
> >> *
> >> Do you have any clues on why this happened? Thanks a lot!
> >>
> >> --
> >> Best Regards,
> >> Sufeng Niu
> >> ECASP lab, ECE department, Illinois Institute of Technology
> >> Tel: 312-731-7219
> >>
> >>
> >> _______________________________________________
> >> discuss mailing list discuss at mpich.org
> >> To manage subscription options or unsubscribe:
> >> https://lists.mpich.org/mailman/listinfo/discuss
> >>
> >
> > --
> > Pavan Balaji
> > http://www.mcs.anl.gov/~balaji
> > _______________________________________________
> > discuss mailing list discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
>
>
>
> --
> Jeff Hammond
> jeff.science at gmail.com
>
--
Best Regards,
Sufeng Niu
ECASP lab, ECE department, Illinois Institute of Technology
Tel: 312-731-7219
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20130709/59a2b677/attachment.html>
More information about the discuss
mailing list