[mpich-discuss] About an error while using mpi i/o collectives : "Error in ADIOI_Calc_aggregator(): rank_index(1)..."

Thakur, Rajeev thakur at anl.gov
Wed Aug 23 10:12:51 CDT 2017


> Just curious: Is/Was there any specific reason for this constraint?

So that it is easier for implementations to implement. 

Rajeev

> On Aug 23, 2017, at 12:35 AM, pramod kumbhar <pramod.s.kumbhar at gmail.com> wrote:
> 
> Hi Rajeev, Rob,
> 
> Thanks for clarification. I see etype and filetype specification in the standard mention "monotonically nondecreasing" constraint.
> Just curious: Is/Was there any specific reason for this constraint?
> 
> @Rob: the error message I posted in the first email triggered from the application violating this constraint. Depending on the number of ranks and offsets order, the application deadlock, crashes etc.
> 
> Thanks,
> 
> Pramod
> 
> 
> On Wed, Aug 23, 2017 at 3:07 AM, Latham, Robert J. <robl at mcs.anl.gov> wrote:
> On Tue, 2017-08-22 at 22:27 +0000, Thakur, Rajeev wrote:
> > Yes, displacements for the filetype must be in “monotonically
> > nondecreasing order”.
> 
> ... which sounds pretty restrictive, but there is no constraint on
> memory types.  Folks work around this by shuffling the memory addresses
> to match the ascending file offsets.
> 
> ==rob
> 
> >
> > Rajeev
> >
> > > On Aug 22, 2017, at 3:05 PM, pramod kumbhar <pramod.s.kumbhar at gmail
> > > .com> wrote:
> > >
> > > Hi Rob,
> > >
> > > Thanks! Below is not exactly same issue/error but related :
> > >
> > > While constructing derived datatype (filetype used for set_view),
> > > do we need displacements / offsets to be in ascending order?
> > > I mean, suppose I am creating derived datatype using
> > > MPI_Type_create_hindexed (or mpi struct) with length/displacements
> > > as:
> > >
> > >         blocklengths[0] = 8;
> > >         blocklengths[1] = 231670;
> > >         blocklengths[2] = 116606;
> > >
> > >         displacements[0] = 0;
> > >         displacements[1] = 8;
> > >         displacements[2] = 231678;
> > >
> > > Above displacements are in ascending order. Suppose I shuffle order
> > > bit:
> > >
> > >         blocklengths[0] = 8;
> > >         blocklengths[1] = 116606;
> > >         blocklengths[2] = 231670;
> > >
> > >         displacements[0] = 0;
> > >         displacements[1] = 231678;
> > >         displacements[2] = 8;
> > >
> > > It's still the same but while specifying block-lengths/offsets I
> > > changed the order. (resultant file will have data in different oder
> > > but that's ignored here)
> > > Isn't this a valid specification? This second example results in a
> > > segfault (in ADIO_GEN_WriteStrided / Coll).
> > >
> > > I quickly wrote attached program, let me know if I have missed
> > > anything obvious here.
> > >
> > > Regards,
> > > Pramod
> > >
> > > p.s. you can compile & run as:
> > >
> > > Not working => mpicxx  test.cpp && mpirun -n 2 ./a.out
> > > Working =>. mpicxx test.cpp -DUSE_ORDER && mpirun -n 2 ./a.out
> > >
> > >
> > >
> > > On Tue, Aug 22, 2017 at 5:25 PM, Latham, Robert J. <robl at mcs.anl.go
> > > v> wrote:
> > > On Mon, 2017-08-21 at 17:45 +0200, pramod kumbhar wrote:
> > > > Dear All,
> > > >
> > > > In one of our application I am seeing following error while using
> > > > collective call MPI_File_write_all :
> > > >
> > > > Error in ADIOI_Calc_aggregator(): rank_index(1) >= fd->hints-
> > > > > cb_nodes (1) fd_size=102486061 off=102486469
> > > >
> > > > Non collective version works fine.
> > > >
> > > > While looking at callstack I came across below comment in mpich-
> > > > 3.2/src/mpi/romio/adio/common/ad_aggregate.c :
> > > >
> > > >     /* we index into fd_end with rank_index, and fd_end was
> > > > allocated
> > > > to be no
> > > >      * bigger than fd->hins->cb_nodes.   If we ever violate that,
> > > > we're
> > > >      * overrunning arrays.  Obviously, we should never ever hit
> > > > this
> > > > abort */
> > > >     if (rank_index >= fd->hints->cb_nodes || rank_index < 0) {
> > > >         FPRINTF(stderr, "Error in ADIOI_Calc_aggregator():
> > > > rank_index(%d) >= fd->hints->cb_nodes (%d) fd_size=%lld
> > > > off=%lld\n",
> > > >             rank_index,fd->hints->cb_nodes,fd_size,off);
> > > >         MPI_Abort(MPI_COMM_WORLD, 1);
> > > >     }
> > > >
> > > > I am going to look into application and see if there is an issue
> > > > with
> > > > offset overflow. But looking at above comment ("Obviously, we
> > > > should
> > > > never ever hit this abort ") I thought should ask if there is any
> > > > obvious thing I am missing.
> > >
> > > that's my comment.  The 'rank_index' array is allocated based on
> > > the
> > > 'cb_nodes' hint.  I definitely would like to know more about how
> > > the
> > > code is manipulating rank_index, cb_nodes, and fd_end .
> > >
> > > If there is a reduced test case you can send me, that will be a
> > > huge
> > > help.
> > >
> > > ==rob
> > >
> > > >
> > > > Regards,
> > > > Pramod
> > > >
> > > > p.s. I will provide reproducer after looking into this more
> > > > carefully.
> > > > _______________________________________________
> > > > discuss mailing list     discuss at mpich.org
> > > > To manage subscription options or unsubscribe:
> > > > https://lists.mpich.org/mailman/listinfo/discuss
> > >
> > > _______________________________________________
> > > discuss mailing list     discuss at mpich.org
> > > To manage subscription options or unsubscribe:
> > > https://lists.mpich.org/mailman/listinfo/discuss
> > >
> > > <test.cpp>_______________________________________________
> > > discuss mailing list     discuss at mpich.org
> > > To manage subscription options or unsubscribe:
> > > https://lists.mpich.org/mailman/listinfo/discuss
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list