[mpich-discuss] About an error while using mpi i/o collectives : "Error in ADIOI_Calc_aggregator(): rank_index(1)..."
pramod kumbhar
pramod.s.kumbhar at gmail.com
Wed Aug 23 00:37:30 CDT 2017
> Thanks for clarification. I see type and filetype specification in the
> standard mention "monotonically nondecreasing" constraint.
>
I mean etype and filetype.
-Pramod
On Wed, Aug 23, 2017 at 3:07 AM, Latham, Robert J. <robl at mcs.anl.gov> wrote:
>
>> On Tue, 2017-08-22 at 22:27 +0000, Thakur, Rajeev wrote:
>> > Yes, displacements for the filetype must be in “monotonically
>> > nondecreasing order”.
>>
>> ... which sounds pretty restrictive, but there is no constraint on
>> memory types. Folks work around this by shuffling the memory addresses
>> to match the ascending file offsets.
>>
>> ==rob
>>
>> >
>> > Rajeev
>> >
>> > > On Aug 22, 2017, at 3:05 PM, pramod kumbhar <pramod.s.kumbhar at gmail
>> > > .com> wrote:
>> > >
>> > > Hi Rob,
>> > >
>> > > Thanks! Below is not exactly same issue/error but related :
>> > >
>> > > While constructing derived datatype (filetype used for set_view),
>> > > do we need displacements / offsets to be in ascending order?
>> > > I mean, suppose I am creating derived datatype using
>> > > MPI_Type_create_hindexed (or mpi struct) with length/displacements
>> > > as:
>> > >
>> > > blocklengths[0] = 8;
>> > > blocklengths[1] = 231670;
>> > > blocklengths[2] = 116606;
>> > >
>> > > displacements[0] = 0;
>> > > displacements[1] = 8;
>> > > displacements[2] = 231678;
>> > >
>> > > Above displacements are in ascending order. Suppose I shuffle order
>> > > bit:
>> > >
>> > > blocklengths[0] = 8;
>> > > blocklengths[1] = 116606;
>> > > blocklengths[2] = 231670;
>> > >
>> > > displacements[0] = 0;
>> > > displacements[1] = 231678;
>> > > displacements[2] = 8;
>> > >
>> > > It's still the same but while specifying block-lengths/offsets I
>> > > changed the order. (resultant file will have data in different oder
>> > > but that's ignored here)
>> > > Isn't this a valid specification? This second example results in a
>> > > segfault (in ADIO_GEN_WriteStrided / Coll).
>> > >
>> > > I quickly wrote attached program, let me know if I have missed
>> > > anything obvious here.
>> > >
>> > > Regards,
>> > > Pramod
>> > >
>> > > p.s. you can compile & run as:
>> > >
>> > > Not working => mpicxx test.cpp && mpirun -n 2 ./a.out
>> > > Working =>. mpicxx test.cpp -DUSE_ORDER && mpirun -n 2 ./a.out
>> > >
>> > >
>> > >
>> > > On Tue, Aug 22, 2017 at 5:25 PM, Latham, Robert J. <robl at mcs.anl.go
>> > > v> wrote:
>> > > On Mon, 2017-08-21 at 17:45 +0200, pramod kumbhar wrote:
>> > > > Dear All,
>> > > >
>> > > > In one of our application I am seeing following error while using
>> > > > collective call MPI_File_write_all :
>> > > >
>> > > > Error in ADIOI_Calc_aggregator(): rank_index(1) >= fd->hints-
>> > > > > cb_nodes (1) fd_size=102486061 off=102486469
>> > > >
>> > > > Non collective version works fine.
>> > > >
>> > > > While looking at callstack I came across below comment in mpich-
>> > > > 3.2/src/mpi/romio/adio/common/ad_aggregate.c :
>> > > >
>> > > > /* we index into fd_end with rank_index, and fd_end was
>> > > > allocated
>> > > > to be no
>> > > > * bigger than fd->hins->cb_nodes. If we ever violate that,
>> > > > we're
>> > > > * overrunning arrays. Obviously, we should never ever hit
>> > > > this
>> > > > abort */
>> > > > if (rank_index >= fd->hints->cb_nodes || rank_index < 0) {
>> > > > FPRINTF(stderr, "Error in ADIOI_Calc_aggregator():
>> > > > rank_index(%d) >= fd->hints->cb_nodes (%d) fd_size=%lld
>> > > > off=%lld\n",
>> > > > rank_index,fd->hints->cb_nodes,fd_size,off);
>> > > > MPI_Abort(MPI_COMM_WORLD, 1);
>> > > > }
>> > > >
>> > > > I am going to look into application and see if there is an issue
>> > > > with
>> > > > offset overflow. But looking at above comment ("Obviously, we
>> > > > should
>> > > > never ever hit this abort ") I thought should ask if there is any
>> > > > obvious thing I am missing.
>> > >
>> > > that's my comment. The 'rank_index' array is allocated based on
>> > > the
>> > > 'cb_nodes' hint. I definitely would like to know more about how
>> > > the
>> > > code is manipulating rank_index, cb_nodes, and fd_end .
>> > >
>> > > If there is a reduced test case you can send me, that will be a
>> > > huge
>> > > help.
>> > >
>> > > ==rob
>> > >
>> > > >
>> > > > Regards,
>> > > > Pramod
>> > > >
>> > > > p.s. I will provide reproducer after looking into this more
>> > > > carefully.
>> > > > _______________________________________________
>> > > > discuss mailing list discuss at mpich.org
>> > > > To manage subscription options or unsubscribe:
>> > > > https://lists.mpich.org/mailman/listinfo/discuss
>> > >
>> > > _______________________________________________
>> > > discuss mailing list discuss at mpich.org
>> > > To manage subscription options or unsubscribe:
>> > > https://lists.mpich.org/mailman/listinfo/discuss
>> > >
>> > > <test.cpp>_______________________________________________
>> > > discuss mailing list discuss at mpich.org
>> > > To manage subscription options or unsubscribe:
>> > > https://lists.mpich.org/mailman/listinfo/discuss
>> >
>> > _______________________________________________
>> > discuss mailing list discuss at mpich.org
>> > To manage subscription options or unsubscribe:
>> > https://lists.mpich.org/mailman/listinfo/discuss
>> _______________________________________________
>> discuss mailing list discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20170823/627018f6/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list