[mpich-discuss] Dataloop error message

Latham, Robert J. robl at mcs.anl.gov
Thu Mar 9 11:22:45 CST 2017


On Wed, 2017-03-08 at 23:41 +0000, Palmer, Bruce J wrote:
> Rob,
> 
> Attached are the valgrind logs for a failed run. I've checked out the
> code on our side and I don't see anything obviously bogus (not that
> that means much). Do these suggest anything to you? I'm still trying
> to create a short reproducer, but as you can imagine, all my efforts
> so far work just fine.

Yeah, these valgrind logs look ok.  The warnings about uninitialized
bytes are common especially when MPICH not configured with extra
debugging to explicitly zero out all those buffers.

The dataloop code is getting some kind of garbage datatype.  What types
are you trying to use?  Simple contiguous types?  A deeply nested
complex user-defined type?

==rob

> 
> Bruce
> 
> -----Original Message-----
> From: Latham, Robert J. [mailto:robl at mcs.anl.gov> Sent: Wednesday, March 08, 2017 7:26 AM
> To: discuss at mpich.org
> Subject: Re: [mpich-discuss] Dataloop error message
> 
> On Tue, 2017-03-07 at 19:31 +0000, Palmer, Bruce J wrote:
> > Hi,
> >  
> > I’m trying to track down a possible race condition in a test
> > program 
> > that is using MPI RMA from MPICH 3.2. The program repeats a series
> > of 
> > put/get/accumulate operations to different processors. When I’m 
> > running on 1 node 4 processors everything is fine but when I move
> > to
> > 2  nodes 4 processors I start getting failures. The error messages
> > I’m 
> > seeing are
> >  
> > Assertion failed in file
> > src/mpid/common/datatype/dataloop/dataloop.c
> > at line 265: 0
> 
> that's a strange one!  that came from the "Dataloop_update" routine. 
> It updates pointers after a copy operation.  That particular
> assertion came from the "handle different types" switch
> 
>     switch(dataloop->kind & DLOOP_KIND_MASK)
> 
> which  means somehow this code got a datatype that was not one of
> CONTIG, VECTOR, BLOCKINDEXED, INDEXED, or STRUCT  (in dataloop
> terms. 
> MPI type "HINDEXED" for example maps to INDEXED directly, so not all
> MPI types are explicitly handled).
> 
>  
> > Assertion failed in file
> > src/mpid/common/datatype/dataloop/dataloop.c
> > at line 157: dataloop->loop_params.cm_t.dataloop
> 
> Also inside "Dataloop_update".  This assertion
> 
>     DLOOP_Assert(dataloop->loop_params.cm_t.dataloop)
> 
> basically suggests garbage was passed to the Dataloop_update routine.
>  
> > Does anyone have a handle on what these routines do and what kind
> > of 
> > behavior is generating these errors? The test program is
> > allocating 
> > memory and using it to create a window, followed immediately by a
> > call 
> > to MPI_Win_lock_all to create a passive synchronization epoch.
> > I’ve been using request based RMA calls (Rput, Rget, Raccumulate) 
> > followed by an immediate call to MPI_Wait  for the individual RMA 
> > operations. Any suggestions about what these errors are telling me?
> > If I start putting in print statements to narrow down the location
> > of 
> > the error, the code runs to completion.
> 
> The two assertions plus your observation that "printf debugging makes
> it go away" sure sounds a lot like some kind of memory
> corruption.  Any chance you can collect some valgrind logs? 
> 
> ==rob
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list