[mpich-discuss] Dataloop error message

Palmer, Bruce J Bruce.Palmer at pnnl.gov
Wed Mar 8 17:41:46 CST 2017


Rob,

Attached are the valgrind logs for a failed run. I've checked out the code on our side and I don't see anything obviously bogus (not that that means much). Do these suggest anything to you? I'm still trying to create a short reproducer, but as you can imagine, all my efforts so far work just fine.

Bruce

-----Original Message-----
From: Latham, Robert J. [mailto:robl at mcs.anl.gov] 
Sent: Wednesday, March 08, 2017 7:26 AM
To: discuss at mpich.org
Subject: Re: [mpich-discuss] Dataloop error message

On Tue, 2017-03-07 at 19:31 +0000, Palmer, Bruce J wrote:
> Hi,
>  
> I’m trying to track down a possible race condition in a test program 
> that is using MPI RMA from MPICH 3.2. The program repeats a series of 
> put/get/accumulate operations to different processors. When I’m 
> running on 1 node 4 processors everything is fine but when I move to
> 2  nodes 4 processors I start getting failures. The error messages I’m 
> seeing are
>  
> Assertion failed in file src/mpid/common/datatype/dataloop/dataloop.c
> at line 265: 0

that's a strange one!  that came from the "Dataloop_update" routine. 
It updates pointers after a copy operation.  That particular assertion came from the "handle different types" switch

    switch(dataloop->kind & DLOOP_KIND_MASK)

which  means somehow this code got a datatype that was not one of CONTIG, VECTOR, BLOCKINDEXED, INDEXED, or STRUCT  (in dataloop terms. 
MPI type "HINDEXED" for example maps to INDEXED directly, so not all MPI types are explicitly handled).

 
> Assertion failed in file src/mpid/common/datatype/dataloop/dataloop.c
> at line 157: dataloop->loop_params.cm_t.dataloop

Also inside "Dataloop_update".  This assertion

    DLOOP_Assert(dataloop->loop_params.cm_t.dataloop)

basically suggests garbage was passed to the Dataloop_update routine.
 
> Does anyone have a handle on what these routines do and what kind of 
> behavior is generating these errors? The test program is allocating 
> memory and using it to create a window, followed immediately by a call 
> to MPI_Win_lock_all to create a passive synchronization epoch.
> I’ve been using request based RMA calls (Rput, Rget, Raccumulate) 
> followed by an immediate call to MPI_Wait  for the individual RMA 
> operations. Any suggestions about what these errors are telling me?
> If I start putting in print statements to narrow down the location of 
> the error, the code runs to completion.

The two assertions plus your observation that "printf debugging makes it go away" sure sounds a lot like some kind of memory corruption.  Any chance you can collect some valgrind logs? 

==rob
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
-------------- next part --------------
A non-text attachment was scrubbed...
Name: log.1673
Type: application/octet-stream
Size: 15568 bytes
Desc: log.1673
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20170308/19a5c4fd/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: log.1674
Type: application/octet-stream
Size: 6583 bytes
Desc: log.1674
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20170308/19a5c4fd/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: log.1729
Type: application/octet-stream
Size: 13004 bytes
Desc: log.1729
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20170308/19a5c4fd/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: log.1730
Type: application/octet-stream
Size: 6583 bytes
Desc: log.1730
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20170308/19a5c4fd/attachment-0003.obj>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list