<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">I’d still like to see MPICH adopt the Dataloop code that Tarun wrote and that should be much faster and particularly more appropriate for use with RMA. I think that would be more productive in the long term than continuing to maintain the current code.<div class=""><br class=""></div><div class="">Bill</div><div class=""><br class=""><div class="">
<div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><div style="color: rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;">William Gropp<br class="">Acting Director and Chief Scientist, NCSA<br class="">Director, Parallel Computing Institute<br class="">Thomas M. Siebel Chair in Computer Science<br class="">University of Illinois Urbana-Champaign</div><br class="Apple-interchange-newline"></div><br class="Apple-interchange-newline">
</div>
<br class=""><div><blockquote type="cite" class=""><div class="">On Mar 8, 2017, at 5:41 PM, Palmer, Bruce J <<a href="mailto:Bruce.Palmer@pnnl.gov" class="">Bruce.Palmer@pnnl.gov</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div class="">Rob,<br class=""><br class="">Attached are the valgrind logs for a failed run. I've checked out the code on our side and I don't see anything obviously bogus (not that that means much). Do these suggest anything to you? I'm still trying to create a short reproducer, but as you can imagine, all my efforts so far work just fine.<br class=""><br class="">Bruce<br class=""><br class="">-----Original Message-----<br class="">From: Latham, Robert J. [<a href="mailto:robl@mcs.anl.gov" class="">mailto:robl@mcs.anl.gov</a>] <br class="">Sent: Wednesday, March 08, 2017 7:26 AM<br class="">To: <a href="mailto:discuss@mpich.org" class="">discuss@mpich.org</a><br class="">Subject: Re: [mpich-discuss] Dataloop error message<br class=""><br class="">On Tue, 2017-03-07 at 19:31 +0000, Palmer, Bruce J wrote:<br class=""><blockquote type="cite" class="">Hi,<br class=""> <br class="">I’m trying to track down a possible race condition in a test program <br class="">that is using MPI RMA from MPICH 3.2. The program repeats a series of <br class="">put/get/accumulate operations to different processors. When I’m <br class="">running on 1 node 4 processors everything is fine but when I move to<br class="">2 nodes 4 processors I start getting failures. The error messages I’m <br class="">seeing are<br class=""> <br class="">Assertion failed in file src/mpid/common/datatype/dataloop/dataloop.c<br class="">at line 265: 0<br class=""></blockquote><br class="">that's a strange one! that came from the "Dataloop_update" routine. <br class="">It updates pointers after a copy operation. That particular assertion came from the "handle different types" switch<br class=""><br class=""> switch(dataloop->kind & DLOOP_KIND_MASK)<br class=""><br class="">which means somehow this code got a datatype that was not one of CONTIG, VECTOR, BLOCKINDEXED, INDEXED, or STRUCT (in dataloop terms. <br class="">MPI type "HINDEXED" for example maps to INDEXED directly, so not all MPI types are explicitly handled).<br class=""><br class=""> <br class=""><blockquote type="cite" class="">Assertion failed in file src/mpid/common/datatype/dataloop/dataloop.c<br class="">at line 157: dataloop->loop_params.cm_t.dataloop<br class=""></blockquote><br class="">Also inside "Dataloop_update". This assertion<br class=""><br class=""> DLOOP_Assert(dataloop->loop_params.cm_t.dataloop)<br class=""><br class="">basically suggests garbage was passed to the Dataloop_update routine.<br class=""> <br class=""><blockquote type="cite" class="">Does anyone have a handle on what these routines do and what kind of <br class="">behavior is generating these errors? The test program is allocating <br class="">memory and using it to create a window, followed immediately by a call <br class="">to MPI_Win_lock_all to create a passive synchronization epoch.<br class="">I’ve been using request based RMA calls (Rput, Rget, Raccumulate) <br class="">followed by an immediate call to MPI_Wait for the individual RMA <br class="">operations. Any suggestions about what these errors are telling me?<br class="">If I start putting in print statements to narrow down the location of <br class="">the error, the code runs to completion.<br class=""></blockquote><br class="">The two assertions plus your observation that "printf debugging makes it go away" sure sounds a lot like some kind of memory corruption. Any chance you can collect some valgrind logs? <br class=""><br class="">==rob<br class="">_______________________________________________<br class="">discuss mailing list <a href="mailto:discuss@mpich.org" class="">discuss@mpich.org</a><br class="">To manage subscription options or unsubscribe:<br class=""><a href="https://lists.mpich.org/mailman/listinfo/discuss" class="">https://lists.mpich.org/mailman/listinfo/discuss</a><br class=""><span id="cid:3EB14914-46CF-420C-BC5F-024C34DF2D0C"><log.1673></span><span id="cid:352E92DE-B893-47D7-BC0D-02E630BE80A4"><log.1674></span><span id="cid:9570D5B0-9402-4794-874C-10609956506D"><log.1729></span><span id="cid:EF50D5B2-9E2B-4CF3-AF1A-AA4E91B9D2C6"><log.1730></span>_______________________________________________<br class="">discuss mailing list discuss@mpich.org<br class="">To manage subscription options or unsubscribe:<br class="">https://lists.mpich.org/mailman/listinfo/discuss<br class=""></div></div></blockquote></div><br class=""></div></body></html>