<html><head><meta http-equiv="Content-Type" content="text/html charset=us-ascii"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Hard to know what's really going on without a proper test case. Is the test case using cancel? The reference counting is known to be sloppy (at least in CH3) for some of the cancellation paths.<div><br></div><div>-Dave</div><div><br><div><div>On Apr 25, 2013, at 2:00 PM CDT, Bob Cernohous <<a href="mailto:bobc@us.ibm.com">bobc@us.ibm.com</a>> wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><font size="2" face="sans-serif">Here's the big lock failure... looks like
we use a request after it's complete/refcount is 0</font>
<br>
<br><font size="2" face="sans-serif">...</font>
<br><font size="2" face="Courier 10 Pitch">stderr[0]: decr 0x15604b0 (0x44000000
kind=COMM) refcount to 3</font>
<br><font size="2" face="Courier 10 Pitch">stderr[0]: set 0x15d0fe8 (0xac000001
kind=REQUEST) refcount to 2</font>
<br><font size="2" face="Courier 10 Pitch">stderr[0]: incr 0x15604b0 (0x44000000
kind=COMM) refcount to 4</font>
<br><font size="2" face="Courier 10 Pitch">stderr[0]: decr 0x15d0fe8 (0xac000001
kind=REQUEST) refcount to 1</font>
<br><font size="2" face="Courier 10 Pitch">stderr[0]: decr 0x15d0fe8 (0xac000001
kind=REQUEST) refcount to 0</font>
<br><font size="2" face="Courier 10 Pitch">
^^^^^^^^^</font>
<br><font size="2" face="Courier 10 Pitch">stderr[0]: decr 0x15604b0 (0x44000000
kind=COMM) refcount to 3</font>
<br><font size="2" face="Courier 10 Pitch">stderr[0]: decr 0x15d12a0 (0xac000004
kind=REQUEST) refcount to 0</font>
<br><font size="2" face="Courier 10 Pitch">stderr[0]: decr 0x15604b0 (0x44000000
kind=COMM) refcount to 2</font>
<br><font size="2" face="Courier 10 Pitch">stderr[0]: badcase 0x15d0fe8 (0xac000001
kind=0) refcount 0</font>
<br><font size="2" face="Courier 10 Pitch">
^^^^^^^^^</font>
<br><font size="2" face="Courier 10 Pitch">stderr[0]: Abort(1) on node 0
(rank 0 in comm 1140850688): Fatal error in MPI_Bsend: Internal MPI error!,
error stack:</font>
<br><font size="2" face="Courier 10 Pitch">stderr[0]: MPI_Bsend(181)..............:
MPI_Bsend(buf=0x19c8606d70, count=1024, MPI_CHAR, dest=1, tag=0, MPI_COMM_WORLD)
failed</font>
<br><font size="2" face="Courier 10 Pitch">stderr[0]: MPIR_Bsend_isend(226).......:
</font>
<br><font size="2" face="Courier 10 Pitch">stderr[0]: MPIR_Bsend_check_active(456):
</font>
<br><font size="2" face="Courier 10 Pitch">stderr[0]: MPIR_Test_impl(65)..........:
</font>
<br><font size="2" face="Courier 10 Pitch">stderr[0]: MPIR_Request_complete(239)..:
INTERNAL ERROR: unexpected value in case statement (value=0)</font>
<br>
<br><font size="2" face="sans-serif"><br>
Bob Cernohous: (T/L 553) 507-253-6093<br>
<br>
<a href="mailto:BobC@us.ibm.com">BobC@us.ibm.com</a><br>
IBM Rochester, Building 030-2(C335), Department 61L<br>
3605 Hwy 52 North, Rochester, MN 55901-7829<br>
<br>
> Chaos reigns within.<br>
> Reflect, repent, and reboot.<br>
> Order shall return.<br>
</font>
<br>
<br><tt><font size="2"><a href="mailto:devel-bounces@mpich.org">devel-bounces@mpich.org</a> wrote on 04/25/2013 11:38:24
AM:<br>
<br>
> From: Bob Cernohous/Rochester/IBM@IBMUS</font></tt>
<br><tt><font size="2">> To: <a href="mailto:devel@mpich.org">devel@mpich.org</a>, </font></tt>
<br><tt><font size="2">> Date: 04/25/2013 11:43 AM</font></tt>
<br><tt><font size="2">> Subject: Re: [mpich-devel] MPI_Bsend under MPIU_THREAD_GRANULARITY_PER_OBJECT</font></tt>
<br><tt><font size="2">> Sent by: <a href="mailto:devel-bounces@mpich.org">devel-bounces@mpich.org</a></font></tt>
<br><tt><font size="2">> <br>
> Patches could be tricky because I'm even seeing intermittent (and
<br>
> different) failures on our 'legacy' libraries which are not per-obj
<br>
> but use the big lock. So there's probably more than one problem
here. <br>
> <br>
> Abort(1) on node 3 (rank 3 in comm 1140850688): Fatal error in <br>
> MPI_Bsend: Internal MPI error!, error stack: <br>
> MPI_Bsend(181)..............: MPI_Bsend(buf=0x19c8a06da0, <br>
> count=1024, MPI_CHAR, dest=0, tag=0, MPI_COMM_WORLD) failed <br>
> MPIR_Bsend_isend(226).......: <br>
> MPIR_Bsend_check_active(456): <br>
> MPIR_Test_impl(65)..........: <br>
> MPIR_Request_complete(234)..: INTERNAL ERROR: unexpected value in
<br>
> case statement (value=0) <br>
> <br>
> <br>
> Bob Cernohous: (T/L 553) 507-253-6093<br>
> <br>
> <a href="mailto:BobC@us.ibm.com">BobC@us.ibm.com</a><br>
> IBM Rochester, Building 030-2(C335), Department 61L<br>
> 3605 Hwy 52 North, Rochester, MN 55901-7829<br>
> <br>
> > Chaos reigns within.<br>
> > Reflect, repent, and reboot.<br>
> > Order shall return.<br>
> <br>
> <br>
> <br>
> <br>
> From: Dave Goodell <<a href="mailto:goodell@mcs.anl.gov">goodell@mcs.anl.gov</a>>
<br>
> To: <a href="mailto:devel@mpich.org">devel@mpich.org</a>, <br>
> Date: 04/25/2013 11:14 AM <br>
> Subject: Re: [mpich-devel] MPI_Bsend under
<br>
> MPIU_THREAD_GRANULARITY_PER_OBJECT <br>
> Sent by: <a href="mailto:devel-bounces@mpich.org">devel-bounces@mpich.org</a> <br>
> <br>
> <br>
> <br>
> The Bsend paths almost certainly have not been protected correctly.
<br>
> Patches to fix the issue are most welcome.<br>
> <br>
> -Dave<br>
> <br>
> On Apr 25, 2013, at 10:49 AM CDT, Bob Cernohous <<a href="mailto:bobc@us.ibm.com">bobc@us.ibm.com</a>>
wrote:<br>
> <br>
> > Start by saying that I have not been involved in the nitty gritty
<br>
> of the per-object locking design. <br>
> > <br>
> > What protects the attached buffer/data structures/request when
<br>
> doing multithreaded MPI_Bsend()'s? All I see in the code path
is a <br>
> (no-op) MPIU_THREAD_CS_ENTER(ALLFUNC,). <br>
> > <br>
> > I have a customer test in which the threads seem to be walking
all<br>
> over the request around: <br>
> > <br>
> > /bgusr/bobc/bgq/comm/lib/dev/mpich2/src/mpid/pamid/<br>
> include/../src/mpid_request.h:259 <br>
> > 0000000001088c0c MPIR_Request_complete <br>
> > /bgusr/bobc/bgq/comm/lib/dev/mpich2/src/mpi/pt2pt/mpir_request.c:87
<br>
> > 000000000106e874 MPIR_Test_impl <br>
> > /bgusr/bobc/bgq/comm/lib/dev/mpich2/src/mpi/pt2pt/test.c:62
<br>
> > 00000000010188f0 MPIR_Bsend_check_active <br>
> > /bgusr/bobc/bgq/comm/lib/dev/mpich2/src/mpi/pt2pt/bsendutil.c:455
<br>
> > 0000000001018dc0 MPIR_Bsend_isend <br>
> > /bgusr/bobc/bgq/comm/lib/dev/mpich2/src/mpi/pt2pt/bsendutil.c:226
<br>
> > 0000000001008734 PMPI_Bsend <br>
> > /bgusr/bobc/bgq/comm/lib/dev/mpich2/src/mpi/pt2pt/bsend.c:163
<br>
> > 00000000010009c0 00000012.long_branch_r2off.__libc_start_main+0
<br>
> > :0 <br>
> > 000000000130cbc0 start_thread <br>
> > <br>
> > eg. (fprinting from MPIU_HANDLE_LOG_REFCOUNT_CHANGE) <br>
> > <br>
> > stderr[8]: set 0x15f8048 (0xac0000ff kind=REQUEST) refcount to
2 <br>
> > stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount
to 1 <br>
> > stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount
to 0 <br>
> > stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount
to -1 <br>
> > stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount
to -2 <br>
> > stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount
to -3 <br>
> > stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount
to -4 <br>
> > stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount
to -5 <br>
> > stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount
to -6 <br>
> > stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount
to -7 <br>
> > stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount
to -8 <br>
> > <br>
> > <br>
> > Bob Cernohous: (T/L 553) 507-253-6093<br>
> > <br>
> > <a href="mailto:BobC@us.ibm.com">BobC@us.ibm.com</a><br>
> > IBM Rochester, Building 030-2(C335), Department 61L<br>
> > 3605 Hwy 52 North, Rochester, MN 55901-7829<br>
> > <br>
> > > Chaos reigns within.<br>
> > > Reflect, repent, and reboot.<br>
> > > Order shall return.<br>
> <br>
</font></tt></blockquote></div><br></div></body></html>