<font size=2 face="sans-serif">Patches could be tricky because I'm even
seeing intermittent (and different) failures on our 'legacy' libraries
which are not per-obj but use the big lock. So there's probably more
than one problem here.</font>
<br>
<br><font size=2 face="sans-serif">Abort(1) on node 3 (rank 3 in comm 1140850688):
Fatal error in MPI_Bsend: Internal MPI error!, error stack:</font>
<br><font size=2 face="sans-serif">MPI_Bsend(181)..............: MPI_Bsend(buf=0x19c8a06da0,
count=1024, MPI_CHAR, dest=0, tag=0, MPI_COMM_WORLD) failed</font>
<br><font size=2 face="sans-serif">MPIR_Bsend_isend(226).......: </font>
<br><font size=2 face="sans-serif">MPIR_Bsend_check_active(456): </font>
<br><font size=2 face="sans-serif">MPIR_Test_impl(65)..........: </font>
<br><font size=2 face="sans-serif">MPIR_Request_complete(234)..: INTERNAL
ERROR: unexpected value in case statement (value=0)</font>
<br>
<br><font size=2 face="sans-serif"><br>
Bob Cernohous: (T/L 553) 507-253-6093<br>
<br>
BobC@us.ibm.com<br>
IBM Rochester, Building 030-2(C335), Department 61L<br>
3605 Hwy 52 North, Rochester, MN 55901-7829<br>
<br>
> Chaos reigns within.<br>
> Reflect, repent, and reboot.<br>
> Order shall return.<br>
</font>
<br>
<br>
<br>
<br><font size=1 color=#5f5f5f face="sans-serif">From:
</font><font size=1 face="sans-serif">Dave Goodell <goodell@mcs.anl.gov></font>
<br><font size=1 color=#5f5f5f face="sans-serif">To:
</font><font size=1 face="sans-serif">devel@mpich.org, </font>
<br><font size=1 color=#5f5f5f face="sans-serif">Date:
</font><font size=1 face="sans-serif">04/25/2013 11:14 AM</font>
<br><font size=1 color=#5f5f5f face="sans-serif">Subject:
</font><font size=1 face="sans-serif">Re: [mpich-devel]
MPI_Bsend under MPIU_THREAD_GRANULARITY_PER_OBJECT</font>
<br><font size=1 color=#5f5f5f face="sans-serif">Sent by:
</font><font size=1 face="sans-serif">devel-bounces@mpich.org</font>
<br>
<hr noshade>
<br>
<br>
<br><tt><font size=2>The Bsend paths almost certainly have not been protected
correctly. Patches to fix the issue are most welcome.<br>
<br>
-Dave<br>
<br>
On Apr 25, 2013, at 10:49 AM CDT, Bob Cernohous <bobc@us.ibm.com>
wrote:<br>
<br>
> Start by saying that I have not been involved in the nitty gritty
of the per-object locking design. <br>
> <br>
> What protects the attached buffer/data structures/request when doing
multithreaded MPI_Bsend()'s? All I see in the code path is a (no-op)
MPIU_THREAD_CS_ENTER(ALLFUNC,). <br>
> <br>
> I have a customer test in which the threads seem to be walking all
over the request around: <br>
> <br>
> /bgusr/bobc/bgq/comm/lib/dev/mpich2/src/mpid/pamid/include/../src/mpid_request.h:259
<br>
> 0000000001088c0c MPIR_Request_complete <br>
> /bgusr/bobc/bgq/comm/lib/dev/mpich2/src/mpi/pt2pt/mpir_request.c:87
<br>
> 000000000106e874 MPIR_Test_impl <br>
> /bgusr/bobc/bgq/comm/lib/dev/mpich2/src/mpi/pt2pt/test.c:62
<br>
> 00000000010188f0 MPIR_Bsend_check_active <br>
> /bgusr/bobc/bgq/comm/lib/dev/mpich2/src/mpi/pt2pt/bsendutil.c:455
<br>
> 0000000001018dc0 MPIR_Bsend_isend <br>
> /bgusr/bobc/bgq/comm/lib/dev/mpich2/src/mpi/pt2pt/bsendutil.c:226
<br>
> 0000000001008734 PMPI_Bsend <br>
> /bgusr/bobc/bgq/comm/lib/dev/mpich2/src/mpi/pt2pt/bsend.c:163
<br>
> 00000000010009c0 00000012.long_branch_r2off.__libc_start_main+0 <br>
> :0 <br>
> 000000000130cbc0 start_thread <br>
> <br>
> eg. (fprinting from MPIU_HANDLE_LOG_REFCOUNT_CHANGE) <br>
> <br>
> stderr[8]: set 0x15f8048 (0xac0000ff kind=REQUEST) refcount to 2 <br>
> stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount to 1
<br>
> stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount to 0
<br>
> stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount to -1
<br>
> stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount to -2
<br>
> stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount to -3
<br>
> stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount to -4
<br>
> stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount to -5
<br>
> stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount to -6
<br>
> stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount to -7
<br>
> stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount to -8
<br>
> <br>
> <br>
> Bob Cernohous: (T/L 553) 507-253-6093<br>
> <br>
> BobC@us.ibm.com<br>
> IBM Rochester, Building 030-2(C335), Department 61L<br>
> 3605 Hwy 52 North, Rochester, MN 55901-7829<br>
> <br>
> > Chaos reigns within.<br>
> > Reflect, repent, and reboot.<br>
> > Order shall return.<br>
<br>
</font></tt>
<br>