[mpich-devel] MPI_Bsend under MPIU_THREAD_GRANULARITY_PER_OBJECT

Dave Goodell goodell at mcs.anl.gov
Thu Apr 25 14:03:35 CDT 2013


Hard to know what's really going on without a proper test case.  Is the test case using cancel?  The reference counting is known to be sloppy (at least in CH3) for some of the cancellation paths.

-Dave

On Apr 25, 2013, at 2:00 PM CDT, Bob Cernohous <bobc at us.ibm.com> wrote:

> Here's the big lock failure... looks like we use a request after it's complete/refcount is 0 
> 
> ... 
> stderr[0]: decr 0x15604b0 (0x44000000 kind=COMM) refcount to 3 
> stderr[0]: set 0x15d0fe8 (0xac000001 kind=REQUEST) refcount to 2 
> stderr[0]: incr 0x15604b0 (0x44000000 kind=COMM) refcount to 4 
> stderr[0]: decr 0x15d0fe8 (0xac000001 kind=REQUEST) refcount to 1 
> stderr[0]: decr 0x15d0fe8 (0xac000001 kind=REQUEST) refcount to 0 
>                 ^^^^^^^^^ 
> stderr[0]: decr 0x15604b0 (0x44000000 kind=COMM) refcount to 3 
> stderr[0]: decr 0x15d12a0 (0xac000004 kind=REQUEST) refcount to 0 
> stderr[0]: decr 0x15604b0 (0x44000000 kind=COMM) refcount to 2 
> stderr[0]: badcase 0x15d0fe8 (0xac000001 kind=0) refcount 0 
>                    ^^^^^^^^^ 
> stderr[0]: Abort(1) on node 0 (rank 0 in comm 1140850688): Fatal error in MPI_Bsend: Internal MPI error!, error stack: 
> stderr[0]: MPI_Bsend(181)..............: MPI_Bsend(buf=0x19c8606d70, count=1024, MPI_CHAR, dest=1, tag=0, MPI_COMM_WORLD) failed 
> stderr[0]: MPIR_Bsend_isend(226).......: 
> stderr[0]: MPIR_Bsend_check_active(456): 
> stderr[0]: MPIR_Test_impl(65)..........: 
> stderr[0]: MPIR_Request_complete(239)..: INTERNAL ERROR: unexpected value in case statement (value=0) 
> 
> 
> Bob Cernohous:  (T/L 553) 507-253-6093
> 
> BobC at us.ibm.com
> IBM Rochester, Building 030-2(C335), Department 61L
> 3605 Hwy 52 North, Rochester,  MN 55901-7829
> 
> > Chaos reigns within.
> > Reflect, repent, and reboot.
> > Order shall return.
> 
> 
> devel-bounces at mpich.org wrote on 04/25/2013 11:38:24 AM:
> 
> > From: Bob Cernohous/Rochester/IBM at IBMUS 
> > To: devel at mpich.org, 
> > Date: 04/25/2013 11:43 AM 
> > Subject: Re: [mpich-devel] MPI_Bsend under MPIU_THREAD_GRANULARITY_PER_OBJECT 
> > Sent by: devel-bounces at mpich.org 
> > 
> > Patches could be tricky because I'm even seeing intermittent (and 
> > different) failures on our 'legacy' libraries which are not per-obj 
> > but use the big lock.  So there's probably more than one problem here. 
> > 
> > Abort(1) on node 3 (rank 3 in comm 1140850688): Fatal error in 
> > MPI_Bsend: Internal MPI error!, error stack: 
> > MPI_Bsend(181)..............: MPI_Bsend(buf=0x19c8a06da0, 
> > count=1024, MPI_CHAR, dest=0, tag=0, MPI_COMM_WORLD) failed 
> > MPIR_Bsend_isend(226).......: 
> > MPIR_Bsend_check_active(456): 
> > MPIR_Test_impl(65)..........: 
> > MPIR_Request_complete(234)..: INTERNAL ERROR: unexpected value in 
> > case statement (value=0) 
> > 
> > 
> > Bob Cernohous:  (T/L 553) 507-253-6093
> > 
> > BobC at us.ibm.com
> > IBM Rochester, Building 030-2(C335), Department 61L
> > 3605 Hwy 52 North, Rochester,  MN 55901-7829
> > 
> > > Chaos reigns within.
> > > Reflect, repent, and reboot.
> > > Order shall return.
> > 
> > 
> > 
> > 
> > From:        Dave Goodell <goodell at mcs.anl.gov> 
> > To:        devel at mpich.org, 
> > Date:        04/25/2013 11:14 AM 
> > Subject:        Re: [mpich-devel] MPI_Bsend under 
> > MPIU_THREAD_GRANULARITY_PER_OBJECT 
> > Sent by:        devel-bounces at mpich.org 
> > 
> > 
> > 
> > The Bsend paths almost certainly have not been protected correctly. 
> > Patches to fix the issue are most welcome.
> > 
> > -Dave
> > 
> > On Apr 25, 2013, at 10:49 AM CDT, Bob Cernohous <bobc at us.ibm.com> wrote:
> > 
> > > Start by saying that I have not been involved in the nitty gritty 
> > of the per-object locking design. 
> > > 
> > > What protects the attached buffer/data structures/request when 
> > doing multithreaded MPI_Bsend()'s?  All I see in the code path is a 
> > (no-op) MPIU_THREAD_CS_ENTER(ALLFUNC,). 
> > > 
> > > I have a customer test in which the threads seem to be walking all
> > over the request around: 
> > > 
> > >         /bgusr/bobc/bgq/comm/lib/dev/mpich2/src/mpid/pamid/
> > include/../src/mpid_request.h:259 
> > > 0000000001088c0c MPIR_Request_complete 
> > >         /bgusr/bobc/bgq/comm/lib/dev/mpich2/src/mpi/pt2pt/mpir_request.c:87 
> > > 000000000106e874 MPIR_Test_impl 
> > >         /bgusr/bobc/bgq/comm/lib/dev/mpich2/src/mpi/pt2pt/test.c:62 
> > > 00000000010188f0 MPIR_Bsend_check_active 
> > >         /bgusr/bobc/bgq/comm/lib/dev/mpich2/src/mpi/pt2pt/bsendutil.c:455 
> > > 0000000001018dc0 MPIR_Bsend_isend 
> > >         /bgusr/bobc/bgq/comm/lib/dev/mpich2/src/mpi/pt2pt/bsendutil.c:226 
> > > 0000000001008734 PMPI_Bsend 
> > >         /bgusr/bobc/bgq/comm/lib/dev/mpich2/src/mpi/pt2pt/bsend.c:163 
> > > 00000000010009c0 00000012.long_branch_r2off.__libc_start_main+0 
> > >         :0 
> > > 000000000130cbc0 start_thread 
> > > 
> > > eg. (fprinting from MPIU_HANDLE_LOG_REFCOUNT_CHANGE) 
> > > 
> > > stderr[8]: set 0x15f8048 (0xac0000ff kind=REQUEST) refcount to 2 
> > > stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount to 1 
> > > stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount to 0 
> > > stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount to -1 
> > > stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount to -2 
> > > stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount to -3 
> > > stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount to -4 
> > > stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount to -5 
> > > stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount to -6 
> > > stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount to -7 
> > > stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount to -8 
> > > 
> > > 
> > > Bob Cernohous:  (T/L 553) 507-253-6093
> > > 
> > > BobC at us.ibm.com
> > > IBM Rochester, Building 030-2(C335), Department 61L
> > > 3605 Hwy 52 North, Rochester,  MN 55901-7829
> > > 
> > > > Chaos reigns within.
> > > > Reflect, repent, and reboot.
> > > > Order shall return.
> > 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/devel/attachments/20130425/4d80027a/attachment.html>


More information about the devel mailing list