[mpich-devel] MPI_Bsend under MPIU_THREAD_GRANULARITY_PER_OBJECT
Dave Goodell
goodell at mcs.anl.gov
Thu Apr 25 14:10:20 CDT 2013
Thanks for the test case.
On my laptop with a debugging build of ch3:nemesis:tcp, I cannot reproduce the problem. I tried with and without valgrind, with and without the "MPI_Comm_dup", and for a range of "-np" values. Obviously this is inconclusive because this is a threading issue, but maybe there's a pamid-specific bug here?
-Dave
On Apr 25, 2013, at 2:02 PM CDT, Bob Cernohous <bobc at us.ibm.com> wrote:
> Testcase is pretty simple. I'll keep poking at it myself and look at that ticket too.
>
>
>
> Bob Cernohous: (T/L 553) 507-253-6093
>
> BobC at us.ibm.com
> IBM Rochester, Building 030-2(C335), Department 61L
> 3605 Hwy 52 North, Rochester, MN 55901-7829
>
> > Chaos reigns within.
> > Reflect, repent, and reboot.
> > Order shall return.
>
>
> devel-bounces at mpich.org wrote on 04/25/2013 01:57:35 PM:
>
> > From: Dave Goodell <goodell at mcs.anl.gov>
> > To: devel at mpich.org,
> > Date: 04/25/2013 01:58 PM
> > Subject: Re: [mpich-devel] MPI_Bsend under MPIU_THREAD_GRANULARITY_PER_OBJECT
> > Sent by: devel-bounces at mpich.org
> >
> > In that case, test programs to help us reproduce this are also welcome.
> >
> > There's also always a risk that one of the currently missing ALLFUNC
> > critical sections is affecting your user:
> >
> > https://trac.mpich.org/projects/mpich/ticket/1797
> >
> > -Dave
> >
> > On Apr 25, 2013, at 11:38 AM CDT, Bob Cernohous <bobc at us.ibm.com> wrote:
> >
> > Patches could be tricky because I'm even seeing intermittent (and
> > different) failures on our 'legacy' libraries which are not per-obj
> > but use the big lock. So there's probably more than one problem here.
> >
> > Abort(1) on node 3 (rank 3 in comm 1140850688): Fatal error in
> > MPI_Bsend: Internal MPI error!, error stack:
> > MPI_Bsend(181)..............: MPI_Bsend(buf=0x19c8a06da0,
> > count=1024, MPI_CHAR, dest=0, tag=0, MPI_COMM_WORLD) failed
> > MPIR_Bsend_isend(226).......:
> > MPIR_Bsend_check_active(456):
> > MPIR_Test_impl(65)..........:
> > MPIR_Request_complete(234)..: INTERNAL ERROR: unexpected value in
> > case statement (value=0)
> >
> >
> > Bob Cernohous: (T/L 553) 507-253-6093
> >
> > BobC at us.ibm.com
> > IBM Rochester, Building 030-2(C335), Department 61L
> > 3605 Hwy 52 North, Rochester, MN 55901-7829
> >
> > > Chaos reigns within.
> > > Reflect, repent, and reboot.
> > > Order shall return.
> >
> >
> >
> >
> > From: Dave Goodell <goodell at mcs.anl.gov>
> > To: devel at mpich.org,
> > Date: 04/25/2013 11:14 AM
> > Subject: Re: [mpich-devel] MPI_Bsend under
> > MPIU_THREAD_GRANULARITY_PER_OBJECT
> > Sent by: devel-bounces at mpich.org
> >
> >
> >
> > The Bsend paths almost certainly have not been protected correctly.
> > Patches to fix the issue are most welcome.
> >
> > -Dave
> >
> > On Apr 25, 2013, at 10:49 AM CDT, Bob Cernohous <bobc at us.ibm.com> wrote:
> >
> > > Start by saying that I have not been involved in the nitty gritty
> > of the per-object locking design.
> > >
> > > What protects the attached buffer/data structures/request when
> > doing multithreaded MPI_Bsend()'s? All I see in the code path is a
> > (no-op) MPIU_THREAD_CS_ENTER(ALLFUNC,).
> > >
> > > I have a customer test in which the threads seem to be walking all
> > over the request around:
> > >
> > > /bgusr/bobc/bgq/comm/lib/dev/mpich2/src/mpid/pamid/
> > include/../src/mpid_request.h:259
> > > 0000000001088c0c MPIR_Request_complete
> > > /bgusr/bobc/bgq/comm/lib/dev/mpich2/src/mpi/pt2pt/mpir_request.c:87
> > > 000000000106e874 MPIR_Test_impl
> > > /bgusr/bobc/bgq/comm/lib/dev/mpich2/src/mpi/pt2pt/test.c:62
> > > 00000000010188f0 MPIR_Bsend_check_active
> > > /bgusr/bobc/bgq/comm/lib/dev/mpich2/src/mpi/pt2pt/bsendutil.c:455
> > > 0000000001018dc0 MPIR_Bsend_isend
> > > /bgusr/bobc/bgq/comm/lib/dev/mpich2/src/mpi/pt2pt/bsendutil.c:226
> > > 0000000001008734 PMPI_Bsend
> > > /bgusr/bobc/bgq/comm/lib/dev/mpich2/src/mpi/pt2pt/bsend.c:163
> > > 00000000010009c0 00000012.long_branch_r2off.__libc_start_main+0
> > > :0
> > > 000000000130cbc0 start_thread
> > >
> > > eg. (fprinting from MPIU_HANDLE_LOG_REFCOUNT_CHANGE)
> > >
> > > stderr[8]: set 0x15f8048 (0xac0000ff kind=REQUEST) refcount to 2
> > > stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount to 1
> > > stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount to 0
> > > stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount to -1
> > > stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount to -2
> > > stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount to -3
> > > stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount to -4
> > > stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount to -5
> > > stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount to -6
> > > stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount to -7
> > > stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount to -8
> > >
> > >
> > > Bob Cernohous: (T/L 553) 507-253-6093
> > >
> > > BobC at us.ibm.com
> > > IBM Rochester, Building 030-2(C335), Department 61L
> > > 3605 Hwy 52 North, Rochester, MN 55901-7829
> > >
> > > > Chaos reigns within.
> > > > Reflect, repent, and reboot.
> > > > Order shall return.
> >
> <main.c>
More information about the devel
mailing list