[mpich-devel] MPI_Bsend under MPIU_THREAD_GRANULARITY_PER_OBJECT
Bob Cernohous
bobc at us.ibm.com
Thu Apr 25 11:38:24 CDT 2013
Patches could be tricky because I'm even seeing intermittent (and
different) failures on our 'legacy' libraries which are not per-obj but
use the big lock. So there's probably more than one problem here.
Abort(1) on node 3 (rank 3 in comm 1140850688): Fatal error in MPI_Bsend:
Internal MPI error!, error stack:
MPI_Bsend(181)..............: MPI_Bsend(buf=0x19c8a06da0, count=1024,
MPI_CHAR, dest=0, tag=0, MPI_COMM_WORLD) failed
MPIR_Bsend_isend(226).......:
MPIR_Bsend_check_active(456):
MPIR_Test_impl(65)..........:
MPIR_Request_complete(234)..: INTERNAL ERROR: unexpected value in case
statement (value=0)
Bob Cernohous: (T/L 553) 507-253-6093
BobC at us.ibm.com
IBM Rochester, Building 030-2(C335), Department 61L
3605 Hwy 52 North, Rochester, MN 55901-7829
> Chaos reigns within.
> Reflect, repent, and reboot.
> Order shall return.
From: Dave Goodell <goodell at mcs.anl.gov>
To: devel at mpich.org,
Date: 04/25/2013 11:14 AM
Subject: Re: [mpich-devel] MPI_Bsend under
MPIU_THREAD_GRANULARITY_PER_OBJECT
Sent by: devel-bounces at mpich.org
The Bsend paths almost certainly have not been protected correctly.
Patches to fix the issue are most welcome.
-Dave
On Apr 25, 2013, at 10:49 AM CDT, Bob Cernohous <bobc at us.ibm.com> wrote:
> Start by saying that I have not been involved in the nitty gritty of the
per-object locking design.
>
> What protects the attached buffer/data structures/request when doing
multithreaded MPI_Bsend()'s? All I see in the code path is a (no-op)
MPIU_THREAD_CS_ENTER(ALLFUNC,).
>
> I have a customer test in which the threads seem to be walking all over
the request around:
>
>
/bgusr/bobc/bgq/comm/lib/dev/mpich2/src/mpid/pamid/include/../src/mpid_request.h:259
> 0000000001088c0c MPIR_Request_complete
> /bgusr/bobc/bgq/comm/lib/dev/mpich2/src/mpi/pt2pt/mpir_request.c:87
> 000000000106e874 MPIR_Test_impl
> /bgusr/bobc/bgq/comm/lib/dev/mpich2/src/mpi/pt2pt/test.c:62
> 00000000010188f0 MPIR_Bsend_check_active
> /bgusr/bobc/bgq/comm/lib/dev/mpich2/src/mpi/pt2pt/bsendutil.c:455
> 0000000001018dc0 MPIR_Bsend_isend
> /bgusr/bobc/bgq/comm/lib/dev/mpich2/src/mpi/pt2pt/bsendutil.c:226
> 0000000001008734 PMPI_Bsend
> /bgusr/bobc/bgq/comm/lib/dev/mpich2/src/mpi/pt2pt/bsend.c:163
> 00000000010009c0 00000012.long_branch_r2off.__libc_start_main+0
> :0
> 000000000130cbc0 start_thread
>
> eg. (fprinting from MPIU_HANDLE_LOG_REFCOUNT_CHANGE)
>
> stderr[8]: set 0x15f8048 (0xac0000ff kind=REQUEST) refcount to 2
> stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount to 1
> stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount to 0
> stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount to -1
> stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount to -2
> stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount to -3
> stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount to -4
> stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount to -5
> stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount to -6
> stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount to -7
> stderr[8]: decr 0x15f8048 (0xac0000ff kind=REQUEST) refcount to -8
>
>
> Bob Cernohous: (T/L 553) 507-253-6093
>
> BobC at us.ibm.com
> IBM Rochester, Building 030-2(C335), Department 61L
> 3605 Hwy 52 North, Rochester, MN 55901-7829
>
> > Chaos reigns within.
> > Reflect, repent, and reboot.
> > Order shall return.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/devel/attachments/20130425/1b1aa783/attachment.html>
More information about the devel
mailing list