[mpich-devel] MPI_Bsend fails multithreaded

Bob Cernohous bobc at us.ibm.com
Fri Apr 26 00:05:33 CDT 2013


If I put a new, big CS_ENTER(BSENDDATA) lock around the bsend processing 
(that doesn't yield), then MPIU_THREAD_GRANULARITY_PER_OBJECT works ok.

Bob Cernohous:  (T/L 553) 507-253-6093

BobC at us.ibm.com
IBM Rochester, Building 030-2(C335), Department 61L
3605 Hwy 52 North, Rochester,  MN 55901-7829

> Chaos reigns within.
> Reflect, repent, and reboot.
> Order shall return.


devel-bounces at mpich.org wrote on 04/25/2013 11:20:17 PM:

> From: Bob Cernohous/Rochester/IBM at IBMUS
> To: mpich2-dev at mcs.anl.gov, 
> Cc: Nysal K Jan <jnysal at in.ibm.com>, Haizhu Liu/Poughkeepsie/
> IBM at IBMUS, Sameh S Sharkawi/Austin/IBM at IBMUS, Su 
Huang/Poughkeepsie/IBM at IBMUS
> Date: 04/25/2013 11:20 PM
> Subject: Re: [mpich-devel] MPI_Bsend fails multithreaded
> Sent by: devel-bounces at mpich.org
> 
> I think my problem with MPIU_THREAD_GRANULARITY_GLOBAL is a CS_YIELD. 
> 
> MPIU_THREAD_CS_ENTER(ALLFUNC,); 
> .... 
> 
> MPIR_Bsend_data_t *active = BsendBuffer.active, *next_active; 
>     while (active) { 
>         fprintf(stderr,"%2.2u:%u:active %p (0x%08x kind=%d) refcount 
%d\n", 
>                 Kernel_ProcessorID(),__LINE__, 
>                 (active->request),  
>                 (active->request)->handle,  
>                 active->request->kind, 
>                         MPIU_Object_get_ref((active->request))); 
> ... 
> 
> There's one or more yields somewhere... in test and/or progress. I 
> haven't tracked it down and I'm out tomorrow.   I end up with 3 
> threads (26, 54, 48) working on the same active request.   26 frees 
> it and moves on to the next active.  48 chokes on the freed request. 
> 
> stderr[0]: threaded exit 
> stderr[0]: 26:441:active 0x15d1c78 (0xac000003 kind=1) refcount 2 
> stderr[0]: 26:decr 0x15d1aa8 (0xac000001 kind=REQUEST) refcount to 1 
> stderr[0]: 26:decr 0x15d1c78 (0xac000003 kind=REQUEST) refcount to 1 
> stderr[0]: yield 
> 
> stderr[0]: 54:441:active 0x15d1c78 (0xac000003 kind=1) refcount 1 
> stderr[0]: 54:set 0x15d1d60 (0xac000004 kind=REQUEST) refcount to 1 
> stderr[0]: 54:set 0x15d1d60 (0xac000004 kind=REQUEST) refcount to 2 
> stderr[0]: 54:decr 0x15d1d60 (0xac000004 kind=REQUEST) refcount to 1 
> stderr[0]: yield 
> 
> stderr[0]: 48:441:active 0x15d1c78 (0xac000003 kind=1) refcount 1 
> stderr[0]: 48:set 0x15d1e48 (0xac000005 kind=REQUEST) refcount to 1 
> stderr[0]: 48:set 0x15d1e48 (0xac000005 kind=REQUEST) refcount to 2 
> stderr[0]: 48:decr 0x15d1e48 (0xac000005 kind=REQUEST) refcount to 1 
> stderr[0]: yield 
> 
> stderr[0]: 26:decr 0x15d1c78 (0xac000003 kind=REQUEST) refcount to 0 
> stderr[0]: 26:decr 0x1560f78 (0x44000000 kind=COMM) refcount to 3 
> stderr[0]: 26:free 0x15d1c78 (0xac000003 kind=0) refcount 0 
> stderr[0]: 26:356:prev 0x15d1c78, active 0x15d1aa8 (0xac000001 
> kind=1) refcount 1 
> stderr[0]: 26:441:active 0x15d1aa8 (0xac000001 kind=1) refcount 1 
> 
> stderr[0]: yield 
> stderr[0]: 32:441:active 0x15d1aa8 (0xac000001 kind=1) refcount 1 
> 
> stderr[0]: yield 
> stderr[0]: 48:badcase 0x15d1c78 (0xac000003 kind=0) refcount 0 
> stderr[0]: Abort(1) on node 0 (rank 0 in comm 1140850688): Fatal 
> error in MPI_Bsend: Internal MPI error!, error stack: 
> stderr[0]: MPI_Bsend(181)..............: MPI_Bsend(buf=0x19c8a06d70,
> count=1024, MPI_CHAR, dest=1, tag=0, MPI_COMM_WORLD) failed 
> stderr[0]: MPIR_Bsend_isend(226).......: 
> stderr[0]: MPIR_Bsend_check_active(474): 
> stderr[0]: MPIR_Test_impl(65)..........: 
> stderr[0]: MPIR_Request_complete(239)..: INTERNAL ERROR: unexpected 
> value in case statement (value=0) 
> 
> 
> I'm guessing the problem with MPIU_THREAD_GRANULARITY_PER_OBJECT is 
> there's no lock and the threads are all over each other... no yield 
> needed?  It's just not thread safe with the static BsendBuffer. 
> 
> 
> Bob Cernohous:  (T/L 553) 507-253-6093
> 
> BobC at us.ibm.com
> IBM Rochester, Building 030-2(C335), Department 61L
> 3605 Hwy 52 North, Rochester,  MN 55901-7829
> 
> > Chaos reigns within.
> > Reflect, repent, and reboot.
> > Order shall return.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/devel/attachments/20130426/6140bfc0/attachment-0003.html>


More information about the devel mailing list