[mpich-devel] MPICH2 hang

Jim Dinan dinan at mcs.anl.gov
Sat Dec 15 23:39:12 CST 2012


Hi Bob,

I think I have a fix for detecting this error, and I should be able to 
send it along next week.

AFAIK, MPICH2 has never detected the particular context ID exhaustion 
scenario from the test case you sent.  Is it possible that the size of 
the context ID space was increased on the BG/P until the error went 
away?  If that was the case (would be in src/include/mpiimpl.h where 
MPIR_MAX_CONTEXT_MASK is defined), it may make sense to do the same for 
BG/Q, in case there are apps that need more communicators than the 
default limit.

Best,
  ~Jim.

On 12/15/12 9:23 AM, Bob Cernohous wrote:
> I didn't try it myself on BG/P.  All I really know is someone at ANL
> (possibly Nick?) reported (about ftmain.f90) :
>
>
> This programs create as many communicators as there are MPI tasks. It is
> a bad program provided by a user. On BG/P, this type of program threw a
> proper warning. It should do the same on BG/Q.
>
> Business impact ( BusImpact )
> It is mostly a nuisance to those who don't understand the inherent
>   limitations in MPICH2.
>
>
> I don't have access to PMR's but it was *41473,122,000 . *The CPS issue
> that was opened to me from that PMR had no other details.
>
> Bob Cernohous:  (T/L 553) 507-253-6093
>
> BobC at us.ibm.com
> IBM Rochester, Building 030-2(C335), Department 61L
> 3605 Hwy 52 North, Rochester,  MN 55901-7829
>
>  > Chaos reigns within.
>  > Reflect, repent, and reboot.
>  > Order shall return.
>
>
>
>
> From: Jim Dinan <dinan at mcs.anl.gov>
> To: devel at mpich.org,
> Date: 12/14/2012 11:52 PM
> Subject: Re: [mpich-devel] MPICH2 hang
> Sent by: devel-bounces at mpich.org
> ------------------------------------------------------------------------
>
>
>
> Hi Bob,
>
> The ftmain2.f90 test fails on MPICH2 1.2.1p1, which was released on
> 2-22-2010, well before any of the MPI-3 changes.  Could you provide some
> more information on when this test was reporting a failure instead of
> hanging?
>
> It looks like this test case generates a context ID exhaustion pattern
> where context IDs are available at all processes, but the processes have
> no free context IDs in common.  Because there is no common context ID
> available, allocation can't succeed and it loops indefinitely.  This is
> a resource exhaustion pattern that AFAIK, MPICH has not detected in the
> past.
>
> I attached for reference a C translation of this test that is a little
> easier to grok and also fails on MPICH, going back to MPICH2 1.2.1p1.
>
>   ~Jim.
>
> On 12/14/12 11:27 AM, Jim Dinan wrote:
>  > Hi Bob,
>  >
>  > Thanks for the detailed bug report and test cases.  I confirmed that the
>  > failure you are seeing on the MPICH trunk.  This is likely related to
>  > changes we made to support MPI-3 MPI_Comm_create_group().  I created a
>  > ticket to track this:
>  >
>  > https://trac.mpich.org/projects/mpich/ticket/1768
>  >
>  >   ~Jim.
>  >
>  > On 12/12/12 5:38 PM, Bob Cernohous wrote:
>  >>
>  >> I've had a hang reported on BG/Q after about 2K MPI_Comm_create's.
>  >>
>  >>   It hangs on the latest 2 releases (mpich2 v1.5.x and v1.4.x) on BG/Q.
>  >>
>  >>   It also hangs on linux: 64bit (MPI over PAMI) MPICH2 library.
>  >>
>  >>   On older mpich 1.? (BG/P) it failed with 'too many communicators' and
>  >>   didn't hang, which is what they expected.
>  >>
>  >>   It seems like it's stuck in the while (*context_id == 0)  loop
>  >>   repeatedly calling allreduce and never settling on a context id in
>  >>   commutil.c.  I didn't do a lot of debug but seems like it's in
>  >>   vanilla mpich code, not something we modified.
>  >>
>  >>   ftmain.f90 fails if you run it on >2k ranks (creates one comm per
>  >> rank).  This was the original customer testcase.
>  >>
>  >> ftmain2.f90 fails by looping so you can run on fewer ranks.
>  >>
>  >>
>  >>
>  >>
>  >> I just noticed that with --np 1, I get the 'too many communicators' from
>  >> ftmain2.  But --np 2 and up hangs.
>  >>
>  >> stdout[0]:  check_newcomm do-start       0 , repeat         2045 , total
>  >>         2046
>  >> stderr[0]: Abort(1) on node 0 (rank 0 in comm 1140850688): Fatal error
>  >> in PMPI_Comm_create: Other MPI error, error stack:
>  >> stderr[0]: PMPI_Comm_create(609).........:
>  >> MPI_Comm_create(MPI_COMM_WORLD, group=0xc80700f6, new_comm=0x1dbfffb520)
>  >> failed
>  >> stderr[0]: PMPI_Comm_create(590).........:
>  >> stderr[0]: MPIR_Comm_create_intra(250)...:
>  >> stderr[0]: MPIR_Get_contextid(521).......:
>  >> stderr[0]: MPIR_Get_contextid_sparse(752): Too many communicators
> [attachment "too_many_comms3.c" deleted by Bob Cernohous/Rochester/IBM]


More information about the devel mailing list