[mpich-devel] MPICH2 hang

Sun Dec 16 10:55:10 CST 2012

I am copying Nick on this since he was the one who reported the PMR to IBM.

My understanding of MPICH is that increasing the maximum communicator
count is not easy.  I recall reading an article by Cray on how they
had to reduce the max to accomodate extra bits in order to run MPI at
scale on Jaguar.

I agree that this should not hang; rather, it should fail with an
appropriate error message.  On the other hand, I don't think that any
effort should be put forth to accomodate users who want thousands of
communicators.

Jeff

On Sat, Dec 15, 2012 at 11:39 PM, Jim Dinan <dinan at mcs.anl.gov> wrote:
> Hi Bob,
>
> I think I have a fix for detecting this error, and I should be able to send
> it along next week.
>
> AFAIK, MPICH2 has never detected the particular context ID exhaustion
> scenario from the test case you sent.  Is it possible that the size of the
> context ID space was increased on the BG/P until the error went away?  If
> that was the case (would be in src/include/mpiimpl.h where
> MPIR_MAX_CONTEXT_MASK is defined), it may make sense to do the same for
> BG/Q, in case there are apps that need more communicators than the default
> limit.
>
> Best,
>  ~Jim.
>
>
> On 12/15/12 9:23 AM, Bob Cernohous wrote:
>>
>> I didn't try it myself on BG/P.  All I really know is someone at ANL
>> (possibly Nick?) reported (about ftmain.f90) :
>>
>>
>> This programs create as many communicators as there are MPI tasks. It is
>> a bad program provided by a user. On BG/P, this type of program threw a
>> proper warning. It should do the same on BG/Q.
>>
>> Business impact ( BusImpact )
>> It is mostly a nuisance to those who don't understand the inherent
>>   limitations in MPICH2.
>>
>>
>> I don't have access to PMR's but it was *41473,122,000 . *The CPS issue
>>
>> that was opened to me from that PMR had no other details.
>>
>> Bob Cernohous:  (T/L 553) 507-253-6093
>>
>> BobC at us.ibm.com
>> IBM Rochester, Building 030-2(C335), Department 61L
>> 3605 Hwy 52 North, Rochester,  MN 55901-7829
>>
>>  > Chaos reigns within.
>>  > Reflect, repent, and reboot.
>>  > Order shall return.
>>
>>
>>
>>
>> From: Jim Dinan <dinan at mcs.anl.gov>
>> To: devel at mpich.org,
>> Date: 12/14/2012 11:52 PM
>> Subject: Re: [mpich-devel] MPICH2 hang
>> Sent by: devel-bounces at mpich.org
>> ------------------------------------------------------------------------
>>
>>
>>
>>
>> Hi Bob,
>>
>> The ftmain2.f90 test fails on MPICH2 1.2.1p1, which was released on
>> 2-22-2010, well before any of the MPI-3 changes.  Could you provide some
>> more information on when this test was reporting a failure instead of
>> hanging?
>>
>> It looks like this test case generates a context ID exhaustion pattern
>> where context IDs are available at all processes, but the processes have
>> no free context IDs in common.  Because there is no common context ID
>> available, allocation can't succeed and it loops indefinitely.  This is
>> a resource exhaustion pattern that AFAIK, MPICH has not detected in the
>> past.
>>
>> I attached for reference a C translation of this test that is a little
>> easier to grok and also fails on MPICH, going back to MPICH2 1.2.1p1.
>>
>>   ~Jim.
>>
>> On 12/14/12 11:27 AM, Jim Dinan wrote:
>>  > Hi Bob,
>>  >
>>  > Thanks for the detailed bug report and test cases.  I confirmed that
>> the
>>  > failure you are seeing on the MPICH trunk.  This is likely related to
>>  > changes we made to support MPI-3 MPI_Comm_create_group().  I created a
>>  > ticket to track this:
>>  >
>>  > https://trac.mpich.org/projects/mpich/ticket/1768
>>  >
>>  >   ~Jim.
>>  >
>>  > On 12/12/12 5:38 PM, Bob Cernohous wrote:
>>  >>
>>  >> I've had a hang reported on BG/Q after about 2K MPI_Comm_create's.
>>  >>
>>  >>   It hangs on the latest 2 releases (mpich2 v1.5.x and v1.4.x) on
>> BG/Q.
>>  >>
>>  >>   It also hangs on linux: 64bit (MPI over PAMI) MPICH2 library.
>>  >>
>>  >>   On older mpich 1.? (BG/P) it failed with 'too many communicators'
>> and
>>  >>   didn't hang, which is what they expected.
>>  >>
>>  >>   It seems like it's stuck in the while (*context_id == 0)  loop
>>  >>   repeatedly calling allreduce and never settling on a context id in
>>  >>   commutil.c.  I didn't do a lot of debug but seems like it's in
>>  >>   vanilla mpich code, not something we modified.
>>  >>
>>  >>   ftmain.f90 fails if you run it on >2k ranks (creates one comm per
>>  >> rank).  This was the original customer testcase.
>>  >>
>>  >> ftmain2.f90 fails by looping so you can run on fewer ranks.
>>  >>
>>  >>
>>  >>
>>  >>
>>  >> I just noticed that with --np 1, I get the 'too many communicators'
>> from
>>  >> ftmain2.  But --np 2 and up hangs.
>>  >>
>>  >> stdout[0]:  check_newcomm do-start       0 , repeat         2045 ,
>> total
>>  >>         2046
>>  >> stderr[0]: Abort(1) on node 0 (rank 0 in comm 1140850688): Fatal error
>>  >> in PMPI_Comm_create: Other MPI error, error stack:
>>  >> stderr[0]: PMPI_Comm_create(609).........:
>>  >> MPI_Comm_create(MPI_COMM_WORLD, group=0xc80700f6,
>> new_comm=0x1dbfffb520)
>>  >> failed
>>  >> stderr[0]: PMPI_Comm_create(590).........:
>>  >> stderr[0]: MPIR_Comm_create_intra(250)...:
>>  >> stderr[0]: MPIR_Get_contextid(521).......:
>>  >> stderr[0]: MPIR_Get_contextid_sparse(752): Too many communicators
>> [attachment "too_many_comms3.c" deleted by Bob Cernohous/Rochester/IBM]

-- 
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond