[mpich-devel] MPICH2 hang
Jeff Hammond
jhammond at alcf.anl.gov
Sun Dec 16 10:55:10 CST 2012
I am copying Nick on this since he was the one who reported the PMR to IBM.
My understanding of MPICH is that increasing the maximum communicator
count is not easy. I recall reading an article by Cray on how they
had to reduce the max to accomodate extra bits in order to run MPI at
scale on Jaguar.
I agree that this should not hang; rather, it should fail with an
appropriate error message. On the other hand, I don't think that any
effort should be put forth to accomodate users who want thousands of
communicators.
Jeff
On Sat, Dec 15, 2012 at 11:39 PM, Jim Dinan <dinan at mcs.anl.gov> wrote:
> Hi Bob,
>
> I think I have a fix for detecting this error, and I should be able to send
> it along next week.
>
> AFAIK, MPICH2 has never detected the particular context ID exhaustion
> scenario from the test case you sent. Is it possible that the size of the
> context ID space was increased on the BG/P until the error went away? If
> that was the case (would be in src/include/mpiimpl.h where
> MPIR_MAX_CONTEXT_MASK is defined), it may make sense to do the same for
> BG/Q, in case there are apps that need more communicators than the default
> limit.
>
> Best,
> ~Jim.
>
>
> On 12/15/12 9:23 AM, Bob Cernohous wrote:
>>
>> I didn't try it myself on BG/P. All I really know is someone at ANL
>> (possibly Nick?) reported (about ftmain.f90) :
>>
>>
>> This programs create as many communicators as there are MPI tasks. It is
>> a bad program provided by a user. On BG/P, this type of program threw a
>> proper warning. It should do the same on BG/Q.
>>
>> Business impact ( BusImpact )
>> It is mostly a nuisance to those who don't understand the inherent
>> limitations in MPICH2.
>>
>>
>> I don't have access to PMR's but it was *41473,122,000 . *The CPS issue
>>
>> that was opened to me from that PMR had no other details.
>>
>> Bob Cernohous: (T/L 553) 507-253-6093
>>
>> BobC at us.ibm.com
>> IBM Rochester, Building 030-2(C335), Department 61L
>> 3605 Hwy 52 North, Rochester, MN 55901-7829
>>
>> > Chaos reigns within.
>> > Reflect, repent, and reboot.
>> > Order shall return.
>>
>>
>>
>>
>> From: Jim Dinan <dinan at mcs.anl.gov>
>> To: devel at mpich.org,
>> Date: 12/14/2012 11:52 PM
>> Subject: Re: [mpich-devel] MPICH2 hang
>> Sent by: devel-bounces at mpich.org
>> ------------------------------------------------------------------------
>>
>>
>>
>>
>> Hi Bob,
>>
>> The ftmain2.f90 test fails on MPICH2 1.2.1p1, which was released on
>> 2-22-2010, well before any of the MPI-3 changes. Could you provide some
>> more information on when this test was reporting a failure instead of
>> hanging?
>>
>> It looks like this test case generates a context ID exhaustion pattern
>> where context IDs are available at all processes, but the processes have
>> no free context IDs in common. Because there is no common context ID
>> available, allocation can't succeed and it loops indefinitely. This is
>> a resource exhaustion pattern that AFAIK, MPICH has not detected in the
>> past.
>>
>> I attached for reference a C translation of this test that is a little
>> easier to grok and also fails on MPICH, going back to MPICH2 1.2.1p1.
>>
>> ~Jim.
>>
>> On 12/14/12 11:27 AM, Jim Dinan wrote:
>> > Hi Bob,
>> >
>> > Thanks for the detailed bug report and test cases. I confirmed that
>> the
>> > failure you are seeing on the MPICH trunk. This is likely related to
>> > changes we made to support MPI-3 MPI_Comm_create_group(). I created a
>> > ticket to track this:
>> >
>> > https://trac.mpich.org/projects/mpich/ticket/1768
>> >
>> > ~Jim.
>> >
>> > On 12/12/12 5:38 PM, Bob Cernohous wrote:
>> >>
>> >> I've had a hang reported on BG/Q after about 2K MPI_Comm_create's.
>> >>
>> >> It hangs on the latest 2 releases (mpich2 v1.5.x and v1.4.x) on
>> BG/Q.
>> >>
>> >> It also hangs on linux: 64bit (MPI over PAMI) MPICH2 library.
>> >>
>> >> On older mpich 1.? (BG/P) it failed with 'too many communicators'
>> and
>> >> didn't hang, which is what they expected.
>> >>
>> >> It seems like it's stuck in the while (*context_id == 0) loop
>> >> repeatedly calling allreduce and never settling on a context id in
>> >> commutil.c. I didn't do a lot of debug but seems like it's in
>> >> vanilla mpich code, not something we modified.
>> >>
>> >> ftmain.f90 fails if you run it on >2k ranks (creates one comm per
>> >> rank). This was the original customer testcase.
>> >>
>> >> ftmain2.f90 fails by looping so you can run on fewer ranks.
>> >>
>> >>
>> >>
>> >>
>> >> I just noticed that with --np 1, I get the 'too many communicators'
>> from
>> >> ftmain2. But --np 2 and up hangs.
>> >>
>> >> stdout[0]: check_newcomm do-start 0 , repeat 2045 ,
>> total
>> >> 2046
>> >> stderr[0]: Abort(1) on node 0 (rank 0 in comm 1140850688): Fatal error
>> >> in PMPI_Comm_create: Other MPI error, error stack:
>> >> stderr[0]: PMPI_Comm_create(609).........:
>> >> MPI_Comm_create(MPI_COMM_WORLD, group=0xc80700f6,
>> new_comm=0x1dbfffb520)
>> >> failed
>> >> stderr[0]: PMPI_Comm_create(590).........:
>> >> stderr[0]: MPIR_Comm_create_intra(250)...:
>> >> stderr[0]: MPIR_Get_contextid(521).......:
>> >> stderr[0]: MPIR_Get_contextid_sparse(752): Too many communicators
>> [attachment "too_many_comms3.c" deleted by Bob Cernohous/Rochester/IBM]
--
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
More information about the devel
mailing list