[mpich-devel] MPICH2 hang

Bob Cernohous bobc at us.ibm.com
Sat Dec 15 09:23:08 CST 2012


I didn't try it myself on BG/P.  All I really know is someone at ANL 
(possibly Nick?) reported (about ftmain.f90) :


This programs create as many communicators as there are MPI tasks. It is a 
bad program provided by a user. On BG/P, this type of program threw a 
proper warning. It should do the same on BG/Q. 
 
Business impact ( BusImpact ) 
It is mostly a nuisance to those who don't understand the inherent 
limitations in MPICH2. 


I don't have access to PMR's but it was 41473,122,000 .  The CPS issue 
that was opened to me from that PMR had no other details.

Bob Cernohous:  (T/L 553) 507-253-6093

BobC at us.ibm.com
IBM Rochester, Building 030-2(C335), Department 61L
3605 Hwy 52 North, Rochester,  MN 55901-7829

> Chaos reigns within.
> Reflect, repent, and reboot.
> Order shall return.




From:   Jim Dinan <dinan at mcs.anl.gov>
To:     devel at mpich.org, 
Date:   12/14/2012 11:52 PM
Subject:        Re: [mpich-devel] MPICH2 hang
Sent by:        devel-bounces at mpich.org



Hi Bob,

The ftmain2.f90 test fails on MPICH2 1.2.1p1, which was released on 
2-22-2010, well before any of the MPI-3 changes.  Could you provide some 
more information on when this test was reporting a failure instead of 
hanging?

It looks like this test case generates a context ID exhaustion pattern 
where context IDs are available at all processes, but the processes have 
no free context IDs in common.  Because there is no common context ID 
available, allocation can't succeed and it loops indefinitely.  This is 
a resource exhaustion pattern that AFAIK, MPICH has not detected in the 
past.

I attached for reference a C translation of this test that is a little 
easier to grok and also fails on MPICH, going back to MPICH2 1.2.1p1.

  ~Jim.

On 12/14/12 11:27 AM, Jim Dinan wrote:
> Hi Bob,
>
> Thanks for the detailed bug report and test cases.  I confirmed that the
> failure you are seeing on the MPICH trunk.  This is likely related to
> changes we made to support MPI-3 MPI_Comm_create_group().  I created a
> ticket to track this:
>
> https://trac.mpich.org/projects/mpich/ticket/1768
>
>   ~Jim.
>
> On 12/12/12 5:38 PM, Bob Cernohous wrote:
>>
>> I've had a hang reported on BG/Q after about 2K MPI_Comm_create's.
>>
>>   It hangs on the latest 2 releases (mpich2 v1.5.x and v1.4.x) on BG/Q.
>>
>>   It also hangs on linux: 64bit (MPI over PAMI) MPICH2 library.
>>
>>   On older mpich 1.? (BG/P) it failed with 'too many communicators' and
>>   didn't hang, which is what they expected.
>>
>>   It seems like it's stuck in the while (*context_id == 0)  loop
>>   repeatedly calling allreduce and never settling on a context id in
>>   commutil.c.  I didn't do a lot of debug but seems like it's in
>>   vanilla mpich code, not something we modified.
>>
>>   ftmain.f90 fails if you run it on >2k ranks (creates one comm per
>> rank).  This was the original customer testcase.
>>
>> ftmain2.f90 fails by looping so you can run on fewer ranks.
>>
>>
>>
>>
>> I just noticed that with --np 1, I get the 'too many communicators' 
from
>> ftmain2.  But --np 2 and up hangs.
>>
>> stdout[0]:  check_newcomm do-start       0 , repeat         2045 , 
total
>>         2046
>> stderr[0]: Abort(1) on node 0 (rank 0 in comm 1140850688): Fatal error
>> in PMPI_Comm_create: Other MPI error, error stack:
>> stderr[0]: PMPI_Comm_create(609).........:
>> MPI_Comm_create(MPI_COMM_WORLD, group=0xc80700f6, 
new_comm=0x1dbfffb520)
>> failed
>> stderr[0]: PMPI_Comm_create(590).........:
>> stderr[0]: MPIR_Comm_create_intra(250)...:
>> stderr[0]: MPIR_Get_contextid(521).......:
>> stderr[0]: MPIR_Get_contextid_sparse(752): Too many communicators
[attachment "too_many_comms3.c" deleted by Bob Cernohous/Rochester/IBM] 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/devel/attachments/20121215/40080082/attachment.html>


More information about the devel mailing list