<font size=2 face="sans-serif">I didn't try it myself on BG/P. All
I really know is someone at ANL (possibly Nick?) reported (about ftmain.f90)
:</font>
<br>
<br>
<br><font size=2 face="sans-serif">This programs create as many communicators
as there are MPI tasks. It is a bad program provided by a user. On BG/P,
this type of program threw a proper warning. It should do the same on BG/Q.
</font>
<br><font size=2 face="sans-serif">
</font>
<br><font size=2 face="sans-serif">Business impact ( BusImpact )
</font>
<br><font size=2 face="sans-serif">It is mostly a nuisance to those who
don't understand the inherent limitations in MPICH2.
</font>
<br>
<br>
<br><font size=2 face="sans-serif">I don't have access to PMR's but it
was </font><font size=1 color=#000080 face="sans-serif"><b>41473,122,000
. </b></font><font size=2 face="sans-serif">The CPS issue that was
opened to me from that PMR had no other details.</font>
<br><font size=2 face="sans-serif"><br>
Bob Cernohous: (T/L 553) 507-253-6093<br>
<br>
BobC@us.ibm.com<br>
IBM Rochester, Building 030-2(C335), Department 61L<br>
3605 Hwy 52 North, Rochester, MN 55901-7829<br>
<br>
> Chaos reigns within.<br>
> Reflect, repent, and reboot.<br>
> Order shall return.<br>
</font>
<br>
<br>
<br>
<br><font size=1 color=#5f5f5f face="sans-serif">From:
</font><font size=1 face="sans-serif">Jim Dinan <dinan@mcs.anl.gov></font>
<br><font size=1 color=#5f5f5f face="sans-serif">To:
</font><font size=1 face="sans-serif">devel@mpich.org, </font>
<br><font size=1 color=#5f5f5f face="sans-serif">Date:
</font><font size=1 face="sans-serif">12/14/2012 11:52 PM</font>
<br><font size=1 color=#5f5f5f face="sans-serif">Subject:
</font><font size=1 face="sans-serif">Re: [mpich-devel]
MPICH2 hang</font>
<br><font size=1 color=#5f5f5f face="sans-serif">Sent by:
</font><font size=1 face="sans-serif">devel-bounces@mpich.org</font>
<br>
<hr noshade>
<br>
<br>
<br><tt><font size=2>Hi Bob,<br>
<br>
The ftmain2.f90 test fails on MPICH2 1.2.1p1, which was released on <br>
2-22-2010, well before any of the MPI-3 changes. Could you provide
some <br>
more information on when this test was reporting a failure instead of <br>
hanging?<br>
<br>
It looks like this test case generates a context ID exhaustion pattern
<br>
where context IDs are available at all processes, but the processes have
<br>
no free context IDs in common. Because there is no common context
ID <br>
available, allocation can't succeed and it loops indefinitely. This
is <br>
a resource exhaustion pattern that AFAIK, MPICH has not detected in the
<br>
past.<br>
<br>
I attached for reference a C translation of this test that is a little
<br>
easier to grok and also fails on MPICH, going back to MPICH2 1.2.1p1.<br>
<br>
~Jim.<br>
<br>
On 12/14/12 11:27 AM, Jim Dinan wrote:<br>
> Hi Bob,<br>
><br>
> Thanks for the detailed bug report and test cases. I confirmed
that the<br>
> failure you are seeing on the MPICH trunk. This is likely related
to<br>
> changes we made to support MPI-3 MPI_Comm_create_group(). I
created a<br>
> ticket to track this:<br>
><br>
> </font></tt><a href=https://trac.mpich.org/projects/mpich/ticket/1768><tt><font size=2>https://trac.mpich.org/projects/mpich/ticket/1768</font></tt></a><tt><font size=2><br>
><br>
> ~Jim.<br>
><br>
> On 12/12/12 5:38 PM, Bob Cernohous wrote:<br>
>><br>
>> I've had a hang reported on BG/Q after about 2K MPI_Comm_create's.<br>
>><br>
>> It hangs on the latest 2 releases (mpich2 v1.5.x and v1.4.x)
on BG/Q.<br>
>><br>
>> It also hangs on linux: 64bit (MPI over PAMI) MPICH2 library.<br>
>><br>
>> On older mpich 1.? (BG/P) it failed with 'too many communicators'
and<br>
>> didn't hang, which is what they expected.<br>
>><br>
>> It seems like it's stuck in the while (*context_id == 0)
loop<br>
>> repeatedly calling allreduce and never settling on a context
id in<br>
>> commutil.c. I didn't do a lot of debug but seems
like it's in<br>
>> vanilla mpich code, not something we modified.<br>
>><br>
>> ftmain.f90 fails if you run it on >2k ranks (creates
one comm per<br>
>> rank). This was the original customer testcase.<br>
>><br>
>> ftmain2.f90 fails by looping so you can run on fewer ranks.<br>
>><br>
>><br>
>><br>
>><br>
>> I just noticed that with --np 1, I get the 'too many communicators'
from<br>
>> ftmain2. But --np 2 and up hangs.<br>
>><br>
>> stdout[0]: check_newcomm do-start 0
, repeat 2045 , total<br>
>> 2046<br>
>> stderr[0]: Abort(1) on node 0 (rank 0 in comm 1140850688): Fatal
error<br>
>> in PMPI_Comm_create: Other MPI error, error stack:<br>
>> stderr[0]: PMPI_Comm_create(609).........:<br>
>> MPI_Comm_create(MPI_COMM_WORLD, group=0xc80700f6, new_comm=0x1dbfffb520)<br>
>> failed<br>
>> stderr[0]: PMPI_Comm_create(590).........:<br>
>> stderr[0]: MPIR_Comm_create_intra(250)...:<br>
>> stderr[0]: MPIR_Get_contextid(521).......:<br>
>> stderr[0]: MPIR_Get_contextid_sparse(752): Too many communicators<br>
[attachment "too_many_comms3.c" deleted by Bob Cernohous/Rochester/IBM]
</font></tt>
<br>