[mpich-devel] MPICH2 hang

Jim Dinan dinan at mcs.anl.gov
Fri Dec 14 11:27:59 CST 2012


Hi Bob,

Thanks for the detailed bug report and test cases.  I confirmed that the 
failure you are seeing on the MPICH trunk.  This is likely related to 
changes we made to support MPI-3 MPI_Comm_create_group().  I created a 
ticket to track this:

https://trac.mpich.org/projects/mpich/ticket/1768

  ~Jim.

On 12/12/12 5:38 PM, Bob Cernohous wrote:
>
> I've had a hang reported on BG/Q after about 2K MPI_Comm_create's.
>
>   It hangs on the latest 2 releases (mpich2 v1.5.x and v1.4.x) on BG/Q.
>
>   It also hangs on linux: 64bit (MPI over PAMI) MPICH2 library.
>
>   On older mpich 1.? (BG/P) it failed with 'too many communicators' and
>   didn't hang, which is what they expected.
>
>   It seems like it's stuck in the while (*context_id == 0)  loop
>   repeatedly calling allreduce and never settling on a context id in
>   commutil.c.  I didn't do a lot of debug but seems like it's in
>   vanilla mpich code, not something we modified.
>
>   ftmain.f90 fails if you run it on >2k ranks (creates one comm per
> rank).  This was the original customer testcase.
>
> ftmain2.f90 fails by looping so you can run on fewer ranks.
>
>
>
>
> I just noticed that with --np 1, I get the 'too many communicators' from
> ftmain2.  But --np 2 and up hangs.
>
> stdout[0]:  check_newcomm do-start       0 , repeat         2045 , total
>         2046
> stderr[0]: Abort(1) on node 0 (rank 0 in comm 1140850688): Fatal error
> in PMPI_Comm_create: Other MPI error, error stack:
> stderr[0]: PMPI_Comm_create(609).........:
> MPI_Comm_create(MPI_COMM_WORLD, group=0xc80700f6, new_comm=0x1dbfffb520)
> failed
> stderr[0]: PMPI_Comm_create(590).........:
> stderr[0]: MPIR_Comm_create_intra(250)...:
> stderr[0]: MPIR_Get_contextid(521).......:
> stderr[0]: MPIR_Get_contextid_sparse(752): Too many communicators


More information about the devel mailing list