[mpich-devel] MPICH2 hang

Bob Cernohous bobc at us.ibm.com
Wed Dec 12 17:38:09 CST 2012


 I've had a hang reported on BG/Q after about 2K MPI_Comm_create's. 

 It hangs on the latest 2 releases (mpich2 v1.5.x and v1.4.x) on BG/Q.

 It also hangs on linux: 64bit (MPI over PAMI) MPICH2 library.

 On older mpich 1.? (BG/P) it failed with 'too many communicators' and
 didn't hang, which is what they expected.

 It seems like it's stuck in the while (*context_id == 0)  loop 
 repeatedly calling allreduce and never settling on a context id in 
 commutil.c.  I didn't do a lot of debug but seems like it's in 
 vanilla mpich code, not something we modified.

 ftmain.f90 fails if you run it on >2k ranks (creates one comm per 
 rank).  This was the original customer testcase.
 
 ftmain2.f90 fails by looping so you can run on fewer ranks.




I just noticed that with --np 1, I get the 'too many communicators' from 
ftmain2.  But --np 2 and up hangs.

stdout[0]:  check_newcomm do-start           0 , repeat         2045 , 
total        2046
stderr[0]: Abort(1) on node 0 (rank 0 in comm 1140850688): Fatal error in 
PMPI_Comm_create: Other MPI error, error stack:
stderr[0]: PMPI_Comm_create(609).........: MPI_Comm_create(MPI_COMM_WORLD, 
group=0xc80700f6, new_comm=0x1dbfffb520) failed
stderr[0]: PMPI_Comm_create(590).........: 
stderr[0]: MPIR_Comm_create_intra(250)...: 
stderr[0]: MPIR_Get_contextid(521).......: 
stderr[0]: MPIR_Get_contextid_sparse(752): Too many communicators
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/devel/attachments/20121212/baccc0c4/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ftmain.f90
Type: application/octet-stream
Size: 3768 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/devel/attachments/20121212/baccc0c4/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ftmain2.f90
Type: application/octet-stream
Size: 3922 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/devel/attachments/20121212/baccc0c4/attachment-0003.obj>


More information about the devel mailing list