[mpich-commits] [mpich] MPICH primary repository branch, master, updated. v3.0.4-95-g1c0b649

mysql vizuser noreply at mpich.org
Thu Apr 25 14:12:25 CDT 2013


This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "MPICH primary repository".

The branch, master has been updated
       via  1c0b649785d408fa47d09864f2943a93978b1d56 (commit)
       via  06ea2eabca6d5b1556ea2b29e54f78fa8f41b5c8 (commit)
       via  788da81b50ab8874f22b574dbd69792a61a466e1 (commit)
       via  3197bbc5aea3fd560ce7ba22a77946c7c08946c5 (commit)
       via  016bcef9a69be95f0e777443b9e8f7c7ef3c5a63 (commit)
       via  9d3c3b374b39a2c4f17ad5e6d1dfc6d0899f6bcd (commit)
       via  15dddc3fd045cef313cd5f96e956e0920cc03de8 (commit)
       via  eed075d6bf4b3b4462a9e574bbf23f44df5960d7 (commit)
       via  e45c8bf210bae11d337ecdb8939037cd6b05a51b (commit)
       via  21db6f2edc633121304c796e2ac08208d2836004 (commit)
       via  973957e7d4637d1b713e1db00e2aba25a12458a6 (commit)
       via  3fdf08871a0a69b4b7d4953a4acf9316b88cc677 (commit)
       via  873180629a97fa154482e85445ae8227749ce386 (commit)
       via  8b6ceadb7eb56792cc3bd8d432bede1ad33d8fd3 (commit)
       via  ec2b9406b27a619000be4d8e255dd4d80fe1bf3a (commit)
       via  c32aeb9e104a375bc909244b51a6acc55fb47f9b (commit)
       via  09d20b381725981f5c10a7f5e062c521339b1081 (commit)
       via  c95d740c4268268233242c7f3d0c277e2b0abf2d (commit)
       via  224dfb1bc40328e9e51c52d110aaf971eb2c862f (commit)
       via  858da8da2e6e8d205b56e4343d800224718819b7 (commit)
       via  63577b2830635062056e45d4b211b4227040556c (commit)
       via  cbb6e7a7bd4928593d089126f6d1f82425a30ca7 (commit)
       via  09a16913c92416d95ed6aec5b29451d92e7fe4d9 (commit)
       via  4d66fef3da8af915ed42985295b1d5f9647a7cc9 (commit)
       via  1c0b1499d406f8bb6fb30427ef9cb476b98e0a93 (commit)
       via  7eb95d3a0059ebbf9b345be084b9551273fcd877 (commit)
       via  bdc6a36771a6276965f9e6caf86f7e26df9f1fd5 (commit)
       via  69ebc32fc07b36727a8ea9690d21cc3616ec191e (commit)
       via  07a94c0a595514ba127efd6d9b2077ca7d9d05f3 (commit)
       via  f3093fa83df4c6483d1c21df5208d9ff23967404 (commit)
       via  964ce7d207fd6374309309ec31ea5a8cd3e2e9e8 (commit)
       via  fd39fc2b4f9fce35c057bc09e22666e85b627f93 (commit)
       via  ba5b2ce35bb828791afdd1c8d70f7e3fe73015c9 (commit)
       via  cae9c7a1ddbaada469923ffc8e077e795075193f (commit)
       via  b2a8f02bdb8e1e0668e2d32a10f5ba1869f22592 (commit)
       via  0e3b48bdf1935ff6869e37d0ea19873e3f8282fb (commit)
       via  4fcf7a46582b799599a41997984bef9f1420c994 (commit)
       via  e741c721d6b57d32fcca37b031b6e625d7dba4ef (commit)
       via  0d5992f038460c80ad3af86b038936136f97e439 (commit)
       via  685b4d97b439813706516192feb08dccbca124f8 (commit)
       via  3fcd8d8e05dea84d7d10352f18445edea66cac75 (commit)
       via  f0eefb17a7e960824e8c7f54539c7a5b8fb08128 (commit)
       via  be2862cd7dc3a0c619200905b0a2985805cc3a26 (commit)
       via  6d3a8039a552b753a8f34e1278a4d308c0dfa82d (commit)
       via  14d4dd38f994a5d702accafddc7e2e64093d7ab9 (commit)
       via  c5f01fb660c1a60ac4ae2d01ff8884723c1dc4a3 (commit)
       via  4dafcf4b091fcd915f73faac1aece1f6d84f6444 (commit)
       via  5870a2240e7efb273484d188680443a53afbebd2 (commit)
       via  5a5f5276507f8fb77290eb9bccb9a1f572bef75c (commit)
       via  efeeb12cbb4b14dd8e15100beb333e4a86eda9a5 (commit)
       via  cc9a7b0329fc2b668cd6c15e86714261cb6588bf (commit)
       via  dad9e73110e94b921c9dd1fba0a611a157039ca5 (commit)
       via  8a98906cb6d4dd96189c23a652ebbccaa3b760be (commit)
       via  3b93604c9a5a7dbdaf96ecc4b97700403169a35c (commit)
       via  33225d96054a33e5d0aa0136f586cb052232bcaf (commit)
       via  286b7d55e4000819ce494de008134657b5839d8b (commit)
       via  4e48732bb08577a4eb8ef449de50e35348274e35 (commit)
       via  feb17d535e2eb7982b769389259238432222e403 (commit)
       via  a6885c7b09d0b29efab6e2f3540d333e6944cb03 (commit)
       via  c6846c1b9010fd447c46769dbf8cc58146859321 (commit)
       via  634f95f6d91a7f5b8af8fbde5021ca7fb93e348e (commit)
       via  e6ab6613f7ab50009b83bc760e2e3b63b71f30c6 (commit)
       via  5510e24f5fec7c51234b3df86ca9cac03ddf9edf (commit)
       via  5ea5683ff71c4419b628d4a3a4513772ae83000a (commit)
       via  9f335ce7ed568a19da82245492723e885dae9192 (commit)
       via  2f03f4ba14b1ff2bad3a6e35682c480703d7ec42 (commit)
       via  27efcfd94efdecaa42eb1dc58d601559bf045afc (commit)
       via  1a40b5d4b1c996a259f1bf62715bf9f84929e788 (commit)
       via  769b38bcfb9f623c2860169e11ef02d91b3eec60 (commit)
       via  8bcd1b9eb270df9196be41903a8ef2ddf7e2ca9b (commit)
       via  1753acded0dbe6c634b73b5610c5735b8013f08c (commit)
       via  53f6e9344deb468e2d1c244a8936928f101cf5fb (commit)
       via  e25d1dd65eee37596341d7b678881fda9384450c (commit)
       via  d1cfa4d42946c19c281ab18f0a09fd6fa02eb573 (commit)
       via  4022fa1dac9d6b9f7f66590ff29bef3c9a2221f1 (commit)
       via  8a5655f69f631bce6c573fe567051b694e3e004d (commit)
       via  136bf33fddd2333b0d9b0199dbf6582975f81b85 (commit)
       via  0994aab0c7f2febd633da2a0097626a57796179e (commit)
       via  4b9c2b4be5574f7df76589c2666abc6eaf51b20a (commit)
       via  b55e2ffda51d29261fcd2b230c2ff5276ccff743 (commit)
       via  eebe16d1604f4df9562e469631288e80b7026bf7 (commit)
       via  6d89f69981a1d00f992272062493e2594c997045 (commit)
       via  48c81cff1b5dd8b33c39387e4d7e6b9278bfdb20 (commit)
       via  26f8ef856e978b6e98838280c9228ebdbc627c58 (commit)
       via  9132e28ea86e5855d175048488224de575ba7ae7 (commit)
       via  6d73db2b6cea0f2f680f68800a6f7bdd11f8b47d (commit)
       via  3999e397ed27f6fc30ea628daa5f659ed6d3c8f0 (commit)
       via  6b41fba7799a84ee348551409e19aa18de984c6a (commit)
       via  1fd977b986e704c92304711195b42cbfb0aae30c (commit)
       via  7e9856376c646160f33a272bb53bac7080c24a97 (commit)
       via  87b2176055388b1e7219f76d883996475659c8f5 (commit)
       via  1cc25348873276e804d1ebc551916dd6f75e5157 (commit)
       via  35b7af16e01290c95aa61c8b31ce0acb94f92706 (commit)
       via  ed96b182c00bea496e4f3acbb6c7471ea288ab93 (commit)
      from  85acf6c54540d3e5e63deb4ef3a003dc78803659 (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
http://git.mpich.org/mpich.git/commitdiff/1c0b649785d408fa47d09864f2943a93978b1d56

commit 1c0b649785d408fa47d09864f2943a93978b1d56
Merge: 06ea2ea 85acf6c
Author: Dave Goodell <goodell at mcs.anl.gov>
Date:   Thu Apr 25 11:20:34 2013 -0500

    Merge remote-tracking branch 'origin/master' into master
    
    Conflicts:
    	src/mpi/coll/helper_fns.c


http://git.mpich.org/mpich.git/commitdiff/06ea2eabca6d5b1556ea2b29e54f78fa8f41b5c8

commit 06ea2eabca6d5b1556ea2b29e54f78fa8f41b5c8
Merge: b98c7fd 788da81
Author: Dave Goodell <goodell at mcs.anl.gov>
Date:   Fri Apr 19 10:45:40 2013 -0500

    Merge branch 'ibm-integ'
    
    Brings in the 'pending-00' branch plus a few additional commits directly
    cherry-picked from 'pending-skip' to the 'ibm-integ' branch.
    
    No reviewer.


http://git.mpich.org/mpich.git/commitdiff/788da81b50ab8874f22b574dbd69792a61a466e1

commit 788da81b50ab8874f22b574dbd69792a61a466e1
Author: Michael Blocksome <blocksom at us.ibm.com>
Date:   Tue Apr 16 16:34:29 2013 -0500

    Use 'MPIU_ERR_CHKANDJUMP' for onesided parameter checks
    
    The pamid onesided implementation used a non-standard error reporting
    macro which is not available in the top mpich.org master branch.
    
    See ticket #1809

diff --git a/src/mpid/pamid/src/onesided/mpid_win_accumulate.c b/src/mpid/pamid/src/onesided/mpid_win_accumulate.c
index 85f42e2..b34c083 100644
--- a/src/mpid/pamid/src/onesided/mpid_win_accumulate.c
+++ b/src/mpid/pamid/src/onesided/mpid_win_accumulate.c
@@ -152,6 +152,10 @@ MPIDI_Accumulate(pami_context_t   context,
  * \param[in] win              Window
  * \return MPI_SUCCESS
  */
+#undef FUNCNAME
+#define FUNCNAME MPID_Accumulate
+#undef FCNAME
+#define FCNAME MPIU_QUOTE(FUNCNAME)
 int
 MPID_Accumulate(void         *origin_addr,
                 int           origin_count,
@@ -163,6 +167,7 @@ MPID_Accumulate(void         *origin_addr,
                 MPI_Op        op,
                 MPID_Win     *win)
 {
+  int mpi_errno = MPI_SUCCESS;
   MPIDI_Win_request *req = MPIU_Calloc0(1, MPIDI_Win_request);
   req->win          = win;
   req->type         = MPIDI_WIN_REQUEST_ACCUMULATE;
@@ -321,6 +326,6 @@ MPID_Accumulate(void         *origin_addr,
    */
   PAMI_Context_post(MPIDI_Context[0], &req->post_request, MPIDI_Accumulate, req);
 
-
-  return MPI_SUCCESS;
+fn_fail:
+  return mpi_errno;
 }
diff --git a/src/mpid/pamid/src/onesided/mpid_win_free.c b/src/mpid/pamid/src/onesided/mpid_win_free.c
index a71e039..69e082b 100644
--- a/src/mpid/pamid/src/onesided/mpid_win_free.c
+++ b/src/mpid/pamid/src/onesided/mpid_win_free.c
@@ -30,6 +30,10 @@
  * \param[in,out] win  Window
  * \return MPI_SUCCESS or error returned from MPI_Barrier.
  */
+#undef FUNCNAME
+#define FUNCNAME MPID_Win_free
+#undef FCNAME
+#define FCNAME MPIU_QUOTE(FUNCNAME)
 int
 MPID_Win_free(MPID_Win **win_ptr)
 {
@@ -71,5 +75,6 @@ MPID_Win_free(MPID_Win **win_ptr)
 
   MPIU_Handle_obj_free(&MPID_Win_mem, win);
 
+fn_fail:
   return mpi_errno;
 }
diff --git a/src/mpid/pamid/src/onesided/mpid_win_get.c b/src/mpid/pamid/src/onesided/mpid_win_get.c
index aa8114b..d9eaf93 100644
--- a/src/mpid/pamid/src/onesided/mpid_win_get.c
+++ b/src/mpid/pamid/src/onesided/mpid_win_get.c
@@ -214,6 +214,10 @@ MPIDI_Get_use_pami_get(pami_context_t context, MPIDI_Win_request * req, int *fre
  * \param[in] win              Window
  * \return MPI_SUCCESS
  */
+#undef FUNCNAME
+#define FUNCNAME MPID_Get
+#undef FCNAME
+#define FCNAME MPIU_QUOTE(FUNCNAME)
 int
 MPID_Get(void         *origin_addr,
          int           origin_count,
@@ -224,6 +228,7 @@ MPID_Get(void         *origin_addr,
          MPI_Datatype  target_datatype,
          MPID_Win     *win)
 {
+  int mpi_errno = MPI_SUCCESS;
   MPIDI_Win_request *req = MPIU_Calloc0(1, MPIDI_Win_request);
   req->win          = win;
   req->type         = MPIDI_WIN_REQUEST_GET;
@@ -360,6 +365,6 @@ MPID_Get(void         *origin_addr,
    */
   PAMI_Context_post(MPIDI_Context[0], &req->post_request, MPIDI_Get, req);
 
-
-  return MPI_SUCCESS;
+fn_fail:
+  return mpi_errno;
 }
diff --git a/src/mpid/pamid/src/onesided/mpid_win_pscw.c b/src/mpid/pamid/src/onesided/mpid_win_pscw.c
index 221e316..7dc2cef 100644
--- a/src/mpid/pamid/src/onesided/mpid_win_pscw.c
+++ b/src/mpid/pamid/src/onesided/mpid_win_pscw.c
@@ -113,13 +113,15 @@ MPID_Win_start(MPID_Group *group,
   MPID_PROGRESS_WAIT_WHILE(group->size != sync->pw.count);
   sync->pw.count = 0;
 
-  MPIU_ERR_CHKORASSERT(win->mpid.sync.sc.group == NULL,
-                       mpi_errno, MPI_ERR_GROUP, return mpi_errno, "**group");
+  MPIU_ERR_CHKANDJUMP((win->mpid.sync.sc.group != NULL), mpi_errno, MPI_ERR_GROUP, "**group");
 
   win->mpid.sync.sc.group = group;
   win->mpid.sync.origin_epoch_type = MPID_EPOTYPE_START;
 
+fn_exit:
   return mpi_errno;
+fn_fail:
+  goto fn_exit;
 }
 
 
@@ -174,8 +176,8 @@ MPID_Win_post(MPID_Group *group,
 
   MPIR_Group_add_ref(group);
 
-  MPIU_ERR_CHKORASSERT(win->mpid.sync.pw.group == NULL,
-                       mpi_errno, MPI_ERR_GROUP, return mpi_errno,"**group");
+  MPIU_ERR_CHKANDJUMP((win->mpid.sync.pw.group != NULL), mpi_errno, MPI_ERR_GROUP, "**group");
+
   win->mpid.sync.pw.group = group;
 
   MPIDI_WinPSCW_info info = {
@@ -187,6 +189,7 @@ MPID_Win_post(MPID_Group *group,
 
   win->mpid.sync.target_epoch_type = MPID_EPOTYPE_POST;
 
+fn_fail:
   return mpi_errno;
 }
 
diff --git a/src/mpid/pamid/src/onesided/mpid_win_put.c b/src/mpid/pamid/src/onesided/mpid_win_put.c
index 1d4f428..8864721 100644
--- a/src/mpid/pamid/src/onesided/mpid_win_put.c
+++ b/src/mpid/pamid/src/onesided/mpid_win_put.c
@@ -220,6 +220,10 @@ MPIDI_Put_use_pami_put(pami_context_t   context, MPIDI_Win_request * req,int *fr
  * \param[in] win              Window
  * \return MPI_SUCCESS
  */
+#undef FUNCNAME
+#define FUNCNAME MPID_Put
+#undef FCNAME
+#define FCNAME MPIU_QUOTE(FUNCNAME)
 int
 MPID_Put(void         *origin_addr,
          int           origin_count,
@@ -230,6 +234,7 @@ MPID_Put(void         *origin_addr,
          MPI_Datatype  target_datatype,
          MPID_Win     *win)
 {
+  int mpi_errno = MPI_SUCCESS;
   MPIDI_Win_request *req = MPIU_Calloc0(1, MPIDI_Win_request);
   req->win          = win;
   req->type         = MPIDI_WIN_REQUEST_PUT;
@@ -365,6 +370,6 @@ MPID_Put(void         *origin_addr,
    */
   PAMI_Context_post(MPIDI_Context[0], &req->post_request, MPIDI_Put, req);
 
-
-  return MPI_SUCCESS;
+fn_fail:
+  return mpi_errno;
 }

http://git.mpich.org/mpich.git/commitdiff/3197bbc5aea3fd560ce7ba22a77946c7c08946c5

commit 3197bbc5aea3fd560ce7ba22a77946c7c08946c5
Merge: 016bcef 8b6cead
Author: Dave Goodell <goodell at mcs.anl.gov>
Date:   Mon Apr 15 11:31:19 2013 -0500

    Merge remote-tracking branch 'origin/pending-00' into ibm-integ
    
    Conflicts:
    	src/mpi/romio/adio/ad_ufs/ad_ufs_open.c
    
    Fixed whitespace in favor of better readability.  Probably not the
    whitespace used in the 'ibm' branch.
    
    No reviewer.


http://git.mpich.org/mpich.git/commitdiff/016bcef9a69be95f0e777443b9e8f7c7ef3c5a63

commit 016bcef9a69be95f0e777443b9e8f7c7ef3c5a63
Author: Dave Goodell <goodell at mcs.anl.gov>
Date:   Fri Apr 12 17:24:17 2013 -0500

    make `MPIR_Comm_create_intra` non-static
    
    This should be used instead of the `MPIR_Comm_create_intra_ext` hack in
    [f617bd9f].
    
    No reviewer.

diff --git a/src/include/mpiimpl.h b/src/include/mpiimpl.h
index 507fdf0..d3e6d03 100644
--- a/src/include/mpiimpl.h
+++ b/src/include/mpiimpl.h
@@ -4088,6 +4088,11 @@ int MPIR_Comm_create_create_and_map_vcrt(int n,
                                          MPID_VCRT *out_vcrt,
                                          MPID_VCR **out_vcr);
 
+/* implements the logic for MPI_Comm_create for intracommunicators only */
+int MPIR_Comm_create_intra(MPID_Comm *comm_ptr, MPID_Group *group_ptr,
+                           MPID_Comm **newcomm_ptr);
+
+
 int MPIR_Comm_commit( MPID_Comm * );
 
 int MPIR_Comm_is_node_aware( MPID_Comm * );
diff --git a/src/mpi/comm/comm_create.c b/src/mpi/comm/comm_create.c
index a86543e..894679b 100644
--- a/src/mpi/comm/comm_create.c
+++ b/src/mpi/comm/comm_create.c
@@ -20,12 +20,9 @@
 
 /* prototypes to make the compiler happy in the case that PMPI_LOCAL expands to
  * nothing instead of "static" */
-PMPI_LOCAL int MPIR_Comm_create_intra(MPID_Comm *comm_ptr, MPID_Group *group_ptr,
-                                      MPID_Comm **newcomm_ptr);
 PMPI_LOCAL int MPIR_Comm_create_inter(MPID_Comm *comm_ptr, MPID_Group *group_ptr,
                                       MPID_Comm **newcomm_ptr);
 
-
 /* Define MPICH_MPI_FROM_PMPI if weak symbols are not supported to build
    the MPI routines */
 #ifndef MPICH_MPI_FROM_PMPI
@@ -213,8 +210,8 @@ fn_fail:
 #define FCNAME MPIU_QUOTE(FUNCNAME)
 /* comm create impl for intracommunicators, assumes that the standard error
  * checking has already taken place in the calling function */
-PMPI_LOCAL int MPIR_Comm_create_intra(MPID_Comm *comm_ptr, MPID_Group *group_ptr,
-                                      MPID_Comm **newcomm_ptr)
+int MPIR_Comm_create_intra(MPID_Comm *comm_ptr, MPID_Group *group_ptr,
+                           MPID_Comm **newcomm_ptr)
 {
     int mpi_errno = MPI_SUCCESS;
     MPIR_Context_id_t new_context_id = 0;

http://git.mpich.org/mpich.git/commitdiff/9d3c3b374b39a2c4f17ad5e6d1dfc6d0899f6bcd

commit 9d3c3b374b39a2c4f17ad5e6d1dfc6d0899f6bcd
Author: Qi QC Zhang <keirazhang at cn.ibm.com>
Date:   Mon Dec 10 00:28:53 2012 -0500

    check NULL size in MPI_File_get_size
    
    (ibm) D187578
    (ibm) 2213e89684327f2b5c4b7d2cdd7416a3ec7a110a
    
    Signed-off-by: Michael Blocksome <blocksom at us.ibm.com>

diff --git a/src/mpi/romio/mpi-io/get_size.c b/src/mpi/romio/mpi-io/get_size.c
index f3150de..1582380 100644
--- a/src/mpi/romio/mpi-io/get_size.c
+++ b/src/mpi/romio/mpi-io/get_size.c
@@ -51,6 +51,12 @@ int MPI_File_get_size(MPI_File fh, MPI_Offset *size)
 
     /* --BEGIN ERROR HANDLING-- */
     MPIO_CHECK_FILE_HANDLE(adio_fh, myname, error_code);
+    if(size == NULL){
+        error_code = MPIO_Err_create_code(MPI_SUCCESS, MPIR_ERR_RECOVERABLE,
+                     myname, __LINE__, MPI_ERR_ARG,
+                     "**nullptr", "**nullptr %s", "size");
+        goto fn_fail;
+    }
     /* --END ERROR HANDLING-- */
 
     ADIOI_TEST_DEFERRED(adio_fh, myname, &error_code);
@@ -70,4 +76,9 @@ int MPI_File_get_size(MPI_File fh, MPI_Offset *size)
 
 fn_exit:
     return error_code;
+fn_fail:
+    /* --BEGIN ERROR HANDLING-- */
+    error_code = MPIO_Err_return_file(fh, error_code);
+    goto fn_exit;
+    /* --END ERROR HANDLING-- */
 }

http://git.mpich.org/mpich.git/commitdiff/15dddc3fd045cef313cd5f96e956e0920cc03de8

commit 15dddc3fd045cef313cd5f96e956e0920cc03de8
Author: Dave Goodell <goodell at mcs.anl.gov>
Date:   Fri Apr 12 16:39:48 2013 -0500

    whitespace fixup for [eed075d6]
    
    It appears to have been written with an incorrect tabstop=4 assumption.
    
    No reviewer.

diff --git a/src/mpi/romio/adio/ad_ufs/ad_ufs_open.c b/src/mpi/romio/adio/ad_ufs/ad_ufs_open.c
index 751e7fc..a615778 100644
--- a/src/mpi/romio/adio/ad_ufs/ad_ufs_open.c
+++ b/src/mpi/romio/adio/ad_ufs/ad_ufs_open.c
@@ -91,19 +91,19 @@ void ADIOI_UFS_Open(ADIO_File fd, int *error_code)
 					       __LINE__, MPI_ERR_READ_ONLY,
 					       "**ioneedrd", 0 );
 	}
-    else if(errno == EISDIR) {
-        *error_code = MPIO_Err_create_code(MPI_SUCCESS,
-                           MPIR_ERR_RECOVERABLE, myname,
-                           __LINE__, MPI_ERR_BAD_FILE,
-                           "**filename", 0);
-    }
-    else if(errno == EEXIST) {
-        *error_code = MPIO_Err_create_code(MPI_SUCCESS,
-                           MPIR_ERR_RECOVERABLE, myname,
-                           __LINE__, MPI_ERR_FILE_EXISTS,
-                           "**fileexist", 0);
+        else if(errno == EISDIR) {
+            *error_code = MPIO_Err_create_code(MPI_SUCCESS,
+                                               MPIR_ERR_RECOVERABLE, myname,
+                                               __LINE__, MPI_ERR_BAD_FILE,
+                                               "**filename", 0);
+        }
+        else if(errno == EEXIST) {
+            *error_code = MPIO_Err_create_code(MPI_SUCCESS,
+                                               MPIR_ERR_RECOVERABLE, myname,
+                                               __LINE__, MPI_ERR_FILE_EXISTS,
+                                               "**fileexist", 0);
 
-    }
+        }
 	else {
 	    *error_code = MPIO_Err_create_code(MPI_SUCCESS,
 					       MPIR_ERR_RECOVERABLE, myname,

http://git.mpich.org/mpich.git/commitdiff/eed075d6bf4b3b4462a9e574bbf23f44df5960d7

commit eed075d6bf4b3b4462a9e574bbf23f44df5960d7
Author: Qi QC Zhang <keirazhang at cn.ibm.com>
Date:   Tue Dec 4 01:29:47 2012 -0500

    Wrong error class returned on GPFS
    
    (ibm) D187578
    (ibm) 8f26ccae0b1f8ea23caf4d37c936b84c84762f61
    
    Signed-off-by: Michael Blocksome <blocksom at us.ibm.com>

diff --git a/src/mpi/romio/adio/ad_ufs/ad_ufs_open.c b/src/mpi/romio/adio/ad_ufs/ad_ufs_open.c
index 1a8ee3b..751e7fc 100644
--- a/src/mpi/romio/adio/ad_ufs/ad_ufs_open.c
+++ b/src/mpi/romio/adio/ad_ufs/ad_ufs_open.c
@@ -91,6 +91,19 @@ void ADIOI_UFS_Open(ADIO_File fd, int *error_code)
 					       __LINE__, MPI_ERR_READ_ONLY,
 					       "**ioneedrd", 0 );
 	}
+    else if(errno == EISDIR) {
+        *error_code = MPIO_Err_create_code(MPI_SUCCESS,
+                           MPIR_ERR_RECOVERABLE, myname,
+                           __LINE__, MPI_ERR_BAD_FILE,
+                           "**filename", 0);
+    }
+    else if(errno == EEXIST) {
+        *error_code = MPIO_Err_create_code(MPI_SUCCESS,
+                           MPIR_ERR_RECOVERABLE, myname,
+                           __LINE__, MPI_ERR_FILE_EXISTS,
+                           "**fileexist", 0);
+
+    }
 	else {
 	    *error_code = MPIO_Err_create_code(MPI_SUCCESS,
 					       MPIR_ERR_RECOVERABLE, myname,

http://git.mpich.org/mpich.git/commitdiff/e45c8bf210bae11d337ecdb8939037cd6b05a51b

commit e45c8bf210bae11d337ecdb8939037cd6b05a51b
Author: Qi QC Zhang <keirazhang at cn.ibm.com>
Date:   Thu Oct 11 01:04:54 2012 -0400

    update error checkings in ADIO
    
    (ibm) 70c9c492f206d4b2b124360bd1a8c07283493895
    
    Signed-off-by: Michael Blocksome <blocksom at us.ibm.com>

diff --git a/src/mpi/romio/adio/common/ad_fstype.c b/src/mpi/romio/adio/common/ad_fstype.c
index 01d816f..85d2142 100644
--- a/src/mpi/romio/adio/common/ad_fstype.c
+++ b/src/mpi/romio/adio/common/ad_fstype.c
@@ -204,12 +204,17 @@ static void ADIO_FileSysType_parentdir(const char *filename, char **dirnamep)
 }
 #endif /* ROMIO_NTFS */
 
-#ifdef ROMIO_BGL   /* BlueGene support for lockless i/o (necessary for PVFS.
+#if defined(ROMIO_BGL) || defined(ROMIO_BG)
+		    /* BlueGene support for lockless i/o (necessary for PVFS.
 		      possibly beneficial for others, unless data sieving
 		      writes desired) */
 
 /* BlueGene environment variables can override lockless selection.*/
+#ifdef ROMIO_BG
+extern void ad_bg_get_env_vars();
+#else
 extern void ad_bgl_get_env_vars();
+#endif
 extern long bglocklessmpio_f_type;
 
 static void check_for_lockless_exceptions(long stat_type, int *fstype)
@@ -350,6 +355,16 @@ static void ADIO_FileSysType_fncall(const char *filename, int *fstype, int *erro
     }
 # endif
 
+#ifdef ROMIO_BG
+/* The BlueGene generic ADIO is also a special case. */
+    ad_bg_get_env_vars();
+
+    *fstype = ADIO_BG;
+    check_for_lockless_exceptions(fsbuf.f_type, fstype);
+    *error_code = MPI_SUCCESS;
+    return;
+#endif
+
 #  ifdef ROMIO_BGL 
     /* BlueGene is a special case: all file systems are AD_BGL, except for
      * certain exceptions */
@@ -579,6 +594,9 @@ static void ADIO_FileSysType_prefix(const char *filename, int *fstype, int *erro
     else if (!strncmp(filename, "bgl:", 4) || !strncmp(filename, "BGL:", 4)) {
 	*fstype = ADIO_BGL;
     }
+    else if (!strncmp(filename, "bg:", 3) || !strncmp(filename, "BG:", 3)) {
+	*fstype = ADIO_BG;
+    }
     else if (!strncmp(filename, "bglockless:", 11) || 
 	    !strncmp(filename, "BGLOCKLESS:", 11)) {
 	*fstype = ADIO_BGLOCKLESS;
@@ -828,6 +846,16 @@ void ADIO_ResolveFileType(MPI_Comm comm, const char *filename, int *fstype,
 	*ops = &ADIO_BGL_operations;
 #endif
     }
+    if (file_system == ADIO_BG) {
+#ifndef ROMIO_BG
+	*error_code = MPIO_Err_create_code(MPI_SUCCESS, MPIR_ERR_RECOVERABLE,
+					myname, __LINE__, MPI_ERR_IO,
+					"**iofstypeunsupported", 0);
+	return;
+#else
+	*ops = &ADIO_BG_operations;
+#endif
+    }
     if (file_system == ADIO_BGLOCKLESS) {
 #ifndef ROMIO_BGLOCKLESS
 	*error_code = MPIO_Err_create_code(MPI_SUCCESS, MPIR_ERR_RECOVERABLE, 

http://git.mpich.org/mpich.git/commitdiff/21db6f2edc633121304c796e2ac08208d2836004

commit 21db6f2edc633121304c796e2ac08208d2836004
Author: Michael Blocksome <blocksom at us.ibm.com>
Date:   Wed Apr 3 13:09:40 2013 -0500

    BG ROMIO changes that remained unresolved after the merge.
    
    I think these are "really old" baseline changes that didn't get pushed
    into the top-level mpich master branch during previous (?) code
    contributions.
    
    These changes are needed at this point because the following commits,
    that were essentially cherry-picked from the ibm master branch, depend
    on some of these romio changes in this commit.

diff --git a/src/mpi/romio/adio/ad_bg/ad_bg_aggrs.c b/src/mpi/romio/adio/ad_bg/ad_bg_aggrs.c
index c99b2d5..2596b87 100644
--- a/src/mpi/romio/adio/ad_bg/ad_bg_aggrs.c
+++ b/src/mpi/romio/adio/ad_bg/ad_bg_aggrs.c
@@ -186,7 +186,7 @@ ADIOI_BG_compute_agg_ranklist_serial_do (const ADIOI_BG_ConfInfo_t *confInfo,
    /* In this array, we can pick an appropriate number of midpoints based on
     * our bridgenode index and the number of aggregators */
 
-   numAggs = confInfo->aggRatio * confInfo->ioMaxSize /*virtualPsetSize*/;
+   numAggs = confInfo->aggRatio * confInfo->ioMinSize /*virtualPsetSize*/;
    if(numAggs == 1)
       aggTotal = 1;
    else
@@ -194,8 +194,9 @@ ADIOI_BG_compute_agg_ranklist_serial_do (const ADIOI_BG_ConfInfo_t *confInfo,
     * bridge node is an aggregator */
       aggTotal = confInfo->numBridgeRanks * (numAggs+1);
 
-   distance = (confInfo->ioMaxSize /*virtualPsetSize*/ / numAggs);
-   TRACE_ERR("numBridgeRanks: %d, aggRatio: %f numBridge: %d pset size: %d numAggs: %d distance: %d, aggTotal: %d\n", confInfo->numBridgeRanks, confInfo->aggRatio, confInfo->numBridgeRanks,  confInfo->ioMaxSize /*virtualPsetSize*/, numAggs, distance, aggTotal);
+   if(aggTotal>confInfo->nProcs) aggTotal=confInfo->nProcs;
+
+   TRACE_ERR("numBridgeRanks: %d, aggRatio: %f numBridge: %d pset size: %d/%d numAggs: %d, aggTotal: %d\n", confInfo->numBridgeRanks, confInfo->aggRatio, confInfo->numBridgeRanks,  confInfo->ioMinSize, confInfo->ioMaxSize /*virtualPsetSize*/, numAggs, aggTotal);
    aggList = (int *)ADIOI_Malloc(aggTotal * sizeof(int));
 
 
@@ -205,30 +206,59 @@ ADIOI_BG_compute_agg_ranklist_serial_do (const ADIOI_BG_ConfInfo_t *confInfo,
       aggList[0] = bridgelist[0].bridge;
    else
    {
-      for(i=0; i < confInfo->numBridgeRanks; i++)
-      {
-         aggList[i]=bridgelist[i*confInfo->ioMaxSize /*virtualPsetSize*/].bridge;
-         TRACE_ERR("aggList[%d]: %d\n", i, aggList[i]);
-         
+     int lastBridge = bridgelist[confInfo->nProcs-1].bridge;
+     int nextBridge = 0, nextAggr = confInfo->numBridgeRanks;
+     int psetSize = 0;
+     int procIndex;
+     for(procIndex=confInfo->nProcs-1; procIndex>=0; procIndex--)
+     {
+       TRACE_ERR("bridgelist[%d].bridge %u/rank %u\n",procIndex,  bridgelist[procIndex].bridge, bridgelist[procIndex].rank);
+       if(lastBridge == bridgelist[procIndex].bridge)
+       {
+         psetSize++;
+         if(procIndex) continue; 
+         else procIndex--;/* procIndex == 0 */
+       }
+       /* Sets up a list of nodes which will act as aggregators. numAggs
+        * per bridge node total. The list of aggregators is
+        * bridgeNode 0
+        * bridgeNode 1
+        * bridgeNode ...
+        * bridgeNode N
+        * bridgeNode[0]aggr[0]
+        * bridgeNode[0]aggr[1]...
+        * bridgeNode[0]aggr[N]...
+        * ...
+        * bridgeNode[N]aggr[0]..
+        * bridgeNode[N]aggr[N]
+        */
+       aggList[nextBridge]=lastBridge;
+       distance = psetSize/numAggs;
+       TRACE_ERR("nextBridge %u is bridge %u, distance %u, size %u\n",nextBridge, aggList[nextBridge],distance,psetSize);
+       if(numAggs>1)
+       {
          for(j = 0; j < numAggs; j++)
          {
-            /* Sets up a list of nodes which will act as aggregators. numAggs
-             * per bridge node total. The list of aggregators is
-             * bridgeNodes
-             * bridgeNode[0]aggr[0]
-             * bridgeNode[0]aggr[1]...
-             * bridgeNode[0]aggr[N]...
-             * ...
-             * bridgeNode[N]aggr[0]..
-             * bridgeNode[N]aggr[N]
-             */
-            aggList[i*numAggs+j+confInfo->numBridgeRanks] = bridgelist[i*confInfo->ioMaxSize /*virtualPsetSize*/ + j*distance+1].rank;
-            TRACE_ERR("(post bridge) agglist[%d] -> %d\n", confInfo->numBridgeRanks +i*numAggs+j, aggList[i*numAggs+j+confInfo->numBridgeRanks]);
+           ADIOI_BG_assert(nextAggr<aggTotal);
+           aggList[nextAggr] = bridgelist[procIndex+j*distance+1].rank;
+           TRACE_ERR("agglist[%d] -> bridgelist[%d] = %d\n", nextAggr, procIndex+j*distance+1,aggList[nextAggr]);
+           if(aggList[nextAggr]==lastBridge) /* can't have bridge in the list twice */
+           {  
+             aggList[nextAggr] = bridgelist[procIndex+psetSize].rank; /* take the last one in the pset */
+             TRACE_ERR("replacement agglist[%d] -> bridgelist[%d] = %d\n", nextAggr, procIndex+psetSize,aggList[nextAggr]);
+           }
+           nextAggr++;
          }
-      }
+       }
+       if(procIndex<0) break;
+       lastBridge = bridgelist[procIndex].bridge;
+       psetSize = 1;
+       nextBridge++;
+     }
    }
 
-   memcpy(tmp_ranklist, aggList, (numAggs*confInfo->numBridgeRanks+numAggs)*sizeof(int));
+   TRACE_ERR("memcpy(tmp_ranklist, aggList, (numAggs(%u)*confInfo->numBridgeRanks(%u)+numAggs(%u)) (%u) %u*sizeof(int))\n",numAggs,confInfo->numBridgeRanks,numAggs,(numAggs*confInfo->numBridgeRanks+numAggs),aggTotal);
+   memcpy(tmp_ranklist, aggList, aggTotal*sizeof(int));
    for(i=0;i<aggTotal;i++)
    {
       TRACE_ERR("tmp_ranklist[%d]: %d\n", i, tmp_ranklist[i]);
@@ -605,7 +635,6 @@ void ADIOI_BG_Calc_my_req(ADIO_File fd, ADIO_Offset *offset_list, ADIO_Offset *l
 #ifdef AGGREGATION_PROFILE
     MPE_Log_event (5024, 0, NULL);
 #endif
-
     *count_my_req_per_proc_ptr = (int *) ADIOI_Calloc(nprocs,sizeof(int)); 
     count_my_req_per_proc = *count_my_req_per_proc_ptr;
 /* count_my_req_per_proc[i] gives the no. of contig. requests of this
@@ -820,7 +849,7 @@ void ADIOI_BG_Calc_others_req(ADIO_File fd, int count_my_req_procs,
      */
     count_others_req_per_proc = (int *) ADIOI_Malloc(nprocs*sizeof(int));
 /*     cora2a1=timebase(); */
-for(i=0;i<nprocs;i++)
+/*for(i=0;i<nprocs;i++) ?*/
     MPI_Alltoall(count_my_req_per_proc, 1, MPI_INT,
 		 count_others_req_per_proc, 1, MPI_INT, fd->comm);
 
@@ -903,7 +932,7 @@ for(i=0;i<nprocs;i++)
     if ( sendBufForLens    == (void*)0xFFFFFFFFFFFFFFFF) sendBufForLens    = NULL;
 
     /* Calculate the displacements from the sendBufForOffsets/Lens */
-    MPI_Barrier(fd->comm);
+    MPI_Barrier(fd->comm);/* Why?*/
     for (i=0; i<nprocs; i++)
     {
 	/* Send these offsets to process i.*/
diff --git a/src/mpi/romio/adio/ad_bg/ad_bg_pset.c b/src/mpi/romio/adio/ad_bg/ad_bg_pset.c
index 14c5ebc..b5d9026 100644
--- a/src/mpi/romio/adio/ad_bg/ad_bg_pset.c
+++ b/src/mpi/romio/adio/ad_bg/ad_bg_pset.c
@@ -112,7 +112,7 @@ ADIOI_BG_persInfo_init(ADIOI_BG_ConfInfo_t *conf,
       conf->cpuIDsize = hw.ppn;
       /*conf->virtualPsetSize = conf->ioMaxSize * conf->cpuIDsize;*/
       conf->nAggrs = 1;
-      conf->aggRatio = 1. * conf->nAggrs / conf->ioMaxSize /*virtualPsetSize*/;
+      conf->aggRatio = 1. * conf->nAggrs / conf->ioMinSize /*virtualPsetSize*/;
       if(conf->aggRatio > 1) conf->aggRatio = 1.;
       TRACE_ERR("I am (single) Bridge rank\n");
       return;
@@ -194,7 +194,7 @@ ADIOI_BG_persInfo_init(ADIOI_BG_ConfInfo_t *conf,
          if(countPset < mincompute)
             mincompute = countPset;
 
-         /* Is this my bridge? */
+         /* Was this my bridge we finished? */
          if(tempCoords == bridgeCoords)
          {
             /* Am I the bridge rank? */
@@ -208,6 +208,7 @@ ADIOI_BG_persInfo_init(ADIOI_BG_ConfInfo_t *conf,
             proc->myIOSize = countPset;
             proc->ioNodeIndex = bridgeIndex;
          }
+         /* Setup next bridge */
          tempCoords = bridges[i].bridgeCoord & ~1;
          tempRank   = bridges[i].rank;
          bridgeIndex++;
@@ -226,7 +227,7 @@ ADIOI_BG_persInfo_init(ADIOI_BG_ConfInfo_t *conf,
    if(countPset < mincompute)
       mincompute = countPset;
 
-   /* Is this my bridge? */
+   /* Was this my bridge? */
    if(tempCoords == bridgeCoords)
    {
       /* Am I the bridge rank? */
@@ -252,15 +253,17 @@ ADIOI_BG_persInfo_init(ADIOI_BG_ConfInfo_t *conf,
             
       conf->nAggrs = n_aggrs;
       /*    First pass gets nAggrs = -1 */
-      if(conf->nAggrs <=0 || 
-         MIN(conf->nProcs, conf->ioMaxSize /*virtualPsetSize*/) < conf->nAggrs) 
+      if(conf->nAggrs <=0) 
          conf->nAggrs = ADIOI_BG_NAGG_PSET_DFLT;
-      if(conf->nAggrs > conf->numBridgeRanks) /* maybe? * conf->cpuIDsize) */
-         conf->nAggrs = conf->numBridgeRanks; /* * conf->cpuIDsize; */
-   
-      conf->aggRatio = 1. * conf->nAggrs / conf->ioMaxSize /*virtualPsetSize*/;
-      if(conf->aggRatio > 1) conf->aggRatio = 1.;
-      TRACE_ERR("Maximum ranks under a bridge rank: %d, minimum: %d, nAggrs: %d, vps: %d, numBridgeRanks: %d pset dflt: %d naggrs: %d ratio: %f\n", maxcompute, mincompute, conf->nAggrs, conf->ioMaxSize /*virtualPsetSize*/, conf->numBridgeRanks, ADIOI_BG_NAGG_PSET_DFLT, conf->nAggrs, conf->aggRatio);
+      if(conf->ioMinSize <= conf->nAggrs) 
+        conf->nAggrs = MAX(1,conf->ioMinSize-1); /* not including bridge itself */
+/*      if(conf->nAggrs > conf->numBridgeRanks) 
+         conf->nAggrs = conf->numBridgeRanks; 
+*/
+      conf->aggRatio = 1. * conf->nAggrs / conf->ioMinSize /*virtualPsetSize*/;
+/*    if(conf->aggRatio > 1) conf->aggRatio = 1.; */
+      TRACE_ERR("n_aggrs %zd, conf->nProcs %zu, conf->ioMaxSize %zu, ADIOI_BG_NAGG_PSET_DFLT %zu,conf->numBridgeRanks %zu,conf->nAggrs %zu\n",(size_t)n_aggrs, (size_t)conf->nProcs, (size_t)conf->ioMaxSize, (size_t)ADIOI_BG_NAGG_PSET_DFLT,(size_t)conf->numBridgeRanks,(size_t)conf->nAggrs);
+      TRACE_ERR("Maximum ranks under a bridge rank: %d, minimum: %d, nAggrs: %d, numBridgeRanks: %d pset dflt: %d naggrs: %d ratio: %f\n", maxcompute, mincompute, conf->nAggrs, conf->numBridgeRanks, ADIOI_BG_NAGG_PSET_DFLT, conf->nAggrs, conf->aggRatio);
    }
 
    ADIOI_BG_assert((bridgerank != -1));
diff --git a/src/mpi/romio/adio/ad_bglockless/ad_bglockless_features.c b/src/mpi/romio/adio/ad_bglockless/ad_bglockless_features.c
index 4153c5e..5e78f80 100644
--- a/src/mpi/romio/adio/ad_bglockless/ad_bglockless_features.c
+++ b/src/mpi/romio/adio/ad_bglockless/ad_bglockless_features.c
@@ -1,3 +1,22 @@
+/* begin_generated_IBM_copyright_prolog                             */
+/*                                                                  */
+/* This is an automatically generated copyright prolog.             */
+/* After initializing,  DO NOT MODIFY OR MOVE                       */
+/*  --------------------------------------------------------------- */
+/*                                                                  */
+/* Licensed Materials - Property of IBM                             */
+/* Blue Gene/Q                                                      */
+/* (C) Copyright IBM Corp.  2011, 2012                              */
+/* US Government Users Restricted Rights - Use, duplication or      */      
+/*   disclosure restricted by GSA ADP Schedule Contract with IBM    */
+/*   Corp.                                                          */
+/*                                                                  */
+/* This software is available to you under the Eclipse Public       */
+/* License (EPL).                                                   */
+/*                                                                  */
+/*  --------------------------------------------------------------- */
+/*                                                                  */
+/* end_generated_IBM_copyright_prolog                               */
 #include "adio.h"
 
 int ADIOI_BGLOCKLESS_Feature(ADIO_File fd, int flag)
diff --git a/src/mpi/romio/adio/common/ad_get_sh_fp.c b/src/mpi/romio/adio/common/ad_get_sh_fp.c
index 2a6bc5b..dcadb0d 100644
--- a/src/mpi/romio/adio/common/ad_get_sh_fp.c
+++ b/src/mpi/romio/adio/common/ad_get_sh_fp.c
@@ -42,6 +42,14 @@ void ADIO_Get_shared_fp(ADIO_File fd, int incr, ADIO_Offset *shared_fp,
 	return;
     }
 #endif
+#ifdef ROMIO_BG
+    /* BGLOCKLESS won't support shared fp */
+    if (fd->file_system == ADIO_BG) {
+	ADIOI_BG_Get_shared_fp(fd, incr, shared_fp, error_code);
+	return;
+    }
+#endif
+
 
     if (fd->shared_fp_fd == ADIO_FILE_NULL) {
 	MPI_Comm_dup(MPI_COMM_SELF, &dupcommself);
diff --git a/src/mpi/romio/adio/common/ad_set_sh_fp.c b/src/mpi/romio/adio/common/ad_set_sh_fp.c
index 2787b3e..ba6affd 100644
--- a/src/mpi/romio/adio/common/ad_set_sh_fp.c
+++ b/src/mpi/romio/adio/common/ad_set_sh_fp.c
@@ -33,6 +33,13 @@ void ADIO_Set_shared_fp(ADIO_File fd, ADIO_Offset offset, int *error_code)
 	return;
     }
 #endif
+#ifdef ROMIO_BG
+    /* BGLOCKLESS won't support shared fp */
+    if (fd->file_system == ADIO_BG) {
+	ADIOI_BG_Set_shared_fp(fd, offset, error_code);
+	return;
+    }
+#endif
 
     if (fd->shared_fp_fd == ADIO_FILE_NULL) {
 	MPI_Comm_dup(MPI_COMM_SELF, &dupcommself);
diff --git a/src/mpi/romio/adio/common/lock.c b/src/mpi/romio/adio/common/lock.c
index d064ede..2590d77 100644
--- a/src/mpi/romio/adio/common/lock.c
+++ b/src/mpi/romio/adio/common/lock.c
@@ -153,7 +153,7 @@ int ADIOI_Set_lock(FDTYPE fd, int cmd, int type, ADIO_Offset offset, int whence,
     if (err && (errno != EBADF)) {
 	/* FIXME: This should use the error message system, 
 	   especially for MPICH */
-	FPRINTF(stderr, "File locking failed in ADIOI_Set_lock(fd %X,cmd %s/%X,type %s/%X,whence %X) with return value %X and errno %X.\n"
+	FPRINTF(stderr, "This requires fcntl(2) to be implemented. As of 8/25/2011 it is not. Generic MPICH Message: File locking failed in ADIOI_Set_lock(fd %X,cmd %s/%X,type %s/%X,whence %X) with return value %X and errno %X.\n"
                   "- If the file system is NFS, you need to use NFS version 3, ensure that the lockd daemon is running on all the machines, and mount the directory with the 'noac' option (no attribute caching).\n"
                   "- If the file system is LUSTRE, ensure that the directory is mounted with the 'flock' option.\n",
           fd,
diff --git a/src/mpi/romio/adio/include/adio.h b/src/mpi/romio/adio/include/adio.h
index 4d4acf9..067641f 100644
--- a/src/mpi/romio/adio/include/adio.h
+++ b/src/mpi/romio/adio/include/adio.h
@@ -293,6 +293,7 @@ typedef struct {
 #define ADIO_BGL                 164   /* IBM BGL */
 #define ADIO_BGLOCKLESS          165   /* IBM BGL (lock-free) */
 #define ADIO_ZOIDFS              167   /* ZoidFS: the I/O forwarding fs */
+#define ADIO_BG                  168
 
 #define ADIO_SEEK_SET            SEEK_SET
 #define ADIO_SEEK_CUR            SEEK_CUR
diff --git a/src/mpi/romio/adio/include/adioi_fs_proto.h b/src/mpi/romio/adio/include/adioi_fs_proto.h
index d28c123..65f0183 100644
--- a/src/mpi/romio/adio/include/adioi_fs_proto.h
+++ b/src/mpi/romio/adio/include/adioi_fs_proto.h
@@ -79,6 +79,11 @@ extern struct ADIOI_Fns_struct ADIO_BGL_operations;
 /* prototypes are in adio/ad_bgl/ad_bgl.h */
 #endif
 
+#ifdef ROMIO_BG
+extern struct ADIOI_Fns_struct ADIO_BG_operations;
+/* prototypes are in adio/ad_bg/ad_bg.h */
+#endif
+
 #ifdef ROMIO_BGLOCKLESS
 extern struct ADIOI_Fns_struct ADIO_BGLOCKLESS_operations;
 /* no extra prototypes for this fs at this time */
diff --git a/src/mpi/romio/configure.ac b/src/mpi/romio/configure.ac
index 2ee57b0..4530838 100644
--- a/src/mpi/romio/configure.ac
+++ b/src/mpi/romio/configure.ac
@@ -1171,15 +1171,15 @@ if test -n "$file_system_bg"; then
     AC_DEFINE(ROMIO_BG,1,[Define for ROMIO with BG])
 fi
 if test -n "$file_system_bglockless"; then
-    if test x"$file_system_bgl" != x; then
+    if test -n "$file_system_bgl"; then
         AC_DEFINE(ROMIO_BGLOCKLESS,1,[Define for lock-free ROMIO with BGL])
     fi
 
-    if test x"$file_system_bg" != x; then
+    if test -n "$file_system_bg"; then
         AC_DEFINE(ROMIO_BGLOCKLESS,1,[Define for lock-free ROMIO with BG])
     fi
 
-    if test x"$ROMIO_BGLOCKLESS" -ne x1; then
+    if test -n "$ROMIO_BGLOCKLESS"; then
         AC_MSG_ERROR("bglockless requested without [bgl|bg]")
     fi
 fi

http://git.mpich.org/mpich.git/commitdiff/973957e7d4637d1b713e1db00e2aba25a12458a6

commit 973957e7d4637d1b713e1db00e2aba25a12458a6
Author: Dave Goodell <goodell at mcs.anl.gov>
Date:   Fri Apr 12 15:41:33 2013 -0500

    initialize mpi_errno in `MPIC_` routines
    
    Otherwise they could be used uninitialized in the `MPIU_ERR_` macros.
    Followup to [3fdf0887].
    
    No reviewer.

diff --git a/src/mpi/coll/helper_fns.c b/src/mpi/coll/helper_fns.c
index 74b4b01..5e3f2c2 100644
--- a/src/mpi/coll/helper_fns.c
+++ b/src/mpi/coll/helper_fns.c
@@ -88,7 +88,7 @@ int MPIC_Send(const void *buf, int count, MPI_Datatype datatype, int dest, int t
 int MPIC_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag,
 	     MPI_Comm comm, MPI_Status *status)
 {
-    int mpi_errno, context_id;
+    int mpi_errno = MPI_SUCCESS, context_id;
     MPID_Request *request_ptr=NULL;
     MPID_Comm *comm_ptr = NULL;
     MPIDI_STATE_DECL(MPID_STATE_MPIC_RECV);
@@ -136,7 +136,7 @@ int MPIC_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag,
 int MPIC_Ssend(const void *buf, int count, MPI_Datatype datatype, int dest, int tag,
                MPI_Comm comm)
 {
-    int mpi_errno, context_id;
+    int mpi_errno = MPI_SUCCESS, context_id;
     MPID_Request *request_ptr=NULL;
     MPID_Comm *comm_ptr=NULL;
     MPIDI_STATE_DECL(MPID_STATE_MPIC_SSEND);
@@ -180,7 +180,7 @@ int MPIC_Sendrecv(const void *sendbuf, int sendcount, MPI_Datatype sendtype,
                   MPI_Comm comm, MPI_Status *status) 
 {
     MPID_Request *recv_req_ptr=NULL, *send_req_ptr=NULL;
-    int mpi_errno, context_id;
+    int mpi_errno = MPI_SUCCESS, context_id;
     MPID_Comm *comm_ptr = NULL;
     MPIDI_STATE_DECL(MPID_STATE_MPIC_SENDRECV);
 
@@ -476,7 +476,7 @@ int MPIR_Localcopy(const void *sendbuf, int sendcount, MPI_Datatype sendtype,
 int MPIC_Isend(const void *buf, int count, MPI_Datatype datatype, int dest, int tag,
               MPI_Comm comm, MPI_Request *request)
 {
-    int mpi_errno, context_id;
+    int mpi_errno = MPI_SUCCESS, context_id;
     MPID_Request *request_ptr=NULL;
     MPID_Comm *comm_ptr=NULL;
     MPIDI_STATE_DECL(MPID_STATE_MPIC_ISEND);
@@ -511,7 +511,7 @@ int MPIC_Isend(const void *buf, int count, MPI_Datatype datatype, int dest, int
 int MPIC_Irecv(void *buf, int count, MPI_Datatype datatype, int
                source, int tag, MPI_Comm comm, MPI_Request *request)
 {
-    int mpi_errno, context_id;
+    int mpi_errno = MPI_SUCCESS, context_id;
     MPID_Request *request_ptr=NULL;
     MPID_Comm *comm_ptr = NULL;
     MPIDI_STATE_DECL(MPID_STATE_MPIC_IRECV);

http://git.mpich.org/mpich.git/commitdiff/3fdf08871a0a69b4b7d4953a4acf9316b88cc677

commit 3fdf08871a0a69b4b7d4953a4acf9316b88cc677
Author: Qi QC Zhang <keirazhang at cn.ibm.com>
Date:   Fri Sep 28 01:07:38 2012 -0400

    fix uninitialized `mpi_errno`
    
    Created by squashing two mpich-ibm.git commits together (533f660f and
    acb6d143).  Original subject was:
    
    "pami coredump at _lapi_shm_amsend"
    
    merged IBM breadcrumbs:
    (ibm) D180594
    (ibm) fe8f99116561f407c0e5e39e2f7b3354537e9279
    (ibm) 88f19b240b90983f8c6f99b273df996839dfecf4
    (ibm) 170b04ee98ad1706da2723d7cbf538711945f402
    
    No reviewer.

diff --git a/src/mpi/coll/helper_fns.c b/src/mpi/coll/helper_fns.c
index 60741a0..74b4b01 100644
--- a/src/mpi/coll/helper_fns.c
+++ b/src/mpi/coll/helper_fns.c
@@ -47,7 +47,7 @@ int MPIC_Probe(int source, int tag, MPI_Comm comm, MPI_Status *status)
 int MPIC_Send(const void *buf, int count, MPI_Datatype datatype, int dest, int tag,
               MPI_Comm comm)
 {
-    int mpi_errno, context_id;
+    int context_id, mpi_errno = MPI_SUCCESS;
     MPID_Request *request_ptr=NULL;
     MPID_Comm *comm_ptr=NULL;
     MPIDI_STATE_DECL(MPID_STATE_MPIC_SEND);

http://git.mpich.org/mpich.git/commitdiff/873180629a97fa154482e85445ae8227749ce386

commit 873180629a97fa154482e85445ae8227749ce386
Author: Dave Goodell <goodell at mcs.anl.gov>
Date:   Fri Apr 12 14:08:11 2013 -0500

    attr: check for handle allocation error
    
    ANL-preferred alternative to mpich-ibm.git:[c47282a2].
    
    No reviewer.

diff --git a/src/mpi/attr/attrutil.c b/src/mpi/attr/attrutil.c
index 7b51cef..f04c1f7 100644
--- a/src/mpi/attr/attrutil.c
+++ b/src/mpi/attr/attrutil.c
@@ -50,6 +50,7 @@ MPID_Attribute *MPID_Attr_alloc(void)
     MPID_Attribute *attr = (MPID_Attribute *)MPIU_Handle_obj_alloc(&MPID_Attr_mem);
     /* attributes don't have refcount semantics, but let's keep valgrind and
      * the debug logging pacified */
+    MPIU_Assert(attr != NULL);
     MPIU_Object_set_ref(attr, 0);
     return attr;
 }

http://git.mpich.org/mpich.git/commitdiff/8b6ceadb7eb56792cc3bd8d432bede1ad33d8fd3

commit 8b6ceadb7eb56792cc3bd8d432bede1ad33d8fd3
Author: Haizhu Liu <haizhu at us.ibm.com>
Date:   Mon Mar 11 09:14:28 2013 -0400

    MPI_Win_unlock could potentially complete too soon before the remote side gets the control message
    
    (ibm) D188500
    (ibm) 17bab8b8a6410d0df5b280dbed974dd80822573e
    
    Signed-off-by: Michael Blocksome <blocksom at us.ibm.com>

diff --git a/src/mpid/pamid/src/onesided/mpid_win_lock.c b/src/mpid/pamid/src/onesided/mpid_win_lock.c
index af2c14a..f2a9c5f 100644
--- a/src/mpid/pamid/src/onesided/mpid_win_lock.c
+++ b/src/mpid/pamid/src/onesided/mpid_win_lock.c
@@ -215,8 +215,7 @@ MPID_Win_unlock(int       rank,
   .win  = win,
   };
   MPIDI_Context_post(MPIDI_Context[0], &info.work, MPIDI_WinUnlock_post, &info);
-  MPID_PROGRESS_WAIT_WHILE(!info.done);
-  sync->lock.remote.locked = 0;
+  MPID_PROGRESS_WAIT_WHILE(sync->lock.remote.locked);
 
   if(win->mpid.sync.target_epoch_type == MPID_EPOTYPE_REFENCE)
   {
diff --git a/src/mpid/pamid/src/onesided/mpidi_onesided.h b/src/mpid/pamid/src/onesided/mpidi_onesided.h
index 2697eb4..d2f7024 100644
--- a/src/mpid/pamid/src/onesided/mpidi_onesided.h
+++ b/src/mpid/pamid/src/onesided/mpidi_onesided.h
@@ -168,6 +168,10 @@ void
 MPIDI_Win_DoneCB(pami_context_t  context,
                  void          * cookie,
                  pami_result_t   result);
+void
+MPIDI_WinUnlockDoneCB(pami_context_t  context,
+                 void          * cookie,
+                 pami_result_t   result);
 
 void
 MPIDI_WinAccumCB(pami_context_t    context,
diff --git a/src/mpid/pamid/src/onesided/mpidi_win_control.c b/src/mpid/pamid/src/onesided/mpidi_win_control.c
index 532adde..2e3984b 100644
--- a/src/mpid/pamid/src/onesided/mpidi_win_control.c
+++ b/src/mpid/pamid/src/onesided/mpidi_win_control.c
@@ -35,21 +35,49 @@ MPIDI_WinCtrlSend(pami_context_t       context,
   rc = PAMI_Endpoint_create(MPIDI_Client, peer, 0, &dest);
   MPID_assert(rc == PAMI_SUCCESS);
 
-  pami_send_immediate_t params = {
-    .dispatch = MPIDI_Protocols_WinCtrl,
-    .dest     = dest,
-    .header   = {
-      .iov_base = control,
-      .iov_len  = sizeof(MPIDI_Win_control_t),
-    },
-    .data     = {
-      .iov_base = NULL,
-      .iov_len  = 0,
-    },
-  };
-
-  rc = PAMI_Send_immediate(context, &params);
+  if(control->type == MPIDI_WIN_MSGTYPE_UNLOCK) {
+    pami_send_t params = {
+      .send   = {
+        .dispatch = MPIDI_Protocols_WinCtrl,
+        .dest     = dest,
+        .header   = {
+          .iov_base = control,
+          .iov_len  = sizeof(MPIDI_Win_control_t),
+        },
+      },
+      .events = {
+        .cookie   = win,
+        .local_fn = NULL,
+        .remote_fn= MPIDI_WinUnlockDoneCB,
+      },
+    };
+    rc = PAMI_Send(context, &params);
+  } else {
+    pami_send_immediate_t params = {
+      .dispatch = MPIDI_Protocols_WinCtrl,
+      .dest     = dest,
+      .header   = {
+        .iov_base = control,
+        .iov_len  = sizeof(MPIDI_Win_control_t),
+      },
+      .data     = {
+        .iov_base = NULL,
+        .iov_len  = 0,
+      },
+    };
+    rc = PAMI_Send_immediate(context, &params);
+  }
   MPID_assert(rc == PAMI_SUCCESS);
+
+}
+
+void
+MPIDI_WinUnlockDoneCB(pami_context_t   context,
+                      void           * cookie,
+                      pami_result_t    result)
+{
+  MPID_Win *win = (MPID_Win *)cookie;
+  win->mpid.sync.lock.remote.locked = 0;
 }
 
 

http://git.mpich.org/mpich.git/commitdiff/ec2b9406b27a619000be4d8e255dd4d80fe1bf3a

commit ec2b9406b27a619000be4d8e255dd4d80fe1bf3a
Author: Bob Cernohous <bobc at us.ibm.com>
Date:   Thu Mar 7 09:23:05 2013 -0600

    Implement a simple PAMID_NUMREQUESTS for async flow control
    
    (ibm) Issue 9136
    (ibm) 88839e74ae10101f237899722cb622c8f83076e5
    
    Signed-off-by: sssharka <sssharka at us.ibm.com>

diff --git a/src/mpid/pamid/include/mpidi_datatypes.h b/src/mpid/pamid/include/mpidi_datatypes.h
index 08df29f..3dd8b9a 100644
--- a/src/mpid/pamid/include/mpidi_datatypes.h
+++ b/src/mpid/pamid/include/mpidi_datatypes.h
@@ -114,6 +114,7 @@ typedef struct
     unsigned select_colls;      /**< Enable collective selection */
     unsigned auto_select_colls; /**< Enable automatic collective selection */
     unsigned memory;            /**< Enable memory optimized subcomm's */
+    unsigned num_requests;      /**< Number of requests between flow control barriers */
   }
   optimized;
 
@@ -312,7 +313,7 @@ struct MPIDI_Comm
   char allgathervs[4];
   char scattervs[2];
   char optgather, optscatter, optreduce;
-
+  unsigned num_requests;
   /* These need to be freed at geom destroy, so we need to store them
    * inside the communicator struct until destroy time rather than
    * allocating pointers on the stack
diff --git a/src/mpid/pamid/src/coll/allgather/mpido_allgather.c b/src/mpid/pamid/src/coll/allgather/mpido_allgather.c
index c2230d7..a63c91f 100644
--- a/src/mpid/pamid/src/coll/allgather/mpido_allgather.c
+++ b/src/mpid/pamid/src/coll/allgather/mpido_allgather.c
@@ -497,12 +497,13 @@ MPIDO_Allgather(const void *sendbuf,
                        recvbuf, recvcount, recvtype,
                        comm_ptr, mpierrno);
          }
-         if(my_md->check_correct.values.asyncflowctl) 
-         { /* need better flow control than a barrier every time */
+         if(my_md->check_correct.values.asyncflowctl && !(--(comm_ptr->mpid.num_requests))) 
+         { 
+           comm_ptr->mpid.num_requests = MPIDI_Process.optimized.num_requests;
            int tmpmpierrno;   
            if(unlikely(verbose))
              fprintf(stderr,"Query barrier required for %s\n", my_md->name);
-           MPIR_Barrier(comm_ptr, &tmpmpierrno);
+           MPIDO_Barrier(comm_ptr, &tmpmpierrno);
          }
       }
 
diff --git a/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c b/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c
index b674bec..b01f4db 100644
--- a/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c
+++ b/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c
@@ -522,12 +522,13 @@ MPIDO_Allgatherv(const void *sendbuf,
                                   recvbuf, recvcounts, displs, recvtype,
                                   comm_ptr, mpierrno);
          }
-         if(my_md->check_correct.values.asyncflowctl) 
-         { /* need better flow control than a barrier every time */
+         if(my_md->check_correct.values.asyncflowctl && !(--(comm_ptr->mpid.num_requests))) 
+         { 
+           comm_ptr->mpid.num_requests = MPIDI_Process.optimized.num_requests;
            int tmpmpierrno;   
            if(unlikely(verbose))
              fprintf(stderr,"Query barrier required for %s\n", my_md->name);
-           MPIR_Barrier(comm_ptr, &tmpmpierrno);
+           MPIDO_Barrier(comm_ptr, &tmpmpierrno);
          }
       }
 
diff --git a/src/mpid/pamid/src/coll/alltoall/mpido_alltoall.c b/src/mpid/pamid/src/coll/alltoall/mpido_alltoall.c
index b2e4848..5b33afb 100644
--- a/src/mpid/pamid/src/coll/alltoall/mpido_alltoall.c
+++ b/src/mpid/pamid/src/coll/alltoall/mpido_alltoall.c
@@ -179,12 +179,13 @@ int MPIDO_Alltoall(const void *sendbuf,
                                    recvbuf, recvcount, recvtype,
                                    comm_ptr, mpierrno);
       }
-      if(my_md->check_correct.values.asyncflowctl) 
-      { /* need better flow control than a barrier every time */
+      if(my_md->check_correct.values.asyncflowctl && !(--(comm_ptr->mpid.num_requests))) 
+      { 
+         comm_ptr->mpid.num_requests = MPIDI_Process.optimized.num_requests;
          int tmpmpierrno;   
          if(unlikely(verbose))
             fprintf(stderr,"Query barrier required for %s\n", my_md->name);
-         MPIR_Barrier(comm_ptr, &tmpmpierrno);
+         MPIDO_Barrier(comm_ptr, &tmpmpierrno);
       }
    }
 
diff --git a/src/mpid/pamid/src/coll/alltoallv/mpido_alltoallv.c b/src/mpid/pamid/src/coll/alltoallv/mpido_alltoallv.c
index ab4c30e..87a6df1 100644
--- a/src/mpid/pamid/src/coll/alltoallv/mpido_alltoallv.c
+++ b/src/mpid/pamid/src/coll/alltoallv/mpido_alltoallv.c
@@ -177,12 +177,13 @@ int MPIDO_Alltoallv(const void *sendbuf,
                               recvbuf, recvcounts, recvdispls, recvtype,
                               comm_ptr, mpierrno);
       }
-      if(my_md->check_correct.values.asyncflowctl) 
-      { /* need better flow control than a barrier every time */
+      if(my_md->check_correct.values.asyncflowctl && !(--(comm_ptr->mpid.num_requests))) 
+      { 
+         comm_ptr->mpid.num_requests = MPIDI_Process.optimized.num_requests;
          int tmpmpierrno;   
          if(unlikely(verbose))
             fprintf(stderr,"Query barrier required for %s\n", pname);
-         MPIR_Barrier(comm_ptr, &tmpmpierrno);
+         MPIDO_Barrier(comm_ptr, &tmpmpierrno);
       }
    }
 
diff --git a/src/mpid/pamid/src/coll/bcast/mpido_bcast.c b/src/mpid/pamid/src/coll/bcast/mpido_bcast.c
index 29ec76a..e86696e 100644
--- a/src/mpid/pamid/src/coll/bcast/mpido_bcast.c
+++ b/src/mpid/pamid/src/coll/bcast/mpido_bcast.c
@@ -229,12 +229,13 @@ int MPIDO_Bcast(void *buffer,
          MPIDI_Update_last_algorithm(comm_ptr,"BCAST_MPICH");
          return MPIR_Bcast_intra(buffer, count, datatype, root, comm_ptr, mpierrno);
       }
-      if(my_md->check_correct.values.asyncflowctl) 
-      { /* need better flow control than a barrier every time */
+      if(my_md->check_correct.values.asyncflowctl && !(--(comm_ptr->mpid.num_requests))) 
+      { 
+         comm_ptr->mpid.num_requests = MPIDI_Process.optimized.num_requests;
          int tmpmpierrno;   
          if(unlikely(verbose))
             fprintf(stderr,"Query barrier required for %s\n", my_md->name);
-         MPIR_Barrier(comm_ptr, &tmpmpierrno);
+         MPIDO_Barrier(comm_ptr, &tmpmpierrno);
       }
    }
 
diff --git a/src/mpid/pamid/src/coll/gather/mpido_gather.c b/src/mpid/pamid/src/coll/gather/mpido_gather.c
index d48818f..acdca25 100644
--- a/src/mpid/pamid/src/coll/gather/mpido_gather.c
+++ b/src/mpid/pamid/src/coll/gather/mpido_gather.c
@@ -328,12 +328,13 @@ int MPIDO_Gather(const void *sendbuf,
                            recvbuf, recvcount, recvtype,
                            root, comm_ptr, mpierrno);
       }
-      if(my_md->check_correct.values.asyncflowctl) 
-      { /* need better flow control than a barrier every time */
+      if(my_md->check_correct.values.asyncflowctl && !(--(comm_ptr->mpid.num_requests))) 
+      { 
+        comm_ptr->mpid.num_requests = MPIDI_Process.optimized.num_requests;
         int tmpmpierrno;   
         if(unlikely(verbose))
           fprintf(stderr,"Query barrier required for %s\n", my_md->name);
-        MPIR_Barrier(comm_ptr, &tmpmpierrno);
+        MPIDO_Barrier(comm_ptr, &tmpmpierrno);
       }
    }
 
diff --git a/src/mpid/pamid/src/coll/gatherv/mpido_gatherv.c b/src/mpid/pamid/src/coll/gatherv/mpido_gatherv.c
index 42249a5..1645f44 100644
--- a/src/mpid/pamid/src/coll/gatherv/mpido_gatherv.c
+++ b/src/mpid/pamid/src/coll/gatherv/mpido_gatherv.c
@@ -190,12 +190,13 @@ int MPIDO_Gatherv(const void *sendbuf,
                              recvbuf, recvcounts, displs, recvtype,
                              root, comm_ptr, mpierrno);
       }
-      if(my_md->check_correct.values.asyncflowctl) 
-      { /* need better flow control than a barrier every time */
+      if(my_md->check_correct.values.asyncflowctl && !(--(comm_ptr->mpid.num_requests))) 
+      { 
+         comm_ptr->mpid.num_requests = MPIDI_Process.optimized.num_requests;
          int tmpmpierrno;   
          if(unlikely(verbose))
             fprintf(stderr,"Query barrier required for %s\n", my_md->name);
-         MPIR_Barrier(comm_ptr, &tmpmpierrno);
+         MPIDO_Barrier(comm_ptr, &tmpmpierrno);
       }
    }
    
diff --git a/src/mpid/pamid/src/coll/reduce/mpido_reduce.c b/src/mpid/pamid/src/coll/reduce/mpido_reduce.c
index 8ab9ce9..45fb05c 100644
--- a/src/mpid/pamid/src/coll/reduce/mpido_reduce.c
+++ b/src/mpid/pamid/src/coll/reduce/mpido_reduce.c
@@ -192,12 +192,13 @@ int MPIDO_Reduce(const void *sendbuf,
       }  
       else 
       {   
-         if(my_md->check_correct.values.asyncflowctl) 
-         { /* need better flow control than a barrier every time */
+         if(my_md->check_correct.values.asyncflowctl && !(--(comm_ptr->mpid.num_requests))) 
+         { 
+            comm_ptr->mpid.num_requests = MPIDI_Process.optimized.num_requests;
             int tmpmpierrno;   
             if(unlikely(verbose))
                fprintf(stderr,"Query barrier required for %s\n", my_md->name);
-            MPIR_Barrier(comm_ptr, &tmpmpierrno);
+            MPIDO_Barrier(comm_ptr, &tmpmpierrno);
          }
          alg_selected = 1;
       }
diff --git a/src/mpid/pamid/src/coll/scan/mpido_scan.c b/src/mpid/pamid/src/coll/scan/mpido_scan.c
index 1af1d84..f47cb81 100644
--- a/src/mpid/pamid/src/coll/scan/mpido_scan.c
+++ b/src/mpid/pamid/src/coll/scan/mpido_scan.c
@@ -180,12 +180,13 @@ int MPIDO_Doscan(const void *sendbuf, void *recvbuf,
          else
             return MPIR_Scan(sendbuf, recvbuf, count, datatype, op, comm_ptr, mpierrno);
       }
-      if(my_md->check_correct.values.asyncflowctl) 
-      { /* need better flow control than a barrier every time */
+      if(my_md->check_correct.values.asyncflowctl && !(--(comm_ptr->mpid.num_requests))) 
+      { 
+         comm_ptr->mpid.num_requests = MPIDI_Process.optimized.num_requests;
          int tmpmpierrno;   
          if(unlikely(verbose))
             fprintf(stderr,"Query barrier required for %s\n", my_md->name);
-         MPIR_Barrier(comm_ptr, &tmpmpierrno);
+         MPIDO_Barrier(comm_ptr, &tmpmpierrno);
       }
    }
    
diff --git a/src/mpid/pamid/src/coll/scatter/mpido_scatter.c b/src/mpid/pamid/src/coll/scatter/mpido_scatter.c
index 00eb1e0..082c316 100644
--- a/src/mpid/pamid/src/coll/scatter/mpido_scatter.c
+++ b/src/mpid/pamid/src/coll/scatter/mpido_scatter.c
@@ -280,12 +280,13 @@ int MPIDO_Scatter(const void *sendbuf,
                             recvbuf, recvcount, recvtype,
                             root, comm_ptr, mpierrno);
       }
-      if(my_md->check_correct.values.asyncflowctl) 
-      { /* need better flow control than a barrier every time */
+      if(my_md->check_correct.values.asyncflowctl && !(--(comm_ptr->mpid.num_requests))) 
+      { 
+        comm_ptr->mpid.num_requests = MPIDI_Process.optimized.num_requests;
         int tmpmpierrno;   
         if(unlikely(verbose))
           fprintf(stderr,"Query barrier required for %s\n", my_md->name);
-        MPIR_Barrier(comm_ptr, &tmpmpierrno);
+        MPIDO_Barrier(comm_ptr, &tmpmpierrno);
       }
    }
 
diff --git a/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c b/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c
index e15acd9..508739a 100644
--- a/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c
+++ b/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c
@@ -399,12 +399,13 @@ int MPIDO_Scatterv(const void *sendbuf,
                              recvbuf, recvcount, recvtype,
                              root, comm_ptr, mpierrno);
       }
-      if(my_md->check_correct.values.asyncflowctl) 
-      { /* need better flow control than a barrier every time */
+      if(my_md->check_correct.values.asyncflowctl && !(--(comm_ptr->mpid.num_requests))) 
+      { 
+        comm_ptr->mpid.num_requests = MPIDI_Process.optimized.num_requests;
         int tmpmpierrno;   
         if(unlikely(verbose))
           fprintf(stderr,"Query barrier required for %s\n", my_md->name);
-        MPIR_Barrier(comm_ptr, &tmpmpierrno);
+        MPIDO_Barrier(comm_ptr, &tmpmpierrno);
       }
    }
 
diff --git a/src/mpid/pamid/src/comm/mpid_comm.c b/src/mpid/pamid/src/comm/mpid_comm.c
index 5a73df8..989dc19 100644
--- a/src/mpid/pamid/src/comm/mpid_comm.c
+++ b/src/mpid/pamid/src/comm/mpid_comm.c
@@ -250,6 +250,8 @@ void MPIDI_Coll_comm_create(MPID_Comm *comm)
          return;
       }
    }
+   /* Initialize the async flow control in case it will be used. */
+   comm->mpid.num_requests = MPIDI_Process.optimized.num_requests;
 
    TRACE_ERR("Querying protocols\n");
    /* Determine what protocols are available for this comm/geom */
diff --git a/src/mpid/pamid/src/mpid_init.c b/src/mpid/pamid/src/mpid_init.c
index 4620821..39d54af 100644
--- a/src/mpid/pamid/src/mpid_init.c
+++ b/src/mpid/pamid/src/mpid_init.c
@@ -121,6 +121,7 @@ MPIDI_Process_t  MPIDI_Process = {
     .subcomms            = 1,
     .select_colls        = 2,
     .memory              = 0,
+    .num_requests        = 1,
   },
 
   .mpir_nbc              = 0,
@@ -711,6 +712,7 @@ MPIDI_PAMI_init(int* rank, int* size, int* threading)
              "  optimized.select_colls: %u\n"
              "  optimized.subcomms    : %u\n"
              "  optimized.memory      : %u\n"
+             "  optimized.num_requests: %u\n"
              "  mpir_nbc              : %u\n" 
              "  numTasks              : %u\n",
              MPIDI_Process.verbose,
@@ -744,6 +746,7 @@ MPIDI_PAMI_init(int* rank, int* size, int* threading)
              MPIDI_Process.optimized.select_colls,
              MPIDI_Process.optimized.subcomms,
              MPIDI_Process.optimized.memory,
+             MPIDI_Process.optimized.num_requests,
              MPIDI_Process.mpir_nbc, 
              MPIDI_Process.numTasks);
       switch (*threading)
diff --git a/src/mpid/pamid/src/mpidi_env.c b/src/mpid/pamid/src/mpidi_env.c
index 62a048b..3e25f06 100644
--- a/src/mpid/pamid/src/mpidi_env.c
+++ b/src/mpid/pamid/src/mpidi_env.c
@@ -153,13 +153,15 @@
  ***************************************************************************
  *
  * - PAMID_NUMREQUESTS - Sets the number of outstanding asynchronous
- *   broadcasts to have before a barrier is called.  This is mostly
- *   used in allgather/allgatherv using asynchronous broadcasts.
- *   Higher numbers can help on larger partitions and larger
- *   message sizes. This is also used for asynchronous broadcasts.
- *   After every {PAMID_NUMREQUESTS} async bcasts, the "glue" will call
- *   a barrier. See PAMID_BCAST and PAMID_ALLGATHER(V) for more information
- *   - Default is 32.
+ *   collectives to have before a barrier is called.  This is used when
+ *   the PAMI collective metadata indicates 'asyncflowctl' may be needed 
+ *   to avoid 'flooding' other participants with unexpected data. 
+ *   Higher numbers can help on larger partitions and larger message sizes. 
+ * 
+ *   After every {PAMID_NUMREQUESTS} async collectives, the "glue" will call
+ *   a barrier. 
+ *   - Default is 1 (guaranteed functionality) 
+ *   - N>1may used to tune performance
  *
  ***************************************************************************
  *                            "Safety" Options                             *
@@ -505,6 +507,13 @@ MPIDI_Env_setup(int rank, int requested)
     ENV_Unsigned(names, &MPIDI_Process.statistics, 1, &found_deprecated_env_var, rank);
   }
 
+  /* Set async flow control - number of collectives between barriers */
+  {
+    char* names[] = {"PAMID_NUMREQUESTS", NULL};
+    ENV_Unsigned(names, &MPIDI_Process.optimized.num_requests, 1, &found_deprecated_env_var, rank);
+    TRACE_ERR("MPIDI_Process.optimized.num_requests=%u\n", MPIDI_Process.optimized.num_requests);
+  }
+
   /* "Globally" set the optimization flag for low-level collectives in geometry creation.
    * This is probably temporary. metadata should set this flag likely.
    */

http://git.mpich.org/mpich.git/commitdiff/c32aeb9e104a375bc909244b51a6acc55fb47f9b

commit c32aeb9e104a375bc909244b51a6acc55fb47f9b
Author: Bob Cernohous <bobc at us.ibm.com>
Date:   Wed Mar 6 15:27:23 2013 -0600

    After review: Comment out [v] range checks
    
    (ibm) 9f41db5eb709f092a632cfedb46848b8f82d07f8
    
    Signed-off-by: sssharka <sssharka at us.ibm.com>

diff --git a/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c b/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c
index 4e818ec..b674bec 100644
--- a/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c
+++ b/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c
@@ -484,7 +484,8 @@ MPIDO_Allgatherv(const void *sendbuf,
            /* process metadata bits */
            if((!my_md->check_correct.values.inplace) && (sendbuf == MPI_IN_PLACE))
               result.check.unspecified = 1;
-           if(my_md->check_correct.values.rangeminmax)
+/* Can't check ranges like this.  Non-local.  Comment out for now.
+          if(my_md->check_correct.values.rangeminmax)
            {
              MPI_Aint data_true_lb;
              MPID_Datatype *data_ptr;
@@ -492,7 +493,7 @@ MPIDO_Allgatherv(const void *sendbuf,
              MPIDI_Datatype_get_info(sendcount, sendtype, data_contig, data_size, data_ptr, data_true_lb); 
              if((my_md->range_lo <= data_size) &&
                 (my_md->range_hi >= data_size))
-                ; /* ok, algorithm selected */
+                ; 
              else
              {
                 result.check.range = 1;
@@ -506,6 +507,7 @@ MPIDO_Allgatherv(const void *sendbuf,
                 }
              }
            }
+ */
          }
          else /* calling the check fn is sufficient */
            result = my_md->check_fn(&allgatherv);
diff --git a/src/mpid/pamid/src/coll/alltoallv/mpido_alltoallv.c b/src/mpid/pamid/src/coll/alltoallv/mpido_alltoallv.c
index 8290b2a..ab4c30e 100644
--- a/src/mpid/pamid/src/coll/alltoallv/mpido_alltoallv.c
+++ b/src/mpid/pamid/src/coll/alltoallv/mpido_alltoallv.c
@@ -139,7 +139,7 @@ int MPIDO_Alltoallv(const void *sendbuf,
         /* process metadata bits */
          if((!my_md->check_correct.values.inplace) && (sendbuf == MPI_IN_PLACE))
             result.check.unspecified = 1;
-/*
+/* Can't check ranges like this.  Non-local.  Comment out for now.
          if(my_md->check_correct.values.rangeminmax)
          {
             MPI_Aint data_true_lb;
diff --git a/src/mpid/pamid/src/coll/gatherv/mpido_gatherv.c b/src/mpid/pamid/src/coll/gatherv/mpido_gatherv.c
index 6b7b049..42249a5 100644
--- a/src/mpid/pamid/src/coll/gatherv/mpido_gatherv.c
+++ b/src/mpid/pamid/src/coll/gatherv/mpido_gatherv.c
@@ -152,6 +152,7 @@ int MPIDO_Gatherv(const void *sendbuf,
          /* process metadata bits */
          if((!my_md->check_correct.values.inplace) && (sendbuf == MPI_IN_PLACE))
             result.check.unspecified = 1;
+/* Can't check ranges like this.  Non-local.  Comment out for now.
          if(my_md->check_correct.values.rangeminmax)
          {
             MPI_Aint data_true_lb;
@@ -160,7 +161,7 @@ int MPIDO_Gatherv(const void *sendbuf,
             MPIDI_Datatype_get_info(sendcount, sendtype, data_contig, data_size, data_ptr, data_true_lb); 
             if((my_md->range_lo <= data_size) &&
                (my_md->range_hi >= data_size))
-               ; /* ok, algorithm selected */
+               ; 
             else
             {
                result.check.range = 1;
@@ -174,6 +175,7 @@ int MPIDO_Gatherv(const void *sendbuf,
                }
             }
          }
+ */
       }
       else /* calling the check fn is sufficient */
          result = my_md->check_fn(&gatherv);
diff --git a/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c b/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c
index 06e03de..e15acd9 100644
--- a/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c
+++ b/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c
@@ -361,6 +361,7 @@ int MPIDO_Scatterv(const void *sendbuf,
         /* process metadata bits */
         if((!my_md->check_correct.values.inplace) && (recvbuf == MPI_IN_PLACE))
            result.check.unspecified = 1;
+/* Can't check ranges like this.  Non-local.  Comment out for now.
          if(my_md->check_correct.values.rangeminmax)
          {
            MPI_Aint data_true_lb;
@@ -369,7 +370,7 @@ int MPIDO_Scatterv(const void *sendbuf,
            MPIDI_Datatype_get_info(recvcount, recvtype, data_contig, data_size, data_ptr, data_true_lb); 
            if((my_md->range_lo <= data_size) &&
               (my_md->range_hi >= data_size))
-              ; /* ok, algorithm selected */
+              ; 
            else
            {
               result.check.range = 1;
@@ -383,6 +384,7 @@ int MPIDO_Scatterv(const void *sendbuf,
               }
            }
          }
+ */
       }
       else /* calling the check fn is sufficient */
         result = my_md->check_fn(&scatterv);

http://git.mpich.org/mpich.git/commitdiff/09d20b381725981f5c10a7f5e062c521339b1081

commit 09d20b381725981f5c10a7f5e062c521339b1081
Author: Bob Cernohous <bobc at us.ibm.com>
Date:   Wed Mar 6 15:22:45 2013 -0600

    After review: Fix scatter range check
    
    (ibm) e3494dedebd6e061a7b957d31662043f35a97ace
    
    Signed-off-by: sssharka <sssharka at us.ibm.com>

diff --git a/src/mpid/pamid/src/coll/scatter/mpido_scatter.c b/src/mpid/pamid/src/coll/scatter/mpido_scatter.c
index aafb0c9..00eb1e0 100644
--- a/src/mpid/pamid/src/coll/scatter/mpido_scatter.c
+++ b/src/mpid/pamid/src/coll/scatter/mpido_scatter.c
@@ -246,7 +246,10 @@ int MPIDO_Scatter(const void *sendbuf,
            MPI_Aint data_true_lb;
            MPID_Datatype *data_ptr;
            int data_size, data_contig;
-           MPIDI_Datatype_get_info(recvcount, recvtype, data_contig, data_size, data_ptr, data_true_lb); 
+           if(rank == root)
+             MPIDI_Datatype_get_info(sendcount, sendtype, data_contig, data_size, data_ptr, data_true_lb); 
+           else
+             MPIDI_Datatype_get_info(recvcount, recvtype, data_contig, data_size, data_ptr, data_true_lb); 
            if((my_md->range_lo <= data_size) &&
               (my_md->range_hi >= data_size))
               ; /* ok, algorithm selected */

http://git.mpich.org/mpich.git/commitdiff/c95d740c4268268233242c7f3d0c277e2b0abf2d

commit c95d740c4268268233242c7f3d0c277e2b0abf2d
Author: Bob Cernohous <bobc at us.ibm.com>
Date:   Wed Mar 6 13:31:12 2013 -0600

    After review: Update comments and range checks
    
    (ibm) 62fe532130c1cc7daf4d4e55de90560fe56f7e99
    
    Signed-off-by: sssharka <sssharka at us.ibm.com>

diff --git a/src/mpid/pamid/src/coll/allgather/mpido_allgather.c b/src/mpid/pamid/src/coll/allgather/mpido_allgather.c
index 2b29704..c2230d7 100644
--- a/src/mpid/pamid/src/coll/allgather/mpido_allgather.c
+++ b/src/mpid/pamid/src/coll/allgather/mpido_allgather.c
@@ -98,7 +98,7 @@ int MPIDO_Allgather_allreduce(const void *sendbuf,
 
     for(i = 0; i < (send_size/sizeof(int)); ++i) 
       tmpsbuf[i] = (double)sibuf[i];
-    
+    /* Switch to comm->coll_fns->fn() */
     rc = MPIDO_Allreduce(MPI_IN_PLACE,
 			 tmprbuf,
 			 recv_size/sizeof(int),
@@ -124,6 +124,7 @@ int MPIDO_Allgather_allreduce(const void *sendbuf,
   memset(destbuf + send_size, 0, recv_size - (rank + 1) * send_size);
 
   if (sendtype == MPI_DOUBLE && recvtype == MPI_DOUBLE)
+    /* Switch to comm->coll_fns->fn() */
     rc = MPIDO_Allreduce(MPI_IN_PLACE,
 			 startbuf,
 			 recv_size/sizeof(double),
@@ -132,6 +133,7 @@ int MPIDO_Allgather_allreduce(const void *sendbuf,
 			 comm_ptr,
 			 mpierrno);
   else
+    /* Switch to comm->coll_fns->fn() */
     rc = MPIDO_Allreduce(MPI_IN_PLACE,
 			 startbuf,
 			 recv_size/sizeof(int),
@@ -190,7 +192,7 @@ int MPIDO_Allgather_bcast(const void *sendbuf,
   for (i = 0; i < np; i++)
   {
     void *destbuf = recvbuf + i * recvcount * extent;
-    /* TODO: Change to PAMI */
+    /* Switch to comm->coll_fns->fn() */
     rc = MPIDO_Bcast(destbuf,
                      recvcount,
                      recvtype,
@@ -259,6 +261,7 @@ int MPIDO_Allgather_alltoall(const void *sendbuf,
   }
 
 
+  /* Switch to comm->coll_fns->fn() */
   rc = MPIDO_Alltoallv((const void *)a2a_sendbuf,
                        a2a_sendcounts,
                        a2a_senddispls,
diff --git a/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c b/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c
index a289b48..4e818ec 100644
--- a/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c
+++ b/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c
@@ -103,6 +103,7 @@ int MPIDO_Allgatherv_allreduce(const void *sendbuf,
     for(i = 0; i < (send_size/sizeof(int)); ++i) 
       tmpsbuf[i] = (double)sibuf[i];
     
+    /* Switch to comm->coll_fns->fn() */
     rc = MPIDO_Allreduce(MPI_IN_PLACE,
 			 tmprbuf,
 			 buffer_sum/sizeof(int),
@@ -135,7 +136,7 @@ int MPIDO_Allgatherv_allreduce(const void *sendbuf,
   memset(startbuf + start, 0, length);
 
   TRACE_ERR("Calling MPIDO_Allreduce from MPIDO_Allgatherv_allreduce\n");
-  /* TODO: Change to PAMI allreduce */
+  /* Switch to comm->coll_fns->fn() */
   rc = MPIDO_Allreduce(MPI_IN_PLACE,
 		       startbuf,
 		       buffer_sum/sizeof(unsigned),
@@ -194,7 +195,7 @@ int MPIDO_Allgatherv_bcast(const void *sendbuf,
   for (i = 0; i < comm_ptr->local_size; i++)
   {
     void *destbuffer = recvbuf + displs[i] * extent;
-    /* TODO: Change to PAMI */
+    /* Switch to comm->coll_fns->fn() */
     rc = MPIDO_Bcast(destbuffer,
                      recvcounts[i],
                      recvtype,
@@ -265,7 +266,7 @@ int MPIDO_Allgatherv_alltoall(const void *sendbuf,
   }
 
    TRACE_ERR("Calling alltoallv in MPIDO_Allgatherv_alltoallv\n");
-   /* TODO: Change to PAMI alltoallv */
+   /* Switch to comm->coll_fns->fn() */
   rc = MPIDO_Alltoallv(a2a_sendbuf,
 		       a2a_sendcounts,
 		       a2a_senddispls,
@@ -483,25 +484,28 @@ MPIDO_Allgatherv(const void *sendbuf,
            /* process metadata bits */
            if((!my_md->check_correct.values.inplace) && (sendbuf == MPI_IN_PLACE))
               result.check.unspecified = 1;
-         MPI_Aint data_true_lb;
-         MPID_Datatype *data_ptr;
-         int data_size, data_contig;
-         MPIDI_Datatype_get_info(sendcount, sendtype, data_contig, data_size, data_ptr, data_true_lb); 
-         if((my_md->range_lo <= data_size) &&
-            (my_md->range_hi >= data_size))
-            ; /* ok, algorithm selected */
-         else
-         {
-            result.check.range = 1;
-            if(unlikely(verbose))
-            {   
-               fprintf(stderr,"message size (%u) outside range (%zu<->%zu) for %s.\n",
-                       data_size,
-                       my_md->range_lo,
-                       my_md->range_hi,
-                       my_md->name);
-            }
-         }
+           if(my_md->check_correct.values.rangeminmax)
+           {
+             MPI_Aint data_true_lb;
+             MPID_Datatype *data_ptr;
+             int data_size, data_contig;
+             MPIDI_Datatype_get_info(sendcount, sendtype, data_contig, data_size, data_ptr, data_true_lb); 
+             if((my_md->range_lo <= data_size) &&
+                (my_md->range_hi >= data_size))
+                ; /* ok, algorithm selected */
+             else
+             {
+                result.check.range = 1;
+                if(unlikely(verbose))
+                {   
+                   fprintf(stderr,"message size (%u) outside range (%zu<->%zu) for %s.\n",
+                           data_size,
+                           my_md->range_lo,
+                           my_md->range_hi,
+                           my_md->name);
+                }
+             }
+           }
          }
          else /* calling the check fn is sufficient */
            result = my_md->check_fn(&allgatherv);
diff --git a/src/mpid/pamid/src/coll/alltoall/mpido_alltoall.c b/src/mpid/pamid/src/coll/alltoall/mpido_alltoall.c
index dc796a9..b2e4848 100644
--- a/src/mpid/pamid/src/coll/alltoall/mpido_alltoall.c
+++ b/src/mpid/pamid/src/coll/alltoall/mpido_alltoall.c
@@ -143,23 +143,26 @@ int MPIDO_Alltoall(const void *sendbuf,
         /* process metadata bits */
          if((!my_md->check_correct.values.inplace) && (sendbuf == MPI_IN_PLACE))
             result.check.unspecified = 1;
-         MPI_Aint data_true_lb;
-         MPID_Datatype *data_ptr;
-         int data_size, data_contig;
-         MPIDI_Datatype_get_info(sendcount, sendtype, data_contig, data_size, data_ptr, data_true_lb); 
-         if((my_md->range_lo <= data_size) &&
-            (my_md->range_hi >= data_size))
-            ; /* ok, algorithm selected */
-         else
+         if(my_md->check_correct.values.rangeminmax)
          {
-            result.check.range = 1;
-            if(unlikely(verbose))
-            {   
-               fprintf(stderr,"message size (%u) outside range (%zu<->%zu) for %s.\n",
-                       data_size,
-                       my_md->range_lo,
-                       my_md->range_hi,
-                       my_md->name);
+            MPI_Aint data_true_lb;
+            MPID_Datatype *data_ptr;
+            int data_size, data_contig;
+            MPIDI_Datatype_get_info(sendcount, sendtype, data_contig, data_size, data_ptr, data_true_lb); 
+            if((my_md->range_lo <= data_size) &&
+               (my_md->range_hi >= data_size))
+               ; /* ok, algorithm selected */
+            else
+            {
+               result.check.range = 1;
+               if(unlikely(verbose))
+               {   
+                  fprintf(stderr,"message size (%u) outside range (%zu<->%zu) for %s.\n",
+                          data_size,
+                          my_md->range_lo,
+                          my_md->range_hi,
+                          my_md->name);
+               }
             }
          }
       }
diff --git a/src/mpid/pamid/src/coll/alltoallv/mpido_alltoallv.c b/src/mpid/pamid/src/coll/alltoallv/mpido_alltoallv.c
index eabc356..8290b2a 100644
--- a/src/mpid/pamid/src/coll/alltoallv/mpido_alltoallv.c
+++ b/src/mpid/pamid/src/coll/alltoallv/mpido_alltoallv.c
@@ -139,25 +139,30 @@ int MPIDO_Alltoallv(const void *sendbuf,
         /* process metadata bits */
          if((!my_md->check_correct.values.inplace) && (sendbuf == MPI_IN_PLACE))
             result.check.unspecified = 1;
-/*         MPI_Aint data_true_lb;
-         MPID_Datatype *data_ptr;
-         int data_size, data_contig;
-         MPIDI_Datatype_get_info(??, sendtype, data_contig, data_size, data_ptr, data_true_lb); 
-         if((my_md->range_lo <= data_size) &&
-            (my_md->range_hi >= data_size))
-            ; *//* ok, algorithm selected */
-/*         else
+/*
+         if(my_md->check_correct.values.rangeminmax)
          {
-            result.check.range = 1;
-            if(unlikely(verbose))
-            {   
-               fprintf(stderr,"message size (%u) outside range (%zu<->%zu) for %s.\n",
-                       data_size,
-                       my_md->range_lo,
-                       my_md->range_hi,
-                       my_md->name);
+            MPI_Aint data_true_lb;
+            MPID_Datatype *data_ptr;
+            int data_size, data_contig;
+            MPIDI_Datatype_get_info(??, sendtype, data_contig, data_size, data_ptr, data_true_lb); 
+            if((my_md->range_lo <= data_size) &&
+               (my_md->range_hi >= data_size))
+               ; 
+            else
+            {
+               result.check.range = 1;
+               if(unlikely(verbose))
+               {   
+                  fprintf(stderr,"message size (%u) outside range (%zu<->%zu) for %s.\n",
+                          data_size,
+                          my_md->range_lo,
+                          my_md->range_hi,
+                          my_md->name);
+               }
             }
-         }*/
+         }
+*/
       }
       else /* calling the check fn is sufficient */
          result = my_md->check_fn(&alltoallv);
diff --git a/src/mpid/pamid/src/coll/gather/mpido_gather.c b/src/mpid/pamid/src/coll/gather/mpido_gather.c
index f44f19d..d48818f 100644
--- a/src/mpid/pamid/src/coll/gather/mpido_gather.c
+++ b/src/mpid/pamid/src/coll/gather/mpido_gather.c
@@ -295,7 +295,8 @@ int MPIDO_Gather(const void *sendbuf,
               (my_md->range_lo <= recv_bytes) &&
               (my_md->range_hi >= recv_bytes)
               ) &&
-             ((my_md->range_lo <= send_bytes) &&
+             ((rank != root) &&
+              (my_md->range_lo <= send_bytes) &&
               (my_md->range_hi >= send_bytes)
               )
              )
diff --git a/src/mpid/pamid/src/coll/gatherv/mpido_gatherv.c b/src/mpid/pamid/src/coll/gatherv/mpido_gatherv.c
index 7df7210..6b7b049 100644
--- a/src/mpid/pamid/src/coll/gatherv/mpido_gatherv.c
+++ b/src/mpid/pamid/src/coll/gatherv/mpido_gatherv.c
@@ -152,23 +152,26 @@ int MPIDO_Gatherv(const void *sendbuf,
          /* process metadata bits */
          if((!my_md->check_correct.values.inplace) && (sendbuf == MPI_IN_PLACE))
             result.check.unspecified = 1;
-         MPI_Aint data_true_lb;
-         MPID_Datatype *data_ptr;
-         int data_size, data_contig;
-         MPIDI_Datatype_get_info(sendcount, sendtype, data_contig, data_size, data_ptr, data_true_lb); 
-         if((my_md->range_lo <= data_size) &&
-            (my_md->range_hi >= data_size))
-            ; /* ok, algorithm selected */
-         else
+         if(my_md->check_correct.values.rangeminmax)
          {
-            result.check.range = 1;
-            if(unlikely(verbose))
-            {   
-               fprintf(stderr,"message size (%u) outside range (%zu<->%zu) for %s.\n",
-                       data_size,
-                       my_md->range_lo,
-                       my_md->range_hi,
-                       my_md->name);
+            MPI_Aint data_true_lb;
+            MPID_Datatype *data_ptr;
+            int data_size, data_contig;
+            MPIDI_Datatype_get_info(sendcount, sendtype, data_contig, data_size, data_ptr, data_true_lb); 
+            if((my_md->range_lo <= data_size) &&
+               (my_md->range_hi >= data_size))
+               ; /* ok, algorithm selected */
+            else
+            {
+               result.check.range = 1;
+               if(unlikely(verbose))
+               {   
+                  fprintf(stderr,"message size (%u) outside range (%zu<->%zu) for %s.\n",
+                          data_size,
+                          my_md->range_lo,
+                          my_md->range_hi,
+                          my_md->name);
+               }
             }
          }
       }
diff --git a/src/mpid/pamid/src/coll/reduce/mpido_reduce.c b/src/mpid/pamid/src/coll/reduce/mpido_reduce.c
index 79e65f7..8ab9ce9 100644
--- a/src/mpid/pamid/src/coll/reduce/mpido_reduce.c
+++ b/src/mpid/pamid/src/coll/reduce/mpido_reduce.c
@@ -94,6 +94,7 @@ int MPIDO_Reduce(const void *sendbuf,
       {
          tbuf = destbuf = MPIU_Malloc(tsize);
       }
+      /* Switch to comm->coll_fns->fn() */
       MPIDO_Allreduce(sendbuf,
                       destbuf,
                       count,
@@ -156,23 +157,26 @@ int MPIDO_Reduce(const void *sendbuf,
          /* process metadata bits */
          if((!my_md->check_correct.values.inplace) && (sendbuf == MPI_IN_PLACE))
             result.check.unspecified = 1;
-         MPI_Aint data_true_lb;
-         MPID_Datatype *data_ptr;
-         int data_size, data_contig;
-         MPIDI_Datatype_get_info(count, datatype, data_contig, data_size, data_ptr, data_true_lb); 
-         if((my_md->range_lo <= data_size) &&
-            (my_md->range_hi >= data_size))
-            ; /* ok, algorithm selected */
-         else
+         if(my_md->check_correct.values.rangeminmax)
          {
-            result.check.range = 1;
-            if(unlikely(verbose))
-            {   
-               fprintf(stderr,"message size (%u) outside range (%zu<->%zu) for %s.\n",
-                       data_size,
-                       my_md->range_lo,
-                       my_md->range_hi,
-                       my_md->name);
+            MPI_Aint data_true_lb;
+            MPID_Datatype *data_ptr;
+            int data_size, data_contig;
+            MPIDI_Datatype_get_info(count, datatype, data_contig, data_size, data_ptr, data_true_lb); 
+            if((my_md->range_lo <= data_size) &&
+               (my_md->range_hi >= data_size))
+               ; /* ok, algorithm selected */
+            else
+            {
+               result.check.range = 1;
+               if(unlikely(verbose))
+               {   
+                  fprintf(stderr,"message size (%u) outside range (%zu<->%zu) for %s.\n",
+                          data_size,
+                          my_md->range_lo,
+                          my_md->range_hi,
+                          my_md->name);
+               }
             }
          }
       }
diff --git a/src/mpid/pamid/src/coll/scan/mpido_scan.c b/src/mpid/pamid/src/coll/scan/mpido_scan.c
index 73e21b1..1af1d84 100644
--- a/src/mpid/pamid/src/coll/scan/mpido_scan.c
+++ b/src/mpid/pamid/src/coll/scan/mpido_scan.c
@@ -142,23 +142,26 @@ int MPIDO_Doscan(const void *sendbuf, void *recvbuf,
         /* process metadata bits */
          if((!my_md->check_correct.values.inplace) && (sendbuf == MPI_IN_PLACE))
             result.check.unspecified = 1;
-         MPI_Aint data_true_lb;
-         MPID_Datatype *data_ptr;
-         int data_size, data_contig;
-         MPIDI_Datatype_get_info(count, datatype, data_contig, data_size, data_ptr, data_true_lb); 
-         if((my_md->range_lo <= data_size) &&
-            (my_md->range_hi >= data_size))
-            ; /* ok, algorithm selected */
-         else
+         if(my_md->check_correct.values.rangeminmax)
          {
-            result.check.range = 1;
-            if(unlikely(verbose))
-            {   
-               fprintf(stderr,"message size (%u) outside range (%zu<->%zu) for %s.\n",
-                       data_size,
-                       my_md->range_lo,
-                       my_md->range_hi,
-                       my_md->name);
+            MPI_Aint data_true_lb;
+            MPID_Datatype *data_ptr;
+            int data_size, data_contig;
+            MPIDI_Datatype_get_info(count, datatype, data_contig, data_size, data_ptr, data_true_lb); 
+            if((my_md->range_lo <= data_size) &&
+               (my_md->range_hi >= data_size))
+               ; /* ok, algorithm selected */
+            else
+            {
+               result.check.range = 1;
+               if(unlikely(verbose))
+               {   
+                  fprintf(stderr,"message size (%u) outside range (%zu<->%zu) for %s.\n",
+                          data_size,
+                          my_md->range_lo,
+                          my_md->range_hi,
+                          my_md->name);
+               }
             }
          }
       }
diff --git a/src/mpid/pamid/src/coll/scatter/mpido_scatter.c b/src/mpid/pamid/src/coll/scatter/mpido_scatter.c
index 7b9de8c..aafb0c9 100644
--- a/src/mpid/pamid/src/coll/scatter/mpido_scatter.c
+++ b/src/mpid/pamid/src/coll/scatter/mpido_scatter.c
@@ -83,7 +83,7 @@ int MPIDO_Scatter_bcast(void * sendbuf,
     }
   }
 
-   /* TODO: Needs to be a PAMI bcast */
+  /* Switch to comm->coll_fns->fn() */
   rc = MPIDO_Bcast(tempbuf, nbytes*size, MPI_CHAR, root, comm_ptr, mpierrno);
 
   if(rank == root && recvbuf == MPI_IN_PLACE)
@@ -241,24 +241,27 @@ int MPIDO_Scatter(const void *sendbuf,
         /* process metadata bits */
         if((!my_md->check_correct.values.inplace) && (recvbuf == MPI_IN_PLACE))
            result.check.unspecified = 1;
-         MPI_Aint data_true_lb;
-         MPID_Datatype *data_ptr;
-         int data_size, data_contig;
-         MPIDI_Datatype_get_info(recvcount, recvtype, data_contig, data_size, data_ptr, data_true_lb); 
-         if((my_md->range_lo <= data_size) &&
-            (my_md->range_hi >= data_size))
-            ; /* ok, algorithm selected */
-         else
+         if(my_md->check_correct.values.rangeminmax)
          {
-            result.check.range = 1;
-            if(unlikely(verbose))
-            {   
-               fprintf(stderr,"message size (%u) outside range (%zu<->%zu) for %s.\n",
-                       data_size,
-                       my_md->range_lo,
-                       my_md->range_hi,
-                       my_md->name);
-            }
+           MPI_Aint data_true_lb;
+           MPID_Datatype *data_ptr;
+           int data_size, data_contig;
+           MPIDI_Datatype_get_info(recvcount, recvtype, data_contig, data_size, data_ptr, data_true_lb); 
+           if((my_md->range_lo <= data_size) &&
+              (my_md->range_hi >= data_size))
+              ; /* ok, algorithm selected */
+           else
+           {
+              result.check.range = 1;
+              if(unlikely(verbose))
+              {   
+                 fprintf(stderr,"message size (%u) outside range (%zu<->%zu) for %s.\n",
+                         data_size,
+                         my_md->range_lo,
+                         my_md->range_hi,
+                         my_md->name);
+              }
+           }
          }
       }
       else /* calling the check fn is sufficient */
diff --git a/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c b/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c
index 14edf3d..06e03de 100644
--- a/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c
+++ b/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c
@@ -68,6 +68,7 @@ int MPIDO_Scatterv_bcast(void *sendbuf,
   else
     tempbuf = sendbuf;
 
+  /* Switch to comm->coll_fns->fn() */
   rc = MPIDO_Bcast(tempbuf, sum, sendtype, root, comm_ptr, mpierrno);
 
   if(rank == root && recvbuf == MPI_IN_PLACE)
@@ -360,24 +361,27 @@ int MPIDO_Scatterv(const void *sendbuf,
         /* process metadata bits */
         if((!my_md->check_correct.values.inplace) && (recvbuf == MPI_IN_PLACE))
            result.check.unspecified = 1;
-         MPI_Aint data_true_lb;
-         MPID_Datatype *data_ptr;
-         int data_size, data_contig;
-         MPIDI_Datatype_get_info(recvcount, recvtype, data_contig, data_size, data_ptr, data_true_lb); 
-         if((my_md->range_lo <= data_size) &&
-            (my_md->range_hi >= data_size))
-            ; /* ok, algorithm selected */
-         else
+         if(my_md->check_correct.values.rangeminmax)
          {
-            result.check.range = 1;
-            if(unlikely(verbose))
-            {   
-               fprintf(stderr,"message size (%u) outside range (%zu<->%zu) for %s.\n",
-                       data_size,
-                       my_md->range_lo,
-                       my_md->range_hi,
-                       my_md->name);
-            }
+           MPI_Aint data_true_lb;
+           MPID_Datatype *data_ptr;
+           int data_size, data_contig;
+           MPIDI_Datatype_get_info(recvcount, recvtype, data_contig, data_size, data_ptr, data_true_lb); 
+           if((my_md->range_lo <= data_size) &&
+              (my_md->range_hi >= data_size))
+              ; /* ok, algorithm selected */
+           else
+           {
+              result.check.range = 1;
+              if(unlikely(verbose))
+              {   
+                 fprintf(stderr,"message size (%u) outside range (%zu<->%zu) for %s.\n",
+                         data_size,
+                         my_md->range_lo,
+                         my_md->range_hi,
+                         my_md->name);
+              }
+           }
          }
       }
       else /* calling the check fn is sufficient */

http://git.mpich.org/mpich.git/commitdiff/224dfb1bc40328e9e51c52d110aaf971eb2c862f

commit 224dfb1bc40328e9e51c52d110aaf971eb2c862f
Author: Bob Cernohous <bobc at us.ibm.com>
Date:   Wed Mar 6 13:14:47 2013 -0600

    After review: Use PAMI_GEOMETRY_NULL
    
    (ibm) aaaa64363a946b0f66727233b73a34069b5dda61
    
    Signed-off-by: sssharka <sssharka at us.ibm.com>

diff --git a/src/mpid/pamid/src/comm/mpid_comm.c b/src/mpid/pamid/src/comm/mpid_comm.c
index ba8db3a..5a73df8 100644
--- a/src/mpid/pamid/src/comm/mpid_comm.c
+++ b/src/mpid/pamid/src/comm/mpid_comm.c
@@ -147,7 +147,7 @@ void MPIDI_Coll_comm_create(MPID_Comm *comm)
 
   if(comm->comm_kind != MPID_INTRACOMM) return;
   /* Create a geometry */
-   
+
    if(comm->mpid.geometry != MPIDI_Process.world_geometry)
    {
       if(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL)
@@ -198,7 +198,7 @@ void MPIDI_Coll_comm_create(MPID_Comm *comm)
       {
          /* Don't create irregular geometries.  Fallback to MPICH only collectives */
          geom_init = 0;
-         comm->mpid.geometry = NULL;
+         comm->mpid.geometry = PAMI_GEOMETRY_NULL;
       }
       else if(comm->mpid.tasks == NULL)
       {   
@@ -207,7 +207,7 @@ void MPIDI_Coll_comm_create(MPID_Comm *comm)
          geom_post.context_offset = 0; /* TODO BES investigate */
          geom_post.num_configs = numconfigs;
          geom_post.newgeom = &comm->mpid.geometry,
-         geom_post.parent = NULL;
+         geom_post.parent = PAMI_GEOMETRY_NULL;
          geom_post.id     = comm->context_id;
          geom_post.ranges = &comm->mpid.range;
          geom_post.tasks = NULL;;
@@ -226,7 +226,7 @@ void MPIDI_Coll_comm_create(MPID_Comm *comm)
          geom_post.context_offset = 0; /* TODO BES investigate */
          geom_post.num_configs = numconfigs;
          geom_post.newgeom = &comm->mpid.geometry,
-         geom_post.parent = NULL;
+         geom_post.parent = PAMI_GEOMETRY_NULL;
          geom_post.id     = comm->context_id;
          geom_post.ranges = NULL;
          geom_post.tasks = comm->mpid.tasks;
@@ -242,7 +242,7 @@ void MPIDI_Coll_comm_create(MPID_Comm *comm)
       TRACE_ERR("Waiting for geom create to finish\n");
       MPID_PROGRESS_WAIT_WHILE(geom_init);
 
-      if(comm->mpid.geometry == NULL)
+      if(comm->mpid.geometry == PAMI_GEOMETRY_NULL)
       {
          if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_0 && comm->rank == 0))
             fprintf(stderr,"Created unoptimized communicator id=%u, size=%u\n", (unsigned) comm->context_id,comm->local_size);
@@ -275,13 +275,13 @@ void MPIDI_Coll_comm_destroy(MPID_Comm *comm)
   if (!MPIDI_Process.optimized.collectives)
     return;
 
-  if(comm->comm_kind != MPID_INTRACOMM) 
+  if(comm->comm_kind != MPID_INTRACOMM)
     return;
 
   /* It's possible (MPIR_Setup_intercomm_localcomm) to have an intracomm
      without a geometry even when using optimized collectives */
-  if(comm->mpid.geometry == NULL)
-    return; 
+  if(comm->mpid.geometry == PAMI_GEOMETRY_NULL)
+    return;
 
    MPIU_TestFree(&comm->coll_fns);
    for(i=0;i<PAMI_XFER_COUNT;i++)

http://git.mpich.org/mpich.git/commitdiff/858da8da2e6e8d205b56e4343d800224718819b7

commit 858da8da2e6e8d205b56e4343d800224718819b7
Author: Bob Cernohous <bobc at us.ibm.com>
Date:   Thu Feb 14 13:37:36 2013 -0600

    Use new configuration options.
    
    - PAMI_CLIENT_NONCONTIG
    - PAMI_CLIENT_MEMORY_OPTIMIZE
    - PAMI_GEOMETRY_NONCONTIG
    - PAMI_GEOMETRY_MEMORY_OPTIMIZE
    
    PAMID_COLLECTIVES_MEMORY_OPTIMIZED will set the appropriate configuration options.
    
    (ibm) Issue 9356
    (ibm) e725e372ddabb55bd1c17f2be7fefd1268d74896
    
    Signed-off-by: sssharka <sssharka at us.ibm.com>

diff --git a/src/mpid/pamid/include/mpidi_datatypes.h b/src/mpid/pamid/include/mpidi_datatypes.h
index 3e25404..08df29f 100644
--- a/src/mpid/pamid/include/mpidi_datatypes.h
+++ b/src/mpid/pamid/include/mpidi_datatypes.h
@@ -110,10 +110,10 @@ typedef struct
   struct
   {
     unsigned collectives;       /**< Enable optimized collective functions. */
-    unsigned subcomms;
+    unsigned subcomms;          /**< Enable hardware optimized subcomm's */
     unsigned select_colls;      /**< Enable collective selection */
     unsigned auto_select_colls; /**< Enable automatic collective selection */
-    unsigned memory;
+    unsigned memory;            /**< Enable memory optimized subcomm's */
   }
   optimized;
 
diff --git a/src/mpid/pamid/src/comm/mpid_comm.c b/src/mpid/pamid/src/comm/mpid_comm.c
index 27f789d..ba8db3a 100644
--- a/src/mpid/pamid/src/comm/mpid_comm.c
+++ b/src/mpid/pamid/src/comm/mpid_comm.c
@@ -177,29 +177,33 @@ void MPIDI_Coll_comm_create(MPID_Comm *comm)
       if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_0 && comm->rank == 0))
          fprintf(stderr,"create geometry tasks %p {%u..%u}\n", comm->mpid.tasks, MPID_VCR_GET_LPID(comm->vcr, 0),MPID_VCR_GET_LPID(comm->vcr, comm->local_size-1));
 
-      pami_configuration_t config;
-      size_t numconfigs = 0;
-
+      pami_configuration_t config[3];
+      config[0].name = PAMI_GEOMETRY_NONCONTIG;
+      config[0].value.intval = 0; // Disable non-contig, pamid doesn't use pami for non-contig data collectives
+      size_t numconfigs = 1;
       if(MPIDI_Process.optimized.subcomms)
       {
-         config.name = PAMI_GEOMETRY_OPTIMIZE;
-         numconfigs = 1;
+         config[numconfigs].name = PAMI_GEOMETRY_OPTIMIZE;
+         config[numconfigs].value.intval = 1; 
+         ++numconfigs;
       }
-      else
+      if(MPIDI_Process.optimized.memory) 
       {
-         numconfigs = 0;
+         config[numconfigs].name = PAMI_GEOMETRY_MEMORY_OPTIMIZE;
+         config[numconfigs].value.intval = MPIDI_Process.optimized.memory; /* level of optimization */
+         ++numconfigs;
       }
 
       if(MPIDI_Process.optimized.memory && (comm->local_size & (comm->local_size-1)))
       {
-	/* Don't create irregular geometries.  Fallback to MPICH only collectives */
-	geom_init = 0;
-	comm->mpid.geometry = NULL;
+         /* Don't create irregular geometries.  Fallback to MPICH only collectives */
+         geom_init = 0;
+         comm->mpid.geometry = NULL;
       }
       else if(comm->mpid.tasks == NULL)
       {   
          geom_post.client = MPIDI_Client;
-         geom_post.configs = &config;
+         geom_post.configs = config;
          geom_post.context_offset = 0; /* TODO BES investigate */
          geom_post.num_configs = numconfigs;
          geom_post.newgeom = &comm->mpid.geometry,
@@ -218,7 +222,7 @@ void MPIDI_Coll_comm_create(MPID_Comm *comm)
       else
       {
          geom_post.client = MPIDI_Client;
-         geom_post.configs = &config;
+         geom_post.configs = config;
          geom_post.context_offset = 0; /* TODO BES investigate */
          geom_post.num_configs = numconfigs;
          geom_post.newgeom = &comm->mpid.geometry,
diff --git a/src/mpid/pamid/src/comm/mpid_optcolls.c b/src/mpid/pamid/src/comm/mpid_optcolls.c
index df095bc..85bfd45 100644
--- a/src/mpid/pamid/src/comm/mpid_optcolls.c
+++ b/src/mpid/pamid/src/comm/mpid_optcolls.c
@@ -713,6 +713,8 @@ void MPIDI_Comm_coll_select(MPID_Comm *comm_ptr)
     comm_ptr->mpid.query_cached_allreduce = MPID_COLL_USE_MPICH;
 
     comm_ptr->mpid.cutoff_size[PAMI_XFER_ALLREDUCE][0] = 0;
+    comm_ptr->mpid.must_query[PAMI_XFER_ALLREDUCE][0] = MPID_COLL_USE_MPICH;
+    comm_ptr->mpid.must_query[PAMI_XFER_ALLREDUCE][1] = MPID_COLL_USE_MPICH;
     /* For BGQ */
     /*  1ppn: I0:MultiCombineDput:-:MU if it is available, but it has a check_fn
      *  since it is MU-based*/
diff --git a/src/mpid/pamid/src/mpid_init.c b/src/mpid/pamid/src/mpid_init.c
index 4f8f1b4..4620821 100644
--- a/src/mpid/pamid/src/mpid_init.c
+++ b/src/mpid/pamid/src/mpid_init.c
@@ -279,11 +279,20 @@ MPIDI_PAMI_client_init(int* rank, int* size, int* mpidi_dynamic_tasking, char **
   /* ------------------------------------ */
   /*  Initialize the MPICH->PAMI Client  */
   /* ------------------------------------ */
-  pami_configuration_t config;
   pami_result_t        rc = PAMI_ERROR;
-  unsigned             n  = 0;
+  
+  pami_configuration_t config[2];
+  config[0].name = PAMI_CLIENT_NONCONTIG;
+  config[0].value.intval = 0; // Disable non-contig, pamid doesn't use pami for non-contig data
+  size_t numconfigs = 1;
+  if(MPIDI_Process.optimized.memory) 
+  {
+    config[numconfigs].name = PAMI_CLIENT_MEMORY_OPTIMIZE;
+    config[numconfigs].value.intval = MPIDI_Process.optimized.memory;
+    ++numconfigs;
+  }
 
-  rc = PAMI_Client_create("MPI", &MPIDI_Client, &config, n);
+  rc = PAMI_Client_create("MPI", &MPIDI_Client, config, numconfigs);
   MPID_assert_always(rc == PAMI_SUCCESS);
   PAMIX_Initialize(MPIDI_Client);
 
diff --git a/src/mpid/pamid/src/mpidi_env.c b/src/mpid/pamid/src/mpidi_env.c
index 58c1470..62a048b 100644
--- a/src/mpid/pamid/src/mpidi_env.c
+++ b/src/mpid/pamid/src/mpidi_env.c
@@ -112,7 +112,11 @@
  *   - 0 - Collectives are not memory optimized.
  *   - n - Collectives are memory optimized. 'n' may represent different 
  *         levels of optimization. 
- *  
+ *
+ *   PAMID_OPTIMIZED_SUBCOMMS - Use PAMI 'optimized' collectives. Defaullt is 1.
+ *   - 0 - Some optimized protocols may be disabled.
+ *   - 1 - All performance optimized protocols will be enabled when available
+ * 
  * - PAMID_VERBOSE - Increases the amount of information dumped during an
  *   MPI_Abort() call and during varoius MPI function calls.  Possible values:
  *   - 0 - No additional information is dumped.

http://git.mpich.org/mpich.git/commitdiff/63577b2830635062056e45d4b211b4227040556c

commit 63577b2830635062056e45d4b211b4227040556c
Author: Bob Cernohous <bobc at us.ibm.com>
Date:   Thu Feb 7 10:21:02 2013 -0600

    Change PAMI_COLLECTIVES_MEMORY_OPTIMIZED to PAMID_COLLECTIVES_MEMORY_OPTIMIZED
    
    (ibm) Issue 9208
    (ibm) a1096bdaa3ecf07db05fb23889e97abeed6708bb
    
    Signed-off-by: sssharka <sssharka at us.ibm.com>

diff --git a/src/mpid/pamid/include/mpidi_datatypes.h b/src/mpid/pamid/include/mpidi_datatypes.h
index 33cf60a..3e25404 100644
--- a/src/mpid/pamid/include/mpidi_datatypes.h
+++ b/src/mpid/pamid/include/mpidi_datatypes.h
@@ -113,6 +113,7 @@ typedef struct
     unsigned subcomms;
     unsigned select_colls;      /**< Enable collective selection */
     unsigned auto_select_colls; /**< Enable automatic collective selection */
+    unsigned memory;
   }
   optimized;
 
diff --git a/src/mpid/pamid/src/comm/mpid_comm.c b/src/mpid/pamid/src/comm/mpid_comm.c
index 89fea1c..27f789d 100644
--- a/src/mpid/pamid/src/comm/mpid_comm.c
+++ b/src/mpid/pamid/src/comm/mpid_comm.c
@@ -190,8 +190,14 @@ void MPIDI_Coll_comm_create(MPID_Comm *comm)
          numconfigs = 0;
       }
 
-      if(comm->mpid.tasks == NULL)
+      if(MPIDI_Process.optimized.memory && (comm->local_size & (comm->local_size-1)))
       {
+	/* Don't create irregular geometries.  Fallback to MPICH only collectives */
+	geom_init = 0;
+	comm->mpid.geometry = NULL;
+      }
+      else if(comm->mpid.tasks == NULL)
+      {   
          geom_post.client = MPIDI_Client;
          geom_post.configs = &config;
          geom_post.context_offset = 0; /* TODO BES investigate */
diff --git a/src/mpid/pamid/src/mpid_init.c b/src/mpid/pamid/src/mpid_init.c
index 57181cb..4f8f1b4 100644
--- a/src/mpid/pamid/src/mpid_init.c
+++ b/src/mpid/pamid/src/mpid_init.c
@@ -120,6 +120,7 @@ MPIDI_Process_t  MPIDI_Process = {
     .collectives         = MPIDI_OPTIMIZED_COLLECTIVE_DEFAULT,
     .subcomms            = 1,
     .select_colls        = 2,
+    .memory              = 0,
   },
 
   .mpir_nbc              = 0,
@@ -700,6 +701,7 @@ MPIDI_PAMI_init(int* rank, int* size, int* threading)
              "  optimized.collectives : %u\n"
              "  optimized.select_colls: %u\n"
              "  optimized.subcomms    : %u\n"
+             "  optimized.memory      : %u\n"
              "  mpir_nbc              : %u\n" 
              "  numTasks              : %u\n",
              MPIDI_Process.verbose,
@@ -732,6 +734,7 @@ MPIDI_PAMI_init(int* rank, int* size, int* threading)
              MPIDI_Process.optimized.collectives,
              MPIDI_Process.optimized.select_colls,
              MPIDI_Process.optimized.subcomms,
+             MPIDI_Process.optimized.memory,
              MPIDI_Process.mpir_nbc, 
              MPIDI_Process.numTasks);
       switch (*threading)
diff --git a/src/mpid/pamid/src/mpidi_env.c b/src/mpid/pamid/src/mpidi_env.c
index f19272a..58c1470 100644
--- a/src/mpid/pamid/src/mpidi_env.c
+++ b/src/mpid/pamid/src/mpidi_env.c
@@ -106,6 +106,13 @@
  *   - 0 - Optimized collective selection is not used.
  *   - 1 - Optimized collective selection is used. (default)
  *
+ * - PAMID_COLLECTIVES_MEMORY_OPTIMIZED - Controls whether collectives are 
+ *   optimized to reduce memory usage. This may disable some PAMI collectives.
+ *   Possible values:
+ *   - 0 - Collectives are not memory optimized.
+ *   - n - Collectives are memory optimized. 'n' may represent different 
+ *         levels of optimization. 
+ *  
  * - PAMID_VERBOSE - Increases the amount of information dumped during an
  *   MPI_Abort() call and during varoius MPI function calls.  Possible values:
  *   - 0 - No additional information is dumped.
@@ -865,6 +872,12 @@ MPIDI_Env_setup(int rank, int requested)
          MPIDI_Process.optimized.auto_select_colls = MPID_AUTO_SELECT_COLLS_NONE;/* Auto coll sel is disabled for all */ 
    }
    
+   /* Set the status for memory optimized collectives */
+   {
+      char* names[] = {"PAMID_COLLECTIVES_MEMORY_OPTIMIZED", NULL};
+      ENV_Unsigned(names, &MPIDI_Process.optimized.memory, 1, &found_deprecated_env_var, rank);
+      TRACE_ERR("MPIDI_Process.optimized.memory=%u\n", MPIDI_Process.optimized.memory);
+   }
 
   /* Set the status of the optimized shared memory point-to-point functions */
   {

http://git.mpich.org/mpich.git/commitdiff/cbb6e7a7bd4928593d089126f6d1f82425a30ca7

commit cbb6e7a7bd4928593d089126f6d1f82425a30ca7
Author: Michael Blocksome <blocksom at us.ibm.com>
Date:   Thu Jan 3 15:56:03 2013 -0600

    Implement MPIX_Cart_comm_create and MPIX_Pset_* functions.
    
    The following MPIX functions were also provided for BG/P mpich:
    
    -> MPIX_Pset_same_comm_create
    -> MPIX_Pset_diff_comm_create
    -> MPIX_Cart_comm_create
    
    The following MPIX function are new for BG/Q mpich:
    
    -> MPIX_Pset_same_comm_create_from_parent
    -> MPIX_Pset_diff_comm_create_from_parent
    -> MPIX_Pset_io_node
    
    (ibm) CPS 92XKPE
    (ibm) Issue 9231
    (ibm) dd601c33995486e257bd9a664e5027ff7b37e5cf
    
    Signed-off-by: sssharka <sssharka at us.ibm.com>

diff --git a/src/mpid/pamid/include/mpix.h b/src/mpid/pamid/include/mpix.h
index 9272116..2f1f345 100644
--- a/src/mpid/pamid/include/mpix.h
+++ b/src/mpid/pamid/include/mpix.h
@@ -135,6 +135,97 @@ extern "C" {
     */
   int MPIX_Get_last_algorithm_name(MPI_Comm comm, char *protocol, int length);
 
+  /**
+   * \brief Create a communicator such that all nodes in the same
+   *        communicator are served by the same I/O node
+   *
+   * \note This is a collective operation on MPI_COMM_WORLD
+   *
+   * \param [out] pset_comm The new communicator
+   *
+   * \return MPI status code
+   */
+  int MPIX_Pset_same_comm_create (MPI_Comm *pset_comm);
+
+  /**
+   * \brief Create a communicator such that all nodes in the same
+   *        communicator are served by a different I/O node
+   *
+   * \note This is a collective operation on MPI_COMM_WORLD
+   *
+   * \param [out] pset_comm The new communicator
+   *
+   * \return MPI status code
+   */
+  int MPIX_Pset_diff_comm_create (MPI_Comm *pset_comm);
+
+  /**
+   * \brief Create a communicator such that all nodes in the same
+   *        communicator are served by the same I/O node
+   *
+   * \note This is a collective operation on the parent communicator.
+   *
+   * \param [in]  parent_comm The parent communicator
+   * \param [out] pset_comm   The new communicator
+   *
+   * \return MPI status code
+   */
+  int MPIX_Pset_same_comm_create_from_parent (MPI_Comm parent_comm, MPI_Comm *pset_comm);
+
+  /**
+   * \brief Create a communicator such that all nodes in the same
+   *        communicator are served by a different I/O node
+   *
+   * \note This is a collective operation on the parent communicator
+   *
+   * \param [in]  parent_comm The parent communicator
+   * \param [out] pset_comm   The new communicator
+   *
+   * \return MPI status code
+   */
+  int MPIX_Pset_diff_comm_create_from_parent (MPI_Comm parent_comm, MPI_Comm *pset_comm);
+
+  /**
+   * \brief Retrieve information about the I/O node associated with the
+   *        local compute node.
+   *
+   * The I/O node route identifier is a unique number, yet it is not a
+   * monotonically increasing integer; such as a rank in a communicator.
+   * Multiple ranks, and multiple compute nodes, can be associated with the
+   * same I/O node route.
+   *
+   * The distance to the I/O node is the number of hops on the torus from the
+   * local compute node to the associated I/O node.
+   *
+   * \note On BG/Q the 'bridge' compute nodes are those nodes that are closest
+   *       to the I/O node and will have a distance of '1'.
+   *
+   * \param [out] io_node_route_id     The unique I/O node route identifier
+   * \param [out] distance_to_io_node  The number of hops to the I/O node
+   */
+  void MPIX_Pset_io_node (int *io_node_route_id, int *distance_to_io_node);
+
+  /**
+   * \brief Create a Cartesian communicator that exactly matches the partition
+   *
+   * This is a collective operation on MPI_COMM_WORLD, and will only run
+   * successfully on a full partition job (no -np)
+   *
+   * The communicator is created to match the size of each dimension, the
+   * physical coords on each node, and the torus/mesh link status.
+   *
+   * Because of MPICH2 dimension ordering, the associated arrays (i.e. coords,
+   * sizes, and periods) are in [a, b, c, d, e, t] order. Consequently, when
+   * using the default ABCDET mapping, the rank in cart_comm will match the rank
+   * in MPI_COMM_WORLD. However, when using a non-default mapping or a mapfile
+   * the ranks will be different.
+   *
+   * \param [out] cart_comm The new Cartesian communicator
+   *
+   * \return MPI_SUCCESS or MPI_ERR_TOPOLOGY
+   */
+  int MPIX_Cart_comm_create (MPI_Comm *cart_comm);
+
 
 #if defined(__cplusplus)
 }
diff --git a/src/mpid/pamid/src/mpix/mpix.c b/src/mpid/pamid/src/mpix/mpix.c
index 5be3505..87741fc 100644
--- a/src/mpid/pamid/src/mpix/mpix.c
+++ b/src/mpid/pamid/src/mpix/mpix.c
@@ -21,6 +21,12 @@
  */
 
 #include <mpidimpl.h>
+
+#ifdef __BGQ__
+#include <stdlib.h>
+#include <spi/include/kernel/location.h>
+#endif /* __BGQ__ */
+
 MPIX_Hardware_t MPIDI_HW;
 
 /* Determine the number of torus dimensions. Implemented to keep this code
@@ -513,6 +519,347 @@ MPIX_Get_last_algorithm_name(MPI_Comm comm, char *protocol, int length)
    return MPI_SUCCESS;
 }
 
+#undef FUNCNAME
+#define FUNCNAME MPIX_Pset_ionode
+#undef FCNAME
+#define FCNAME MPIU_QUOTE(FUNCNAME)
+void
+MPIX_Pset_io_node (int *io_node_route_id, int *distance_to_io_node)
+{
+  int iA,  iB,  iC,  iD,  iE;                /* The local node's coordinates  */
+  int nA,  nB,  nC,  nD,  nE;                /* Size of each torus dimension  */
+  int brA, brB, brC, brD, brE;               /* The bridge node's coordinates */
+  int Nflags;
+  int torusA, torusB, torusC, torusD, torusE;        /* mesh == 0, torus == 1 */
+  int d1, d2;
+  int dA, dB, dC, dD, dE;          /* distance from local node to bridge node */
+
+  Personality_t personality;
+
+  Kernel_GetPersonality(&personality, sizeof(personality));
+
+  iA  = personality.Network_Config.Acoord;
+  iB  = personality.Network_Config.Bcoord;
+  iC  = personality.Network_Config.Ccoord;
+  iD  = personality.Network_Config.Dcoord;
+  iE  = personality.Network_Config.Ecoord;
+
+  nA  = personality.Network_Config.Anodes;
+  nB  = personality.Network_Config.Bnodes;
+  nC  = personality.Network_Config.Cnodes;
+  nD  = personality.Network_Config.Dnodes;
+  nE  = personality.Network_Config.Enodes;
+
+  brA = personality.Network_Config.cnBridge_A;
+  brB = personality.Network_Config.cnBridge_B;
+  brC = personality.Network_Config.cnBridge_C;
+  brD = personality.Network_Config.cnBridge_D;
+  brE = personality.Network_Config.cnBridge_E;
+
+  Nflags = personality.Network_Config.NetFlags;
+
+  if (Nflags & ND_ENABLE_TORUS_DIM_A) torusA = 1;
+  else                                torusA = 0;
+  if (Nflags & ND_ENABLE_TORUS_DIM_B) torusB = 1;
+  else                                torusB = 0;
+  if (Nflags & ND_ENABLE_TORUS_DIM_C) torusC = 1;
+  else                                torusC = 0;
+  if (Nflags & ND_ENABLE_TORUS_DIM_D) torusD = 1;
+  else                                torusD = 0;
+  if (Nflags & ND_ENABLE_TORUS_DIM_E) torusE = 1;
+  else                                torusE = 0;
+
+  /*
+   * This is the bridge node, numbered in ABCDE order, E increments first.
+   * It is considered the unique "io node route identifer" because each
+   * bridge node only has one torus link to one io node.
+   */
+  *io_node_route_id = brE + brD*nE + brC*nD*nE + brB*nC*nD*nE + brA*nB*nC*nD*nE;
+
+  d1 = abs(iA - brA);
+  d2 = nA - d1;
+  if (torusA) dA = (d1 < d2) ? d1 : d2;
+  else        dA = d1;
+
+  d1 = abs(iB - brB);
+  d2 = nB - d1;
+  if (torusB) dB = (d1 < d2) ? d1 : d2;
+  else        dB = d1;
+
+  d1 = abs(iC - brC);
+  d2 = nC - d1;
+  if (torusC) dC = (d1 < d2) ? d1 : d2;
+  else        dC = d1;
+
+  d1 = abs(iD - brD);
+  d2 = nD - d1;
+  if (torusD) dD = (d1 < d2) ? d1 : d2;
+  else        dD = d1;
+
+  d1 = abs(iE - brE);
+  d2 = nE - d1;
+  if (torusE) dE = (d1 < d2) ? d1 : d2;
+  else        dE = d1;
+
+  /* This is the number of hops to the io node */
+  *distance_to_io_node = dA + dB + dC + dD + dE + 1;
+
+  return;
+};
+
+/**
+ * \brief Create a communicator of ranks that have a common bridge node.
+ *
+ * \note This function is private to this source file.
+ *
+ * \param [in]  parent_comm_ptr  Pointer to the parent communicator
+ * \param [out] pset_comm_ptr    Pointer to the new 'MPID' communicator
+ *
+ * \return MPI status
+ */
+int _MPIX_Pset_same_comm_create (MPID_Comm *parent_comm_ptr, MPID_Comm **pset_comm_ptr)
+{
+  int color, key;
+  int mpi_errno;
+
+  MPIX_Pset_io_node (&color, &key);
+
+  /*
+   * Use MPIR_Comm_split_impl to make a communicator of all ranks in the parent
+   * communicator that share the same bridge node; i.e. the 'color' is the
+   * 'io node route identifer', which is unique to each BGQ bridge node.
+   *
+   * Setting the 'key' to the 'distance to io node' ensures that rank 0 in
+   * the new communicator is on the bridge node, or as close to the bridge node
+   * as possible.
+   */
+
+  *pset_comm_ptr = NULL;
+  mpi_errno = MPI_SUCCESS;
+  mpi_errno = MPIR_Comm_split_impl(parent_comm_ptr, color, key, pset_comm_ptr);
+
+  return mpi_errno;
+}
+
+#undef FUNCNAME
+#define FUNCNAME MPIX_Pset_same_comm_create_from_parent
+#undef FCNAME
+#define FCNAME MPIU_QUOTE(FUNCNAME)
+int
+MPIX_Pset_same_comm_create_from_parent (MPI_Comm parent_comm, MPI_Comm *pset_comm)
+{
+  int mpi_errno;
+  MPID_Comm *parent_comm_ptr, *pset_comm_ptr;
+
+  *pset_comm = MPI_COMM_NULL;
+
+  /*
+   * Convert the parent communicator object handle to an object pointer;
+   * needed by the error handling code.
+   */
+  parent_comm_ptr = NULL;
+  MPID_Comm_get_ptr(parent_comm, parent_comm_ptr);
+
+  mpi_errno = MPI_SUCCESS;
+  mpi_errno = _MPIX_Pset_same_comm_create (parent_comm_ptr, &pset_comm_ptr);
+  if (mpi_errno) MPIU_ERR_POP(mpi_errno);
+  if (pset_comm_ptr)
+    MPIU_OBJ_PUBLISH_HANDLE(*pset_comm, pset_comm_ptr->handle);
+  else
+    goto fn_fail;
+
+fn_exit:
+  return mpi_errno;
+fn_fail:
+  mpi_errno = MPIR_Err_return_comm( parent_comm_ptr, FCNAME, mpi_errno );
+  goto fn_exit;
+};
+
+#undef FUNCNAME
+#define FUNCNAME MPIX_Pset_same_comm_create
+#undef FCNAME
+#define FCNAME MPIU_QUOTE(FUNCNAME)
+int
+MPIX_Pset_same_comm_create (MPI_Comm *pset_comm)
+{
+  return MPIX_Pset_same_comm_create_from_parent (MPI_COMM_WORLD, pset_comm);
+};
+
+
+#undef FUNCNAME
+#define FUNCNAME MPIX_Pset_diff_comm_create_from_parent
+#undef FCNAME
+#define FCNAME MPIU_QUOTE(FUNCNAME)
+int
+MPIX_Pset_diff_comm_create_from_parent (MPI_Comm parent_comm, MPI_Comm *pset_comm)
+{
+  MPID_Comm *parent_comm_ptr, *pset_same_comm_ptr, *pset_diff_comm_ptr;
+  int color, key;
+  int mpi_errno;
+
+  *pset_comm = MPI_COMM_NULL;
+
+  /*
+   * Convert the parent communicator object handle to an object pointer;
+   * needed by the error handling code.
+   */
+  parent_comm_ptr = NULL;
+  MPID_Comm_get_ptr(parent_comm, parent_comm_ptr);
+
+  /*
+   * Determine the 'color' of this rank to create the new communicator - which
+   * is the rank in a (transient) communicator where all ranks share a common
+   * bridge node.
+   */
+  mpi_errno = MPI_SUCCESS;
+  mpi_errno = _MPIX_Pset_same_comm_create (parent_comm_ptr, &pset_same_comm_ptr);
+  if (mpi_errno) MPIU_ERR_POP(mpi_errno);
+  if (pset_same_comm_ptr == NULL)
+    goto fn_fail;
+
+  color = MPIR_Comm_rank(pset_same_comm_ptr) * MPIDI_HW.ppn + MPIDI_HW.coreID;
+
+  /* Discard the 'pset_same_comm_ptr' .. it is no longer needed. */
+  mpi_errno = MPIR_Comm_free_impl(pset_same_comm_ptr);
+  if (mpi_errno) MPIU_ERR_POP(mpi_errno);
+
+  /* Set the 'key' for this rank to order the ranks in the new communicator. */
+  key = MPIR_Comm_rank(parent_comm_ptr);
+
+  pset_diff_comm_ptr = NULL;
+  mpi_errno = MPI_SUCCESS;
+  mpi_errno = MPIR_Comm_split_impl(parent_comm_ptr, color, key, &pset_diff_comm_ptr);
+  if (mpi_errno) MPIU_ERR_POP(mpi_errno);
+  if (pset_diff_comm_ptr)
+    MPIU_OBJ_PUBLISH_HANDLE(*pset_comm, pset_diff_comm_ptr->handle);
+  else
+    goto fn_fail;
+
+fn_exit:
+  return mpi_errno;
+fn_fail:
+  mpi_errno = MPIR_Err_return_comm( parent_comm_ptr, FCNAME, mpi_errno );
+  goto fn_exit;
+};
+
+#undef FUNCNAME
+#define FUNCNAME MPIX_Pset_diff_comm_create
+#undef FCNAME
+#define FCNAME MPIU_QUOTE(FUNCNAME)
+int
+MPIX_Pset_diff_comm_create (MPI_Comm *pset_comm)
+{
+  return MPIX_Pset_diff_comm_create_from_parent (MPI_COMM_WORLD, pset_comm);
+};
+
+
+
+/**
+ * \brief Compare each elemt of two six-element arrays
+ * \param [in] A The first array
+ * \param [in] B The first array
+ * \return MPI_SUCCESS (does not return on failure)
+ */
+#define CMP_6(A,B)                              \
+({                                              \
+  assert(A[0] == B[0]);                         \
+  assert(A[1] == B[1]);                         \
+  assert(A[2] == B[2]);                         \
+  assert(A[3] == B[3]);                         \
+  assert(A[4] == B[4]);                         \
+  assert(A[5] == B[5]);                         \
+  MPI_SUCCESS;                                  \
+})
+
+#undef FUNCNAME
+#define FUNCNAME MPIX_Cart_comm_create
+#undef FCNAME
+#define FCNAME MPIU_QUOTE(FUNCNAME)
+int
+MPIX_Cart_comm_create (MPI_Comm *cart_comm)
+{
+  int result;
+  int rank, numprocs,
+      dims[6],
+      wrap[6],
+      coords[6];
+  int new_rank1, new_rank2;
+  MPI_Comm new_comm = MPI_COMM_NULL;
+  int cart_rank,
+      cart_dims[6],
+      cart_wrap[6],
+      cart_coords[6];
+  int Nflags;
+
+  *cart_comm = MPI_COMM_NULL;
+  PMPI_Comm_rank(MPI_COMM_WORLD, &rank);
+  PMPI_Comm_size(MPI_COMM_WORLD, &numprocs);
+
+  Personality_t personality;
+
+  Kernel_GetPersonality(&personality, sizeof(personality));
+
+  dims[0] = personality.Network_Config.Anodes;
+  dims[1] = personality.Network_Config.Bnodes;
+  dims[2] = personality.Network_Config.Cnodes;
+  dims[3] = personality.Network_Config.Dnodes;
+  dims[4] = personality.Network_Config.Enodes;
+  dims[5] = Kernel_ProcessCount();
+
+  /* This only works if MPI_COMM_WORLD is the full partition */
+  if (dims[5] * dims[4] * dims[3] * dims[2] * dims[1] * dims[0] != numprocs)
+    return MPI_ERR_TOPOLOGY;
+
+  Nflags = personality.Network_Config.NetFlags;
+  wrap[0] = ((Nflags & ND_ENABLE_TORUS_DIM_A) != 0);
+  wrap[1] = ((Nflags & ND_ENABLE_TORUS_DIM_B) != 0);
+  wrap[2] = ((Nflags & ND_ENABLE_TORUS_DIM_C) != 0);
+  wrap[3] = ((Nflags & ND_ENABLE_TORUS_DIM_D) != 0);
+  wrap[4] = ((Nflags & ND_ENABLE_TORUS_DIM_E) != 0);
+  wrap[5] = 1;
+
+  coords[0] = personality.Network_Config.Acoord;
+  coords[1] = personality.Network_Config.Bcoord;
+  coords[2] = personality.Network_Config.Ccoord;
+  coords[3] = personality.Network_Config.Dcoord;
+  coords[4] = personality.Network_Config.Ecoord;
+  coords[5] = Kernel_MyTcoord();
+
+  new_rank1 =                                         coords[5] +
+                                            dims[5] * coords[4] +
+                                  dims[5] * dims[4] * coords[3] +
+                        dims[5] * dims[4] * dims[3] * coords[2] +
+              dims[5] * dims[4] * dims[3] * dims[2] * coords[1] +
+    dims[5] * dims[4] * dims[3] * dims[2] * dims[1] * coords[0];
+
+  result = PMPI_Comm_split(MPI_COMM_WORLD, 0, new_rank1, &new_comm);
+  if (result != MPI_SUCCESS)
+  {
+     PMPI_Comm_free(&new_comm);
+     return result;
+  }
+  PMPI_Comm_rank(new_comm, &new_rank2);
+  assert(new_rank1 == new_rank2);
+
+  result = PMPI_Cart_create(new_comm,
+                            6,
+                            dims,
+                            wrap,
+                            0,
+                            cart_comm);
+  if (result != MPI_SUCCESS)
+    return result;
+
+  PMPI_Comm_rank(*cart_comm, &cart_rank);
+  PMPI_Cart_get (*cart_comm, 6, cart_dims, cart_wrap, cart_coords);
+
+  CMP_6(dims,   cart_dims);
+  CMP_6(wrap,   cart_wrap);
+  CMP_6(coords, cart_coords);
+
+  PMPI_Comm_free(&new_comm);
+  return MPI_SUCCESS;
+};
 
 #endif
 

http://git.mpich.org/mpich.git/commitdiff/09a16913c92416d95ed6aec5b29451d92e7fe4d9

commit 09a16913c92416d95ed6aec5b29451d92e7fe4d9
Author: Bob Cernohous <bobc at us.ibm.com>
Date:   Thu Jan 24 15:36:33 2013 -0600

    Fix optgather flag processing
    
    (ibm) Issue 9295
    (ibm) fc790d0ebd2d3f5e4001158fdf8b0277cc037339
    
    Signed-off-by: sssharka <sssharka at us.ibm.com>

diff --git a/src/mpid/pamid/src/coll/gather/mpido_gather.c b/src/mpid/pamid/src/coll/gather/mpido_gather.c
index 3b755a9..f44f19d 100644
--- a/src/mpid/pamid/src/coll/gather/mpido_gather.c
+++ b/src/mpid/pamid/src/coll/gather/mpido_gather.c
@@ -134,7 +134,7 @@ int MPIDO_Gather(const void *sendbuf,
   MPI_Aint true_lb = 0;
   pami_xfer_t gather;
   MPIDI_Post_coll_t gather_post;
-  int success = 1, contig, send_bytes=-1, recv_bytes = 0;
+  int use_opt = 1, contig=0, send_bytes=-1, recv_bytes = 0;
   const int rank = comm_ptr->rank;
   const int size = comm_ptr->local_size;
 #if ASSERT_LEVEL==0
@@ -146,69 +146,91 @@ int MPIDO_Gather(const void *sendbuf,
    const struct MPIDI_Comm* const mpid = &(comm_ptr->mpid);
    const int selected_type = mpid->user_selected_type[PAMI_XFER_GATHER];
 
-  if (sendtype != MPI_DATATYPE_NULL && sendcount >= 0)
+  if (rank == root)
+  {
+    if (recvtype != MPI_DATATYPE_NULL && recvcount >= 0)
+    {
+      MPIDI_Datatype_get_info(recvcount, recvtype, contig,
+                              recv_bytes, data_ptr, true_lb);
+      if (!contig || ((recv_bytes * size) % sizeof(int))) /* ? */
+        use_opt = 0;
+    }
+    else
+      use_opt = 0;
+  }
+
+  if ((sendbuf != MPI_IN_PLACE) && sendtype != MPI_DATATYPE_NULL && sendcount >= 0)
   {
     MPIDI_Datatype_get_info(sendcount, sendtype, contig,
                             send_bytes, data_ptr, true_lb);
     if (!contig || ((send_bytes * size) % sizeof(int)))
-      success = 0;
+      use_opt = 0;
   }
-  else
-    success = 0;
-
-  if (success && rank == root)
+  else 
   {
-    if (recvtype != MPI_DATATYPE_NULL && recvcount >= 0)
+    if(sendbuf == MPI_IN_PLACE)
+      send_bytes = recv_bytes;
+    if (sendtype == MPI_DATATYPE_NULL || sendcount == 0)
     {
-      MPIDI_Datatype_get_info(recvcount, recvtype, contig,
-                              recv_bytes, data_ptr, true_lb);
-      if (!contig) success = 0;
+      send_bytes = 0;
+      use_opt = 0;
     }
-    else
-      success = 0;
   }
 
-  MPIDI_Update_last_algorithm(comm_ptr, "GATHER_MPICH");
-  if(!mpid->optgather ||
+  if(!mpid->optgather &&
    selected_type == MPID_COLL_USE_MPICH)
   {
+    MPIDI_Update_last_algorithm(comm_ptr, "GATHER_MPICH");
     if(unlikely(verbose))
-      fprintf(stderr,"Using MPICH gather algorithm\n");
+      fprintf(stderr,"Using MPICH gather algorithm (01) opt %x, selected type %d\n",mpid->optgather,selected_type);
     return MPIR_Gather(sendbuf, sendcount, sendtype,
                        recvbuf, recvcount, recvtype,
                        root, comm_ptr, mpierrno);
   }
-
-   if(mpid->preallreduces[MPID_GATHER_PREALLREDUCE])
-   {
-      volatile unsigned allred_active = 1;
-      pami_xfer_t allred;
-      MPIDI_Post_coll_t allred_post;
-      allred.cb_done = cb_allred;
-      allred.cookie = (void *)&allred_active;
-      /* Guaranteed to work allreduce */
-      allred.algorithm = mpid->coll_algorithm[PAMI_XFER_ALLREDUCE][0][0];
-      allred.cmd.xfer_allreduce.sndbuf = (void *)(size_t)success;
-      allred.cmd.xfer_allreduce.stype = PAMI_TYPE_SIGNED_INT;
-      allred.cmd.xfer_allreduce.rcvbuf = (void *)(size_t)success;
-      allred.cmd.xfer_allreduce.rtype = PAMI_TYPE_SIGNED_INT;
-      allred.cmd.xfer_allreduce.stypecount = 1;
-      allred.cmd.xfer_allreduce.rtypecount = 1;
-      allred.cmd.xfer_allreduce.op = PAMI_DATA_BAND;
-
-      MPIDI_Context_post(MPIDI_Context[0], &allred_post.state,
-                         MPIDI_Pami_post_wrapper, (void *)&allred);
-      MPID_PROGRESS_WAIT_WHILE(allred_active);
-   }
-
-   if(selected_type == MPID_COLL_USE_MPICH || !success)
-   {
+  if(mpid->preallreduces[MPID_GATHER_PREALLREDUCE])
+  {
     if(unlikely(verbose))
-      fprintf(stderr,"Using MPICH gather algorithm\n");
-    return MPIR_Gather(sendbuf, sendcount, sendtype,
-                       recvbuf, recvcount, recvtype,
-                       root, comm_ptr, mpierrno);
-   }
+      fprintf(stderr,"MPID_GATHER_PREALLREDUCE opt %x, selected type %d, use_opt %d\n",mpid->optgather,selected_type, use_opt);
+    volatile unsigned allred_active = 1;
+    pami_xfer_t allred;
+    MPIDI_Post_coll_t allred_post;
+    allred.cb_done = cb_allred;
+    allred.cookie = (void *)&allred_active;
+    /* Guaranteed to work allreduce */
+    allred.algorithm = mpid->coll_algorithm[PAMI_XFER_ALLREDUCE][0][0];
+    allred.cmd.xfer_allreduce.sndbuf = (void *)PAMI_IN_PLACE;
+    allred.cmd.xfer_allreduce.stype = PAMI_TYPE_SIGNED_INT;
+    allred.cmd.xfer_allreduce.rcvbuf = (void *)&use_opt;
+    allred.cmd.xfer_allreduce.rtype = PAMI_TYPE_SIGNED_INT;
+    allred.cmd.xfer_allreduce.stypecount = 1;
+    allred.cmd.xfer_allreduce.rtypecount = 1;
+    allred.cmd.xfer_allreduce.op = PAMI_DATA_BAND;
+    
+    MPIDI_Context_post(MPIDI_Context[0], &allred_post.state,
+                       MPIDI_Pami_post_wrapper, (void *)&allred);
+    MPID_PROGRESS_WAIT_WHILE(allred_active);
+    if(unlikely(verbose))
+      fprintf(stderr,"MPID_GATHER_PREALLREDUCE opt %x, selected type %d, use_opt %d\n",mpid->optgather,selected_type, use_opt);
+  }
+
+  if(mpid->optgather)
+  {
+    if(use_opt)
+    {
+      MPIDI_Update_last_algorithm(comm_ptr, "GLUE_REDUCDE");
+      abort();
+      /* GLUE_REDUCE ? */
+    }
+    else
+    {
+      MPIDI_Update_last_algorithm(comm_ptr, "GATHER_MPICH");
+      if(unlikely(verbose))
+        fprintf(stderr,"Using MPICH gather algorithm (02) opt %x, selected type %d, use_opt %d\n",mpid->optgather,selected_type, use_opt);
+      return MPIR_Gather(sendbuf, sendcount, sendtype,
+                         recvbuf, recvcount, recvtype,
+                         root, comm_ptr, mpierrno);
+    }
+  }
 
 
    pami_algorithm_t my_gather;
@@ -268,8 +290,15 @@ int MPIDO_Gather(const void *sendbuf,
            result.check.unspecified = 1;
         if(my_md->check_correct.values.rangeminmax)
         {
-          if((my_md->range_lo <= recv_bytes) &&
-             (my_md->range_hi >= recv_bytes))
+          /* Non-local decision? */
+          if(((rank == root) &&
+              (my_md->range_lo <= recv_bytes) &&
+              (my_md->range_hi >= recv_bytes)
+              ) &&
+             ((my_md->range_lo <= send_bytes) &&
+              (my_md->range_hi >= send_bytes)
+              )
+             )
             ; /* ok, algorithm selected */
           else
           {
diff --git a/src/mpid/pamid/src/comm/mpid_optcolls.c b/src/mpid/pamid/src/comm/mpid_optcolls.c
index ebb7ad7..df095bc 100644
--- a/src/mpid/pamid/src/comm/mpid_optcolls.c
+++ b/src/mpid/pamid/src/comm/mpid_optcolls.c
@@ -264,6 +264,13 @@ void MPIDI_Comm_coll_select(MPID_Comm *comm_ptr)
     TRACE_ERR("Done setting optimized allgatherv[int]\n");
   }
 
+  if(comm_ptr->mpid.user_selected_type[PAMI_XFER_GATHER] == MPID_COLL_NOSELECTION)
+  {
+    TRACE_ERR("Default gather to  MPICH\n");
+    comm_ptr->mpid.user_selected_type[PAMI_XFER_GATHER] = MPID_COLL_USE_MPICH;
+    comm_ptr->mpid.opt_protocol[PAMI_XFER_GATHER][0] = 0;
+  }
+
   opt_proto = -1;
   mustquery = 0;
   /* Alltoall */
@@ -832,6 +839,8 @@ void MPIDI_Comm_coll_select(MPID_Comm *comm_ptr)
       fprintf(stderr,"Selecting %s for opt allgatherv comm %p\n", comm_ptr->mpid.opt_protocol_md[PAMI_XFER_ALLGATHERV_INT][0].name, comm_ptr);
     if(comm_ptr->mpid.must_query[PAMI_XFER_ALLGATHERV_INT][0] == MPID_COLL_USE_MPICH)
       fprintf(stderr,"Selecting MPICH for allgatherv below %d size comm %p\n", comm_ptr->mpid.cutoff_size[PAMI_XFER_ALLGATHERV_INT][0], comm_ptr);
+    if(comm_ptr->mpid.user_selected_type[PAMI_XFER_GATHER] == MPID_COLL_USE_MPICH)
+      fprintf(stderr,"Selecting MPICH for gather comm %p\n", comm_ptr);
     if(comm_ptr->mpid.user_selected_type[PAMI_XFER_BROADCAST] == MPID_COLL_OPTIMIZED)
       fprintf(stderr,"Selecting %s for opt bcast up to size %d comm %p\n", comm_ptr->mpid.opt_protocol_md[PAMI_XFER_BROADCAST][0].name,
               comm_ptr->mpid.cutoff_size[PAMI_XFER_BROADCAST][0], comm_ptr);

http://git.mpich.org/mpich.git/commitdiff/4d66fef3da8af915ed42985295b1d5f9647a7cc9

commit 4d66fef3da8af915ed42985295b1d5f9647a7cc9
Author: Bob Cernohous <bobc at us.ibm.com>
Date:   Wed Jan 23 17:44:43 2013 -0600

    Simple support for async flow control metadata
    
    (ibm) Issue 9136
    (ibm) cf4fa7aa56634b0fe697fe2024f53f34fc4298b3
    
    Signed-off-by: sssharka <sssharka at us.ibm.com>

diff --git a/src/mpid/pamid/src/coll/allgather/mpido_allgather.c b/src/mpid/pamid/src/coll/allgather/mpido_allgather.c
index 31a629d..2b29704 100644
--- a/src/mpid/pamid/src/coll/allgather/mpido_allgather.c
+++ b/src/mpid/pamid/src/coll/allgather/mpido_allgather.c
@@ -494,6 +494,13 @@ MPIDO_Allgather(const void *sendbuf,
                        recvbuf, recvcount, recvtype,
                        comm_ptr, mpierrno);
          }
+         if(my_md->check_correct.values.asyncflowctl) 
+         { /* need better flow control than a barrier every time */
+           int tmpmpierrno;   
+           if(unlikely(verbose))
+             fprintf(stderr,"Query barrier required for %s\n", my_md->name);
+           MPIR_Barrier(comm_ptr, &tmpmpierrno);
+         }
       }
 
       if(unlikely(verbose))
diff --git a/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c b/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c
index 008b7aa..a289b48 100644
--- a/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c
+++ b/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c
@@ -516,6 +516,13 @@ MPIDO_Allgatherv(const void *sendbuf,
                                   recvbuf, recvcounts, displs, recvtype,
                                   comm_ptr, mpierrno);
          }
+         if(my_md->check_correct.values.asyncflowctl) 
+         { /* need better flow control than a barrier every time */
+           int tmpmpierrno;   
+           if(unlikely(verbose))
+             fprintf(stderr,"Query barrier required for %s\n", my_md->name);
+           MPIR_Barrier(comm_ptr, &tmpmpierrno);
+         }
       }
 
       if(unlikely(verbose))
diff --git a/src/mpid/pamid/src/coll/alltoall/mpido_alltoall.c b/src/mpid/pamid/src/coll/alltoall/mpido_alltoall.c
index 4ae5535..dc796a9 100644
--- a/src/mpid/pamid/src/coll/alltoall/mpido_alltoall.c
+++ b/src/mpid/pamid/src/coll/alltoall/mpido_alltoall.c
@@ -176,6 +176,13 @@ int MPIDO_Alltoall(const void *sendbuf,
                                    recvbuf, recvcount, recvtype,
                                    comm_ptr, mpierrno);
       }
+      if(my_md->check_correct.values.asyncflowctl) 
+      { /* need better flow control than a barrier every time */
+         int tmpmpierrno;   
+         if(unlikely(verbose))
+            fprintf(stderr,"Query barrier required for %s\n", my_md->name);
+         MPIR_Barrier(comm_ptr, &tmpmpierrno);
+      }
    }
 
    if(unlikely(verbose))
diff --git a/src/mpid/pamid/src/coll/bcast/mpido_bcast.c b/src/mpid/pamid/src/coll/bcast/mpido_bcast.c
index 4d5d450..29ec76a 100644
--- a/src/mpid/pamid/src/coll/bcast/mpido_bcast.c
+++ b/src/mpid/pamid/src/coll/bcast/mpido_bcast.c
@@ -229,6 +229,13 @@ int MPIDO_Bcast(void *buffer,
          MPIDI_Update_last_algorithm(comm_ptr,"BCAST_MPICH");
          return MPIR_Bcast_intra(buffer, count, datatype, root, comm_ptr, mpierrno);
       }
+      if(my_md->check_correct.values.asyncflowctl) 
+      { /* need better flow control than a barrier every time */
+         int tmpmpierrno;   
+         if(unlikely(verbose))
+            fprintf(stderr,"Query barrier required for %s\n", my_md->name);
+         MPIR_Barrier(comm_ptr, &tmpmpierrno);
+      }
    }
 
    if(unlikely(verbose))
diff --git a/src/mpid/pamid/src/coll/gather/mpido_gather.c b/src/mpid/pamid/src/coll/gather/mpido_gather.c
index 8859f05..3b755a9 100644
--- a/src/mpid/pamid/src/coll/gather/mpido_gather.c
+++ b/src/mpid/pamid/src/coll/gather/mpido_gather.c
@@ -298,6 +298,13 @@ int MPIDO_Gather(const void *sendbuf,
                            recvbuf, recvcount, recvtype,
                            root, comm_ptr, mpierrno);
       }
+      if(my_md->check_correct.values.asyncflowctl) 
+      { /* need better flow control than a barrier every time */
+        int tmpmpierrno;   
+        if(unlikely(verbose))
+          fprintf(stderr,"Query barrier required for %s\n", my_md->name);
+        MPIR_Barrier(comm_ptr, &tmpmpierrno);
+      }
    }
 
    MPIDI_Update_last_algorithm(comm_ptr,
diff --git a/src/mpid/pamid/src/coll/gatherv/mpido_gatherv.c b/src/mpid/pamid/src/coll/gatherv/mpido_gatherv.c
index a6f7b42..7df7210 100644
--- a/src/mpid/pamid/src/coll/gatherv/mpido_gatherv.c
+++ b/src/mpid/pamid/src/coll/gatherv/mpido_gatherv.c
@@ -185,6 +185,13 @@ int MPIDO_Gatherv(const void *sendbuf,
                              recvbuf, recvcounts, displs, recvtype,
                              root, comm_ptr, mpierrno);
       }
+      if(my_md->check_correct.values.asyncflowctl) 
+      { /* need better flow control than a barrier every time */
+         int tmpmpierrno;   
+         if(unlikely(verbose))
+            fprintf(stderr,"Query barrier required for %s\n", my_md->name);
+         MPIR_Barrier(comm_ptr, &tmpmpierrno);
+      }
    }
    
    MPIDI_Update_last_algorithm(comm_ptr, my_md->name);
diff --git a/src/mpid/pamid/src/coll/reduce/mpido_reduce.c b/src/mpid/pamid/src/coll/reduce/mpido_reduce.c
index 1a5c9d1..79e65f7 100644
--- a/src/mpid/pamid/src/coll/reduce/mpido_reduce.c
+++ b/src/mpid/pamid/src/coll/reduce/mpido_reduce.c
@@ -186,7 +186,17 @@ int MPIDO_Reduce(const void *sendbuf,
             fprintf(stderr,"Query failed for %s.  Using MPICH reduce.\n",
                     my_md->name);
       }  
-      else alg_selected = 1;
+      else 
+      {   
+         if(my_md->check_correct.values.asyncflowctl) 
+         { /* need better flow control than a barrier every time */
+            int tmpmpierrno;   
+            if(unlikely(verbose))
+               fprintf(stderr,"Query barrier required for %s\n", my_md->name);
+            MPIR_Barrier(comm_ptr, &tmpmpierrno);
+         }
+         alg_selected = 1;
+      }
    }
 
    if(alg_selected)
diff --git a/src/mpid/pamid/src/coll/scan/mpido_scan.c b/src/mpid/pamid/src/coll/scan/mpido_scan.c
index 4a4e37a..73e21b1 100644
--- a/src/mpid/pamid/src/coll/scan/mpido_scan.c
+++ b/src/mpid/pamid/src/coll/scan/mpido_scan.c
@@ -177,6 +177,13 @@ int MPIDO_Doscan(const void *sendbuf, void *recvbuf,
          else
             return MPIR_Scan(sendbuf, recvbuf, count, datatype, op, comm_ptr, mpierrno);
       }
+      if(my_md->check_correct.values.asyncflowctl) 
+      { /* need better flow control than a barrier every time */
+         int tmpmpierrno;   
+         if(unlikely(verbose))
+            fprintf(stderr,"Query barrier required for %s\n", my_md->name);
+         MPIR_Barrier(comm_ptr, &tmpmpierrno);
+      }
    }
    
    if(unlikely(verbose))
diff --git a/src/mpid/pamid/src/coll/scatter/mpido_scatter.c b/src/mpid/pamid/src/coll/scatter/mpido_scatter.c
index aca6c89..7b9de8c 100644
--- a/src/mpid/pamid/src/coll/scatter/mpido_scatter.c
+++ b/src/mpid/pamid/src/coll/scatter/mpido_scatter.c
@@ -274,6 +274,13 @@ int MPIDO_Scatter(const void *sendbuf,
                             recvbuf, recvcount, recvtype,
                             root, comm_ptr, mpierrno);
       }
+      if(my_md->check_correct.values.asyncflowctl) 
+      { /* need better flow control than a barrier every time */
+        int tmpmpierrno;   
+        if(unlikely(verbose))
+          fprintf(stderr,"Query barrier required for %s\n", my_md->name);
+        MPIR_Barrier(comm_ptr, &tmpmpierrno);
+      }
    }
 
    if(unlikely(verbose))
diff --git a/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c b/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c
index 6f899ff..14edf3d 100644
--- a/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c
+++ b/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c
@@ -393,6 +393,13 @@ int MPIDO_Scatterv(const void *sendbuf,
                              recvbuf, recvcount, recvtype,
                              root, comm_ptr, mpierrno);
       }
+      if(my_md->check_correct.values.asyncflowctl) 
+      { /* need better flow control than a barrier every time */
+        int tmpmpierrno;   
+        if(unlikely(verbose))
+          fprintf(stderr,"Query barrier required for %s\n", my_md->name);
+        MPIR_Barrier(comm_ptr, &tmpmpierrno);
+      }
    }
 
    MPIDI_Update_last_algorithm(comm_ptr, my_md->name);

http://git.mpich.org/mpich.git/commitdiff/1c0b1499d406f8bb6fb30427ef9cb476b98e0a93

commit 1c0b1499d406f8bb6fb30427ef9cb476b98e0a93
Author: Bob Cernohous <bobc at us.ibm.com>
Date:   Wed Jan 23 17:20:23 2013 -0600

    Switch RankBased to SequenceBased
    
    (ibm) Trac #657
    (ibm) 8ab9f9f17829fd077dcbb4054d4bd06ca4001523
    
    Signed-off-by: sssharka <sssharka at us.ibm.com>

diff --git a/src/mpid/pamid/src/comm/mpid_optcolls.c b/src/mpid/pamid/src/comm/mpid_optcolls.c
index e2bf8e4..ebb7ad7 100644
--- a/src/mpid/pamid/src/comm/mpid_optcolls.c
+++ b/src/mpid/pamid/src/comm/mpid_optcolls.c
@@ -529,7 +529,7 @@ void MPIDI_Comm_coll_select(MPID_Comm *comm_ptr)
       for(i = 0; i < comm_ptr->mpid.coll_count[PAMI_XFER_BROADCAST][0]; i++)
       {
         /* This is a good choice for small messages only */
-        if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][0][i].name, "I0:RankBased_Binomial:SHMEM:MU") == 0)
+        if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][0][i].name, "I0:SequenceBased_Binomial:SHMEM:MU") == 0)
         {
           opt_proto = i;
           comm_ptr->mpid.cutoff_size[PAMI_XFER_BROADCAST][0] = 256;
@@ -539,7 +539,7 @@ void MPIDI_Comm_coll_select(MPID_Comm *comm_ptr)
       if(opt_proto == -1) for(i = 0; i < comm_ptr->mpid.coll_count[PAMI_XFER_BROADCAST][1]; i++)
       {
         /* This is a good choice for small messages only */
-        if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][1][i].name, "I0:RankBased_Binomial:SHMEM:MU") == 0)
+        if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][1][i].name, "I0:SequenceBased_Binomial:SHMEM:MU") == 0)
         {
           opt_proto = i;
           mustquery = 1;
@@ -630,7 +630,7 @@ void MPIDI_Comm_coll_select(MPID_Comm *comm_ptr)
             }
           }
         }
-      if(strcasecmp(comm_ptr->mpid.opt_protocol_md[PAMI_XFER_BROADCAST][0].name, "I0:RankBased_Binomial:SHMEM:MU") == 0)
+      if(strcasecmp(comm_ptr->mpid.opt_protocol_md[PAMI_XFER_BROADCAST][0].name, "I0:SequenceBased_Binomial:SHMEM:MU") == 0)
       {
         /* This protocol was only good for up to 256, and it was an irregular, so let's set
          * 2-nomial for larger message sizes. Cutoff should have already been set to 256 too */

http://git.mpich.org/mpich.git/commitdiff/7eb95d3a0059ebbf9b345be084b9551273fcd877

commit 7eb95d3a0059ebbf9b345be084b9551273fcd877
Author: Bob Cernohous <bobc at us.ibm.com>
Date:   Wed Jan 23 17:11:29 2013 -0600

    Update MPI_IN_PLACE support after D188059/D188060 fixes
    
    (ibm) Issue 9136
    (ibm) b59bcad03176bd250cbe10e4bfe46589febe0a15
    
    Signed-off-by: sssharka <sssharka at us.ibm.com>

diff --git a/src/mpid/pamid/src/coll/allgather/mpido_allgather.c b/src/mpid/pamid/src/coll/allgather/mpido_allgather.c
index f1333f0..31a629d 100644
--- a/src/mpid/pamid/src/coll/allgather/mpido_allgather.c
+++ b/src/mpid/pamid/src/coll/allgather/mpido_allgather.c
@@ -367,8 +367,6 @@ MPIDO_Allgather(const void *sendbuf,
    sbuf = PAMI_IN_PLACE;
    if(sendbuf != MPI_IN_PLACE)
    {
-     if(unlikely(verbose))
-         fprintf(stderr,"allgather MPI_IN_PLACE buffering\n");
       MPIDI_Datatype_get_info(sendcount,
                             sendtype,
                             config[MPID_SEND_CONTIG],
@@ -377,6 +375,10 @@ MPIDO_Allgather(const void *sendbuf,
                             send_true_lb);
       sbuf = (char *)sendbuf+send_true_lb;
    }
+   else
+     if(unlikely(verbose))
+       fprintf(stderr,"allgather MPI_IN_PLACE buffering\n");
+
 
   /* verify everyone's datatype contiguity */
   /* Check buffer alignment now, since we're pre-allreducing anyway */
diff --git a/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c b/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c
index cc0efdf..008b7aa 100644
--- a/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c
+++ b/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c
@@ -378,6 +378,9 @@ MPIDO_Allgatherv(const void *sendbuf,
    if(sendbuf == MPI_IN_PLACE)
    {
      sbuf = PAMI_IN_PLACE;
+     if(unlikely(verbose))
+       fprintf(stderr,"allgatherv MPI_IN_PLACE buffering\n");
+     stype = rtype;
      scount = recvcounts[rank];
      send_size = recv_size * scount; 
    }
diff --git a/src/mpid/pamid/src/coll/reduce/mpido_reduce.c b/src/mpid/pamid/src/coll/reduce/mpido_reduce.c
index b3788a4..1a5c9d1 100644
--- a/src/mpid/pamid/src/coll/reduce/mpido_reduce.c
+++ b/src/mpid/pamid/src/coll/reduce/mpido_reduce.c
@@ -77,7 +77,7 @@ int MPIDO_Reduce(const void *sendbuf,
    if(sendbuf == MPI_IN_PLACE) 
    {
       if(unlikely(verbose))
-         fprintf(stderr,"reduce MPI_IN_PLACE buffering\n");
+	fprintf(stderr,"reduce MPI_IN_PLACE send buffering (%d,%d)\n",count,tsize);
       sbuf = PAMI_IN_PLACE;
    }
 

http://git.mpich.org/mpich.git/commitdiff/bdc6a36771a6276965f9e6caf86f7e26df9f1fd5

commit bdc6a36771a6276965f9e6caf86f7e26df9f1fd5
Author: Sameer Kumar <sameerk at us.ibm.com>
Date:   Wed Jan 16 04:09:51 2013 -0600

    Allgather(v) optimizations to use allreduce double sum instead of integer BOR.
    
    Only int32, int64, float and double can take advantage of this optimization.
    
    (ibm) Trac #636
    (ibm) 1c6de3526b6c7b10450d55193501087cfd664abe
    
    Signed-off-by: sssharka <sssharka at us.ibm.com>

diff --git a/src/mpid/pamid/src/coll/allgather/mpido_allgather.c b/src/mpid/pamid/src/coll/allgather/mpido_allgather.c
index 47acda0..f1333f0 100644
--- a/src/mpid/pamid/src/coll/allgather/mpido_allgather.c
+++ b/src/mpid/pamid/src/coll/allgather/mpido_allgather.c
@@ -47,6 +47,9 @@ static void allgather_cb_done(void *ctxt, void *clientdata, pami_result_t err)
  *       - The datatype parameters needed added to the function signature
  */
 /* ****************************************************************** */
+
+#define MAX_ALLGATHER_ALLREDUCE_BUFFER_SIZE  (1024*1024*2)
+
 int MPIDO_Allgather_allreduce(const void *sendbuf,
 			      int sendcount,
 			      MPI_Datatype sendtype,
@@ -61,7 +64,7 @@ int MPIDO_Allgather_allreduce(const void *sendbuf,
                               int *mpierrno)
 
 {
-  int rc;
+  int rc, i;
   char *startbuf = NULL;
   char *destbuf = NULL;
   const int rank = comm_ptr->rank;
@@ -69,23 +72,74 @@ int MPIDO_Allgather_allreduce(const void *sendbuf,
   startbuf   = (char *) recvbuf + recv_true_lb;
   destbuf    = startbuf + rank * send_size;
 
-  memset(startbuf, 0, rank * send_size);
-  memset(destbuf + send_size, 0, recv_size - (rank + 1) * send_size);
-
   if (sendbuf != MPI_IN_PLACE)
   {
     char *outputbuf = (char *) sendbuf + send_true_lb;
     memcpy(destbuf, outputbuf, send_size);
   }
+
   /* TODO: Change to PAMI */
-  rc = MPIDO_Allreduce(MPI_IN_PLACE,
-                       startbuf,
-                       recv_size/sizeof(unsigned),
-                       MPI_UNSIGNED,
-                       MPI_BOR,
-                       comm_ptr,
-                       mpierrno);
+  /*Do a convert and then do the allreudce*/
+  if ( recv_size <= MAX_ALLGATHER_ALLREDUCE_BUFFER_SIZE &&
+       (send_size & 0x3)==0 &&  /*integer/long allgathers only*/
+       (sendtype != MPI_DOUBLE || recvtype != MPI_DOUBLE))       
+  {
+    double *tmprbuf = (double *)MPIU_Malloc(recv_size*2);
+    if (tmprbuf == NULL)
+      goto direct_algo; /*skip int to fp conversion and go to direct
+			  algo*/
+
+    double *tmpsbuf = tmprbuf + (rank*send_size)/sizeof(int);
+    int *sibuf = (int *) destbuf;
+    
+    memset(tmprbuf, 0, rank*send_size*2);
+    memset(tmpsbuf + send_size/sizeof(int), 0, 
+	   (recv_size - (rank + 1)*send_size)*2);
+
+    for(i = 0; i < (send_size/sizeof(int)); ++i) 
+      tmpsbuf[i] = (double)sibuf[i];
+    
+    rc = MPIDO_Allreduce(MPI_IN_PLACE,
+			 tmprbuf,
+			 recv_size/sizeof(int),
+			 MPI_DOUBLE,
+			 MPI_SUM,
+			 comm_ptr,
+			 mpierrno);
+    
+    sibuf = (int *) startbuf;
+    for(i = 0; i < (rank*send_size/sizeof(int)); ++i) 
+      sibuf[i] = (int)tmprbuf[i];
+
+    for(i = (rank+1)*send_size/sizeof(int); i < recv_size/sizeof(int); ++i) 
+      sibuf[i] = (int)tmprbuf[i];
+
+    MPIU_Free(tmprbuf);
+    return rc;
+  }
+
+ direct_algo:
 
+  memset(startbuf, 0, rank * send_size);
+  memset(destbuf + send_size, 0, recv_size - (rank + 1) * send_size);
+
+  if (sendtype == MPI_DOUBLE && recvtype == MPI_DOUBLE)
+    rc = MPIDO_Allreduce(MPI_IN_PLACE,
+			 startbuf,
+			 recv_size/sizeof(double),
+			 MPI_DOUBLE,
+			 MPI_SUM,
+			 comm_ptr,
+			 mpierrno);
+  else
+    rc = MPIDO_Allreduce(MPI_IN_PLACE,
+			 startbuf,
+			 recv_size/sizeof(int),
+			 MPI_UNSIGNED,
+			 MPI_BOR,
+			 comm_ptr,
+			 mpierrno);
+  
   return rc;
 }
 
@@ -204,6 +258,7 @@ int MPIDO_Allgather_alltoall(const void *sendbuf,
     a2a_recvcounts[rank] = 0;
   }
 
+
   rc = MPIDO_Alltoallv((const void *)a2a_sendbuf,
                        a2a_sendcounts,
                        a2a_senddispls,
@@ -214,16 +269,6 @@ int MPIDO_Allgather_alltoall(const void *sendbuf,
                        recvtype,
                        comm_ptr,
                        mpierrno); 
-/*  rc = MPIR_Alltoallv((const void *)a2a_sendbuf,
-		       a2a_sendcounts,
-		       a2a_senddispls,
-		       MPI_CHAR,
-		       recvbuf,
-		       a2a_recvcounts,
-		       a2a_recvdispls,
-		       recvtype,
-		       comm_ptr,
-		       mpierrno); */
 
   return rc;
 }
diff --git a/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c b/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c
index 60e98dd..cc0efdf 100644
--- a/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c
+++ b/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c
@@ -46,6 +46,7 @@ static void allred_cb_done(void *ctxt, void *clientdata, pami_result_t err)
  *       - Tree allreduce is availible (for max performance)
  */
 /* ****************************************************************** */
+#define MAX_ALLGATHERV_ALLREDUCE_BUFFER_SIZE (1024*1024*2)
 int MPIDO_Allgatherv_allreduce(const void *sendbuf,
 			       int sendcount,
 			       MPI_Datatype sendtype,
@@ -61,7 +62,7 @@ int MPIDO_Allgatherv_allreduce(const void *sendbuf,
 			       MPID_Comm * comm_ptr,
                                int *mpierrno)
 {
-  int start, rc;
+  int start, rc, i;
   int length;
   char *startbuf = NULL;
   char *destbuf = NULL;
@@ -71,6 +72,58 @@ int MPIDO_Allgatherv_allreduce(const void *sendbuf,
   startbuf = (char *) recvbuf + recv_true_lb;
   destbuf = startbuf + displs[rank] * recv_size;
 
+  if (sendbuf != MPI_IN_PLACE)
+  {
+    char *outputbuf = (char *) sendbuf + send_true_lb;
+    memcpy(destbuf, outputbuf, send_size);
+  }
+
+  //printf("buffer_sum %d, send_size %d recv_size %d\n", buffer_sum, 
+  // (int)send_size, (int)recv_size);	 
+
+  /* TODO: Change to PAMI */
+  /*integer/long/double allgathers only*/
+  /*Do a convert and then do the allreudce*/
+  if ( buffer_sum <= MAX_ALLGATHERV_ALLREDUCE_BUFFER_SIZE &&
+       (send_size & 0x3)==0 && (recv_size & 0x3)==0)  
+  {
+    double *tmprbuf = (double *)MPIU_Malloc(buffer_sum*2);
+    if (tmprbuf == NULL)
+      goto direct_algo; /*skip int to fp conversion and go to direct
+			  algo*/
+
+    double *tmpsbuf = tmprbuf + (displs[rank]*recv_size)/sizeof(int);
+    int *sibuf = (int *) destbuf;
+    
+    memset(tmprbuf, 0, displs[rank]*recv_size*2);
+    start  = (displs[rank] + recvcounts[rank]) * recv_size;   
+    length = buffer_sum - (displs[rank] + recvcounts[rank]) * recv_size;
+    memset(tmprbuf + start/sizeof(int), 0, length*2);
+
+    for(i = 0; i < (send_size/sizeof(int)); ++i) 
+      tmpsbuf[i] = (double)sibuf[i];
+    
+    rc = MPIDO_Allreduce(MPI_IN_PLACE,
+			 tmprbuf,
+			 buffer_sum/sizeof(int),
+			 MPI_DOUBLE,
+			 MPI_SUM,
+			 comm_ptr,
+			 mpierrno);
+    
+    sibuf = (int *) startbuf;
+    for(i = 0; i < (displs[rank]*recv_size/sizeof(int)); ++i) 
+      sibuf[i] = (int)tmprbuf[i];
+    
+    for(i = start/sizeof(int); i < buffer_sum/sizeof(int); ++i) 
+      sibuf[i] = (int)tmprbuf[i];
+
+    MPIU_Free(tmprbuf);
+    return rc;
+  }
+
+ direct_algo:
+
   start = 0;
   length = displs[rank] * recv_size;
   memset(startbuf + start, 0, length);
@@ -81,15 +134,8 @@ int MPIDO_Allgatherv_allreduce(const void *sendbuf,
 			 recvcounts[rank]) * recv_size;
   memset(startbuf + start, 0, length);
 
-  if (sendbuf != MPI_IN_PLACE)
-  {
-    char *outputbuf = (char *) sendbuf + send_true_lb;
-    memcpy(destbuf, outputbuf, send_size);
-  }
-
-
-   TRACE_ERR("Calling MPIDO_Allreduce from MPIDO_Allgatherv_allreduce\n");
-   /* TODO: Change to PAMI allreduce */
+  TRACE_ERR("Calling MPIDO_Allreduce from MPIDO_Allgatherv_allreduce\n");
+  /* TODO: Change to PAMI allreduce */
   rc = MPIDO_Allreduce(MPI_IN_PLACE,
 		       startbuf,
 		       buffer_sum/sizeof(unsigned),
@@ -98,7 +144,7 @@ int MPIDO_Allgatherv_allreduce(const void *sendbuf,
 		       comm_ptr,
                        mpierrno);
 
-   TRACE_ERR("Leaving MPIDO_Allgatherv_allreduce\n");
+  TRACE_ERR("Leaving MPIDO_Allgatherv_allreduce\n");
   return rc;
 }
 
@@ -220,7 +266,7 @@ int MPIDO_Allgatherv_alltoall(const void *sendbuf,
 
    TRACE_ERR("Calling alltoallv in MPIDO_Allgatherv_alltoallv\n");
    /* TODO: Change to PAMI alltoallv */
-  rc = MPIR_Alltoallv(a2a_sendbuf,
+  rc = MPIDO_Alltoallv(a2a_sendbuf,
 		       a2a_sendcounts,
 		       a2a_senddispls,
 		       MPI_CHAR,
@@ -298,10 +344,10 @@ MPIDO_Allgatherv(const void *sendbuf,
   allred.cmd.xfer_allreduce.rtypecount = 6;
   allred.cmd.xfer_allreduce.op = PAMI_DATA_BAND;
 
-   use_alltoall = mpid->allgathervs[2];
-   use_tree_reduce = mpid->allgathervs[0];
-   use_bcast = mpid->allgathervs[1];
-   use_pami = selected_type != MPID_COLL_USE_MPICH;
+  use_alltoall = mpid->allgathervs[2];
+  use_tree_reduce = mpid->allgathervs[0];
+  use_bcast = mpid->allgathervs[1];
+  use_pami = selected_type != MPID_COLL_USE_MPICH;
 	 
    if((sendbuf != MPI_IN_PLACE) && (MPIDI_Datatype_to_pami(sendtype, &stype, -1, NULL, &tmp) != MPI_SUCCESS))
      use_pami = 0;

http://git.mpich.org/mpich.git/commitdiff/69ebc32fc07b36727a8ea9690d21cc3616ec191e

commit 69ebc32fc07b36727a8ea9690d21cc3616ec191e
Author: Bob Cernohous <bobc at us.ibm.com>
Date:   Fri Jan 11 11:02:44 2013 -0600

    MPI_Allgather glue protocol updates.
    
    (ibm) Trac #652
    (ibm) 3cd75510705df9ade6c6a098e78d41c3c6822032
    
    Signed-off-by: sssharka <sssharka at us.ibm.com>

diff --git a/src/mpid/pamid/src/coll/allgather/mpido_allgather.c b/src/mpid/pamid/src/coll/allgather/mpido_allgather.c
index 756ec17..47acda0 100644
--- a/src/mpid/pamid/src/coll/allgather/mpido_allgather.c
+++ b/src/mpid/pamid/src/coll/allgather/mpido_allgather.c
@@ -204,8 +204,17 @@ int MPIDO_Allgather_alltoall(const void *sendbuf,
     a2a_recvcounts[rank] = 0;
   }
 
-/* TODO: Change to PAMI */
-  rc = MPIR_Alltoallv((const void *)a2a_sendbuf,
+  rc = MPIDO_Alltoallv((const void *)a2a_sendbuf,
+                       a2a_sendcounts,
+                       a2a_senddispls,
+                       MPI_CHAR,
+                       recvbuf,
+                       a2a_recvcounts,
+                       a2a_recvdispls,
+                       recvtype,
+                       comm_ptr,
+                       mpierrno); 
+/*  rc = MPIR_Alltoallv((const void *)a2a_sendbuf,
 		       a2a_sendcounts,
 		       a2a_senddispls,
 		       MPI_CHAR,
@@ -214,7 +223,7 @@ int MPIDO_Allgather_alltoall(const void *sendbuf,
 		       a2a_recvdispls,
 		       recvtype,
 		       comm_ptr,
-		       mpierrno);
+		       mpierrno); */
 
   return rc;
 }

http://git.mpich.org/mpich.git/commitdiff/07a94c0a595514ba127efd6d9b2077ca7d9d05f3

commit 07a94c0a595514ba127efd6d9b2077ca7d9d05f3
Author: Bob Cernohous <bobc at us.ibm.com>
Date:   Fri Dec 7 15:10:56 2012 -0600

    Add PAMID_COLLECTIVE_REDUCE=GLUE_ALLREDUCE
    
    (ibm) Issue 9185
    (ibm) f7a38942c46b35d0508af00b5c655996b5f9acd0
    
    Signed-off-by: sssharka <sssharka at us.ibm.com>

diff --git a/src/mpid/pamid/include/mpidi_datatypes.h b/src/mpid/pamid/include/mpidi_datatypes.h
index 8f9bf88..33cf60a 100644
--- a/src/mpid/pamid/include/mpidi_datatypes.h
+++ b/src/mpid/pamid/include/mpidi_datatypes.h
@@ -310,7 +310,7 @@ struct MPIDI_Comm
   char allgathers[4];
   char allgathervs[4];
   char scattervs[2];
-  char optgather, optscatter;
+  char optgather, optscatter, optreduce;
 
   /* These need to be freed at geom destroy, so we need to store them
    * inside the communicator struct until destroy time rather than
diff --git a/src/mpid/pamid/src/coll/reduce/mpido_reduce.c b/src/mpid/pamid/src/coll/reduce/mpido_reduce.c
index f56010f..b3788a4 100644
--- a/src/mpid/pamid/src/coll/reduce/mpido_reduce.c
+++ b/src/mpid/pamid/src/coll/reduce/mpido_reduce.c
@@ -49,20 +49,21 @@ int MPIDO_Reduce(const void *sendbuf,
    pami_type_t pdt;
    int rc;
    int alg_selected = 0;
+   const int rank = comm_ptr->rank;
 #if ASSERT_LEVEL==0
    /* We can't afford the tracing in ndebug/performance libraries */
     const unsigned verbose = 0;
 #else
-    const unsigned verbose = (MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL) && (comm_ptr->rank == 0);
+    const unsigned verbose = (MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL) && (rank == 0);
 #endif
    const struct MPIDI_Comm* const mpid = &(comm_ptr->mpid);
    const int selected_type = mpid->user_selected_type[PAMI_XFER_REDUCE];
 
    rc = MPIDI_Datatype_to_pami(datatype, &pdt, op, &pop, &mu);
    if(unlikely(verbose))
-      fprintf(stderr,"reduce - rc %u, dt: %p, op: %p, mu: %u, selectedvar %u != %u (MPICH)\n",
-         rc, pdt, pop, mu, 
-         (unsigned)selected_type, MPID_COLL_USE_MPICH);
+      fprintf(stderr,"reduce - rc %u, root %u, count %d, dt: %p, op: %p, mu: %u, selectedvar %u != %u (MPICH) sendbuf %p, recvbuf %p\n",
+	      rc, root, count, pdt, pop, mu, 
+	      (unsigned)selected_type, MPID_COLL_USE_MPICH,sendbuf, recvbuf);
 
    pami_xfer_t reduce;
    pami_algorithm_t my_reduce=0;
@@ -70,13 +71,6 @@ int MPIDO_Reduce(const void *sendbuf,
    int queryreq = 0;
    volatile unsigned reduce_active = 1;
 
-   if(selected_type == MPID_COLL_USE_MPICH || rc != MPI_SUCCESS)
-   {
-      if(unlikely(verbose))
-         fprintf(stderr,"Using MPICH reduce algorithm\n");
-      return MPIR_Reduce(sendbuf, recvbuf, count, datatype, op, root, comm_ptr, mpierrno);
-   }
-
    MPIDI_Datatype_get_info(count, datatype, dt_contig, tsize, dt_null, true_lb);
    rbuf = (char *)recvbuf + true_lb;
    sbuf = (char *)sendbuf + true_lb;
@@ -89,6 +83,35 @@ int MPIDO_Reduce(const void *sendbuf,
 
    reduce.cb_done = reduce_cb_done;
    reduce.cookie = (void *)&reduce_active;
+   if(mpid->optreduce) /* GLUE_ALLREDUCE */
+   {
+      char* tbuf = NULL;
+      if(unlikely(verbose))
+         fprintf(stderr,"Using protocol GLUE_ALLREDUCE for reduce (%d,%d)\n",count,tsize);
+      MPIDI_Update_last_algorithm(comm_ptr, "REDUCE_OPT_ALLREDUCE");
+      void *destbuf = recvbuf;
+      if(rank != root) /* temp buffer for non-root destbuf */
+      {
+         tbuf = destbuf = MPIU_Malloc(tsize);
+      }
+      MPIDO_Allreduce(sendbuf,
+                      destbuf,
+                      count,
+                      datatype,
+                      op,
+                      comm_ptr,
+                      mpierrno);
+      if(tbuf)
+         MPIU_Free(tbuf);
+      return 0;
+   }
+   if(selected_type == MPID_COLL_USE_MPICH || rc != MPI_SUCCESS)
+   {
+      if(unlikely(verbose))
+         fprintf(stderr,"Using MPICH reduce algorithm\n");
+      return MPIR_Reduce(sendbuf, recvbuf, count, datatype, op, root, comm_ptr, mpierrno);
+   }
+
    if(selected_type == MPID_COLL_OPTIMIZED)
    {
       if((mpid->cutoff_size[PAMI_XFER_REDUCE][0] == 0) || 
diff --git a/src/mpid/pamid/src/comm/mpid_selectcolls.c b/src/mpid/pamid/src/comm/mpid_selectcolls.c
index 4905dfd..d4c50df 100644
--- a/src/mpid/pamid/src/comm/mpid_selectcolls.c
+++ b/src/mpid/pamid/src/comm/mpid_selectcolls.c
@@ -253,6 +253,18 @@ void MPIDI_Comm_coll_envvars(MPID_Comm *comm)
       char* names[] = {"PAMID_COLLECTIVE_ALLTOALL", NULL};
       MPIDI_Check_protocols(names, comm, "alltoall", PAMI_XFER_ALLTOALL);
    }
+   comm->mpid.optreduce = 0;
+   envopts = getenv("PAMID_COLLECTIVE_REDUCE");
+   if(envopts != NULL)
+   {
+      if(strcasecmp(envopts, "GLUE_ALLREDUCE") == 0)
+      {
+         if(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_0 && comm->rank == 0)
+            fprintf(stderr,"Selecting glue allreduce for reduce\n");
+         comm->mpid.optreduce = 1;
+      }
+   }
+   /* In addition to glue protocols, check for other PAMI protocols and check for PE now */
    {
       TRACE_ERR("Checking reduce\n");
       char* names[] = {"PAMID_COLLECTIVE_REDUCE", "MP_S_MPI_REDUCE", NULL};
@@ -394,7 +406,7 @@ void MPIDI_Comm_coll_envvars(MPID_Comm *comm)
       if(strcasecmp(envopts, "GLUE_REDUCE") == 0)
       {
          if(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_0 && comm->rank == 0)
-            fprintf(stderr,"using glue_reduce for gather\n");
+            fprintf(stderr,"Selecting glue reduce for gather\n");
          comm->mpid.optgather = 1;
       }
    }
@@ -572,6 +584,10 @@ void MPIDI_Comm_coll_query(MPID_Comm *comm)
             {
                fprintf(stderr,"comm[%p] coll type %d (%s), \"glue\" algorithm: GLUE_BCAST\n", comm, i, MPIDI_Coll_type_name(i));
             }
+            if(i == PAMI_XFER_REDUCE)
+            {
+               fprintf(stderr,"comm[%p] coll type %d (%s), \"glue\" algorithm: GLUE_ALLREDUCE\n", comm, i, MPIDI_Coll_type_name(i));
+            }
          }
       }
    }

http://git.mpich.org/mpich.git/commitdiff/f3093fa83df4c6483d1c21df5208d9ff23967404

commit f3093fa83df4c6483d1c21df5208d9ff23967404
Author: Bob Cernohous <bobc at us.ibm.com>
Date:   Thu Dec 13 14:03:40 2012 -0600

    PAMI_COLLECTIVES_MEMORY_OPTIMIZED does not optimize irregular communicators
    
    (ibm) Issue 9208
    (ibm) 9207de08e5685466e2191cd41db44d0cf677588b
    
    Signed-off-by: sssharka <sssharka at us.ibm.com>

diff --git a/src/mpid/pamid/src/coll/allgather/mpido_allgather.c b/src/mpid/pamid/src/coll/allgather/mpido_allgather.c
index 5dcb2ac..756ec17 100644
--- a/src/mpid/pamid/src/coll/allgather/mpido_allgather.c
+++ b/src/mpid/pamid/src/coll/allgather/mpido_allgather.c
@@ -242,8 +242,8 @@ MPIDO_Allgather(const void *sendbuf,
    MPI_Aint send_true_lb = 0;
    MPI_Aint recv_true_lb = 0;
    int rc, comm_size = comm_ptr->local_size;
-   size_t send_size = 0;
-   size_t recv_size = 0;
+   size_t send_bytes = 0;
+   size_t recv_bytes = 0;
    volatile unsigned allred_active = 1;
    volatile unsigned allgather_active = 1;
    pami_xfer_t allred;
@@ -303,11 +303,11 @@ MPIDO_Allgather(const void *sendbuf,
    MPIDI_Datatype_get_info(recvcount,
 			  recvtype,
         config[MPID_RECV_CONTIG],
-			  recv_size,
+			  recv_bytes,
 			  dt_null,
 			  recv_true_lb);
 
-   send_size = recv_size;
+   send_bytes = recv_bytes;
    rbuf = (char *)recvbuf+recv_true_lb;
 
    sbuf = PAMI_IN_PLACE;
@@ -318,7 +318,7 @@ MPIDO_Allgather(const void *sendbuf,
       MPIDI_Datatype_get_info(sendcount,
                             sendtype,
                             config[MPID_SEND_CONTIG],
-                            send_size,
+                            send_bytes,
                             dt_null,
                             send_true_lb);
       sbuf = (char *)sendbuf+send_true_lb;
@@ -347,12 +347,12 @@ MPIDO_Allgather(const void *sendbuf,
        use_alltoall = allgathers[2] &&
             config[MPID_RECV_CONTIG] && config[MPID_SEND_CONTIG];;
 
-      /* Note: some of the glue protocols use recv_size*comm_size rather than 
-       * recv_size so we use that for comparison here, plus we pass that in
+      /* Note: some of the glue protocols use recv_bytes*comm_size rather than 
+       * recv_bytes so we use that for comparison here, plus we pass that in
        * to those protocols. */
        use_tree_reduce =  allgathers[0] &&
          config[MPID_RECV_CONTIG] && config[MPID_SEND_CONTIG] &&
-         config[MPID_RECV_CONTINUOUS] && (recv_size*comm_size%sizeof(unsigned)) == 0;
+         config[MPID_RECV_CONTINUOUS] && (recv_bytes*comm_size%sizeof(unsigned)) == 0;
 
        use_bcast = allgathers[1];
 
@@ -368,12 +368,12 @@ MPIDO_Allgather(const void *sendbuf,
       allgather.cmd.xfer_allgather.sndbuf = sbuf;
       allgather.cmd.xfer_allgather.stype = PAMI_TYPE_BYTE;
       allgather.cmd.xfer_allgather.rtype = PAMI_TYPE_BYTE;
-      allgather.cmd.xfer_allgather.stypecount = send_size;
-      allgather.cmd.xfer_allgather.rtypecount = recv_size;
+      allgather.cmd.xfer_allgather.stypecount = send_bytes;
+      allgather.cmd.xfer_allgather.rtypecount = recv_bytes;
       if(selected_type == MPID_COLL_OPTIMIZED)
       {
         if((mpid->cutoff_size[PAMI_XFER_ALLGATHER][0] == 0) || 
-           (mpid->cutoff_size[PAMI_XFER_ALLGATHER][0] > 0 && mpid->cutoff_size[PAMI_XFER_ALLGATHER][0] >= send_size))
+           (mpid->cutoff_size[PAMI_XFER_ALLGATHER][0] > 0 && mpid->cutoff_size[PAMI_XFER_ALLGATHER][0] >= send_bytes))
         {
            allgather.algorithm = mpid->opt_protocol[PAMI_XFER_ALLGATHER][0];
            my_md = &mpid->opt_protocol_md[PAMI_XFER_ALLGATHER][0];
@@ -400,13 +400,31 @@ MPIDO_Allgather(const void *sendbuf,
          TRACE_ERR("Querying allgather protocol %s, type was: %d\n",
             my_md->name,
             selected_type);
-         if(queryreq == MPID_COLL_ALWAYS_QUERY)
+         if(my_md->check_fn == NULL)
          {
            /* process metadata bits */
            if((!my_md->check_correct.values.inplace) && (sendbuf == MPI_IN_PLACE))
               result.check.unspecified = 1;
+           if(my_md->check_correct.values.rangeminmax)
+           {
+             if((my_md->range_lo <= recv_bytes) &&
+                (my_md->range_hi >= recv_bytes))
+                ; /* ok, algorithm selected */
+             else
+             {
+               result.check.range = 1;
+               if(unlikely(verbose))
+               {   
+                 fprintf(stderr,"message size (%zu) outside range (%zu<->%zu) for %s.\n",
+                         recv_bytes,
+                         my_md->range_lo,
+                         my_md->range_hi,
+                         my_md->name);
+               }
+             }
+           }
          }
-         else /* (queryreq == MPID_COLL_CHECK_FN_REQUIRED - calling the check fn is sufficient */
+         else /* calling the check fn is sufficient */
            result = my_md->check_fn(&allgather);
          TRACE_ERR("bitmask: %#X\n", result.bitmask);
          result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
@@ -451,7 +469,7 @@ MPIDO_Allgather(const void *sendbuf,
       MPIDI_Update_last_algorithm(comm_ptr, "ALLGATHER_OPT_ALLREDUCE");
      return MPIDO_Allgather_allreduce(sendbuf, sendcount, sendtype,
                                recvbuf, recvcount, recvtype,
-                               send_true_lb, recv_true_lb, send_size, recv_size*comm_size, comm_ptr, mpierrno);
+                               send_true_lb, recv_true_lb, send_bytes, recv_bytes*comm_size, comm_ptr, mpierrno);
    }
    if(use_alltoall)
    {
@@ -461,7 +479,7 @@ MPIDO_Allgather(const void *sendbuf,
       MPIDI_Update_last_algorithm(comm_ptr, "ALLGATHER_OPT_ALLTOALL");
      return MPIDO_Allgather_alltoall(sendbuf, sendcount, sendtype,
                                recvbuf, recvcount, recvtype,
-                               send_true_lb, recv_true_lb, send_size, recv_size*comm_size, comm_ptr, mpierrno);
+                               send_true_lb, recv_true_lb, send_bytes, recv_bytes*comm_size, comm_ptr, mpierrno);
    }
 
    if(use_bcast)
@@ -472,7 +490,7 @@ MPIDO_Allgather(const void *sendbuf,
      MPIDI_Update_last_algorithm(comm_ptr, "ALLGATHER_OPT_BCAST");
      return MPIDO_Allgather_bcast(sendbuf, sendcount, sendtype,
                                recvbuf, recvcount, recvtype,
-                               send_true_lb, recv_true_lb, send_size, recv_size*comm_size, comm_ptr, mpierrno);
+                               send_true_lb, recv_true_lb, send_bytes, recv_bytes*comm_size, comm_ptr, mpierrno);
    }
    
    /* Nothing used yet; dump to MPICH */
diff --git a/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c b/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c
index fd8e3a4..60e98dd 100644
--- a/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c
+++ b/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c
@@ -429,13 +429,32 @@ MPIDO_Allgatherv(const void *sendbuf,
          metadata_result_t result = {0};
          TRACE_ERR("Querying allgatherv_int protocol %s, type was %d\n", my_md->name,
             selected_type);
-         if(queryreq == MPID_COLL_ALWAYS_QUERY)
+         if(my_md->check_fn == NULL)
          {
            /* process metadata bits */
            if((!my_md->check_correct.values.inplace) && (sendbuf == MPI_IN_PLACE))
               result.check.unspecified = 1;
+         MPI_Aint data_true_lb;
+         MPID_Datatype *data_ptr;
+         int data_size, data_contig;
+         MPIDI_Datatype_get_info(sendcount, sendtype, data_contig, data_size, data_ptr, data_true_lb); 
+         if((my_md->range_lo <= data_size) &&
+            (my_md->range_hi >= data_size))
+            ; /* ok, algorithm selected */
+         else
+         {
+            result.check.range = 1;
+            if(unlikely(verbose))
+            {   
+               fprintf(stderr,"message size (%u) outside range (%zu<->%zu) for %s.\n",
+                       data_size,
+                       my_md->range_lo,
+                       my_md->range_hi,
+                       my_md->name);
+            }
+         }
          }
-         else /* (queryreq == MPID_COLL_CHECK_FN_REQUIRED - calling the check fn is sufficient */
+         else /* calling the check fn is sufficient */
            result = my_md->check_fn(&allgatherv);
          TRACE_ERR("Allgatherv bitmask: %#X\n", result.bitmask);
          result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
diff --git a/src/mpid/pamid/src/coll/alltoall/mpido_alltoall.c b/src/mpid/pamid/src/coll/alltoall/mpido_alltoall.c
index b930030..4ae5535 100644
--- a/src/mpid/pamid/src/coll/alltoall/mpido_alltoall.c
+++ b/src/mpid/pamid/src/coll/alltoall/mpido_alltoall.c
@@ -138,13 +138,32 @@ int MPIDO_Alltoall(const void *sendbuf,
       metadata_result_t result = {0};
       TRACE_ERR("querying alltoall protocol %s, query level was %d\n", pname,
          queryreq);
-      if(queryreq == MPID_COLL_ALWAYS_QUERY)
+      if(my_md->check_fn == NULL)
       {
         /* process metadata bits */
          if((!my_md->check_correct.values.inplace) && (sendbuf == MPI_IN_PLACE))
             result.check.unspecified = 1;
+         MPI_Aint data_true_lb;
+         MPID_Datatype *data_ptr;
+         int data_size, data_contig;
+         MPIDI_Datatype_get_info(sendcount, sendtype, data_contig, data_size, data_ptr, data_true_lb); 
+         if((my_md->range_lo <= data_size) &&
+            (my_md->range_hi >= data_size))
+            ; /* ok, algorithm selected */
+         else
+         {
+            result.check.range = 1;
+            if(unlikely(verbose))
+            {   
+               fprintf(stderr,"message size (%u) outside range (%zu<->%zu) for %s.\n",
+                       data_size,
+                       my_md->range_lo,
+                       my_md->range_hi,
+                       my_md->name);
+            }
+         }
       }
-      else /* (queryreq == MPID_COLL_CHECK_FN_REQUIRED - calling the check fn is sufficient */
+      else /* calling the check fn is sufficient */
          result = my_md->check_fn(&alltoall);
       TRACE_ERR("bitmask: %#X\n", result.bitmask);
       result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
diff --git a/src/mpid/pamid/src/coll/alltoallv/mpido_alltoallv.c b/src/mpid/pamid/src/coll/alltoallv/mpido_alltoallv.c
index ccdc3ba..eabc356 100644
--- a/src/mpid/pamid/src/coll/alltoallv/mpido_alltoallv.c
+++ b/src/mpid/pamid/src/coll/alltoallv/mpido_alltoallv.c
@@ -134,13 +134,32 @@ int MPIDO_Alltoallv(const void *sendbuf,
    {
       metadata_result_t result = {0};
       TRACE_ERR("querying alltoallv protocol %s, type was %d\n", pname, queryreq);
-      if(queryreq == MPID_COLL_ALWAYS_QUERY)
+      if(my_md->check_fn == NULL)
       {
         /* process metadata bits */
          if((!my_md->check_correct.values.inplace) && (sendbuf == MPI_IN_PLACE))
             result.check.unspecified = 1;
+/*         MPI_Aint data_true_lb;
+         MPID_Datatype *data_ptr;
+         int data_size, data_contig;
+         MPIDI_Datatype_get_info(??, sendtype, data_contig, data_size, data_ptr, data_true_lb); 
+         if((my_md->range_lo <= data_size) &&
+            (my_md->range_hi >= data_size))
+            ; *//* ok, algorithm selected */
+/*         else
+         {
+            result.check.range = 1;
+            if(unlikely(verbose))
+            {   
+               fprintf(stderr,"message size (%u) outside range (%zu<->%zu) for %s.\n",
+                       data_size,
+                       my_md->range_lo,
+                       my_md->range_hi,
+                       my_md->name);
+            }
+         }*/
       }
-      else /* (queryreq == MPID_COLL_CHECK_FN_REQUIRED - calling the check fn is sufficient */
+      else /* calling the check fn is sufficient */
          result = my_md->check_fn(&alltoallv);
       TRACE_ERR("bitmask: %#X\n", result.bitmask);
       result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
diff --git a/src/mpid/pamid/src/coll/bcast/mpido_bcast.c b/src/mpid/pamid/src/coll/bcast/mpido_bcast.c
index 4578613..4d5d450 100644
--- a/src/mpid/pamid/src/coll/bcast/mpido_bcast.c
+++ b/src/mpid/pamid/src/coll/bcast/mpido_bcast.c
@@ -190,15 +190,38 @@ int MPIDO_Bcast(void *buffer,
    {
       metadata_result_t result = {0};
       TRACE_ERR("querying bcast protocol %s, type was: %d\n",
-         my_md->name, queryreq);
-      if(queryreq == MPID_COLL_ALWAYS_QUERY)
+                my_md->name, queryreq);
+      if(my_md->check_fn != NULL) /* calling the check fn is sufficient */
       {
-        /* process metadata bits */
-      }
-      else /* (queryreq == MPID_COLL_CHECK_FN_REQUIRED - calling the check fn is sufficient */
+         metadata_result_t result = {0};
          result = my_md->check_fn(&bcast);
+         result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
+      } 
+      else /* no check_fn, manually look at the metadata fields */
+      {
+         TRACE_ERR("Optimzed selection line %d\n",__LINE__);
+         /* Check if the message range if restricted */
+         if(my_md->check_correct.values.rangeminmax)
+         {
+            if((my_md->range_lo <= data_size) &&
+               (my_md->range_hi >= data_size))
+               ; /* ok, algorithm selected */
+            else
+            {
+               result.check.range = 1;
+               if(unlikely(verbose))
+               {   
+                  fprintf(stderr,"message size (%u) outside range (%zu<->%zu) for %s.\n",
+                          data_size,
+                          my_md->range_lo,
+                          my_md->range_hi,
+                          my_md->name);
+               }
+            }
+         }
+         /* \todo check the rest of the metadata */
+      }
       TRACE_ERR("bitmask: %#X\n", result.bitmask);
-      result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
       if(result.bitmask)
       {
          if(unlikely(verbose))
diff --git a/src/mpid/pamid/src/coll/gather/mpido_gather.c b/src/mpid/pamid/src/coll/gather/mpido_gather.c
index 0bebefe..8859f05 100644
--- a/src/mpid/pamid/src/coll/gather/mpido_gather.c
+++ b/src/mpid/pamid/src/coll/gather/mpido_gather.c
@@ -261,13 +261,31 @@ int MPIDO_Gather(const void *sendbuf,
       metadata_result_t result = {0};
       TRACE_ERR("querying gather protocol %s, type was %d\n",
          my_md->name, queryreq);
-      if(queryreq == MPID_COLL_ALWAYS_QUERY)
+      if(my_md->check_fn == NULL)
       {
         /* process metadata bits */
         if((!my_md->check_correct.values.inplace) && (sendbuf == MPI_IN_PLACE))
            result.check.unspecified = 1;
+        if(my_md->check_correct.values.rangeminmax)
+        {
+          if((my_md->range_lo <= recv_bytes) &&
+             (my_md->range_hi >= recv_bytes))
+            ; /* ok, algorithm selected */
+          else
+          {
+            result.check.range = 1;
+            if(unlikely(verbose))
+            {   
+              fprintf(stderr,"message size (%u) outside range (%zu<->%zu) for %s.\n",
+                      recv_bytes,
+                      my_md->range_lo,
+                      my_md->range_hi,
+                      my_md->name);
+            }
+          }
+        }
       }
-      else /* (queryreq == MPID_COLL_CHECK_FN_REQUIRED - calling the check fn is sufficient */
+      else /* calling the check fn is sufficient */
         result = my_md->check_fn(&gather);
       TRACE_ERR("bitmask: %#X\n", result.bitmask);
       result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
diff --git a/src/mpid/pamid/src/coll/gatherv/mpido_gatherv.c b/src/mpid/pamid/src/coll/gatherv/mpido_gatherv.c
index e6c7273..a6f7b42 100644
--- a/src/mpid/pamid/src/coll/gatherv/mpido_gatherv.c
+++ b/src/mpid/pamid/src/coll/gatherv/mpido_gatherv.c
@@ -147,13 +147,32 @@ int MPIDO_Gatherv(const void *sendbuf,
       metadata_result_t result = {0};
       TRACE_ERR("querying gatherv protocol %s, type was %d\n", 
          my_md->name, queryreq);
-      if(queryreq == MPID_COLL_ALWAYS_QUERY)
+      if(my_md->check_fn == NULL)
       {
          /* process metadata bits */
          if((!my_md->check_correct.values.inplace) && (sendbuf == MPI_IN_PLACE))
             result.check.unspecified = 1;
+         MPI_Aint data_true_lb;
+         MPID_Datatype *data_ptr;
+         int data_size, data_contig;
+         MPIDI_Datatype_get_info(sendcount, sendtype, data_contig, data_size, data_ptr, data_true_lb); 
+         if((my_md->range_lo <= data_size) &&
+            (my_md->range_hi >= data_size))
+            ; /* ok, algorithm selected */
+         else
+         {
+            result.check.range = 1;
+            if(unlikely(verbose))
+            {   
+               fprintf(stderr,"message size (%u) outside range (%zu<->%zu) for %s.\n",
+                       data_size,
+                       my_md->range_lo,
+                       my_md->range_hi,
+                       my_md->name);
+            }
+         }
       }
-      else /* (queryreq == MPID_COLL_CHECK_FN_REQUIRED - calling the check fn is sufficient */
+      else /* calling the check fn is sufficient */
          result = my_md->check_fn(&gatherv);
       TRACE_ERR("bitmask: %#X\n", result.bitmask);
       result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
diff --git a/src/mpid/pamid/src/coll/reduce/mpido_reduce.c b/src/mpid/pamid/src/coll/reduce/mpido_reduce.c
index c8c0bb7..f56010f 100644
--- a/src/mpid/pamid/src/coll/reduce/mpido_reduce.c
+++ b/src/mpid/pamid/src/coll/reduce/mpido_reduce.c
@@ -124,36 +124,46 @@ int MPIDO_Reduce(const void *sendbuf,
    if(unlikely(queryreq == MPID_COLL_ALWAYS_QUERY || 
                queryreq == MPID_COLL_CHECK_FN_REQUIRED))
    {
-      if(my_md->check_fn != NULL)
+      metadata_result_t result = {0};
+      TRACE_ERR("Querying reduce protocol %s, type was %d\n",
+                my_md->name,
+                queryreq);
+      if(my_md->check_fn == NULL)
       {
-         metadata_result_t result = {0};
-         TRACE_ERR("Querying reduce protocol %s, type was %d\n",
-            my_md->name,
-            queryreq);
-         if(queryreq == MPID_COLL_ALWAYS_QUERY)
-         {
-            /* process metadata bits */
-            if((!my_md->check_correct.values.inplace) && (sendbuf == MPI_IN_PLACE))
-               result.check.unspecified = 1;
-         }
-         else /* (queryreq == MPID_COLL_CHECK_FN_REQUIRED - calling the check fn is sufficient */
-            result = my_md->check_fn(&reduce);
-         TRACE_ERR("Bitmask: %#X\n", result.bitmask);
-         result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
-         if(result.bitmask)
+         /* process metadata bits */
+         if((!my_md->check_correct.values.inplace) && (sendbuf == MPI_IN_PLACE))
+            result.check.unspecified = 1;
+         MPI_Aint data_true_lb;
+         MPID_Datatype *data_ptr;
+         int data_size, data_contig;
+         MPIDI_Datatype_get_info(count, datatype, data_contig, data_size, data_ptr, data_true_lb); 
+         if((my_md->range_lo <= data_size) &&
+            (my_md->range_hi >= data_size))
+            ; /* ok, algorithm selected */
+         else
          {
+            result.check.range = 1;
             if(unlikely(verbose))
-              fprintf(stderr,"Query failed for %s.  Using MPICH reduce.\n",
-                 my_md->name);
+            {   
+               fprintf(stderr,"message size (%u) outside range (%zu<->%zu) for %s.\n",
+                       data_size,
+                       my_md->range_lo,
+                       my_md->range_hi,
+                       my_md->name);
+            }
          }
-         else alg_selected = 1;
       }
-      else
+      else /* calling the check fn is sufficient */
+         result = my_md->check_fn(&reduce);
+      TRACE_ERR("Bitmask: %#X\n", result.bitmask);
+      result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
+      if(result.bitmask)
       {
-         /* No check function, but check required */
-         /* look at meta data */
-         /* assert(0);*/
-      }
+         if(unlikely(verbose))
+            fprintf(stderr,"Query failed for %s.  Using MPICH reduce.\n",
+                    my_md->name);
+      }  
+      else alg_selected = 1;
    }
 
    if(alg_selected)
diff --git a/src/mpid/pamid/src/coll/scan/mpido_scan.c b/src/mpid/pamid/src/coll/scan/mpido_scan.c
index 6c073c1..4a4e37a 100644
--- a/src/mpid/pamid/src/coll/scan/mpido_scan.c
+++ b/src/mpid/pamid/src/coll/scan/mpido_scan.c
@@ -137,13 +137,32 @@ int MPIDO_Doscan(const void *sendbuf, void *recvbuf,
       TRACE_ERR("Querying scan protocol %s, type was %d\n",
          my_md->name,
          selected_type);
-      if(queryreq == MPID_COLL_ALWAYS_QUERY)
+      if(my_md->check_fn == NULL)
       {
         /* process metadata bits */
          if((!my_md->check_correct.values.inplace) && (sendbuf == MPI_IN_PLACE))
             result.check.unspecified = 1;
+         MPI_Aint data_true_lb;
+         MPID_Datatype *data_ptr;
+         int data_size, data_contig;
+         MPIDI_Datatype_get_info(count, datatype, data_contig, data_size, data_ptr, data_true_lb); 
+         if((my_md->range_lo <= data_size) &&
+            (my_md->range_hi >= data_size))
+            ; /* ok, algorithm selected */
+         else
+         {
+            result.check.range = 1;
+            if(unlikely(verbose))
+            {   
+               fprintf(stderr,"message size (%u) outside range (%zu<->%zu) for %s.\n",
+                       data_size,
+                       my_md->range_lo,
+                       my_md->range_hi,
+                       my_md->name);
+            }
+         }
       }
-      else /* (queryreq == MPID_COLL_CHECK_FN_REQUIRED - calling the check fn is sufficient */
+      else /* calling the check fn is sufficient */
          result = my_md->check_fn(&scan);
       TRACE_ERR("Bitmask: %#X\n", result.bitmask);
       result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
diff --git a/src/mpid/pamid/src/coll/scatter/mpido_scatter.c b/src/mpid/pamid/src/coll/scatter/mpido_scatter.c
index 01667d2..aca6c89 100644
--- a/src/mpid/pamid/src/coll/scatter/mpido_scatter.c
+++ b/src/mpid/pamid/src/coll/scatter/mpido_scatter.c
@@ -236,13 +236,32 @@ int MPIDO_Scatter(const void *sendbuf,
       metadata_result_t result = {0};
       TRACE_ERR("querying scatter protoocl %s, type was %d\n",
          my_md->name, queryreq);
-      if(queryreq == MPID_COLL_ALWAYS_QUERY)
+      if(my_md->check_fn == NULL)
       {
         /* process metadata bits */
         if((!my_md->check_correct.values.inplace) && (recvbuf == MPI_IN_PLACE))
            result.check.unspecified = 1;
+         MPI_Aint data_true_lb;
+         MPID_Datatype *data_ptr;
+         int data_size, data_contig;
+         MPIDI_Datatype_get_info(recvcount, recvtype, data_contig, data_size, data_ptr, data_true_lb); 
+         if((my_md->range_lo <= data_size) &&
+            (my_md->range_hi >= data_size))
+            ; /* ok, algorithm selected */
+         else
+         {
+            result.check.range = 1;
+            if(unlikely(verbose))
+            {   
+               fprintf(stderr,"message size (%u) outside range (%zu<->%zu) for %s.\n",
+                       data_size,
+                       my_md->range_lo,
+                       my_md->range_hi,
+                       my_md->name);
+            }
+         }
       }
-      else /* (queryreq == MPID_COLL_CHECK_FN_REQUIRED - calling the check fn is sufficient */
+      else /* calling the check fn is sufficient */
         result = my_md->check_fn(&scatter);
       TRACE_ERR("bitmask: %#X\n", result.bitmask);
       result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
diff --git a/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c b/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c
index e5f8e7f..6f899ff 100644
--- a/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c
+++ b/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c
@@ -355,13 +355,32 @@ int MPIDO_Scatterv(const void *sendbuf,
       metadata_result_t result = {0};
       TRACE_ERR("querying scatterv protocol %s, type was %d\n",
          my_md->name, queryreq);
-      if(queryreq == MPID_COLL_ALWAYS_QUERY)
+      if(my_md->check_fn == NULL)
       {
         /* process metadata bits */
         if((!my_md->check_correct.values.inplace) && (recvbuf == MPI_IN_PLACE))
            result.check.unspecified = 1;
+         MPI_Aint data_true_lb;
+         MPID_Datatype *data_ptr;
+         int data_size, data_contig;
+         MPIDI_Datatype_get_info(recvcount, recvtype, data_contig, data_size, data_ptr, data_true_lb); 
+         if((my_md->range_lo <= data_size) &&
+            (my_md->range_hi >= data_size))
+            ; /* ok, algorithm selected */
+         else
+         {
+            result.check.range = 1;
+            if(unlikely(verbose))
+            {   
+               fprintf(stderr,"message size (%u) outside range (%zu<->%zu) for %s.\n",
+                       data_size,
+                       my_md->range_lo,
+                       my_md->range_hi,
+                       my_md->name);
+            }
+         }
       }
-      else /* (queryreq == MPID_COLL_CHECK_FN_REQUIRED - calling the check fn is sufficient */
+      else /* calling the check fn is sufficient */
         result = my_md->check_fn(&scatterv);
       TRACE_ERR("bitmask: %#X\n", result.bitmask);
       result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
diff --git a/src/mpid/pamid/src/comm/mpid_comm.c b/src/mpid/pamid/src/comm/mpid_comm.c
index 93baee8..89fea1c 100644
--- a/src/mpid/pamid/src/comm/mpid_comm.c
+++ b/src/mpid/pamid/src/comm/mpid_comm.c
@@ -231,6 +231,14 @@ void MPIDI_Coll_comm_create(MPID_Comm *comm)
 
       TRACE_ERR("Waiting for geom create to finish\n");
       MPID_PROGRESS_WAIT_WHILE(geom_init);
+
+      if(comm->mpid.geometry == NULL)
+      {
+         if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_0 && comm->rank == 0))
+            fprintf(stderr,"Created unoptimized communicator id=%u, size=%u\n", (unsigned) comm->context_id,comm->local_size);
+         MPIU_TestFree(&comm->coll_fns);
+         return;
+      }
    }
 
    TRACE_ERR("Querying protocols\n");

http://git.mpich.org/mpich.git/commitdiff/964ce7d207fd6374309309ec31ea5a8cd3e2e9e8

commit 964ce7d207fd6374309309ec31ea5a8cd3e2e9e8
Author: Bob Cernohous <bobc at us.ibm.com>
Date:   Wed Dec 5 17:56:02 2012 -0600

    Do optimized selection even if first protocol list is empty
    
    (ibm) Issue 8756
    (ibm) e8f70eff99ac0910c65071d0b71786892583642c
    
    Signed-off-by: sssharka <sssharka at us.ibm.com>

diff --git a/src/mpid/pamid/src/comm/mpid_selectcolls.c b/src/mpid/pamid/src/comm/mpid_selectcolls.c
index 1e974d2..4905dfd 100644
--- a/src/mpid/pamid/src/comm/mpid_selectcolls.c
+++ b/src/mpid/pamid/src/comm/mpid_selectcolls.c
@@ -195,17 +195,24 @@ void MPIDI_Comm_coll_envvars(MPID_Comm *comm)
       comm->mpid.user_selected_type[i] = MPID_COLL_NOSELECTION;
          if(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_0 && comm->rank == 0)
             fprintf(stderr,"Setting up collective %d on comm %p\n", i, comm);
-      if(comm->mpid.coll_count[i][0] == 0)
+	 if((comm->mpid.coll_count[i][0] == 0) && (comm->mpid.coll_count[i][1] == 0))
       {
          comm->mpid.user_selected_type[i] = MPID_COLL_USE_MPICH;
          comm->mpid.user_selected[i] = 0;
       }
-      else
+	 else if(comm->mpid.coll_count[i][0] != 0)
       {
          comm->mpid.user_selected[i] = comm->mpid.coll_algorithm[i][0][0];
          memcpy(&comm->mpid.user_metadata[i], &comm->mpid.coll_metadata[i][0][0],
                sizeof(pami_metadata_t));
       }
+	 else
+	   {
+	     MPIDI_Update_coll(i, MPID_COLL_QUERY, 0, comm);
+	     /* even though it's a query protocol, say NOSELECTION 
+		so the optcoll selection will override (maybe) */
+	     comm->mpid.user_selected_type[i] = MPID_COLL_NOSELECTION;
+	   }
    }
 
 

http://git.mpich.org/mpich.git/commitdiff/fd39fc2b4f9fce35c057bc09e22666e85b627f93

commit fd39fc2b4f9fce35c057bc09e22666e85b627f93
Author: Bob Cernohous <bobc at us.ibm.com>
Date:   Wed Dec 5 15:47:32 2012 -0600

    Remove misleading verbose output
    
    (ibm) Issue 9159
    (ibm) b7ee8fbf401396fd5022e7ac78ad3267ca35e20c
    
    Signed-off-by: sssharka <sssharka at us.ibm.com>

diff --git a/src/mpid/pamid/src/comm/mpid_selectcolls.c b/src/mpid/pamid/src/comm/mpid_selectcolls.c
index 57cb3cc..1e974d2 100644
--- a/src/mpid/pamid/src/comm/mpid_selectcolls.c
+++ b/src/mpid/pamid/src/comm/mpid_selectcolls.c
@@ -197,8 +197,6 @@ void MPIDI_Comm_coll_envvars(MPID_Comm *comm)
             fprintf(stderr,"Setting up collective %d on comm %p\n", i, comm);
       if(comm->mpid.coll_count[i][0] == 0)
       {
-         if(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_0 && comm->rank == 0)
-            fprintf(stderr,"There are no 'always works' protocols of type %d. This could be a problem later in your app\n", i);
          comm->mpid.user_selected_type[i] = MPID_COLL_USE_MPICH;
          comm->mpid.user_selected[i] = 0;
       }

http://git.mpich.org/mpich.git/commitdiff/ba5b2ce35bb828791afdd1c8d70f7e3fe73015c9

commit ba5b2ce35bb828791afdd1c8d70f7e3fe73015c9
Author: Bob Cernohous <bobc at us.ibm.com>
Date:   Mon Dec 3 12:14:34 2012 -0600

    More thorough check for optimized bcast protocol
    
    (ibm) Issue 9159
    (ibm) fc2af458fc604f6b06126c5191bcf20983e8503c
    
    Signed-off-by: sssharka <sssharka at us.ibm.com>

diff --git a/src/mpid/pamid/src/comm/mpid_optcolls.c b/src/mpid/pamid/src/comm/mpid_optcolls.c
index 341901a..e2bf8e4 100644
--- a/src/mpid/pamid/src/comm/mpid_optcolls.c
+++ b/src/mpid/pamid/src/comm/mpid_optcolls.c
@@ -525,7 +525,18 @@ void MPIDI_Comm_coll_select(MPID_Comm *comm_ptr)
       }
     if(opt_proto == -1)
     {
-      for(i = 0; i < comm_ptr->mpid.coll_count[PAMI_XFER_BROADCAST][1]; i++)
+      /* this protocol is sometimes query, sometimes always works so check both lists but prefer always works */
+      for(i = 0; i < comm_ptr->mpid.coll_count[PAMI_XFER_BROADCAST][0]; i++)
+      {
+        /* This is a good choice for small messages only */
+        if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][0][i].name, "I0:RankBased_Binomial:SHMEM:MU") == 0)
+        {
+          opt_proto = i;
+          comm_ptr->mpid.cutoff_size[PAMI_XFER_BROADCAST][0] = 256;
+          break;
+        }
+      }
+      if(opt_proto == -1) for(i = 0; i < comm_ptr->mpid.coll_count[PAMI_XFER_BROADCAST][1]; i++)
       {
         /* This is a good choice for small messages only */
         if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][1][i].name, "I0:RankBased_Binomial:SHMEM:MU") == 0)
@@ -540,12 +551,18 @@ void MPIDI_Comm_coll_select(MPID_Comm *comm_ptr)
     /* Next best to check */
     if(opt_proto == -1)
     {
+      /* this protocol is sometimes query, sometimes always works so check both lists but prefer always works */
       for(i = 0; i < comm_ptr->mpid.coll_count[PAMI_XFER_BROADCAST][0]; i++)
       {
-        /* Also, NOT in the 'must query' list */
         if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][0][i].name, "I0:2-nomial:SHMEM:MU") == 0)
           opt_proto = i;
       }
+      if(opt_proto == -1) for(i = 0; i < comm_ptr->mpid.coll_count[PAMI_XFER_BROADCAST][1]; i++)
+      {
+        if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][1][i].name, "I0:2-nomial:SHMEM:MU") == 0)
+          opt_proto = i;
+          mustquery = 1;
+      }
     }
 
     /* These protocols are good for most message sizes, but there are some
@@ -617,11 +634,18 @@ void MPIDI_Comm_coll_select(MPID_Comm *comm_ptr)
       {
         /* This protocol was only good for up to 256, and it was an irregular, so let's set
          * 2-nomial for larger message sizes. Cutoff should have already been set to 256 too */
+        /* this protocol is sometimes query, sometimes always works so check both lists but prefer always works */
         for(i = 0; i < comm_ptr->mpid.coll_count[PAMI_XFER_BROADCAST][0]; i++)
         {
           if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][0][i].name, "I0:2-nomial:SHMEM:MU") == 0)
             opt_proto = i;
         }
+        if(opt_proto == -1) for(i = 0; i < comm_ptr->mpid.coll_count[PAMI_XFER_BROADCAST][1]; i++)
+        {
+          if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][1][i].name, "I0:2-nomial:SHMEM:MU") == 0)
+            opt_proto = i;
+            mustquery = 1;
+        }
       }
 
       if(opt_proto != -1)

http://git.mpich.org/mpich.git/commitdiff/cae9c7a1ddbaada469923ffc8e077e795075193f

commit cae9c7a1ddbaada469923ffc8e077e795075193f
Author: Bob Cernohous <bobc at us.ibm.com>
Date:   Mon Dec 3 11:59:48 2012 -0600

    Metadata glue fixes
    
    (ibm) Issue 9159
    (ibm) ed90c94e27d58b25452cf1f396c928a151f0f52e
    
    Signed-off-by: sssharka <sssharka at us.ibm.com>

diff --git a/src/mpid/pamid/src/comm/mpid_optcolls.c b/src/mpid/pamid/src/comm/mpid_optcolls.c
index d9b19bc..341901a 100644
--- a/src/mpid/pamid/src/comm/mpid_optcolls.c
+++ b/src/mpid/pamid/src/comm/mpid_optcolls.c
@@ -30,54 +30,54 @@
 
 static int MPIDI_Check_FCA_envvar(char *string, int *user_range_hi)
 {
-   char *env = getenv("MP_MPI_PAMI_FOR");
-   if(env != NULL)
-   {
-      if(strcasecmp(env, "ALL") == 0)
-         return 1;
-      int len = strlen(env);
-      len++;
-      char *temp = MPIU_Malloc(sizeof(char) * len);
-      char *ptrToFree = temp;
-      strcpy(temp, env);
-      char *sepptr;
-      for(sepptr = temp; (sepptr = strsep(&temp, ",")) != NULL ; )
+  char *env = getenv("MP_MPI_PAMI_FOR");
+  if(env != NULL)
+  {
+    if(strcasecmp(env, "ALL") == 0)
+      return 1;
+    int len = strlen(env);
+    len++;
+    char *temp = MPIU_Malloc(sizeof(char) * len);
+    char *ptrToFree = temp;
+    strcpy(temp, env);
+    char *sepptr;
+    for(sepptr = temp; (sepptr = strsep(&temp, ",")) != NULL ; )
+    {
+      char *subsepptr, *temp_sepptr;
+      temp_sepptr = sepptr;
+      subsepptr = strsep(&temp_sepptr, ":");
+      if(temp_sepptr != NULL)/* SSS: There is a a colon for this collective */
       {
-         char *subsepptr, *temp_sepptr;
-         temp_sepptr = sepptr;
-         subsepptr = strsep(&temp_sepptr, ":");
-         if(temp_sepptr != NULL)/* SSS: There is a a colon for this collective */
-         {
-             if(strcasecmp(subsepptr, string) == 0)
-             {
-                *user_range_hi = atoi(temp_sepptr);
-                MPIU_Free(ptrToFree);
-                return 1;
-             }
-             else
-                sepptr++;
-         }
-         else
-         { 
-             if(strcasecmp(sepptr, string) == 0)
-             {
-                *user_range_hi = -1;
-                MPIU_Free(ptrToFree);
-                return 1;
-             }
-             else
-                sepptr++;
-         }
+        if(strcasecmp(subsepptr, string) == 0)
+        {
+          *user_range_hi = atoi(temp_sepptr);
+          MPIU_Free(ptrToFree);
+          return 1;
+        }
+        else
+          sepptr++;
       }
-      /* We didn't find it, but the end var was set, so return 0 */
-      MPIU_Free(ptrToFree);
-      return 0;
-   }
-   if(MPIDI_Process.optimized.collectives == MPID_COLL_FCA)
-      return 1; /* To have gotten this far, opt colls are on. If the env doesn't exist,
-                   we should use the FCA protocol for "string" */
-   else
-      return -1; /* we don't have any FCA things */
+      else
+      {
+        if(strcasecmp(sepptr, string) == 0)
+        {
+          *user_range_hi = -1;
+          MPIU_Free(ptrToFree);
+          return 1;
+        }
+        else
+          sepptr++;
+      }
+    }
+    /* We didn't find it, but the end var was set, so return 0 */
+    MPIU_Free(ptrToFree);
+    return 0;
+  }
+  if(MPIDI_Process.optimized.collectives == MPID_COLL_FCA)
+    return 1; /* To have gotten this far, opt colls are on. If the env doesn't exist,
+                 we should use the FCA protocol for "string" */
+  else
+    return -1; /* we don't have any FCA things */
 }
 
 static inline void 
@@ -87,763 +87,776 @@ MPIDI_Coll_comm_check_FCA(char *coll_name,
                           int query_type,
                           int proto_num,
                           MPID_Comm *comm_ptr)
-{                        
-   int opt_proto = -1;
-   int i;
-   int user_range_hi = -1;/* SSS: By default we assume user hasn't defined a range_hi (cutoff_size) */
+{
+  int opt_proto = -1;
+  int i;
+  int user_range_hi = -1;/* SSS: By default we assume user hasn't defined a range_hi (cutoff_size) */
 #ifdef TRACE_ON
-   char *envstring = getenv("MP_MPI_PAMI_FOR");
+  char *envstring = getenv("MP_MPI_PAMI_FOR");
 #endif
-   TRACE_ERR("Checking for %s in %s\n", coll_name, envstring);
-   int check_var = MPIDI_Check_FCA_envvar(coll_name, &user_range_hi);
-   if(check_var == 1)
-   {
-      TRACE_ERR("Found %s\n",coll_name);
-      /* Look for protocol_name in the "always works list */
-      for(i = 0; i <comm_ptr->mpid.coll_count[pami_xfer][0]; i++)
+  TRACE_ERR("Checking for %s in %s\n", coll_name, envstring);
+  int check_var = MPIDI_Check_FCA_envvar(coll_name, &user_range_hi);
+  if(check_var == 1)
+  {
+    TRACE_ERR("Found %s\n",coll_name);
+    /* Look for protocol_name in the "always works list */
+    for(i = 0; i <comm_ptr->mpid.coll_count[pami_xfer][0]; i++)
+    {
+      if(strcasecmp(comm_ptr->mpid.coll_metadata[pami_xfer][0][i].name, protocol_name) == 0)
       {
-         if(strcasecmp(comm_ptr->mpid.coll_metadata[pami_xfer][0][i].name, protocol_name) == 0)
-         {
-            opt_proto = i;
-            break;
-         }
+        opt_proto = i;
+        break;
       }
-      if(opt_proto != -1) /* we found it, so copy it to the optimized var */
-      {                                                                                           
-         TRACE_ERR("Memcpy protocol type %d, number %d (%s) to optimized protocol\n",
-               pami_xfer, opt_proto,
-               comm_ptr->mpid.coll_metadata[pami_xfer][0][opt_proto].name);
-            comm_ptr->mpid.opt_protocol[pami_xfer][proto_num] =
-                  comm_ptr->mpid.coll_algorithm[pami_xfer][0][opt_proto];
-            memcpy(&comm_ptr->mpid.opt_protocol_md[pami_xfer][proto_num],
-                  &comm_ptr->mpid.coll_metadata[pami_xfer][0][opt_proto],
-                  sizeof(pami_metadata_t));
-            comm_ptr->mpid.must_query[pami_xfer][proto_num] = query_type;
-            if(user_range_hi != -1)
-              comm_ptr->mpid.cutoff_size[pami_xfer][proto_num] = user_range_hi;
-            else
-              comm_ptr->mpid.cutoff_size[pami_xfer][proto_num] = 0;
-            comm_ptr->mpid.user_selected_type[pami_xfer] = MPID_COLL_OPTIMIZED;
-      }                                                                                           
-      else /* see if it is in the must query list instead */
+    }
+    if(opt_proto != -1) /* we found it, so copy it to the optimized var */
+    {
+      TRACE_ERR("Memcpy protocol type %d, number %d (%s) to optimized protocol\n",
+                pami_xfer, opt_proto,
+                comm_ptr->mpid.coll_metadata[pami_xfer][0][opt_proto].name);
+      comm_ptr->mpid.opt_protocol[pami_xfer][proto_num] =
+      comm_ptr->mpid.coll_algorithm[pami_xfer][0][opt_proto];
+      memcpy(&comm_ptr->mpid.opt_protocol_md[pami_xfer][proto_num],
+             &comm_ptr->mpid.coll_metadata[pami_xfer][0][opt_proto],
+             sizeof(pami_metadata_t));
+      comm_ptr->mpid.must_query[pami_xfer][proto_num] = query_type;
+      if(user_range_hi != -1)
+        comm_ptr->mpid.cutoff_size[pami_xfer][proto_num] = user_range_hi;
+      else
+        comm_ptr->mpid.cutoff_size[pami_xfer][proto_num] = 0;
+      comm_ptr->mpid.user_selected_type[pami_xfer] = MPID_COLL_OPTIMIZED;
+    }
+    else /* see if it is in the must query list instead */
+    {
+      for(i = 0; i <comm_ptr->mpid.coll_count[pami_xfer][1]; i++)
       {
-         for(i = 0; i <comm_ptr->mpid.coll_count[pami_xfer][1]; i++)
-         {
-            if(strcasecmp(comm_ptr->mpid.coll_metadata[pami_xfer][1][i].name, protocol_name) == 0)
-            {
-               opt_proto = i;
-               break;
-            }
-         }
-         if(opt_proto != -1) /* ok, it was in the must query list */
-         {
-            TRACE_ERR("Memcpy protocol type %d, number %d (%s) to optimized protocol\n",
+        if(strcasecmp(comm_ptr->mpid.coll_metadata[pami_xfer][1][i].name, protocol_name) == 0)
+        {
+          opt_proto = i;
+          break;
+        }
+      }
+      if(opt_proto != -1) /* ok, it was in the must query list */
+      {
+        TRACE_ERR("Memcpy protocol type %d, number %d (%s) to optimized protocol\n",
                   pami_xfer, opt_proto,
                   comm_ptr->mpid.coll_metadata[pami_xfer][1][opt_proto].name);
-            comm_ptr->mpid.opt_protocol[pami_xfer][proto_num] =
-                  comm_ptr->mpid.coll_algorithm[pami_xfer][1][opt_proto];
-            memcpy(&comm_ptr->mpid.opt_protocol_md[pami_xfer][proto_num],
-                  &comm_ptr->mpid.coll_metadata[pami_xfer][1][opt_proto],
-                  sizeof(pami_metadata_t));
-            comm_ptr->mpid.must_query[pami_xfer][proto_num] = query_type;
-            if(user_range_hi != -1)
-              comm_ptr->mpid.cutoff_size[pami_xfer][proto_num] = user_range_hi;
-            else
-              comm_ptr->mpid.cutoff_size[pami_xfer][proto_num] = 0;
-            comm_ptr->mpid.user_selected_type[pami_xfer] = MPID_COLL_OPTIMIZED;
-         }
-         else /* that protocol doesn't exist */
-         {
-            TRACE_ERR("Couldn't find %s in the list for %s, reverting to MPICH\n",protocol_name,coll_name);
-                  comm_ptr->mpid.user_selected_type[pami_xfer] = MPID_COLL_USE_MPICH;
-         }
+        comm_ptr->mpid.opt_protocol[pami_xfer][proto_num] =
+        comm_ptr->mpid.coll_algorithm[pami_xfer][1][opt_proto];
+        memcpy(&comm_ptr->mpid.opt_protocol_md[pami_xfer][proto_num],
+               &comm_ptr->mpid.coll_metadata[pami_xfer][1][opt_proto],
+               sizeof(pami_metadata_t));
+        comm_ptr->mpid.must_query[pami_xfer][proto_num] = query_type;
+        if(user_range_hi != -1)
+          comm_ptr->mpid.cutoff_size[pami_xfer][proto_num] = user_range_hi;
+        else
+          comm_ptr->mpid.cutoff_size[pami_xfer][proto_num] = 0;
+        comm_ptr->mpid.user_selected_type[pami_xfer] = MPID_COLL_OPTIMIZED;
       }
-   }
-   else if(check_var == 0)/* The env var was set, but wasn't set for coll_name */
-   {
-      TRACE_ERR("Couldn't find any optimal %s protocols or user chose not to set it. Selecting MPICH\n",coll_name);
-                  comm_ptr->mpid.user_selected_type[pami_xfer] = MPID_COLL_USE_MPICH;
-   }
-   else
-      return; 
+      else /* that protocol doesn't exist */
+      {
+        TRACE_ERR("Couldn't find %s in the list for %s, reverting to MPICH\n",protocol_name,coll_name);
+        comm_ptr->mpid.user_selected_type[pami_xfer] = MPID_COLL_USE_MPICH;
+      }
+    }
+  }
+  else if(check_var == 0)/* The env var was set, but wasn't set for coll_name */
+  {
+    TRACE_ERR("Couldn't find any optimal %s protocols or user chose not to set it. Selecting MPICH\n",coll_name);
+    comm_ptr->mpid.user_selected_type[pami_xfer] = MPID_COLL_USE_MPICH;
+  }
+  else
+    return; 
 }
 
 
 void MPIDI_Comm_coll_select(MPID_Comm *comm_ptr)
 {
-   TRACE_ERR("Entering MPIDI_Comm_coll_select\n");
-   int opt_proto = -1;
-   int mustquery = 0;
-   int i;
-   int use_threaded_collectives = 1;
-
-   /* Some highly optimized protocols (limited resource) do not
-      support MPI_THREAD_MULTIPLE semantics so do not enable them
-      except on COMM_WORLD.
-      NOTE: we are not checking metadata because these are known,
-      hardcoded optimized protocols.
-   */
-   if((MPIR_ThreadInfo.thread_provided == MPI_THREAD_MULTIPLE) &&
-      (comm_ptr != MPIR_Process.comm_world)) use_threaded_collectives = 0;
-   if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_0 && comm_ptr->rank == 0))
-      fprintf(stderr, "thread_provided=%s, %scomm_world, use_threaded_collectives %u\n", 
-              MPIR_ThreadInfo.thread_provided == MPI_THREAD_MULTIPLE? "MPI_THREAD_MULTIPLE":
-              MPIR_ThreadInfo.thread_provided == MPI_THREAD_SINGLE?"MPI_THREAD_SINGLE":
-              MPIR_ThreadInfo.thread_provided == MPI_THREAD_FUNNELED?"MPI_THREAD_FUNNELED":
-              MPIR_ThreadInfo.thread_provided == MPI_THREAD_SERIALIZED?"MPI_THREAD_SERIALIZED":
-              "??", 
-              (comm_ptr != MPIR_Process.comm_world)?"!":"",
-              use_threaded_collectives);
-   
-   /* First, setup the (easy, allreduce is complicated) FCA collectives if there 
-    * are any because they are always usable when they are on */
-   if(comm_ptr->mpid.user_selected_type[PAMI_XFER_REDUCE] == MPID_COLL_NOSELECTION)
-   {
-      MPIDI_Coll_comm_check_FCA("REDUCE","I1:Reduce:FCA:FCA",PAMI_XFER_REDUCE,MPID_COLL_CHECK_FN_REQUIRED, 0, comm_ptr);
-   }
-   if(comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLGATHER] == MPID_COLL_NOSELECTION)
-   {
-      MPIDI_Coll_comm_check_FCA("ALLGATHER","I1:Allgather:FCA:FCA",PAMI_XFER_ALLGATHER,MPID_COLL_NOQUERY, 0, comm_ptr);
-   }
-   if(comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLGATHERV_INT] == MPID_COLL_NOSELECTION)
-   {
-      MPIDI_Coll_comm_check_FCA("ALLGATHERV","I1:AllgathervInt:FCA:FCA",PAMI_XFER_ALLGATHERV_INT,MPID_COLL_NOQUERY, 0, comm_ptr);
-   }
-   if(comm_ptr->mpid.user_selected_type[PAMI_XFER_BROADCAST] == MPID_COLL_NOSELECTION)
-   {
-      MPIDI_Coll_comm_check_FCA("BCAST", "I1:Broadcast:FCA:FCA", PAMI_XFER_BROADCAST, MPID_COLL_NOQUERY, 0, comm_ptr);
-      MPIDI_Coll_comm_check_FCA("BCAST", "I1:Broadcast:FCA:FCA", PAMI_XFER_BROADCAST, MPID_COLL_NOQUERY, 1, comm_ptr);
-   }
-   if(comm_ptr->mpid.user_selected_type[PAMI_XFER_BARRIER] == MPID_COLL_NOSELECTION)
-   {
-      MPIDI_Coll_comm_check_FCA("BARRIER","I1:Barrier:FCA:FCA",PAMI_XFER_BARRIER,MPID_COLL_NOQUERY, 0, comm_ptr);
-   }
-   /* SSS: There isn't really an FCA Gatherv protocol. We do this call to force the use of MPICH for gatherv
-    * when FCA is enabled so we don't have to use PAMI protocol.  */
-   if(comm_ptr->mpid.user_selected_type[PAMI_XFER_GATHERV_INT] == MPID_COLL_NOSELECTION)
-   {
-      MPIDI_Coll_comm_check_FCA("GATHERV","I1:GathervInt:FCA:FCA",PAMI_XFER_GATHERV_INT,MPID_COLL_NOQUERY, 0, comm_ptr);
-   }
-
-   opt_proto = -1;
-   mustquery = 0;
-   /* So, several protocols are really easy. Tackle them first. */
-   if(comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLGATHERV_INT] == MPID_COLL_NOSELECTION)
-   {
-      TRACE_ERR("No allgatherv[int] env var, so setting optimized allgatherv[int]\n");
-      /* Use I0:RectangleDput */
-      for(i = 0; i < comm_ptr->mpid.coll_count[PAMI_XFER_ALLGATHERV_INT][1]; i++)
+  TRACE_ERR("Entering MPIDI_Comm_coll_select\n");
+  int opt_proto = -1;
+  int mustquery = 0;
+  int i;
+  int use_threaded_collectives = 1;
+
+  /* Some highly optimized protocols (limited resource) do not
+     support MPI_THREAD_MULTIPLE semantics so do not enable them
+     except on COMM_WORLD.
+     NOTE: we are not checking metadata because these are known,
+     hardcoded optimized protocols.
+  */
+  if((MPIR_ThreadInfo.thread_provided == MPI_THREAD_MULTIPLE) &&
+     (comm_ptr != MPIR_Process.comm_world)) use_threaded_collectives = 0;
+  if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_0 && comm_ptr->rank == 0))
+    fprintf(stderr, "thread_provided=%s, %scomm_world, use_threaded_collectives %u\n", 
+            MPIR_ThreadInfo.thread_provided == MPI_THREAD_MULTIPLE? "MPI_THREAD_MULTIPLE":
+            MPIR_ThreadInfo.thread_provided == MPI_THREAD_SINGLE?"MPI_THREAD_SINGLE":
+            MPIR_ThreadInfo.thread_provided == MPI_THREAD_FUNNELED?"MPI_THREAD_FUNNELED":
+            MPIR_ThreadInfo.thread_provided == MPI_THREAD_SERIALIZED?"MPI_THREAD_SERIALIZED":
+            "??", 
+            (comm_ptr != MPIR_Process.comm_world)?"!":"",
+            use_threaded_collectives);
+
+  /* First, setup the (easy, allreduce is complicated) FCA collectives if there 
+   * are any because they are always usable when they are on */
+  if(comm_ptr->mpid.user_selected_type[PAMI_XFER_REDUCE] == MPID_COLL_NOSELECTION)
+  {
+    MPIDI_Coll_comm_check_FCA("REDUCE","I1:Reduce:FCA:FCA",PAMI_XFER_REDUCE,MPID_COLL_CHECK_FN_REQUIRED, 0, comm_ptr);
+  }
+  if(comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLGATHER] == MPID_COLL_NOSELECTION)
+  {
+    MPIDI_Coll_comm_check_FCA("ALLGATHER","I1:Allgather:FCA:FCA",PAMI_XFER_ALLGATHER,MPID_COLL_NOQUERY, 0, comm_ptr);
+  }
+  if(comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLGATHERV_INT] == MPID_COLL_NOSELECTION)
+  {
+    MPIDI_Coll_comm_check_FCA("ALLGATHERV","I1:AllgathervInt:FCA:FCA",PAMI_XFER_ALLGATHERV_INT,MPID_COLL_NOQUERY, 0, comm_ptr);
+  }
+  if(comm_ptr->mpid.user_selected_type[PAMI_XFER_BROADCAST] == MPID_COLL_NOSELECTION)
+  {
+    MPIDI_Coll_comm_check_FCA("BCAST", "I1:Broadcast:FCA:FCA", PAMI_XFER_BROADCAST, MPID_COLL_NOQUERY, 0, comm_ptr);
+    MPIDI_Coll_comm_check_FCA("BCAST", "I1:Broadcast:FCA:FCA", PAMI_XFER_BROADCAST, MPID_COLL_NOQUERY, 1, comm_ptr);
+  }
+  if(comm_ptr->mpid.user_selected_type[PAMI_XFER_BARRIER] == MPID_COLL_NOSELECTION)
+  {
+    MPIDI_Coll_comm_check_FCA("BARRIER","I1:Barrier:FCA:FCA",PAMI_XFER_BARRIER,MPID_COLL_NOQUERY, 0, comm_ptr);
+  }
+  /* SSS: There isn't really an FCA Gatherv protocol. We do this call to force the use of MPICH for gatherv
+   * when FCA is enabled so we don't have to use PAMI protocol.  */
+  if(comm_ptr->mpid.user_selected_type[PAMI_XFER_GATHERV_INT] == MPID_COLL_NOSELECTION)
+  {
+    MPIDI_Coll_comm_check_FCA("GATHERV","I1:GathervInt:FCA:FCA",PAMI_XFER_GATHERV_INT,MPID_COLL_NOQUERY, 0, comm_ptr);
+  }
+
+  opt_proto = -1;
+  mustquery = 0;
+  /* So, several protocols are really easy. Tackle them first. */
+  if(comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLGATHERV_INT] == MPID_COLL_NOSELECTION)
+  {
+    TRACE_ERR("No allgatherv[int] env var, so setting optimized allgatherv[int]\n");
+    /* Use I0:RectangleDput */
+    for(i = 0; i < comm_ptr->mpid.coll_count[PAMI_XFER_ALLGATHERV_INT][1]; i++)
+    {
+      if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLGATHERV_INT][1][i].name, "I0:RectangleDput:SHMEM:MU") == 0)
       {
-         if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLGATHERV_INT][1][i].name, "I0:RectangleDput:SHMEM:MU") == 0)
-         {
-            opt_proto = i;
-            mustquery = 1;
-            break;
-         }
+        opt_proto = i;
+        mustquery = 1;
+        break;
       }
-      if(opt_proto != -1)
+    }
+    if(opt_proto != -1)
+    {
+      TRACE_ERR("Memcpy protocol type %d number %d (%s) to optimized protocol\n",
+                PAMI_XFER_ALLGATHERV_INT, opt_proto,
+                comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLGATHERV_INT][mustquery][opt_proto].name);
+      comm_ptr->mpid.opt_protocol[PAMI_XFER_ALLGATHERV_INT][0] =
+      comm_ptr->mpid.coll_algorithm[PAMI_XFER_ALLGATHERV_INT][mustquery][opt_proto];
+      memcpy(&comm_ptr->mpid.opt_protocol_md[PAMI_XFER_ALLGATHERV_INT][0], 
+             &comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLGATHERV_INT][mustquery][opt_proto], 
+             sizeof(pami_metadata_t));
+      comm_ptr->mpid.must_query[PAMI_XFER_ALLGATHERV_INT][0] = mustquery?MPID_COLL_ALWAYS_QUERY:MPID_COLL_NOQUERY;
+      comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLGATHERV_INT] = MPID_COLL_OPTIMIZED;
+    }
+    else /* no optimized allgatherv? */
+    {
+      TRACE_ERR("Couldn't find optimial allgatherv[int] protocol\n");
+      comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLGATHERV_INT] = MPID_COLL_USE_MPICH;
+      comm_ptr->mpid.opt_protocol[PAMI_XFER_ALLGATHERV_INT][0] = 0;
+      comm_ptr->mpid.allgathervs[0] = 1; /* Use GLUE_ALLREDUCE */
+    }
+    TRACE_ERR("Done setting optimized allgatherv[int]\n");
+  }
+
+  opt_proto = -1;
+  mustquery = 0;
+  /* Alltoall */
+  /* If the user has forced a selection, don't bother setting it here */
+  if(comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLTOALL] == MPID_COLL_NOSELECTION)
+  {
+    TRACE_ERR("No alltoall env var, so setting optimized alltoall\n");
+    /* the best alltoall is always I0:M2MComposite:MU:MU, though there are
+     * displacement array memory issues today.... */
+    /* Loop over the protocols until we find the one we want */
+    if(use_threaded_collectives)
+      for(i = 0; i < comm_ptr->mpid.coll_count[PAMI_XFER_ALLTOALL][1]; i++)
       {
-         TRACE_ERR("Memcpy protocol type %d number %d (%s) to optimized protocol\n",
-            PAMI_XFER_ALLGATHERV_INT, opt_proto,
-            comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLGATHERV_INT][mustquery][opt_proto].name);
-         comm_ptr->mpid.opt_protocol[PAMI_XFER_ALLGATHERV_INT][0] =
-                comm_ptr->mpid.coll_algorithm[PAMI_XFER_ALLGATHERV_INT][mustquery][opt_proto];
-         memcpy(&comm_ptr->mpid.opt_protocol_md[PAMI_XFER_ALLGATHERV_INT][0], 
-                &comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLGATHERV_INT][mustquery][opt_proto], 
-                sizeof(pami_metadata_t));
-         comm_ptr->mpid.must_query[PAMI_XFER_ALLGATHERV_INT][0] = mustquery?MPID_COLL_ALWAYS_QUERY:MPID_COLL_NOQUERY;
-         comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLGATHERV_INT] = MPID_COLL_OPTIMIZED;
+        if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLTOALL][1][i].name, "I0:M2MComposite:MU:MU") == 0)
+        {
+          opt_proto = i;
+          mustquery = 1;
+          break;
+        }
       }
-      else /* no optimized allgatherv? */
+    if(opt_proto != -1)
+    {
+      TRACE_ERR("Memcpy protocol type %d, number %d (%s) to optimized protocol\n",
+                PAMI_XFER_ALLTOALL, opt_proto, 
+                comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLTOALL][mustquery][opt_proto].name);
+      comm_ptr->mpid.opt_protocol[PAMI_XFER_ALLTOALL][0] =
+      comm_ptr->mpid.coll_algorithm[PAMI_XFER_ALLTOALL][mustquery][opt_proto];
+      memcpy(&comm_ptr->mpid.opt_protocol_md[PAMI_XFER_ALLTOALL][0], 
+             &comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLTOALL][mustquery][opt_proto], 
+             sizeof(pami_metadata_t));
+      comm_ptr->mpid.must_query[PAMI_XFER_ALLTOALL][0] = mustquery?MPID_COLL_ALWAYS_QUERY:MPID_COLL_NOQUERY;
+      comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLTOALL] = MPID_COLL_OPTIMIZED;
+    }
+    else
+    {
+      TRACE_ERR("Couldn't find I0:M2MComposite:MU:MU in the list for alltoall, reverting to MPICH\n");
+      comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLTOALL] = MPID_COLL_USE_MPICH;
+      comm_ptr->mpid.opt_protocol[PAMI_XFER_ALLTOALL][0] = 0;
+    }
+    TRACE_ERR("Done setting optimized alltoall\n");
+  }
+
+
+  opt_proto = -1;
+  mustquery = 0;
+  /* Alltoallv */
+  if(comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLTOALLV_INT] == MPID_COLL_NOSELECTION)
+  {
+    TRACE_ERR("No alltoallv env var, so setting optimized alltoallv\n");
+    /* the best alltoallv is always I0:M2MComposite:MU:MU, though there are
+     * displacement array memory issues today.... */
+    /* Loop over the protocols until we find the one we want */
+    if(use_threaded_collectives)
+      for(i = 0; i <comm_ptr->mpid.coll_count[PAMI_XFER_ALLTOALLV_INT][1]; i++)
       {
-         TRACE_ERR("Couldn't find optimial allgatherv[int] protocol\n");
-         comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLGATHERV_INT] = MPID_COLL_USE_MPICH;
-         comm_ptr->mpid.opt_protocol[PAMI_XFER_ALLGATHERV_INT][0] = 0;
-         comm_ptr->mpid.allgathervs[0] = 1; /* Use GLUE_ALLREDUCE */
+        if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLTOALLV_INT][1][i].name, "I0:M2MComposite:MU:MU") == 0)
+        {
+          opt_proto = i;
+          mustquery = 1;
+          break;
+        }
       }
-      TRACE_ERR("Done setting optimized allgatherv[int]\n");
-   }
-
-   opt_proto = -1;
-   mustquery = 0;
-   /* Alltoall */
-   /* If the user has forced a selection, don't bother setting it here */
-   if(comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLTOALL] == MPID_COLL_NOSELECTION)
-   {
-      TRACE_ERR("No alltoall env var, so setting optimized alltoall\n");
-      /* the best alltoall is always I0:M2MComposite:MU:MU, though there are
-       * displacement array memory issues today.... */
-      /* Loop over the protocols until we find the one we want */
-      if(use_threaded_collectives)
-       for(i = 0; i < comm_ptr->mpid.coll_count[PAMI_XFER_ALLTOALL][1]; i++)
-       {
-         if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLTOALL][1][i].name, "I0:M2MComposite:MU:MU") == 0)
-         {
-            opt_proto = i;
-            mustquery = 1;
-            break;
-         }
-       }
-      if(opt_proto != -1)
-      {
-         TRACE_ERR("Memcpy protocol type %d, number %d (%s) to optimized protocol\n",
-            PAMI_XFER_ALLTOALL, opt_proto, 
-            comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLTOALL][mustquery][opt_proto].name);
-         comm_ptr->mpid.opt_protocol[PAMI_XFER_ALLTOALL][0] =
-                comm_ptr->mpid.coll_algorithm[PAMI_XFER_ALLTOALL][mustquery][opt_proto];
-         memcpy(&comm_ptr->mpid.opt_protocol_md[PAMI_XFER_ALLTOALL][0], 
-                &comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLTOALL][mustquery][opt_proto], 
-                sizeof(pami_metadata_t));
-         comm_ptr->mpid.must_query[PAMI_XFER_ALLTOALL][0] = mustquery?MPID_COLL_ALWAYS_QUERY:MPID_COLL_NOQUERY;
-         comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLTOALL] = MPID_COLL_OPTIMIZED;
-      }
-      else
+    if(opt_proto != -1)
+    {
+      TRACE_ERR("Memcpy protocol type %d, number %d (%s) to optimized protocol\n",
+                PAMI_XFER_ALLTOALLV_INT, opt_proto, 
+                comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLTOALLV_INT][mustquery][opt_proto].name);
+      comm_ptr->mpid.opt_protocol[PAMI_XFER_ALLTOALLV_INT][0] =
+      comm_ptr->mpid.coll_algorithm[PAMI_XFER_ALLTOALLV_INT][mustquery][opt_proto];
+      memcpy(&comm_ptr->mpid.opt_protocol_md[PAMI_XFER_ALLTOALLV_INT][0], 
+             &comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLTOALLV_INT][mustquery][opt_proto], 
+             sizeof(pami_metadata_t));
+      comm_ptr->mpid.must_query[PAMI_XFER_ALLTOALLV_INT][0] = mustquery?MPID_COLL_ALWAYS_QUERY:MPID_COLL_NOQUERY;
+      comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLTOALLV_INT] = MPID_COLL_OPTIMIZED;
+    }
+    else
+    {
+      TRACE_ERR("Couldn't find I0:M2MComposite:MU:MU in the list for alltoallv, reverting to MPICH\n");
+      comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLTOALLV_INT] = MPID_COLL_USE_MPICH;
+      comm_ptr->mpid.opt_protocol[PAMI_XFER_ALLTOALLV_INT][0] = 0;
+    }
+    TRACE_ERR("Done setting optimized alltoallv\n");
+  }
+
+  opt_proto = -1;
+  mustquery = 0;
+  /* Barrier */
+  if(comm_ptr->mpid.user_selected_type[PAMI_XFER_BARRIER] == MPID_COLL_NOSELECTION)
+  {
+    TRACE_ERR("No barrier env var, so setting optimized barrier\n");
+    /* For 1ppn, I0:MultiSync:-:GI is best, followed by
+     * I0:RectangleMultiSync:-:MU, followed by
+     * I0:OptBinomial:P2P:P2P
+     */
+    /* For 16 and 64 ppn, I0:MultiSync2Device:SHMEM:GI (which doesn't exist at 1ppn)
+    * is best, followed by
+    * I0:RectangleMultiSync2Device:SHMEM:MU for rectangles, followed by
+    * I0:OptBinomial:P2P:P2P */
+    /* So we basically check for the protocols in reverse-optimal order */
+
+
+    /* In order, >1ppn we use
+     * I0:MultiSync2Device:SHMEM:GI
+     * I0:RectangleMultiSync2Device:SHMEM:MU
+     * I0:OptBinomial:P2P:P2P
+     * MPICH
+     *
+     * In order 1ppn we use
+     * I0:MultiSync:-:GI
+     * I0:RectangleMultiSync:-:MU
+     * I0:OptBinomial:P2P:P2P
+     * MPICH
+     */
+    if(use_threaded_collectives)
+      for(i = 0 ; i < comm_ptr->mpid.coll_count[PAMI_XFER_BARRIER][0]; i++)
       {
-         TRACE_ERR("Couldn't find I0:M2MComposite:MU:MU in the list for alltoall, reverting to MPICH\n");
-         comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLTOALL] = MPID_COLL_USE_MPICH;
-         comm_ptr->mpid.opt_protocol[PAMI_XFER_ALLTOALL][0] = 0;
+        /* These two are mutually exclusive */
+        if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BARRIER][0][i].name, "I0:MultiSync2Device:SHMEM:GI") == 0)
+        {
+          opt_proto = i;
+        }
+        if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BARRIER][0][i].name, "I0:MultiSync:-:GI") == 0)
+        {
+          opt_proto = i;
+        }
       }
-      TRACE_ERR("Done setting optimized alltoall\n");
-   }
-
-
-   opt_proto = -1;
-   mustquery = 0;
-   /* Alltoallv */
-   if(comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLTOALLV_INT] == MPID_COLL_NOSELECTION)
-   {
-      TRACE_ERR("No alltoallv env var, so setting optimized alltoallv\n");
-      /* the best alltoallv is always I0:M2MComposite:MU:MU, though there are
-       * displacement array memory issues today.... */
-      /* Loop over the protocols until we find the one we want */
+    /* Next best rectangular to check */
+    if(opt_proto == -1)
+    {
       if(use_threaded_collectives)
-       for(i = 0; i <comm_ptr->mpid.coll_count[PAMI_XFER_ALLTOALLV_INT][1]; i++)
-       {
-         if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLTOALLV_INT][1][i].name, "I0:M2MComposite:MU:MU") == 0)
-         {
+        for(i = 0 ; i < comm_ptr->mpid.coll_count[PAMI_XFER_BARRIER][0]; i++)
+        {
+          if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BARRIER][0][i].name, "I0:RectangleMultiSync2Device:SHMEM:MU") == 0)
             opt_proto = i;
-            mustquery = 1;
-            break;
-         }
-       }
-      if(opt_proto != -1)
-      {
-         TRACE_ERR("Memcpy protocol type %d, number %d (%s) to optimized protocol\n",
-            PAMI_XFER_ALLTOALLV_INT, opt_proto, 
-            comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLTOALLV_INT][mustquery][opt_proto].name);
-         comm_ptr->mpid.opt_protocol[PAMI_XFER_ALLTOALLV_INT][0] =
-                comm_ptr->mpid.coll_algorithm[PAMI_XFER_ALLTOALLV_INT][mustquery][opt_proto];
-         memcpy(&comm_ptr->mpid.opt_protocol_md[PAMI_XFER_ALLTOALLV_INT][0], 
-                &comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLTOALLV_INT][mustquery][opt_proto], 
-                sizeof(pami_metadata_t));
-         comm_ptr->mpid.must_query[PAMI_XFER_ALLTOALLV_INT][0] = mustquery?MPID_COLL_ALWAYS_QUERY:MPID_COLL_NOQUERY;
-         comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLTOALLV_INT] = MPID_COLL_OPTIMIZED;
-      }
-      else
-      {
-         TRACE_ERR("Couldn't find I0:M2MComposite:MU:MU in the list for alltoallv, reverting to MPICH\n");
-         comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLTOALLV_INT] = MPID_COLL_USE_MPICH;
-         comm_ptr->mpid.opt_protocol[PAMI_XFER_ALLTOALLV_INT][0] = 0;
-      }
-      TRACE_ERR("Done setting optimized alltoallv\n");
-   }
-   
-   opt_proto = -1;
-   mustquery = 0;
-   /* Barrier */
-   if(comm_ptr->mpid.user_selected_type[PAMI_XFER_BARRIER] == MPID_COLL_NOSELECTION)
-   {
-      TRACE_ERR("No barrier env var, so setting optimized barrier\n");
-      /* For 1ppn, I0:MultiSync:-:GI is best, followed by
-       * I0:RectangleMultiSync:-:MU, followed by
-       * I0:OptBinomial:P2P:P2P
-       */
-       /* For 16 and 64 ppn, I0:MultiSync2Device:SHMEM:GI (which doesn't exist at 1ppn)
-       * is best, followed by
-       * I0:RectangleMultiSync2Device:SHMEM:MU for rectangles, followed by
-       * I0:OptBinomial:P2P:P2P */
-      /* So we basically check for the protocols in reverse-optimal order */
-
-
-         /* In order, >1ppn we use
-          * I0:MultiSync2Device:SHMEM:GI
-          * I0:RectangleMultiSync2Device:SHMEM:MU
-          * I0:OptBinomial:P2P:P2P
-          * MPICH
-          *
-          * In order 1ppn we use
-          * I0:MultiSync:-:GI
-          * I0:RectangleMultiSync:-:MU
-          * I0:OptBinomial:P2P:P2P
-          * MPICH
-          */
-      if(use_threaded_collectives)
-       for(i = 0 ; i < comm_ptr->mpid.coll_count[PAMI_XFER_BARRIER][0]; i++)
-       {
-          /* These two are mutually exclusive */
-          if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BARRIER][0][i].name, "I0:MultiSync2Device:SHMEM:GI") == 0)
-          {
-             opt_proto = i;
-          }
-          if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BARRIER][0][i].name, "I0:MultiSync:-:GI") == 0)
-          {
-             opt_proto = i;
-          }
-       }
-      /* Next best rectangular to check */
-      if(opt_proto == -1)
-      {
-         if(use_threaded_collectives)
-          for(i = 0 ; i < comm_ptr->mpid.coll_count[PAMI_XFER_BARRIER][0]; i++)
-          {
-            if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BARRIER][0][i].name, "I0:RectangleMultiSync2Device:SHMEM:MU") == 0)
-               opt_proto = i;
-            if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BARRIER][0][i].name, "I0:RectangleMultiSync:-:MU") == 0)
-               opt_proto = i;
-          }
-      }
-      /* Finally, see if we have opt binomial */
-      if(opt_proto == -1)
-      {
-         for(i = 0 ; i < comm_ptr->mpid.coll_count[PAMI_XFER_BARRIER][0]; i++)
-         {
-            if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BARRIER][0][i].name, "I0:OptBinomial:P2P:P2P") == 0)
-               opt_proto = i;
-         }
-      }
-
-      if(opt_proto != -1)
+          if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BARRIER][0][i].name, "I0:RectangleMultiSync:-:MU") == 0)
+            opt_proto = i;
+        }
+    }
+    /* Finally, see if we have opt binomial */
+    if(opt_proto == -1)
+    {
+      for(i = 0 ; i < comm_ptr->mpid.coll_count[PAMI_XFER_BARRIER][0]; i++)
       {
-         TRACE_ERR("Memcpy protocol type %d, number %d (%s) to optimize protocol\n",
-            PAMI_XFER_BARRIER, opt_proto, 
-            comm_ptr->mpid.coll_metadata[PAMI_XFER_BARRIER][0][opt_proto].name);
-
-         comm_ptr->mpid.opt_protocol[PAMI_XFER_BARRIER][0] =
-                comm_ptr->mpid.coll_algorithm[PAMI_XFER_BARRIER][0][opt_proto]; 
-         memcpy(&comm_ptr->mpid.opt_protocol_md[PAMI_XFER_BARRIER][0], 
-                &comm_ptr->mpid.coll_metadata[PAMI_XFER_BARRIER][0][opt_proto], 
-                sizeof(pami_metadata_t));
-         comm_ptr->mpid.must_query[PAMI_XFER_BARRIER][0] = MPID_COLL_NOQUERY;
-         comm_ptr->mpid.user_selected_type[PAMI_XFER_BARRIER] = MPID_COLL_OPTIMIZED;
+        if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BARRIER][0][i].name, "I0:OptBinomial:P2P:P2P") == 0)
+          opt_proto = i;
       }
-      else
+    }
+
+    if(opt_proto != -1)
+    {
+      TRACE_ERR("Memcpy protocol type %d, number %d (%s) to optimize protocol\n",
+                PAMI_XFER_BARRIER, opt_proto, 
+                comm_ptr->mpid.coll_metadata[PAMI_XFER_BARRIER][0][opt_proto].name);
+
+      comm_ptr->mpid.opt_protocol[PAMI_XFER_BARRIER][0] =
+      comm_ptr->mpid.coll_algorithm[PAMI_XFER_BARRIER][0][opt_proto]; 
+      memcpy(&comm_ptr->mpid.opt_protocol_md[PAMI_XFER_BARRIER][0], 
+             &comm_ptr->mpid.coll_metadata[PAMI_XFER_BARRIER][0][opt_proto], 
+             sizeof(pami_metadata_t));
+      comm_ptr->mpid.must_query[PAMI_XFER_BARRIER][0] = MPID_COLL_NOQUERY;
+      comm_ptr->mpid.user_selected_type[PAMI_XFER_BARRIER] = MPID_COLL_OPTIMIZED;
+    }
+    else
+    {
+      TRACE_ERR("Couldn't find any optimal barrier protocols. Selecting MPICH\n");
+      comm_ptr->mpid.user_selected_type[PAMI_XFER_BARRIER] = MPID_COLL_USE_MPICH;
+      comm_ptr->mpid.opt_protocol[PAMI_XFER_BARRIER][0] = 0;
+    }
+
+    TRACE_ERR("Done setting optimized barrier\n");
+  }
+
+  opt_proto = -1;
+  mustquery = 0;
+
+  /* This becomes messy when we have to message sizes. If we were gutting the 
+   * existing framework, it might be easier, but I think the existing framework
+   * is useful for future collective selection libraries/mechanisms, so I'd
+   * rather leave it in place and deal with it here instead.... */
+
+  /* Broadcast */
+  /* 1ppn */
+  /* small messages: I0:MulticastDput:-:MU
+   * if it exists, I0:RectangleDput:MU:MU for >64k */
+
+  /* 16ppn */
+  /* small messages: I0:MultiCastDput:SHMEM:MU
+   * for 16ppn/1node: I0:MultiCastDput:SHMEM:- perhaps
+   * I0:RectangleDput:SHMEM:MU for >128k */
+  /* nonrect(?): I0:MultiCast2DeviceDput:SHMEM:MU 
+   * no hw: I0:Sync2-nary:Shmem:MUDput */
+  /* 64ppn */
+  /* all sizes: I0:MultiCastDput:SHMEM:MU */
+  /* otherwise, I0:2-nomial:SHMEM:MU */
+
+
+
+  /* First, set up small message bcasts */
+  if(comm_ptr->mpid.user_selected_type[PAMI_XFER_BROADCAST] == MPID_COLL_NOSELECTION)
+  {
+    /* Complicated exceptions: */
+    /* I0:RankBased_Binomial:-:ShortMU is good on irregular for <256 bytes */
+    /* I0:MultiCast:SHMEM:- is good at 1 node/16ppn, which is a SOW point */
+    TRACE_ERR("No bcast env var, so setting optimized bcast\n");
+
+    if(use_threaded_collectives)
+      for(i = 0 ; i < comm_ptr->mpid.coll_count[PAMI_XFER_BROADCAST][1]; i++)
       {
-         TRACE_ERR("Couldn't find any optimal barrier protocols. Selecting MPICH\n");
-         comm_ptr->mpid.user_selected_type[PAMI_XFER_BARRIER] = MPID_COLL_USE_MPICH;
-         comm_ptr->mpid.opt_protocol[PAMI_XFER_BARRIER][0] = 0;
+        /* These two are mutually exclusive */
+        if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][1][i].name, "I0:MultiCastDput:-:MU") == 0)
+        {
+          opt_proto = i;
+          mustquery = 1;
+          break;
+        }
+        if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][1][i].name, "I0:MultiCastDput:SHMEM:MU") == 0)
+        {
+          opt_proto = i;
+          mustquery = 1;
+          break;
+        }
       }
-
-      TRACE_ERR("Done setting optimized barrier\n");
-   }
-
-   opt_proto = -1;
-   mustquery = 0;
-
-   /* This becomes messy when we have to message sizes. If we were gutting the 
-    * existing framework, it might be easier, but I think the existing framework
-    * is useful for future collective selection libraries/mechanisms, so I'd
-    * rather leave it in place and deal with it here instead.... */
-
-   /* Broadcast */
-   /* 1ppn */
-   /* small messages: I0:MulticastDput:-:MU
-    * if it exists, I0:RectangleDput:MU:MU for >64k */
-
-   /* 16ppn */
-   /* small messages: I0:MultiCastDput:SHMEM:MU
-    * for 16ppn/1node: I0:MultiCastDput:SHMEM:- perhaps
-    * I0:RectangleDput:SHMEM:MU for >128k */
-   /* nonrect(?): I0:MultiCast2DeviceDput:SHMEM:MU 
-    * no hw: I0:Sync2-nary:Shmem:MUDput */
-   /* 64ppn */
-   /* all sizes: I0:MultiCastDput:SHMEM:MU */
-   /* otherwise, I0:2-nomial:SHMEM:MU */
-
-
-
-   /* First, set up small message bcasts */
-   if(comm_ptr->mpid.user_selected_type[PAMI_XFER_BROADCAST] == MPID_COLL_NOSELECTION)
-   {
-      /* Complicated exceptions: */
-      /* I0:RankBased_Binomial:-:ShortMU is good on irregular for <256 bytes */
-      /* I0:MultiCast:SHMEM:- is good at 1 node/16ppn, which is a SOW point */
-      TRACE_ERR("No bcast env var, so setting optimized bcast\n");
-
-      if(use_threaded_collectives)
-       for(i = 0 ; i < comm_ptr->mpid.coll_count[PAMI_XFER_BROADCAST][1]; i++)
-       {
-         /* These two are mutually exclusive */
-         if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][1][i].name, "I0:MultiCastDput:-:MU") == 0)
-            opt_proto = i;
-         if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][1][i].name, "I0:MultiCastDput:SHMEM:MU") == 0)
-            opt_proto = i;
-	 mustquery = 1;
-       }
-      /* Next best MU 2 device to check */
-      if(use_threaded_collectives)
+    /* Next best MU 2 device to check */
+    if(use_threaded_collectives)
       if(opt_proto == -1)
       {
-         if(use_threaded_collectives)
+        if(use_threaded_collectives)
           for(i = 0; i < comm_ptr->mpid.coll_count[PAMI_XFER_BROADCAST][1]; i++)
           {
             if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][1][i].name, "I0:MultiCast2DeviceDput:SHMEM:MU") == 0)
-               opt_proto = i;
-               mustquery = 1;
+            {
+              opt_proto = i;
+              mustquery = 1;
+              break;
+            }
           }
       }
       /* Check for  rectangle */
-      if(use_threaded_collectives)
+    if(use_threaded_collectives)
       if(opt_proto == -1)
       {
-         unsigned len = strlen("I0:RectangleDput:");
-         if(use_threaded_collectives)
+        unsigned len = strlen("I0:RectangleDput:");
+        if(use_threaded_collectives)
           for(i = 0; i < comm_ptr->mpid.coll_count[PAMI_XFER_BROADCAST][1]; i++)
           {
             if(strcasecmp (comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][1][i].name, "I0:RectangleDput:SHMEM:MU") == 0)
             { /* Prefer the :SHMEM:MU so break when it's found */
-               opt_proto = i; 
-               mustquery = 1;
-               break;
+              opt_proto = i; 
+              mustquery = 1;
+              break;
             }
             /* Otherwise any RectangleDput is better than nothing. */
             if(strncasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][1][i].name, "I0:RectangleDput:",len) == 0)
-	    {
-               opt_proto = i;
-               mustquery = 1;
-	    }
-          }
-      }
-      if(opt_proto == -1)
-      {
-         for(i = 0; i < comm_ptr->mpid.coll_count[PAMI_XFER_BROADCAST][1]; i++)
-         {
-            /* This is a good choice for small messages only */
-            if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][1][i].name, "I0:RankBased_Binomial:SHMEM:MU") == 0)
             {
-               opt_proto = i;
-               mustquery = 1;
-               comm_ptr->mpid.cutoff_size[PAMI_XFER_BROADCAST][0] = 256;
+              opt_proto = i;
+              mustquery = 1;
             }
-         }
+          }
       }
-      /* Next best to check */
-      if(opt_proto == -1)
+    if(opt_proto == -1)
+    {
+      for(i = 0; i < comm_ptr->mpid.coll_count[PAMI_XFER_BROADCAST][1]; i++)
       {
-         for(i = 0; i < comm_ptr->mpid.coll_count[PAMI_XFER_BROADCAST][0]; i++)
-         {
-            /* Also, NOT in the 'must query' list */
-            if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][0][i].name, "I0:2-nomial:SHMEM:MU") == 0)
-               opt_proto = i;
-         }
+        /* This is a good choice for small messages only */
+        if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][1][i].name, "I0:RankBased_Binomial:SHMEM:MU") == 0)
+        {
+          opt_proto = i;
+          mustquery = 1;
+          comm_ptr->mpid.cutoff_size[PAMI_XFER_BROADCAST][0] = 256;
+          break;
+        }
       }
-      
-      /* These protocols are good for most message sizes, but there are some
-       * better choices for larger messages */
-      /* Set opt_proto for bcast[0] right now */
-      if(opt_proto != -1)
+    }
+    /* Next best to check */
+    if(opt_proto == -1)
+    {
+      for(i = 0; i < comm_ptr->mpid.coll_count[PAMI_XFER_BROADCAST][0]; i++)
       {
-         TRACE_ERR("Memcpy protocol type %d, number %d (%s) to optimize protocol 0\n",
-            PAMI_XFER_BROADCAST, opt_proto, 
-            comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][mustquery][opt_proto].name);
-
-         comm_ptr->mpid.opt_protocol[PAMI_XFER_BROADCAST][0] = 
-                comm_ptr->mpid.coll_algorithm[PAMI_XFER_BROADCAST][mustquery][opt_proto];
-         memcpy(&comm_ptr->mpid.opt_protocol_md[PAMI_XFER_BROADCAST][0], 
-                &comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][mustquery][opt_proto], 
-                sizeof(pami_metadata_t));
-         comm_ptr->mpid.must_query[PAMI_XFER_BROADCAST][0] = mustquery?MPID_COLL_ALWAYS_QUERY:MPID_COLL_NOQUERY;
-         comm_ptr->mpid.user_selected_type[PAMI_XFER_BROADCAST] = MPID_COLL_OPTIMIZED;
+        /* Also, NOT in the 'must query' list */
+        if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][0][i].name, "I0:2-nomial:SHMEM:MU") == 0)
+          opt_proto = i;
       }
-      else
-      {
-         TRACE_ERR("Couldn't find any optimal bcast protocols. Selecting MPICH\n");
-         comm_ptr->mpid.user_selected_type[PAMI_XFER_BROADCAST] = MPID_COLL_USE_MPICH;
-         comm_ptr->mpid.opt_protocol[PAMI_XFER_BROADCAST][0] = 0;
-      }
-
-      TRACE_ERR("Done setting optimized bcast 0\n");
-
-      /* Now, look into large message bcasts */
-      opt_proto = -1;
-      mustquery = 0;
-      /* If bcast0 is I0:MultiCastDput:-:MU, and I0:RectangleDput:MU:MU is available, use
-       * it for 64k messages */
-      if(comm_ptr->mpid.user_selected_type[PAMI_XFER_BROADCAST] != MPID_COLL_USE_MPICH)
-      {
+    }
+
+    /* These protocols are good for most message sizes, but there are some
+     * better choices for larger messages */
+    /* Set opt_proto for bcast[0] right now */
+    if(opt_proto != -1)
+    {
+      TRACE_ERR("Memcpy protocol type %d, number %d (%s) to optimize protocol 0\n",
+                PAMI_XFER_BROADCAST, opt_proto, 
+                comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][mustquery][opt_proto].name);
+
+      comm_ptr->mpid.opt_protocol[PAMI_XFER_BROADCAST][0] = 
+      comm_ptr->mpid.coll_algorithm[PAMI_XFER_BROADCAST][mustquery][opt_proto];
+      memcpy(&comm_ptr->mpid.opt_protocol_md[PAMI_XFER_BROADCAST][0], 
+             &comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][mustquery][opt_proto], 
+             sizeof(pami_metadata_t));
+      comm_ptr->mpid.must_query[PAMI_XFER_BROADCAST][0] = mustquery?MPID_COLL_ALWAYS_QUERY:MPID_COLL_NOQUERY;
+      comm_ptr->mpid.user_selected_type[PAMI_XFER_BROADCAST] = MPID_COLL_OPTIMIZED;
+    }
+    else
+    {
+      TRACE_ERR("Couldn't find any optimal bcast protocols. Selecting MPICH\n");
+      comm_ptr->mpid.user_selected_type[PAMI_XFER_BROADCAST] = MPID_COLL_USE_MPICH;
+      comm_ptr->mpid.opt_protocol[PAMI_XFER_BROADCAST][0] = 0;
+    }
+
+    TRACE_ERR("Done setting optimized bcast 0\n");
+
+    /* Now, look into large message bcasts */
+    opt_proto = -1;
+    mustquery = 0;
+    /* If bcast0 is I0:MultiCastDput:-:MU, and I0:RectangleDput:MU:MU is available, use
+     * it for 64k messages */
+    if(comm_ptr->mpid.user_selected_type[PAMI_XFER_BROADCAST] != MPID_COLL_USE_MPICH)
+    {
       if(use_threaded_collectives)
-         if(strcasecmp(comm_ptr->mpid.opt_protocol_md[PAMI_XFER_BROADCAST][0].name, "I0:MultiCastDput:-:MU") == 0)
-         {
-            /* See if I0:RectangleDput:MU:MU is available */
-            for(i = 0; i < comm_ptr->mpid.coll_count[PAMI_XFER_BROADCAST][1]; i++)
+        if(strcasecmp(comm_ptr->mpid.opt_protocol_md[PAMI_XFER_BROADCAST][0].name, "I0:MultiCastDput:-:MU") == 0)
+        {
+          /* See if I0:RectangleDput:MU:MU is available */
+          for(i = 0; i < comm_ptr->mpid.coll_count[PAMI_XFER_BROADCAST][1]; i++)
+          {
+            if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][1][i].name, "I0:RectangleDput:MU:MU") == 0)
             {
-               if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][1][i].name, "I0:RectangleDput:MU:MU") == 0)
-               {
-                  opt_proto = i;
-		  mustquery = 1;
-                  comm_ptr->mpid.cutoff_size[PAMI_XFER_BROADCAST][0] = 65536;
-               }
+              opt_proto = i;
+              mustquery = 1;
+              comm_ptr->mpid.cutoff_size[PAMI_XFER_BROADCAST][0] = 65536;
+              break;
             }
-         }
-          /* small messages: I0:MultiCastDput:SHMEM:MU*/
-          /* I0:RectangleDput:SHMEM:MU for >128k */
+          }
+        }
+        /* small messages: I0:MultiCastDput:SHMEM:MU*/
+        /* I0:RectangleDput:SHMEM:MU for >128k */
       if(use_threaded_collectives)
-         if(strcasecmp(comm_ptr->mpid.opt_protocol_md[PAMI_XFER_BROADCAST][0].name, "I0:MultiCastDput:SHMEM:MU") == 0)
-         {
-            /* See if I0:RectangleDput:SHMEM:MU is available */
-            for(i = 0; i < comm_ptr->mpid.coll_count[PAMI_XFER_BROADCAST][1]; i++)
-            {
-               if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][1][i].name, "I0:RectangleDput:SHMEM:MU") == 0)
-               {
-                  opt_proto = i;
-		  mustquery = 1;
-                  comm_ptr->mpid.cutoff_size[PAMI_XFER_BROADCAST][0] = 131072;
-               }
-            }
-         }
-         if(strcasecmp(comm_ptr->mpid.opt_protocol_md[PAMI_XFER_BROADCAST][0].name, "I0:RankBased_Binomial:SHMEM:MU") == 0)
-         {
-            /* This protocol was only good for up to 256, and it was an irregular, so let's set
-             * 2-nomial for larger message sizes. Cutoff should have already been set to 256 too */
-            for(i = 0; i < comm_ptr->mpid.coll_count[PAMI_XFER_BROADCAST][0]; i++)
+        if(strcasecmp(comm_ptr->mpid.opt_protocol_md[PAMI_XFER_BROADCAST][0].name, "I0:MultiCastDput:SHMEM:MU") == 0)
+        {
+          /* See if I0:RectangleDput:SHMEM:MU is available */
+          for(i = 0; i < comm_ptr->mpid.coll_count[PAMI_XFER_BROADCAST][1]; i++)
+          {
+            if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][1][i].name, "I0:RectangleDput:SHMEM:MU") == 0)
             {
-               if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][0][i].name, "I0:2-nomial:SHMEM:MU") == 0)
-                  opt_proto = i;
+              opt_proto = i;
+              mustquery = 1;
+              comm_ptr->mpid.cutoff_size[PAMI_XFER_BROADCAST][0] = 131072;
+              break;
             }
-         }
+          }
+        }
+      if(strcasecmp(comm_ptr->mpid.opt_protocol_md[PAMI_XFER_BROADCAST][0].name, "I0:RankBased_Binomial:SHMEM:MU") == 0)
+      {
+        /* This protocol was only good for up to 256, and it was an irregular, so let's set
+         * 2-nomial for larger message sizes. Cutoff should have already been set to 256 too */
+        for(i = 0; i < comm_ptr->mpid.coll_count[PAMI_XFER_BROADCAST][0]; i++)
+        {
+          if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][0][i].name, "I0:2-nomial:SHMEM:MU") == 0)
+            opt_proto = i;
+        }
+      }
 
-         if(opt_proto != -1)
-         {
-            if(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_0 && comm_ptr->rank == 0)
-            {
-               fprintf(stderr,"Selecting %s as optimal broadcast 1 (above %d)\n", 
+      if(opt_proto != -1)
+      {
+        if(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_0 && comm_ptr->rank == 0)
+        {
+          fprintf(stderr,"Selecting %s as optimal broadcast 1 (above %d)\n", 
                   comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][mustquery][opt_proto].name, 
                   comm_ptr->mpid.cutoff_size[PAMI_XFER_BROADCAST][0]);
-            }
-            TRACE_ERR("Memcpy protocol type %d, number %d (%s) to optimize protocol 1 (above %d)\n",
-               PAMI_XFER_BROADCAST, opt_proto, 
-               comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][mustquery][opt_proto].name,
-               comm_ptr->mpid.cutoff_size[PAMI_XFER_BROADCAST][0]);
-
-            comm_ptr->mpid.opt_protocol[PAMI_XFER_BROADCAST][1] =
-                    comm_ptr->mpid.coll_algorithm[PAMI_XFER_BROADCAST][mustquery][opt_proto];
-            memcpy(&comm_ptr->mpid.opt_protocol_md[PAMI_XFER_BROADCAST][1], 
-                   &comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][mustquery][opt_proto], 
-                   sizeof(pami_metadata_t));
-            comm_ptr->mpid.must_query[PAMI_XFER_BROADCAST][1] = mustquery?MPID_COLL_ALWAYS_QUERY:MPID_COLL_NOQUERY;
-            /* This should already be set... */
-            /* comm_ptr->mpid.user_selected_type[PAMI_XFER_BROADCAST] = MPID_COLL_OPTIMIZED; */
-         }
-         else
-         {
-            TRACE_ERR("Secondary bcast protocols unavilable; using primary for all sizes\n");
-            
-            TRACE_ERR("Duplicating protocol type %d, number %d (%s) to optimize protocol 1 (above %d)\n",
-               PAMI_XFER_BROADCAST, 0, 
-               comm_ptr->mpid.opt_protocol_md[PAMI_XFER_BROADCAST][0].name,
-               comm_ptr->mpid.cutoff_size[PAMI_XFER_BROADCAST][0]);
-
-            comm_ptr->mpid.opt_protocol[PAMI_XFER_BROADCAST][1] = 
-                    comm_ptr->mpid.opt_protocol[PAMI_XFER_BROADCAST][0];
-            memcpy(&comm_ptr->mpid.opt_protocol_md[PAMI_XFER_BROADCAST][1], 
-                   &comm_ptr->mpid.opt_protocol_md[PAMI_XFER_BROADCAST][0],
-                   sizeof(pami_metadata_t));
-            comm_ptr->mpid.must_query[PAMI_XFER_BROADCAST][1] = comm_ptr->mpid.must_query[PAMI_XFER_BROADCAST][0];
-         }
+        }
+        TRACE_ERR("Memcpy protocol type %d, number %d (%s) to optimize protocol 1 (above %d)\n",
+                  PAMI_XFER_BROADCAST, opt_proto, 
+                  comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][mustquery][opt_proto].name,
+                  comm_ptr->mpid.cutoff_size[PAMI_XFER_BROADCAST][0]);
+
+        comm_ptr->mpid.opt_protocol[PAMI_XFER_BROADCAST][1] =
+        comm_ptr->mpid.coll_algorithm[PAMI_XFER_BROADCAST][mustquery][opt_proto];
+        memcpy(&comm_ptr->mpid.opt_protocol_md[PAMI_XFER_BROADCAST][1], 
+               &comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][mustquery][opt_proto], 
+               sizeof(pami_metadata_t));
+        comm_ptr->mpid.must_query[PAMI_XFER_BROADCAST][1] = mustquery?MPID_COLL_ALWAYS_QUERY:MPID_COLL_NOQUERY;
+        /* This should already be set... */
+        /* comm_ptr->mpid.user_selected_type[PAMI_XFER_BROADCAST] = MPID_COLL_OPTIMIZED; */
       }
-      TRACE_ERR("Done with bcast protocol selection\n");
-   }
-
-   opt_proto = -1;
-   mustquery = 0;
-   /* The most fun... allreduce */
-   /* 512-way data: */
-   /* For starters, Amith's protocol works on doubles on sum/min/max. Because
-    * those are targetted types/ops, we will pre-cache it.
-    * That protocol works on ints, up to 8k/ppn max message size. We'll precache 
-    * it too
-    */
-   
-   if(comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLREDUCE] == MPID_COLL_NOSELECTION)
-   {
-      /* the user hasn't selected a protocol, so we can NULL the protocol/metadatas */
-      comm_ptr->mpid.query_cached_allreduce = MPID_COLL_USE_MPICH;
-
-      comm_ptr->mpid.cutoff_size[PAMI_XFER_ALLREDUCE][0] = 0;
-      /* For BGQ */
-      /*  1ppn: I0:MultiCombineDput:-:MU if it is available, but it has a check_fn
-       *  since it is MU-based*/
-      /*  Next best is I1:ShortAllreduce:P2P:P2P for short messages, then MPICH is best*/
-      /*  I0:MultiCombineDput:-:MU could be used in the i/dsmm cached protocols, so we'll do that */
-      /*  First, look in the 'must query' list and see i I0:MultiCombine:Dput:-:MU is there */
-
-      char *pname="...none...";
-      int user_range_hi = -1;
-      int fca_enabled = 0;
-      if((fca_enabled = MPIDI_Check_FCA_envvar("ALLREDUCE", &user_range_hi)) == 1)
-         pname = "I1:Allreduce:FCA:FCA";
-      else if(use_threaded_collectives)
-         pname = "I0:MultiCombineDput:-:MU";
-      /*SSS: Any "MU" protocol will not be available on non-BG systems. I just need to check for FCA in the 
-                   first if only. No need to do another check since the second if will never succeed for PE systems*/
-      for(i = 0; i < comm_ptr->mpid.coll_count[PAMI_XFER_ALLREDUCE][1]; i++)
+      else
       {
-         if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLREDUCE][1][i].name, pname) == 0)
-         {
-            /* So, this should be fine for the i/dsmm protocols. everything else needs to call the check function */
-            /* This also works for all message sizes, so no need to deal with it specially for query */
-            comm_ptr->mpid.cached_allreduce = 
-                   comm_ptr->mpid.coll_algorithm[PAMI_XFER_ALLREDUCE][1][i];
-            memcpy(&comm_ptr->mpid.cached_allreduce_md,
-                   &comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLREDUCE][1][i],
-                  sizeof(pami_metadata_t));
-            comm_ptr->mpid.query_cached_allreduce = MPID_COLL_QUERY;
-
-            comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLREDUCE] = MPID_COLL_OPTIMIZED;
-            if(fca_enabled && user_range_hi != -1)
-              comm_ptr->mpid.cutoff_size[PAMI_XFER_ALLREDUCE][0] = user_range_hi;
-            opt_proto = i;
+        TRACE_ERR("Secondary bcast protocols unavilable; using primary for all sizes\n");
 
-         }
-         if(use_threaded_collectives)
-         if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLREDUCE][1][i].name, "I0:MultiCombineDput:SHMEM:MU") == 0)
-         {
-            /* This works well for doubles sum/min/max but has trouble with int > 8k/ppn */
-            comm_ptr->mpid.cached_allreduce =
-                   comm_ptr->mpid.coll_algorithm[PAMI_XFER_ALLREDUCE][1][i];
-            memcpy(&comm_ptr->mpid.cached_allreduce_md,
-                   &comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLREDUCE][1][i],
-                  sizeof(pami_metadata_t));
-            comm_ptr->mpid.query_cached_allreduce = MPID_COLL_CHECK_FN_REQUIRED;
-
-            comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLREDUCE] = MPID_COLL_OPTIMIZED;
-            opt_proto = i;
-         }
+        TRACE_ERR("Duplicating protocol type %d, number %d (%s) to optimize protocol 1 (above %d)\n",
+                  PAMI_XFER_BROADCAST, 0, 
+                  comm_ptr->mpid.opt_protocol_md[PAMI_XFER_BROADCAST][0].name,
+                  comm_ptr->mpid.cutoff_size[PAMI_XFER_BROADCAST][0]);
+
+        comm_ptr->mpid.opt_protocol[PAMI_XFER_BROADCAST][1] = 
+        comm_ptr->mpid.opt_protocol[PAMI_XFER_BROADCAST][0];
+        memcpy(&comm_ptr->mpid.opt_protocol_md[PAMI_XFER_BROADCAST][1], 
+               &comm_ptr->mpid.opt_protocol_md[PAMI_XFER_BROADCAST][0],
+               sizeof(pami_metadata_t));
+        comm_ptr->mpid.must_query[PAMI_XFER_BROADCAST][1] = comm_ptr->mpid.must_query[PAMI_XFER_BROADCAST][0];
       }
-      /* At this point, if opt_proto != -1, we have must-query protocols in the i/dsmm caches */
-      /* We should pick a backup, non-must query */
-      /* I0:ShortAllreduce:P2P:P2P < 128, then mpich*/
-
-      /*SSS: ShortAllreduce is available on both BG and PE. I have to pick just one to check for in this case. 
-                  However, I need to add FCA for both opt_protocol[0]and[1] to cover all data sizes*/
-      if(fca_enabled == 1)
-         pname = "I1:Allreduce:FCA:FCA";
-      else 
-         pname = "I1:ShortAllreduce:P2P:P2P";
-
-      for(i = 0; i < comm_ptr->mpid.coll_count[PAMI_XFER_ALLREDUCE][1]; i++)
+    }
+    TRACE_ERR("Done with bcast protocol selection\n");
+  }
+
+  opt_proto = -1;
+  mustquery = 0;
+  /* The most fun... allreduce */
+  /* 512-way data: */
+  /* For starters, Amith's protocol works on doubles on sum/min/max. Because
+   * those are targetted types/ops, we will pre-cache it.
+   * That protocol works on ints, up to 8k/ppn max message size. We'll precache 
+   * it too
+   */
+
+  if(comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLREDUCE] == MPID_COLL_NOSELECTION)
+  {
+    /* the user hasn't selected a protocol, so we can NULL the protocol/metadatas */
+    comm_ptr->mpid.query_cached_allreduce = MPID_COLL_USE_MPICH;
+
+    comm_ptr->mpid.cutoff_size[PAMI_XFER_ALLREDUCE][0] = 0;
+    /* For BGQ */
+    /*  1ppn: I0:MultiCombineDput:-:MU if it is available, but it has a check_fn
+     *  since it is MU-based*/
+    /*  Next best is I1:ShortAllreduce:P2P:P2P for short messages, then MPICH is best*/
+    /*  I0:MultiCombineDput:-:MU could be used in the i/dsmm cached protocols, so we'll do that */
+    /*  First, look in the 'must query' list and see i I0:MultiCombine:Dput:-:MU is there */
+
+    char *pname="...none...";
+    int user_range_hi = -1;
+    int fca_enabled = 0;
+    if((fca_enabled = MPIDI_Check_FCA_envvar("ALLREDUCE", &user_range_hi)) == 1)
+      pname = "I1:Allreduce:FCA:FCA";
+    else if(use_threaded_collectives)
+      pname = "I0:MultiCombineDput:-:MU";
+    /*SSS: Any "MU" protocol will not be available on non-BG systems. I just need to check for FCA in the 
+                 first if only. No need to do another check since the second if will never succeed for PE systems*/
+    for(i = 0; i < comm_ptr->mpid.coll_count[PAMI_XFER_ALLREDUCE][1]; i++)
+    {
+      if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLREDUCE][1][i].name, pname) == 0)
       {
-         if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLREDUCE][1][i].name, pname) == 0)
-         {
-            comm_ptr->mpid.opt_protocol[PAMI_XFER_ALLREDUCE][0] =
-                   comm_ptr->mpid.coll_algorithm[PAMI_XFER_ALLREDUCE][1][i];
-            memcpy(&comm_ptr->mpid.opt_protocol_md[PAMI_XFER_ALLREDUCE][0],
-                   &comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLREDUCE][1][i],
-                   sizeof(pami_metadata_t));
-            if(fca_enabled == 1)
-            {
-              comm_ptr->mpid.must_query[PAMI_XFER_ALLREDUCE][0] = MPID_COLL_CHECK_FN_REQUIRED;
-              if(user_range_hi != -1)
-                comm_ptr->mpid.cutoff_size[PAMI_XFER_ALLREDUCE][0] = user_range_hi;
-              /*SSS: Otherwise another protocol may get selected in mpido_allreduce if we don't set this flag here*/
-              comm_ptr->mpid.must_query[PAMI_XFER_ALLREDUCE][1] = MPID_COLL_CHECK_FN_REQUIRED;
-            }
-            else
-            {
-              /* Short is good for up to 512 bytes... but it's a query protocol */
-              comm_ptr->mpid.must_query[PAMI_XFER_ALLREDUCE][0] = MPID_COLL_QUERY;
-              /* MPICH above that ... when short query fails */
-              comm_ptr->mpid.must_query[PAMI_XFER_ALLREDUCE][1] = MPID_COLL_USE_MPICH;
-            }
-            comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLREDUCE] = MPID_COLL_OPTIMIZED;
+        /* So, this should be fine for the i/dsmm protocols. everything else needs to call the check function */
+        /* This also works for all message sizes, so no need to deal with it specially for query */
+        comm_ptr->mpid.cached_allreduce = 
+        comm_ptr->mpid.coll_algorithm[PAMI_XFER_ALLREDUCE][1][i];
+        memcpy(&comm_ptr->mpid.cached_allreduce_md,
+               &comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLREDUCE][1][i],
+               sizeof(pami_metadata_t));
+        comm_ptr->mpid.query_cached_allreduce = MPID_COLL_QUERY;
+
+        comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLREDUCE] = MPID_COLL_OPTIMIZED;
+        if(fca_enabled && user_range_hi != -1)
+          comm_ptr->mpid.cutoff_size[PAMI_XFER_ALLREDUCE][0] = user_range_hi;
+        opt_proto = i;
 
-            opt_proto = i;
-         }
       }
-      if(opt_proto == -1)
+      if(use_threaded_collectives)
+        if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLREDUCE][1][i].name, "I0:MultiCombineDput:SHMEM:MU") == 0)
+        {
+          /* This works well for doubles sum/min/max but has trouble with int > 8k/ppn */
+          comm_ptr->mpid.cached_allreduce =
+          comm_ptr->mpid.coll_algorithm[PAMI_XFER_ALLREDUCE][1][i];
+          memcpy(&comm_ptr->mpid.cached_allreduce_md,
+                 &comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLREDUCE][1][i],
+                 sizeof(pami_metadata_t));
+          comm_ptr->mpid.query_cached_allreduce = MPID_COLL_CHECK_FN_REQUIRED;
+
+          comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLREDUCE] = MPID_COLL_OPTIMIZED;
+          opt_proto = i;
+        }
+    }
+    /* At this point, if opt_proto != -1, we have must-query protocols in the i/dsmm caches */
+    /* We should pick a backup, non-must query */
+    /* I0:ShortAllreduce:P2P:P2P < 128, then mpich*/
+
+    /*SSS: ShortAllreduce is available on both BG and PE. I have to pick just one to check for in this case. 
+                However, I need to add FCA for both opt_protocol[0]and[1] to cover all data sizes*/
+    if(fca_enabled == 1)
+      pname = "I1:Allreduce:FCA:FCA";
+    else
+      pname = "I1:ShortAllreduce:P2P:P2P";
+
+    for(i = 0; i < comm_ptr->mpid.coll_count[PAMI_XFER_ALLREDUCE][1]; i++)
+    {
+      if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLREDUCE][1][i].name, pname) == 0)
       {
-         if(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_0 && comm_ptr->rank == 0)
-            fprintf(stderr,"Optimized allreduce falls back to MPICH\n");
-         comm_ptr->mpid.must_query[PAMI_XFER_ALLREDUCE][0] = MPID_COLL_USE_MPICH;
-         comm_ptr->mpid.must_query[PAMI_XFER_ALLREDUCE][1] = MPID_COLL_USE_MPICH;
+        comm_ptr->mpid.opt_protocol[PAMI_XFER_ALLREDUCE][0] =
+        comm_ptr->mpid.coll_algorithm[PAMI_XFER_ALLREDUCE][1][i];
+        memcpy(&comm_ptr->mpid.opt_protocol_md[PAMI_XFER_ALLREDUCE][0],
+               &comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLREDUCE][1][i],
+               sizeof(pami_metadata_t));
+        if(fca_enabled == 1)
+        {
+          comm_ptr->mpid.must_query[PAMI_XFER_ALLREDUCE][0] = MPID_COLL_CHECK_FN_REQUIRED;
+          if(user_range_hi != -1)
+            comm_ptr->mpid.cutoff_size[PAMI_XFER_ALLREDUCE][0] = user_range_hi;
+          /*SSS: Otherwise another protocol may get selected in mpido_allreduce if we don't set this flag here*/
+          comm_ptr->mpid.must_query[PAMI_XFER_ALLREDUCE][1] = MPID_COLL_CHECK_FN_REQUIRED;
+        }
+        else
+        {
+          /* Short is good for up to 512 bytes... but it's a query protocol */
+          comm_ptr->mpid.must_query[PAMI_XFER_ALLREDUCE][0] = MPID_COLL_QUERY;
+          /* MPICH above that ... when short query fails */
+          comm_ptr->mpid.must_query[PAMI_XFER_ALLREDUCE][1] = MPID_COLL_USE_MPICH;
+        }
+        comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLREDUCE] = MPID_COLL_OPTIMIZED;
+
+        opt_proto = i;
       }
-      TRACE_ERR("Done setting optimized allreduce protocols\n");
-   }
-
-   
-   if(MPIDI_Process.optimized.select_colls != 2)
-   {
-      for(i = 0; i < PAMI_XFER_COUNT; i++)
+    }
+    if(opt_proto == -1)
+    {
+      if(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_0 && comm_ptr->rank == 0)
+        fprintf(stderr,"Optimized allreduce falls back to MPICH\n");
+      comm_ptr->mpid.must_query[PAMI_XFER_ALLREDUCE][0] = MPID_COLL_USE_MPICH;
+      comm_ptr->mpid.must_query[PAMI_XFER_ALLREDUCE][1] = MPID_COLL_USE_MPICH;
+    }
+    TRACE_ERR("Done setting optimized allreduce protocols\n");
+  }
+
+
+  if(MPIDI_Process.optimized.select_colls != 2)
+  {
+    for(i = 0; i < PAMI_XFER_COUNT; i++)
+    {
+      if(i == PAMI_XFER_AMBROADCAST || i == PAMI_XFER_AMSCATTER ||
+         i == PAMI_XFER_AMGATHER || i == PAMI_XFER_AMREDUCE)
+        continue;
+      if(comm_ptr->mpid.user_selected_type[i] != MPID_COLL_OPTIMIZED)
       {
-         if(i == PAMI_XFER_AMBROADCAST || i == PAMI_XFER_AMSCATTER ||
-            i == PAMI_XFER_AMGATHER || i == PAMI_XFER_AMREDUCE)
-            continue;
-         if(comm_ptr->mpid.user_selected_type[i] != MPID_COLL_OPTIMIZED)
-         {
-            if(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_0 && comm_ptr->rank == 0)
-               fprintf(stderr, "Collective wasn't selected for type %d,using MPICH (comm %p)\n", i, comm_ptr);
-            comm_ptr->mpid.user_selected_type[i] = MPID_COLL_USE_MPICH;
-         }
+        if(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_0 && comm_ptr->rank == 0)
+          fprintf(stderr, "Collective wasn't selected for type %d,using MPICH (comm %p)\n", i, comm_ptr);
+        comm_ptr->mpid.user_selected_type[i] = MPID_COLL_USE_MPICH;
       }
-   }
-   
-            
-   if(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_0 && comm_ptr->rank == 0)
-   {
-      if(comm_ptr->mpid.user_selected_type[PAMI_XFER_BARRIER] == MPID_COLL_OPTIMIZED)
-         fprintf(stderr,"Selecting %s for opt barrier comm %p\n", comm_ptr->mpid.opt_protocol_md[PAMI_XFER_BARRIER][0].name, comm_ptr);
-      if(comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLGATHERV_INT] == MPID_COLL_OPTIMIZED)
-         fprintf(stderr,"Selecting %s for opt allgatherv comm %p\n", comm_ptr->mpid.opt_protocol_md[PAMI_XFER_ALLGATHERV_INT][0].name, comm_ptr);
-      if(comm_ptr->mpid.must_query[PAMI_XFER_ALLGATHERV_INT][0] == MPID_COLL_USE_MPICH)
-         fprintf(stderr,"Selecting MPICH for allgatherv below %d size comm %p\n", comm_ptr->mpid.cutoff_size[PAMI_XFER_ALLGATHERV_INT][0], comm_ptr);
-      if(comm_ptr->mpid.user_selected_type[PAMI_XFER_BROADCAST] == MPID_COLL_OPTIMIZED)
-         fprintf(stderr,"Selecting %s for opt bcast up to size %d comm %p\n", comm_ptr->mpid.opt_protocol_md[PAMI_XFER_BROADCAST][0].name,
-            comm_ptr->mpid.cutoff_size[PAMI_XFER_BROADCAST][0], comm_ptr);
-      if((comm_ptr->mpid.must_query[PAMI_XFER_BROADCAST][1] == MPID_COLL_NOQUERY) ||
-	 (comm_ptr->mpid.must_query[PAMI_XFER_BROADCAST][1] == MPID_COLL_ALWAYS_QUERY))
-         fprintf(stderr,"Selecting %s (mustquery=%d) for opt bcast above size %d comm %p\n",
-		 comm_ptr->mpid.opt_protocol_md[PAMI_XFER_BROADCAST][1].name,
-		 comm_ptr->mpid.must_query[PAMI_XFER_BROADCAST][1],
-		 comm_ptr->mpid.cutoff_size[PAMI_XFER_BROADCAST][0], comm_ptr);
-      if(comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLTOALLV_INT] == MPID_COLL_OPTIMIZED)
-         fprintf(stderr,"Selecting %s for opt alltoallv comm %p\n", comm_ptr->mpid.opt_protocol_md[PAMI_XFER_ALLTOALLV_INT][0].name, comm_ptr);
-      if(comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLTOALL] == MPID_COLL_OPTIMIZED)
-         fprintf(stderr,"Selecting %s for opt alltoall comm %p\n", comm_ptr->mpid.opt_protocol_md[PAMI_XFER_ALLTOALL][0].name, comm_ptr);
-      if(comm_ptr->mpid.must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_USE_MPICH)
-         fprintf(stderr,"Selecting MPICH for allreduce below %d size comm %p\n", comm_ptr->mpid.cutoff_size[PAMI_XFER_ALLREDUCE][0], comm_ptr);
-      else
+    }
+  }
+
+
+  if(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_0 && comm_ptr->rank == 0)
+  {
+    if(comm_ptr->mpid.user_selected_type[PAMI_XFER_BARRIER] == MPID_COLL_OPTIMIZED)
+      fprintf(stderr,"Selecting %s for opt barrier comm %p\n", comm_ptr->mpid.opt_protocol_md[PAMI_XFER_BARRIER][0].name, comm_ptr);
+    if(comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLGATHERV_INT] == MPID_COLL_OPTIMIZED)
+      fprintf(stderr,"Selecting %s for opt allgatherv comm %p\n", comm_ptr->mpid.opt_protocol_md[PAMI_XFER_ALLGATHERV_INT][0].name, comm_ptr);
+    if(comm_ptr->mpid.must_query[PAMI_XFER_ALLGATHERV_INT][0] == MPID_COLL_USE_MPICH)
+      fprintf(stderr,"Selecting MPICH for allgatherv below %d size comm %p\n", comm_ptr->mpid.cutoff_size[PAMI_XFER_ALLGATHERV_INT][0], comm_ptr);
+    if(comm_ptr->mpid.user_selected_type[PAMI_XFER_BROADCAST] == MPID_COLL_OPTIMIZED)
+      fprintf(stderr,"Selecting %s for opt bcast up to size %d comm %p\n", comm_ptr->mpid.opt_protocol_md[PAMI_XFER_BROADCAST][0].name,
+              comm_ptr->mpid.cutoff_size[PAMI_XFER_BROADCAST][0], comm_ptr);
+    if((comm_ptr->mpid.must_query[PAMI_XFER_BROADCAST][1] == MPID_COLL_NOQUERY) ||
+       (comm_ptr->mpid.must_query[PAMI_XFER_BROADCAST][1] == MPID_COLL_ALWAYS_QUERY))
+      fprintf(stderr,"Selecting %s (mustquery=%d) for opt bcast above size %d comm %p\n",
+              comm_ptr->mpid.opt_protocol_md[PAMI_XFER_BROADCAST][1].name,
+              comm_ptr->mpid.must_query[PAMI_XFER_BROADCAST][1],
+              comm_ptr->mpid.cutoff_size[PAMI_XFER_BROADCAST][0], comm_ptr);
+    if(comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLTOALLV_INT] == MPID_COLL_OPTIMIZED)
+      fprintf(stderr,"Selecting %s for opt alltoallv comm %p\n", comm_ptr->mpid.opt_protocol_md[PAMI_XFER_ALLTOALLV_INT][0].name, comm_ptr);
+    if(comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLTOALL] == MPID_COLL_OPTIMIZED)
+      fprintf(stderr,"Selecting %s for opt alltoall comm %p\n", comm_ptr->mpid.opt_protocol_md[PAMI_XFER_ALLTOALL][0].name, comm_ptr);
+    if(comm_ptr->mpid.must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_USE_MPICH)
+      fprintf(stderr,"Selecting MPICH for allreduce below %d size comm %p\n", comm_ptr->mpid.cutoff_size[PAMI_XFER_ALLREDUCE][0], comm_ptr);
+    else
+    {
+      if(comm_ptr->mpid.query_cached_allreduce != MPID_COLL_USE_MPICH)
+      {
+        fprintf(stderr,"Selecting %s for double sum/min/max ops allreduce, query: %d comm %p\n",
+                comm_ptr->mpid.cached_allreduce_md.name, comm_ptr->mpid.query_cached_allreduce, comm_ptr);
+      }
+      fprintf(stderr,"Selecting %s for other operations allreduce up to %d comm %p\n",
+              comm_ptr->mpid.opt_protocol_md[PAMI_XFER_ALLREDUCE][0].name, 
+              comm_ptr->mpid.cutoff_size[PAMI_XFER_ALLREDUCE][0], comm_ptr);
+    }
+    if(comm_ptr->mpid.must_query[PAMI_XFER_ALLREDUCE][1] == MPID_COLL_USE_MPICH)
+      fprintf(stderr,"Selecting MPICH for allreduce above %d size comm %p\n", comm_ptr->mpid.cutoff_size[PAMI_XFER_ALLREDUCE][0], comm_ptr);
+    else
+    {
+      if(comm_ptr->mpid.query_cached_allreduce != MPID_COLL_USE_MPICH)
       {
-         if(comm_ptr->mpid.query_cached_allreduce != MPID_COLL_USE_MPICH)
-         {
-            fprintf(stderr,"Selecting %s for double sum/min/max ops allreduce, query: %d comm %p\n",
-               comm_ptr->mpid.cached_allreduce_md.name, comm_ptr->mpid.query_cached_allreduce, comm_ptr);
-         }
-         fprintf(stderr,"Selecting %s for other operations allreduce up to %d comm %p\n",
-               comm_ptr->mpid.opt_protocol_md[PAMI_XFER_ALLREDUCE][0].name, 
-               comm_ptr->mpid.cutoff_size[PAMI_XFER_ALLREDUCE][0], comm_ptr);
+        fprintf(stderr,"Selecting %s for double sum/min/max ops allreduce, above %d query: %d comm %p\n",
+                comm_ptr->mpid.cached_allreduce_md.name, 
+                comm_ptr->mpid.cutoff_size[PAMI_XFER_ALLREDUCE][0],
+                comm_ptr->mpid.query_cached_allreduce, comm_ptr);
       }
-      if(comm_ptr->mpid.must_query[PAMI_XFER_ALLREDUCE][1] == MPID_COLL_USE_MPICH)
-         fprintf(stderr,"Selecting MPICH for allreduce above %d size comm %p\n", comm_ptr->mpid.cutoff_size[PAMI_XFER_ALLREDUCE][0], comm_ptr);
       else
       {
-         if(comm_ptr->mpid.query_cached_allreduce != MPID_COLL_USE_MPICH)
-         {
-            fprintf(stderr,"Selecting %s for double sum/min/max ops allreduce, above %d query: %d comm %p\n",
-               comm_ptr->mpid.cached_allreduce_md.name, 
-               comm_ptr->mpid.cutoff_size[PAMI_XFER_ALLREDUCE][0],
-               comm_ptr->mpid.query_cached_allreduce, comm_ptr);
-         }
-         else
-         {
-            fprintf(stderr,"Selecting MPICH for double sum/min/max ops allreduce, above %d size comm %p\n",
-               comm_ptr->mpid.cutoff_size[PAMI_XFER_ALLREDUCE][0], comm_ptr);
-         }
-         fprintf(stderr,"Selecting %s for other operations allreduce over %d comm %p\n",
-            comm_ptr->mpid.opt_protocol_md[PAMI_XFER_ALLREDUCE][1].name,
-            comm_ptr->mpid.cutoff_size[PAMI_XFER_ALLREDUCE][0], comm_ptr);
+        fprintf(stderr,"Selecting MPICH for double sum/min/max ops allreduce, above %d size comm %p\n",
+                comm_ptr->mpid.cutoff_size[PAMI_XFER_ALLREDUCE][0], comm_ptr);
       }
-   }
+      fprintf(stderr,"Selecting %s for other operations allreduce over %d comm %p\n",
+              comm_ptr->mpid.opt_protocol_md[PAMI_XFER_ALLREDUCE][1].name,
+              comm_ptr->mpid.cutoff_size[PAMI_XFER_ALLREDUCE][0], comm_ptr);
+    }
+  }
 
-   TRACE_ERR("Leaving MPIDI_Comm_coll_select\n");
+  TRACE_ERR("Leaving MPIDI_Comm_coll_select\n");
 
 }
 

http://git.mpich.org/mpich.git/commitdiff/b2a8f02bdb8e1e0668e2d32a10f5ba1869f22592

commit b2a8f02bdb8e1e0668e2d32a10f5ba1869f22592
Author: Bob Cernohous <bobc at us.ibm.com>
Date:   Tue Nov 27 16:56:39 2012 -0600

    Support MPI_IN_PLACE metadata
    
    (ibm) Issue 9136
    (ibm) 1e1246237339bde2e68905bc007b7814c3f380f2
    
    Signed-off-by: sssharka <sssharka at us.ibm.com>

diff --git a/src/mpid/pamid/src/coll/allgather/mpido_allgather.c b/src/mpid/pamid/src/coll/allgather/mpido_allgather.c
index c881db3..5dcb2ac 100644
--- a/src/mpid/pamid/src/coll/allgather/mpido_allgather.c
+++ b/src/mpid/pamid/src/coll/allgather/mpido_allgather.c
@@ -258,7 +258,7 @@ MPIDO_Allgather(const void *sendbuf,
    const int selected_type = mpid->user_selected_type[PAMI_XFER_ALLGATHER];
 
    for (i=0;i<6;i++) config[i] = 1;
-   const pami_metadata_t *my_md;
+   const pami_metadata_t *my_md = (pami_metadata_t *)NULL;
 
 
    allred.cb_done = allred_cb_done;
@@ -403,6 +403,8 @@ MPIDO_Allgather(const void *sendbuf,
          if(queryreq == MPID_COLL_ALWAYS_QUERY)
          {
            /* process metadata bits */
+           if((!my_md->check_correct.values.inplace) && (sendbuf == MPI_IN_PLACE))
+              result.check.unspecified = 1;
          }
          else /* (queryreq == MPID_COLL_CHECK_FN_REQUIRED - calling the check fn is sufficient */
            result = my_md->check_fn(&allgather);
@@ -411,7 +413,7 @@ MPIDO_Allgather(const void *sendbuf,
          if(result.bitmask)
          {
            if(unlikely(verbose))
-             fprintf(stderr,"Query failed for %s.\n",
+             fprintf(stderr,"Query failed for %s.  Using MPICH allgather\n",
                      my_md->name);
            MPIDI_Update_last_algorithm(comm_ptr, "ALLGATHER_MPICH");
            return MPIR_Allgather(sendbuf, sendcount, sendtype,
@@ -434,7 +436,6 @@ MPIDO_Allgather(const void *sendbuf,
       TRACE_ERR("Calling PAMI_Collective with allgather structure\n");
       MPIDI_Post_coll_t allgather_post;
       MPIDI_Context_post(MPIDI_Context[0], &allgather_post.state, MPIDI_Pami_post_wrapper, (void *)&allgather);
-      TRACE_ERR("Allgather %s\n", MPIDI_Process.context_post.active>0?"posted":"invoked");
 
       MPIDI_Update_last_algorithm(comm_ptr, my_md->name);
       MPID_PROGRESS_WAIT_WHILE(allgather_active);
diff --git a/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c b/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c
index 44d3498..fd8e3a4 100644
--- a/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c
+++ b/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c
@@ -283,7 +283,7 @@ MPIDO_Allgatherv(const void *sendbuf,
   volatile unsigned allgatherv_active = 1;
   pami_type_t stype, rtype;
   int tmp;
-  const pami_metadata_t *my_md;
+  const pami_metadata_t *my_md = (pami_metadata_t *)NULL;
 
   for(i=0;i<6;i++) config[i] = 1;
 
@@ -301,7 +301,6 @@ MPIDO_Allgatherv(const void *sendbuf,
    use_alltoall = mpid->allgathervs[2];
    use_tree_reduce = mpid->allgathervs[0];
    use_bcast = mpid->allgathervs[1];
-   /* Assuming PAMI doesn't support MPI_IN_PLACE */
    use_pami = selected_type != MPID_COLL_USE_MPICH;
 	 
    if((sendbuf != MPI_IN_PLACE) && (MPIDI_Datatype_to_pami(sendtype, &stype, -1, NULL, &tmp) != MPI_SUCCESS))
@@ -433,6 +432,8 @@ MPIDO_Allgatherv(const void *sendbuf,
          if(queryreq == MPID_COLL_ALWAYS_QUERY)
          {
            /* process metadata bits */
+           if((!my_md->check_correct.values.inplace) && (sendbuf == MPI_IN_PLACE))
+              result.check.unspecified = 1;
          }
          else /* (queryreq == MPID_COLL_CHECK_FN_REQUIRED - calling the check fn is sufficient */
            result = my_md->check_fn(&allgatherv);
@@ -441,7 +442,7 @@ MPIDO_Allgatherv(const void *sendbuf,
          if(result.bitmask)
          {
            if(unlikely(verbose))
-             fprintf(stderr,"Query failed for %s\n", my_md->name);
+             fprintf(stderr,"Query failed for %s. Using MPICH allgatherv.\n", my_md->name);
            MPIDI_Update_last_algorithm(comm_ptr, "ALLGATHERV_MPICH");
            return MPIR_Allgatherv(sendbuf, sendcount, sendtype,
                                   recvbuf, recvcounts, displs, recvtype,
@@ -460,7 +461,6 @@ MPIDO_Allgatherv(const void *sendbuf,
                  my_md->name,
               (unsigned) comm_ptr->context_id);
       }
-      TRACE_ERR("Calling allgatherv via %s()\n", MPIDI_Process.context_post.active>0?"PAMI_Collective":"PAMI_Context_post");
       MPIDI_Post_coll_t allgatherv_post;
       MPIDI_Context_post(MPIDI_Context[0], &allgatherv_post.state,
                          MPIDI_Pami_post_wrapper, (void *)&allgatherv);
diff --git a/src/mpid/pamid/src/coll/allreduce/mpido_allreduce.c b/src/mpid/pamid/src/coll/allreduce/mpido_allreduce.c
index 3185832..9edab4e 100644
--- a/src/mpid/pamid/src/coll/allreduce/mpido_allreduce.c
+++ b/src/mpid/pamid/src/coll/allreduce/mpido_allreduce.c
@@ -59,7 +59,7 @@ int MPIDO_Allreduce(const void *sendbuf,
   volatile unsigned active = 1;
   pami_xfer_t allred;
   pami_algorithm_t my_allred = 0;
-  const pami_metadata_t *my_allred_md = (pami_metadata_t *)NULL;
+  const pami_metadata_t *my_md = (pami_metadata_t *)NULL;
   int alg_selected = 0;
   const int rank = comm_ptr->rank;
   const struct MPIDI_Comm* const mpid = &(comm_ptr->mpid);
@@ -80,7 +80,7 @@ int MPIDO_Allreduce(const void *sendbuf,
     {
       /* double protocol works on all message sizes */
       my_allred = mpid->cached_allreduce;
-      my_allred_md = &mpid->cached_allreduce_md;
+      my_md = &mpid->cached_allreduce_md;
       alg_selected = 1;
     }
     if(likely(op == MPI_SUM))
@@ -140,14 +140,14 @@ int MPIDO_Allreduce(const void *sendbuf,
       if(mpid->query_cached_allreduce != MPID_COLL_USE_MPICH)
       { /* try the cached algorithm first, assume it's always a query algorithm so query now */
         my_allred = mpid->cached_allreduce;
-        my_allred_md = &mpid->cached_allreduce_md;
+        my_md = &mpid->cached_allreduce_md;
         alg_selected = 1;
-        if(my_allred_md->check_fn != NULL)/*This should always be the case in FCA.. Otherwise punt to mpich*/
+        if(my_md->check_fn != NULL)/*This should always be the case in FCA.. Otherwise punt to mpich*/
         {
           metadata_result_t result = {0};
           TRACE_ERR("querying allreduce algorithm %s\n",
-                  my_allred_md->name);
-          result = my_allred_md->check_fn(&allred);
+                  my_md->name);
+          result = my_md->check_fn(&allred);
           TRACE_ERR("bitmask: %#X\n", result.bitmask);
           /* \todo Ignore check_correct.values.nonlocal until we implement the
                    'pre-allreduce allreduce' or the 'safe' environment flag.
@@ -160,30 +160,30 @@ int MPIDO_Allreduce(const void *sendbuf,
           {
             alg_selected = 0;
             if(unlikely(verbose))
-              fprintf(stderr,"check_fn failed for %s.\n", my_allred_md->name);
+              fprintf(stderr,"check_fn failed for %s.\n", my_md->name);
           }
         }
         else /* no check_fn, manually look at the metadata fields */
         {
           TRACE_ERR("Optimzed selection line %d\n",__LINE__);
           /* Check if the message range if restricted */
-          if(my_allred_md->check_correct.values.rangeminmax)
+          if(my_md->check_correct.values.rangeminmax)
           {
             MPI_Aint data_true_lb;
             MPID_Datatype *data_ptr;
             int data_size, data_contig;
             MPIDI_Datatype_get_info(count, dt, data_contig, data_size, data_ptr, data_true_lb); 
-            if((my_allred_md->range_lo <= data_size) &&
-               (my_allred_md->range_hi >= data_size))
+            if((my_md->range_lo <= data_size) &&
+               (my_md->range_hi >= data_size))
               ; /* ok, algorithm selected */
             else
             {
               if(unlikely(verbose))
                 fprintf(stderr,"message size (%u) outside range (%zu<->%zu) for %s.\n",
                         data_size,
-                        my_allred_md->range_lo,
-                        my_allred_md->range_hi,
-                        my_allred_md->name);
+                        my_md->range_lo,
+                        my_md->range_hi,
+                        my_md->name);
               alg_selected = 0;
             }
           }
@@ -201,7 +201,7 @@ int MPIDO_Allreduce(const void *sendbuf,
         {
           TRACE_ERR("Optimzed selection line %d\n",__LINE__);
           my_allred = mpid->opt_protocol[PAMI_XFER_ALLREDUCE][0];
-          my_allred_md = &mpid->opt_protocol_md[PAMI_XFER_ALLREDUCE][0];
+          my_md = &mpid->opt_protocol_md[PAMI_XFER_ALLREDUCE][0];
           alg_selected = 1;
         }
         else if(queryreq1 == MPID_COLL_NOQUERY &&
@@ -209,7 +209,7 @@ int MPIDO_Allreduce(const void *sendbuf,
         {
           TRACE_ERR("Optimzed selection line %d\n",__LINE__);
           my_allred = mpid->opt_protocol[PAMI_XFER_ALLREDUCE][1];
-          my_allred_md = &mpid->opt_protocol_md[PAMI_XFER_ALLREDUCE][1];
+          my_md = &mpid->opt_protocol_md[PAMI_XFER_ALLREDUCE][1];
           alg_selected = 1;
         }
         else if(((queryreq0 == MPID_COLL_CHECK_FN_REQUIRED) ||
@@ -220,7 +220,7 @@ int MPIDO_Allreduce(const void *sendbuf,
         {
           TRACE_ERR("Optimzed selection line %d\n",__LINE__);
           my_allred = mpid->opt_protocol[PAMI_XFER_ALLREDUCE][0];
-          my_allred_md = &mpid->opt_protocol_md[PAMI_XFER_ALLREDUCE][0];
+          my_md = &mpid->opt_protocol_md[PAMI_XFER_ALLREDUCE][0];
           alg_selected = 1;
           queryreq = queryreq0;
         }
@@ -230,7 +230,7 @@ int MPIDO_Allreduce(const void *sendbuf,
         {  
           TRACE_ERR("Optimzed selection line %d\n",__LINE__);
           my_allred = mpid->opt_protocol[PAMI_XFER_ALLREDUCE][1];
-          my_allred_md = &mpid->opt_protocol_md[PAMI_XFER_ALLREDUCE][1];
+          my_md = &mpid->opt_protocol_md[PAMI_XFER_ALLREDUCE][1];
           alg_selected = 1;
           queryreq = queryreq1;
         }
@@ -242,12 +242,12 @@ int MPIDO_Allreduce(const void *sendbuf,
                       (queryreq == MPID_COLL_ALWAYS_QUERY)))
           {
             TRACE_ERR("Optimzed selection line %d\n",__LINE__);
-            if(my_allred_md->check_fn != NULL)/*This should always be the case in FCA.. Otherwise punt to mpich*/
+            if(my_md->check_fn != NULL)/*This should always be the case in FCA.. Otherwise punt to mpich*/
             {
               metadata_result_t result = {0};
               TRACE_ERR("querying allreduce algorithm %s\n",
-                        my_allred_md->name);
-              result = my_allred_md->check_fn(&allred);
+                        my_md->name);
+              result = my_md->check_fn(&allred);
               TRACE_ERR("bitmask: %#X\n", result.bitmask);
               /* \todo Ignore check_correct.values.nonlocal until we implement the
                 'pre-allreduce allreduce' or the 'safe' environment flag.
@@ -260,30 +260,30 @@ int MPIDO_Allreduce(const void *sendbuf,
               {
                 alg_selected = 0;
                 if(unlikely(verbose))
-                  fprintf(stderr,"check_fn failed for %s.\n", my_allred_md->name);
+                  fprintf(stderr,"check_fn failed for %s.\n", my_md->name);
               }
             } 
             else /* no check_fn, manually look at the metadata fields */
             {
               TRACE_ERR("Optimzed selection line %d\n",__LINE__);
               /* Check if the message range if restricted */
-              if(my_allred_md->check_correct.values.rangeminmax)
+              if(my_md->check_correct.values.rangeminmax)
               {
                 MPI_Aint data_true_lb;
                 MPID_Datatype *data_ptr;
                 int data_size, data_contig;
                 MPIDI_Datatype_get_info(count, dt, data_contig, data_size, data_ptr, data_true_lb); 
-                if((my_allred_md->range_lo <= data_size) &&
-                   (my_allred_md->range_hi >= data_size))
+                if((my_md->range_lo <= data_size) &&
+                   (my_md->range_hi >= data_size))
                   ; /* ok, algorithm selected */
                 else
                 {
                   if(unlikely(verbose))
                     fprintf(stderr,"message size (%u) outside range (%zu<->%zu) for %s.\n",
                             data_size,
-                            my_allred_md->range_lo,
-                            my_allred_md->range_hi,
-                            my_allred_md->name);
+                            my_md->range_lo,
+                            my_md->range_hi,
+                            my_md->name);
                   alg_selected = 0;
                 }
               }
@@ -292,7 +292,7 @@ int MPIDO_Allreduce(const void *sendbuf,
           }
           else
           {
-            TRACE_ERR("Using %s for allreduce\n", my_allred_md->name);
+            TRACE_ERR("Using %s for allreduce\n", my_md->name);
           }
         }
       }
@@ -301,19 +301,19 @@ int MPIDO_Allreduce(const void *sendbuf,
     {
       TRACE_ERR("Non-Optimzed selection line %d\n",__LINE__);
       my_allred = mpid->user_selected[PAMI_XFER_ALLREDUCE];
-      my_allred_md = &mpid->user_metadata[PAMI_XFER_ALLREDUCE];
+      my_md = &mpid->user_metadata[PAMI_XFER_ALLREDUCE];
       if(selected_type == MPID_COLL_QUERY ||
          selected_type == MPID_COLL_ALWAYS_QUERY ||
          selected_type == MPID_COLL_CHECK_FN_REQUIRED)
       {
         TRACE_ERR("Non-Optimzed selection line %d\n",__LINE__);
-        if(my_allred_md->check_fn != NULL)
+        if(my_md->check_fn != NULL)
         {
           /* For now, we don't distinguish between MPID_COLL_ALWAYS_QUERY &
              MPID_COLL_CHECK_FN_REQUIRED, we just call the fn                */
           metadata_result_t result = {0};
           TRACE_ERR("querying allreduce algorithm %s, type was %d\n",
-                    my_allred_md->name,
+                    my_md->name,
                     selected_type);
           result = mpid->user_metadata[PAMI_XFER_ALLREDUCE].check_fn(&allred);
           TRACE_ERR("bitmask: %#X\n", result.bitmask);
@@ -326,28 +326,28 @@ int MPIDO_Allreduce(const void *sendbuf,
             alg_selected = 1; /* query algorithm successfully selected */
           else
             if(unlikely(verbose))
-              fprintf(stderr,"check_fn failed for %s.\n", my_allred_md->name);
+              fprintf(stderr,"check_fn failed for %s.\n", my_md->name);
         }
         else /* no check_fn, manually look at the metadata fields */
         {
           TRACE_ERR("Non-Optimzed selection line %d\n",__LINE__);
           /* Check if the message range if restricted */
-          if(my_allred_md->check_correct.values.rangeminmax)
+          if(my_md->check_correct.values.rangeminmax)
           {
             MPI_Aint data_true_lb;
             MPID_Datatype *data_ptr;
             int data_size, data_contig;
             MPIDI_Datatype_get_info(count, dt, data_contig, data_size, data_ptr, data_true_lb); 
-            if((my_allred_md->range_lo <= data_size) &&
-               (my_allred_md->range_hi >= data_size))
+            if((my_md->range_lo <= data_size) &&
+               (my_md->range_hi >= data_size))
               alg_selected = 1; /* query algorithm successfully selected */
             else
               if(unlikely(verbose))
                 fprintf(stderr,"message size (%u) outside range (%zu<->%zu) for %s.\n",
                       data_size,
-                      my_allred_md->range_lo,
-                      my_allred_md->range_hi,
-                      my_allred_md->name);
+                      my_md->range_lo,
+                      my_md->range_hi,
+                      my_md->name);
           }
           /* \todo check the rest of the metadata */
         }
@@ -374,7 +374,7 @@ int MPIDO_Allreduce(const void *sendbuf,
     threadID = (unsigned long long int)tid;
     fprintf(stderr,"<%llx> Using protocol %s for allreduce on %u\n", 
             threadID,
-            my_allred_md->name,
+            my_md->name,
             (unsigned) comm_ptr->context_id);
   }
 
@@ -383,7 +383,7 @@ int MPIDO_Allreduce(const void *sendbuf,
                      MPIDI_Pami_post_wrapper, (void *)&allred);
 
   MPID_assert(rc == PAMI_SUCCESS);
-  MPIDI_Update_last_algorithm(comm_ptr,my_allred_md->name);
+  MPIDI_Update_last_algorithm(comm_ptr,my_md->name);
   MPID_PROGRESS_WAIT_WHILE(active);
   TRACE_ERR("allreduce done\n");
   return MPI_SUCCESS;
diff --git a/src/mpid/pamid/src/coll/alltoall/mpido_alltoall.c b/src/mpid/pamid/src/coll/alltoall/mpido_alltoall.c
index c5f54fa..b930030 100644
--- a/src/mpid/pamid/src/coll/alltoall/mpido_alltoall.c
+++ b/src/mpid/pamid/src/coll/alltoall/mpido_alltoall.c
@@ -91,23 +91,23 @@ int MPIDO_Alltoall(const void *sendbuf,
 
    pami_xfer_t alltoall;
    pami_algorithm_t my_alltoall;
-   const pami_metadata_t *my_alltoall_md;
+   const pami_metadata_t *my_md = (pami_metadata_t *)NULL;
    int queryreq = 0;
    if(selected_type == MPID_COLL_OPTIMIZED)
    {
       TRACE_ERR("Optimized alltoall was pre-selected\n");
       my_alltoall = mpid->opt_protocol[PAMI_XFER_ALLTOALL][0];
-      my_alltoall_md = &mpid->opt_protocol_md[PAMI_XFER_ALLTOALL][0];
+      my_md = &mpid->opt_protocol_md[PAMI_XFER_ALLTOALL][0];
       queryreq = mpid->must_query[PAMI_XFER_ALLTOALL][0];
    }
    else
    {
       TRACE_ERR("Alltoall was specified by user\n");
       my_alltoall = mpid->user_selected[PAMI_XFER_ALLTOALL];
-      my_alltoall_md = &mpid->user_metadata[PAMI_XFER_ALLTOALL];
+      my_md = &mpid->user_metadata[PAMI_XFER_ALLTOALL];
       queryreq = selected_type;
    }
-   char *pname = my_alltoall_md->name;
+   char *pname = my_md->name;
    TRACE_ERR("Using alltoall protocol %s\n", pname);
 
    alltoall.cb_done = cb_alltoall;
@@ -141,15 +141,17 @@ int MPIDO_Alltoall(const void *sendbuf,
       if(queryreq == MPID_COLL_ALWAYS_QUERY)
       {
         /* process metadata bits */
+         if((!my_md->check_correct.values.inplace) && (sendbuf == MPI_IN_PLACE))
+            result.check.unspecified = 1;
       }
       else /* (queryreq == MPID_COLL_CHECK_FN_REQUIRED - calling the check fn is sufficient */
-         result = my_alltoall_md->check_fn(&alltoall);
+         result = my_md->check_fn(&alltoall);
       TRACE_ERR("bitmask: %#X\n", result.bitmask);
       result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
       if(result.bitmask)
       {
         if(unlikely(verbose))
-           fprintf(stderr,"Query failed for %s\n", pname);
+           fprintf(stderr,"Query failed for %s. Using MPICH alltoall.\n", pname);
         MPIDI_Update_last_algorithm(comm_ptr, "ALLTOALL_MPICH");
         return MPIR_Alltoall_intra(sendbuf, sendcount, sendtype,
                                    recvbuf, recvcount, recvtype,
@@ -165,7 +167,7 @@ int MPIDO_Alltoall(const void *sendbuf,
       threadID = (unsigned long long int)tid;
       fprintf(stderr,"<%llx> Using protocol %s for alltoall on %u\n", 
               threadID,
-              my_alltoall_md->name,
+              my_md->name,
               (unsigned) comm_ptr->context_id);
    }
 
diff --git a/src/mpid/pamid/src/coll/alltoallv/mpido_alltoallv.c b/src/mpid/pamid/src/coll/alltoallv/mpido_alltoallv.c
index 96bb397..ccdc3ba 100644
--- a/src/mpid/pamid/src/coll/alltoallv/mpido_alltoallv.c
+++ b/src/mpid/pamid/src/coll/alltoallv/mpido_alltoallv.c
@@ -82,25 +82,25 @@ int MPIDO_Alltoallv(const void *sendbuf,
 
    pami_xfer_t alltoallv;
    pami_algorithm_t my_alltoallv;
-   const pami_metadata_t *my_alltoallv_md;
+   const pami_metadata_t *my_md = (pami_metadata_t *)NULL;
    int queryreq = 0;
 
    if(selected_type == MPID_COLL_OPTIMIZED)
    {
       TRACE_ERR("Optimized alltoallv was selected\n");
       my_alltoallv = mpid->opt_protocol[PAMI_XFER_ALLTOALLV_INT][0];
-      my_alltoallv_md = &mpid->opt_protocol_md[PAMI_XFER_ALLTOALLV_INT][0];
+      my_md = &mpid->opt_protocol_md[PAMI_XFER_ALLTOALLV_INT][0];
       queryreq = mpid->must_query[PAMI_XFER_ALLTOALLV_INT][0];
    }
    else
    { /* is this purely an else? or do i need to check for some other selectedvar... */
       TRACE_ERR("Alltoallv specified by user\n");
       my_alltoallv = mpid->user_selected[PAMI_XFER_ALLTOALLV_INT];
-      my_alltoallv_md = &mpid->user_metadata[PAMI_XFER_ALLTOALLV_INT];
+      my_md = &mpid->user_metadata[PAMI_XFER_ALLTOALLV_INT];
       queryreq = selected_type;
    }
    alltoallv.algorithm = my_alltoallv;
-   char *pname = my_alltoallv_md->name;
+   char *pname = my_md->name;
 
 
    alltoallv.cb_done = cb_alltoallv;
@@ -137,20 +137,29 @@ int MPIDO_Alltoallv(const void *sendbuf,
       if(queryreq == MPID_COLL_ALWAYS_QUERY)
       {
         /* process metadata bits */
+         if((!my_md->check_correct.values.inplace) && (sendbuf == MPI_IN_PLACE))
+            result.check.unspecified = 1;
       }
       else /* (queryreq == MPID_COLL_CHECK_FN_REQUIRED - calling the check fn is sufficient */
-         result = my_alltoallv_md->check_fn(&alltoallv);
+         result = my_md->check_fn(&alltoallv);
       TRACE_ERR("bitmask: %#X\n", result.bitmask);
       result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
       if(result.bitmask)
       {
         if(unlikely(verbose))
-          fprintf(stderr,"Query failed for %s\n", pname);
+          fprintf(stderr,"Query failed for %s. Using MPICH alltoallv\n", pname);
         MPIDI_Update_last_algorithm(comm_ptr, "ALLTOALLV_MPICH");
         return MPIR_Alltoallv(sendbuf, sendcounts, senddispls, sendtype,
                               recvbuf, recvcounts, recvdispls, recvtype,
                               comm_ptr, mpierrno);
       }
+      if(my_md->check_correct.values.asyncflowctl) 
+      { /* need better flow control than a barrier every time */
+         int tmpmpierrno;   
+         if(unlikely(verbose))
+            fprintf(stderr,"Query barrier required for %s\n", pname);
+         MPIR_Barrier(comm_ptr, &tmpmpierrno);
+      }
    }
 
    if(unlikely(verbose))
@@ -161,7 +170,7 @@ int MPIDO_Alltoallv(const void *sendbuf,
       threadID = (unsigned long long int)tid;
       fprintf(stderr,"<%llx> Using protocol %s for alltoallv on %u\n", 
               threadID,
-              my_alltoallv_md->name,
+              my_md->name,
               (unsigned) comm_ptr->context_id);
    }
 
diff --git a/src/mpid/pamid/src/coll/barrier/mpido_barrier.c b/src/mpid/pamid/src/coll/barrier/mpido_barrier.c
index abf51d9..23e5e6d 100644
--- a/src/mpid/pamid/src/coll/barrier/mpido_barrier.c
+++ b/src/mpid/pamid/src/coll/barrier/mpido_barrier.c
@@ -39,7 +39,7 @@ int MPIDO_Barrier(MPID_Comm *comm_ptr, int *mpierrno)
    MPIDI_Post_coll_t barrier_post;
    pami_xfer_t barrier;
    pami_algorithm_t my_barrier;
-   const pami_metadata_t *my_barrier_md;
+   const pami_metadata_t *my_md = (pami_metadata_t *)NULL;
    int queryreq = 0;
    const struct MPIDI_Comm* const mpid = &(comm_ptr->mpid);
    const int selected_type = mpid->user_selected_type[PAMI_XFER_BARRIER];
@@ -64,14 +64,14 @@ int MPIDO_Barrier(MPID_Comm *comm_ptr, int *mpierrno)
    {
       TRACE_ERR("Optimized barrier (%s) was pre-selected\n", mpid->opt_protocol_md[PAMI_XFER_BARRIER][0].name);
       my_barrier = mpid->opt_protocol[PAMI_XFER_BARRIER][0];
-      my_barrier_md = &mpid->opt_protocol_md[PAMI_XFER_BARRIER][0];
+      my_md = &mpid->opt_protocol_md[PAMI_XFER_BARRIER][0];
       queryreq = mpid->must_query[PAMI_XFER_BARRIER][0];
    }
    else
    {
       TRACE_ERR("Barrier (%s) was specified by user\n", mpid->user_metadata[PAMI_XFER_BARRIER].name);
       my_barrier = mpid->user_selected[PAMI_XFER_BARRIER];
-      my_barrier_md = &mpid->user_metadata[PAMI_XFER_BARRIER];
+      my_md = &mpid->user_metadata[PAMI_XFER_BARRIER];
       queryreq = selected_type;
    }
 
@@ -88,16 +88,14 @@ int MPIDO_Barrier(MPID_Comm *comm_ptr, int *mpierrno)
       threadID = (unsigned long long int)tid;
      fprintf(stderr,"<%llx> Using protocol %s for barrier on %u\n", 
              threadID,
-             my_barrier_md->name,
+             my_md->name,
             (unsigned) comm_ptr->context_id);
    }
-   TRACE_ERR("%s barrier\n", MPIDI_Process.context_post.active>0?"posting":"invoking");
    MPIDI_Context_post(MPIDI_Context[0], &barrier_post.state,
                       MPIDI_Pami_post_wrapper, (void *)&barrier);
-   TRACE_ERR("barrier %s rc: %d\n", MPIDI_Process.context_post.active>0?"posted":"invoked", rc);
 
    TRACE_ERR("advance spinning\n");
-   MPIDI_Update_last_algorithm(comm_ptr, my_barrier_md->name);
+   MPIDI_Update_last_algorithm(comm_ptr, my_md->name);
    MPID_PROGRESS_WAIT_WHILE(active);
    TRACE_ERR("exiting mpido_barrier\n");
    return 0;
@@ -131,4 +129,4 @@ int MPIDO_Barrier_simple(MPID_Comm *comm_ptr, int *mpierrno)
    MPID_PROGRESS_WAIT_WHILE(active);
    TRACE_ERR("Exiting MPIDO_Barrier_optimized\n");
    return 0;
-}
\ No newline at end of file
+}
diff --git a/src/mpid/pamid/src/coll/bcast/mpido_bcast.c b/src/mpid/pamid/src/coll/bcast/mpido_bcast.c
index 9e320b3..4578613 100644
--- a/src/mpid/pamid/src/coll/bcast/mpido_bcast.c
+++ b/src/mpid/pamid/src/coll/bcast/mpido_bcast.c
@@ -129,7 +129,7 @@ int MPIDO_Bcast(void *buffer,
 
    pami_xfer_t bcast;
    pami_algorithm_t my_bcast;
-   const pami_metadata_t *my_bcast_md;
+   const pami_metadata_t *my_md = (pami_metadata_t *)NULL;
    int queryreq = 0;
 
    bcast.cb_done = cb_bcast;
@@ -152,7 +152,7 @@ int MPIDO_Bcast(void *buffer,
         if(data_size <= mpid->cutoff_size[PAMI_XFER_BROADCAST][1])
         {
           my_bcast = mpid->opt_protocol[PAMI_XFER_BROADCAST][1];
-          my_bcast_md = &mpid->opt_protocol_md[PAMI_XFER_BROADCAST][1];
+          my_md = &mpid->opt_protocol_md[PAMI_XFER_BROADCAST][1];
           queryreq = mpid->must_query[PAMI_XFER_BROADCAST][1];
         }
         else
@@ -164,13 +164,13 @@ int MPIDO_Bcast(void *buffer,
       if(data_size > mpid->cutoff_size[PAMI_XFER_BROADCAST][0])
       {
          my_bcast = mpid->opt_protocol[PAMI_XFER_BROADCAST][1];
-         my_bcast_md = &mpid->opt_protocol_md[PAMI_XFER_BROADCAST][1];
+         my_md = &mpid->opt_protocol_md[PAMI_XFER_BROADCAST][1];
          queryreq = mpid->must_query[PAMI_XFER_BROADCAST][1];
       }
       else
       {
          my_bcast = mpid->opt_protocol[PAMI_XFER_BROADCAST][0];
-         my_bcast_md = &mpid->opt_protocol_md[PAMI_XFER_BROADCAST][0];
+         my_md = &mpid->opt_protocol_md[PAMI_XFER_BROADCAST][0];
          queryreq = mpid->must_query[PAMI_XFER_BROADCAST][0];
       }
    }
@@ -179,7 +179,7 @@ int MPIDO_Bcast(void *buffer,
       TRACE_ERR("Bcast (%s) was specified by user\n",
          mpid->user_metadata[PAMI_XFER_BROADCAST].name);
       my_bcast =  mpid->user_selected[PAMI_XFER_BROADCAST];
-      my_bcast_md = &mpid->user_metadata[PAMI_XFER_BROADCAST];
+      my_md = &mpid->user_metadata[PAMI_XFER_BROADCAST];
       queryreq = selected_type;
    }
 
@@ -190,13 +190,13 @@ int MPIDO_Bcast(void *buffer,
    {
       metadata_result_t result = {0};
       TRACE_ERR("querying bcast protocol %s, type was: %d\n",
-         my_bcast_md->name, queryreq);
+         my_md->name, queryreq);
       if(queryreq == MPID_COLL_ALWAYS_QUERY)
       {
         /* process metadata bits */
       }
       else /* (queryreq == MPID_COLL_CHECK_FN_REQUIRED - calling the check fn is sufficient */
-         result = my_bcast_md->check_fn(&bcast);
+         result = my_md->check_fn(&bcast);
       TRACE_ERR("bitmask: %#X\n", result.bitmask);
       result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
       if(result.bitmask)
@@ -216,12 +216,12 @@ int MPIDO_Bcast(void *buffer,
       threadID = (unsigned long long int)tid;
       fprintf(stderr,"<%llx> Using protocol %s for bcast on %u\n", 
               threadID,
-              my_bcast_md->name,
+              my_md->name,
               (unsigned) comm_ptr->context_id);
    }
 
    MPIDI_Context_post(MPIDI_Context[0], &bcast_post.state, MPIDI_Pami_post_wrapper, (void *)&bcast);
-   MPIDI_Update_last_algorithm(comm_ptr, my_bcast_md->name);
+   MPIDI_Update_last_algorithm(comm_ptr, my_md->name);
    MPID_PROGRESS_WAIT_WHILE(active);
    TRACE_ERR("bcast done\n");
 
diff --git a/src/mpid/pamid/src/coll/gather/mpido_gather.c b/src/mpid/pamid/src/coll/gather/mpido_gather.c
index d8721a1..0bebefe 100644
--- a/src/mpid/pamid/src/coll/gather/mpido_gather.c
+++ b/src/mpid/pamid/src/coll/gather/mpido_gather.c
@@ -212,7 +212,7 @@ int MPIDO_Gather(const void *sendbuf,
 
 
    pami_algorithm_t my_gather;
-   const pami_metadata_t *my_gather_md;
+   const pami_metadata_t *my_md = (pami_metadata_t *)NULL;
    int queryreq = 0;
    volatile unsigned active = 1;
 
@@ -242,7 +242,7 @@ int MPIDO_Gather(const void *sendbuf,
       TRACE_ERR("Optimized gather (%s) was pre-selected\n",
          mpid->opt_protocol_md[PAMI_XFER_GATHER][0].name);
       my_gather = mpid->opt_protocol[PAMI_XFER_GATHER][0];
-      my_gather_md = &mpid->opt_protocol_md[PAMI_XFER_GATHER][0];
+      my_md = &mpid->opt_protocol_md[PAMI_XFER_GATHER][0];
       queryreq = mpid->must_query[PAMI_XFER_GATHER][0];
    }
    else
@@ -250,7 +250,7 @@ int MPIDO_Gather(const void *sendbuf,
       TRACE_ERR("Optimized gather (%s) was specified by user\n",
       mpid->user_metadata[PAMI_XFER_GATHER].name);
       my_gather = mpid->user_selected[PAMI_XFER_GATHER];
-      my_gather_md = &mpid->user_metadata[PAMI_XFER_GATHER];
+      my_md = &mpid->user_metadata[PAMI_XFER_GATHER];
       queryreq = selected_type;
    }
 
@@ -260,19 +260,21 @@ int MPIDO_Gather(const void *sendbuf,
    {
       metadata_result_t result = {0};
       TRACE_ERR("querying gather protocol %s, type was %d\n",
-         my_gather_md->name, queryreq);
+         my_md->name, queryreq);
       if(queryreq == MPID_COLL_ALWAYS_QUERY)
       {
         /* process metadata bits */
+        if((!my_md->check_correct.values.inplace) && (sendbuf == MPI_IN_PLACE))
+           result.check.unspecified = 1;
       }
       else /* (queryreq == MPID_COLL_CHECK_FN_REQUIRED - calling the check fn is sufficient */
-        result = my_gather_md->check_fn(&gather);
+        result = my_md->check_fn(&gather);
       TRACE_ERR("bitmask: %#X\n", result.bitmask);
       result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
       if(result.bitmask)
       {
         if(unlikely(verbose))
-          fprintf(stderr,"query failed for %s\n", my_gather_md->name);
+          fprintf(stderr,"query failed for %s\n", my_md->name);
         MPIDI_Update_last_algorithm(comm_ptr, "GATHER_MPICH");
         return MPIR_Gather(sendbuf, sendcount, sendtype,
                            recvbuf, recvcount, recvtype,
@@ -292,11 +294,10 @@ int MPIDO_Gather(const void *sendbuf,
       threadID = (unsigned long long int)tid;
       fprintf(stderr,"<%llx> Using protocol %s for gather on %u\n", 
               threadID,
-              my_gather_md->name,
+              my_md->name,
               (unsigned) comm_ptr->context_id);
    }
 
-   TRACE_ERR("%s gather\n", MPIDI_Process.context_post.active>0?"Posting":"Invoking");
    MPIDI_Context_post(MPIDI_Context[0], &gather_post.state,
                       MPIDI_Pami_post_wrapper, (void *)&gather);
 
diff --git a/src/mpid/pamid/src/coll/gatherv/mpido_gatherv.c b/src/mpid/pamid/src/coll/gatherv/mpido_gatherv.c
index a5f9ffe..e6c7273 100644
--- a/src/mpid/pamid/src/coll/gatherv/mpido_gatherv.c
+++ b/src/mpid/pamid/src/coll/gatherv/mpido_gatherv.c
@@ -118,7 +118,7 @@ int MPIDO_Gatherv(const void *sendbuf,
    gatherv.cmd.xfer_gatherv_int.sndbuf = sbuf;
 
    pami_algorithm_t my_gatherv;
-   const pami_metadata_t *my_gatherv_md;
+   const pami_metadata_t *my_md = (pami_metadata_t *)NULL;
    int queryreq = 0;
 
    if(selected_type == MPID_COLL_OPTIMIZED)
@@ -126,7 +126,7 @@ int MPIDO_Gatherv(const void *sendbuf,
       TRACE_ERR("Optimized gatherv %s was selected\n",
          mpid->opt_protocol_md[PAMI_XFER_GATHERV_INT][0].name);
       my_gatherv = mpid->opt_protocol[PAMI_XFER_GATHERV_INT][0];
-      my_gatherv_md = &mpid->opt_protocol_md[PAMI_XFER_GATHERV_INT][0];
+      my_md = &mpid->opt_protocol_md[PAMI_XFER_GATHERV_INT][0];
       queryreq = mpid->must_query[PAMI_XFER_GATHERV_INT][0];
    }
    else
@@ -134,7 +134,7 @@ int MPIDO_Gatherv(const void *sendbuf,
       TRACE_ERR("Optimized gatherv %s was set by user\n",
          mpid->user_metadata[PAMI_XFER_GATHERV_INT].name);
          my_gatherv = mpid->user_selected[PAMI_XFER_GATHERV_INT];
-         my_gatherv_md = &mpid->user_metadata[PAMI_XFER_GATHERV_INT];
+         my_md = &mpid->user_metadata[PAMI_XFER_GATHERV_INT];
          queryreq = selected_type;
    }
 
@@ -146,19 +146,21 @@ int MPIDO_Gatherv(const void *sendbuf,
    {
       metadata_result_t result = {0};
       TRACE_ERR("querying gatherv protocol %s, type was %d\n", 
-         my_gatherv_md->name, queryreq);
+         my_md->name, queryreq);
       if(queryreq == MPID_COLL_ALWAYS_QUERY)
       {
-        /* process metadata bits */
+         /* process metadata bits */
+         if((!my_md->check_correct.values.inplace) && (sendbuf == MPI_IN_PLACE))
+            result.check.unspecified = 1;
       }
       else /* (queryreq == MPID_COLL_CHECK_FN_REQUIRED - calling the check fn is sufficient */
-         result = my_gatherv_md->check_fn(&gatherv);
+         result = my_md->check_fn(&gatherv);
       TRACE_ERR("bitmask: %#X\n", result.bitmask);
       result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
       if(result.bitmask)
       {
          if(unlikely(verbose))
-            fprintf(stderr,"Query failed for %s\n", my_gatherv_md->name);
+            fprintf(stderr,"Query failed for %s. Using MPICH gatherv.\n", my_md->name);
          MPIDI_Update_last_algorithm(comm_ptr, "GATHERV_MPICH");
          return MPIR_Gatherv(sendbuf, sendcount, sendtype,
                              recvbuf, recvcounts, displs, recvtype,
@@ -166,7 +168,7 @@ int MPIDO_Gatherv(const void *sendbuf,
       }
    }
    
-   MPIDI_Update_last_algorithm(comm_ptr, my_gatherv_md->name);
+   MPIDI_Update_last_algorithm(comm_ptr, my_md->name);
 
    if(unlikely(verbose))
    {
@@ -176,15 +178,13 @@ int MPIDO_Gatherv(const void *sendbuf,
       threadID = (unsigned long long int)tid;
       fprintf(stderr,"<%llx> Using protocol %s for gatherv on %u\n", 
               threadID,
-              my_gatherv_md->name,
+              my_md->name,
               (unsigned) comm_ptr->context_id);
    }
 
    MPIDI_Post_coll_t gatherv_post;
-   TRACE_ERR("%s gatherv\n", MPIDI_Process.context_post.active>0?"Posting":"Invoking");
    MPIDI_Context_post(MPIDI_Context[0], &gatherv_post.state,
                       MPIDI_Pami_post_wrapper, (void *)&gatherv);
-   TRACE_ERR("Gatherv %s\n", MPIDI_Process.context_post.active>0?"posted":"invoked");
    
    TRACE_ERR("Waiting on active %d\n", gatherv_active);
    MPID_PROGRESS_WAIT_WHILE(gatherv_active);
diff --git a/src/mpid/pamid/src/coll/reduce/mpido_reduce.c b/src/mpid/pamid/src/coll/reduce/mpido_reduce.c
index 0f664ad..c8c0bb7 100644
--- a/src/mpid/pamid/src/coll/reduce/mpido_reduce.c
+++ b/src/mpid/pamid/src/coll/reduce/mpido_reduce.c
@@ -66,7 +66,7 @@ int MPIDO_Reduce(const void *sendbuf,
 
    pami_xfer_t reduce;
    pami_algorithm_t my_reduce=0;
-   const pami_metadata_t *my_reduce_md=NULL;
+   const pami_metadata_t *my_md = (pami_metadata_t *)NULL;
    int queryreq = 0;
    volatile unsigned reduce_active = 1;
 
@@ -97,7 +97,7 @@ int MPIDO_Reduce(const void *sendbuf,
         TRACE_ERR("Optimized Reduce (%s) was pre-selected\n",
          mpid->opt_protocol_md[PAMI_XFER_REDUCE][0].name);
         my_reduce    = mpid->opt_protocol[PAMI_XFER_REDUCE][0];
-        my_reduce_md = &mpid->opt_protocol_md[PAMI_XFER_REDUCE][0];
+        my_md = &mpid->opt_protocol_md[PAMI_XFER_REDUCE][0];
         queryreq     = mpid->must_query[PAMI_XFER_REDUCE][0];
       }
 
@@ -107,7 +107,7 @@ int MPIDO_Reduce(const void *sendbuf,
       TRACE_ERR("Optimized reduce (%s) was specified by user\n",
       mpid->user_metadata[PAMI_XFER_REDUCE].name);
       my_reduce    =  mpid->user_selected[PAMI_XFER_REDUCE];
-      my_reduce_md = &mpid->user_metadata[PAMI_XFER_REDUCE];
+      my_md = &mpid->user_metadata[PAMI_XFER_REDUCE];
       queryreq     = selected_type;
    }
    reduce.algorithm = my_reduce;
@@ -124,25 +124,27 @@ int MPIDO_Reduce(const void *sendbuf,
    if(unlikely(queryreq == MPID_COLL_ALWAYS_QUERY || 
                queryreq == MPID_COLL_CHECK_FN_REQUIRED))
    {
-      if(my_reduce_md->check_fn != NULL)
+      if(my_md->check_fn != NULL)
       {
          metadata_result_t result = {0};
          TRACE_ERR("Querying reduce protocol %s, type was %d\n",
-            my_reduce_md->name,
+            my_md->name,
             queryreq);
          if(queryreq == MPID_COLL_ALWAYS_QUERY)
          {
             /* process metadata bits */
+            if((!my_md->check_correct.values.inplace) && (sendbuf == MPI_IN_PLACE))
+               result.check.unspecified = 1;
          }
          else /* (queryreq == MPID_COLL_CHECK_FN_REQUIRED - calling the check fn is sufficient */
-            result = my_reduce_md->check_fn(&reduce);
+            result = my_md->check_fn(&reduce);
          TRACE_ERR("Bitmask: %#X\n", result.bitmask);
          result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
          if(result.bitmask)
          {
-            if(verbose)
-              fprintf(stderr,"Query failed for %s.\n",
-                 my_reduce_md->name);
+            if(unlikely(verbose))
+              fprintf(stderr,"Query failed for %s.  Using MPICH reduce.\n",
+                 my_md->name);
          }
          else alg_selected = 1;
       }
@@ -164,15 +166,12 @@ int MPIDO_Reduce(const void *sendbuf,
          threadID = (unsigned long long int)tid;
          fprintf(stderr,"<%llx> Using protocol %s for reduce on %u\n", 
                  threadID,
-                 my_reduce_md->name,
+                 my_md->name,
               (unsigned) comm_ptr->context_id);
       }
-      TRACE_ERR("%s reduce, context %d, algoname: %s, exflag: %d\n", MPIDI_Process.context_post.active>0?"Posting":"Invoking", 0,
-                my_reduce_md->name, exflag);
       MPIDI_Post_coll_t reduce_post;
       MPIDI_Context_post(MPIDI_Context[0], &reduce_post.state,
                          MPIDI_Pami_post_wrapper, (void *)&reduce);
-      TRACE_ERR("Reduce %s\n", MPIDI_Process.context_post.active>0?"posted":"invoked");
    }
    else
    {
@@ -183,7 +182,7 @@ int MPIDO_Reduce(const void *sendbuf,
    }
 
    MPIDI_Update_last_algorithm(comm_ptr,
-                               my_reduce_md->name);
+                               my_md->name);
    MPID_PROGRESS_WAIT_WHILE(reduce_active);
    TRACE_ERR("Reduce done\n");
    return 0;
diff --git a/src/mpid/pamid/src/coll/scan/mpido_scan.c b/src/mpid/pamid/src/coll/scan/mpido_scan.c
index ba5854d..6c073c1 100644
--- a/src/mpid/pamid/src/coll/scan/mpido_scan.c
+++ b/src/mpid/pamid/src/coll/scan/mpido_scan.c
@@ -63,7 +63,7 @@ int MPIDO_Doscan(const void *sendbuf, void *recvbuf,
    pami_data_function pop;
    pami_type_t pdt;
    int rc;
-   const pami_metadata_t *my_md;
+   const pami_metadata_t *my_md = (pami_metadata_t *)NULL;
    int queryreq = 0;
 #if ASSERT_LEVEL==0
    /* We can't afford the tracing in ndebug/performance libraries */
@@ -140,6 +140,8 @@ int MPIDO_Doscan(const void *sendbuf, void *recvbuf,
       if(queryreq == MPID_COLL_ALWAYS_QUERY)
       {
         /* process metadata bits */
+         if((!my_md->check_correct.values.inplace) && (sendbuf == MPI_IN_PLACE))
+            result.check.unspecified = 1;
       }
       else /* (queryreq == MPID_COLL_CHECK_FN_REQUIRED - calling the check fn is sufficient */
          result = my_md->check_fn(&scan);
@@ -147,8 +149,9 @@ int MPIDO_Doscan(const void *sendbuf, void *recvbuf,
       result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
       if(result.bitmask)
       {
-         fprintf(stderr,"Query failed for %s.\n",
-            my_md->name);
+         if(unlikely(verbose))
+            fprintf(stderr,"Query failed for %s.  Using MPICH scan\n",
+                    my_md->name);
          MPIDI_Update_last_algorithm(comm_ptr, "SCAN_MPICH");
          if(exflag)
             return MPIR_Exscan(sendbuf, recvbuf, count, datatype, op, comm_ptr, mpierrno);
@@ -172,7 +175,6 @@ int MPIDO_Doscan(const void *sendbuf, void *recvbuf,
    MPIDI_Post_coll_t scan_post;
    MPIDI_Context_post(MPIDI_Context[0], &scan_post.state,
                       MPIDI_Pami_post_wrapper, (void *)&scan);
-   TRACE_ERR("Scan %s\n", MPIDI_Process.context_post.active>0?"posted":"invoked");
    MPIDI_Update_last_algorithm(comm_ptr, my_md->name);
    MPID_PROGRESS_WAIT_WHILE(scan_active);
    TRACE_ERR("Scan done\n");
diff --git a/src/mpid/pamid/src/coll/scatter/mpido_scatter.c b/src/mpid/pamid/src/coll/scatter/mpido_scatter.c
index b5ce1a4..01667d2 100644
--- a/src/mpid/pamid/src/coll/scatter/mpido_scatter.c
+++ b/src/mpid/pamid/src/coll/scatter/mpido_scatter.c
@@ -185,7 +185,7 @@ int MPIDO_Scatter(const void *sendbuf,
    pami_xfer_t scatter;
    MPIDI_Post_coll_t scatter_post;
    pami_algorithm_t my_scatter;
-   const pami_metadata_t *my_scatter_md;
+   const pami_metadata_t *my_md = (pami_metadata_t *)NULL;
    volatile unsigned scatter_active = 1;
    int queryreq = 0;
 
@@ -194,7 +194,7 @@ int MPIDO_Scatter(const void *sendbuf,
       TRACE_ERR("Optimized scatter %s was selected\n",
          mpid->opt_protocol_md[PAMI_XFER_SCATTER][0].name);
       my_scatter = mpid->opt_protocol[PAMI_XFER_SCATTER][0];
-      my_scatter_md = &mpid->opt_protocol_md[PAMI_XFER_SCATTER][0];
+      my_md = &mpid->opt_protocol_md[PAMI_XFER_SCATTER][0];
       queryreq = mpid->must_query[PAMI_XFER_SCATTER][0];
    }
    else
@@ -202,7 +202,7 @@ int MPIDO_Scatter(const void *sendbuf,
       TRACE_ERR("Optimized scatter %s was set by user\n",
          mpid->user_metadata[PAMI_XFER_SCATTER].name);
       my_scatter = mpid->user_selected[PAMI_XFER_SCATTER];
-      my_scatter_md = &mpid->user_metadata[PAMI_XFER_SCATTER];
+      my_md = &mpid->user_metadata[PAMI_XFER_SCATTER];
       queryreq = selected_type;
    }
  
@@ -235,19 +235,21 @@ int MPIDO_Scatter(const void *sendbuf,
    {
       metadata_result_t result = {0};
       TRACE_ERR("querying scatter protoocl %s, type was %d\n",
-         my_scatter_md->name, queryreq);
+         my_md->name, queryreq);
       if(queryreq == MPID_COLL_ALWAYS_QUERY)
       {
         /* process metadata bits */
+        if((!my_md->check_correct.values.inplace) && (recvbuf == MPI_IN_PLACE))
+           result.check.unspecified = 1;
       }
       else /* (queryreq == MPID_COLL_CHECK_FN_REQUIRED - calling the check fn is sufficient */
-        result = my_scatter_md->check_fn(&scatter);
+        result = my_md->check_fn(&scatter);
       TRACE_ERR("bitmask: %#X\n", result.bitmask);
       result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
       if(result.bitmask)
       {
         if(unlikely(verbose))
-          fprintf(stderr,"query failed for %s\n", my_scatter_md->name);
+          fprintf(stderr,"query failed for %s\n", my_md->name);
         MPIDI_Update_last_algorithm(comm_ptr, "SCATTER_MPICH");
         return MPIR_Scatter(sendbuf, sendcount, sendtype,
                             recvbuf, recvcount, recvtype,
@@ -263,10 +265,9 @@ int MPIDO_Scatter(const void *sendbuf,
       threadID = (unsigned long long int)tid;
       fprintf(stderr,"<%llx> Using protocol %s for scatter on %u\n", 
               threadID,
-              my_scatter_md->name,
+              my_md->name,
               (unsigned) comm_ptr->context_id);
    }
-   TRACE_ERR("%s scatter\n", MPIDI_Process.context_post.active>0?"Posting":"Invoking");
    MPIDI_Context_post(MPIDI_Context[0], &scatter_post.state,
                       MPIDI_Pami_post_wrapper, (void *)&scatter);
    TRACE_ERR("Waiting on active %d\n", scatter_active);
diff --git a/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c b/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c
index 6e565ea..e5f8e7f 100644
--- a/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c
+++ b/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c
@@ -278,7 +278,7 @@ int MPIDO_Scatterv(const void *sendbuf,
 
    pami_xfer_t scatterv;
    pami_algorithm_t my_scatterv;
-   const pami_metadata_t *my_scatterv_md;
+   const pami_metadata_t *my_md = (pami_metadata_t *)NULL;
    volatile unsigned scatterv_active = 1;
    int queryreq = 0;
 
@@ -287,7 +287,7 @@ int MPIDO_Scatterv(const void *sendbuf,
       TRACE_ERR("Optimized scatterv %s was selected\n",
          mpid->opt_protocol_md[PAMI_XFER_SCATTERV_INT][0].name);
       my_scatterv = mpid->opt_protocol[PAMI_XFER_SCATTERV_INT][0];
-      my_scatterv_md = &mpid->opt_protocol_md[PAMI_XFER_SCATTERV_INT][0];
+      my_md = &mpid->opt_protocol_md[PAMI_XFER_SCATTERV_INT][0];
       queryreq = mpid->must_query[PAMI_XFER_SCATTERV_INT][0];
    }
    else
@@ -295,7 +295,7 @@ int MPIDO_Scatterv(const void *sendbuf,
       TRACE_ERR("User selected %s for scatterv\n",
       mpid->user_selected[PAMI_XFER_SCATTERV_INT]);
       my_scatterv = mpid->user_selected[PAMI_XFER_SCATTERV_INT];
-      my_scatterv_md = &mpid->user_metadata[PAMI_XFER_SCATTERV_INT];
+      my_md = &mpid->user_metadata[PAMI_XFER_SCATTERV_INT];
       queryreq = selected_type;
    }
 
@@ -354,19 +354,21 @@ int MPIDO_Scatterv(const void *sendbuf,
    {
       metadata_result_t result = {0};
       TRACE_ERR("querying scatterv protocol %s, type was %d\n",
-         my_scatterv_md->name, queryreq);
+         my_md->name, queryreq);
       if(queryreq == MPID_COLL_ALWAYS_QUERY)
       {
         /* process metadata bits */
+        if((!my_md->check_correct.values.inplace) && (recvbuf == MPI_IN_PLACE))
+           result.check.unspecified = 1;
       }
       else /* (queryreq == MPID_COLL_CHECK_FN_REQUIRED - calling the check fn is sufficient */
-        result = my_scatterv_md->check_fn(&scatterv);
+        result = my_md->check_fn(&scatterv);
       TRACE_ERR("bitmask: %#X\n", result.bitmask);
       result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
       if(result.bitmask)
       {
         if(unlikely(verbose))
-          fprintf(stderr,"Query failed for %s\n", my_scatterv_md->name);
+          fprintf(stderr,"Query failed for %s. Using MPICH scatterv.\n", my_md->name);
         MPIDI_Update_last_algorithm(comm_ptr, "SCATTERV_MPICH");
         return MPIR_Scatterv(sendbuf, sendcounts, displs, sendtype,
                              recvbuf, recvcount, recvtype,
@@ -374,7 +376,7 @@ int MPIDO_Scatterv(const void *sendbuf,
       }
    }
 
-   MPIDI_Update_last_algorithm(comm_ptr, my_scatterv_md->name);
+   MPIDI_Update_last_algorithm(comm_ptr, my_md->name);
 
    if(unlikely(verbose))
    {
@@ -384,11 +386,10 @@ int MPIDO_Scatterv(const void *sendbuf,
       threadID = (unsigned long long int)tid;
       fprintf(stderr,"<%llx> Using protocol %s for scatterv on %u\n", 
               threadID,
-              my_scatterv_md->name,
+              my_md->name,
               (unsigned) comm_ptr->context_id);
    }
    MPIDI_Post_coll_t scatterv_post;
-   TRACE_ERR("%s scatterv\n", MPIDI_Process.context_post.active>0?"Posting":"Invoking");
    MPIDI_Context_post(MPIDI_Context[0], &scatterv_post.state,
                       MPIDI_Pami_post_wrapper, (void *)&scatterv);
 
@@ -554,7 +555,6 @@ int MPIDO_Scatterv_simple(const void *sendbuf,
   /* set the internal control flow to disable internal star tuning */
    if(mpid->preallreduces[MPID_SCATTERV_PREALLREDUCE])
    {
-     TRACE_ERR("%s scatterv pre-allreduce\n", MPIDI_Process.context_post.active>0?"Posting":"Invoking");
      MPIDI_Post_coll_t allred_post;
      rc = MPIDI_Context_post(MPIDI_Context[0], &allred_post.state,
                              MPIDI_Pami_post_wrapper, (void *)&allred);
diff --git a/src/mpid/pamid/src/comm/mpid_optcolls.c b/src/mpid/pamid/src/comm/mpid_optcolls.c
index 35f9571..d9b19bc 100644
--- a/src/mpid/pamid/src/comm/mpid_optcolls.c
+++ b/src/mpid/pamid/src/comm/mpid_optcolls.c
@@ -234,7 +234,7 @@ void MPIDI_Comm_coll_select(MPID_Comm *comm_ptr)
       /* Use I0:RectangleDput */
       for(i = 0; i < comm_ptr->mpid.coll_count[PAMI_XFER_ALLGATHERV_INT][1]; i++)
       {
-         if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLGATHERV_INT][0][i].name, "I0:RectangleDput:SHMEM:MU") == 0)
+         if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLGATHERV_INT][1][i].name, "I0:RectangleDput:SHMEM:MU") == 0)
          {
             opt_proto = i;
             mustquery = 1;
@@ -254,11 +254,12 @@ void MPIDI_Comm_coll_select(MPID_Comm *comm_ptr)
          comm_ptr->mpid.must_query[PAMI_XFER_ALLGATHERV_INT][0] = mustquery?MPID_COLL_ALWAYS_QUERY:MPID_COLL_NOQUERY;
          comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLGATHERV_INT] = MPID_COLL_OPTIMIZED;
       }
-      else
+      else /* no optimized allgatherv? */
       {
          TRACE_ERR("Couldn't find optimial allgatherv[int] protocol\n");
          comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLGATHERV_INT] = MPID_COLL_USE_MPICH;
          comm_ptr->mpid.opt_protocol[PAMI_XFER_ALLGATHERV_INT][0] = 0;
+         comm_ptr->mpid.allgathervs[0] = 1; /* Use GLUE_ALLREDUCE */
       }
       TRACE_ERR("Done setting optimized allgatherv[int]\n");
    }

http://git.mpich.org/mpich.git/commitdiff/0e3b48bdf1935ff6869e37d0ea19873e3f8282fb

commit 0e3b48bdf1935ff6869e37d0ea19873e3f8282fb
Author: Bob Cernohous <bobc at us.ibm.com>
Date:   Tue Nov 27 15:59:10 2012 -0600

    Revert "Trac #636:Disable optimized allgatherv"
    
    This reverts commit 4aa0587ffd9e6246bdf48723c234d17c1844612c.
    
    (ibm) 3de8e2735df35e0a94dd532cf1fb9e0eba8e460f
    
    Signed-off-by: sssharka <sssharka at us.ibm.com>

diff --git a/src/mpid/pamid/src/comm/mpid_optcolls.c b/src/mpid/pamid/src/comm/mpid_optcolls.c
index f8d52bb..35f9571 100644
--- a/src/mpid/pamid/src/comm/mpid_optcolls.c
+++ b/src/mpid/pamid/src/comm/mpid_optcolls.c
@@ -208,11 +208,6 @@ void MPIDI_Comm_coll_select(MPID_Comm *comm_ptr)
    if(comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLGATHERV_INT] == MPID_COLL_NOSELECTION)
    {
       MPIDI_Coll_comm_check_FCA("ALLGATHERV","I1:AllgathervInt:FCA:FCA",PAMI_XFER_ALLGATHERV_INT,MPID_COLL_NOQUERY, 0, comm_ptr);
-      if(comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLGATHERV_INT] != MPID_COLL_OPTIMIZED)
-      {
-        /* SSS: FCA not selected, then punt to mpich (avoiding rectangledput from running */
-        comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLGATHERV_INT] = MPID_COLL_USE_MPICH;
-      }
    }
    if(comm_ptr->mpid.user_selected_type[PAMI_XFER_BROADCAST] == MPID_COLL_NOSELECTION)
    {

http://git.mpich.org/mpich.git/commitdiff/4fcf7a46582b799599a41997984bef9f1420c994

commit 4fcf7a46582b799599a41997984bef9f1420c994
Author: Bob Cernohous <bobc at us.ibm.com>
Date:   Tue Nov 27 15:54:18 2012 -0600

    Revert "Ticket #632: Disable optimized alltoall[v]'s with MPI_IN_PLACE"
    
    This reverts commit bbb81a6003d69df30231d0a69da7dfceb156e4f0.
    
    (ibm) a7ecbb4fe666762cb202195bd07cbe0e67bf360e
    
    Signed-off-by: sssharka <sssharka at us.ibm.com>

diff --git a/src/mpid/pamid/src/coll/alltoallv/mpido_alltoallv.c b/src/mpid/pamid/src/coll/alltoallv/mpido_alltoallv.c
index 84d4a89..96bb397 100644
--- a/src/mpid/pamid/src/coll/alltoallv/mpido_alltoallv.c
+++ b/src/mpid/pamid/src/coll/alltoallv/mpido_alltoallv.c
@@ -15,7 +15,7 @@
 /*                                                                  */
 /* end_generated_IBM_copyright_prolog                               */
 /*  (C)Copyright IBM Corp.  2007, 2011  */
-/**
+/**  
  * \file src/coll/alltoallv/mpido_alltoallv.c
  * \brief ???
  */

http://git.mpich.org/mpich.git/commitdiff/e741c721d6b57d32fcca37b031b6e625d7dba4ef

commit e741c721d6b57d32fcca37b031b6e625d7dba4ef
Author: Bob Cernohous <bobc at us.ibm.com>
Date:   Tue Nov 27 11:17:35 2012 -0600

    Fix double allreduce when there is no cached protocol
    
    (ibm) Issue 8597
    (ibm) d65b89385b12cb726118693bea96831ac8ba0cf9
    
    Signed-off-by: sssharka <sssharka at us.ibm.com>

diff --git a/src/mpid/pamid/src/coll/allreduce/mpido_allreduce.c b/src/mpid/pamid/src/coll/allreduce/mpido_allreduce.c
index f99e27b..3185832 100644
--- a/src/mpid/pamid/src/coll/allreduce/mpido_allreduce.c
+++ b/src/mpid/pamid/src/coll/allreduce/mpido_allreduce.c
@@ -75,7 +75,8 @@ int MPIDO_Allreduce(const void *sendbuf,
   {
     rc = MPI_SUCCESS;
     pdt = PAMI_TYPE_DOUBLE;
-    if(likely(selected_type == MPID_COLL_OPTIMIZED))
+    if(likely(selected_type == MPID_COLL_OPTIMIZED) &&
+       (mpid->query_cached_allreduce != MPID_COLL_USE_MPICH))
     {
       /* double protocol works on all message sizes */
       my_allred = mpid->cached_allreduce;

http://git.mpich.org/mpich.git/commitdiff/0d5992f038460c80ad3af86b038936136f97e439

commit 0d5992f038460c80ad3af86b038936136f97e439
Author: Bob Cernohous <bobc at us.ibm.com>
Date:   Mon Nov 26 16:18:12 2012 -0600

    Use optimized protocol for more supported dt/op combinations
    
    (ibm) Issue 8597
    (ibm) d05ae82a0621be1a435716e6177acb154ae90ec0
    
    Signed-off-by: sssharka <sssharka at us.ibm.com>

diff --git a/src/mpid/pamid/include/mpidi_datatypes.h b/src/mpid/pamid/include/mpidi_datatypes.h
index b96d488..8f9bf88 100644
--- a/src/mpid/pamid/include/mpidi_datatypes.h
+++ b/src/mpid/pamid/include/mpidi_datatypes.h
@@ -321,29 +321,21 @@ struct MPIDI_Comm
   /* For create_tasklist/endpoints if we ever use it */
   pami_task_t *tasks;
   pami_endpoint_t *endpoints;
-   /* There are some protocols where the optimized protocol always works and
-    * is the best performance */
-   /* Assume we have small vs large cutoffs vs medium for some protocols */
-   pami_algorithm_t opt_protocol[PAMI_XFER_COUNT][2];
-   int must_query[PAMI_XFER_COUNT][2];
-   pami_metadata_t opt_protocol_md[PAMI_XFER_COUNT][2];
-   int cutoff_size[PAMI_XFER_COUNT][2];
-   /* Our best allreduce double protocol only works on 
-    * doubles and sum/min/max. Since that is a common
-    * occurance let's cache that protocol and call
-    * it without checking */
-   pami_algorithm_t cached_allred_dsmm; /*dsmm = double, sum/min/max */
-   pami_metadata_t cached_allred_dsmm_md;
-   int query_allred_dsmm; 
-
-   /* We have some integer optimized protocols that only work on
-    * sum/min/max but also have datasize/ppn <= 8k limitations */
-   /* Using Amith's protocol, these work on int/min/max/sum of SMALL messages */
-   pami_algorithm_t cached_allred_ismm;
-   pami_metadata_t cached_allred_ismm_md;
-   /* Because this only works at select message sizes, this will have to be
-    * nonzero */
-   int query_allred_ismm;
+  /* There are some protocols where the optimized protocol always works and
+   * is the best performance */
+  /* Assume we have small vs large cutoffs vs medium for some protocols */
+  pami_algorithm_t opt_protocol[PAMI_XFER_COUNT][2];
+  int must_query[PAMI_XFER_COUNT][2];
+  pami_metadata_t opt_protocol_md[PAMI_XFER_COUNT][2];
+  int cutoff_size[PAMI_XFER_COUNT][2];
+  /* Our best allreduce protocol always works on 
+   * doubles and sum/min/max. Since that is a common
+   * occurance let's cache that protocol and call
+   * it without checking.  Any other dt/op must be 
+   * checked */ 
+  pami_algorithm_t cached_allreduce;
+  pami_metadata_t cached_allreduce_md;
+  int query_cached_allreduce; 
 
   union tasks_descrip_t {
     /* For create_taskrange */
diff --git a/src/mpid/pamid/src/coll/allreduce/mpido_allreduce.c b/src/mpid/pamid/src/coll/allreduce/mpido_allreduce.c
index 03b4188..f99e27b 100644
--- a/src/mpid/pamid/src/coll/allreduce/mpido_allreduce.c
+++ b/src/mpid/pamid/src/coll/allreduce/mpido_allreduce.c
@@ -23,7 +23,10 @@
 /* #define TRACE_ON */
 
 #include <mpidimpl.h>
-
+/* 
+#undef TRACE_ERR
+#define TRACE_ERR(format, ...) fprintf(stderr, format, ##__VA_ARGS__)
+*/
 static void cb_allreduce(void *ctxt, void *clientdata, pami_result_t err)
 {
   int *active = (int *) clientdata;
@@ -55,7 +58,7 @@ int MPIDO_Allreduce(const void *sendbuf,
 #endif
   volatile unsigned active = 1;
   pami_xfer_t allred;
-  pami_algorithm_t my_allred;
+  pami_algorithm_t my_allred = 0;
   const pami_metadata_t *my_allred_md = (pami_metadata_t *)NULL;
   int alg_selected = 0;
   const int rank = comm_ptr->rank;
@@ -67,17 +70,29 @@ int MPIDO_Allreduce(const void *sendbuf,
 #else
   const unsigned verbose = (MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL) && (rank == 0);
 #endif
+  int queryreq = 0;
   if(likely(dt == MPI_DOUBLE || dt == MPI_DOUBLE_PRECISION))
   {
     rc = MPI_SUCCESS;
     pdt = PAMI_TYPE_DOUBLE;
+    if(likely(selected_type == MPID_COLL_OPTIMIZED))
+    {
+      /* double protocol works on all message sizes */
+      my_allred = mpid->cached_allreduce;
+      my_allred_md = &mpid->cached_allreduce_md;
+      alg_selected = 1;
+    }
     if(likely(op == MPI_SUM))
       pop = PAMI_DATA_SUM;
     else if(likely(op == MPI_MAX))
       pop = PAMI_DATA_MAX;
     else if(likely(op == MPI_MIN))
       pop = PAMI_DATA_MIN;
-    else rc = MPIDI_Datatype_to_pami(dt, &pdt, op, &pop, &mu);
+    else 
+    {
+      alg_selected = 0;
+      rc = MPIDI_Datatype_to_pami(dt, &pdt, op, &pop, &mu);
+    }
   }
   else rc = MPIDI_Datatype_to_pami(dt, &pdt, op, &pop, &mu);
 
@@ -116,129 +131,190 @@ int MPIDO_Allreduce(const void *sendbuf,
   allred.cmd.xfer_allreduce.op = pop;
 
   TRACE_ERR("Allreduce - Basic Collective Selection\n");
-  if(likely(selected_type == MPID_COLL_OPTIMIZED))
+
+  if(unlikely(!alg_selected)) /* Cached double algorithm not selected above */
   {
-    if(likely(pop == PAMI_DATA_SUM || pop == PAMI_DATA_MAX || pop == PAMI_DATA_MIN))
+    if(likely(selected_type == MPID_COLL_OPTIMIZED))
     {
-      /* double protocol works on all message sizes */
-      if(likely(pdt == PAMI_TYPE_DOUBLE && mpid->query_allred_dsmm == MPID_COLL_QUERY))
-      {
-        my_allred = mpid->cached_allred_dsmm;
-        my_allred_md = &mpid->cached_allred_dsmm_md;
-        alg_selected = 1;
-      }
-      else if(pdt == PAMI_TYPE_UNSIGNED_INT && mpid->query_allred_ismm == MPID_COLL_QUERY)
-      {
-        my_allred = mpid->cached_allred_ismm;
-        my_allred_md = &mpid->cached_allred_ismm_md;
-        alg_selected = 1;
-      }
-      /* The integer protocol at >1 ppn requires small messages only */
-      else if(pdt == PAMI_TYPE_UNSIGNED_INT && mpid->query_allred_ismm == MPID_COLL_CHECK_FN_REQUIRED &&
-              count <= mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0])
-      {
-        my_allred = mpid->cached_allred_ismm;
-        my_allred_md = &mpid->cached_allred_ismm_md;
-        alg_selected = 1;
-      }
-      else if(mpid->must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_NOQUERY &&
-              count <= mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0])
-      {
-        my_allred = mpid->opt_protocol[PAMI_XFER_ALLREDUCE][0];
-        my_allred_md = &mpid->opt_protocol_md[PAMI_XFER_ALLREDUCE][0];
-        alg_selected = 1;
-      }
-      else if(mpid->must_query[PAMI_XFER_ALLREDUCE][1] == MPID_COLL_NOQUERY &&
-              count > mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0])
-      {
-        my_allred = mpid->opt_protocol[PAMI_XFER_ALLREDUCE][1];
-        my_allred_md = &mpid->opt_protocol_md[PAMI_XFER_ALLREDUCE][1];
+      if(mpid->query_cached_allreduce != MPID_COLL_USE_MPICH)
+      { /* try the cached algorithm first, assume it's always a query algorithm so query now */
+        my_allred = mpid->cached_allreduce;
+        my_allred_md = &mpid->cached_allreduce_md;
         alg_selected = 1;
+        if(my_allred_md->check_fn != NULL)/*This should always be the case in FCA.. Otherwise punt to mpich*/
+        {
+          metadata_result_t result = {0};
+          TRACE_ERR("querying allreduce algorithm %s\n",
+                  my_allred_md->name);
+          result = my_allred_md->check_fn(&allred);
+          TRACE_ERR("bitmask: %#X\n", result.bitmask);
+          /* \todo Ignore check_correct.values.nonlocal until we implement the
+                   'pre-allreduce allreduce' or the 'safe' environment flag.
+                   We will basically assume 'safe' -- that all ranks are aligned (or not).
+          */
+          result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
+          if(!result.bitmask)
+            ; /* ok, algorithm selected */
+          else
+          {
+            alg_selected = 0;
+            if(unlikely(verbose))
+              fprintf(stderr,"check_fn failed for %s.\n", my_allred_md->name);
+          }
+        }
+        else /* no check_fn, manually look at the metadata fields */
+        {
+          TRACE_ERR("Optimzed selection line %d\n",__LINE__);
+          /* Check if the message range if restricted */
+          if(my_allred_md->check_correct.values.rangeminmax)
+          {
+            MPI_Aint data_true_lb;
+            MPID_Datatype *data_ptr;
+            int data_size, data_contig;
+            MPIDI_Datatype_get_info(count, dt, data_contig, data_size, data_ptr, data_true_lb); 
+            if((my_allred_md->range_lo <= data_size) &&
+               (my_allred_md->range_hi >= data_size))
+              ; /* ok, algorithm selected */
+            else
+            {
+              if(unlikely(verbose))
+                fprintf(stderr,"message size (%u) outside range (%zu<->%zu) for %s.\n",
+                        data_size,
+                        my_allred_md->range_lo,
+                        my_allred_md->range_hi,
+                        my_allred_md->name);
+              alg_selected = 0;
+            }
+          }
+          /* \todo check the rest of the metadata */
+        }
       }
-      else if((mpid->must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_CHECK_FN_REQUIRED) ||
-              (mpid->must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_QUERY) ||
-              (mpid->must_query[PAMI_XFER_ALLREDUCE][0] ==  MPID_COLL_ALWAYS_QUERY))
+      /* If we didn't use the cached protocol above (query failed?) then check regular optimized protocol fields */
+      if(!alg_selected)
       {
-        if((mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0] == 0) || 
-           (count <= mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0] && mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0] > 0))
+        const int queryreq0 = mpid->must_query[PAMI_XFER_ALLREDUCE][0];
+        const int queryreq1 = mpid->must_query[PAMI_XFER_ALLREDUCE][1];
+        /* TODO this really needs to be cleaned up for BGQ and fca  */
+        if(queryreq0 == MPID_COLL_NOQUERY &&
+           count <= mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0])
         {
+          TRACE_ERR("Optimzed selection line %d\n",__LINE__);
           my_allred = mpid->opt_protocol[PAMI_XFER_ALLREDUCE][0];
           my_allred_md = &mpid->opt_protocol_md[PAMI_XFER_ALLREDUCE][0];
           alg_selected = 1;
         }
-      }
-    }
-    else
-    {
-      /* so we aren't one of the key ops... */
-      if(mpid->must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_NOQUERY &&
-         count <= mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0])
-      {
-        my_allred = mpid->opt_protocol[PAMI_XFER_ALLREDUCE][0];
-        my_allred_md = &mpid->opt_protocol_md[PAMI_XFER_ALLREDUCE][0];
-        alg_selected = 1;
-      }
-      else if(mpid->must_query[PAMI_XFER_ALLREDUCE][1] == MPID_COLL_NOQUERY &&
-              count > mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0])
-      {
-        my_allred = mpid->opt_protocol[PAMI_XFER_ALLREDUCE][1];
-        my_allred_md = &mpid->opt_protocol_md[PAMI_XFER_ALLREDUCE][1];
-        alg_selected = 1;
-      }
-      else if((mpid->must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_CHECK_FN_REQUIRED) ||
-              (mpid->must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_QUERY) ||
-              (mpid->must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_ALWAYS_QUERY))
-      {
-        if((mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0] == 0) || 
-           (count <= mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0] && mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0] > 0))
+        else if(queryreq1 == MPID_COLL_NOQUERY &&
+                count > mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0])
         {
+          TRACE_ERR("Optimzed selection line %d\n",__LINE__);
+          my_allred = mpid->opt_protocol[PAMI_XFER_ALLREDUCE][1];
+          my_allred_md = &mpid->opt_protocol_md[PAMI_XFER_ALLREDUCE][1];
+          alg_selected = 1;
+        }
+        else if(((queryreq0 == MPID_COLL_CHECK_FN_REQUIRED) ||
+                 (queryreq0 == MPID_COLL_QUERY) ||
+                 (queryreq0 ==  MPID_COLL_ALWAYS_QUERY)) &&
+                ((mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0] == 0) || 
+                 (count <= mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0] && mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0] > 0)))
+        {
+          TRACE_ERR("Optimzed selection line %d\n",__LINE__);
           my_allred = mpid->opt_protocol[PAMI_XFER_ALLREDUCE][0];
           my_allred_md = &mpid->opt_protocol_md[PAMI_XFER_ALLREDUCE][0];
           alg_selected = 1;
+          queryreq = queryreq0;
         }
-      }
-    }
-    TRACE_ERR("Alg selected: %d\n", alg_selected);
-    if(likely(alg_selected))
-    {
-      if(unlikely(mpid->must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_CHECK_FN_REQUIRED))
-      {
-        if(my_allred_md->check_fn != NULL)/*This should always be the case in FCA.. Otherwise punt to mpich*/
+        else if((queryreq1 == MPID_COLL_CHECK_FN_REQUIRED) ||
+                (queryreq1 == MPID_COLL_QUERY) ||
+                (queryreq1 ==  MPID_COLL_ALWAYS_QUERY))
+        {  
+          TRACE_ERR("Optimzed selection line %d\n",__LINE__);
+          my_allred = mpid->opt_protocol[PAMI_XFER_ALLREDUCE][1];
+          my_allred_md = &mpid->opt_protocol_md[PAMI_XFER_ALLREDUCE][1];
+          alg_selected = 1;
+          queryreq = queryreq1;
+        }
+        TRACE_ERR("Alg selected: %d\n", alg_selected);
+        if(likely(alg_selected))
         {
-          metadata_result_t result = {0};
-          TRACE_ERR("querying allreduce algorithm %s, type was %d\n",
-                    my_allred_md->name,
-                    mpid->must_query[PAMI_XFER_ALLREDUCE]);
-          result = my_allred_md->check_fn(&allred);
-          TRACE_ERR("bitmask: %#X\n", result.bitmask);
-          /* \todo Ignore check_correct.values.nonlocal until we implement the
-             'pre-allreduce allreduce' or the 'safe' environment flag.
-             We will basically assume 'safe' -- that all ranks are aligned (or not).
-          */
-          result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
-          if(!result.bitmask)
+          if(unlikely((queryreq == MPID_COLL_CHECK_FN_REQUIRED) ||
+                      (queryreq == MPID_COLL_QUERY) ||
+                      (queryreq == MPID_COLL_ALWAYS_QUERY)))
           {
-            allred.algorithm = my_allred;
+            TRACE_ERR("Optimzed selection line %d\n",__LINE__);
+            if(my_allred_md->check_fn != NULL)/*This should always be the case in FCA.. Otherwise punt to mpich*/
+            {
+              metadata_result_t result = {0};
+              TRACE_ERR("querying allreduce algorithm %s\n",
+                        my_allred_md->name);
+              result = my_allred_md->check_fn(&allred);
+              TRACE_ERR("bitmask: %#X\n", result.bitmask);
+              /* \todo Ignore check_correct.values.nonlocal until we implement the
+                'pre-allreduce allreduce' or the 'safe' environment flag.
+                We will basically assume 'safe' -- that all ranks are aligned (or not).
+              */
+              result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
+              if(!result.bitmask)
+                ; /* ok, algorithm selected */
+              else
+              {
+                alg_selected = 0;
+                if(unlikely(verbose))
+                  fprintf(stderr,"check_fn failed for %s.\n", my_allred_md->name);
+              }
+            } 
+            else /* no check_fn, manually look at the metadata fields */
+            {
+              TRACE_ERR("Optimzed selection line %d\n",__LINE__);
+              /* Check if the message range if restricted */
+              if(my_allred_md->check_correct.values.rangeminmax)
+              {
+                MPI_Aint data_true_lb;
+                MPID_Datatype *data_ptr;
+                int data_size, data_contig;
+                MPIDI_Datatype_get_info(count, dt, data_contig, data_size, data_ptr, data_true_lb); 
+                if((my_allred_md->range_lo <= data_size) &&
+                   (my_allred_md->range_hi >= data_size))
+                  ; /* ok, algorithm selected */
+                else
+                {
+                  if(unlikely(verbose))
+                    fprintf(stderr,"message size (%u) outside range (%zu<->%zu) for %s.\n",
+                            data_size,
+                            my_allred_md->range_lo,
+                            my_allred_md->range_hi,
+                            my_allred_md->name);
+                  alg_selected = 0;
+                }
+              }
+              /* \todo check the rest of the metadata */
+            }
           }
           else
           {
-            alg_selected = 0;
-            if(unlikely(verbose))
-              fprintf(stderr,"check_fn failed for %s.\n", my_allred_md->name);
+            TRACE_ERR("Using %s for allreduce\n", my_allred_md->name);
           }
         }
-        else alg_selected = 0;
       }
-      else if(unlikely(((mpid->must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_QUERY) ||
-                        (mpid->must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_ALWAYS_QUERY))))
+    }
+    else
+    {
+      TRACE_ERR("Non-Optimzed selection line %d\n",__LINE__);
+      my_allred = mpid->user_selected[PAMI_XFER_ALLREDUCE];
+      my_allred_md = &mpid->user_metadata[PAMI_XFER_ALLREDUCE];
+      if(selected_type == MPID_COLL_QUERY ||
+         selected_type == MPID_COLL_ALWAYS_QUERY ||
+         selected_type == MPID_COLL_CHECK_FN_REQUIRED)
       {
-        if(my_allred_md->check_fn != NULL)/*This should always be the case in FCA.. Otherwise punt to mpich*/
+        TRACE_ERR("Non-Optimzed selection line %d\n",__LINE__);
+        if(my_allred_md->check_fn != NULL)
         {
+          /* For now, we don't distinguish between MPID_COLL_ALWAYS_QUERY &
+             MPID_COLL_CHECK_FN_REQUIRED, we just call the fn                */
           metadata_result_t result = {0};
           TRACE_ERR("querying allreduce algorithm %s, type was %d\n",
                     my_allred_md->name,
-                    mpid->must_query[PAMI_XFER_ALLREDUCE]);
-          result = my_allred_md->check_fn(&allred);
+                    selected_type);
+          result = mpid->user_metadata[PAMI_XFER_ALLREDUCE].check_fn(&allred);
           TRACE_ERR("bitmask: %#X\n", result.bitmask);
           /* \todo Ignore check_correct.values.nonlocal until we implement the
              'pre-allreduce allreduce' or the 'safe' environment flag.
@@ -246,18 +322,14 @@ int MPIDO_Allreduce(const void *sendbuf,
           */
           result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
           if(!result.bitmask)
-          {
-            allred.algorithm = my_allred;
-          }
+            alg_selected = 1; /* query algorithm successfully selected */
           else
-          {
-            alg_selected = 0;
             if(unlikely(verbose))
               fprintf(stderr,"check_fn failed for %s.\n", my_allred_md->name);
-          }
         }
         else /* no check_fn, manually look at the metadata fields */
         {
+          TRACE_ERR("Non-Optimzed selection line %d\n",__LINE__);
           /* Check if the message range if restricted */
           if(my_allred_md->check_correct.values.rangeminmax)
           {
@@ -267,83 +339,21 @@ int MPIDO_Allreduce(const void *sendbuf,
             MPIDI_Datatype_get_info(count, dt, data_contig, data_size, data_ptr, data_true_lb); 
             if((my_allred_md->range_lo <= data_size) &&
                (my_allred_md->range_hi >= data_size))
-              allred.algorithm = my_allred; /* query algorithm successfully selected */
+              alg_selected = 1; /* query algorithm successfully selected */
             else
-            {
               if(unlikely(verbose))
                 fprintf(stderr,"message size (%u) outside range (%zu<->%zu) for %s.\n",
-                        data_size,
-                        my_allred_md->range_lo,
-                        my_allred_md->range_hi,
-                        my_allred_md->name);
-              alg_selected = 0;
-            }
+                      data_size,
+                      my_allred_md->range_lo,
+                      my_allred_md->range_hi,
+                      my_allred_md->name);
           }
           /* \todo check the rest of the metadata */
         }
       }
-      else
-      {
-        TRACE_ERR("Using %s for allreduce\n", my_allred_md->name);
-        allred.algorithm = my_allred;
-      }
-    }
-  }
-  else
-  {
-    my_allred = mpid->user_selected[PAMI_XFER_ALLREDUCE];
-    my_allred_md = &mpid->user_metadata[PAMI_XFER_ALLREDUCE];
-    allred.algorithm = my_allred;
-    if(selected_type == MPID_COLL_QUERY ||
-       selected_type == MPID_COLL_ALWAYS_QUERY ||
-       selected_type == MPID_COLL_CHECK_FN_REQUIRED)
-    {
-      if(my_allred_md->check_fn != NULL)
-      {
-        /* For now, we don't distinguish between MPID_COLL_ALWAYS_QUERY &
-           MPID_COLL_CHECK_FN_REQUIRED, we just call the fn                */
-        metadata_result_t result = {0};
-        TRACE_ERR("querying allreduce algorithm %s, type was %d\n",
-                  my_allred_md->name,
-                  selected_type);
-        result = mpid->user_metadata[PAMI_XFER_ALLREDUCE].check_fn(&allred);
-        TRACE_ERR("bitmask: %#X\n", result.bitmask);
-        /* \todo Ignore check_correct.values.nonlocal until we implement the
-           'pre-allreduce allreduce' or the 'safe' environment flag.
-           We will basically assume 'safe' -- that all ranks are aligned (or not).
-        */
-        result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
-        if(!result.bitmask)
-          alg_selected = 1; /* query algorithm successfully selected */
-        else
-          if(unlikely(verbose))
-          fprintf(stderr,"check_fn failed for %s.\n", my_allred_md->name);
-      }
-      else /* no check_fn, manually look at the metadata fields */
-      {
-        /* Check if the message range if restricted */
-        if(my_allred_md->check_correct.values.rangeminmax)
-        {
-          MPI_Aint data_true_lb;
-          MPID_Datatype *data_ptr;
-          int data_size, data_contig;
-          MPIDI_Datatype_get_info(count, dt, data_contig, data_size, data_ptr, data_true_lb); 
-          if((my_allred_md->range_lo <= data_size) &&
-             (my_allred_md->range_hi >= data_size))
-            alg_selected = 1; /* query algorithm successfully selected */
-          else
-            if(unlikely(verbose))
-            fprintf(stderr,"message size (%u) outside range (%zu<->%zu) for %s.\n",
-                    data_size,
-                    my_allred_md->range_lo,
-                    my_allred_md->range_hi,
-                    my_allred_md->name);
-        }
-        /* \todo check the rest of the metadata */
-      }
+      else alg_selected = 1; /* non-query algorithm selected */
+  
     }
-    else alg_selected = 1; /* non-query algorithm selected */
-
   }
 
   if(unlikely(!alg_selected)) /* must be fallback to MPICH */
@@ -353,6 +363,7 @@ int MPIDO_Allreduce(const void *sendbuf,
     MPIDI_Update_last_algorithm(comm_ptr, "ALLREDUCE_MPICH");
     return MPIR_Allreduce(sendbuf, recvbuf, count, dt, op, comm_ptr, mpierrno);
   }
+  allred.algorithm = my_allred;
 
   if(unlikely(verbose))
   {
diff --git a/src/mpid/pamid/src/comm/mpid_optcolls.c b/src/mpid/pamid/src/comm/mpid_optcolls.c
index d9d3a10..f8d52bb 100644
--- a/src/mpid/pamid/src/comm/mpid_optcolls.c
+++ b/src/mpid/pamid/src/comm/mpid_optcolls.c
@@ -28,7 +28,7 @@
 #include <mpidimpl.h>
 
 
-static int MPIDI_Check_FCA_envvar(char *string)
+static int MPIDI_Check_FCA_envvar(char *string, int *user_range_hi)
 {
    char *env = getenv("MP_MPI_PAMI_FOR");
    if(env != NULL)
@@ -36,19 +36,38 @@ static int MPIDI_Check_FCA_envvar(char *string)
       if(strcasecmp(env, "ALL") == 0)
          return 1;
       int len = strlen(env);
+      len++;
       char *temp = MPIU_Malloc(sizeof(char) * len);
       char *ptrToFree = temp;
       strcpy(temp, env);
       char *sepptr;
       for(sepptr = temp; (sepptr = strsep(&temp, ",")) != NULL ; )
       {
-         if(strcasecmp(sepptr, string) == 0)
+         char *subsepptr, *temp_sepptr;
+         temp_sepptr = sepptr;
+         subsepptr = strsep(&temp_sepptr, ":");
+         if(temp_sepptr != NULL)/* SSS: There is a a colon for this collective */
          {
-            MPIU_Free(ptrToFree);
-            return 1;
+             if(strcasecmp(subsepptr, string) == 0)
+             {
+                *user_range_hi = atoi(temp_sepptr);
+                MPIU_Free(ptrToFree);
+                return 1;
+             }
+             else
+                sepptr++;
          }
          else
-            sepptr++;
+         { 
+             if(strcasecmp(sepptr, string) == 0)
+             {
+                *user_range_hi = -1;
+                MPIU_Free(ptrToFree);
+                return 1;
+             }
+             else
+                sepptr++;
+         }
       }
       /* We didn't find it, but the end var was set, so return 0 */
       MPIU_Free(ptrToFree);
@@ -71,11 +90,12 @@ MPIDI_Coll_comm_check_FCA(char *coll_name,
 {                        
    int opt_proto = -1;
    int i;
+   int user_range_hi = -1;/* SSS: By default we assume user hasn't defined a range_hi (cutoff_size) */
 #ifdef TRACE_ON
    char *envstring = getenv("MP_MPI_PAMI_FOR");
 #endif
    TRACE_ERR("Checking for %s in %s\n", coll_name, envstring);
-   int check_var = MPIDI_Check_FCA_envvar(coll_name);
+   int check_var = MPIDI_Check_FCA_envvar(coll_name, &user_range_hi);
    if(check_var == 1)
    {
       TRACE_ERR("Found %s\n",coll_name);
@@ -99,6 +119,10 @@ MPIDI_Coll_comm_check_FCA(char *coll_name,
                   &comm_ptr->mpid.coll_metadata[pami_xfer][0][opt_proto],
                   sizeof(pami_metadata_t));
             comm_ptr->mpid.must_query[pami_xfer][proto_num] = query_type;
+            if(user_range_hi != -1)
+              comm_ptr->mpid.cutoff_size[pami_xfer][proto_num] = user_range_hi;
+            else
+              comm_ptr->mpid.cutoff_size[pami_xfer][proto_num] = 0;
             comm_ptr->mpid.user_selected_type[pami_xfer] = MPID_COLL_OPTIMIZED;
       }                                                                                           
       else /* see if it is in the must query list instead */
@@ -122,6 +146,10 @@ MPIDI_Coll_comm_check_FCA(char *coll_name,
                   &comm_ptr->mpid.coll_metadata[pami_xfer][1][opt_proto],
                   sizeof(pami_metadata_t));
             comm_ptr->mpid.must_query[pami_xfer][proto_num] = query_type;
+            if(user_range_hi != -1)
+              comm_ptr->mpid.cutoff_size[pami_xfer][proto_num] = user_range_hi;
+            else
+              comm_ptr->mpid.cutoff_size[pami_xfer][proto_num] = 0;
             comm_ptr->mpid.user_selected_type[pami_xfer] = MPID_COLL_OPTIMIZED;
          }
          else /* that protocol doesn't exist */
@@ -642,42 +670,41 @@ void MPIDI_Comm_coll_select(MPID_Comm *comm_ptr)
    if(comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLREDUCE] == MPID_COLL_NOSELECTION)
    {
       /* the user hasn't selected a protocol, so we can NULL the protocol/metadatas */
-      comm_ptr->mpid.query_allred_dsmm = MPID_COLL_USE_MPICH;
-      comm_ptr->mpid.query_allred_ismm = MPID_COLL_USE_MPICH;
-
-      comm_ptr->mpid.cutoff_size[PAMI_XFER_ALLREDUCE][0] = 128;
-      /* 1ppn: I0:MultiCombineDput:-:MU if it is available, but it has a check_fn
-       * since it is MU-based*/
-      /* Next best is I1:ShortAllreduce:P2P:P2P for short messages, then MPICH is best*/
-      /* I0:MultiCombineDput:-:MU could be used in the i/dsmm cached protocols, so we'll do that */
-      /* First, look in the 'must query' list and see i I0:MultiCombine:Dput:-:MU is there */
+      comm_ptr->mpid.query_cached_allreduce = MPID_COLL_USE_MPICH;
+
+      comm_ptr->mpid.cutoff_size[PAMI_XFER_ALLREDUCE][0] = 0;
+      /* For BGQ */
+      /*  1ppn: I0:MultiCombineDput:-:MU if it is available, but it has a check_fn
+       *  since it is MU-based*/
+      /*  Next best is I1:ShortAllreduce:P2P:P2P for short messages, then MPICH is best*/
+      /*  I0:MultiCombineDput:-:MU could be used in the i/dsmm cached protocols, so we'll do that */
+      /*  First, look in the 'must query' list and see i I0:MultiCombine:Dput:-:MU is there */
+
+      char *pname="...none...";
+      int user_range_hi = -1;
+      int fca_enabled = 0;
+      if((fca_enabled = MPIDI_Check_FCA_envvar("ALLREDUCE", &user_range_hi)) == 1)
+         pname = "I1:Allreduce:FCA:FCA";
+      else if(use_threaded_collectives)
+         pname = "I0:MultiCombineDput:-:MU";
+      /*SSS: Any "MU" protocol will not be available on non-BG systems. I just need to check for FCA in the 
+                   first if only. No need to do another check since the second if will never succeed for PE systems*/
       for(i = 0; i < comm_ptr->mpid.coll_count[PAMI_XFER_ALLREDUCE][1]; i++)
       {
-         char *pname="...none...";
-         if(MPIDI_Check_FCA_envvar("ALLREDUCE") == 1)
-            pname = "I1:Allreduce:FCA:FCA";
-         else if(use_threaded_collectives)
-            pname = "I0:MultiCombineDput:-:MU";
-         /*SSS: Any "MU" protocol will not be available on non-BG systems. I just need to check for FCA in the 
-                first if only. No need to do another check since the second if will never succeed for PE systems*/
          if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLREDUCE][1][i].name, pname) == 0)
          {
             /* So, this should be fine for the i/dsmm protocols. everything else needs to call the check function */
             /* This also works for all message sizes, so no need to deal with it specially for query */
-            comm_ptr->mpid.cached_allred_dsmm = 
+            comm_ptr->mpid.cached_allreduce = 
                    comm_ptr->mpid.coll_algorithm[PAMI_XFER_ALLREDUCE][1][i];
-            memcpy(&comm_ptr->mpid.cached_allred_dsmm_md,
+            memcpy(&comm_ptr->mpid.cached_allreduce_md,
                    &comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLREDUCE][1][i],
                   sizeof(pami_metadata_t));
-            comm_ptr->mpid.query_allred_dsmm = MPID_COLL_QUERY;
+            comm_ptr->mpid.query_cached_allreduce = MPID_COLL_QUERY;
 
-            comm_ptr->mpid.cached_allred_ismm =
-                   comm_ptr->mpid.coll_algorithm[PAMI_XFER_ALLREDUCE][1][i];
-            memcpy(&comm_ptr->mpid.cached_allred_ismm_md,
-                   &comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLREDUCE][1][i],
-                  sizeof(pami_metadata_t));
-            comm_ptr->mpid.query_allred_ismm = MPID_COLL_QUERY;
             comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLREDUCE] = MPID_COLL_OPTIMIZED;
+            if(fca_enabled && user_range_hi != -1)
+              comm_ptr->mpid.cutoff_size[PAMI_XFER_ALLREDUCE][0] = user_range_hi;
             opt_proto = i;
 
          }
@@ -685,39 +712,30 @@ void MPIDI_Comm_coll_select(MPID_Comm *comm_ptr)
          if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLREDUCE][1][i].name, "I0:MultiCombineDput:SHMEM:MU") == 0)
          {
             /* This works well for doubles sum/min/max but has trouble with int > 8k/ppn */
-            comm_ptr->mpid.cached_allred_dsmm =
+            comm_ptr->mpid.cached_allreduce =
                    comm_ptr->mpid.coll_algorithm[PAMI_XFER_ALLREDUCE][1][i];
-            memcpy(&comm_ptr->mpid.cached_allred_dsmm_md,
+            memcpy(&comm_ptr->mpid.cached_allreduce_md,
                    &comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLREDUCE][1][i],
                   sizeof(pami_metadata_t));
-            comm_ptr->mpid.query_allred_dsmm = MPID_COLL_QUERY;
+            comm_ptr->mpid.query_cached_allreduce = MPID_COLL_CHECK_FN_REQUIRED;
 
-            comm_ptr->mpid.cached_allred_ismm =
-                   comm_ptr->mpid.coll_algorithm[PAMI_XFER_ALLREDUCE][1][i];
-            memcpy(&comm_ptr->mpid.cached_allred_ismm_md,
-                   &comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLREDUCE][1][i],
-                  sizeof(pami_metadata_t));
-            comm_ptr->mpid.query_allred_ismm = MPID_COLL_CHECK_FN_REQUIRED;
             comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLREDUCE] = MPID_COLL_OPTIMIZED;
-            /* I don't think MPIX_HW is initialized yet, so keep this at 128 for now, which 
-             * is the upper lower limit anyway (8192 / 64) */
-            /* comm_ptr->mpid.cutoff_size[PAMI_XFER_ALLREDUCE][0] = 8192 / ppn; */
             opt_proto = i;
          }
       }
       /* At this point, if opt_proto != -1, we have must-query protocols in the i/dsmm caches */
       /* We should pick a backup, non-must query */
       /* I0:ShortAllreduce:P2P:P2P < 128, then mpich*/
+
+      /*SSS: ShortAllreduce is available on both BG and PE. I have to pick just one to check for in this case. 
+                  However, I need to add FCA for both opt_protocol[0]and[1] to cover all data sizes*/
+      if(fca_enabled == 1)
+         pname = "I1:Allreduce:FCA:FCA";
+      else 
+         pname = "I1:ShortAllreduce:P2P:P2P";
+
       for(i = 0; i < comm_ptr->mpid.coll_count[PAMI_XFER_ALLREDUCE][1]; i++)
       {
-         char *pname;
-         int pickFCA = MPIDI_Check_FCA_envvar("ALLREDUCE");
-         if(pickFCA == 1)
-            pname = "I1:Allreduce:FCA:FCA";
-         else 
-            pname = "I1:ShortAllreduce:P2P:P2P";
-         /*SSS: ShortAllreduce is available on both BG and PE. I have to pick just one to check for in this case. 
-                However, I need to add FCA for both opt_protocol[0]and[1] to cover all data sizes*/
          if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLREDUCE][1][i].name, pname) == 0)
          {
             comm_ptr->mpid.opt_protocol[PAMI_XFER_ALLREDUCE][0] =
@@ -725,20 +743,20 @@ void MPIDI_Comm_coll_select(MPID_Comm *comm_ptr)
             memcpy(&comm_ptr->mpid.opt_protocol_md[PAMI_XFER_ALLREDUCE][0],
                    &comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLREDUCE][1][i],
                    sizeof(pami_metadata_t));
-            if(pickFCA == 1)
+            if(fca_enabled == 1)
             {
               comm_ptr->mpid.must_query[PAMI_XFER_ALLREDUCE][0] = MPID_COLL_CHECK_FN_REQUIRED;
-              comm_ptr->mpid.cutoff_size[PAMI_XFER_ALLREDUCE][0] = 0;/*SSS: Always use opt_protocol[0] for FCA*/
+              if(user_range_hi != -1)
+                comm_ptr->mpid.cutoff_size[PAMI_XFER_ALLREDUCE][0] = user_range_hi;
               /*SSS: Otherwise another protocol may get selected in mpido_allreduce if we don't set this flag here*/
               comm_ptr->mpid.must_query[PAMI_XFER_ALLREDUCE][1] = MPID_COLL_CHECK_FN_REQUIRED;
             }
             else
             {
-              /*SSS: (on BG) MPICH is actually better at > 128 bytes for 1/16/64ppn at 512 nodes */
-              comm_ptr->mpid.must_query[PAMI_XFER_ALLREDUCE][1] = MPID_COLL_USE_MPICH;
               /* Short is good for up to 512 bytes... but it's a query protocol */
               comm_ptr->mpid.must_query[PAMI_XFER_ALLREDUCE][0] = MPID_COLL_QUERY;
-              comm_ptr->mpid.cutoff_size[PAMI_XFER_ALLREDUCE][0] = 512;
+              /* MPICH above that ... when short query fails */
+              comm_ptr->mpid.must_query[PAMI_XFER_ALLREDUCE][1] = MPID_COLL_USE_MPICH;
             }
             comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLREDUCE] = MPID_COLL_OPTIMIZED;
 
@@ -748,7 +766,7 @@ void MPIDI_Comm_coll_select(MPID_Comm *comm_ptr)
       if(opt_proto == -1)
       {
          if(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_0 && comm_ptr->rank == 0)
-            fprintf(stderr,"Opt to MPICH\n");
+            fprintf(stderr,"Optimized allreduce falls back to MPICH\n");
          comm_ptr->mpid.must_query[PAMI_XFER_ALLREDUCE][0] = MPID_COLL_USE_MPICH;
          comm_ptr->mpid.must_query[PAMI_XFER_ALLREDUCE][1] = MPID_COLL_USE_MPICH;
       }
@@ -780,7 +798,7 @@ void MPIDI_Comm_coll_select(MPID_Comm *comm_ptr)
       if(comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLGATHERV_INT] == MPID_COLL_OPTIMIZED)
          fprintf(stderr,"Selecting %s for opt allgatherv comm %p\n", comm_ptr->mpid.opt_protocol_md[PAMI_XFER_ALLGATHERV_INT][0].name, comm_ptr);
       if(comm_ptr->mpid.must_query[PAMI_XFER_ALLGATHERV_INT][0] == MPID_COLL_USE_MPICH)
-         fprintf(stderr,"Selecting MPICH for allgatherv below %d size comm %p\n", comm_ptr->mpid.cutoff_size[PAMI_XFER_ALLREDUCE][0], comm_ptr);
+         fprintf(stderr,"Selecting MPICH for allgatherv below %d size comm %p\n", comm_ptr->mpid.cutoff_size[PAMI_XFER_ALLGATHERV_INT][0], comm_ptr);
       if(comm_ptr->mpid.user_selected_type[PAMI_XFER_BROADCAST] == MPID_COLL_OPTIMIZED)
          fprintf(stderr,"Selecting %s for opt bcast up to size %d comm %p\n", comm_ptr->mpid.opt_protocol_md[PAMI_XFER_BROADCAST][0].name,
             comm_ptr->mpid.cutoff_size[PAMI_XFER_BROADCAST][0], comm_ptr);
@@ -798,17 +816,12 @@ void MPIDI_Comm_coll_select(MPID_Comm *comm_ptr)
          fprintf(stderr,"Selecting MPICH for allreduce below %d size comm %p\n", comm_ptr->mpid.cutoff_size[PAMI_XFER_ALLREDUCE][0], comm_ptr);
       else
       {
-         if(comm_ptr->mpid.query_allred_ismm != MPID_COLL_USE_MPICH)
-         {
-            fprintf(stderr,"Selecting %s for integer sum/min/max ops, query: %d comm %p\n",
-               comm_ptr->mpid.cached_allred_ismm_md.name, comm_ptr->mpid.query_allred_ismm, comm_ptr);
-         }
-         if(comm_ptr->mpid.query_allred_dsmm != MPID_COLL_USE_MPICH)
+         if(comm_ptr->mpid.query_cached_allreduce != MPID_COLL_USE_MPICH)
          {
-            fprintf(stderr,"Selecting %s for double sum/min/max ops, query: %d comm %p\n",
-               comm_ptr->mpid.cached_allred_dsmm_md.name, comm_ptr->mpid.query_allred_dsmm, comm_ptr);
+            fprintf(stderr,"Selecting %s for double sum/min/max ops allreduce, query: %d comm %p\n",
+               comm_ptr->mpid.cached_allreduce_md.name, comm_ptr->mpid.query_cached_allreduce, comm_ptr);
          }
-         fprintf(stderr,"Selecting %s for other operations up to %d comm %p\n",
+         fprintf(stderr,"Selecting %s for other operations allreduce up to %d comm %p\n",
                comm_ptr->mpid.opt_protocol_md[PAMI_XFER_ALLREDUCE][0].name, 
                comm_ptr->mpid.cutoff_size[PAMI_XFER_ALLREDUCE][0], comm_ptr);
       }
@@ -816,31 +829,19 @@ void MPIDI_Comm_coll_select(MPID_Comm *comm_ptr)
          fprintf(stderr,"Selecting MPICH for allreduce above %d size comm %p\n", comm_ptr->mpid.cutoff_size[PAMI_XFER_ALLREDUCE][0], comm_ptr);
       else
       {
-         if(comm_ptr->mpid.query_allred_ismm != MPID_COLL_USE_MPICH)
-         {
-            fprintf(stderr,"Selecting %s for integer sum/min/max ops, above %d query: %d comm %p\n",
-               comm_ptr->mpid.cached_allred_ismm_md.name, 
-               comm_ptr->mpid.cutoff_size[PAMI_XFER_ALLREDUCE][0],
-               comm_ptr->mpid.query_allred_ismm, comm_ptr);
-         }
-         else
-         {
-            fprintf(stderr,"Selecting MPICH for integer sum/min/max ops above %d size comm %p\n",
-               comm_ptr->mpid.cutoff_size[PAMI_XFER_ALLREDUCE][0], comm_ptr);
-         }
-         if(comm_ptr->mpid.query_allred_dsmm != MPID_COLL_USE_MPICH)
+         if(comm_ptr->mpid.query_cached_allreduce != MPID_COLL_USE_MPICH)
          {
-            fprintf(stderr,"Selecting %s for double sum/min/max ops, above %d query: %d comm %p\n",
-               comm_ptr->mpid.cached_allred_dsmm_md.name, 
+            fprintf(stderr,"Selecting %s for double sum/min/max ops allreduce, above %d query: %d comm %p\n",
+               comm_ptr->mpid.cached_allreduce_md.name, 
                comm_ptr->mpid.cutoff_size[PAMI_XFER_ALLREDUCE][0],
-               comm_ptr->mpid.query_allred_dsmm, comm_ptr);
+               comm_ptr->mpid.query_cached_allreduce, comm_ptr);
          }
          else
          {
-            fprintf(stderr,"Selecting MPICH for double sum/min/max ops above %d size comm %p\n",
+            fprintf(stderr,"Selecting MPICH for double sum/min/max ops allreduce, above %d size comm %p\n",
                comm_ptr->mpid.cutoff_size[PAMI_XFER_ALLREDUCE][0], comm_ptr);
          }
-         fprintf(stderr,"Selecting %s for other operations over %d comm %p\n",
+         fprintf(stderr,"Selecting %s for other operations allreduce over %d comm %p\n",
             comm_ptr->mpid.opt_protocol_md[PAMI_XFER_ALLREDUCE][1].name,
             comm_ptr->mpid.cutoff_size[PAMI_XFER_ALLREDUCE][0], comm_ptr);
       }

http://git.mpich.org/mpich.git/commitdiff/685b4d97b439813706516192feb08dccbca124f8

commit 685b4d97b439813706516192feb08dccbca124f8
Author: Bob Cernohous <bobc at us.ibm.com>
Date:   Mon Nov 26 15:55:44 2012 -0600

    Reformat mpido_allreduce
    
    (ibm) Issue 8597
    (ibm) 8ad98635e08d20729a3fc335dbd127795ad380a2
    
    Signed-off-by: sssharka <sssharka at us.ibm.com>

diff --git a/src/mpid/pamid/src/coll/allreduce/mpido_allreduce.c b/src/mpid/pamid/src/coll/allreduce/mpido_allreduce.c
index f5c2ecc..03b4188 100644
--- a/src/mpid/pamid/src/coll/allreduce/mpido_allreduce.c
+++ b/src/mpid/pamid/src/coll/allreduce/mpido_allreduce.c
@@ -26,10 +26,10 @@
 
 static void cb_allreduce(void *ctxt, void *clientdata, pami_result_t err)
 {
-   int *active = (int *) clientdata;
-   TRACE_ERR("callback enter, active: %d\n", (*active));
-   MPIDI_Progress_signal();
-   (*active)--;
+  int *active = (int *) clientdata;
+  TRACE_ERR("callback enter, active: %d\n", (*active));
+  MPIDI_Progress_signal();
+  (*active)--;
 }
 
 int MPIDO_Allreduce(const void *sendbuf,
@@ -40,62 +40,62 @@ int MPIDO_Allreduce(const void *sendbuf,
                     MPID_Comm *comm_ptr,
                     int *mpierrno)
 {
-   void *sbuf;
-   TRACE_ERR("Entering mpido_allreduce\n");
-   pami_type_t pdt;
-   pami_data_function pop;
-   int mu;
-   int rc;
+  void *sbuf;
+  TRACE_ERR("Entering mpido_allreduce\n");
+  pami_type_t pdt;
+  pami_data_function pop;
+  int mu;
+  int rc;
 #ifdef TRACE_ON
-    int len; 
-    char op_str[255]; 
-    char dt_str[255]; 
-    MPIDI_Op_to_string(op, op_str); 
-    PMPI_Type_get_name(dt, dt_str, &len); 
+  int len; 
+  char op_str[255]; 
+  char dt_str[255]; 
+  MPIDI_Op_to_string(op, op_str); 
+  PMPI_Type_get_name(dt, dt_str, &len); 
 #endif
-   volatile unsigned active = 1;
-   pami_xfer_t allred;
-   pami_algorithm_t my_allred;
-   const pami_metadata_t *my_allred_md = (pami_metadata_t *)NULL;
-   int alg_selected = 0;
-   const int rank = comm_ptr->rank;
-   const struct MPIDI_Comm* const mpid = &(comm_ptr->mpid);
-   const int selected_type = mpid->user_selected_type[PAMI_XFER_ALLREDUCE];
+  volatile unsigned active = 1;
+  pami_xfer_t allred;
+  pami_algorithm_t my_allred;
+  const pami_metadata_t *my_allred_md = (pami_metadata_t *)NULL;
+  int alg_selected = 0;
+  const int rank = comm_ptr->rank;
+  const struct MPIDI_Comm* const mpid = &(comm_ptr->mpid);
+  const int selected_type = mpid->user_selected_type[PAMI_XFER_ALLREDUCE];
 #if ASSERT_LEVEL==0
-   /* We can't afford the tracing in ndebug/performance libraries */
-    const unsigned verbose = 0;
+  /* We can't afford the tracing in ndebug/performance libraries */
+  const unsigned verbose = 0;
 #else
-    const unsigned verbose = (MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL) && (rank == 0);
+  const unsigned verbose = (MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL) && (rank == 0);
 #endif
-   if(likely(dt == MPI_DOUBLE || dt == MPI_DOUBLE_PRECISION))
-   {
-      rc = MPI_SUCCESS;
-      pdt = PAMI_TYPE_DOUBLE;
-      if(likely(op == MPI_SUM))
-         pop = PAMI_DATA_SUM; 
-      else if(likely(op == MPI_MAX))
-         pop = PAMI_DATA_MAX; 
-      else if(likely(op == MPI_MIN))
-         pop = PAMI_DATA_MIN; 
-      else rc = MPIDI_Datatype_to_pami(dt, &pdt, op, &pop, &mu);
-   }
-   else rc = MPIDI_Datatype_to_pami(dt, &pdt, op, &pop, &mu);
+  if(likely(dt == MPI_DOUBLE || dt == MPI_DOUBLE_PRECISION))
+  {
+    rc = MPI_SUCCESS;
+    pdt = PAMI_TYPE_DOUBLE;
+    if(likely(op == MPI_SUM))
+      pop = PAMI_DATA_SUM;
+    else if(likely(op == MPI_MAX))
+      pop = PAMI_DATA_MAX;
+    else if(likely(op == MPI_MIN))
+      pop = PAMI_DATA_MIN;
+    else rc = MPIDI_Datatype_to_pami(dt, &pdt, op, &pop, &mu);
+  }
+  else rc = MPIDI_Datatype_to_pami(dt, &pdt, op, &pop, &mu);
 
-    if(unlikely(verbose))
+  if(unlikely(verbose))
     fprintf(stderr,"allred rc %u,count %d, Datatype %p, op %p, mu %u, selectedvar %u != %u, sendbuf %p, recvbuf %p\n",
             rc, count, pdt, pop, mu, 
             (unsigned)selected_type,MPID_COLL_USE_MPICH, sendbuf, recvbuf);
-      /* convert to metadata query */
+  /* convert to metadata query */
   /* Punt count 0 allreduce to MPICH. Let them do whatever's 'right' */
   if(unlikely(rc != MPI_SUCCESS || (count==0) ||
-	      selected_type == MPID_COLL_USE_MPICH))
-   {
-     if(unlikely(verbose))
-         fprintf(stderr,"Using MPICH allreduce type %u.\n",
-                 selected_type);
-      MPIDI_Update_last_algorithm(comm_ptr, "ALLREDUCE_MPICH");
-      return MPIR_Allreduce(sendbuf, recvbuf, count, dt, op, comm_ptr, mpierrno);
-   }
+              selected_type == MPID_COLL_USE_MPICH))
+  {
+    if(unlikely(verbose))
+      fprintf(stderr,"Using MPICH allreduce type %u.\n",
+              selected_type);
+    MPIDI_Update_last_algorithm(comm_ptr, "ALLREDUCE_MPICH");
+    return MPIR_Allreduce(sendbuf, recvbuf, count, dt, op, comm_ptr, mpierrno);
+  }
 
   sbuf = (void *)sendbuf;
   if(unlikely(sendbuf == MPI_IN_PLACE))
@@ -105,276 +105,276 @@ int MPIDO_Allreduce(const void *sendbuf,
       sbuf = PAMI_IN_PLACE;
    }
 
-   allred.cb_done = cb_allreduce;
-   allred.cookie = (void *)&active;
-   allred.cmd.xfer_allreduce.sndbuf = sbuf;
-   allred.cmd.xfer_allreduce.stype = pdt;
-   allred.cmd.xfer_allreduce.rcvbuf = recvbuf;
-   allred.cmd.xfer_allreduce.rtype = pdt;
-   allred.cmd.xfer_allreduce.stypecount = count;
-   allred.cmd.xfer_allreduce.rtypecount = count;
-   allred.cmd.xfer_allreduce.op = pop;
+  allred.cb_done = cb_allreduce;
+  allred.cookie = (void *)&active;
+  allred.cmd.xfer_allreduce.sndbuf = sbuf;
+  allred.cmd.xfer_allreduce.stype = pdt;
+  allred.cmd.xfer_allreduce.rcvbuf = recvbuf;
+  allred.cmd.xfer_allreduce.rtype = pdt;
+  allred.cmd.xfer_allreduce.stypecount = count;
+  allred.cmd.xfer_allreduce.rtypecount = count;
+  allred.cmd.xfer_allreduce.op = pop;
 
-   TRACE_ERR("Allreduce - Basic Collective Selection\n");
-   if(likely(selected_type == MPID_COLL_OPTIMIZED))
-   {
-     if(likely(pop == PAMI_DATA_SUM || pop == PAMI_DATA_MAX || pop == PAMI_DATA_MIN))
+  TRACE_ERR("Allreduce - Basic Collective Selection\n");
+  if(likely(selected_type == MPID_COLL_OPTIMIZED))
+  {
+    if(likely(pop == PAMI_DATA_SUM || pop == PAMI_DATA_MAX || pop == PAMI_DATA_MIN))
+    {
+      /* double protocol works on all message sizes */
+      if(likely(pdt == PAMI_TYPE_DOUBLE && mpid->query_allred_dsmm == MPID_COLL_QUERY))
       {
-         /* double protocol works on all message sizes */
-         if(likely(pdt == PAMI_TYPE_DOUBLE && mpid->query_allred_dsmm == MPID_COLL_QUERY))
-         {
-            my_allred = mpid->cached_allred_dsmm;
-            my_allred_md = &mpid->cached_allred_dsmm_md;
-            alg_selected = 1;
-         }
-         else if(pdt == PAMI_TYPE_UNSIGNED_INT && mpid->query_allred_ismm == MPID_COLL_QUERY)
-         {
-            my_allred = mpid->cached_allred_ismm;
-            my_allred_md = &mpid->cached_allred_ismm_md;
-            alg_selected = 1;
-         }
-         /* The integer protocol at >1 ppn requires small messages only */
-         else if(pdt == PAMI_TYPE_UNSIGNED_INT && mpid->query_allred_ismm == MPID_COLL_CHECK_FN_REQUIRED &&
-                 count <= mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0])
-         {
-            my_allred = mpid->cached_allred_ismm;
-            my_allred_md = &mpid->cached_allred_ismm_md;
-            alg_selected = 1;
-         }
-         else if(mpid->must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_NOQUERY &&
-                 count <= mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0])
-         {
-            my_allred = mpid->opt_protocol[PAMI_XFER_ALLREDUCE][0];
-            my_allred_md = &mpid->opt_protocol_md[PAMI_XFER_ALLREDUCE][0];
-            alg_selected = 1;
-         }
-         else if(mpid->must_query[PAMI_XFER_ALLREDUCE][1] == MPID_COLL_NOQUERY &&
-                 count > mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0])
-         {
-            my_allred = mpid->opt_protocol[PAMI_XFER_ALLREDUCE][1];
-            my_allred_md = &mpid->opt_protocol_md[PAMI_XFER_ALLREDUCE][1];
-            alg_selected = 1;
-         }
-         else if((mpid->must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_CHECK_FN_REQUIRED) ||
-		 (mpid->must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_QUERY) ||
-		 (mpid->must_query[PAMI_XFER_ALLREDUCE][0] ==  MPID_COLL_ALWAYS_QUERY))
-         {
-            if((mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0] == 0) || 
-			(count <= mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0] && mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0] > 0))
-            {
-              my_allred = mpid->opt_protocol[PAMI_XFER_ALLREDUCE][0];
-              my_allred_md = &mpid->opt_protocol_md[PAMI_XFER_ALLREDUCE][0];
-              alg_selected = 1;
-            }
-         }
+        my_allred = mpid->cached_allred_dsmm;
+        my_allred_md = &mpid->cached_allred_dsmm_md;
+        alg_selected = 1;
       }
-      else
+      else if(pdt == PAMI_TYPE_UNSIGNED_INT && mpid->query_allred_ismm == MPID_COLL_QUERY)
       {
-         /* so we aren't one of the key ops... */
-         if(mpid->must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_NOQUERY &&
-            count <= mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0])
-         {
-            my_allred = mpid->opt_protocol[PAMI_XFER_ALLREDUCE][0];
-            my_allred_md = &mpid->opt_protocol_md[PAMI_XFER_ALLREDUCE][0];
-            alg_selected = 1;
-         }
-         else if(mpid->must_query[PAMI_XFER_ALLREDUCE][1] == MPID_COLL_NOQUERY &&
-                 count > mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0])
-         {
-            my_allred = mpid->opt_protocol[PAMI_XFER_ALLREDUCE][1];
-            my_allred_md = &mpid->opt_protocol_md[PAMI_XFER_ALLREDUCE][1];
-            alg_selected = 1;
-         }
-         else if((mpid->must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_CHECK_FN_REQUIRED) ||
-		 (mpid->must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_QUERY) ||
-		 (mpid->must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_ALWAYS_QUERY))
-         {
-            if((mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0] == 0) || 
-               (count <= mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0] && mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0] > 0))
-            {			
-              my_allred = mpid->opt_protocol[PAMI_XFER_ALLREDUCE][0];
-              my_allred_md = &mpid->opt_protocol_md[PAMI_XFER_ALLREDUCE][0];
-              alg_selected = 1;
-            }
-         }
+        my_allred = mpid->cached_allred_ismm;
+        my_allred_md = &mpid->cached_allred_ismm_md;
+        alg_selected = 1;
+      }
+      /* The integer protocol at >1 ppn requires small messages only */
+      else if(pdt == PAMI_TYPE_UNSIGNED_INT && mpid->query_allred_ismm == MPID_COLL_CHECK_FN_REQUIRED &&
+              count <= mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0])
+      {
+        my_allred = mpid->cached_allred_ismm;
+        my_allred_md = &mpid->cached_allred_ismm_md;
+        alg_selected = 1;
+      }
+      else if(mpid->must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_NOQUERY &&
+              count <= mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0])
+      {
+        my_allred = mpid->opt_protocol[PAMI_XFER_ALLREDUCE][0];
+        my_allred_md = &mpid->opt_protocol_md[PAMI_XFER_ALLREDUCE][0];
+        alg_selected = 1;
+      }
+      else if(mpid->must_query[PAMI_XFER_ALLREDUCE][1] == MPID_COLL_NOQUERY &&
+              count > mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0])
+      {
+        my_allred = mpid->opt_protocol[PAMI_XFER_ALLREDUCE][1];
+        my_allred_md = &mpid->opt_protocol_md[PAMI_XFER_ALLREDUCE][1];
+        alg_selected = 1;
       }
-      TRACE_ERR("Alg selected: %d\n", alg_selected);
-      if(likely(alg_selected))
+      else if((mpid->must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_CHECK_FN_REQUIRED) ||
+              (mpid->must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_QUERY) ||
+              (mpid->must_query[PAMI_XFER_ALLREDUCE][0] ==  MPID_COLL_ALWAYS_QUERY))
       {
-	if(unlikely(mpid->must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_CHECK_FN_REQUIRED))
+        if((mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0] == 0) || 
+           (count <= mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0] && mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0] > 0))
         {
-           if(my_allred_md->check_fn != NULL)/*This should always be the case in FCA.. Otherwise punt to mpich*/
-           {
-              metadata_result_t result = {0};
-              TRACE_ERR("querying allreduce algorithm %s, type was %d\n",
-                 my_allred_md->name,
-                 mpid->must_query[PAMI_XFER_ALLREDUCE]);
-              result = my_allred_md->check_fn(&allred);
-              TRACE_ERR("bitmask: %#X\n", result.bitmask);
-              /* \todo Ignore check_correct.values.nonlocal until we implement the
-                 'pre-allreduce allreduce' or the 'safe' environment flag.
-                 We will basically assume 'safe' -- that all ranks are aligned (or not).
-              */
-              result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
-              if(!result.bitmask)
-              {
-                 allred.algorithm = my_allred;
-              }
-              else
-              {
-                 alg_selected = 0;
-                 if(unlikely(verbose))
-                    fprintf(stderr,"check_fn failed for %s.\n", my_allred_md->name);
-              }
-           }
-         else alg_selected = 0;
-	}
-	else if(unlikely(((mpid->must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_QUERY) ||
-			  (mpid->must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_ALWAYS_QUERY))))
+          my_allred = mpid->opt_protocol[PAMI_XFER_ALLREDUCE][0];
+          my_allred_md = &mpid->opt_protocol_md[PAMI_XFER_ALLREDUCE][0];
+          alg_selected = 1;
+        }
+      }
+    }
+    else
+    {
+      /* so we aren't one of the key ops... */
+      if(mpid->must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_NOQUERY &&
+         count <= mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0])
+      {
+        my_allred = mpid->opt_protocol[PAMI_XFER_ALLREDUCE][0];
+        my_allred_md = &mpid->opt_protocol_md[PAMI_XFER_ALLREDUCE][0];
+        alg_selected = 1;
+      }
+      else if(mpid->must_query[PAMI_XFER_ALLREDUCE][1] == MPID_COLL_NOQUERY &&
+              count > mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0])
+      {
+        my_allred = mpid->opt_protocol[PAMI_XFER_ALLREDUCE][1];
+        my_allred_md = &mpid->opt_protocol_md[PAMI_XFER_ALLREDUCE][1];
+        alg_selected = 1;
+      }
+      else if((mpid->must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_CHECK_FN_REQUIRED) ||
+              (mpid->must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_QUERY) ||
+              (mpid->must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_ALWAYS_QUERY))
+      {
+        if((mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0] == 0) || 
+           (count <= mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0] && mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0] > 0))
         {
-           if(my_allred_md->check_fn != NULL)/*This should always be the case in FCA.. Otherwise punt to mpich*/
-           {
-              metadata_result_t result = {0};
-              TRACE_ERR("querying allreduce algorithm %s, type was %d\n",
-                 my_allred_md->name,
-                 mpid->must_query[PAMI_XFER_ALLREDUCE]);
-              result = my_allred_md->check_fn(&allred);
-              TRACE_ERR("bitmask: %#X\n", result.bitmask);
-              /* \todo Ignore check_correct.values.nonlocal until we implement the
-                 'pre-allreduce allreduce' or the 'safe' environment flag.
-                 We will basically assume 'safe' -- that all ranks are aligned (or not).
-              */
-              result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
-              if(!result.bitmask)
-              {
-                 allred.algorithm = my_allred;
-              }
-              else
-              {
-                 alg_selected = 0;
-                 if(unlikely(verbose))
-                    fprintf(stderr,"check_fn failed for %s.\n", my_allred_md->name);
-              }
-           }
-	   else /* no check_fn, manually look at the metadata fields */
-	   {
-	     /* Check if the message range if restricted */
-	     if(my_allred_md->check_correct.values.rangeminmax)
-	     {
-               MPI_Aint data_true_lb;
-               MPID_Datatype *data_ptr;
-               int data_size, data_contig;
-               MPIDI_Datatype_get_info(count, dt, data_contig, data_size, data_ptr, data_true_lb); 
-               if((my_allred_md->range_lo <= data_size) &&
-                  (my_allred_md->range_hi >= data_size))
-                 allred.algorithm = my_allred; /* query algorithm successfully selected */
-               else
-		 {
-		   if(unlikely(verbose))
-                     fprintf(stderr,"message size (%u) outside range (%zu<->%zu) for %s.\n",
-                             data_size,
-                             my_allred_md->range_lo,
-                             my_allred_md->range_hi,
-                             my_allred_md->name);
-		   alg_selected = 0;
-		 }
-	     }
-	     /* \todo check the rest of the metadata */
-	   }
+          my_allred = mpid->opt_protocol[PAMI_XFER_ALLREDUCE][0];
+          my_allred_md = &mpid->opt_protocol_md[PAMI_XFER_ALLREDUCE][0];
+          alg_selected = 1;
         }
-        else
+      }
+    }
+    TRACE_ERR("Alg selected: %d\n", alg_selected);
+    if(likely(alg_selected))
+    {
+      if(unlikely(mpid->must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_CHECK_FN_REQUIRED))
+      {
+        if(my_allred_md->check_fn != NULL)/*This should always be the case in FCA.. Otherwise punt to mpich*/
         {
-           TRACE_ERR("Using %s for allreduce\n", my_allred_md->name);
-           allred.algorithm = my_allred;
+          metadata_result_t result = {0};
+          TRACE_ERR("querying allreduce algorithm %s, type was %d\n",
+                    my_allred_md->name,
+                    mpid->must_query[PAMI_XFER_ALLREDUCE]);
+          result = my_allred_md->check_fn(&allred);
+          TRACE_ERR("bitmask: %#X\n", result.bitmask);
+          /* \todo Ignore check_correct.values.nonlocal until we implement the
+             'pre-allreduce allreduce' or the 'safe' environment flag.
+             We will basically assume 'safe' -- that all ranks are aligned (or not).
+          */
+          result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
+          if(!result.bitmask)
+          {
+            allred.algorithm = my_allred;
+          }
+          else
+          {
+            alg_selected = 0;
+            if(unlikely(verbose))
+              fprintf(stderr,"check_fn failed for %s.\n", my_allred_md->name);
+          }
         }
+        else alg_selected = 0;
       }
-   }
-   else
-   {
-      my_allred = mpid->user_selected[PAMI_XFER_ALLREDUCE];
-      my_allred_md = &mpid->user_metadata[PAMI_XFER_ALLREDUCE];
-      allred.algorithm = my_allred;
-      if(selected_type == MPID_COLL_QUERY ||
-         selected_type == MPID_COLL_ALWAYS_QUERY ||
-         selected_type == MPID_COLL_CHECK_FN_REQUIRED)
+      else if(unlikely(((mpid->must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_QUERY) ||
+                        (mpid->must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_ALWAYS_QUERY))))
       {
-         if(my_allred_md->check_fn != NULL)
-         {
-            /* For now, we don't distinguish between MPID_COLL_ALWAYS_QUERY &
-               MPID_COLL_CHECK_FN_REQUIRED, we just call the fn                */
-            metadata_result_t result = {0};
-            TRACE_ERR("querying allreduce algorithm %s, type was %d\n",
-               my_allred_md->name,
-               selected_type);
-            result = mpid->user_metadata[PAMI_XFER_ALLREDUCE].check_fn(&allred);
-            TRACE_ERR("bitmask: %#X\n", result.bitmask);
-            /* \todo Ignore check_correct.values.nonlocal until we implement the
-               'pre-allreduce allreduce' or the 'safe' environment flag.
-               We will basically assume 'safe' -- that all ranks are aligned (or not).
-            */
-            result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
-            if(!result.bitmask)
-               alg_selected = 1; /* query algorithm successfully selected */
-            else 
-               if(unlikely(verbose))
-                  fprintf(stderr,"check_fn failed for %s.\n", my_allred_md->name);
-         }
-         else /* no check_fn, manually look at the metadata fields */
-         {
-            /* Check if the message range if restricted */
-            if(my_allred_md->check_correct.values.rangeminmax)
+        if(my_allred_md->check_fn != NULL)/*This should always be the case in FCA.. Otherwise punt to mpich*/
+        {
+          metadata_result_t result = {0};
+          TRACE_ERR("querying allreduce algorithm %s, type was %d\n",
+                    my_allred_md->name,
+                    mpid->must_query[PAMI_XFER_ALLREDUCE]);
+          result = my_allred_md->check_fn(&allred);
+          TRACE_ERR("bitmask: %#X\n", result.bitmask);
+          /* \todo Ignore check_correct.values.nonlocal until we implement the
+             'pre-allreduce allreduce' or the 'safe' environment flag.
+             We will basically assume 'safe' -- that all ranks are aligned (or not).
+          */
+          result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
+          if(!result.bitmask)
+          {
+            allred.algorithm = my_allred;
+          }
+          else
+          {
+            alg_selected = 0;
+            if(unlikely(verbose))
+              fprintf(stderr,"check_fn failed for %s.\n", my_allred_md->name);
+          }
+        }
+        else /* no check_fn, manually look at the metadata fields */
+        {
+          /* Check if the message range if restricted */
+          if(my_allred_md->check_correct.values.rangeminmax)
+          {
+            MPI_Aint data_true_lb;
+            MPID_Datatype *data_ptr;
+            int data_size, data_contig;
+            MPIDI_Datatype_get_info(count, dt, data_contig, data_size, data_ptr, data_true_lb); 
+            if((my_allred_md->range_lo <= data_size) &&
+               (my_allred_md->range_hi >= data_size))
+              allred.algorithm = my_allred; /* query algorithm successfully selected */
+            else
             {
-               MPI_Aint data_true_lb;
-               MPID_Datatype *data_ptr;
-               int data_size, data_contig;
-               MPIDI_Datatype_get_info(count, dt, data_contig, data_size, data_ptr, data_true_lb); 
-               if((my_allred_md->range_lo <= data_size) &&
-                  (my_allred_md->range_hi >= data_size))
-                  alg_selected = 1; /* query algorithm successfully selected */
-               else
-                 if(unlikely(verbose))
-                     fprintf(stderr,"message size (%u) outside range (%zu<->%zu) for %s.\n",
-                             data_size,
-                             my_allred_md->range_lo,
-                             my_allred_md->range_hi,
-                             my_allred_md->name);
+              if(unlikely(verbose))
+                fprintf(stderr,"message size (%u) outside range (%zu<->%zu) for %s.\n",
+                        data_size,
+                        my_allred_md->range_lo,
+                        my_allred_md->range_hi,
+                        my_allred_md->name);
+              alg_selected = 0;
             }
-            /* \todo check the rest of the metadata */
-         }
+          }
+          /* \todo check the rest of the metadata */
+        }
+      }
+      else
+      {
+        TRACE_ERR("Using %s for allreduce\n", my_allred_md->name);
+        allred.algorithm = my_allred;
       }
-      else alg_selected = 1; /* non-query algorithm selected */
+    }
+  }
+  else
+  {
+    my_allred = mpid->user_selected[PAMI_XFER_ALLREDUCE];
+    my_allred_md = &mpid->user_metadata[PAMI_XFER_ALLREDUCE];
+    allred.algorithm = my_allred;
+    if(selected_type == MPID_COLL_QUERY ||
+       selected_type == MPID_COLL_ALWAYS_QUERY ||
+       selected_type == MPID_COLL_CHECK_FN_REQUIRED)
+    {
+      if(my_allred_md->check_fn != NULL)
+      {
+        /* For now, we don't distinguish between MPID_COLL_ALWAYS_QUERY &
+           MPID_COLL_CHECK_FN_REQUIRED, we just call the fn                */
+        metadata_result_t result = {0};
+        TRACE_ERR("querying allreduce algorithm %s, type was %d\n",
+                  my_allred_md->name,
+                  selected_type);
+        result = mpid->user_metadata[PAMI_XFER_ALLREDUCE].check_fn(&allred);
+        TRACE_ERR("bitmask: %#X\n", result.bitmask);
+        /* \todo Ignore check_correct.values.nonlocal until we implement the
+           'pre-allreduce allreduce' or the 'safe' environment flag.
+           We will basically assume 'safe' -- that all ranks are aligned (or not).
+        */
+        result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
+        if(!result.bitmask)
+          alg_selected = 1; /* query algorithm successfully selected */
+        else
+          if(unlikely(verbose))
+          fprintf(stderr,"check_fn failed for %s.\n", my_allred_md->name);
+      }
+      else /* no check_fn, manually look at the metadata fields */
+      {
+        /* Check if the message range if restricted */
+        if(my_allred_md->check_correct.values.rangeminmax)
+        {
+          MPI_Aint data_true_lb;
+          MPID_Datatype *data_ptr;
+          int data_size, data_contig;
+          MPIDI_Datatype_get_info(count, dt, data_contig, data_size, data_ptr, data_true_lb); 
+          if((my_allred_md->range_lo <= data_size) &&
+             (my_allred_md->range_hi >= data_size))
+            alg_selected = 1; /* query algorithm successfully selected */
+          else
+            if(unlikely(verbose))
+            fprintf(stderr,"message size (%u) outside range (%zu<->%zu) for %s.\n",
+                    data_size,
+                    my_allred_md->range_lo,
+                    my_allred_md->range_hi,
+                    my_allred_md->name);
+        }
+        /* \todo check the rest of the metadata */
+      }
+    }
+    else alg_selected = 1; /* non-query algorithm selected */
 
-   }
+  }
 
-   if(unlikely(!alg_selected)) /* must be fallback to MPICH */
-   {
-     if(unlikely(verbose))
-         fprintf(stderr,"Using MPICH allreduce\n");
-      MPIDI_Update_last_algorithm(comm_ptr, "ALLREDUCE_MPICH");
-      return MPIR_Allreduce(sendbuf, recvbuf, count, dt, op, comm_ptr, mpierrno);
-   }
+  if(unlikely(!alg_selected)) /* must be fallback to MPICH */
+  {
+    if(unlikely(verbose))
+      fprintf(stderr,"Using MPICH allreduce\n");
+    MPIDI_Update_last_algorithm(comm_ptr, "ALLREDUCE_MPICH");
+    return MPIR_Allreduce(sendbuf, recvbuf, count, dt, op, comm_ptr, mpierrno);
+  }
 
-   if(unlikely(verbose))
-   {
-      unsigned long long int threadID;
-      MPIU_Thread_id_t tid;
-      MPIU_Thread_self(&tid);
-      threadID = (unsigned long long int)tid;
-      fprintf(stderr,"<%llx> Using protocol %s for allreduce on %u\n", 
-              threadID,
-              my_allred_md->name,
-              (unsigned) comm_ptr->context_id);
-   }
+  if(unlikely(verbose))
+  {
+    unsigned long long int threadID;
+    MPIU_Thread_id_t tid;
+    MPIU_Thread_self(&tid);
+    threadID = (unsigned long long int)tid;
+    fprintf(stderr,"<%llx> Using protocol %s for allreduce on %u\n", 
+            threadID,
+            my_allred_md->name,
+            (unsigned) comm_ptr->context_id);
+  }
 
-   MPIDI_Post_coll_t allred_post;
-   MPIDI_Context_post(MPIDI_Context[0], &allred_post.state,
-                      MPIDI_Pami_post_wrapper, (void *)&allred);
+  MPIDI_Post_coll_t allred_post;
+  MPIDI_Context_post(MPIDI_Context[0], &allred_post.state,
+                     MPIDI_Pami_post_wrapper, (void *)&allred);
 
-   MPID_assert(rc == PAMI_SUCCESS);
-   MPIDI_Update_last_algorithm(comm_ptr,my_allred_md->name);
-   MPID_PROGRESS_WAIT_WHILE(active);
-   TRACE_ERR("allreduce done\n");
-   return MPI_SUCCESS;
+  MPID_assert(rc == PAMI_SUCCESS);
+  MPIDI_Update_last_algorithm(comm_ptr,my_allred_md->name);
+  MPID_PROGRESS_WAIT_WHILE(active);
+  TRACE_ERR("allreduce done\n");
+  return MPI_SUCCESS;
 }
 
 int MPIDO_Allreduce_simple(const void *sendbuf,

http://git.mpich.org/mpich.git/commitdiff/3fcd8d8e05dea84d7d10352f18445edea66cac75

commit 3fcd8d8e05dea84d7d10352f18445edea66cac75
Author: Bob Cernohous <bobc at us.ibm.com>
Date:   Mon Nov 26 14:39:32 2012 -0600

    Glue updates for metadata changes
    
    (ibm) Issue 8756
    (ibm) 625a34684858febdf227063c212dff4edcc5b6de
    
    Signed-off-by: sssharka <sssharka at us.ibm.com>

diff --git a/src/mpid/pamid/src/coll/allgather/mpido_allgather.c b/src/mpid/pamid/src/coll/allgather/mpido_allgather.c
index c455d38..c881db3 100644
--- a/src/mpid/pamid/src/coll/allgather/mpido_allgather.c
+++ b/src/mpid/pamid/src/coll/allgather/mpido_allgather.c
@@ -248,6 +248,7 @@ MPIDO_Allgather(const void *sendbuf,
    volatile unsigned allgather_active = 1;
    pami_xfer_t allred;
    const int rank = comm_ptr->rank;
+   int queryreq = 0;
 #if ASSERT_LEVEL==0
    /* We can't afford the tracing in ndebug/performance libraries */
     const unsigned verbose = 0;
@@ -372,10 +373,11 @@ MPIDO_Allgather(const void *sendbuf,
       if(selected_type == MPID_COLL_OPTIMIZED)
       {
         if((mpid->cutoff_size[PAMI_XFER_ALLGATHER][0] == 0) || 
-	    (mpid->cutoff_size[PAMI_XFER_ALLGATHER][0] > 0 && mpid->cutoff_size[PAMI_XFER_ALLGATHER][0] >= send_size))
+           (mpid->cutoff_size[PAMI_XFER_ALLGATHER][0] > 0 && mpid->cutoff_size[PAMI_XFER_ALLGATHER][0] >= send_size))
         {
            allgather.algorithm = mpid->opt_protocol[PAMI_XFER_ALLGATHER][0];
            my_md = &mpid->opt_protocol_md[PAMI_XFER_ALLGATHER][0];
+           queryreq     = mpid->must_query[PAMI_XFER_ALLGATHER][0];
         }
         else
         {
@@ -388,22 +390,30 @@ MPIDO_Allgather(const void *sendbuf,
       {
          allgather.algorithm = mpid->user_selected[PAMI_XFER_ALLGATHER];
          my_md = &mpid->user_metadata[PAMI_XFER_ALLGATHER];
+         queryreq     = selected_type;
       }
 
-      if(unlikely( selected_type == MPID_COLL_ALWAYS_QUERY ||
-                   selected_type == MPID_COLL_CHECK_FN_REQUIRED))
+      if(unlikely( queryreq == MPID_COLL_ALWAYS_QUERY ||
+                   queryreq == MPID_COLL_CHECK_FN_REQUIRED))
       {
          metadata_result_t result = {0};
          TRACE_ERR("Querying allgather protocol %s, type was: %d\n",
             my_md->name,
             selected_type);
-         result = my_md->check_fn(&allgather);
+         if(queryreq == MPID_COLL_ALWAYS_QUERY)
+         {
+           /* process metadata bits */
+         }
+         else /* (queryreq == MPID_COLL_CHECK_FN_REQUIRED - calling the check fn is sufficient */
+           result = my_md->check_fn(&allgather);
          TRACE_ERR("bitmask: %#X\n", result.bitmask);
-         if(!result.bitmask)
+         result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
+         if(result.bitmask)
          {
-      if(unlikely(verbose))
-            fprintf(stderr,"Query failed for %s.\n",
-               my_md->name);
+           if(unlikely(verbose))
+             fprintf(stderr,"Query failed for %s.\n",
+                     my_md->name);
+           MPIDI_Update_last_algorithm(comm_ptr, "ALLGATHER_MPICH");
            return MPIR_Allgather(sendbuf, sendcount, sendtype,
                        recvbuf, recvcount, recvtype,
                        comm_ptr, mpierrno);
diff --git a/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c b/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c
index a4ef524..44d3498 100644
--- a/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c
+++ b/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c
@@ -268,6 +268,7 @@ MPIDO_Allgatherv(const void *sendbuf,
   char *sbuf, *rbuf;
   const int rank = comm_ptr->rank;
   const struct MPIDI_Comm* const mpid = &(comm_ptr->mpid);
+  int queryreq = 0;
 
 #if ASSERT_LEVEL==0
    /* We can't afford the tracing in ndebug/performance libraries */
@@ -396,10 +397,11 @@ MPIDO_Allgatherv(const void *sendbuf,
       if(selected_type == MPID_COLL_OPTIMIZED)
       {
         if((mpid->cutoff_size[PAMI_XFER_ALLGATHERV_INT][0] == 0) || 
-	    (mpid->cutoff_size[PAMI_XFER_ALLGATHERV_INT][0] > 0 && mpid->cutoff_size[PAMI_XFER_ALLGATHERV_INT][0] >= send_size))
+           (mpid->cutoff_size[PAMI_XFER_ALLGATHERV_INT][0] > 0 && mpid->cutoff_size[PAMI_XFER_ALLGATHERV_INT][0] >= send_size))
         {		
           allgatherv.algorithm = mpid->opt_protocol[PAMI_XFER_ALLGATHERV_INT][0];
           my_md = &mpid->opt_protocol_md[PAMI_XFER_ALLGATHERV_INT][0];
+          queryreq     = mpid->must_query[PAMI_XFER_ALLGATHERV_INT][0];
         }
         else
           return MPIR_Allgatherv(sendbuf, sendcount, sendtype,
@@ -410,6 +412,7 @@ MPIDO_Allgatherv(const void *sendbuf,
       {  
         allgatherv.algorithm = mpid->user_selected[PAMI_XFER_ALLGATHERV_INT];
         my_md = &mpid->user_metadata[PAMI_XFER_ALLGATHERV_INT];
+        queryreq     = selected_type;
       }
       
       allgatherv.cmd.xfer_allgatherv_int.sndbuf = sbuf;
@@ -421,15 +424,21 @@ MPIDO_Allgatherv(const void *sendbuf,
       allgatherv.cmd.xfer_allgatherv_int.rtypecounts = (int *) recvcounts;
       allgatherv.cmd.xfer_allgatherv_int.rdispls = (int *) displs;
 
-      if(unlikely (selected_type == MPID_COLL_ALWAYS_QUERY ||
-                   selected_type == MPID_COLL_CHECK_FN_REQUIRED))
+      if(unlikely (queryreq == MPID_COLL_ALWAYS_QUERY ||
+                   queryreq == MPID_COLL_CHECK_FN_REQUIRED))
       {
          metadata_result_t result = {0};
          TRACE_ERR("Querying allgatherv_int protocol %s, type was %d\n", my_md->name,
             selected_type);
-         result = my_md->check_fn(&allgatherv);
+         if(queryreq == MPID_COLL_ALWAYS_QUERY)
+         {
+           /* process metadata bits */
+         }
+         else /* (queryreq == MPID_COLL_CHECK_FN_REQUIRED - calling the check fn is sufficient */
+           result = my_md->check_fn(&allgatherv);
          TRACE_ERR("Allgatherv bitmask: %#X\n", result.bitmask);
-         if(!result.bitmask)
+         result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
+         if(result.bitmask)
          {
            if(unlikely(verbose))
              fprintf(stderr,"Query failed for %s\n", my_md->name);
diff --git a/src/mpid/pamid/src/coll/alltoall/mpido_alltoall.c b/src/mpid/pamid/src/coll/alltoall/mpido_alltoall.c
index aa1543e..c5f54fa 100644
--- a/src/mpid/pamid/src/coll/alltoall/mpido_alltoall.c
+++ b/src/mpid/pamid/src/coll/alltoall/mpido_alltoall.c
@@ -132,20 +132,28 @@ int MPIDO_Alltoall(const void *sendbuf,
    alltoall.cmd.xfer_alltoall.rtypecount = recvcount;
    alltoall.cmd.xfer_alltoall.rtype = rtype;
 
-   if(unlikely(queryreq == MPID_COLL_ALWAYS_QUERY || queryreq == MPID_COLL_CHECK_FN_REQUIRED))
+   if(unlikely(queryreq == MPID_COLL_ALWAYS_QUERY || 
+               queryreq == MPID_COLL_CHECK_FN_REQUIRED))
    {
       metadata_result_t result = {0};
       TRACE_ERR("querying alltoall protocol %s, query level was %d\n", pname,
          queryreq);
-      result = my_alltoall_md->check_fn(&alltoall);
+      if(queryreq == MPID_COLL_ALWAYS_QUERY)
+      {
+        /* process metadata bits */
+      }
+      else /* (queryreq == MPID_COLL_CHECK_FN_REQUIRED - calling the check fn is sufficient */
+         result = my_alltoall_md->check_fn(&alltoall);
       TRACE_ERR("bitmask: %#X\n", result.bitmask);
-      if(!result.bitmask)
+      result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
+      if(result.bitmask)
       {
-      if(unlikely(verbose))
-         fprintf(stderr,"Query failed for %s\n", pname);
-      return MPIR_Alltoall_intra(sendbuf, sendcount, sendtype,
-                                 recvbuf, recvcount, recvtype,
-                                 comm_ptr, mpierrno);
+        if(unlikely(verbose))
+           fprintf(stderr,"Query failed for %s\n", pname);
+        MPIDI_Update_last_algorithm(comm_ptr, "ALLTOALL_MPICH");
+        return MPIR_Alltoall_intra(sendbuf, sendcount, sendtype,
+                                   recvbuf, recvcount, recvtype,
+                                   comm_ptr, mpierrno);
       }
    }
 
diff --git a/src/mpid/pamid/src/coll/alltoallv/mpido_alltoallv.c b/src/mpid/pamid/src/coll/alltoallv/mpido_alltoallv.c
index e128de8..84d4a89 100644
--- a/src/mpid/pamid/src/coll/alltoallv/mpido_alltoallv.c
+++ b/src/mpid/pamid/src/coll/alltoallv/mpido_alltoallv.c
@@ -129,16 +129,24 @@ int MPIDO_Alltoallv(const void *sendbuf,
    alltoallv.cmd.xfer_alltoallv_int.rtypecounts = (int *) recvcounts;
    alltoallv.cmd.xfer_alltoallv_int.rtype = rtype;
 
-   if(unlikely(queryreq == MPID_COLL_ALWAYS_QUERY || queryreq == MPID_COLL_CHECK_FN_REQUIRED))
+   if(unlikely(queryreq == MPID_COLL_ALWAYS_QUERY || 
+               queryreq == MPID_COLL_CHECK_FN_REQUIRED))
    {
       metadata_result_t result = {0};
       TRACE_ERR("querying alltoallv protocol %s, type was %d\n", pname, queryreq);
-      result = my_alltoallv_md->check_fn(&alltoallv);
+      if(queryreq == MPID_COLL_ALWAYS_QUERY)
+      {
+        /* process metadata bits */
+      }
+      else /* (queryreq == MPID_COLL_CHECK_FN_REQUIRED - calling the check fn is sufficient */
+         result = my_alltoallv_md->check_fn(&alltoallv);
       TRACE_ERR("bitmask: %#X\n", result.bitmask);
-      if(!result.bitmask)
+      result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
+      if(result.bitmask)
       {
         if(unlikely(verbose))
           fprintf(stderr,"Query failed for %s\n", pname);
+        MPIDI_Update_last_algorithm(comm_ptr, "ALLTOALLV_MPICH");
         return MPIR_Alltoallv(sendbuf, sendcounts, senddispls, sendtype,
                               recvbuf, recvcounts, recvdispls, recvtype,
                               comm_ptr, mpierrno);
diff --git a/src/mpid/pamid/src/coll/bcast/mpido_bcast.c b/src/mpid/pamid/src/coll/bcast/mpido_bcast.c
index bab8b58..9e320b3 100644
--- a/src/mpid/pamid/src/coll/bcast/mpido_bcast.c
+++ b/src/mpid/pamid/src/coll/bcast/mpido_bcast.c
@@ -104,7 +104,7 @@ int MPIDO_Bcast(void *buffer,
    {
      if(unlikely(verbose))
        fprintf(stderr,"Using MPICH bcast algorithm\n");
-      MPIDI_Update_last_algorithm(comm_ptr,"MPICH");
+      MPIDI_Update_last_algorithm(comm_ptr,"BCAST_MPICH");
       return MPIR_Bcast_intra(buffer, count, datatype, root, comm_ptr, mpierrno);
    }
 
@@ -185,35 +185,25 @@ int MPIDO_Bcast(void *buffer,
 
    bcast.algorithm = my_bcast;
 
-   if(queryreq == MPID_COLL_ALWAYS_QUERY)
-   {
-     metadata_result_t result = {0};
-     TRACE_ERR("querying bcast protocol %s, type was: %d\n",
-	       my_bcast_md->name, queryreq);
-     // TODO check bits?
-     TRACE_ERR("bitmask: %#X\n", result.bitmask);
-     result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
-     if(result.bitmask)
-     {
-       if(unlikely(verbose))
-	 fprintf(stderr,"Using MPICH bcast algorithm - query bits failed\n");
-       MPIDI_Update_last_algorithm(comm_ptr,"MPICH");
-       return MPIR_Bcast_intra(buffer, count, datatype, root, comm_ptr, mpierrno);
-     }
-   }
-   else if(queryreq == MPID_COLL_CHECK_FN_REQUIRED)
+   if(unlikely(queryreq == MPID_COLL_ALWAYS_QUERY ||
+               queryreq == MPID_COLL_CHECK_FN_REQUIRED))
    {
       metadata_result_t result = {0};
       TRACE_ERR("querying bcast protocol %s, type was: %d\n",
          my_bcast_md->name, queryreq);
-      result = my_bcast_md->check_fn(&bcast);
+      if(queryreq == MPID_COLL_ALWAYS_QUERY)
+      {
+        /* process metadata bits */
+      }
+      else /* (queryreq == MPID_COLL_CHECK_FN_REQUIRED - calling the check fn is sufficient */
+         result = my_bcast_md->check_fn(&bcast);
       TRACE_ERR("bitmask: %#X\n", result.bitmask);
       result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
       if(result.bitmask)
       {
          if(unlikely(verbose))
             fprintf(stderr,"Using MPICH bcast algorithm - query fn failed\n");
-         MPIDI_Update_last_algorithm(comm_ptr,"MPICH");
+         MPIDI_Update_last_algorithm(comm_ptr,"BCAST_MPICH");
          return MPIR_Bcast_intra(buffer, count, datatype, root, comm_ptr, mpierrno);
       }
    }
diff --git a/src/mpid/pamid/src/coll/gather/mpido_gather.c b/src/mpid/pamid/src/coll/gather/mpido_gather.c
index 139b5d9..d8721a1 100644
--- a/src/mpid/pamid/src/coll/gather/mpido_gather.c
+++ b/src/mpid/pamid/src/coll/gather/mpido_gather.c
@@ -261,12 +261,19 @@ int MPIDO_Gather(const void *sendbuf,
       metadata_result_t result = {0};
       TRACE_ERR("querying gather protocol %s, type was %d\n",
          my_gather_md->name, queryreq);
-      result = my_gather_md->check_fn(&gather);
+      if(queryreq == MPID_COLL_ALWAYS_QUERY)
+      {
+        /* process metadata bits */
+      }
+      else /* (queryreq == MPID_COLL_CHECK_FN_REQUIRED - calling the check fn is sufficient */
+        result = my_gather_md->check_fn(&gather);
       TRACE_ERR("bitmask: %#X\n", result.bitmask);
-      if(!result.bitmask)
+      result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
+      if(result.bitmask)
       {
         if(unlikely(verbose))
           fprintf(stderr,"query failed for %s\n", my_gather_md->name);
+        MPIDI_Update_last_algorithm(comm_ptr, "GATHER_MPICH");
         return MPIR_Gather(sendbuf, sendcount, sendtype,
                            recvbuf, recvcount, recvtype,
                            root, comm_ptr, mpierrno);
diff --git a/src/mpid/pamid/src/coll/gatherv/mpido_gatherv.c b/src/mpid/pamid/src/coll/gatherv/mpido_gatherv.c
index ea2c6e5..a5f9ffe 100644
--- a/src/mpid/pamid/src/coll/gatherv/mpido_gatherv.c
+++ b/src/mpid/pamid/src/coll/gatherv/mpido_gatherv.c
@@ -141,14 +141,21 @@ int MPIDO_Gatherv(const void *sendbuf,
    gatherv.algorithm = my_gatherv;
 
 
-   if(unlikely(queryreq == MPID_COLL_ALWAYS_QUERY || queryreq == MPID_COLL_CHECK_FN_REQUIRED))
+   if(unlikely(queryreq == MPID_COLL_ALWAYS_QUERY || 
+               queryreq == MPID_COLL_CHECK_FN_REQUIRED))
    {
       metadata_result_t result = {0};
       TRACE_ERR("querying gatherv protocol %s, type was %d\n", 
          my_gatherv_md->name, queryreq);
-      result = my_gatherv_md->check_fn(&gatherv);
+      if(queryreq == MPID_COLL_ALWAYS_QUERY)
+      {
+        /* process metadata bits */
+      }
+      else /* (queryreq == MPID_COLL_CHECK_FN_REQUIRED - calling the check fn is sufficient */
+         result = my_gatherv_md->check_fn(&gatherv);
       TRACE_ERR("bitmask: %#X\n", result.bitmask);
-      if(!result.bitmask)
+      result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
+      if(result.bitmask)
       {
          if(unlikely(verbose))
             fprintf(stderr,"Query failed for %s\n", my_gatherv_md->name);
diff --git a/src/mpid/pamid/src/coll/reduce/mpido_reduce.c b/src/mpid/pamid/src/coll/reduce/mpido_reduce.c
index ec85e33..0f664ad 100644
--- a/src/mpid/pamid/src/coll/reduce/mpido_reduce.c
+++ b/src/mpid/pamid/src/coll/reduce/mpido_reduce.c
@@ -121,7 +121,8 @@ int MPIDO_Reduce(const void *sendbuf,
    reduce.cmd.xfer_reduce.root = MPID_VCR_GET_LPID(comm_ptr->vcr, root);
 
 
-   if(queryreq == MPID_COLL_ALWAYS_QUERY || queryreq == MPID_COLL_CHECK_FN_REQUIRED)
+   if(unlikely(queryreq == MPID_COLL_ALWAYS_QUERY || 
+               queryreq == MPID_COLL_CHECK_FN_REQUIRED))
    {
       if(my_reduce_md->check_fn != NULL)
       {
@@ -129,8 +130,14 @@ int MPIDO_Reduce(const void *sendbuf,
          TRACE_ERR("Querying reduce protocol %s, type was %d\n",
             my_reduce_md->name,
             queryreq);
-         result = my_reduce_md->check_fn(&reduce);
+         if(queryreq == MPID_COLL_ALWAYS_QUERY)
+         {
+            /* process metadata bits */
+         }
+         else /* (queryreq == MPID_COLL_CHECK_FN_REQUIRED - calling the check fn is sufficient */
+            result = my_reduce_md->check_fn(&reduce);
          TRACE_ERR("Bitmask: %#X\n", result.bitmask);
+         result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
          if(result.bitmask)
          {
             if(verbose)
@@ -169,6 +176,7 @@ int MPIDO_Reduce(const void *sendbuf,
    }
    else
    {
+      MPIDI_Update_last_algorithm(comm_ptr, "REDUCE_MPICH");
       if(unlikely(verbose))
          fprintf(stderr,"Using MPICH reduce algorithm\n");
       return MPIR_Reduce(sendbuf, recvbuf, count, datatype, op, root, comm_ptr, mpierrno);
diff --git a/src/mpid/pamid/src/coll/scan/mpido_scan.c b/src/mpid/pamid/src/coll/scan/mpido_scan.c
index 7f38bd0..ba5854d 100644
--- a/src/mpid/pamid/src/coll/scan/mpido_scan.c
+++ b/src/mpid/pamid/src/coll/scan/mpido_scan.c
@@ -64,6 +64,7 @@ int MPIDO_Doscan(const void *sendbuf, void *recvbuf,
    pami_type_t pdt;
    int rc;
    const pami_metadata_t *my_md;
+   int queryreq = 0;
 #if ASSERT_LEVEL==0
    /* We can't afford the tracing in ndebug/performance libraries */
     const unsigned verbose = 0;
@@ -111,11 +112,13 @@ int MPIDO_Doscan(const void *sendbuf, void *recvbuf,
    {
       scan.algorithm = mpid->opt_protocol[PAMI_XFER_SCAN][0];
       my_md = &mpid->opt_protocol_md[PAMI_XFER_SCAN][0];
+      queryreq     = mpid->must_query[PAMI_XFER_SCAN][0];
    }
    else
    {
       scan.algorithm = mpid->user_selected[PAMI_XFER_SCAN];
       my_md = &mpid->user_metadata[PAMI_XFER_SCAN];
+      queryreq     = selected_type;
    }
    scan.cmd.xfer_scan.sndbuf = sbuf;
    scan.cmd.xfer_scan.rcvbuf = rbuf;
@@ -127,19 +130,30 @@ int MPIDO_Doscan(const void *sendbuf, void *recvbuf,
    scan.cmd.xfer_scan.exclusive = exflag;
 
 
-   if(selected_type == MPID_COLL_ALWAYS_QUERY ||
-      selected_type == MPID_COLL_CHECK_FN_REQUIRED)
+   if(unlikely(queryreq == MPID_COLL_ALWAYS_QUERY ||
+               queryreq == MPID_COLL_CHECK_FN_REQUIRED))
    {
       metadata_result_t result = {0};
       TRACE_ERR("Querying scan protocol %s, type was %d\n",
          my_md->name,
          selected_type);
-      result = my_md->check_fn(&scan);
+      if(queryreq == MPID_COLL_ALWAYS_QUERY)
+      {
+        /* process metadata bits */
+      }
+      else /* (queryreq == MPID_COLL_CHECK_FN_REQUIRED - calling the check fn is sufficient */
+         result = my_md->check_fn(&scan);
       TRACE_ERR("Bitmask: %#X\n", result.bitmask);
-      if(!result.bitmask)
+      result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
+      if(result.bitmask)
       {
          fprintf(stderr,"Query failed for %s.\n",
             my_md->name);
+         MPIDI_Update_last_algorithm(comm_ptr, "SCAN_MPICH");
+         if(exflag)
+            return MPIR_Exscan(sendbuf, recvbuf, count, datatype, op, comm_ptr, mpierrno);
+         else
+            return MPIR_Scan(sendbuf, recvbuf, count, datatype, op, comm_ptr, mpierrno);
       }
    }
    
diff --git a/src/mpid/pamid/src/coll/scatter/mpido_scatter.c b/src/mpid/pamid/src/coll/scatter/mpido_scatter.c
index 0d82e1d..b5ce1a4 100644
--- a/src/mpid/pamid/src/coll/scatter/mpido_scatter.c
+++ b/src/mpid/pamid/src/coll/scatter/mpido_scatter.c
@@ -236,9 +236,15 @@ int MPIDO_Scatter(const void *sendbuf,
       metadata_result_t result = {0};
       TRACE_ERR("querying scatter protoocl %s, type was %d\n",
          my_scatter_md->name, queryreq);
-      result = my_scatter_md->check_fn(&scatter);
+      if(queryreq == MPID_COLL_ALWAYS_QUERY)
+      {
+        /* process metadata bits */
+      }
+      else /* (queryreq == MPID_COLL_CHECK_FN_REQUIRED - calling the check fn is sufficient */
+        result = my_scatter_md->check_fn(&scatter);
       TRACE_ERR("bitmask: %#X\n", result.bitmask);
-      if(!result.bitmask)
+      result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
+      if(result.bitmask)
       {
         if(unlikely(verbose))
           fprintf(stderr,"query failed for %s\n", my_scatter_md->name);
diff --git a/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c b/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c
index a0539eb..6e565ea 100644
--- a/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c
+++ b/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c
@@ -349,14 +349,21 @@ int MPIDO_Scatterv(const void *sendbuf,
    scatterv.cmd.xfer_scatterv_int.rtypecount = recvcount;
    scatterv.cmd.xfer_scatterv_int.sdispls = (int *) displs;
 
-   if(unlikely(queryreq == MPID_COLL_ALWAYS_QUERY || queryreq == MPID_COLL_CHECK_FN_REQUIRED))
+   if(unlikely(queryreq == MPID_COLL_ALWAYS_QUERY || 
+               queryreq == MPID_COLL_CHECK_FN_REQUIRED))
    {
       metadata_result_t result = {0};
       TRACE_ERR("querying scatterv protocol %s, type was %d\n",
          my_scatterv_md->name, queryreq);
-      result = my_scatterv_md->check_fn(&scatterv);
+      if(queryreq == MPID_COLL_ALWAYS_QUERY)
+      {
+        /* process metadata bits */
+      }
+      else /* (queryreq == MPID_COLL_CHECK_FN_REQUIRED - calling the check fn is sufficient */
+        result = my_scatterv_md->check_fn(&scatterv);
       TRACE_ERR("bitmask: %#X\n", result.bitmask);
-      if(!result.bitmask)
+      result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
+      if(result.bitmask)
       {
         if(unlikely(verbose))
           fprintf(stderr,"Query failed for %s\n", my_scatterv_md->name);
diff --git a/src/mpid/pamid/src/comm/mpid_optcolls.c b/src/mpid/pamid/src/comm/mpid_optcolls.c
index b901c57..d9d3a10 100644
--- a/src/mpid/pamid/src/comm/mpid_optcolls.c
+++ b/src/mpid/pamid/src/comm/mpid_optcolls.c
@@ -145,6 +145,7 @@ void MPIDI_Comm_coll_select(MPID_Comm *comm_ptr)
 {
    TRACE_ERR("Entering MPIDI_Comm_coll_select\n");
    int opt_proto = -1;
+   int mustquery = 0;
    int i;
    int use_threaded_collectives = 1;
 
@@ -201,16 +202,19 @@ void MPIDI_Comm_coll_select(MPID_Comm *comm_ptr)
       MPIDI_Coll_comm_check_FCA("GATHERV","I1:GathervInt:FCA:FCA",PAMI_XFER_GATHERV_INT,MPID_COLL_NOQUERY, 0, comm_ptr);
    }
 
+   opt_proto = -1;
+   mustquery = 0;
    /* So, several protocols are really easy. Tackle them first. */
    if(comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLGATHERV_INT] == MPID_COLL_NOSELECTION)
    {
       TRACE_ERR("No allgatherv[int] env var, so setting optimized allgatherv[int]\n");
       /* Use I0:RectangleDput */
-      for(i = 0; i < comm_ptr->mpid.coll_count[PAMI_XFER_ALLGATHERV_INT][0]; i++)
+      for(i = 0; i < comm_ptr->mpid.coll_count[PAMI_XFER_ALLGATHERV_INT][1]; i++)
       {
          if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLGATHERV_INT][0][i].name, "I0:RectangleDput:SHMEM:MU") == 0)
          {
             opt_proto = i;
+            mustquery = 1;
             break;
          }
       }
@@ -218,13 +222,13 @@ void MPIDI_Comm_coll_select(MPID_Comm *comm_ptr)
       {
          TRACE_ERR("Memcpy protocol type %d number %d (%s) to optimized protocol\n",
             PAMI_XFER_ALLGATHERV_INT, opt_proto,
-            comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLGATHERV_INT][0][opt_proto].name);
+            comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLGATHERV_INT][mustquery][opt_proto].name);
          comm_ptr->mpid.opt_protocol[PAMI_XFER_ALLGATHERV_INT][0] =
-                comm_ptr->mpid.coll_algorithm[PAMI_XFER_ALLGATHERV_INT][0][opt_proto];
+                comm_ptr->mpid.coll_algorithm[PAMI_XFER_ALLGATHERV_INT][mustquery][opt_proto];
          memcpy(&comm_ptr->mpid.opt_protocol_md[PAMI_XFER_ALLGATHERV_INT][0], 
-                &comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLGATHERV_INT][0][opt_proto], 
+                &comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLGATHERV_INT][mustquery][opt_proto], 
                 sizeof(pami_metadata_t));
-         comm_ptr->mpid.must_query[PAMI_XFER_ALLGATHERV_INT][0] = MPID_COLL_NOQUERY;
+         comm_ptr->mpid.must_query[PAMI_XFER_ALLGATHERV_INT][0] = mustquery?MPID_COLL_ALWAYS_QUERY:MPID_COLL_NOQUERY;
          comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLGATHERV_INT] = MPID_COLL_OPTIMIZED;
       }
       else
@@ -237,7 +241,7 @@ void MPIDI_Comm_coll_select(MPID_Comm *comm_ptr)
    }
 
    opt_proto = -1;
-
+   mustquery = 0;
    /* Alltoall */
    /* If the user has forced a selection, don't bother setting it here */
    if(comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLTOALL] == MPID_COLL_NOSELECTION)
@@ -247,11 +251,12 @@ void MPIDI_Comm_coll_select(MPID_Comm *comm_ptr)
        * displacement array memory issues today.... */
       /* Loop over the protocols until we find the one we want */
       if(use_threaded_collectives)
-       for(i = 0; i < comm_ptr->mpid.coll_count[PAMI_XFER_ALLTOALL][0]; i++)
+       for(i = 0; i < comm_ptr->mpid.coll_count[PAMI_XFER_ALLTOALL][1]; i++)
        {
-         if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLTOALL][0][i].name, "I0:M2MComposite:MU:MU") == 0)
+         if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLTOALL][1][i].name, "I0:M2MComposite:MU:MU") == 0)
          {
             opt_proto = i;
+            mustquery = 1;
             break;
          }
        }
@@ -259,13 +264,13 @@ void MPIDI_Comm_coll_select(MPID_Comm *comm_ptr)
       {
          TRACE_ERR("Memcpy protocol type %d, number %d (%s) to optimized protocol\n",
             PAMI_XFER_ALLTOALL, opt_proto, 
-            comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLTOALL][0][opt_proto].name);
+            comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLTOALL][mustquery][opt_proto].name);
          comm_ptr->mpid.opt_protocol[PAMI_XFER_ALLTOALL][0] =
-                comm_ptr->mpid.coll_algorithm[PAMI_XFER_ALLTOALL][0][opt_proto];
+                comm_ptr->mpid.coll_algorithm[PAMI_XFER_ALLTOALL][mustquery][opt_proto];
          memcpy(&comm_ptr->mpid.opt_protocol_md[PAMI_XFER_ALLTOALL][0], 
-                &comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLTOALL][0][opt_proto], 
+                &comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLTOALL][mustquery][opt_proto], 
                 sizeof(pami_metadata_t));
-         comm_ptr->mpid.must_query[PAMI_XFER_ALLTOALL][0] = MPID_COLL_NOQUERY;
+         comm_ptr->mpid.must_query[PAMI_XFER_ALLTOALL][0] = mustquery?MPID_COLL_ALWAYS_QUERY:MPID_COLL_NOQUERY;
          comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLTOALL] = MPID_COLL_OPTIMIZED;
       }
       else
@@ -279,6 +284,7 @@ void MPIDI_Comm_coll_select(MPID_Comm *comm_ptr)
 
 
    opt_proto = -1;
+   mustquery = 0;
    /* Alltoallv */
    if(comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLTOALLV_INT] == MPID_COLL_NOSELECTION)
    {
@@ -287,11 +293,12 @@ void MPIDI_Comm_coll_select(MPID_Comm *comm_ptr)
        * displacement array memory issues today.... */
       /* Loop over the protocols until we find the one we want */
       if(use_threaded_collectives)
-       for(i = 0; i <comm_ptr->mpid.coll_count[PAMI_XFER_ALLTOALLV_INT][0]; i++)
+       for(i = 0; i <comm_ptr->mpid.coll_count[PAMI_XFER_ALLTOALLV_INT][1]; i++)
        {
-         if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLTOALLV_INT][0][i].name, "I0:M2MComposite:MU:MU") == 0)
+         if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLTOALLV_INT][1][i].name, "I0:M2MComposite:MU:MU") == 0)
          {
             opt_proto = i;
+            mustquery = 1;
             break;
          }
        }
@@ -299,13 +306,13 @@ void MPIDI_Comm_coll_select(MPID_Comm *comm_ptr)
       {
          TRACE_ERR("Memcpy protocol type %d, number %d (%s) to optimized protocol\n",
             PAMI_XFER_ALLTOALLV_INT, opt_proto, 
-            comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLTOALLV_INT][0][opt_proto].name);
+            comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLTOALLV_INT][mustquery][opt_proto].name);
          comm_ptr->mpid.opt_protocol[PAMI_XFER_ALLTOALLV_INT][0] =
-                comm_ptr->mpid.coll_algorithm[PAMI_XFER_ALLTOALLV_INT][0][opt_proto];
+                comm_ptr->mpid.coll_algorithm[PAMI_XFER_ALLTOALLV_INT][mustquery][opt_proto];
          memcpy(&comm_ptr->mpid.opt_protocol_md[PAMI_XFER_ALLTOALLV_INT][0], 
-                &comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLTOALLV_INT][0][opt_proto], 
+                &comm_ptr->mpid.coll_metadata[PAMI_XFER_ALLTOALLV_INT][mustquery][opt_proto], 
                 sizeof(pami_metadata_t));
-         comm_ptr->mpid.must_query[PAMI_XFER_ALLTOALLV_INT][0] = MPID_COLL_NOQUERY;
+         comm_ptr->mpid.must_query[PAMI_XFER_ALLTOALLV_INT][0] = mustquery?MPID_COLL_ALWAYS_QUERY:MPID_COLL_NOQUERY;
          comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLTOALLV_INT] = MPID_COLL_OPTIMIZED;
       }
       else
@@ -318,6 +325,7 @@ void MPIDI_Comm_coll_select(MPID_Comm *comm_ptr)
    }
    
    opt_proto = -1;
+   mustquery = 0;
    /* Barrier */
    if(comm_ptr->mpid.user_selected_type[PAMI_XFER_BARRIER] == MPID_COLL_NOSELECTION)
    {
@@ -405,6 +413,7 @@ void MPIDI_Comm_coll_select(MPID_Comm *comm_ptr)
    }
 
    opt_proto = -1;
+   mustquery = 0;
 
    /* This becomes messy when we have to message sizes. If we were gutting the 
     * existing framework, it might be easier, but I think the existing framework
@@ -435,7 +444,6 @@ void MPIDI_Comm_coll_select(MPID_Comm *comm_ptr)
       /* I0:RankBased_Binomial:-:ShortMU is good on irregular for <256 bytes */
       /* I0:MultiCast:SHMEM:- is good at 1 node/16ppn, which is a SOW point */
       TRACE_ERR("No bcast env var, so setting optimized bcast\n");
-      int mustquery = 0;
 
       if(use_threaded_collectives)
        for(i = 0 ; i < comm_ptr->mpid.coll_count[PAMI_XFER_BROADCAST][1]; i++)
@@ -529,12 +537,11 @@ void MPIDI_Comm_coll_select(MPID_Comm *comm_ptr)
          comm_ptr->mpid.opt_protocol[PAMI_XFER_BROADCAST][0] = 0;
       }
 
-      mustquery = 0;
-
       TRACE_ERR("Done setting optimized bcast 0\n");
 
       /* Now, look into large message bcasts */
       opt_proto = -1;
+      mustquery = 0;
       /* If bcast0 is I0:MultiCastDput:-:MU, and I0:RectangleDput:MU:MU is available, use
        * it for 64k messages */
       if(comm_ptr->mpid.user_selected_type[PAMI_XFER_BROADCAST] != MPID_COLL_USE_MPICH)
@@ -616,14 +623,14 @@ void MPIDI_Comm_coll_select(MPID_Comm *comm_ptr)
             memcpy(&comm_ptr->mpid.opt_protocol_md[PAMI_XFER_BROADCAST][1], 
                    &comm_ptr->mpid.opt_protocol_md[PAMI_XFER_BROADCAST][0],
                    sizeof(pami_metadata_t));
-            comm_ptr->mpid.must_query[PAMI_XFER_BROADCAST][1] = 
-	      comm_ptr->mpid.must_query[PAMI_XFER_BROADCAST][0];
+            comm_ptr->mpid.must_query[PAMI_XFER_BROADCAST][1] = comm_ptr->mpid.must_query[PAMI_XFER_BROADCAST][0];
          }
       }
       TRACE_ERR("Done with bcast protocol selection\n");
    }
 
    opt_proto = -1;
+   mustquery = 0;
    /* The most fun... allreduce */
    /* 512-way data: */
    /* For starters, Amith's protocol works on doubles on sum/min/max. Because
diff --git a/src/mpid/pamid/src/comm/mpid_selectcolls.c b/src/mpid/pamid/src/comm/mpid_selectcolls.c
index c454d4f..57cb3cc 100644
--- a/src/mpid/pamid/src/comm/mpid_selectcolls.c
+++ b/src/mpid/pamid/src/comm/mpid_selectcolls.c
@@ -82,10 +82,10 @@ static void MPIDI_Update_coll(pami_algorithm_t coll,
          TRACE_ERR("Protocol %s setting to always query/call check_fn\n", comm->mpid.coll_metadata[coll][type][index].name);
          comm->mpid.user_selected_type[coll] = MPID_COLL_CHECK_FN_REQUIRED;
       } 
-      else
+      else /* No check fn but we still need to check metadata bits (query protocol)  */
       {
-	TRACE_ERR("Protocol %s setting to always query/no check_fn\n", comm->mpid.coll_metadata[coll][type][index].name);
-	comm->mpid.user_selected_type[coll] = MPID_COLL_ALWAYS_QUERY;
+         TRACE_ERR("Protocol %s setting to always query/no check_fn\n", comm->mpid.coll_metadata[coll][type][index].name);
+         comm->mpid.user_selected_type[coll] = MPID_COLL_ALWAYS_QUERY;
       }
 
    }

http://git.mpich.org/mpich.git/commitdiff/f0eefb17a7e960824e8c7f54539c7a5b8fb08128

commit f0eefb17a7e960824e8c7f54539c7a5b8fb08128
Author: Bob Cernohous <bobc at us.ibm.com>
Date:   Wed Nov 21 14:22:32 2012 -0600

    Update PAMID with new bcast metadata changes
    
    (ibm) Issue 8783
    (ibm) 119ad60f127b16688c1a18b266752f7c2704c9e8
    
    Signed-off-by: sssharka <sssharka at us.ibm.com>

diff --git a/src/mpid/pamid/src/coll/bcast/mpido_bcast.c b/src/mpid/pamid/src/coll/bcast/mpido_bcast.c
index 73b0b1e..bab8b58 100644
--- a/src/mpid/pamid/src/coll/bcast/mpido_bcast.c
+++ b/src/mpid/pamid/src/coll/bcast/mpido_bcast.c
@@ -176,7 +176,7 @@ int MPIDO_Bcast(void *buffer,
    }
    else
    {
-      TRACE_ERR("Optimized bcast (%s) was specified by user\n",
+      TRACE_ERR("Bcast (%s) was specified by user\n",
          mpid->user_metadata[PAMI_XFER_BROADCAST].name);
       my_bcast =  mpid->user_selected[PAMI_XFER_BROADCAST];
       my_bcast_md = &mpid->user_metadata[PAMI_XFER_BROADCAST];
@@ -185,17 +185,34 @@ int MPIDO_Bcast(void *buffer,
 
    bcast.algorithm = my_bcast;
 
-   if(queryreq == MPID_COLL_ALWAYS_QUERY || queryreq == MPID_COLL_CHECK_FN_REQUIRED)
+   if(queryreq == MPID_COLL_ALWAYS_QUERY)
+   {
+     metadata_result_t result = {0};
+     TRACE_ERR("querying bcast protocol %s, type was: %d\n",
+	       my_bcast_md->name, queryreq);
+     // TODO check bits?
+     TRACE_ERR("bitmask: %#X\n", result.bitmask);
+     result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
+     if(result.bitmask)
+     {
+       if(unlikely(verbose))
+	 fprintf(stderr,"Using MPICH bcast algorithm - query bits failed\n");
+       MPIDI_Update_last_algorithm(comm_ptr,"MPICH");
+       return MPIR_Bcast_intra(buffer, count, datatype, root, comm_ptr, mpierrno);
+     }
+   }
+   else if(queryreq == MPID_COLL_CHECK_FN_REQUIRED)
    {
       metadata_result_t result = {0};
       TRACE_ERR("querying bcast protocol %s, type was: %d\n",
          my_bcast_md->name, queryreq);
       result = my_bcast_md->check_fn(&bcast);
       TRACE_ERR("bitmask: %#X\n", result.bitmask);
-      if(!result.bitmask)
+      result.check.nonlocal = 0; /* #warning REMOVE THIS WHEN IMPLEMENTED */
+      if(result.bitmask)
       {
          if(unlikely(verbose))
-            fprintf(stderr,"Using MPICH bcast algorithm\n");
+            fprintf(stderr,"Using MPICH bcast algorithm - query fn failed\n");
          MPIDI_Update_last_algorithm(comm_ptr,"MPICH");
          return MPIR_Bcast_intra(buffer, count, datatype, root, comm_ptr, mpierrno);
       }
@@ -307,4 +324,4 @@ int MPIDO_Bcast_simple(void *buffer,
 
    TRACE_ERR("Exiting MPIDO_Bcast_optimized\n");
    return 0;
-}
\ No newline at end of file
+}
diff --git a/src/mpid/pamid/src/comm/mpid_optcolls.c b/src/mpid/pamid/src/comm/mpid_optcolls.c
index 274eccf..b901c57 100644
--- a/src/mpid/pamid/src/comm/mpid_optcolls.c
+++ b/src/mpid/pamid/src/comm/mpid_optcolls.c
@@ -431,8 +431,6 @@ void MPIDI_Comm_coll_select(MPID_Comm *comm_ptr)
    /* First, set up small message bcasts */
    if(comm_ptr->mpid.user_selected_type[PAMI_XFER_BROADCAST] == MPID_COLL_NOSELECTION)
    {
-      /* Note: Neither of these protocols are in a 'must query' list so
-       * the for() loop only needs to loop over [0] protocols */
       /* Complicated exceptions: */
       /* I0:RankBased_Binomial:-:ShortMU is good on irregular for <256 bytes */
       /* I0:MultiCast:SHMEM:- is good at 1 node/16ppn, which is a SOW point */
@@ -440,43 +438,47 @@ void MPIDI_Comm_coll_select(MPID_Comm *comm_ptr)
       int mustquery = 0;
 
       if(use_threaded_collectives)
-       for(i = 0 ; i < comm_ptr->mpid.coll_count[PAMI_XFER_BROADCAST][0]; i++)
+       for(i = 0 ; i < comm_ptr->mpid.coll_count[PAMI_XFER_BROADCAST][1]; i++)
        {
          /* These two are mutually exclusive */
-         if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][0][i].name, "I0:MultiCastDput:-:MU") == 0)
+         if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][1][i].name, "I0:MultiCastDput:-:MU") == 0)
             opt_proto = i;
-         if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][0][i].name, "I0:MultiCastDput:SHMEM:MU") == 0)
+         if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][1][i].name, "I0:MultiCastDput:SHMEM:MU") == 0)
             opt_proto = i;
+	 mustquery = 1;
        }
-      /* Next best rectangular to check */
+      /* Next best MU 2 device to check */
       if(use_threaded_collectives)
       if(opt_proto == -1)
       {
-         /* This is also NOT in the 'must query' list */
          if(use_threaded_collectives)
-          for(i = 0; i < comm_ptr->mpid.coll_count[PAMI_XFER_BROADCAST][0]; i++)
+          for(i = 0; i < comm_ptr->mpid.coll_count[PAMI_XFER_BROADCAST][1]; i++)
           {
-            if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][0][i].name, "I0:MultiCast2DeviceDput:SHMEM:MU") == 0)
+            if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][1][i].name, "I0:MultiCast2DeviceDput:SHMEM:MU") == 0)
                opt_proto = i;
+               mustquery = 1;
           }
       }
-      /* Another rectangular to check */
+      /* Check for  rectangle */
       if(use_threaded_collectives)
       if(opt_proto == -1)
       {
-         /* This is also NOT in the 'must query' list */
          unsigned len = strlen("I0:RectangleDput:");
          if(use_threaded_collectives)
-          for(i = 0; i < comm_ptr->mpid.coll_count[PAMI_XFER_BROADCAST][0]; i++)
+          for(i = 0; i < comm_ptr->mpid.coll_count[PAMI_XFER_BROADCAST][1]; i++)
           {
-            if(strcasecmp (comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][0][i].name, "I0:RectangleDput:SHMEM:MU") == 0)
+            if(strcasecmp (comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][1][i].name, "I0:RectangleDput:SHMEM:MU") == 0)
             { /* Prefer the :SHMEM:MU so break when it's found */
                opt_proto = i; 
+               mustquery = 1;
                break;
             }
             /* Otherwise any RectangleDput is better than nothing. */
-            if(strncasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][0][i].name, "I0:RectangleDput:",len) == 0)
+            if(strncasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][1][i].name, "I0:RectangleDput:",len) == 0)
+	    {
                opt_proto = i;
+               mustquery = 1;
+	    }
           }
       }
       if(opt_proto == -1)
@@ -484,7 +486,6 @@ void MPIDI_Comm_coll_select(MPID_Comm *comm_ptr)
          for(i = 0; i < comm_ptr->mpid.coll_count[PAMI_XFER_BROADCAST][1]; i++)
          {
             /* This is a good choice for small messages only */
-            /* BES TODO Why is this protocol in a must query list? */
             if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][1][i].name, "I0:RankBased_Binomial:SHMEM:MU") == 0)
             {
                opt_proto = i;
@@ -518,7 +519,7 @@ void MPIDI_Comm_coll_select(MPID_Comm *comm_ptr)
          memcpy(&comm_ptr->mpid.opt_protocol_md[PAMI_XFER_BROADCAST][0], 
                 &comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][mustquery][opt_proto], 
                 sizeof(pami_metadata_t));
-         comm_ptr->mpid.must_query[PAMI_XFER_BROADCAST][0] = MPID_COLL_NOQUERY;
+         comm_ptr->mpid.must_query[PAMI_XFER_BROADCAST][0] = mustquery?MPID_COLL_ALWAYS_QUERY:MPID_COLL_NOQUERY;
          comm_ptr->mpid.user_selected_type[PAMI_XFER_BROADCAST] = MPID_COLL_OPTIMIZED;
       }
       else
@@ -528,6 +529,7 @@ void MPIDI_Comm_coll_select(MPID_Comm *comm_ptr)
          comm_ptr->mpid.opt_protocol[PAMI_XFER_BROADCAST][0] = 0;
       }
 
+      mustquery = 0;
 
       TRACE_ERR("Done setting optimized bcast 0\n");
 
@@ -535,18 +537,18 @@ void MPIDI_Comm_coll_select(MPID_Comm *comm_ptr)
       opt_proto = -1;
       /* If bcast0 is I0:MultiCastDput:-:MU, and I0:RectangleDput:MU:MU is available, use
        * it for 64k messages */
-      /* Again, none of these protocols are in the 'must query' list */
       if(comm_ptr->mpid.user_selected_type[PAMI_XFER_BROADCAST] != MPID_COLL_USE_MPICH)
       {
       if(use_threaded_collectives)
          if(strcasecmp(comm_ptr->mpid.opt_protocol_md[PAMI_XFER_BROADCAST][0].name, "I0:MultiCastDput:-:MU") == 0)
          {
             /* See if I0:RectangleDput:MU:MU is available */
-            for(i = 0; i < comm_ptr->mpid.coll_count[PAMI_XFER_BROADCAST][0]; i++)
+            for(i = 0; i < comm_ptr->mpid.coll_count[PAMI_XFER_BROADCAST][1]; i++)
             {
-               if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][0][i].name, "I0:RectangleDput:MU:MU") == 0)
+               if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][1][i].name, "I0:RectangleDput:MU:MU") == 0)
                {
                   opt_proto = i;
+		  mustquery = 1;
                   comm_ptr->mpid.cutoff_size[PAMI_XFER_BROADCAST][0] = 65536;
                }
             }
@@ -557,11 +559,12 @@ void MPIDI_Comm_coll_select(MPID_Comm *comm_ptr)
          if(strcasecmp(comm_ptr->mpid.opt_protocol_md[PAMI_XFER_BROADCAST][0].name, "I0:MultiCastDput:SHMEM:MU") == 0)
          {
             /* See if I0:RectangleDput:SHMEM:MU is available */
-            for(i = 0; i < comm_ptr->mpid.coll_count[PAMI_XFER_BROADCAST][0]; i++)
+            for(i = 0; i < comm_ptr->mpid.coll_count[PAMI_XFER_BROADCAST][1]; i++)
             {
-               if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][0][i].name, "I0:RectangleDput:SHMEM:MU") == 0)
+               if(strcasecmp(comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][1][i].name, "I0:RectangleDput:SHMEM:MU") == 0)
                {
                   opt_proto = i;
+		  mustquery = 1;
                   comm_ptr->mpid.cutoff_size[PAMI_XFER_BROADCAST][0] = 131072;
                }
             }
@@ -582,20 +585,20 @@ void MPIDI_Comm_coll_select(MPID_Comm *comm_ptr)
             if(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_0 && comm_ptr->rank == 0)
             {
                fprintf(stderr,"Selecting %s as optimal broadcast 1 (above %d)\n", 
-                  comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][0][opt_proto].name, 
+                  comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][mustquery][opt_proto].name, 
                   comm_ptr->mpid.cutoff_size[PAMI_XFER_BROADCAST][0]);
             }
             TRACE_ERR("Memcpy protocol type %d, number %d (%s) to optimize protocol 1 (above %d)\n",
                PAMI_XFER_BROADCAST, opt_proto, 
-               comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][0][opt_proto].name,
+               comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][mustquery][opt_proto].name,
                comm_ptr->mpid.cutoff_size[PAMI_XFER_BROADCAST][0]);
 
             comm_ptr->mpid.opt_protocol[PAMI_XFER_BROADCAST][1] =
-                    comm_ptr->mpid.coll_algorithm[PAMI_XFER_BROADCAST][0][opt_proto];
+                    comm_ptr->mpid.coll_algorithm[PAMI_XFER_BROADCAST][mustquery][opt_proto];
             memcpy(&comm_ptr->mpid.opt_protocol_md[PAMI_XFER_BROADCAST][1], 
-                   &comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][0][opt_proto], 
+                   &comm_ptr->mpid.coll_metadata[PAMI_XFER_BROADCAST][mustquery][opt_proto], 
                    sizeof(pami_metadata_t));
-            comm_ptr->mpid.must_query[PAMI_XFER_BROADCAST][1] = MPID_COLL_NOQUERY;
+            comm_ptr->mpid.must_query[PAMI_XFER_BROADCAST][1] = mustquery?MPID_COLL_ALWAYS_QUERY:MPID_COLL_NOQUERY;
             /* This should already be set... */
             /* comm_ptr->mpid.user_selected_type[PAMI_XFER_BROADCAST] = MPID_COLL_OPTIMIZED; */
          }
@@ -613,7 +616,8 @@ void MPIDI_Comm_coll_select(MPID_Comm *comm_ptr)
             memcpy(&comm_ptr->mpid.opt_protocol_md[PAMI_XFER_BROADCAST][1], 
                    &comm_ptr->mpid.opt_protocol_md[PAMI_XFER_BROADCAST][0],
                    sizeof(pami_metadata_t));
-            comm_ptr->mpid.must_query[PAMI_XFER_BROADCAST][1] = MPID_COLL_NOQUERY;
+            comm_ptr->mpid.must_query[PAMI_XFER_BROADCAST][1] = 
+	      comm_ptr->mpid.must_query[PAMI_XFER_BROADCAST][0];
          }
       }
       TRACE_ERR("Done with bcast protocol selection\n");
@@ -773,9 +777,12 @@ void MPIDI_Comm_coll_select(MPID_Comm *comm_ptr)
       if(comm_ptr->mpid.user_selected_type[PAMI_XFER_BROADCAST] == MPID_COLL_OPTIMIZED)
          fprintf(stderr,"Selecting %s for opt bcast up to size %d comm %p\n", comm_ptr->mpid.opt_protocol_md[PAMI_XFER_BROADCAST][0].name,
             comm_ptr->mpid.cutoff_size[PAMI_XFER_BROADCAST][0], comm_ptr);
-      if(comm_ptr->mpid.must_query[PAMI_XFER_BROADCAST][1] == MPID_COLL_NOQUERY)
-         fprintf(stderr,"Selecting %s for opt bcast above size %d comm %p\n", comm_ptr->mpid.opt_protocol_md[PAMI_XFER_BROADCAST][1].name,
-            comm_ptr->mpid.cutoff_size[PAMI_XFER_BROADCAST][0], comm_ptr);
+      if((comm_ptr->mpid.must_query[PAMI_XFER_BROADCAST][1] == MPID_COLL_NOQUERY) ||
+	 (comm_ptr->mpid.must_query[PAMI_XFER_BROADCAST][1] == MPID_COLL_ALWAYS_QUERY))
+         fprintf(stderr,"Selecting %s (mustquery=%d) for opt bcast above size %d comm %p\n",
+		 comm_ptr->mpid.opt_protocol_md[PAMI_XFER_BROADCAST][1].name,
+		 comm_ptr->mpid.must_query[PAMI_XFER_BROADCAST][1],
+		 comm_ptr->mpid.cutoff_size[PAMI_XFER_BROADCAST][0], comm_ptr);
       if(comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLTOALLV_INT] == MPID_COLL_OPTIMIZED)
          fprintf(stderr,"Selecting %s for opt alltoallv comm %p\n", comm_ptr->mpid.opt_protocol_md[PAMI_XFER_ALLTOALLV_INT][0].name, comm_ptr);
       if(comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLTOALL] == MPID_COLL_OPTIMIZED)
diff --git a/src/mpid/pamid/src/comm/mpid_selectcolls.c b/src/mpid/pamid/src/comm/mpid_selectcolls.c
index 7db6f94..c454d4f 100644
--- a/src/mpid/pamid/src/comm/mpid_selectcolls.c
+++ b/src/mpid/pamid/src/comm/mpid_selectcolls.c
@@ -79,8 +79,13 @@ static void MPIDI_Update_coll(pami_algorithm_t coll,
       {
          /* For now, if there's a check_fn we will always call it and not cache.
             We *could* be smarter about this eventually.                        */
-         TRACE_ERR("Protocol %s setting to always query\n", comm->mpid.coll_metadata[coll][type][index].name);
-         comm->mpid.user_selected_type[coll] = MPID_COLL_ALWAYS_QUERY;
+         TRACE_ERR("Protocol %s setting to always query/call check_fn\n", comm->mpid.coll_metadata[coll][type][index].name);
+         comm->mpid.user_selected_type[coll] = MPID_COLL_CHECK_FN_REQUIRED;
+      } 
+      else
+      {
+	TRACE_ERR("Protocol %s setting to always query/no check_fn\n", comm->mpid.coll_metadata[coll][type][index].name);
+	comm->mpid.user_selected_type[coll] = MPID_COLL_ALWAYS_QUERY;
       }
 
    }

http://git.mpich.org/mpich.git/commitdiff/be2862cd7dc3a0c619200905b0a2985805cc3a26

commit be2862cd7dc3a0c619200905b0a2985805cc3a26
Author: Michael Blocksome <blocksom at us.ibm.com>
Date:   Tue Nov 20 11:44:20 2012 -0600

    Set appropriate wrapper flags and default error string length.
    
    (ibm) 1c4bb2d66f358c76ee53dfde6866fa9cbecc42df
    
    Signed-off-by: sssharka <sssharka at us.ibm.com>

diff --git a/src/mpid/pamid/subconfigure.m4 b/src/mpid/pamid/subconfigure.m4
index 017f7f1..0ca5777 100644
--- a/src/mpid/pamid/subconfigure.m4
+++ b/src/mpid/pamid/subconfigure.m4
@@ -1,4 +1,22 @@
 [#] start of __file__
+dnl begin_generated_IBM_copyright_prolog                             
+dnl                                                                  
+dnl This is an automatically generated copyright prolog.             
+dnl After initializing,  DO NOT MODIFY OR MOVE                       
+dnl  --------------------------------------------------------------- 
+dnl Licensed Materials - Property of IBM                             
+dnl Blue Gene/Q 5765-PER 5765-PRP                                    
+dnl                                                                  
+dnl (C) Copyright IBM Corp. 2011, 2012 All Rights Reserved           
+dnl US Government Users Restricted Rights -                          
+dnl Use, duplication, or disclosure restricted                       
+dnl by GSA ADP Schedule Contract with IBM Corp.                      
+dnl                                                                  
+dnl  --------------------------------------------------------------- 
+dnl                                                                  
+dnl end_generated_IBM_copyright_prolog                               
+dnl -*- mode: makefile-gmake; -*-
+
 dnl MPICH_SUBCFG_BEFORE=src/mpid/common/sched
 dnl MPICH_SUBCFG_BEFORE=src/mpid/common/datatype
 dnl MPICH_SUBCFG_BEFORE=src/mpid/common/thread
@@ -93,10 +111,25 @@ if test "${pamid_platform}" = "BGQ" ; then
       PAC_APPEND_FLAG([-I${bgq_driver}/spi/include],            [CPPFLAGS])
       PAC_APPEND_FLAG([-I${bgq_driver}/spi/include/kernel/cnk], [CPPFLAGS])
 
-      PAC_APPEND_FLAG([-I${bgq_driver}],                        [WRAPPER_CPPFLAGS])
-      PAC_APPEND_FLAG([-I${bgq_driver}/comm/sys/include],       [WRAPPER_CPPFLAGS])
-      PAC_APPEND_FLAG([-I${bgq_driver}/spi/include],            [WRAPPER_CPPFLAGS])
-      PAC_APPEND_FLAG([-I${bgq_driver}/spi/include/kernel/cnk], [WRAPPER_CPPFLAGS])
+      PAC_APPEND_FLAG([-I${bgq_driver}],                        [WRAPPER_CFLAGS])
+      PAC_APPEND_FLAG([-I${bgq_driver}],                        [WRAPPER_CXXFLAGS])
+      PAC_APPEND_FLAG([-I${bgq_driver}],                        [WRAPPER_FCFLAGS])
+      PAC_APPEND_FLAG([-I${bgq_driver}],                        [WRAPPER_FFLAGS])
+
+      PAC_APPEND_FLAG([-I${bgq_driver}/comm/sys/include],       [WRAPPER_CFLAGS])
+      PAC_APPEND_FLAG([-I${bgq_driver}/comm/sys/include],       [WRAPPER_CXXFLAGS])
+      PAC_APPEND_FLAG([-I${bgq_driver}/comm/sys/include],       [WRAPPER_FCFLAGS])
+      PAC_APPEND_FLAG([-I${bgq_driver}/comm/sys/include],       [WRAPPER_FFLAGS])
+
+      PAC_APPEND_FLAG([-I${bgq_driver}/spi/include],            [WRAPPER_CFLAGS])
+      PAC_APPEND_FLAG([-I${bgq_driver}/spi/include],            [WRAPPER_CXXFLAGS])
+      PAC_APPEND_FLAG([-I${bgq_driver}/spi/include],            [WRAPPER_FCFLAGS])
+      PAC_APPEND_FLAG([-I${bgq_driver}/spi/include],            [WRAPPER_FFLAGS])
+
+      PAC_APPEND_FLAG([-I${bgq_driver}/spi/include/kernel/cnk], [WRAPPER_CFLAGS])
+      PAC_APPEND_FLAG([-I${bgq_driver}/spi/include/kernel/cnk], [WRAPPER_CXXFLAGS])
+      PAC_APPEND_FLAG([-I${bgq_driver}/spi/include/kernel/cnk], [WRAPPER_FCFLAGS])
+      PAC_APPEND_FLAG([-I${bgq_driver}/spi/include/kernel/cnk], [WRAPPER_FFLAGS])
 
       PAC_APPEND_FLAG([-L${bgq_driver}/spi/lib],                [LDFLAGS])
 
@@ -128,7 +161,9 @@ if test "${pamid_platform}" = "BGQ" ; then
   MPID_LIBTOOL_STATIC_FLAG="-all-static"
 fi
 
-
+if test "${pamid_platform}" = "PE" ; then
+        MPID_MAX_ERROR_STRING=512
+fi
 #
 # Check for gnu-style option to enable all warnings; if specified, then
 # add gnu option to treat all warnings as errors.

http://git.mpich.org/mpich.git/commitdiff/6d3a8039a552b753a8f34e1278a4d308c0dfa82d

commit 6d3a8039a552b753a8f34e1278a4d308c0dfa82d
Author: Haizhu Liu <haizhu at us.ibm.com>
Date:   Mon Feb 25 13:14:53 2013 -0500

    Dyntask/pgroup_intercomm_test test failure
    
    (ibm) D188897
    (ibm) 8b3fcd50bae917073b042b4a570805da96dd8f3c
    
    Signed-off-by: Michael Blocksome <blocksom at us.ibm.com>

diff --git a/src/mpid/pamid/src/mpid_init.c b/src/mpid/pamid/src/mpid_init.c
index b33f07f..57181cb 100644
--- a/src/mpid/pamid/src/mpid_init.c
+++ b/src/mpid/pamid/src/mpid_init.c
@@ -811,9 +811,9 @@ MPIDI_VCRT_init(int rank, int size, char *world_tasks, MPIDI_PG_t *pg)
 
 #ifdef DYNAMIC_TASKING
   if(mpidi_dynamic_tasking) {
-    comm->vcr[0]->pg=pg->vct[0].pg;
-    comm->vcr[0]->pg_rank=pg->vct[0].pg_rank;
-    pg->vct[0].taskid = comm->vcr[0]->taskid;
+    comm->vcr[0]->pg=pg->vct[rank].pg;
+    comm->vcr[0]->pg_rank=pg->vct[rank].pg_rank;
+    pg->vct[rank].taskid = comm->vcr[0]->taskid;
     if(comm->vcr[0]->pg) {
       TRACE_ERR("Adding ref for comm=%x vcr=%x pg=%x\n", comm, comm->vcr[0], comm->vcr[0]->pg);
       MPIDI_PG_add_ref(comm->vcr[0]->pg);

http://git.mpich.org/mpich.git/commitdiff/14d4dd38f994a5d702accafddc7e2e64093d7ab9

commit 14d4dd38f994a5d702accafddc7e2e64093d7ab9
Author: Su Huang <suhuang at us.ibm.com>
Date:   Wed Feb 20 15:21:02 2013 -0500

    probe12 failed on MPICH2 from rcot PTF1 build
    
    (ibm) D188785
    (ibm) 47ee200a814c25a7d8e594c38ed1a50ee171cbe3
    
    Signed-off-by: Michael Blocksome <blocksom at us.ibm.com>

diff --git a/src/mpid/pamid/src/mpid_recvq.c b/src/mpid/pamid/src/mpid_recvq.c
index 313bb37..8897296 100644
--- a/src/mpid/pamid/src/mpid_recvq.c
+++ b/src/mpid/pamid/src/mpid_recvq.c
@@ -158,23 +158,20 @@ MPIDI_Recvq_FU(int source, int tag, int context_id, MPI_Status * status)
 #ifdef USE_STATISTICS
         ++search_length;
 #endif
-#ifdef OUT_OF_ORDER_HANDLING
-        if(( ( (int)(nMsgs-MPIDI_Request_getMatchSeq(rreq))) >= 0) || (source == MPI_ANY_SOURCE)) {
-#endif
         if ( (  MPIDI_Request_getMatchCtxt(rreq)              == match.context_id) &&
              ( (MPIDI_Request_getMatchRank(rreq) & mask.rank) == match.rank      ) &&
              ( (MPIDI_Request_getMatchTag(rreq)  & mask.tag ) == match.tag       )
              )
           {
 #ifdef OUT_OF_ORDER_HANDLING
-            if(source == MPI_ANY_SOURCE) {
-              pami_source= MPIDI_Request_getPeerRank_pami(rreq);
-              in_cntr = &MPIDI_In_cntr[pami_source];
-              nMsgs = in_cntr->nMsgs+1;
-              if((int) (nMsgs-MPIDI_Request_getMatchSeq(rreq)) < 0 )
-                 goto NEXT_MSG;
-
-            }
+            pami_source= MPIDI_Request_getPeerRank_pami(rreq);
+            in_cntr=&MPIDI_In_cntr[pami_source];
+            nMsgs = in_cntr->nMsgs + 1;
+            if(( ( (int)(nMsgs-MPIDI_Request_getMatchSeq(rreq))) >= 0) || (source == MPI_ANY_SOURCE)) {
+               if(source == MPI_ANY_SOURCE) {
+                 if((int) (nMsgs-MPIDI_Request_getMatchSeq(rreq)) < 0 )
+                    goto NEXT_MSG;
+               }
             if (rreq->mpid.nextR != NULL)  { /* recv is in the out of order list */
               if (MPIDI_Request_getMatchSeq(rreq) == nMsgs)
                 in_cntr->nMsgs=nMsgs;
@@ -185,12 +182,12 @@ MPIDI_Recvq_FU(int source, int tag, int context_id, MPI_Status * status)
             if(status != MPI_STATUS_IGNORE) 
               *status = (rreq->status);
             break;
-          }
-
 #ifdef OUT_OF_ORDER_HANDLING
+           }
+#endif
+
         }
      NEXT_MSG:
-#endif
         rreq = rreq->mpid.next;
       }
     }

http://git.mpich.org/mpich.git/commitdiff/c5f01fb660c1a60ac4ae2d01ff8884723c1dc4a3

commit c5f01fb660c1a60ac4ae2d01ff8884723c1dc4a3
Author: Haizhu Liu <haizhu at us.ibm.com>
Date:   Mon Feb 18 15:33:42 2013 -0500

    MPI_Comm_disconnect problem
    
    (ibm) D188388
    (ibm) 26421dc57d61e3e58c4235c7553f63bd53507aab
    
    Signed-off-by: Michael Blocksome <blocksom at us.ibm.com>

diff --git a/src/mpid/pamid/include/mpidi_datatypes.h b/src/mpid/pamid/include/mpidi_datatypes.h
index bca2e5c..b96d488 100644
--- a/src/mpid/pamid/include/mpidi_datatypes.h
+++ b/src/mpid/pamid/include/mpidi_datatypes.h
@@ -154,6 +154,7 @@ enum
     MPIDI_Protocols_RVZ_zerobyte,
 #ifdef DYNAMIC_TASKING
     MPIDI_Protocols_Dyntask,
+    MPIDI_Protocols_Dyntask_disconnect,
 #endif
     MPIDI_Protocols_COUNT,
   };
@@ -352,6 +353,8 @@ struct MPIDI_Comm
     pami_endpoint_t *endpoints;
   } tasks_descriptor;
 #ifdef DYNAMIC_TASKING
+  int local_leader;
+  long long world_intercomm_cntr;
   int *world_ids;      /* ids of worlds that composed this communicator (inter communicator created for dynamic tasking */
 #endif
 };
diff --git a/src/mpid/pamid/include/mpidi_prototypes.h b/src/mpid/pamid/include/mpidi_prototypes.h
index 77632f1..e40baa3 100644
--- a/src/mpid/pamid/include/mpidi_prototypes.h
+++ b/src/mpid/pamid/include/mpidi_prototypes.h
@@ -148,6 +148,14 @@ void MPIDI_Recvfrom_remote_world (pami_context_t    context,
                                   size_t            sndlen,
                                   pami_endpoint_t   sender,
                                   pami_recv_t     * recv);
+void MPIDI_Recvfrom_remote_world_disconnect (pami_context_t    context,
+                                  void            * cookie,
+                                  const void      * _msginfo,
+                                  size_t            msginfo_size,
+                                  const void      * sndbuf,
+                                  size_t            sndlen,
+                                  pami_endpoint_t   sender,
+                                  pami_recv_t     * recv);
 #endif
 #ifdef OUT_OF_ORDER_HANDLING
 void MPIDI_Recvq_process_out_of_order_msgs(pami_task_t src, pami_context_t context);
diff --git a/src/mpid/pamid/include/mpidimpl.h b/src/mpid/pamid/include/mpidimpl.h
index 5993ac2..8e5d24f 100644
--- a/src/mpid/pamid/include/mpidimpl.h
+++ b/src/mpid/pamid/include/mpidimpl.h
@@ -100,6 +100,12 @@ typedef struct conn_info {
   struct conn_info   *next;
 }conn_info;
 
+/* link list of transaciton id for all active remote connections in my world */
+typedef struct transactionID_struct {
+  long long                     tranid;
+  int                           *cntr_for_AM; /* Array size = TOTAL_AM */
+  struct transactionID_struct   *next;
+}transactionID_struct;
 
 /*--------------------------
   BEGIN MPI PORT SECTION
diff --git a/src/mpid/pamid/src/dyntask/mpid_comm_disconnect.c b/src/mpid/pamid/src/dyntask/mpid_comm_disconnect.c
index 0317f5e..094ca0f 100644
--- a/src/mpid/pamid/src/dyntask/mpid_comm_disconnect.c
+++ b/src/mpid/pamid/src/dyntask/mpid_comm_disconnect.c
@@ -15,12 +15,182 @@
 /*                                                                  */
 /* end_generated_IBM_copyright_prolog                               */
 /*  (C)Copyright IBM Corp.  2007, 2011  */
-
 #include "mpidimpl.h"
 
 #ifdef DYNAMIC_TASKING
 
 extern conn_info *_conn_info_list;
+
+#define DISCONNECT_LAPI_XFER_TIMEOUT  5*60*1000000
+#define TOTAL_AM    3
+#define FIRST_AM    0
+#define SECOND_AM   1
+#define LAST_AM     2
+
+/* Returns time in micro seconds "double ticks" */
+#define CURTIME(ticks) {                                        \
+  struct timeval tp;                                            \
+  struct timezone tzp;                                          \
+  gettimeofday(&tp,&tzp);                                       \
+  ticks = (double) tp.tv_sec * 1000000 + (double) tp.tv_usec;   \
+}
+
+
+/* Used inside termination message send by smaller task to larger task in discon
+nect */
+typedef struct {
+  long long tranid;
+  int       whichAM;
+}AM_struct2;
+
+
+extern transactionID_struct *_transactionID_list;
+
+
+void MPIDI_send_AM_to_remote_leader_on_disconnect(int taskid, long long comm_cntr, int whichAM)
+{
+   pami_send_immediate_t xferP;
+
+   int              rc, current_val;
+   AM_struct2       AM_data;
+   pami_endpoint_t  dest;
+
+   AM_data.tranid  = comm_cntr;
+   AM_data.whichAM = whichAM;
+
+   bzero(&xferP, sizeof(pami_send_immediate_t));
+   xferP.header.iov_base = (void*)&AM_data;
+   xferP.header.iov_len  = sizeof(AM_struct2);
+   xferP.dispatch = (size_t)MPIDI_Protocols_Dyntask_disconnect;
+
+   rc = PAMI_Endpoint_create(MPIDI_Client, taskid, 0, &dest);
+   TRACE_ERR("PAMI_Resume to taskid=%d\n", taskid);
+        PAMI_Resume(MPIDI_Context[0],
+                    &dest, 1);
+
+   if(rc != 0)
+     TRACE_ERR("PAMI_Endpoint_create failed\n");
+
+   xferP.dest = dest;
+
+   rc = PAMI_Send_immediate(MPIDI_Context[0], &xferP);
+}
+
+void MPIDI_Recvfrom_remote_world_disconnect(pami_context_t    context,
+                void            * cookie,
+                const void      * _msginfo,
+                size_t            msginfo_size,
+                const void      * sndbuf,
+                size_t            sndlen,
+                pami_endpoint_t   sender,
+                pami_recv_t     * recv)
+{
+  AM_struct2        *AM_data;
+  long long        tranid;
+  int              whichAM;
+
+  AM_data  = ((AM_struct2 *)_msginfo);
+  tranid   = AM_data->tranid;
+  whichAM  = AM_data->whichAM;
+  MPIDI_increment_AM_cntr_for_tranid(tranid, whichAM);
+
+  TRACE_ERR("MPIDI_Recvfrom_remote_world_disconnect: invoked for tranid = %lld, whichAM = %d \n",tranid, whichAM);
+
+  return;
+}
+
+
+/**
+ * Function to retreive the active message counter for a particular trasaction id.
+ * This function is used inside disconnect routine.
+ * whichAM = FIRST_AM/SECOND_AM/LAST_AM
+ */
+int MPIDI_get_AM_cntr_for_tranid(long long tranid, int whichAM)
+{
+  transactionID_struct *tridtmp;
+
+  if(_transactionID_list == NULL)
+    TRACE_ERR("MPIDI_get_AM_cntr_for_tranid - _transactionID_list is NULL\n");
+
+  tridtmp = _transactionID_list;
+
+  while(tridtmp != NULL) {
+    if(tridtmp->tranid == tranid) {
+      return tridtmp->cntr_for_AM[whichAM];
+    }
+    tridtmp = tridtmp->next;
+  }
+
+  return -1;
+}
+
+
+/**
+ * Function used to gurantee the delivery of LAPI active message at the
+ * destination. Called at the destination taskid and returns only whe
+ * the expected LAPI active message is recived. If the sequence number
+ * of the LAPI active message is LAST_AM, the other condition under which
+ * this function may exit is if DISCONNECT_LAPI_XFER_TIMEOUT happens.
+ */
+void MPIDI_wait_for_AM(long long tranid, int expected_AM, int whichAM)
+{
+  double starttime, currtime, elapsetime;
+  int    rc, curr_AMcntr;
+
+  MPIU_THREAD_CS_EXIT(ALLFUNC,);
+  rc = PAMI_Context_advance(MPIDI_Context[0], (size_t)100);
+  MPIU_THREAD_CS_ENTER(ALLFUNC,);
+  if(whichAM == LAST_AM) {
+    CURTIME(starttime)
+    do {
+      CURTIME(currtime)
+      elapsetime = currtime - starttime;
+
+      MPIU_THREAD_CS_EXIT(ALLFUNC,);
+      rc = PAMI_Context_advance(MPIDI_Context[0], (size_t)100);
+      MPIU_THREAD_CS_ENTER(ALLFUNC,);
+      curr_AMcntr = MPIDI_get_AM_cntr_for_tranid(tranid, whichAM);
+      TRACE_ERR("_try_to_disconnect: Looping in timer for TranID %lld, whichAM %d expected_AM = %d, Current AM = %d\n",tranid,whichAM,expected_AM,curr_AMcntr);
+    }while(curr_AMcntr != expected_AM && elapsetime < DISCONNECT_LAPI_XFER_TIMEOUT);
+  }
+  else {
+    do {
+      MPIU_THREAD_CS_EXIT(ALLFUNC,);
+      rc = PAMI_Context_advance(MPIDI_Context[0], (size_t)100);
+      MPIU_THREAD_CS_ENTER(ALLFUNC,);
+      curr_AMcntr = MPIDI_get_AM_cntr_for_tranid(tranid, whichAM);
+      TRACE_ERR("_try_to_disconnect: Looping in timer for TranID %lld, whichAM %d expected_AM = %d, Current AM = %d\n",tranid,whichAM,expected_AM,curr_AMcntr);
+    }while(curr_AMcntr != expected_AM);
+  }
+}
+
+/* function to swap two integers. Used inside function _qsort_dyntask below */
+static void _swap_dyntask(int t[],int i,int j)
+{
+   int  tmp;
+
+   tmp = t[i];
+   t[i] = t[j];
+   t[j] = tmp;
+}
+
+/* qsort sorting function which is used only in this file */
+static void _qsort_dyntask(int t[],int left,int right)
+{
+   int i,last;
+
+   if(left >= right)  return;
+   last = left;
+   for(i=left+1;i<=right;i++)
+      if(t[i] < t[left])
+         _swap_dyntask(t,++last,i);
+   _swap_dyntask(t,left,last);
+   _qsort_dyntask(t,left,last-1);
+   _qsort_dyntask(t,last+1,right);
+}
+
+
+
 /*@
    MPID_Comm_disconnect - Disconnect a communicator
 
@@ -34,9 +204,21 @@ extern conn_info *_conn_info_list;
 @*/
 int MPID_Comm_disconnect(MPID_Comm *comm_ptr)
 {
-    int rc, i,ref_count,mpi_errno, probe_flag=0;
+    int rc, i,j, k, ref_count,mpi_errno=0, probe_flag=0;
+    pami_task_t *local_list;
     MPI_Status status;
+    int errflag = FALSE;
     MPIDI_PG_t *pg;
+    int total_leaders=0, gsize;
+    pami_task_t *leader_tids;
+    int expected_firstAM=0, expected_secondAM=0, expected_lastAM=0;
+    MPID_Comm *commworld_ptr;
+    MPID_VCR *glist;
+    MPID_Comm *lcomm;
+    int local_tasks=0, localtasks_in_remglist=0;
+    int jobIdSize=64;
+    char jobId[jobIdSize];
+    int MY_TASKID = PAMIX_Client_query(MPIDI_Client, PAMI_CLIENT_TASK_ID  ).value.intval;
 
     if(comm_ptr->mpid.world_ids != NULL) {
 	rc = MPID_Iprobe(comm_ptr->rank, MPI_ANY_TAG, comm_ptr, MPID_CONTEXT_INTER_PT2PT, &probe_flag, &status);
@@ -45,16 +227,189 @@ int MPID_Comm_disconnect(MPID_Comm *comm_ptr)
 	  exit(1);
         }
 
-        for(i=0; comm_ptr->mpid.world_ids[i] != -1; i++) {
-          ref_count = MPIDI_Decrement_ref_count(comm_ptr->mpid.world_ids[i]);
-          TRACE_ERR("ref_count=%d with world=%d comm_ptr=%x\n", ref_count, comm_ptr->mpid.world_ids[i], comm_ptr);
-          if(ref_count == -1)
-	    TRACE_ERR("something is wrong\n");
-        }
+	/* make commSubWorld */
+	{
+	  commworld_ptr = MPIR_Process.comm_world;
+	  
+	  glist = commworld_ptr->vcr;
+	  gsize = commworld_ptr->local_size;
+	  for(i=0;i<comm_ptr->local_size;i++) {
+	    for(j=0;j<gsize;j++) {
+	      if(comm_ptr->local_vcr[i]->taskid == glist[j]->taskid)
+		local_tasks++;
+	    }
+	  }
+
+	  /**
+	   * Tasks belonging to the same local world may also be part of
+	   * the GROUPREMLIST, so these tasks will have to be use in addition
+	   * to the tasks in GROUPLIST to construct lcomm
+	   **/
+	  if(comm_ptr->comm_kind == MPID_INTERCOMM) {
+	    for(i=0;i<comm_ptr->remote_size;i++) {
+	      for(j=0;j<gsize;j++) {
+		if(comm_ptr->vcr[i]->taskid == glist[j]->taskid) {
+		  local_tasks++;
+		  localtasks_in_remglist++;
+		}
+	      }
+	    }
+	  }
+	  k=0;
+	  local_list = MPIU_Malloc(local_tasks*sizeof(pami_task_t));
+
+	  for(i=0;i<comm_ptr->local_size;i++) {
+	    for(j=0;j<gsize;j++) {
+	      if(comm_ptr->local_vcr[i]->taskid == glist[j]->taskid)
+		local_list[k++] = glist[j]->taskid;
+	    }
+	  }
+	  if((comm_ptr->comm_kind == MPID_INTERCOMM) && localtasks_in_remglist) {
+	    for(i=0;i<comm_ptr->remote_size;i++) {
+	      for(j=0;j<gsize;j++) {
+		if(comm_ptr->vcr[i]->taskid == glist[j]->taskid)
+		  local_list[k++] = glist[j]->taskid;
+	      }
+	    }
+	    /* Sort the local_list when there are localtasks_in_remglist */
+	    _qsort_dyntask(local_list, 0, local_tasks-1);
+	  }
+	  
+	  mpi_errno = MPIR_Comm_create(&lcomm);
+	  if (mpi_errno != MPI_SUCCESS) {
+	    TRACE_ERR("MPIR_Comm_create returned with mpi_errno=%d\n", mpi_errno);
+	  }
+
+	  /* fill in all the fields of lcomm. */
+	  if(localtasks_in_remglist==0) {
+	    lcomm->context_id     = MPID_CONTEXT_SET_FIELD(DYNAMIC_PROC, comm_ptr->recvcontext_id, 1);
+	    lcomm->recvcontext_id = lcomm->context_id;
+	  } else {
+	    lcomm->context_id     = MPID_CONTEXT_SET_FIELD(DYNAMIC_PROC, comm_ptr->recvcontext_id, 1);
+	    lcomm->recvcontext_id = MPID_CONTEXT_SET_FIELD(DYNAMIC_PROC, comm_ptr->context_id, 1);
+	  }
+	  TRACE_ERR("lcomm->context_id =%d\n", lcomm->context_id);
+
+	  /* sanity: the INVALID context ID value could potentially conflict with the
+	   * dynamic proccess space */
+	  MPIU_Assert(lcomm->context_id     != MPIR_INVALID_CONTEXT_ID);
+	  MPIU_Assert(lcomm->recvcontext_id != MPIR_INVALID_CONTEXT_ID);
+
+	  /* FIXME - we probably need a unique context_id. */
+
+	  /* Fill in new intercomm */
+	  lcomm->comm_kind    = MPID_INTRACOMM;
+	  lcomm->remote_size = lcomm->local_size = local_tasks;
+
+	  /* Set up VC reference table */
+	  mpi_errno = MPID_VCRT_Create(lcomm->remote_size, &lcomm->vcrt);
+	  if (mpi_errno != MPI_SUCCESS) {
+	    TRACE_ERR("MPID_VCRT_Create returned with mpi_errno=%d", mpi_errno);
+	  }
+	  MPID_VCRT_Add_ref(lcomm->vcrt);
+	  mpi_errno = MPID_VCRT_Get_ptr(lcomm->vcrt, &lcomm->vcr);
+	  if (mpi_errno != MPI_SUCCESS) {
+	    TRACE_ERR("MPID_VCRT_Get_ptr returned with mpi_errno=%d", mpi_errno);
+	  }
 
-        MPIU_Free(comm_ptr->mpid.world_ids);
+	  for(i=0; i<local_tasks; i++) {
+	    if(MY_TASKID == local_list[i]) lcomm->rank = i;
+	    lcomm->vcr[i]->taskid = local_list[i];
+	  }
+	}
+
+	TRACE_ERR("subcomm for disconnect is established local_tasks=%d calling MPIR_Barrier_intra\n", local_tasks);
+	mpi_errno = MPIR_Barrier_intra(lcomm, &errflag);
+	if (mpi_errno != MPI_SUCCESS) {
+	  TRACE_ERR("MPIR_Barrier_intra returned with mpi_errno=%d\n", mpi_errno);
+	}
+	TRACE_ERR("after barrier in disconnect\n");
+
+	if(MY_TASKID != comm_ptr->mpid.local_leader) {
+	  for(i=0; comm_ptr->mpid.world_ids[i] != -1; i++) {
+	    ref_count = MPIDI_Decrement_ref_count(comm_ptr->mpid.world_ids[i]);
+	    TRACE_ERR("ref_count=%d with world=%d comm_ptr=%x\n", ref_count, comm_ptr->mpid.world_ids[i], comm_ptr);
+	    if(ref_count == -1)
+	      TRACE_ERR("something is wrong\n");
+	  }
+	}
+
+	if(MY_TASKID == comm_ptr->mpid.local_leader) {
+	  PMI2_Job_GetId(jobId, jobIdSize);
+	  for(i=0;comm_ptr->mpid.world_ids[i]!=-1;i++)  {
+	    if(atoi(jobId) != comm_ptr->mpid.world_ids[i])
+	      total_leaders++;
+	  }
+	  TRACE_ERR("total_leaders=%d\n", total_leaders);
+	  leader_tids = MPIU_Malloc(total_leaders*sizeof(int));
+	  MPIDI_get_allremote_leaders(leader_tids, comm_ptr);
+	  
+	  { /* First Pair of Send / Recv -- All smaller task send to all larger tasks */
+	    for(i=0;i<total_leaders;i++) {
+	      MPID_assert(leader_tids[i] != -1);
+	      if(MY_TASKID < leader_tids[i]) {
+		TRACE_ERR("_try_to_disconnect: FIRST: comm_ptr->mpid.world_intercomm_cntr %lld, toTaskid %d\n",comm_ptr->mpid.world_intercomm_cntr,leader_tids[i]);
+		expected_firstAM++;
+		MPIDI_send_AM_to_remote_leader_on_disconnect(leader_tids[i], comm_ptr->mpid.world_intercomm_cntr, FIRST_AM);
+	      }
+	      else {
+		expected_secondAM++;
+	      }
+	    }
+	    if(expected_secondAM) {
+	      MPIDI_wait_for_AM(comm_ptr->mpid.world_intercomm_cntr, expected_secondAM, FIRST_AM);
+	    }
+	  }
+	  
+	  { /* Second Pair of Send / Recv -- All larger tasks send to all smaller tasks */
+	    for(i=0;i<total_leaders;i++) {
+	      MPID_assert(leader_tids[i] != -1);
+	      if(MY_TASKID > leader_tids[i]) {
+		TRACE_ERR("_try_to_disconnect: SECOND: comm_ptr->mpid.world_intercomm_cntr %lld, toTaskid %d\n",comm_ptr->mpid.world_intercomm_cntr,leader_tids[i]);
+		MPIDI_send_AM_to_remote_leader_on_disconnect(leader_tids[i], comm_ptr->mpid.world_intercomm_cntr, SECOND_AM);
+	      }
+	    }
+	    if(expected_firstAM) {
+	      MPIDI_wait_for_AM(comm_ptr->mpid.world_intercomm_cntr, expected_firstAM, SECOND_AM);
+	    }
+	  }
+
+	  for(i=0; comm_ptr->mpid.world_ids[i] != -1; i++) {
+	    ref_count = MPIDI_Decrement_ref_count(comm_ptr->mpid.world_ids[i]);
+	    TRACE_ERR("ref_count=%d with world=%d comm_ptr=%x\n", ref_count, comm_ptr->mpid.world_ids[i], comm_ptr);
+	    if(ref_count == -1)
+	      TRACE_ERR("something is wrong\n");
+	  }
+
+	  for(i=0;i<total_leaders;i++) {
+	    MPID_assert(leader_tids[i] != -1);
+	    if(MY_TASKID < leader_tids[i]) {
+	      TRACE_ERR("_try_to_disconnect: LAST: comm_ptr->mpid.world_intercomm_cntr %lld, toTaskid %d\n",comm_ptr->mpid.world_intercomm_cntr,leader_tids[i]);
+	      MPIDI_send_AM_to_remote_leader_on_disconnect(leader_tids[i], comm_ptr->mpid.world_intercomm_cntr, LAST_AM);
+	    }
+	    else {
+	      expected_lastAM++;
+	    }
+	  }
+	  if(expected_lastAM) {
+	    MPIDI_wait_for_AM(comm_ptr->mpid.world_intercomm_cntr, expected_lastAM,
+			      LAST_AM);
+	  }
+	  MPIU_Free(leader_tids);
+	}
+
+	TRACE_ERR("_try_to_disconnect: Going inside final barrier for tranid %lld\n",comm_ptr->mpid.world_intercomm_cntr);
+	mpi_errno = MPIR_Barrier_intra(lcomm, &errflag);
+	if (mpi_errno != MPI_SUCCESS) {
+	  TRACE_ERR("MPIR_Barrier_intra returned with mpi_errno=%d\n", mpi_errno);
+	}
+        mpi_errno = MPIR_Comm_release(lcomm,0);
+        if (mpi_errno) TRACE_ERR("MPIR_Comm_release returned with mpi_errno=%d\n", mpi_errno);
+
+	MPIDI_free_tranid_node(comm_ptr->mpid.world_intercomm_cntr);
         mpi_errno = MPIR_Comm_release(comm_ptr,1);
         if (mpi_errno) TRACE_ERR("MPIR_Comm_release returned with mpi_errno=%d\n", mpi_errno);
+	MPIU_Free(local_list);
     }
     return mpi_errno;
 }
@@ -75,4 +430,81 @@ int MPIDI_Decrement_ref_count(int wid) {
   }
   return ref_count;
 }
+
+void MPIDI_get_allremote_leaders(int *tid_arr, MPID_Comm *comm_ptr)
+{
+  conn_info  *tmp_node;
+  int        i,j,k,arr_len,gsize, found=0;
+  int        leader1=-1, leader2=-1;
+  MPID_VCR   *glist;
+  
+  for(i=0;comm_ptr->mpid.world_ids[i] != -1;i++)
+  {
+    TRACE_ERR("i=%d comm_ptr->mpid.world_ids[i]=%d\n", i, comm_ptr->mpid.world_ids[i]);
+    tmp_node = _conn_info_list;
+    found=0;
+    if(tmp_node==NULL) {TRACE_ERR("_conn_info_list is NULL\n");}
+    while(tmp_node != NULL) {
+      if(tmp_node->rem_world_id == comm_ptr->mpid.world_ids[i]) {
+        if(comm_ptr->comm_kind == MPID_INTRACOMM) {
+          glist = comm_ptr->local_vcr;
+          gsize = comm_ptr->local_size;
+        }
+        else {
+          glist = comm_ptr->vcr;
+          gsize = comm_ptr->remote_size;
+        }
+        for(j=0;j<gsize;j++) {
+          for(k=0;tmp_node->rem_taskids[k]!=-1;k++) {
+            TRACE_ERR("j=%d k=%d glist[j]->taskid=%d tmp_node->rem_taskids[k]=%d\n", j, k,glist[j]->taskid, tmp_node->rem_taskids[k]);
+            if(glist[j]->taskid == tmp_node->rem_taskids[k]) {
+              leader1 = glist[j]->taskid;
+              found = 1;
+              break;
+            }
+          }
+          if(found) break;
+        }
+        /*
+	 * There may be the case where my local_comm's GROUPLIST contains tasks
+	 * fro remote world-x and GROUPREMLIST contains other remaining tasks of world-x
+	 * If the smallest task of world-x is in my GROUPLIST then the above iteration
+	 * will give the leader as smallest task from world-x in my GROUPREMLIST.
+	 * But this will not be the correct leader_taskid. I should find the smallest task
+	 * of world-x in my GROUPLIST and then see which of the two leaders is the
+	 * smallest one. The smallest one is the one in which I am interested.
+	 **/
+        if(comm_ptr->comm_kind == MPID_INTERCOMM) {
+          found=0;
+          glist = comm_ptr->local_vcr;
+          gsize = comm_ptr->local_size;
+          for(j=0;j<gsize;j++) {
+            for(k=0;tmp_node->rem_taskids[k]!=-1;k++) {
+            TRACE_ERR("j=%d k=%d glist[j]->taskid=%d tmp_node->rem_taskids[k]=%d\n", j, k, glist[j]->taskid, tmp_node->rem_taskids[k]);
+              if(glist[j]->taskid == tmp_node->rem_taskids[k]) {
+                leader2 = glist[j]->taskid;
+                found = 1;
+                break;
+              }
+            }
+            if(found) break;
+          }
+        }
+        if(found) {
+          break;
+        }
+      } else  {TRACE_ERR("world id is different tmp_node->rem_world_id =%d comm_ptr=%x comm_ptr->mpid.world_ids[i]=%d\n", tmp_node->rem_world_id, comm_ptr, comm_ptr->mpid.world_ids[i]);}
+      tmp_node = tmp_node->next;
+    }
+
+    TRACE_ERR("comm_ptr=%x leader1=%d leader2=%d\n", comm_ptr, leader1, leader2);
+    if(leader1 == -1)
+      *(tid_arr+i) = leader2;
+    else if(leader2 == -1)
+      *(tid_arr+i) = leader1;
+    else
+      *(tid_arr+i) = leader1 < leader2 ? leader1 : leader2;
+  }
+}
+
 #endif
diff --git a/src/mpid/pamid/src/dyntask/mpidi_port.c b/src/mpid/pamid/src/dyntask/mpidi_port.c
index f2445a8..dffc327 100644
--- a/src/mpid/pamid/src/dyntask/mpidi_port.c
+++ b/src/mpid/pamid/src/dyntask/mpidi_port.c
@@ -20,10 +20,13 @@
 
 #ifdef DYNAMIC_TASKING
 #define MAX_HOST_DESCRIPTION_LEN 256
+#define WORLDINTCOMMCNTR _global_world_intercomm_cntr
 #ifdef USE_PMI2_API
 #define MPID_MAX_JOBID_LEN 256
+#define TOTAL_AM 3
 #endif
 
+transactionID_struct *_transactionID_list = NULL;
 
 typedef struct {
   MPID_VCR vcr;
@@ -32,6 +35,7 @@ typedef struct {
 
 conn_info  *_conn_info_list = NULL;
 extern int mpidi_dynamic_tasking;
+long long _global_world_intercomm_cntr;
 
 typedef struct MPIDI_Acceptq
 {
@@ -46,6 +50,7 @@ static int maxAcceptQueueSize = 0;
 static int AcceptQueueSize    = 0;
 
 pthread_mutex_t rem_connlist_mutex = PTHREAD_MUTEX_INITIALIZER;
+extern struct transactionID;
 
 /* FIXME: If dynamic processes are not supported, this file will contain
    no code and some compilers may warn about an "empty translation unit" */
@@ -444,6 +449,42 @@ fn_fail:
 }
 
 
+/**
+ *  * Function to add a new trasaction id in the transaction id list. This function
+ *   * gets called only when a new connection is made with remote tasks.
+ *    */
+void MPIDI_add_new_tranid(long long tranid)
+{
+  int i;
+  transactionID_struct *tridtmp=NULL;
+
+  MPIU_THREAD_CS_ENTER(MSGQUEUE,0);
+  if(_transactionID_list == NULL) {
+    _transactionID_list = (transactionID_struct*) MPIU_Malloc(sizeof(transactionID_struct));
+    _transactionID_list->cntr_for_AM = MPIU_Malloc(TOTAL_AM*sizeof(int));
+    _transactionID_list->tranid = tranid;
+    for(i=0;i<TOTAL_AM;i++)
+      _transactionID_list->cntr_for_AM[i] = 0;
+    _transactionID_list->next     = NULL;
+    MPIU_THREAD_CS_EXIT(MSGQUEUE,0);
+    return;
+  }
+
+  tridtmp = _transactionID_list;
+  while(tridtmp->next != NULL)
+    tridtmp = tridtmp->next;
+
+  tridtmp->next = (transactionID_struct*) MPIU_Malloc(sizeof(transactionID_struct));
+  tridtmp = tridtmp->next;
+  tridtmp->tranid  = tranid;
+  tridtmp->cntr_for_AM = MPIU_Malloc(TOTAL_AM*sizeof(int));
+  for(i=0;i<TOTAL_AM;i++)
+    tridtmp->cntr_for_AM[i] = 0;
+  tridtmp->next    = NULL;
+  MPIU_THREAD_CS_EXIT(MSGQUEUE,0);
+}
+
+
 /* ------------------------------------------------------------------------- */
 /*
    MPIDI_Comm_connect()
@@ -471,7 +512,7 @@ int MPIDI_Comm_connect(const char *port_name, MPID_Info *info, int root,
     MPIDI_PG_t **remote_pg = NULL;
     MPIR_Context_id_t recvcontext_id = MPIR_INVALID_CONTEXT_ID;
     int errflag = FALSE;
-    MPIU_CHKLMEM_DECL(3);
+    long long comm_cntr, lcomm_cntr;
 
     /* Get the context ID here because we need to send it to the remote side */
     mpi_errno = MPIR_Get_contextid( comm_ptr, &recvcontext_id );
@@ -482,6 +523,10 @@ int MPIDI_Comm_connect(const char *port_name, MPID_Info *info, int root,
     local_comm_size = comm_ptr->local_size;
     TRACE_ERR("In MPIDI_Comm_connect - port_name=%s rank=%d root=%d\n", port_name, rank, root);
 
+    WORLDINTCOMMCNTR += 1;
+    comm_cntr = WORLDINTCOMMCNTR;
+    lcomm_cntr = WORLDINTCOMMCNTR;
+
     if (rank == root)
     {
 	/* Establish a communicator to communicate with the root on the
@@ -525,12 +570,26 @@ int MPIDI_Comm_connect(const char *port_name, MPID_Info *info, int root,
                on the send if the port name is invalid */
 	    TRACE_ERR("MPIC_Sendrecv returned with mpi_errno=%d\n", mpi_errno);
 	}
+
+        mpi_errno = MPIC_Sendrecv_replace(&comm_cntr, 1, MPI_INT, 0,
+                                  sendtag++, 0, recvtag++, tmp_comm->handle,
+                                  MPI_STATUS_IGNORE);
+        if (mpi_errno != MPI_SUCCESS) {
+            /* this is a no_port error because we may fail to connect
+               on the send if the port name is invalid */
+            TRACE_ERR("MPIC_Sendrecv returned with mpi_errno=%d\n", mpi_errno);
+        }
     }
 
     /* broadcast the received info to local processes */
     mpi_errno = MPIR_Bcast_intra(recv_ints, 3, MPI_INT, root, comm_ptr, &errflag);
     if (mpi_errno) TRACE_ERR("MPIR_Bcast_intra returned with mpi_errno=%d\n", mpi_errno);
 
+    mpi_errno = MPIR_Bcast_intra(&comm_cntr, 1, MPI_LONG_LONG_INT, root, comm_ptr, &errflag);
+    if (mpi_errno) TRACE_ERR("MPIR_Bcast_intra returned with mpi_errno=%d\n", mpi_errno);
+
+    if(lcomm_cntr > comm_cntr)  comm_cntr = lcomm_cntr;
+
     /* check if root was unable to connect to the port */
 
     n_remote_pgs     = recv_ints[0];
@@ -603,6 +662,9 @@ int MPIDI_Comm_connect(const char *port_name, MPID_Info *info, int root,
 
     mpi_errno = MPIDI_SetupNewIntercomm( comm_ptr, remote_comm_size,
 				   remote_translation, n_remote_pgs, remote_pg, *newcomm );
+    (*newcomm)->mpid.world_intercomm_cntr   = comm_cntr;
+    MPIDI_add_new_tranid(comm_cntr);
+
 /*    MPIDI_Parse_connection_info(n_remote_pgs, remote_pg); */
     if (mpi_errno != MPI_SUCCESS) {
 	TRACE_ERR("MPIDI_SetupNewIntercomm returned with mpi_errno=%d\n", mpi_errno);
@@ -928,6 +990,99 @@ void MPIDI_Parse_connection_info(int n_remote_pgs, MPIDI_PG_t **remote_pg) {
 }
 
 
+
+/**
+ * Function to increment the active message counter for a particular trasaction id.
+ * This function is used inside disconnect routine
+ * whichAM = FIRST_AM/SECOND_AM/LAST_AM
+ */
+void MPIDI_increment_AM_cntr_for_tranid(long long tranid, int whichAM)
+{
+  transactionID_struct *tridtmp;
+
+  /* No error thrown here if tranid not found. This is for the case where timout
+   * happened in MPI_Comm_disconnect and tasks have freed the tranid list node
+   * and now after this the Active message is received.
+   */
+
+  tridtmp = _transactionID_list;
+
+  MPIU_THREAD_CS_ENTER(MSGQUEUE,0);
+  while(tridtmp != NULL) {
+    if(tridtmp->tranid == tranid) {
+      tridtmp->cntr_for_AM[whichAM]++;
+      break;
+    }
+    tridtmp = tridtmp->next;
+  }
+  MPIU_THREAD_CS_EXIT(MSGQUEUE,0);
+
+  TRACE_ERR("MPIDI_increment_AM_cntr_for_tranid - tridtmp->cntr_for_AM[%d]=%d\n",
+          whichAM, tridtmp->cntr_for_AM[whichAM]);
+}
+
+/**
+ * Function to free a partucular trasaction id node from the trasaction id list.
+ * This function is called inside disconnect routine once the remote connection is
+ * terminated
+ */
+void MPIDI_free_tranid_node(long long tranid)
+{
+  transactionID_struct *tridtmp, *tridtmp2;
+
+  MPID_assert(_transactionID_list != NULL);
+
+  tridtmp = tridtmp2 = _transactionID_list;
+
+  MPIU_THREAD_CS_ENTER(MSGQUEUE,0);
+  while(tridtmp != NULL) {
+    if(tridtmp->tranid == tranid) {
+      /* If there is only one node */
+      if(_transactionID_list->next == NULL) {
+        MPIU_Free(_transactionID_list->cntr_for_AM);
+        MPIU_Free(_transactionID_list);
+        _transactionID_list = NULL;
+        MPIU_THREAD_CS_EXIT(MSGQUEUE,0);
+        return;
+      }
+      /* If more than one node and if this is the first node of the list */
+      if(tridtmp == _transactionID_list && tridtmp->next != NULL) {
+        _transactionID_list = _transactionID_list->next;
+        MPIU_Free(tridtmp->cntr_for_AM);
+        MPIU_Free(tridtmp);
+        MPIU_THREAD_CS_EXIT(MSGQUEUE,0);
+        return;
+      }
+      /* For rest all other nodes position of the list */
+      tridtmp2->next = tridtmp->next;
+      MPIU_Free(tridtmp->cntr_for_AM);
+      MPIU_Free(tridtmp);
+        MPIU_THREAD_CS_EXIT(MSGQUEUE,0);
+      return;
+    }
+    tridtmp2 = tridtmp;
+    tridtmp = tridtmp->next;
+  }
+  MPIU_THREAD_CS_EXIT(MSGQUEUE,0);
+}
+
+/** This routine is used inside finalize to free all the nodes
+ * if the disconnect call has not been called
+ */
+void MPIDI_free_all_tranid_node()
+{
+  transactionID_struct *tridtmp;
+
+  MPIU_THREAD_CS_ENTER(MSGQUEUE,0);
+  while(_transactionID_list != NULL) {
+    tridtmp = _transactionID_list;
+    _transactionID_list = _transactionID_list->next;
+    MPIU_Free(tridtmp->cntr_for_AM);
+    MPIU_Free(tridtmp);
+  }
+  MPIU_THREAD_CS_EXIT(MSGQUEUE,0);
+}
+
 /* Sends the process group information to the peer and frees the
    pg_list */
 static int MPIDI_SendPGtoPeerAndFree( struct MPID_Comm *tmp_comm, int *sendtag_p,
@@ -998,6 +1153,8 @@ int MPIDI_Comm_accept(const char *port_name, MPID_Info *info, int root,
     MPIDI_PG_t **remote_pg = NULL;
     int errflag = FALSE;
     char send_char[16], recv_char[16], remote_taskids[16];
+    long long comm_cntr, lcomm_cntr;
+    int leader_taskid;
 
     /* Create the new intercommunicator here. We need to send the
        context id to the other side. */
@@ -1014,6 +1171,10 @@ int MPIDI_Comm_accept(const char *port_name, MPID_Info *info, int root,
     rank = comm_ptr->rank;
     local_comm_size = comm_ptr->local_size;
 
+    WORLDINTCOMMCNTR += 1;
+    comm_cntr = WORLDINTCOMMCNTR;
+    lcomm_cntr = WORLDINTCOMMCNTR;
+
     if (rank == root)
     {
 	/* Establish a communicator to communicate with the root on the
@@ -1065,6 +1226,14 @@ int MPIDI_Comm_accept(const char *port_name, MPID_Info *info, int root,
 	    TRACE_ERR("MPIC_Sendrecv returned with mpi_errno=%d\n", mpi_errno);
 	}
 #endif
+        mpi_errno = MPIC_Sendrecv_replace(&comm_cntr, 1, MPI_INT, 0,
+                                  sendtag++, 0, recvtag++, tmp_comm->handle,
+                                  MPI_STATUS_IGNORE);
+        if (mpi_errno != MPI_SUCCESS) {
+            /* this is a no_port error because we may fail to connect
+               on the send if the port name is invalid */
+            TRACE_ERR("MPIC_Sendrecv returned with mpi_errno=%d\n", mpi_errno);
+        }
 
     }
 
@@ -1073,6 +1242,10 @@ int MPIDI_Comm_accept(const char *port_name, MPID_Info *info, int root,
     mpi_errno = MPIR_Bcast_intra(recv_ints, 3, MPI_INT, root, comm_ptr, &errflag);
     if (mpi_errno) TRACE_ERR("MPIR_Bcast_intra returned with mpi_errno=%d\n", mpi_errno);
 
+    mpi_errno = MPIR_Bcast_intra(&comm_cntr, 1, MPI_LONG_LONG_INT, root, comm_ptr, &errflag);
+    if (mpi_errno) TRACE_ERR("MPIR_Bcast_intra returned with mpi_errno=%d\n", mpi_errno);
+
+    if(lcomm_cntr > comm_cntr)  comm_cntr = lcomm_cntr;
     n_remote_pgs     = recv_ints[0];
     remote_comm_size = recv_ints[1];
     context_id       = recv_ints[2];
@@ -1147,6 +1320,9 @@ int MPIDI_Comm_accept(const char *port_name, MPID_Info *info, int root,
 
     mpi_errno = MPIDI_SetupNewIntercomm( comm_ptr, remote_comm_size,
 				   remote_translation, n_remote_pgs, remote_pg, intercomm );
+    intercomm->mpid.world_intercomm_cntr   = comm_cntr;
+    MPIDI_add_new_tranid(comm_cntr);
+
     if (mpi_errno != MPI_SUCCESS) {
 	TRACE_ERR("MPIDI_SetupNewIntercomm returned with mpi_errno=%d\n", mpi_errno);
     }
@@ -1213,13 +1389,21 @@ static int MPIDI_SetupNewIntercomm( struct MPID_Comm *comm_ptr, int remote_comm_
 			      int n_remote_pgs, MPIDI_PG_t **remote_pg,
 			      struct MPID_Comm *intercomm )
 {
-    int mpi_errno = MPI_SUCCESS, i, j, index;
+    int mpi_errno = MPI_SUCCESS, i, j, index=0;
     int errflag = FALSE;
     int total_rem_world_cnts, p=0;
     char *world_tasks, *cp1;
     conn_info *tmp_node;
     int conn_world_ids[64];
+    MPID_VCR *worldlist;
+    int worldsize;
     pami_endpoint_t dest;
+    MPID_Comm *comm;
+    pami_task_t leader1=-1, leader2=-1, leader_taskid=-1;
+    long long comm_cntr=0, lcomm_cntr=-1;
+    int jobIdSize=64;
+    char jobId[jobIdSize];
+
     TRACE_ERR("MPIDI_SetupNewIntercomm - remote_comm_size=%d\n", remote_comm_size);
     /* FIXME: How much of this could/should be common with the
        upper level (src/mpi/comm/ *.c) code? For best robustness,
@@ -1302,15 +1486,56 @@ static int MPIDI_SetupNewIntercomm( struct MPID_Comm *comm_ptr, int remote_comm_
       }
    }
    else {
+    index=0;
     intercomm->mpid.world_ids = MPIU_Malloc((n_remote_pgs+1)*sizeof(int));
+    PMI2_Job_GetId(jobId, jobIdSize);
     for(i=0;i<n_remote_pgs;i++) {
-      intercomm->mpid.world_ids[i] = atoi((char *)remote_pg[i]->id);
+      if(atoi(jobId) != atoi((char *)remote_pg[i]->id) )
+	intercomm->mpid.world_ids[index++] = atoi((char *)remote_pg[i]->id);
     }
-    intercomm->mpid.world_ids[i] = -1;
+    intercomm->mpid.world_ids[index++] = -1;
    }
    for(i=0; intercomm->mpid.world_ids[i] != -1; i++)
      TRACE_ERR("intercomm=%x intercomm->mpid.world_ids[%d]=%d\n", intercomm, i, intercomm->mpid.world_ids[i]);
 
+   leader_taskid = comm_ptr->vcr[0]->taskid;
+
+   MPID_Comm *comm_world_ptr = MPIR_Process.comm_world;
+   worldlist = comm_world_ptr->vcr;
+   worldsize = comm_world_ptr->local_size;
+   comm = intercomm;
+   for(i=0;i<intercomm->local_size;i++)
+     {
+       for(j=0;j<comm_world_ptr->local_size;j++)
+	 {
+	   if(intercomm->local_vcr[i]->taskid == comm_world_ptr->vcr[j]->taskid) {
+	     leader1 = comm_world_ptr->vcr[j]->taskid;
+	     break;
+	   }
+	 }
+       if(leader1 != -1)
+	 break;
+     }
+   for(i=0;i<intercomm->remote_size;i++)
+     {
+       for(j=0;j<comm_world_ptr->local_size;j++)
+	 {
+	   if(intercomm->vcr[i]->taskid == comm_world_ptr->vcr[j]->taskid) {
+	     leader2 = comm_world_ptr->vcr[j]->taskid;
+	     break;
+	   }
+	 }
+       if(leader2 != -1)
+	 break;
+     }
+   
+   if(leader1 == -1)
+     leader_taskid = leader2;
+   else if(leader2 == -1)
+     leader_taskid = leader1;
+   else
+     leader_taskid = leader1 < leader2 ? leader1 : leader2;
+   intercomm->mpid.local_leader = leader_taskid;
 
    mpi_errno = MPIR_Comm_commit(intercomm);
    if (mpi_errno) TRACE_ERR("MPIR_Comm_commit returned with mpi_errno=%d\n", mpi_errno);
diff --git a/src/mpid/pamid/src/mpid_finalize.c b/src/mpid/pamid/src/mpid_finalize.c
index 7becd22..2ed8633 100644
--- a/src/mpid/pamid/src/mpid_finalize.c
+++ b/src/mpid/pamid/src/mpid_finalize.c
@@ -81,6 +81,7 @@ int MPID_Finalize()
   }
   if(_conn_info_list) 
     MPIU_Free(_conn_info_list);
+  MPIDI_free_all_tranid_node();
 #endif
 
 
diff --git a/src/mpid/pamid/src/mpid_init.c b/src/mpid/pamid/src/mpid_init.c
index f4412c0..b33f07f 100644
--- a/src/mpid/pamid/src/mpid_init.c
+++ b/src/mpid/pamid/src/mpid_init.c
@@ -147,6 +147,7 @@ static struct
   struct protocol_t RVZ_zerobyte;
 #ifdef DYNAMIC_TASKING
   struct protocol_t Dyntask;
+  struct protocol_t Dyntask_disconnect;
 #endif
 } proto_list = {
   .Short = {
@@ -257,6 +258,17 @@ static struct
     },
     .immediate_min     = sizeof(MPIDI_MsgInfo),
   },
+  .Dyntask_disconnect = {
+    .func = MPIDI_Recvfrom_remote_world_disconnect,
+    .dispatch = MPIDI_Protocols_Dyntask_disconnect,
+    .options = {
+      .consistency     = USE_PAMI_CONSISTENCY,
+      .long_header     = PAMI_HINT_DISABLE,
+      .recv_immediate  = PAMI_HINT_ENABLE,
+      .use_rdma        = PAMI_HINT_DISABLE,
+    },
+    .immediate_min     = sizeof(MPIDI_MsgInfo),
+  },
 #endif
 };
 
@@ -580,6 +592,7 @@ MPIDI_PAMI_dispath_init()
   MPIDI_PAMI_dispath_set(MPIDI_Protocols_RVZ_zerobyte, &proto_list.RVZ_zerobyte, NULL);
 #ifdef DYNAMIC_TASKING
   MPIDI_PAMI_dispath_set(MPIDI_Protocols_Dyntask,   &proto_list.Dyntask,  NULL);
+  MPIDI_PAMI_dispath_set(MPIDI_Protocols_Dyntask_disconnect,   &proto_list.Dyntask_disconnect,  NULL);
 #endif
 
   /*

http://git.mpich.org/mpich.git/commitdiff/4dafcf4b091fcd915f73faac1aece1f6d84f6444

commit 4dafcf4b091fcd915f73faac1aece1f6d84f6444
Author: sssharka <sssharka at us.ibm.com>
Date:   Tue Feb 19 19:28:29 2013 -0500

    MPI_Scatterv coredump in PAMI_Type_transform_data with FCA
    
    Adding support for PAMI_IN_PLACE in the MPIDO_<Collective> path
    
    (ibm) D188759
    (ibm) fab60e6078c5048b4f39e7e6156acdbe575a0996
    
    Signed-off-by: Bob Cernohous <bobc at us.ibm.com>

diff --git a/src/mpid/pamid/src/coll/allgather/mpido_allgather.c b/src/mpid/pamid/src/coll/allgather/mpido_allgather.c
index a280464..c455d38 100644
--- a/src/mpid/pamid/src/coll/allgather/mpido_allgather.c
+++ b/src/mpid/pamid/src/coll/allgather/mpido_allgather.c
@@ -309,7 +309,7 @@ MPIDO_Allgather(const void *sendbuf,
    send_size = recv_size;
    rbuf = (char *)recvbuf+recv_true_lb;
 
-   sbuf = (char *)recvbuf+recv_size*rank;
+   sbuf = PAMI_IN_PLACE;
    if(sendbuf != MPI_IN_PLACE)
    {
      if(unlikely(verbose))
diff --git a/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c b/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c
index f97ce89..a4ef524 100644
--- a/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c
+++ b/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c
@@ -331,11 +331,7 @@ MPIDO_Allgatherv(const void *sendbuf,
 
    if(sendbuf == MPI_IN_PLACE)
    {
-     if(unlikely(verbose))
-       fprintf(stderr,"allgatherv MPI_IN_PLACE buffering\n");
-     sbuf = (char *)recvbuf+displs[rank]*recv_size;
-     send_true_lb = recv_true_lb;
-     stype = rtype;
+     sbuf = PAMI_IN_PLACE;
      scount = recvcounts[rank];
      send_size = recv_size * scount; 
    }
diff --git a/src/mpid/pamid/src/coll/allreduce/mpido_allreduce.c b/src/mpid/pamid/src/coll/allreduce/mpido_allreduce.c
index 32292c1..f5c2ecc 100644
--- a/src/mpid/pamid/src/coll/allreduce/mpido_allreduce.c
+++ b/src/mpid/pamid/src/coll/allreduce/mpido_allreduce.c
@@ -102,7 +102,7 @@ int MPIDO_Allreduce(const void *sendbuf,
    {
      if(unlikely(verbose))
          fprintf(stderr,"allreduce MPI_IN_PLACE buffering\n");
-      sbuf = recvbuf;
+      sbuf = PAMI_IN_PLACE;
    }
 
    allred.cb_done = cb_allreduce;
diff --git a/src/mpid/pamid/src/coll/alltoall/mpido_alltoall.c b/src/mpid/pamid/src/coll/alltoall/mpido_alltoall.c
index a4012eb..aa1543e 100644
--- a/src/mpid/pamid/src/coll/alltoall/mpido_alltoall.c
+++ b/src/mpid/pamid/src/coll/alltoall/mpido_alltoall.c
@@ -59,8 +59,6 @@ int MPIDO_Alltoall(const void *sendbuf,
    const struct MPIDI_Comm* const mpid = &(comm_ptr->mpid);
    const int selected_type = mpid->user_selected_type[PAMI_XFER_ALLTOALL];
 
-   if(sendbuf == MPI_IN_PLACE) 
-     pamidt = 0; /* Disable until ticket #632 is fixed */
    if(sendbuf != MPI_IN_PLACE)
    {
       MPIDI_Datatype_get_info(1, sendtype, snd_contig, sndlen, sdt, sdt_true_lb);
@@ -75,7 +73,7 @@ int MPIDO_Alltoall(const void *sendbuf,
 
 
    /* Is it a built in type? If not, send to MPICH */
-   if(MPIDI_Datatype_to_pami(sendtype, &stype, -1, NULL, &tmp) != MPI_SUCCESS)
+   if(sendbuf != MPI_IN_PLACE && (MPIDI_Datatype_to_pami(sendtype, &stype, -1, NULL, &tmp) != MPI_SUCCESS))
       pamidt = 0;
    if(MPIDI_Datatype_to_pami(recvtype, &rtype, -1, NULL, &tmp) != MPI_SUCCESS)
       pamidt = 0;
@@ -121,7 +119,7 @@ int MPIDO_Alltoall(const void *sendbuf,
          fprintf(stderr,"alltoall MPI_IN_PLACE buffering\n");
       alltoall.cmd.xfer_alltoall.stype = rtype;
       alltoall.cmd.xfer_alltoall.stypecount = recvcount;
-      alltoall.cmd.xfer_alltoall.sndbuf = (char *)recvbuf + rdt_true_lb;
+      alltoall.cmd.xfer_alltoall.sndbuf = PAMI_IN_PLACE;
    }
    else
    {
diff --git a/src/mpid/pamid/src/coll/alltoallv/mpido_alltoallv.c b/src/mpid/pamid/src/coll/alltoallv/mpido_alltoallv.c
index b7ccb0d..e128de8 100644
--- a/src/mpid/pamid/src/coll/alltoallv/mpido_alltoallv.c
+++ b/src/mpid/pamid/src/coll/alltoallv/mpido_alltoallv.c
@@ -62,13 +62,14 @@ int MPIDO_Alltoallv(const void *sendbuf,
    const struct MPIDI_Comm* const mpid = &(comm_ptr->mpid);
    const int selected_type = mpid->user_selected_type[PAMI_XFER_ALLTOALLV_INT];
 
-   if(sendbuf == MPI_IN_PLACE) 
-     pamidt = 0; /* Disable until ticket #632 is fixed */
-   if(MPIDI_Datatype_to_pami(sendtype, &stype, -1, NULL, &tmp) != MPI_SUCCESS)
+   if((sendbuf != MPI_IN_PLACE) && (MPIDI_Datatype_to_pami(sendtype, &stype, -1, NULL, &tmp) != MPI_SUCCESS))
       pamidt = 0;
    if(MPIDI_Datatype_to_pami(recvtype, &rtype, -1, NULL, &tmp) != MPI_SUCCESS)
       pamidt = 0;
 
+   MPIDI_Datatype_get_info(1, recvtype, rcv_contig, rcvtypelen, rdt, rdt_true_lb);
+   if(!rcv_contig) pamidt = 0;
+
    if((selected_type == MPID_COLL_USE_MPICH) ||
        pamidt == 0)
    {
@@ -79,8 +80,6 @@ int MPIDO_Alltoallv(const void *sendbuf,
                             comm_ptr, mpierrno);
    }
 
-   MPIDI_Datatype_get_info(1, recvtype, rcv_contig, rcvtypelen, rdt, rdt_true_lb);
-
    pami_xfer_t alltoallv;
    pami_algorithm_t my_alltoallv;
    const pami_metadata_t *my_alltoallv_md;
@@ -114,7 +113,7 @@ int MPIDO_Alltoallv(const void *sendbuf,
       alltoallv.cmd.xfer_alltoallv_int.stype = rtype;
       alltoallv.cmd.xfer_alltoallv_int.sdispls = (int *) recvdispls;
       alltoallv.cmd.xfer_alltoallv_int.stypecounts = (int *) recvcounts;
-      alltoallv.cmd.xfer_alltoallv_int.sndbuf = (char *)recvbuf+rdt_true_lb;
+      alltoallv.cmd.xfer_alltoallv_int.sndbuf = PAMI_IN_PLACE;
    }
    else
    {
diff --git a/src/mpid/pamid/src/coll/gather/mpido_gather.c b/src/mpid/pamid/src/coll/gather/mpido_gather.c
index 806eacd..139b5d9 100644
--- a/src/mpid/pamid/src/coll/gather/mpido_gather.c
+++ b/src/mpid/pamid/src/coll/gather/mpido_gather.c
@@ -146,7 +146,7 @@ int MPIDO_Gather(const void *sendbuf,
    const struct MPIDI_Comm* const mpid = &(comm_ptr->mpid);
    const int selected_type = mpid->user_selected_type[PAMI_XFER_GATHER];
 
-  if ((sendbuf == MPI_IN_PLACE) && sendtype != MPI_DATATYPE_NULL && sendcount >= 0)
+  if (sendtype != MPI_DATATYPE_NULL && sendcount >= 0)
   {
     MPIDI_Datatype_get_info(sendcount, sendtype, contig,
                             send_bytes, data_ptr, true_lb);
@@ -224,7 +224,7 @@ int MPIDO_Gather(const void *sendbuf,
      if(unlikely(verbose))
        fprintf(stderr,"gather MPI_IN_PLACE buffering\n");
      gather.cmd.xfer_gather.stypecount = recv_bytes;
-     gather.cmd.xfer_gather.sndbuf = (char *)recvbuf + recv_bytes*rank;
+     gather.cmd.xfer_gather.sndbuf = PAMI_IN_PLACE;
    }
    else
    {
diff --git a/src/mpid/pamid/src/coll/gatherv/mpido_gatherv.c b/src/mpid/pamid/src/coll/gatherv/mpido_gatherv.c
index e4e5ad0..ea2c6e5 100644
--- a/src/mpid/pamid/src/coll/gatherv/mpido_gatherv.c
+++ b/src/mpid/pamid/src/coll/gatherv/mpido_gatherv.c
@@ -105,7 +105,7 @@ int MPIDO_Gatherv(const void *sendbuf,
       {
          if(unlikely(verbose))
             fprintf(stderr,"gatherv MPI_IN_PLACE buffering\n");
-         sbuf = (char*)rbuf + rsize*displs[rank];
+         sbuf = PAMI_IN_PLACE;
          gatherv.cmd.xfer_gatherv_int.stype = rtype;
          gatherv.cmd.xfer_gatherv_int.stypecount = recvcounts[rank];
       }
diff --git a/src/mpid/pamid/src/coll/reduce/mpido_reduce.c b/src/mpid/pamid/src/coll/reduce/mpido_reduce.c
index 95903ef..ec85e33 100644
--- a/src/mpid/pamid/src/coll/reduce/mpido_reduce.c
+++ b/src/mpid/pamid/src/coll/reduce/mpido_reduce.c
@@ -84,7 +84,7 @@ int MPIDO_Reduce(const void *sendbuf,
    {
       if(unlikely(verbose))
          fprintf(stderr,"reduce MPI_IN_PLACE buffering\n");
-      sbuf = rbuf;
+      sbuf = PAMI_IN_PLACE;
    }
 
    reduce.cb_done = reduce_cb_done;
diff --git a/src/mpid/pamid/src/coll/scan/mpido_scan.c b/src/mpid/pamid/src/coll/scan/mpido_scan.c
index 52e62fa..7f38bd0 100644
--- a/src/mpid/pamid/src/coll/scan/mpido_scan.c
+++ b/src/mpid/pamid/src/coll/scan/mpido_scan.c
@@ -82,8 +82,7 @@ int MPIDO_Doscan(const void *sendbuf, void *recvbuf,
    pami_xfer_t scan;
    volatile unsigned scan_active = 1;
 
-   if((sendbuf == MPI_IN_PLACE) || /* Disable until ticket #627 is fixed */
-      (selected_type == MPID_COLL_USE_MPICH || rc != MPI_SUCCESS))
+   if((selected_type == MPID_COLL_USE_MPICH || rc != MPI_SUCCESS))
    {
       if(unlikely(verbose))
          fprintf(stderr,"Using MPICH scan algorithm (exflag %d)\n",exflag);
@@ -99,7 +98,7 @@ int MPIDO_Doscan(const void *sendbuf, void *recvbuf,
    {
       if(unlikely(verbose))
          fprintf(stderr,"scan MPI_IN_PLACE buffering\n");
-      sbuf = rbuf;
+      sbuf = PAMI_IN_PLACE;
    }
    else
    {
diff --git a/src/mpid/pamid/src/coll/scatter/mpido_scatter.c b/src/mpid/pamid/src/coll/scatter/mpido_scatter.c
index 0153a0d..0d82e1d 100644
--- a/src/mpid/pamid/src/coll/scatter/mpido_scatter.c
+++ b/src/mpid/pamid/src/coll/scatter/mpido_scatter.c
@@ -133,7 +133,7 @@ int MPIDO_Scatter(const void *sendbuf,
     if(MPIDI_Datatype_to_pami(sendtype, &stype, -1, NULL, &tmp) != MPI_SUCCESS)
       use_pami = 0;
   }
-  if(MPIDI_Datatype_to_pami(recvtype, &rtype, -1, NULL, &tmp) != MPI_SUCCESS)
+  if(recvbuf != MPI_IN_PLACE && (MPIDI_Datatype_to_pami(recvtype, &rtype, -1, NULL, &tmp) != MPI_SUCCESS))
     use_pami = 0;
 
   if(!use_pami)
@@ -219,7 +219,7 @@ int MPIDO_Scatter(const void *sendbuf,
        fprintf(stderr,"scatter MPI_IN_PLACE buffering\n");
      MPIDI_Datatype_get_info(sendcount, sendtype, contig,
                              nbytes, data_ptr, true_lb);
-     scatter.cmd.xfer_scatter.rcvbuf = (char *)sendbuf + nbytes*rank;
+     scatter.cmd.xfer_scatter.rcvbuf = PAMI_IN_PLACE;
      scatter.cmd.xfer_scatter.rtype = stype;
      scatter.cmd.xfer_scatter.rtypecount = sendcount;
    }
diff --git a/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c b/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c
index 53c116e..a0539eb 100644
--- a/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c
+++ b/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c
@@ -326,7 +326,7 @@ int MPIDO_Scatterv(const void *sendbuf,
       {
         if(unlikely(verbose))
           fprintf(stderr,"scatterv MPI_IN_PLACE buffering\n");
-        rbuf = (char *)sendbuf + ssize*displs[rank] + send_true_lb;
+        rbuf = PAMI_IN_PLACE;
       }
       else
       {  

http://git.mpich.org/mpich.git/commitdiff/5870a2240e7efb273484d188680443a53afbebd2

commit 5870a2240e7efb273484d188680443a53afbebd2
Author: Haizhu Liu <haizhu at us.ibm.com>
Date:   Mon Feb 4 10:27:46 2013 -0500

    spaiccreate core at MPI_Intercomm_create
    
    (ibm) D188390
    (ibm) 9f8b7ef72a7a7cabd57bd5c979690b8c7b2326fe
    
    Signed-off-by: Michael Blocksome <blocksom at us.ibm.com>

diff --git a/src/mpid/pamid/src/mpid_vc.c b/src/mpid/pamid/src/mpid_vc.c
index 14c470c..a49f3a8 100644
--- a/src/mpid/pamid/src/mpid_vc.c
+++ b/src/mpid/pamid/src/mpid_vc.c
@@ -147,7 +147,7 @@ int MPID_VCR_CommFromLpids( MPID_Comm *newcomm_ptr,
     MPID_VCRT_Get_ptr( newcomm_ptr->vcrt, &newcomm_ptr->vcr );
     if(mpidi_dynamic_tasking) {
       for (i=0; i<size; i++) {
-	MPID_VCR *vc = 0;
+        MPID_VCR vc = 0;
 
 	/* For rank i in the new communicator, find the corresponding
 	   virtual connection.  For lpids less than the size of comm_world,
@@ -158,7 +158,7 @@ int MPID_VCR_CommFromLpids( MPID_Comm *newcomm_ptr,
 	   MPIR_Process.comm_world->rank, i, lpids[i] ); */
 #if 0
 	if (lpids[i] < commworld_ptr->remote_size) {
-	    *vc = commworld_ptr->vcr[lpids[i]];
+           vc = commworld_ptr->vcr[lpids[i]];
 	}
 	else {
 #endif
@@ -169,7 +169,7 @@ int MPID_VCR_CommFromLpids( MPID_Comm *newcomm_ptr,
 
 	    MPIDI_PG_Get_iterator(&iter);
 	    /* Skip comm_world */
-	    MPIDI_PG_Get_next( &iter, &pg );
+            /*MPIDI_PG_Get_next( &iter, &pg ); */
 	    do {
 		MPIDI_PG_Get_next( &iter, &pg );
                 /*MPIU_ERR_CHKINTERNAL(!pg, mpi_errno, "no pg"); */
@@ -177,8 +177,8 @@ int MPID_VCR_CommFromLpids( MPID_Comm *newcomm_ptr,
 		   for this process group could help speed this search */
 		for (j=0; j<pg->size; j++) {
 		    /*printf( "Checking lpid %d against %d in pg %s\n",
-			    lpids[i], pg->vct[j].lpid, (char *)pg->id );
-			    fflush(stdout); */
+                            lpids[i], pg->vct[j].taskid, (char *)pg->id );
+                           fflush(stdout); */
 		    if (pg->vct[j].taskid == lpids[i]) {
 			vc = &pg->vct[j];
 			/*printf( "found vc %x for lpid = %d in another pg\n",

http://git.mpich.org/mpich.git/commitdiff/5a5f5276507f8fb77290eb9bccb9a1f572bef75c

commit 5a5f5276507f8fb77290eb9bccb9a1f572bef75c
Author: Haizhu Liu <haizhu at us.ibm.com>
Date:   Thu Jan 31 15:34:29 2013 -0500

    Dynamic test case finalize6 hang
    
    (ibm) D188486
    (ibm) c7fa105f8d4198d811ae2ef6d7252116363834be
    
    Signed-off-by: Michael Blocksome <blocksom at us.ibm.com>

diff --git a/src/mpid/pamid/src/dyntask/mpidi_pg.c b/src/mpid/pamid/src/dyntask/mpidi_pg.c
index ca89c85..d588882 100644
--- a/src/mpid/pamid/src/dyntask/mpidi_pg.c
+++ b/src/mpid/pamid/src/dyntask/mpidi_pg.c
@@ -13,6 +13,8 @@
 #ifdef DYNAMIC_TASKING
 
 extern int mpidi_dynamic_tasking;
+int mpidi_sync_done=0;
+
 
 #define MAX_JOBID_LEN 1024
 
@@ -71,6 +73,11 @@ int MPIDI_PG_Init(int *argc_p, char ***argv_p,
     return mpi_errno;
 }
 
+void *mpidi_finalize_req(void *arg) {
+    PMI2_Finalize();
+    mpidi_sync_done=1;
+}
+
 /*@
    MPIDI_PG_Finalize - Finalize the process groups, including freeing all
    process group structures
@@ -82,9 +89,10 @@ int MPIDI_PG_Finalize(void)
    int                    my_max_worldid, world_max_worldid;
    int                    wid_bit_array_size=0, wid;
    unsigned char          *wid_bit_array=NULL, *root_wid_barray=NULL;
-   MPIDI_PG_t *pg, *pgNext;
-   char key[PMI2_MAX_KEYLEN];
-   char value[PMI2_MAX_VALLEN];
+   MPIDI_PG_t             *pg, *pgNext;
+   char                   key[PMI2_MAX_KEYLEN];
+   char                   value[PMI2_MAX_VALLEN];
+   pthread_t              finalize_req_thread;
 
    /* Print the state of the process groups */
    if (verbose) {
@@ -161,17 +169,21 @@ int MPIDI_PG_Finalize(void)
    TRACE_ERR("PMI2_KVS_Fence returned with mpi_errno=%d\n", mpi_errno);
 
    MPIU_Free(root_wid_barray); /* root_wid_barray is now NULL for non-root */
-/*    if (pg_world->connData) { */
-#ifdef USE_PMI2_API
-	mpi_errno = PMI2_Finalize();
-#else
-	int rc;
-	rc = PMI_Finalize();
-	if (rc) {
-          TRACE_ERR("PMI_Finalize returned with rc=%d\n", rc);
-	}
-#endif
-    /*}*/
+
+
+   pthread_create(&finalize_req_thread, NULL, mpidi_finalize_req, NULL);
+   MPIU_THREAD_CS_EXIT(ALLFUNC,);
+   while (mpidi_sync_done !=1) {
+     mpi_errno=PAMI_Context_advance(MPIDI_Context[0], 1000);
+     if (mpi_errno == PAMI_EAGAIN) {
+       usleep(1);
+     }
+   }
+
+   if (mpi_errno = pthread_join(finalize_req_thread, NULL) ) {
+         TRACE_ERR("error returned from pthread_join() mpi_errno=%d\n",mpi_errno);
+   }
+   MPIU_THREAD_CS_ENTER(ALLFUNC,);
 
    if(_conn_info_list) {
      if(_conn_info_list->rem_taskids)

http://git.mpich.org/mpich.git/commitdiff/efeeb12cbb4b14dd8e15100beb333e4a86eda9a5

commit efeeb12cbb4b14dd8e15100beb333e4a86eda9a5
Author: Haizhu Liu <haizhu at us.ibm.com>
Date:   Tue Jan 29 15:14:02 2013 -0500

    multi_ports failed in dynamic tasking
    
    (ibm) D188299
    (ibm) 9a123cd1deefa7633870a7333d0cfa502cf31dc8
    
    Signed-off-by: Michael Blocksome <blocksom at us.ibm.com>

diff --git a/src/mpid/pamid/src/dyntask/mpidi_pg.c b/src/mpid/pamid/src/dyntask/mpidi_pg.c
index 236830a..ca89c85 100644
--- a/src/mpid/pamid/src/dyntask/mpidi_pg.c
+++ b/src/mpid/pamid/src/dyntask/mpidi_pg.c
@@ -913,7 +913,8 @@ int MPIDI_PG_Dup_vcr( MPIDI_PG_t *pg, int rank, pami_task_t taskid, MPID_VCR *vc
 
     TRACE_ERR("ENTER MPIDI_PG_Dup_vcr - pg->id=%s rank=%d taskid=%d\n", pg->id, rank, taskid);
     pg->vct[rank].taskid = taskid;
-    vcr = &pg->vct[rank];
+
+    vcr = MPIU_Malloc(sizeof(struct MPID_VCR_t));
     TRACE_ERR("MPIDI_PG_Dup_vcr- pg->vct[%d].pg=%x pg=%x vcr=%x vcr->pg=%x\n", rank, pg->vct[rank].pg, pg, vcr, vcr->pg);
     vcr->pg = pg;
     vcr->pg_rank = rank;

http://git.mpich.org/mpich.git/commitdiff/cc9a7b0329fc2b668cd6c15e86714261cb6588bf

commit cc9a7b0329fc2b668cd6c15e86714261cb6588bf
Author: Su Huang <suhuang at us.ibm.com>
Date:   Fri Feb 1 10:54:22 2013 -0500

    MPI job hangs at MPI_Iprobe using mpich2
    
    (ibm) D188431
    (ibm) 8Q6
    (ibm) 3dd53dcc2f880660e7a4d636189ec6aea935a6de
    
    Signed-off-by: Michael Blocksome <blocksom at us.ibm.com>

diff --git a/src/mpid/pamid/src/mpid_recvq.c b/src/mpid/pamid/src/mpid_recvq.c
index ce32428..313bb37 100644
--- a/src/mpid/pamid/src/mpid_recvq.c
+++ b/src/mpid/pamid/src/mpid_recvq.c
@@ -87,13 +87,6 @@ MPIDI_Recvq_FU(int source, int tag, int context_id, MPI_Status * status)
   MPIDI_In_cntr_t *in_cntr;
   uint nMsgs=0;
   pami_task_t pami_source;
-  
-
-  if(source != MPI_ANY_SOURCE) {
-    pami_source = PAMIX_Endpoint_query(source);
-    in_cntr=&MPIDI_In_cntr[pami_source];
-    nMsgs = in_cntr->nMsgs + 1;
-  }
 #endif
 
   if (tag != MPI_ANY_TAG && source != MPI_ANY_SOURCE)
@@ -103,31 +96,32 @@ MPIDI_Recvq_FU(int source, int tag, int context_id, MPI_Status * status)
 #ifdef USE_STATISTICS
         ++search_length;
 #endif
-#ifdef OUT_OF_ORDER_HANDLING
-        if( ((int)(nMsgs-MPIDI_Request_getMatchSeq(rreq))) >= 0 )
-        { 
-#endif
         if ( (MPIDI_Request_getMatchCtxt(rreq) == context_id) &&
              (MPIDI_Request_getMatchRank(rreq) == source    ) &&
              (MPIDI_Request_getMatchTag(rreq)  == tag       )
              )
           {
-#ifdef OUT_OF_ORDER_HANDLING
-            if (rreq->mpid.nextR != NULL) {  /* recv is in the out of order list */
-              if (MPIDI_Request_getMatchSeq(rreq) == nMsgs) {
-                in_cntr->nMsgs=nMsgs;
-              MPIDI_Recvq_remove_req_from_ool(rreq,in_cntr);
-            } 
-           } 
-#endif
+  #ifdef OUT_OF_ORDER_HANDLING
+            pami_source= MPIDI_Request_getPeerRank_pami(rreq);
+            in_cntr=&MPIDI_In_cntr[pami_source];
+            nMsgs = in_cntr->nMsgs + 1;
+            if( ((int)(nMsgs-MPIDI_Request_getMatchSeq(rreq))) >= 0 )
+            {
+              if (rreq->mpid.nextR != NULL) {  /* recv is in the out of order list */
+                 if (MPIDI_Request_getMatchSeq(rreq) == nMsgs) {
+                     in_cntr->nMsgs=nMsgs;
+                     MPIDI_Recvq_remove_req_from_ool(rreq,in_cntr);
+                 } 
+              } 
+  #endif
             found = TRUE;
             if(status != MPI_STATUS_IGNORE)
               *status = (rreq->status);
             break;
-          }
 #ifdef OUT_OF_ORDER_HANDLING
-       }
+             }
 #endif
+       }
         rreq = rreq->mpid.next;
       }
     }

http://git.mpich.org/mpich.git/commitdiff/dad9e73110e94b921c9dd1fba0a611a157039ca5

commit dad9e73110e94b921c9dd1fba0a611a157039ca5
Author: Haizhu Liu <haizhu at us.ibm.com>
Date:   Sat Jan 26 17:18:30 2013 -0500

    soft spawn patch caused error msg expansion error during compile
    
    (ibm) 45921fb741c700d5452e96e6ea3c1df1a40952f3
    
    Signed-off-by: Bob Cernohous <bobc at us.ibm.com>

diff --git a/src/mpid/pamid/src/dyntask/mpid_comm_spawn_multiple.c b/src/mpid/pamid/src/dyntask/mpid_comm_spawn_multiple.c
index 729abc4..00306fb 100644
--- a/src/mpid/pamid/src/dyntask/mpid_comm_spawn_multiple.c
+++ b/src/mpid/pamid/src/dyntask/mpid_comm_spawn_multiple.c
@@ -320,7 +320,7 @@ int MPIDI_Comm_spawn_multiple(int count, char **commands,
 
     if(pmi_errno) {
            mpi_errno = MPIR_Err_create_code(MPI_SUCCESS, MPIR_ERR_FATAL, __FILE__, __LINE__, MPI_ERR_SPAWN,
-            "**noresource", 0);
+            "**mpi_comm_spawn", 0);
     }
 
  fn_exit:

http://git.mpich.org/mpich.git/commitdiff/8a98906cb6d4dd96189c23a652ebbccaa3b760be

commit 8a98906cb6d4dd96189c23a652ebbccaa3b760be
Author: Haizhu Liu <haizhu at us.ibm.com>
Date:   Wed Jan 23 13:34:21 2013 -0500

    Dynamic cases failed with large procs
    
    (ibm) D187931
    (ibm) 23aa052a1030b1d665c926701ec46fbb5f343384
    
    Signed-off-by: Bob Cernohous <bobc at us.ibm.com>

diff --git a/src/mpid/pamid/src/dyntask/mpidi_pg.c b/src/mpid/pamid/src/dyntask/mpidi_pg.c
index af58e9f..236830a 100644
--- a/src/mpid/pamid/src/dyntask/mpidi_pg.c
+++ b/src/mpid/pamid/src/dyntask/mpidi_pg.c
@@ -459,7 +459,6 @@ int MPIDI_PG_Create_from_string(const char * str, MPIDI_PG_t ** pg_pptr,
     int mpi_errno = MPI_SUCCESS;
     const char *p;
     char *pg_id, *pg_id2, *cp2, *cp3,*str2, *str3;
-    pami_task_t taskids[10];
     int vct_sz, i;
     MPIDI_PG_t *existing_pg, *pg_ptr=0;
 
@@ -484,9 +483,9 @@ int MPIDI_PG_Create_from_string(const char * str, MPIDI_PG_t ** pg_pptr,
     while (*p) p++; p++;
     vct_sz = atoi(p);
 
-    p++;p++;
-    TRACE_ERR("before MPIDI_PG_Create - p=%s\n", p);
+    while (*p) p++;p++;
     char *p_tmp = MPIU_Strdup(p);
+    TRACE_ERR("before MPIDI_PG_Create - p=%s p_tmp=%s vct_sz=%d\n", p, p_tmp, vct_sz);
     mpi_errno = MPIDI_PG_Create(vct_sz, (void *)str, pg_pptr);
     if (mpi_errno != MPI_SUCCESS) {
 	TRACE_ERR("MPIDI_PG_Create returned with mpi_errno=%d\n", mpi_errno);
@@ -610,6 +609,16 @@ int MPIDI_connToStringKVS( char **buf_p, int *slen, MPIDI_PG_t *pg )
 
     /* add the taskids of the pg */
     for(i = 0; i < pg->size; i++) {
+      MPIU_Snprintf(buf, MPIDI_MAX_KVS_VALUE_LEN, "%d:", pg->vct[i].taskid);
+      vallen = strlen(buf);
+      if (len+vallen+1 >= curSlen) {
+        char *nstring = 0;
+        curSlen += (pg->size - i) * (vallen + 1 );
+        nstring = MPIU_Realloc( string, curSlen);
+        MPID_assert(nstring != NULL);
+        string = nstring;
+      }
+      /* Append to string */
       nChars = MPIU_Snprintf(&string[len], curSlen - len, "%d:", pg->vct[i].taskid);
       len+=nChars;
     }

http://git.mpich.org/mpich.git/commitdiff/3b93604c9a5a7dbdaf96ecc4b97700403169a35c

commit 3b93604c9a5a7dbdaf96ecc4b97700403169a35c
Author: Bob Cernohous <bobc at us.ibm.com>
Date:   Mon Jan 21 13:28:22 2013 -0600

    Fix BGQ compile errors
    
    mpido_allgather.c:497: error: 'snd_data_contig' may be used uninitialized in this function
    
    mpido_scatterv.c:411: error: 'recv_true_lb' may be used uninitialized in this function
    
    (ibm) D188060
    (ibm) 9bb4fb69d061e234d0a5c77b07697603364de625
    
    Signed-off-by: sssharka <sssharka at us.ibm.com>

diff --git a/src/mpid/pamid/src/coll/allgather/mpido_allgather.c b/src/mpid/pamid/src/coll/allgather/mpido_allgather.c
index 9bb2388..a280464 100644
--- a/src/mpid/pamid/src/coll/allgather/mpido_allgather.c
+++ b/src/mpid/pamid/src/coll/allgather/mpido_allgather.c
@@ -494,7 +494,7 @@ MPIDO_Allgather_simple(const void *sendbuf,
    void *snd_noncontig_buff = NULL, *rcv_noncontig_buff = NULL;
    MPI_Aint send_true_lb = 0;
    MPI_Aint recv_true_lb = 0;
-   int snd_data_contig, rcv_data_contig;
+   int snd_data_contig = 1, rcv_data_contig;
    size_t send_size = 0;
    size_t recv_size = 0;
    MPID_Segment segment;
diff --git a/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c b/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c
index 532b986..53c116e 100644
--- a/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c
+++ b/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c
@@ -408,7 +408,7 @@ int MPIDO_Scatterv_simple(const void *sendbuf,
   int tmp, pamidt = 1;
   int ssize, rsize;
   MPID_Datatype *dt_ptr = NULL;
-  MPI_Aint send_true_lb=0, recv_true_lb;
+  MPI_Aint send_true_lb=0, recv_true_lb=0;
   char *sbuf, *rbuf;
   pami_type_t stype, rtype = NULL;
   const int rank = comm_ptr->rank;

http://git.mpich.org/mpich.git/commitdiff/33225d96054a33e5d0aa0136f586cb052232bcaf

commit 33225d96054a33e5d0aa0136f586cb052232bcaf
Author: sssharka <sssharka at us.ibm.com>
Date:   Fri Jan 18 16:53:11 2013 -0500

    multi mpi core with collective selection enabled
    
    Adding support for PAMI_IN_PLACE in collective selection path
    
    (ibm) D188060
    (ibm) 7Z8
    (ibm) 1a6d15e469f89a81de14fcd59875e39d3cd0d353
    
    Signed-off-by: Bob Cernohous <bobc at us.ibm.com>

diff --git a/src/mpid/pamid/src/coll/allgather/mpido_allgather.c b/src/mpid/pamid/src/coll/allgather/mpido_allgather.c
index c346d89..9bb2388 100644
--- a/src/mpid/pamid/src/coll/allgather/mpido_allgather.c
+++ b/src/mpid/pamid/src/coll/allgather/mpido_allgather.c
@@ -503,7 +503,6 @@ MPIDO_Allgather_simple(const void *sendbuf,
 
    const pami_metadata_t *my_md;
 
-
    char *rbuf = NULL, *sbuf = NULL;
 
 
@@ -532,31 +531,33 @@ MPIDO_Allgather_simple(const void *sendbuf,
       }
    }
 
-   MPIDI_Datatype_get_info(sendcount,
-                         sendtype,
-                         snd_data_contig,
-                         send_size,
-                         dt_null,
-                         send_true_lb);
-
-   sbuf = (char *)sendbuf+send_true_lb;
-   if(sendbuf == MPI_IN_PLACE) 
-     sbuf = (char *)recvbuf+recv_size*rank;
-
-   if(!snd_data_contig)
+   if(sendbuf == MPI_IN_PLACE)
+     sbuf = PAMI_IN_PLACE;
+   else
    {
-      snd_noncontig_buff = MPIU_Malloc(send_size);
-      sbuf = snd_noncontig_buff;
-      if(snd_noncontig_buff == NULL)
-      {
-         MPID_Abort(NULL, MPI_ERR_NO_SPACE, 1,
-            "Fatal:  Cannot allocate pack buffer");
-      }
-      DLOOP_Offset last = send_size;
-      MPID_Segment_init(sendbuf != MPI_IN_PLACE?sendbuf:(void*)((char *)recvbuf+recv_size*rank), 
-	                    sendcount, sendtype, &segment, 0);
-      MPID_Segment_pack(&segment, 0, &last, snd_noncontig_buff);
-   }
+     MPIDI_Datatype_get_info(sendcount,
+                           sendtype,
+                           snd_data_contig,
+                           send_size,
+                           dt_null,
+                           send_true_lb);
+
+     sbuf = (char *)sendbuf+send_true_lb;
+
+     if(!snd_data_contig)
+     {
+        snd_noncontig_buff = MPIU_Malloc(send_size);
+        sbuf = snd_noncontig_buff;
+        if(snd_noncontig_buff == NULL)
+        {
+           MPID_Abort(NULL, MPI_ERR_NO_SPACE, 1,
+              "Fatal:  Cannot allocate pack buffer");
+        }
+        DLOOP_Offset last = send_size;
+        MPID_Segment_init(sendbuf, sendcount, sendtype, &segment, 0);
+        MPID_Segment_pack(&segment, 0, &last, snd_noncontig_buff);
+     }
+  }
 
    TRACE_ERR("Using PAMI-level allgather protocol\n");
    pami_xfer_t allgather;
@@ -564,7 +565,7 @@ MPIDO_Allgather_simple(const void *sendbuf,
    allgather.cookie = (void *)&allgather_active;
    allgather.cmd.xfer_allgather.rcvbuf = rbuf;
    allgather.cmd.xfer_allgather.sndbuf = sbuf;
-   allgather.cmd.xfer_allgather.stype = PAMI_TYPE_BYTE;
+   allgather.cmd.xfer_allgather.stype = PAMI_TYPE_BYTE;/* stype is ignored when sndbuf == PAMI_IN_PLACE */
    allgather.cmd.xfer_allgather.rtype = PAMI_TYPE_BYTE;
    allgather.cmd.xfer_allgather.stypecount = send_size;
    allgather.cmd.xfer_allgather.rtypecount = recv_size;
@@ -587,4 +588,4 @@ MPIDO_Allgather_simple(const void *sendbuf,
    if(!snd_data_contig)  MPIU_Free(snd_noncontig_buff);
    TRACE_ERR("Allgather done\n");
    return MPI_SUCCESS;
-}
\ No newline at end of file
+}
diff --git a/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c b/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c
index 5c182a5..f97ce89 100644
--- a/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c
+++ b/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c
@@ -542,7 +542,7 @@ MPIDO_Allgatherv_simple(const void *sendbuf,
   int scount=sendcount;
 
   char *sbuf, *rbuf;
-  pami_type_t stype, rtype;
+  pami_type_t stype = NULL, rtype;
   const int rank = comm_ptr->rank;
   const struct MPIDI_Comm* const mpid = &(comm_ptr->mpid);
 
@@ -571,10 +571,7 @@ MPIDO_Allgatherv_simple(const void *sendbuf,
 
    if(sendbuf == MPI_IN_PLACE)
    {
-     sbuf = (char *)recvbuf+displs[rank]*recv_size;
-     send_true_lb = recv_true_lb;
-     scount = recvcounts[rank];
-     send_size = recv_size * scount; 
+     sbuf = PAMI_IN_PLACE;
    }
    else
    {
@@ -601,7 +598,7 @@ MPIDO_Allgatherv_simple(const void *sendbuf,
    allgatherv.cookie = (void *)&allgatherv_active;
    allgatherv.cmd.xfer_allgatherv_int.sndbuf = sbuf;
    allgatherv.cmd.xfer_allgatherv_int.rcvbuf = rbuf;
-   allgatherv.cmd.xfer_allgatherv_int.stype = stype;
+   allgatherv.cmd.xfer_allgatherv_int.stype = stype;/* stype is ignored when sndbuf == PAMI_IN_PLACE */
    allgatherv.cmd.xfer_allgatherv_int.rtype = rtype;
    allgatherv.cmd.xfer_allgatherv_int.stypecount = scount;
    allgatherv.cmd.xfer_allgatherv_int.rtypecounts = (int *) recvcounts;
@@ -620,4 +617,4 @@ MPIDO_Allgatherv_simple(const void *sendbuf,
    MPID_PROGRESS_WAIT_WHILE(allgatherv_active);
 
    return MPI_SUCCESS;
-}
\ No newline at end of file
+}
diff --git a/src/mpid/pamid/src/coll/allreduce/mpido_allreduce.c b/src/mpid/pamid/src/coll/allreduce/mpido_allreduce.c
index d6c0d0a..32292c1 100644
--- a/src/mpid/pamid/src/coll/allreduce/mpido_allreduce.c
+++ b/src/mpid/pamid/src/coll/allreduce/mpido_allreduce.c
@@ -428,7 +428,7 @@ int MPIDO_Allreduce_simple(const void *sendbuf,
   sbuf = (void *)sendbuf;
   if(unlikely(sendbuf == MPI_IN_PLACE))
   {
-     sbuf = recvbuf;
+     sbuf = PAMI_IN_PLACE;
   }
 
   allred.cb_done = cb_allreduce;
@@ -452,4 +452,4 @@ int MPIDO_Allreduce_simple(const void *sendbuf,
   MPID_PROGRESS_WAIT_WHILE(active);
   TRACE_ERR("Allreduce done\n");
   return MPI_SUCCESS;
-}
\ No newline at end of file
+}
diff --git a/src/mpid/pamid/src/coll/alltoall/mpido_alltoall.c b/src/mpid/pamid/src/coll/alltoall/mpido_alltoall.c
index d36e87e..a4012eb 100644
--- a/src/mpid/pamid/src/coll/alltoall/mpido_alltoall.c
+++ b/src/mpid/pamid/src/coll/alltoall/mpido_alltoall.c
@@ -186,7 +186,7 @@ int MPIDO_Alltoall_simple(const void *sendbuf,
    TRACE_ERR("Entering MPIDO_Alltoall_optimized\n");
    volatile unsigned active = 1;
    MPID_Datatype *sdt, *rdt;
-   pami_type_t stype, rtype;
+   pami_type_t stype = NULL, rtype;
    MPI_Aint sdt_true_lb=0, rdt_true_lb;
    MPIDI_Post_coll_t alltoall_post;
    int sndlen, rcvlen, snd_contig, rcv_contig, pamidt=1;
@@ -194,8 +194,11 @@ int MPIDO_Alltoall_simple(const void *sendbuf,
 
    const struct MPIDI_Comm* const mpid = &(comm_ptr->mpid);
 
-   MPIDI_Datatype_get_info(1, sendtype, snd_contig, sndlen, sdt, sdt_true_lb);
-   if(!snd_contig) pamidt = 0;
+   if(sendbuf != MPI_IN_PLACE)
+   {
+     MPIDI_Datatype_get_info(1, sendtype, snd_contig, sndlen, sdt, sdt_true_lb);
+     if(!snd_contig) pamidt = 0;
+   }
    MPIDI_Datatype_get_info(1, recvtype, rcv_contig, rcvlen, rdt, rdt_true_lb);
    if(!rcv_contig) pamidt = 0;
 
@@ -205,14 +208,11 @@ int MPIDO_Alltoall_simple(const void *sendbuf,
 
 
    /* Is it a built in type? If not, send to MPICH */
-   if(MPIDI_Datatype_to_pami(sendtype, &stype, -1, NULL, &tmp) != MPI_SUCCESS)
+   if(sendbuf != MPI_IN_PLACE && (MPIDI_Datatype_to_pami(sendtype, &stype, -1, NULL, &tmp) != MPI_SUCCESS))
       pamidt = 0;
    if(MPIDI_Datatype_to_pami(recvtype, &rtype, -1, NULL, &tmp) != MPI_SUCCESS)
       pamidt = 0;
 
-   if(sendbuf ==  MPI_IN_PLACE)
-      pamidt = 0;
-
    if(pamidt == 0)
    {
       return MPIR_Alltoall_intra(sendbuf, sendcount, sendtype,
@@ -231,20 +231,15 @@ int MPIDO_Alltoall_simple(const void *sendbuf,
    alltoall.cb_done = cb_alltoall;
    alltoall.cookie = (void *)&active;
    alltoall.algorithm = mpid->coll_algorithm[PAMI_XFER_ALLTOALL][0][0];
+   alltoall.cmd.xfer_alltoall.stype = stype;/* stype is ignored when sndbuf == PAMI_IN_PLACE */
+   alltoall.cmd.xfer_alltoall.stypecount = sendcount;
+   alltoall.cmd.xfer_alltoall.sndbuf = (char *)sendbuf + sdt_true_lb;
+
    if(sendbuf == MPI_IN_PLACE)
    {
-      alltoall.cmd.xfer_alltoall.stype = rtype;
-      alltoall.cmd.xfer_alltoall.stypecount = recvcount;
-      alltoall.cmd.xfer_alltoall.sndbuf = (char *)recvbuf + rdt_true_lb;
-   }
-   else
-   {
-      alltoall.cmd.xfer_alltoall.stype = stype;
-      alltoall.cmd.xfer_alltoall.stypecount = sendcount;
-      alltoall.cmd.xfer_alltoall.sndbuf = (char *)sendbuf + sdt_true_lb;
+      alltoall.cmd.xfer_alltoall.sndbuf = PAMI_IN_PLACE;
    }
    alltoall.cmd.xfer_alltoall.rcvbuf = (char *)recvbuf + rdt_true_lb;
-
    alltoall.cmd.xfer_alltoall.rtypecount = recvcount;
    alltoall.cmd.xfer_alltoall.rtype = rtype;
 
@@ -256,4 +251,4 @@ int MPIDO_Alltoall_simple(const void *sendbuf,
 
    TRACE_ERR("Leaving MPIDO_Alltoall_optimized\n");
    return MPI_SUCCESS;
-}
\ No newline at end of file
+}
diff --git a/src/mpid/pamid/src/coll/alltoallv/mpido_alltoallv.c b/src/mpid/pamid/src/coll/alltoallv/mpido_alltoallv.c
index d59f905..b7ccb0d 100644
--- a/src/mpid/pamid/src/coll/alltoallv/mpido_alltoallv.c
+++ b/src/mpid/pamid/src/coll/alltoallv/mpido_alltoallv.c
@@ -187,8 +187,8 @@ int MPIDO_Alltoallv_simple(const void *sendbuf,
    volatile unsigned active = 1;
    int sndtypelen, rcvtypelen, snd_contig, rcv_contig;
    MPID_Datatype *sdt, *rdt;
-   pami_type_t stype, rtype;
-   MPI_Aint sdt_true_lb, rdt_true_lb;
+   pami_type_t stype = NULL, rtype;
+   MPI_Aint sdt_true_lb = 0, rdt_true_lb;
    MPIDI_Post_coll_t alltoallv_post;
    int pamidt = 1;
    int tmp;
@@ -196,19 +196,20 @@ int MPIDO_Alltoallv_simple(const void *sendbuf,
 
    const struct MPIDI_Comm* const mpid = &(comm_ptr->mpid);
 
-   if(MPIDI_Datatype_to_pami(sendtype, &stype, -1, NULL, &tmp) != MPI_SUCCESS)
+   if(sendbuf != MPI_IN_PLACE && (MPIDI_Datatype_to_pami(sendtype, &stype, -1, NULL, &tmp) != MPI_SUCCESS))
       pamidt = 0;
    if(MPIDI_Datatype_to_pami(recvtype, &rtype, -1, NULL, &tmp) != MPI_SUCCESS)
       pamidt = 0;
 
 
-   MPIDI_Datatype_get_info(1, sendtype, snd_contig, sndtypelen, sdt, sdt_true_lb);
-   if(!snd_contig) pamidt = 0;
+   if(sendbuf != MPI_IN_PLACE)
+   {
+     MPIDI_Datatype_get_info(1, sendtype, snd_contig, sndtypelen, sdt, sdt_true_lb);
+     if(!snd_contig) pamidt = 0;
+   }
    MPIDI_Datatype_get_info(1, recvtype, rcv_contig, rcvtypelen, rdt, rdt_true_lb);
    if(!rcv_contig) pamidt = 0;
 
-   if(sendbuf ==  MPI_IN_PLACE)
-      pamidt = 0;
 
    if(pamidt == 0)
    {
@@ -227,23 +228,17 @@ int MPIDO_Alltoallv_simple(const void *sendbuf,
 
    alltoallv.cb_done = cb_alltoallv;
    alltoallv.cookie = (void *)&active;
+   alltoallv.cmd.xfer_alltoallv_int.stype = stype;/* stype is ignored when sndbuf == PAMI_IN_PLACE */
+   alltoallv.cmd.xfer_alltoallv_int.sdispls = (int *) senddispls;
+   alltoallv.cmd.xfer_alltoallv_int.stypecounts = (int *) sendcounts;
+   alltoallv.cmd.xfer_alltoallv_int.sndbuf = (char *)sendbuf+sdt_true_lb;
+
    /* We won't bother with alltoallv since MPI is always going to be ints. */
    if(sendbuf == MPI_IN_PLACE)
    {
-      alltoallv.cmd.xfer_alltoallv_int.stype = rtype;
-      alltoallv.cmd.xfer_alltoallv_int.sdispls = (int *) recvdispls;
-      alltoallv.cmd.xfer_alltoallv_int.stypecounts = (int *) recvcounts;
-      alltoallv.cmd.xfer_alltoallv_int.sndbuf = (char *)recvbuf+rdt_true_lb;
-   }
-   else
-   {
-      alltoallv.cmd.xfer_alltoallv_int.stype = stype;
-      alltoallv.cmd.xfer_alltoallv_int.sdispls = (int *) senddispls;
-      alltoallv.cmd.xfer_alltoallv_int.stypecounts = (int *) sendcounts;
-      alltoallv.cmd.xfer_alltoallv_int.sndbuf = (char *)sendbuf+sdt_true_lb;
+      alltoallv.cmd.xfer_alltoallv_int.sndbuf = PAMI_IN_PLACE;
    }
    alltoallv.cmd.xfer_alltoallv_int.rcvbuf = (char *)recvbuf+rdt_true_lb;
-      
    alltoallv.cmd.xfer_alltoallv_int.rdispls = (int *) recvdispls;
    alltoallv.cmd.xfer_alltoallv_int.rtypecounts = (int *) recvcounts;
    alltoallv.cmd.xfer_alltoallv_int.rtype = rtype;
@@ -260,4 +255,4 @@ int MPIDO_Alltoallv_simple(const void *sendbuf,
 
 
    return MPI_SUCCESS;
-}
\ No newline at end of file
+}
diff --git a/src/mpid/pamid/src/coll/gather/mpido_gather.c b/src/mpid/pamid/src/coll/gather/mpido_gather.c
index 77d9926..806eacd 100644
--- a/src/mpid/pamid/src/coll/gather/mpido_gather.c
+++ b/src/mpid/pamid/src/coll/gather/mpido_gather.c
@@ -320,10 +320,13 @@ int MPIDO_Gather_simple(const void *sendbuf,
   const int size = comm_ptr->local_size;
   const struct MPIDI_Comm* const mpid = &(comm_ptr->mpid);
 
-  MPIDI_Datatype_get_info(sendcount, sendtype, contig,
+  if(sendbuf != MPI_IN_PLACE)
+  {
+    MPIDI_Datatype_get_info(sendcount, sendtype, contig,
                             send_bytes, data_ptr, true_lb);
-  if (!contig)
+    if (!contig)
       success = 0;
+  }
 
   if (success && rank == root)
   {
@@ -352,17 +355,13 @@ int MPIDO_Gather_simple(const void *sendbuf,
    gather.cb_done = cb_gather;
    gather.cookie = (void *)&active;
    gather.cmd.xfer_gather.root = MPID_VCR_GET_LPID(comm_ptr->vcr, root);
+   gather.cmd.xfer_gather.stypecount = send_bytes;/* stypecount is ignored when sndbuf == PAMI_IN_PLACE */
+   gather.cmd.xfer_gather.sndbuf = (void *)sendbuf;
    if(sendbuf == MPI_IN_PLACE) 
    {
-     gather.cmd.xfer_gather.stypecount = recv_bytes;
-     gather.cmd.xfer_gather.sndbuf = (char *)recvbuf + recv_bytes*rank;
+     gather.cmd.xfer_gather.sndbuf = PAMI_IN_PLACE;
    }
-   else
-   {
-     gather.cmd.xfer_gather.stypecount = send_bytes;
-     gather.cmd.xfer_gather.sndbuf = (void *)sendbuf;
-   }
-   gather.cmd.xfer_gather.stype = PAMI_TYPE_BYTE;
+   gather.cmd.xfer_gather.stype = PAMI_TYPE_BYTE;/* stype is ignored when sndbuf == PAMI_IN_PLACE */
    gather.cmd.xfer_gather.rcvbuf = (void *)recvbuf;
    gather.cmd.xfer_gather.rtype = PAMI_TYPE_BYTE;
    gather.cmd.xfer_gather.rtypecount = recv_bytes;
@@ -382,4 +381,4 @@ int MPIDO_Gather_simple(const void *sendbuf,
 
    TRACE_ERR("Leaving MPIDO_Gather_optimized\n");
    return MPI_SUCCESS;
-}
\ No newline at end of file
+}
diff --git a/src/mpid/pamid/src/coll/gatherv/mpido_gatherv.c b/src/mpid/pamid/src/coll/gatherv/mpido_gatherv.c
index d653527..e4e5ad0 100644
--- a/src/mpid/pamid/src/coll/gatherv/mpido_gatherv.c
+++ b/src/mpid/pamid/src/coll/gatherv/mpido_gatherv.c
@@ -206,7 +206,7 @@ int MPIDO_Gatherv_simple(const void *sendbuf,
    MPID_Datatype *dt_ptr = NULL;
    MPI_Aint send_true_lb, recv_true_lb;
    char *sbuf, *rbuf;
-   pami_type_t stype, rtype;
+   pami_type_t stype = NULL, rtype;
    int tmp;
    volatile unsigned gatherv_active = 1;
    const int rank = comm_ptr->rank;
@@ -217,7 +217,7 @@ int MPIDO_Gatherv_simple(const void *sendbuf,
    /* Check for native PAMI types and MPI_IN_PLACE on sendbuf */
    /* MPI_IN_PLACE is a nonlocal decision. We will need a preallreduce if we ever have
     * multiple "good" gathervs that work on different counts for example */
-   if((MPIDI_Datatype_to_pami(sendtype, &stype, -1, NULL, &tmp) != MPI_SUCCESS))
+   if(sendbuf != MPI_IN_PLACE && (MPIDI_Datatype_to_pami(sendtype, &stype, -1, NULL, &tmp) != MPI_SUCCESS))
       pamidt = 0;
    if(MPIDI_Datatype_to_pami(recvtype, &rtype, -1, NULL, &tmp) != MPI_SUCCESS)
       pamidt = 0;
@@ -232,18 +232,17 @@ int MPIDO_Gatherv_simple(const void *sendbuf,
    {
       if(sendbuf == MPI_IN_PLACE) 
       {
-         sbuf = (char*)rbuf + rsize*displs[rank];
-         gatherv.cmd.xfer_gatherv_int.stype = rtype;
-         gatherv.cmd.xfer_gatherv_int.stypecount = recvcounts[rank];
+         sbuf = PAMI_IN_PLACE;
       }
       else
       {
          MPIDI_Datatype_get_info(1, sendtype, contig, ssize, dt_ptr, send_true_lb);
 		 if(!contig) pamidt = 0;
          sbuf = (char *)sbuf + send_true_lb;
-         gatherv.cmd.xfer_gatherv_int.stype = stype;
-         gatherv.cmd.xfer_gatherv_int.stypecount = sendcount;
       }
+      gatherv.cmd.xfer_gatherv_int.stype = stype;/* stype is ignored when sndbuf == PAMI_IN_PLACE */
+      gatherv.cmd.xfer_gatherv_int.stypecount = sendcount;
+
    }
    else
    {
@@ -292,4 +291,4 @@ int MPIDO_Gatherv_simple(const void *sendbuf,
 
    TRACE_ERR("Leaving MPIDO_Gatherv_optimized\n");
    return MPI_SUCCESS;
-}
\ No newline at end of file
+}
diff --git a/src/mpid/pamid/src/coll/reduce/mpido_reduce.c b/src/mpid/pamid/src/coll/reduce/mpido_reduce.c
index a1780e6..95903ef 100644
--- a/src/mpid/pamid/src/coll/reduce/mpido_reduce.c
+++ b/src/mpid/pamid/src/coll/reduce/mpido_reduce.c
@@ -221,7 +221,7 @@ int MPIDO_Reduce_simple(const void *sendbuf,
    sbuf = (char *)sendbuf + true_lb;
    if(sendbuf == MPI_IN_PLACE) 
    {
-      sbuf = rbuf;
+      sbuf = PAMI_IN_PLACE;
    }
 
    reduce.cb_done = reduce_cb_done;
@@ -249,4 +249,4 @@ int MPIDO_Reduce_simple(const void *sendbuf,
    MPID_PROGRESS_WAIT_WHILE(reduce_active);
    TRACE_ERR("Reduce done\n");
    return MPI_SUCCESS;
-}
\ No newline at end of file
+}
diff --git a/src/mpid/pamid/src/coll/scan/mpido_scan.c b/src/mpid/pamid/src/coll/scan/mpido_scan.c
index 8097284..52e62fa 100644
--- a/src/mpid/pamid/src/coll/scan/mpido_scan.c
+++ b/src/mpid/pamid/src/coll/scan/mpido_scan.c
@@ -201,7 +201,7 @@ int MPIDO_Doscan_simple(const void *sendbuf, void *recvbuf,
    rbuf = (char *)recvbuf + true_lb;
    if(sendbuf == MPI_IN_PLACE) 
    {
-      sbuf = rbuf;
+      sbuf = PAMI_IN_PLACE;
    }
    else
    {
diff --git a/src/mpid/pamid/src/coll/scatter/mpido_scatter.c b/src/mpid/pamid/src/coll/scatter/mpido_scatter.c
index 4aea4ee..0153a0d 100644
--- a/src/mpid/pamid/src/coll/scatter/mpido_scatter.c
+++ b/src/mpid/pamid/src/coll/scatter/mpido_scatter.c
@@ -288,14 +288,14 @@ int MPIDO_Scatter_simple(const void *sendbuf,
   int contig, nbytes = 0;
   const int rank = comm_ptr->rank;
   int success = 1;
-  pami_type_t stype, rtype;
+  pami_type_t stype, rtype = NULL;
   int tmp;
   int use_pami = 1;
   const struct MPIDI_Comm* const mpid = &(comm_ptr->mpid);
 
   if(MPIDI_Datatype_to_pami(sendtype, &stype, -1, NULL, &tmp) != MPI_SUCCESS)
     use_pami = 0;
-  if(MPIDI_Datatype_to_pami(recvtype, &rtype, -1, NULL, &tmp) != MPI_SUCCESS)
+  if(recvbuf != MPI_IN_PLACE && (MPIDI_Datatype_to_pami(recvtype, &rtype, -1, NULL, &tmp) != MPI_SUCCESS))
     use_pami = 0;
 
 
@@ -357,19 +357,13 @@ int MPIDO_Scatter_simple(const void *sendbuf,
    scatter.cmd.xfer_scatter.sndbuf = (void *)sendbuf;
    scatter.cmd.xfer_scatter.stype = stype;
    scatter.cmd.xfer_scatter.stypecount = sendcount;
+   scatter.cmd.xfer_scatter.rcvbuf = (void *)recvbuf;
+   scatter.cmd.xfer_scatter.rtype = rtype;/* rtype is ignored when rcvbuf == PAMI_IN_PLACE */
+   scatter.cmd.xfer_scatter.rtypecount = recvcount;
+
    if(recvbuf == MPI_IN_PLACE) 
    {
-     MPIDI_Datatype_get_info(sendcount, sendtype, contig,
-                             nbytes, data_ptr, true_lb);
-     scatter.cmd.xfer_scatter.rcvbuf = (char *)sendbuf + nbytes*rank;
-     scatter.cmd.xfer_scatter.rtype = stype;
-     scatter.cmd.xfer_scatter.rtypecount = sendcount;
-   }
-   else
-   {
-     scatter.cmd.xfer_scatter.rcvbuf = (void *)recvbuf;
-     scatter.cmd.xfer_scatter.rtype = rtype;
-     scatter.cmd.xfer_scatter.rtypecount = recvcount;
+     scatter.cmd.xfer_scatter.rcvbuf = PAMI_IN_PLACE;
    }
 
 
diff --git a/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c b/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c
index 5808c5b..532b986 100644
--- a/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c
+++ b/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c
@@ -403,12 +403,14 @@ int MPIDO_Scatterv_simple(const void *sendbuf,
                    MPID_Comm *comm_ptr,
                    int *mpierrno)
 {
-  int snd_contig, rcv_contig, tmp, pamidt = 1;
+  int snd_contig = 1;
+  int rcv_contig = 1;
+  int tmp, pamidt = 1;
   int ssize, rsize;
   MPID_Datatype *dt_ptr = NULL;
   MPI_Aint send_true_lb=0, recv_true_lb;
   char *sbuf, *rbuf;
-  pami_type_t stype, rtype;
+  pami_type_t stype, rtype = NULL;
   const int rank = comm_ptr->rank;
   const struct MPIDI_Comm* const mpid = &(comm_ptr->mpid);
 
@@ -423,7 +425,8 @@ int MPIDO_Scatterv_simple(const void *sendbuf,
    if(MPIDI_Datatype_to_pami(sendtype, &stype, -1, NULL, &tmp) != MPI_SUCCESS)
       pamidt = 0;
    MPIDI_Datatype_get_info(1, sendtype, snd_contig, ssize, dt_ptr, send_true_lb);
-   MPIDI_Datatype_get_info(1, recvtype, rcv_contig, rsize, dt_ptr, recv_true_lb);
+   if(recvbuf != MPI_IN_PLACE)
+     MPIDI_Datatype_get_info(1, recvtype, rcv_contig, rsize, dt_ptr, recv_true_lb);
 
    if(pamidt == 0 || !snd_contig || !rcv_contig)
    {
@@ -442,7 +445,7 @@ int MPIDO_Scatterv_simple(const void *sendbuf,
    {
       if(recvbuf == MPI_IN_PLACE) 
       {
-        rbuf = (char *)sendbuf + ssize*displs[rank] + send_true_lb;
+        rbuf = PAMI_IN_PLACE;
       }
       else
       {
@@ -460,7 +463,7 @@ int MPIDO_Scatterv_simple(const void *sendbuf,
    scatterv.cmd.xfer_scatterv_int.rcvbuf = rbuf;
    scatterv.cmd.xfer_scatterv_int.sndbuf = sbuf;
    scatterv.cmd.xfer_scatterv_int.stype = stype;
-   scatterv.cmd.xfer_scatterv_int.rtype = rtype;
+   scatterv.cmd.xfer_scatterv_int.rtype = rtype;/* rtype is ignored when rcvbuf == PAMI_IN_PLACE */
    scatterv.cmd.xfer_scatterv_int.stypecounts = (int *) sendcounts;
    scatterv.cmd.xfer_scatterv_int.rtypecount = recvcount;
    scatterv.cmd.xfer_scatterv_int.sdispls = (int *) displs;

http://git.mpich.org/mpich.git/commitdiff/286b7d55e4000819ce494de008134657b5839d8b

commit 286b7d55e4000819ce494de008134657b5839d8b
Author: Haizhu Liu <haizhu at us.ibm.com>
Date:   Mon Jan 21 13:22:27 2013 -0500

    MPI_Abort does not abort connected world when it is in finalize process
    
    (ibm) D188181
    (ibm) 26a901c45585070e45d14f7d9cfb2783b8314dca
    
    Signed-off-by: Bob Cernohous <bobc at us.ibm.com>

diff --git a/src/mpid/pamid/src/dyntask/mpid_comm_spawn_multiple.c b/src/mpid/pamid/src/dyntask/mpid_comm_spawn_multiple.c
index 5d50503..729abc4 100644
--- a/src/mpid/pamid/src/dyntask/mpid_comm_spawn_multiple.c
+++ b/src/mpid/pamid/src/dyntask/mpid_comm_spawn_multiple.c
@@ -135,7 +135,7 @@ int MPIDI_Comm_spawn_multiple(int count, char **commands,
     int len=0;
     int *info_keyval_sizes=0, i, mpi_errno=MPI_SUCCESS;
     PMI_keyval_t **info_keyval_vectors=0, preput_keyval_vector;
-    int *pmi_errcodes = 0, pmi_errno;
+    int *pmi_errcodes = 0, pmi_errno=0;
     int total_num_processes, should_accept = 1;
     MPID_Info tmp_info_ptr;
     char *tmp;
diff --git a/src/pmi/pmi2/poe/poe2pmi.c b/src/pmi/pmi2/poe/poe2pmi.c
index 11c0d1a..e37563e 100644
--- a/src/pmi/pmi2/poe/poe2pmi.c
+++ b/src/pmi/pmi2/poe/poe2pmi.c
@@ -290,30 +290,28 @@ int _mpi_world_exiting_handler_wrapper(pami_context_t context, void *cookie)
   int world_id = req->world_id;
   MPID_Comm *comm = MPIR_Process.comm_world;
 
-  if(!mpidi_finalized) {
-    ref_count = MPIDI_get_refcnt_of_world(world_id);
-    TRACE_ERR("_mpi_world_exiting_handler: invoked for world %d exiting ref_count=%d my comm_word_size=%d\n", world_id, ref_count, world_size);
-    if(ref_count == 0) {
-      taskid_list = MPIDI_get_taskids_in_world_id(world_id);
-      if(taskid_list != NULL) {
-        for(i=0;taskid_list[i]!=-1;i++) {
-          PAMI_Endpoint_create(MPIDI_Client, taskid_list[i], 0, &dest);
-	  MPIDI_OpState_reset(taskid_list[i]);
-	  MPIDI_IpState_reset(taskid_list[i]);
-	  TRACE_ERR("PAMI_Purge on taskid_list[%d]=%d\n", i,taskid_list[i]);
-            PAMI_Purge(context, &dest, 1);
-        }
-        MPIDI_delete_conn_record(world_id);
+  ref_count = MPIDI_get_refcnt_of_world(world_id);
+  TRACE_ERR("_mpi_world_exiting_handler: invoked for world %d exiting ref_count=%d my comm_word_size=%d\n", world_id, ref_count, world_size);
+  if(ref_count == 0) {
+    taskid_list = MPIDI_get_taskids_in_world_id(world_id);
+    if(taskid_list != NULL) {
+      for(i=0;taskid_list[i]!=-1;i++) {
+        PAMI_Endpoint_create(MPIDI_Client, taskid_list[i], 0, &dest);
+        MPIDI_OpState_reset(taskid_list[i]);
+	MPIDI_IpState_reset(taskid_list[i]);
+	TRACE_ERR("PAMI_Purge on taskid_list[%d]=%d\n", i,taskid_list[i]);
+        PAMI_Purge(context, &dest, 1);
       }
-      rc = -1;
+      MPIDI_delete_conn_record(world_id);
     }
-    my_state = TRUE;
-
-    rc = _mpi_reduce_for_dyntask(&my_state, &reduce_state);
-    if(rc) return rc;
-	
-    TRACE_ERR("_mpi_world_exiting_handler: Out of _mpi_reduce_for_dyntask for exiting world %d reduce_state=%d\n",world_id, reduce_state);
+    rc = -1;
   }
+  my_state = TRUE;
+
+  rc = _mpi_reduce_for_dyntask(&my_state, &reduce_state);
+  if(rc) return rc;
+
+  TRACE_ERR("_mpi_world_exiting_handler: Out of _mpi_reduce_for_dyntask for exiting world %d reduce_state=%d\n",world_id, reduce_state);
 
   if(comm->rank == 0) {
     MPIU_Snprintf(world_id_str, sizeof(world_id_str), "%d", world_id);
@@ -342,8 +340,12 @@ int _mpi_world_exiting_handler(int world_id)
     req = MPIU_Malloc(sizeof(struct worldExitReq));
     req->world_id = world_id;
 
-    if(!mpidi_finalized && MPIDI_Context[0])
-      PAMI_Context_post(MPIDI_Context[0], &(req->work), _mpi_world_exiting_handler_wrapper, req);
+    if(MPIDI_Context[0]) {
+      if(!mpidi_finalized)
+        PAMI_Context_post(MPIDI_Context[0], &(req->work), _mpi_world_exiting_handler_wrapper, req);
+      else
+        _mpi_world_exiting_handler_wrapper(MPIDI_Context[0], req);
+    }
 
     return MPI_SUCCESS;
 }

http://git.mpich.org/mpich.git/commitdiff/4e48732bb08577a4eb8ef449de50e35348274e35

commit 4e48732bb08577a4eb8ef449de50e35348274e35
Author: Haizhu Liu <haizhu at us.ibm.com>
Date:   Thu Jan 17 10:22:24 2013 -0500

    Soft spawn bug fix in mpich2
    
    (ibm) D188226
    (ibm) 0be9a2d68b98248d393ba8fc1b1ba2a9eb4c8c6b
    
    Signed-off-by: Bob Cernohous <bobc at us.ibm.com>

diff --git a/src/mpid/pamid/src/dyntask/mpid_comm_spawn_multiple.c b/src/mpid/pamid/src/dyntask/mpid_comm_spawn_multiple.c
index 461e4ad..5d50503 100644
--- a/src/mpid/pamid/src/dyntask/mpid_comm_spawn_multiple.c
+++ b/src/mpid/pamid/src/dyntask/mpid_comm_spawn_multiple.c
@@ -139,6 +139,7 @@ int MPIDI_Comm_spawn_multiple(int count, char **commands,
     int total_num_processes, should_accept = 1;
     MPID_Info tmp_info_ptr;
     char *tmp;
+    int tmp_ret = 0;
 
     if (comm_ptr->rank == root) {
 	/* create an array for the pmi error codes */
@@ -163,7 +164,7 @@ int MPIDI_Comm_spawn_multiple(int count, char **commands,
         {
             int *argcs = MPIU_Malloc(count*sizeof(int));
             struct MPID_Info preput;
-            struct MPID_Info *preput_p[1] = { &preput };
+            struct MPID_Info *preput_p[2] = { &preput, &tmp_info_ptr };
 
             MPIU_Assert(argcs);
 
@@ -172,12 +173,12 @@ int MPIDI_Comm_spawn_multiple(int count, char **commands,
             /* FIXME cheating on constness */
             preput.key = (char *)MPIDI_PARENT_PORT_KVSKEY;
             preput.value = port_name;
-            preput.next = NULL;
+            preput.next = &tmp_info_ptr;
 
 	    tmp_info_ptr.key = "COMMCTX";
 	    len=sprintf(ctxid_str, "%d", comm_ptr->context_id);
 	    TRACE_ERR("COMMCTX=%d\n", comm_ptr->context_id);
-	     ctxid_str[len]='\0';
+	    ctxid_str[len]='\0';
 	    tmp_info_ptr.value = ctxid_str;
 	    tmp_info_ptr.next = NULL;
 
@@ -189,10 +190,6 @@ int MPIDI_Comm_spawn_multiple(int count, char **commands,
                         ++argcs[i];
                     }
                 }
-
-                /* a fib for now */
-                info_keyval_sizes[i] = 1;
-		info_ptrs[i] = &tmp_info_ptr;
             }
 
             /* XXX DJG don't need this, PMI API is thread-safe? */
@@ -203,19 +200,23 @@ int MPIDI_Comm_spawn_multiple(int count, char **commands,
                                        argcs, (const char ***)argvs,
                                        maxprocs,
                                        info_keyval_sizes, (const MPID_Info **)info_ptrs,
-                                       1, (const struct MPID_Info **)preput_p,
+                                       2, (const struct MPID_Info **)preput_p,
                                        jobId, jobIdSize,
                                        pmi_errcodes);
-	    TRACE_ERR("after PMI2_Job_Spawn - jobId=%s\n", jobId);
+	    TRACE_ERR("after PMI2_Job_Spawn - pmi_errno=%d jobId=%s\n", pmi_errno, jobId);
 
 	    tmp=MPIU_Strdup(jobId);
-	    strtok(tmp, ";");
-	    pami_task_t leader_taskid = atoi(strtok(NULL, ";"));
-	    pami_endpoint_t ldest;
+	    tmp_ret = atoi(strtok(tmp, ";"));
+
+	    if( (pmi_errno == PMI2_SUCCESS) && (tmp_ret != -1) ) {
+	      pami_task_t leader_taskid = atoi(strtok(NULL, ";"));
+	      pami_endpoint_t ldest;
+
+              PAMI_Endpoint_create(MPIDI_Client,  leader_taskid, 0, &ldest);
+	      TRACE_ERR("PAMI_Resume to taskid=%d\n", leader_taskid);
+              PAMI_Resume(MPIDI_Context[0], &ldest, 1);
+            }
 
-            PAMI_Endpoint_create(MPIDI_Client,  leader_taskid, 0, &ldest);
-	    TRACE_ERR("PAMI_Resume to taskid=%d\n", leader_taskid);
-            PAMI_Resume(MPIDI_Context[0], &ldest, 1);
             MPIU_Free(tmp);
 
             MPIU_Free(argcs);
@@ -263,10 +264,10 @@ int MPIDI_Comm_spawn_multiple(int count, char **commands,
 	TRACE_ERR("pmi_errno from PMI_Spawn_multiple=%d\n", pmi_errno);
 #endif
 
-	if (errcodes != MPI_ERRCODES_IGNORE) {
+        if (errcodes != MPI_ERRCODES_IGNORE) {
 	    for (i=0; i<total_num_processes; i++) {
 		/* FIXME: translate the pmi error codes here */
-		errcodes[i] = pmi_errcodes[i];
+		errcodes[i] = pmi_errcodes[0];
                 /* We want to accept if any of the spawns succeeded.
                    Alternatively, this is the same as we want to NOT accept if
                    all of them failed.  should_accept = NAND(e_0, ..., e_n)
@@ -275,6 +276,13 @@ int MPIDI_Comm_spawn_multiple(int count, char **commands,
 	    }
             should_accept = !should_accept; /* the `N' in NAND */
 	}
+
+#ifdef USE_PMI2_API
+        if( (pmi_errno == PMI2_SUCCESS) && (tmp_ret == -1) )
+#else
+        if( (pmi_errno == PMI_SUCCESS) && (tmp_ret == -1) )
+#endif
+	  should_accept = 0;
     }
 
     if (errcodes != MPI_ERRCODES_IGNORE) {
@@ -282,6 +290,9 @@ int MPIDI_Comm_spawn_multiple(int count, char **commands,
         mpi_errno = MPIR_Bcast_impl(&should_accept, 1, MPI_INT, root, comm_ptr, &errflag);
         if (mpi_errno) TRACE_ERR("MPIR_Bcast_impl returned with mpi_errno=%d\n", mpi_errno);
 
+        mpi_errno = MPIR_Bcast_impl(&pmi_errno, 1, MPI_INT, root, comm_ptr, &errflag);
+        if (mpi_errno) TRACE_ERR("MPIR_Bcast_impl returned with mpi_errno=%d\n", mpi_errno);
+
         mpi_errno = MPIR_Bcast_impl(&total_num_processes, 1, MPI_INT, root, comm_ptr, &errflag);
         if (mpi_errno) TRACE_ERR("MPIR_Bcast_impl returned with mpi_errno=%d\n", mpi_errno);
 
@@ -292,6 +303,10 @@ int MPIDI_Comm_spawn_multiple(int count, char **commands,
     if (should_accept) {
         mpi_errno = MPID_Comm_accept(port_name, NULL, root, comm_ptr, intercomm);
 	TRACE_ERR("mpi_errno from MPID_Comm_accept=%d\n", mpi_errno);
+    } else {
+	if( (pmi_errno == PMI2_SUCCESS) && (errcodes[0] != 0) ) {
+	  MPIR_Comm_create(intercomm);
+	}
     }
 
     if (comm_ptr->rank == root) {
@@ -303,6 +318,11 @@ int MPIDI_Comm_spawn_multiple(int count, char **commands,
 	/* --END ERROR HANDLING-- */
     }
 
+    if(pmi_errno) {
+           mpi_errno = MPIR_Err_create_code(MPI_SUCCESS, MPIR_ERR_FATAL, __FILE__, __LINE__, MPI_ERR_SPAWN,
+            "**noresource", 0);
+    }
+
  fn_exit:
     if (info_keyval_vectors) {
 	MPIDI_free_pmi_keyvals(info_keyval_vectors, count, info_keyval_sizes);

http://git.mpich.org/mpich.git/commitdiff/feb17d535e2eb7982b769389259238432222e403

commit feb17d535e2eb7982b769389259238432222e403
Author: Charles Archer <archerc at us.ibm.com>
Date:   Mon Jan 21 04:13:16 2013 -0500

    Change defaults for alltoall throttle to 32 outstanding sends from 4
    
    (ibm) D186947
    (ibm) 83f4d03ce02075afbd2240dbe16ccd83f496172a
    
    Signed-off-by: Bob Cernohous <bobc at us.ibm.com>

diff --git a/src/util/param/params.yml b/src/util/param/params.yml
index 5d5fe3f..adb6e56 100644
--- a/src/util/param/params.yml
+++ b/src/util/param/params.yml
@@ -68,7 +68,7 @@ parameters:
     - category    : collective
       name        : ALLTOALL_THROTTLE
       type        : int
-      default     : 4
+      default     : 32
       description : >-
         max no. of irecvs/isends posted at a time in some alltoall
         algorithms. Setting it to 0 causes all irecvs/isends to be

http://git.mpich.org/mpich.git/commitdiff/a6885c7b09d0b29efab6e2f3540d333e6944cb03

commit a6885c7b09d0b29efab6e2f3540d333e6944cb03
Author: Su Huang <suhuang at us.ibm.com>
Date:   Tue Jan 15 15:08:48 2013 -0500

    Incorrect MPI_IO default key value
    
    (ibm) D188160
    (ibm) fcae45edf7a6d4baeb3dc031eb4d8c026dbad553
    
    Signed-off-by: Michael Blocksome <blocksom at us.ibm.com>

diff --git a/src/mpid/pamid/src/mpid_init.c b/src/mpid/pamid/src/mpid_init.c
index 1f90db5..f4412c0 100644
--- a/src/mpid/pamid/src/mpid_init.c
+++ b/src/mpid/pamid/src/mpid_init.c
@@ -984,6 +984,7 @@ int MPID_Init(int * argc,
   /* ------------------------------- */
   MPIR_Process.attrs.tag_ub = INT_MAX;
   MPIR_Process.attrs.wtime_is_global = 1;
+  MPIR_Process.attrs.io   = MPI_ANY_SOURCE;
 
 
   /* ------------------------------- */

http://git.mpich.org/mpich.git/commitdiff/c6846c1b9010fd447c46769dbf8cc58146859321

commit c6846c1b9010fd447c46769dbf8cc58146859321
Author: Su Huang <suhuang at us.ibm.com>
Date:   Thu Dec 20 22:46:46 2012 -0500

    NAMD core dumps due to unsafe MPICH2 iprobe() handling
    
    (ibm) D187900 GENCI
    (ibm) bbb6075e55017ad1e01711243df0930613268460
    
    Signed-off-by: Michael Blocksome <blocksom at us.ibm.com>

diff --git a/src/mpid/pamid/src/mpid_recvq.c b/src/mpid/pamid/src/mpid_recvq.c
index 853abac..ce32428 100644
--- a/src/mpid/pamid/src/mpid_recvq.c
+++ b/src/mpid/pamid/src/mpid_recvq.c
@@ -83,6 +83,18 @@ MPIDI_Recvq_FU(int source, int tag, int context_id, MPI_Status * status)
 #ifdef USE_STATISTICS
   unsigned search_length = 0;
 #endif
+#ifdef OUT_OF_ORDER_HANDLING
+  MPIDI_In_cntr_t *in_cntr;
+  uint nMsgs=0;
+  pami_task_t pami_source;
+  
+
+  if(source != MPI_ANY_SOURCE) {
+    pami_source = PAMIX_Endpoint_query(source);
+    in_cntr=&MPIDI_In_cntr[pami_source];
+    nMsgs = in_cntr->nMsgs + 1;
+  }
+#endif
 
   if (tag != MPI_ANY_TAG && source != MPI_ANY_SOURCE)
     {
@@ -91,17 +103,31 @@ MPIDI_Recvq_FU(int source, int tag, int context_id, MPI_Status * status)
 #ifdef USE_STATISTICS
         ++search_length;
 #endif
+#ifdef OUT_OF_ORDER_HANDLING
+        if( ((int)(nMsgs-MPIDI_Request_getMatchSeq(rreq))) >= 0 )
+        { 
+#endif
         if ( (MPIDI_Request_getMatchCtxt(rreq) == context_id) &&
              (MPIDI_Request_getMatchRank(rreq) == source    ) &&
              (MPIDI_Request_getMatchTag(rreq)  == tag       )
              )
           {
+#ifdef OUT_OF_ORDER_HANDLING
+            if (rreq->mpid.nextR != NULL) {  /* recv is in the out of order list */
+              if (MPIDI_Request_getMatchSeq(rreq) == nMsgs) {
+                in_cntr->nMsgs=nMsgs;
+              MPIDI_Recvq_remove_req_from_ool(rreq,in_cntr);
+            } 
+           } 
+#endif
             found = TRUE;
             if(status != MPI_STATUS_IGNORE)
               *status = (rreq->status);
             break;
           }
-
+#ifdef OUT_OF_ORDER_HANDLING
+       }
+#endif
         rreq = rreq->mpid.next;
       }
     }
@@ -138,17 +164,39 @@ MPIDI_Recvq_FU(int source, int tag, int context_id, MPI_Status * status)
 #ifdef USE_STATISTICS
         ++search_length;
 #endif
+#ifdef OUT_OF_ORDER_HANDLING
+        if(( ( (int)(nMsgs-MPIDI_Request_getMatchSeq(rreq))) >= 0) || (source == MPI_ANY_SOURCE)) {
+#endif
         if ( (  MPIDI_Request_getMatchCtxt(rreq)              == match.context_id) &&
              ( (MPIDI_Request_getMatchRank(rreq) & mask.rank) == match.rank      ) &&
              ( (MPIDI_Request_getMatchTag(rreq)  & mask.tag ) == match.tag       )
              )
           {
+#ifdef OUT_OF_ORDER_HANDLING
+            if(source == MPI_ANY_SOURCE) {
+              pami_source= MPIDI_Request_getPeerRank_pami(rreq);
+              in_cntr = &MPIDI_In_cntr[pami_source];
+              nMsgs = in_cntr->nMsgs+1;
+              if((int) (nMsgs-MPIDI_Request_getMatchSeq(rreq)) < 0 )
+                 goto NEXT_MSG;
+
+            }
+            if (rreq->mpid.nextR != NULL)  { /* recv is in the out of order list */
+              if (MPIDI_Request_getMatchSeq(rreq) == nMsgs)
+                in_cntr->nMsgs=nMsgs;
+              MPIDI_Recvq_remove_req_from_ool(rreq,in_cntr);
+            }
+#endif
             found = TRUE;
-            if(status != MPI_STATUS_IGNORE)
+            if(status != MPI_STATUS_IGNORE) 
               *status = (rreq->status);
             break;
           }
 
+#ifdef OUT_OF_ORDER_HANDLING
+        }
+     NEXT_MSG:
+#endif
         rreq = rreq->mpid.next;
       }
     }
@@ -488,7 +536,6 @@ MPIDI_Recvq_AEU(MPID_Request *newreq, int source, pami_task_t pami_source, int t
   memset(rstatus,0,sizeof(recv_status));
 #endif
 
-
   in_cntr = &MPIDI_In_cntr[pami_source];
   MPIDI_Request_setMatch(rreq, tag, source, context_id); /* mpi rank needed */
   MPIDI_Request_setPeerRank_pami(rreq, pami_source);

http://git.mpich.org/mpich.git/commitdiff/634f95f6d91a7f5b8af8fbde5021ca7fb93e348e

commit 634f95f6d91a7f5b8af8fbde5021ca7fb93e348e
Author: Su Huang <suhuang at us.ibm.com>
Date:   Mon Dec 17 15:17:49 2012 -0500

    job fail w/ MP_BUFFER_MEM set on x86_64
    
    (ibm) D187797
    (ibm) 8Q9
    (ibm) 031ac21508e6289a6987aaf2d1b858b0649eb586
    
    Signed-off-by: Michael Blocksome <blocksom at us.ibm.com>

diff --git a/src/mpid/pamid/src/mpidi_env.c b/src/mpid/pamid/src/mpidi_env.c
index d141a69..f19272a 100644
--- a/src/mpid/pamid/src/mpidi_env.c
+++ b/src/mpid/pamid/src/mpidi_env.c
@@ -1175,6 +1175,7 @@ int  MPIDI_get_buf_mem(unsigned long *buf_mem,unsigned long *buf_mem_max)
             TRACE_ERR("ERROR in MP_BUFFER_MEM %s(%d)\n",__FILE__,__LINE__);
             return 1;
          }
+        return 0;
      } else {
          /* MP_BUFFER_MEM is not specified by the user*/
          *buf_mem     = BUFFER_MEM_DEFAULT;

http://git.mpich.org/mpich.git/commitdiff/e6ab6613f7ab50009b83bc760e2e3b63b71f30c6

commit e6ab6613f7ab50009b83bc760e2e3b63b71f30c6
Author: Haizhu Liu <haizhu at us.ibm.com>
Date:   Tue Dec 11 09:27:20 2012 -0500

    Fix for concurrent_spawns
    
    (ibm) D187684
    (ibm) 8d36c1a79e548218e3c9d82fa63c099adc331e24
    
    Signed-off-by: Michael Blocksome <blocksom at us.ibm.com>

diff --git a/src/mpid/pamid/src/dyntask/mpid_comm_spawn_multiple.c b/src/mpid/pamid/src/dyntask/mpid_comm_spawn_multiple.c
index 2eb6c71..461e4ad 100644
--- a/src/mpid/pamid/src/dyntask/mpid_comm_spawn_multiple.c
+++ b/src/mpid/pamid/src/dyntask/mpid_comm_spawn_multiple.c
@@ -46,7 +46,7 @@ static int  MPIDI_mpi_to_pmi_keyvals( MPID_Info *info_ptr, PMI_keyval_t **kv_ptr
 	kv[i].key = MPIU_Strdup(key);
 	kv[i].val = MPIU_Malloc( vallen + 1 );
 	MPIR_Info_get_impl( info_ptr, key, vallen+1, kv[i].val, &flag );
-	TRACE_OUT(("key: <%s>, value: <%s>\n", kv[i].key, kv[i].val));
+	TRACE_OUT("key: <%s>, value: <%s>\n", kv[i].key, kv[i].val);
     }
 
  fn_fail:
diff --git a/src/mpid/pamid/src/dyntask/mpidi_pg.c b/src/mpid/pamid/src/dyntask/mpidi_pg.c
index fe0cb3c..af58e9f 100644
--- a/src/mpid/pamid/src/dyntask/mpidi_pg.c
+++ b/src/mpid/pamid/src/dyntask/mpidi_pg.c
@@ -29,7 +29,7 @@ static int verbose = 0;
 
 /* Key track of the process group corresponding to the MPI_COMM_WORLD
    of this process */
-static MPIDI_PG_t *pg_world = NULL;
+MPIDI_PG_t *pg_world = NULL;
 
 #define MPIDI_MAX_KVS_KEY_LEN      256
 
@@ -97,8 +97,12 @@ int MPIDI_PG_Finalize(void)
    my_max_worldid  = -1;
 
    while(NULL != conn_node) {
-     if(conn_node->rem_world_id>my_max_worldid && conn_node->ref_count>0)
+     if(conn_node->rem_world_id>my_max_worldid && conn_node->ref_count>0) {
+       TRACE_ERR("conn_node->rem_world_id=%d conn_node->ref_count=%d\n", conn_node->rem_world_id, conn_node->ref_count);
        my_max_worldid = conn_node->rem_world_id;
+     } else {
+       TRACE_ERR("conn_node->rem_world_id=%d conn_node->ref_count=%d\n", conn_node->rem_world_id, conn_node->ref_count);
+     }
      conn_node = conn_node->next;
    }
    MPIR_Allreduce_impl( &my_max_worldid, &world_max_worldid, 1, MPI_INT, MPI_MAX,   MPIR_Process.comm_world, &mpi_errno);
@@ -111,6 +115,7 @@ int MPIDI_PG_Finalize(void)
     * dont add 1, then the bit array will be size 1 byte, and when
     * we try to set bit in position 8, we will get segfault.
     */
+   TRACE_ERR("my_max_worldid=%d world_max_worldid=%d\n", my_max_worldid, world_max_worldid);
    if(world_max_worldid != -1) {
      world_max_worldid++;
      wid_bit_array_size = (world_max_worldid + CHAR_BIT -1) / CHAR_BIT;
@@ -141,8 +146,6 @@ int MPIDI_PG_Finalize(void)
      MPIU_Snprintf(key, PMI2_MAX_KEYLEN-1, "%s", "ROOTWIDARRAY");
      MPIU_Snprintf(value, PMI2_MAX_VALLEN-1, "%s", root_wid_barray);
      TRACE_ERR("root_wid_barray=%s\n", value);
-     key[strlen(key)+1]='\0';
-     value[strlen(value)+1]='\0';
      mpi_errno = PMI2_KVS_Put(key, value);
      TRACE_ERR("PMI2_KVS_Put returned with mpi_errno=%d\n", mpi_errno);
 
diff --git a/src/mpid/pamid/src/mpid_init.c b/src/mpid/pamid/src/mpid_init.c
index 9b966a7..1f90db5 100644
--- a/src/mpid/pamid/src/mpid_init.c
+++ b/src/mpid/pamid/src/mpid_init.c
@@ -31,6 +31,8 @@
 #define PAMIX_CLIENT_DYNAMIC_TASKING 1032
 #define PAMIX_CLIENT_WORLD_TASKS     1033
 #define MAX_JOBID_LEN                1024
+int     world_rank;
+int     world_size;
 #endif
 int mpidi_dynamic_tasking = 0;
 
@@ -310,8 +312,8 @@ MPIDI_PAMI_client_init(int* rank, int* size, int* mpidi_dynamic_tasking, char **
               config2[1].value.intval,
               config2[2].value.intval,
               config2[3].value.chararray);
-    *rank = config2[0].value.intval;
-    *size = config2[1].value.intval;
+    *rank = world_rank = config2[0].value.intval;
+    *size = world_size = config2[1].value.intval;
     *mpidi_dynamic_tasking  = config2[2].value.intval;
     *world_tasks = config2[3].value.chararray;
   }
diff --git a/src/pmi/pmi2/poe/poe2pmi.c b/src/pmi/pmi2/poe/poe2pmi.c
index ca68e18..11c0d1a 100644
--- a/src/pmi/pmi2/poe/poe2pmi.c
+++ b/src/pmi/pmi2/poe/poe2pmi.c
@@ -39,7 +39,13 @@
 
 #define MAX_INT_STR_LEN 11 /* number of digits in MAX_UINT + 1 */
 
+struct worldExitReq {
+  pami_work_t work;
+  int         world_id;
+};
+
 int (*mp_world_exiting_handler)(int) = NULL;
+
 typedef enum { PMI2_UNINITIALIZED = 0, NORMAL_INIT_WITH_PM = 1 } PMI2State;
 static PMI2State PMI2_initialized = PMI2_UNINITIALIZED;
 
@@ -52,6 +58,10 @@ static int PMI2_debug_init = 0;    /* Set this to true to debug the init */
 
 int PMI2_pmiverbose = 0;    /* Set this to true to print PMI debugging info */
 
+extern MPIDI_PG_t *pg_world;
+extern int world_rank;
+extern int world_size;
+
 #ifdef MPICH_IS_THREADED
 static MPID_Thread_mutex_t mutex;
 static int blocked = FALSE;
@@ -82,14 +92,14 @@ int PMI2_Init(int *spawned, int *size, int *rank, int *appnum)
         TRACE_ERR("failed to open libpoe.so\n");
     }
 
-    mp_world_exiting_handler = &(_mpi_world_exiting_handler);
-
     pmi2_init = (int (*)())dlsym(poeptr, "PMI2_Init");
     if (pmi2_init == NULL) {
         TRACE_ERR("failed to dlsym PMI2_Init\n");
     }
 
-    return (*pmi2_init)(spawned, size, rank, appnum);
+    ret = (*pmi2_init)(spawned, size, rank, appnum);
+    mp_world_exiting_handler = &(_mpi_world_exiting_handler);
+    return ret;
 }
 
 int PMI2_Finalize(void)
@@ -261,7 +271,7 @@ int PMI2_Info_GetJobAttr(const char name[], char value[], int valuelen, int *fla
  * This is the mpi level of callback that get invoked when a task get notified
  * of a world's exiting
  */
-int _mpi_world_exiting_handler(int world_id)
+int _mpi_world_exiting_handler_wrapper(pami_context_t context, void *cookie)
 {
   /* check the reference count associated with that remote world
      if the reference count is zero, the task will call LAPI_Purge_totask on
@@ -276,10 +286,13 @@ int _mpi_world_exiting_handler(int world_id)
   char world_id_str[32];
   int mpi_errno = MPI_SUCCESS;
   pami_endpoint_t dest;
+  struct worldExitReq *req = (struct worldExitReq *)cookie;
+  int world_id = req->world_id;
+  MPID_Comm *comm = MPIR_Process.comm_world;
 
   if(!mpidi_finalized) {
     ref_count = MPIDI_get_refcnt_of_world(world_id);
-    TRACE_ERR("_mpi_world_exiting_handler: invoked for world %d exiting ref_count=%d\n", world_id, ref_count);
+    TRACE_ERR("_mpi_world_exiting_handler: invoked for world %d exiting ref_count=%d my comm_word_size=%d\n", world_id, ref_count, world_size);
     if(ref_count == 0) {
       taskid_list = MPIDI_get_taskids_in_world_id(world_id);
       if(taskid_list != NULL) {
@@ -288,8 +301,7 @@ int _mpi_world_exiting_handler(int world_id)
 	  MPIDI_OpState_reset(taskid_list[i]);
 	  MPIDI_IpState_reset(taskid_list[i]);
 	  TRACE_ERR("PAMI_Purge on taskid_list[%d]=%d\n", i,taskid_list[i]);
-	  if(MPIDI_Context[0])
-            PAMI_Purge(MPIDI_Context[0], &dest, 1);
+            PAMI_Purge(context, &dest, 1);
         }
         MPIDI_delete_conn_record(world_id);
       }
@@ -297,18 +309,16 @@ int _mpi_world_exiting_handler(int world_id)
     }
     my_state = TRUE;
 
-/*  _mpi_reduce_for_dyntask(&my_state, &reduce_state); */
-    if(MPIDI_Context[0])
-      MPIR_Reduce_impl(&my_state,&reduce_state,1,
-                       MPI_INT,MPI_LAND,0,MPIR_Process.comm_world,&mpi_errno);
+    rc = _mpi_reduce_for_dyntask(&my_state, &reduce_state);
+    if(rc) return rc;
+	
     TRACE_ERR("_mpi_world_exiting_handler: Out of _mpi_reduce_for_dyntask for exiting world %d reduce_state=%d\n",world_id, reduce_state);
   }
 
-  if(MPIR_Process.comm_world->rank == 0) {
+  if(comm->rank == 0) {
     MPIU_Snprintf(world_id_str, sizeof(world_id_str), "%d", world_id);
     PMI2_Abort(0, world_id_str);
-/*    _mp_send_exiting_ack(world_id); */
-    if(MPIDI_Context[0] && (reduce_state != TRUE)) {
+    if((reduce_state != world_size)) {
       TRACE_ERR("root is exiting with error\n");
       exit(-1);
     }
@@ -321,5 +331,119 @@ int _mpi_world_exiting_handler(int world_id)
     rc = -2;
   }
 
-  return rc;
+  if(cookie) MPIU_Free(cookie);
+  return PAMI_SUCCESS;
+}
+
+
+int _mpi_world_exiting_handler(int world_id)
+{
+    struct worldExitReq *req;
+    req = MPIU_Malloc(sizeof(struct worldExitReq));
+    req->world_id = world_id;
+
+    if(!mpidi_finalized && MPIDI_Context[0])
+      PAMI_Context_post(MPIDI_Context[0], &(req->work), _mpi_world_exiting_handler_wrapper, req);
+
+    return MPI_SUCCESS;
+}
+
+
+int getchildren(int iam, double alpha,int gsize, int *children,
+                int *blocks, int *numchildren, int *parent)
+{
+  int fakeme=iam,i;
+  int p=gsize,pbig,bflag=0,blocks_from_children=0;
+
+  *numchildren=0;
+
+  if( blocks != NULL )
+    bflag=1;
+
+   while( p > 1 ) {
+
+     pbig = MAX(1,MIN((int) (alpha*(double)p), p-1));
+
+     if ( fakeme == 0 ) {
+
+        (children)[*numchildren] = (iam+pbig+gsize)%gsize;
+        if(bflag)
+          (blocks)[*numchildren] = p -pbig;
+
+        *numchildren +=1;
+     }
+     if ( fakeme == pbig ) {
+        *parent = (iam-pbig+gsize)%gsize;
+        if(bflag)
+          blocks_from_children = p - pbig;
+     }
+     if( pbig > fakeme) {
+       p = pbig;
+     } else {
+       p -=pbig;
+       fakeme -=pbig;
+     }
+   }
+   if(bflag)
+      (blocks)[*numchildren] = blocks_from_children;
+}
+
+int _mpi_reduce_for_dyntask(int *sendbuf, int *recvbuf)
+{
+  int         *children, gid, child_rank, parent_rank, rc;
+  int         numchildren, parent=0, i, result=0,tag, remaining_child_count;
+  MPID_Comm   *comm_ptr;
+  int         mpi_errno;
+
+  int TASKS= world_size;
+  children = MPIU_Malloc(TASKS*sizeof(int));
+
+  comm_ptr = MPIR_Process.comm_world;
+
+  if(pg_world && pg_world->id)
+    tag = (-1) * (atoi(pg_world->id));
+  else {
+    TRACE_ERR("pg_world hasn't been created, should skip the rest of the handler and return\n");
+    return -1;
+  }
+
+  result = *sendbuf;
+
+  getchildren(world_rank, 0.5, TASKS, children, NULL, &numchildren, &parent);
+
+  TRACE_ERR("_mpi_reduce_for_dyntask - numchildren=%d parent=%d world_rank=%d\n", numchildren, parent, world_rank);
+  for(i=numchildren-1;i>=0;i--)
+  {
+    remaining_child_count = i;
+    child_rank = (children[i])% TASKS;
+    mpi_errno = MPIC_Recv(recvbuf, sizeof(int),MPI_BYTE, pg_world->vct[child_rank].taskid, tag, comm_ptr->handle, MPI_STATUS_IGNORE);
+
+    if(world_rank != parent)
+    {
+      if(remaining_child_count == 0) {
+        parent_rank = (parent) % TASKS;
+        result += *recvbuf;
+        MPIC_Send(&result, sizeof(int), MPI_BYTE, pg_world->vct[parent_rank].taskid, tag, comm_ptr->handle);
+      }
+      else
+      {
+        result += *recvbuf;
+      }
+    }
+    if(world_rank == 0)
+    {
+      result += *recvbuf;
+    }
+  }
+
+  if(world_rank != parent && numchildren == 0) {
+    parent_rank = (parent) % TASKS;
+    MPIC_Send(sendbuf, sizeof(int), MPI_BYTE, pg_world->vct[parent_rank].taskid, tag, comm_ptr->handle);
+  }
+
+  if(world_rank == 0) {
+    *recvbuf = result;
+  }
+  MPIU_Free(children);
+  return 0;
 }

http://git.mpich.org/mpich.git/commitdiff/5510e24f5fec7c51234b3df86ca9cac03ddf9edf

commit 5510e24f5fec7c51234b3df86ca9cac03ddf9edf
Author: sssharka <sssharka at us.ibm.com>
Date:   Mon Dec 10 10:27:25 2012 -0500

    Implement collective algorithm selection in MPICH
    
    (ibm) F183421
    (ibm) b36666c383fd5b1388fd470788a9922e84cdc35e
    
    Signed-off-by: Bob Cernohous <bobc at us.ibm.com>

diff --git a/src/mpid/pamid/include/mpidi_constants.h b/src/mpid/pamid/include/mpidi_constants.h
index 0a5d997..be0d18c 100644
--- a/src/mpid/pamid/include/mpidi_constants.h
+++ b/src/mpid/pamid/include/mpidi_constants.h
@@ -95,4 +95,25 @@ MPID_EPOTYPE_FENCE     = 4,       /**< MPI_Win_fence access/exposure epoch */
 MPID_EPOTYPE_REFENCE   = 5,       /**< MPI_Win_fence possible access/exposure epoch */
 };
 
+enum
+{
+  MPID_AUTO_SELECT_COLLS_NONE            = 0,
+  MPID_AUTO_SELECT_COLLS_BARRIER         = 1,
+  MPID_AUTO_SELECT_COLLS_BCAST           = ((int)((MPID_AUTO_SELECT_COLLS_BARRIER        << 1) & 0xFFFFFFFF)),
+  MPID_AUTO_SELECT_COLLS_ALLGATHER       = ((int)((MPID_AUTO_SELECT_COLLS_BCAST          << 1) & 0xFFFFFFFF)),
+  MPID_AUTO_SELECT_COLLS_ALLGATHERV      = ((int)((MPID_AUTO_SELECT_COLLS_ALLGATHER      << 1) & 0xFFFFFFFF)),
+  MPID_AUTO_SELECT_COLLS_ALLREDUCE       = ((int)((MPID_AUTO_SELECT_COLLS_ALLGATHERV     << 1) & 0xFFFFFFFF)),
+  MPID_AUTO_SELECT_COLLS_ALLTOALL        = ((int)((MPID_AUTO_SELECT_COLLS_ALLREDUCE      << 1) & 0xFFFFFFFF)),
+  MPID_AUTO_SELECT_COLLS_ALLTOALLV       = ((int)((MPID_AUTO_SELECT_COLLS_ALLTOALL       << 1) & 0xFFFFFFFF)),
+  MPID_AUTO_SELECT_COLLS_EXSCAN          = ((int)((MPID_AUTO_SELECT_COLLS_ALLTOALLV      << 1) & 0xFFFFFFFF)),
+  MPID_AUTO_SELECT_COLLS_GATHER          = ((int)((MPID_AUTO_SELECT_COLLS_EXSCAN         << 1) & 0xFFFFFFFF)),
+  MPID_AUTO_SELECT_COLLS_GATHERV         = ((int)((MPID_AUTO_SELECT_COLLS_GATHER         << 1) & 0xFFFFFFFF)),
+  MPID_AUTO_SELECT_COLLS_REDUCE_SCATTER  = ((int)((MPID_AUTO_SELECT_COLLS_GATHERV        << 1) & 0xFFFFFFFF)),
+  MPID_AUTO_SELECT_COLLS_REDUCE          = ((int)((MPID_AUTO_SELECT_COLLS_REDUCE_SCATTER << 1) & 0xFFFFFFFF)),
+  MPID_AUTO_SELECT_COLLS_SCAN            = ((int)((MPID_AUTO_SELECT_COLLS_REDUCE         << 1) & 0xFFFFFFFF)),
+  MPID_AUTO_SELECT_COLLS_SCATTER         = ((int)((MPID_AUTO_SELECT_COLLS_SCAN           << 1) & 0xFFFFFFFF)),
+  MPID_AUTO_SELECT_COLLS_SCATTERV        = ((int)((MPID_AUTO_SELECT_COLLS_SCATTER        << 1) & 0xFFFFFFFF)),
+  MPID_AUTO_SELECT_COLLS_ALL             = 0xFFFFFFFF,
+};
+
 #endif
diff --git a/src/mpid/pamid/include/mpidi_datatypes.h b/src/mpid/pamid/include/mpidi_datatypes.h
index 5e4b30d..bca2e5c 100644
--- a/src/mpid/pamid/include/mpidi_datatypes.h
+++ b/src/mpid/pamid/include/mpidi_datatypes.h
@@ -109,9 +109,10 @@ typedef struct
 
   struct
   {
-    unsigned collectives;  /**< Enable optimized collective functions. */
+    unsigned collectives;       /**< Enable optimized collective functions. */
     unsigned subcomms;
-    unsigned select_colls; /**< Enable collective selection */
+    unsigned select_colls;      /**< Enable collective selection */
+    unsigned auto_select_colls; /**< Enable automatic collective selection */
   }
   optimized;
 
diff --git a/src/mpid/pamid/include/mpidi_prototypes.h b/src/mpid/pamid/include/mpidi_prototypes.h
index 316b681..77632f1 100644
--- a/src/mpid/pamid/include/mpidi_prototypes.h
+++ b/src/mpid/pamid/include/mpidi_prototypes.h
@@ -217,52 +217,93 @@ void MPIDI_Comm_coll_select(MPID_Comm *comm);
 void MPIDI_Coll_register    (void);
 
 int MPIDO_Bcast(void *buffer, int count, MPI_Datatype dt, int root, MPID_Comm *comm_ptr, int *mpierrno);
+int MPIDO_Bcast_simple(void *buffer, int count, MPI_Datatype dt, int root, MPID_Comm *comm_ptr, int *mpierrno);
 int MPIDO_Barrier(MPID_Comm *comm_ptr, int *mpierrno);
+int MPIDO_Barrier_simple(MPID_Comm *comm_ptr, int *mpierrno);
 
 int MPIDO_Allreduce(const void *sbuffer, void *rbuffer, int count,
                     MPI_Datatype datatype, MPI_Op op, MPID_Comm *comm_ptr, int *mpierrno);
+int MPIDO_Allreduce_simple(const void *sbuffer, void *rbuffer, int count,
+                    MPI_Datatype datatype, MPI_Op op, MPID_Comm *comm_ptr, int *mpierrno);
 int MPIDO_Reduce(const void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, 
                  MPI_Op op, int root, MPID_Comm *comm_ptr, int *mpierrno);
+int MPIDO_Reduce_simple(const void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, 
+                 MPI_Op op, int root, MPID_Comm *comm_ptr, int *mpierrno);
 int MPIDO_Allgather(const void *sendbuf, int sendcount, MPI_Datatype sendtype,
                     void *recvbuf, int recvcount, MPI_Datatype recvtype,
                     MPID_Comm *comm_ptr, int *mpierrno);
+int MPIDO_Allgather_simple(const void *sendbuf, int sendcount, MPI_Datatype sendtype,
+                    void *recvbuf, int recvcount, MPI_Datatype recvtype,
+                    MPID_Comm *comm_ptr, int *mpierrno);
 
 int MPIDO_Allgatherv(const void *sendbuf, int sendcount, MPI_Datatype sendtype,
                      void *recvbuf, const int *recvcounts, const int *displs,
                      MPI_Datatype recvtype, MPID_Comm * comm_ptr, int *mpierrno);
+int MPIDO_Allgatherv_simple(const void *sendbuf, int sendcount, MPI_Datatype sendtype,
+                     void *recvbuf, const int *recvcounts, const int *displs,
+                     MPI_Datatype recvtype, MPID_Comm * comm_ptr, int *mpierrno);
+int MPIDO_Iallgatherv(const void *sendbuf, int sendcount, MPI_Datatype sendtype,
+                      void *recvbuf, const int *recvcounts, const int *displs,
+                      MPI_Datatype recvtype, MPID_Comm * comm_ptr,
+                      MPID_Request ** request);
 
 int MPIDO_Gather(const void *sendbuf, int sendcount, MPI_Datatype sendtype,
                  void *recvbuf, int recvcount, MPI_Datatype recvtype,
                  int root, MPID_Comm * comm_ptr, int *mpierrno);
+int MPIDO_Gather_simple(const void *sendbuf, int sendcount, MPI_Datatype sendtype,
+                 void *recvbuf, int recvcount, MPI_Datatype recvtype,
+                 int root, MPID_Comm * comm_ptr, int *mpierrno);
 
 int MPIDO_Gatherv(const void *sendbuf, int sendcount, MPI_Datatype sendtype,
                   void *recvbuf, const int *recvcounts, const int *displs, MPI_Datatype recvtype,
                   int root, MPID_Comm * comm_ptr, int *mpierrno);
+int MPIDO_Gatherv_simple(const void *sendbuf, int sendcount, MPI_Datatype sendtype,
+                  void *recvbuf, const int *recvcounts, const int *displs, MPI_Datatype recvtype,
+                  int root, MPID_Comm * comm_ptr, int *mpierrno);
 
 int MPIDO_Scan(const void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype,
                MPI_Op op, MPID_Comm * comm_ptr, int *mpierrno);
+int MPIDO_Scan_simple(const void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype,
+               MPI_Op op, MPID_Comm * comm_ptr, int *mpierrno);
 
 int MPIDO_Exscan(const void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype,
                MPI_Op op, MPID_Comm * comm_ptr, int *mpierrno);
+int MPIDO_Exscan_simple(const void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype,
+               MPI_Op op, MPID_Comm * comm_ptr, int *mpierrno);
 
 int MPIDO_Scatter(const void *sendbuf, int sendcount, MPI_Datatype sendtype,
                   void *recvbuf, int recvcount, MPI_Datatype recvtype,
                   int root, MPID_Comm * comm_ptr, int *mpierrno);
+int MPIDO_Scatter_simple(const void *sendbuf, int sendcount, MPI_Datatype sendtype,
+                  void *recvbuf, int recvcount, MPI_Datatype recvtype,
+                  int root, MPID_Comm * comm_ptr, int *mpierrno);
 
 int MPIDO_Scatterv(const void *sendbuf, const int *sendcounts, const int *displs,
                    MPI_Datatype sendtype,
                    void *recvbuf, int recvcount, MPI_Datatype recvtype,
                    int root, MPID_Comm * comm_ptr, int *mpierrno);
+int MPIDO_Scatterv_simple(const void *sendbuf, const int *sendcounts, const int *displs,
+                   MPI_Datatype sendtype,
+                   void *recvbuf, int recvcount, MPI_Datatype recvtype,
+                   int root, MPID_Comm * comm_ptr, int *mpierrno);
 
 int MPIDO_Alltoallv(const void *sendbuf, const int *sendcounts, const int *senddispls,
                     MPI_Datatype sendtype,
                     void *recvbuf, const int *recvcounts, const int *recvdispls,
                     MPI_Datatype recvtype,
                     MPID_Comm *comm_ptr, int *mpierrno);
+int MPIDO_Alltoallv_simple(const void *sendbuf, const int *sendcounts, const int *senddispls,
+                    MPI_Datatype sendtype,
+                    void *recvbuf, const int *recvcounts, const int *recvdispls,
+                    MPI_Datatype recvtype,
+                    MPID_Comm *comm_ptr, int *mpierrno);
 
 int MPIDO_Alltoall(const void *sendbuf, int sendcount, MPI_Datatype sendtype,
                    void *recvbuf, int recvcount, MPI_Datatype recvtype,
                    MPID_Comm *comm_ptr, int *mpierrno);
+int MPIDO_Alltoall_simple(const void *sendbuf, int sendcount, MPI_Datatype sendtype,
+                   void *recvbuf, int recvcount, MPI_Datatype recvtype,
+                   MPID_Comm *comm_ptr, int *mpierrno);
 
 int MPIDI_Datatype_to_pami(MPI_Datatype        dt,
                            pami_type_t        *pdt,
diff --git a/src/mpid/pamid/src/coll/allgather/mpido_allgather.c b/src/mpid/pamid/src/coll/allgather/mpido_allgather.c
index d8c8ff6..c346d89 100644
--- a/src/mpid/pamid/src/coll/allgather/mpido_allgather.c
+++ b/src/mpid/pamid/src/coll/allgather/mpido_allgather.c
@@ -473,3 +473,118 @@ MPIDO_Allgather(const void *sendbuf,
                          recvbuf, recvcount, recvtype,
                          comm_ptr, mpierrno);
 }
+
+
+int
+MPIDO_Allgather_simple(const void *sendbuf,
+                int sendcount,
+                MPI_Datatype sendtype,
+                void *recvbuf,
+                int recvcount,
+                MPI_Datatype recvtype,
+                MPID_Comm * comm_ptr,
+                int *mpierrno)
+{
+     /* *********************************
+   * Check the nature of the buffers
+   * *********************************
+   */
+   const struct MPIDI_Comm* const mpid = &(comm_ptr->mpid);
+   MPID_Datatype * dt_null = NULL;
+   void *snd_noncontig_buff = NULL, *rcv_noncontig_buff = NULL;
+   MPI_Aint send_true_lb = 0;
+   MPI_Aint recv_true_lb = 0;
+   int snd_data_contig, rcv_data_contig;
+   size_t send_size = 0;
+   size_t recv_size = 0;
+   MPID_Segment segment;
+   volatile unsigned allgather_active = 1;
+   const int rank = comm_ptr->rank;
+
+   const pami_metadata_t *my_md;
+
+
+   char *rbuf = NULL, *sbuf = NULL;
+
+
+   if ((sendcount < 1 && sendbuf != MPI_IN_PLACE) || recvcount < 1)
+      return MPI_SUCCESS;
+
+   /* Gather datatype information */
+   MPIDI_Datatype_get_info(recvcount,
+			  recvtype,
+			  rcv_data_contig,
+			  recv_size,
+			  dt_null,
+			  recv_true_lb);
+
+   send_size = recv_size;
+   rbuf = (char *)recvbuf+recv_true_lb;
+
+   if(!rcv_data_contig)
+   {
+      rcv_noncontig_buff = MPIU_Malloc(recv_size);
+      rbuf = rcv_noncontig_buff;
+      if(rcv_noncontig_buff == NULL)
+      {
+         MPID_Abort(NULL, MPI_ERR_NO_SPACE, 1,
+            "Fatal:  Cannot allocate pack buffer");
+      }
+   }
+
+   MPIDI_Datatype_get_info(sendcount,
+                         sendtype,
+                         snd_data_contig,
+                         send_size,
+                         dt_null,
+                         send_true_lb);
+
+   sbuf = (char *)sendbuf+send_true_lb;
+   if(sendbuf == MPI_IN_PLACE) 
+     sbuf = (char *)recvbuf+recv_size*rank;
+
+   if(!snd_data_contig)
+   {
+      snd_noncontig_buff = MPIU_Malloc(send_size);
+      sbuf = snd_noncontig_buff;
+      if(snd_noncontig_buff == NULL)
+      {
+         MPID_Abort(NULL, MPI_ERR_NO_SPACE, 1,
+            "Fatal:  Cannot allocate pack buffer");
+      }
+      DLOOP_Offset last = send_size;
+      MPID_Segment_init(sendbuf != MPI_IN_PLACE?sendbuf:(void*)((char *)recvbuf+recv_size*rank), 
+	                    sendcount, sendtype, &segment, 0);
+      MPID_Segment_pack(&segment, 0, &last, snd_noncontig_buff);
+   }
+
+   TRACE_ERR("Using PAMI-level allgather protocol\n");
+   pami_xfer_t allgather;
+   allgather.cb_done = allgather_cb_done;
+   allgather.cookie = (void *)&allgather_active;
+   allgather.cmd.xfer_allgather.rcvbuf = rbuf;
+   allgather.cmd.xfer_allgather.sndbuf = sbuf;
+   allgather.cmd.xfer_allgather.stype = PAMI_TYPE_BYTE;
+   allgather.cmd.xfer_allgather.rtype = PAMI_TYPE_BYTE;
+   allgather.cmd.xfer_allgather.stypecount = send_size;
+   allgather.cmd.xfer_allgather.rtypecount = recv_size;
+   allgather.algorithm = mpid->coll_algorithm[PAMI_XFER_ALLGATHER][0][0];
+   my_md = &mpid->coll_metadata[PAMI_XFER_ALLGATHER][0][0];
+
+   TRACE_ERR("Calling PAMI_Collective with allgather structure\n");
+   MPIDI_Post_coll_t allgather_post;
+   MPIDI_Context_post(MPIDI_Context[0], &allgather_post.state, MPIDI_Pami_post_wrapper, (void *)&allgather);
+   TRACE_ERR("Allgather %s\n", MPIDI_Process.context_post.active>0?"posted":"invoked");
+
+   MPIDI_Update_last_algorithm(comm_ptr, my_md->name);
+   MPID_PROGRESS_WAIT_WHILE(allgather_active);
+   if(!rcv_data_contig)
+   {
+      MPIR_Localcopy(rcv_noncontig_buff, recv_size, MPI_CHAR,
+                        recvbuf,         recvcount,     recvtype);
+      MPIU_Free(rcv_noncontig_buff);   
+   }
+   if(!snd_data_contig)  MPIU_Free(snd_noncontig_buff);
+   TRACE_ERR("Allgather done\n");
+   return MPI_SUCCESS;
+}
\ No newline at end of file
diff --git a/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c b/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c
index e67ba76..5c182a5 100644
--- a/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c
+++ b/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c
@@ -517,3 +517,107 @@ MPIDO_Allgatherv(const void *sendbuf,
                        recvbuf, recvcounts, displs, recvtype,
                        comm_ptr, mpierrno);
 }
+
+int
+MPIDO_Allgatherv_simple(const void *sendbuf,
+		 int sendcount,
+		 MPI_Datatype sendtype,
+		 void *recvbuf,
+		 const int *recvcounts,
+		 const int *displs,
+		 MPI_Datatype recvtype,
+		 MPID_Comm * comm_ptr,
+                 int *mpierrno)
+{
+   TRACE_ERR("Entering MPIDO_Allgatherv_optimized\n");
+  /* function pointer to be used to point to approperiate algorithm */
+  /* Check the nature of the buffers */
+  MPID_Datatype *dt_null = NULL;
+  MPI_Aint send_true_lb  = 0;
+  MPI_Aint recv_true_lb  = 0;
+  size_t   send_size     = 0;
+  size_t   recv_size     = 0;
+  int snd_data_contig = 0, rcv_data_contig = 0;
+  void *snd_noncontig_buff = NULL, *rcv_noncontig_buff = NULL;
+  int scount=sendcount;
+
+  char *sbuf, *rbuf;
+  pami_type_t stype, rtype;
+  const int rank = comm_ptr->rank;
+  const struct MPIDI_Comm* const mpid = &(comm_ptr->mpid);
+
+  volatile unsigned allgatherv_active = 1;
+  int tmp;
+  const pami_metadata_t *my_md;
+
+   if((sendbuf != MPI_IN_PLACE) && (MPIDI_Datatype_to_pami(sendtype, &stype, -1, NULL, &tmp) != MPI_SUCCESS))
+   {
+     return MPIR_Allgatherv(sendbuf, sendcount, sendtype,
+                       recvbuf, recvcounts, displs, recvtype,
+                       comm_ptr, mpierrno);
+   }
+   if(MPIDI_Datatype_to_pami(recvtype, &rtype, -1, NULL, &tmp) != MPI_SUCCESS)
+   {
+     return MPIR_Allgatherv(sendbuf, sendcount, sendtype,
+                       recvbuf, recvcounts, displs, recvtype,
+                       comm_ptr, mpierrno);
+   }
+   MPIDI_Datatype_get_info(1,
+			  recvtype,
+			  rcv_data_contig,
+			  recv_size,
+			  dt_null,
+			  recv_true_lb);
+
+   if(sendbuf == MPI_IN_PLACE)
+   {
+     sbuf = (char *)recvbuf+displs[rank]*recv_size;
+     send_true_lb = recv_true_lb;
+     scount = recvcounts[rank];
+     send_size = recv_size * scount; 
+   }
+   else
+   {
+      MPIDI_Datatype_get_info(sendcount,
+                              sendtype,
+                              snd_data_contig,
+                              send_size,
+                              dt_null,
+                              send_true_lb);
+       sbuf = (char *)sendbuf+send_true_lb;
+   }
+
+   rbuf = (char *)recvbuf+recv_true_lb;
+
+   if(!snd_data_contig || !rcv_data_contig)
+   {
+      return MPIR_Allgatherv(sendbuf, sendcount, sendtype,
+                       recvbuf, recvcounts, displs, recvtype,
+                       comm_ptr, mpierrno);
+   }
+
+   pami_xfer_t allgatherv;
+   allgatherv.cb_done = allgatherv_cb_done;
+   allgatherv.cookie = (void *)&allgatherv_active;
+   allgatherv.cmd.xfer_allgatherv_int.sndbuf = sbuf;
+   allgatherv.cmd.xfer_allgatherv_int.rcvbuf = rbuf;
+   allgatherv.cmd.xfer_allgatherv_int.stype = stype;
+   allgatherv.cmd.xfer_allgatherv_int.rtype = rtype;
+   allgatherv.cmd.xfer_allgatherv_int.stypecount = scount;
+   allgatherv.cmd.xfer_allgatherv_int.rtypecounts = (int *) recvcounts;
+   allgatherv.cmd.xfer_allgatherv_int.rdispls = (int *) displs;
+   allgatherv.algorithm = mpid->coll_algorithm[PAMI_XFER_ALLGATHERV_INT][0][0];
+   my_md = &mpid->coll_metadata[PAMI_XFER_ALLGATHERV_INT][0][0];
+
+   TRACE_ERR("Calling allgatherv via %s()\n", MPIDI_Process.context_post.active>0?"PAMI_Collective":"PAMI_Context_post");
+   MPIDI_Post_coll_t allgatherv_post;
+   MPIDI_Context_post(MPIDI_Context[0], &allgatherv_post.state,
+                      MPIDI_Pami_post_wrapper, (void *)&allgatherv);
+
+   MPIDI_Update_last_algorithm(comm_ptr, my_md->name);
+
+   TRACE_ERR("Rank %d waiting on active %d\n", rank, allgatherv_active);
+   MPID_PROGRESS_WAIT_WHILE(allgatherv_active);
+
+   return MPI_SUCCESS;
+}
\ No newline at end of file
diff --git a/src/mpid/pamid/src/coll/allreduce/mpido_allreduce.c b/src/mpid/pamid/src/coll/allreduce/mpido_allreduce.c
index b77b27d..d6c0d0a 100644
--- a/src/mpid/pamid/src/coll/allreduce/mpido_allreduce.c
+++ b/src/mpid/pamid/src/coll/allreduce/mpido_allreduce.c
@@ -377,3 +377,79 @@ int MPIDO_Allreduce(const void *sendbuf,
    return MPI_SUCCESS;
 }
 
+int MPIDO_Allreduce_simple(const void *sendbuf,
+                    void *recvbuf,
+                    int count,
+                    MPI_Datatype dt,
+                    MPI_Op op,
+                    MPID_Comm *comm_ptr,
+                    int *mpierrno)
+{
+   void *sbuf;
+   TRACE_ERR("Entering MPIDO_Allreduce_optimized\n");
+   pami_type_t pdt;
+   pami_data_function pop;
+   int mu;
+   int rc;
+#ifdef TRACE_ON
+    int len; 
+    char op_str[255]; 
+    char dt_str[255]; 
+    MPIDI_Op_to_string(op, op_str); 
+    PMPI_Type_get_name(dt, dt_str, &len); 
+#endif
+   volatile unsigned active = 1;
+   pami_xfer_t allred;
+   const pami_metadata_t *my_allred_md = (pami_metadata_t *)NULL;
+   const int rank = comm_ptr->rank;
+   const struct MPIDI_Comm* const mpid = &(comm_ptr->mpid);
+   MPID_Datatype *data_ptr;
+   MPI_Aint data_true_lb = 0;
+   int data_size, data_contig;
+
+   rc = MPIDI_Datatype_to_pami(dt, &pdt, op, &pop, &mu);
+
+      /* convert to metadata query */
+  /* Punt count 0 allreduce to MPICH. Let them do whatever's 'right' */
+  if(unlikely(rc != MPI_SUCCESS || (count==0)))
+   {
+      MPIDI_Update_last_algorithm(comm_ptr, "ALLREDUCE_MPICH");
+      return MPIR_Allreduce(sendbuf, recvbuf, count, dt, op, comm_ptr, mpierrno);
+   }
+   MPIDI_Datatype_get_info(1, dt,
+			   data_contig, data_size, data_ptr, data_true_lb);
+
+   if(!data_contig)
+   {
+      MPIDI_Update_last_algorithm(comm_ptr, "ALLREDUCE_MPICH");
+      return MPIR_Allreduce(sendbuf, recvbuf, count, dt, op, comm_ptr, mpierrno);
+   }
+
+  sbuf = (void *)sendbuf;
+  if(unlikely(sendbuf == MPI_IN_PLACE))
+  {
+     sbuf = recvbuf;
+  }
+
+  allred.cb_done = cb_allreduce;
+  allred.cookie = (void *)&active;
+  allred.cmd.xfer_allreduce.sndbuf = sbuf;
+  allred.cmd.xfer_allreduce.stype = pdt;
+  allred.cmd.xfer_allreduce.rcvbuf = recvbuf;
+  allred.cmd.xfer_allreduce.rtype = pdt;
+  allred.cmd.xfer_allreduce.stypecount = count;
+  allred.cmd.xfer_allreduce.rtypecount = count;
+  allred.cmd.xfer_allreduce.op = pop;
+  allred.algorithm = mpid->coll_algorithm[PAMI_XFER_ALLREDUCE][0][0];
+  my_allred_md = &mpid->coll_metadata[PAMI_XFER_ALLREDUCE][0][0];
+
+  MPIDI_Post_coll_t allred_post;
+  MPIDI_Context_post(MPIDI_Context[0], &allred_post.state,
+                      MPIDI_Pami_post_wrapper, (void *)&allred);
+
+  MPID_assert(rc == PAMI_SUCCESS);
+  MPIDI_Update_last_algorithm(comm_ptr,my_allred_md->name);
+  MPID_PROGRESS_WAIT_WHILE(active);
+  TRACE_ERR("Allreduce done\n");
+  return MPI_SUCCESS;
+}
\ No newline at end of file
diff --git a/src/mpid/pamid/src/coll/alltoall/mpido_alltoall.c b/src/mpid/pamid/src/coll/alltoall/mpido_alltoall.c
index 4688236..d36e87e 100644
--- a/src/mpid/pamid/src/coll/alltoall/mpido_alltoall.c
+++ b/src/mpid/pamid/src/coll/alltoall/mpido_alltoall.c
@@ -172,3 +172,88 @@ int MPIDO_Alltoall(const void *sendbuf,
    TRACE_ERR("Leaving alltoall\n");
   return PAMI_SUCCESS;
 }
+
+
+int MPIDO_Alltoall_simple(const void *sendbuf,
+                   int sendcount,
+                   MPI_Datatype sendtype,
+                   void *recvbuf,
+                   int recvcount,
+                   MPI_Datatype recvtype,
+                   MPID_Comm *comm_ptr,
+                   int *mpierrno)
+{
+   TRACE_ERR("Entering MPIDO_Alltoall_optimized\n");
+   volatile unsigned active = 1;
+   MPID_Datatype *sdt, *rdt;
+   pami_type_t stype, rtype;
+   MPI_Aint sdt_true_lb=0, rdt_true_lb;
+   MPIDI_Post_coll_t alltoall_post;
+   int sndlen, rcvlen, snd_contig, rcv_contig, pamidt=1;
+   int tmp;
+
+   const struct MPIDI_Comm* const mpid = &(comm_ptr->mpid);
+
+   MPIDI_Datatype_get_info(1, sendtype, snd_contig, sndlen, sdt, sdt_true_lb);
+   if(!snd_contig) pamidt = 0;
+   MPIDI_Datatype_get_info(1, recvtype, rcv_contig, rcvlen, rdt, rdt_true_lb);
+   if(!rcv_contig) pamidt = 0;
+
+   /* Alltoall is much simpler if bytes are required because we don't need to
+    * malloc displ/count arrays and copy things
+    */
+
+
+   /* Is it a built in type? If not, send to MPICH */
+   if(MPIDI_Datatype_to_pami(sendtype, &stype, -1, NULL, &tmp) != MPI_SUCCESS)
+      pamidt = 0;
+   if(MPIDI_Datatype_to_pami(recvtype, &rtype, -1, NULL, &tmp) != MPI_SUCCESS)
+      pamidt = 0;
+
+   if(sendbuf ==  MPI_IN_PLACE)
+      pamidt = 0;
+
+   if(pamidt == 0)
+   {
+      return MPIR_Alltoall_intra(sendbuf, sendcount, sendtype,
+                      recvbuf, recvcount, recvtype,
+                      comm_ptr, mpierrno);
+
+   }
+
+   pami_xfer_t alltoall;
+   const pami_metadata_t *my_alltoall_md;
+   my_alltoall_md = &mpid->coll_metadata[PAMI_XFER_ALLTOALL][0][0];
+
+   char *pname = my_alltoall_md->name;
+   TRACE_ERR("Using alltoall protocol %s\n", pname);
+
+   alltoall.cb_done = cb_alltoall;
+   alltoall.cookie = (void *)&active;
+   alltoall.algorithm = mpid->coll_algorithm[PAMI_XFER_ALLTOALL][0][0];
+   if(sendbuf == MPI_IN_PLACE)
+   {
+      alltoall.cmd.xfer_alltoall.stype = rtype;
+      alltoall.cmd.xfer_alltoall.stypecount = recvcount;
+      alltoall.cmd.xfer_alltoall.sndbuf = (char *)recvbuf + rdt_true_lb;
+   }
+   else
+   {
+      alltoall.cmd.xfer_alltoall.stype = stype;
+      alltoall.cmd.xfer_alltoall.stypecount = sendcount;
+      alltoall.cmd.xfer_alltoall.sndbuf = (char *)sendbuf + sdt_true_lb;
+   }
+   alltoall.cmd.xfer_alltoall.rcvbuf = (char *)recvbuf + rdt_true_lb;
+
+   alltoall.cmd.xfer_alltoall.rtypecount = recvcount;
+   alltoall.cmd.xfer_alltoall.rtype = rtype;
+
+   MPIDI_Context_post(MPIDI_Context[0], &alltoall_post.state,
+                      MPIDI_Pami_post_wrapper, (void *)&alltoall);
+
+   TRACE_ERR("Waiting on active\n");
+   MPID_PROGRESS_WAIT_WHILE(active);
+
+   TRACE_ERR("Leaving MPIDO_Alltoall_optimized\n");
+   return MPI_SUCCESS;
+}
\ No newline at end of file
diff --git a/src/mpid/pamid/src/coll/alltoallv/mpido_alltoallv.c b/src/mpid/pamid/src/coll/alltoallv/mpido_alltoallv.c
index 856a247..d59f905 100644
--- a/src/mpid/pamid/src/coll/alltoallv/mpido_alltoallv.c
+++ b/src/mpid/pamid/src/coll/alltoallv/mpido_alltoallv.c
@@ -170,3 +170,94 @@ int MPIDO_Alltoallv(const void *sendbuf,
 
    return 0;
 }
+
+
+int MPIDO_Alltoallv_simple(const void *sendbuf,
+                   const int *sendcounts,
+                   const int *senddispls,
+                   MPI_Datatype sendtype,
+                   void *recvbuf,
+                   const int *recvcounts,
+                   const int *recvdispls,
+                   MPI_Datatype recvtype,
+                   MPID_Comm *comm_ptr,
+                   int *mpierrno)
+{
+   TRACE_ERR("Entering MPIDO_Alltoallv_optimized\n");
+   volatile unsigned active = 1;
+   int sndtypelen, rcvtypelen, snd_contig, rcv_contig;
+   MPID_Datatype *sdt, *rdt;
+   pami_type_t stype, rtype;
+   MPI_Aint sdt_true_lb, rdt_true_lb;
+   MPIDI_Post_coll_t alltoallv_post;
+   int pamidt = 1;
+   int tmp;
+   const int rank = comm_ptr->rank;
+
+   const struct MPIDI_Comm* const mpid = &(comm_ptr->mpid);
+
+   if(MPIDI_Datatype_to_pami(sendtype, &stype, -1, NULL, &tmp) != MPI_SUCCESS)
+      pamidt = 0;
+   if(MPIDI_Datatype_to_pami(recvtype, &rtype, -1, NULL, &tmp) != MPI_SUCCESS)
+      pamidt = 0;
+
+
+   MPIDI_Datatype_get_info(1, sendtype, snd_contig, sndtypelen, sdt, sdt_true_lb);
+   if(!snd_contig) pamidt = 0;
+   MPIDI_Datatype_get_info(1, recvtype, rcv_contig, rcvtypelen, rdt, rdt_true_lb);
+   if(!rcv_contig) pamidt = 0;
+
+   if(sendbuf ==  MPI_IN_PLACE)
+      pamidt = 0;
+
+   if(pamidt == 0)
+   {
+      return MPIR_Alltoallv(sendbuf, sendcounts, senddispls, sendtype,
+                              recvbuf, recvcounts, recvdispls, recvtype,
+                              comm_ptr, mpierrno);
+
+   }   
+
+   pami_xfer_t alltoallv;
+   const pami_metadata_t *my_alltoallv_md;
+   my_alltoallv_md = &mpid->coll_metadata[PAMI_XFER_ALLTOALLV_INT][0][0];
+
+   alltoallv.algorithm = mpid->coll_algorithm[PAMI_XFER_ALLTOALLV_INT][0][0];
+   char *pname = my_alltoallv_md->name;
+
+   alltoallv.cb_done = cb_alltoallv;
+   alltoallv.cookie = (void *)&active;
+   /* We won't bother with alltoallv since MPI is always going to be ints. */
+   if(sendbuf == MPI_IN_PLACE)
+   {
+      alltoallv.cmd.xfer_alltoallv_int.stype = rtype;
+      alltoallv.cmd.xfer_alltoallv_int.sdispls = (int *) recvdispls;
+      alltoallv.cmd.xfer_alltoallv_int.stypecounts = (int *) recvcounts;
+      alltoallv.cmd.xfer_alltoallv_int.sndbuf = (char *)recvbuf+rdt_true_lb;
+   }
+   else
+   {
+      alltoallv.cmd.xfer_alltoallv_int.stype = stype;
+      alltoallv.cmd.xfer_alltoallv_int.sdispls = (int *) senddispls;
+      alltoallv.cmd.xfer_alltoallv_int.stypecounts = (int *) sendcounts;
+      alltoallv.cmd.xfer_alltoallv_int.sndbuf = (char *)sendbuf+sdt_true_lb;
+   }
+   alltoallv.cmd.xfer_alltoallv_int.rcvbuf = (char *)recvbuf+rdt_true_lb;
+      
+   alltoallv.cmd.xfer_alltoallv_int.rdispls = (int *) recvdispls;
+   alltoallv.cmd.xfer_alltoallv_int.rtypecounts = (int *) recvcounts;
+   alltoallv.cmd.xfer_alltoallv_int.rtype = rtype;
+
+
+   MPIDI_Context_post(MPIDI_Context[0], &alltoallv_post.state,
+                      MPIDI_Pami_post_wrapper, (void *)&alltoallv);
+
+   TRACE_ERR("%d waiting on active %d\n", rank, active);
+   MPID_PROGRESS_WAIT_WHILE(active);
+
+
+   TRACE_ERR("Leaving alltoallv\n");
+
+
+   return MPI_SUCCESS;
+}
\ No newline at end of file
diff --git a/src/mpid/pamid/src/coll/barrier/mpido_barrier.c b/src/mpid/pamid/src/coll/barrier/mpido_barrier.c
index b08070e..abf51d9 100644
--- a/src/mpid/pamid/src/coll/barrier/mpido_barrier.c
+++ b/src/mpid/pamid/src/coll/barrier/mpido_barrier.c
@@ -102,3 +102,33 @@ int MPIDO_Barrier(MPID_Comm *comm_ptr, int *mpierrno)
    TRACE_ERR("exiting mpido_barrier\n");
    return 0;
 }
+
+
+int MPIDO_Barrier_simple(MPID_Comm *comm_ptr, int *mpierrno)
+{
+   TRACE_ERR("Entering MPIDO_Barrier_optimized\n");
+   volatile unsigned active=1;
+   MPIDI_Post_coll_t barrier_post;
+   pami_xfer_t barrier;
+   pami_algorithm_t my_barrier;
+   const pami_metadata_t *my_barrier_md;
+   const struct MPIDI_Comm* const mpid = &(comm_ptr->mpid);
+ 
+   barrier.cb_done = cb_barrier;
+   barrier.cookie = (void *)&active;
+   my_barrier = mpid->coll_algorithm[PAMI_XFER_BARRIER][0][0];
+   my_barrier_md = &mpid->coll_metadata[PAMI_XFER_BARRIER][0][0];
+   barrier.algorithm = my_barrier;
+
+
+   TRACE_ERR("%s barrier\n", MPIDI_Process.context_post.active>0?"posting":"invoking");
+   MPIDI_Context_post(MPIDI_Context[0], &barrier_post.state,
+                      MPIDI_Pami_post_wrapper, (void *)&barrier);
+   TRACE_ERR("barrier %s rc: %d\n", MPIDI_Process.context_post.active>0?"posted":"invoked", rc);
+
+   TRACE_ERR("advance spinning\n");
+   MPIDI_Update_last_algorithm(comm_ptr, my_barrier_md->name);
+   MPID_PROGRESS_WAIT_WHILE(active);
+   TRACE_ERR("Exiting MPIDO_Barrier_optimized\n");
+   return 0;
+}
\ No newline at end of file
diff --git a/src/mpid/pamid/src/coll/bcast/mpido_bcast.c b/src/mpid/pamid/src/coll/bcast/mpido_bcast.c
index ec85f11..73b0b1e 100644
--- a/src/mpid/pamid/src/coll/bcast/mpido_bcast.c
+++ b/src/mpid/pamid/src/coll/bcast/mpido_bcast.c
@@ -229,3 +229,82 @@ int MPIDO_Bcast(void *buffer,
    TRACE_ERR("leaving bcast\n");
    return 0;
 }
+
+
+int MPIDO_Bcast_simple(void *buffer,
+                int count,
+                MPI_Datatype datatype,
+                int root,
+                MPID_Comm *comm_ptr,
+                int *mpierrno)
+{
+   TRACE_ERR("Entering MPIDO_Bcast_optimized\n");
+
+   int data_contig;
+   void *data_buffer    = NULL,
+        *noncontig_buff = NULL;
+   volatile unsigned active = 1;
+   MPI_Aint data_true_lb = 0;
+   MPID_Datatype *data_ptr;
+   MPID_Segment segment;
+   MPIDI_Post_coll_t bcast_post;
+   const struct MPIDI_Comm* const mpid = &(comm_ptr->mpid);
+   const int rank = comm_ptr->rank;
+
+
+   /* Must calculate data_size based on count=1 in case it's total size is > integer */
+   int data_size_one;
+   MPIDI_Datatype_get_info(1, datatype,
+			   data_contig, data_size_one, data_ptr, data_true_lb);
+
+   const int data_size = data_size_one*(size_t)count;
+
+   data_buffer = (char *)buffer + data_true_lb;
+
+   if(!data_contig)
+   {
+      noncontig_buff = MPIU_Malloc(data_size);
+      data_buffer = noncontig_buff;
+      if(noncontig_buff == NULL)
+      {
+         MPID_Abort(NULL, MPI_ERR_NO_SPACE, 1,
+            "Fatal:  Cannot allocate pack buffer");
+      }
+      if(rank == root)
+      {
+         DLOOP_Offset last = data_size;
+         MPID_Segment_init(buffer, count, datatype, &segment, 0);
+         MPID_Segment_pack(&segment, 0, &last, noncontig_buff);
+      }
+   }
+
+   pami_xfer_t bcast;
+   const pami_metadata_t *my_bcast_md;
+   int queryreq = 0;
+
+   bcast.cb_done = cb_bcast;
+   bcast.cookie = (void *)&active;
+   bcast.cmd.xfer_broadcast.root = MPID_VCR_GET_LPID(comm_ptr->vcr, root);
+   bcast.algorithm = mpid->coll_algorithm[PAMI_XFER_BROADCAST][0][0];
+   bcast.cmd.xfer_broadcast.buf = data_buffer;
+   bcast.cmd.xfer_broadcast.type = PAMI_TYPE_BYTE;
+   /* Needs to be sizeof(type)*count since we are using bytes as * the generic type */
+   bcast.cmd.xfer_broadcast.typecount = data_size;
+   my_bcast_md = &mpid->coll_metadata[PAMI_XFER_BROADCAST][0][0];
+
+   MPIDI_Context_post(MPIDI_Context[0], &bcast_post.state, MPIDI_Pami_post_wrapper, (void *)&bcast);
+   MPIDI_Update_last_algorithm(comm_ptr, my_bcast_md->name);
+   MPID_PROGRESS_WAIT_WHILE(active);
+   TRACE_ERR("bcast done\n");
+
+   if(!data_contig)
+   {
+      if(rank != root)
+         MPIR_Localcopy(noncontig_buff, data_size, MPI_CHAR,
+                        buffer,         count,     datatype);
+      MPIU_Free(noncontig_buff);
+   }
+
+   TRACE_ERR("Exiting MPIDO_Bcast_optimized\n");
+   return 0;
+}
\ No newline at end of file
diff --git a/src/mpid/pamid/src/coll/gather/mpido_gather.c b/src/mpid/pamid/src/coll/gather/mpido_gather.c
index 534e129..77d9926 100644
--- a/src/mpid/pamid/src/coll/gather/mpido_gather.c
+++ b/src/mpid/pamid/src/coll/gather/mpido_gather.c
@@ -299,3 +299,87 @@ int MPIDO_Gather(const void *sendbuf,
    TRACE_ERR("Leaving MPIDO_Gather\n");
    return 0;
 }
+
+
+int MPIDO_Gather_simple(const void *sendbuf,
+                 int sendcount,
+                 MPI_Datatype sendtype,
+                 void *recvbuf,
+                 int recvcount,
+                 MPI_Datatype recvtype,
+                 int root,
+                 MPID_Comm *comm_ptr,
+		 int *mpierrno)
+{
+  MPID_Datatype * data_ptr;
+  MPI_Aint true_lb = 0;
+  pami_xfer_t gather;
+  MPIDI_Post_coll_t gather_post;
+  int success = 1, contig, send_bytes=-1, recv_bytes = 0;
+  const int rank = comm_ptr->rank;
+  const int size = comm_ptr->local_size;
+  const struct MPIDI_Comm* const mpid = &(comm_ptr->mpid);
+
+  MPIDI_Datatype_get_info(sendcount, sendtype, contig,
+                            send_bytes, data_ptr, true_lb);
+  if (!contig)
+      success = 0;
+
+  if (success && rank == root)
+  {
+    if (recvtype != MPI_DATATYPE_NULL && recvcount >= 0)
+    {
+      MPIDI_Datatype_get_info(recvcount, recvtype, contig,
+                              recv_bytes, data_ptr, true_lb);
+      if (!contig) success = 0;
+    }
+    else
+      success = 0;
+  }
+
+  MPIDI_Update_last_algorithm(comm_ptr, "GATHER_MPICH");
+  if(!success)
+  {
+    return MPIR_Gather(sendbuf, sendcount, sendtype,
+                       recvbuf, recvcount, recvtype,
+                       root, comm_ptr, mpierrno);
+  }
+
+
+   const pami_metadata_t *my_gather_md;
+   volatile unsigned active = 1;
+
+   gather.cb_done = cb_gather;
+   gather.cookie = (void *)&active;
+   gather.cmd.xfer_gather.root = MPID_VCR_GET_LPID(comm_ptr->vcr, root);
+   if(sendbuf == MPI_IN_PLACE) 
+   {
+     gather.cmd.xfer_gather.stypecount = recv_bytes;
+     gather.cmd.xfer_gather.sndbuf = (char *)recvbuf + recv_bytes*rank;
+   }
+   else
+   {
+     gather.cmd.xfer_gather.stypecount = send_bytes;
+     gather.cmd.xfer_gather.sndbuf = (void *)sendbuf;
+   }
+   gather.cmd.xfer_gather.stype = PAMI_TYPE_BYTE;
+   gather.cmd.xfer_gather.rcvbuf = (void *)recvbuf;
+   gather.cmd.xfer_gather.rtype = PAMI_TYPE_BYTE;
+   gather.cmd.xfer_gather.rtypecount = recv_bytes;
+   gather.algorithm = mpid->coll_algorithm[PAMI_XFER_GATHER][0][0];
+   my_gather_md = &mpid->coll_metadata[PAMI_XFER_GATHER][0][0];
+
+   MPIDI_Update_last_algorithm(comm_ptr,
+            my_gather_md->name);
+
+
+   TRACE_ERR("%s gather\n", MPIDI_Process.context_post.active>0?"Posting":"Invoking");
+   MPIDI_Context_post(MPIDI_Context[0], &gather_post.state,
+                      MPIDI_Pami_post_wrapper, (void *)&gather);
+
+   TRACE_ERR("Waiting on active: %d\n", active);
+   MPID_PROGRESS_WAIT_WHILE(active);
+
+   TRACE_ERR("Leaving MPIDO_Gather_optimized\n");
+   return MPI_SUCCESS;
+}
\ No newline at end of file
diff --git a/src/mpid/pamid/src/coll/gatherv/mpido_gatherv.c b/src/mpid/pamid/src/coll/gatherv/mpido_gatherv.c
index 67db093..d653527 100644
--- a/src/mpid/pamid/src/coll/gatherv/mpido_gatherv.c
+++ b/src/mpid/pamid/src/coll/gatherv/mpido_gatherv.c
@@ -185,3 +185,111 @@ int MPIDO_Gatherv(const void *sendbuf,
    TRACE_ERR("Leaving MPIDO_Gatherv\n");
    return 0;
 }
+
+
+int MPIDO_Gatherv_simple(const void *sendbuf, 
+                  int sendcount, 
+                  MPI_Datatype sendtype,
+                  void *recvbuf, 
+                  const int *recvcounts, 
+                  const int *displs, 
+                  MPI_Datatype recvtype,
+                  int root, 
+                  MPID_Comm * comm_ptr, 
+                  int *mpierrno)
+
+{
+   TRACE_ERR("Entering MPIDO_Gatherv_optimized\n");
+   int rc;
+   int contig, rsize=0, ssize=0;
+   int pamidt = 1;
+   MPID_Datatype *dt_ptr = NULL;
+   MPI_Aint send_true_lb, recv_true_lb;
+   char *sbuf, *rbuf;
+   pami_type_t stype, rtype;
+   int tmp;
+   volatile unsigned gatherv_active = 1;
+   const int rank = comm_ptr->rank;
+
+   const struct MPIDI_Comm* const mpid = &(comm_ptr->mpid);
+
+
+   /* Check for native PAMI types and MPI_IN_PLACE on sendbuf */
+   /* MPI_IN_PLACE is a nonlocal decision. We will need a preallreduce if we ever have
+    * multiple "good" gathervs that work on different counts for example */
+   if((MPIDI_Datatype_to_pami(sendtype, &stype, -1, NULL, &tmp) != MPI_SUCCESS))
+      pamidt = 0;
+   if(MPIDI_Datatype_to_pami(recvtype, &rtype, -1, NULL, &tmp) != MPI_SUCCESS)
+      pamidt = 0;
+
+   MPIDI_Datatype_get_info(1, recvtype, contig, rsize, dt_ptr, recv_true_lb);
+   if(!contig) pamidt = 0;
+
+   rbuf = (char *)recvbuf + recv_true_lb;
+   sbuf = (void *) sendbuf;
+   pami_xfer_t gatherv;
+   if(rank == root)
+   {
+      if(sendbuf == MPI_IN_PLACE) 
+      {
+         sbuf = (char*)rbuf + rsize*displs[rank];
+         gatherv.cmd.xfer_gatherv_int.stype = rtype;
+         gatherv.cmd.xfer_gatherv_int.stypecount = recvcounts[rank];
+      }
+      else
+      {
+         MPIDI_Datatype_get_info(1, sendtype, contig, ssize, dt_ptr, send_true_lb);
+		 if(!contig) pamidt = 0;
+         sbuf = (char *)sbuf + send_true_lb;
+         gatherv.cmd.xfer_gatherv_int.stype = stype;
+         gatherv.cmd.xfer_gatherv_int.stypecount = sendcount;
+      }
+   }
+   else
+   {
+      gatherv.cmd.xfer_gatherv_int.stype = stype;
+      gatherv.cmd.xfer_gatherv_int.stypecount = sendcount;     
+   }
+
+   if(pamidt == 0)
+   {
+      TRACE_ERR("GATHERV using MPICH\n");
+      MPIDI_Update_last_algorithm(comm_ptr, "GATHERV_MPICH");
+      return MPIR_Gatherv(sendbuf, sendcount, sendtype,
+               recvbuf, recvcounts, displs, recvtype,
+               root, comm_ptr, mpierrno);
+   }
+
+
+
+
+
+   gatherv.cb_done = cb_gatherv;
+   gatherv.cookie = (void *)&gatherv_active;
+   gatherv.cmd.xfer_gatherv_int.root = MPID_VCR_GET_LPID(comm_ptr->vcr, root);
+   gatherv.cmd.xfer_gatherv_int.rcvbuf = rbuf;
+   gatherv.cmd.xfer_gatherv_int.rtype = rtype;
+   gatherv.cmd.xfer_gatherv_int.rtypecounts = (int *) recvcounts;
+   gatherv.cmd.xfer_gatherv_int.rdispls = (int *) displs;
+
+   gatherv.cmd.xfer_gatherv_int.sndbuf = sbuf;
+
+   const pami_metadata_t *my_gatherv_md;
+
+   gatherv.algorithm = mpid->coll_algorithm[PAMI_XFER_GATHERV_INT][0][0];
+   my_gatherv_md = &mpid->coll_metadata[PAMI_XFER_GATHERV_INT][0][0];
+
+   MPIDI_Update_last_algorithm(comm_ptr, my_gatherv_md->name);
+
+   MPIDI_Post_coll_t gatherv_post;
+   TRACE_ERR("%s gatherv\n", MPIDI_Process.context_post.active>0?"Posting":"Invoking");
+   MPIDI_Context_post(MPIDI_Context[0], &gatherv_post.state,
+                      MPIDI_Pami_post_wrapper, (void *)&gatherv);
+   TRACE_ERR("Gatherv %s\n", MPIDI_Process.context_post.active>0?"posted":"invoked");
+   
+   TRACE_ERR("Waiting on active %d\n", gatherv_active);
+   MPID_PROGRESS_WAIT_WHILE(gatherv_active);
+
+   TRACE_ERR("Leaving MPIDO_Gatherv_optimized\n");
+   return MPI_SUCCESS;
+}
\ No newline at end of file
diff --git a/src/mpid/pamid/src/coll/reduce/mpido_reduce.c b/src/mpid/pamid/src/coll/reduce/mpido_reduce.c
index b84d868..a1780e6 100644
--- a/src/mpid/pamid/src/coll/reduce/mpido_reduce.c
+++ b/src/mpid/pamid/src/coll/reduce/mpido_reduce.c
@@ -180,3 +180,73 @@ int MPIDO_Reduce(const void *sendbuf,
    TRACE_ERR("Reduce done\n");
    return 0;
 }
+
+
+int MPIDO_Reduce_simple(const void *sendbuf, 
+                 void *recvbuf, 
+                 int count, 
+                 MPI_Datatype datatype,
+                 MPI_Op op, 
+                 int root, 
+                 MPID_Comm *comm_ptr, 
+                 int *mpierrno)
+
+{
+   MPID_Datatype *dt_null = NULL;
+   MPI_Aint true_lb = 0;
+   int dt_contig, tsize;
+   int mu;
+   char *sbuf, *rbuf;
+   pami_data_function pop;
+   pami_type_t pdt;
+   int rc;
+   int alg_selected = 0;
+
+   const struct MPIDI_Comm* const mpid = &(comm_ptr->mpid);
+
+   rc = MPIDI_Datatype_to_pami(datatype, &pdt, op, &pop, &mu);
+
+   pami_xfer_t reduce;
+   const pami_metadata_t *my_reduce_md=NULL;
+   volatile unsigned reduce_active = 1;
+
+   MPIDI_Datatype_get_info(count, datatype, dt_contig, tsize, dt_null, true_lb);
+   if(rc != MPI_SUCCESS || !dt_contig)
+   {
+      return MPIR_Reduce(sendbuf, recvbuf, count, datatype, op, root, comm_ptr, mpierrno);
+   }
+
+
+   rbuf = (char *)recvbuf + true_lb;
+   sbuf = (char *)sendbuf + true_lb;
+   if(sendbuf == MPI_IN_PLACE) 
+   {
+      sbuf = rbuf;
+   }
+
+   reduce.cb_done = reduce_cb_done;
+   reduce.cookie = (void *)&reduce_active;
+   reduce.algorithm = mpid->coll_algorithm[PAMI_XFER_REDUCE][0][0];
+   reduce.cmd.xfer_reduce.sndbuf = sbuf;
+   reduce.cmd.xfer_reduce.rcvbuf = rbuf;
+   reduce.cmd.xfer_reduce.stype = pdt;
+   reduce.cmd.xfer_reduce.rtype = pdt;
+   reduce.cmd.xfer_reduce.stypecount = count;
+   reduce.cmd.xfer_reduce.rtypecount = count;
+   reduce.cmd.xfer_reduce.op = pop;
+   reduce.cmd.xfer_reduce.root = MPID_VCR_GET_LPID(comm_ptr->vcr, root);
+   my_reduce_md = &mpid->coll_metadata[PAMI_XFER_REDUCE][0][0];
+
+   TRACE_ERR("%s reduce, context %d, algoname: %s, exflag: %d\n", MPIDI_Process.context_post.active>0?"Posting":"Invoking", 0,
+                my_reduce_md->name, exflag);
+   MPIDI_Post_coll_t reduce_post;
+   MPIDI_Context_post(MPIDI_Context[0], &reduce_post.state,
+                         MPIDI_Pami_post_wrapper, (void *)&reduce);
+   TRACE_ERR("Reduce %s\n", MPIDI_Process.context_post.active>0?"posted":"invoked");
+
+   MPIDI_Update_last_algorithm(comm_ptr,
+                               my_reduce_md->name);
+   MPID_PROGRESS_WAIT_WHILE(reduce_active);
+   TRACE_ERR("Reduce done\n");
+   return MPI_SUCCESS;
+}
\ No newline at end of file
diff --git a/src/mpid/pamid/src/coll/scan/mpido_scan.c b/src/mpid/pamid/src/coll/scan/mpido_scan.c
index be79dfe..8097284 100644
--- a/src/mpid/pamid/src/coll/scan/mpido_scan.c
+++ b/src/mpid/pamid/src/coll/scan/mpido_scan.c
@@ -165,3 +165,85 @@ int MPIDO_Doscan(const void *sendbuf, void *recvbuf,
    TRACE_ERR("Scan done\n");
    return rc;
 }
+
+
+int MPIDO_Doscan_simple(const void *sendbuf, void *recvbuf, 
+               int count, MPI_Datatype datatype,
+               MPI_Op op, MPID_Comm * comm_ptr, int *mpierrno, int exflag)
+{
+   MPID_Datatype *dt_null = NULL;
+   MPI_Aint true_lb = 0;
+   int dt_contig, tsize;
+   int mu;
+   char *sbuf, *rbuf;
+   pami_data_function pop;
+   pami_type_t pdt;
+   int rc;
+   const pami_metadata_t *my_md;
+
+   const struct MPIDI_Comm* const mpid = &(comm_ptr->mpid);
+
+   rc = MPIDI_Datatype_to_pami(datatype, &pdt, op, &pop, &mu);
+
+   pami_xfer_t scan;
+   volatile unsigned scan_active = 1;
+   MPIDI_Datatype_get_info(count, datatype, dt_contig, tsize, dt_null, true_lb);
+   
+   if(rc != MPI_SUCCESS || !dt_contig)
+   {
+      if(exflag)
+         return MPIR_Exscan(sendbuf, recvbuf, count, datatype, op, comm_ptr, mpierrno);
+      else
+         return MPIR_Scan(sendbuf, recvbuf, count, datatype, op, comm_ptr, mpierrno);
+   }
+
+
+   rbuf = (char *)recvbuf + true_lb;
+   if(sendbuf == MPI_IN_PLACE) 
+   {
+      sbuf = rbuf;
+   }
+   else
+   {
+      sbuf = (char *)sendbuf + true_lb;
+   }
+
+   scan.cb_done = scan_cb_done;
+   scan.cookie = (void *)&scan_active;
+   scan.algorithm = mpid->coll_algorithm[PAMI_XFER_SCAN][0][0];
+   my_md = &mpid->coll_metadata[PAMI_XFER_SCAN][0][0];
+   scan.cmd.xfer_scan.sndbuf = sbuf;
+   scan.cmd.xfer_scan.rcvbuf = rbuf;
+   scan.cmd.xfer_scan.stype = pdt;
+   scan.cmd.xfer_scan.rtype = pdt;
+   scan.cmd.xfer_scan.stypecount = count;
+   scan.cmd.xfer_scan.rtypecount = count;
+   scan.cmd.xfer_scan.op = pop;
+   scan.cmd.xfer_scan.exclusive = exflag;
+   
+   MPIDI_Post_coll_t scan_post;
+   MPIDI_Context_post(MPIDI_Context[0], &scan_post.state,
+                      MPIDI_Pami_post_wrapper, (void *)&scan);
+   TRACE_ERR("Scan %s\n", MPIDI_Process.context_post.active>0?"posted":"invoked");
+   MPIDI_Update_last_algorithm(comm_ptr, my_md->name);
+   MPID_PROGRESS_WAIT_WHILE(scan_active);
+   TRACE_ERR("Scan done\n");
+   return rc;
+}
+
+
+int MPIDO_Exscan_simple(const void *sendbuf, void *recvbuf, 
+               int count, MPI_Datatype datatype,
+               MPI_Op op, MPID_Comm * comm_ptr, int *mpierrno)
+{
+   return MPIDO_Doscan_simple(sendbuf, recvbuf, count, datatype,
+                op, comm_ptr, mpierrno, 1);
+}
+
+int MPIDO_Scan_simple(const void *sendbuf, void *recvbuf, 
+               int count, MPI_Datatype datatype,
+               MPI_Op op, MPID_Comm * comm_ptr, int *mpierrno)
+{
+   return MPIDO_Doscan_simple(sendbuf, recvbuf, count, datatype,
+                op, comm_ptr, mpierrno, 0);
+}
diff --git a/src/mpid/pamid/src/coll/scatter/mpido_scatter.c b/src/mpid/pamid/src/coll/scatter/mpido_scatter.c
index d5896be..4aea4ee 100644
--- a/src/mpid/pamid/src/coll/scatter/mpido_scatter.c
+++ b/src/mpid/pamid/src/coll/scatter/mpido_scatter.c
@@ -273,6 +273,118 @@ int MPIDO_Scatter(const void *sendbuf,
 }
 
 
+int MPIDO_Scatter_simple(const void *sendbuf,
+                  int sendcount,
+                  MPI_Datatype sendtype,
+                  void *recvbuf,
+                  int recvcount,
+                  MPI_Datatype recvtype,
+                  int root,
+                  MPID_Comm *comm_ptr,
+                  int *mpierrno)
+{
+  MPID_Datatype * data_ptr;
+  MPI_Aint true_lb = 0;
+  int contig, nbytes = 0;
+  const int rank = comm_ptr->rank;
+  int success = 1;
+  pami_type_t stype, rtype;
+  int tmp;
+  int use_pami = 1;
+  const struct MPIDI_Comm* const mpid = &(comm_ptr->mpid);
+
+  if(MPIDI_Datatype_to_pami(sendtype, &stype, -1, NULL, &tmp) != MPI_SUCCESS)
+    use_pami = 0;
+  if(MPIDI_Datatype_to_pami(recvtype, &rtype, -1, NULL, &tmp) != MPI_SUCCESS)
+    use_pami = 0;
+
+
+  if (rank == root)
+  {
+    if (recvtype != MPI_DATATYPE_NULL && recvcount >= 0)/* Should this be send or recv? */
+    {
+      MPIDI_Datatype_get_info(sendcount, sendtype, contig,
+                              nbytes, data_ptr, true_lb);
+      if (!contig) success = 0;
+    }
+    else
+      success = 0;
+
+    if (success)
+    {
+      if (recvtype != MPI_DATATYPE_NULL && recvcount >= 0)
+      {
+        MPIDI_Datatype_get_info(recvcount, recvtype, contig,
+                                nbytes, data_ptr, true_lb);
+        if (!contig) success = 0;
+      }
+      else success = 0;
+    }
+  }
+
+  else
+  {
+    if (sendtype != MPI_DATATYPE_NULL && sendcount >= 0)/* Should this be send or recv? */
+    {
+      MPIDI_Datatype_get_info(recvcount, recvtype, contig,
+                              nbytes, data_ptr, true_lb);
+      if (!contig) success = 0;
+    }
+    else
+      success = 0;
+  }
+  
+  if(!use_pami || !success)
+  {
+    MPIDI_Update_last_algorithm(comm_ptr, "SCATTER_MPICH");
+    return MPIR_Scatter(sendbuf, sendcount, sendtype,
+                        recvbuf, recvcount, recvtype,
+                        root, comm_ptr, mpierrno);
+  }
+
+   pami_xfer_t scatter;
+   MPIDI_Post_coll_t scatter_post;
+   const pami_metadata_t *my_scatter_md;
+   volatile unsigned scatter_active = 1;
+
+ 
+   scatter.algorithm = mpid->coll_algorithm[PAMI_XFER_SCATTER][0][0];
+   my_scatter_md = &mpid->coll_metadata[PAMI_XFER_SCATTER][0][0];
+
+   scatter.cb_done = cb_scatter;
+   scatter.cookie = (void *)&scatter_active;
+   scatter.cmd.xfer_scatter.root = MPID_VCR_GET_LPID(comm_ptr->vcr, root);
+   scatter.cmd.xfer_scatter.sndbuf = (void *)sendbuf;
+   scatter.cmd.xfer_scatter.stype = stype;
+   scatter.cmd.xfer_scatter.stypecount = sendcount;
+   if(recvbuf == MPI_IN_PLACE) 
+   {
+     MPIDI_Datatype_get_info(sendcount, sendtype, contig,
+                             nbytes, data_ptr, true_lb);
+     scatter.cmd.xfer_scatter.rcvbuf = (char *)sendbuf + nbytes*rank;
+     scatter.cmd.xfer_scatter.rtype = stype;
+     scatter.cmd.xfer_scatter.rtypecount = sendcount;
+   }
+   else
+   {
+     scatter.cmd.xfer_scatter.rcvbuf = (void *)recvbuf;
+     scatter.cmd.xfer_scatter.rtype = rtype;
+     scatter.cmd.xfer_scatter.rtypecount = recvcount;
+   }
+
+
+   TRACE_ERR("%s scatter\n", MPIDI_Process.context_post.active>0?"Posting":"Invoking");
+   MPIDI_Context_post(MPIDI_Context[0], &scatter_post.state,
+                      MPIDI_Pami_post_wrapper, (void *)&scatter);
+   TRACE_ERR("Waiting on active %d\n", scatter_active);
+   MPID_PROGRESS_WAIT_WHILE(scatter_active);
+
+
+   TRACE_ERR("Leaving MPIDO_Scatter_optimized\n");
+
+   return MPI_SUCCESS;
+}
+
 
 #if 0 /* old glue-based scatter-via-bcast */
 
diff --git a/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c b/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c
index cafd9d2..5808c5b 100644
--- a/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c
+++ b/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c
@@ -392,7 +392,94 @@ int MPIDO_Scatterv(const void *sendbuf,
    return 0;
 }
 
+int MPIDO_Scatterv_simple(const void *sendbuf,
+                   const int *sendcounts,
+                   const int *displs,
+                   MPI_Datatype sendtype,
+                   void *recvbuf,
+                   int recvcount,
+                   MPI_Datatype recvtype,
+                   int root,
+                   MPID_Comm *comm_ptr,
+                   int *mpierrno)
+{
+  int snd_contig, rcv_contig, tmp, pamidt = 1;
+  int ssize, rsize;
+  MPID_Datatype *dt_ptr = NULL;
+  MPI_Aint send_true_lb=0, recv_true_lb;
+  char *sbuf, *rbuf;
+  pami_type_t stype, rtype;
+  const int rank = comm_ptr->rank;
+  const struct MPIDI_Comm* const mpid = &(comm_ptr->mpid);
+
+   pami_xfer_t scatterv;
+   const pami_metadata_t *my_scatterv_md;
+   volatile unsigned scatterv_active = 1;
+
+
+   if((recvbuf != MPI_IN_PLACE) && MPIDI_Datatype_to_pami(recvtype, &rtype, -1, NULL, &tmp) != MPI_SUCCESS)
+      pamidt = 0;
+
+   if(MPIDI_Datatype_to_pami(sendtype, &stype, -1, NULL, &tmp) != MPI_SUCCESS)
+      pamidt = 0;
+   MPIDI_Datatype_get_info(1, sendtype, snd_contig, ssize, dt_ptr, send_true_lb);
+   MPIDI_Datatype_get_info(1, recvtype, rcv_contig, rsize, dt_ptr, recv_true_lb);
+
+   if(pamidt == 0 || !snd_contig || !rcv_contig)
+   {
+      TRACE_ERR("Scatterv using MPICH\n");
+      MPIDI_Update_last_algorithm(comm_ptr, "SCATTERV_MPICH");
+      return MPIR_Scatterv(sendbuf, sendcounts, displs, sendtype,
+                           recvbuf, recvcount, recvtype,
+                           root, comm_ptr, mpierrno);
+   }
+
+
+   sbuf = (char *)sendbuf + send_true_lb;
+   rbuf = recvbuf;
+
+   if(rank == root)
+   {
+      if(recvbuf == MPI_IN_PLACE) 
+      {
+        rbuf = (char *)sendbuf + ssize*displs[rank] + send_true_lb;
+      }
+      else
+      {
+        rbuf = (char *)recvbuf + recv_true_lb;
+      }
+   }
+
+   scatterv.cb_done = cb_scatterv;
+   scatterv.cookie = (void *)&scatterv_active;
+   scatterv.cmd.xfer_scatterv_int.root = MPID_VCR_GET_LPID(comm_ptr->vcr, root);
+
+   scatterv.algorithm = mpid->coll_algorithm[PAMI_XFER_SCATTERV_INT][0][0];
+   my_scatterv_md = &mpid->coll_metadata[PAMI_XFER_SCATTERV][0][0];
+   
+   scatterv.cmd.xfer_scatterv_int.rcvbuf = rbuf;
+   scatterv.cmd.xfer_scatterv_int.sndbuf = sbuf;
+   scatterv.cmd.xfer_scatterv_int.stype = stype;
+   scatterv.cmd.xfer_scatterv_int.rtype = rtype;
+   scatterv.cmd.xfer_scatterv_int.stypecounts = (int *) sendcounts;
+   scatterv.cmd.xfer_scatterv_int.rtypecount = recvcount;
+   scatterv.cmd.xfer_scatterv_int.sdispls = (int *) displs;
+
+
+   MPIDI_Update_last_algorithm(comm_ptr, my_scatterv_md->name);
+
 
+   MPIDI_Post_coll_t scatterv_post;
+   TRACE_ERR("%s scatterv\n", MPIDI_Process.context_post.active>0?"Posting":"Invoking");
+   MPIDI_Context_post(MPIDI_Context[0], &scatterv_post.state,
+                      MPIDI_Pami_post_wrapper, (void *)&scatterv);
+
+   TRACE_ERR("Waiting on active %d\n", scatterv_active);
+   MPID_PROGRESS_WAIT_WHILE(scatterv_active);
+
+   TRACE_ERR("Leaving MPIDO_Scatterv_optimized\n");
+   return MPI_SUCCESS;
+}
 
 
 #if 0
diff --git a/src/mpid/pamid/src/comm/mpid_selectcolls.c b/src/mpid/pamid/src/comm/mpid_selectcolls.c
index 7c1a4ed..7db6f94 100644
--- a/src/mpid/pamid/src/comm/mpid_selectcolls.c
+++ b/src/mpid/pamid/src/comm/mpid_selectcolls.c
@@ -393,6 +393,92 @@ void MPIDI_Comm_coll_envvars(MPID_Comm *comm)
       MPIDI_Check_protocols(names, comm, "gather", PAMI_XFER_GATHER);
    }
 
+   /*   If automatic collective selection is enabled and user didn't specifically overwrite
+      it, then use auto coll sel.. Otherwise, go through the manual coll sel code path. */
+   /* ************ Barrier ************ */
+   if((MPIDI_Process.optimized.auto_select_colls & MPID_AUTO_SELECT_COLLS_BARRIER) &&
+       comm->mpid.user_selected_type[PAMI_XFER_BARRIER] == MPID_COLL_NOSELECTION)
+   {
+     comm->coll_fns->Barrier      = MPIDO_Barrier_simple;
+   }
+   /* ************ Bcast ************ */
+   if((MPIDI_Process.optimized.auto_select_colls & MPID_AUTO_SELECT_COLLS_BCAST) &&
+       comm->mpid.user_selected_type[PAMI_XFER_BROADCAST] == MPID_COLL_NOSELECTION)
+   {
+     comm->coll_fns->Bcast        = MPIDO_Bcast_simple;
+   }
+   /* ************ Allreduce ************ */
+   if((MPIDI_Process.optimized.auto_select_colls & MPID_AUTO_SELECT_COLLS_ALLREDUCE) &&
+       comm->mpid.user_selected_type[PAMI_XFER_ALLREDUCE] == MPID_COLL_NOSELECTION)
+   {
+     comm->coll_fns->Allreduce    = MPIDO_Allreduce_simple;
+   }
+   /* ************ Allgather ************ */
+   if((MPIDI_Process.optimized.auto_select_colls & MPID_AUTO_SELECT_COLLS_ALLGATHER) &&
+       comm->mpid.user_selected_type[PAMI_XFER_ALLGATHER] == MPID_COLL_NOSELECTION)
+   {
+     comm->coll_fns->Allgather    = MPIDO_Allgather_simple;
+   }
+   /* ************ Allgatherv ************ */
+   if((MPIDI_Process.optimized.auto_select_colls & MPID_AUTO_SELECT_COLLS_ALLGATHERV) &&
+       comm->mpid.user_selected_type[PAMI_XFER_ALLGATHERV_INT] == MPID_COLL_NOSELECTION)
+   {
+     comm->coll_fns->Allgatherv   = MPIDO_Allgatherv_simple;
+   }
+   /* ************ Scatterv ************ */
+   if((MPIDI_Process.optimized.auto_select_colls & MPID_AUTO_SELECT_COLLS_SCATTERV) &&
+       comm->mpid.user_selected_type[PAMI_XFER_SCATTERV_INT] == MPID_COLL_NOSELECTION)
+   {
+     comm->coll_fns->Scatterv     = MPIDO_Scatterv_simple;
+   }
+   /* ************ Scatter ************ */
+   if((MPIDI_Process.optimized.auto_select_colls & MPID_AUTO_SELECT_COLLS_SCATTER) &&
+       comm->mpid.user_selected_type[PAMI_XFER_SCATTER] == MPID_COLL_NOSELECTION)
+   {
+     comm->coll_fns->Scatter      = MPIDO_Scatter_simple;
+   }
+   /* ************ Gather ************ */
+   if((MPIDI_Process.optimized.auto_select_colls & MPID_AUTO_SELECT_COLLS_GATHER) &&
+       comm->mpid.user_selected_type[PAMI_XFER_GATHER] == MPID_COLL_NOSELECTION)
+   {
+     comm->coll_fns->Gather       = MPIDO_Gather_simple;
+   }
+   /* ************ Alltoallv ************ */
+   if((MPIDI_Process.optimized.auto_select_colls & MPID_AUTO_SELECT_COLLS_ALLTOALLV) &&
+       comm->mpid.user_selected_type[PAMI_XFER_ALLTOALLV_INT] == MPID_COLL_NOSELECTION)
+   {
+     comm->coll_fns->Alltoallv    = MPIDO_Alltoallv_simple;
+   }
+   /* ************ Alltoall ************ */
+   if((MPIDI_Process.optimized.auto_select_colls & MPID_AUTO_SELECT_COLLS_ALLTOALL) &&
+       comm->mpid.user_selected_type[PAMI_XFER_ALLTOALL] == MPID_COLL_NOSELECTION)
+   {
+     comm->coll_fns->Alltoall     = MPIDO_Alltoall_simple;
+   }
+   /* ************ Gatherv ************ */
+   if((MPIDI_Process.optimized.auto_select_colls & MPID_AUTO_SELECT_COLLS_GATHERV) &&
+       comm->mpid.user_selected_type[PAMI_XFER_GATHERV_INT] == MPID_COLL_NOSELECTION)
+   {
+     comm->coll_fns->Gatherv      = MPIDO_Gatherv_simple;
+   }
+   /* ************ Reduce ************ */
+   if((MPIDI_Process.optimized.auto_select_colls & MPID_AUTO_SELECT_COLLS_REDUCE) &&
+       comm->mpid.user_selected_type[PAMI_XFER_REDUCE] == MPID_COLL_NOSELECTION)
+   {
+     comm->coll_fns->Reduce       = MPIDO_Reduce_simple;
+   }
+   /* ************ Scan ************ */
+   if((MPIDI_Process.optimized.auto_select_colls & MPID_AUTO_SELECT_COLLS_SCAN) &&
+       comm->mpid.user_selected_type[PAMI_XFER_SCAN] == MPID_COLL_NOSELECTION)
+   {
+     comm->coll_fns->Scan         = MPIDO_Scan_simple;
+   }
+   /* ************ Exscan ************ */
+   if((MPIDI_Process.optimized.auto_select_colls & MPID_AUTO_SELECT_COLLS_EXSCAN) &&
+       comm->mpid.user_selected_type[PAMI_XFER_SCAN] == MPID_COLL_NOSELECTION)
+   {
+     comm->coll_fns->Exscan       = MPIDO_Exscan_simple;
+   }
    TRACE_ERR("MPIDI_Comm_coll_envvars exit\n");
 }
 
@@ -479,7 +565,7 @@ void MPIDI_Comm_coll_query(MPID_Comm *comm)
          }
       }
    }
-   /* Determine if we have protocols for these maybe, rather than just setting them? */
+   /* Determine if we have protocols for these maybe, rather than just setting them?  */
    comm->coll_fns->Barrier      = MPIDO_Barrier;
    comm->coll_fns->Bcast        = MPIDO_Bcast;
    comm->coll_fns->Allreduce    = MPIDO_Allreduce;
diff --git a/src/mpid/pamid/src/mpidi_env.c b/src/mpid/pamid/src/mpidi_env.c
index 8a9bbf6..d141a69 100644
--- a/src/mpid/pamid/src/mpidi_env.c
+++ b/src/mpid/pamid/src/mpidi_env.c
@@ -842,6 +842,29 @@ MPIDI_Env_setup(int rank, int requested)
       TRACE_ERR("MPIDI_Process.optimized.select_colls=%u\n", MPIDI_Process.optimized.select_colls);
    }
 
+   /* Finally, if MP_COLLECTIVE_SELECTION is "on", then we want to overwrite any other setting */
+   {
+      unsigned temp;
+      temp = 0;
+      char* names[] = {"MP_COLLECTIVE_SELECTION", NULL};
+      ENV_Char(names, &temp);
+      if(temp)
+      {
+         pami_extension_t extension;
+         pami_result_t status = PAMI_ERROR;
+         status = PAMI_Extension_open (MPIDI_Client, "EXT_collsel", &extension);
+         if(status == PAMI_SUCCESS)
+         {
+
+           MPIDI_Process.optimized.auto_select_colls = MPID_AUTO_SELECT_COLLS_ALL; /* All collectives will be using auto coll sel. 
+                                                                                    We will check later on each individual coll. */ 
+           MPIDI_Process.optimized.collectives       = 1;                          /* Enable optimized collectives so we can create PAMI Geometry */
+         }
+      }
+      else
+         MPIDI_Process.optimized.auto_select_colls = MPID_AUTO_SELECT_COLLS_NONE;/* Auto coll sel is disabled for all */ 
+   }
+   
 
   /* Set the status of the optimized shared memory point-to-point functions */
   {

http://git.mpich.org/mpich.git/commitdiff/5ea5683ff71c4419b628d4a3a4513772ae83000a

commit 5ea5683ff71c4419b628d4a3a4513772ae83000a
Author: Qi QC Zhang <keirazhang at cn.ibm.com>
Date:   Mon Dec 10 00:28:53 2012 -0500

    check NULL size in MPI_File_get_size
    
    (ibm) D187578
    (ibm) 2213e89684327f2b5c4b7d2cdd7416a3ec7a110a
    
    Signed-off-by: Michael Blocksome <blocksom at us.ibm.com>

diff --git a/src/mpi/romio/mpi-io/get_size.c b/src/mpi/romio/mpi-io/get_size.c
index f3150de..1582380 100644
--- a/src/mpi/romio/mpi-io/get_size.c
+++ b/src/mpi/romio/mpi-io/get_size.c
@@ -51,6 +51,12 @@ int MPI_File_get_size(MPI_File fh, MPI_Offset *size)
 
     /* --BEGIN ERROR HANDLING-- */
     MPIO_CHECK_FILE_HANDLE(adio_fh, myname, error_code);
+    if(size == NULL){
+        error_code = MPIO_Err_create_code(MPI_SUCCESS, MPIR_ERR_RECOVERABLE,
+                     myname, __LINE__, MPI_ERR_ARG,
+                     "**nullptr", "**nullptr %s", "size");
+        goto fn_fail;
+    }
     /* --END ERROR HANDLING-- */
 
     ADIOI_TEST_DEFERRED(adio_fh, myname, &error_code);
@@ -70,4 +76,9 @@ int MPI_File_get_size(MPI_File fh, MPI_Offset *size)
 
 fn_exit:
     return error_code;
+fn_fail:
+    /* --BEGIN ERROR HANDLING-- */
+    error_code = MPIO_Err_return_file(fh, error_code);
+    goto fn_exit;
+    /* --END ERROR HANDLING-- */
 }

http://git.mpich.org/mpich.git/commitdiff/9f335ce7ed568a19da82245492723e885dae9192

commit 9f335ce7ed568a19da82245492723e885dae9192
Author: Qi QC Zhang <keirazhang at cn.ibm.com>
Date:   Tue Dec 4 01:29:47 2012 -0500

    Wrong error class returned on GPFS
    
    (ibm) D187578
    (ibm) 8f26ccae0b1f8ea23caf4d37c936b84c84762f61
    
    Signed-off-by: Michael Blocksome <blocksom at us.ibm.com>

diff --git a/src/mpi/romio/adio/ad_ufs/ad_ufs_open.c b/src/mpi/romio/adio/ad_ufs/ad_ufs_open.c
index 1a8ee3b..751e7fc 100644
--- a/src/mpi/romio/adio/ad_ufs/ad_ufs_open.c
+++ b/src/mpi/romio/adio/ad_ufs/ad_ufs_open.c
@@ -91,6 +91,19 @@ void ADIOI_UFS_Open(ADIO_File fd, int *error_code)
 					       __LINE__, MPI_ERR_READ_ONLY,
 					       "**ioneedrd", 0 );
 	}
+    else if(errno == EISDIR) {
+        *error_code = MPIO_Err_create_code(MPI_SUCCESS,
+                           MPIR_ERR_RECOVERABLE, myname,
+                           __LINE__, MPI_ERR_BAD_FILE,
+                           "**filename", 0);
+    }
+    else if(errno == EEXIST) {
+        *error_code = MPIO_Err_create_code(MPI_SUCCESS,
+                           MPIR_ERR_RECOVERABLE, myname,
+                           __LINE__, MPI_ERR_FILE_EXISTS,
+                           "**fileexist", 0);
+
+    }
 	else {
 	    *error_code = MPIO_Err_create_code(MPI_SUCCESS,
 					       MPIR_ERR_RECOVERABLE, myname,

http://git.mpich.org/mpich.git/commitdiff/2f03f4ba14b1ff2bad3a6e35682c480703d7ec42

commit 2f03f4ba14b1ff2bad3a6e35682c480703d7ec42
Author: Haizhu Liu <haizhu at us.ibm.com>
Date:   Wed Dec 5 22:12:49 2012 -0500

    Fix for spawn/spaiccreate failure where the PG information needs to be broadcasted to all COMM_WORLD members that did not participated in the spawn
    
    (ibm) D187685
    (ibm) 25e0e4da417b377f3ec01544d71be7dccbf91c3d
    
    Signed-off-by: Michael Blocksome <blocksom at us.ibm.com>

diff --git a/src/mpid/pamid/include/mpidpost.h b/src/mpid/pamid/include/mpidpost.h
index a9302b2..5d238fb 100644
--- a/src/mpid/pamid/include/mpidpost.h
+++ b/src/mpid/pamid/include/mpidpost.h
@@ -37,4 +37,9 @@
 #include "../src/pt2pt/mpid_send.h"
 #include "../src/pt2pt/mpid_irecv.h"
 
+#ifdef DYNAMIC_TASKING
+#define MPID_ICCREATE_REMOTECOMM_HOOK(_p,_c,_np,_gp,_r) \
+     MPID_PG_ForwardPGInfo(_p,_c,_np,_gp,_r)
+#endif
+
 #endif
diff --git a/src/mpid/pamid/include/mpidpre.h b/src/mpid/pamid/include/mpidpre.h
index e78f3f7..4214268 100644
--- a/src/mpid/pamid/include/mpidpre.h
+++ b/src/mpid/pamid/include/mpidpre.h
@@ -62,5 +62,8 @@
 #include "mpidi_trace.h"
 #endif
 
+#ifdef DYNAMIC_TASKING
+#define HAVE_GPID_ROUTINES
+#endif
 
 #endif
diff --git a/src/mpid/pamid/src/dyntask/mpidi_pg.c b/src/mpid/pamid/src/dyntask/mpidi_pg.c
index 55fafae..fe0cb3c 100644
--- a/src/mpid/pamid/src/dyntask/mpidi_pg.c
+++ b/src/mpid/pamid/src/dyntask/mpidi_pg.c
@@ -12,6 +12,8 @@
 
 #ifdef DYNAMIC_TASKING
 
+extern int mpidi_dynamic_tasking;
+
 #define MAX_JOBID_LEN 1024
 
 /* FIXME: These routines need a description.  What is their purpose?  Who
@@ -458,6 +460,7 @@ int MPIDI_PG_Create_from_string(const char * str, MPIDI_PG_t ** pg_pptr,
     int vct_sz, i;
     MPIDI_PG_t *existing_pg, *pg_ptr=0;
 
+    TRACE_ERR("MPIDI_PG_Create_from_string - str=%s\n", str);
     /* The pg_id is at the beginning of the string, so we can just pass
        it to the find routine */
     /* printf( "Looking for pg with id %s\n", str );fflush(stdout); */
@@ -477,13 +480,22 @@ int MPIDI_PG_Create_from_string(const char * str, MPIDI_PG_t ** pg_pptr,
     p = str;
     while (*p) p++; p++;
     vct_sz = atoi(p);
+
+    p++;p++;
+    TRACE_ERR("before MPIDI_PG_Create - p=%s\n", p);
+    char *p_tmp = MPIU_Strdup(p);
     mpi_errno = MPIDI_PG_Create(vct_sz, (void *)str, pg_pptr);
     if (mpi_errno != MPI_SUCCESS) {
 	TRACE_ERR("MPIDI_PG_Create returned with mpi_errno=%d\n", mpi_errno);
     }
 
     pg_ptr = *pg_pptr;
+    pg_ptr->vct[0].taskid=atoi(strtok(p_tmp,":"));
+    for(i=1; i<vct_sz; i++) {
+	pg_ptr->vct[i].taskid=atoi(strtok(NULL,":"));
+    }
     TRACE_ERR("pg_ptr->id = %s\n",(*pg_pptr)->id);
+    MPIU_Free(p_tmp);
 
     if(verbose)
       MPIU_PG_Printall(stderr);
@@ -575,7 +587,7 @@ int MPIDI_connToStringKVS( char **buf_p, int *slen, MPIDI_PG_t *pg )
 					     the pg id is a string */
     char buf[MPIDI_MAX_KVS_VALUE_LEN];
     int   i, j, vallen, rc, mpi_errno = MPI_SUCCESS, len;
-    int   curSlen;
+    int   curSlen, nChars;
 
     /* Make an initial allocation of a string with an estimate of the
        needed space */
@@ -591,7 +603,13 @@ int MPIDI_connToStringKVS( char **buf_p, int *slen, MPIDI_PG_t *pg )
     /* Add the size of the pg */
     MPIU_Snprintf( &string[len], curSlen - len, "%d", pg->size );
     while (string[len]) len++;
-    len++;
+    string[len++] = 0;
+
+    /* add the taskids of the pg */
+    for(i = 0; i < pg->size; i++) {
+      nChars = MPIU_Snprintf(&string[len], curSlen - len, "%d:", pg->vct[i].taskid);
+      len+=nChars;
+    }
 
 #if 0
     for (i=0; i<pg->size; i++) {
@@ -923,13 +941,11 @@ int MPIU_PG_Printall( FILE *fp )
         /* XXX DJG FIXME-MT should we be checking this? */
 	fprintf( fp, "size = %d, refcount = %d, id = %s\n",
 		 pg->size, MPIU_Object_get_ref(pg), (char *)pg->id );
-#if 0
 	for (i=0; i<pg->size; i++) {
-	    fprintf( fp, "\tVCT rank = %d, refcount = %d, taskid = %d, state = %d \n",
-		     pg->vct[i].pg_rank, MPIU_Object_get_ref(&pg->vct[i]),
-		     pg->vct[i].taskid, (int)pg->vct[i].state );
+	    fprintf( fp, "\tVCT rank = %d, refcount = %d, taskid = %d\n",
+		     pg->vct[i].pg_rank, MPIU_Object_get_ref(pg),
+		     pg->vct[i].taskid );
 	}
-#endif
 	fflush(fp);
 	pg = pg->next;
     }
@@ -958,4 +974,63 @@ void MPIDI_PG_IdToNum( MPIDI_PG_t *pg, int *id )
 {
     *id = atoi((char *)pg->id);
 }
+
+
+int MPID_PG_ForwardPGInfo( MPID_Comm *peer_ptr, MPID_Comm *comm_ptr,
+			   int nPGids, const int gpids[],
+			   int root )
+{
+    int mpi_errno = MPI_SUCCESS;
+    int i, allfound = 1, pgid, pgidWorld;
+    MPIDI_PG_t *pg = 0;
+    MPIDI_PG_iterator iter;
+    int errflag = FALSE;
+
+    if(mpidi_dynamic_tasking) {
+    /* Get the pgid for CommWorld (always attached to the first process
+       group) */
+    MPIDI_PG_Get_iterator(&iter);
+    MPIDI_PG_Get_next( &iter, &pg );
+    MPIDI_PG_IdToNum( pg, &pgidWorld );
+
+    /* Extract the unique process groups */
+    for (i=0; i<nPGids && allfound; i++) {
+	if (gpids[0] != pgidWorld) {
+	    /* Add this gpid to the list of values to check */
+	    /* FIXME: For testing, we just test in place */
+            MPIDI_PG_Get_iterator(&iter);
+	    do {
+                MPIDI_PG_Get_next( &iter, &pg );
+		if (!pg) {
+		    /* We don't know this pgid */
+		    allfound = 0;
+		    break;
+		}
+		MPIDI_PG_IdToNum( pg, &pgid );
+	    } while (pgid != gpids[0]);
+	}
+	gpids += 2;
+    }
+
+    /* See if everyone is happy */
+    mpi_errno = MPIR_Allreduce_impl( MPI_IN_PLACE, &allfound, 1, MPI_INT, MPI_LAND, comm_ptr, &errflag );
+
+    if (allfound) return MPI_SUCCESS;
+
+    /* FIXME: We need a cleaner way to handle this case than using an ifdef.
+       We could have an empty version of MPID_PG_BCast in ch3u_port.c, but
+       that's a rather crude way of addressing this problem.  Better is to
+       make the handling of local and remote PIDS for the dynamic process
+       case part of the dynamic process "module"; devices that don't support
+       dynamic processes (and hence have only COMM_WORLD) could optimize for
+       that case */
+    /* We need to share the process groups.  We use routines
+       from ch3u_port.c */
+    MPID_PG_BCast( peer_ptr, comm_ptr, root );
+    }
+ fn_exit:
+    return MPI_SUCCESS;
+ fn_fail:
+    goto fn_exit;
+}
 #endif
diff --git a/src/mpid/pamid/src/dyntask/mpidi_port.c b/src/mpid/pamid/src/dyntask/mpidi_port.c
index 03f5380..f2445a8 100644
--- a/src/mpid/pamid/src/dyntask/mpidi_port.c
+++ b/src/mpid/pamid/src/dyntask/mpidi_port.c
@@ -1240,6 +1240,8 @@ static int MPIDI_SetupNewIntercomm( struct MPID_Comm *comm_ptr, int remote_comm_
     intercomm->local_vcrt = comm_ptr->vcrt;
     MPID_VCRT_Add_ref(comm_ptr->vcrt);
     intercomm->local_vcr  = comm_ptr->vcr;
+    for(i=0; i<comm_ptr->local_size; i++)
+	TRACE_ERR("intercomm->local_vcr[%d]->pg_rank=%d comm_ptr->vcr[%d].pg_rank=%d intercomm->local_vcr[%d]->taskid=%d comm_ptr->vcr[%d]->taskid=%d\n", i, intercomm->local_vcr[i]->pg_rank, i, comm_ptr->vcr[i]->pg_rank, i, intercomm->local_vcr[i]->taskid, i, comm_ptr->vcr[i]->taskid);
 
     /* Set up VC reference table */
     mpi_errno = MPID_VCRT_Create(intercomm->remote_size, &intercomm->vcrt);
@@ -1454,4 +1456,92 @@ void MPIDI_delete_conn_record(int wid) {
   }
   MPIU_THREAD_CS_EXIT(MSGQUEUE,0);
 }
+
+
+int MPID_PG_BCast( MPID_Comm *peercomm_p, MPID_Comm *comm_p, int root )
+{
+    int n_local_pgs=0, mpi_errno = MPI_SUCCESS;
+    pg_translation *local_translation = 0;
+    pg_node *pg_list, *pg_next, *pg_head = 0;
+    int rank, i, peer_comm_size;
+    int errflag = FALSE;
+    MPIU_CHKLMEM_DECL(1);
+
+    peer_comm_size = comm_p->local_size;
+    rank            = comm_p->rank;
+
+    local_translation = (pg_translation*)MPIU_Malloc(peer_comm_size*sizeof(pg_translation));
+    if (rank == root) {
+	/* Get the process groups known to the *peercomm* */
+	MPIDI_ExtractLocalPGInfo( peercomm_p, local_translation, &pg_head,
+			    &n_local_pgs );
+    }
+
+    /* Now, broadcast the number of local pgs */
+    mpi_errno = MPIR_Bcast_impl( &n_local_pgs, 1, MPI_INT, root, comm_p, &errflag);
+
+    pg_list = pg_head;
+    for (i=0; i<n_local_pgs; i++) {
+	int len, flag;
+	char *pg_str=0;
+	MPIDI_PG_t *pgptr;
+
+	if (rank == root) {
+	    if (!pg_list) {
+		/* FIXME: Error, the pg_list is broken */
+		printf( "Unexpected end of pg_list\n" ); fflush(stdout);
+		break;
+	    }
+	    pg_str  = pg_list->str;
+	    len     = pg_list->lenStr;
+	    pg_list = pg_list->next;
+	}
+	mpi_errno = MPIR_Bcast_impl( &len, 1, MPI_INT, root, comm_p, &errflag);
+	if (rank != root) {
+	    pg_str = (char *)MPIU_Malloc(len);
+	    if (!pg_str) {
+		goto fn_exit;
+	    }
+	}
+	mpi_errno = MPIR_Bcast_impl( pg_str, len, MPI_CHAR, root, comm_p, &errflag);
+	if (mpi_errno) {
+	    if (rank != root)
+		MPIU_Free( pg_str );
+	}
+
+	if (rank != root) {
+	    /* flag is true if the pg was created, false if it
+	       already existed. This step
+	       also initializes the created process group  */
+	    MPIDI_PG_Create_from_string( pg_str, &pgptr, &flag );
+	    if (flag) {
+		/*printf( "[%d]Added pg named %s to list\n", rank,
+			(char *)pgptr->id );
+			fflush(stdout); */
+	    }
+	    MPIU_Free( pg_str );
+	}
+    }
+
+    /* Free pg_list */
+    pg_list = pg_head;
+
+    /* FIXME: We should use the PG destroy function for this, and ensure that
+       the PG fields are valid for that function */
+    while (pg_list) {
+	pg_next = pg_list->next;
+	MPIU_Free( pg_list->str );
+	if (pg_list->pg_id ) {
+	    MPIU_Free( pg_list->pg_id );
+	}
+	MPIU_Free( pg_list );
+	pg_list = pg_next;
+    }
+
+ fn_exit:
+    MPIU_Free(local_translation);
+    return mpi_errno;
+ fn_fail:
+    goto fn_exit;
+}
 #endif
diff --git a/src/mpid/pamid/src/mpid_init.c b/src/mpid/pamid/src/mpid_init.c
index f493198..9b966a7 100644
--- a/src/mpid/pamid/src/mpid_init.c
+++ b/src/mpid/pamid/src/mpid_init.c
@@ -798,6 +798,7 @@ MPIDI_VCRT_init(int rank, int size, char *world_tasks, MPIDI_PG_t *pg)
   if(mpidi_dynamic_tasking) {
     comm->vcr[0]->pg=pg->vct[0].pg;
     comm->vcr[0]->pg_rank=pg->vct[0].pg_rank;
+    pg->vct[0].taskid = comm->vcr[0]->taskid;
     if(comm->vcr[0]->pg) {
       TRACE_ERR("Adding ref for comm=%x vcr=%x pg=%x\n", comm, comm->vcr[0], comm->vcr[0]->pg);
       MPIDI_PG_add_ref(comm->vcr[0]->pg);
@@ -842,11 +843,13 @@ MPIDI_VCRT_init(int rank, int size, char *world_tasks, MPIDI_PG_t *pg)
     {
 	  comm->vcr[p]->pg=pg->vct[p].pg;
           comm->vcr[p]->pg_rank=pg->vct[p].pg_rank;
+          pg->vct[p].taskid = comm->vcr[p]->taskid;
 	  if(comm->vcr[p]->pg) {
-		TRACE_ERR("Adding ref for comm=%x vcr=%x pg=%x\n", comm, comm->vcr[p], comm->vcr[p]->pg);
-		MPIDI_PG_add_ref(comm->vcr[p]->pg);
+            TRACE_ERR("Adding ref for comm=%x vcr=%x pg=%x\n", comm, comm->vcr[p], comm->vcr[p]->pg);
+            MPIDI_PG_add_ref(comm->vcr[p]->pg);
 	  }
        /* MPID_VCR_Dup(&pg->vct[p], &(comm->vcr[p]));*/
+	  TRACE_ERR("comm->vcr[%d]->pg->id=%s comm->vcr[%d]->pg_rank=%d\n", p, comm->vcr[p]->pg->id, p, comm->vcr[p]->pg_rank);
 	  TRACE_ERR("TASKID -- comm->vcr[%d]=%d\n", p, comm->vcr[p]->taskid);
     }
 
diff --git a/src/mpid/pamid/src/mpid_vc.c b/src/mpid/pamid/src/mpid_vc.c
index f051c78..14c470c 100644
--- a/src/mpid/pamid/src/mpid_vc.c
+++ b/src/mpid/pamid/src/mpid_vc.c
@@ -146,7 +146,7 @@ int MPID_VCR_CommFromLpids( MPID_Comm *newcomm_ptr,
     MPID_VCRT_Create( size, &newcomm_ptr->vcrt );
     MPID_VCRT_Get_ptr( newcomm_ptr->vcrt, &newcomm_ptr->vcr );
     if(mpidi_dynamic_tasking) {
-    for (i=0; i<size; i++) {
+      for (i=0; i<size; i++) {
 	MPID_VCR *vc = 0;
 
 	/* For rank i in the new communicator, find the corresponding
@@ -156,10 +156,12 @@ int MPID_VCR_CommFromLpids( MPID_Comm *newcomm_ptr,
 	*/
 	/* printf( "[%d] Remote rank %d has lpid %d\n",
 	   MPIR_Process.comm_world->rank, i, lpids[i] ); */
+#if 0
 	if (lpids[i] < commworld_ptr->remote_size) {
 	    *vc = commworld_ptr->vcr[lpids[i]];
 	}
 	else {
+#endif
 	    /* We must find the corresponding vcr for a given lpid */
 	    /* For now, this means iterating through the process groups */
 	    MPIDI_PG_t *pg = 0;
@@ -185,16 +187,18 @@ int MPID_VCR_CommFromLpids( MPID_Comm *newcomm_ptr,
 		    }
 		}
 	    } while (!vc);
+#if 0
 	}
+#endif
 
 	/* printf( "about to dup vc %x for lpid = %d in another pg\n",
 	   (int)vc, lpids[i] ); */
 	/* Note that his will increment the ref count for the associate
 	   PG if necessary.  */
-	MPID_VCR_Dup( *vc, &newcomm_ptr->vcr[i] );
-    }
+	MPID_VCR_Dup( vc, &newcomm_ptr->vcr[i] );
+      }
     } else  {
-    for (i=0; i<size; i++) {
+      for (i=0; i<size; i++) {
         /* For rank i in the new communicator, find the corresponding
            rank in the comm world (FIXME FOR MPI2) */
         /* printf( "[%d] Remote rank %d has lpid %d\n",
@@ -209,7 +213,7 @@ int MPID_VCR_CommFromLpids( MPID_Comm *newcomm_ptr,
             return 1;
             /* MPID_VCR_Dup( ???, &newcomm_ptr->vcr[i] ); */
         }
-    }
+      }
 
     }
 fn_exit:
@@ -231,7 +235,7 @@ int MPID_GPID_ToLpidArray( int size, int gpid[], int lpid[] )
     MPIDI_PG_iterator iter;
 
     if(mpidi_dynamic_tasking) {
-    for (i=0; i<size; i++) {
+      for (i=0; i<size; i++) {
         MPIDI_PG_Get_iterator(&iter);
 	do {
 	    MPIDI_PG_Get_next( &iter, &pg );
@@ -267,12 +271,12 @@ int MPID_GPID_ToLpidArray( int size, int gpid[], int lpid[] )
 	    }
 	} while (1);
 	gpid += 2;
-    }
+      }
     } else {
-    for (i=0; i<size; i++) {
+      for (i=0; i<size; i++) {
         lpid[i] = *++gpid;  gpid++;
-    }
-    return 0;
+      }
+      return 0;
 
     }
 
@@ -296,33 +300,32 @@ int MPID_GPID_GetAllInComm( MPID_Comm *comm_ptr, int local_size,
     MPIU_Assert(comm_ptr->local_size == local_size);
 
     if(mpidi_dynamic_tasking) {
-    *singlePG = 1;
-    for (i=0; i<comm_ptr->local_size; i++) {
-	vc = comm_ptr->vcr[i];
-
-	/* Get the process group id as an int */
-	MPIDI_PG_IdToNum( vc->pg, &pgid );
-
-	*gpid++ = pgid;
-	if (lastPGID != pgid) {
-	    if (lastPGID != -1)
-		*singlePG = 0;
-	    lastPGID = pgid;
-	}
-	*gpid++ = vc->pg_rank;
-
-        MPIU_DBG_MSG_FMT(COMM,VERBOSE, (MPIU_DBG_FDEST,
-                         "pgid=%d vc->pg_rank=%d",
-                         pgid, vc->pg_rank));
-    }
+      *singlePG = 1;
+      for (i=0; i<comm_ptr->local_size; i++) {
+	  vc = comm_ptr->vcr[i];
+
+	  /* Get the process group id as an int */
+	  MPIDI_PG_IdToNum( vc->pg, &pgid );
+
+	  *gpid++ = pgid;
+	  if (lastPGID != pgid) {
+	      if (lastPGID != -1)
+                 *singlePG = 0;
+	      lastPGID = pgid;
+	  }
+	  *gpid++ = vc->pg_rank;
+
+          MPIU_DBG_MSG_FMT(COMM,VERBOSE, (MPIU_DBG_FDEST,
+                           "pgid=%d vc->pg_rank=%d",
+                           pgid, vc->pg_rank));
+      }
     } else {
-    for (i=0; i<comm_ptr->local_size; i++) {
+      for (i=0; i<comm_ptr->local_size; i++) {
         *gpid++ = 0;
         (void)MPID_VCR_Get_lpid( comm_ptr->vcr[i], gpid );
         gpid++;
-    }
-    *singlePG = 1;
-
+      }
+      *singlePG = 1;
     }
 
     return mpi_errno;
@@ -334,4 +337,6 @@ int MPIDI_VC_Init( MPID_VCR vcr, MPIDI_PG_t *pg, int rank )
     vcr->pg      = pg;
     vcr->pg_rank = rank;
 }
+
+
 #endif

http://git.mpich.org/mpich.git/commitdiff/27efcfd94efdecaa42eb1dc58d601559bf045afc

commit 27efcfd94efdecaa42eb1dc58d601559bf045afc
Author: Su Huang <suhuang at us.ibm.com>
Date:   Wed Dec 5 13:16:20 2012 -0500

    Trace tool to include 0 byte renzevous messages
    
    (ibm) D186175
    (ibm) b1c4f0883d55d79d16e3d3c09746266c0a3d5403
    
    Signed-off-by: Michael Blocksome <blocksom at us.ibm.com>

diff --git a/src/mpid/pamid/src/pt2pt/mpidi_control.c b/src/mpid/pamid/src/pt2pt/mpidi_control.c
index dec4384..9cc8a2b 100644
--- a/src/mpid/pamid/src/pt2pt/mpidi_control.c
+++ b/src/mpid/pamid/src/pt2pt/mpidi_control.c
@@ -132,6 +132,13 @@ MPIDI_RecvRzvDoneCB_zerobyte(pami_context_t  context,
   MPIDI_Request_setControl(rreq, original_value);
 
   MPIDI_RecvDoneCB(context, rreq, PAMI_SUCCESS);
+#ifdef MPIDI_TRACE
+  pami_task_t source;
+  source = MPIDI_Request_getPeerRank_pami(rreq);
+  MPIDI_Trace_buf[source].R[(rreq->mpid.idx)].sync_com_in_HH=1;
+  MPIDI_Trace_buf[source].R[(rreq->mpid.idx)].matchedInHH=1;
+  MPIDI_Trace_buf[source].R[(rreq->mpid.idx)].bufadd=rreq->mpid.userbuf;
+#endif
   MPID_Request_release(rreq);
 }
 
diff --git a/src/mpid/pamid/src/pt2pt/mpidi_sendmsg.c b/src/mpid/pamid/src/pt2pt/mpidi_sendmsg.c
index 33095ca..d6308dd 100644
--- a/src/mpid/pamid/src/pt2pt/mpidi_sendmsg.c
+++ b/src/mpid/pamid/src/pt2pt/mpidi_sendmsg.c
@@ -299,6 +299,14 @@ MPIDI_SendMsg_rzv_zerobyte(pami_context_t    context,
 
   rc = PAMI_Send_immediate(context, &params);
   MPID_assert(rc == PAMI_SUCCESS);
+#ifdef MPIDI_TRACE
+  MPIDI_Trace_buf[dest].S[(sreq->mpid.idx)].bufaddr=sreq->mpid.envelope.data;
+  MPIDI_Trace_buf[dest].S[(sreq->mpid.idx)].mode=MPIDI_Protocols_RVZ_zerobyte;
+  MPIDI_Trace_buf[dest].S[(sreq->mpid.idx)].sendRzv=1;
+  MPIDI_Trace_buf[dest].S[(sreq->mpid.idx)].sendEnvelop=1;
+  MPIDI_Trace_buf[dest].S[(sreq->mpid.idx)].memRegion=sreq->mpid.envelope.memregion_used;
+  MPIDI_Trace_buf[dest].S[(sreq->mpid.idx)].use_pami_get=MPIDI_Process.mp_s_use_pami_get;
+#endif
 }
 
 

http://git.mpich.org/mpich.git/commitdiff/1a40b5d4b1c996a259f1bf62715bf9f84929e788

commit 1a40b5d4b1c996a259f1bf62715bf9f84929e788
Author: Su Huang <suhuang at us.ibm.com>
Date:   Mon Dec 3 10:23:33 2012 -0500

    Support MP_SHMEM_PT2PT with yes or no option
    
    (ibm) D187584
    (ibm) 6d0a67d612b8c2e748849147914cbe5ec3bd0be4
    
    Signed-off-by: Michael Blocksome <blocksom at us.ibm.com>

diff --git a/src/mpid/pamid/src/mpidi_env.c b/src/mpid/pamid/src/mpidi_env.c
index 7f97dac..8a9bbf6 100644
--- a/src/mpid/pamid/src/mpidi_env.c
+++ b/src/mpid/pamid/src/mpidi_env.c
@@ -845,10 +845,16 @@ MPIDI_Env_setup(int rank, int requested)
 
   /* Set the status of the optimized shared memory point-to-point functions */
   {
-    char* names[] = {"PAMID_SHMEM_PT2PT", "MP_SHMEM_PT2PT", "PAMI_SHMEM_PT2PT", NULL};
+    char* names[] = {"PAMID_SHMEM_PT2PT", "PAMI_SHMEM_PT2PT", NULL};
     ENV_Unsigned(names, &MPIDI_Process.shmem_pt2pt, 2, &found_deprecated_env_var, rank);
   }
 
+  /* MP_SHMEM_PT2PT = yes or no       */
+  {
+    char* names[] = {"MP_SHMEM_PT2PT", NULL};
+      ENV_Char(names, &MPIDI_Process.shmem_pt2pt);
+  }
+
   /* Enable MPIR_* implementations of non-blocking collectives */
   {
     char* names[] = {"PAMID_MPIR_NBC", NULL};

http://git.mpich.org/mpich.git/commitdiff/769b38bcfb9f623c2860169e11ef02d91b3eec60

commit 769b38bcfb9f623c2860169e11ef02d91b3eec60
Author: Haizhu Liu <haizhu at us.ibm.com>
Date:   Mon Dec 3 11:25:09 2012 -0500

    AIX build break and mem freeup
    
    (ibm) D187547
    (ibm) 64a14a47dce017444ce2cd5c551375e903cd5965
    
    Signed-off-by: Michael Blocksome <blocksom at us.ibm.com>

diff --git a/src/mpid/pamid/src/dyntask/mpid_comm_spawn_multiple.c b/src/mpid/pamid/src/dyntask/mpid_comm_spawn_multiple.c
index 23eba8e..2eb6c71 100644
--- a/src/mpid/pamid/src/dyntask/mpid_comm_spawn_multiple.c
+++ b/src/mpid/pamid/src/dyntask/mpid_comm_spawn_multiple.c
@@ -306,9 +306,11 @@ int MPIDI_Comm_spawn_multiple(int count, char **commands,
  fn_exit:
     if (info_keyval_vectors) {
 	MPIDI_free_pmi_keyvals(info_keyval_vectors, count, info_keyval_sizes);
-	MPIU_Free(info_keyval_sizes);
 	MPIU_Free(info_keyval_vectors);
     }
+    if (info_keyval_sizes) {
+	MPIU_Free(info_keyval_sizes);
+    }
     if (pmi_errcodes) {
 	MPIU_Free(pmi_errcodes);
     }
diff --git a/src/mpid/pamid/src/dyntask/mpid_port.c b/src/mpid/pamid/src/dyntask/mpid_port.c
index b2c43cb..51bc422 100644
--- a/src/mpid/pamid/src/dyntask/mpid_port.c
+++ b/src/mpid/pamid/src/dyntask/mpid_port.c
@@ -7,7 +7,6 @@
 #include "mpidimpl.h"
 #include "netdb.h"
 #include <net/if.h>
-#include <linux/sockios.h>
 
 
 #ifdef DYNAMIC_TASKING
diff --git a/src/mpid/pamid/src/dyntask/mpidi_pg.c b/src/mpid/pamid/src/dyntask/mpidi_pg.c
index bd9d748..55fafae 100644
--- a/src/mpid/pamid/src/dyntask/mpidi_pg.c
+++ b/src/mpid/pamid/src/dyntask/mpidi_pg.c
@@ -332,7 +332,7 @@ int MPIDI_PG_Destroy(MPIDI_PG_t * pg)
 	    TRACE_ERR("destroying pg->vct=%x\n", pg->vct);
 	    MPIU_Free(pg->vct);
 	    TRACE_ERR("after destroying pg->vct=%x\n", pg->vct);
-#if 0
+
 	    if (pg->connData) {
 		if (pg->freeConnInfo) {
                     TRACE_ERR("calling freeConnInfo on pg\n");
@@ -343,7 +343,7 @@ int MPIDI_PG_Destroy(MPIDI_PG_t * pg)
 		    MPIU_Free(pg->connData);
 		}
 	    }
-#endif
+
 	    TRACE_ERR("final destroying pg\n");
 	    MPIU_Free(pg);
 
diff --git a/src/mpid/pamid/src/dyntask/mpidi_port.c b/src/mpid/pamid/src/dyntask/mpidi_port.c
index 1ddafd0..03f5380 100644
--- a/src/mpid/pamid/src/dyntask/mpidi_port.c
+++ b/src/mpid/pamid/src/dyntask/mpidi_port.c
@@ -636,6 +636,7 @@ int MPIDI_Comm_connect(const char *port_name, MPID_Info *info, int root,
     }
 
 fn_exit:
+    if(local_translation) MPIU_Free(local_translation);
     return mpi_errno;
 
 fn_fail:
@@ -812,39 +813,6 @@ static int MPIDI_ReceivePGAndDistribute( struct MPID_Comm *tmp_comm, struct MPID
 }
 
 
-/**
- * This routine adds the remote world (wid) to local known world linked list
- * if there is no record of it before, or increment the reference count
- * associated with world (wid) if it is known before
- */
-void MPIDI_Parse_connection_info(int n_remote_pgs, MPIDI_PG_t **remote_pg) {
-  int i, p, ref_count=0;
-  int jobIdSize=8;
-  char jobId[jobIdSize];
-  char *pginfo_sav, *pgid_taskid_sav, *pgid, *pgid_taskid[20], *pginfo_tmp, *cp3, *cp2;
-  pami_task_t *taskids;
-  int n_rem_wids=0;
-  int mpi_errno = MPI_SUCCESS;
-  MPIDI_PG_t *existing_pg;
-
-  for(p=0; p<n_remote_pgs; p++) {
-        TRACE_ERR("call MPIDI_PG_Find to find %s\n", (char*)(remote_pg[p]->id));
-        mpi_errno = MPIDI_PG_Find(remote_pg[p]->id, &existing_pg);
-        if (mpi_errno) TRACE_ERR("MPIDI_PG_Find failed\n");
-
-         if (existing_pg != NULL) {
-	  taskids = MPIU_Malloc((existing_pg->size)*sizeof(pami_task_t));
-          for(i=0; i<existing_pg->size; i++) {
-             taskids[i]=existing_pg->vct[i].taskid;
-	     TRACE_ERR("id=%s taskids[%d]=%d\n", (char*)(remote_pg[p]->id), i, taskids[i]);
-          }
-          MPIDI_Add_connection_info(atoi((char*)(remote_pg[p]->id)), existing_pg->size, taskids);
-	  MPIU_Free(taskids);
-        }
-  }
-}
-
-
 void MPIDI_Add_connection_info(int wid, int wsize, pami_task_t *taskids) {
   int jobIdSize=64;
   char jobId[jobIdSize];
@@ -927,6 +895,39 @@ void MPIDI_Add_connection_info(int wid, int wsize, pami_task_t *taskids) {
 }
 
 
+/**
+ * This routine adds the remote world (wid) to local known world linked list
+ * if there is no record of it before, or increment the reference count
+ * associated with world (wid) if it is known before
+ */
+void MPIDI_Parse_connection_info(int n_remote_pgs, MPIDI_PG_t **remote_pg) {
+  int i, p, ref_count=0;
+  int jobIdSize=8;
+  char jobId[jobIdSize];
+  char *pginfo_sav, *pgid_taskid_sav, *pgid, *pgid_taskid[20], *pginfo_tmp, *cp3, *cp2;
+  pami_task_t *taskids;
+  int n_rem_wids=0;
+  int mpi_errno = MPI_SUCCESS;
+  MPIDI_PG_t *existing_pg;
+
+  for(p=0; p<n_remote_pgs; p++) {
+        TRACE_ERR("call MPIDI_PG_Find to find %s\n", (char*)(remote_pg[p]->id));
+        mpi_errno = MPIDI_PG_Find(remote_pg[p]->id, &existing_pg);
+        if (mpi_errno) TRACE_ERR("MPIDI_PG_Find failed\n");
+
+         if (existing_pg != NULL) {
+	  taskids = MPIU_Malloc((existing_pg->size)*sizeof(pami_task_t));
+          for(i=0; i<existing_pg->size; i++) {
+             taskids[i]=existing_pg->vct[i].taskid;
+	     TRACE_ERR("id=%s taskids[%d]=%d\n", (char*)(remote_pg[p]->id), i, taskids[i]);
+          }
+          MPIDI_Add_connection_info(atoi((char*)(remote_pg[p]->id)), existing_pg->size, taskids);
+	  MPIU_Free(taskids);
+        }
+  }
+}
+
+
 /* Sends the process group information to the peer and frees the
    pg_list */
 static int MPIDI_SendPGtoPeerAndFree( struct MPID_Comm *tmp_comm, int *sendtag_p,
@@ -1176,6 +1177,7 @@ int MPIDI_Comm_accept(const char *port_name, MPID_Info *info, int root,
     }
 
 fn_exit:
+    if(local_translation) MPIU_Free(local_translation);
     return mpi_errno;
 
 fn_fail:
@@ -1317,6 +1319,8 @@ static int MPIDI_SetupNewIntercomm( struct MPID_Comm *comm_ptr, int remote_comm_
    }
 
  fn_exit:
+    if(remote_pg) MPIU_Free(remote_pg);
+    if(remote_translation) MPIU_Free(remote_translation);
     return mpi_errno;
 
  fn_fail:
@@ -1345,7 +1349,7 @@ int MPIDI_Acceptq_dequeue(MPID_VCR * vcr, int port_name_tag)
 	    else
 		prev->next = q_item->next;
 
-	    /*MPIU_Free(q_item); */
+	    MPIU_Free(q_item);
 	    AcceptQueueSize--;
 	    break;;
 	}
diff --git a/src/mpid/pamid/src/mpid_finalize.c b/src/mpid/pamid/src/mpid_finalize.c
index 177973a..7becd22 100644
--- a/src/mpid/pamid/src/mpid_finalize.c
+++ b/src/mpid/pamid/src/mpid_finalize.c
@@ -31,6 +31,9 @@ extern pami_extension_t pe_extension;
 
 extern int mpidi_dynamic_tasking;
 int mpidi_finalized = 0;
+#ifdef DYNAMIC_TASKING
+extern conn_info  *_conn_info_list;
+#endif
 
 
 void MPIDI_close_pe_extension() {
@@ -76,6 +79,8 @@ int MPID_Finalize()
 
     MPIDI_FreeParentPort();
   }
+  if(_conn_info_list) 
+    MPIU_Free(_conn_info_list);
 #endif
 
 

http://git.mpich.org/mpich.git/commitdiff/8bcd1b9eb270df9196be41903a8ef2ddf7e2ca9b

commit 8bcd1b9eb270df9196be41903a8ef2ddf7e2ca9b
Author: Su Huang <suhuang at us.ibm.com>
Date:   Thu Nov 29 10:21:34 2012 -0500

    Support MP_BUFFER_MEM with two values
    
    (ibm) D187137
    (ibm) 5300640b00038e54bd716361a8e7c80268d2a2c9
    
    Signed-off-by: Michael Blocksome <blocksom at us.ibm.com>

diff --git a/src/mpid/pamid/include/mpidi_datatypes.h b/src/mpid/pamid/include/mpidi_datatypes.h
index 55b1f21..5e4b30d 100644
--- a/src/mpid/pamid/include/mpidi_datatypes.h
+++ b/src/mpid/pamid/include/mpidi_datatypes.h
@@ -83,6 +83,7 @@ typedef struct
   unsigned disable_internal_eager_scale; /**< The number of tasks at which point eager will be disabled */
 #if TOKEN_FLOW_CONTROL
   unsigned long long mp_buf_mem;
+  unsigned long long mp_buf_mem_max;
   unsigned is_token_flow_control_on;
 #endif
 #if (MPIDI_STATISTICS || MPIDI_PRINTENV)
@@ -131,6 +132,7 @@ typedef struct
   } perobj;                  /**< This structure is only used in the 'perobj' mpich lock mode. */
 
   unsigned mpir_nbc;         /**< Enable MPIR_* non-blocking collectives implementations. */
+  int  numTasks;             /* total number of tasks on a job                            */
 #ifdef DYNAMIC_TASKING
   struct MPIDI_PG_t * my_pg; /**< Process group I belong to */
   int                 my_pg_rank; /**< Rank in process group */
diff --git a/src/mpid/pamid/include/mpidi_util.h b/src/mpid/pamid/include/mpidi_util.h
index 6665571..5bec9c6 100644
--- a/src/mpid/pamid/include/mpidi_util.h
+++ b/src/mpid/pamid/include/mpidi_util.h
@@ -82,6 +82,8 @@ typedef struct {
         int timeout;
         int interrupts;
         uint  polling_interval;
+        unsigned long buffer_mem;
+        long long buffer_mem_max;
         int eager_limit;
         int use_token_flow_control;
         char wait_mode[8];
@@ -133,7 +135,7 @@ typedef struct {
     long lateArrivals;       /* Count of msgs received for which a recv
                                         was posted                      */
     long unorderedMsgs;      /* Total number of out of order msgs  */
-    long pamid_reserve_11;
+    long buffer_mem_hwmark;
     long pamid_reserve_10;
     long pamid_reserve_9;
     long pamid_reserve_8;
@@ -151,7 +153,6 @@ extern MPIX_stats_t *mpid_statp;
 extern int   prtStat;
 extern int   prtEnv;
 extern void set_mpich_env(int *,int*);
-extern int numTasks;
 extern void MPIDI_open_pe_extension();
 extern void MPIDI_close_pe_extension();
 extern MPIDI_Statistics_write(FILE *);
diff --git a/src/mpid/pamid/src/mpid_finalize.c b/src/mpid/pamid/src/mpid_finalize.c
index 6e1d212..177973a 100644
--- a/src/mpid/pamid/src/mpid_finalize.c
+++ b/src/mpid/pamid/src/mpid_finalize.c
@@ -94,15 +94,16 @@ int MPID_Finalize()
 
 #ifdef MPIDI_TRACE
  {  int i;
-  for (i=0; i< numTasks; i++) {
-      if (MPIDI_In_cntr[i].R)
-          MPIU_Free(MPIDI_In_cntr[i].R);
-      if (MPIDI_In_cntr[i].PR)
-          MPIU_Free(MPIDI_In_cntr[i].PR);
-      if (MPIDI_Out_cntr[i].S)
-          MPIU_Free(MPIDI_Out_cntr[i].S);
+  for (i=0; i< MPIDI_Process.numTasks; i++) {
+      if (MPIDI_Trace_buf[i].R)
+          MPIU_Free(MPIDI_Trace_buf[i].R);
+      if (MPIDI_Trace_buf[i].PR)
+          MPIU_Free(MPIDI_Trace_buf[i].PR);
+      if (MPIDI_Trace_buf[i].S)
+          MPIU_Free(MPIDI_Trace_buf[i].S);
   }
  }
+ MPIU_Free(MPIDI_Trace_buf);
 #endif
 
 #ifdef OUT_OF_ORDER_HANDLING
diff --git a/src/mpid/pamid/src/mpid_init.c b/src/mpid/pamid/src/mpid_init.c
index fe6f5dc..f493198 100644
--- a/src/mpid/pamid/src/mpid_init.c
+++ b/src/mpid/pamid/src/mpid_init.c
@@ -102,6 +102,7 @@ MPIDI_Process_t  MPIDI_Process = {
   .disable_internal_eager_scale = MPIDI_DISABLE_INTERNAL_EAGER_SCALE,
 #if TOKEN_FLOW_CONTROL
   .mp_buf_mem          = BUFFER_MEM_DEFAULT,
+  .mp_buf_mem_max      = BUFFER_MEM_DEFAULT,
   .is_token_flow_control_on = 0,
 #endif
 #if (MPIDI_STATISTICS || MPIDI_PRINTENV)
@@ -120,6 +121,7 @@ MPIDI_Process_t  MPIDI_Process = {
   },
 
   .mpir_nbc              = 0,
+  .numTasks              = 0,
 };
 
 
@@ -364,9 +366,7 @@ MPIDI_PAMI_context_init(int* threading, int *size)
 {
   int requested_thread_level;
   requested_thread_level = *threading;
-#ifdef OUT_OF_ORDER_HANDLING
-  extern int numTasks;
-#endif
+  int  numTasks;
 
 #if (MPIU_THREAD_GRANULARITY == MPIU_THREAD_GRANULARITY_PER_OBJECT)
   /*
@@ -455,8 +455,8 @@ MPIDI_PAMI_context_init(int* threading, int *size)
 
   TRACE_ERR ("Thread-level=%d, requested=%d\n", *threading, requested_thread_level);
 
+  MPIDI_Process.numTasks= numTasks = PAMIX_Client_query(MPIDI_Client, PAMI_CLIENT_NUM_TASKS).value.intval;
 #ifdef OUT_OF_ORDER_HANDLING
-  numTasks  = PAMIX_Client_query(MPIDI_Client, PAMI_CLIENT_NUM_TASKS).value.intval;
   MPIDI_In_cntr = MPIU_Calloc0(numTasks, MPIDI_In_cntr_t);
   if(MPIDI_In_cntr == NULL)
     MPID_abort();
@@ -467,21 +467,6 @@ MPIDI_PAMI_context_init(int* threading, int *size)
   memset((void *) MPIDI_Out_cntr,0, sizeof(MPIDI_Out_cntr_t));
 #endif
 
-if (TOKEN_FLOW_CONTROL_ON)
-  {
-    #if TOKEN_FLOW_CONTROL
-    int i;
-    MPIDI_mm_init(numTasks,&MPIDI_Process.pt2pt.limits.application.eager.remote,&MPIDI_Process.mp_buf_mem);
-    MPIDI_Token_cntr = MPIU_Calloc0(numTasks, MPIDI_Token_cntr_t);
-    memset((void *) MPIDI_Token_cntr,0, (sizeof(MPIDI_Token_cntr_t) * numTasks));
-    for (i=0; i < numTasks; i++)
-      {
-        MPIDI_Token_cntr[i].tokens=MPIDI_tfctrl_enabled;
-      }
-    #else
-    MPID_assert_always(0);
-    #endif
-}
 
 #ifdef MPIDI_TRACE
       int i; 
@@ -621,6 +606,22 @@ MPIDI_PAMI_dispath_init()
 
   if (MPIDI_Process.pt2pt.limits.internal.immediate.local > send_immediate_max_bytes)
     MPIDI_Process.pt2pt.limits.internal.immediate.local = send_immediate_max_bytes;
+
+  if (TOKEN_FLOW_CONTROL_ON)
+     {
+       #if TOKEN_FLOW_CONTROL
+        int i;
+        MPIDI_mm_init(MPIDI_Process.numTasks,&MPIDI_Process.pt2pt.limits.application.eager.remote,&MPIDI_Process.mp_buf_mem);
+        MPIDI_Token_cntr = MPIU_Calloc0(MPIDI_Process.numTasks, MPIDI_Token_cntr_t);
+        memset((void *) MPIDI_Token_cntr,0, (sizeof(MPIDI_Token_cntr_t) * MPIDI_Process.numTasks));
+        for (i=0; i < MPIDI_Process.numTasks; i++)
+        {
+          MPIDI_Token_cntr[i].tokens=MPIDI_tfctrl_enabled;
+        }
+        #else
+         MPID_assert_always(0);
+        #endif
+     }
 }
 
 
@@ -670,8 +671,11 @@ MPIDI_PAMI_init(int* rank, int* size, int* threading)
              "  rma_pending           : %u\n"
              "  shmem_pt2pt           : %u\n"
              "  disable_internal_eager_scale : %u\n"
+#if TOKEN_FLOW_CONTROL
              "  mp_buf_mem               : %u\n"
+             "  mp_buf_mem_max           : %u\n"
              "  is_token_flow_control_on : %u\n"
+#endif
 #if (MPIDI_STATISTICS || MPIDI_PRINTENV)
              "  mp_infolevel : %u\n"
              "  mp_statistics: %u\n"
@@ -681,7 +685,8 @@ MPIDI_PAMI_init(int* rank, int* size, int* threading)
              "  optimized.collectives : %u\n"
              "  optimized.select_colls: %u\n"
              "  optimized.subcomms    : %u\n"
-             "  mpir_nbc              : %u\n",
+             "  mpir_nbc              : %u\n" 
+             "  numTasks              : %u\n",
              MPIDI_Process.verbose,
              MPIDI_Process.statistics,
              MPIDI_Process.avail_contexts,
@@ -700,10 +705,8 @@ MPIDI_PAMI_init(int* rank, int* size, int* threading)
              MPIDI_Process.disable_internal_eager_scale,
 #if TOKEN_FLOW_CONTROL             
              MPIDI_Process.mp_buf_mem,
+             MPIDI_Process.mp_buf_mem_max,
              MPIDI_Process.is_token_flow_control_on,
-#else
-             0,
-             0,
 #endif
 #if (MPIDI_STATISTICS || MPIDI_PRINTENV)
              MPIDI_Process.mp_infolevel,
@@ -714,7 +717,8 @@ MPIDI_PAMI_init(int* rank, int* size, int* threading)
              MPIDI_Process.optimized.collectives,
              MPIDI_Process.optimized.select_colls,
              MPIDI_Process.optimized.subcomms,
-             MPIDI_Process.mpir_nbc);
+             MPIDI_Process.mpir_nbc, 
+             MPIDI_Process.numTasks);
       switch (*threading)
         {
           case MPI_THREAD_MULTIPLE:
diff --git a/src/mpid/pamid/src/mpidi_bufmm.c b/src/mpid/pamid/src/mpidi_bufmm.c
index 213f434..4414359 100644
--- a/src/mpid/pamid/src/mpidi_bufmm.c
+++ b/src/mpid/pamid/src/mpidi_bufmm.c
@@ -127,8 +127,8 @@ static size_t flex_size;                  /* size for flex slot           */
 static char  *buddy_heap_ptr;             /* ptr points to beg. of buddy  */
 static char  *end_heap_ptr;               /* ptr points to end of buddy   */
 static char  *heap;                       /* begin. address of flex stack */
-static uint mem_inuse;                    /* memory in use                */
-static uint mem_hwmark;                   /* highest memory usage         */
+static long mem_inuse;                    /* memory in use                */
+long mem_hwmark;                          /* highest memory usage         */
 
 
 static int  sizetable[TAB_SIZE + 1];      /* from bucket to size            */
@@ -158,7 +158,10 @@ void MPIDI_calc_tokens(int nTasks,uint *eager_limit_in, unsigned long *buf_mem_i
  int  i;
 
        /* Round up passed eager limit to power of 2 */
-    buf_mem_max= *buf_mem_in;
+    if (MPIDI_Process.mp_buf_mem_max > *buf_mem_in) 
+        buf_mem_max = MPIDI_Process.mp_buf_mem_max;
+    else 
+        buf_mem_max= *buf_mem_in;
 
     if (*eager_limit_in != 0) {
        for (val=1 ; val < *eager_limit_in ; val *= 2);
@@ -196,7 +199,7 @@ void MPIDI_calc_tokens(int nTasks,uint *eager_limit_in, unsigned long *buf_mem_i
                  val = MIN_BUF_BKT_SIZE;
                  buf_mem_max = new_buf_mem_max;
                  if ( application_set_buf_mem ) {
-                     printf("informational messge \n"); fflush(stdout);
+                     TRACE_ERR("informational messge \n");
                  }
              }
              else {
@@ -211,8 +214,7 @@ void MPIDI_calc_tokens(int nTasks,uint *eager_limit_in, unsigned long *buf_mem_i
        if ( *eager_limit_in != val ) {
           if ( application_set_eager_limit && (*eager_limit_in > val)) {
              /* Only give warning on reduce. */
-             printf("warning message if eager limit is reduced \n"); fflush(stdout);
-             fflush(stderr);
+             printf("ATTENTION: eager limit is reduced from %d to %d \n",*eager_limit_in,val); fflush(stdout);
           }
           *eager_limit_in = val;
 
@@ -223,7 +225,7 @@ void MPIDI_calc_tokens(int nTasks,uint *eager_limit_in, unsigned long *buf_mem_i
           sprintf(EagerLimit, "MP_EAGER_LIMIT=%d",val);
           rc = putenv(EagerLimit);
           if (rc !=0) {
-              printf("PUTENV with Eager Limit failed \n"); fflush(stdout);
+              TRACE_ERR("PUTENV with Eager Limit failed \n"); 
           }
        }
       }
@@ -235,6 +237,12 @@ void MPIDI_calc_tokens(int nTasks,uint *eager_limit_in, unsigned long *buf_mem_i
     /* user may want to set MP_EAGER_LIMIT to 0 or less than 256 */
     if (*eager_limit_in < MPIDI_Process.pt2pt.limits.application.immediate.remote)
         MPIDI_Process.pt2pt.limits.application.immediate.remote= *eager_limit_in;
+    if (*eager_limit_in < MPIDI_Process.pt2pt.limits.application.eager.remote)
+        MPIDI_Process.pt2pt.limits.application.eager.remote= *eager_limit_in;
+#   ifdef DUMP_MM
+     printf("MPIDI_tfctrl_enabled=%d eager_limit=%d  buf_mem=%d  buf_mem_max=%d\n",
+             MPIDI_tfctrl_enabled,*eager_limit_in,*buf_mem_in,buf_mem_max); fflush(stdout);
+#   endif 
 
 }
 
@@ -371,9 +379,9 @@ static void MPIDI_init_buddy(unsigned long buf_mem)
     size = (size == 0) ? 1 : (size > MAX_BUDDIES) ? MAX_BUDDIES : size;
     MPIDI_alloc_buddies(size,&space);
     if ( space == NO ) {
-        printf("ERROR  line=%d\n",__LINE__); fflush(stdout);
+        TRACE_ERR("out of memory %s(%d)\n",__FILE__,__LINE__); 
+        MPID_abort();
     }
-/*    printf("MPI-MM flex=%ld  #buddy=%ld\n",flex_size,size); */
 }
 
 
@@ -601,29 +609,47 @@ static void MPIDI_buddy_free(void *ptr)
    bud->free =1;
    MPIDI_add_head(bud,bud->bucket);
 }
+#  ifdef TRACE
+   int nAllocs =0;   /* number of times MPIDI_mm_alloc() is called */
+   int nFree =0;   /* number of times MPIDI_mm_free() is called */
+   int nM=0;       /* number of times MPIU_Malloc() is called   */
+   int nF=0;       /* number of times MPIU_Free() is called     */
+#  endif 
 void *MPIDI_mm_alloc(size_t size)
 {
    void *pt;
    int bucket,tmp;
-   int  nTimes=0;
 
    MPID_assert(size <= max_size);
    tmp = NORMSIZE(size);
    tmp =bucket =sizetrans[tmp];
    if(bucket >flex_count || (pt =MPIDI_flex_alloc(tmp)) ==NULL) {
       pt =MPIDI_buddy_alloc(bucket);
-      nTimes++;
+      if (MPIDI_Process.mp_statistics) {
+          mem_inuse = mem_inuse + sizetable[tmp];
+          if (mem_inuse > mem_hwmark) {
+              mem_hwmark = mem_inuse;
+          }
+       }
    }
    if (pt == NULL) {
+       MPIU_THREAD_CS_ENTER(MSGQUEUE,0);
        pt=MPIU_Malloc(size);
-       if (MPIDI_Process.statistics) {
+       MPIU_THREAD_CS_EXIT(MSGQUEUE,0);
+       if (MPIDI_Process.mp_statistics) {
            mem_inuse = mem_inuse + sizetable[tmp];
-          if (mem_inuse > mem_hwmark)
+          if (mem_inuse > mem_hwmark) {
              mem_hwmark = mem_inuse;
+          }
        }
+#      ifdef TRACE
+       nM++;
        if (pt == NULL) {
            printf("ERROR  line=%d\n",__LINE__); fflush(stdout);
+       } else {
+         printf("malloc nM=%d size=%d pt=0x%p \n",nM,size,pt); fflush(stdout);
        }
+#      endif
    }
 #  ifdef TRACE
    printf("MPIDI_mm_alloc(%4d): %p\n",size,pt);
@@ -637,8 +663,8 @@ void MPIDI_mm_free(void *ptr, size_t size)
    int tmp,bucket;
 
    if (size > MAX_SIZE) {
-      printf("ERROR  line=%d\n",__LINE__); fflush(stdout);
-      exit(1);
+      TRACE_ERR("Out of memory in %s(%d)\n",__FILE__,__LINE__);
+      MPID_abort(); 
    }
    if ((ptr >= (void *) heap) && (ptr < (void *)end_heap_ptr)) {
      if(*((char *)ptr -OVERHEAD) ==FLEX){
@@ -646,8 +672,34 @@ void MPIDI_mm_free(void *ptr, size_t size)
      }
      else
         MPIDI_buddy_free(ptr);
+     if (MPIDI_Process.mp_statistics) {
+         tmp = NORMSIZE(size);
+         bucket =sizetrans[tmp];
+         mem_inuse = mem_inuse - sizetable[bucket];
+         if (mem_inuse > mem_hwmark) {
+             mem_hwmark = mem_inuse;
+         }
+     }
    } else {
-       printf("ERROR free %s(%d)\n",__FILE__,__LINE__); fflush(stdout);
+     if (!ptr) {
+        TRACE_ERR("NULL ptr passed MPIDI_mm_free() in %s(%d)\n",__FILE__,__LINE__);
+        MPID_abort();
+     }
+     if (MPIDI_Process.mp_statistics) {
+        tmp = NORMSIZE(size);
+        bucket =sizetrans[tmp];
+         mem_inuse = mem_inuse - sizetable[bucket];
+         if (mem_inuse > mem_hwmark) {
+             mem_hwmark = mem_inuse;
+         }
+     }
+     MPIU_THREAD_CS_ENTER(MSGQUEUE,0);
+#    ifdef TRACE
+     nF++;
+     printf("free nF=%d size=%d ptr=0x%p \n",nF,sizetable[bucket],ptr); fflush(stdout);
+#    endif
+     MPIU_Free(ptr);
+     MPIU_THREAD_CS_EXIT(MSGQUEUE,0);
    }
 #  ifdef TRACE
    nFrees++;
diff --git a/src/mpid/pamid/src/mpidi_env.c b/src/mpid/pamid/src/mpidi_env.c
index 427a042..7f97dac 100644
--- a/src/mpid/pamid/src/mpidi_env.c
+++ b/src/mpid/pamid/src/mpidi_env.c
@@ -331,7 +331,7 @@
 #if (MPIDI_STATISTICS || MPIDI_PRINTENV)
 int prtStat=0;
 int prtEnv=0;
-int numTasks=0;
+
 MPIX_stats_t *mpid_statp=NULL;
 extern MPIDI_printenv_t  *mpich_env;
 #endif
@@ -339,7 +339,7 @@ extern MPIDI_printenv_t  *mpich_env;
 #define ENV_Deprecated(a, b, c, d, e) ENV_Deprecated__(a, b, c, d, e)
 
 #ifdef TOKEN_FLOW_CONTORL
- extern void MPIDI_get_buf_mem(unsigned long *);
+ extern int MPIDI_get_buf_mem(unsigned long *);
  extern int MPIDI_atoi(char* , int* );
 #endif
  extern int application_set_eager_limit;
@@ -708,7 +708,9 @@ MPIDI_Env_setup(int rank, int requested)
   }
   /* Determine buffer memory for early arrivals */
   {
-    MPIDI_get_buf_mem(&MPIDI_Process.mp_buf_mem);
+    int rc;
+    rc=MPIDI_get_buf_mem(&MPIDI_Process.mp_buf_mem,&MPIDI_Process.mp_buf_mem_max);
+    MPID_assert_always(rc == MPI_SUCCESS);
   }
 #endif
 
@@ -1026,11 +1028,12 @@ MPIDI_Env_setup(int rank, int requested)
     }
 }
 
-
+#if TOKEN_FLOW_CONTROL
 int  MPIDI_set_eager_limit(unsigned int *eager_limit)
 {
      char *cp;
      int  val;
+     int  numTasks=MPIDI_Process.numTasks;
      cp = getenv("MP_EAGER_LIMIT");
      if (cp)
        {
@@ -1038,21 +1041,54 @@ int  MPIDI_set_eager_limit(unsigned int *eager_limit)
          if ( MPIDI_atoi(cp, &val) == 0 )
            *eager_limit=val;
        }
+     else
+       {
+        /*  set default
+        *  Number of tasks      MP_EAGER_LIMIT
+        * -----------------     --------------
+        *      1  -    256         32768
+        *    257  -    512         16384
+        *    513  -   1024          8192
+        *   1025  -   2048          4096
+        *   2049  -   4096          2048
+        *   4097  &  above          1024
+        */
+       if      (numTasks <  257) *eager_limit = 32768;
+       else if (numTasks <  513) *eager_limit = 16384;
+       else if (numTasks < 1025) *eager_limit =  8192;
+       else if (numTasks < 2049) *eager_limit =  4096;
+       else if (numTasks < 4097) *eager_limit =  2048;
+       else                      *eager_limit =  1024;
+
+       }
      return 0;
 }
 
-#if TOKEN_FLOW_CONTROL
-   /*****************************************************************/
-   /* Check for MP_BUFFER_MEM, if the value is not set by the user, */
-   /* then set the value with the default of 64 MB.                 */
-   /*****************************************************************/
-int  MPIDI_get_buf_mem(unsigned long *buf_mem) {
+   /******************************************************************/
+   /*                                                                */
+   /* Check for MP_BUFFER_MEM, if the value is not set by the user,  */
+   /* then set the value with the default of 64 MB.                  */
+   /* MP_BUFFER_MEM supports the following format:                   */
+   /* MP_BUFFER_MEM=xxM                                              */
+   /* MP_BUFFER_MEM=xxM,yyyM                                         */
+   /* MP_BUFFER_MEM=xxM,yyyG                                         */
+   /* MP_BUFFER_MEM=,yyyM                                            */
+   /* xx:  pre allocated size  the max. allowable value is 256 MB    */
+   /*      the space is allocated during the initialization.         */
+   /*      the default is 64 MB                                      */
+   /* yyy: maximum size - the maximum size to which the early arrival*/
+   /*      buffer can temporarily grow when the preallocated portion */
+   /*      of the EA buffer has been filled.                         */
+   /*                                                                */
+   /******************************************************************/
+int  MPIDI_get_buf_mem(unsigned long *buf_mem,unsigned long *buf_mem_max)
+    {
      char *cp;
      int  i;
      int args_in_error=0;
      char pre_alloc_buf[25], buf_max[25];
      char *buf_max_cp;
-     int pre_alloc_val;
+     int pre_alloc_val=0;
      unsigned long buf_max_val;
      int  has_error = 0;
 
@@ -1060,8 +1096,38 @@ int  MPIDI_get_buf_mem(unsigned long *buf_mem) {
          pre_alloc_buf[24] = '\0';
          buf_max[24] = '\0';
          if ( (buf_max_cp = strchr(cp, ',')) ) {
-              printf("No max buffer mem support in MPICH2 \n"); fflush(stdout);
-         } else {
+           if ( *(++buf_max_cp)  == '\0' ) {
+              /* Error: missing buffer_mem_max */
+              has_error = 1;
+           }
+           else if ( cp[0] == ',' ) {
+              /* Pre_alloc value is default -- use default   */
+              pre_alloc_val = -1;
+              strncpy(buf_max, buf_max_cp, 24);
+              if ( MPIDI_atoll(buf_max, &buf_max_val) != 0 )
+                 has_error = 1;
+           }
+           else {
+              /* both values are present */
+              for (i=0; ; i++ ) {
+                 if ( (cp[i] != ',') && (i<24) )
+                    pre_alloc_buf[i] = cp[i];
+                 else {
+                    pre_alloc_buf[i] = '\0';
+                    break;
+                 }
+              }
+              strncpy(buf_max, buf_max_cp, 24);
+              if ( MPIDI_atoi(pre_alloc_buf, &pre_alloc_val) == 0 ) {
+                 if ( MPIDI_atoll(buf_max, &buf_max_val) != 0 )
+                    has_error = 1;
+              }
+              else
+                 has_error = 1;
+           }
+        }
+        else
+         {
             /* Old single value format  */
             if ( MPIDI_atoi(cp, &pre_alloc_val) == 0 )
                buf_max_val = (unsigned long)pre_alloc_val;
@@ -1069,17 +1135,22 @@ int  MPIDI_get_buf_mem(unsigned long *buf_mem) {
                has_error = 1;
          }
          if ( has_error == 0) {
-              *buf_mem     = (int) pre_alloc_val;
+             if ((int) pre_alloc_val != -1)  /* MP_BUFFER_MEM=,128MB  */
+                 *buf_mem     = (int) pre_alloc_val;
              if (buf_max_val > ONE_SHARED_SEGMENT)
                  *buf_mem = ONE_SHARED_SEGMENT;
+             if (buf_max_val  > *buf_mem_max)
+                  *buf_mem_max = buf_max_val;
          } else {
             args_in_error += 1;
-            printf("ERROR in MP_BUFFER_MEM %s(%d)\n",__FILE__,__LINE__); fflush(stdout);
+            TRACE_ERR("ERROR in MP_BUFFER_MEM %s(%d)\n",__FILE__,__LINE__);
+            return 1;
          }
      } else {
          /* MP_BUFFER_MEM is not specified by the user*/
          *buf_mem     = BUFFER_MEM_DEFAULT;
+         TRACE_ERR("buffer_mem=%d  buffer_mem_max=%d\n",*buf_mem,*buf_mem_max);
+         return 0;
      }
-  return 0;
 }
 #endif
diff --git a/src/mpid/pamid/src/mpidi_util.c b/src/mpid/pamid/src/mpidi_util.c
index 60892a8..fceebe2 100644
--- a/src/mpid/pamid/src/mpidi_util.c
+++ b/src/mpid/pamid/src/mpidi_util.c
@@ -66,6 +66,8 @@ void MPIDI_Set_mpich_env(int rank, int size) {
                     mpich_env->retransmit_interval); /* microseconds */
             rc = putenv(polling_buf);
      }
+     mpich_env->buffer_mem=MPIDI_Process.mp_buf_mem;
+     mpich_env->buffer_mem_max=MPIDI_Process.mp_buf_mem_max;
 }
 
 
@@ -598,6 +600,8 @@ int MPIDI_Print_mpenv(int rank,int size)
         sender.eager_limit = mpich_env->eager_limit;
         sender.use_token_flow_control=MPIDI_Process.is_token_flow_control_on;
         sender.retransmit_interval = mpich_env->retransmit_interval;
+        sender.buffer_mem = mpich_env->buffer_mem;
+        sender.buffer_mem_max = mpich_env->buffer_mem_max;
 
         /* Get shared memory  */
         sender.shmem_pt2pt = MPIDI_Process.shmem_pt2pt;
@@ -685,7 +689,7 @@ int MPIDI_Print_mpenv(int rank,int size)
         #ifdef _DEBUG
         printf("task_count = %d\n", task_count);
         printf("calling _mpi_gather(%p,%d,%d,%p,%d,%d,%d,%d,%p,%d)\n",
-            &sender,sizeof(_printenv_t),MPI_BYTE,gatherer,sizeof(_printenv_t),MPI_BYTE,
+            &sender,sizeof(MPIDI_printenv_t),MPI_BYTE,gatherer,sizeof(MPIDI_printenv_t),MPI_BYTE,
             0, MPI_COMM_WORLD,NULL,0);
         fflush(stdout);
         #endif
@@ -738,6 +742,8 @@ int MPIDI_Print_mpenv(int rank,int size)
                 MATCHI(timeout,"Connection Timeout (MP_TIMEOUT/sec):");
                 MATCHB(interrupts,"Adapter Interrupts Enabled (MP_CSS_INTERRUPT):");
                 MATCHI(polling_interval,"Polling Interval (MP_POLLING_INTERVAL/usec):");
+                MATCHI(buffer_mem,"Buffer Memory (MP_BUFFER_MEM/Bytes):");
+                MATCHLL(buffer_mem_max,"Max. Buffer Memory (MP_BUFFER_MEM_MAX/Bytes):");
                 MATCHI(eager_limit,"Message Eager Limit (MP_EAGER_LIMIT/Bytes):");
                 MATCHI(use_token_flow_control,"Use token flow control:");
                 MATCHC(wait_mode,"Message Wait Mode(MP_WAIT_MODE):",8);
@@ -1148,3 +1154,105 @@ int MPIDI_atoi(char* str_in, unsigned int* val)
 
    return retval;
 }
+
+ /***************************************************************************
+  Name:           MPIDI_checkll()
+  
+  Function:       Determine whether a given number and units
+                  value are valid. If they are valid, the
+                  multiplication of the number and the units
+                  will be returned as an unsigned int. If the
+                  number and units are invalid, a 1 will be returned.
+
+  Description:    if units is G
+                    multiplier is 1G
+                  else if units is M
+                    multiplier is 1M
+                  else if units is K
+                    multiplier is 1K
+                  else
+                    return error
+
+                    multiply value by multiplier
+                    return result
+  Parameters:     A0 = MPIDI_checkll(A1, A2, A3)
+
+                  A1    given value                   int
+                  A2    given units                   char *
+                  A3    result                        long long *
+
+                  A0    Return Code                   int
+
+  Return Codes:   0 OK
+                  1 bad value
+ ***************************************************************************/
+
+int MPIDI_checkll(int myval, char myunits, long long *mygoodval)
+{
+  int multiplier;                   /* units multiplier for entered value */
+
+  if (myunits == 'G') {             /* if units is G */
+     multiplier = ONEG;
+  }
+  else if (myunits == 'M') {        /* if units is M */
+     multiplier = ONEM;
+  }
+  else if (myunits == 'K') {        /* if units is K */
+     multiplier = ONEK;
+  }
+  else
+     return 1;                      /* Unkonwn unit */
+
+  *mygoodval = (long long) myval * multiplier;  /* do multiplication */
+  return 0;                         /* good return */
+}
+
+
+int MPIDI_atoll(char* str_in, long long* val)
+{
+   char tempbuf[256];
+   char size_mult;                 /* multiplier for size strings */
+   int  i, tempval;
+   int  letter=0, retval=0;
+
+   /*****************************************/
+   /* Check for letter, if none, MPIDI_atoi */
+   /*****************************************/
+   for (i=0; i<strlen(str_in); i++) {
+      if (!isdigit(str_in[i])) {
+         letter = 1;
+         break;
+      }
+   }
+   if (!letter) {   /* only digits */
+      errno = 0;
+      *val = atoll(str_in);
+      if (errno) {
+         retval = errno;
+      }
+   }
+   else {
+      /***********************************/
+      /* Check for K or M.               */
+      /***********************************/
+      MPIDI_toupper(str_in);
+      retval= MPIDI_scan_str3(str_in, 'G', 'M', 'K', &size_mult, tempbuf);
+
+      if ( retval == 0) {
+         tempval = atoi(tempbuf);
+
+         /***********************************/
+         /* If 0 K or 0 M entered, set to 0 */
+         /* otherwise, do conversion.       */
+         /***********************************/
+         if (tempval != 0)
+            retval = MPIDI_checkll(tempval, size_mult, val);
+         else
+            *val = 0;
+      }
+   }
+
+   return retval;
+}
+
+
diff --git a/src/mpid/pamid/src/mpix/mpix.c b/src/mpid/pamid/src/mpix/mpix.c
index 5dca860..5be3505 100644
--- a/src/mpid/pamid/src/mpix/mpix.c
+++ b/src/mpid/pamid/src/mpix/mpix.c
@@ -243,9 +243,11 @@ MPIDI_Statistics_write(FILE *statfile) {
     long long Tot_pkt_recv_cnt=0;
     long long Tot_data_sent=0;
     long long Tot_data_recv=0;
+    extern long mem_hwmark;
 
     memset(&time_buf,0, 201);
     sprintf(time_buf, __DATE__" "__TIME__);
+    mpid_statp->buffer_mem_hwmark =  mem_hwmark;
     mpid_statp->sendWaitsComplete =  mpid_statp->sends - mpid_statp->sendsComplete;
     fprintf(statfile,"Start of task (pid=%d) statistics at %s \n", getpid(), time_buf);
     fprintf(statfile, "MPICH: sends = %ld\n", mpid_statp->sends);
@@ -257,6 +259,7 @@ MPIDI_Statistics_write(FILE *statfile) {
     fprintf(statfile, "MPICH: earlyArrivalsMatched = %ld\n", mpid_statp->earlyArrivalsMatched);
     fprintf(statfile, "MPICH: lateArrivals = %ld\n", mpid_statp->lateArrivals);
     fprintf(statfile, "MPICH: unorderedMsgs = %ld\n", mpid_statp->unorderedMsgs);
+    fprintf(statfile, "MPICH: buffer_mem_hwmark = %ld\n", mpid_statp->buffer_mem_hwmark);
     fflush(statfile);
     memset(&query_stat,0, sizeof(query_stat));
     query_stat.name =  (pami_attribute_name_t)PAMI_CONTEXT_STATISTICS ;
diff --git a/src/mpid/pamid/src/pt2pt/mpidi_done.c b/src/mpid/pamid/src/pt2pt/mpidi_done.c
index 9864322..0b5dd7a 100644
--- a/src/mpid/pamid/src/pt2pt/mpidi_done.c
+++ b/src/mpid/pamid/src/pt2pt/mpidi_done.c
@@ -35,7 +35,7 @@ MPIDI_SendDoneCB(pami_context_t   context,
 {
 #ifdef MPIDI_TRACE
   MPID_Request * req = (MPID_Request *) clientdata;
-  MPIDI_Out_cntr[(req->mpid.partner_id)].S[(req->mpid.idx)].sendComp=1;
+  MPIDI_Trace_buf[(req->mpid.partner_id)].S[(req->mpid.idx)].sendComp=1;
 #endif
   MPIDI_SendDoneCB_inline(context,
                           clientdata,
@@ -198,8 +198,9 @@ void MPIDI_Recvq_process_out_of_order_msgs(pami_task_t src, pami_context_t conte
           }
 
 #ifdef MPIDI_TRACE
-       MPIDI_In_cntr[src].R[(rreq->mpid.idx)].matchedInOOL=1;
-       MPIDI_In_cntr[src].R[(rreq->mpid.idx)].rlen=dt_size;
+       rreq->mpid.idx = ooreq->mpid.idx;
+       MPIDI_Trace_buf[src].R[(rreq->mpid.idx)].matchedInOOL=1;
+       MPIDI_Trace_buf[src].R[(rreq->mpid.idx)].rlen=dt_size;
 #endif
         ooreq->comm = rreq->comm;
         MPIR_Comm_add_ref(ooreq->comm);

http://git.mpich.org/mpich.git/commitdiff/1753acded0dbe6c634b73b5610c5735b8013f08c

commit 1753acded0dbe6c634b73b5610c5735b8013f08c
Author: Michael Blocksome <blocksom at us.ibm.com>
Date:   Thu Nov 29 13:42:14 2012 -0600

    set bgq default eager limits for intra- and inter-node to 4097.
    
    (ibm) Issue 9102
    (ibm) c1971e58c8a57036b5ab24849916f90e507bfc83
    
    Signed-off-by: Haizhu Liu <haizhu at us.ibm.com>

diff --git a/src/mpid/pamid/include/mpidi_platform.h b/src/mpid/pamid/include/mpidi_platform.h
index 1917b39..e706e03 100644
--- a/src/mpid/pamid/include/mpidi_platform.h
+++ b/src/mpid/pamid/include/mpidi_platform.h
@@ -67,7 +67,9 @@
 
 #ifdef __BGQ__
 #undef  MPIDI_EAGER_LIMIT_LOCAL
-#define MPIDI_EAGER_LIMIT_LOCAL  64
+#define MPIDI_EAGER_LIMIT_LOCAL  4097
+#undef  MPIDI_EAGER_LIMIT
+#define MPIDI_EAGER_LIMIT  4097
 #undef  MPIDI_DISABLE_INTERNAL_EAGER_SCALE
 #define MPIDI_DISABLE_INTERNAL_EAGER_SCALE (512*1024)
 #define MPIDI_MAX_THREADS     64

http://git.mpich.org/mpich.git/commitdiff/53f6e9344deb468e2d1c244a8936928f101cf5fb

commit 53f6e9344deb468e2d1c244a8936928f101cf5fb
Author: Haizhu Liu <haizhu at us.ibm.com>
Date:   Wed Nov 7 20:56:27 2012 -0500

    Dynamic tasking support
    
    (ibm) F182392
    (ibm) 084d46ff209380813e5a8822ae6b4c517fc1dc42
    (ibm) 910067ee7018c6616568457a80644704271b3b82
    
    Signed-off-by: Michael Blocksome <blocksom at us.ibm.com>

diff --git a/src/mpid/pamid/include/mpidi_datatypes.h b/src/mpid/pamid/include/mpidi_datatypes.h
index 810f242..55b1f21 100644
--- a/src/mpid/pamid/include/mpidi_datatypes.h
+++ b/src/mpid/pamid/include/mpidi_datatypes.h
@@ -131,6 +131,10 @@ typedef struct
   } perobj;                  /**< This structure is only used in the 'perobj' mpich lock mode. */
 
   unsigned mpir_nbc;         /**< Enable MPIR_* non-blocking collectives implementations. */
+#ifdef DYNAMIC_TASKING
+  struct MPIDI_PG_t * my_pg; /**< Process group I belong to */
+  int                 my_pg_rank; /**< Rank in process group */
+#endif
 } MPIDI_Process_t;
 
 
@@ -145,6 +149,9 @@ enum
     MPIDI_Protocols_WinCtrl,
     MPIDI_Protocols_WinAccum,
     MPIDI_Protocols_RVZ_zerobyte,
+#ifdef DYNAMIC_TASKING
+    MPIDI_Protocols_Dyntask,
+#endif
     MPIDI_Protocols_COUNT,
   };
 
@@ -306,7 +313,7 @@ struct MPIDI_Comm
    * allocating pointers on the stack
    */
   /* For create_taskrange */
-  pami_geometry_range_t *ranges;
+  pami_geometry_range_t range;
   /* For create_tasklist/endpoints if we ever use it */
   pami_task_t *tasks;
   pami_endpoint_t *endpoints;
@@ -341,6 +348,9 @@ struct MPIDI_Comm
     pami_task_t *tasks;
     pami_endpoint_t *endpoints;
   } tasks_descriptor;
+#ifdef DYNAMIC_TASKING
+  int *world_ids;      /* ids of worlds that composed this communicator (inter communicator created for dynamic tasking */
+#endif
 };
 
 
diff --git a/src/mpid/pamid/include/mpidi_hooks.h b/src/mpid/pamid/include/mpidi_hooks.h
index 4d7604c..b4228ca 100644
--- a/src/mpid/pamid/include/mpidi_hooks.h
+++ b/src/mpid/pamid/include/mpidi_hooks.h
@@ -29,9 +29,17 @@
 #define __include_mpidi_hooks_h__
 
 
-typedef pami_task_t         MPID_VCR;
+struct MPID_VCR_t {
+	pami_task_t      taskid;
+#ifdef DYNAMIC_TASKING
+	int              pg_rank;    /** rank in process group **/
+	struct MPIDI_PG *pg;         /** process group **/
+#endif
+};
+typedef struct MPID_VCR_t * MPID_VCR ;
 typedef struct MPIDI_VCRT * MPID_VCRT;
 
+
 typedef size_t              MPIDI_msg_sz_t;
 
 #define MPID_Irsend     MPID_Isend
diff --git a/src/mpid/pamid/include/mpidi_macros.h b/src/mpid/pamid/include/mpidi_macros.h
index 3e1798d..14d0195 100644
--- a/src/mpid/pamid/include/mpidi_macros.h
+++ b/src/mpid/pamid/include/mpidi_macros.h
@@ -111,9 +111,20 @@ _dt_contig_out, _data_sz_out, _dt_ptr, _dt_true_lb)             \
 
 #define MPID_VCR_GET_LPID(vcr, index)           \
 ({                                              \
-  vcr[index];                                   \
+  vcr[index]->taskid;                           \
 })
 
+
+#define MPID_VCR_GET_LPIDS(comm, taskids)                      \
+({                                                             \
+  int i;                                                       \
+  taskids=MPIU_Malloc((comm->local_size)*sizeof(pami_task_t)); \
+  MPID_assert(taskids != NULL);                                \
+  for(i=0; i<comm->local_size; i++)                            \
+    taskids[i] = comm->vcr[i]->taskid;                         \
+})
+
+
 #define MPID_GPID_Get(comm_ptr, rank, gpid)             \
 ({                                                      \
   gpid[1] = MPID_VCR_GET_LPID(comm_ptr->vcr, rank);     \
diff --git a/src/mpid/pamid/include/mpidi_platform.h b/src/mpid/pamid/include/mpidi_platform.h
index 6ce3125..1917b39 100644
--- a/src/mpid/pamid/include/mpidi_platform.h
+++ b/src/mpid/pamid/include/mpidi_platform.h
@@ -40,6 +40,7 @@
 #define USE_PAMI_RDMA 1
 #define USE_PAMI_CONSISTENCY PAMI_HINT_ENABLE
 #undef  OUT_OF_ORDER_HANDLING
+#undef  DYNAMIC_TASKING
 #undef  RDMA_FAILOVER
 
 #define ASYNC_PROGRESS_MODE_DEFAULT 0
@@ -123,6 +124,7 @@ static const char _ibm_release_version_[] = "V1R2M0";
 #define MPIDI_BANNER          1
 #define MPIDI_NO_ASSERT       1
 #define TOKEN_FLOW_CONTROL    1
+#define DYNAMIC_TASKING       1
 
 /* 'is local task' extension and limits */
 #define PAMIX_IS_LOCAL_TASK
diff --git a/src/mpid/pamid/include/mpidi_prototypes.h b/src/mpid/pamid/include/mpidi_prototypes.h
index 74e91d7..316b681 100644
--- a/src/mpid/pamid/include/mpidi_prototypes.h
+++ b/src/mpid/pamid/include/mpidi_prototypes.h
@@ -127,7 +127,6 @@ void MPIDI_RecvRzvCB_zerobyte (pami_context_t    context,
                                size_t            sndlen,
                                pami_endpoint_t   sender,
                                pami_recv_t     * recv);
-
 void MPIDI_RecvDoneCB        (pami_context_t    context,
                               void            * clientdata,
                               pami_result_t     result);
@@ -140,6 +139,16 @@ void MPIDI_RecvRzvDoneCB     (pami_context_t    context,
 void MPIDI_RecvRzvDoneCB_zerobyte (pami_context_t    context,
                                    void            * cookie,
                                    pami_result_t     result);
+#ifdef DYNAMIC_TASKING
+void MPIDI_Recvfrom_remote_world (pami_context_t    context,
+                                  void            * cookie,
+                                  const void      * _msginfo,
+                                  size_t            msginfo_size,
+                                  const void      * sndbuf,
+                                  size_t            sndlen,
+                                  pami_endpoint_t   sender,
+                                  pami_recv_t     * recv);
+#endif
 #ifdef OUT_OF_ORDER_HANDLING
 void MPIDI_Recvq_process_out_of_order_msgs(pami_task_t src, pami_context_t context);
 int MPIDI_Recvq_search_recv_posting_queue(int src, int tag, int context_id,
diff --git a/src/mpid/pamid/include/mpidimpl.h b/src/mpid/pamid/include/mpidimpl.h
index 9e2b478..5993ac2 100644
--- a/src/mpid/pamid/include/mpidimpl.h
+++ b/src/mpid/pamid/include/mpidimpl.h
@@ -28,4 +28,121 @@
 #include "pamix.h"
 #include <mpix.h>
 
+
+#ifdef DYNAMIC_TASKING
+
+#define MPIDI_MAX_KVS_VALUE_LEN    4096
+
+typedef struct MPIDI_PG
+{
+    /* MPIU_Object field.  MPIDI_PG_t objects are not allocated using the
+       MPIU_Object system, but we do use the associated reference counting
+       routines.  Therefore, handle must be present, but is not used
+       except by debugging routines */
+    MPIU_OBJECT_HEADER; /* adds handle and ref_count fields */
+
+    /* Next pointer used to maintain a list of all process groups known to
+       this process */
+    struct MPIDI_PG * next;
+
+    /* Number of processes in the process group */
+    int size;
+
+    /* VC table.  At present this is a pointer to an array of VC structures.
+       Someday we may want make this a pointer to an array
+       of VC references.  Thus, it is important to use MPIDI_PG_Get_vc()
+       instead of directly referencing this field. */
+    MPID_VCR vct;
+
+    /* Pointer to the process group ID.  The actual ID is defined and
+       allocated by the process group.  The pointer is kept in the
+       device space because it is necessary for the device to be able to
+       find a particular process group. */
+    void * id;
+
+    /* Replacement abstraction for connection information */
+    /* Connection information needed to access processes in this process
+       group and to share the data with other processes.  The items are
+       connData - pointer for data used to implement these functions
+                  (e.g., a pointer to an array of process group info)
+       getConnInfo( rank, buf, bufsize, self ) - function to store into
+                  buf the connection information for rank in this process
+                  group
+       connInfoToString( buf_p, size, self ) - return in buf_p a string
+                  that can be sent to another process to recreate the
+                  connection information (the info needed to support
+                  getConnInfo)
+       connInfoFromString( buf, self ) - setup the information needed
+                  to implement getConnInfo
+       freeConnInfo( self ) - free any storage or resources associated
+                  with the connection information.
+
+       See ch3/src/mpidi_pg.c
+    */
+    void *connData;
+    int  (*getConnInfo)( int, char *, int, struct MPIDI_PG * );
+    int  (*connInfoToString)( char **, int *, struct MPIDI_PG * );
+    int  (*connInfoFromString)( const char *,  struct MPIDI_PG * );
+    int  (*freeConnInfo)( struct MPIDI_PG * );
+}
+MPIDI_PG_t;
+
+typedef int (*MPIDI_PG_Compare_ids_fn_t)(void * id1, void * id2);
+typedef int (*MPIDI_PG_Destroy_fn_t)(MPIDI_PG_t * pg);
+
+
+typedef MPIDI_PG_t * MPIDI_PG_iterator;
+
+typedef struct conn_info {
+  int                rem_world_id;
+  int                ref_count;
+  int                *rem_taskids;  /* The last member of this array is -1 */
+  struct conn_info   *next;
+}conn_info;
+
+
+/*--------------------------
+  BEGIN MPI PORT SECTION
+  --------------------------*/
+/* These are the default functions */
+int MPIDI_Comm_connect(const char *, struct MPID_Info *, int, struct MPID_Comm *, struct MPID_Comm **);
+int MPIDI_Comm_accept(const char *, struct MPID_Info *, int, struct MPID_Comm *, struct MPID_Comm **);
+
+int MPIDI_Comm_spawn_multiple(int, char **, char ***, int *, struct MPID_Info **,
+                              int, struct MPID_Comm *, struct MPID_Comm **, int *);
+
+
+typedef struct MPIDI_Port_Ops {
+    int (*OpenPort)( struct MPID_Info *, char *);
+    int (*ClosePort)( const char * );
+    int (*CommAccept)( const char *, struct MPID_Info *, int, struct MPID_Comm *,
+                       struct MPID_Comm ** );
+    int (*CommConnect)( const char *, struct MPID_Info *, int, struct MPID_Comm *,
+                        struct MPID_Comm ** );
+} MPIDI_PortFns;
+
+
+#define MPIDI_VC_add_ref( _vc )                                 \
+    do { MPIU_Object_add_ref( _vc ); } while (0)
+
+#define MPIDI_PG_add_ref(pg_)                   \
+do {                                            \
+    MPIU_Object_add_ref(pg_);                   \
+} while (0)
+#define MPIDI_PG_release_ref(pg_, inuse_)       \
+do {                                            \
+    MPIU_Object_release_ref(pg_, inuse_);       \
+} while (0)
+
+#define MPIDI_VC_release_ref( _vc, _inuse ) \
+    do { MPIU_Object_release_ref( _vc, _inuse ); } while (0)
+
+
+/* Initialize a new VC */
+int MPIDI_PG_Create_from_string(const char * str, MPIDI_PG_t ** pg_pptr,
+				int *flag);
+int MPIDI_PG_Get_size(MPIDI_PG_t * pg);
+#define MPIDI_PG_Get_size(pg_) ((pg_)->size)
+#endif  /** DYNAMIC_TASKING **/
+
 #endif
diff --git a/src/mpid/pamid/include/mpidpre.h b/src/mpid/pamid/include/mpidpre.h
index a351a32..e78f3f7 100644
--- a/src/mpid/pamid/include/mpidpre.h
+++ b/src/mpid/pamid/include/mpidpre.h
@@ -62,4 +62,5 @@
 #include "mpidi_trace.h"
 #endif
 
+
 #endif
diff --git a/src/mpid/pamid/src/Makefile.mk b/src/mpid/pamid/src/Makefile.mk
index 1fc9915..065b33a 100644
--- a/src/mpid/pamid/src/Makefile.mk
+++ b/src/mpid/pamid/src/Makefile.mk
@@ -36,6 +36,7 @@ include $(top_srcdir)/src/mpid/pamid/src/mpix/Makefile.mk
 include $(top_srcdir)/src/mpid/pamid/src/onesided/Makefile.mk
 include $(top_srcdir)/src/mpid/pamid/src/pamix/Makefile.mk
 include $(top_srcdir)/src/mpid/pamid/src/pt2pt/Makefile.mk
+include $(top_srcdir)/src/mpid/pamid/src/dyntask/Makefile.mk
 
 
 lib_lib at MPILIBNAME@_la_SOURCES +=               \
diff --git a/src/mpid/pamid/src/comm/mpid_comm.c b/src/mpid/pamid/src/comm/mpid_comm.c
index 0414387..93baee8 100644
--- a/src/mpid/pamid/src/comm/mpid_comm.c
+++ b/src/mpid/pamid/src/comm/mpid_comm.c
@@ -61,7 +61,8 @@ typedef struct MPIDI_Post_geom_create
    pami_geometry_t parent;
    unsigned id;
    pami_geometry_range_t *ranges;
-   size_t slice_count;
+   pami_task_t *tasks;
+   size_t count; /* count of ranges or tasks */
    pami_event_function fn;
    void* cookie;
 } MPIDI_Post_geom_create_t;
@@ -89,7 +90,27 @@ static pami_result_t geom_rangelist_create_wrapper(pami_context_t context, void
                geom_struct->parent,
                geom_struct->id,
                geom_struct->ranges,
-               geom_struct->slice_count,
+               geom_struct->count,
+               context,
+               geom_struct->fn,
+               geom_struct->cookie);
+   TRACE_ERR("Done in geom create wrapper\n");
+}
+static pami_result_t geom_tasklist_create_wrapper(pami_context_t context, void *cookie)
+{
+   /* I'll need one of these per geometry creation function..... */
+   MPIDI_Post_geom_create_t *geom_struct = (MPIDI_Post_geom_create_t *)cookie;
+   TRACE_ERR("In geom create wrapper\n");
+   return PAMI_Geometry_create_tasklist(
+               geom_struct->client,
+               geom_struct->context_offset,
+               geom_struct->configs,
+               geom_struct->num_configs,
+               geom_struct->newgeom,
+               geom_struct->parent,
+               geom_struct->id,
+               geom_struct->tasks,
+               geom_struct->count,
                context,
                geom_struct->fn,
                geom_struct->cookie);
@@ -133,28 +154,29 @@ void MPIDI_Coll_comm_create(MPID_Comm *comm)
          fprintf(stderr,"world geom: %p parent geom: %p\n", MPIDI_Process.world_geometry, comm->mpid.parent);
       TRACE_ERR("Creating subgeom\n");
       /* Change to this at some point */
-      #if 0
-      rc = PAMI_Geometry_create_tasklist( MPIDI_Client,
-                                          NULL,
-                                          0,
-                                          &comm->mpid.geometry,
-                                          NULL, /* Parent */
-                                          comm->context_id,
-                                          /* task list, where/how do I get that? */
-                                          comm->local_size,
-                                          MPIDI_Context[0],
-                                          geom_create_cb_done,
-                                          &geom_init);
-      #endif
-
-      comm->mpid.tasks_descriptor.ranges = MPIU_Malloc(sizeof(comm->mpid.tasks_descriptor) *
-                                             comm->local_size);
-      /* Can we just pass a max and min. Does that work now? */
-      for(i=0;i<comm->local_size;i++)
+
+      comm->mpid.tasks = NULL;
+      for(i=1;i<comm->local_size;i++)
       {
-         comm->mpid.tasks_descriptor.ranges[i].lo = MPID_VCR_GET_LPID(comm->vcr, i);
-         comm->mpid.tasks_descriptor.ranges[i].hi = MPID_VCR_GET_LPID(comm->vcr, i);
+         /* only if sequential tasks should we use a (single) range.
+            Multi or reordered ranges are inefficient */
+         if(MPID_VCR_GET_LPID(comm->vcr, i) != (MPID_VCR_GET_LPID(comm->vcr, i-1) + 1)) {
+            /* not sequential, use tasklist */
+	    MPID_VCR_GET_LPIDS(comm, comm->mpid.tasks);
+            break;
+         }
       }
+      /* Should we use a range? (no task list set) */
+      if(comm->mpid.tasks == NULL)
+      {
+         /* one range, {first rank ... last rank} */
+         comm->mpid.range.lo = MPID_VCR_GET_LPID(comm->vcr, 0);
+         comm->mpid.range.hi = MPID_VCR_GET_LPID(comm->vcr, comm->local_size-1);
+      }
+
+      if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_0 && comm->rank == 0))
+         fprintf(stderr,"create geometry tasks %p {%u..%u}\n", comm->mpid.tasks, MPID_VCR_GET_LPID(comm->vcr, 0),MPID_VCR_GET_LPID(comm->vcr, comm->local_size-1));
+
       pami_configuration_t config;
       size_t numconfigs = 0;
 
@@ -168,21 +190,44 @@ void MPIDI_Coll_comm_create(MPID_Comm *comm)
          numconfigs = 0;
       }
 
-      geom_post.client = MPIDI_Client;
-      geom_post.configs = &config;
-      geom_post.context_offset = 0; /* TODO BES investigate */
-      geom_post.num_configs = numconfigs;
-      geom_post.newgeom = &comm->mpid.geometry,
-      geom_post.parent = NULL;
-      geom_post.id     = comm->context_id;
-      geom_post.ranges = comm->mpid.tasks_descriptor.ranges;
-      geom_post.slice_count = (size_t)comm->local_size,
-      geom_post.fn = geom_create_cb_done;
-      geom_post.cookie = (void*)&geom_init;
-
-      TRACE_ERR("%s geom_create\n", MPIDI_Process.context_post>0?"Posting":"Invoking");
-      MPIDI_Context_post(MPIDI_Context[0], &geom_post.state,
-                         geom_rangelist_create_wrapper, (void *)&geom_post);
+      if(comm->mpid.tasks == NULL)
+      {
+         geom_post.client = MPIDI_Client;
+         geom_post.configs = &config;
+         geom_post.context_offset = 0; /* TODO BES investigate */
+         geom_post.num_configs = numconfigs;
+         geom_post.newgeom = &comm->mpid.geometry,
+         geom_post.parent = NULL;
+         geom_post.id     = comm->context_id;
+         geom_post.ranges = &comm->mpid.range;
+         geom_post.tasks = NULL;;
+         geom_post.count = (size_t)1;
+         geom_post.fn = geom_create_cb_done;
+         geom_post.cookie = (void*)&geom_init;
+
+         TRACE_ERR("%s geom_rangelist_create\n", MPIDI_Process.context_post>0?"Posting":"Invoking");
+         MPIDI_Context_post(MPIDI_Context[0], &geom_post.state,
+                            geom_rangelist_create_wrapper, (void *)&geom_post);
+      }
+      else
+      {
+         geom_post.client = MPIDI_Client;
+         geom_post.configs = &config;
+         geom_post.context_offset = 0; /* TODO BES investigate */
+         geom_post.num_configs = numconfigs;
+         geom_post.newgeom = &comm->mpid.geometry,
+         geom_post.parent = NULL;
+         geom_post.id     = comm->context_id;
+         geom_post.ranges = NULL;
+         geom_post.tasks = comm->mpid.tasks;
+         geom_post.count = (size_t)comm->local_size;
+         geom_post.fn = geom_create_cb_done;
+         geom_post.cookie = (void*)&geom_init;
+
+         TRACE_ERR("%s geom_tasklist_create\n", MPIDI_Process.context_post>0?"Posting":"Invoking");
+         MPIDI_Context_post(MPIDI_Context[0], &geom_post.state,
+                            geom_tasklist_create_wrapper, (void *)&geom_post);
+      }
 
       TRACE_ERR("Waiting for geom create to finish\n");
       MPID_PROGRESS_WAIT_WHILE(geom_init);
@@ -242,20 +287,9 @@ void MPIDI_Coll_comm_destroy(MPID_Comm *comm)
 
    TRACE_ERR("Waiting for geom destroy to finish\n");
    MPID_PROGRESS_WAIT_WHILE(geom_destroy);
-   TRACE_ERR("Freeing geometry ranges\n");
+   MPIU_Free(comm->mpid.tasks);
+/*   TRACE_ERR("Freeing geometry ranges\n");
    MPIU_TestFree(&comm->mpid.tasks_descriptor.ranges);
+*/
    TRACE_ERR("MPIDI_Coll_comm_destroy exit\n");
 }
-
-
-
-void MPIDI_Comm_world_setup()
-{
-  TRACE_ERR("MPIDI_Comm_world_setup enter\n");
-
-  /* Anything special required for COMM_WORLD goes here */
-   MPID_Comm *comm;
-   comm = MPIR_Process.comm_world;
-
-  TRACE_ERR("MPIDI_Comm_world_setup exit\n");
-}
diff --git a/src/mpid/pamid/src/dyntask/Makefile.mk b/src/mpid/pamid/src/dyntask/Makefile.mk
new file mode 100644
index 0000000..64402d8
--- /dev/null
+++ b/src/mpid/pamid/src/dyntask/Makefile.mk
@@ -0,0 +1,32 @@
+# begin_generated_IBM_copyright_prolog
+#
+# This is an automatically generated copyright prolog.
+# After initializing,  DO NOT MODIFY OR MOVE
+#  ---------------------------------------------------------------
+# Licensed Materials - Property of IBM
+# Blue Gene/Q 5765-PER 5765-PRP
+#
+# (C) Copyright IBM Corp. 2011, 2012 All Rights Reserved
+# US Government Users Restricted Rights -
+# Use, duplication, or disclosure restricted
+# by GSA ADP Schedule Contract with IBM Corp.
+#
+#  ---------------------------------------------------------------
+#
+# end_generated_IBM_copyright_prolog
+# -*- mode: makefile-gmake; -*-
+
+# note that the includes always happen but the effects of their contents are
+# affected by "if BUILD_PAMID"
+if BUILD_PAMID
+
+
+lib_lib at MPILIBNAME@_la_SOURCES +=                                    \
+  src/mpid/pamid/src/dyntask/mpid_comm_spawn_multiple.c              \
+  src/mpid/pamid/src/dyntask/mpid_comm_disconnect.c                  \
+  src/mpid/pamid/src/dyntask/mpid_port.c                             \
+  src/mpid/pamid/src/dyntask/mpidi_pg.c                              \
+  src/mpid/pamid/src/dyntask/mpidi_port.c
+
+
+endif BUILD_PAMID
diff --git a/src/mpid/pamid/src/dyntask/mpid_comm_disconnect.c b/src/mpid/pamid/src/dyntask/mpid_comm_disconnect.c
new file mode 100644
index 0000000..0317f5e
--- /dev/null
+++ b/src/mpid/pamid/src/dyntask/mpid_comm_disconnect.c
@@ -0,0 +1,78 @@
+/* begin_generated_IBM_copyright_prolog                             */
+/*                                                                  */
+/* This is an automatically generated copyright prolog.             */
+/* After initializing,  DO NOT MODIFY OR MOVE                       */
+/*  --------------------------------------------------------------- */
+/* Licensed Materials - Property of IBM                             */
+/* Blue Gene/Q 5765-PER 5765-PRP                                    */
+/*                                                                  */
+/* (C) Copyright IBM Corp. 2011, 2012 All Rights Reserved           */
+/* US Government Users Restricted Rights -                          */
+/* Use, duplication, or disclosure restricted                       */
+/* by GSA ADP Schedule Contract with IBM Corp.                      */
+/*                                                                  */
+/*  --------------------------------------------------------------- */
+/*                                                                  */
+/* end_generated_IBM_copyright_prolog                               */
+/*  (C)Copyright IBM Corp.  2007, 2011  */
+
+#include "mpidimpl.h"
+
+#ifdef DYNAMIC_TASKING
+
+extern conn_info *_conn_info_list;
+/*@
+   MPID_Comm_disconnect - Disconnect a communicator
+
+   Arguments:
+.  comm_ptr - communicator
+
+   Notes:
+
+.N Errors
+.N MPI_SUCCESS
+@*/
+int MPID_Comm_disconnect(MPID_Comm *comm_ptr)
+{
+    int rc, i,ref_count,mpi_errno, probe_flag=0;
+    MPI_Status status;
+    MPIDI_PG_t *pg;
+
+    if(comm_ptr->mpid.world_ids != NULL) {
+	rc = MPID_Iprobe(comm_ptr->rank, MPI_ANY_TAG, comm_ptr, MPID_CONTEXT_INTER_PT2PT, &probe_flag, &status);
+        if(rc || probe_flag) {
+          TRACE_ERR("PENDING_PTP");
+	  exit(1);
+        }
+
+        for(i=0; comm_ptr->mpid.world_ids[i] != -1; i++) {
+          ref_count = MPIDI_Decrement_ref_count(comm_ptr->mpid.world_ids[i]);
+          TRACE_ERR("ref_count=%d with world=%d comm_ptr=%x\n", ref_count, comm_ptr->mpid.world_ids[i], comm_ptr);
+          if(ref_count == -1)
+	    TRACE_ERR("something is wrong\n");
+        }
+
+        MPIU_Free(comm_ptr->mpid.world_ids);
+        mpi_errno = MPIR_Comm_release(comm_ptr,1);
+        if (mpi_errno) TRACE_ERR("MPIR_Comm_release returned with mpi_errno=%d\n", mpi_errno);
+    }
+    return mpi_errno;
+}
+
+
+int MPIDI_Decrement_ref_count(int wid) {
+  conn_info *tmp_node;
+  int ref_count=-1;
+
+  tmp_node = _conn_info_list;
+  while(tmp_node != NULL) {
+    if(tmp_node->rem_world_id == wid) {
+      ref_count = --tmp_node->ref_count;
+      TRACE_ERR("decrement_ref_count: ref_count decremented to %d for remote world %d\n",ref_count,wid);
+      break;
+    }
+    tmp_node = tmp_node->next;
+  }
+  return ref_count;
+}
+#endif
diff --git a/src/mpid/pamid/src/dyntask/mpid_comm_spawn_multiple.c b/src/mpid/pamid/src/dyntask/mpid_comm_spawn_multiple.c
new file mode 100644
index 0000000..23eba8e
--- /dev/null
+++ b/src/mpid/pamid/src/dyntask/mpid_comm_spawn_multiple.c
@@ -0,0 +1,377 @@
+/* -*- Mode: C; c-basic-offset:4 ; -*- */
+/*
+ *  (C) 2001 by Argonne National Laboratory.
+ *      See COPYRIGHT in top-level directory.
+ */
+
+#include "mpidimpl.h"
+#ifdef USE_PMI2_API
+#include "pmi2.h"
+#else
+#include "pmi.h"
+#endif
+
+#ifdef DYNAMIC_TASKING
+
+/* Define the name of the kvs key used to provide the port name to the
+   children */
+#define MPIDI_PARENT_PORT_KVSKEY "PARENT_ROOT_PORT_NAME"
+
+/* FIXME: We can avoid these two routines if we define PMI as using
+   MPI info values */
+/* Turn a SINGLE MPI_Info into an array of PMI_keyvals (return the pointer
+   to the array of PMI keyvals) */
+static int  MPIDI_mpi_to_pmi_keyvals( MPID_Info *info_ptr, PMI_keyval_t **kv_ptr, int *nkeys_ptr )
+{
+    char key[MPI_MAX_INFO_KEY];
+    PMI_keyval_t *kv = 0;
+    int          i, nkeys = 0, vallen, flag, mpi_errno=MPI_SUCCESS;
+
+    if (!info_ptr || info_ptr->handle == MPI_INFO_NULL) {
+	goto fn_exit;
+    }
+
+    MPIR_Info_get_nkeys_impl( info_ptr, &nkeys );
+    if (nkeys == 0) {
+	goto fn_exit;
+    }
+    kv = (PMI_keyval_t *)MPIU_Malloc( nkeys * sizeof(PMI_keyval_t) );
+
+    for (i=0; i<nkeys; i++) {
+	mpi_errno = MPIR_Info_get_nthkey_impl( info_ptr, i, key );
+	if (mpi_errno)
+          TRACE_ERR("MPIR_Info_get_nthkey_impl returned with mpi_errno=%d\n", mpi_errno);
+	MPIR_Info_get_valuelen_impl( info_ptr, key, &vallen, &flag );
+
+	kv[i].key = MPIU_Strdup(key);
+	kv[i].val = MPIU_Malloc( vallen + 1 );
+	MPIR_Info_get_impl( info_ptr, key, vallen+1, kv[i].val, &flag );
+	TRACE_OUT(("key: <%s>, value: <%s>\n", kv[i].key, kv[i].val));
+    }
+
+ fn_fail:
+ fn_exit:
+    *kv_ptr    = kv;
+    *nkeys_ptr = nkeys;
+    return mpi_errno;
+}
+
+
+/* Free the entire array of PMI keyvals */
+static void MPIDI_free_pmi_keyvals(PMI_keyval_t **kv, int size, int *counts)
+{
+    int i,j;
+
+    for (i=0; i<size; i++)
+    {
+	for (j=0; j<counts[i]; j++)
+	{
+	    if (kv[i][j].key != NULL)
+		MPIU_Free((char *)kv[i][j].key);
+	    if (kv[i][j].val != NULL)
+		MPIU_Free(kv[i][j].val);
+	}
+	if (kv[i] != NULL)
+	{
+	    MPIU_Free(kv[i]);
+	}
+    }
+}
+
+/*@
+   MPID_Comm_spawn_multiple -
+
+   Input Arguments:
++  int count - count
+.  char *array_of_commands[] - commands
+.  char* *array_of_argv[] - arguments
+.  int array_of_maxprocs[] - maxprocs
+.  MPI_Info array_of_info[] - infos
+.  int root - root
+-  MPI_Comm comm - communicator
+
+   Output Arguments:
++  MPI_Comm *intercomm - intercommunicator
+-  int array_of_errcodes[] - error codes
+
+   Notes:
+
+.N Errors
+.N MPI_SUCCESS
+@*/
+int MPID_Comm_spawn_multiple(int count, char *array_of_commands[],
+			     char ** array_of_argv[], const int array_of_maxprocs[],
+			     MPID_Info * array_of_info_ptrs[], int root,
+			     MPID_Comm * comm_ptr, MPID_Comm ** intercomm,
+			     int array_of_errcodes[])
+{
+    int mpi_errno = MPI_SUCCESS;
+
+    /* We allow an empty implementation of this function to
+       simplify building MPICH2 on systems that have difficulty
+       supporing process creation */
+    mpi_errno = MPIDI_Comm_spawn_multiple(count, array_of_commands,
+					  array_of_argv, array_of_maxprocs,
+					  array_of_info_ptrs,
+					  root, comm_ptr, intercomm,
+					  array_of_errcodes);
+    return mpi_errno;
+}
+
+
+/*
+ * MPIDI_Comm_spawn_multiple()
+ */
+int MPIDI_Comm_spawn_multiple(int count, char **commands,
+                              char ***argvs, int *maxprocs,
+                              MPID_Info **info_ptrs, int root,
+                              MPID_Comm *comm_ptr, MPID_Comm
+                              **intercomm, int *errcodes)
+{
+    char port_name[MPI_MAX_PORT_NAME];
+    char jobId[64];
+    char ctxid_str[16];
+    int jobIdSize = 64;
+    int len=0;
+    int *info_keyval_sizes=0, i, mpi_errno=MPI_SUCCESS;
+    PMI_keyval_t **info_keyval_vectors=0, preput_keyval_vector;
+    int *pmi_errcodes = 0, pmi_errno;
+    int total_num_processes, should_accept = 1;
+    MPID_Info tmp_info_ptr;
+    char *tmp;
+
+    if (comm_ptr->rank == root) {
+	/* create an array for the pmi error codes */
+	total_num_processes = 0;
+	for (i=0; i<count; i++) {
+	    total_num_processes += maxprocs[i];
+	}
+	pmi_errcodes = (int*)MPIU_Malloc(sizeof(int) * total_num_processes);
+
+	/* initialize them to 0 */
+	for (i=0; i<total_num_processes; i++)
+	    pmi_errcodes[i] = 0;
+
+	/* Open a port for the spawned processes to connect to */
+	/* FIXME: info may be needed for port name */
+        mpi_errno = MPID_Open_port(NULL, port_name);
+	TRACE_ERR("mpi_errno from MPID_Open_port=%d\n", mpi_errno);
+
+	/* Spawn the processes */
+#ifdef USE_PMI2_API
+        MPIU_Assert(count > 0);
+        {
+            int *argcs = MPIU_Malloc(count*sizeof(int));
+            struct MPID_Info preput;
+            struct MPID_Info *preput_p[1] = { &preput };
+
+            MPIU_Assert(argcs);
+
+            info_keyval_sizes = MPIU_Malloc(count * sizeof(int));
+
+            /* FIXME cheating on constness */
+            preput.key = (char *)MPIDI_PARENT_PORT_KVSKEY;
+            preput.value = port_name;
+            preput.next = NULL;
+
+	    tmp_info_ptr.key = "COMMCTX";
+	    len=sprintf(ctxid_str, "%d", comm_ptr->context_id);
+	    TRACE_ERR("COMMCTX=%d\n", comm_ptr->context_id);
+	     ctxid_str[len]='\0';
+	    tmp_info_ptr.value = ctxid_str;
+	    tmp_info_ptr.next = NULL;
+
+            /* compute argcs array */
+            for (i = 0; i < count; ++i) {
+                argcs[i] = 0;
+                if (argvs != NULL && argvs[i] != NULL) {
+                    while (argvs[i][argcs[i]]) {
+                        ++argcs[i];
+                    }
+                }
+
+                /* a fib for now */
+                info_keyval_sizes[i] = 1;
+		info_ptrs[i] = &tmp_info_ptr;
+            }
+
+            /* XXX DJG don't need this, PMI API is thread-safe? */
+            /*MPIU_THREAD_CS_ENTER(PMI,);*/
+            /* release the global CS for spawn PMI calls */
+            MPIU_THREAD_CS_EXIT(ALLFUNC,);
+            pmi_errno = PMI2_Job_Spawn(count, (const char **)commands,
+                                       argcs, (const char ***)argvs,
+                                       maxprocs,
+                                       info_keyval_sizes, (const MPID_Info **)info_ptrs,
+                                       1, (const struct MPID_Info **)preput_p,
+                                       jobId, jobIdSize,
+                                       pmi_errcodes);
+	    TRACE_ERR("after PMI2_Job_Spawn - jobId=%s\n", jobId);
+
+	    tmp=MPIU_Strdup(jobId);
+	    strtok(tmp, ";");
+	    pami_task_t leader_taskid = atoi(strtok(NULL, ";"));
+	    pami_endpoint_t ldest;
+
+            PAMI_Endpoint_create(MPIDI_Client,  leader_taskid, 0, &ldest);
+	    TRACE_ERR("PAMI_Resume to taskid=%d\n", leader_taskid);
+            PAMI_Resume(MPIDI_Context[0], &ldest, 1);
+            MPIU_Free(tmp);
+
+            MPIU_Free(argcs);
+            if (pmi_errno != PMI2_SUCCESS) {
+               TRACE_ERR("PMI2_Job_Spawn returned with pmi_errno=%d\n", pmi_errno);
+            }
+        }
+#else
+        /* FIXME: This is *really* awkward.  We should either
+           Fix on MPI-style info data structures for PMI (avoid unnecessary
+           duplication) or add an MPIU_Info_getall(...) that creates
+           the necessary arrays of key/value pairs */
+
+        /* convert the infos into PMI keyvals */
+        info_keyval_sizes   = (int *) MPIU_Malloc(count * sizeof(int));
+        info_keyval_vectors =
+            (PMI_keyval_t**) MPIU_Malloc(count * sizeof(PMI_keyval_t*));
+
+        if (!info_ptrs) {
+            for (i=0; i<count; i++) {
+                info_keyval_vectors[i] = 0;
+                info_keyval_sizes[i]   = 0;
+            }
+        }
+        else {
+            for (i=0; i<count; i++) {
+                mpi_errno = MPIDI_mpi_to_pmi_keyvals( info_ptrs[i],
+                                                &info_keyval_vectors[i],
+                                                &info_keyval_sizes[i] );
+                if (mpi_errno) { TRACE_ERR("MPIDI_mpi_to_pmi_keyvals returned with mpi_errno=%d\n", mpi_errno); }
+            }
+        }
+
+        preput_keyval_vector.key = MPIDI_PARENT_PORT_KVSKEY;
+        preput_keyval_vector.val = port_name;
+
+        pmi_errno = PMI_Spawn_multiple(count, (const char **)
+                                       commands,
+                                       (const char ***) argvs,
+                                       maxprocs, info_keyval_sizes,
+                                       (const PMI_keyval_t **)
+                                       info_keyval_vectors, 1,
+                                       &preput_keyval_vector,
+                                       pmi_errcodes);
+	TRACE_ERR("pmi_errno from PMI_Spawn_multiple=%d\n", pmi_errno);
+#endif
+
+	if (errcodes != MPI_ERRCODES_IGNORE) {
+	    for (i=0; i<total_num_processes; i++) {
+		/* FIXME: translate the pmi error codes here */
+		errcodes[i] = pmi_errcodes[i];
+                /* We want to accept if any of the spawns succeeded.
+                   Alternatively, this is the same as we want to NOT accept if
+                   all of them failed.  should_accept = NAND(e_0, ..., e_n)
+                   Remember, success equals false (0). */
+                should_accept = should_accept && errcodes[i];
+	    }
+            should_accept = !should_accept; /* the `N' in NAND */
+	}
+    }
+
+    if (errcodes != MPI_ERRCODES_IGNORE) {
+        int errflag = FALSE;
+        mpi_errno = MPIR_Bcast_impl(&should_accept, 1, MPI_INT, root, comm_ptr, &errflag);
+        if (mpi_errno) TRACE_ERR("MPIR_Bcast_impl returned with mpi_errno=%d\n", mpi_errno);
+
+        mpi_errno = MPIR_Bcast_impl(&total_num_processes, 1, MPI_INT, root, comm_ptr, &errflag);
+        if (mpi_errno) TRACE_ERR("MPIR_Bcast_impl returned with mpi_errno=%d\n", mpi_errno);
+
+        mpi_errno = MPIR_Bcast_impl(errcodes, total_num_processes, MPI_INT, root, comm_ptr, &errflag);
+        if (mpi_errno) TRACE_ERR("MPIR_Bcast_impl returned with mpi_errno=%d\n", mpi_errno);
+    }
+
+    if (should_accept) {
+        mpi_errno = MPID_Comm_accept(port_name, NULL, root, comm_ptr, intercomm);
+	TRACE_ERR("mpi_errno from MPID_Comm_accept=%d\n", mpi_errno);
+    }
+
+    if (comm_ptr->rank == root) {
+	/* Close the port opened for the spawned processes to connect to */
+	mpi_errno = MPID_Close_port(port_name);
+	/* --BEGIN ERROR HANDLING-- */
+	if (mpi_errno != MPI_SUCCESS)
+	    TRACE_ERR("MPID_Close_port returned with mpi_errno=%d\n", mpi_errno);
+	/* --END ERROR HANDLING-- */
+    }
+
+ fn_exit:
+    if (info_keyval_vectors) {
+	MPIDI_free_pmi_keyvals(info_keyval_vectors, count, info_keyval_sizes);
+	MPIU_Free(info_keyval_sizes);
+	MPIU_Free(info_keyval_vectors);
+    }
+    if (pmi_errcodes) {
+	MPIU_Free(pmi_errcodes);
+    }
+    return mpi_errno;
+ fn_fail:
+    goto fn_exit;
+}
+
+
+/* This function is used only with mpid_init to set up the parent communicator
+   if there is one.  The routine should be in this file because the parent
+   port name is setup with the "preput" arguments to PMI_Spawn_multiple */
+static char *parent_port_name = 0;    /* Name of parent port if this
+					 process was spawned (and is root
+					 of comm world) or null */
+
+int MPIDI_GetParentPort(char ** parent_port)
+{
+    int mpi_errno = MPI_SUCCESS;
+    int pmi_errno;
+    char val[MPIDI_MAX_KVS_VALUE_LEN];
+
+    if (parent_port_name == NULL)
+    {
+	char *kvsname = NULL;
+	/* We can always use PMI_KVS_Get on our own process group */
+	MPIDI_PG_GetConnKVSname( &kvsname );
+#ifdef USE_PMI2_API
+        {
+            int vallen = 0;
+            pmi_errno = PMI2_KVS_Get(kvsname, PMI2_ID_NULL, MPIDI_PARENT_PORT_KVSKEY, val, sizeof(val), &vallen);
+	    TRACE_ERR("PMI2_KVS_Get - val=%s\n", val);
+            if (pmi_errno)
+                TRACE_ERR("PMI2_KVS_Get returned with pmi_errno=%d\n", pmi_errno);
+        }
+#else
+	/*MPIU_THREAD_CS_ENTER(PMI,);*/
+	pmi_errno = PMI_KVS_Get( kvsname, MPIDI_PARENT_PORT_KVSKEY, val, sizeof(val));
+/*	MPIU_THREAD_CS_EXIT(PMI,);*/
+	if (pmi_errno) {
+            mpi_errno = MPIR_Err_create_code(MPI_SUCCESS, MPIR_ERR_FATAL, FCNAME, __LINE__, MPI_ERR_OTHER, "**pmi_kvsget", "**pmi_kvsget %d", pmi_errno);
+            goto fn_exit;
+	}
+#endif
+	parent_port_name = MPIU_Strdup(val);
+    }
+
+    *parent_port = parent_port_name;
+
+ fn_exit:
+    return mpi_errno;
+ fn_fail:
+    goto fn_exit;
+}
+
+
+void MPIDI_FreeParentPort(void)
+{
+    if (parent_port_name) {
+	MPIU_Free( parent_port_name );
+	parent_port_name = 0;
+    }
+}
+
+
+#endif
diff --git a/src/mpid/pamid/src/dyntask/mpid_port.c b/src/mpid/pamid/src/dyntask/mpid_port.c
new file mode 100644
index 0000000..b2c43cb
--- /dev/null
+++ b/src/mpid/pamid/src/dyntask/mpid_port.c
@@ -0,0 +1,294 @@
+/* -*- Mode: C; c-basic-offset:4 ; -*- */
+/*
+ *  (C) 2001 by Argonne National Laboratory.
+ *      See COPYRIGHT in top-level directory.
+ */
+
+#include "mpidimpl.h"
+#include "netdb.h"
+#include <net/if.h>
+#include <linux/sockios.h>
+
+
+#ifdef DYNAMIC_TASKING
+
+#define MAX_HOST_DESCRIPTION_LEN 128
+#define NUM_IFREQS 10
+#define MPI_MAX_TASKID_NAME 8
+#define MPIDI_TASKID_TAG_KEY "taskid"
+
+static int MPIDI_Open_port(MPID_Info *, char *);
+static int MPIDI_Close_port(const char *);
+
+
+/* Define the functions that are used to implement the port
+ * operations */
+static MPIDI_PortFns portFns = { MPIDI_Open_port,
+				 MPIDI_Close_port,
+				 MPIDI_Comm_accept,
+				 MPIDI_Comm_connect };
+
+/*@
+   MPID_Open_port - Open an MPI Port
+
+   Input Arguments:
+.  MPI_Info info - info
+
+   Output Arguments:
+.  char *port_name - port name
+
+   Notes:
+
+
+.N Errors
+.N MPI_SUCCESS
+.N MPI_ERR_OTHER
+@*/
+int MPID_Open_port(MPID_Info *info_ptr, char *port_name)
+{
+    int mpi_errno=MPI_SUCCESS;
+
+    if (portFns.OpenPort) {
+	mpi_errno = portFns.OpenPort( info_ptr, port_name );
+	if (mpi_errno != MPI_SUCCESS) {
+	    TRACE_ERR("OpenPort returned with mpi_errno=%d\n", mpi_errno);
+	}
+    }
+
+ fn_fail:
+    return mpi_errno;
+}
+
+
+
+/*@
+   MPID_Close_port - Close port
+
+   Input Parameter:
+.  port_name - Name of MPI port to close
+
+   Notes:
+
+.N Errors
+.N MPI_SUCCESS
+.N MPI_ERR_OTHER
+
+@*/
+int MPID_Close_port(const char *port_name)
+{
+    int mpi_errno=MPI_SUCCESS;
+
+    if (portFns.ClosePort) {
+	mpi_errno = portFns.ClosePort( port_name );
+	if (mpi_errno != MPI_SUCCESS) {
+	    TRACE_ERR("ClosePort returned with mpi_errno=%d\n", mpi_errno);
+	}
+    }
+
+ fn_fail:
+    return mpi_errno;
+}
+
+int MPID_Comm_accept(const char * port_name, MPID_Info * info, int root,
+		     MPID_Comm * comm, MPID_Comm ** newcomm_ptr)
+{
+    int mpi_errno = MPI_SUCCESS;
+
+    if (portFns.CommAccept) {
+	mpi_errno = portFns.CommAccept( port_name, info, root, comm,
+					newcomm_ptr );
+	if (mpi_errno != MPI_SUCCESS) {
+	    TRACE_ERR("CommAccept returned with mpi_errno=%d\n", mpi_errno);
+	}
+    }
+
+ fn_fail:
+    return mpi_errno;
+}
+
+int MPID_Comm_connect(const char * port_name, MPID_Info * info, int root,
+		      MPID_Comm * comm, MPID_Comm ** newcomm_ptr)
+{
+    int mpi_errno=MPI_SUCCESS;
+
+    if (portFns.CommConnect) {
+	mpi_errno = portFns.CommConnect( port_name, info, root, comm,
+					 newcomm_ptr );
+	if (mpi_errno != MPI_SUCCESS) {
+	    TRACE_ERR("CommConnect returned with mpi_errno=%d\n", mpi_errno);
+	}
+    }
+
+ fn_fail:
+    return mpi_errno;
+}
+
+
+/*
+ * Here are the routines that provide some of the default implementations
+ * for the Port routines.
+ *
+ * MPIDI_Open_port - creates a port "name" that includes a tag value that
+ * is used to separate different MPI Port values.  That tag value is
+ * extracted with MPIDI_GetTagFromPort
+ * MPIDI_GetTagFromPort - Routine to return the tag associated with a port.
+ *
+ * The port_name_tag is used in the connect and accept messages that
+ * are used in the connect/accept protocol.
+ */
+
+#define MPIDI_PORT_NAME_TAG_KEY "tag"
+#define MPIDI_TASKID_TAG_KEY "taskid"
+
+/* Though the port_name_tag_mask itself is an int, we can only have as
+ * many tags as the context_id space can support. */
+static int port_name_tag_mask[MPIR_MAX_CONTEXT_MASK] = { 0 };
+
+static int MPIDI_get_port_name_tag(int * port_name_tag)
+{
+    int i, j;
+    int mpi_errno = MPI_SUCCESS;
+
+
+    for (i = 0; i < MPIR_MAX_CONTEXT_MASK; i++)
+	if (port_name_tag_mask[i] != ~0)
+	    break;
+
+    TRACE_ERR("MPIDI_get_port_name_tag - i=%d MPIR_MAX_CONTEXT_MASK=%d", i,  MPIR_MAX_CONTEXT_MASK);
+    if (i < MPIR_MAX_CONTEXT_MASK) {
+	/* Found a free tag. port_name_tag_mask[i] is not fully used
+	 * up. */
+
+	/* OR the mask value with powers of two. If the OR value is
+	 * the same as the original value, then it means that the
+	 * OR'ed bit was originally 1 (used); otherwise, it was
+	 * originally 0 (free). */
+	for (j = 0; j < (8 * sizeof(int)); j++) {
+	    if ((port_name_tag_mask[i] | (1 << ((8 * sizeof(int)) - j - 1))) !=
+		port_name_tag_mask[i]) {
+		/* Mark the appropriate bit as used and return that */
+		port_name_tag_mask[i] |= (1 << ((8 * sizeof(int)) - j - 1));
+		*port_name_tag = ((i * 8 * sizeof(int)) + j);
+		goto fn_exit;
+	    }
+	}
+    }
+    else {
+	goto fn_fail;
+    }
+
+fn_exit:
+    return mpi_errno;
+
+fn_fail:
+    /* Everything is used up */
+    *port_name_tag = -1;
+    mpi_errno = MPI_ERR_OTHER;
+    goto fn_exit;
+}
+
+static void MPIDI_free_port_name_tag(int tag)
+{
+    int index, rem_tag;
+
+    index = tag / (sizeof(int) * 8);
+    rem_tag = tag - (index * sizeof(int) * 8);
+
+    port_name_tag_mask[index] &= ~(1 << ((8 * sizeof(int)) - 1 - rem_tag));
+}
+
+
+/*
+ * MPIDI_Open_port()
+ */
+static int MPIDI_Open_port(MPID_Info *info_ptr, char *port_name)
+{
+    int mpi_errno = MPI_SUCCESS;
+    int str_errno = MPIU_STR_SUCCESS;
+    int len;
+    int port_name_tag = 0; /* this tag is added to the business card,
+                              which is then returned as the port name */
+    int taskid_tag;
+    int myRank = MPIR_Process.comm_world->rank;
+
+    mpi_errno = MPIDI_get_port_name_tag(&port_name_tag);
+    TRACE_ERR("MPIDI_get_port_name_tag - port_name_tag=%d mpi_errno=%d\n", port_name_tag, mpi_errno);
+
+    len = MPI_MAX_PORT_NAME;
+    str_errno = MPIU_Str_add_int_arg(&port_name, &len,
+                                     MPIDI_PORT_NAME_TAG_KEY, port_name_tag);
+    /*len = MPI_MAX_TASKID_NAME;*/
+    taskid_tag = PAMIX_Client_query(MPIDI_Client, PAMI_CLIENT_TASK_ID  ).value.intval;
+    str_errno = MPIU_Str_add_int_arg(&port_name, &len,
+                                     MPIDI_TASKID_TAG_KEY, taskid_tag);
+    TRACE_ERR("MPIU_Str_add_int_arg - port_name=%s str_errno=%d\n", port_name, str_errno);
+
+    /* This works because Get_business_card uses the same MPIU_Str_xxx
+       functions as above to add the business card to the input string */
+    /* FIXME: We should instead ask the mpid_pg routines to give us
+       a connection string. There may need to be a separate step to
+       restrict us to a connection information that is only valid for
+       connections between processes that are started separately (e.g.,
+       may not use shared memory).  We may need a channel-specific
+       function to create an exportable connection string.  */
+
+fn_exit:
+    return mpi_errno;
+fn_fail:
+    goto fn_exit;
+}
+
+/*
+ * MPIDI_Close_port()
+ */
+static int MPIDI_Close_port(const char *port_name)
+{
+    int mpi_errno = MPI_SUCCESS;
+    int port_name_tag;
+
+    mpi_errno = MPIDI_GetTagFromPort(port_name, &port_name_tag);
+
+    MPIDI_free_port_name_tag(port_name_tag);
+
+fn_exit:
+    return mpi_errno;
+fn_fail:
+    goto fn_exit;
+}
+
+/*
+ * The connect and accept routines use this routine to get the port tag
+ * from the port name.
+ */
+int MPIDI_GetTagFromPort( const char *port_name, int *port_name_tag )
+{
+    int mpi_errno = MPI_SUCCESS;
+    int str_errno = MPIU_STR_SUCCESS;
+
+    str_errno = MPIU_Str_get_int_arg(port_name, MPIDI_PORT_NAME_TAG_KEY,
+                                     port_name_tag);
+
+ fn_exit:
+    return mpi_errno;
+ fn_fail:
+    goto fn_exit;
+}
+
+/*
+ * The connect and accept routines use this routine to get the port tag
+ * from the port name.
+ */
+int MPIDI_GetTaskidFromPort( const char *port_name, int *taskid_tag )
+{
+    int mpi_errno = MPI_SUCCESS;
+    int str_errno = MPIU_STR_SUCCESS;
+
+    str_errno = MPIU_Str_get_int_arg(port_name, MPIDI_TASKID_TAG_KEY,
+                                     taskid_tag);
+
+ fn_exit:
+    return mpi_errno;
+ fn_fail:
+    goto fn_exit;
+}
+#endif
diff --git a/src/mpid/pamid/src/dyntask/mpidi_pg.c b/src/mpid/pamid/src/dyntask/mpidi_pg.c
new file mode 100644
index 0000000..bd9d748
--- /dev/null
+++ b/src/mpid/pamid/src/dyntask/mpidi_pg.c
@@ -0,0 +1,961 @@
+/* -*- Mode: C; c-basic-offset:4 ; -*- */
+/*
+ *  (C) 2001 by Argonne National Laboratory.
+ *      See COPYRIGHT in top-level directory.
+ */
+#include <mpidimpl.h>
+#ifdef USE_PMI2_API
+#include "pmi2.h"
+#else
+#include "pmi.h"
+#endif
+
+#ifdef DYNAMIC_TASKING
+
+#define MAX_JOBID_LEN 1024
+
+/* FIXME: These routines need a description.  What is their purpose?  Who
+   calls them and why?  What does each one do?
+*/
+static MPIDI_PG_t * MPIDI_PG_list = NULL;
+static MPIDI_PG_t * MPIDI_PG_iterator_next = NULL;
+static MPIDI_PG_Compare_ids_fn_t MPIDI_PG_Compare_ids_fn;
+static MPIDI_PG_Destroy_fn_t MPIDI_PG_Destroy_fn;
+
+/* Set verbose to 1 to record changes to the process group structure. */
+static int verbose = 0;
+
+/* Key track of the process group corresponding to the MPI_COMM_WORLD
+   of this process */
+static MPIDI_PG_t *pg_world = NULL;
+
+#define MPIDI_MAX_KVS_KEY_LEN      256
+
+extern conn_info *_conn_info_list;
+
+int MPIDI_PG_Init(int *argc_p, char ***argv_p,
+		  MPIDI_PG_Compare_ids_fn_t compare_ids_fn,
+		  MPIDI_PG_Destroy_fn_t destroy_fn)
+{
+    int mpi_errno = MPI_SUCCESS;
+    char *p;
+
+    MPIDI_PG_Compare_ids_fn = compare_ids_fn;
+    MPIDI_PG_Destroy_fn     = destroy_fn;
+
+    /* Check for debugging options.  We use MPICHD_DBG and -mpichd-dbg
+       to avoid confusion with the code in src/util/dbg/dbg_printf.c */
+    p = getenv( "MPICHD_DBG_PG" );
+    if (p && ( strcmp( p, "YES" ) == 0 || strcmp( p, "yes" ) == 0) )
+	verbose = 1;
+    if (argc_p && argv_p) {
+	int argc = *argc_p, i;
+	char **argv = *argv_p;
+	/* applied patch from Juha Jeronen, req #3920 */
+	for (i=1; i<argc && argv[i]; i++) {
+	    if (strcmp( "-mpichd-dbg-pg", argv[i] ) == 0) {
+		int j;
+		verbose = 1;
+		for (j=i; j<argc-1; j++) {
+		    argv[j] = argv[j+1];
+		}
+		argv[argc-1] = NULL;
+		*argc_p = argc - 1;
+		break;
+	    }
+	}
+    }
+
+    return mpi_errno;
+}
+
+/*@
+   MPIDI_PG_Finalize - Finalize the process groups, including freeing all
+   process group structures
+  @*/
+int MPIDI_PG_Finalize(void)
+{
+   int mpi_errno = MPI_SUCCESS;
+   conn_info              *conn_node;
+   int                    my_max_worldid, world_max_worldid;
+   int                    wid_bit_array_size=0, wid;
+   unsigned char          *wid_bit_array=NULL, *root_wid_barray=NULL;
+   MPIDI_PG_t *pg, *pgNext;
+   char key[PMI2_MAX_KEYLEN];
+   char value[PMI2_MAX_VALLEN];
+
+   /* Print the state of the process groups */
+   if (verbose) {
+     MPIU_PG_Printall( stdout );
+   }
+
+   /* FIXME - straighten out the use of PMI_Finalize - no use after
+      PG_Finalize */
+   conn_node     = _conn_info_list;
+   my_max_worldid  = -1;
+
+   while(NULL != conn_node) {
+     if(conn_node->rem_world_id>my_max_worldid && conn_node->ref_count>0)
+       my_max_worldid = conn_node->rem_world_id;
+     conn_node = conn_node->next;
+   }
+   MPIR_Allreduce_impl( &my_max_worldid, &world_max_worldid, 1, MPI_INT, MPI_MAX,   MPIR_Process.comm_world, &mpi_errno);
+
+   /* create bit array of size = world_max_worldid + 1
+    * We add 1 to world_max_worldid because suppose my world
+    * is only connected to world_id 0 then world_max_worldid=0
+    * and if we do not add 1, then size of bite array will be 0.
+    * Also suppose in my world world_max_worldid is 8. Then if we
+    * dont add 1, then the bit array will be size 1 byte, and when
+    * we try to set bit in position 8, we will get segfault.
+    */
+   if(world_max_worldid != -1) {
+     world_max_worldid++;
+     wid_bit_array_size = (world_max_worldid + CHAR_BIT -1) / CHAR_BIT;
+     wid_bit_array = MPIU_Malloc(wid_bit_array_size*sizeof(unsigned char));
+     memset(wid_bit_array, 0, wid_bit_array_size*sizeof(unsigned char));
+     root_wid_barray = MPIU_Malloc(wid_bit_array_size*sizeof(unsigned char));
+
+     memset(root_wid_barray, 0, wid_bit_array_size*sizeof(unsigned char));
+     conn_node     = _conn_info_list;
+     while(NULL != conn_node) {
+       if(conn_node->ref_count >0) {
+	 wid = conn_node->rem_world_id;
+	 wid_bit_array[wid/CHAR_BIT] |= 1 << (wid%CHAR_BIT);
+	 TRACE_ERR("wid=%d wid_bit_array[%d]=%x\n", wid, wid/CHAR_BIT, 1 << (wid%CHAR_BIT));
+       }
+       conn_node = conn_node->next;
+
+     }
+     /* Let root of my world know about this bit array */
+     MPIR_Reduce_impl(wid_bit_array,root_wid_barray,wid_bit_array_size,
+		   MPI_UNSIGNED_CHAR,MPI_BOR,0,MPIR_Process.comm_world,&mpi_errno);
+
+     MPIU_Free(wid_bit_array);
+   }
+
+   if(MPIR_Process.comm_world->rank == 0) {
+
+     MPIU_Snprintf(key, PMI2_MAX_KEYLEN-1, "%s", "ROOTWIDARRAY");
+     MPIU_Snprintf(value, PMI2_MAX_VALLEN-1, "%s", root_wid_barray);
+     TRACE_ERR("root_wid_barray=%s\n", value);
+     key[strlen(key)+1]='\0';
+     value[strlen(value)+1]='\0';
+     mpi_errno = PMI2_KVS_Put(key, value);
+     TRACE_ERR("PMI2_KVS_Put returned with mpi_errno=%d\n", mpi_errno);
+
+     MPIU_Snprintf(key, PMI2_MAX_KEYLEN-1, "%s", "WIDBITARRAYSZ");
+     MPIU_Snprintf(value, PMI2_MAX_VALLEN-1, "%x", wid_bit_array_size);
+     key[strlen(key)+1]='\0';
+     value[strlen(value)+1]='\0';
+     mpi_errno = PMI2_KVS_Put(key, value);
+     TRACE_ERR("PMI2_KVS_Put returned with mpi_errno=%d\n", mpi_errno);
+
+   }
+   mpi_errno = PMI2_KVS_Fence();
+   TRACE_ERR("PMI2_KVS_Fence returned with mpi_errno=%d\n", mpi_errno);
+
+   MPIU_Free(root_wid_barray); /* root_wid_barray is now NULL for non-root */
+/*    if (pg_world->connData) { */
+#ifdef USE_PMI2_API
+	mpi_errno = PMI2_Finalize();
+#else
+	int rc;
+	rc = PMI_Finalize();
+	if (rc) {
+          TRACE_ERR("PMI_Finalize returned with rc=%d\n", rc);
+	}
+#endif
+    /*}*/
+
+   if(_conn_info_list) {
+     if(_conn_info_list->rem_taskids)
+       MPIU_Free(_conn_info_list->rem_taskids);
+     else
+       MPIU_Free(_conn_info_list);
+   }
+   /* Free the storage associated with the process groups */
+   pg = MPIDI_PG_list;
+   while (pg) {
+     pgNext = pg->next;
+
+     /* In finalize, we free all process group information, even if
+        the ref count is not zero.  This can happen if the user
+        fails to use MPI_Comm_disconnect on communicators that
+        were created with the dynamic process routines.*/
+	/* XXX DJG FIXME-MT should we be checking this? */
+     if (MPIU_Object_get_ref(pg) == 0 || 1) {
+       if (pg == MPIDI_Process.my_pg)
+         MPIDI_Process.my_pg = NULL;
+        MPIU_Object_set_ref(pg, 0); /* satisfy assertions in PG_Destroy */
+        MPIDI_PG_Destroy( pg );
+     }
+     pg     = pgNext;
+   }
+
+   /* If COMM_WORLD is still around (it normally should be),
+      try to free it here.  The reason that we need to free it at this
+      point is that comm_world (and comm_self) still exist, and
+      hence the usual process to free the related VC structures will
+      not be invoked. */
+   if (MPIDI_Process.my_pg) {
+     MPIDI_PG_Destroy(MPIDI_Process.my_pg);
+   }
+   MPIDI_Process.my_pg = NULL;
+
+   return mpi_errno;
+}
+
+
+/* This routine creates a new process group description and appends it to
+   the list of the known process groups.  The pg_id is saved, not copied.
+   The PG_Destroy routine that was set with MPIDI_PG_Init is responsible for
+   freeing any storage associated with the pg_id.
+
+   The new process group is returned in pg_ptr
+*/
+int MPIDI_PG_Create(int vct_sz, void * pg_id, MPIDI_PG_t ** pg_ptr)
+{
+    MPIDI_PG_t * pg = NULL, *pgnext;
+    int p, i, j;
+    int mpi_errno = MPI_SUCCESS;
+    char *cp, *world_tasks, *cp1;
+
+    pg = MPIU_Malloc(sizeof(MPIDI_PG_t));
+    pg->vct = MPIU_Malloc(sizeof(struct MPID_VCR_t)*vct_sz);
+
+    pg->handle = 0;
+    /* The reference count indicates the number of vc's that are or
+       have been in use and not disconnected. It starts at zero,
+       except for MPI_COMM_WORLD. */
+    MPIU_Object_set_ref(pg, 0);
+    pg->size = vct_sz;
+    pg->id   = MPIU_Strdup(pg_id);
+    TRACE_ERR("PG_Create - pg=%x pg->id=%s pg->vct=%x\n", pg, pg->id, pg->vct);
+    /* Initialize the connection information to null.  Use
+       the appropriate MPIDI_PG_InitConnXXX routine to set up these
+       fields */
+
+    pg->connData           = 0;
+    pg->getConnInfo        = 0;
+    pg->connInfoToString   = 0;
+    pg->connInfoFromString = 0;
+    pg->freeConnInfo       = 0;
+
+    for (p = 0; p < vct_sz; p++)
+    {
+	/* Initialize device fields in the VC object */
+	MPIDI_VC_Init(&pg->vct[p], pg,p);
+    }
+
+    /* The first process group is always the world group */
+    if (!pg_world) { pg_world = pg; }
+
+    /* Add pg's at the tail so that comm world is always the first pg */
+    pg->next = 0;
+    if (!MPIDI_PG_list)
+    {
+	MPIDI_PG_list = pg;
+    }
+    else
+    {
+	pgnext = MPIDI_PG_list;
+	while (pgnext->next)
+	{
+	    pgnext = pgnext->next;
+	}
+	pgnext->next = pg;
+    }
+    /* These are now done in MPIDI_VC_Init */
+    *pg_ptr = pg;
+
+  fn_exit:
+    return mpi_errno;
+
+  fn_fail:
+    goto fn_exit;
+}
+
+int MPIDI_PG_Destroy(MPIDI_PG_t * pg)
+{
+    MPIDI_PG_t * pg_prev;
+    MPIDI_PG_t * pg_cur;
+    int i;
+    int mpi_errno = MPI_SUCCESS;
+
+    MPIU_Assert(MPIU_Object_get_ref(pg) == 0);
+
+    pg_prev = NULL;
+    pg_cur = MPIDI_PG_list;
+
+    while(pg_cur != NULL)
+    {
+	if (pg_cur == pg)
+	{
+	    if (MPIDI_PG_iterator_next == pg)
+	    {
+		MPIDI_PG_iterator_next = MPIDI_PG_iterator_next->next;
+	    }
+
+	    if (pg_prev == NULL)
+		MPIDI_PG_list = pg->next;
+	    else
+		pg_prev->next = pg->next;
+
+	    TRACE_ERR("destroying pg=%p pg->id=%s\n", pg, (char *)pg->id);
+
+	    for (i = 0; i < pg->size; ++i) {
+		/* FIXME it would be good if we could make this assertion.
+		   Unfortunately, either:
+		   1) We're not being disciplined and some caller of this
+		      function doesn't bother to manage all the refcounts
+		      because he thinks he knows better.  Annoying, but not
+		      strictly a bug.
+		      (wdg - actually, that is a bug - managing the ref
+		      counts IS required and missing one is a bug.)
+		   2) There is a real bug lurking out there somewhere and we
+		      just haven't hit it in the tests yet.  */
+
+		/* This used to be handled in MPID_VCRT_Release, but that was
+		   not the right place to do this.  The VC should only be freed
+		   when the PG that it belongs to is freed, not just when the
+		   VC's refcount drops to zero. [goodell@ 2008-06-13] */
+		/* In that case, the fact that the VC is in the PG should
+		   increment the ref count - reflecting the fact that the
+		   use in the PG constitutes a reference-count-incrementing
+		   use.  Alternately, if the PG is able to recreate a VC,
+		   and can thus free unused (or idle) VCs, it should be allowed
+		   to do so.  [wdg 2008-08-31] */
+	    }
+
+	    MPIDI_PG_Destroy_fn(pg);
+	    TRACE_ERR("destroying pg->vct=%x\n", pg->vct);
+	    MPIU_Free(pg->vct);
+	    TRACE_ERR("after destroying pg->vct=%x\n", pg->vct);
+#if 0
+	    if (pg->connData) {
+		if (pg->freeConnInfo) {
+                    TRACE_ERR("calling freeConnInfo on pg\n");
+		    (*pg->freeConnInfo)( pg );
+		}
+		else {
+                    TRACE_ERR("free pg->connData\n");
+		    MPIU_Free(pg->connData);
+		}
+	    }
+#endif
+	    TRACE_ERR("final destroying pg\n");
+	    MPIU_Free(pg);
+
+	    goto fn_exit;
+	}
+
+	pg_prev = pg_cur;
+	pg_cur = pg_cur->next;
+    }
+
+  fn_exit:
+    return mpi_errno;
+  fn_fail:
+    goto fn_exit;
+}
+
+int MPIDI_PG_Find(void * id, MPIDI_PG_t ** pg_ptr)
+{
+    MPIDI_PG_t * pg;
+    int mpi_errno = MPI_SUCCESS;
+
+    pg = MPIDI_PG_list;
+
+    while (pg != NULL)
+    {
+	if (MPIDI_PG_Compare_ids_fn(id, pg->id) != FALSE)
+	{
+	    *pg_ptr = pg;
+	    goto fn_exit;
+	}
+
+	pg = pg->next;
+    }
+
+    *pg_ptr = NULL;
+
+  fn_exit:
+    return mpi_errno;
+}
+
+
+int MPIDI_PG_Id_compare(void * id1, void *id2)
+{
+    return MPIDI_PG_Compare_ids_fn(id1, id2);
+}
+
+/* iter always points at the next element */
+int MPIDI_PG_Get_next(MPIDI_PG_iterator *iter, MPIDI_PG_t ** pg_ptr)
+{
+    *pg_ptr = (*iter);
+    if ((*iter) != NULL) {
+	(*iter) = (*iter)->next;
+    }
+
+    return MPI_SUCCESS;
+}
+
+int MPIDI_PG_Has_next(MPIDI_PG_iterator *iter)
+{
+    return (*iter != NULL);
+}
+
+int MPIDI_PG_Get_iterator(MPIDI_PG_iterator *iter)
+{
+    *iter = MPIDI_PG_list;
+    return MPI_SUCCESS;
+}
+
+/* FIXME: What does DEV_IMPLEMENTS_KVS mean?  Why is it used?  Who uses
+   PG_To_string and why?  */
+
+/* PG_To_string is used in the implementation of connect/accept (and
+   hence in spawn) */
+/* Note: Allocated memory that is returned in str_ptr.  The user of
+   this routine must free that data */
+int MPIDI_PG_To_string(MPIDI_PG_t *pg_ptr, char **str_ptr, int *lenStr)
+{
+    int mpi_errno = MPI_SUCCESS;
+
+    /* Replace this with the new string */
+    MPIDI_connToStringKVS( str_ptr, lenStr, pg_ptr );
+#if 0
+    if (pg_ptr->connInfoToString) {
+	(*pg_ptr->connInfoToString)( str_ptr, lenStr, pg_ptr );
+    }
+    else {
+	MPIU_ERR_SETANDJUMP(mpi_errno,MPI_ERR_INTERN,"**noConnInfoToString");
+    }
+#endif
+
+fn_exit:
+    return mpi_errno;
+fn_fail:
+    goto fn_exit;
+}
+
+/* This routine takes a string description of a process group (created with
+   MPIDI_PG_To_string, usually on a different process) and returns a pointer to
+   the matching process group.  If the group already exists, flag is set to
+   false.  If the group does not exist, it is created with MPIDI_PG_Create (and
+   hence is added to the list of active process groups) and flag is set to
+   true.  In addition, the connection information is set up using the
+   information in the input string.
+*/
+int MPIDI_PG_Create_from_string(const char * str, MPIDI_PG_t ** pg_pptr,
+				int *flag)
+{
+    int mpi_errno = MPI_SUCCESS;
+    const char *p;
+    char *pg_id, *pg_id2, *cp2, *cp3,*str2, *str3;
+    pami_task_t taskids[10];
+    int vct_sz, i;
+    MPIDI_PG_t *existing_pg, *pg_ptr=0;
+
+    /* The pg_id is at the beginning of the string, so we can just pass
+       it to the find routine */
+    /* printf( "Looking for pg with id %s\n", str );fflush(stdout); */
+    mpi_errno = MPIDI_PG_Find((void *)str, &existing_pg);
+    if (mpi_errno) TRACE_ERR("MPIDI_PG_Find returned with mpi_errno=%d\n", mpi_errno);
+
+    if (existing_pg != NULL) {
+	/* return the existing PG */
+	*pg_pptr = existing_pg;
+	*flag = 0;
+	/* Note that the memory for the pg_id is freed in the exit */
+	goto fn_exit;
+    }
+    *flag = 1;
+
+    /* Get the size from the string */
+    p = str;
+    while (*p) p++; p++;
+    vct_sz = atoi(p);
+    mpi_errno = MPIDI_PG_Create(vct_sz, (void *)str, pg_pptr);
+    if (mpi_errno != MPI_SUCCESS) {
+	TRACE_ERR("MPIDI_PG_Create returned with mpi_errno=%d\n", mpi_errno);
+    }
+
+    pg_ptr = *pg_pptr;
+    TRACE_ERR("pg_ptr->id = %s\n",(*pg_pptr)->id);
+
+    if(verbose)
+      MPIU_PG_Printall(stderr);
+
+fn_exit:
+    return mpi_errno;
+fn_fail:
+    goto fn_exit;
+}
+
+#ifdef HAVE_CTYPE_H
+/* Needed for isdigit */
+#include <ctype.h>
+#endif
+
+
+/* For all of these routines, the format of the process group description
+   that is created and used by the connTo/FromString routines is this:
+   (All items are strings, terminated by null)
+
+   process group id string
+   sizeof process group (as string)
+   conninfo for rank 0
+   conninfo for rank 1
+   ...
+
+   The "conninfo for rank 0" etc. for the original (MPI_COMM_WORLD)
+   process group are stored in the PMI_KVS space with the keys
+   p<rank>-businesscard .
+
+   Fixme: Add a routine to publish the connection info to this file so that
+   the key for the businesscard is defined in just this one file.
+*/
+
+
+/* The "KVS" versions are for the process group to which the calling
+   process belongs.  These use the PMI_KVS routines to access the
+   process information */
+static int MPIDI_getConnInfoKVS( int rank, char *buf, int bufsize, MPIDI_PG_t *pg )
+{
+#ifdef USE_PMI2_API
+    char key[MPIDI_MAX_KVS_KEY_LEN];
+    int  mpi_errno = MPI_SUCCESS, rc;
+    int vallen;
+
+    rc = MPIU_Snprintf(key, MPIDI_MAX_KVS_KEY_LEN, "P%d-businesscard", rank );
+
+    mpi_errno = PMI2_KVS_Get(pg->connData, PMI2_ID_NULL, key, buf, bufsize, &vallen);
+    if (mpi_errno) {
+	TRACE_ERR("PMI2_KVS_Get returned with mpi_errno=%d\n", mpi_errno);
+    }
+ fn_exit:
+    return mpi_errno;
+ fn_fail:
+    goto fn_exit;
+#else
+    char key[MPIDI_MAX_KVS_KEY_LEN];
+    int  mpi_errno = MPI_SUCCESS, rc, pmi_errno;
+
+    rc = MPIU_Snprintf(key, MPIDI_MAX_KVS_KEY_LEN, "P%d-businesscard", rank );
+    if (rc < 0 || rc > MPIDI_MAX_KVS_KEY_LEN) {
+	MPIU_ERR_SETANDJUMP(mpi_errno,MPI_ERR_OTHER,"**nomem");
+    }
+
+/*    MPIU_THREAD_CS_ENTER(PMI,);*/
+    pmi_errno = PMI_KVS_Get(pg->connData, key, buf, bufsize );
+    if (pmi_errno) {
+	MPIDI_PG_CheckForSingleton();
+	pmi_errno = PMI_KVS_Get(pg->connData, key, buf, bufsize );
+    }
+/*    MPIU_THREAD_CS_EXIT(PMI,);*/
+    if (pmi_errno) {
+	MPIU_ERR_SETANDJUMP(mpi_errno,MPI_ERR_OTHER,"**pmi_kvs_get");
+    }
+
+ fn_exit:
+   return mpi_errno;
+ fn_fail:
+    goto fn_exit;
+#endif
+}
+
+/* *slen is the length of the string, including the null terminator.  So if the
+   resulting string is |foo\0bar\0|, then *slen == 8. */
+int MPIDI_connToStringKVS( char **buf_p, int *slen, MPIDI_PG_t *pg )
+{
+    char *string = 0;
+    char *pg_idStr = (char *)pg->id;      /* In the PMI/KVS space,
+					     the pg id is a string */
+    char buf[MPIDI_MAX_KVS_VALUE_LEN];
+    int   i, j, vallen, rc, mpi_errno = MPI_SUCCESS, len;
+    int   curSlen;
+
+    /* Make an initial allocation of a string with an estimate of the
+       needed space */
+    len = 0;
+    curSlen = 10 + pg->size * 128;
+    string = (char *)MPIU_Malloc( curSlen );
+
+    /* Start with the id of the pg */
+    while (*pg_idStr && len < curSlen)
+	string[len++] = *pg_idStr++;
+    string[len++] = 0;
+
+    /* Add the size of the pg */
+    MPIU_Snprintf( &string[len], curSlen - len, "%d", pg->size );
+    while (string[len]) len++;
+    len++;
+
+#if 0
+    for (i=0; i<pg->size; i++) {
+	rc = getConnInfoKVS( i, buf, MPIDI_MAX_KVS_VALUE_LEN, pg );
+	if (rc) {
+	    MPIU_Internal_error_printf(
+		    "Panic: getConnInfoKVS failed for %s (rc=%d)\n",
+		    (char *)pg->id, rc );
+	}
+#ifndef USE_PERSISTENT_SHARED_MEMORY
+	/* FIXME: This is a hack to avoid including shared-memory
+	   queue names in the business card that may be used
+	   by processes that were not part of the same COMM_WORLD.
+	   To fix this, the shared memory channels should look at the
+	   returned connection info and decide whether to use
+	   sockets or shared memory by determining whether the
+	   process is in the same MPI_COMM_WORLD. */
+	/* FIXME: The more general problem is that the connection information
+	   needs to include some information on the range of validity (e.g.,
+	   all processes, same comm world, particular ranks), and that
+	   representation needs to be scalable */
+/*	printf( "Adding key %s value %s\n", key, val ); */
+	{
+	char *p = strstr( buf, "$shm_host" );
+	if (p) p[1] = 0;
+	/*	    printf( "(fixed) Adding key %s value %s\n", key, val ); */
+	}
+#endif
+	/* Add the information to the output buffer */
+	vallen = strlen(buf);
+	/* Check that this will fix in the remaining space */
+	if (len + vallen + 1 >= curSlen) {
+	    char *nstring = 0;
+            curSlen += (pg->size - i) * (vallen + 1 );
+	    nstring = MPIU_Realloc( string, curSlen);
+	    if (!nstring) {
+		MPIU_ERR_SETANDJUMP(mpi_errno,MPI_ERR_OTHER,"**nomem");
+	    }
+	    string = nstring;
+	}
+	/* Append to string */
+	for (j=0; j<vallen+1; j++) {
+	    string[len++] = buf[j];
+	}
+    }
+#endif
+
+    MPIU_Assert(len <= curSlen);
+
+    *buf_p = string;
+    *slen  = len;
+ fn_exit:
+    return mpi_errno;
+ fn_fail:
+    if (string) MPIU_Free(string);
+    goto fn_exit;
+}
+
+static int MPIDI_connFromStringKVS( const char *buf ATTRIBUTE((unused)),
+			      MPIDI_PG_t *pg ATTRIBUTE((unused)) )
+{
+    /* Fixme: this should be a failure to call this routine */
+    return MPI_SUCCESS;
+}
+static int MPIDI_connFreeKVS( MPIDI_PG_t *pg )
+{
+    if (pg->connData) {
+	MPIU_Free( pg->connData );
+    }
+    return MPI_SUCCESS;
+}
+
+
+int MPIDI_PG_InitConnKVS( MPIDI_PG_t *pg )
+{
+#ifdef USE_PMI2_API
+    int mpi_errno = MPI_SUCCESS;
+
+    pg->connData = (char *)MPIU_Malloc(MAX_JOBID_LEN);
+    if (pg->connData == NULL) {
+	TRACE_ERR("MPIDI_PG_InitConnKVS - MPIU_Malloc failure\n");
+    }
+
+    mpi_errno = PMI2_Job_GetId(pg->connData, MAX_JOBID_LEN);
+    if (mpi_errno) TRACE_ERR("PMI2_Job_GetId returned with mpi_errno=%d\n", mpi_errno);
+#else
+    int pmi_errno, kvs_name_sz;
+    int mpi_errno = MPI_SUCCESS;
+
+    pmi_errno = PMI_KVS_Get_name_length_max( &kvs_name_sz );
+    if (pmi_errno != PMI_SUCCESS) {
+	MPIU_ERR_SETANDJUMP1(mpi_errno,MPI_ERR_OTHER,
+			     "**pmi_kvs_get_name_length_max",
+			     "**pmi_kvs_get_name_length_max %d", pmi_errno);
+    }
+
+    pg->connData = (char *)MPIU_Malloc(kvs_name_sz + 1);
+    if (pg->connData == NULL) {
+	MPIU_ERR_SETANDJUMP(mpi_errno,MPI_ERR_OTHER, "**nomem");
+    }
+
+    pmi_errno = PMI_KVS_Get_my_name(pg->connData, kvs_name_sz);
+    if (pmi_errno != PMI_SUCCESS) {
+	MPIU_ERR_SETANDJUMP1(mpi_errno,MPI_ERR_OTHER,
+			     "**pmi_kvs_get_my_name",
+			     "**pmi_kvs_get_my_name %d", pmi_errno);
+    }
+#endif
+    pg->getConnInfo        = MPIDI_getConnInfoKVS;
+    pg->connInfoToString   = MPIDI_connToStringKVS;
+    pg->connInfoFromString = MPIDI_connFromStringKVS;
+    pg->freeConnInfo       = MPIDI_connFreeKVS;
+
+ fn_exit:
+    return mpi_errno;
+ fn_fail:
+    if (pg->connData) { MPIU_Free(pg->connData); }
+    goto fn_exit;
+}
+
+/* Return the kvsname associated with the MPI_COMM_WORLD of this process. */
+int MPIDI_PG_GetConnKVSname( char ** kvsname )
+{
+    *kvsname = pg_world->connData;
+    return MPI_SUCCESS;
+}
+
+/* For process groups that are not our MPI_COMM_WORLD, store the connection
+   information in an array of strings.  These routines and structure
+   implement the access to this information. */
+typedef struct {
+    int     toStringLen;   /* Length needed to encode this connection info */
+    char ** connStrings;   /* pointer to an array, indexed by rank, containing
+			      connection information */
+} MPIDI_ConnInfo;
+
+
+static int MPIDI_getConnInfo( int rank, char *buf, int bufsize, MPIDI_PG_t *pg )
+{
+    MPIDI_ConnInfo *connInfo = (MPIDI_ConnInfo *)pg->connData;
+
+    /* printf( "Entering getConnInfo\n" ); fflush(stdout); */
+    if (!connInfo || !connInfo->connStrings || !connInfo->connStrings[rank]) {
+	/* FIXME: Turn this into a valid error code create/return */
+	/*printf( "Fatal error in getConnInfo (rank = %d)\n", rank );
+	printf( "connInfo = %p\n", connInfo );fflush(stdout); */
+	if (connInfo) {
+/*	    printf( "connInfo->connStrings = %p\n", connInfo->connStrings ); */
+	}
+	/* Fatal error.  Connection information missing */
+	fflush(stdout);
+    }
+
+    /* printf( "Copying %s to buf\n", connInfo->connStrings[rank] ); fflush(stdout); */
+
+    MPIU_Strncpy( buf, connInfo->connStrings[rank], bufsize );
+    return MPI_SUCCESS;
+}
+
+
+static int MPIDI_connToString( char **buf_p, int *slen, MPIDI_PG_t *pg )
+{
+    int mpi_errno = MPI_SUCCESS;
+    char *str = NULL, *pg_id;
+    int  i, len=0;
+    MPIDI_ConnInfo *connInfo = (MPIDI_ConnInfo *)pg->connData;
+
+    /* Create this from the string array */
+    str = (char *)MPIU_Malloc(connInfo->toStringLen);
+
+#if defined(MPICH_DEBUG_MEMINIT)
+    memset(str, 0, connInfo->toStringLen);
+#endif
+
+    pg_id = pg->id;
+    /* FIXME: This is a hack, and it doesn't even work */
+    /*    MPIDI_PrintConnStrToFile( stdout, __FILE__, __LINE__,
+	  "connToString: pg id is", (char *)pg_id );*/
+    /* This is intended to cause a process to transition from a singleton
+       to a non-singleton. */
+    /* XXX DJG TODO figure out what this little bit is all about. */
+    if (strstr( pg_id, "singinit_kvs" ) == pg_id) {
+#ifdef USE_PMI2_API
+        MPIU_Assertp(0); /* don't know what to do here for pmi2 yet.  DARIUS */
+#else
+	PMI_KVS_Get_my_name( pg->id, 256 );
+#endif
+    }
+
+    while (*pg_id) str[len++] = *pg_id++;
+    str[len++] = 0;
+
+    MPIU_Snprintf( &str[len], 20, "%d", pg->size);
+    /* Skip over the length */
+    while (str[len++]);
+
+    /* Copy each connection string */
+    for (i=0; i<pg->size; i++) {
+	char *p = connInfo->connStrings[i];
+	while (*p) { str[len++] = *p++; }
+	str[len++] = 0;
+    }
+
+    if (len > connInfo->toStringLen) {
+	*buf_p = 0;
+	*slen  = 0;
+	TRACE_ERR("len > connInfo->toStringLen");
+    }
+
+    *buf_p = str;
+    *slen = len;
+
+fn_exit:
+    return mpi_errno;
+fn_fail:
+    goto fn_exit;
+
+}
+
+
+static int MPIDI_connFromString( const char *buf, MPIDI_PG_t *pg )
+{
+    MPIDI_ConnInfo *conninfo = 0;
+    int i, mpi_errno = MPI_SUCCESS;
+    const char *buf0 = buf;   /* save the start of buf */
+
+    /* printf( "Starting with buf = %s\n", buf );fflush(stdout); */
+
+    /* Skip the pg id */
+    while (*buf) buf++; buf++;
+
+    /* Determine the size of the pg */
+    pg->size = atoi( buf );
+    while (*buf) buf++; buf++;
+
+    conninfo = (MPIDI_ConnInfo *)MPIU_Malloc( sizeof(MPIDI_ConnInfo) );
+    conninfo->connStrings = (char **)MPIU_Malloc( pg->size * sizeof(char *));
+
+    /* For now, make a copy of each item */
+    for (i=0; i<pg->size; i++) {
+	/* printf( "Adding conn[%d] = %s\n", i, buf );fflush(stdout); */
+	conninfo->connStrings[i] = MPIU_Strdup( buf );
+	while (*buf) buf++;
+	buf++;
+    }
+    pg->connData = conninfo;
+
+    /* Save the length of the string needed to encode the connection
+       information */
+    conninfo->toStringLen = (int)(buf - buf0) + 1;
+
+    return mpi_errno;
+}
+
+
+static int MPIDI_connFree( MPIDI_PG_t *pg )
+{
+    MPIDI_ConnInfo *conninfo = (MPIDI_ConnInfo *)pg->connData;
+    int i;
+
+    for (i=0; i<pg->size; i++) {
+	MPIU_Free( conninfo->connStrings[i] );
+    }
+    MPIU_Free( conninfo->connStrings );
+    MPIU_Free( conninfo );
+
+    return MPI_SUCCESS;
+}
+
+
+/*@
+  MPIDI_PG_Dup_vcr - Duplicate a virtual connection from a process group
+
+  Notes:
+  This routine provides a dup of a virtual connection given a process group
+  and a rank in that group.  This routine is used only in initializing
+  the MPI-1 communicators 'MPI_COMM_WORLD' and 'MPI_COMM_SELF', and in creating
+  the initial intercommunicator after an 'MPI_Comm_spawn',
+  'MPI_Comm_spawn_multiple', or 'MPI_Comm_connect/MPI_Comm_accept'.
+
+  In addition to returning a dup of the virtual connection, it manages the
+  reference count of the process group, which is always the number of inuse
+  virtual connections.
+  @*/
+int MPIDI_PG_Dup_vcr( MPIDI_PG_t *pg, int rank, pami_task_t taskid, MPID_VCR *vcr_p )
+{
+    int inuse;
+    MPID_VCR vcr;
+
+    TRACE_ERR("ENTER MPIDI_PG_Dup_vcr - pg->id=%s rank=%d taskid=%d\n", pg->id, rank, taskid);
+    pg->vct[rank].taskid = taskid;
+    vcr = &pg->vct[rank];
+    TRACE_ERR("MPIDI_PG_Dup_vcr- pg->vct[%d].pg=%x pg=%x vcr=%x vcr->pg=%x\n", rank, pg->vct[rank].pg, pg, vcr, vcr->pg);
+    vcr->pg = pg;
+    vcr->pg_rank = rank;
+    vcr->taskid = taskid;
+    /* Increase the reference count of the vc.  If the reference count
+       increases from 0 to 1, increase the reference count of the
+       process group *and* the reference count of the vc (this
+       allows us to distinquish between Comm_free and Comm_disconnect) */
+    /* FIXME-MT: This should be a fetch and increment for thread-safety */
+    /*if (MPIU_Object_get_ref(vcr_p) == 0) { */
+	TRACE_ERR("MPIDI_PG_add_ref on pg=%s pg=%x\n", pg->id, pg);
+	MPIDI_PG_add_ref(pg);
+        inuse=MPIU_Object_get_ref(pg);
+	TRACE_ERR("after MPIDI_PG_add_ref on pg=%s inuse=%d\n", pg->id, inuse);
+/*	MPIDI_VC_add_ref(vcr_p);
+    }
+    MPIDI_VC_add_ref(vcr_p);*/
+    *vcr_p = vcr;
+
+    return MPI_SUCCESS;
+}
+
+
+/*
+ * This routine may be called to print the contents (including states and
+ * reference counts) for process groups.
+ */
+int MPIU_PG_Printall( FILE *fp )
+{
+    MPIDI_PG_t *pg;
+    int         i;
+
+    pg = MPIDI_PG_list;
+
+    fprintf( fp, "Process groups:\n" );
+    while (pg) {
+        /* XXX DJG FIXME-MT should we be checking this? */
+	fprintf( fp, "size = %d, refcount = %d, id = %s\n",
+		 pg->size, MPIU_Object_get_ref(pg), (char *)pg->id );
+#if 0
+	for (i=0; i<pg->size; i++) {
+	    fprintf( fp, "\tVCT rank = %d, refcount = %d, taskid = %d, state = %d \n",
+		     pg->vct[i].pg_rank, MPIU_Object_get_ref(&pg->vct[i]),
+		     pg->vct[i].taskid, (int)pg->vct[i].state );
+	}
+#endif
+	fflush(fp);
+	pg = pg->next;
+    }
+
+    return 0;
+}
+
+#ifdef HAVE_CTYPE_H
+/* Needed for isdigit */
+#include <ctype.h>
+#endif
+/* Convert a process group id into a number.  This is a hash-based approach,
+ * which has the potential for some collisions.  This is an alternative to the
+ * previous approach that caused req#3930, which was to sum up the values of the
+ * characters.  The summing approach worked OK when the id's were all similar
+ * but with an incrementing prefix or suffix, but terrible for a 32 hex-character
+ * UUID type of id.
+ *
+ * FIXME It would really be best if the PM could give us this value.
+ */
+/* FIXME: This is a temporary hack for devices that do not define
+   MPIDI_DEV_IMPLEMENTS_KVS
+   FIXME: MPIDI_DEV_IMPLEMENTS_KVS should be removed
+ */
+void MPIDI_PG_IdToNum( MPIDI_PG_t *pg, int *id )
+{
+    *id = atoi((char *)pg->id);
+}
+#endif
diff --git a/src/mpid/pamid/src/dyntask/mpidi_port.c b/src/mpid/pamid/src/dyntask/mpidi_port.c
new file mode 100644
index 0000000..1ddafd0
--- /dev/null
+++ b/src/mpid/pamid/src/dyntask/mpidi_port.c
@@ -0,0 +1,1453 @@
+/* begin_generated_IBM_copyright_prolog                             */
+/*                                                                  */
+/* This is an automatically generated copyright prolog.             */
+/* After initializing,  DO NOT MODIFY OR MOVE                       */
+/*  --------------------------------------------------------------- */
+/* Licensed Materials - Property of IBM                             */
+/* Blue Gene/Q 5765-PER 5765-PRP                                    */
+/*                                                                  */
+/* (C) Copyright IBM Corp. 2011, 2012 All Rights Reserved           */
+/* US Government Users Restricted Rights -                          */
+/* Use, duplication, or disclosure restricted                       */
+/* by GSA ADP Schedule Contract with IBM Corp.                      */
+/*                                                                  */
+/*  --------------------------------------------------------------- */
+/*                                                                  */
+/* end_generated_IBM_copyright_prolog                               */
+/*  (C)Copyright IBM Corp.  2007, 2011  */
+
+#include <mpidimpl.h>
+
+#ifdef DYNAMIC_TASKING
+#define MAX_HOST_DESCRIPTION_LEN 256
+#ifdef USE_PMI2_API
+#define MPID_MAX_JOBID_LEN 256
+#endif
+
+
+typedef struct {
+  MPID_VCR vcr;
+  int        port_name_tag;
+}AM_struct;
+
+conn_info  *_conn_info_list = NULL;
+extern int mpidi_dynamic_tasking;
+
+typedef struct MPIDI_Acceptq
+{
+    int             port_name_tag;
+    MPID_VCR 	    vcr;
+    struct MPIDI_Acceptq *next;
+}
+MPIDI_Acceptq_t;
+
+static MPIDI_Acceptq_t * acceptq_head=0;
+static int maxAcceptQueueSize = 0;
+static int AcceptQueueSize    = 0;
+
+pthread_mutex_t rem_connlist_mutex = PTHREAD_MUTEX_INITIALIZER;
+
+/* FIXME: If dynamic processes are not supported, this file will contain
+   no code and some compilers may warn about an "empty translation unit" */
+
+/* FIXME: pg_translation is used for ? */
+typedef struct pg_translation {
+    int pg_index;    /* index of a process group (index in pg_node) */
+    int pg_rank;     /* rank in that process group */
+    pami_task_t pg_taskid;     /* rank in that process group */
+} pg_translation;
+
+
+typedef struct pg_node {
+    int  index;            /* Internal index of process group
+			      (see pg_translation) */
+    char *pg_id;
+    char *str;             /* String describing connection info for pg */
+    int   lenStr;          /* Length of this string (including the null terminator(s)) */
+    struct pg_node *next;
+} pg_node;
+
+
+void MPIDI_Recvfrom_remote_world(pami_context_t    context,
+                void            * cookie,
+                const void      * _msginfo,
+                size_t            msginfo_size,
+                const void      * sndbuf,
+                size_t            sndlen,
+                pami_endpoint_t   sender,
+                pami_recv_t     * recv)
+{
+  AM_struct        *AM_data;
+  MPID_VCR       *new_vcr;
+  int              port_name_tag;
+  MPIDI_Acceptq_t *q_item;
+  pami_endpoint_t dest;
+
+
+  q_item = MPIU_Malloc(sizeof(MPIDI_Acceptq_t));
+  q_item->vcr = MPIU_Malloc(sizeof(struct MPID_VCR_t));
+  q_item->vcr->pg = MPIU_Malloc(sizeof(MPIDI_PG_t));
+  MPIU_Object_set_ref(q_item->vcr->pg, 0);
+  TRACE_ERR("ENTER MPIDI_Acceptq_enqueue-1 q_item=%llx _msginfo=%llx (AM_struct *)_msginfo=%llx ((AM_struct *)_msginfo)->vcr=%llx\n", q_item, _msginfo, (AM_struct *)_msginfo, ((AM_struct *)_msginfo)->vcr);
+  q_item->port_name_tag = ((AM_struct *)_msginfo)->port_name_tag;
+  q_item->vcr->taskid = PAMIX_Endpoint_query(sender);
+  TRACE_ERR("MPIDI_Recvfrom_remote_world INVOKED with new_vcr->taskid=%d\n",sender);
+
+  /* Keep some statistics on the accept queue */
+  AcceptQueueSize++;
+  if (AcceptQueueSize > maxAcceptQueueSize)
+    maxAcceptQueueSize = AcceptQueueSize;
+
+  q_item->next = acceptq_head;
+  acceptq_head = q_item;
+  return;
+}
+
+
+/* These functions help implement the connect/accept algorithm */
+static int MPIDI_ExtractLocalPGInfo( struct MPID_Comm *, pg_translation [],
+			       pg_node **, int * );
+static int MPIDI_ReceivePGAndDistribute( struct MPID_Comm *, struct MPID_Comm *, int, int *,
+				   int, MPIDI_PG_t *[] );
+static int MPIDI_SendPGtoPeerAndFree( struct MPID_Comm *, int *, pg_node * );
+static int MPIDI_SetupNewIntercomm( struct MPID_Comm *comm_ptr, int remote_comm_size,
+			      pg_translation remote_translation[],
+			      int n_remote_pgs, MPIDI_PG_t **remote_pg,
+			      struct MPID_Comm *intercomm );
+static int MPIDI_Initialize_tmp_comm(struct MPID_Comm **comm_pptr,
+					  struct MPID_VCR_t *vcr_ptr, int is_low_group, int context_id_offset);
+
+
+/* ------------------------------------------------------------------------- */
+/*
+ * Structure of this file and the connect/accept algorithm:
+ *
+ * Here are the steps involved in implementating MPI_Comm_connect and
+ * MPI_Comm_accept.  These same steps are used withing MPI_Comm_spawn
+ * and MPI_Comm_spawn_multiple.
+ *
+ * First, the connecting process establishes a connection (not a virtual
+ * connection!) to the designated accepting process.
+ * This makes use of the usual (channel-specific) connection code.
+ * Once this connection is established, the connecting process sends a packet
+ * to the accepting process.
+ * This packet contains a "port_tag_name", which is a value that
+ * is used to separate different MPI port names (values from MPI_Open_port)
+ * on the same process (this is a way to multiplex many MPI port names on
+ * a single communication connection port).
+ *
+ * On the accepting side, the process waits until the progress engine
+ * inserts the connect request into the accept queue (this is done with the
+ * routine MPIDI_Acceptq_dequeue).  This routine returns the matched
+ * virtual connection (VC).
+ *
+ * Once both sides have established there VC, they both invoke
+ * MPIDI_Initialize_tmp_comm to create a temporary intercommunicator.
+ * A temporary intercommunicator is constructed so that we can use
+ * MPI routines to send the other information that we need to complete
+ * the connect/accept operation (described below).
+ *
+ * The above is implemented with the routines
+ *   MPIDI_Create_inter_root_communicator_connect
+ *   MPIDI_Create_inter_root_communicator_accept
+ *   MPIDI_Initialize_tmp_comm
+ *
+ * At this point, the two "root" processes of the communicators that are
+ * connecting can use MPI communication.  They must then exchange the
+ * following information:
+ *
+ *    The size of the "remote" communicator
+ *    Description of all process groups; that is, all of the MPI_COMM_WORLDs
+ *    that they know.
+ *    The shared context id that will be used
+ *
+ *
+ */
+/* ------------------------------------------------------------------------- */
+
+int MPIDU_send_AM_to_leader(MPID_VCR new_vcr, int port_name_tag, pami_task_t taskid)
+{
+   pami_send_t xferP;
+   pami_endpoint_t dest;
+   int              rc, current_val;
+
+   AM_struct        AM_data;
+
+   AM_data.vcr = new_vcr;
+   TRACE_ERR("MPIDU_send_AM_to_leader - new_vcr->taskid=%d\n", new_vcr->taskid);
+   AM_data.port_name_tag = port_name_tag;
+   TRACE_ERR("send - %p %d %p %d\n", AM_data.vcr, AM_data.port_name_tag, AM_data.vcr, AM_data.vcr->taskid);
+
+
+   bzero(&xferP, sizeof(pami_send_t));
+   xferP.send.header.iov_base = (void*)&AM_data;
+   xferP.send.header.iov_len  = sizeof(AM_struct);
+   xferP.send.dispatch = MPIDI_Protocols_Dyntask;
+   /*xferP.hints.use_rdma  = mpci_enviro.use_shmem;
+   xferP.hints.use_shmem = mpci_enviro.use_shmem;*/
+   rc = PAMI_Endpoint_create(MPIDI_Client, taskid, 0, &dest);
+   TRACE_ERR("PAMI_Resume to taskid=%d\n", taskid);
+	PAMI_Resume(MPIDI_Context[0],
+                    &dest, 1);
+
+   if(rc != 0)
+     TRACE_ERR("PAMI_Endpoint_create failed\n");
+
+   xferP.send.dest = dest;
+
+   rc = PAMI_Send(MPIDI_Context[0], &xferP);
+
+}
+
+
+/*
+ * These next two routines are used to create a virtual connection
+ * (VC) and a temporary intercommunicator that can be used to
+ * communicate between the two "root" processes for the
+ * connect and accept.
+ */
+
+/* FIXME: Describe the algorithm for the connection logic */
+int MPIDI_Connect_to_root(const char * port_name,
+                          MPID_VCR * new_vc)
+{
+    int mpi_errno = MPI_SUCCESS;
+    MPID_VCR vc;
+    char host_description[MAX_HOST_DESCRIPTION_LEN];
+    int port, port_name_tag; pami_task_t taskid_tag;
+    int hasIfaddr = 0;
+    AM_struct *conn;
+
+    /* First, create a new vc (we may use this to pass to a generic
+       connection routine) */
+    vc = MPIU_Malloc(sizeof(struct MPID_VCR_t));
+    vc->pg = MPIU_Malloc(sizeof(MPIDI_PG_t));
+    MPIU_Object_set_ref(vc->pg, 0);
+    TRACE_ERR("vc from MPIDI_Connect_to_root=%llx vc->pg=%llx\n", vc, vc->pg);
+    /* FIXME - where does this vc get freed? */
+
+    *new_vc = vc;
+
+    /* FIXME: There may need to be an additional routine here, to ensure that the
+       channel is initialized for this pair of process groups (this process
+       and the remote process to which the vc will connect). */
+/*    MPIDI_VC_Init(vc->vc, NULL, 0); */
+    vc->taskid = PAMIX_Client_query(MPIDI_Client, PAMI_CLIENT_TASK_ID  ).value.intval;
+    TRACE_ERR("MPIDI_Connect_to_root - vc->taskid=%d\n", vc->taskid);
+
+    mpi_errno = MPIDI_GetTagFromPort(port_name, &port_name_tag);
+    if (mpi_errno != MPIU_STR_SUCCESS) {
+      TRACE_ERR("MPIDI_GetTagFromPort returned with mpi_errno=%d", mpi_errno);
+    }
+    mpi_errno = MPIDI_GetTaskidFromPort(port_name, &taskid_tag);
+    if (mpi_errno != MPIU_STR_SUCCESS) {
+      TRACE_ERR("MPIDI_GetTaskidFromPort returned with mpi_errno=%d", mpi_errno);
+    }
+
+    TRACE_ERR("posting connect to host %s, port %d task %d vc %p\n",
+	host_description, port, taskid_tag, vc );
+    mpi_errno = MPIDU_send_AM_to_leader(vc, port_name_tag, taskid_tag);
+
+ fn_exit:
+    return mpi_errno;
+ fn_fail:
+    goto fn_exit;
+}
+
+
+/* ------------------------------------------------------------------------- */
+/* Business card management.  These routines insert or extract connection
+   information when using sockets from the business card */
+/* ------------------------------------------------------------------------- */
+
+/* FIXME: These are small routines; we may want to bring them together
+   into a more specific post-connection-for-sock */
+
+/* The host_description should be of length MAX_HOST_DESCRIPTION_LEN */
+
+
+static int MPIDI_Create_inter_root_communicator_connect(const char *port_name,
+							struct MPID_Comm **comm_pptr,
+							MPID_VCR *vc_pptr)
+{
+    int mpi_errno = MPI_SUCCESS;
+    struct MPID_Comm *tmp_comm;
+    struct MPID_VCR_t *connect_vc= NULL;
+    int port_name_tag, taskid_tag;
+    /* Connect to the root on the other side. Create a
+       temporary intercommunicator between the two roots so that
+       we can use MPI functions to communicate data between them. */
+
+    MPIDI_Connect_to_root(port_name, &(connect_vc));
+    if (mpi_errno != MPI_SUCCESS) {
+	TRACE_ERR("MPIDI_Connect_to_root returned with mpi_errno=%d\n", mpi_errno);
+    }
+
+    /* extract the tag from the port_name */
+    mpi_errno = MPIDI_GetTagFromPort( port_name, &port_name_tag);
+    if (mpi_errno != MPIU_STR_SUCCESS) {
+	TRACE_ERR("MPIDI_GetTagFromPort returned with mpi_errno=%d\n", mpi_errno);
+    }
+
+    mpi_errno = MPIDI_GetTaskidFromPort(port_name, &taskid_tag);
+    if (mpi_errno != MPIU_STR_SUCCESS) {
+	TRACE_ERR("MPIDI_GetTaskidFromPort returned with mpi_errno=%d\n", mpi_errno);
+    }
+    connect_vc->taskid=taskid_tag;
+    mpi_errno = MPIDI_Initialize_tmp_comm(&tmp_comm, connect_vc, 1, port_name_tag);
+    if (mpi_errno != MPI_SUCCESS) {
+	TRACE_ERR("MPIDI_Initialize_tmp_comm returned with mpi_errno=%d\n", mpi_errno);
+    }
+
+    *comm_pptr = tmp_comm;
+    *vc_pptr = connect_vc;
+
+ fn_exit:
+    return mpi_errno;
+ fn_fail:
+    goto fn_exit;
+}
+
+/* Creates a communicator for the purpose of communicating with one other
+   process (the root of the other group).  It also returns the virtual
+   connection */
+static int MPIDI_Create_inter_root_communicator_accept(const char *port_name,
+						struct MPID_Comm **comm_pptr,
+						MPID_VCR *vc_pptr)
+{
+    int mpi_errno = MPI_SUCCESS;
+    struct MPID_Comm *tmp_comm;
+    MPID_VCR new_vc;
+
+    MPID_Progress_state progress_state;
+    int port_name_tag;
+
+    /* extract the tag from the port_name */
+    mpi_errno = MPIDI_GetTagFromPort( port_name, &port_name_tag);
+    if (mpi_errno != MPIU_STR_SUCCESS) {
+	TRACE_ERR("MPIDI_GetTagFromPort returned with mpi_errnp=%d\n", mpi_errno);
+    }
+
+    /* FIXME: Describe the algorithm used here, and what routine
+       is user on the other side of this connection */
+    /* dequeue the accept queue to see if a connection with the
+       root on the connect side has been formed in the progress
+       engine (the connection is returned in the form of a vc). If
+       not, poke the progress engine. */
+
+    for(;;)
+    {
+	MPIDI_Acceptq_dequeue(&new_vc, port_name_tag);
+	if (new_vc != NULL)
+	{
+	    break;
+	}
+
+	mpi_errno = MPID_Progress_wait(100);
+	/* --BEGIN ERROR HANDLING-- */
+	if (mpi_errno)
+	{
+	    TRACE_ERR("MPID_Progress_wait returned with mpi_errno=%d\n", mpi_errno);
+	}
+	/* --END ERROR HANDLING-- */
+    }
+
+    mpi_errno = MPIDI_Initialize_tmp_comm(&tmp_comm, new_vc, 0, port_name_tag);
+    if (mpi_errno != MPI_SUCCESS) {
+	TRACE_ERR("MPIDI_Initialize_tmp_comm returned with mpi_errno=%d\n", mpi_errno);
+    }
+
+    *comm_pptr = tmp_comm;
+    *vc_pptr = new_vc;
+
+fn_exit:
+    return mpi_errno;
+fn_fail:
+    goto fn_exit;
+}
+
+/* This is a utility routine used to initialize temporary communicators
+   used in connect/accept operations, and is only used in the above two
+   routines */
+static int MPIDI_Initialize_tmp_comm(struct MPID_Comm **comm_pptr,
+					  struct MPID_VCR_t *vc_ptr, int is_low_group, int context_id_offset)
+{
+    int mpi_errno = MPI_SUCCESS;
+    struct MPID_Comm *tmp_comm, *commself_ptr;
+
+    MPID_Comm_get_ptr( MPI_COMM_SELF, commself_ptr );
+
+    /* WDG-old code allocated a context id that was then discarded */
+    mpi_errno = MPIR_Comm_create(&tmp_comm);
+    if (mpi_errno != MPI_SUCCESS) {
+	TRACE_ERR("MPIR_Comm_create returned with mpi_errno=%d\n", mpi_errno);
+    }
+    /* fill in all the fields of tmp_comm. */
+
+    /* We use the second half of the context ID bits for dynamic
+     * processes. This assumes that the context ID mask array is made
+     * up of uint32_t's. */
+    /* FIXME: This code is still broken for the following case:
+     * If the same process opens connections to the multiple
+     * processes, this context ID might get out of sync.
+     */
+    tmp_comm->context_id     = MPID_CONTEXT_SET_FIELD(DYNAMIC_PROC, context_id_offset, 1);
+    tmp_comm->recvcontext_id = tmp_comm->context_id;
+
+    /* sanity: the INVALID context ID value could potentially conflict with the
+     * dynamic proccess space */
+    MPIU_Assert(tmp_comm->context_id     != MPIR_INVALID_CONTEXT_ID);
+    MPIU_Assert(tmp_comm->recvcontext_id != MPIR_INVALID_CONTEXT_ID);
+
+    /* FIXME - we probably need a unique context_id. */
+    tmp_comm->remote_size = 1;
+
+    /* Fill in new intercomm */
+    tmp_comm->local_size   = 1;
+    tmp_comm->rank         = 0;
+    tmp_comm->comm_kind    = MPID_INTERCOMM;
+    tmp_comm->local_comm   = NULL;
+    tmp_comm->is_low_group = is_low_group;
+
+    /* No pg structure needed since vc has already been set up
+       (connection has been established). */
+
+    /* Point local vcr, vcrt at those of commself_ptr */
+    /* FIXME: Explain why */
+    tmp_comm->local_vcrt = commself_ptr->vcrt;
+    MPID_VCRT_Add_ref(commself_ptr->vcrt);
+    tmp_comm->local_vcr  = commself_ptr->vcr;
+
+    /* No pg needed since connection has already been formed.
+       FIXME - ensure that the comm_release code does not try to
+       free an unallocated pg */
+
+    /* Set up VC reference table */
+    mpi_errno = MPID_VCRT_Create(tmp_comm->remote_size, &tmp_comm->vcrt);
+    if (mpi_errno != MPI_SUCCESS) {
+	TRACE_ERR("MPID_VCRT_Create returned with mpi_errno=%d", mpi_errno);
+    }
+    mpi_errno = MPID_VCRT_Get_ptr(tmp_comm->vcrt, &tmp_comm->vcr);
+    if (mpi_errno != MPI_SUCCESS) {
+	TRACE_ERR("MPID_VCRT_Get_ptr returned with mpi_errno=%d", mpi_errno);
+    }
+
+    /* FIXME: Why do we do a dup here? */
+    MPID_VCR_Dup(vc_ptr, tmp_comm->vcr);
+
+    *comm_pptr = tmp_comm;
+
+fn_exit:
+    return mpi_errno;
+fn_fail:
+    goto fn_exit;
+}
+
+
+/* ------------------------------------------------------------------------- */
+/*
+   MPIDI_Comm_connect()
+
+   Algorithm: First create a connection (vc) between this root and the
+   root on the accept side. Using this vc, create a temporary
+   intercomm between the two roots. Use MPI functions to communicate
+   the other information needed to create the real intercommunicator
+   between the processes on the two sides. Then free the
+   intercommunicator between the roots. Most of the complexity is
+   because there can be multiple process groups on each side.
+*/
+int MPIDI_Comm_connect(const char *port_name, MPID_Info *info, int root,
+		       struct MPID_Comm *comm_ptr, struct MPID_Comm **newcomm)
+{
+    int mpi_errno=MPI_SUCCESS;
+    int j, i, rank, recv_ints[3], send_ints[3], context_id;
+    int remote_comm_size=0;
+    struct MPID_Comm *tmp_comm = NULL;
+    MPID_VCR new_vc= NULL;
+    int sendtag=100, recvtag=100, n_remote_pgs;
+    int n_local_pgs=1, local_comm_size;
+    pg_translation *local_translation = NULL, *remote_translation = NULL;
+    pg_node *pg_list = NULL;
+    MPIDI_PG_t **remote_pg = NULL;
+    MPIR_Context_id_t recvcontext_id = MPIR_INVALID_CONTEXT_ID;
+    int errflag = FALSE;
+    MPIU_CHKLMEM_DECL(3);
+
+    /* Get the context ID here because we need to send it to the remote side */
+    mpi_errno = MPIR_Get_contextid( comm_ptr, &recvcontext_id );
+    TRACE_ERR("MPIDI_Comm_connect calling MPIR_Get_contextid = %d\n", recvcontext_id);
+    if (mpi_errno) TRACE_ERR("MPIR_Get_contextid returned with mpi_errno=%d\n", mpi_errno);
+
+    rank = comm_ptr->rank;
+    local_comm_size = comm_ptr->local_size;
+    TRACE_ERR("In MPIDI_Comm_connect - port_name=%s rank=%d root=%d\n", port_name, rank, root);
+
+    if (rank == root)
+    {
+	/* Establish a communicator to communicate with the root on the
+	   other side. */
+	mpi_errno = MPIDI_Create_inter_root_communicator_connect(
+	    port_name, &tmp_comm, &new_vc);
+	if (mpi_errno != MPI_SUCCESS) {
+	    TRACE_ERR("MPIDI_Create_inter_root_communicator_connect returned mpi_errno=%d\n", mpi_errno);
+	}
+	TRACE_ERR("after MPIDI_Create_inter_root_communicator_connect - tmp_comm=%p  new_vc=%p mpi_errno=%d\n", tmp_comm, new_vc, mpi_errno);
+
+	/* Make an array to translate local ranks to process group index
+	   and rank */
+	local_translation = MPIU_Malloc(local_comm_size*sizeof(pg_translation));
+/*	MPIU_CHKLMEM_MALLOC(local_translation,pg_translation*,
+			    local_comm_size*sizeof(pg_translation),
+			    mpi_errno,"local_translation"); */
+
+	/* Make a list of the local communicator's process groups and encode
+	   them in strings to be sent to the other side.
+	   The encoded string for each process group contains the process
+	   group id, size and all its KVS values */
+	mpi_errno = MPIDI_ExtractLocalPGInfo( comm_ptr, local_translation,
+					&pg_list, &n_local_pgs );
+
+	/* Send the remote root: n_local_pgs, local_comm_size,
+           Recv from the remote root: n_remote_pgs, remote_comm_size,
+           recvcontext_id for newcomm */
+
+        send_ints[0] = n_local_pgs;
+        send_ints[1] = local_comm_size;
+        send_ints[2] = recvcontext_id;
+
+	TRACE_ERR("connect:sending 3 ints, %d, %d, %d, and receiving 2 ints with sendtag=%d recvtag=%d\n", send_ints[0], send_ints[1], send_ints[2], sendtag, recvtag);
+        mpi_errno = MPIC_Sendrecv(send_ints, 3, MPI_INT, 0,
+                                  sendtag++, recv_ints, 3, MPI_INT,
+                                  0, recvtag++, tmp_comm->handle,
+                                  MPI_STATUS_IGNORE);
+        if (mpi_errno != MPI_SUCCESS) {
+            /* this is a no_port error because we may fail to connect
+               on the send if the port name is invalid */
+	    TRACE_ERR("MPIC_Sendrecv returned with mpi_errno=%d\n", mpi_errno);
+	}
+    }
+
+    /* broadcast the received info to local processes */
+    mpi_errno = MPIR_Bcast_intra(recv_ints, 3, MPI_INT, root, comm_ptr, &errflag);
+    if (mpi_errno) TRACE_ERR("MPIR_Bcast_intra returned with mpi_errno=%d\n", mpi_errno);
+
+    /* check if root was unable to connect to the port */
+
+    n_remote_pgs     = recv_ints[0];
+    remote_comm_size = recv_ints[1];
+    context_id	     = recv_ints[2];
+
+   TRACE_ERR("MPIDI_Comm_connect - n_remote_pgs=%d remote_comm_size=%d context_id=%d\n", n_remote_pgs,
+	remote_comm_size, context_id);
+    remote_pg = MPIU_Malloc(n_remote_pgs * sizeof(MPIDI_PG_t*));
+    remote_translation = MPIU_Malloc(remote_comm_size * sizeof(pg_translation));
+    /* Exchange the process groups and their corresponding KVSes */
+    if (rank == root)
+    {
+	mpi_errno = MPIDI_SendPGtoPeerAndFree( tmp_comm, &sendtag, pg_list );
+	mpi_errno = MPIDI_ReceivePGAndDistribute( tmp_comm, comm_ptr, root, &recvtag,
+					n_remote_pgs, remote_pg );
+	/* Receive the translations from remote process rank to process group
+	   index */
+	mpi_errno = MPIC_Sendrecv(local_translation, local_comm_size * 3,
+				  MPI_INT, 0, sendtag++,
+				  remote_translation, remote_comm_size * 3,
+				  MPI_INT, 0, recvtag++, tmp_comm->handle,
+				  MPI_STATUS_IGNORE);
+	if (mpi_errno) {
+	    TRACE_ERR("MPIC_Sendrecv returned with mpi_errno=%d\n", mpi_errno);
+	}
+
+	for (i=0; i<remote_comm_size; i++)
+	{
+	    TRACE_ERR(" remote_translation[%d].pg_index = %d\n remote_translation[%d].pg_rank = %d\n",
+		i, remote_translation[i].pg_index, i, remote_translation[i].pg_rank);
+	}
+    }
+    else
+    {
+	mpi_errno = MPIDI_ReceivePGAndDistribute( tmp_comm, comm_ptr, root, &recvtag,
+					    n_remote_pgs, remote_pg );
+    }
+
+    /* Broadcast out the remote rank translation array */
+    mpi_errno = MPIR_Bcast_intra(remote_translation, remote_comm_size * 3, MPI_INT,
+                                 root, comm_ptr, &errflag);
+    if (mpi_errno) TRACE_ERR("MPIR_Bcast_intra returned with mpi_errno=%d\n", mpi_errno);
+
+    char *pginfo = MPIU_Malloc(256*sizeof(char));
+    memset(pginfo, 0, 256);
+    char cp[20];
+    for (i=0; i<remote_comm_size; i++)
+    {
+	TRACE_ERR(" remote_translation[%d].pg_index = %d remote_translation[%d].pg_rank = %d remote_translation[%d].pg_taskid=%d\n",
+	    i, remote_translation[i].pg_index, i, remote_translation[i].pg_rank, i, remote_translation[i].pg_taskid);
+	    TRACE_ERR("remote_pg[remote_translation[%d].pg_index]->id=%s\n",i, (char *)(remote_pg[remote_translation[i].pg_index]->id));
+	    strcat(pginfo, (char *)(remote_pg[remote_translation[i].pg_index]->id));
+	    sprintf(cp, ":%d ", remote_translation[i].pg_taskid);
+	    strcat(pginfo, cp);
+
+
+    }
+    pginfo[strlen(pginfo)]='\0';
+    TRACE_ERR("connection info %s\n", pginfo);
+    /*MPIDI_Parse_connection_info(n_remote_pgs, remote_pg);*/
+    MPIU_Free(pginfo);
+
+    mpi_errno = MPIR_Comm_create(newcomm);
+    if (mpi_errno) TRACE_ERR("MPIR_Comm_create returned with mpi_errno=%d\n", mpi_errno);
+
+    (*newcomm)->context_id     = context_id;
+    (*newcomm)->recvcontext_id = recvcontext_id;
+    (*newcomm)->is_low_group   = 1;
+
+    mpi_errno = MPIDI_SetupNewIntercomm( comm_ptr, remote_comm_size,
+				   remote_translation, n_remote_pgs, remote_pg, *newcomm );
+/*    MPIDI_Parse_connection_info(n_remote_pgs, remote_pg); */
+    if (mpi_errno != MPI_SUCCESS) {
+	TRACE_ERR("MPIDI_SetupNewIntercomm returned with mpi_errno=%d\n", mpi_errno);
+    }
+
+    /* synchronize with remote root */
+    if (rank == root)
+    {
+        mpi_errno = MPIC_Sendrecv(&i, 0, MPI_INT, 0,
+                                  sendtag++, &j, 0, MPI_INT,
+                                  0, recvtag++, tmp_comm->handle,
+                                  MPI_STATUS_IGNORE);
+        if (mpi_errno != MPI_SUCCESS) {
+	    TRACE_ERR("MPIC_Sendrecv returned with mpi_errno=%d\n", mpi_errno);
+        }
+
+        /* All communication with remote root done. Release the communicator. */
+        MPIR_Comm_release(tmp_comm,0);
+    }
+
+    TRACE_ERR("connect:barrier\n");
+    mpi_errno = MPIR_Barrier_intra(comm_ptr, &errflag);
+    if (mpi_errno != MPI_SUCCESS) {
+	TRACE_ERR("MPIR_Barrier_intra returned with mpi_errno=%d\n", mpi_errno);
+    }
+
+    TRACE_ERR("connect:free new vc\n");
+    /* Free new_vc. It was explicitly allocated in MPIDI_Connect_to_root.*/
+    if (rank == root) {
+	MPIU_Free( new_vc);
+    }
+
+fn_exit:
+    return mpi_errno;
+
+fn_fail:
+    goto fn_exit;
+}
+
+/*
+ * Extract all of the process groups from the given communicator and
+ * form a list (returned in pg_list) of those process groups.
+ * Also returned is an array (local_translation) that contains tuples mapping
+ * rank in process group to rank in that communicator (local translation
+ * must be allocated before this routine is called).  The number of
+ * distinct process groups is returned in n_local_pgs_p .
+ *
+ * This allows an intercomm_create to exchange the full description of
+ * all of the process groups that have made up the communicator that
+ * will define the "remote group".
+ */
+static int MPIDI_ExtractLocalPGInfo( struct MPID_Comm *comm_p,
+			       pg_translation local_translation[],
+			       pg_node **pg_list_p,
+			       int *n_local_pgs_p )
+{
+    pg_node        *pg_list = 0, *pg_iter, *pg_trailer;
+    int            i, cur_index = 0, local_comm_size, mpi_errno = 0;
+    char           *pg_id;
+
+    local_comm_size = comm_p->local_size;
+
+    /* Make a list of the local communicator's process groups and encode
+       them in strings to be sent to the other side.
+       The encoded string for each process group contains the process
+       group id, size and all its KVS values */
+
+    cur_index = 0;
+    pg_list = MPIU_Malloc(sizeof(pg_node));
+
+    pg_list->pg_id = MPIU_Strdup(comm_p->vcr[0]->pg->id);
+    pg_list->index = cur_index++;
+    pg_list->next = NULL;
+    /* XXX DJG FIXME-MT should we be checking this?  the add/release macros already check this */
+    TRACE_ERR("MPIU_Object_get_ref(comm_p->vcr[0]->pg) comm_p=%x vsr=%x pg=%x %d\n", comm_p, comm_p->vcr[0], comm_p->vcr[0]->pg, MPIU_Object_get_ref(comm_p->vcr[0]->pg));
+    MPIU_Assert( MPIU_Object_get_ref(comm_p->vcr[0]->pg));
+    mpi_errno = MPIDI_PG_To_string(comm_p->vcr[0]->pg, &pg_list->str,
+				   &pg_list->lenStr );
+    TRACE_ERR("pg_list->str=%s pg_list->lenStr=%d\n", pg_list->str, pg_list->lenStr);
+    if (mpi_errno != MPI_SUCCESS) {
+	TRACE_ERR("MPIDI_PG_To_string returned with mpi_errno=%d\n", mpi_errno);
+    }
+    TRACE_ERR("PG as string is %s\n", pg_list->str );
+    local_translation[0].pg_index = 0;
+    local_translation[0].pg_rank = comm_p->vcr[0]->pg_rank;
+    local_translation[0].pg_taskid = comm_p->vcr[0]->taskid;
+    TRACE_ERR("local_translation[0].pg_index=%d local_translation[0].pg_rank=%d\n", local_translation[0].pg_index, local_translation[0].pg_rank);
+    pg_iter = pg_list;
+    for (i=1; i<local_comm_size; i++) {
+	pg_iter = pg_list;
+	pg_trailer = pg_list;
+	while (pg_iter != NULL) {
+	    /* Check to ensure pg is (probably) valid */
+            /* XXX DJG FIXME-MT should we be checking this?  the add/release macros already check this */
+	    MPIU_Assert(MPIU_Object_get_ref(comm_p->vcr[i]->pg) != 0);
+	    if (MPIDI_PG_Id_compare(comm_p->vcr[i]->pg->id, pg_iter->pg_id)) {
+		local_translation[i].pg_index = pg_iter->index;
+		local_translation[i].pg_rank  = comm_p->vcr[i]->pg_rank;
+		local_translation[i].pg_taskid  = comm_p->vcr[i]->taskid;
+                TRACE_ERR("local_translation[%d].pg_index=%d local_translation[%d].pg_rank=%d\n", i, local_translation[i].pg_index, i,local_translation[i].pg_rank);
+		break;
+	    }
+	    if (pg_trailer != pg_iter)
+		pg_trailer = pg_trailer->next;
+	    pg_iter = pg_iter->next;
+	}
+	if (pg_iter == NULL) {
+	    /* We use MPIU_Malloc directly because we do not know in
+	       advance how many nodes we may allocate */
+	    pg_iter = (pg_node*)MPIU_Malloc(sizeof(pg_node));
+	    pg_iter->pg_id = MPIU_Strdup(comm_p->vcr[i]->pg->id);
+	    pg_iter->index = cur_index++;
+	    pg_iter->next = NULL;
+	    mpi_errno = MPIDI_PG_To_string(comm_p->vcr[i]->pg, &pg_iter->str,
+					   &pg_iter->lenStr );
+
+            TRACE_ERR("cur_index=%d pg_iter->str=%s pg_iter->lenStr=%d\n", cur_index, pg_iter->str, pg_iter->lenStr);
+	    if (mpi_errno != MPI_SUCCESS) {
+		TRACE_ERR("MPIDI_PG_To_string returned with mpi_errno=%d\n", mpi_errno);
+	    }
+	    local_translation[i].pg_index = pg_iter->index;
+	    local_translation[i].pg_rank = comm_p->vcr[i]->pg_rank;
+	    local_translation[i].pg_taskid = comm_p->vcr[i]->taskid;
+	    pg_trailer->next = pg_iter;
+	}
+    }
+
+    *n_local_pgs_p = cur_index;
+    *pg_list_p     = pg_list;
+
+ fn_exit:
+    return mpi_errno;
+ fn_fail:
+    goto fn_exit;
+}
+
+
+
+/* The root process in comm_ptr receives strings describing the
+   process groups and then distributes them to the other processes
+   in comm_ptr.
+   See SendPGToPeer for the routine that sends the descriptions */
+static int MPIDI_ReceivePGAndDistribute( struct MPID_Comm *tmp_comm, struct MPID_Comm *comm_ptr,
+				   int root, int *recvtag_p,
+				   int n_remote_pgs, MPIDI_PG_t *remote_pg[] )
+{
+    char *pg_str = 0;
+    char *pginfo = 0;
+    int  i, j, flag;
+    int  rank = comm_ptr->rank;
+    int  mpi_errno = 0;
+    int  recvtag = *recvtag_p;
+    int errflag = FALSE;
+
+    TRACE_ERR("MPIDI_ReceivePGAndDistribute - n_remote_pgs=%d\n", n_remote_pgs);
+    for (i=0; i<n_remote_pgs; i++) {
+
+	if (rank == root) {
+	    /* First, receive the pg description from the partner */
+	    mpi_errno = MPIC_Recv(&j, 1, MPI_INT, 0, recvtag++,
+				  tmp_comm->handle, MPI_STATUS_IGNORE);
+	    *recvtag_p = recvtag;
+	    if (mpi_errno != MPI_SUCCESS) {
+		TRACE_ERR("MPIC_Recv returned with mpi_errno=%d\n", mpi_errno);
+	    }
+	    pg_str = (char*)MPIU_Malloc(j);
+	    mpi_errno = MPIC_Recv(pg_str, j, MPI_CHAR, 0, recvtag++,
+				  tmp_comm->handle, MPI_STATUS_IGNORE);
+	    *recvtag_p = recvtag;
+	    if (mpi_errno != MPI_SUCCESS) {
+		TRACE_ERR("MPIC_Recv returned with mpi_errno=%d\n", mpi_errno);
+	    }
+	}
+
+	/* Broadcast the size and data to the local communicator */
+	TRACE_ERR("accept:broadcasting 1 int\n");
+	mpi_errno = MPIR_Bcast_intra(&j, 1, MPI_INT, root, comm_ptr, &errflag);
+	if (mpi_errno != MPI_SUCCESS) TRACE_ERR("MPIR_Bcast_intra returned with mpi_errno=%d\n", mpi_errno);
+
+	if (rank != root) {
+	    /* The root has already allocated this string */
+	    pg_str = (char*)MPIU_Malloc(j);
+	}
+	TRACE_ERR("accept:broadcasting string of length %d\n", j);
+	pg_str[j-1]='\0';
+	mpi_errno = MPIR_Bcast_intra(pg_str, j, MPI_CHAR, root, comm_ptr, &errflag);
+	if (mpi_errno != MPI_SUCCESS)
+           TRACE_ERR("MPIR_Bcast_intra returned with mpi_errno=%d\n", mpi_errno);
+	/* Then reconstruct the received process group.  This step
+	   also initializes the created process group */
+
+	TRACE_ERR("Adding connection information - pg_str=%s\n", pg_str);
+	TRACE_ERR("Creating pg from string %s flag=%d\n", pg_str, flag);
+	mpi_errno = MPIDI_PG_Create_from_string(pg_str, &remote_pg[i], &flag);
+        TRACE_ERR("remote_pg[%d]->id=%s\n", i, (char*)(remote_pg[i]->id));
+	if (mpi_errno != MPI_SUCCESS) {
+	    TRACE_ERR("MPIDI_PG_Create_from_string returned with mpi_errno=%d\n", mpi_errno);
+	}
+
+	MPIU_Free(pg_str);
+    }
+    /*MPIDI_Parse_connection_info(pg_str); */
+ fn_exit:
+    return mpi_errno;
+ fn_fail:
+    goto fn_exit;
+}
+
+
+/**
+ * This routine adds the remote world (wid) to local known world linked list
+ * if there is no record of it before, or increment the reference count
+ * associated with world (wid) if it is known before
+ */
+void MPIDI_Parse_connection_info(int n_remote_pgs, MPIDI_PG_t **remote_pg) {
+  int i, p, ref_count=0;
+  int jobIdSize=8;
+  char jobId[jobIdSize];
+  char *pginfo_sav, *pgid_taskid_sav, *pgid, *pgid_taskid[20], *pginfo_tmp, *cp3, *cp2;
+  pami_task_t *taskids;
+  int n_rem_wids=0;
+  int mpi_errno = MPI_SUCCESS;
+  MPIDI_PG_t *existing_pg;
+
+  for(p=0; p<n_remote_pgs; p++) {
+        TRACE_ERR("call MPIDI_PG_Find to find %s\n", (char*)(remote_pg[p]->id));
+        mpi_errno = MPIDI_PG_Find(remote_pg[p]->id, &existing_pg);
+        if (mpi_errno) TRACE_ERR("MPIDI_PG_Find failed\n");
+
+         if (existing_pg != NULL) {
+	  taskids = MPIU_Malloc((existing_pg->size)*sizeof(pami_task_t));
+          for(i=0; i<existing_pg->size; i++) {
+             taskids[i]=existing_pg->vct[i].taskid;
+	     TRACE_ERR("id=%s taskids[%d]=%d\n", (char*)(remote_pg[p]->id), i, taskids[i]);
+          }
+          MPIDI_Add_connection_info(atoi((char*)(remote_pg[p]->id)), existing_pg->size, taskids);
+	  MPIU_Free(taskids);
+        }
+  }
+}
+
+
+void MPIDI_Add_connection_info(int wid, int wsize, pami_task_t *taskids) {
+  int jobIdSize=64;
+  char jobId[jobIdSize];
+  int ref_count, i;
+  conn_info *tmp_node1=NULL, *tmp_node2=NULL;
+
+  TRACE_ERR("MPIDI_Add_connection_info ENTER wid=%d wsize=%d\n", wid, wsize);
+  PMI2_Job_GetId(jobId, jobIdSize);
+  if(atoi(jobId) == wid)
+	return;
+
+  /* FIXME: check the lock */
+  MPIU_THREAD_CS_ENTER(MSGQUEUE,0);
+  if(_conn_info_list == NULL) { /* Connection list is not yet created */
+    _conn_info_list = (conn_info*) MPIU_Malloc(sizeof(conn_info));
+    _conn_info_list->rem_world_id = wid;
+      _conn_info_list->ref_count    = 1;
+
+    ref_count = _conn_info_list->ref_count;
+    if(taskids != NULL) {
+      _conn_info_list->rem_taskids = MPIU_Malloc((wsize+1)*sizeof(int));
+      for(i=0;i<wsize;i++) {
+        _conn_info_list->rem_taskids[i] = taskids[i];
+      }
+      _conn_info_list->rem_taskids[i]   = -1;
+    }
+    else
+      _conn_info_list->rem_taskids = NULL;
+    _conn_info_list->next = NULL;
+  }
+  else {
+    tmp_node1 = _conn_info_list;
+    while(tmp_node1) {
+      tmp_node2 = tmp_node1;
+      if(tmp_node1->rem_world_id == wid)
+        break;
+      tmp_node1 = tmp_node1->next;
+    }
+    if(tmp_node1) {  /* Connection already exists. Increment reference count */
+      if(tmp_node1->ref_count == 0) {
+        if(taskids != NULL) {
+          tmp_node1->rem_taskids = MPIU_Malloc((wsize+1)*sizeof(int));
+          for(i=0;i<wsize;i++) {
+            tmp_node1->rem_taskids[i] = taskids[i];
+          }
+          tmp_node1->rem_taskids[i] = -1;
+        }
+        tmp_node1->rem_world_id = wid;
+      }
+      tmp_node1->ref_count++;
+      ref_count = tmp_node1->ref_count;
+    }
+    else {           /* Connection do not exists. Create a new connection */
+      tmp_node2->next = (conn_info*) MPIU_Malloc(sizeof(conn_info));
+      tmp_node2 = tmp_node2->next;
+      tmp_node2->rem_world_id = wid;
+        tmp_node2->ref_count    = 1;
+
+      ref_count = tmp_node2->ref_count;
+      if(taskids != NULL) {
+        tmp_node2->rem_taskids = MPIU_Malloc((wsize+1)*sizeof(int));
+        for(i=0;i<wsize;i++) {
+          tmp_node2->rem_taskids[i] = taskids[i];
+        }
+        tmp_node2->rem_taskids[i] = -1;
+      }
+      else
+        tmp_node2->rem_taskids = NULL;
+      tmp_node2->next = NULL;
+    }
+  }
+  MPIU_THREAD_CS_EXIT(MSGQUEUE,0);
+
+  tmp_node1 = _conn_info_list;
+  while(tmp_node1) {
+    TRACE_ERR("REM WORLD=%d ref_count=%d", tmp_node1->rem_world_id,tmp_node1->ref_count);
+
+    tmp_node1 = tmp_node1->next;
+  }
+}
+
+
+/* Sends the process group information to the peer and frees the
+   pg_list */
+static int MPIDI_SendPGtoPeerAndFree( struct MPID_Comm *tmp_comm, int *sendtag_p,
+				pg_node *pg_list )
+{
+    int mpi_errno = 0;
+    int sendtag = *sendtag_p, i;
+    pg_node *pg_iter;
+
+    while (pg_list != NULL) {
+	pg_iter = pg_list;
+        i = pg_iter->lenStr;
+	TRACE_ERR("connect:sending 1 int: %d\n", i);
+	mpi_errno = MPIC_Send(&i, 1, MPI_INT, 0, sendtag++, tmp_comm->handle);
+	*sendtag_p = sendtag;
+	if (mpi_errno != MPI_SUCCESS) {
+	    TRACE_ERR("MPIC_Send returned with mpi_errno=%d\n", mpi_errno);
+	}
+
+	TRACE_ERR("connect:sending string length %d\n", i);
+	mpi_errno = MPIC_Send(pg_iter->str, i, MPI_CHAR, 0, sendtag++,
+			      tmp_comm->handle);
+	*sendtag_p = sendtag;
+	if (mpi_errno != MPI_SUCCESS) {
+	    TRACE_ERR("MPIC_Send returned with mpi_errno=%d\n", mpi_errno);
+	}
+
+	pg_list = pg_list->next;
+	MPIU_Free(pg_iter->str);
+	MPIU_Free(pg_iter->pg_id);
+	MPIU_Free(pg_iter);
+    }
+
+ fn_exit:
+    return mpi_errno;
+ fn_fail:
+    goto fn_exit;
+}
+
+/* ---------------------------------------------------------------------- */
+/*
+ * MPIDI_Comm_accept()
+
+   Algorithm: First dequeue the vc from the accept queue (it was
+   enqueued by the progress engine in response to a connect request
+   from the root process that is attempting the connection on
+   the connect side). Use this vc to create an
+   intercommunicator between this root and the root on the connect
+   side. Use this intercomm. to communicate the other information
+   needed to create the real intercommunicator between the processes
+   on the two sides. Then free the intercommunicator between the
+   roots. Most of the complexity is because there can be multiple
+   process groups on each side.
+
+ */
+int MPIDI_Comm_accept(const char *port_name, MPID_Info *info, int root,
+		      struct MPID_Comm *comm_ptr, struct MPID_Comm **newcomm)
+{
+    int mpi_errno=MPI_SUCCESS;
+    int i, j, rank, recv_ints[3], send_ints[3], context_id;
+    int remote_comm_size=0;
+    struct MPID_Comm *tmp_comm = NULL, *intercomm;
+    MPID_VCR new_vc = NULL;
+    int sendtag=100, recvtag=100, local_comm_size;
+    int n_local_pgs=1, n_remote_pgs;
+    pg_translation *local_translation = NULL, *remote_translation = NULL;
+    pg_node *pg_list = NULL;
+    MPIDI_PG_t **remote_pg = NULL;
+    int errflag = FALSE;
+    char send_char[16], recv_char[16], remote_taskids[16];
+
+    /* Create the new intercommunicator here. We need to send the
+       context id to the other side. */
+    mpi_errno = MPIR_Comm_create(newcomm);
+    if (mpi_errno != MPI_SUCCESS) {
+	TRACE_ERR("MPIR_Comm_create returned with mpi_errno=%d\n", mpi_errno);
+    }
+    mpi_errno = MPIR_Get_contextid( comm_ptr, &(*newcomm)->recvcontext_id );
+    TRACE_ERR("In MPIDI_Comm_accept - MPIR_Get_contextid=%d\n", (*newcomm)->recvcontext_id);
+    if (mpi_errno) TRACE_ERR("MPIR_Get_contextid returned with mpi_errno=%d\n", mpi_errno);
+    /* FIXME why is this commented out? */
+    /*    (*newcomm)->context_id = (*newcomm)->recvcontext_id; */
+
+    rank = comm_ptr->rank;
+    local_comm_size = comm_ptr->local_size;
+
+    if (rank == root)
+    {
+	/* Establish a communicator to communicate with the root on the
+	   other side. */
+	mpi_errno = MPIDI_Create_inter_root_communicator_accept(port_name,
+						&tmp_comm, &new_vc);
+	TRACE_ERR("done MPIDI_Create_inter_root_communicator_accept mpi_errno=%d tmp_comm=%p new_vc=%p \n", mpi_errno, tmp_comm, new_vc);
+	if (mpi_errno != MPI_SUCCESS) {
+	    TRACE_ERR("MPIDI_Create_inter_root_communicator_accept returned with mpi_errno=%d\n", mpi_errno);
+	}
+
+	/* Make an array to translate local ranks to process group index and
+	   rank */
+	local_translation = MPIU_Malloc(local_comm_size*sizeof(pg_translation));
+/*	MPIU_CHKLMEM_MALLOC(local_translation,pg_translation*,
+			    local_comm_size*sizeof(pg_translation),
+			    mpi_errno,"local_translation"); */
+
+	/* Make a list of the local communicator's process groups and encode
+	   them in strings to be sent to the other side.
+	   The encoded string for each process group contains the process
+	   group id, size and all its KVS values */
+	mpi_errno = MPIDI_ExtractLocalPGInfo( comm_ptr, local_translation,
+					&pg_list, &n_local_pgs );
+        /* Send the remote root: n_local_pgs, local_comm_size, context_id for
+	   newcomm.
+           Recv from the remote root: n_remote_pgs, remote_comm_size */
+
+        send_ints[0] = n_local_pgs;
+        send_ints[1] = local_comm_size;
+        send_ints[2] = (*newcomm)->recvcontext_id;
+
+	TRACE_ERR("accept:sending 3 ints, %d, %d, %d, and receiving 2 ints with sendtag=%d recvtag=%d\n", send_ints[0], send_ints[1], send_ints[2], sendtag, recvtag);
+        mpi_errno = MPIC_Sendrecv(send_ints, 3, MPI_INT, 0,
+                                  sendtag++, recv_ints, 3, MPI_INT,
+                                  0, recvtag++, tmp_comm->handle,
+                                  MPI_STATUS_IGNORE);
+        if (mpi_errno != MPI_SUCCESS) {
+	    TRACE_ERR("MPIC_Sendrecv returned with mpi_errno=%d\n", mpi_errno);
+	}
+#if 0
+	send_char = pg_list->str;
+	TRACE_ERR("accept:sending 1 string and receiving 1 string\n", send_char, recv_char);
+        mpi_errno = MPIC_Sendrecv(send_char, 1, MPI_CHAR, 0,
+                                  sendtag++, recv_char, 3, MPI_CHAR,
+                                  0, recvtag++, tmp_comm->handle,
+                                  MPI_STATUS_IGNORE);
+        if (mpi_errno != MPI_SUCCESS) {
+	    TRACE_ERR("MPIC_Sendrecv returned with mpi_errno=%d\n", mpi_errno);
+	}
+#endif
+
+    }
+
+    /* broadcast the received info to local processes */
+    TRACE_ERR("accept:broadcasting 2 ints - %d and %d\n", recv_ints[0], recv_ints[1]);
+    mpi_errno = MPIR_Bcast_intra(recv_ints, 3, MPI_INT, root, comm_ptr, &errflag);
+    if (mpi_errno) TRACE_ERR("MPIR_Bcast_intra returned with mpi_errno=%d\n", mpi_errno);
+
+    n_remote_pgs     = recv_ints[0];
+    remote_comm_size = recv_ints[1];
+    context_id       = recv_ints[2];
+    remote_pg = MPIU_Malloc(n_remote_pgs * sizeof(MPIDI_PG_t*));
+    remote_translation =  MPIU_Malloc(remote_comm_size * sizeof(pg_translation));
+    TRACE_ERR("[%d]accept:remote process groups: %d\nremote comm size: %d\nrecv_char: %s\n", rank, n_remote_pgs, remote_comm_size, remote_taskids);
+
+    /* Exchange the process groups and their corresponding KVSes */
+    if (rank == root)
+    {
+	/* The root receives the PG from the peer (in tmp_comm) and
+	   distributes them to the processes in comm_ptr */
+	mpi_errno = MPIDI_ReceivePGAndDistribute( tmp_comm, comm_ptr, root, &recvtag,
+					n_remote_pgs, remote_pg );
+
+	mpi_errno = MPIDI_SendPGtoPeerAndFree( tmp_comm, &sendtag, pg_list );
+
+	/* Receive the translations from remote process rank to process group index */
+	TRACE_ERR("accept:sending %d ints and receiving %d ints\n", local_comm_size * 2, remote_comm_size * 2);
+	mpi_errno = MPIC_Sendrecv(local_translation, local_comm_size * 3,
+				  MPI_INT, 0, sendtag++,
+				  remote_translation, remote_comm_size * 3,
+				  MPI_INT, 0, recvtag++, tmp_comm->handle,
+				  MPI_STATUS_IGNORE);
+	for (i=0; i<remote_comm_size; i++)
+	{
+	    TRACE_ERR(" remote_translation[%d].pg_index = %d\n remote_translation[%d].pg_rank = %d\n",
+		i, remote_translation[i].pg_index, i, remote_translation[i].pg_rank);
+	}
+    }
+    else
+    {
+	mpi_errno = MPIDI_ReceivePGAndDistribute( tmp_comm, comm_ptr, root, &recvtag,
+					    n_remote_pgs, remote_pg );
+    }
+     for(i=0; i<n_remote_pgs; i++)
+     {
+	    TRACE_ERR("after calling MPIDI_ReceivePGAndDistribute - remote_pg[%d]->id=%s\n",i, (char *)(remote_pg[i]->id));
+     }
+
+
+    /* Broadcast out the remote rank translation array */
+    TRACE_ERR("Broadcast remote_translation");
+    mpi_errno = MPIR_Bcast_intra(remote_translation, remote_comm_size * 3, MPI_INT,
+                                 root, comm_ptr, &errflag);
+    if (mpi_errno) TRACE_ERR("MPIR_Bcast_intra returned with mpi_errno=%d\n", mpi_errno);
+    TRACE_ERR("[%d]accept:Received remote_translation after broadcast:\n", rank);
+    char *pginfo = MPIU_Malloc(256*sizeof(char));
+    memset(pginfo, 0, 256);
+    char cp[20];
+    for (i=0; i<remote_comm_size; i++)
+    {
+	TRACE_ERR(" remote_translation[%d].pg_index = %d remote_translation[%d].pg_rank = %d remote_translation[%d].pg_taskid=%d\n",
+	    i, remote_translation[i].pg_index, i, remote_translation[i].pg_rank, i, remote_translation[i].pg_taskid);
+	TRACE_ERR("remote_pg[remote_translation[%d].pg_index]->id=%s\n",i, (char *)(remote_pg[remote_translation[i].pg_index]->id));
+	    strcat(pginfo, (char *)(remote_pg[remote_translation[i].pg_index]->id));
+	    sprintf(cp, ":%d ", remote_translation[i].pg_taskid);
+	    strcat(pginfo, cp);
+
+
+    }
+    pginfo[strlen(pginfo)]='\0';
+    TRACE_ERR("connection info %s\n", pginfo);
+/*    MPIDI_Parse_connection_info(n_remote_pgs, remote_pg); */
+    MPIU_Free(pginfo);
+
+
+    /* Now fill in newcomm */
+    intercomm               = *newcomm;
+    intercomm->context_id   = context_id;
+    intercomm->is_low_group = 0;
+
+    mpi_errno = MPIDI_SetupNewIntercomm( comm_ptr, remote_comm_size,
+				   remote_translation, n_remote_pgs, remote_pg, intercomm );
+    if (mpi_errno != MPI_SUCCESS) {
+	TRACE_ERR("MPIDI_SetupNewIntercomm returned with mpi_errno=%d\n", mpi_errno);
+    }
+
+    /* synchronize with remote root */
+    if (rank == root)
+    {
+        mpi_errno = MPIC_Sendrecv(&i, 0, MPI_INT, 0,
+                                  sendtag++, &j, 0, MPI_INT,
+                                  0, recvtag++, tmp_comm->handle,
+                                  MPI_STATUS_IGNORE);
+        if (mpi_errno != MPI_SUCCESS) {
+	    TRACE_ERR("MPIC_Sendrecv returned with mpi_errno=%d\n", mpi_errno);
+        }
+
+        /* All communication with remote root done. Release the communicator. */
+        MPIR_Comm_release(tmp_comm,0);
+    }
+
+    mpi_errno = MPIR_Barrier_intra(comm_ptr, &errflag);
+    if (mpi_errno != MPI_SUCCESS) {
+	TRACE_ERR("MPIR_Barrier_intra returned with mpi_errno=%d\n", mpi_errno);
+    }
+
+    /* Free new_vc once the connection is completed. */
+    if (rank == root) {
+	MPIU_Free( new_vc );
+    }
+
+fn_exit:
+    return mpi_errno;
+
+fn_fail:
+    goto fn_exit;
+}
+
+/* ------------------------------------------------------------------------- */
+
+/* This routine initializes the new intercomm, setting up the
+   VCRT and other common structures.  The is_low_group and context_id
+   fields are NOT set because they differ in the use of this
+   routine in Comm_accept and Comm_connect.  The virtual connections
+   are initialized from a collection of process groups.
+
+   Input parameters:
++  comm_ptr - communicator that gives the group for the "local" group on the
+   new intercommnicator
+.  remote_comm_size - size of remote group
+.  remote_translation - array that specifies the process group and rank in
+   that group for each of the processes to include in the remote group of the
+   new intercommunicator
+-  remote_pg - array of remote process groups
+
+   Input/Output Parameter:
+.  intercomm - New intercommunicator.  The intercommunicator must already
+   have been allocated; this routine initializes many of the fields
+
+   Note:
+   This routine performance a barrier over 'comm_ptr'.  Why?
+*/
+static int MPIDI_SetupNewIntercomm( struct MPID_Comm *comm_ptr, int remote_comm_size,
+			      pg_translation remote_translation[],
+			      int n_remote_pgs, MPIDI_PG_t **remote_pg,
+			      struct MPID_Comm *intercomm )
+{
+    int mpi_errno = MPI_SUCCESS, i, j, index;
+    int errflag = FALSE;
+    int total_rem_world_cnts, p=0;
+    char *world_tasks, *cp1;
+    conn_info *tmp_node;
+    int conn_world_ids[64];
+    pami_endpoint_t dest;
+    TRACE_ERR("MPIDI_SetupNewIntercomm - remote_comm_size=%d\n", remote_comm_size);
+    /* FIXME: How much of this could/should be common with the
+       upper level (src/mpi/comm/ *.c) code? For best robustness,
+       this should use the same routine (not copy/paste code) as
+       in the upper level code. */
+    intercomm->attributes   = NULL;
+    intercomm->remote_size  = remote_comm_size;
+    intercomm->local_size   = comm_ptr->local_size;
+    intercomm->rank         = comm_ptr->rank;
+    intercomm->local_group  = NULL;
+    intercomm->remote_group = NULL;
+    intercomm->comm_kind    = MPID_INTERCOMM;
+    intercomm->local_comm   = NULL;
+    intercomm->coll_fns     = NULL;
+    intercomm->mpid.world_ids = NULL; /*FIXME*/
+
+    /* Point local vcr, vcrt at those of incoming intracommunicator */
+    intercomm->local_vcrt = comm_ptr->vcrt;
+    MPID_VCRT_Add_ref(comm_ptr->vcrt);
+    intercomm->local_vcr  = comm_ptr->vcr;
+
+    /* Set up VC reference table */
+    mpi_errno = MPID_VCRT_Create(intercomm->remote_size, &intercomm->vcrt);
+    if (mpi_errno != MPI_SUCCESS) {
+        TRACE_ERR("MPID_VCRT_Create returned with mpi_errno=%d\n", mpi_errno);
+    }
+    mpi_errno = MPID_VCRT_Get_ptr(intercomm->vcrt, &intercomm->vcr);
+    if (mpi_errno != MPI_SUCCESS) {
+        TRACE_ERR("MPID_VCRT_Get_ptr returned with mpi_errno=%d\n", mpi_errno);
+    }
+
+    for (i=0; i < intercomm->remote_size; i++) {
+	MPIDI_PG_Dup_vcr(remote_pg[remote_translation[i].pg_index],
+			 remote_translation[i].pg_rank, remote_translation[i].pg_taskid,&intercomm->vcr[i]);
+	TRACE_ERR("MPIDI_SetupNewIntercomm - pg_id=%s pg_rank=%d pg_taskid=%d intercomm->vcr[%d]->taskid=%d intercomm->vcr[%d]->pg=%x\n ", remote_pg[remote_translation[i].pg_index]->id, remote_translation[i].pg_rank, remote_translation[i].pg_taskid, i, intercomm->vcr[i]->taskid, i, intercomm->vcr[i]->pg);
+	PAMI_Endpoint_create(MPIDI_Client, remote_translation[i].pg_taskid, 0, &dest);
+	/*PAMI_Resume(MPIDI_Context[0],
+                    &dest, 1); */
+    }
+
+    MPIDI_Parse_connection_info(n_remote_pgs, remote_pg);
+
+    /* anchor connection information in mpid */
+    total_rem_world_cnts = 0;
+    tmp_node = _conn_info_list;
+    p=0;
+    while(tmp_node != NULL) {
+       total_rem_world_cnts++;
+       conn_world_ids[p++]=tmp_node->rem_world_id;
+       tmp_node = tmp_node->next;
+    }
+    if(intercomm->mpid.world_ids) { /* need to look at other places that may populate world id list for this communicator */
+      for(i=0;intercomm->mpid.world_ids[i]!=-1;i++)
+      {
+        for(j=0;j<total_rem_world_cnts;j++) {
+          if(intercomm->mpid.world_ids[i] == conn_world_ids[j]) {
+            conn_world_ids[j] = -1;
+          }
+        }
+      }
+      /* Now Total world_ids inside intercomm->world_ids = i, excluding last entry of ' -1' */
+      index = 0;
+      for(j=0;j<total_rem_world_cnts;j++) {
+        if(conn_world_ids[j] != -1)
+          index++;
+      }
+      if(index) {
+        intercomm->mpid.world_ids = MPIU_Malloc((index+i+1)*sizeof(int));
+        /* Current index i inside intercomm->mpid.world_ids is
+         * the place where next world_id can be added
+         */
+        for(j=0;j<total_rem_world_cnts;j++) {
+          if(conn_world_ids[j] != -1) {
+            intercomm->mpid.world_ids[i++] = conn_world_ids[j];
+          }
+        }
+        intercomm->mpid.world_ids[i] = -1;
+      }
+   }
+   else {
+    intercomm->mpid.world_ids = MPIU_Malloc((n_remote_pgs+1)*sizeof(int));
+    for(i=0;i<n_remote_pgs;i++) {
+      intercomm->mpid.world_ids[i] = atoi((char *)remote_pg[i]->id);
+    }
+    intercomm->mpid.world_ids[i] = -1;
+   }
+   for(i=0; intercomm->mpid.world_ids[i] != -1; i++)
+     TRACE_ERR("intercomm=%x intercomm->mpid.world_ids[%d]=%d\n", intercomm, i, intercomm->mpid.world_ids[i]);
+
+
+   mpi_errno = MPIR_Comm_commit(intercomm);
+   if (mpi_errno) TRACE_ERR("MPIR_Comm_commit returned with mpi_errno=%d\n", mpi_errno);
+
+    mpi_errno = MPIR_Barrier_intra(comm_ptr, &errflag);
+    if (mpi_errno != MPI_SUCCESS) {
+	TRACE_ERR("MPIR_Barrier_intra returned with mpi_errno=%d\n", mpi_errno);
+   }
+
+ fn_exit:
+    return mpi_errno;
+
+ fn_fail:
+    goto fn_exit;
+}
+
+
+/* Attempt to dequeue a vc from the accept queue. If the queue is
+   empty or the port_name_tag doesn't match, return a NULL vc. */
+int MPIDI_Acceptq_dequeue(MPID_VCR * vcr, int port_name_tag)
+{
+    int mpi_errno=MPI_SUCCESS;
+    MPIDI_Acceptq_t *q_item, *prev;
+    *vcr = NULL;
+    q_item = acceptq_head;
+    prev = q_item;
+
+    while (q_item != NULL)
+    {
+	if (q_item->port_name_tag == port_name_tag)
+	{
+	    *vcr = q_item->vcr;
+
+	    if ( q_item == acceptq_head )
+		acceptq_head = q_item->next;
+	    else
+		prev->next = q_item->next;
+
+	    /*MPIU_Free(q_item); */
+	    AcceptQueueSize--;
+	    break;;
+	}
+	else
+	{
+	    prev = q_item;
+	    q_item = q_item->next;
+	}
+    }
+
+    return mpi_errno;
+}
+
+
+/**
+ * This routine return the list of taskids associated with world (wid)
+ */
+int* MPIDI_get_taskids_in_world_id(int wid) {
+  conn_info *tmp_node;
+
+  MPIU_THREAD_CS_ENTER(MSGQUEUE,0);
+  tmp_node = _conn_info_list;
+  while(tmp_node != NULL) {
+    if(tmp_node->rem_world_id == wid) {
+      break;
+    }
+    tmp_node = tmp_node->next;
+  }
+  MPIU_THREAD_CS_EXIT(MSGQUEUE,0);
+  if(tmp_node == NULL)
+    return NULL;
+  else
+    return (tmp_node->rem_taskids);
+}
+
+void MPIDI_IpState_reset(int dest)
+{
+  MPIDI_In_cntr_t *in_cntr;
+  in_cntr=&MPIDI_In_cntr[dest];
+
+  in_cntr->n_OutOfOrderMsgs = 0;
+  in_cntr->nMsgs = 0;
+  in_cntr->OutOfOrderList = NULL;
+}
+
+
+void MPIDI_OpState_reset(int dest)
+{
+  MPIDI_Out_cntr_t *out_cntr;
+  out_cntr=&MPIDI_Out_cntr[dest];
+
+  out_cntr->nMsgs = 0;
+  out_cntr->unmatched = 0;
+}
+
+
+/**
+ * This routine return the connection reference count associated with the
+ * remote world identified by wid
+ */
+int MPIDI_get_refcnt_of_world(int wid) {
+  conn_info *tmp_node;
+
+  MPIU_THREAD_CS_ENTER(MSGQUEUE,0);
+  tmp_node = _conn_info_list;
+  while(tmp_node != NULL) {
+    if(tmp_node->rem_world_id == wid) {
+      break;
+    }
+    tmp_node = tmp_node->next;
+  }
+  MPIU_THREAD_CS_EXIT(MSGQUEUE,0);
+  if(tmp_node == NULL)
+    return 0;
+  else
+    return (tmp_node->ref_count);
+}
+
+/**
+ * This routine delete world (wid) from linked list of known world descriptors
+ */
+void MPIDI_delete_conn_record(int wid) {
+  conn_info *tmp_node1, *tmp_node2;
+
+  MPIU_THREAD_CS_ENTER(MSGQUEUE,0);
+  tmp_node1 = tmp_node2 = _conn_info_list;
+  while(tmp_node1) {
+    if(tmp_node1->rem_world_id == wid) {
+      if(tmp_node1 == tmp_node2) {
+        _conn_info_list = tmp_node1->next;
+      }
+      else {
+        tmp_node2->next = tmp_node1->next;
+      }
+      if(tmp_node1->rem_taskids != NULL)
+        MPIU_Free(tmp_node1->rem_taskids);
+      MPIU_Free(tmp_node1);
+      break;
+    }
+    tmp_node2 = tmp_node1;
+    tmp_node1 = tmp_node1->next;
+  }
+  MPIU_THREAD_CS_EXIT(MSGQUEUE,0);
+}
+#endif
diff --git a/src/mpid/pamid/src/misc/mpid_abort.c b/src/mpid/pamid/src/misc/mpid_abort.c
index edfc229..ef8af06 100644
--- a/src/mpid/pamid/src/misc/mpid_abort.c
+++ b/src/mpid/pamid/src/misc/mpid_abort.c
@@ -92,5 +92,8 @@ int MPID_Abort(MPID_Comm * comm, int mpi_errno, int exit_code, const char *error
     if ( (strncasecmp("no", env, 2)==0) || (strncasecmp("exit", env, 4)==0) || (strncmp("0", env, 1)==0) )
       exit(1);
 
+#ifdef DYNAMIC_TASKING
+  return PMI2_Abort(1,error_msg);
+#endif
   abort();
 }
diff --git a/src/mpid/pamid/src/misc/mpid_get_universe_size.c b/src/mpid/pamid/src/misc/mpid_get_universe_size.c
index dfefd49..87049bb 100644
--- a/src/mpid/pamid/src/misc/mpid_get_universe_size.c
+++ b/src/mpid/pamid/src/misc/mpid_get_universe_size.c
@@ -26,9 +26,71 @@
 
 #include <mpidimpl.h>
 
+#ifdef DYNAMIC_TASKING
+#ifdef USE_PMI2_API
+#include "pmi2.h"
+#else
+#include "pmi.h"
+#endif  /* USE_PMI2_API */
+#endif  /* DYNAMIC_TASKING */
+
+extern int mpidi_dynamic_tasking;
+
+/*
+ * MPID_Get_universe_size - Get the universe size from the process manager
+ *
+ * Notes: This requires that the PMI routines are used to
+ * communicate with the process manager.
+ */
 int MPID_Get_universe_size(int  * universe_size)
 {
-  int mpi_errno = MPI_SUCCESS;
+    int mpi_errno = MPI_SUCCESS;
+#ifdef DYNAMIC_TASKING
+#ifdef USE_PMI2_API
+    if(mpidi_dynamic_tasking) {
+      char val[PMI2_MAX_VALLEN];
+      int found = 0;
+      char *endptr;
+
+      mpi_errno = PMI2_Info_GetJobAttr("universeSize", val, sizeof(val), &found);
+      TRACE_ERR("mpi_errno from PMI2_Info_GetJobAttr=%d\n", mpi_errno);
+
+      if (!found) {
+        TRACE_ERR("PMI2_Info_GetJobAttr not found\n");
+	*universe_size = MPIR_UNIVERSE_SIZE_NOT_AVAILABLE;
+      }
+      else {
+        *universe_size = strtol(val, &endptr, 0);
+        TRACE_ERR("PMI2_Info_GetJobAttr found universe_size=%d\n", *universe_size);
+      }
+#else
+      int pmi_errno = PMI_SUCCESS;
+
+      pmi_errno = PMI_Get_universe_size(universe_size);
+      if (pmi_errno != PMI_SUCCESS) {
+        MPIU_ERR_SETANDJUMP1(mpi_errno, MPI_ERR_OTHER,
+			     "**pmi_get_universe_size",
+			     "**pmi_get_universe_size %d", pmi_errno);
+      }
+      if (*universe_size < 0)
+      {
+	*universe_size = MPIR_UNIVERSE_SIZE_NOT_AVAILABLE;
+      }
+#endif /*USE_PMI2_API*/
+   } else {
+     *universe_size = MPIR_UNIVERSE_SIZE_NOT_AVAILABLE;
+   }
+#else
   *universe_size = MPIR_UNIVERSE_SIZE_NOT_AVAILABLE;
-  return mpi_errno;
+#endif /*DYNAMIC_TASKING*/
+
+
+fn_exit:
+    return mpi_errno;
+
+    /* --BEGIN ERROR HANDLING-- */
+fn_fail:
+    *universe_size = MPIR_UNIVERSE_SIZE_NOT_AVAILABLE;
+    goto fn_exit;
+    /* --END ERROR HANDLING-- */
 }
diff --git a/src/mpid/pamid/src/misc/mpid_unimpl.c b/src/mpid/pamid/src/misc/mpid_unimpl.c
index 16c973a..d390ea3 100644
--- a/src/mpid/pamid/src/misc/mpid_unimpl.c
+++ b/src/mpid/pamid/src/misc/mpid_unimpl.c
@@ -21,6 +21,7 @@
  */
 #include <mpidimpl.h>
 
+#ifndef DYNAMIC_TASKING
 int MPID_Close_port(const char *port_name)
 {
   MPID_abort();
@@ -69,7 +70,7 @@ int MPID_Comm_spawn_multiple(int count,
   MPID_abort();
   return 0;
 }
-
+#endif
 
 int MPID_Comm_reenable_anysource(MPID_Comm *comm,
                                  MPID_Group **failed_group_ptr)
diff --git a/src/mpid/pamid/src/mpid_finalize.c b/src/mpid/pamid/src/mpid_finalize.c
index 549853f..6e1d212 100644
--- a/src/mpid/pamid/src/mpid_finalize.c
+++ b/src/mpid/pamid/src/mpid_finalize.c
@@ -29,6 +29,9 @@ extern void MPIDI_close_mm();
 #ifdef MPIDI_STATISTICS
 extern pami_extension_t pe_extension;
 
+extern int mpidi_dynamic_tasking;
+int mpidi_finalized = 0;
+
 
 void MPIDI_close_pe_extension() {
      int rc;
@@ -58,6 +61,24 @@ int MPID_Finalize()
   }
   MPIDI_close_pe_extension();
 #endif
+
+#ifdef DYNAMIC_TASKING
+  mpidi_finalized = 1;
+  if(mpidi_dynamic_tasking) {
+    /* Tell the process group code that we're done with the process groups.
+       This will notify PMI (with PMI_Finalize) if necessary.  It
+       also frees all PG structures, including the PG for COMM_WORLD, whose
+       pointer is also saved in MPIDI_Process.my_pg */
+    mpierrno = MPIDI_PG_Finalize();
+    if (mpierrno) {
+	TRACE_ERR("MPIDI_PG_Finalize returned with mpierrno=%d\n", mpierrno);
+    }
+
+    MPIDI_FreeParentPort();
+  }
+#endif
+
+
   /* ------------------------- */
   /* shutdown request queues   */
   /* ------------------------- */
@@ -93,7 +114,7 @@ int MPID_Finalize()
    {
      #if TOKEN_FLOW_CONTROL
      extern char *EagerLimit;
-     
+
      if (EagerLimit) MPIU_Free(EagerLimit);
      MPIU_Free(MPIDI_Token_cntr);
      MPIDI_close_mm();
diff --git a/src/mpid/pamid/src/mpid_init.c b/src/mpid/pamid/src/mpid_init.c
index e571d16..fe6f5dc 100644
--- a/src/mpid/pamid/src/mpid_init.c
+++ b/src/mpid/pamid/src/mpid_init.c
@@ -27,6 +27,13 @@
 #include "mpidi_platform.h"
 #include "onesided/mpidi_onesided.h"
 
+#ifdef DYNAMIC_TASKING
+#define PAMIX_CLIENT_DYNAMIC_TASKING 1032
+#define PAMIX_CLIENT_WORLD_TASKS     1033
+#define MAX_JOBID_LEN                1024
+#endif
+int mpidi_dynamic_tasking = 0;
+
 #if TOKEN_FLOW_CONTROL
   extern int MPIDI_mm_init(int,uint *,unsigned long *);
   extern int MPIDI_tfctrl_enabled;
@@ -134,6 +141,9 @@ static struct
   struct protocol_t WinCtrl;
   struct protocol_t WinAccum;
   struct protocol_t RVZ_zerobyte;
+#ifdef DYNAMIC_TASKING
+  struct protocol_t Dyntask;
+#endif
 } proto_list = {
   .Short = {
     .func = MPIDI_RecvShortAsyncCB,
@@ -231,10 +241,23 @@ static struct
     },
     .immediate_min     = sizeof(MPIDI_MsgEnvelope),
   },
+#ifdef DYNAMIC_TASKING
+  .Dyntask = {
+    .func = MPIDI_Recvfrom_remote_world,
+    .dispatch = MPIDI_Protocols_Dyntask,
+    .options = {
+      .consistency     = USE_PAMI_CONSISTENCY,
+      .long_header     = PAMI_HINT_DISABLE,
+      .recv_immediate  = PAMI_HINT_ENABLE,
+      .use_rdma        = PAMI_HINT_DISABLE,
+    },
+    .immediate_min     = sizeof(MPIDI_MsgInfo),
+  },
+#endif
 };
 
 static void
-MPIDI_PAMI_client_init(int* rank, int* size, int threading)
+MPIDI_PAMI_client_init(int* rank, int* size, int* mpidi_dynamic_tasking, char **world_tasks)
 {
   /* ------------------------------------ */
   /*  Initialize the MPICH->PAMI Client  */
@@ -248,12 +271,64 @@ MPIDI_PAMI_client_init(int* rank, int* size, int threading)
   PAMIX_Initialize(MPIDI_Client);
 
 
-  /* ---------------------------------- */
-  /*  Get my rank and the process size  */
-  /* ---------------------------------- */
-  *rank = PAMIX_Client_query(MPIDI_Client, PAMI_CLIENT_TASK_ID  ).value.intval;
-  MPIR_Process.comm_world->rank = *rank; /* Set the rank early to make tracing better */
-  *size = PAMIX_Client_query(MPIDI_Client, PAMI_CLIENT_NUM_TASKS).value.intval;
+  *mpidi_dynamic_tasking=0;
+#ifdef DYNAMIC_TASKING
+  *world_tasks = NULL;
+  pami_result_t status = PAMI_ERROR;
+
+  typedef pami_result_t (*dyn_task_query_fn) (
+             pami_client_t          client,
+             pami_configuration_t   config[],
+             size_t                 num_configs);
+  dyn_task_query_fn  dyn_task_query = NULL;
+
+  pami_extension_t extension;
+  status = PAMI_Extension_open (MPIDI_Client, "PE_dyn_task", &extension);
+  if(status != PAMI_SUCCESS)
+  {
+    TRACE_ERR("Error. The PE_dyn_task extension is not implemented. result = %d\n", status);
+  }
+
+  dyn_task_query =  (dyn_task_query_fn) PAMI_Extension_symbol(extension, "query");
+  if (dyn_task_query == (void*)NULL) {
+    TRACE_ERR("Err: the Dynamic Tasking extension function dyn_task_query is not implememted.\n");
+
+  } else {
+    pami_configuration_t config2[] =
+    {
+       {PAMI_CLIENT_TASK_ID, -1},
+       {PAMI_CLIENT_NUM_TASKS, -1},
+       {(pami_attribute_name_t)PAMIX_CLIENT_DYNAMIC_TASKING},
+       {(pami_attribute_name_t)PAMIX_CLIENT_WORLD_TASKS},
+    };
+
+    dyn_task_query(MPIDI_Client, config2, 4);
+    TRACE_ERR("dyn_task_query: task_id %d num_tasks %d dynamic_tasking %d world_tasks %s\n",
+              config2[0].value.intval,
+              config2[1].value.intval,
+              config2[2].value.intval,
+              config2[3].value.chararray);
+    *rank = config2[0].value.intval;
+    *size = config2[1].value.intval;
+    *mpidi_dynamic_tasking  = config2[2].value.intval;
+    *world_tasks = config2[3].value.chararray;
+  }
+
+  status = PAMI_Extension_close (extension);
+  if(status != PAMI_SUCCESS)
+  {
+    TRACE_ERR("Error. The PE_dyn_task extension could not be closed. result = %d\n", status);
+  }
+#endif
+
+  if(*mpidi_dynamic_tasking == 0) {
+     /* ---------------------------------- */
+     /*  Get my rank and the process size  */
+     /* ---------------------------------- */
+     *rank = PAMIX_Client_query(MPIDI_Client, PAMI_CLIENT_TASK_ID  ).value.intval;
+     MPIR_Process.comm_world->rank = *rank; /* Set the rank early to make tracing better */
+     *size = PAMIX_Client_query(MPIDI_Client, PAMI_CLIENT_NUM_TASKS).value.intval;
+  }
 
   /* --------------------------------------------------------------- */
   /* Determine if the eager point-to-point protocol for internal mpi */
@@ -285,7 +360,7 @@ MPIDI_PAMI_client_init(int* rank, int* size, int threading)
 
 
 static void
-MPIDI_PAMI_context_init(int* threading)
+MPIDI_PAMI_context_init(int* threading, int *size)
 {
   int requested_thread_level;
   requested_thread_level = *threading;
@@ -410,13 +485,16 @@ if (TOKEN_FLOW_CONTROL_ON)
 
 #ifdef MPIDI_TRACE
       int i; 
+      MPIDI_Trace_buf = MPIU_Calloc0(numTasks, MPIDI_Trace_buf_t);
+      if(MPIDI_Trace_buf == NULL) MPID_abort();
+      memset((void *) MPIDI_Trace_buf,0, sizeof(MPIDI_Trace_buf_t));
       for (i=0; i < numTasks; i++) {
-          MPIDI_In_cntr[i].R=MPIU_Calloc0(N_MSGS, recv_status);
-          if (MPIDI_In_cntr[i].R==NULL) MPID_abort();
-          MPIDI_In_cntr[i].PR=MPIU_Calloc0(N_MSGS, posted_recv);
-          if (MPIDI_In_cntr[i].PR ==NULL) MPID_abort();
-          MPIDI_Out_cntr[i].S=MPIU_Calloc0(N_MSGS, send_status);
-          if (MPIDI_Out_cntr[i].S ==NULL) MPID_abort();
+          MPIDI_Trace_buf[i].R=MPIU_Calloc0(N_MSGS, recv_status);
+          if (MPIDI_Trace_buf[i].R==NULL) MPID_abort();
+          MPIDI_Trace_buf[i].PR=MPIU_Calloc0(N_MSGS, posted_recv);
+          if (MPIDI_Trace_buf[i].PR ==NULL) MPID_abort();
+          MPIDI_Trace_buf[i].S=MPIU_Calloc0(N_MSGS, send_status);
+          if (MPIDI_Trace_buf[i].S ==NULL) MPID_abort();
       }
 #endif
 
@@ -491,7 +569,7 @@ MPIDI_PAMI_dispath_init()
       }
     else
       {
-        TRACE_ERR((" Attention: PAMI_Client_query(DISPATCH_SEND_IMMEDIATE_MAX=%d) rc=%d\n", config.name, rc));
+        TRACE_ERR(" Attention: PAMI_Client_query(DISPATCH_SEND_IMMEDIATE_MAX=%d) rc=%d\n", config.name, rc);
         MPIDI_Process.pt2pt.limits_array[2] = 256;
       }
 
@@ -513,6 +591,9 @@ MPIDI_PAMI_dispath_init()
   MPIDI_PAMI_dispath_set(MPIDI_Protocols_WinCtrl,   &proto_list.WinCtrl,   NULL);
   MPIDI_PAMI_dispath_set(MPIDI_Protocols_WinAccum,  &proto_list.WinAccum,  NULL);
   MPIDI_PAMI_dispath_set(MPIDI_Protocols_RVZ_zerobyte, &proto_list.RVZ_zerobyte, NULL);
+#ifdef DYNAMIC_TASKING
+  MPIDI_PAMI_dispath_set(MPIDI_Protocols_Dyntask,   &proto_list.Dyntask,  NULL);
+#endif
 
   /*
    * The first two protocols are our short protocols: they use
@@ -561,7 +642,7 @@ printEnvVars(char *type)
 static void
 MPIDI_PAMI_init(int* rank, int* size, int* threading)
 {
-  MPIDI_PAMI_context_init(threading);
+  MPIDI_PAMI_context_init(threading, size);
 
 
   MPIDI_PAMI_dispath_init();
@@ -681,12 +762,21 @@ MPIDI_PAMI_init(int* rank, int* size, int* threading)
 #endif
 }
 
-
+#ifndef DYNAMIC_TASKING
 static void
 MPIDI_VCRT_init(int rank, int size)
+#else
+static void
+MPIDI_VCRT_init(int rank, int size, char *world_tasks, MPIDI_PG_t *pg)
+#endif
 {
   int i, rc;
   MPID_Comm * comm;
+  int p, mpi_errno=0;
+#ifdef DYNAMIC_TASKING
+  char *world_tasks_save,*cp;
+  char *pg_id;
+#endif
 
   /* ------------------------------- */
   /* Initialize MPI_COMM_SELF object */
@@ -698,8 +788,18 @@ MPIDI_VCRT_init(int rank, int size)
   MPID_assert_always(rc == MPI_SUCCESS);
   rc = MPID_VCRT_Get_ptr(comm->vcrt, &comm->vcr);
   MPID_assert_always(rc == MPI_SUCCESS);
-  comm->vcr[0] = rank;
-
+  comm->vcr[0]->taskid= PAMIX_Client_query(MPIDI_Client, PAMI_CLIENT_TASK_ID  ).value.intval;
+
+#ifdef DYNAMIC_TASKING
+  if(mpidi_dynamic_tasking) {
+    comm->vcr[0]->pg=pg->vct[0].pg;
+    comm->vcr[0]->pg_rank=pg->vct[0].pg_rank;
+    if(comm->vcr[0]->pg) {
+      TRACE_ERR("Adding ref for comm=%x vcr=%x pg=%x\n", comm, comm->vcr[0], comm->vcr[0]->pg);
+      MPIDI_PG_add_ref(comm->vcr[0]->pg);
+    }
+  }
+#endif
 
   /* -------------------------------- */
   /* Initialize MPI_COMM_WORLD object */
@@ -711,8 +811,56 @@ MPIDI_VCRT_init(int rank, int size)
   MPID_assert_always(rc == MPI_SUCCESS);
   rc = MPID_VCRT_Get_ptr(comm->vcrt, &comm->vcr);
   MPID_assert_always(rc == MPI_SUCCESS);
-  for (i=0; i<size; i++)
-    comm->vcr[i] = i;
+
+#ifdef DYNAMIC_TASKING
+  if(mpidi_dynamic_tasking) {
+    i=0;
+    world_tasks_save = MPIU_Strdup(world_tasks);
+    if(world_tasks != NULL) {
+      comm->vcr[0]->taskid = atoi(strtok(world_tasks, ":"));
+      TRACE_ERR("comm->vcr[0]->taskid =%d\n", comm->vcr[0]->taskid);
+      while( (cp=strtok(NULL, ":")) != NULL) {
+        comm->vcr[++i]->taskid= atoi(cp);
+        TRACE_ERR("comm->vcr[i]->taskid =%d\n", comm->vcr[i]->taskid);
+      }
+    }
+    MPIU_Free(world_tasks_save);
+
+        /* This memory will be freed by the PG_Destroy if there is an error */
+        pg_id = MPIU_Malloc(MAX_JOBID_LEN);
+
+        mpi_errno = PMI2_Job_GetId(pg_id, MAX_JOBID_LEN);
+        TRACE_ERR("PMI2_Job_GetId - pg_id=%s\n", pg_id);
+
+    /* Initialize the connection table on COMM_WORLD from the process group's
+       connection table */
+    for (p = 0; p < comm->local_size; p++)
+    {
+	  comm->vcr[p]->pg=pg->vct[p].pg;
+          comm->vcr[p]->pg_rank=pg->vct[p].pg_rank;
+	  if(comm->vcr[p]->pg) {
+		TRACE_ERR("Adding ref for comm=%x vcr=%x pg=%x\n", comm, comm->vcr[p], comm->vcr[p]->pg);
+		MPIDI_PG_add_ref(comm->vcr[p]->pg);
+	  }
+       /* MPID_VCR_Dup(&pg->vct[p], &(comm->vcr[p]));*/
+	  TRACE_ERR("TASKID -- comm->vcr[%d]=%d\n", p, comm->vcr[p]->taskid);
+    }
+
+  i = 0;
+
+  }else {
+	for (i=0; i<size; i++) {
+	  comm->vcr[i]->taskid = i;
+	  TRACE_ERR("comm->vcr[%d]=%d\n", i, comm->vcr[i]->taskid);
+        }
+	TRACE_ERR("MP_I_WORLD_TASKS not SET\n");
+  }
+#else
+  for (i=0; i<size; i++) {
+    comm->vcr[i]->taskid = i;
+    TRACE_ERR("comm->vcr[%d]=%d\n", i, comm->vcr[i]->taskid);
+  }
+#endif
 }
 
 
@@ -734,12 +882,26 @@ int MPID_Init(int * argc,
               int * has_env)
 {
   int rank, size;
+#ifdef DYNAMIC_TASKING
+  int has_parent=0;
+  MPIDI_PG_t * pg=NULL;
+  int pg_rank=-1;
+  int pg_size;
+  int appnum,mpi_errno;
+  MPID_Comm * comm;
+  int i,j;
+  pami_configuration_t config;
+  int world_size;
+#endif
+  char *world_tasks;
+  pami_result_t rc;
 
 
   /* ------------------------------------------------------------------------------- */
   /*  Initialize the pami client to get the process rank; needed for env var output. */
   /* ------------------------------------------------------------------------------- */
-  MPIDI_PAMI_client_init(&rank, &size, requested);
+  MPIDI_PAMI_client_init(&rank, &size, &mpidi_dynamic_tasking, &world_tasks);
+  TRACE_OUT("after MPIDI_PAMI_client_init rank=%d size=%d mpidi_dynamic_tasking=%d\n", rank, size, mpidi_dynamic_tasking);
 
   /* ------------------------------------ */
   /*  Get new defaults from the Env Vars  */
@@ -769,6 +931,34 @@ int MPID_Init(int * argc,
 #endif
   MPIDI_PAMI_init(&rank, &size, provided);
 
+#ifdef DYNAMIC_TASKING
+  if (mpidi_dynamic_tasking) {
+
+    /*
+     * Perform PMI initialization
+     */
+    mpi_errno = MPIDI_InitPG( argc, argv,
+			      has_args, has_env, &has_parent, &pg_rank, &pg );
+    if (mpi_errno) {
+	TRACE_ERR("MPIDI_InitPG returned with mpi_errno=%d\n", mpi_errno);
+    }
+
+    /* FIXME: Why are pg_size and pg_rank handled differently? */
+    pg_size = MPIDI_PG_Get_size(pg);
+
+    TRACE_ERR("MPID_Init - pg_size=%d\n", pg_size);
+    MPIDI_Process.my_pg = pg;  /* brad : this is rework for shared memories
+				* because they need this set earlier
+                                * for getting the business card
+                                */
+    MPIDI_Process.my_pg_rank = pg_rank;
+    /* FIXME: Why do we add a ref to pg here? */
+    TRACE_ERR("Adding ref pg=%x\n", pg);
+    MPIDI_PG_add_ref(pg);
+
+  }
+#endif
+
   /* ------------------------- */
   /* initialize request queues */
   /* ------------------------- */
@@ -790,23 +980,61 @@ int MPID_Init(int * argc,
   /* ------------------------------- */
   /* Initialize communicator objects */
   /* ------------------------------- */
+#ifndef DYNAMIC_TASKING
   MPIDI_VCRT_init(rank, size);
-
+#else
+  MPIDI_VCRT_init(rank, size, world_tasks, pg);
+#endif
 
   /* ------------------------------- */
   /* Setup optimized communicators   */
   /* ------------------------------- */
   TRACE_ERR("creating world geometry\n");
-  pami_result_t rc;
   rc = PAMI_Geometry_world(MPIDI_Client, &MPIDI_Process.world_geometry);
   MPID_assert_always(rc == PAMI_SUCCESS);
   TRACE_ERR("calling comm_create on comm world %p\n", MPIR_Process.comm_world);
   MPIR_Process.comm_world->mpid.geometry = MPIDI_Process.world_geometry;
   MPIR_Process.comm_world->mpid.parent   = PAMI_GEOMETRY_NULL;
-  MPIDI_Comm_create(MPIR_Process.comm_world);
-  MPIDI_Comm_world_setup();
-
-
+  MPIR_Comm_commit(MPIR_Process.comm_world);
+
+#ifdef DYNAMIC_TASKING
+  if (has_parent) {
+     char * parent_port;
+
+     /* FIXME: To allow just the "root" process to
+        request the port and then use MPIR_Bcast_intra to
+        distribute it to the rest of the processes,
+        we need to perform the Bcast after MPI is
+        otherwise initialized.  We could do this
+        by adding another MPID call that the MPI_Init(_thread)
+        routine would make after the rest of MPI is
+        initialized, but before MPI_Init returns.
+        In fact, such a routine could be used to
+        perform various checks, including parameter
+        consistency value (e.g., all processes have the
+        same environment variable values). Alternately,
+        we could allow a few routines to operate with
+        predefined parameter choices (e.g., bcast, allreduce)
+        for the purposes of initialization. */
+	mpi_errno = MPIDI_GetParentPort(&parent_port);
+	if (mpi_errno != MPI_SUCCESS) {
+          TRACE_ERR("MPIDI_GetParentPort returned with mpi_errno=%d\n", mpi_errno);
+	}
+
+	mpi_errno = MPID_Comm_connect(parent_port, NULL, 0,
+				      MPIR_Process.comm_world, &comm);
+	if (mpi_errno != MPI_SUCCESS) {
+	    TRACE_ERR("mpi_errno from Comm_connect=%d\n", mpi_errno);
+	}
+
+	MPIR_Process.comm_parent = comm;
+	MPIU_Assert(MPIR_Process.comm_parent != NULL);
+	MPIU_Strncpy(comm->name, "MPI_COMM_PARENT", MPI_MAX_OBJECT_NAME);
+
+	/* FIXME: Check that this intercommunicator gets freed in MPI_Finalize
+	   if not already freed.  */
+   }
+#endif
   /* ------------------------------- */
   /* Initialize timer data           */
   /* ------------------------------- */
@@ -929,3 +1157,169 @@ static_assertions()
   MPID_assert_static(sizeof(uint64_t) == sizeof(size_t));
 #endif
 }
+
+#ifdef DYNAMIC_TASKING
+/* FIXME: The PG code should supply these, since it knows how the
+   pg_ids and other data are represented */
+int MPIDI_PG_Compare_ids(void * id1, void * id2)
+{
+    return (strcmp((char *) id1, (char *) id2) == 0) ? TRUE : FALSE;
+}
+
+int MPIDI_PG_Destroy_id(MPIDI_PG_t * pg)
+{
+    if (pg->id != NULL)
+    {
+	TRACE_ERR("free pg id =%p pg=%p\n", pg->id, pg);
+	MPIU_Free(pg->id);
+	TRACE_ERR("done free pg id \n");
+    }
+
+    return MPI_SUCCESS;
+}
+
+
+int MPIDI_InitPG( int *argc, char ***argv,
+	          int *has_args, int *has_env, int *has_parent,
+	          int *pg_rank_p, MPIDI_PG_t **pg_p )
+{
+    int pmi_errno;
+    int mpi_errno = MPI_SUCCESS;
+    int pg_rank, pg_size, appnum, pg_id_sz;
+    int usePMI=1;
+    char *pg_id;
+    MPIDI_PG_t *pg = 0;
+
+    /* If we use PMI here, make the PMI calls to get the
+       basic values.  Note that systems that return setvals == true
+       do not make use of PMI for the KVS routines either (it is
+       assumed that the discover connection information through some
+       other mechanism */
+    /* FIXME: We may want to allow the channel to ifdef out the use
+       of PMI calls, or ask the channel to provide stubs that
+       return errors if the routines are in fact used */
+    if (usePMI) {
+	/*
+	 * Initialize the process manangement interface (PMI),
+	 * and get rank and size information about our process group
+	 */
+
+#ifdef USE_PMI2_API
+	TRACE_ERR("Calling PMI2_Init\n");
+        mpi_errno = PMI2_Init(has_parent, &pg_size, &pg_rank, &appnum);
+	TRACE_ERR("PMI2_Init - pg_size=%d pg_rank=%d\n", pg_size, pg_rank);
+        /*if (mpi_errno) MPIU_ERR_POP(mpi_errno);*/
+#else
+	TRACE_ERR("Calling PMI_Init\n");
+	pmi_errno = PMI_Init(has_parent);
+	if (pmi_errno != PMI_SUCCESS) {
+	/*    MPIU_ERR_SETANDJUMP1(mpi_errno,MPI_ERR_OTHER, "**pmi_init",
+			     "**pmi_init %d", pmi_errno); */
+	}
+
+	pmi_errno = PMI_Get_rank(&pg_rank);
+	if (pmi_errno != PMI_SUCCESS) {
+	    /*MPIU_ERR_SETANDJUMP1(mpi_errno,MPI_ERR_OTHER, "**pmi_get_rank",
+			     "**pmi_get_rank %d", pmi_errno); */
+	}
+
+	pmi_errno = PMI_Get_size(&pg_size);
+	if (pmi_errno != 0) {
+	/*MPIU_ERR_SETANDJUMP1(mpi_errno,MPI_ERR_OTHER, "**pmi_get_size",
+			     "**pmi_get_size %d", pmi_errno);*/
+	}
+
+	pmi_errno = PMI_Get_appnum(&appnum);
+	if (pmi_errno != PMI_SUCCESS) {
+/*	    MPIU_ERR_SETANDJUMP1(mpi_errno,MPI_ERR_OTHER, "**pmi_get_appnum",
+				 "**pmi_get_appnum %d", pmi_errno); */
+	}
+#endif
+	/* Note that if pmi is not availble, the value of MPI_APPNUM is
+	   not set */
+	if (appnum != -1) {
+	    MPIR_Process.attrs.appnum = appnum;
+	}
+
+#ifdef USE_PMI2_API
+
+        /* This memory will be freed by the PG_Destroy if there is an error */
+	pg_id = MPIU_Malloc(MAX_JOBID_LEN);
+
+        mpi_errno = PMI2_Job_GetId(pg_id, MAX_JOBID_LEN);
+	TRACE_ERR("PMI2_Job_GetId - pg_id=%s\n", pg_id);
+#else
+	/* Now, initialize the process group information with PMI calls */
+	/*
+	 * Get the process group id
+	 */
+	pmi_errno = PMI_KVS_Get_name_length_max(&pg_id_sz);
+	if (pmi_errno != PMI_SUCCESS) {
+          TRACE_ERR("PMI_KVS_Get_name_length_max returned with pmi_errno=%d\n", pmi_errno);
+	}
+
+	/* This memory will be freed by the PG_Destroy if there is an error */
+	pg_id = MPIU_Malloc(pg_id_sz + 1);
+
+	/* Note in the singleton init case, the pg_id is a dummy.
+	   We'll want to replace this value if we join an
+	   Process manager */
+	pmi_errno = PMI_KVS_Get_my_name(pg_id, pg_id_sz);
+	if (pmi_errno != PMI_SUCCESS) {
+          TRACE_ERR("PMI_KVS_Get_my_name returned with pmi_errno=%d\n", pmi_errno);
+	}
+#endif
+    }
+    else {
+	/* Create a default pg id */
+	pg_id = MPIU_Malloc(2);
+	MPIU_Strncpy( pg_id, "0", 2 );
+    }
+
+	TRACE_ERR("pg_size=%d pg_id=%s\n", pg_size, pg_id);
+    /*
+     * Initialize the process group tracking subsystem
+     */
+    mpi_errno = MPIDI_PG_Init(argc, argv,
+			     MPIDI_PG_Compare_ids, MPIDI_PG_Destroy_id);
+    if (mpi_errno != MPI_SUCCESS) {
+      TRACE_ERR("MPIDI_PG_Init returned with mpi_errno=%d\n", mpi_errno);
+    }
+
+    /*
+     * Create a new structure to track the process group for our MPI_COMM_WORLD
+     */
+    TRACE_ERR("pg_size=%d pg_id=%p pg_id=%s\n", pg_size, pg_id, pg_id);
+    mpi_errno = MPIDI_PG_Create(pg_size, pg_id, &pg);
+    MPIU_Free(pg_id);
+    if (mpi_errno != MPI_SUCCESS) {
+      TRACE_ERR("MPIDI_PG_Create returned with mpi_errno=%d\n", mpi_errno);
+    }
+
+    /* FIXME: We can allow the channels to tell the PG how to get
+       connection information by passing the pg to the channel init routine */
+    if (usePMI) {
+	/* Tell the process group how to get connection information */
+        mpi_errno = MPIDI_PG_InitConnKVS( pg );
+        if (mpi_errno)
+          TRACE_ERR("MPIDI_PG_InitConnKVS returned with mpi_errno=%d\n", mpi_errno);
+    }
+
+    /* FIXME: has_args and has_env need to come from PMI eventually... */
+    *has_args = TRUE;
+    *has_env  = TRUE;
+
+    *pg_p      = pg;
+    *pg_rank_p = pg_rank;
+
+ fn_exit:
+    return mpi_errno;
+ fn_fail:
+    /* --BEGIN ERROR HANDLING-- */
+    if (pg) {
+	MPIDI_PG_Destroy( pg );
+    }
+    goto fn_exit;
+    /* --END ERROR HANDLING-- */
+}
+#endif
diff --git a/src/mpid/pamid/src/mpid_vc.c b/src/mpid/pamid/src/mpid_vc.c
index 7cdac0e..f051c78 100644
--- a/src/mpid/pamid/src/mpid_vc.c
+++ b/src/mpid/pamid/src/mpid_vc.c
@@ -26,6 +26,7 @@
 
 #include <mpidimpl.h>
 
+extern int mpidi_dynamic_tasking;
 /**
  * \brief Virtual connection reference table
  */
@@ -39,22 +40,32 @@ struct MPIDI_VCRT
 
 int MPID_VCR_Dup(MPID_VCR orig_vcr, MPID_VCR * new_vcr)
 {
+#ifdef DYNAMIC_TASKING
+    if(mpidi_dynamic_tasking) {
+      if (orig_vcr->pg) {
+        MPIDI_PG_add_ref( orig_vcr->pg );
+      }
+    }
+#endif
+
     *new_vcr = orig_vcr;
     return MPI_SUCCESS;
 }
 
 int MPID_VCR_Get_lpid(MPID_VCR vcr, int * lpid_ptr)
 {
-    *lpid_ptr = (int)vcr;
+    *lpid_ptr = (int)(vcr->taskid);
     return MPI_SUCCESS;
 }
 
 int MPID_VCRT_Create(int size, MPID_VCRT *vcrt_ptr)
 {
     struct MPIDI_VCRT * vcrt;
-    int result;
+    int i,result;
 
     vcrt = MPIU_Malloc(sizeof(struct MPIDI_VCRT) + size*sizeof(MPID_VCR));
+    for(i = 0; i < size; i++)
+	vcrt->vcr_table[i] = MPIU_Malloc(sizeof(struct MPID_VCR_t));
     if (vcrt != NULL)
     {
         MPIU_Object_set_ref(vcrt, 1);
@@ -77,11 +88,41 @@ int MPID_VCRT_Add_ref(MPID_VCRT vcrt)
 
 int MPID_VCRT_Release(MPID_VCRT vcrt, int isDisconnect)
 {
-    int count;
+    int count, i, inuse;
 
     MPIU_Object_release_ref(vcrt, &count);
-    if (count == 0)
-      MPIU_Free(vcrt);
+
+    if (count == 0) {
+#ifdef DYNAMIC_TASKING
+     if(mpidi_dynamic_tasking) {
+      for (i = 0; i < vcrt->size; i++)
+        {
+          MPID_VCR const vcr = vcrt->vcr_table[i];
+            if (vcr->pg == MPIDI_Process.my_pg &&
+                vcr->pg_rank == MPIDI_Process.my_pg_rank)
+              {
+	        TRACE_ERR("before MPIDI_PG_release_ref on vcr=%x pg=%x pg=%s inuse=%d\n", vcr, vcr->pg, (vcr->pg)->id, inuse);
+                inuse=MPIU_Object_get_ref(vcr->pg);
+	        TRACE_ERR("before MPIDI_PG_release_ref on vcr=%x pg=%x pg=%s inuse=%d\n", vcr, vcr->pg, (vcr->pg)->id, inuse);
+                MPIDI_PG_release_ref(vcr->pg, &inuse);
+	        TRACE_ERR("MPIDI_PG_release_ref on pg=%s inuse=%d\n", (vcr->pg)->id, inuse);
+                if (inuse == 0)
+                 {
+                   MPIDI_PG_Destroy(vcr->pg);
+                 }
+                 continue;
+              }
+            inuse=MPIU_Object_get_ref(vcr->pg);
+
+            MPIDI_PG_release_ref(vcr->pg, &inuse);
+            if (inuse == 0)
+              MPIDI_PG_Destroy(vcr->pg);
+	/*MPIU_Free(vcrt->vcr_table[i]);*/
+       }
+     } /** CHECK */
+#endif
+     MPIU_Free(vcrt);
+    }
     return MPI_SUCCESS;
 }
 
@@ -90,3 +131,207 @@ int MPID_VCRT_Get_ptr(MPID_VCRT vcrt, MPID_VCR **vc_pptr)
     *vc_pptr = vcrt->vcr_table;
     return MPI_SUCCESS;
 }
+
+#ifdef DYNAMIC_TASKING
+int MPID_VCR_CommFromLpids( MPID_Comm *newcomm_ptr,
+			    int size, const int lpids[] )
+{
+    int mpi_errno = MPI_SUCCESS;
+    MPID_Comm *commworld_ptr;
+    int i;
+    MPIDI_PG_iterator iter;
+
+    commworld_ptr = MPIR_Process.comm_world;
+    /* Setup the communicator's vc table: remote group */
+    MPID_VCRT_Create( size, &newcomm_ptr->vcrt );
+    MPID_VCRT_Get_ptr( newcomm_ptr->vcrt, &newcomm_ptr->vcr );
+    if(mpidi_dynamic_tasking) {
+    for (i=0; i<size; i++) {
+	MPID_VCR *vc = 0;
+
+	/* For rank i in the new communicator, find the corresponding
+	   virtual connection.  For lpids less than the size of comm_world,
+	   we can just take the corresponding entry from comm_world.
+	   Otherwise, we need to search through the process groups.
+	*/
+	/* printf( "[%d] Remote rank %d has lpid %d\n",
+	   MPIR_Process.comm_world->rank, i, lpids[i] ); */
+	if (lpids[i] < commworld_ptr->remote_size) {
+	    *vc = commworld_ptr->vcr[lpids[i]];
+	}
+	else {
+	    /* We must find the corresponding vcr for a given lpid */
+	    /* For now, this means iterating through the process groups */
+	    MPIDI_PG_t *pg = 0;
+	    int j;
+
+	    MPIDI_PG_Get_iterator(&iter);
+	    /* Skip comm_world */
+	    MPIDI_PG_Get_next( &iter, &pg );
+	    do {
+		MPIDI_PG_Get_next( &iter, &pg );
+                /*MPIU_ERR_CHKINTERNAL(!pg, mpi_errno, "no pg"); */
+		/* FIXME: a quick check on the min/max values of the lpid
+		   for this process group could help speed this search */
+		for (j=0; j<pg->size; j++) {
+		    /*printf( "Checking lpid %d against %d in pg %s\n",
+			    lpids[i], pg->vct[j].lpid, (char *)pg->id );
+			    fflush(stdout); */
+		    if (pg->vct[j].taskid == lpids[i]) {
+			vc = &pg->vct[j];
+			/*printf( "found vc %x for lpid = %d in another pg\n",
+			  (int)vc, lpids[i] );*/
+			break;
+		    }
+		}
+	    } while (!vc);
+	}
+
+	/* printf( "about to dup vc %x for lpid = %d in another pg\n",
+	   (int)vc, lpids[i] ); */
+	/* Note that his will increment the ref count for the associate
+	   PG if necessary.  */
+	MPID_VCR_Dup( *vc, &newcomm_ptr->vcr[i] );
+    }
+    } else  {
+    for (i=0; i<size; i++) {
+        /* For rank i in the new communicator, find the corresponding
+           rank in the comm world (FIXME FOR MPI2) */
+        /* printf( "[%d] Remote rank %d has lpid %d\n",
+           MPIR_Process.comm_world->rank, i, lpids[i] ); */
+        if (lpids[i] < commworld_ptr->remote_size) {
+            MPID_VCR_Dup( commworld_ptr->vcr[lpids[i]],
+                          &newcomm_ptr->vcr[i] );
+        }
+        else {
+            /* We must find the corresponding vcr for a given lpid */
+            /* FIXME: Error */
+            return 1;
+            /* MPID_VCR_Dup( ???, &newcomm_ptr->vcr[i] ); */
+        }
+    }
+
+    }
+fn_exit:
+    return mpi_errno;
+fn_fail:
+    goto fn_exit;
+}
+
+/*
+ * The following is a very simple code for looping through
+ * the GPIDs.  Note that this code requires that all processes
+ * have information on the process groups.
+ */
+int MPID_GPID_ToLpidArray( int size, int gpid[], int lpid[] )
+{
+    int i, mpi_errno = MPI_SUCCESS;
+    int pgid;
+    MPIDI_PG_t *pg = 0;
+    MPIDI_PG_iterator iter;
+
+    if(mpidi_dynamic_tasking) {
+    for (i=0; i<size; i++) {
+        MPIDI_PG_Get_iterator(&iter);
+	do {
+	    MPIDI_PG_Get_next( &iter, &pg );
+	    if (!pg) {
+		/* Internal error.  This gpid is unknown on this process */
+		TRACE_ERR("No matching pg foung for id = %d\n", pgid );
+		lpid[i] = -1;
+		/*MPIU_ERR_SET2(mpi_errno,MPI_ERR_INTERN, "**unknowngpid",
+			      "**unknowngpid %d %d", gpid[0], gpid[1] ); */
+		return mpi_errno;
+	    }
+	    MPIDI_PG_IdToNum( pg, &pgid );
+
+	    if (pgid == gpid[0]) {
+		/* found the process group.  gpid[1] is the rank in
+		   this process group */
+		/* Sanity check on size */
+		TRACE_ERR("found the progress group for id = %d\n", pgid );
+		TRACE_ERR("pg->size = %d gpid[1]=%d\n", pg->size, gpid[1] );
+		if (pg->size > gpid[1]) {
+		    TRACE_ERR("pg->vct[gpid[1]].taskid = %d\n", pg->vct[gpid[1]].taskid );
+		    lpid[i] = pg->vct[gpid[1]].taskid;
+		}
+		else {
+		    lpid[i] = -1;
+		    /*MPIU_ERR_SET2(mpi_errno,MPI_ERR_INTERN, "**unknowngpid",
+				  "**unknowngpid %d %d", gpid[0], gpid[1] ); */
+		    return mpi_errno;
+		}
+		/* printf( "lpid[%d] = %d for gpid = (%d)%d\n", i, lpid[i],
+		   gpid[0], gpid[1] ); */
+		break;
+	    }
+	} while (1);
+	gpid += 2;
+    }
+    } else {
+    for (i=0; i<size; i++) {
+        lpid[i] = *++gpid;  gpid++;
+    }
+    return 0;
+
+    }
+
+    return mpi_errno;
+}
+/*
+ * The following routines convert to/from the global pids, which are
+ * represented as pairs of ints (process group id, rank in that process group)
+ */
+
+/* FIXME: These routines belong in a different place */
+int MPID_GPID_GetAllInComm( MPID_Comm *comm_ptr, int local_size,
+			    int local_gpids[], int *singlePG )
+{
+    int mpi_errno = MPI_SUCCESS;
+    int i;
+    int *gpid = local_gpids;
+    int lastPGID = -1, pgid;
+    MPID_VCR vc;
+
+    MPIU_Assert(comm_ptr->local_size == local_size);
+
+    if(mpidi_dynamic_tasking) {
+    *singlePG = 1;
+    for (i=0; i<comm_ptr->local_size; i++) {
+	vc = comm_ptr->vcr[i];
+
+	/* Get the process group id as an int */
+	MPIDI_PG_IdToNum( vc->pg, &pgid );
+
+	*gpid++ = pgid;
+	if (lastPGID != pgid) {
+	    if (lastPGID != -1)
+		*singlePG = 0;
+	    lastPGID = pgid;
+	}
+	*gpid++ = vc->pg_rank;
+
+        MPIU_DBG_MSG_FMT(COMM,VERBOSE, (MPIU_DBG_FDEST,
+                         "pgid=%d vc->pg_rank=%d",
+                         pgid, vc->pg_rank));
+    }
+    } else {
+    for (i=0; i<comm_ptr->local_size; i++) {
+        *gpid++ = 0;
+        (void)MPID_VCR_Get_lpid( comm_ptr->vcr[i], gpid );
+        gpid++;
+    }
+    *singlePG = 1;
+
+    }
+
+    return mpi_errno;
+}
+
+
+int MPIDI_VC_Init( MPID_VCR vcr, MPIDI_PG_t *pg, int rank )
+{
+    vcr->pg      = pg;
+    vcr->pg_rank = rank;
+}
+#endif
diff --git a/src/mpid/pamid/src/pt2pt/mpidi_sendmsg.c b/src/mpid/pamid/src/pt2pt/mpidi_sendmsg.c
index 3e043e4..33095ca 100644
--- a/src/mpid/pamid/src/pt2pt/mpidi_sendmsg.c
+++ b/src/mpid/pamid/src/pt2pt/mpidi_sendmsg.c
@@ -403,7 +403,7 @@ MPIDI_SendMsg(pami_context_t   context,
   pami_endpoint_t dest;
   MPIDI_Context_endpoint(sreq, &dest);
   pami_task_t  dest_tid;
-  dest_tid=sreq->comm->vcr[rank];
+  dest_tid=sreq->comm->vcr[rank]->taskid;
 #if (MPIDI_STATISTICS)
   MPID_NSTAT(mpid_statp->sends);
 #endif
diff --git a/src/mpid/pamid/subconfigure.m4 b/src/mpid/pamid/subconfigure.m4
index 723bfce..017f7f1 100644
--- a/src/mpid/pamid/subconfigure.m4
+++ b/src/mpid/pamid/subconfigure.m4
@@ -20,6 +20,10 @@ pamid_platform=${device_args}
 
 # Set a value for the maximum processor name.
 MPID_MAX_PROCESSOR_NAME=128
+PM_REQUIRES_PMI=pmi2
+if test "${pamid_platform}" = "PE" ; then
+        PM_REQUIRES_PMI=pmi2/poe
+fi
 
 MPID_DEVICE_TIMER_TYPE=double
 MPID_MAX_THREAD_LEVEL=MPI_THREAD_MULTIPLE
diff --git a/src/pm/hydra/mpichprereq b/src/pm/hydra/mpichprereq
index 2f3451d..bfbf778 100755
--- a/src/pm/hydra/mpichprereq
+++ b/src/pm/hydra/mpichprereq
@@ -5,12 +5,12 @@
 
 
 if test -z "$PM_REQUIRES_PMI" ; then
-    if test "$with_pmi" = "pmi2" -o "$with_pmi" = "simple" ; then
+    if test "$with_pmi" = "pmi2/simple" -o "$with_pmi" = "simple" ; then
         PM_REQUIRES_PMI=$with_pmi
     else
 	PM_REQUIRES_PMI=simple
     fi
-elif test "$PM_REQUIRES_PMI" != "simple" -a "$PM_REQUIRES_PMI" != "pmi2" ; then
+elif test "$PM_REQUIRES_PMI" != "simple" -a "$PM_REQUIRES_PMI" != "pmi2/simple" -a "$PM_REQUIRES_PMI" != "pmi2/poe"; then
     echo "hydra requires the \"simple\" or \"pmi2\" PMI implementation; \"$PM_REQUIRES_PMI\" has already been selected"
     exit 1
 fi
diff --git a/src/pmi/pmi2/Makefile.mk b/src/pmi/pmi2/Makefile.mk
index 1f4ea2d..4000a38 100644
--- a/src/pmi/pmi2/Makefile.mk
+++ b/src/pmi/pmi2/Makefile.mk
@@ -5,18 +5,5 @@
 ##     See COPYRIGHT in top-level directory.
 ##
 
-if BUILD_PMI_PMI2
-
-lib_lib at MPILIBNAME@_la_SOURCES += \
-    src/pmi/pmi2/simple2pmi.c     \
-    src/pmi/pmi2/simple_pmiutil.c
-
-noinst_HEADERS +=                 \
-    src/pmi/pmi2/simple_pmiutil.h \
-    src/pmi/pmi2/simple2pmi.h     \
-    src/pmi/pmi2/pmi2compat.h
-
-AM_CPPFLAGS += -I$(top_srcdir)/src/pmi/pmi2
-
-endif BUILD_PMI_PMI2
-
+include $(top_srcdir)/src/pmi/pmi2/poe/Makefile.mk
+include $(top_srcdir)/src/pmi/pmi2/simple/Makefile.mk
diff --git a/src/pmi/pmi2/poe/Makefile.mk b/src/pmi/pmi2/poe/Makefile.mk
new file mode 100644
index 0000000..a10fb21
--- /dev/null
+++ b/src/pmi/pmi2/poe/Makefile.mk
@@ -0,0 +1,15 @@
+## -*- Mode: Makefile; -*-
+## vim: set ft=automake :
+##
+## (C) 2011 by Argonne National Laboratory.
+##     See COPYRIGHT in top-level directory.
+##
+
+if BUILD_PMI_PMI2_POE
+
+lib_lib at MPILIBNAME@_la_SOURCES += \
+    src/pmi/pmi2/poe/poe2pmi.c
+
+AM_CPPFLAGS += -I$(top_srcdir)/src/pmi/pmi2/poe
+
+endif BUILD_PMI_PMI2_POE
diff --git a/src/pmi/pmi2/poe/poe2pmi.c b/src/pmi/pmi2/poe/poe2pmi.c
new file mode 100644
index 0000000..ca68e18
--- /dev/null
+++ b/src/pmi/pmi2/poe/poe2pmi.c
@@ -0,0 +1,325 @@
+/* -*- Mode: C; c-basic-offset:4 ; -*- */
+/*
+ *  (C) 2007 by Argonne National Laboratory.
+ *      See COPYRIGHT in top-level directory.
+ */
+
+#include <dlfcn.h>
+#include "mpichconf.h"
+#include "pmi2.h"
+#include "mpiimpl.h"
+
+#include <stdio.h>
+#ifdef HAVE_UNISTD_H
+#include <unistd.h>
+#endif
+#ifdef HAVE_STDLIB_H
+#include <stdlib.h>
+#endif
+#ifdef HAVE_STRING_H
+#include <string.h>
+#endif
+#ifdef HAVE_STRINGS_H
+#include <strings.h>
+#endif
+#if defined(HAVE_SYS_SOCKET_H)
+#include <sys/socket.h>
+#endif
+
+#ifdef USE_PMI_PORT
+#ifndef MAXHOSTNAME
+#define MAXHOSTNAME 256
+#endif
+#endif
+
+#define PMII_EXIT_CODE -1
+
+#define PMI_VERSION    2
+#define PMI_SUBVERSION 0
+
+#define MAX_INT_STR_LEN 11 /* number of digits in MAX_UINT + 1 */
+
+int (*mp_world_exiting_handler)(int) = NULL;
+typedef enum { PMI2_UNINITIALIZED = 0, NORMAL_INIT_WITH_PM = 1 } PMI2State;
+static PMI2State PMI2_initialized = PMI2_UNINITIALIZED;
+
+static int PMI2_debug = 0;
+static int PMI2_fd = -1;
+static int PMI2_size = 1;
+static int PMI2_rank = 0;
+
+static int PMI2_debug_init = 0;    /* Set this to true to debug the init */
+
+int PMI2_pmiverbose = 0;    /* Set this to true to print PMI debugging info */
+
+#ifdef MPICH_IS_THREADED
+static MPID_Thread_mutex_t mutex;
+static int blocked = FALSE;
+static MPID_Thread_cond_t cond;
+#endif
+
+extern int mpidi_finalized;
+extern int (*mp_world_exiting_handler)(int);
+extern int _mpi_world_exiting_handler(int);
+
+void *poeptr = NULL;
+
+/* ------------------------------------------------------------------------- */
+/* PMI API Routines */
+/* ------------------------------------------------------------------------- */
+int PMI2_Init(int *spawned, int *size, int *rank, int *appnum)
+{
+    int pmi2_errno = PMI2_SUCCESS;
+    char *p;
+    char *jobid;
+    char *pmiid;
+    int ret;
+
+    int (*pmi2_init)(int*, int*, int *, int*);
+
+    poeptr = dlopen("libpoe.so",RTLD_NOW|RTLD_GLOBAL);
+    if (poeptr == NULL) {
+        TRACE_ERR("failed to open libpoe.so\n");
+    }
+
+    mp_world_exiting_handler = &(_mpi_world_exiting_handler);
+
+    pmi2_init = (int (*)())dlsym(poeptr, "PMI2_Init");
+    if (pmi2_init == NULL) {
+        TRACE_ERR("failed to dlsym PMI2_Init\n");
+    }
+
+    return (*pmi2_init)(spawned, size, rank, appnum);
+}
+
+int PMI2_Finalize(void)
+{
+    int pmi2_errno = PMI2_SUCCESS;
+    int rc;
+    const char *errmsg;
+
+    int (*pmi2_finalize)(void);
+
+    pmi2_finalize = (int (*)())dlsym(poeptr, "PMI2_Finalize");
+    if (pmi2_finalize == NULL) {
+        TRACE_ERR("failed to dlsym PMI2_Finalize\n");
+    }
+
+    return (*pmi2_finalize)();
+
+}
+
+int PMI2_Initialized(void)
+{
+    /* Turn this into a logical value (1 or 0) .  This allows us
+       to use PMI2_initialized to distinguish between initialized with
+       an PMI service (e.g., via mpiexec) and the singleton init,
+       which has no PMI service */
+    return PMI2_initialized != 0;
+}
+
+int PMI2_Abort( int flag, const char msg[] )
+{
+    int (*pmi2_abort)(int, const char*);
+
+    pmi2_abort = (int (*)())dlsym(poeptr, "PMI2_Abort");
+    if (pmi2_abort == NULL) {
+        TRACE_ERR("failed to dlsym pmi2_abort\n");
+    }
+
+    return (*pmi2_abort)(flag, msg);
+}
+
+int PMI2_Job_Spawn(int count, const char * cmds[],
+                   int argcs[], const char ** argvs[],
+                   const int maxprocs[],
+                   const int info_keyval_sizes[],
+                   const struct MPID_Info *info_keyval_vectors[],
+                   int preput_keyval_size,
+                   const struct MPID_Info *preput_keyval_vector[],
+                   char jobId[], int jobIdSize,
+                   int errors[])
+{
+    int  i,rc,spawncnt,total_num_processes,num_errcodes_found;
+    int found;
+    const char *jid;
+    int jidlen;
+    char *lead, *lag;
+    int spawn_rc;
+    const char *errmsg = NULL;
+    int pmi2_errno = 0;
+
+    int (*pmi2_job_spawn)(int , const char * [], int [], const char ** [],const int [],const int [],const struct MPID_Info *[],int ,const struct MPID_Info *[],char jobId[],int ,int []);
+
+    pmi2_job_spawn = (int (*)())dlsym(poeptr, "PMI2_Job_Spawn");
+    if (pmi2_job_spawn == NULL) {
+        TRACE_ERR("failed to dlsym pmi2_job_spawn\n");
+    }
+
+    return (*pmi2_job_spawn)(count, cmds, argcs, argvs, maxprocs,
+                             info_keyval_sizes, info_keyval_vectors,
+                             preput_keyval_size, preput_keyval_vector,
+                             jobId, jobIdSize, errors);
+
+}
+
+int PMI2_Job_GetId(char jobid[], int jobid_size)
+{
+    int pmi2_errno = PMI2_SUCCESS;
+    int found;
+    const char *jid;
+    int jidlen;
+    int rc;
+    const char *errmsg;
+
+    int (*pmi2_job_getid)(char*, int);
+
+    pmi2_job_getid = (int (*)())dlsym(poeptr, "PMI2_Job_GetId");
+    if (pmi2_job_getid == NULL) {
+        TRACE_ERR("failed to dlsym pmi2_job_getid\n");
+    }
+
+    return (*pmi2_job_getid)(jobid, jobid_size);
+}
+
+
+int PMI2_KVS_Put(const char key[], const char value[])
+{
+    int pmi2_errno = PMI2_SUCCESS;
+    int rc;
+    const char *errmsg;
+
+    int (*pmi2_kvs_put)(const char*, const char*);
+
+    pmi2_kvs_put = (int (*)())dlsym(poeptr, "PMI2_KVS_Put");
+    if (pmi2_kvs_put == NULL) {
+        TRACE_ERR("failed to dlsym pmi2_kvs_put\n");
+    }
+
+    return (*pmi2_kvs_put)(key, value);
+}
+
+int PMI2_KVS_Fence(void)
+{
+    int pmi2_errno = PMI2_SUCCESS;
+    int rc;
+    const char *errmsg;
+
+    int (*pmi2_kvs_fence)(void);
+
+    pmi2_kvs_fence = (int (*)())dlsym(poeptr, "PMI2_KVS_Fence");
+    if (pmi2_kvs_fence == NULL) {
+        TRACE_ERR("failed to dlsym pmi2_kvs_fence\n");
+    }
+
+    return (*pmi2_kvs_fence)();
+}
+
+int PMI2_KVS_Get(const char *jobid, int src_pmi_id, const char key[], char value [], int maxValue, int *valLen)
+{
+    int pmi2_errno = PMI2_SUCCESS;
+    int found, keyfound;
+    const char *kvsvalue;
+    int kvsvallen;
+    int ret;
+    int rc;
+    char src_pmi_id_str[256];
+    const char *errmsg;
+
+    int (*pmi2_kvs_get)(const char*, int, const char *, char *, int, int*);
+
+    pmi2_kvs_get = (int (*)())dlsym(poeptr, "PMI2_KVS_Get");
+    if (pmi2_kvs_get == NULL) {
+        TRACE_ERR("failed to dlsym pmi2_kvs_get\n");
+    }
+
+    return (*pmi2_kvs_get)(jobid, src_pmi_id, key, value, maxValue, valLen);
+}
+
+
+int PMI2_Info_GetJobAttr(const char name[], char value[], int valuelen, int *flag)
+{
+    int pmi2_errno = PMI2_SUCCESS;
+    int found;
+    const char *kvsvalue;
+    int kvsvallen;
+    int rc;
+    const char *errmsg;
+
+    int (*pmi2_info_getjobattr)(const char*, char *, int, int*);
+
+    pmi2_info_getjobattr = (int (*)())dlsym(poeptr, "PMI2_Info_GetJobAttr");
+    if (pmi2_info_getjobattr == NULL) {
+        TRACE_ERR("failed to dlsym pmi2_info_getjobattr\n");
+    }
+
+    return (*pmi2_info_getjobattr)(name, value, valuelen, flag);
+}
+
+
+/**
+ * This is the mpi level of callback that get invoked when a task get notified
+ * of a world's exiting
+ */
+int _mpi_world_exiting_handler(int world_id)
+{
+  /* check the reference count associated with that remote world
+     if the reference count is zero, the task will call LAPI_Purge_totask on
+     all tasks in that world,reset MPCI. It would also remove the world
+     structure corresponding to that world ID
+     if the reference count is not zero, it should call STOPALL
+  */
+  int rc,ref_count = -1;
+  int  *taskid_list = NULL;
+  int i;
+  int my_state=FALSE,reduce_state=FALSE;
+  char world_id_str[32];
+  int mpi_errno = MPI_SUCCESS;
+  pami_endpoint_t dest;
+
+  if(!mpidi_finalized) {
+    ref_count = MPIDI_get_refcnt_of_world(world_id);
+    TRACE_ERR("_mpi_world_exiting_handler: invoked for world %d exiting ref_count=%d\n", world_id, ref_count);
+    if(ref_count == 0) {
+      taskid_list = MPIDI_get_taskids_in_world_id(world_id);
+      if(taskid_list != NULL) {
+        for(i=0;taskid_list[i]!=-1;i++) {
+          PAMI_Endpoint_create(MPIDI_Client, taskid_list[i], 0, &dest);
+	  MPIDI_OpState_reset(taskid_list[i]);
+	  MPIDI_IpState_reset(taskid_list[i]);
+	  TRACE_ERR("PAMI_Purge on taskid_list[%d]=%d\n", i,taskid_list[i]);
+	  if(MPIDI_Context[0])
+            PAMI_Purge(MPIDI_Context[0], &dest, 1);
+        }
+        MPIDI_delete_conn_record(world_id);
+      }
+      rc = -1;
+    }
+    my_state = TRUE;
+
+/*  _mpi_reduce_for_dyntask(&my_state, &reduce_state); */
+    if(MPIDI_Context[0])
+      MPIR_Reduce_impl(&my_state,&reduce_state,1,
+                       MPI_INT,MPI_LAND,0,MPIR_Process.comm_world,&mpi_errno);
+    TRACE_ERR("_mpi_world_exiting_handler: Out of _mpi_reduce_for_dyntask for exiting world %d reduce_state=%d\n",world_id, reduce_state);
+  }
+
+  if(MPIR_Process.comm_world->rank == 0) {
+    MPIU_Snprintf(world_id_str, sizeof(world_id_str), "%d", world_id);
+    PMI2_Abort(0, world_id_str);
+/*    _mp_send_exiting_ack(world_id); */
+    if(MPIDI_Context[0] && (reduce_state != TRUE)) {
+      TRACE_ERR("root is exiting with error\n");
+      exit(-1);
+    }
+    TRACE_ERR("_mpi_world_exiting_handler: Root finished sending SSM_WORLD_EXITING to POE for exiting world %d\n",world_id);
+  }
+
+  if(ref_count != 0) {
+    TRACE_ERR("STOPALL is sent by task %d\n", PAMIX_Client_query(MPIDI_Client, PAMI_CLIENT_TASK_ID  ).value.intval);
+    PMI2_Abort(1, "STOPALL should be sent");
+    rc = -2;
+  }
+
+  return rc;
+}
diff --git a/src/pmi/pmi2/poe/subconfigure.m4 b/src/pmi/pmi2/poe/subconfigure.m4
new file mode 100644
index 0000000..1b10221
--- /dev/null
+++ b/src/pmi/pmi2/poe/subconfigure.m4
@@ -0,0 +1,24 @@
+[#] start of __file__
+dnl MPICH2_SUBCFG_AFTER=src/pmi
+
+AC_DEFUN([PAC_SUBCFG_PREREQ_]PAC_SUBCFG_AUTO_SUFFIX,[
+])
+
+AC_DEFUN([PAC_SUBCFG_BODY_]PAC_SUBCFG_AUTO_SUFFIX,[
+
+AM_CONDITIONAL([BUILD_PMI_PMI2_POE],[test "x$pmi_name" = "xpmi2/poe"])
+
+AM_COND_IF([BUILD_PMI_PMI2_POE],[
+if test "$enable_pmiport" != "no" ; then
+   enable_pmiport=yes
+fi
+
+dnl causes USE_PMI2_API to be AC_DEFINE'ed by the top-level configure.ac
+USE_PMI2_API=yes
+
+PAC_C_GNU_ATTRIBUTE
+])dnl end COND_IF
+
+])dnl end BODY macro
+
+[#] end of __file__
diff --git a/src/pmi/pmi2/simple/Makefile.mk b/src/pmi/pmi2/simple/Makefile.mk
new file mode 100644
index 0000000..4336dd5
--- /dev/null
+++ b/src/pmi/pmi2/simple/Makefile.mk
@@ -0,0 +1,21 @@
+## -*- Mode: Makefile; -*-
+## vim: set ft=automake :
+##
+## (C) 2011 by Argonne National Laboratory.
+##     See COPYRIGHT in top-level directory.
+##
+
+if BUILD_PMI_PMI2_SIMPLE
+
+lib_lib at MPILIBNAME@_la_SOURCES += \
+    src/pmi/pmi2/simple/simple2pmi.c     \
+    src/pmi/pmi2/simple/simple_pmiutil.c
+
+noinst_HEADERS +=                 \
+    src/pmi/pmi2/simple/simple_pmiutil.h \
+    src/pmi/pmi2/simple/simple2pmi.h     \
+    src/pmi/pmi2/simple/pmi2compat.h
+
+AM_CPPFLAGS += -I$(top_srcdir)/src/pmi/pmi2/simple
+
+endif BUILD_PMI_PMI2_SIMPLE
diff --git a/src/pmi/pmi2/README b/src/pmi/pmi2/simple/README
similarity index 95%
rename from src/pmi/pmi2/README
rename to src/pmi/pmi2/simple/README
index 3cfd2fe..9ccf5a0 100644
--- a/src/pmi/pmi2/README
+++ b/src/pmi/pmi2/simple/README
@@ -4,13 +4,13 @@ PMI version 1.  This version is not yet in use in MPICH, and is
 being developed to prototype the changes proposed.  In particular, this version
 adds support for better thread-safe behavior.
 
-Currently, the source files simply define the interfaces.  There is no 
+Currently, the source files simply define the interfaces.  There is no
 implementation yet.  The files have been added to the repository so that they
-can be reviewed by interested parties, particularly 3rd-parties that 
+can be reviewed by interested parties, particularly 3rd-parties that
 will need to interface to this second-generation interface.
 
 A major issue that PMI verison 2 needs to address is thread-safety and
-responsiveness.  In particular, no thread that is waiting on a PMI call can 
+responsiveness.  In particular, no thread that is waiting on a PMI call can
 block other threads, even PMI_Spawn.  The design sketched out here
 addresses these issues as well as providing a relatively simple model
 for thread interaction.
@@ -30,7 +30,7 @@ exit-atomic
 During the "waitfor" operation, other threads may enter this atomic
 section.
 
-Here is a sketch of the implementation of "waitfor" that makes use of 
+Here is a sketch of the implementation of "waitfor" that makes use of
 the Posix condition variable to block the threads that are waiting for
 a response.
 
@@ -57,7 +57,7 @@ int wait-function( int *flag, void *socket_set )
 do {
      poll( socket_set ... );
      for each active fd
-          process fd (may complete some operations). 
+          process fd (may complete some operations).
     } while (!*flag);
 }
 
@@ -73,7 +73,7 @@ waitfor( ... )
 
 and the enter/exit atomic are no-ops.  This makes each of the
 implementation of each of the PMI routines essentially independent of
-the number of threads.  
+the number of threads.
 
 For example, the PMI_KVS_Get operation will look more like
 
@@ -106,7 +106,7 @@ if (!*flag) {
 }
 
 An easy way to do this with the single wait function is to make the
-routine wait only if there is a non-null flag, i.e., 
+routine wait only if there is a non-null flag, i.e.,
 
 wait-function( NULL, wait-ctx )
 if (!*flag) {
diff --git a/src/pmi/pmi2/pmi2compat.h b/src/pmi/pmi2/simple/pmi2compat.h
similarity index 100%
rename from src/pmi/pmi2/pmi2compat.h
rename to src/pmi/pmi2/simple/pmi2compat.h
diff --git a/src/pmi/pmi2/simple2pmi.c b/src/pmi/pmi2/simple/simple2pmi.c
similarity index 99%
rename from src/pmi/pmi2/simple2pmi.c
rename to src/pmi/pmi2/simple/simple2pmi.c
index b42f46e..2c52808 100644
--- a/src/pmi/pmi2/simple2pmi.c
+++ b/src/pmi/pmi2/simple/simple2pmi.c
@@ -152,7 +152,7 @@ static inline void ENQUEUE(PMI2_Command *cmd)
         pendingq_tail = pi;
     }
 }
-        
+
 static inline int SEARCH_REMOVE(PMI2_Command *cmd)
 {
     pending_item_t *pi, *prev;
@@ -167,7 +167,7 @@ static inline int SEARCH_REMOVE(PMI2_Command *cmd)
     }
     prev = pi;
     pi = pi->next;
-    
+
     for ( ; pi ; pi = pi->next) {
         if (pi->cmd == cmd) {
             prev->next = pi->next;
@@ -177,7 +177,7 @@ static inline int SEARCH_REMOVE(PMI2_Command *cmd)
             return 1;
         }
     }
-    
+
     return 0;
 }
 
@@ -223,9 +223,9 @@ int PMI2_Init(int *spawned, int *size, int *rank, int *appnum)
 	PMI2_size = 1;
 	PMI2_rank = 0;
 	*spawned = 0;
-	
+
 	PMI2_initialized = SINGLETON_INIT_BUT_NO_PM;
-	
+
         goto fn_exit;
     }
 
@@ -266,14 +266,14 @@ int PMI2_Init(int *spawned, int *size, int *rank, int *appnum)
         int spawner_jobid_len;
         PMI2_Command cmd = {0};
         int debugged;
-        
+
 
         jobid = getenv("PMI_JOBID");
         if (jobid) {
             init_kv_str(&pairs[npairs], PMIJOBID_KEY, jobid);
             ++npairs;
         }
-        
+
         pmiid = getenv("PMI_ID");
         if (pmiid) {
             init_kv_str(&pairs[npairs], PMIRANK_KEY, pmiid);
@@ -296,14 +296,14 @@ int PMI2_Init(int *spawned, int *size, int *rank, int *appnum)
 #endif
         init_kv_str(&pairs[npairs], THREADED_KEY, isThreaded ? "TRUE" : "FALSE");
         ++npairs;
-        
- 
+
+
         pmi2_errno = PMIi_WriteSimpleCommand(PMI2_fd, 0, FULLINIT_CMD, pairs_p, npairs); /* don't pass in thread id for init */
         if (pmi2_errno) PMI2U_ERR_POP(pmi2_errno);
 
         /* Read auth-response */
         /* Send auth-response-complete */
-    
+
         /* Read fullinit-response */
         pmi2_errno = PMIi_ReadCommandExp(PMI2_fd, &cmd, FULLINITRESP_CMD, &rc, &errmsg);
         if (pmi2_errno) PMI2U_ERR_POP(pmi2_errno);
@@ -359,7 +359,7 @@ int PMI2_Finalize(void)
     int rc;
     const char *errmsg;
     PMI2_Command cmd = {0};
-   
+
     if ( PMI2_initialized > SINGLETON_INIT_BUT_NO_PM) {
         pmi2_errno = PMIi_WriteSimpleCommandStr(PMI2_fd, &cmd, FINALIZE_CMD, NULL);
         if (pmi2_errno) PMI2U_ERR_POP(pmi2_errno);
@@ -368,7 +368,7 @@ int PMI2_Finalize(void)
         PMI2U_ERR_CHKANDJUMP1(rc, pmi2_errno, PMI2_ERR_OTHER, "**pmi2_finalize", "**pmi2_finalize %s", errmsg ? errmsg : "unknown");
         PMI2U_Free(cmd.command);
         freepairs(cmd.pairs, cmd.nPairs);
-        
+
 	shutdown( PMI2_fd, SHUT_RDWR );
 	close( PMI2_fd );
     }
@@ -395,7 +395,7 @@ int PMI2_Abort( int flag, const char msg[] )
 
     /* ignoring return code, because we're exiting anyway */
     PMIi_WriteSimpleCommandStr(PMI2_fd, NULL, ABORT_CMD, ISWORLD_KEY, flag ? TRUE_VAL : FALSE_VAL, MSG_KEY, msg, NULL);
-    
+
     PMI2U_Exit(PMII_EXIT_CODE);
     return PMI2_SUCCESS;
 }
@@ -553,7 +553,7 @@ int PMI2_Job_GetId(char jobid[], int jobid_size)
     int rc;
     const char *errmsg;
     PMI2_Command cmd = {0};
-    
+
     pmi2_errno = PMIi_WriteSimpleCommandStr(PMI2_fd, &cmd, JOBGETID_CMD, NULL);
     if (pmi2_errno) PMI2U_ERR_POP(pmi2_errno);
     pmi2_errno = PMIi_ReadCommandExp(PMI2_fd, &cmd, JOBGETIDRESP_CMD, &rc, &errmsg);
@@ -598,7 +598,7 @@ int PMI2_Job_Connect(const char jobid[], PMI2_Connect_comm_t *conn)
 
     PMI2U_ERR_CHKANDJUMP(kvscopy, pmi2_errno, PMI2_ERR_OTHER, "**notimpl");
 
-    
+
  fn_exit:
     PMI2U_Free(cmd.command);
     freepairs(cmd.pairs, cmd.nPairs);
@@ -620,7 +620,7 @@ int PMI2_Job_Disconnect(const char jobid[])
     pmi2_errno = PMIi_ReadCommandExp(PMI2_fd, &cmd, JOBDISCONNECTRESP_CMD, &rc, &errmsg);
     if (pmi2_errno) PMI2U_ERR_POP(pmi2_errno);
     PMI2U_ERR_CHKANDJUMP1(rc, pmi2_errno, PMI2_ERR_OTHER, "**pmi2_jobdisconnect", "**pmi2_jobdisconnect %s", errmsg ? errmsg : "unknown");
-        
+
 fn_exit:
     PMI2U_Free(cmd.command);
     freepairs(cmd.pairs, cmd.nPairs);
@@ -641,7 +641,7 @@ int PMI2_KVS_Put(const char key[], const char value[])
     pmi2_errno = PMIi_ReadCommandExp(PMI2_fd, &cmd, KVSPUTRESP_CMD, &rc, &errmsg);
     if (pmi2_errno) PMI2U_ERR_POP(pmi2_errno);
     PMI2U_ERR_CHKANDJUMP1(rc, pmi2_errno, PMI2_ERR_OTHER, "**pmi2_kvsput", "**pmi2_kvsput %s", errmsg ? errmsg : "unknown");
-        
+
 fn_exit:
     PMI2U_Free(cmd.command);
     freepairs(cmd.pairs, cmd.nPairs);
@@ -683,10 +683,10 @@ int PMI2_KVS_Get(const char *jobid, int src_pmi_id, const char key[], char value
     const char *errmsg;
 
     PMI2U_Snprintf(src_pmi_id_str, sizeof(src_pmi_id_str), "%d", src_pmi_id);
-    
+
     pmi2_errno = PMIi_InitIfSingleton();
     if (pmi2_errno) PMI2U_ERR_POP(pmi2_errno);
-    
+
     pmi2_errno = PMIi_WriteSimpleCommandStr(PMI2_fd, &cmd, KVSGET_CMD, JOBID_KEY, jobid, SRCID_KEY, src_pmi_id_str, KEY_KEY, key, NULL);
     if (pmi2_errno) PMI2U_ERR_POP(pmi2_errno);
     pmi2_errno = PMIi_ReadCommandExp(PMI2_fd, &cmd, KVSGETRESP_CMD, &rc, &errmsg);
@@ -702,7 +702,7 @@ int PMI2_KVS_Get(const char *jobid, int src_pmi_id, const char key[], char value
 
     ret = PMI2U_Strncpy(value, kvsvalue, maxValue);
     *valLen = ret ? -kvsvallen : kvsvallen;
-    
+
 
  fn_exit:
     PMI2U_Free(cmd.command);
@@ -725,7 +725,7 @@ int PMI2_Info_GetNodeAttr(const char name[], char value[], int valuelen, int *fl
 
     pmi2_errno = PMIi_InitIfSingleton();
     if (pmi2_errno) PMI2U_ERR_POP(pmi2_errno);
-    
+
     pmi2_errno = PMIi_WriteSimpleCommandStr(PMI2_fd, &cmd, GETNODEATTR_CMD, KEY_KEY, name, WAIT_KEY, waitfor ? "TRUE" : "FALSE", NULL);
     if (pmi2_errno) PMI2U_ERR_POP(pmi2_errno);
     pmi2_errno = PMIi_ReadCommandExp(PMI2_fd, &cmd, GETNODEATTRRESP_CMD, &rc, &errmsg);
@@ -740,7 +740,7 @@ int PMI2_Info_GetNodeAttr(const char name[], char value[], int valuelen, int *fl
 
         PMI2U_Strncpy(value, kvsvalue, valuelen);
     }
-    
+
 fn_exit:
     PMI2U_Free(cmd.command);
     freepairs(cmd.pairs, cmd.nPairs);
@@ -760,10 +760,10 @@ int PMI2_Info_GetNodeAttrIntArray(const char name[], int array[], int arraylen,
     const char *errmsg;
     int i;
     const char *valptr;
-    
+
     pmi2_errno = PMIi_InitIfSingleton();
     if (pmi2_errno) PMI2U_ERR_POP(pmi2_errno);
-    
+
     pmi2_errno = PMIi_WriteSimpleCommandStr(PMI2_fd, &cmd, GETNODEATTR_CMD, KEY_KEY, name, WAIT_KEY, "FALSE", NULL);
     if (pmi2_errno) PMI2U_ERR_POP(pmi2_errno);
     pmi2_errno = PMIi_ReadCommandExp(PMI2_fd, &cmd, GETNODEATTRRESP_CMD, &rc, &errmsg);
@@ -787,10 +787,10 @@ int PMI2_Info_GetNodeAttrIntArray(const char name[], int array[], int arraylen,
             PMI2U_ERR_CHKANDJUMP1(rc != 1, pmi2_errno, PMI2_ERR_OTHER, "**intern", "**intern %s", "unable to parse intarray");
             ++i;
         }
-        
+
         *outlen = i;
     }
-    
+
 fn_exit:
     PMI2U_Free(cmd.command);
     freepairs(cmd.pairs, cmd.nPairs);
@@ -811,7 +811,7 @@ int PMI2_Info_PutNodeAttr(const char name[], const char value[])
     pmi2_errno = PMIi_ReadCommandExp(PMI2_fd, &cmd, PUTNODEATTRRESP_CMD, &rc, &errmsg);
     if (pmi2_errno) PMI2U_ERR_POP(pmi2_errno);
     PMI2U_ERR_CHKANDJUMP1(rc, pmi2_errno, PMI2_ERR_OTHER, "**pmi2_putnodeattr", "**pmi2_putnodeattr %s", errmsg ? errmsg : "unknown");
-        
+
 fn_exit:
     PMI2U_Free(cmd.command);
     freepairs(cmd.pairs, cmd.nPairs);
@@ -832,7 +832,7 @@ int PMI2_Info_GetJobAttr(const char name[], char value[], int valuelen, int *fla
 
     pmi2_errno = PMIi_InitIfSingleton();
     if (pmi2_errno) PMI2U_ERR_POP(pmi2_errno);
-    
+
     pmi2_errno = PMIi_WriteSimpleCommandStr(PMI2_fd, &cmd, GETJOBATTR_CMD, KEY_KEY, name, NULL);
     if (pmi2_errno) PMI2U_ERR_POP(pmi2_errno);
     pmi2_errno = PMIi_ReadCommandExp(PMI2_fd, &cmd, GETJOBATTRRESP_CMD, &rc, &errmsg);
@@ -845,10 +845,10 @@ int PMI2_Info_GetJobAttr(const char name[], char value[], int valuelen, int *fla
     if (*flag) {
         found = getval(cmd.pairs, cmd.nPairs, VALUE_KEY, &kvsvalue, &kvsvallen);
         PMI2U_ERR_CHKANDJUMP(found != 1, pmi2_errno, PMI2_ERR_OTHER, "**intern");
-        
+
         PMI2U_Strncpy(value, kvsvalue, valuelen);
     }
-    
+
 fn_exit:
     PMI2U_Free(cmd.command);
     freepairs(cmd.pairs, cmd.nPairs);
@@ -867,10 +867,10 @@ int PMI2_Info_GetJobAttrIntArray(const char name[], int array[], int arraylen, i
     const char *errmsg;
     int i;
     const char *valptr;
-    
+
     pmi2_errno = PMIi_InitIfSingleton();
     if (pmi2_errno) PMI2U_ERR_POP(pmi2_errno);
-    
+
     pmi2_errno = PMIi_WriteSimpleCommandStr(PMI2_fd, &cmd, GETJOBATTR_CMD, KEY_KEY, name, NULL);
     if (pmi2_errno) PMI2U_ERR_POP(pmi2_errno);
     pmi2_errno = PMIi_ReadCommandExp(PMI2_fd, &cmd, GETJOBATTRRESP_CMD, &rc, &errmsg);
@@ -894,10 +894,10 @@ int PMI2_Info_GetJobAttrIntArray(const char name[], int array[], int arraylen, i
             PMI2U_ERR_CHKANDJUMP1(rc != 1, pmi2_errno, PMI2_ERR_OTHER, "**intern", "**intern %s", "unable to parse intarray");
             ++i;
         }
-        
+
         *outlen = i;
     }
-    
+
 fn_exit:
     PMI2U_Free(cmd.command);
     freepairs(cmd.pairs, cmd.nPairs);
@@ -1032,7 +1032,7 @@ static void freepairs(PMI2_Keyvalpair** pairs, int npairs)
 static int getval(PMI2_Keyvalpair *const pairs[], int npairs, const char *key,  const char **value, int *vallen)
 {
     int i;
-    
+
     for (i = 0; i < npairs; ++i)
         if (strncmp(key, pairs[i]->key, PMI2_MAX_KEYLEN) == 0) {
             *value = pairs[i]->value;
@@ -1049,7 +1049,7 @@ static int getvalint(PMI2_Keyvalpair *const pairs[], int npairs, const char *key
     int vallen;
     int ret;
     /* char *endptr; */
-    
+
     found = getval(pairs, npairs, key, &value, &vallen);
     if (found != 1)
         return found;
@@ -1060,11 +1060,11 @@ static int getvalint(PMI2_Keyvalpair *const pairs[], int npairs, const char *key
     ret = sscanf(value, "%d", val);
     if (ret != 1)
         return -1;
-    
+
     /* *val = strtoll(value, &endptr, 0); */
     /* if (endptr - value != vallen) */
     /*     return -1; */
-    
+
     return 1;
 }
 
@@ -1076,7 +1076,7 @@ static int getvalptr(PMI2_Keyvalpair *const pairs[], int npairs, const char *key
     int ret;
     void **val_ = val;
     /* char *endptr; */
-    
+
     found = getval(pairs, npairs, key, &value, &vallen);
     if (found != 1)
         return found;
@@ -1091,7 +1091,7 @@ static int getvalptr(PMI2_Keyvalpair *const pairs[], int npairs, const char *key
     /* *val_ = (void *)(PMI2R_Upint)strtoll(value, &endptr, 0); */
     /* if (endptr - value != vallen) */
     /*     return -1; */
-    
+
     return 1;
 }
 
@@ -1101,8 +1101,8 @@ static int getvalbool(PMI2_Keyvalpair *const pairs[], int npairs, const char *ke
     int found;
     const char *value;
     int vallen;
-    
-    
+
+
     found = getval(pairs, npairs, key, &value, &vallen);
     if (found != 1)
         return found;
@@ -1137,7 +1137,7 @@ static int parse_keyval(char **cmdptr, int *len, char **key, char **val, int *va
     char *c = *cmdptr;
     char *d;
 
-    
+
     /* find key */
     *key = c; /* key is at the start of the buffer */
     while (*len && *c != '=') {
@@ -1188,14 +1188,14 @@ static int create_keyval(PMI2_Keyvalpair **kv, const char *key, const char *val,
     PMI2U_CHKPMEM_DECL(3);
 
     PMI2U_CHKPMEM_MALLOC(*kv, PMI2_Keyvalpair *, sizeof(PMI2_Keyvalpair), pmi2_errno, "pair");
-        
+
     PMI2U_CHKPMEM_MALLOC(key_p, char *, strlen(key)+1, pmi2_errno, "key");
     PMI2U_Strncpy(key_p, key, PMI2_MAX_KEYLEN+1);
-    
+
     PMI2U_CHKPMEM_MALLOC(value_p, char *, vallen+1, pmi2_errno, "value");
     PMI2U_Memcpy(value_p, val, vallen);
     value_p[vallen] = '\0';
-    
+
     (*kv)->key = key_p;
     (*kv)->value = value_p;
     (*kv)->valueLen = vallen;
@@ -1265,7 +1265,7 @@ int PMIi_ReadCommand( int fd, PMI2_Command *cmd )
             offset += nbytes;
         }
         while (offset < PMII_COMMANDLEN_SIZE);
-    
+
         cmd_len = atoi(cmd_len_str);
 
         cmd_buf = PMI2U_Malloc(cmd_len+1);
@@ -1293,7 +1293,7 @@ int PMIi_ReadCommand( int fd, PMI2_Command *cmd )
         c = cmd_buf;
         remaining_len = cmd_len;
         num_pairs = 0;
-    
+
         while (remaining_len) {
             while (remaining_len && *c != ';') {
                 --remaining_len;
@@ -1308,14 +1308,14 @@ int PMIi_ReadCommand( int fd, PMI2_Command *cmd )
                 ++c;
             }
         }
-    
+
         c = cmd_buf;
         remaining_len = cmd_len;
         pmi2_errno = parse_keyval(&c, &remaining_len, &key, &val, &vallen);
         if (pmi2_errno) PMI2U_ERR_POP(pmi2_errno);
 
         PMI2U_ERR_CHKANDJUMP(strncmp(key, "cmd", PMI2_MAX_KEYLEN) != 0, pmi2_errno, PMI2_ERR_OTHER, "**bad_cmd");
-    
+
         command = PMI2U_Malloc(vallen+1);
         if (!command) { PMI2U_CHKMEM_SETERR(pmi2_errno, vallen+1, "command"); goto fn_exit; }
         PMI2U_Memcpy(command, val, vallen);
@@ -1325,18 +1325,18 @@ int PMIi_ReadCommand( int fd, PMI2_Command *cmd )
 
         pairs = PMI2U_Malloc(sizeof(PMI2_Keyvalpair *) * nPairs);
         if (!pairs) { PMI2U_CHKMEM_SETERR(pmi2_errno, sizeof(PMI2_Keyvalpair *) * nPairs, "pairs"); goto fn_exit; }
-    
+
         pair_index = 0;
         while (remaining_len)
         {
             PMI2_Keyvalpair *pair;
-        
+
             pmi2_errno = parse_keyval(&c, &remaining_len, &key, &val, &vallen);
             if (pmi2_errno) PMI2U_ERR_POP(pmi2_errno);
 
             pmi2_errno = create_keyval(&pair, key, val, vallen);
             if (pmi2_errno) PMI2U_ERR_POP(pmi2_errno);
-        
+
             pairs[pair_index] = pair;
             ++pair_index;
         }
@@ -1347,12 +1347,12 @@ int PMIi_ReadCommand( int fd, PMI2_Command *cmd )
         else
             if (PMI2_debug && SEARCH_REMOVE(target_cmd) == 0) {
                 int i;
-                
+
                 printf("command=%s\n", command);
                 for (i = 0; i < nPairs; ++i)
                     dump_PMI2_Keyvalpair(stdout, pairs[i]);
             }
-        
+
         target_cmd->command = command;
         target_cmd->nPairs = nPairs;
         target_cmd->pairs = pairs;
@@ -1362,7 +1362,7 @@ int PMIi_ReadCommand( int fd, PMI2_Command *cmd )
 
         PMI2U_Free(cmd_buf);
     } while (!cmd->complete);
-    
+
 #ifdef MPICH_IS_THREADED
     MPIU_THREAD_CHECK_BEGIN;
     {
@@ -1375,7 +1375,7 @@ int PMIi_ReadCommand( int fd, PMI2_Command *cmd )
 #endif
 
 
-    
+
 
 fn_exit:
     return pmi2_errno;
@@ -1392,7 +1392,7 @@ int PMIi_ReadCommandExp( int fd, PMI2_Command *cmd, const char *exp, int* rc, co
     int pmi2_errno = PMI2_SUCCESS;
     int found;
     int msglen;
-    
+
     pmi2_errno = PMIi_ReadCommand(fd, cmd);
     if (pmi2_errno) PMI2U_ERR_POP(pmi2_errno);
 
@@ -1407,7 +1407,7 @@ int PMIi_ReadCommandExp( int fd, PMI2_Command *cmd, const char *exp, int* rc, co
     if (!found)
         *errmsg = NULL;
 
-    
+
 fn_exit:
     return pmi2_errno;
 fn_fail:
@@ -1451,7 +1451,7 @@ int PMIi_WriteSimpleCommand( int fd, PMI2_Command *resp, const char cmd[], PMI2_
     }
     MPIU_THREAD_CHECK_END;
 #endif
-    
+
     for (pair_index = 0; pair_index < npairs; ++pair_index) {
         /* write key= */
         PMI2U_ERR_CHKANDJUMP(strlen(pairs[pair_index]->key) > PMI2_MAX_KEYLEN, pmi2_errno, PMI2_ERR_OTHER, "**key_too_long");
@@ -1488,8 +1488,8 @@ int PMIi_WriteSimpleCommand( int fd, PMI2_Command *resp, const char cmd[], PMI2_
 
     cmdbuf[cmdlen+PMII_COMMANDLEN_SIZE] = '\0'; /* silence valgrind warnings in printf_d */
     printf_d("PMI sending: %s\n", cmdbuf);
-    
-    
+
+
  #ifdef MPICH_IS_THREADED
     MPIU_THREAD_CHECK_BEGIN;
     {
@@ -1527,7 +1527,7 @@ int PMIi_WriteSimpleCommand( int fd, PMI2_Command *resp, const char cmd[], PMI2_
     }
     MPIU_THREAD_CHECK_END;
 #endif
-    
+
 fn_exit:
     return pmi2_errno;
 fn_fail:
@@ -1573,7 +1573,7 @@ int PMIi_WriteSimpleCommandStr(int fd, PMI2_Command *resp, const char cmd[], ...
 
     pmi2_errno = PMIi_WriteSimpleCommand(fd, resp, cmd, pairs_p, npairs);
     if (pmi2_errno) PMI2U_ERR_POP(pmi2_errno);
-    
+
 fn_exit:
     PMI2U_CHKLMEM_FREEALL();
     return pmi2_errno;
@@ -1620,13 +1620,13 @@ static int PMII_Connect_to_pm( char *hostname, int portnum )
     int                fd;
     int                optval = 1;
     int                q_wait = 1;
-    
+
     hp = gethostbyname( hostname );
     if (!hp) {
 	PMI2U_printf( 1, "Unable to get host entry for %s\n", hostname );
 	return -1;
     }
-    
+
     memset( (void *)&sa, 0, sizeof(sa) );
     /* POSIX might define h_addr_list only and node define h_addr */
 #ifdef HAVE_H_ADDR_LIST
@@ -1636,13 +1636,13 @@ static int PMII_Connect_to_pm( char *hostname, int portnum )
 #endif
     sa.sin_family = hp->h_addrtype;
     sa.sin_port   = htons( (unsigned short) portnum );
-    
+
     fd = socket( AF_INET, SOCK_STREAM, TCP );
     if (fd < 0) {
 	PMI2U_printf( 1, "Unable to get AF_INET socket\n" );
 	return -1;
     }
-    
+
     if (setsockopt( fd, IPPROTO_TCP, TCP_NODELAY,
 		    (char *)&optval, sizeof(optval) )) {
 	perror( "Error calling setsockopt:" );
@@ -1657,13 +1657,13 @@ static int PMII_Connect_to_pm( char *hostname, int portnum )
 	    if (q_wait)
 		close(fd);
 	    return -1;
-	    
+
 	case EINPROGRESS: /*  (nonblocking) - select for writing. */
 	    break;
-	    
+
 	case EISCONN: /*  (already connected) */
 	    break;
-	    
+
 	case ETIMEDOUT: /* timed out */
 	    PMI2U_printf( 1, "connect failed with timeout\n" );
 	    return -1;
@@ -1799,7 +1799,7 @@ static int getPMIFD(void)
 
 	/* Connect to the indicated port (in format hostname:portnumber)
 	   and get the fd for the socket */
-	
+
 	/* Split p into host and port */
 	pn = p;
 	ph = hostname;
@@ -1809,7 +1809,7 @@ static int getPMIFD(void)
 	*ph = 0;
 
         PMI2U_ERR_CHKANDJUMP1(*pn != ':', pmi2_errno, PMI2_ERR_OTHER, "**pmi2_port", "**pmi2_port %s", p);
-        
+
         portnum = atoi( pn+1 );
         /* FIXME: Check for valid integer after : */
         /* This routine only gets the fd to use to talk to
@@ -1858,7 +1858,7 @@ static void dump_PMI2_Keyvalpair(FILE *file, PMI2_Keyvalpair *kv)
 static void dump_PMI2_Command(FILE *file, PMI2_Command *cmd)
 {
     int i;
-    
+
     fprintf(file, "cmd    = %s\n", cmd->command);
     fprintf(file, "nPairs = %d\n", cmd->nPairs);
 
diff --git a/src/pmi/pmi2/simple2pmi.h b/src/pmi/pmi2/simple/simple2pmi.h
similarity index 98%
rename from src/pmi/pmi2/simple2pmi.h
rename to src/pmi/pmi2/simple/simple2pmi.h
index 01d2497..f6e0f90 100644
--- a/src/pmi/pmi2/simple2pmi.h
+++ b/src/pmi/pmi2/simple/simple2pmi.h
@@ -79,15 +79,15 @@ static const char FALSE_VAL[] = "FALSE";
 
 /* Local types */
 
-/* Parse commands are in this structure.  Fields in this structure are 
+/* Parse commands are in this structure.  Fields in this structure are
    dynamically allocated as necessary */
 typedef struct PMI2_Keyvalpair {
     const char *key;
     const char *value;
     int         valueLen;  /* Length of a value (values may contain nulls, so
                               we need this) */
-    int         isCopy;    /* The value is a copy (and will need to be freed) 
-                              if this is true, otherwise, 
+    int         isCopy;    /* The value is a copy (and will need to be freed)
+                              if this is true, otherwise,
                               it is a null-terminated string in the original
                               buffer */
 } PMI2_Keyvalpair;
diff --git a/src/pmi/pmi2/simple_pmiutil.c b/src/pmi/pmi2/simple/simple_pmiutil.c
similarity index 93%
rename from src/pmi/pmi2/simple_pmiutil.c
rename to src/pmi/pmi2/simple/simple_pmiutil.c
index ca84609..65ca889 100644
--- a/src/pmi/pmi2/simple_pmiutil.c
+++ b/src/pmi/pmi2/simple/simple_pmiutil.c
@@ -33,7 +33,7 @@
 #define MAXVALLEN 1024
 #define MAXKEYLEN   32
 
-/* These are not the keyvals in the keyval space that is part of the 
+/* These are not the keyvals in the keyval space that is part of the
    PMI specification.
    They are just part of this implementation's internal utilities.
 */
@@ -44,7 +44,7 @@ struct PMI2U_keyval_pairs {
 static struct PMI2U_keyval_pairs PMI2U_keyval_tab[64] = { { {0}, {0} } };
 static int  PMI2U_keyval_tab_idx = 0;
 
-/* This is used to prepend printed output.  Set the initial value to 
+/* This is used to prepend printed output.  Set the initial value to
    "unset" */
 static char PMI2U_print_id[PMI2U_IDSIZE] = "unset";
 
@@ -66,7 +66,7 @@ void PMI2U_printf( int print_flag, const char *fmt, ... )
 {
     va_list ap;
     static FILE *logfile= 0;
-    
+
     /* In some cases when we are debugging, the handling of stdout or
        stderr may be unreliable.  In that case, we make it possible to
        select an output file. */
@@ -77,7 +77,7 @@ void PMI2U_printf( int print_flag, const char *fmt, ... )
 	    char filename[1024];
 	    p = getenv("PMI_ID");
 	    if (p) {
-		PMI2U_Snprintf( filename, sizeof(filename), 
+		PMI2U_Snprintf( filename, sizeof(filename),
 			       "testclient-%s.out", p );
 		logfile = fopen( filename, "w" );
 	    }
@@ -85,7 +85,7 @@ void PMI2U_printf( int print_flag, const char *fmt, ... )
 		logfile = fopen( "testserver.out", "w" );
 	    }
 	}
-	else 
+	else
 	    logfile = stderr;
     }
 
@@ -102,26 +102,26 @@ void PMI2U_printf( int print_flag, const char *fmt, ... )
 }
 
 #define MAX_READLINE 1024
-/* 
+/*
  * Return the next newline-terminated string of maximum length maxlen.
  * This is a buffered version, and reads from fd as necessary.  A
  */
 int PMI2U_readline( int fd, char *buf, int maxlen )
 {
     static char readbuf[MAX_READLINE];
-    static char *nextChar = 0, *lastChar = 0;  /* lastChar is really one past 
+    static char *nextChar = 0, *lastChar = 0;  /* lastChar is really one past
 						  last char */
     static int  lastErrno = 0;
     static int lastfd = -1;
     int curlen, n;
     char *p, ch;
 
-    /* Note: On the client side, only one thread at a time should 
-       be calling this, and there should only be a single fd.  
-       Server side code should not use this routine (see the 
+    /* Note: On the client side, only one thread at a time should
+       be calling this, and there should only be a single fd.
+       Server side code should not use this routine (see the
        replacement version in src/pm/util/pmiserv.c) */
     PMI2U_Assert(nextChar == lastChar || fd == lastfd);
-    
+
     p      = buf;
     curlen = 1;    /* Make room for the null */
     while (curlen < maxlen) {
@@ -152,7 +152,7 @@ int PMI2U_readline( int fd, char *buf, int maxlen )
 	    /* FIXME: Make this an optional output */
 	    /* printf( "Readline %s\n", readbuf ); */
 	}
-	
+
 	ch   = *nextChar++;
 	*p++ = ch;
 	curlen++;
@@ -168,7 +168,7 @@ int PMI2U_readline( int fd, char *buf, int maxlen )
     return curlen-1;
 }
 
-int PMI2U_writeline( int fd, char *buf )	
+int PMI2U_writeline( int fd, char *buf )
 {
     int size, n;
 
@@ -237,20 +237,20 @@ int PMI2U_parse_keyvals( char *st )
 	/* Null terminate the key */
 	*p = 0;
 	/* store key */
-        PMI2U_Strncpy( PMI2U_keyval_tab[PMI2U_keyval_tab_idx].key, keystart, 
+        PMI2U_Strncpy( PMI2U_keyval_tab[PMI2U_keyval_tab_idx].key, keystart,
 		      MAXKEYLEN );
 
 	valstart = ++p;			/* start of value */
 	while ( *p != ' ' && *p != '\n' && *p != '\0' )
 	    p++;
 	/* store value */
-        PMI2U_Strncpy( PMI2U_keyval_tab[PMI2U_keyval_tab_idx].value, valstart, 
+        PMI2U_Strncpy( PMI2U_keyval_tab[PMI2U_keyval_tab_idx].value, valstart,
 		      MAXVALLEN );
 	offset = p - valstart;
 	/* When compiled with -fPIC, the pgcc compiler generates incorrect
-	   code if "p - valstart" is used instead of using the 
+	   code if "p - valstart" is used instead of using the
 	   intermediate offset */
-	PMI2U_keyval_tab[PMI2U_keyval_tab_idx].value[offset] = '\0';  
+	PMI2U_keyval_tab[PMI2U_keyval_tab_idx].value[offset] = '\0';
 	PMI2U_keyval_tab_idx++;
 	if ( *p == ' ' )
 	    continue;
@@ -262,23 +262,23 @@ int PMI2U_parse_keyvals( char *st )
 void PMI2U_dump_keyvals( void )
 {
     int i;
-    for (i=0; i < PMI2U_keyval_tab_idx; i++) 
+    for (i=0; i < PMI2U_keyval_tab_idx; i++)
 	PMI2U_printf(1, "  %s=%s\n",PMI2U_keyval_tab[i].key, PMI2U_keyval_tab[i].value);
 }
 
 char *PMI2U_getval( const char *keystr, char *valstr, int vallen )
 {
     int i, rc;
-    
+
     for (i = 0; i < PMI2U_keyval_tab_idx; i++) {
-	if ( strcmp( keystr, PMI2U_keyval_tab[i].key ) == 0 ) { 
+	if ( strcmp( keystr, PMI2U_keyval_tab[i].key ) == 0 ) {
 	    rc = PMI2U_Strncpy( valstr, PMI2U_keyval_tab[i].value, vallen );
 	    if (rc != 0) {
 		PMI2U_printf( 1, "PMI2U_Strncpy failed in PMI2U_getval\n" );
 		return NULL;
 	    }
 	    return valstr;
-       } 
+       }
     }
     valstr[0] = '\0';
     return NULL;
@@ -287,7 +287,7 @@ char *PMI2U_getval( const char *keystr, char *valstr, int vallen )
 void PMI2U_chgval( const char *keystr, char *valstr )
 {
     int i;
-    
+
     for ( i = 0; i < PMI2U_keyval_tab_idx; i++ ) {
 	if ( strcmp( keystr, PMI2U_keyval_tab[i].key ) == 0 ) {
 	    PMI2U_Strncpy( PMI2U_keyval_tab[i].value, valstr, MAXVALLEN - 1 );
diff --git a/src/pmi/pmi2/simple_pmiutil.h b/src/pmi/pmi2/simple/simple_pmiutil.h
similarity index 99%
rename from src/pmi/pmi2/simple_pmiutil.h
rename to src/pmi/pmi2/simple/simple_pmiutil.h
index 76762a4..583ceed 100644
--- a/src/pmi/pmi2/simple_pmiutil.h
+++ b/src/pmi/pmi2/simple/simple_pmiutil.h
@@ -110,7 +110,7 @@ extern int PMI2_pmiverbose; /* Set this to true to print PMI debugging info */
 #define PMI2U_AssertDeclValue(_a, _b) _a = _b
 #else
 /* Empty decls not allowed in C */
-#define PMI2U_AssertDecl(a_) a_ 
+#define PMI2U_AssertDecl(a_) a_
 #define PMI2U_AssertDeclValue(_a, _b) _a ATTRIBUTE((unused))
 #endif
 
diff --git a/src/pmi/pmi2/subconfigure.m4 b/src/pmi/pmi2/simple/subconfigure.m4
similarity index 94%
copy from src/pmi/pmi2/subconfigure.m4
copy to src/pmi/pmi2/simple/subconfigure.m4
index 1893d7f..530012a 100644
--- a/src/pmi/pmi2/subconfigure.m4
+++ b/src/pmi/pmi2/simple/subconfigure.m4
@@ -1,14 +1,14 @@
 [#] start of __file__
-dnl MPICH_SUBCFG_AFTER=src/pmi
+dnl MPICH2_SUBCFG_AFTER=src/pmi
 
 AC_DEFUN([PAC_SUBCFG_PREREQ_]PAC_SUBCFG_AUTO_SUFFIX,[
 ])
 
 AC_DEFUN([PAC_SUBCFG_BODY_]PAC_SUBCFG_AUTO_SUFFIX,[
 
-AM_CONDITIONAL([BUILD_PMI_PMI2],[test "x$pmi_name" = "xpmi2"])
+AM_CONDITIONAL([BUILD_PMI_PMI2_SIMPLE],[test "x$pmi_name" = "xpmi2/simple"])
 
-AM_COND_IF([BUILD_PMI_PMI2],[
+AM_COND_IF([BUILD_PMI_PMI2_SIMPLE],[
 if test "$enable_pmiport" != "no" ; then
    enable_pmiport=yes
 fi
@@ -47,7 +47,7 @@ if test "$enable_pmiport" = "yes" ; then
     AC_SEARCH_LIBS(gethostbyname,nsl)
     missing_functions=no
     AC_CHECK_FUNCS(socket setsockopt gethostbyname,,missing_functions=yes)
-    
+
     if test "$missing_functions" = "no" ; then
         AC_DEFINE(USE_PMI_PORT,1,[Define if access to PMI information through a port rather than just an fd is allowed])
     else
diff --git a/src/pmi/pmi2/subconfigure.m4 b/src/pmi/pmi2/subconfigure.m4
index 1893d7f..d1dc441 100644
--- a/src/pmi/pmi2/subconfigure.m4
+++ b/src/pmi/pmi2/subconfigure.m4
@@ -16,6 +16,10 @@ fi
 dnl causes USE_PMI2_API to be AC_DEFINE'ed by the top-level configure.ac
 USE_PMI2_API=yes
 
+# common ARG_ENABLE, shared by "simple" and "poe"
+AC_ARG_ENABLE(pmiport,
+[--enable-pmiport - Allow PMI interface to use a host-port pair to contact
+                   for PMI services],,enable_pmiport=default)
 AC_CHECK_HEADERS([unistd.h string.h stdlib.h sys/socket.h strings.h assert.h])
 dnl Use snprintf if possible when creating messages
 AC_CHECK_FUNCS(snprintf)

http://git.mpich.org/mpich.git/commitdiff/e25d1dd65eee37596341d7b678881fda9384450c

commit e25d1dd65eee37596341d7b678881fda9384450c
Author: Charles Archer <archerc at us.ibm.com>
Date:   Tue Nov 27 11:35:55 2012 -0500

    Out of order message handling deadlock fix
    
    This code adds a check for oom processing in the receive path
    and the eager callback path to clear out of order messages and move
    them into the appropriate queue.
    
    The checks are necessary because we discovered a hang condition
    where a message existed in the posted and unexpected queues
    
    (ibm) D187489
    (ibm) df7f72e60270391b185a91b6f6310a8a756d6105
    
    Signed-off-by: Michael Blocksome <blocksom at us.ibm.com>

diff --git a/src/mpid/pamid/src/pt2pt/mpidi_callback_eager.c b/src/mpid/pamid/src/pt2pt/mpidi_callback_eager.c
index eef73b8..5136715 100644
--- a/src/mpid/pamid/src/pt2pt/mpidi_callback_eager.c
+++ b/src/mpid/pamid/src/pt2pt/mpidi_callback_eager.c
@@ -157,7 +157,7 @@ MPIDI_RecvCB(pami_context_t    context,
 #else
       rreq = MPIDI_Recvq_FDP(rank, PAMIX_Endpoint_query(sender), tag, context_id, msginfo->MPIseqno);
 #endif
-      
+
       if (unlikely(rreq == NULL))
       {
         MPIDI_Callback_process_unexp(newreq, context, msginfo, sndlen, sender, sndbuf, recv, msginfo->isSync);
@@ -308,6 +308,12 @@ MPIDI_RecvCB(pami_context_t    context,
    MPIDI_In_cntr[(PAMIX_Endpoint_query(sender))].R[(rreq->mpid.idx)].bufadd=rreq->mpid.userbuf;
 #endif
 
+#ifdef OUT_OF_ORDER_HANDLING
+  if (MPIDI_In_cntr[PAMIX_Endpoint_query(sender)].n_OutOfOrderMsgs > 0) {
+    MPIDI_Recvq_process_out_of_order_msgs(PAMIX_Endpoint_query(sender), context);
+  }
+#endif
+
  fn_exit_eager:
  MPIDI_Return_tokens(context, source, rettoks);
   /* ---------------------------------------- */
diff --git a/src/mpid/pamid/src/pt2pt/mpidi_recv.h b/src/mpid/pamid/src/pt2pt/mpidi_recv.h
index 5916255..7042594 100644
--- a/src/mpid/pamid/src/pt2pt/mpidi_recv.h
+++ b/src/mpid/pamid/src/pt2pt/mpidi_recv.h
@@ -213,6 +213,10 @@ MPIDI_Recv(void          * buf,
       }
       MPIU_THREAD_CS_EXIT(MSGQUEUE,0);
       MPID_Request_discard(newreq);
+#ifdef OUT_OF_ORDER_HANDLING
+      if ((MPIDI_In_cntr[rreq->mpid.peer_pami].n_OutOfOrderMsgs>0))
+          MPIDI_Recvq_process_out_of_order_msgs(rreq->mpid.peer_pami, MPIDI_Context[0]);
+#endif
     }
   else
     {

http://git.mpich.org/mpich.git/commitdiff/d1cfa4d42946c19c281ab18f0a09fd6fa02eb573

commit d1cfa4d42946c19c281ab18f0a09fd6fa02eb573
Author: Michael Blocksome <blocksom at us.ibm.com>
Date:   Fri Nov 2 13:45:11 2012 -0500

    Enable MPIR_* non-blocking collectives implementation
    
    If the environment variable 'PAMID_MPIR_NBC' is set to non-zero then a
    pami work function is posted to context 0 which will invoke the schedule
    progress function.
    
    By default, MPIR_* non-blocking collectives are disabled in order to
    avoid impacting the performance of other MPI operations.
    
    (ibm) 213eb5171c8cc3e636a6a7ea665e10454b174ef8
    
    Signed-off-by: Charles Archer <archerc at us.ibm.com>

diff --git a/src/mpid/pamid/include/mpidi_datatypes.h b/src/mpid/pamid/include/mpidi_datatypes.h
index 0edc6d9..810f242 100644
--- a/src/mpid/pamid/include/mpidi_datatypes.h
+++ b/src/mpid/pamid/include/mpidi_datatypes.h
@@ -130,6 +130,7 @@ typedef struct
     } context_post;
   } perobj;                  /**< This structure is only used in the 'perobj' mpich lock mode. */
 
+  unsigned mpir_nbc;         /**< Enable MPIR_* non-blocking collectives implementations. */
 } MPIDI_Process_t;
 
 
diff --git a/src/mpid/pamid/include/mpidi_prototypes.h b/src/mpid/pamid/include/mpidi_prototypes.h
index 79119a9..74e91d7 100644
--- a/src/mpid/pamid/include/mpidi_prototypes.h
+++ b/src/mpid/pamid/include/mpidi_prototypes.h
@@ -263,4 +263,8 @@ int MPIDI_Datatype_to_pami(MPI_Datatype        dt,
 void MPIDI_Op_to_string(MPI_Op op, char *string);
 pami_result_t MPIDI_Pami_post_wrapper(pami_context_t context, void *cookie);
 
+
+void MPIDI_NBC_init ();
+
+
 #endif
diff --git a/src/mpid/pamid/src/Makefile.mk b/src/mpid/pamid/src/Makefile.mk
index 247812a..1fc9915 100644
--- a/src/mpid/pamid/src/Makefile.mk
+++ b/src/mpid/pamid/src/Makefile.mk
@@ -56,7 +56,8 @@ lib_lib at MPILIBNAME@_la_SOURCES +=               \
     src/mpid/pamid/src/mpid_mrecv.c             \
     src/mpid/pamid/src/mpid_mprobe.c            \
     src/mpid/pamid/src/mpid_imrecv.c            \
-    src/mpid/pamid/src/mpid_improbe.c
+    src/mpid/pamid/src/mpid_improbe.c           \
+    src/mpid/pamid/src/mpidi_nbc_sched.c
 
 endif BUILD_PAMID
 
diff --git a/src/mpid/pamid/src/mpid_init.c b/src/mpid/pamid/src/mpid_init.c
index e4ac607..e571d16 100644
--- a/src/mpid/pamid/src/mpid_init.c
+++ b/src/mpid/pamid/src/mpid_init.c
@@ -111,6 +111,8 @@ MPIDI_Process_t  MPIDI_Process = {
     .subcomms            = 1,
     .select_colls        = 2,
   },
+
+  .mpir_nbc              = 0,
 };
 
 
@@ -597,7 +599,8 @@ MPIDI_PAMI_init(int* rank, int* size, int* threading)
 #endif
              "  optimized.collectives : %u\n"
              "  optimized.select_colls: %u\n"
-             "  optimized.subcomms    : %u\n",
+             "  optimized.subcomms    : %u\n"
+             "  mpir_nbc              : %u\n",
              MPIDI_Process.verbose,
              MPIDI_Process.statistics,
              MPIDI_Process.avail_contexts,
@@ -629,7 +632,8 @@ MPIDI_PAMI_init(int* rank, int* size, int* threading)
 #endif
              MPIDI_Process.optimized.collectives,
              MPIDI_Process.optimized.select_colls,
-             MPIDI_Process.optimized.subcomms);
+             MPIDI_Process.optimized.subcomms,
+             MPIDI_Process.mpir_nbc);
       switch (*threading)
         {
           case MPI_THREAD_MULTIPLE:
@@ -828,6 +832,7 @@ int MPID_Init(int * argc,
  */
 int MPID_InitCompleted()
 {
+  MPIDI_NBC_init();
   MPIDI_Progress_init();
   return MPI_SUCCESS;
 }
diff --git a/src/mpid/pamid/src/mpidi_env.c b/src/mpid/pamid/src/mpidi_env.c
index 1eab34a..427a042 100644
--- a/src/mpid/pamid/src/mpidi_env.c
+++ b/src/mpid/pamid/src/mpidi_env.c
@@ -847,6 +847,12 @@ MPIDI_Env_setup(int rank, int requested)
     ENV_Unsigned(names, &MPIDI_Process.shmem_pt2pt, 2, &found_deprecated_env_var, rank);
   }
 
+  /* Enable MPIR_* implementations of non-blocking collectives */
+  {
+    char* names[] = {"PAMID_MPIR_NBC", NULL};
+    ENV_Unsigned(names, &MPIDI_Process.mpir_nbc, 1, &found_deprecated_env_var, rank);
+  }
+
   /* Check for deprecated collectives environment variables. These variables are
    * used in src/mpid/pamid/src/comm/mpid_selectcolls.c */
   {
diff --git a/src/mpid/pamid/src/mpidi_nbc_sched.c b/src/mpid/pamid/src/mpidi_nbc_sched.c
new file mode 100644
index 0000000..998d03e
--- /dev/null
+++ b/src/mpid/pamid/src/mpidi_nbc_sched.c
@@ -0,0 +1,65 @@
+/* begin_generated_IBM_copyright_prolog                             */
+/*                                                                  */
+/* This is an automatically generated copyright prolog.             */
+/* After initializing,  DO NOT MODIFY OR MOVE                       */
+/*  --------------------------------------------------------------- */
+/* Licensed Materials - Property of IBM                             */
+/* Blue Gene/Q 5765-PER 5765-PRP                                    */
+/*                                                                  */
+/* (C) Copyright IBM Corp. 2011, 2012 All Rights Reserved           */
+/* US Government Users Restricted Rights -                          */
+/* Use, duplication, or disclosure restricted                       */
+/* by GSA ADP Schedule Contract with IBM Corp.                      */
+/*                                                                  */
+/*  --------------------------------------------------------------- */
+/*                                                                  */
+/* end_generated_IBM_copyright_prolog                               */
+/*  (C)Copyright IBM Corp.  2007, 2011  */
+/**
+ * \file src/mpidi_nbc_sched.c
+ * \brief Non-blocking collectives hooks
+ */
+
+#include <pami.h>
+#include <mpidimpl.h>
+
+/**
+ * work object for persistent advance of nbc schedules
+ */
+pami_work_t mpidi_nbc_work_object;
+
+/**
+ * \brief Persistent work function for nbc schedule progress
+ */
+pami_result_t mpidi_nbc_work_function (pami_context_t context, void *cookie)
+{
+  int made_progress = 0;
+  MPIDU_Sched_progress (&made_progress);
+
+  return PAMI_EAGAIN;
+}
+
+/**
+ * \brief Initialize support for MPIR_* nbc implementation.
+ *
+ * The MPIR_* non-blocking collectives only work if the schedule is advanced.
+ * This is done by posting a work function to context 0 that invokes the
+ * schedule progress function.
+ *
+ * Because this is a persistent work function and will negatively impact the
+ * performance of all other MPI operations - even when mpir non-blocking
+ * collectives are not used - the work function is only posted if explicitly
+ * requested.
+ */
+void MPIDI_NBC_init ()
+{
+  if (MPIDI_Process.mpir_nbc != 0)
+  {
+    PAMI_Context_post(MPIDI_Context[0],
+                      &mpidi_nbc_work_object,
+                      mpidi_nbc_work_function,
+                      NULL);
+  }
+
+  return;
+}

http://git.mpich.org/mpich.git/commitdiff/4022fa1dac9d6b9f7f66590ff29bef3c9a2221f1

commit 4022fa1dac9d6b9f7f66590ff29bef3c9a2221f1
Author: Qi QC Zhang <keirazhang at cn.ibm.com>
Date:   Thu Oct 11 01:04:54 2012 -0400

    update error checkings in ADIO
    
    (ibm) 70c9c492f206d4b2b124360bd1a8c07283493895
    
    Signed-off-by: Michael Blocksome <blocksom at us.ibm.com>

diff --git a/src/mpi/romio/adio/common/ad_fstype.c b/src/mpi/romio/adio/common/ad_fstype.c
index 01d816f..85d2142 100644
--- a/src/mpi/romio/adio/common/ad_fstype.c
+++ b/src/mpi/romio/adio/common/ad_fstype.c
@@ -204,12 +204,17 @@ static void ADIO_FileSysType_parentdir(const char *filename, char **dirnamep)
 }
 #endif /* ROMIO_NTFS */
 
-#ifdef ROMIO_BGL   /* BlueGene support for lockless i/o (necessary for PVFS.
+#if defined(ROMIO_BGL) || defined(ROMIO_BG)
+		    /* BlueGene support for lockless i/o (necessary for PVFS.
 		      possibly beneficial for others, unless data sieving
 		      writes desired) */
 
 /* BlueGene environment variables can override lockless selection.*/
+#ifdef ROMIO_BG
+extern void ad_bg_get_env_vars();
+#else
 extern void ad_bgl_get_env_vars();
+#endif
 extern long bglocklessmpio_f_type;
 
 static void check_for_lockless_exceptions(long stat_type, int *fstype)
@@ -350,6 +355,16 @@ static void ADIO_FileSysType_fncall(const char *filename, int *fstype, int *erro
     }
 # endif
 
+#ifdef ROMIO_BG
+/* The BlueGene generic ADIO is also a special case. */
+    ad_bg_get_env_vars();
+
+    *fstype = ADIO_BG;
+    check_for_lockless_exceptions(fsbuf.f_type, fstype);
+    *error_code = MPI_SUCCESS;
+    return;
+#endif
+
 #  ifdef ROMIO_BGL 
     /* BlueGene is a special case: all file systems are AD_BGL, except for
      * certain exceptions */
@@ -579,6 +594,9 @@ static void ADIO_FileSysType_prefix(const char *filename, int *fstype, int *erro
     else if (!strncmp(filename, "bgl:", 4) || !strncmp(filename, "BGL:", 4)) {
 	*fstype = ADIO_BGL;
     }
+    else if (!strncmp(filename, "bg:", 3) || !strncmp(filename, "BG:", 3)) {
+	*fstype = ADIO_BG;
+    }
     else if (!strncmp(filename, "bglockless:", 11) || 
 	    !strncmp(filename, "BGLOCKLESS:", 11)) {
 	*fstype = ADIO_BGLOCKLESS;
@@ -828,6 +846,16 @@ void ADIO_ResolveFileType(MPI_Comm comm, const char *filename, int *fstype,
 	*ops = &ADIO_BGL_operations;
 #endif
     }
+    if (file_system == ADIO_BG) {
+#ifndef ROMIO_BG
+	*error_code = MPIO_Err_create_code(MPI_SUCCESS, MPIR_ERR_RECOVERABLE,
+					myname, __LINE__, MPI_ERR_IO,
+					"**iofstypeunsupported", 0);
+	return;
+#else
+	*ops = &ADIO_BG_operations;
+#endif
+    }
     if (file_system == ADIO_BGLOCKLESS) {
 #ifndef ROMIO_BGLOCKLESS
 	*error_code = MPIO_Err_create_code(MPI_SUCCESS, MPIR_ERR_RECOVERABLE, 

http://git.mpich.org/mpich.git/commitdiff/8a5655f69f631bce6c573fe567051b694e3e004d

commit 8a5655f69f631bce6c573fe567051b694e3e004d
Author: Michael Blocksome <blocksom at us.ibm.com>
Date:   Wed Apr 3 13:09:40 2013 -0500

    BG ROMIO changes that remained unresolved after the merge.
    
    I think these are "really old" baseline changes that didn't get pushed
    into the top-level mpich master branch during previous (?) code
    contributions.
    
    These changes are needed at this point because the following commits,
    that were essentially cherry-picked from the ibm master branch, depend
    on some of these romio changes in this commit.

diff --git a/src/mpi/romio/adio/ad_bg/ad_bg_aggrs.c b/src/mpi/romio/adio/ad_bg/ad_bg_aggrs.c
index c99b2d5..2596b87 100644
--- a/src/mpi/romio/adio/ad_bg/ad_bg_aggrs.c
+++ b/src/mpi/romio/adio/ad_bg/ad_bg_aggrs.c
@@ -186,7 +186,7 @@ ADIOI_BG_compute_agg_ranklist_serial_do (const ADIOI_BG_ConfInfo_t *confInfo,
    /* In this array, we can pick an appropriate number of midpoints based on
     * our bridgenode index and the number of aggregators */
 
-   numAggs = confInfo->aggRatio * confInfo->ioMaxSize /*virtualPsetSize*/;
+   numAggs = confInfo->aggRatio * confInfo->ioMinSize /*virtualPsetSize*/;
    if(numAggs == 1)
       aggTotal = 1;
    else
@@ -194,8 +194,9 @@ ADIOI_BG_compute_agg_ranklist_serial_do (const ADIOI_BG_ConfInfo_t *confInfo,
     * bridge node is an aggregator */
       aggTotal = confInfo->numBridgeRanks * (numAggs+1);
 
-   distance = (confInfo->ioMaxSize /*virtualPsetSize*/ / numAggs);
-   TRACE_ERR("numBridgeRanks: %d, aggRatio: %f numBridge: %d pset size: %d numAggs: %d distance: %d, aggTotal: %d\n", confInfo->numBridgeRanks, confInfo->aggRatio, confInfo->numBridgeRanks,  confInfo->ioMaxSize /*virtualPsetSize*/, numAggs, distance, aggTotal);
+   if(aggTotal>confInfo->nProcs) aggTotal=confInfo->nProcs;
+
+   TRACE_ERR("numBridgeRanks: %d, aggRatio: %f numBridge: %d pset size: %d/%d numAggs: %d, aggTotal: %d\n", confInfo->numBridgeRanks, confInfo->aggRatio, confInfo->numBridgeRanks,  confInfo->ioMinSize, confInfo->ioMaxSize /*virtualPsetSize*/, numAggs, aggTotal);
    aggList = (int *)ADIOI_Malloc(aggTotal * sizeof(int));
 
 
@@ -205,30 +206,59 @@ ADIOI_BG_compute_agg_ranklist_serial_do (const ADIOI_BG_ConfInfo_t *confInfo,
       aggList[0] = bridgelist[0].bridge;
    else
    {
-      for(i=0; i < confInfo->numBridgeRanks; i++)
-      {
-         aggList[i]=bridgelist[i*confInfo->ioMaxSize /*virtualPsetSize*/].bridge;
-         TRACE_ERR("aggList[%d]: %d\n", i, aggList[i]);
-         
+     int lastBridge = bridgelist[confInfo->nProcs-1].bridge;
+     int nextBridge = 0, nextAggr = confInfo->numBridgeRanks;
+     int psetSize = 0;
+     int procIndex;
+     for(procIndex=confInfo->nProcs-1; procIndex>=0; procIndex--)
+     {
+       TRACE_ERR("bridgelist[%d].bridge %u/rank %u\n",procIndex,  bridgelist[procIndex].bridge, bridgelist[procIndex].rank);
+       if(lastBridge == bridgelist[procIndex].bridge)
+       {
+         psetSize++;
+         if(procIndex) continue; 
+         else procIndex--;/* procIndex == 0 */
+       }
+       /* Sets up a list of nodes which will act as aggregators. numAggs
+        * per bridge node total. The list of aggregators is
+        * bridgeNode 0
+        * bridgeNode 1
+        * bridgeNode ...
+        * bridgeNode N
+        * bridgeNode[0]aggr[0]
+        * bridgeNode[0]aggr[1]...
+        * bridgeNode[0]aggr[N]...
+        * ...
+        * bridgeNode[N]aggr[0]..
+        * bridgeNode[N]aggr[N]
+        */
+       aggList[nextBridge]=lastBridge;
+       distance = psetSize/numAggs;
+       TRACE_ERR("nextBridge %u is bridge %u, distance %u, size %u\n",nextBridge, aggList[nextBridge],distance,psetSize);
+       if(numAggs>1)
+       {
          for(j = 0; j < numAggs; j++)
          {
-            /* Sets up a list of nodes which will act as aggregators. numAggs
-             * per bridge node total. The list of aggregators is
-             * bridgeNodes
-             * bridgeNode[0]aggr[0]
-             * bridgeNode[0]aggr[1]...
-             * bridgeNode[0]aggr[N]...
-             * ...
-             * bridgeNode[N]aggr[0]..
-             * bridgeNode[N]aggr[N]
-             */
-            aggList[i*numAggs+j+confInfo->numBridgeRanks] = bridgelist[i*confInfo->ioMaxSize /*virtualPsetSize*/ + j*distance+1].rank;
-            TRACE_ERR("(post bridge) agglist[%d] -> %d\n", confInfo->numBridgeRanks +i*numAggs+j, aggList[i*numAggs+j+confInfo->numBridgeRanks]);
+           ADIOI_BG_assert(nextAggr<aggTotal);
+           aggList[nextAggr] = bridgelist[procIndex+j*distance+1].rank;
+           TRACE_ERR("agglist[%d] -> bridgelist[%d] = %d\n", nextAggr, procIndex+j*distance+1,aggList[nextAggr]);
+           if(aggList[nextAggr]==lastBridge) /* can't have bridge in the list twice */
+           {  
+             aggList[nextAggr] = bridgelist[procIndex+psetSize].rank; /* take the last one in the pset */
+             TRACE_ERR("replacement agglist[%d] -> bridgelist[%d] = %d\n", nextAggr, procIndex+psetSize,aggList[nextAggr]);
+           }
+           nextAggr++;
          }
-      }
+       }
+       if(procIndex<0) break;
+       lastBridge = bridgelist[procIndex].bridge;
+       psetSize = 1;
+       nextBridge++;
+     }
    }
 
-   memcpy(tmp_ranklist, aggList, (numAggs*confInfo->numBridgeRanks+numAggs)*sizeof(int));
+   TRACE_ERR("memcpy(tmp_ranklist, aggList, (numAggs(%u)*confInfo->numBridgeRanks(%u)+numAggs(%u)) (%u) %u*sizeof(int))\n",numAggs,confInfo->numBridgeRanks,numAggs,(numAggs*confInfo->numBridgeRanks+numAggs),aggTotal);
+   memcpy(tmp_ranklist, aggList, aggTotal*sizeof(int));
    for(i=0;i<aggTotal;i++)
    {
       TRACE_ERR("tmp_ranklist[%d]: %d\n", i, tmp_ranklist[i]);
@@ -605,7 +635,6 @@ void ADIOI_BG_Calc_my_req(ADIO_File fd, ADIO_Offset *offset_list, ADIO_Offset *l
 #ifdef AGGREGATION_PROFILE
     MPE_Log_event (5024, 0, NULL);
 #endif
-
     *count_my_req_per_proc_ptr = (int *) ADIOI_Calloc(nprocs,sizeof(int)); 
     count_my_req_per_proc = *count_my_req_per_proc_ptr;
 /* count_my_req_per_proc[i] gives the no. of contig. requests of this
@@ -820,7 +849,7 @@ void ADIOI_BG_Calc_others_req(ADIO_File fd, int count_my_req_procs,
      */
     count_others_req_per_proc = (int *) ADIOI_Malloc(nprocs*sizeof(int));
 /*     cora2a1=timebase(); */
-for(i=0;i<nprocs;i++)
+/*for(i=0;i<nprocs;i++) ?*/
     MPI_Alltoall(count_my_req_per_proc, 1, MPI_INT,
 		 count_others_req_per_proc, 1, MPI_INT, fd->comm);
 
@@ -903,7 +932,7 @@ for(i=0;i<nprocs;i++)
     if ( sendBufForLens    == (void*)0xFFFFFFFFFFFFFFFF) sendBufForLens    = NULL;
 
     /* Calculate the displacements from the sendBufForOffsets/Lens */
-    MPI_Barrier(fd->comm);
+    MPI_Barrier(fd->comm);/* Why?*/
     for (i=0; i<nprocs; i++)
     {
 	/* Send these offsets to process i.*/
diff --git a/src/mpi/romio/adio/ad_bg/ad_bg_pset.c b/src/mpi/romio/adio/ad_bg/ad_bg_pset.c
index 14c5ebc..b5d9026 100644
--- a/src/mpi/romio/adio/ad_bg/ad_bg_pset.c
+++ b/src/mpi/romio/adio/ad_bg/ad_bg_pset.c
@@ -112,7 +112,7 @@ ADIOI_BG_persInfo_init(ADIOI_BG_ConfInfo_t *conf,
       conf->cpuIDsize = hw.ppn;
       /*conf->virtualPsetSize = conf->ioMaxSize * conf->cpuIDsize;*/
       conf->nAggrs = 1;
-      conf->aggRatio = 1. * conf->nAggrs / conf->ioMaxSize /*virtualPsetSize*/;
+      conf->aggRatio = 1. * conf->nAggrs / conf->ioMinSize /*virtualPsetSize*/;
       if(conf->aggRatio > 1) conf->aggRatio = 1.;
       TRACE_ERR("I am (single) Bridge rank\n");
       return;
@@ -194,7 +194,7 @@ ADIOI_BG_persInfo_init(ADIOI_BG_ConfInfo_t *conf,
          if(countPset < mincompute)
             mincompute = countPset;
 
-         /* Is this my bridge? */
+         /* Was this my bridge we finished? */
          if(tempCoords == bridgeCoords)
          {
             /* Am I the bridge rank? */
@@ -208,6 +208,7 @@ ADIOI_BG_persInfo_init(ADIOI_BG_ConfInfo_t *conf,
             proc->myIOSize = countPset;
             proc->ioNodeIndex = bridgeIndex;
          }
+         /* Setup next bridge */
          tempCoords = bridges[i].bridgeCoord & ~1;
          tempRank   = bridges[i].rank;
          bridgeIndex++;
@@ -226,7 +227,7 @@ ADIOI_BG_persInfo_init(ADIOI_BG_ConfInfo_t *conf,
    if(countPset < mincompute)
       mincompute = countPset;
 
-   /* Is this my bridge? */
+   /* Was this my bridge? */
    if(tempCoords == bridgeCoords)
    {
       /* Am I the bridge rank? */
@@ -252,15 +253,17 @@ ADIOI_BG_persInfo_init(ADIOI_BG_ConfInfo_t *conf,
             
       conf->nAggrs = n_aggrs;
       /*    First pass gets nAggrs = -1 */
-      if(conf->nAggrs <=0 || 
-         MIN(conf->nProcs, conf->ioMaxSize /*virtualPsetSize*/) < conf->nAggrs) 
+      if(conf->nAggrs <=0) 
          conf->nAggrs = ADIOI_BG_NAGG_PSET_DFLT;
-      if(conf->nAggrs > conf->numBridgeRanks) /* maybe? * conf->cpuIDsize) */
-         conf->nAggrs = conf->numBridgeRanks; /* * conf->cpuIDsize; */
-   
-      conf->aggRatio = 1. * conf->nAggrs / conf->ioMaxSize /*virtualPsetSize*/;
-      if(conf->aggRatio > 1) conf->aggRatio = 1.;
-      TRACE_ERR("Maximum ranks under a bridge rank: %d, minimum: %d, nAggrs: %d, vps: %d, numBridgeRanks: %d pset dflt: %d naggrs: %d ratio: %f\n", maxcompute, mincompute, conf->nAggrs, conf->ioMaxSize /*virtualPsetSize*/, conf->numBridgeRanks, ADIOI_BG_NAGG_PSET_DFLT, conf->nAggrs, conf->aggRatio);
+      if(conf->ioMinSize <= conf->nAggrs) 
+        conf->nAggrs = MAX(1,conf->ioMinSize-1); /* not including bridge itself */
+/*      if(conf->nAggrs > conf->numBridgeRanks) 
+         conf->nAggrs = conf->numBridgeRanks; 
+*/
+      conf->aggRatio = 1. * conf->nAggrs / conf->ioMinSize /*virtualPsetSize*/;
+/*    if(conf->aggRatio > 1) conf->aggRatio = 1.; */
+      TRACE_ERR("n_aggrs %zd, conf->nProcs %zu, conf->ioMaxSize %zu, ADIOI_BG_NAGG_PSET_DFLT %zu,conf->numBridgeRanks %zu,conf->nAggrs %zu\n",(size_t)n_aggrs, (size_t)conf->nProcs, (size_t)conf->ioMaxSize, (size_t)ADIOI_BG_NAGG_PSET_DFLT,(size_t)conf->numBridgeRanks,(size_t)conf->nAggrs);
+      TRACE_ERR("Maximum ranks under a bridge rank: %d, minimum: %d, nAggrs: %d, numBridgeRanks: %d pset dflt: %d naggrs: %d ratio: %f\n", maxcompute, mincompute, conf->nAggrs, conf->numBridgeRanks, ADIOI_BG_NAGG_PSET_DFLT, conf->nAggrs, conf->aggRatio);
    }
 
    ADIOI_BG_assert((bridgerank != -1));
diff --git a/src/mpi/romio/adio/ad_bglockless/ad_bglockless_features.c b/src/mpi/romio/adio/ad_bglockless/ad_bglockless_features.c
index 4153c5e..5e78f80 100644
--- a/src/mpi/romio/adio/ad_bglockless/ad_bglockless_features.c
+++ b/src/mpi/romio/adio/ad_bglockless/ad_bglockless_features.c
@@ -1,3 +1,22 @@
+/* begin_generated_IBM_copyright_prolog                             */
+/*                                                                  */
+/* This is an automatically generated copyright prolog.             */
+/* After initializing,  DO NOT MODIFY OR MOVE                       */
+/*  --------------------------------------------------------------- */
+/*                                                                  */
+/* Licensed Materials - Property of IBM                             */
+/* Blue Gene/Q                                                      */
+/* (C) Copyright IBM Corp.  2011, 2012                              */
+/* US Government Users Restricted Rights - Use, duplication or      */      
+/*   disclosure restricted by GSA ADP Schedule Contract with IBM    */
+/*   Corp.                                                          */
+/*                                                                  */
+/* This software is available to you under the Eclipse Public       */
+/* License (EPL).                                                   */
+/*                                                                  */
+/*  --------------------------------------------------------------- */
+/*                                                                  */
+/* end_generated_IBM_copyright_prolog                               */
 #include "adio.h"
 
 int ADIOI_BGLOCKLESS_Feature(ADIO_File fd, int flag)
diff --git a/src/mpi/romio/adio/common/ad_get_sh_fp.c b/src/mpi/romio/adio/common/ad_get_sh_fp.c
index d3fc3f5..5610915 100644
--- a/src/mpi/romio/adio/common/ad_get_sh_fp.c
+++ b/src/mpi/romio/adio/common/ad_get_sh_fp.c
@@ -49,6 +49,14 @@ void ADIO_Get_shared_fp(ADIO_File fd, int incr, ADIO_Offset *shared_fp,
 	return;
     }
 #endif
+#ifdef ROMIO_BG
+    /* BGLOCKLESS won't support shared fp */
+    if (fd->file_system == ADIO_BG) {
+	ADIOI_BG_Get_shared_fp(fd, incr, shared_fp, error_code);
+	return;
+    }
+#endif
+
 
     if (fd->shared_fp_fd == ADIO_FILE_NULL) {
 	MPI_Comm_dup(MPI_COMM_SELF, &dupcommself);
diff --git a/src/mpi/romio/adio/common/ad_set_sh_fp.c b/src/mpi/romio/adio/common/ad_set_sh_fp.c
index 2787b3e..ba6affd 100644
--- a/src/mpi/romio/adio/common/ad_set_sh_fp.c
+++ b/src/mpi/romio/adio/common/ad_set_sh_fp.c
@@ -33,6 +33,13 @@ void ADIO_Set_shared_fp(ADIO_File fd, ADIO_Offset offset, int *error_code)
 	return;
     }
 #endif
+#ifdef ROMIO_BG
+    /* BGLOCKLESS won't support shared fp */
+    if (fd->file_system == ADIO_BG) {
+	ADIOI_BG_Set_shared_fp(fd, offset, error_code);
+	return;
+    }
+#endif
 
     if (fd->shared_fp_fd == ADIO_FILE_NULL) {
 	MPI_Comm_dup(MPI_COMM_SELF, &dupcommself);
diff --git a/src/mpi/romio/adio/common/lock.c b/src/mpi/romio/adio/common/lock.c
index d064ede..2590d77 100644
--- a/src/mpi/romio/adio/common/lock.c
+++ b/src/mpi/romio/adio/common/lock.c
@@ -153,7 +153,7 @@ int ADIOI_Set_lock(FDTYPE fd, int cmd, int type, ADIO_Offset offset, int whence,
     if (err && (errno != EBADF)) {
 	/* FIXME: This should use the error message system, 
 	   especially for MPICH */
-	FPRINTF(stderr, "File locking failed in ADIOI_Set_lock(fd %X,cmd %s/%X,type %s/%X,whence %X) with return value %X and errno %X.\n"
+	FPRINTF(stderr, "This requires fcntl(2) to be implemented. As of 8/25/2011 it is not. Generic MPICH Message: File locking failed in ADIOI_Set_lock(fd %X,cmd %s/%X,type %s/%X,whence %X) with return value %X and errno %X.\n"
                   "- If the file system is NFS, you need to use NFS version 3, ensure that the lockd daemon is running on all the machines, and mount the directory with the 'noac' option (no attribute caching).\n"
                   "- If the file system is LUSTRE, ensure that the directory is mounted with the 'flock' option.\n",
           fd,
diff --git a/src/mpi/romio/adio/include/adio.h b/src/mpi/romio/adio/include/adio.h
index 4d4acf9..067641f 100644
--- a/src/mpi/romio/adio/include/adio.h
+++ b/src/mpi/romio/adio/include/adio.h
@@ -293,6 +293,7 @@ typedef struct {
 #define ADIO_BGL                 164   /* IBM BGL */
 #define ADIO_BGLOCKLESS          165   /* IBM BGL (lock-free) */
 #define ADIO_ZOIDFS              167   /* ZoidFS: the I/O forwarding fs */
+#define ADIO_BG                  168
 
 #define ADIO_SEEK_SET            SEEK_SET
 #define ADIO_SEEK_CUR            SEEK_CUR
diff --git a/src/mpi/romio/adio/include/adioi_fs_proto.h b/src/mpi/romio/adio/include/adioi_fs_proto.h
index d28c123..65f0183 100644
--- a/src/mpi/romio/adio/include/adioi_fs_proto.h
+++ b/src/mpi/romio/adio/include/adioi_fs_proto.h
@@ -79,6 +79,11 @@ extern struct ADIOI_Fns_struct ADIO_BGL_operations;
 /* prototypes are in adio/ad_bgl/ad_bgl.h */
 #endif
 
+#ifdef ROMIO_BG
+extern struct ADIOI_Fns_struct ADIO_BG_operations;
+/* prototypes are in adio/ad_bg/ad_bg.h */
+#endif
+
 #ifdef ROMIO_BGLOCKLESS
 extern struct ADIOI_Fns_struct ADIO_BGLOCKLESS_operations;
 /* no extra prototypes for this fs at this time */
diff --git a/src/mpi/romio/configure.ac b/src/mpi/romio/configure.ac
index 2ee57b0..4530838 100644
--- a/src/mpi/romio/configure.ac
+++ b/src/mpi/romio/configure.ac
@@ -1171,15 +1171,15 @@ if test -n "$file_system_bg"; then
     AC_DEFINE(ROMIO_BG,1,[Define for ROMIO with BG])
 fi
 if test -n "$file_system_bglockless"; then
-    if test x"$file_system_bgl" != x; then
+    if test -n "$file_system_bgl"; then
         AC_DEFINE(ROMIO_BGLOCKLESS,1,[Define for lock-free ROMIO with BGL])
     fi
 
-    if test x"$file_system_bg" != x; then
+    if test -n "$file_system_bg"; then
         AC_DEFINE(ROMIO_BGLOCKLESS,1,[Define for lock-free ROMIO with BG])
     fi
 
-    if test x"$ROMIO_BGLOCKLESS" -ne x1; then
+    if test -n "$ROMIO_BGLOCKLESS"; then
         AC_MSG_ERROR("bglockless requested without [bgl|bg]")
     fi
 fi

http://git.mpich.org/mpich.git/commitdiff/136bf33fddd2333b0d9b0199dbf6582975f81b85

commit 136bf33fddd2333b0d9b0199dbf6582975f81b85
Author: Qi QC Zhang <keirazhang at cn.ibm.com>
Date:   Thu Oct 11 01:35:11 2012 -0400

    handle ENOMEM errors in ADIO
    
    (ibm) f8d7585697d27676bc15f9b47ce41a8fa5536a11
    
    Signed-off-by: Michael Blocksome <blocksom at us.ibm.com>

diff --git a/src/mpi/romio/adio/common/cb_config_list.c b/src/mpi/romio/adio/common/cb_config_list.c
index 9f62933..83e9f29 100644
--- a/src/mpi/romio/adio/common/cb_config_list.c
+++ b/src/mpi/romio/adio/common/cb_config_list.c
@@ -688,7 +688,7 @@ static int get_max_procs(int cb_nodes)
  *
  * Returns a token of types defined at top of this file.
  */
-#ifdef ROMIO_BGL
+#if defined(ROMIO_BGL) || defined(ROMIO_BG)
 /* On BlueGene, the ',' character shows up in get_processor_name, so we have to
  * use a different delimiter */
 #define COLON ':'

http://git.mpich.org/mpich.git/commitdiff/0994aab0c7f2febd633da2a0097626a57796179e

commit 0994aab0c7f2febd633da2a0097626a57796179e
Author: Bob Cernohous <bobc at us.ibm.com>
Date:   Wed Oct 17 16:27:39 2012 -0500

    Split large broadcasts into smaller broadcasts
    
    Also little streamlining/cleanup of collectives including:
    - ndebug changes to remove verbose logging
    - use local const variables to cache pointer references
    - likely/unlikely code patch changes
    
    (ibm) Issue 8863
    (ibm) f747641815a2250a53407f01c12592dfe5c8ae33
    
    Signed-off-by: Su Huang <suhuang at us.ibm.com>

diff --git a/src/mpid/pamid/src/coll/allgather/mpido_allgather.c b/src/mpid/pamid/src/coll/allgather/mpido_allgather.c
index 0d48f6a..d8c8ff6 100644
--- a/src/mpid/pamid/src/coll/allgather/mpido_allgather.c
+++ b/src/mpid/pamid/src/coll/allgather/mpido_allgather.c
@@ -20,7 +20,7 @@
  * \brief ???
  */
 
-/*#define TRACE_ON */
+/* #define TRACE_ON */
 #include <mpidimpl.h>
 
 
@@ -61,11 +61,10 @@ int MPIDO_Allgather_allreduce(const void *sendbuf,
                               int *mpierrno)
 
 {
-  int rc, rank;
+  int rc;
   char *startbuf = NULL;
   char *destbuf = NULL;
-
-  rank = comm_ptr->rank;
+  const int rank = comm_ptr->rank;
 
   startbuf   = (char *) recvbuf + recv_true_lb;
   destbuf    = startbuf + rank * send_size;
@@ -80,12 +79,12 @@ int MPIDO_Allgather_allreduce(const void *sendbuf,
   }
   /* TODO: Change to PAMI */
   rc = MPIDO_Allreduce(MPI_IN_PLACE,
-		       startbuf,
-		       recv_size/sizeof(int),
-		       MPI_INT,
-		       MPI_BOR,
-		       comm_ptr,
-           mpierrno);
+                       startbuf,
+                       recv_size/sizeof(unsigned),
+                       MPI_UNSIGNED,
+                       MPI_BOR,
+                       comm_ptr,
+                       mpierrno);
 
   return rc;
 }
@@ -100,20 +99,21 @@ int MPIDO_Allgather_allreduce(const void *sendbuf,
  */
 /* ****************************************************************** */
 int MPIDO_Allgather_bcast(const void *sendbuf,
-			  int sendcount,
-			  MPI_Datatype sendtype,
-			  void *recvbuf,
-			  int recvcount,
-			  MPI_Datatype recvtype,
-			  MPI_Aint send_true_lb,
-			  MPI_Aint recv_true_lb,
-			  size_t send_size,
-			  size_t recv_size,
-			  MPID_Comm * comm_ptr,
+                          int sendcount,
+                          MPI_Datatype sendtype,
+                          void *recvbuf,
+                          int recvcount,  
+                          MPI_Datatype recvtype,
+                          MPI_Aint send_true_lb,
+                          MPI_Aint recv_true_lb,
+                          size_t send_size,
+                          size_t recv_size,
+                          MPID_Comm * comm_ptr,
                           int *mpierrno)
 {
   int i, np, rc = 0;
   MPI_Aint extent;
+  const int rank = comm_ptr->rank;
 
   np = comm_ptr ->local_size;
   MPID_Datatype_get_extent_macro(recvtype, extent);
@@ -122,7 +122,7 @@ int MPIDO_Allgather_bcast(const void *sendbuf,
 				     np * recvcount * extent));
   if (sendbuf != MPI_IN_PLACE)
   {
-    void *destbuf = recvbuf + comm_ptr->rank * recvcount * extent;
+    void *destbuf = recvbuf + rank * recvcount * extent;
     MPIR_Localcopy(sendbuf,
                    sendcount,
                    sendtype,
@@ -175,13 +175,15 @@ int MPIDO_Allgather_alltoall(const void *sendbuf,
   void *a2a_sendbuf = NULL;
   char *destbuf=NULL;
   char *startbuf=NULL;
+  const int size = comm_ptr->local_size;
+  const int rank = comm_ptr->rank;
 
-  int a2a_sendcounts[comm_ptr->local_size];
-  int a2a_senddispls[comm_ptr->local_size];
-  int a2a_recvcounts[comm_ptr->local_size];
-  int a2a_recvdispls[comm_ptr->local_size];
+  int a2a_sendcounts[size];
+  int a2a_senddispls[size];
+  int a2a_recvcounts[size];
+  int a2a_recvdispls[size];
 
-  for (i = 0; i < comm_ptr->local_size; ++i)
+  for (i = 0; i < size; ++i)
   {
     a2a_sendcounts[i] = send_size;
     a2a_senddispls[i] = 0;
@@ -195,11 +197,11 @@ int MPIDO_Allgather_alltoall(const void *sendbuf,
   else
   {
     startbuf = (char *) recvbuf + recv_true_lb;
-    destbuf = startbuf + comm_ptr->rank * send_size;
+    destbuf = startbuf + rank * send_size;
     a2a_sendbuf = destbuf;
-    a2a_sendcounts[comm_ptr->rank] = 0;
+    a2a_sendcounts[rank] = 0;
 
-    a2a_recvcounts[comm_ptr->rank] = 0;
+    a2a_recvcounts[rank] = 0;
   }
 
 /* TODO: Change to PAMI */
@@ -234,7 +236,7 @@ MPIDO_Allgather(const void *sendbuf,
    * Check the nature of the buffers
    * *********************************
    */
-/*  MPIDO_Coll_config config = {1,1,1,1,1,1};*/
+   const struct MPIDI_Comm* const mpid = &(comm_ptr->mpid);
    int config[6], i;
    MPID_Datatype * dt_null = NULL;
    MPI_Aint send_true_lb = 0;
@@ -245,15 +247,24 @@ MPIDO_Allgather(const void *sendbuf,
    volatile unsigned allred_active = 1;
    volatile unsigned allgather_active = 1;
    pami_xfer_t allred;
+   const int rank = comm_ptr->rank;
+#if ASSERT_LEVEL==0
+   /* We can't afford the tracing in ndebug/performance libraries */
+    const unsigned verbose = 0;
+#else
+    const unsigned verbose = (MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL) && (rank == 0);
+#endif
+   const int selected_type = mpid->user_selected_type[PAMI_XFER_ALLGATHER];
+
    for (i=0;i<6;i++) config[i] = 1;
-   pami_metadata_t *my_md;
+   const pami_metadata_t *my_md;
 
 
    allred.cb_done = allred_cb_done;
    allred.cookie = (void *)&allred_active;
    /* Pick an algorithm that is guaranteed to work for the pre-allreduce */
    /* TODO: This needs selection for fast(er|est) allreduce protocol */
-   allred.algorithm = comm_ptr->mpid.coll_algorithm[PAMI_XFER_ALLREDUCE][0][0]; 
+   allred.algorithm = mpid->coll_algorithm[PAMI_XFER_ALLREDUCE][0][0]; 
    allred.cmd.xfer_allreduce.sndbuf = (void *)config;
    allred.cmd.xfer_allreduce.stype = PAMI_TYPE_SIGNED_INT;
    allred.cmd.xfer_allreduce.rcvbuf = (void *)config;
@@ -265,20 +276,19 @@ MPIDO_Allgather(const void *sendbuf,
   char use_tree_reduce, use_alltoall, use_bcast, use_pami, use_opt;
   char *rbuf = NULL, *sbuf = NULL;
 
-   use_alltoall = comm_ptr->mpid.allgathers[2];
-   use_tree_reduce = comm_ptr->mpid.allgathers[0];
-   use_bcast = comm_ptr->mpid.allgathers[1];
-   use_pami = 
-      (comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLGATHER] == MPID_COLL_USE_MPICH) ? 0 : 1;
-/*   if(sendbuf == MPI_IN_PLACE) use_pami = 0;*/
+   const char * const allgathers = mpid->allgathers;
+   use_alltoall = allgathers[2];
+   use_tree_reduce = allgathers[0];
+   use_bcast = allgathers[1];
+   use_pami = (selected_type == MPID_COLL_USE_MPICH) ? 0 : 1;
    use_opt = use_alltoall || use_tree_reduce || use_bcast || use_pami;
 
 
    TRACE_ERR("flags before: b: %d a: %d t: %d p: %d\n", use_bcast, use_alltoall, use_tree_reduce, use_pami);
    if(!use_opt)
    {
-      if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
-         fprintf(stderr,"Using MPICH allgather algorithm\n");
+     if(unlikely(verbose))
+       fprintf(stderr,"Using MPICH allgather algorithm\n");
       TRACE_ERR("No options set/available; using MPICH for allgather\n");
       MPIDI_Update_last_algorithm(comm_ptr, "ALLGATHER_MPICH");
       return MPIR_Allgather(sendbuf, sendcount, sendtype,
@@ -299,9 +309,10 @@ MPIDO_Allgather(const void *sendbuf,
    send_size = recv_size;
    rbuf = (char *)recvbuf+recv_true_lb;
 
+   sbuf = (char *)recvbuf+recv_size*rank;
    if(sendbuf != MPI_IN_PLACE)
    {
-      if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL))
+     if(unlikely(verbose))
          fprintf(stderr,"allgather MPI_IN_PLACE buffering\n");
       MPIDI_Datatype_get_info(sendcount,
                             sendtype,
@@ -311,13 +322,6 @@ MPIDO_Allgather(const void *sendbuf,
                             send_true_lb);
       sbuf = (char *)sendbuf+send_true_lb;
    }
-   else
-   {
-      sbuf = (char *)recvbuf+recv_size*comm_ptr->rank;
-   }
-/*   fprintf(stderr,"sendount: %d, recvcount: %d send_size: %zd
-     recv_size: %zd\n", sendcount, recvcount, send_size,
-     recv_size);*/
 
   /* verify everyone's datatype contiguity */
   /* Check buffer alignment now, since we're pre-allreducing anyway */
@@ -328,7 +332,7 @@ MPIDO_Allgather(const void *sendbuf,
                !((long)sendbuf & 0x0F) && !((long)recvbuf & 0x0F);
 
       /* #warning need to determine best allreduce for short messages */
-      if(comm_ptr->mpid.preallreduces[MPID_ALLGATHER_PREALLREDUCE])
+      if(mpid->preallreduces[MPID_ALLGATHER_PREALLREDUCE])
       {
          TRACE_ERR("Preallreducing in allgather\n");
          MPIDI_Post_coll_t allred_post;
@@ -339,17 +343,17 @@ MPIDO_Allgather(const void *sendbuf,
      }
 
 
-       use_alltoall = comm_ptr->mpid.allgathers[2] &&
+       use_alltoall = allgathers[2] &&
             config[MPID_RECV_CONTIG] && config[MPID_SEND_CONTIG];;
 
       /* Note: some of the glue protocols use recv_size*comm_size rather than 
        * recv_size so we use that for comparison here, plus we pass that in
        * to those protocols. */
-       use_tree_reduce = comm_ptr->mpid.allgathers[0] &&
+       use_tree_reduce =  allgathers[0] &&
          config[MPID_RECV_CONTIG] && config[MPID_SEND_CONTIG] &&
-         config[MPID_RECV_CONTINUOUS] && (recv_size*comm_size % sizeof(int) == 0);
+         config[MPID_RECV_CONTINUOUS] && (recv_size*comm_size%sizeof(unsigned)) == 0;
 
-       use_bcast = comm_ptr->mpid.allgathers[1];
+       use_bcast = allgathers[1];
 
        TRACE_ERR("flags after: b: %d a: %d t: %d p: %d\n", use_bcast, use_alltoall, use_tree_reduce, use_pami);
    }
@@ -365,34 +369,48 @@ MPIDO_Allgather(const void *sendbuf,
       allgather.cmd.xfer_allgather.rtype = PAMI_TYPE_BYTE;
       allgather.cmd.xfer_allgather.stypecount = send_size;
       allgather.cmd.xfer_allgather.rtypecount = recv_size;
-      if(comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLGATHER] == MPID_COLL_OPTIMIZED)
+      if(selected_type == MPID_COLL_OPTIMIZED)
       {
-         allgather.algorithm = comm_ptr->mpid.opt_protocol[PAMI_XFER_ALLGATHER][0];
-         my_md = &comm_ptr->mpid.opt_protocol_md[PAMI_XFER_ALLGATHER][0];
+        if((mpid->cutoff_size[PAMI_XFER_ALLGATHER][0] == 0) || 
+	    (mpid->cutoff_size[PAMI_XFER_ALLGATHER][0] > 0 && mpid->cutoff_size[PAMI_XFER_ALLGATHER][0] >= send_size))
+        {
+           allgather.algorithm = mpid->opt_protocol[PAMI_XFER_ALLGATHER][0];
+           my_md = &mpid->opt_protocol_md[PAMI_XFER_ALLGATHER][0];
+        }
+        else
+        {
+           return MPIR_Allgather(sendbuf, sendcount, sendtype,
+                       recvbuf, recvcount, recvtype,
+                       comm_ptr, mpierrno);
+        }
       }
       else
       {
-         allgather.algorithm = comm_ptr->mpid.user_selected[PAMI_XFER_ALLGATHER];
-         my_md = &comm_ptr->mpid.user_metadata[PAMI_XFER_ALLGATHER];
+         allgather.algorithm = mpid->user_selected[PAMI_XFER_ALLGATHER];
+         my_md = &mpid->user_metadata[PAMI_XFER_ALLGATHER];
       }
 
-      if(unlikely( comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLGATHER] == MPID_COLL_ALWAYS_QUERY ||
-                   comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLGATHER] == MPID_COLL_CHECK_FN_REQUIRED))
+      if(unlikely( selected_type == MPID_COLL_ALWAYS_QUERY ||
+                   selected_type == MPID_COLL_CHECK_FN_REQUIRED))
       {
          metadata_result_t result = {0};
          TRACE_ERR("Querying allgather protocol %s, type was: %d\n",
             my_md->name,
-            comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLGATHER]);
+            selected_type);
          result = my_md->check_fn(&allgather);
          TRACE_ERR("bitmask: %#X\n", result.bitmask);
          if(!result.bitmask)
          {
+      if(unlikely(verbose))
             fprintf(stderr,"Query failed for %s.\n",
                my_md->name);
+           return MPIR_Allgather(sendbuf, sendcount, sendtype,
+                       recvbuf, recvcount, recvtype,
+                       comm_ptr, mpierrno);
          }
       }
 
-      if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
+      if(unlikely(verbose))
       {
          unsigned long long int threadID;
          MPIU_Thread_id_t tid;
@@ -409,7 +427,6 @@ MPIDO_Allgather(const void *sendbuf,
       TRACE_ERR("Allgather %s\n", MPIDI_Process.context_post.active>0?"posted":"invoked");
 
       MPIDI_Update_last_algorithm(comm_ptr, my_md->name);
-
       MPID_PROGRESS_WAIT_WHILE(allgather_active);
       TRACE_ERR("Allgather done\n");
       return PAMI_SUCCESS;
@@ -417,35 +434,38 @@ MPIDO_Allgather(const void *sendbuf,
 
    if(use_tree_reduce)
    {
+      if(unlikely(verbose))
+         fprintf(stderr,"Using protocol GLUE_ALLREDUCE for allgather\n");
       TRACE_ERR("Using allgather via allreduce\n");
       MPIDI_Update_last_algorithm(comm_ptr, "ALLGATHER_OPT_ALLREDUCE");
-     rc = MPIDO_Allgather_allreduce(sendbuf, sendcount, sendtype,
+     return MPIDO_Allgather_allreduce(sendbuf, sendcount, sendtype,
                                recvbuf, recvcount, recvtype,
                                send_true_lb, recv_true_lb, send_size, recv_size*comm_size, comm_ptr, mpierrno);
-      return rc;
    }
    if(use_alltoall)
    {
+      if(unlikely(verbose))
+         fprintf(stderr,"Using protocol GLUE_BCAST for allgather\n");
       TRACE_ERR("Using allgather via alltoall\n");
       MPIDI_Update_last_algorithm(comm_ptr, "ALLGATHER_OPT_ALLTOALL");
-     rc = MPIDO_Allgather_alltoall(sendbuf, sendcount, sendtype,
+     return MPIDO_Allgather_alltoall(sendbuf, sendcount, sendtype,
                                recvbuf, recvcount, recvtype,
                                send_true_lb, recv_true_lb, send_size, recv_size*comm_size, comm_ptr, mpierrno);
-      return rc;
    }
 
    if(use_bcast)
    {
+      if(unlikely(verbose))
+         fprintf(stderr,"Using protocol GLUE_ALLTOALL for allgather\n");
       TRACE_ERR("Using allgather via bcast\n");
-      MPIDI_Update_last_algorithm(comm_ptr, "ALLGATHER_OPT_BCAST");
-     rc = MPIDO_Allgather_bcast(sendbuf, sendcount, sendtype,
+     MPIDI_Update_last_algorithm(comm_ptr, "ALLGATHER_OPT_BCAST");
+     return MPIDO_Allgather_bcast(sendbuf, sendcount, sendtype,
                                recvbuf, recvcount, recvtype,
                                send_true_lb, recv_true_lb, send_size, recv_size*comm_size, comm_ptr, mpierrno);
-      return rc;
    }
    
    /* Nothing used yet; dump to MPICH */
-   if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
+   if(unlikely(verbose))
       fprintf(stderr,"Using MPICH allgather algorithm\n");
    TRACE_ERR("Using allgather via mpich\n");
    MPIDI_Update_last_algorithm(comm_ptr, "ALLGATHER_MPICH");
diff --git a/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c b/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c
index 9ccff6d..e67ba76 100644
--- a/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c
+++ b/src/mpid/pamid/src/coll/allgatherv/mpido_allgatherv.c
@@ -20,7 +20,7 @@
  * \brief ???
  */
 
-/*#define TRACE_ON */
+/* #define TRACE_ON */
 #include <mpidimpl.h>
 
 static void allgatherv_cb_done(void *ctxt, void *clientdata, pami_result_t err)
@@ -65,19 +65,20 @@ int MPIDO_Allgatherv_allreduce(const void *sendbuf,
   int length;
   char *startbuf = NULL;
   char *destbuf = NULL;
+  const int rank = comm_ptr->rank;
   TRACE_ERR("Entering MPIDO_Allgatherv_allreduce\n");
 
   startbuf = (char *) recvbuf + recv_true_lb;
-  destbuf = startbuf + displs[comm_ptr->rank] * recv_size;
+  destbuf = startbuf + displs[rank] * recv_size;
 
   start = 0;
-  length = displs[comm_ptr->rank] * recv_size;
+  length = displs[rank] * recv_size;
   memset(startbuf + start, 0, length);
 
-  start  = (displs[comm_ptr->rank] +
-	    recvcounts[comm_ptr->rank]) * recv_size;
-  length = buffer_sum - (displs[comm_ptr->rank] +
-			 recvcounts[comm_ptr->rank]) * recv_size;
+  start  = (displs[rank] +
+	    recvcounts[rank]) * recv_size;
+  length = buffer_sum - (displs[rank] +
+			 recvcounts[rank]) * recv_size;
   memset(startbuf + start, 0, length);
 
   if (sendbuf != MPI_IN_PLACE)
@@ -86,14 +87,13 @@ int MPIDO_Allgatherv_allreduce(const void *sendbuf,
     memcpy(destbuf, outputbuf, send_size);
   }
 
-  /*if (0==comm_ptr->rank) puts("allreduce allgatherv");*/
 
    TRACE_ERR("Calling MPIDO_Allreduce from MPIDO_Allgatherv_allreduce\n");
    /* TODO: Change to PAMI allreduce */
   rc = MPIDO_Allreduce(MPI_IN_PLACE,
 		       startbuf,
-		       buffer_sum/sizeof(int),
-		       MPI_INT,
+		       buffer_sum/sizeof(unsigned),
+		       MPI_UNSIGNED,
 		       MPI_BOR,
 		       comm_ptr,
                        mpierrno);
@@ -127,6 +127,7 @@ int MPIDO_Allgatherv_bcast(const void *sendbuf,
 			   MPID_Comm * comm_ptr,
                            int *mpierrno)
 {
+   const int rank = comm_ptr->rank;
    TRACE_ERR("Entering MPIDO_Allgatherv_bcast\n");
   int i, rc=MPI_ERR_INTERN;
   MPI_Aint extent;
@@ -134,12 +135,12 @@ int MPIDO_Allgatherv_bcast(const void *sendbuf,
 
   if (sendbuf != MPI_IN_PLACE)
   {
-    void *destbuffer = recvbuf + displs[comm_ptr->rank] * extent;
+    void *destbuffer = recvbuf + displs[rank] * extent;
     MPIR_Localcopy(sendbuf,
                    sendcount,
                    sendtype,
                    destbuffer,
-                   recvcounts[comm_ptr->rank],
+                   recvcounts[rank],
                    recvtype);
   }
 
@@ -155,8 +156,7 @@ int MPIDO_Allgatherv_bcast(const void *sendbuf,
                      comm_ptr,
                      mpierrno);
   }
-  /*if (0==comm_ptr->rank) puts("bcast allgatherv");*/
-   TRACE_ERR("Leaving MPIDO_Allgatherv_bcast\n");
+  TRACE_ERR("Leaving MPIDO_Allgatherv_bcast\n");
 
   return rc;
 }
@@ -193,11 +193,13 @@ int MPIDO_Allgatherv_alltoall(const void *sendbuf,
   int i, rc;
   int my_recvcounts = -1;
   void *a2a_sendbuf = NULL;
-  int a2a_sendcounts[comm_ptr->local_size];
-  int a2a_senddispls[comm_ptr->local_size];
+  const int size = comm_ptr->local_size;
+  int a2a_sendcounts[size];
+  int a2a_senddispls[size];
+   const int rank = comm_ptr->rank;
 
-  total_send_size = recvcounts[comm_ptr->rank] * recv_size;
-  for (i = 0; i < comm_ptr->local_size; ++i)
+  total_send_size = recvcounts[rank] * recv_size;
+  for (i = 0; i < size; ++i)
   {
     a2a_sendcounts[i] = total_send_size;
     a2a_senddispls[i] = 0;
@@ -209,16 +211,15 @@ int MPIDO_Allgatherv_alltoall(const void *sendbuf,
   else
   {
     startbuf = (char *) recvbuf + recv_true_lb;
-    destbuf = startbuf + displs[comm_ptr->rank] * recv_size;
+    destbuf = startbuf + displs[rank] * recv_size;
     a2a_sendbuf = destbuf;
-    a2a_sendcounts[comm_ptr->rank] = 0;
-    my_recvcounts = recvcounts[comm_ptr->rank];
-    recvcounts[comm_ptr->rank] = 0;
+    a2a_sendcounts[rank] = 0;
+    my_recvcounts = recvcounts[rank];
+    recvcounts[rank] = 0;
   }
 
    TRACE_ERR("Calling alltoallv in MPIDO_Allgatherv_alltoallv\n");
    /* TODO: Change to PAMI alltoallv */
-  /*if (0==comm_ptr->rank) puts("all2all allgatherv");*/
   rc = MPIR_Alltoallv(a2a_sendbuf,
 		       a2a_sendcounts,
 		       a2a_senddispls,
@@ -230,7 +231,7 @@ int MPIDO_Allgatherv_alltoall(const void *sendbuf,
 		       comm_ptr,
 		       mpierrno);
   if (sendbuf == MPI_IN_PLACE)
-    recvcounts[comm_ptr->rank] = my_recvcounts;
+    recvcounts[rank] = my_recvcounts;
 
    TRACE_ERR("Leaving MPIDO_Allgatherv_alltoallv\n");
   return rc;
@@ -261,22 +262,33 @@ MPIDO_Allgatherv(const void *sendbuf,
   double msize;
   int scount=sendcount;
 
-  int i, rc, buffer_sum = 0, np = comm_ptr->local_size;
+  int i, rc, buffer_sum = 0;
+  const int size = comm_ptr->local_size;
   char use_tree_reduce, use_alltoall, use_bcast, use_pami, use_opt;
   char *sbuf, *rbuf;
+  const int rank = comm_ptr->rank;
+  const struct MPIDI_Comm* const mpid = &(comm_ptr->mpid);
+
+#if ASSERT_LEVEL==0
+   /* We can't afford the tracing in ndebug/performance libraries */
+    const unsigned verbose = 0;
+#else
+   const unsigned verbose = (MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL) && (rank == 0);
+#endif
+   const int selected_type = mpid->user_selected_type[PAMI_XFER_ALLGATHERV_INT];
 
   pami_xfer_t allred;
   volatile unsigned allred_active = 1;
   volatile unsigned allgatherv_active = 1;
   pami_type_t stype, rtype;
   int tmp;
-  pami_metadata_t *my_md;
+  const pami_metadata_t *my_md;
 
   for(i=0;i<6;i++) config[i] = 1;
 
   allred.cb_done = allred_cb_done;
   allred.cookie = (void *)&allred_active;
-  allred.algorithm = comm_ptr->mpid.coll_algorithm[PAMI_XFER_ALLREDUCE][0][0];
+  allred.algorithm = mpid->coll_algorithm[PAMI_XFER_ALLREDUCE][0][0];
   allred.cmd.xfer_allreduce.sndbuf = (void *)config;
   allred.cmd.xfer_allreduce.stype = PAMI_TYPE_SIGNED_INT;
   allred.cmd.xfer_allreduce.rcvbuf = (void *)config;
@@ -285,11 +297,11 @@ MPIDO_Allgatherv(const void *sendbuf,
   allred.cmd.xfer_allreduce.rtypecount = 6;
   allred.cmd.xfer_allreduce.op = PAMI_DATA_BAND;
 
-   use_alltoall = comm_ptr->mpid.allgathervs[2];
-   use_tree_reduce = comm_ptr->mpid.allgathervs[0];
-   use_bcast = comm_ptr->mpid.allgathervs[1];
+   use_alltoall = mpid->allgathervs[2];
+   use_tree_reduce = mpid->allgathervs[0];
+   use_bcast = mpid->allgathervs[1];
    /* Assuming PAMI doesn't support MPI_IN_PLACE */
-   use_pami = comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLGATHERV_INT] != MPID_COLL_USE_MPICH;
+   use_pami = selected_type != MPID_COLL_USE_MPICH;
 	 
    if((sendbuf != MPI_IN_PLACE) && (MPIDI_Datatype_to_pami(sendtype, &stype, -1, NULL, &tmp) != MPI_SUCCESS))
      use_pami = 0;
@@ -300,9 +312,9 @@ MPIDO_Allgatherv(const void *sendbuf,
 
    if(!use_opt) /* back to MPICH */
    {
-     if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
-     fprintf(stderr,"Using MPICH allgatherv type %u.\n",
-             comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLGATHERV_INT]);
+     if(unlikely(verbose))
+       fprintf(stderr,"Using MPICH allgatherv type %u.\n",
+             selected_type);
      TRACE_ERR("Using MPICH Allgatherv\n");
      MPIDI_Update_last_algorithm(comm_ptr, "ALLGATHERV_MPICH");
      return MPIR_Allgatherv(sendbuf, sendcount, sendtype,
@@ -319,13 +331,13 @@ MPIDO_Allgatherv(const void *sendbuf,
 
    if(sendbuf == MPI_IN_PLACE)
    {
-      if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL))
-         fprintf(stderr,"allgatherv MPI_IN_PLACE buffering\n");
-        sbuf = (char *)recvbuf+displs[comm_ptr->rank]*recv_size;
-        send_true_lb = recv_true_lb;
-        stype = rtype;
-        scount = recvcounts[comm_ptr->rank];
-        send_size = recv_size * scount; 
+     if(unlikely(verbose))
+       fprintf(stderr,"allgatherv MPI_IN_PLACE buffering\n");
+     sbuf = (char *)recvbuf+displs[rank]*recv_size;
+     send_true_lb = recv_true_lb;
+     stype = rtype;
+     scount = recvcounts[rank];
+     send_size = recv_size * scount; 
    }
    else
    {
@@ -345,7 +357,7 @@ MPIDO_Allgatherv(const void *sendbuf,
       if (displs[0])
        config[MPID_RECV_CONTINUOUS] = 0;
 
-      for (i = 1; i < np; i++)
+      for (i = 1; i < size; i++)
       {
         buffer_sum += recvcounts[i - 1];
         if (buffer_sum != displs[i])
@@ -355,13 +367,13 @@ MPIDO_Allgatherv(const void *sendbuf,
         }
       }
 
-      buffer_sum += recvcounts[np - 1];
+      buffer_sum += recvcounts[size - 1];
 
       buffer_sum *= recv_size;
-      msize = (double)buffer_sum / (double)np;
+      msize = (double)buffer_sum / (double)size;
 
       /* disable with "safe allgatherv" env var */
-      if(comm_ptr->mpid.preallreduces[MPID_ALLGATHERV_PREALLREDUCE])
+      if(mpid->preallreduces[MPID_ALLGATHERV_PREALLREDUCE])
       {
          MPIDI_Post_coll_t allred_post;
          MPIDI_Context_post(MPIDI_Context[0], &allred_post.state,
@@ -370,14 +382,14 @@ MPIDO_Allgatherv(const void *sendbuf,
          MPID_PROGRESS_WAIT_WHILE(allred_active);
       }
 
-      use_tree_reduce = comm_ptr->mpid.allgathervs[0] &&
+      use_tree_reduce = mpid->allgathervs[0] &&
          config[MPID_RECV_CONTIG] && config[MPID_SEND_CONTIG] &&
-         config[MPID_RECV_CONTINUOUS] && buffer_sum % sizeof(int) == 0;
+         config[MPID_RECV_CONTINUOUS] && buffer_sum % sizeof(unsigned) == 0;
 
-      use_alltoall = comm_ptr->mpid.allgathervs[2] &&
+      use_alltoall = mpid->allgathervs[2] &&
          config[MPID_RECV_CONTIG] && config[MPID_SEND_CONTIG];
 
-      use_bcast = comm_ptr->mpid.allgathervs[1];
+      use_bcast = mpid->allgathervs[1];
    }
 
    if(use_pami)
@@ -385,15 +397,23 @@ MPIDO_Allgatherv(const void *sendbuf,
       pami_xfer_t allgatherv;
       allgatherv.cb_done = allgatherv_cb_done;
       allgatherv.cookie = (void *)&allgatherv_active;
-      if(comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLGATHERV_INT] == MPID_COLL_OPTIMIZED)
-      {  
-        allgatherv.algorithm = comm_ptr->mpid.opt_protocol[PAMI_XFER_ALLGATHERV_INT][0];
-        my_md = &comm_ptr->mpid.opt_protocol_md[PAMI_XFER_ALLGATHERV_INT][0];
+      if(selected_type == MPID_COLL_OPTIMIZED)
+      {
+        if((mpid->cutoff_size[PAMI_XFER_ALLGATHERV_INT][0] == 0) || 
+	    (mpid->cutoff_size[PAMI_XFER_ALLGATHERV_INT][0] > 0 && mpid->cutoff_size[PAMI_XFER_ALLGATHERV_INT][0] >= send_size))
+        {		
+          allgatherv.algorithm = mpid->opt_protocol[PAMI_XFER_ALLGATHERV_INT][0];
+          my_md = &mpid->opt_protocol_md[PAMI_XFER_ALLGATHERV_INT][0];
+        }
+        else
+          return MPIR_Allgatherv(sendbuf, sendcount, sendtype,
+                       recvbuf, recvcounts, displs, recvtype,
+                       comm_ptr, mpierrno);
       }
       else
       {  
-        allgatherv.algorithm = comm_ptr->mpid.user_selected[PAMI_XFER_ALLGATHERV_INT];
-        my_md = &comm_ptr->mpid.user_metadata[PAMI_XFER_ALLGATHERV_INT];
+        allgatherv.algorithm = mpid->user_selected[PAMI_XFER_ALLGATHERV_INT];
+        my_md = &mpid->user_metadata[PAMI_XFER_ALLGATHERV_INT];
       }
       
       allgatherv.cmd.xfer_allgatherv_int.sndbuf = sbuf;
@@ -405,21 +425,26 @@ MPIDO_Allgatherv(const void *sendbuf,
       allgatherv.cmd.xfer_allgatherv_int.rtypecounts = (int *) recvcounts;
       allgatherv.cmd.xfer_allgatherv_int.rdispls = (int *) displs;
 
-      if(unlikely (comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLGATHERV_INT] == MPID_COLL_ALWAYS_QUERY ||
-                   comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLGATHERV_INT] == MPID_COLL_CHECK_FN_REQUIRED))
+      if(unlikely (selected_type == MPID_COLL_ALWAYS_QUERY ||
+                   selected_type == MPID_COLL_CHECK_FN_REQUIRED))
       {
          metadata_result_t result = {0};
          TRACE_ERR("Querying allgatherv_int protocol %s, type was %d\n", my_md->name,
-            comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLGATHERV_INT]);
+            selected_type);
          result = my_md->check_fn(&allgatherv);
          TRACE_ERR("Allgatherv bitmask: %#X\n", result.bitmask);
          if(!result.bitmask)
          {
-            fprintf(stderr,"Query failed for %s\n", my_md->name);
+           if(unlikely(verbose))
+             fprintf(stderr,"Query failed for %s\n", my_md->name);
+           MPIDI_Update_last_algorithm(comm_ptr, "ALLGATHERV_MPICH");
+           return MPIR_Allgatherv(sendbuf, sendcount, sendtype,
+                                  recvbuf, recvcounts, displs, recvtype,
+                                  comm_ptr, mpierrno);
          }
       }
 
-      if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
+      if(unlikely(verbose))
       {
          unsigned long long int threadID;
          MPIU_Thread_id_t tid;
@@ -430,7 +455,6 @@ MPIDO_Allgatherv(const void *sendbuf,
                  my_md->name,
               (unsigned) comm_ptr->context_id);
       }
-
       TRACE_ERR("Calling allgatherv via %s()\n", MPIDI_Process.context_post.active>0?"PAMI_Collective":"PAMI_Context_post");
       MPIDI_Post_coll_t allgatherv_post;
       MPIDI_Context_post(MPIDI_Context[0], &allgatherv_post.state,
@@ -438,7 +462,7 @@ MPIDO_Allgatherv(const void *sendbuf,
 
       MPIDI_Update_last_algorithm(comm_ptr, my_md->name);
 
-      TRACE_ERR("Rank %d waiting on active %d\n", comm_ptr->rank, allgatherv_active);
+      TRACE_ERR("Rank %d waiting on active %d\n", rank, allgatherv_active);
       MPID_PROGRESS_WAIT_WHILE(allgatherv_active);
 
       return PAMI_SUCCESS;
@@ -447,9 +471,9 @@ MPIDO_Allgatherv(const void *sendbuf,
    /* TODO These need ordered in speed-order */
    if(use_tree_reduce)
    {
-     if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
+     if(unlikely(verbose))
        fprintf(stderr,"Using tree reduce allgatherv type %u.\n",
-               comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLGATHERV_INT]);
+               selected_type);
      rc = MPIDO_Allgatherv_allreduce(sendbuf, sendcount, sendtype,
              recvbuf, recvcounts, buffer_sum, displs, recvtype,
              send_true_lb, recv_true_lb, send_size, recv_size,
@@ -460,9 +484,9 @@ MPIDO_Allgatherv(const void *sendbuf,
 
    if(use_bcast)
    {
-     if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
+     if(unlikely(verbose))
        fprintf(stderr,"Using bcast allgatherv type %u.\n",
-               comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLGATHERV_INT]);
+               selected_type);
      rc = MPIDO_Allgatherv_bcast(sendbuf, sendcount, sendtype,
              recvbuf, recvcounts, buffer_sum, displs, recvtype,
              send_true_lb, recv_true_lb, send_size, recv_size,
@@ -473,9 +497,9 @@ MPIDO_Allgatherv(const void *sendbuf,
 
    if(use_alltoall)
    {
-     if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
+     if(unlikely(verbose))
        fprintf(stderr,"Using alltoall allgatherv type %u.\n",
-               comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLGATHERV_INT]);
+               selected_type);
      rc = MPIDO_Allgatherv_alltoall(sendbuf, sendcount, sendtype,
              recvbuf, (int *)recvcounts, buffer_sum, displs, recvtype,
              send_true_lb, recv_true_lb, send_size, recv_size,
@@ -484,9 +508,9 @@ MPIDO_Allgatherv(const void *sendbuf,
      return rc;
    }
 
-   if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
+   if(unlikely(verbose))
       fprintf(stderr,"Using MPICH allgatherv type %u.\n",
-            comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLGATHERV_INT]);
+            selected_type);
    TRACE_ERR("Using MPICH for Allgatherv\n");
    MPIDI_Update_last_algorithm(comm_ptr, "ALLGATHERV_MPICH");
    return MPIR_Allgatherv(sendbuf, sendcount, sendtype,
diff --git a/src/mpid/pamid/src/coll/allreduce/mpido_allreduce.c b/src/mpid/pamid/src/coll/allreduce/mpido_allreduce.c
index f20842e..b77b27d 100644
--- a/src/mpid/pamid/src/coll/allreduce/mpido_allreduce.c
+++ b/src/mpid/pamid/src/coll/allreduce/mpido_allreduce.c
@@ -20,7 +20,7 @@
  * \brief ???
  */
 
-/*#define TRACE_ON*/
+/* #define TRACE_ON */
 
 #include <mpidimpl.h>
 
@@ -56,9 +56,17 @@ int MPIDO_Allreduce(const void *sendbuf,
    volatile unsigned active = 1;
    pami_xfer_t allred;
    pami_algorithm_t my_allred;
-   pami_metadata_t *my_allred_md = (pami_metadata_t *)NULL;
+   const pami_metadata_t *my_allred_md = (pami_metadata_t *)NULL;
    int alg_selected = 0;
-
+   const int rank = comm_ptr->rank;
+   const struct MPIDI_Comm* const mpid = &(comm_ptr->mpid);
+   const int selected_type = mpid->user_selected_type[PAMI_XFER_ALLREDUCE];
+#if ASSERT_LEVEL==0
+   /* We can't afford the tracing in ndebug/performance libraries */
+    const unsigned verbose = 0;
+#else
+    const unsigned verbose = (MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL) && (rank == 0);
+#endif
    if(likely(dt == MPI_DOUBLE || dt == MPI_DOUBLE_PRECISION))
    {
       rc = MPI_SUCCESS;
@@ -73,29 +81,29 @@ int MPIDO_Allreduce(const void *sendbuf,
    }
    else rc = MPIDI_Datatype_to_pami(dt, &pdt, op, &pop, &mu);
 
-  if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
-      fprintf(stderr,"allred rc %u,count %d, Datatype %p, op %p, mu %u, selectedvar %u != %u, sendbuf %p, recvbuf %p\n",
-              rc, count, pdt, pop, mu, 
-              (unsigned)comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLREDUCE],MPID_COLL_USE_MPICH, sendbuf, recvbuf);
+    if(unlikely(verbose))
+    fprintf(stderr,"allred rc %u,count %d, Datatype %p, op %p, mu %u, selectedvar %u != %u, sendbuf %p, recvbuf %p\n",
+            rc, count, pdt, pop, mu, 
+            (unsigned)selected_type,MPID_COLL_USE_MPICH, sendbuf, recvbuf);
       /* convert to metadata query */
   /* Punt count 0 allreduce to MPICH. Let them do whatever's 'right' */
   if(unlikely(rc != MPI_SUCCESS || (count==0) ||
-	      comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLREDUCE] == MPID_COLL_USE_MPICH))
+	      selected_type == MPID_COLL_USE_MPICH))
    {
-      if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
+     if(unlikely(verbose))
          fprintf(stderr,"Using MPICH allreduce type %u.\n",
-                 comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLREDUCE]);
+                 selected_type);
       MPIDI_Update_last_algorithm(comm_ptr, "ALLREDUCE_MPICH");
       return MPIR_Allreduce(sendbuf, recvbuf, count, dt, op, comm_ptr, mpierrno);
    }
 
+  sbuf = (void *)sendbuf;
   if(unlikely(sendbuf == MPI_IN_PLACE))
    {
-      if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL))
+     if(unlikely(verbose))
          fprintf(stderr,"allreduce MPI_IN_PLACE buffering\n");
       sbuf = recvbuf;
    }
-   else sbuf = (void *)sendbuf;
 
    allred.cb_done = cb_allreduce;
    allred.cookie = (void *)&active;
@@ -108,91 +116,99 @@ int MPIDO_Allreduce(const void *sendbuf,
    allred.cmd.xfer_allreduce.op = pop;
 
    TRACE_ERR("Allreduce - Basic Collective Selection\n");
-   if(likely(comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLREDUCE] == MPID_COLL_OPTIMIZED))
+   if(likely(selected_type == MPID_COLL_OPTIMIZED))
    {
      if(likely(pop == PAMI_DATA_SUM || pop == PAMI_DATA_MAX || pop == PAMI_DATA_MIN))
       {
          /* double protocol works on all message sizes */
-         if(likely(pdt == PAMI_TYPE_DOUBLE && comm_ptr->mpid.query_allred_dsmm == MPID_COLL_QUERY))
+         if(likely(pdt == PAMI_TYPE_DOUBLE && mpid->query_allred_dsmm == MPID_COLL_QUERY))
          {
-            my_allred = comm_ptr->mpid.cached_allred_dsmm;
-            my_allred_md = &comm_ptr->mpid.cached_allred_dsmm_md;
+            my_allred = mpid->cached_allred_dsmm;
+            my_allred_md = &mpid->cached_allred_dsmm_md;
             alg_selected = 1;
          }
-         else if(pdt == PAMI_TYPE_UNSIGNED_INT && comm_ptr->mpid.query_allred_ismm == MPID_COLL_QUERY)
+         else if(pdt == PAMI_TYPE_UNSIGNED_INT && mpid->query_allred_ismm == MPID_COLL_QUERY)
          {
-            my_allred = comm_ptr->mpid.cached_allred_ismm;
-            my_allred_md = &comm_ptr->mpid.cached_allred_ismm_md;
+            my_allred = mpid->cached_allred_ismm;
+            my_allred_md = &mpid->cached_allred_ismm_md;
             alg_selected = 1;
          }
          /* The integer protocol at >1 ppn requires small messages only */
-         else if(pdt == PAMI_TYPE_UNSIGNED_INT && comm_ptr->mpid.query_allred_ismm == MPID_COLL_CHECK_FN_REQUIRED &&
-                 count <= comm_ptr->mpid.cutoff_size[PAMI_XFER_ALLREDUCE][0])
+         else if(pdt == PAMI_TYPE_UNSIGNED_INT && mpid->query_allred_ismm == MPID_COLL_CHECK_FN_REQUIRED &&
+                 count <= mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0])
          {
-            my_allred = comm_ptr->mpid.cached_allred_ismm;
-            my_allred_md = &comm_ptr->mpid.cached_allred_ismm_md;
+            my_allred = mpid->cached_allred_ismm;
+            my_allred_md = &mpid->cached_allred_ismm_md;
             alg_selected = 1;
          }
-         else if(comm_ptr->mpid.must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_NOQUERY &&
-                 count <= comm_ptr->mpid.cutoff_size[PAMI_XFER_ALLREDUCE][0])
+         else if(mpid->must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_NOQUERY &&
+                 count <= mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0])
          {
-            my_allred = comm_ptr->mpid.opt_protocol[PAMI_XFER_ALLREDUCE][0];
-            my_allred_md = &comm_ptr->mpid.opt_protocol_md[PAMI_XFER_ALLREDUCE][0];
+            my_allred = mpid->opt_protocol[PAMI_XFER_ALLREDUCE][0];
+            my_allred_md = &mpid->opt_protocol_md[PAMI_XFER_ALLREDUCE][0];
             alg_selected = 1;
          }
-         else if(comm_ptr->mpid.must_query[PAMI_XFER_ALLREDUCE][1] == MPID_COLL_NOQUERY &&
-                 count > comm_ptr->mpid.cutoff_size[PAMI_XFER_ALLREDUCE][0])
+         else if(mpid->must_query[PAMI_XFER_ALLREDUCE][1] == MPID_COLL_NOQUERY &&
+                 count > mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0])
          {
-            my_allred = comm_ptr->mpid.opt_protocol[PAMI_XFER_ALLREDUCE][1];
-            my_allred_md = &comm_ptr->mpid.opt_protocol_md[PAMI_XFER_ALLREDUCE][1];
+            my_allred = mpid->opt_protocol[PAMI_XFER_ALLREDUCE][1];
+            my_allred_md = &mpid->opt_protocol_md[PAMI_XFER_ALLREDUCE][1];
             alg_selected = 1;
          }
-         else if((comm_ptr->mpid.must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_CHECK_FN_REQUIRED) ||
-		 (comm_ptr->mpid.must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_QUERY) ||
-		 (comm_ptr->mpid.must_query[PAMI_XFER_ALLREDUCE][0] ==  MPID_COLL_ALWAYS_QUERY))
+         else if((mpid->must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_CHECK_FN_REQUIRED) ||
+		 (mpid->must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_QUERY) ||
+		 (mpid->must_query[PAMI_XFER_ALLREDUCE][0] ==  MPID_COLL_ALWAYS_QUERY))
          {
-            my_allred = comm_ptr->mpid.opt_protocol[PAMI_XFER_ALLREDUCE][0];
-            my_allred_md = &comm_ptr->mpid.opt_protocol_md[PAMI_XFER_ALLREDUCE][0];
-            alg_selected = 1;
+            if((mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0] == 0) || 
+			(count <= mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0] && mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0] > 0))
+            {
+              my_allred = mpid->opt_protocol[PAMI_XFER_ALLREDUCE][0];
+              my_allred_md = &mpid->opt_protocol_md[PAMI_XFER_ALLREDUCE][0];
+              alg_selected = 1;
+            }
          }
       }
       else
       {
          /* so we aren't one of the key ops... */
-         if(comm_ptr->mpid.must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_NOQUERY &&
-            count <= comm_ptr->mpid.cutoff_size[PAMI_XFER_ALLREDUCE][0])
+         if(mpid->must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_NOQUERY &&
+            count <= mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0])
          {
-            my_allred = comm_ptr->mpid.opt_protocol[PAMI_XFER_ALLREDUCE][0];
-            my_allred_md = &comm_ptr->mpid.opt_protocol_md[PAMI_XFER_ALLREDUCE][0];
+            my_allred = mpid->opt_protocol[PAMI_XFER_ALLREDUCE][0];
+            my_allred_md = &mpid->opt_protocol_md[PAMI_XFER_ALLREDUCE][0];
             alg_selected = 1;
          }
-         else if(comm_ptr->mpid.must_query[PAMI_XFER_ALLREDUCE][1] == MPID_COLL_NOQUERY &&
-                 count > comm_ptr->mpid.cutoff_size[PAMI_XFER_ALLREDUCE][0])
+         else if(mpid->must_query[PAMI_XFER_ALLREDUCE][1] == MPID_COLL_NOQUERY &&
+                 count > mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0])
          {
-            my_allred = comm_ptr->mpid.opt_protocol[PAMI_XFER_ALLREDUCE][1];
-            my_allred_md = &comm_ptr->mpid.opt_protocol_md[PAMI_XFER_ALLREDUCE][1];
+            my_allred = mpid->opt_protocol[PAMI_XFER_ALLREDUCE][1];
+            my_allred_md = &mpid->opt_protocol_md[PAMI_XFER_ALLREDUCE][1];
             alg_selected = 1;
          }
-         else if((comm_ptr->mpid.must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_CHECK_FN_REQUIRED) ||
-		 (comm_ptr->mpid.must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_QUERY) ||
-		 (comm_ptr->mpid.must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_ALWAYS_QUERY))
+         else if((mpid->must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_CHECK_FN_REQUIRED) ||
+		 (mpid->must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_QUERY) ||
+		 (mpid->must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_ALWAYS_QUERY))
          {
-            my_allred = comm_ptr->mpid.opt_protocol[PAMI_XFER_ALLREDUCE][0];
-            my_allred_md = &comm_ptr->mpid.opt_protocol_md[PAMI_XFER_ALLREDUCE][0];
-            alg_selected = 1;
+            if((mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0] == 0) || 
+               (count <= mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0] && mpid->cutoff_size[PAMI_XFER_ALLREDUCE][0] > 0))
+            {			
+              my_allred = mpid->opt_protocol[PAMI_XFER_ALLREDUCE][0];
+              my_allred_md = &mpid->opt_protocol_md[PAMI_XFER_ALLREDUCE][0];
+              alg_selected = 1;
+            }
          }
       }
       TRACE_ERR("Alg selected: %d\n", alg_selected);
       if(likely(alg_selected))
       {
-	if(unlikely(comm_ptr->mpid.must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_CHECK_FN_REQUIRED))
+	if(unlikely(mpid->must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_CHECK_FN_REQUIRED))
         {
            if(my_allred_md->check_fn != NULL)/*This should always be the case in FCA.. Otherwise punt to mpich*/
            {
               metadata_result_t result = {0};
               TRACE_ERR("querying allreduce algorithm %s, type was %d\n",
                  my_allred_md->name,
-                 comm_ptr->mpid.must_query[PAMI_XFER_ALLREDUCE]);
+                 mpid->must_query[PAMI_XFER_ALLREDUCE]);
               result = my_allred_md->check_fn(&allred);
               TRACE_ERR("bitmask: %#X\n", result.bitmask);
               /* \todo Ignore check_correct.values.nonlocal until we implement the
@@ -207,21 +223,21 @@ int MPIDO_Allreduce(const void *sendbuf,
               else
               {
                  alg_selected = 0;
-                 if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
+                 if(unlikely(verbose))
                     fprintf(stderr,"check_fn failed for %s.\n", my_allred_md->name);
               }
            }
          else alg_selected = 0;
 	}
-	else if(unlikely(((comm_ptr->mpid.must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_QUERY) ||
-			  (comm_ptr->mpid.must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_ALWAYS_QUERY))))
+	else if(unlikely(((mpid->must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_QUERY) ||
+			  (mpid->must_query[PAMI_XFER_ALLREDUCE][0] == MPID_COLL_ALWAYS_QUERY))))
         {
            if(my_allred_md->check_fn != NULL)/*This should always be the case in FCA.. Otherwise punt to mpich*/
            {
               metadata_result_t result = {0};
               TRACE_ERR("querying allreduce algorithm %s, type was %d\n",
                  my_allred_md->name,
-                 comm_ptr->mpid.must_query[PAMI_XFER_ALLREDUCE]);
+                 mpid->must_query[PAMI_XFER_ALLREDUCE]);
               result = my_allred_md->check_fn(&allred);
               TRACE_ERR("bitmask: %#X\n", result.bitmask);
               /* \todo Ignore check_correct.values.nonlocal until we implement the
@@ -236,7 +252,7 @@ int MPIDO_Allreduce(const void *sendbuf,
               else
               {
                  alg_selected = 0;
-                 if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
+                 if(unlikely(verbose))
                     fprintf(stderr,"check_fn failed for %s.\n", my_allred_md->name);
               }
            }
@@ -254,7 +270,7 @@ int MPIDO_Allreduce(const void *sendbuf,
                  allred.algorithm = my_allred; /* query algorithm successfully selected */
                else
 		 {
-		   if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
+		   if(unlikely(verbose))
                      fprintf(stderr,"message size (%u) outside range (%zu<->%zu) for %s.\n",
                              data_size,
                              my_allred_md->range_lo,
@@ -275,12 +291,12 @@ int MPIDO_Allreduce(const void *sendbuf,
    }
    else
    {
-      my_allred = comm_ptr->mpid.user_selected[PAMI_XFER_ALLREDUCE];
-      my_allred_md = &comm_ptr->mpid.user_metadata[PAMI_XFER_ALLREDUCE];
+      my_allred = mpid->user_selected[PAMI_XFER_ALLREDUCE];
+      my_allred_md = &mpid->user_metadata[PAMI_XFER_ALLREDUCE];
       allred.algorithm = my_allred;
-      if(comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLREDUCE] == MPID_COLL_QUERY ||
-         comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLREDUCE] == MPID_COLL_ALWAYS_QUERY ||
-         comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLREDUCE] == MPID_COLL_CHECK_FN_REQUIRED)
+      if(selected_type == MPID_COLL_QUERY ||
+         selected_type == MPID_COLL_ALWAYS_QUERY ||
+         selected_type == MPID_COLL_CHECK_FN_REQUIRED)
       {
          if(my_allred_md->check_fn != NULL)
          {
@@ -289,8 +305,8 @@ int MPIDO_Allreduce(const void *sendbuf,
             metadata_result_t result = {0};
             TRACE_ERR("querying allreduce algorithm %s, type was %d\n",
                my_allred_md->name,
-               comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLREDUCE]);
-            result = comm_ptr->mpid.user_metadata[PAMI_XFER_ALLREDUCE].check_fn(&allred);
+               selected_type);
+            result = mpid->user_metadata[PAMI_XFER_ALLREDUCE].check_fn(&allred);
             TRACE_ERR("bitmask: %#X\n", result.bitmask);
             /* \todo Ignore check_correct.values.nonlocal until we implement the
                'pre-allreduce allreduce' or the 'safe' environment flag.
@@ -300,7 +316,7 @@ int MPIDO_Allreduce(const void *sendbuf,
             if(!result.bitmask)
                alg_selected = 1; /* query algorithm successfully selected */
             else 
-               if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
+               if(unlikely(verbose))
                   fprintf(stderr,"check_fn failed for %s.\n", my_allred_md->name);
          }
          else /* no check_fn, manually look at the metadata fields */
@@ -316,7 +332,7 @@ int MPIDO_Allreduce(const void *sendbuf,
                   (my_allred_md->range_hi >= data_size))
                   alg_selected = 1; /* query algorithm successfully selected */
                else
-                 if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
+                 if(unlikely(verbose))
                      fprintf(stderr,"message size (%u) outside range (%zu<->%zu) for %s.\n",
                              data_size,
                              my_allred_md->range_lo,
@@ -332,13 +348,13 @@ int MPIDO_Allreduce(const void *sendbuf,
 
    if(unlikely(!alg_selected)) /* must be fallback to MPICH */
    {
-      if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
+     if(unlikely(verbose))
          fprintf(stderr,"Using MPICH allreduce\n");
       MPIDI_Update_last_algorithm(comm_ptr, "ALLREDUCE_MPICH");
       return MPIR_Allreduce(sendbuf, recvbuf, count, dt, op, comm_ptr, mpierrno);
    }
 
-   if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
+   if(unlikely(verbose))
    {
       unsigned long long int threadID;
       MPIU_Thread_id_t tid;
@@ -354,6 +370,8 @@ int MPIDO_Allreduce(const void *sendbuf,
    MPIDI_Context_post(MPIDI_Context[0], &allred_post.state,
                       MPIDI_Pami_post_wrapper, (void *)&allred);
 
+   MPID_assert(rc == PAMI_SUCCESS);
+   MPIDI_Update_last_algorithm(comm_ptr,my_allred_md->name);
    MPID_PROGRESS_WAIT_WHILE(active);
    TRACE_ERR("allreduce done\n");
    return MPI_SUCCESS;
diff --git a/src/mpid/pamid/src/coll/alltoall/mpido_alltoall.c b/src/mpid/pamid/src/coll/alltoall/mpido_alltoall.c
index a7b6686..4688236 100644
--- a/src/mpid/pamid/src/coll/alltoall/mpido_alltoall.c
+++ b/src/mpid/pamid/src/coll/alltoall/mpido_alltoall.c
@@ -20,7 +20,7 @@
  * \brief ???
  */
 
-/*#define TRACE_ON*/
+/* #define TRACE_ON */
 
 #include <mpidimpl.h>
 
@@ -50,6 +50,14 @@ int MPIDO_Alltoall(const void *sendbuf,
    MPIDI_Post_coll_t alltoall_post;
    int sndlen, rcvlen, snd_contig, rcv_contig, pamidt=1;
    int tmp;
+#if ASSERT_LEVEL==0
+   /* We can't afford the tracing in ndebug/performance libraries */
+    const unsigned verbose = 0;
+#else
+    const unsigned verbose = (MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL) && (comm_ptr->rank == 0);
+#endif
+   const struct MPIDI_Comm* const mpid = &(comm_ptr->mpid);
+   const int selected_type = mpid->user_selected_type[PAMI_XFER_ALLTOALL];
 
    if(sendbuf == MPI_IN_PLACE) 
      pamidt = 0; /* Disable until ticket #632 is fixed */
@@ -72,11 +80,10 @@ int MPIDO_Alltoall(const void *sendbuf,
    if(MPIDI_Datatype_to_pami(recvtype, &rtype, -1, NULL, &tmp) != MPI_SUCCESS)
       pamidt = 0;
 
-   if(
-      (comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLTOALL] == MPID_COLL_USE_MPICH) ||
-      pamidt == 0)
+   if((selected_type == MPID_COLL_USE_MPICH) ||
+       pamidt == 0)
    {
-      if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
+      if(unlikely(verbose))
          fprintf(stderr,"Using MPICH alltoall algorithm\n");
       return MPIR_Alltoall_intra(sendbuf, sendcount, sendtype,
                       recvbuf, recvcount, recvtype,
@@ -86,21 +93,21 @@ int MPIDO_Alltoall(const void *sendbuf,
 
    pami_xfer_t alltoall;
    pami_algorithm_t my_alltoall;
-   pami_metadata_t *my_alltoall_md;
+   const pami_metadata_t *my_alltoall_md;
    int queryreq = 0;
-   if(comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLTOALL] == MPID_COLL_OPTIMIZED)
+   if(selected_type == MPID_COLL_OPTIMIZED)
    {
       TRACE_ERR("Optimized alltoall was pre-selected\n");
-      my_alltoall = comm_ptr->mpid.opt_protocol[PAMI_XFER_ALLTOALL][0];
-      my_alltoall_md = &comm_ptr->mpid.opt_protocol_md[PAMI_XFER_ALLTOALL][0];
-      queryreq = comm_ptr->mpid.must_query[PAMI_XFER_ALLTOALL][0];
+      my_alltoall = mpid->opt_protocol[PAMI_XFER_ALLTOALL][0];
+      my_alltoall_md = &mpid->opt_protocol_md[PAMI_XFER_ALLTOALL][0];
+      queryreq = mpid->must_query[PAMI_XFER_ALLTOALL][0];
    }
    else
    {
       TRACE_ERR("Alltoall was specified by user\n");
-      my_alltoall = comm_ptr->mpid.user_selected[PAMI_XFER_ALLTOALL];
-      my_alltoall_md = &comm_ptr->mpid.user_metadata[PAMI_XFER_ALLTOALL];
-      queryreq = comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLTOALL];
+      my_alltoall = mpid->user_selected[PAMI_XFER_ALLTOALL];
+      my_alltoall_md = &mpid->user_metadata[PAMI_XFER_ALLTOALL];
+      queryreq = selected_type;
    }
    char *pname = my_alltoall_md->name;
    TRACE_ERR("Using alltoall protocol %s\n", pname);
@@ -110,7 +117,7 @@ int MPIDO_Alltoall(const void *sendbuf,
    alltoall.algorithm = my_alltoall;
    if(sendbuf == MPI_IN_PLACE)
    {
-      if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL))
+      if(unlikely(verbose))
          fprintf(stderr,"alltoall MPI_IN_PLACE buffering\n");
       alltoall.cmd.xfer_alltoall.stype = rtype;
       alltoall.cmd.xfer_alltoall.stypecount = recvcount;
@@ -136,11 +143,15 @@ int MPIDO_Alltoall(const void *sendbuf,
       TRACE_ERR("bitmask: %#X\n", result.bitmask);
       if(!result.bitmask)
       {
+      if(unlikely(verbose))
          fprintf(stderr,"Query failed for %s\n", pname);
+      return MPIR_Alltoall_intra(sendbuf, sendcount, sendtype,
+                                 recvbuf, recvcount, recvtype,
+                                 comm_ptr, mpierrno);
       }
    }
 
-   if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
+   if(unlikely(verbose))
    {
       unsigned long long int threadID;
       MPIU_Thread_id_t tid;
diff --git a/src/mpid/pamid/src/coll/alltoallv/mpido_alltoallv.c b/src/mpid/pamid/src/coll/alltoallv/mpido_alltoallv.c
index 99cff37..856a247 100644
--- a/src/mpid/pamid/src/coll/alltoallv/mpido_alltoallv.c
+++ b/src/mpid/pamid/src/coll/alltoallv/mpido_alltoallv.c
@@ -19,7 +19,7 @@
  * \file src/coll/alltoallv/mpido_alltoallv.c
  * \brief ???
  */
-/*#define TRACE_ON*/
+/* #define TRACE_ON */
 
 #include <mpidimpl.h>
 
@@ -43,8 +43,7 @@ int MPIDO_Alltoallv(const void *sendbuf,
                    MPID_Comm *comm_ptr,
                    int *mpierrno)
 {
-   if(comm_ptr->rank == 0)
-      TRACE_ERR("Entering MPIDO_Alltoallv\n");
+   TRACE_ERR("Entering MPIDO_Alltoallv\n");
    volatile unsigned active = 1;
    int sndtypelen, rcvtypelen, snd_contig, rcv_contig;
    MPID_Datatype *sdt, *rdt;
@@ -53,6 +52,15 @@ int MPIDO_Alltoallv(const void *sendbuf,
    MPIDI_Post_coll_t alltoallv_post;
    int pamidt = 1;
    int tmp;
+   const int rank = comm_ptr->rank;
+#if ASSERT_LEVEL==0
+   /* We can't afford the tracing in ndebug/performance libraries */
+    const unsigned verbose = 0;
+#else
+    const unsigned verbose = (MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL) && (rank == 0);
+#endif
+   const struct MPIDI_Comm* const mpid = &(comm_ptr->mpid);
+   const int selected_type = mpid->user_selected_type[PAMI_XFER_ALLTOALLV_INT];
 
    if(sendbuf == MPI_IN_PLACE) 
      pamidt = 0; /* Disable until ticket #632 is fixed */
@@ -61,42 +69,36 @@ int MPIDO_Alltoallv(const void *sendbuf,
    if(MPIDI_Datatype_to_pami(recvtype, &rtype, -1, NULL, &tmp) != MPI_SUCCESS)
       pamidt = 0;
 
-   if(
-      (comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLTOALLV_INT] == 
-            MPID_COLL_USE_MPICH) ||
+   if((selected_type == MPID_COLL_USE_MPICH) ||
        pamidt == 0)
    {
-      if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
+      if(unlikely(verbose))
          fprintf(stderr,"Using MPICH alltoallv algorithm\n");
-      if(!comm_ptr->rank)
-         TRACE_ERR("Using MPICH alltoallv\n");
       return MPIR_Alltoallv(sendbuf, sendcounts, senddispls, sendtype,
                             recvbuf, recvcounts, recvdispls, recvtype,
                             comm_ptr, mpierrno);
    }
-   if(!comm_ptr->rank)
-      TRACE_ERR("Using %s for alltoallv protocol\n", pname);
 
    MPIDI_Datatype_get_info(1, recvtype, rcv_contig, rcvtypelen, rdt, rdt_true_lb);
 
    pami_xfer_t alltoallv;
    pami_algorithm_t my_alltoallv;
-   pami_metadata_t *my_alltoallv_md;
+   const pami_metadata_t *my_alltoallv_md;
    int queryreq = 0;
 
-   if(comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLTOALLV_INT] == MPID_COLL_OPTIMIZED)
+   if(selected_type == MPID_COLL_OPTIMIZED)
    {
       TRACE_ERR("Optimized alltoallv was selected\n");
-      my_alltoallv = comm_ptr->mpid.opt_protocol[PAMI_XFER_ALLTOALLV_INT][0];
-      my_alltoallv_md = &comm_ptr->mpid.opt_protocol_md[PAMI_XFER_ALLTOALLV_INT][0];
-      queryreq = comm_ptr->mpid.must_query[PAMI_XFER_ALLTOALLV_INT][0];
+      my_alltoallv = mpid->opt_protocol[PAMI_XFER_ALLTOALLV_INT][0];
+      my_alltoallv_md = &mpid->opt_protocol_md[PAMI_XFER_ALLTOALLV_INT][0];
+      queryreq = mpid->must_query[PAMI_XFER_ALLTOALLV_INT][0];
    }
    else
    { /* is this purely an else? or do i need to check for some other selectedvar... */
       TRACE_ERR("Alltoallv specified by user\n");
-      my_alltoallv = comm_ptr->mpid.user_selected[PAMI_XFER_ALLTOALLV_INT];
-      my_alltoallv_md = &comm_ptr->mpid.user_metadata[PAMI_XFER_ALLTOALLV_INT];
-      queryreq = comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLTOALLV_INT];
+      my_alltoallv = mpid->user_selected[PAMI_XFER_ALLTOALLV_INT];
+      my_alltoallv_md = &mpid->user_metadata[PAMI_XFER_ALLTOALLV_INT];
+      queryreq = selected_type;
    }
    alltoallv.algorithm = my_alltoallv;
    char *pname = my_alltoallv_md->name;
@@ -107,7 +109,7 @@ int MPIDO_Alltoallv(const void *sendbuf,
    /* We won't bother with alltoallv since MPI is always going to be ints. */
    if(sendbuf == MPI_IN_PLACE)
    {
-      if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL))
+     if(unlikely(verbose))
          fprintf(stderr,"alltoallv MPI_IN_PLACE buffering\n");
       alltoallv.cmd.xfer_alltoallv_int.stype = rtype;
       alltoallv.cmd.xfer_alltoallv_int.sdispls = (int *) recvdispls;
@@ -136,11 +138,15 @@ int MPIDO_Alltoallv(const void *sendbuf,
       TRACE_ERR("bitmask: %#X\n", result.bitmask);
       if(!result.bitmask)
       {
-         fprintf(stderr,"Query failed for %s\n", pname);
+        if(unlikely(verbose))
+          fprintf(stderr,"Query failed for %s\n", pname);
+        return MPIR_Alltoallv(sendbuf, sendcounts, senddispls, sendtype,
+                              recvbuf, recvcounts, recvdispls, recvtype,
+                              comm_ptr, mpierrno);
       }
    }
 
-   if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
+   if(unlikely(verbose))
    {
       unsigned long long int threadID;
       MPIU_Thread_id_t tid;
@@ -155,7 +161,7 @@ int MPIDO_Alltoallv(const void *sendbuf,
    MPIDI_Context_post(MPIDI_Context[0], &alltoallv_post.state,
                       MPIDI_Pami_post_wrapper, (void *)&alltoallv);
 
-   TRACE_ERR("%d waiting on active %d\n", comm_ptr->rank, active);
+   TRACE_ERR("%d waiting on active %d\n", rank, active);
    MPID_PROGRESS_WAIT_WHILE(active);
 
 
diff --git a/src/mpid/pamid/src/coll/barrier/mpido_barrier.c b/src/mpid/pamid/src/coll/barrier/mpido_barrier.c
index 4afd2d3..b08070e 100644
--- a/src/mpid/pamid/src/coll/barrier/mpido_barrier.c
+++ b/src/mpid/pamid/src/coll/barrier/mpido_barrier.c
@@ -20,7 +20,7 @@
  * \brief ???
  */
 
-/*#define TRACE_ON*/
+/* #define TRACE_ON */
 
 #include <mpidimpl.h>
 
@@ -39,12 +39,20 @@ int MPIDO_Barrier(MPID_Comm *comm_ptr, int *mpierrno)
    MPIDI_Post_coll_t barrier_post;
    pami_xfer_t barrier;
    pami_algorithm_t my_barrier;
-   pami_metadata_t *my_barrier_md;
+   const pami_metadata_t *my_barrier_md;
    int queryreq = 0;
+   const struct MPIDI_Comm* const mpid = &(comm_ptr->mpid);
+   const int selected_type = mpid->user_selected_type[PAMI_XFER_BARRIER];
+#if ASSERT_LEVEL==0
+   /* We can't afford the tracing in ndebug/performance libraries */
+    const unsigned verbose = 0;
+#else
+    const unsigned verbose = (MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL) && (comm_ptr->rank == 0);
+#endif
 
-   if(comm_ptr->mpid.user_selected_type[PAMI_XFER_BARRIER] == MPID_COLL_USE_MPICH)
+   if(unlikely(selected_type == MPID_COLL_USE_MPICH))
    {
-     if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
+     if(unlikely(verbose))
        fprintf(stderr,"Using MPICH barrier\n");
       TRACE_ERR("Using MPICH Barrier\n");
       return MPIR_Barrier(comm_ptr, mpierrno);
@@ -52,29 +60,27 @@ int MPIDO_Barrier(MPID_Comm *comm_ptr, int *mpierrno)
 
    barrier.cb_done = cb_barrier;
    barrier.cookie = (void *)&active;
-   if(comm_ptr->mpid.user_selected_type[PAMI_XFER_BARRIER] == MPID_COLL_OPTIMIZED)
+   if(likely(selected_type == MPID_COLL_OPTIMIZED))
    {
-      TRACE_ERR("Optimized barrier (%s) was pre-selected\n", comm_ptr->mpid.opt_protocol_md[PAMI_XFER_BARRIER][0].name);
-      my_barrier = comm_ptr->mpid.opt_protocol[PAMI_XFER_BARRIER][0];
-      my_barrier_md = &comm_ptr->mpid.opt_protocol_md[PAMI_XFER_BARRIER][0];
-      queryreq = comm_ptr->mpid.must_query[PAMI_XFER_BARRIER][0];
+      TRACE_ERR("Optimized barrier (%s) was pre-selected\n", mpid->opt_protocol_md[PAMI_XFER_BARRIER][0].name);
+      my_barrier = mpid->opt_protocol[PAMI_XFER_BARRIER][0];
+      my_barrier_md = &mpid->opt_protocol_md[PAMI_XFER_BARRIER][0];
+      queryreq = mpid->must_query[PAMI_XFER_BARRIER][0];
    }
    else
    {
-      TRACE_ERR("Barrier (%s) was specified by user\n", comm_ptr->mpid.user_metadata[PAMI_XFER_BARRIER].name);
-      my_barrier = comm_ptr->mpid.user_selected[PAMI_XFER_BARRIER];
-      my_barrier_md = &comm_ptr->mpid.user_metadata[PAMI_XFER_BARRIER];
-      queryreq = comm_ptr->mpid.user_selected_type[PAMI_XFER_BARRIER];
+      TRACE_ERR("Barrier (%s) was specified by user\n", mpid->user_metadata[PAMI_XFER_BARRIER].name);
+      my_barrier = mpid->user_selected[PAMI_XFER_BARRIER];
+      my_barrier_md = &mpid->user_metadata[PAMI_XFER_BARRIER];
+      queryreq = selected_type;
    }
 
    barrier.algorithm = my_barrier;
    /* There is no support for query-required barrier protocols here */
-   MPID_assert_always(queryreq != MPID_COLL_ALWAYS_QUERY);
-   MPID_assert_always(queryreq != MPID_COLL_CHECK_FN_REQUIRED);
+   MPID_assert(queryreq != MPID_COLL_ALWAYS_QUERY);
+   MPID_assert(queryreq != MPID_COLL_CHECK_FN_REQUIRED);
 
-   /* TODO Name needs fixed somehow */
-   MPIDI_Update_last_algorithm(comm_ptr, my_barrier_md->name);
-   if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
+   if(unlikely(verbose))
    {
       unsigned long long int threadID;
       MPIU_Thread_id_t tid;
@@ -83,7 +89,6 @@ int MPIDO_Barrier(MPID_Comm *comm_ptr, int *mpierrno)
      fprintf(stderr,"<%llx> Using protocol %s for barrier on %u\n", 
              threadID,
              my_barrier_md->name,
-/*             comm_ptr->rank,comm_ptr->local_size,comm_ptr->remote_size,*/
             (unsigned) comm_ptr->context_id);
    }
    TRACE_ERR("%s barrier\n", MPIDI_Process.context_post.active>0?"posting":"invoking");
@@ -92,6 +97,7 @@ int MPIDO_Barrier(MPID_Comm *comm_ptr, int *mpierrno)
    TRACE_ERR("barrier %s rc: %d\n", MPIDI_Process.context_post.active>0?"posted":"invoked", rc);
 
    TRACE_ERR("advance spinning\n");
+   MPIDI_Update_last_algorithm(comm_ptr, my_barrier_md->name);
    MPID_PROGRESS_WAIT_WHILE(active);
    TRACE_ERR("exiting mpido_barrier\n");
    return 0;
diff --git a/src/mpid/pamid/src/coll/bcast/mpido_bcast.c b/src/mpid/pamid/src/coll/bcast/mpido_bcast.c
index fcd2980..ec85f11 100644
--- a/src/mpid/pamid/src/coll/bcast/mpido_bcast.c
+++ b/src/mpid/pamid/src/coll/bcast/mpido_bcast.c
@@ -20,7 +20,7 @@
  * \brief ???
  */
 
-/*#define TRACE_ON*/
+/* #define TRACE_ON */
 
 #include <mpidimpl.h>
 
@@ -40,7 +40,8 @@ int MPIDO_Bcast(void *buffer,
                 int *mpierrno)
 {
    TRACE_ERR("in mpido_bcast\n");
-   int data_size, data_contig;
+   const size_t BCAST_LIMIT =      0x40000000;
+   int data_contig, rc;
    void *data_buffer    = NULL,
         *noncontig_buff = NULL;
    volatile unsigned active = 1;
@@ -48,48 +49,77 @@ int MPIDO_Bcast(void *buffer,
    MPID_Datatype *data_ptr;
    MPID_Segment segment;
    MPIDI_Post_coll_t bcast_post;
-/*   MPIDI_Post_coll_t allred_post; eventually for
-   preallreduces*/
-   if(count == 0)
+   const struct MPIDI_Comm* const mpid = &(comm_ptr->mpid);
+   const int rank = comm_ptr->rank;
+#if ASSERT_LEVEL==0
+   /* We can't afford the tracing in ndebug/performance libraries */
+    const unsigned verbose = 0;
+#else
+   const unsigned verbose = (MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL) && (rank == 0);
+#endif
+   const int selected_type = mpid->user_selected_type[PAMI_XFER_BROADCAST];
+
+   /* Must calculate data_size based on count=1 in case it's total size is > integer */
+   int data_size_one;
+   MPIDI_Datatype_get_info(1, datatype,
+			   data_contig, data_size_one, data_ptr, data_true_lb);
+   /* do this calculation once and use twice */
+   const size_t data_size_sz = (size_t)data_size_one*(size_t)count;
+   if(unlikely(verbose))
+     fprintf(stderr,"bcast count %d, size %d (%#zX), root %d, buffer %p\n",
+	     count,data_size_one, (size_t)data_size_one*(size_t)count, root,buffer);
+   if(unlikely( data_size_sz > BCAST_LIMIT) )
    {
-      MPIDI_Update_last_algorithm(comm_ptr,"BCAST_NONE");
-      return MPI_SUCCESS;
-   }
-   if(comm_ptr->mpid.user_selected_type[PAMI_XFER_BROADCAST] == MPID_COLL_USE_MPICH)
-   {
-      if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
-         fprintf(stderr,"Using MPICH bcast algorithm\n");
-      return MPIR_Bcast_intra(buffer, count, datatype, root, comm_ptr, mpierrno);
+      void *new_buffer=buffer;
+      int c, new_count = (int)BCAST_LIMIT/data_size_one;
+      MPID_assert(new_count > 0);
+
+      for(c=1; ((size_t)c*(size_t)new_count) <= (size_t)count; ++c)
+      {
+        if ((rc = MPIDO_Bcast(new_buffer,
+                        new_count,
+                        datatype,
+                        root,
+                        comm_ptr,
+                              mpierrno)) != MPI_SUCCESS)
+         return rc;
+	 new_buffer = (char*)new_buffer + (size_t)data_size_one*(size_t)new_count;
+      }
+      new_count = count % new_count; /* 0 is ok, just returns no-op */
+      return MPIDO_Bcast(new_buffer,
+                         new_count,
+                         datatype,
+                         root,
+                         comm_ptr,
+                         mpierrno);
    }
 
-   MPIDI_Datatype_get_info(count, datatype,
-               data_contig, data_size, data_ptr, data_true_lb);
+   /* Must use data_size based on count for byte bcast processing.
+      Previously calculated as a size_t but large data_sizes were 
+      handled above so this cast to int should be fine here.  
+   */
+   const int data_size = (int)data_size_sz;
 
-   /* If the user has constructed some weird 0-length datatype but 
-    * count is not 0, we'll let mpich handle it */
-   if(unlikely( data_size == 0) )
+   if(selected_type == MPID_COLL_USE_MPICH || data_size == 0)
    {
-      if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
-         fprintf(stderr,"Using MPICH bcast algorithm for data_size 0\n");
+     if(unlikely(verbose))
+       fprintf(stderr,"Using MPICH bcast algorithm\n");
+      MPIDI_Update_last_algorithm(comm_ptr,"MPICH");
       return MPIR_Bcast_intra(buffer, count, datatype, root, comm_ptr, mpierrno);
    }
+
    data_buffer = (char *)buffer + data_true_lb;
 
    if(!data_contig)
    {
-      if(comm_ptr->rank == root)
-         TRACE_ERR("noncontig data\n");
       noncontig_buff = MPIU_Malloc(data_size);
       data_buffer = noncontig_buff;
       if(noncontig_buff == NULL)
       {
-         fprintf(stderr,
-            "Pack: Tree Bcast cannot allocate local non-contig pack buffer\n");
-/*         MPIX_Dump_stacks();*/
          MPID_Abort(NULL, MPI_ERR_NO_SPACE, 1,
             "Fatal:  Cannot allocate pack buffer");
       }
-      if(comm_ptr->rank == root)
+      if(rank == root)
       {
          DLOOP_Offset last = data_size;
          MPID_Segment_init(buffer, count, datatype, &segment, 0);
@@ -99,44 +129,58 @@ int MPIDO_Bcast(void *buffer,
 
    pami_xfer_t bcast;
    pami_algorithm_t my_bcast;
-   pami_metadata_t *my_bcast_md;
+   const pami_metadata_t *my_bcast_md;
    int queryreq = 0;
 
    bcast.cb_done = cb_bcast;
    bcast.cookie = (void *)&active;
    bcast.cmd.xfer_broadcast.root = MPID_VCR_GET_LPID(comm_ptr->vcr, root);
-   bcast.algorithm = comm_ptr->mpid.user_selected[PAMI_XFER_BROADCAST];
+   bcast.algorithm = mpid->user_selected[PAMI_XFER_BROADCAST];
    bcast.cmd.xfer_broadcast.buf = data_buffer;
    bcast.cmd.xfer_broadcast.type = PAMI_TYPE_BYTE;
    /* Needs to be sizeof(type)*count since we are using bytes as * the generic type */
    bcast.cmd.xfer_broadcast.typecount = data_size;
 
-   if(comm_ptr->mpid.user_selected_type[PAMI_XFER_BROADCAST] == MPID_COLL_OPTIMIZED)
+   if(selected_type == MPID_COLL_OPTIMIZED)
    {
       TRACE_ERR("Optimized bcast (%s) and (%s) were pre-selected\n",
-         comm_ptr->mpid.opt_protocol_md[PAMI_XFER_BROADCAST][0].name,
-         comm_ptr->mpid.opt_protocol_md[PAMI_XFER_BROADCAST][1].name);
+         mpid->opt_protocol_md[PAMI_XFER_BROADCAST][0].name,
+         mpid->opt_protocol_md[PAMI_XFER_BROADCAST][1].name);
+
+      if(mpid->cutoff_size[PAMI_XFER_BROADCAST][1] != 0)/* SSS: There is FCA cutoff (FCA only sets cutoff for [PAMI_XFER_BROADCAST][1]) */
+      {
+        if(data_size <= mpid->cutoff_size[PAMI_XFER_BROADCAST][1])
+        {
+          my_bcast = mpid->opt_protocol[PAMI_XFER_BROADCAST][1];
+          my_bcast_md = &mpid->opt_protocol_md[PAMI_XFER_BROADCAST][1];
+          queryreq = mpid->must_query[PAMI_XFER_BROADCAST][1];
+        }
+        else
+        {
+          return MPIR_Bcast_intra(buffer, count, datatype, root, comm_ptr, mpierrno);
+        }
+      }
 
-      if(data_size > comm_ptr->mpid.cutoff_size[PAMI_XFER_BROADCAST][0])
+      if(data_size > mpid->cutoff_size[PAMI_XFER_BROADCAST][0])
       {
-         my_bcast = comm_ptr->mpid.opt_protocol[PAMI_XFER_BROADCAST][1];
-         my_bcast_md = &comm_ptr->mpid.opt_protocol_md[PAMI_XFER_BROADCAST][1];
-         queryreq = comm_ptr->mpid.must_query[PAMI_XFER_BROADCAST][1];
+         my_bcast = mpid->opt_protocol[PAMI_XFER_BROADCAST][1];
+         my_bcast_md = &mpid->opt_protocol_md[PAMI_XFER_BROADCAST][1];
+         queryreq = mpid->must_query[PAMI_XFER_BROADCAST][1];
       }
       else
       {
-         my_bcast = comm_ptr->mpid.opt_protocol[PAMI_XFER_BROADCAST][0];
-         my_bcast_md = &comm_ptr->mpid.opt_protocol_md[PAMI_XFER_BROADCAST][0];
-         queryreq = comm_ptr->mpid.must_query[PAMI_XFER_BROADCAST][0];
+         my_bcast = mpid->opt_protocol[PAMI_XFER_BROADCAST][0];
+         my_bcast_md = &mpid->opt_protocol_md[PAMI_XFER_BROADCAST][0];
+         queryreq = mpid->must_query[PAMI_XFER_BROADCAST][0];
       }
    }
    else
    {
       TRACE_ERR("Optimized bcast (%s) was specified by user\n",
-         comm_ptr->mpid.user_metadata[PAMI_XFER_BROADCAST].name);
-      my_bcast =  comm_ptr->mpid.user_selected[PAMI_XFER_BROADCAST];
-      my_bcast_md = &comm_ptr->mpid.user_metadata[PAMI_XFER_BROADCAST];
-      queryreq = comm_ptr->mpid.user_selected_type[PAMI_XFER_BROADCAST];
+         mpid->user_metadata[PAMI_XFER_BROADCAST].name);
+      my_bcast =  mpid->user_selected[PAMI_XFER_BROADCAST];
+      my_bcast_md = &mpid->user_metadata[PAMI_XFER_BROADCAST];
+      queryreq = selected_type;
    }
 
    bcast.algorithm = my_bcast;
@@ -150,16 +194,14 @@ int MPIDO_Bcast(void *buffer,
       TRACE_ERR("bitmask: %#X\n", result.bitmask);
       if(!result.bitmask)
       {
-         fprintf(stderr,"query failed for %s.\n", my_bcast_md->name);
+         if(unlikely(verbose))
+            fprintf(stderr,"Using MPICH bcast algorithm\n");
+         MPIDI_Update_last_algorithm(comm_ptr,"MPICH");
+         return MPIR_Bcast_intra(buffer, count, datatype, root, comm_ptr, mpierrno);
       }
    }
 
-
-   TRACE_ERR("%s bcast, context: %d, algoname: %s\n",
-             MPIDI_Process.context_post.active>0?"posting":"invoking", 0, my_bcast_md->name);
-   MPIDI_Update_last_algorithm(comm_ptr, my_bcast_md->name);
-
-   if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
+   if(unlikely(verbose))
    {
       unsigned long long int threadID;
       MPIU_Thread_id_t tid;
@@ -172,16 +214,13 @@ int MPIDO_Bcast(void *buffer,
    }
 
    MPIDI_Context_post(MPIDI_Context[0], &bcast_post.state, MPIDI_Pami_post_wrapper, (void *)&bcast);
-
-   TRACE_ERR("bcast %s\n", MPIDI_Process.context_post.active>0?"posted":"invoked");
-
+   MPIDI_Update_last_algorithm(comm_ptr, my_bcast_md->name);
    MPID_PROGRESS_WAIT_WHILE(active);
    TRACE_ERR("bcast done\n");
 
    if(!data_contig)
    {
-      TRACE_ERR("cleaning up noncontig\n");
-      if(comm_ptr->rank != root)
+      if(rank != root)
          MPIR_Localcopy(noncontig_buff, data_size, MPI_CHAR,
                         buffer,         count,     datatype);
       MPIU_Free(noncontig_buff);
diff --git a/src/mpid/pamid/src/coll/gather/mpido_gather.c b/src/mpid/pamid/src/coll/gather/mpido_gather.c
index 5dfe787..534e129 100644
--- a/src/mpid/pamid/src/coll/gather/mpido_gather.c
+++ b/src/mpid/pamid/src/coll/gather/mpido_gather.c
@@ -49,8 +49,8 @@ int MPIDO_Gather_reduce(void * sendbuf,
 {
   MPID_Datatype * data_ptr;
   MPI_Aint true_lb;
-  int rank = comm_ptr->rank;
-  int size = comm_ptr->local_size;
+  const int rank = comm_ptr->rank;
+  const int size = comm_ptr->local_size;
   int rc, sbytes, rbytes, contig;
   char *tempbuf = NULL;
   char *inplacetemp = NULL;
@@ -134,15 +134,23 @@ int MPIDO_Gather(const void *sendbuf,
   MPI_Aint true_lb = 0;
   pami_xfer_t gather;
   MPIDI_Post_coll_t gather_post;
-/*  char *sbuf = sendbuf, *rbuf = recvbuf;*/
   int success = 1, contig, send_bytes=-1, recv_bytes = 0;
-  int rank = comm_ptr->rank;
+  const int rank = comm_ptr->rank;
+  const int size = comm_ptr->local_size;
+#if ASSERT_LEVEL==0
+   /* We can't afford the tracing in ndebug/performance libraries */
+    const unsigned verbose = 0;
+#else
+    const unsigned verbose = (MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL) && (rank == 0);
+#endif
+   const struct MPIDI_Comm* const mpid = &(comm_ptr->mpid);
+   const int selected_type = mpid->user_selected_type[PAMI_XFER_GATHER];
 
   if ((sendbuf == MPI_IN_PLACE) && sendtype != MPI_DATATYPE_NULL && sendcount >= 0)
   {
     MPIDI_Datatype_get_info(sendcount, sendtype, contig,
                             send_bytes, data_ptr, true_lb);
-    if (!contig || ((send_bytes * comm_ptr->local_size) % sizeof(int)))
+    if (!contig || ((send_bytes * size) % sizeof(int)))
       success = 0;
   }
   else
@@ -161,17 +169,17 @@ int MPIDO_Gather(const void *sendbuf,
   }
 
   MPIDI_Update_last_algorithm(comm_ptr, "GATHER_MPICH");
-  if(!comm_ptr->mpid.optgather ||
-   comm_ptr->mpid.user_selected_type[PAMI_XFER_GATHER] == MPID_COLL_USE_MPICH)
+  if(!mpid->optgather ||
+   selected_type == MPID_COLL_USE_MPICH)
   {
-    if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
+    if(unlikely(verbose))
       fprintf(stderr,"Using MPICH gather algorithm\n");
     return MPIR_Gather(sendbuf, sendcount, sendtype,
                        recvbuf, recvcount, recvtype,
                        root, comm_ptr, mpierrno);
   }
 
-   if(comm_ptr->mpid.preallreduces[MPID_GATHER_PREALLREDUCE])
+   if(mpid->preallreduces[MPID_GATHER_PREALLREDUCE])
    {
       volatile unsigned allred_active = 1;
       pami_xfer_t allred;
@@ -179,7 +187,7 @@ int MPIDO_Gather(const void *sendbuf,
       allred.cb_done = cb_allred;
       allred.cookie = (void *)&allred_active;
       /* Guaranteed to work allreduce */
-      allred.algorithm = comm_ptr->mpid.coll_algorithm[PAMI_XFER_ALLREDUCE][0][0];
+      allred.algorithm = mpid->coll_algorithm[PAMI_XFER_ALLREDUCE][0][0];
       allred.cmd.xfer_allreduce.sndbuf = (void *)(size_t)success;
       allred.cmd.xfer_allreduce.stype = PAMI_TYPE_SIGNED_INT;
       allred.cmd.xfer_allreduce.rcvbuf = (void *)(size_t)success;
@@ -193,9 +201,9 @@ int MPIDO_Gather(const void *sendbuf,
       MPID_PROGRESS_WAIT_WHILE(allred_active);
    }
 
-   if(comm_ptr->mpid.user_selected_type[PAMI_XFER_GATHER] == MPID_COLL_USE_MPICH || !success)
+   if(selected_type == MPID_COLL_USE_MPICH || !success)
    {
-    if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
+    if(unlikely(verbose))
       fprintf(stderr,"Using MPICH gather algorithm\n");
     return MPIR_Gather(sendbuf, sendcount, sendtype,
                        recvbuf, recvcount, recvtype,
@@ -204,7 +212,7 @@ int MPIDO_Gather(const void *sendbuf,
 
 
    pami_algorithm_t my_gather;
-   pami_metadata_t *my_gather_md;
+   const pami_metadata_t *my_gather_md;
    int queryreq = 0;
    volatile unsigned active = 1;
 
@@ -213,10 +221,10 @@ int MPIDO_Gather(const void *sendbuf,
    gather.cmd.xfer_gather.root = MPID_VCR_GET_LPID(comm_ptr->vcr, root);
    if(sendbuf == MPI_IN_PLACE) 
    {
-     if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL))
+     if(unlikely(verbose))
        fprintf(stderr,"gather MPI_IN_PLACE buffering\n");
      gather.cmd.xfer_gather.stypecount = recv_bytes;
-     gather.cmd.xfer_gather.sndbuf = (char *)recvbuf + recv_bytes*comm_ptr->rank;
+     gather.cmd.xfer_gather.sndbuf = (char *)recvbuf + recv_bytes*rank;
    }
    else
    {
@@ -229,21 +237,21 @@ int MPIDO_Gather(const void *sendbuf,
    gather.cmd.xfer_gather.rtypecount = recv_bytes;
 
    /* If glue-level protocols are good, this will require some changes */
-   if(comm_ptr->mpid.user_selected_type[PAMI_XFER_GATHER] == MPID_COLL_OPTIMIZED)
+   if(selected_type == MPID_COLL_OPTIMIZED)
    {
       TRACE_ERR("Optimized gather (%s) was pre-selected\n",
-         comm_ptr->mpid.opt_protocol_md[PAMI_XFER_GATHER][0].name);
-      my_gather = comm_ptr->mpid.opt_protocol[PAMI_XFER_GATHER][0];
-      my_gather_md = &comm_ptr->mpid.opt_protocol_md[PAMI_XFER_GATHER][0];
-      queryreq = comm_ptr->mpid.must_query[PAMI_XFER_GATHER][0];
+         mpid->opt_protocol_md[PAMI_XFER_GATHER][0].name);
+      my_gather = mpid->opt_protocol[PAMI_XFER_GATHER][0];
+      my_gather_md = &mpid->opt_protocol_md[PAMI_XFER_GATHER][0];
+      queryreq = mpid->must_query[PAMI_XFER_GATHER][0];
    }
    else
    {
       TRACE_ERR("Optimized gather (%s) was specified by user\n",
-         comm_ptr->mpid.user_metadata[PAMI_XFER_GATHER].name);
-      my_gather = comm_ptr->mpid.user_selected[PAMI_XFER_GATHER];
-      my_gather_md = &comm_ptr->mpid.user_metadata[PAMI_XFER_GATHER];
-      queryreq = comm_ptr->mpid.user_selected_type[PAMI_XFER_GATHER];
+      mpid->user_metadata[PAMI_XFER_GATHER].name);
+      my_gather = mpid->user_selected[PAMI_XFER_GATHER];
+      my_gather_md = &mpid->user_metadata[PAMI_XFER_GATHER];
+      queryreq = selected_type;
    }
 
    gather.algorithm = my_gather;
@@ -257,15 +265,19 @@ int MPIDO_Gather(const void *sendbuf,
       TRACE_ERR("bitmask: %#X\n", result.bitmask);
       if(!result.bitmask)
       {
-         fprintf(stderr,"query failed for %s\n", my_gather_md->name);
+        if(unlikely(verbose))
+          fprintf(stderr,"query failed for %s\n", my_gather_md->name);
+        return MPIR_Gather(sendbuf, sendcount, sendtype,
+                           recvbuf, recvcount, recvtype,
+                           root, comm_ptr, mpierrno);
       }
    }
 
    MPIDI_Update_last_algorithm(comm_ptr,
-            comm_ptr->mpid.user_metadata[PAMI_XFER_GATHER].name);
+            mpid->user_metadata[PAMI_XFER_GATHER].name);
 
 
-   if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
+   if(unlikely(verbose))
    {
       unsigned long long int threadID;
       MPIU_Thread_id_t tid;
diff --git a/src/mpid/pamid/src/coll/gatherv/mpido_gatherv.c b/src/mpid/pamid/src/coll/gatherv/mpido_gatherv.c
index 8947985..67db093 100644
--- a/src/mpid/pamid/src/coll/gatherv/mpido_gatherv.c
+++ b/src/mpid/pamid/src/coll/gatherv/mpido_gatherv.c
@@ -20,7 +20,7 @@
  * \brief ???
  */
 
-/*#define TRACE_ON*/
+/* #define TRACE_ON */
 #include <mpidimpl.h>
 
 static void cb_gatherv(void *ctxt, void *clientdata, pami_result_t err)
@@ -43,15 +43,24 @@ int MPIDO_Gatherv(const void *sendbuf,
 
 {
    TRACE_ERR("Entering MPIDO_Gatherv\n");
-   int contig, rsize, ssize;
+   int rc;
+   int contig, rsize=0, ssize=0;
    int pamidt = 1;
-   ssize = 0;
    MPID_Datatype *dt_ptr = NULL;
    MPI_Aint send_true_lb, recv_true_lb;
    char *sbuf, *rbuf;
    pami_type_t stype, rtype;
    int tmp;
    volatile unsigned gatherv_active = 1;
+   const int rank = comm_ptr->rank;
+#if ASSERT_LEVEL==0
+   /* We can't afford the tracing in ndebug/performance libraries */
+    const unsigned verbose = 0;
+#else
+    const unsigned verbose = (MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL) && (rank == 0);
+#endif
+   const struct MPIDI_Comm* const mpid = &(comm_ptr->mpid);
+   const int selected_type = mpid->user_selected_type[PAMI_XFER_GATHERV_INT];
 
    /* Check for native PAMI types and MPI_IN_PLACE on sendbuf */
    /* MPI_IN_PLACE is a nonlocal decision. We will need a preallreduce if we ever have
@@ -61,9 +70,9 @@ int MPIDO_Gatherv(const void *sendbuf,
    if(MPIDI_Datatype_to_pami(recvtype, &rtype, -1, NULL, &tmp) != MPI_SUCCESS)
       pamidt = 0;
 
-   if(pamidt == 0 || comm_ptr->mpid.user_selected_type[PAMI_XFER_GATHERV_INT] == MPID_COLL_USE_MPICH)
+   if(pamidt == 0 || selected_type == MPID_COLL_USE_MPICH)
    {
-      if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
+      if(unlikely(verbose))
          fprintf(stderr,"Using MPICH gatherv algorithm\n");
       TRACE_ERR("GATHERV using MPICH\n");
       MPIDI_Update_last_algorithm(comm_ptr, "GATHERV_MPICH");
@@ -90,15 +99,15 @@ int MPIDO_Gatherv(const void *sendbuf,
    gatherv.cmd.xfer_gatherv_int.stype = stype;
    gatherv.cmd.xfer_gatherv_int.stypecount = sendcount;
 
-   if(comm_ptr->rank == root)
+   if(rank == root)
    {
       if(sendbuf == MPI_IN_PLACE) 
       {
-         if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL))
+         if(unlikely(verbose))
             fprintf(stderr,"gatherv MPI_IN_PLACE buffering\n");
-         sbuf = (char*)rbuf + rsize*displs[comm_ptr->rank];
+         sbuf = (char*)rbuf + rsize*displs[rank];
          gatherv.cmd.xfer_gatherv_int.stype = rtype;
-         gatherv.cmd.xfer_gatherv_int.stypecount = recvcounts[comm_ptr->rank];
+         gatherv.cmd.xfer_gatherv_int.stypecount = recvcounts[rank];
       }
       else
       {
@@ -109,24 +118,24 @@ int MPIDO_Gatherv(const void *sendbuf,
    gatherv.cmd.xfer_gatherv_int.sndbuf = sbuf;
 
    pami_algorithm_t my_gatherv;
-   pami_metadata_t *my_gatherv_md;
+   const pami_metadata_t *my_gatherv_md;
    int queryreq = 0;
 
-   if(comm_ptr->mpid.user_selected_type[PAMI_XFER_GATHERV_INT] == MPID_COLL_OPTIMIZED)
+   if(selected_type == MPID_COLL_OPTIMIZED)
    {
       TRACE_ERR("Optimized gatherv %s was selected\n",
-         comm_ptr->mpid.opt_protocol_md[PAMI_XFER_GATHERV_INT][0].name);
-      my_gatherv = comm_ptr->mpid.opt_protocol[PAMI_XFER_GATHERV_INT][0];
-      my_gatherv_md = &comm_ptr->mpid.opt_protocol_md[PAMI_XFER_GATHERV_INT][0];
-      queryreq = comm_ptr->mpid.must_query[PAMI_XFER_GATHERV_INT][0];
+         mpid->opt_protocol_md[PAMI_XFER_GATHERV_INT][0].name);
+      my_gatherv = mpid->opt_protocol[PAMI_XFER_GATHERV_INT][0];
+      my_gatherv_md = &mpid->opt_protocol_md[PAMI_XFER_GATHERV_INT][0];
+      queryreq = mpid->must_query[PAMI_XFER_GATHERV_INT][0];
    }
    else
    {
       TRACE_ERR("Optimized gatherv %s was set by user\n",
-         comm_ptr->mpid.user_metadata[PAMI_XFER_GATHERV_INT].name);
-         my_gatherv = comm_ptr->mpid.user_selected[PAMI_XFER_GATHERV_INT];
-         my_gatherv_md = &comm_ptr->mpid.user_metadata[PAMI_XFER_GATHERV_INT];
-         queryreq = comm_ptr->mpid.user_selected_type[PAMI_XFER_GATHERV_INT];
+         mpid->user_metadata[PAMI_XFER_GATHERV_INT].name);
+         my_gatherv = mpid->user_selected[PAMI_XFER_GATHERV_INT];
+         my_gatherv_md = &mpid->user_metadata[PAMI_XFER_GATHERV_INT];
+         queryreq = selected_type;
    }
 
    gatherv.algorithm = my_gatherv;
@@ -141,13 +150,18 @@ int MPIDO_Gatherv(const void *sendbuf,
       TRACE_ERR("bitmask: %#X\n", result.bitmask);
       if(!result.bitmask)
       {
-         fprintf(stderr,"Query failed for %s\n", my_gatherv_md->name);
+         if(unlikely(verbose))
+            fprintf(stderr,"Query failed for %s\n", my_gatherv_md->name);
+         MPIDI_Update_last_algorithm(comm_ptr, "GATHERV_MPICH");
+         return MPIR_Gatherv(sendbuf, sendcount, sendtype,
+                             recvbuf, recvcounts, displs, recvtype,
+                             root, comm_ptr, mpierrno);
       }
    }
    
    MPIDI_Update_last_algorithm(comm_ptr, my_gatherv_md->name);
 
-   if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
+   if(unlikely(verbose))
    {
       unsigned long long int threadID;
       MPIU_Thread_id_t tid;
diff --git a/src/mpid/pamid/src/coll/reduce/mpido_reduce.c b/src/mpid/pamid/src/coll/reduce/mpido_reduce.c
index 1170f45..b84d868 100644
--- a/src/mpid/pamid/src/coll/reduce/mpido_reduce.c
+++ b/src/mpid/pamid/src/coll/reduce/mpido_reduce.c
@@ -20,7 +20,7 @@
  * \brief ???
  */
 
-/*#define TRACE_ON*/
+/* #define TRACE_ON */
 #include <mpidimpl.h>
 
 static void reduce_cb_done(void *ctxt, void *clientdata, pami_result_t err)
@@ -49,55 +49,66 @@ int MPIDO_Reduce(const void *sendbuf,
    pami_type_t pdt;
    int rc;
    int alg_selected = 0;
+#if ASSERT_LEVEL==0
+   /* We can't afford the tracing in ndebug/performance libraries */
+    const unsigned verbose = 0;
+#else
+    const unsigned verbose = (MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL) && (comm_ptr->rank == 0);
+#endif
+   const struct MPIDI_Comm* const mpid = &(comm_ptr->mpid);
+   const int selected_type = mpid->user_selected_type[PAMI_XFER_REDUCE];
 
    rc = MPIDI_Datatype_to_pami(datatype, &pdt, op, &pop, &mu);
-   if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
+   if(unlikely(verbose))
       fprintf(stderr,"reduce - rc %u, dt: %p, op: %p, mu: %u, selectedvar %u != %u (MPICH)\n",
          rc, pdt, pop, mu, 
-         (unsigned)comm_ptr->mpid.user_selected_type[PAMI_XFER_REDUCE], MPID_COLL_USE_MPICH);
-
+         (unsigned)selected_type, MPID_COLL_USE_MPICH);
 
    pami_xfer_t reduce;
-   pami_algorithm_t my_reduce;
-   pami_metadata_t *my_reduce_md;
+   pami_algorithm_t my_reduce=0;
+   const pami_metadata_t *my_reduce_md=NULL;
    int queryreq = 0;
    volatile unsigned reduce_active = 1;
 
-   if(comm_ptr->mpid.user_selected_type[PAMI_XFER_REDUCE] == MPID_COLL_USE_MPICH || rc != MPI_SUCCESS)
+   if(selected_type == MPID_COLL_USE_MPICH || rc != MPI_SUCCESS)
    {
-      if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
+      if(unlikely(verbose))
          fprintf(stderr,"Using MPICH reduce algorithm\n");
       return MPIR_Reduce(sendbuf, recvbuf, count, datatype, op, root, comm_ptr, mpierrno);
    }
 
    MPIDI_Datatype_get_info(count, datatype, dt_contig, tsize, dt_null, true_lb);
    rbuf = (char *)recvbuf + true_lb;
+   sbuf = (char *)sendbuf + true_lb;
    if(sendbuf == MPI_IN_PLACE) 
    {
-      if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL))
+      if(unlikely(verbose))
          fprintf(stderr,"reduce MPI_IN_PLACE buffering\n");
       sbuf = rbuf;
    }
-   else
-      sbuf = (char *)sendbuf + true_lb;
 
    reduce.cb_done = reduce_cb_done;
    reduce.cookie = (void *)&reduce_active;
-   if(comm_ptr->mpid.user_selected_type[PAMI_XFER_REDUCE] == MPID_COLL_OPTIMIZED)
+   if(selected_type == MPID_COLL_OPTIMIZED)
    {
-      TRACE_ERR("Optimized Reduce (%s) was pre-selected\n",
-         comm_ptr->mpid.opt_protocol_md[PAMI_XFER_REDUCE][0].name);
-      my_reduce    = comm_ptr->mpid.opt_protocol[PAMI_XFER_REDUCE][0];
-      my_reduce_md = &comm_ptr->mpid.opt_protocol_md[PAMI_XFER_REDUCE][0];
-      queryreq     = comm_ptr->mpid.must_query[PAMI_XFER_REDUCE][0];
+      if((mpid->cutoff_size[PAMI_XFER_REDUCE][0] == 0) || 
+          (mpid->cutoff_size[PAMI_XFER_REDUCE][0] >= tsize && mpid->cutoff_size[PAMI_XFER_REDUCE][0] > 0))
+      {
+        TRACE_ERR("Optimized Reduce (%s) was pre-selected\n",
+         mpid->opt_protocol_md[PAMI_XFER_REDUCE][0].name);
+        my_reduce    = mpid->opt_protocol[PAMI_XFER_REDUCE][0];
+        my_reduce_md = &mpid->opt_protocol_md[PAMI_XFER_REDUCE][0];
+        queryreq     = mpid->must_query[PAMI_XFER_REDUCE][0];
+      }
+
    }
    else
    {
       TRACE_ERR("Optimized reduce (%s) was specified by user\n",
-         comm_ptr->mpid.user_metadata[PAMI_XFER_REDUCE].name);
-      my_reduce    =  comm_ptr->mpid.user_selected[PAMI_XFER_REDUCE];
-      my_reduce_md = &comm_ptr->mpid.user_metadata[PAMI_XFER_REDUCE];
-      queryreq     = comm_ptr->mpid.user_selected_type[PAMI_XFER_REDUCE];
+      mpid->user_metadata[PAMI_XFER_REDUCE].name);
+      my_reduce    =  mpid->user_selected[PAMI_XFER_REDUCE];
+      my_reduce_md = &mpid->user_metadata[PAMI_XFER_REDUCE];
+      queryreq     = selected_type;
    }
    reduce.algorithm = my_reduce;
    reduce.cmd.xfer_reduce.sndbuf = sbuf;
@@ -122,7 +133,7 @@ int MPIDO_Reduce(const void *sendbuf,
          TRACE_ERR("Bitmask: %#X\n", result.bitmask);
          if(result.bitmask)
          {
-            if(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0)
+            if(verbose)
               fprintf(stderr,"Query failed for %s.\n",
                  my_reduce_md->name);
          }
@@ -138,7 +149,7 @@ int MPIDO_Reduce(const void *sendbuf,
 
    if(alg_selected)
    {
-      if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
+      if(unlikely(verbose))
       {
          unsigned long long int threadID;
          MPIU_Thread_id_t tid;
@@ -158,14 +169,13 @@ int MPIDO_Reduce(const void *sendbuf,
    }
    else
    {
-      if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
+      if(unlikely(verbose))
          fprintf(stderr,"Using MPICH reduce algorithm\n");
       return MPIR_Reduce(sendbuf, recvbuf, count, datatype, op, root, comm_ptr, mpierrno);
    }
 
    MPIDI_Update_last_algorithm(comm_ptr,
                                my_reduce_md->name);
-
    MPID_PROGRESS_WAIT_WHILE(reduce_active);
    TRACE_ERR("Reduce done\n");
    return 0;
diff --git a/src/mpid/pamid/src/coll/scan/mpido_scan.c b/src/mpid/pamid/src/coll/scan/mpido_scan.c
index ced9370..be79dfe 100644
--- a/src/mpid/pamid/src/coll/scan/mpido_scan.c
+++ b/src/mpid/pamid/src/coll/scan/mpido_scan.c
@@ -20,7 +20,7 @@
  * \brief ???
  */
 
-/*#define TRACE_ON */
+/* #define TRACE_ON */
 #include <mpidimpl.h>
 
 static void scan_cb_done(void *ctxt, void *clientdata, pami_result_t err)
@@ -63,23 +63,29 @@ int MPIDO_Doscan(const void *sendbuf, void *recvbuf,
    pami_data_function pop;
    pami_type_t pdt;
    int rc;
-   pami_metadata_t *my_md;
+   const pami_metadata_t *my_md;
+#if ASSERT_LEVEL==0
+   /* We can't afford the tracing in ndebug/performance libraries */
+    const unsigned verbose = 0;
+#else
+    const unsigned verbose = (MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL) && (comm_ptr->rank == 0);
+#endif
+   const struct MPIDI_Comm* const mpid = &(comm_ptr->mpid);
+   const int selected_type = mpid->user_selected_type[PAMI_XFER_SCAN];
 
    rc = MPIDI_Datatype_to_pami(datatype, &pdt, op, &pop, &mu);
-   if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_0 && comm_ptr->rank == 0))
+   if(unlikely(verbose))
       fprintf(stderr,"rc %u, dt: %p, op: %p, mu: %u, selectedvar %u != %u (MPICH)\n",
          rc, pdt, pop, mu, 
-         (unsigned)comm_ptr->mpid.user_selected_type[PAMI_XFER_SCAN], MPID_COLL_USE_MPICH);
-
+         (unsigned)selected_type, MPID_COLL_USE_MPICH);
 
    pami_xfer_t scan;
    volatile unsigned scan_active = 1;
 
    if((sendbuf == MPI_IN_PLACE) || /* Disable until ticket #627 is fixed */
-      (comm_ptr->mpid.user_selected_type[PAMI_XFER_SCAN] == MPID_COLL_USE_MPICH || rc != MPI_SUCCESS))
-      
+      (selected_type == MPID_COLL_USE_MPICH || rc != MPI_SUCCESS))
    {
-      if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
+      if(unlikely(verbose))
          fprintf(stderr,"Using MPICH scan algorithm (exflag %d)\n",exflag);
       if(exflag)
          return MPIR_Exscan(sendbuf, recvbuf, count, datatype, op, comm_ptr, mpierrno);
@@ -91,8 +97,8 @@ int MPIDO_Doscan(const void *sendbuf, void *recvbuf,
    rbuf = (char *)recvbuf + true_lb;
    if(sendbuf == MPI_IN_PLACE) 
    {
-     if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL))
-       fprintf(stderr,"scan MPI_IN_PLACE buffering\n");
+      if(unlikely(verbose))
+         fprintf(stderr,"scan MPI_IN_PLACE buffering\n");
       sbuf = rbuf;
    }
    else
@@ -102,15 +108,15 @@ int MPIDO_Doscan(const void *sendbuf, void *recvbuf,
 
    scan.cb_done = scan_cb_done;
    scan.cookie = (void *)&scan_active;
-   if(comm_ptr->mpid.user_selected_type[PAMI_XFER_SCAN] == MPID_COLL_OPTIMIZED)
+   if(selected_type == MPID_COLL_OPTIMIZED)
    {
-      scan.algorithm = comm_ptr->mpid.opt_protocol[PAMI_XFER_SCAN][0];
-      my_md = &comm_ptr->mpid.opt_protocol_md[PAMI_XFER_SCAN][0];
+      scan.algorithm = mpid->opt_protocol[PAMI_XFER_SCAN][0];
+      my_md = &mpid->opt_protocol_md[PAMI_XFER_SCAN][0];
    }
    else
    {
-      scan.algorithm = comm_ptr->mpid.user_selected[PAMI_XFER_SCAN];
-      my_md = &comm_ptr->mpid.user_metadata[PAMI_XFER_SCAN];
+      scan.algorithm = mpid->user_selected[PAMI_XFER_SCAN];
+      my_md = &mpid->user_metadata[PAMI_XFER_SCAN];
    }
    scan.cmd.xfer_scan.sndbuf = sbuf;
    scan.cmd.xfer_scan.rcvbuf = rbuf;
@@ -122,13 +128,13 @@ int MPIDO_Doscan(const void *sendbuf, void *recvbuf,
    scan.cmd.xfer_scan.exclusive = exflag;
 
 
-   if(comm_ptr->mpid.user_selected_type[PAMI_XFER_SCAN] == MPID_COLL_ALWAYS_QUERY ||
-      comm_ptr->mpid.user_selected_type[PAMI_XFER_SCAN] == MPID_COLL_CHECK_FN_REQUIRED)
+   if(selected_type == MPID_COLL_ALWAYS_QUERY ||
+      selected_type == MPID_COLL_CHECK_FN_REQUIRED)
    {
       metadata_result_t result = {0};
       TRACE_ERR("Querying scan protocol %s, type was %d\n",
          my_md->name,
-         comm_ptr->mpid.user_selected_type[PAMI_XFER_SCAN]);
+         selected_type);
       result = my_md->check_fn(&scan);
       TRACE_ERR("Bitmask: %#X\n", result.bitmask);
       if(!result.bitmask)
@@ -138,7 +144,7 @@ int MPIDO_Doscan(const void *sendbuf, void *recvbuf,
       }
    }
    
-   if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
+   if(unlikely(verbose))
    {
       unsigned long long int threadID;
       MPIU_Thread_id_t tid;
@@ -154,10 +160,7 @@ int MPIDO_Doscan(const void *sendbuf, void *recvbuf,
    MPIDI_Context_post(MPIDI_Context[0], &scan_post.state,
                       MPIDI_Pami_post_wrapper, (void *)&scan);
    TRACE_ERR("Scan %s\n", MPIDI_Process.context_post.active>0?"posted":"invoked");
-
-   MPIDI_Update_last_algorithm(comm_ptr,
-      my_md->name);
-
+   MPIDI_Update_last_algorithm(comm_ptr, my_md->name);
    MPID_PROGRESS_WAIT_WHILE(scan_active);
    TRACE_ERR("Scan done\n");
    return rc;
diff --git a/src/mpid/pamid/src/coll/scatter/mpido_scatter.c b/src/mpid/pamid/src/coll/scatter/mpido_scatter.c
index 123635c..d5896be 100644
--- a/src/mpid/pamid/src/coll/scatter/mpido_scatter.c
+++ b/src/mpid/pamid/src/coll/scatter/mpido_scatter.c
@@ -20,7 +20,7 @@
  * \brief ???
  */
 
-/*#define TRACE_ON */
+/* #define TRACE_ON */
 
 #include <mpidimpl.h>
 
@@ -45,8 +45,8 @@ int MPIDO_Scatter_bcast(void * sendbuf,
 {
   /* Pretty simple - bcast a temp buffer and copy our little chunk out */
   int contig, nbytes, rc;
-  int rank = comm_ptr->rank;
-  int size = comm_ptr->local_size;
+  const int rank = comm_ptr->rank;
+  const int size = comm_ptr->local_size;
   char *tempbuf = NULL;
 
   MPID_Datatype * dt_ptr;
@@ -110,13 +110,20 @@ int MPIDO_Scatter(const void *sendbuf,
 {
   MPID_Datatype * data_ptr;
   MPI_Aint true_lb = 0;
-/*  char *sbuf = sendbuf, *rbuf = recvbuf;*/
   int contig, nbytes = 0;
-  int rank = comm_ptr->rank;
+  const int rank = comm_ptr->rank;
   int success = 1;
   pami_type_t stype, rtype;
   int tmp;
-  char use_pami = !(comm_ptr->mpid.user_selected_type[PAMI_XFER_SCATTER] == MPID_COLL_USE_MPICH);
+#if ASSERT_LEVEL==0
+   /* We can't afford the tracing in ndebug/performance libraries */
+    const unsigned verbose = 0;
+#else
+    const unsigned verbose = (MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL) && (rank == 0);
+#endif
+   const struct MPIDI_Comm* const mpid = &(comm_ptr->mpid);
+   const int selected_type = mpid->user_selected_type[PAMI_XFER_SCATTER];
+   char use_pami = !(selected_type == MPID_COLL_USE_MPICH);
 
   /* if (rank == root)
      We can't decide on just the root to use MPICH. Really need a pre-allreduce.
@@ -131,7 +138,7 @@ int MPIDO_Scatter(const void *sendbuf,
 
   if(!use_pami)
   {
-    if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
+    if(unlikely(verbose))
       fprintf(stderr,"Using MPICH scatter algorithm\n");
     MPIDI_Update_last_algorithm(comm_ptr, "SCATTER_MPICH");
     return MPIR_Scatter(sendbuf, sendcount, sendtype,
@@ -178,25 +185,25 @@ int MPIDO_Scatter(const void *sendbuf,
    pami_xfer_t scatter;
    MPIDI_Post_coll_t scatter_post;
    pami_algorithm_t my_scatter;
-   pami_metadata_t *my_scatter_md;
+   const pami_metadata_t *my_scatter_md;
    volatile unsigned scatter_active = 1;
    int queryreq = 0;
 
-   if(comm_ptr->mpid.user_selected_type[PAMI_XFER_SCATTER] == MPID_COLL_OPTIMIZED)
+   if(selected_type == MPID_COLL_OPTIMIZED)
    {
       TRACE_ERR("Optimized scatter %s was selected\n",
-         comm_ptr->mpid.opt_protocol_md[PAMI_XFER_SCATTER][0].name);
-      my_scatter = comm_ptr->mpid.opt_protocol[PAMI_XFER_SCATTER][0];
-      my_scatter_md = &comm_ptr->mpid.opt_protocol_md[PAMI_XFER_SCATTER][0];
-      queryreq = comm_ptr->mpid.must_query[PAMI_XFER_SCATTER][0];
+         mpid->opt_protocol_md[PAMI_XFER_SCATTER][0].name);
+      my_scatter = mpid->opt_protocol[PAMI_XFER_SCATTER][0];
+      my_scatter_md = &mpid->opt_protocol_md[PAMI_XFER_SCATTER][0];
+      queryreq = mpid->must_query[PAMI_XFER_SCATTER][0];
    }
    else
    {
       TRACE_ERR("Optimized scatter %s was set by user\n",
-         comm_ptr->mpid.user_metadata[PAMI_XFER_SCATTER].name);
-      my_scatter = comm_ptr->mpid.user_selected[PAMI_XFER_SCATTER];
-      my_scatter_md = &comm_ptr->mpid.user_metadata[PAMI_XFER_SCATTER];
-      queryreq = comm_ptr->mpid.user_selected_type[PAMI_XFER_SCATTER];
+         mpid->user_metadata[PAMI_XFER_SCATTER].name);
+      my_scatter = mpid->user_selected[PAMI_XFER_SCATTER];
+      my_scatter_md = &mpid->user_metadata[PAMI_XFER_SCATTER];
+      queryreq = selected_type;
    }
  
    scatter.algorithm = my_scatter;
@@ -208,11 +215,11 @@ int MPIDO_Scatter(const void *sendbuf,
    scatter.cmd.xfer_scatter.stypecount = sendcount;
    if(recvbuf == MPI_IN_PLACE) 
    {
-     if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL))
+     if(unlikely(verbose))
        fprintf(stderr,"scatter MPI_IN_PLACE buffering\n");
      MPIDI_Datatype_get_info(sendcount, sendtype, contig,
                              nbytes, data_ptr, true_lb);
-     scatter.cmd.xfer_scatter.rcvbuf = (char *)sendbuf + nbytes*comm_ptr->rank;
+     scatter.cmd.xfer_scatter.rcvbuf = (char *)sendbuf + nbytes*rank;
      scatter.cmd.xfer_scatter.rtype = stype;
      scatter.cmd.xfer_scatter.rtypecount = sendcount;
    }
@@ -233,11 +240,16 @@ int MPIDO_Scatter(const void *sendbuf,
       TRACE_ERR("bitmask: %#X\n", result.bitmask);
       if(!result.bitmask)
       {
-         fprintf(stderr,"query failed for %s\n", my_scatter_md->name);
+        if(unlikely(verbose))
+          fprintf(stderr,"query failed for %s\n", my_scatter_md->name);
+        MPIDI_Update_last_algorithm(comm_ptr, "SCATTER_MPICH");
+        return MPIR_Scatter(sendbuf, sendcount, sendtype,
+                            recvbuf, recvcount, recvtype,
+                            root, comm_ptr, mpierrno);
       }
    }
 
-   if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
+   if(unlikely(verbose))
    {
       unsigned long long int threadID;
       MPIU_Thread_id_t tid;
@@ -273,7 +285,7 @@ int MPIDO_Scatter(const void *sendbuf,
 
   if (!success)
   {
-    if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
+    if(unlikely(verbose))
       fprintf(stderr,"Using MPICH scatter algorithm\n");
     return MPIR_Scatter(sendbuf, sendcount, sendtype,
                         recvbuf, recvcount, recvtype,
diff --git a/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c b/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c
index 331d2cb..cafd9d2 100644
--- a/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c
+++ b/src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c
@@ -36,14 +36,14 @@ int MPIDO_Scatterv_bcast(void *sendbuf,
                          MPID_Comm *comm_ptr,
                          int *mpierrno)
 {
-  int rank = comm_ptr->rank;
-  int np = comm_ptr->local_size;
+  const int rank = comm_ptr->rank;
+  const int size = comm_ptr->local_size;
   char *tempbuf;
   int i, sum = 0, dtsize, rc=0, contig;
   MPID_Datatype *dt_ptr;
   MPI_Aint dt_lb;
 
-  for (i = 0; i < np; i++)
+  for (i = 0; i < size; i++)
     if (sendcounts > 0)
       sum += sendcounts[i];
 
@@ -94,8 +94,8 @@ int MPIDO_Scatterv_alltoallv(void * sendbuf,
                              MPID_Comm * comm_ptr,
                              int *mpierrno)
 {
-  int rank = comm_ptr->rank;
-  int size = comm_ptr->local_size;
+  const int rank = comm_ptr->rank;
+  const int size = comm_ptr->local_size;
 
   int *sdispls, *scounts;
   int *rdispls, *rcounts;
@@ -126,7 +126,6 @@ int MPIDO_Scatterv_alltoallv(void * sendbuf,
                                 MPI_ERR_OTHER,
                                 "**nomem", 0);
   }
-  /*   memset(rbuf, 0, rbytes * size * sizeof(char));*/
 
   if(rank == root)
   {
@@ -154,7 +153,6 @@ int MPIDO_Scatterv_alltoallv(void * sendbuf,
     }
     memset(sdispls, 0, size*sizeof(int));
     memset(scounts, 0, size*sizeof(int));
-    /*      memset(sbuf, 0, rbytes * sizeof(char));*/
   }
 
   rdispls = MPIU_Malloc(size * sizeof(int));
@@ -196,7 +194,6 @@ int MPIDO_Scatterv_alltoallv(void * sendbuf,
   }
   else
   {
-    /*      memcpy(recvbuf, rbuf+(root*rbytes), rbytes);*/
     memcpy(recvbuf, rbuf, rbytes);
     MPIU_Free(rbuf);
     MPIU_Free(rdispls);
@@ -246,10 +243,19 @@ int MPIDO_Scatterv(const void *sendbuf,
   pami_xfer_t allred;
   int optscatterv[3];
   pami_type_t stype, rtype;
+  const int rank = comm_ptr->rank;
+#if ASSERT_LEVEL==0
+   /* We can't afford the tracing in ndebug/performance libraries */
+    const unsigned verbose = 0;
+#else
+    const unsigned verbose = (MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL) && (rank == 0);
+#endif
+   const struct MPIDI_Comm* const mpid = &(comm_ptr->mpid);
+   const int selected_type = mpid->user_selected_type[PAMI_XFER_SCATTERV_INT];
 
   allred.cb_done = allred_cb_done;
   allred.cookie = (void *)&allred_active;
-  allred.algorithm = comm_ptr->mpid.coll_algorithm[PAMI_XFER_ALLREDUCE][0][0];
+  allred.algorithm = mpid->coll_algorithm[PAMI_XFER_ALLREDUCE][0][0];
   allred.cmd.xfer_allreduce.sndbuf = (void *)optscatterv;
   allred.cmd.xfer_allreduce.stype = PAMI_TYPE_SIGNED_INT;
   allred.cmd.xfer_allreduce.rcvbuf = (void *)optscatterv;
@@ -259,9 +265,9 @@ int MPIDO_Scatterv(const void *sendbuf,
   allred.cmd.xfer_allreduce.op = PAMI_DATA_BAND;
 
 
-   if(comm_ptr->mpid.user_selected_type[PAMI_XFER_SCATTERV_INT] == MPID_COLL_USE_MPICH)
+   if(selected_type == MPID_COLL_USE_MPICH)
   {
-    if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
+    if(unlikely(verbose))
       fprintf(stderr,"Using MPICH scatterv algorithm\n");
     MPIDI_Update_last_algorithm(comm_ptr, "SCATTERV_MPICH");
     return MPIR_Scatterv(sendbuf, sendcounts, displs, sendtype,
@@ -272,25 +278,25 @@ int MPIDO_Scatterv(const void *sendbuf,
 
    pami_xfer_t scatterv;
    pami_algorithm_t my_scatterv;
-   pami_metadata_t *my_scatterv_md;
+   const pami_metadata_t *my_scatterv_md;
    volatile unsigned scatterv_active = 1;
    int queryreq = 0;
 
-   if(comm_ptr->mpid.user_selected_type[PAMI_XFER_SCATTERV_INT] == MPID_COLL_OPTIMIZED)
+   if(selected_type == MPID_COLL_OPTIMIZED)
    {
       TRACE_ERR("Optimized scatterv %s was selected\n",
-         comm_ptr->mpid.opt_protocol_md[PAMI_XFER_SCATTERV_INT][0].name);
-      my_scatterv = comm_ptr->mpid.opt_protocol[PAMI_XFER_SCATTERV_INT][0];
-      my_scatterv_md = &comm_ptr->mpid.opt_protocol_md[PAMI_XFER_SCATTERV_INT][0];
-      queryreq = comm_ptr->mpid.must_query[PAMI_XFER_SCATTERV_INT][0];
+         mpid->opt_protocol_md[PAMI_XFER_SCATTERV_INT][0].name);
+      my_scatterv = mpid->opt_protocol[PAMI_XFER_SCATTERV_INT][0];
+      my_scatterv_md = &mpid->opt_protocol_md[PAMI_XFER_SCATTERV_INT][0];
+      queryreq = mpid->must_query[PAMI_XFER_SCATTERV_INT][0];
    }
    else
    {
       TRACE_ERR("User selected %s for scatterv\n",
-         comm_ptr->mpid.user_selected[PAMI_XFER_SCATTERV_INT]);
-      my_scatterv = comm_ptr->mpid.user_selected[PAMI_XFER_SCATTERV_INT];
-      my_scatterv_md = &comm_ptr->mpid.user_metadata[PAMI_XFER_SCATTERV_INT];
-      queryreq = comm_ptr->mpid.user_selected_type[PAMI_XFER_SCATTERV_INT];
+      mpid->user_selected[PAMI_XFER_SCATTERV_INT]);
+      my_scatterv = mpid->user_selected[PAMI_XFER_SCATTERV_INT];
+      my_scatterv_md = &mpid->user_metadata[PAMI_XFER_SCATTERV_INT];
+      queryreq = selected_type;
    }
 
    if((recvbuf != MPI_IN_PLACE) && MPIDI_Datatype_to_pami(recvtype, &rtype, -1, NULL, &tmp) != MPI_SUCCESS)
@@ -299,9 +305,9 @@ int MPIDO_Scatterv(const void *sendbuf,
    if(MPIDI_Datatype_to_pami(sendtype, &stype, -1, NULL, &tmp) != MPI_SUCCESS)
       pamidt = 0;
 
-   if(pamidt == 0 || comm_ptr->mpid.user_selected_type[PAMI_XFER_SCATTERV_INT] == MPID_COLL_USE_MPICH)
+   if(pamidt == 0 || selected_type == MPID_COLL_USE_MPICH)
    {
-     if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
+     if(unlikely(verbose))
        fprintf(stderr,"Using MPICH scatterv algorithm\n");
       TRACE_ERR("Scatterv using MPICH\n");
       MPIDI_Update_last_algorithm(comm_ptr, "SCATTERV_MPICH");
@@ -314,13 +320,13 @@ int MPIDO_Scatterv(const void *sendbuf,
    sbuf = (char *)sendbuf + send_true_lb;
    rbuf = recvbuf;
 
-   if(comm_ptr->rank == root)
+   if(rank == root)
    {
       if(recvbuf == MPI_IN_PLACE) 
       {
-        if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL))
+        if(unlikely(verbose))
           fprintf(stderr,"scatterv MPI_IN_PLACE buffering\n");
-        rbuf = (char *)sendbuf + ssize*displs[comm_ptr->rank] + send_true_lb;
+        rbuf = (char *)sendbuf + ssize*displs[rank] + send_true_lb;
       }
       else
       {  
@@ -352,13 +358,18 @@ int MPIDO_Scatterv(const void *sendbuf,
       TRACE_ERR("bitmask: %#X\n", result.bitmask);
       if(!result.bitmask)
       {
-         fprintf(stderr,"Query failed for %s\n", my_scatterv_md->name);
+        if(unlikely(verbose))
+          fprintf(stderr,"Query failed for %s\n", my_scatterv_md->name);
+        MPIDI_Update_last_algorithm(comm_ptr, "SCATTERV_MPICH");
+        return MPIR_Scatterv(sendbuf, sendcounts, displs, sendtype,
+                             recvbuf, recvcount, recvtype,
+                             root, comm_ptr, mpierrno);
       }
    }
 
    MPIDI_Update_last_algorithm(comm_ptr, my_scatterv_md->name);
 
-   if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
+   if(unlikely(verbose))
    {
       unsigned long long int threadID;
       MPIU_Thread_id_t tid;
@@ -398,8 +409,8 @@ int MPIDO_Scatterv(const void *sendbuf,
     * optscatterv[2] == sum of sendcounts
     */
 
-   optscatterv[0] = !comm_ptr->mpid.scattervs[0];
-   optscatterv[1] = !comm_ptr->mpid.scattervs[1];
+   optscatterv[0] = !mpid->scattervs[0];
+   optscatterv[1] = !mpid->scattervs[1];
    optscatterv[2] = 1;
 
    if(rank == root)
@@ -444,7 +455,7 @@ int MPIDO_Scatterv(const void *sendbuf,
   /* Make sure parameters are the same on all the nodes */
   /* specifically, noncontig on the receive */
   /* set the internal control flow to disable internal star tuning */
-   if(comm_ptr->mpid.preallreduces[MPID_SCATTERV_PREALLREDUCE])
+   if(mpid->preallreduces[MPID_SCATTERV_PREALLREDUCE])
    {
      TRACE_ERR("%s scatterv pre-allreduce\n", MPIDI_Process.context_post.active>0?"Posting":"Invoking");
      MPIDI_Post_coll_t allred_post;
@@ -499,7 +510,7 @@ int MPIDO_Scatterv(const void *sendbuf,
    } /* nothing valid to try, go to mpich */
    else
    {
-     if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
+     if(unlikely(verbose))
        fprintf(stderr,"Using MPICH scatterv algorithm\n");
       MPIDI_Update_last_algorithm(comm_ptr, "SCATTERV_MPICH");
       return MPIR_Scatterv(sendbuf, sendcounts, displs, sendtype,

http://git.mpich.org/mpich.git/commitdiff/4b9c2b4be5574f7df76589c2666abc6eaf51b20a

commit 4b9c2b4be5574f7df76589c2666abc6eaf51b20a
Author: Qi QC Zhang <keirazhang at cn.ibm.com>
Date:   Fri Jun 29 01:25:55 2012 -0400

    7 MPI-COM error injection cases core dump with MPICH2
    
    (ibm) D183554
    (ibm) Rh62qdr
    (ibm) 05023445d12c781486885571b98ad98db9162d98
    
    Signed-off-by: Michael Blocksome <blocksom at us.ibm.com>

diff --git a/src/mpid/pamid/include/mpidi_constants.h b/src/mpid/pamid/include/mpidi_constants.h
index 0372434..0a5d997 100644
--- a/src/mpid/pamid/include/mpidi_constants.h
+++ b/src/mpid/pamid/include/mpidi_constants.h
@@ -84,4 +84,15 @@ enum
  };
 /** \} */
 
+
+enum
+{
+MPID_EPOTYPE_NONE      = 0,       /**< No epoch in affect */
+MPID_EPOTYPE_LOCK      = 1,       /**< MPI_Win_lock access epoch */
+MPID_EPOTYPE_START     = 2,       /**< MPI_Win_start access epoch */
+MPID_EPOTYPE_POST      = 3,       /**< MPI_Win_post exposure epoch */
+MPID_EPOTYPE_FENCE     = 4,       /**< MPI_Win_fence access/exposure epoch */
+MPID_EPOTYPE_REFENCE   = 5,       /**< MPI_Win_fence possible access/exposure epoch */
+};
+
 #endif
diff --git a/src/mpid/pamid/include/mpidi_datatypes.h b/src/mpid/pamid/include/mpidi_datatypes.h
index 271a882..0edc6d9 100644
--- a/src/mpid/pamid/include/mpidi_datatypes.h
+++ b/src/mpid/pamid/include/mpidi_datatypes.h
@@ -402,6 +402,9 @@ struct MPIDI_Win
     uint32_t assert; /**< MPI_MODE_* bits asserted at epoch start              */
 #endif
 
+    volatile int origin_epoch_type; /**< curretn epoch type for origin */
+    volatile int target_epoch_type; /**< curretn epoch type for target */
+
     /* These fields are reset by the sync functions */
     uint32_t          total;    /**< The number of PAMI requests that we know about (updated only by calling thread) */
     volatile uint32_t started;  /**< The number of PAMI requests made (updated only in the context_post callback) */
diff --git a/src/mpid/pamid/src/onesided/mpid_win_accumulate.c b/src/mpid/pamid/src/onesided/mpid_win_accumulate.c
index 59a4b32..85f42e2 100644
--- a/src/mpid/pamid/src/onesided/mpid_win_accumulate.c
+++ b/src/mpid/pamid/src/onesided/mpid_win_accumulate.c
@@ -167,6 +167,18 @@ MPID_Accumulate(void         *origin_addr,
   req->win          = win;
   req->type         = MPIDI_WIN_REQUEST_ACCUMULATE;
 
+  if(win->mpid.sync.origin_epoch_type == win->mpid.sync.target_epoch_type &&
+     win->mpid.sync.origin_epoch_type == MPID_EPOTYPE_REFENCE){
+     win->mpid.sync.origin_epoch_type = MPID_EPOTYPE_FENCE;
+     win->mpid.sync.target_epoch_type = MPID_EPOTYPE_FENCE;
+  }
+
+  if(win->mpid.sync.origin_epoch_type == MPID_EPOTYPE_NONE ||
+     win->mpid.sync.origin_epoch_type == MPID_EPOTYPE_POST){
+    MPIU_ERR_SETANDSTMT(mpi_errno, MPI_ERR_RMA_SYNC,
+                        return mpi_errno, "**rmasync");
+  }
+
   req->offset = target_disp * win->mpid.info[target_rank].disp_unit;
 
   if (origin_datatype == MPI_DOUBLE_INT)
@@ -251,6 +263,13 @@ MPID_Accumulate(void         *origin_addr,
 
   pami_result_t rc;
   pami_task_t task = MPID_VCR_GET_LPID(win->comm_ptr->vcr, target_rank);
+  if (win->mpid.sync.origin_epoch_type == MPID_EPOTYPE_START &&
+    !MPIDI_valid_group_rank(task, win->mpid.sync.sc.group))
+  {
+       MPIU_ERR_SETANDSTMT(mpi_errno, MPI_ERR_RMA_SYNC,
+                          return mpi_errno, "**rmasync");
+  }
+
   rc = PAMI_Endpoint_create(MPIDI_Client, task, 0, &req->dest);
   MPID_assert(rc == PAMI_SUCCESS);
 
diff --git a/src/mpid/pamid/src/onesided/mpid_win_fence.c b/src/mpid/pamid/src/onesided/mpid_win_fence.c
index 465130a..a2245ab 100644
--- a/src/mpid/pamid/src/onesided/mpid_win_fence.c
+++ b/src/mpid/pamid/src/onesided/mpid_win_fence.c
@@ -27,6 +27,31 @@ MPID_Win_fence(int       assert,
                MPID_Win *win)
 {
   int mpi_errno = MPI_SUCCESS;
+  static char FCNAME[] = "MPID_Win_fence";
+
+  if(win->mpid.sync.origin_epoch_type != win->mpid.sync.target_epoch_type){
+       MPIU_ERR_SETANDSTMT(mpi_errno, MPI_ERR_RMA_SYNC,
+                        return mpi_errno, "**rmasync");
+  }
+
+  if ((assert & MPI_MODE_NOPRECEDE) &&
+            win->mpid.sync.origin_epoch_type != MPID_EPOTYPE_NONE) {
+        /* --BEGIN ERROR HANDLING-- */
+        MPIU_ERR_SETANDSTMT(mpi_errno, MPI_ERR_RMA_SYNC,
+                return mpi_errno, "**rmasync");
+        /* --END ERROR HANDLING-- */
+  }
+
+  if (!(assert & MPI_MODE_NOPRECEDE) &&
+            win->mpid.sync.origin_epoch_type != MPID_EPOTYPE_FENCE &&
+            win->mpid.sync.origin_epoch_type != MPID_EPOTYPE_REFENCE &&
+            win->mpid.sync.origin_epoch_type != MPID_EPOTYPE_NONE) {
+        /* --BEGIN ERROR HANDLING-- */
+        MPIU_ERR_SETANDSTMT(mpi_errno, MPI_ERR_RMA_SYNC,
+                return mpi_errno, "**rmasync");
+        /* --END ERROR HANDLING-- */
+  }
+
 
   struct MPIDI_Win_sync* sync = &win->mpid.sync;
   MPID_PROGRESS_WAIT_WHILE(sync->total != sync->complete);
@@ -34,6 +59,16 @@ MPID_Win_fence(int       assert,
   sync->started  = 0;
   sync->complete = 0;
 
+  if(assert & MPI_MODE_NOSUCCEED)
+  {
+    win->mpid.sync.origin_epoch_type = MPID_EPOTYPE_NONE;
+    win->mpid.sync.target_epoch_type = MPID_EPOTYPE_NONE;
+  }
+  else{
+    win->mpid.sync.origin_epoch_type = MPID_EPOTYPE_REFENCE;
+    win->mpid.sync.target_epoch_type = MPID_EPOTYPE_REFENCE;
+  }
+
   mpi_errno = MPIR_Barrier_impl(win->comm_ptr, &mpi_errno);
   return mpi_errno;
 }
diff --git a/src/mpid/pamid/src/onesided/mpid_win_free.c b/src/mpid/pamid/src/onesided/mpid_win_free.c
index f9ec386..a71e039 100644
--- a/src/mpid/pamid/src/onesided/mpid_win_free.c
+++ b/src/mpid/pamid/src/onesided/mpid_win_free.c
@@ -38,6 +38,12 @@ MPID_Win_free(MPID_Win **win_ptr)
   MPID_Win *win = *win_ptr;
   size_t rank = win->comm_ptr->rank;
 
+  if(win->mpid.sync.origin_epoch_type != win->mpid.sync.target_epoch_type ||
+     (win->mpid.sync.origin_epoch_type != MPID_EPOTYPE_NONE &&
+      win->mpid.sync.origin_epoch_type != MPID_EPOTYPE_REFENCE)){
+    MPIU_ERR_SETANDSTMT(mpi_errno, MPI_ERR_RMA_SYNC, return mpi_errno, "**rmasync");
+  }
+
   mpi_errno = MPIR_Barrier_impl(win->comm_ptr, &mpi_errno);
   if (mpi_errno != MPI_SUCCESS)
     return mpi_errno;
diff --git a/src/mpid/pamid/src/onesided/mpid_win_get.c b/src/mpid/pamid/src/onesided/mpid_win_get.c
index 831e6fd..aa8114b 100644
--- a/src/mpid/pamid/src/onesided/mpid_win_get.c
+++ b/src/mpid/pamid/src/onesided/mpid_win_get.c
@@ -228,6 +228,18 @@ MPID_Get(void         *origin_addr,
   req->win          = win;
   req->type         = MPIDI_WIN_REQUEST_GET;
 
+  if(win->mpid.sync.origin_epoch_type == win->mpid.sync.target_epoch_type &&
+     win->mpid.sync.origin_epoch_type == MPID_EPOTYPE_REFENCE){
+     win->mpid.sync.origin_epoch_type = MPID_EPOTYPE_FENCE;
+     win->mpid.sync.target_epoch_type = MPID_EPOTYPE_FENCE;
+  }
+
+  if(win->mpid.sync.origin_epoch_type == MPID_EPOTYPE_NONE ||
+     win->mpid.sync.origin_epoch_type == MPID_EPOTYPE_POST){
+    MPIU_ERR_SETANDSTMT(mpi_errno, MPI_ERR_RMA_SYNC,
+                        return mpi_errno, "**rmasync");
+  }
+
   req->offset = target_disp * win->mpid.info[target_rank].disp_unit;
 
   MPIDI_Win_datatype_basic(origin_count,
@@ -293,6 +305,13 @@ MPID_Get(void         *origin_addr,
 
   pami_result_t rc;
   pami_task_t task = MPID_VCR_GET_LPID(win->comm_ptr->vcr, target_rank);
+  if (win->mpid.sync.origin_epoch_type == MPID_EPOTYPE_START &&
+    !MPIDI_valid_group_rank(task, win->mpid.sync.sc.group))
+  {
+       MPIU_ERR_SETANDSTMT(mpi_errno, MPI_ERR_RMA_SYNC,
+                          return mpi_errno, "**rmasync");
+  }
+
   rc = PAMI_Endpoint_create(MPIDI_Client, task, 0, &req->dest);
   MPID_assert(rc == PAMI_SUCCESS);
 
diff --git a/src/mpid/pamid/src/onesided/mpid_win_lock.c b/src/mpid/pamid/src/onesided/mpid_win_lock.c
index e137b25..af2c14a 100644
--- a/src/mpid/pamid/src/onesided/mpid_win_lock.c
+++ b/src/mpid/pamid/src/onesided/mpid_win_lock.c
@@ -167,6 +167,13 @@ MPID_Win_lock(int       lock_type,
 {
   int mpi_errno = MPI_SUCCESS;
   struct MPIDI_Win_sync_lock* slock = &win->mpid.sync.lock;
+  static char FCNAME[] = "MPID_Win_lock";
+
+  if(win->mpid.sync.origin_epoch_type != MPID_EPOTYPE_NONE &&
+     win->mpid.sync.origin_epoch_type != MPID_EPOTYPE_REFENCE){
+    MPIU_ERR_SETANDSTMT(mpi_errno, MPI_ERR_RMA_SYNC,
+                        return mpi_errno, "**rmasync");
+   }
 
   MPIDI_WinLock_info info = {
   .done = 0,
@@ -178,6 +185,8 @@ MPID_Win_lock(int       lock_type,
   MPIDI_Context_post(MPIDI_Context[0], &info.work, MPIDI_WinLockReq_post, &info);
   MPID_PROGRESS_WAIT_WHILE(!slock->remote.locked);
 
+  win->mpid.sync.origin_epoch_type = MPID_EPOTYPE_LOCK;
+
   return mpi_errno;
 }
 
@@ -187,6 +196,12 @@ MPID_Win_unlock(int       rank,
                 MPID_Win *win)
 {
   int mpi_errno = MPI_SUCCESS;
+  static char FCNAME[] = "MPID_Win_unlock";
+
+  if(win->mpid.sync.origin_epoch_type != MPID_EPOTYPE_LOCK){
+    MPIU_ERR_SETANDSTMT(mpi_errno, MPI_ERR_RMA_SYNC,
+                        return mpi_errno, "**rmasync");
+   }
 
   struct MPIDI_Win_sync* sync = &win->mpid.sync;
   MPID_PROGRESS_WAIT_WHILE(sync->total != sync->complete);
@@ -202,5 +217,12 @@ MPID_Win_unlock(int       rank,
   MPIDI_Context_post(MPIDI_Context[0], &info.work, MPIDI_WinUnlock_post, &info);
   MPID_PROGRESS_WAIT_WHILE(!info.done);
   sync->lock.remote.locked = 0;
+
+  if(win->mpid.sync.target_epoch_type == MPID_EPOTYPE_REFENCE)
+  {
+    win->mpid.sync.origin_epoch_type = MPID_EPOTYPE_REFENCE;
+  }else{
+    win->mpid.sync.origin_epoch_type = MPID_EPOTYPE_NONE;
+  }
   return mpi_errno;
 }
diff --git a/src/mpid/pamid/src/onesided/mpid_win_pscw.c b/src/mpid/pamid/src/onesided/mpid_win_pscw.c
index 6d3eec3..221e316 100644
--- a/src/mpid/pamid/src/onesided/mpid_win_pscw.c
+++ b/src/mpid/pamid/src/onesided/mpid_win_pscw.c
@@ -99,14 +99,25 @@ MPID_Win_start(MPID_Group *group,
                MPID_Win   *win)
 {
   int mpi_errno = MPI_SUCCESS;
+  static char FCNAME[] = "MPID_Win_start";
+  if(win->mpid.sync.origin_epoch_type != MPID_EPOTYPE_NONE &&
+    win->mpid.sync.origin_epoch_type != MPID_EPOTYPE_REFENCE)
+  {
+    MPIU_ERR_SETANDSTMT(mpi_errno, MPI_ERR_RMA_SYNC,
+                        return mpi_errno, "**rmasync");
+  }
+
   MPIR_Group_add_ref(group);
 
   struct MPIDI_Win_sync* sync = &win->mpid.sync;
   MPID_PROGRESS_WAIT_WHILE(group->size != sync->pw.count);
   sync->pw.count = 0;
 
-  MPID_assert(win->mpid.sync.sc.group == NULL);
+  MPIU_ERR_CHKORASSERT(win->mpid.sync.sc.group == NULL,
+                       mpi_errno, MPI_ERR_GROUP, return mpi_errno, "**group");
+
   win->mpid.sync.sc.group = group;
+  win->mpid.sync.origin_epoch_type = MPID_EPOTYPE_START;
 
   return mpi_errno;
 }
@@ -116,6 +127,11 @@ int
 MPID_Win_complete(MPID_Win *win)
 {
   int mpi_errno = MPI_SUCCESS;
+  static char FCNAME[] = "MPID_Win_complete";
+  if(win->mpid.sync.origin_epoch_type != MPID_EPOTYPE_START){
+     MPIU_ERR_SETANDSTMT(mpi_errno, MPI_ERR_RMA_SYNC,
+                        return mpi_errno, "**rmasync");
+  }
 
   struct MPIDI_Win_sync* sync = &win->mpid.sync;
   MPID_PROGRESS_WAIT_WHILE(sync->total != sync->complete);
@@ -130,6 +146,13 @@ MPID_Win_complete(MPID_Win *win)
   MPIDI_Context_post(MPIDI_Context[0], &info.work, MPIDI_WinComplete_post, &info);
   MPID_PROGRESS_WAIT_WHILE(!info.done);
 
+  if(win->mpid.sync.target_epoch_type == MPID_EPOTYPE_REFENCE)
+  {
+    win->mpid.sync.origin_epoch_type = MPID_EPOTYPE_REFENCE;
+  }else{
+    win->mpid.sync.origin_epoch_type = MPID_EPOTYPE_NONE;
+  }
+
   MPIR_Group_release(sync->sc.group);
   sync->sc.group = NULL;
   return mpi_errno;
@@ -142,9 +165,17 @@ MPID_Win_post(MPID_Group *group,
               MPID_Win   *win)
 {
   int mpi_errno = MPI_SUCCESS;
+  static char FCNAME[] = "MPID_Win_post";
+  if(win->mpid.sync.target_epoch_type != MPID_EPOTYPE_NONE &&
+     win->mpid.sync.target_epoch_type != MPID_EPOTYPE_REFENCE){
+       MPIU_ERR_SETANDSTMT(mpi_errno, MPI_ERR_RMA_SYNC,
+                        return mpi_errno, "**rmasync");
+  }
+
   MPIR_Group_add_ref(group);
 
-  MPID_assert(win->mpid.sync.pw.group == NULL);
+  MPIU_ERR_CHKORASSERT(win->mpid.sync.pw.group == NULL,
+                       mpi_errno, MPI_ERR_GROUP, return mpi_errno,"**group");
   win->mpid.sync.pw.group = group;
 
   MPIDI_WinPSCW_info info = {
@@ -154,6 +185,8 @@ MPID_Win_post(MPID_Group *group,
   MPIDI_Context_post(MPIDI_Context[0], &info.work, MPIDI_WinPost_post, &info);
   MPID_PROGRESS_WAIT_WHILE(!info.done);
 
+  win->mpid.sync.target_epoch_type = MPID_EPOTYPE_POST;
+
   return mpi_errno;
 }
 
@@ -162,14 +195,26 @@ int
 MPID_Win_wait(MPID_Win *win)
 {
   int mpi_errno = MPI_SUCCESS;
-
+  static char FCNAME[] = "MPID_Win_wait";
   struct MPIDI_Win_sync* sync = &win->mpid.sync;
+
+  if(win->mpid.sync.target_epoch_type != MPID_EPOTYPE_POST){
+    MPIU_ERR_SETANDSTMT(mpi_errno, MPI_ERR_RMA_SYNC,
+                        return mpi_errno, "**rmasync");
+  }
+
   MPID_Group *group = sync->pw.group;
   MPID_PROGRESS_WAIT_WHILE(group->size != sync->sc.count);
   sync->sc.count = 0;
   sync->pw.group = NULL;
 
   MPIR_Group_release(group);
+
+  if(win->mpid.sync.origin_epoch_type == MPID_EPOTYPE_REFENCE){
+    win->mpid.sync.target_epoch_type = MPID_EPOTYPE_REFENCE;
+  }else{
+    win->mpid.sync.target_epoch_type = MPID_EPOTYPE_NONE;
+  }
   return mpi_errno;
 }
 
@@ -179,8 +224,14 @@ MPID_Win_test(MPID_Win *win,
               int      *flag)
 {
   int mpi_errno = MPI_SUCCESS;
-
+  static char FCNAME[] = "MPID_Win_test";
   struct MPIDI_Win_sync* sync = &win->mpid.sync;
+
+  if(win->mpid.sync.target_epoch_type != MPID_EPOTYPE_POST){
+    MPIU_ERR_SETANDSTMT(mpi_errno, MPI_ERR_RMA_SYNC,
+                        return mpi_errno, "**rmasync");
+  }
+
   MPID_Group *group = sync->pw.group;
   if (group->size == sync->sc.count)
     {
@@ -188,6 +239,11 @@ MPID_Win_test(MPID_Win *win,
       sync->pw.group = NULL;
       *flag = 1;
       MPIR_Group_release(group);
+      if(win->mpid.sync.origin_epoch_type == MPID_EPOTYPE_REFENCE){
+        win->mpid.sync.target_epoch_type = MPID_EPOTYPE_REFENCE;
+      }else{
+        win->mpid.sync.target_epoch_type = MPID_EPOTYPE_NONE;
+      }
     }
   else
     {
diff --git a/src/mpid/pamid/src/onesided/mpid_win_put.c b/src/mpid/pamid/src/onesided/mpid_win_put.c
index 6fce2c7..1d4f428 100644
--- a/src/mpid/pamid/src/onesided/mpid_win_put.c
+++ b/src/mpid/pamid/src/onesided/mpid_win_put.c
@@ -234,6 +234,18 @@ MPID_Put(void         *origin_addr,
   req->win          = win;
   req->type         = MPIDI_WIN_REQUEST_PUT;
 
+  if(win->mpid.sync.origin_epoch_type == win->mpid.sync.target_epoch_type &&
+     win->mpid.sync.origin_epoch_type == MPID_EPOTYPE_REFENCE){
+     win->mpid.sync.origin_epoch_type = MPID_EPOTYPE_FENCE;
+     win->mpid.sync.target_epoch_type = MPID_EPOTYPE_FENCE;
+  }
+
+  if(win->mpid.sync.origin_epoch_type == MPID_EPOTYPE_NONE ||
+     win->mpid.sync.origin_epoch_type == MPID_EPOTYPE_POST){
+    MPIU_ERR_SETANDSTMT(mpi_errno, MPI_ERR_RMA_SYNC,
+                        return mpi_errno, "**rmasync");
+  }
+
   req->offset = target_disp * win->mpid.info[target_rank].disp_unit;
 
   MPIDI_Win_datatype_basic(origin_count,
@@ -299,6 +311,13 @@ MPID_Put(void         *origin_addr,
 
   pami_result_t rc;
   pami_task_t task = MPID_VCR_GET_LPID(win->comm_ptr->vcr, target_rank);
+  if (win->mpid.sync.origin_epoch_type == MPID_EPOTYPE_START &&
+    !MPIDI_valid_group_rank(task, win->mpid.sync.sc.group))
+  {
+       MPIU_ERR_SETANDSTMT(mpi_errno, MPI_ERR_RMA_SYNC,
+                          return mpi_errno, "**rmasync");
+  }
+
   rc = PAMI_Endpoint_create(MPIDI_Client, task, 0, &req->dest);
   MPID_assert(rc == PAMI_SUCCESS);
 

http://git.mpich.org/mpich.git/commitdiff/b55e2ffda51d29261fcd2b230c2ff5276ccff743

commit b55e2ffda51d29261fcd2b230c2ff5276ccff743
Author: Su Huang <suhuang at us.ibm.com>
Date:   Fri Nov 9 11:21:01 2012 -0500

    mpich2 don't support mpc_statistics_write/zero in mpi program
    
     * Charles:  Build for bluegene, no impact to bluegene, committing
       and signing off without BG team approval.
    
    (ibm) D187240
    (ibm) 3b6577e99204fe78b894e5cde8dca5af552f9ad5
    
    Signed-off-by: Charles Archer <archerc at us.ibm.com>

diff --git a/src/mpid/pamid/include/mpidi_externs.h b/src/mpid/pamid/include/mpidi_externs.h
index 1e12009..cb64f00 100644
--- a/src/mpid/pamid/include/mpidi_externs.h
+++ b/src/mpid/pamid/include/mpidi_externs.h
@@ -33,8 +33,6 @@ extern pami_client_t  MPIDI_Client;
 extern pami_context_t MPIDI_Context[];
 
 extern MPIDI_Process_t MPIDI_Process;
-extern int MPIX_Statistics_write (FILE *);
-extern int MPIX_Statistics_zero ();
 
 
 #endif
diff --git a/src/mpid/pamid/include/mpidi_util.h b/src/mpid/pamid/include/mpidi_util.h
index 2f5c6e9..6665571 100644
--- a/src/mpid/pamid/include/mpidi_util.h
+++ b/src/mpid/pamid/include/mpidi_util.h
@@ -147,7 +147,6 @@ typedef struct {
 } MPIX_stats_t;
 
 extern MPIDI_printenv_t *mpich_env;
-extern MPIX_stats_t mpid_statistics;
 extern MPIX_stats_t *mpid_statp;
 extern int   prtStat;
 extern int   prtEnv;
@@ -155,6 +154,7 @@ extern void set_mpich_env(int *,int*);
 extern int numTasks;
 extern void MPIDI_open_pe_extension();
 extern void MPIDI_close_pe_extension();
+extern MPIDI_Statistics_write(FILE *);
 /*************************************************************
  *    MPIDI_STATISTICS
  *************************************************************/
diff --git a/src/mpid/pamid/src/mpidi_util.c b/src/mpid/pamid/src/mpidi_util.c
index fb116bf..60892a8 100644
--- a/src/mpid/pamid/src/mpidi_util.c
+++ b/src/mpid/pamid/src/mpidi_util.c
@@ -773,7 +773,7 @@ void MPIDI_print_statistics() {
   if ((MPIDI_Process.mp_statistics) ||
        (MPIDI_Process.mp_printenv)) {
        if (MPIDI_Process.mp_statistics) {
-           MPIX_Statistics_write(stdout);
+           MPIDI_Statistics_write(stdout);
            if (mpid_statp) MPIU_Free(mpid_statp);
        }
     if (MPIDI_Process.mp_printenv) {
diff --git a/src/mpid/pamid/src/mpix/mpix.c b/src/mpid/pamid/src/mpix/mpix.c
index 74da3b5..5dca860 100644
--- a/src/mpid/pamid/src/mpix/mpix.c
+++ b/src/mpid/pamid/src/mpix/mpix.c
@@ -160,15 +160,24 @@ MPIX_Hardware(MPIX_Hardware_t *hw)
 }
 
 #if (MPIDI_PRINTENV || MPIDI_STATISTICS || MPIDI_BANNER)
+void mpc_statistics_write() __attribute__ ((alias("MPIX_statistics_write")));
+void mp_statistics_write() __attribute__ ((alias("MPIXF_statistics_write")));
+void mp_statistics_write_() __attribute__ ((alias("MPIXF_statistics_write")));
+void mp_statistics_write__() __attribute__ ((alias("MPIXF_statistics_write")));
+void mpc_statistics_zero() __attribute__ ((alias("MPIX_statistics_zero")));
+void mp_statistics_zero() __attribute__ ((alias("MPIXF_statistics_zero")));
+void mp_statistics_zero_() __attribute__ ((alias("MPIXF_statistics_zero")));
+void mp_statistics_zero__() __attribute__ ((alias("MPIXF_statistics_zero")));
+
   /* ------------------------------------------- */
-  /* - mpid_statistics_zero  and        -------- */
-  /* - mpid_statistics_write can be     -------- */
-  /* - called during init and finalize  -------- */
-  /* - PE utiliti routines              -------- */
-  /* ------------------------------------------- */
+  /* - MPIDI_Statistics_zero  and        -------- */
+  /* - MPIDI_Statistics_write can be     -------- */
+  /* - called during init and finalize   -------- */
+  /* - PE utiliti routines               -------- */
+  /* -------------------------------------------- */
 
 int
-MPIX_Statistics_zero(void)
+MPIDI_Statistics_zero(void)
 {
     int rc=0;
 
@@ -184,8 +193,41 @@ MPIX_Statistics_zero(void)
 
    return (rc); /* to map with current PE support */
 }
+ /***************************************************************************
+ Function Name: _MPIX_statistics_zero
+
+ Description: Call the corresponding MPIDI_statistics_zero function to initialize/clear
+              statistics counter.
+
+ Parameters:
+ Name               Type         I/O
+ void
+ int                >0           Success
+                    <0           statistics not enable
+ ***************************************************************************/
+
+int _MPIX_statistics_zero (void)
+{
+    int rc = MPIDI_Statistics_zero();
+    if (rc < 0) {
+        MPID_assert(rc == PAMI_SUCCESS);
+    }
+    return(rc);
+}
+
+int MPIX_statistics_zero(void)
+{
+    return(_MPIX_statistics_zero());
+}
+
+void MPIXF_statistics_zero(int *rc)
+{
+    *rc = _MPIX_statistics_zero();
+}
+
+
 int
-MPIX_Statistics_write (FILE *statfile) {
+MPIDI_Statistics_write(FILE *statfile) {
 
     int rc=-1;
     int i;
@@ -206,15 +248,15 @@ MPIX_Statistics_write (FILE *statfile) {
     sprintf(time_buf, __DATE__" "__TIME__);
     mpid_statp->sendWaitsComplete =  mpid_statp->sends - mpid_statp->sendsComplete;
     fprintf(statfile,"Start of task (pid=%d) statistics at %s \n", getpid(), time_buf);
-    fprintf(statfile, "PAMID: sends = %ld\n", mpid_statp->sends);
-    fprintf(statfile, "PAMID: sendsComplete = %ld\n", mpid_statp->sendsComplete);
-    fprintf(statfile, "PAMID: sendWaitsComplete = %ld\n", mpid_statp->sendWaitsComplete);
-    fprintf(statfile, "PAMID: recvs = %ld\n", mpid_statp->recvs);
-    fprintf(statfile, "PAMID: recvWaitsComplete = %ld\n", mpid_statp->recvWaitsComplete);
-    fprintf(statfile, "PAMID: earlyArrivals = %ld\n", mpid_statp->earlyArrivals);
-    fprintf(statfile, "PAMID: earlyArrivalsMatched = %ld\n", mpid_statp->earlyArrivalsMatched);
-    fprintf(statfile, "PAMID: lateArrivals = %ld\n", mpid_statp->lateArrivals);
-    fprintf(statfile, "PAMID: unorderedMsgs = %ld\n", mpid_statp->unorderedMsgs);
+    fprintf(statfile, "MPICH: sends = %ld\n", mpid_statp->sends);
+    fprintf(statfile, "MPICH: sendsComplete = %ld\n", mpid_statp->sendsComplete);
+    fprintf(statfile, "MPICH: sendWaitsComplete = %ld\n", mpid_statp->sendWaitsComplete);
+    fprintf(statfile, "MPICH: recvs = %ld\n", mpid_statp->recvs);
+    fprintf(statfile, "MPICH: recvWaitsComplete = %ld\n", mpid_statp->recvWaitsComplete);
+    fprintf(statfile, "MPICH: earlyArrivals = %ld\n", mpid_statp->earlyArrivals);
+    fprintf(statfile, "MPICH: earlyArrivalsMatched = %ld\n", mpid_statp->earlyArrivalsMatched);
+    fprintf(statfile, "MPICH: lateArrivals = %ld\n", mpid_statp->lateArrivals);
+    fprintf(statfile, "MPICH: unorderedMsgs = %ld\n", mpid_statp->unorderedMsgs);
     fflush(statfile);
     memset(&query_stat,0, sizeof(query_stat));
     query_stat.name =  (pami_attribute_name_t)PAMI_CONTEXT_STATISTICS ;
@@ -262,9 +304,80 @@ n",rc);
         }
    return (rc);
 }
+ /***************************************************************************
+ Function Name: _MPIX_statistics_write
+ Description: Call MPIDI_Statistics_write  to write statistical
+              information to specified file descriptor.   
+ Parameters:
+ Name               Type         I/O
+ fptr               FILE*        I    File pointer, can be stdout or stderr.
+                                      If it is to a file, user has to open
+                                      the file.
+ rc (Fortran only)  int          0    Return sum from MPIDI_Statistics_write calls
+ <returns> (C only)  0                Both MPICH and PAMI statistics
+ ***************************************************************************/
+int _MPIX_statistics_write(FILE* fptr)
+{
+    int rc = MPIDI_Statistics_write(fptr);
+    if (rc < 0) {
+        MPID_assert(rc == PAMI_SUCCESS);
+    }
+    return(rc);
+}
+
+int MPIX_statistics_write(FILE* fptr)
+{
+    return(_MPIX_statistics_write(fptr));
+}
+
+/* Fortran:  fdes is pointer to a file descriptor.
+ *           rc   is pointer to buffer for storing return code.
+ *
+ * Note: Fortran app. will convert a Fortran I/O unit to a file
+ *       descriptor by calling Fortran utilities, flush_ and getfd.
+ *       When fdes=1, output is to STDOUT.  When fdes=2, output is to STDERR.
+ */
+
+void MPIXF_statistics_write(int *fdes, int *rc)
+{
+    FILE *fp;
+    int  dup_fd;
+    int  closefp=0;
 
+    /* Convert the DUP file descriptor to a FILE pointer */
+    dup_fd = dup(*fdes);
+    if ( (fp = fdopen(dup_fd, "a")) != NULL )
+       closefp = 1;
+    else
+       fp = stdout;    /* If fdopen failed then default to stdout */
+
+    *rc = _MPIX_statistics_write(fp);
+
+    /* The check is because I don't want to close stdout. */
+    if ( closefp ) fclose(fp);
+}
+
+void MPIXF_statistics_write_(int *fdes, int *rc)
+{
+    FILE *fp;
+    int  dup_fd;
+    int  closefp=0;
+
+    /* Convert the DUP file descriptor to a FILE pointer */
+    dup_fd = dup(*fdes);
+    if ( (fp = fdopen(dup_fd, "a")) != NULL )
+       closefp = 1;
+    else
+       fp = stdout;    /* If fdopen failed then default to stdout */
+
+    *rc = _MPIX_statistics_write(fp);
+
+    /* The check is because I don't want to close stdout. */
+    if ( closefp ) fclose(fp);
+}
 #endif
 
+
 #ifdef __BGQ__
 
 int

http://git.mpich.org/mpich.git/commitdiff/eebe16d1604f4df9562e469631288e80b7026bf7

commit eebe16d1604f4df9562e469631288e80b7026bf7
Author: Su Huang <suhuang at us.ibm.com>
Date:   Thu Oct 25 10:09:03 2012 -0400

    Memory management and token flow control for early arrivals
    
     * D187119: Check in the fixes from code review - memory management and token flow control
     * Fix some white space issues
     * Fix build break
     * Data integrity error w/ mpich2 MPI_Reduce & MPI_Allreduce. Put
       MPIDI_Request_setPeerRank_pami under OUT_OF_ORDER_HANDLING
    
    (ibm) F182398
    (ibm) D187119
    (ibm) D187228
    (ibm) D186242
    (ibm) 7ae1e0584637dfc8b285f38b98582880b8a10c26
    (ibm) ddf821de992bf60d90468094858613daae192e73
    (ibm) e3b66ad45c1286a14ef76dd6fdc029d1c0dfc612
    (ibm) f304257166639b4bac1e5d8971e6fb29d7ffb7d9
    
    Signed-off-by: Charles Archer <archerc at us.ibm.com>
    Signed-off-by: Su Huang <suhuang at us.ibm.com>
    Signed-off-by: Michael Blocksome <blocksom at us.ibm.com>

diff --git a/src/mpid/common/datatype/dataloop/veccpy.h b/src/mpid/common/datatype/dataloop/veccpy.h
index 12aa9e8..ea138a8 100644
--- a/src/mpid/common/datatype/dataloop/veccpy.h
+++ b/src/mpid/common/datatype/dataloop/veccpy.h
@@ -56,7 +56,7 @@
     type * tmp_src = l_src;                                     \
     register int _i, j, k;		                        \
     unsigned long total_count = count * nelms;                  \
-    const int l_stride = stride;				\
+    const DLOOP_Offset l_stride = stride;				\
                                                                 \
     if (nelms == 1) {                                           \
         for (_i = total_count; _i; _i--) {			        \
@@ -168,7 +168,7 @@
     type * tmp_src = l_src;                                     \
     register int _i, j, k;		                        \
     unsigned long total_count = count * nelms;                  \
-    const int l_stride = stride;				\
+    const DLOOP_Offset l_stride = stride;				\
                                                                 \
     if (nelms == 1) {                                           \
         for (_i = total_count; _i; _i--) {			        \
@@ -280,7 +280,7 @@
     type * tmp_dest = l_dest;                                   \
     register int _i, j, k;		                        \
     unsigned long total_count = count * nelms;                  \
-    const int l_stride = stride;				\
+    const DLOOP_Offset l_stride = stride;				\
                                                                 \
     if (nelms == 1) {                                           \
         for (_i = total_count; _i; _i--) {			        \
@@ -392,7 +392,7 @@
     type * tmp_dest = l_dest;                                   \
     register int _i, j, k;		                        \
     unsigned long total_count = count * nelms;                  \
-    const int l_stride = stride;				\
+    const DLOOP_Offset l_stride = stride;				\
                                                                 \
     if (nelms == 1) {                                           \
         for (_i = total_count; _i; _i--) {			        \
diff --git a/src/mpid/pamid/include/mpidi_datatypes.h b/src/mpid/pamid/include/mpidi_datatypes.h
index a6baa07..271a882 100644
--- a/src/mpid/pamid/include/mpidi_datatypes.h
+++ b/src/mpid/pamid/include/mpidi_datatypes.h
@@ -81,6 +81,10 @@ typedef struct
     MPIDI_pt2pt_limits_t limits;
   } pt2pt;
   unsigned disable_internal_eager_scale; /**< The number of tasks at which point eager will be disabled */
+#if TOKEN_FLOW_CONTROL
+  unsigned long long mp_buf_mem;
+  unsigned is_token_flow_control_on;
+#endif
 #if (MPIDI_STATISTICS || MPIDI_PRINTENV)
   unsigned mp_infolevel;
   unsigned mp_statistics;     /* print pamid statistcs data                           */
@@ -164,6 +168,7 @@ typedef enum
     MPIDI_CONTROL_CANCEL_ACKNOWLEDGE,
     MPIDI_CONTROL_CANCEL_NOT_ACKNOWLEDGE,
     MPIDI_CONTROL_RENDEZVOUS_ACKNOWLEDGE,
+    MPIDI_CONTROL_RETURN_TOKENS,
   } MPIDI_CONTROL;
 
 
@@ -206,12 +211,18 @@ typedef struct
       unsigned control:3;  /**< message type for control protocols */
       unsigned isSync:1;   /**< set for sync sends     */
       unsigned isRzv :1;   /**< use pt2pt rendezvous   */
+      unsigned    noRDMA:1;    /**< msg sent via shm or mem reg. fails */
+      unsigned    reserved:6;  /**< unused bits                        */
+      unsigned    tokens:4;    /** tokens need to be returned          */
     } __attribute__ ((__packed__));
   };
 
 #ifdef OUT_OF_ORDER_HANDLING
   unsigned    MPIseqno;    /**< match seqno            */
 #endif
+#if TOKEN_FLOW_CONTROL
+  unsigned    alltokens;   /* control:MPIDI_CONTROL_RETURN_TOKENS  */
+#endif
 } MPIDI_MsgInfo;
 
 /** \brief Full Rendezvous msg info to be set as two quads of unexpected data. */
diff --git a/src/mpid/pamid/include/mpidi_macros.h b/src/mpid/pamid/include/mpidi_macros.h
index 6a8ab4c..3e1798d 100644
--- a/src/mpid/pamid/include/mpidi_macros.h
+++ b/src/mpid/pamid/include/mpidi_macros.h
@@ -31,6 +31,8 @@
 #include "mpidi_datatypes.h"
 #include "mpidi_externs.h"
 
+#define TOKEN_FLOW_CONTROL_ON (TOKEN_FLOW_CONTROL && MPIU_Token_on())
+
 #ifdef TRACE_ON
 #ifdef __GNUC__
 #define TRACE_ALL(fd, format, ...) fprintf(fd, "%s:%u (%d) " format, __FILE__, __LINE__, MPIR_Process.comm_world->rank, ##__VA_ARGS__)
@@ -45,6 +47,11 @@
 #define TRACE_ERR(format...)
 #endif
 
+#if TOKEN_FLOW_CONTROL
+#define MPIU_Token_on() (MPIDI_Process.is_token_flow_control_on)
+#else
+#define MPIU_Token_on() (0)
+#endif
 
 /**
  * \brief Gets significant info regarding the datatype
diff --git a/src/mpid/pamid/include/mpidi_platform.h b/src/mpid/pamid/include/mpidi_platform.h
index 61a001d..6ce3125 100644
--- a/src/mpid/pamid/include/mpidi_platform.h
+++ b/src/mpid/pamid/include/mpidi_platform.h
@@ -76,6 +76,7 @@
 #define PAMIX_IS_LOCAL_TASK
 #define PAMIX_IS_LOCAL_TASK_STRIDE  (4)
 #define PAMIX_IS_LOCAL_TASK_SHIFT   (6)
+#define TOKEN_FLOW_CONTROL    0
 
 /*
  * Enable both the 'internal vs application' and the 'local vs remote'
@@ -107,6 +108,8 @@ static const char _ibm_release_version_[] = "V1R2M0";
 #ifdef __PE__
 #undef USE_PAMI_CONSISTENCY
 #define USE_PAMI_CONSISTENCY PAMI_HINT_DISABLE
+#undef  MPIDI_SHORT_LIMIT
+#define MPIDI_SHORT_LIMIT 256 - sizeof(MPIDI_MsgInfo)
 #undef  MPIDI_EAGER_LIMIT
 #define MPIDI_EAGER_LIMIT 65536
 #undef  MPIDI_EAGER_LIMIT_LOCAL
@@ -119,6 +122,7 @@ static const char _ibm_release_version_[] = "V1R2M0";
 #define RDMA_FAILOVER
 #define MPIDI_BANNER          1
 #define MPIDI_NO_ASSERT       1
+#define TOKEN_FLOW_CONTROL    1
 
 /* 'is local task' extension and limits */
 #define PAMIX_IS_LOCAL_TASK
@@ -155,5 +159,17 @@ static const char _ibm_release_version_[] = "%W%";
 
 #endif
 
+#if TOKEN_FLOW_CONTROL
+#define BUFFER_MEM_DEFAULT (1<<26)          /* 64MB                         */
+#define BUFFER_MEM_MAX     (1<<26)          /* 64MB                         */
+#define ONE_SHARED_SEGMENT (1<<28)          /* 256MB                        */
+#define EAGER_LIMIT_DEFAULT     65536
+#define MAX_BUF_BKT_SIZE        (1<<18)     /* Max eager_limit is 256K     */
+#define MIN_BUF_BKT_SIZE        (64)
+#define TOKENS_BIT         (4)              /* 4 bits piggy back to sender */
+                                            /* should be consistent with tokens
+                                               defined in MPIDI_MsgInfo    */
+#define TOKENS_BITMASK ((1 << TOKENS_BIT)-1)
+#endif
 
 #endif
diff --git a/src/mpid/pamid/include/mpidi_trace.h b/src/mpid/pamid/include/mpidi_trace.h
new file mode 100644
index 0000000..143cae3
--- /dev/null
+++ b/src/mpid/pamid/include/mpidi_trace.h
@@ -0,0 +1,159 @@
+/*  (C)Copyright IBM Corp.  2007, 2011  */
+/**
+ * \file include/mpidi_trace.h
+ * \brief record trace info. for pt2pt comm.
+ */
+/*
+ *
+ *
+ */
+
+
+#ifndef __include_mpidi_trace_h__
+#define __include_mpidi_trace_h__
+
+#include <sys/time.h>
+#include <sys/param.h>
+
+#ifdef MPIDI_TRACE
+#define N_MSGS    1024
+#define SEQMASK   N_MSGS-1
+typedef struct {
+   void  *req;            /* address of request                      */
+   void  *bufadd;         /* user's receive buffer address           */
+   uint  msgid;           /* msg seqno.                              */
+   unsigned short ctx;    /* mpi context id                          */
+   unsigned short dummy;  /* reserved                                */
+   uint nMsgs;            /* highest msg seqno that arrived in order */
+   int        tag;
+   int        len;
+   int        rsource;    /* source of the message arrived           */
+   int        rtag;       /* tag of a received message               */
+   int        rlen;       /* len of a received message               */
+   int        rctx;       /* context of a received message           */
+   uint       posted:1;   /* has the receive posted                  */
+   uint       rzv:1;      /* rendezvous message ?                    */
+   uint       sync:1;     /* synchronous message?                    */
+   uint       sendAck:1;  /* send ack?                               */
+   uint       sendFin:1;  /* send complete info?                     */
+   uint       HH:1;       /* header handler                          */
+   uint       ool:1;      /* the msg arrived out of order            */
+   uint       matchedInOOL:1;/* found a match in out of order list   */
+   uint       comp_in_HH:4;  /* the msg completed in header handler  */
+   uint       comp_in_HHV_noMatch:1;/* no matched in header handler EA */
+   uint       sync_com_in_HH:1; /* sync msg completed in header handler*/
+   uint       matchedInHH:1; /* found a match in header haldner      */
+   uint       matchedInComp:1;/* found a match in completion handler */
+   uint       matchedInUQ:2; /* found a match in unexpected queue    */
+   uint       matchedInUQ2:2;/* found a match in unexpected queue    */
+   uint       matchedInWait:1;/* found a match in MPI_Wait() etc.    */
+   uint       ReadySend:1;   /* a ready send messsage                */
+   uint       persist:1;     /* persist communication                */
+   uint       reserve:9;
+   void *     matchedHandle; /* a message with multiple handles      */
+} recv_status;
+
+typedef struct {
+   void       *req;          /* address of request                   */
+   void       *bufaddr;      /* address of user's send buffer        */
+   int        dest;          /* destination of a message             */
+   int        rank;          /* rank in a communicator               */
+   int        mode;          /* protocol used                        */
+   uint       msgid;         /* message sequence no.                 */
+   unsigned short  sctx;     /* context id                           */
+   unsigned short dummy;
+   int        tag;           /* tag of a message                     */
+   int        len;           /* lengh of a message                   */
+   uint       blocking:1;    /* blocking send ?                      */
+   uint       sync:1;        /* sync message                         */
+   uint       sendEnvelop:1; /* envelop send?                        */
+   uint       sendShort:1;   /* send immediate                       */
+   uint       sendEager:1;   /* eager send                           */
+   uint       sendRzv:1;     /* send via renzdvous protocol          */
+   uint       memRegion:1;   /* memory is registered                 */
+   uint       use_pami_get:1;/* use only PAMI_Get()                  */
+   uint       NoComp:4;      /* no completion handler                */
+   uint       sendComp:1;    /* send complete                        */
+   uint       recvAck:1;     /* recv an ack from the receiver        */
+   uint       recvFin:1;     /* recv complete information            */
+   uint       complSync:1;   /* complete sync                        */
+   uint       ReadySend:1;   /* ready send                           */
+   uint       reqXfer:1;     /* request message transfer             */
+   uint       persist:1;     /* persistent communiation              */
+   uint       reserved:15;
+} send_status;
+
+typedef struct {
+   void  *req;         /* address of a request                 */
+   void  *bufadd;      /* address of user receive buffer       */
+   int    src_task;    /* source PAMI task id                  */
+   int    rank;        /* rank in a communicator               */
+   int    tag;         /* tag of a posted recv                 */
+   int    count;       /* count of a specified datattype       */
+   int    datatype;
+   int    len;         /* length of a receive message          */
+   uint  nMsgs;        /* no. of messages have been received   */
+   uint  msgid;        /* msg seqno of the matched message     */
+   uint  sendCtx:16;   /* context of incoming msg              */
+   uint  recvCtx:16;   /* context of a posted receive          */
+   uint  lw:4;         /* use lw protocol immediate send       */
+   uint  persist:4;    /* persistent communication             */
+   uint  blocking:2;   /* blocking receive                     */
+   uint  reserve:22;
+} posted_recv;
+
+#define MPIDI_SET_PR_REC(rreq,buf,ct,ll,dt,pami_id,rank,tag,comm,is_blk) { \
+        int idx,src,seqNo,x;                                      \
+        if (pami_id != MPI_ANY_SOURCE)                            \
+            src=pami_id;                                          \
+        else {                                                    \
+            src= MPIR_Process.comm_world->rank;                   \
+        }                                                         \
+        MPIDI_Trace_buf[src].totPR++ ;                            \
+        seqNo=MPIDI_Trace_buf[src].totPR;                         \
+        idx = (seqNo & SEQMASK);                                  \
+        bzero(&MPIDI_Trace_buf[src].PR[idx],sizeof(posted_recv)); \
+        MPIDI_Trace_buf[src].PR[idx].src_task= pami_id;           \
+        MPIDI_Trace_buf[src].PR[idx].rank   = rank;               \
+        MPIDI_Trace_buf[src].PR[idx].bufadd = buf;                \
+        MPIDI_Trace_buf[src].PR[idx].msgid = seqNo;               \
+        MPIDI_Trace_buf[src].PR[idx].count = ct;                  \
+        MPIDI_Trace_buf[src].PR[idx].len   = ll;                  \
+        MPIDI_Trace_buf[src].PR[idx].datatype = dt;               \
+        MPIDI_Trace_buf[src].PR[idx].tag=tag;                     \
+        MPIDI_Trace_buf[src].PR[idx].sendCtx=comm->context_id;    \
+        MPIDI_Trace_buf[src].PR[idx].recvCtx=comm->recvcontext_id;\
+        MPIDI_Trace_buf[src].PR[idx].blocking=is_blk;             \
+        rreq->mpid.PR_idx=idx;                                    \
+}
+
+#define MPIDI_GET_S_REC(sreq,ctx,isSync,dataSize) {             \
+        send_status *sstatus;                                   \
+        int dest=sreq->mpid.partner_id;                         \
+        int seqNo=sreq->mpid.envelope.msginfo.MPIseqno;         \
+        int idx = (seqNo & SEQMASK);                            \
+        bzero(&MPIDI_Trace_buf[dest].S[idx],sizeof(send_status));\
+        sstatus=&MPIDI_Trace_buf[dest].S[idx];                  \
+        sstatus->req    = (void *)sreq;                         \
+        sstatus->tag    = sreq->mpid.envelope.msginfo.MPItag;   \
+        sstatus->dest   = sreq->mpid.peer_pami;                 \
+        sstatus->rank   = sreq->mpid.peer_comm;                 \
+        sstatus->msgid = seqNo;                                 \
+        sstatus->sync = isSync;                                 \
+        sstatus->sctx = ctx;                                    \
+        sstatus->tag = sreq->mpid.envelope.msginfo.MPItag;      \
+        sstatus->len= dataSize;                                 \
+        sreq->mpid.idx=idx;                                     \
+}
+
+typedef struct MPIDI_Trace_buf {
+    recv_status *R;     /* record incoming messages    */
+    posted_recv *PR;    /* record posted receive       */
+    send_status *S;     /* send messages               */
+    int  totPR;         /* total no. of poste receive  */
+} MPIDI_Trace_buf_t;
+
+MPIDI_Trace_buf_t  *MPIDI_Trace_buf;
+
+#endif  /* MPIDI_TRACE             */
+#endif   /* include_mpidi_trace_h  */
diff --git a/src/mpid/pamid/include/mpidi_util.h b/src/mpid/pamid/include/mpidi_util.h
index ea30836..2f5c6e9 100644
--- a/src/mpid/pamid/include/mpidi_util.h
+++ b/src/mpid/pamid/include/mpidi_util.h
@@ -83,6 +83,7 @@ typedef struct {
         int interrupts;
         uint  polling_interval;
         int eager_limit;
+        int use_token_flow_control;
         char wait_mode[8];
         int use_shmem;
         uint retransmit_interval;
diff --git a/src/mpid/pamid/src/Makefile.mk b/src/mpid/pamid/src/Makefile.mk
index 69aacbf..247812a 100644
--- a/src/mpid/pamid/src/Makefile.mk
+++ b/src/mpid/pamid/src/Makefile.mk
@@ -40,6 +40,7 @@ include $(top_srcdir)/src/mpid/pamid/src/pt2pt/Makefile.mk
 
 lib_lib at MPILIBNAME@_la_SOURCES +=               \
     src/mpid/pamid/src/mpid_buffer.c            \
+    src/mpid/pamid/src/mpidi_bufmm.c            \
     src/mpid/pamid/src/mpid_finalize.c          \
     src/mpid/pamid/src/mpid_init.c              \
     src/mpid/pamid/src/mpid_iprobe.c            \
diff --git a/src/mpid/pamid/src/mpid_finalize.c b/src/mpid/pamid/src/mpid_finalize.c
index a09eda2..549853f 100644
--- a/src/mpid/pamid/src/mpid_finalize.c
+++ b/src/mpid/pamid/src/mpid_finalize.c
@@ -22,6 +22,9 @@
 #include <mpidimpl.h>
 
 
+#if TOKEN_FLOW_CONTROL
+extern void MPIDI_close_mm();
+#endif
 
 #ifdef MPIDI_STATISTICS
 extern pami_extension_t pe_extension;
@@ -86,5 +89,18 @@ int MPID_Finalize()
   MPIU_Free(MPIDI_Out_cntr);
 #endif
 
+ if (TOKEN_FLOW_CONTROL_ON)
+   {
+     #if TOKEN_FLOW_CONTROL
+     extern char *EagerLimit;
+     
+     if (EagerLimit) MPIU_Free(EagerLimit);
+     MPIU_Free(MPIDI_Token_cntr);
+     MPIDI_close_mm();
+     #else
+     MPID_assert_always(0);
+     #endif
+   }
+
   return MPI_SUCCESS;
 }
diff --git a/src/mpid/pamid/src/mpid_init.c b/src/mpid/pamid/src/mpid_init.c
index 88a9abe..e4ac607 100644
--- a/src/mpid/pamid/src/mpid_init.c
+++ b/src/mpid/pamid/src/mpid_init.c
@@ -27,6 +27,11 @@
 #include "mpidi_platform.h"
 #include "onesided/mpidi_onesided.h"
 
+#if TOKEN_FLOW_CONTROL
+  extern int MPIDI_mm_init(int,uint *,unsigned long *);
+  extern int MPIDI_tfctrl_enabled;
+#endif
+
 #if (MPIDI_STATISTICS || MPIDI_PRINTENV)
   pami_extension_t pe_extension;
 #endif
@@ -88,6 +93,10 @@ MPIDI_Process_t  MPIDI_Process = {
     },
   },
   .disable_internal_eager_scale = MPIDI_DISABLE_INTERNAL_EAGER_SCALE,
+#if TOKEN_FLOW_CONTROL
+  .mp_buf_mem          = BUFFER_MEM_DEFAULT,
+  .is_token_flow_control_on = 0,
+#endif
 #if (MPIDI_STATISTICS || MPIDI_PRINTENV)
   .mp_infolevel          = 0,
   .mp_statistics         = 0,
@@ -380,6 +389,23 @@ MPIDI_PAMI_context_init(int* threading)
   memset((void *) MPIDI_In_cntr,0, sizeof(MPIDI_In_cntr_t));
   memset((void *) MPIDI_Out_cntr,0, sizeof(MPIDI_Out_cntr_t));
 #endif
+
+if (TOKEN_FLOW_CONTROL_ON)
+  {
+    #if TOKEN_FLOW_CONTROL
+    int i;
+    MPIDI_mm_init(numTasks,&MPIDI_Process.pt2pt.limits.application.eager.remote,&MPIDI_Process.mp_buf_mem);
+    MPIDI_Token_cntr = MPIU_Calloc0(numTasks, MPIDI_Token_cntr_t);
+    memset((void *) MPIDI_Token_cntr,0, (sizeof(MPIDI_Token_cntr_t) * numTasks));
+    for (i=0; i < numTasks; i++)
+      {
+        MPIDI_Token_cntr[i].tokens=MPIDI_tfctrl_enabled;
+      }
+    #else
+    MPID_assert_always(0);
+    #endif
+}
+
 #ifdef MPIDI_TRACE
       int i; 
       for (i=0; i < numTasks; i++) {
@@ -561,6 +587,8 @@ MPIDI_PAMI_init(int* rank, int* size, int* threading)
              "  rma_pending           : %u\n"
              "  shmem_pt2pt           : %u\n"
              "  disable_internal_eager_scale : %u\n"
+             "  mp_buf_mem               : %u\n"
+             "  is_token_flow_control_on : %u\n"
 #if (MPIDI_STATISTICS || MPIDI_PRINTENV)
              "  mp_infolevel : %u\n"
              "  mp_statistics: %u\n"
@@ -586,6 +614,13 @@ MPIDI_PAMI_init(int* rank, int* size, int* threading)
              MPIDI_Process.rma_pending,
              MPIDI_Process.shmem_pt2pt,
              MPIDI_Process.disable_internal_eager_scale,
+#if TOKEN_FLOW_CONTROL             
+             MPIDI_Process.mp_buf_mem,
+             MPIDI_Process.is_token_flow_control_on,
+#else
+             0,
+             0,
+#endif
 #if (MPIDI_STATISTICS || MPIDI_PRINTENV)
              MPIDI_Process.mp_infolevel,
              MPIDI_Process.mp_statistics,
diff --git a/src/mpid/pamid/src/mpid_recvq.h b/src/mpid/pamid/src/mpid_recvq.h
index b1d605e..b566f21 100644
--- a/src/mpid/pamid/src/mpid_recvq.h
+++ b/src/mpid/pamid/src/mpid_recvq.h
@@ -179,6 +179,16 @@ MPIDI_Recvq_FDU_or_AEP(MPID_Request *newreq, int source, pami_task_t pami_source
   return rreq;
 }
 
+#if TOKEN_FLOW_CONTROL
+typedef struct MPIDI_Token_cntr {
+    uint16_t unmatched;          /* no. of unmatched EA messages              */
+    uint16_t rettoks;            /* no. of tokens to be returned              */
+    int  tokens;                 /* no. of tokens available-pairwise          */
+    int  n_tokenStarved;         /* no. of times token starvation occured     */
+} MPIDI_Token_cntr_t;
+
+MPIDI_Token_cntr_t  *MPIDI_Token_cntr;
+#endif
 
 #ifdef OUT_OF_ORDER_HANDLING
 
@@ -294,6 +304,9 @@ MPIDI_Recvq_FDP(size_t source, pami_task_t pami_source, int tag, int context_id,
         rreq->mpid.idx=idx;
         rreq->mpid.partner_id=pami_source;
 #endif
+#ifdef OUT_OF_ORDER_HANDLING
+        MPIDI_Request_setPeerRank_pami(rreq, pami_source);
+#endif
         MPIDI_Recvq_remove(MPIDI_Recvq.posted, rreq, prev_rreq);
 #ifdef USE_STATISTICS
         MPIDI_Statistics_time(MPIDI_Statistics.recvq.unexpected_search, search_length);
diff --git a/src/mpid/pamid/src/mpid_request.h b/src/mpid/pamid/src/mpid_request.h
index c663fcb..ba17b78 100644
--- a/src/mpid/pamid/src/mpid_request.h
+++ b/src/mpid/pamid/src/mpid_request.h
@@ -38,7 +38,10 @@
 
 
 extern MPIU_Object_alloc_t MPID_Request_mem;
-
+#if TOKEN_FLOW_CONTROL
+extern void MPIDI_mm_free(void *,size_t);
+#endif
+typedef enum {mpiuMalloc=1,mpidiBufMM} MPIDI_mallocType;
 
 void    MPIDI_Request_uncomplete(MPID_Request *req);
 #if (MPIU_HANDLE_ALLOCATION_METHOD == MPIU_HANDLE_ALLOCATION_THREAD_LOCAL) && defined(__BGQ__)
@@ -263,7 +266,16 @@ MPID_Request_release_inline(MPID_Request *req)
     if (req->comm)              MPIR_Comm_release(req->comm, 0);
     if (req->greq_fns)          MPIU_Free(req->greq_fns);
     if (req->mpid.datatype_ptr) MPID_Datatype_release(req->mpid.datatype_ptr);
-    if (req->mpid.uebuf_malloc) MPIU_Free(req->mpid.uebuf);
+    if (req->mpid.uebuf_malloc== mpiuMalloc) {
+        MPIU_Free(req->mpid.uebuf);
+    }
+#if TOKEN_FLOW_CONTROL
+    else if (req->mpid.uebuf_malloc == mpidiBufMM) {
+        MPIU_THREAD_CS_ENTER(MSGQUEUE,0);
+        MPIDI_mm_free(req->mpid.uebuf,req->mpid.uebuflen);
+        MPIU_THREAD_CS_EXIT(MSGQUEUE,0);
+    }
+#endif
     MPIDI_Request_tls_free(req);
   }
 }
@@ -273,7 +285,16 @@ MPID_Request_release_inline(MPID_Request *req)
 static inline void
 MPID_Request_discard_inline(MPID_Request *req)
 {
-    if (req->mpid.uebuf_malloc) MPIU_Free(req->mpid.uebuf);
+    if (req->mpid.uebuf_malloc == mpiuMalloc) {
+        MPIU_Free(req->mpid.uebuf);
+    }
+#if TOKEN_FLOW_CONTROL
+    else if (req->mpid.uebuf_malloc == mpidiBufMM) {
+        MPIU_THREAD_CS_ENTER(MSGQUEUE,0);
+        MPIDI_mm_free(req->mpid.uebuf,req->mpid.uebuflen);
+        MPIU_THREAD_CS_EXIT(MSGQUEUE,0);
+    }
+#endif
     MPIDI_Request_tls_free(req);
 }
 
diff --git a/src/mpid/pamid/src/mpidi_bufmm.c b/src/mpid/pamid/src/mpidi_bufmm.c
new file mode 100644
index 0000000..213f434
--- /dev/null
+++ b/src/mpid/pamid/src/mpidi_bufmm.c
@@ -0,0 +1,657 @@
+/* begin_generated_IBM_copyright_prolog                             */
+/*                                                                  */
+/* This is an automatically generated copyright prolog.             */
+/* After initializing,  DO NOT MODIFY OR MOVE                       */
+/*  --------------------------------------------------------------- */
+/* Licensed Materials - Property of IBM                             */
+/* Blue Gene/Q 5765-PER 5765-PRP                                    */
+/*                                                                  */
+/* (C) Copyright IBM Corp. 2011, 2012 All Rights Reserved           */
+/* US Government Users Restricted Rights -                          */
+/* Use, duplication, or disclosure restricted                       */
+/* by GSA ADP Schedule Contract with IBM Corp.                      */
+/*                                                                  */
+/*  --------------------------------------------------------------- */
+/*                                                                  */
+/* end_generated_IBM_copyright_prolog                               */
+/*  (C)Copyright IBM Corp.  2007, 2011  */
+/**
+ * \file src/mpidi_bufmm.c
+ * \brief Memory management for early arrivals
+ */
+
+ /*******************************************************************/
+ /*   DESCRIPTION:                                                  */
+ /*   Dynamic memory manager which allows allocation and            */
+ /*   deallocation for earl arrivals sent via eager protocol.       */
+ /*                                                                 */
+ /*   The basic method is for buffers of size between MIN_SIZE      */
+ /*   and MAX_SIZE. The allocation scheme is a modified version     */
+ /*   of Knuth's buddy algorithm. Regardless of what size buffer    */
+ /*   is requested the size is rounded up to the nearest power of 2.*/
+ /*   Note that there is a buddy_header overhead per buffer (8 bytes*/
+ /*   in 32 bit and 16 bytes in 64 bit). So, for example, if a      */
+ /*   256 byte buffer is requested that would require 512 bytes     */
+ /*   of memory. A 248-byte buffer needs 256 bytes of space.        */
+ /*   Only for the maxsize buffers, it is guaranteed that the       */
+ /*   allocation is a power of two, since typically applications    */
+ /*   have such requirements.                                       */
+ /*                                                                 */
+ /*   To speed up the buddy algorithm there are some preallocated   */
+ /*   buffers. There are FLEX_NUM number of buffers from the        */
+ /*   FLEX_COUNT number of smallest buffers. So, for example, if    */
+ /*   MIN_SIZE is 16, FLEX_COUNT is 4, and FLEX_NUM is 256 then     */
+ /*   there are buffers of size 16, 32, 64, and 128 preallocted     */
+ /*   (256 buffers each). These buffers are arranged into stacks.   */
+ /*                                                                 */
+ /*   If the system runs out of preallocated buffers or the size    */
+ /*   is bigger than the biggest preallocated one then the buddy    */
+ /*   algorithm is applied. Originally there is a list of MAX_SIZE  */
+ /*   buffers.  (The size is MAX_SIZE + the 8 or 16 byte overhead.) */
+ /*   These are not merged as the traditional buddy system would    */
+ /*   require, since we never need bigger buffers than these.       */
+ /*   Originally, the lists of smaller size buffers are empty. When */
+ /*   there is an allocation request the program searches for the   */
+ /*   smallest free buffer available in the lists. If it is bigger  */
+ /*   than the requested one then it is repeatedly split into half. */
+ /*   The other halves are inserted into the appropriate list of    */
+ /*   free buffers. At deallocation the program attempts to merge   */
+ /*   the buffer with it's buddy repeatedly to get the largest      */
+ /*   buffer possible.                                              */
+ /*******************************************************************/
+
+#include <mpidimpl.h>
+
+#define NO  0
+#define YES 1
+int application_set_eager_limit=0;
+
+#if TOKEN_FLOW_CONTROL
+#define BUFFER_MEM_MAX    (1<<26)   /* 64MB */
+#define MAX_BUF_BKT_SIZE  (1<<18)   /* Max eager_limit is 256K              */
+#define MIN_BUF_BKT_SIZE  (64)
+#define MIN_LOG2SIZE        6       /* minimum buffer size log-2            */
+#define MIN_SIZE           64       /* minimum buffer size                  */
+#define MAX_BUCKET         13       /* log of maximum buffer size/MIN_SIZE  */
+#define MAX_SIZE          (MIN_SIZE<<(MAX_BUCKET-1)) /* maximum buffer size */
+#define FLEX_COUNT          5       /* num of buf types to preallocate      */
+#define FLEX_NUM           32       /* num of preallocated buffers          */
+     /* overhead for each buffer             */
+#define MAX_BUDDIES        50       /* absolutely maximum number of buddies */
+#define BUDDY               1
+#define FLEX                0
+#define MAX_MALLOCS        10
+
+#define OVERHEAD        (sizeof(buddy_header) - 2 * sizeof (void *))
+#define TAB_SIZE        (MAX_BUCKET+1)
+#define ALIGN8(x)       ((((unsigned long)(x)) + 0x7) & ~0x7L)
+#ifndef max
+#define max(a,b)      ((a) > (b) ? (a) : (b))
+#endif
+#ifndef min
+#define min(a,b)      (a>b ? b : a)
+#endif
+
+/* normalize the size to number of MIN_SIZE blocks required to hold  */
+/* required size. This includes the OVERHEAD                         */
+#define NORMSIZE(sz) (((sz)+MIN_SIZE+OVERHEAD-1) >> MIN_LOG2SIZE)
+
+typedef struct bhdr_struct{
+   char                 buddy;           /* buddy or flex alloc     */
+   char                 free;            /* available or not        */
+   char                 bucket;          /* log of buffer size      */
+   char                 *base_buddy;     /* addr of max size buddy  */
+   struct bhdr_struct   *next;           /* list ptr, used if free  */
+   struct bhdr_struct   *prev;           /* list bptr, used if free */
+} buddy_header;
+
+typedef struct fhdr_struct{
+   char                 buddy;           /* buddy or flex alloc     */
+   char                 ind;             /* which stack             */
+                                         /* bucket - min_bucket     */
+} flex_header;
+
+typedef struct{
+  void  *ptr;                             /* malloc ptr need to be freed  */
+  int   size;                             /* space allocted               */
+  int   type;                             /* FLEX or BUDDY                */
+} malloc_list_t;
+malloc_list_t  *malloc_list;
+
+static int    nMallocs, maxMallocs;       /* no. of malloc() issued       */
+static int    maxMallocs;                 /* max. no. of malloc() allowed */
+static int    max_bucket;                 /* max. number of "buddy" bucket*/
+static int    flex_count;                 /* pre-allocated bucket         */
+static size_t max_size;                   /* max. size for each msg       */
+static size_t flex_size;                  /* size for flex slot           */
+static char  *buddy_heap_ptr;             /* ptr points to beg. of buddy  */
+static char  *end_heap_ptr;               /* ptr points to end of buddy   */
+static char  *heap;                       /* begin. address of flex stack */
+static uint mem_inuse;                    /* memory in use                */
+static uint mem_hwmark;                   /* highest memory usage         */
+
+
+static int  sizetable[TAB_SIZE + 1];      /* from bucket to size            */
+static int  sizetrans[(1<<(MAX_BUCKET-1))+2];/* from size to power of 2 size*/
+
+static char** flex_stack[TAB_SIZE];       /* for flex size alloc            */
+static int    flex_sp   [TAB_SIZE];       /* flex stack pointer             */
+
+static buddy_header *free_buddy[TAB_SIZE];/* for buddy alloc                */
+
+int MPIDI_tfctrl_enabled=0;                     /* token flow control enabled     */
+int MPIDI_tfctrl_hwmark=0;                      /* high water mark for tfc        */
+int application_set_buf_mem=0;            /* MP_BUFFER_MEM set by the user? */
+char *EagerLimit=NULL;                    /* export MP_EAGER_LIMIT if the   */
+                                          /* number is adjusted             */
+
+/***************************************************************************/
+/*  calculate number of tokens available for each pair-wise communication. */
+/***************************************************************************/
+
+void MPIDI_calc_tokens(int nTasks,uint *eager_limit_in, unsigned long *buf_mem_in )
+{
+ char *cp;
+ unsigned long new_buf_mem_max,buf_mem_max;
+ int  val;
+ int  rc;
+ int  i;
+
+       /* Round up passed eager limit to power of 2 */
+    buf_mem_max= *buf_mem_in;
+
+    if (*eager_limit_in != 0) {
+       for (val=1 ; val < *eager_limit_in ; val *= 2);
+       if (val > MAX_BUF_BKT_SIZE) {   /* Maximum eager_limit is 256K */
+           val = MAX_BUF_BKT_SIZE;
+       }
+       if (val < MIN_BUF_BKT_SIZE) {   /* Minimum eager_limit is 64   */
+           val = MIN_BUF_BKT_SIZE;
+       }
+       MPIDI_tfctrl_enabled = buf_mem_max / ((long)nTasks * val);
+
+       /* We need to have a minimum of 2 tokens.  If number of tokens is
+        * less than 2, re-calculate by reducing the eager-limit.
+        * If the number of tokens is still less than 2 then suggest a
+        * new minimum buf_mem.
+        */
+       if (MPIDI_tfctrl_enabled < 2) {
+          for (val ; val>=MIN_BUF_BKT_SIZE; val/=2) {
+             MPIDI_tfctrl_enabled = (buf_mem_max) / ((long)nTasks * val);
+             if (MPIDI_tfctrl_enabled >= 2) {
+                break;
+             }
+          }
+          /* If the number of flow control tokens is still less than two   */
+          /* even with the eager_limit being reduced to 64, calculate a    */
+          /* new buf_mem value for 2 tokens and eager_limit = 64.       */
+          /* This will only happen if tasks>4K and original buf_mem=1M. */
+          if (MPIDI_tfctrl_enabled < 2) {
+             /* Sometimes we are off by 1 - due to integer arithmetic. */
+             new_buf_mem_max = (2 * nTasks * MIN_BUF_BKT_SIZE);
+             if ( new_buf_mem_max <= BUFFER_MEM_MAX ) {
+                 MPIDI_tfctrl_enabled = 2;
+                 /* Reset val to mini (64) because the for loop above     */
+                 /* would have changed it to 32.                          */
+                 val = MIN_BUF_BKT_SIZE;
+                 buf_mem_max = new_buf_mem_max;
+                 if ( application_set_buf_mem ) {
+                     printf("informational messge \n"); fflush(stdout);
+                 }
+             }
+             else {
+                 /* Still not enough ...... Turn off eager send protocol */
+                 MPIDI_tfctrl_enabled = 0;
+                 val = 0;
+             }
+          }
+       }
+       MPIDI_tfctrl_hwmark  = (MPIDI_tfctrl_enabled+1) / 2;  /* high water mark         */
+       /* Eager_limit may have been changed -- either rounded up or reduced */
+       if ( *eager_limit_in != val ) {
+          if ( application_set_eager_limit && (*eager_limit_in > val)) {
+             /* Only give warning on reduce. */
+             printf("warning message if eager limit is reduced \n"); fflush(stdout);
+             fflush(stderr);
+          }
+          *eager_limit_in = val;
+
+          /* putenv MP_EAGER_LIMIT if it is changed                          */
+          /* MP_EAGER_LIMIT always has a value.                              */
+          /* need to check whether MP_EAGER_LIMIT has been export            */
+          EagerLimit = (char*)MPIU_Malloc(32 * sizeof(char));
+          sprintf(EagerLimit, "MP_EAGER_LIMIT=%d",val);
+          rc = putenv(EagerLimit);
+          if (rc !=0) {
+              printf("PUTENV with Eager Limit failed \n"); fflush(stdout);
+          }
+       }
+      }
+    else {
+       /* Eager_limit = 0, all messages will be sent using rendezvous protocol */
+       MPIDI_tfctrl_enabled = 0;
+       MPIDI_tfctrl_hwmark = 0;
+    }
+    /* user may want to set MP_EAGER_LIMIT to 0 or less than 256 */
+    if (*eager_limit_in < MPIDI_Process.pt2pt.limits.application.immediate.remote)
+        MPIDI_Process.pt2pt.limits.application.immediate.remote= *eager_limit_in;
+
+}
+
+
+
+/***************************************************************************/
+/*  Initialize flex stack and stack pointers for each of FLEX_NUM slots    */
+/***************************************************************************/
+
+static char *MPIDI_init_flex(char *hp)
+{
+   int i,j,fcount;
+   char** area;
+   char *temp;
+   int  size;
+   int  kk;
+
+   fcount = flex_count;
+   if ((fcount = flex_count) == 0) {
+       flex_size = 0;
+       return hp;
+   }
+#  ifdef DUMP_MM
+   printf("fcount=%d sizetable[fcount+1]=%d sizetable[1]=%d FLEX_NUM=%d overhead=%d\n",
+         fcount,sizetable[fcount+1],sizetable[1],FLEX_NUM,OVERHEAD); fflush(stdout);
+#  endif
+   flex_size = (sizetable[fcount+1] - sizetable[1]) *FLEX_NUM +
+       OVERHEAD * fcount * FLEX_NUM;
+   kk=fcount *FLEX_NUM *sizeof(char *);
+   size = ALIGN8(kk);
+   area = (char **) MPIU_Malloc(size);
+   malloc_list[nMallocs].ptr=(void *) area;
+   malloc_list[nMallocs].size=size;
+   malloc_list[nMallocs].type=FLEX;
+   nMallocs++;
+
+
+   flex_stack[0] = NULL;
+   for(i =1; i <=fcount;  hp +=FLEX_NUM *(OVERHEAD +sizetable[i]), i++) {
+#  ifdef DEBUG
+       flex_heap[i] =hp;
+#  endif
+       flex_stack[i] =area;
+       area += FLEX_NUM;
+       flex_sp[i] =0;
+#  ifdef DUMP_MM
+       printf("MPIDI_init_flex() i=%d FLEX_NUM=%d fcount=%d i=%d flex_stack[i]=%p OVERHEAD=%d\n",
+               i,FLEX_NUM,fcount,i,flex_stack[i],OVERHEAD); fflush(stdout);
+#  endif
+       for(j =0; j <FLEX_NUM; j++){
+           flex_stack[i][j] =hp +j *(OVERHEAD +sizetable[i]);
+#  ifdef DUMP_MM
+            printf("j=%d hp=%p advance =%d final=%x\n", j,(int *)hp,j *(OVERHEAD +sizetable[i]),
+                   (int *) (hp +j *(OVERHEAD +sizetable[i])));
+            fflush(stdout);
+#  endif
+           ((flex_header *)flex_stack[i][j])->buddy  =FLEX;
+           ((flex_header *)flex_stack[i][j])->ind    =i;
+#  ifdef DUMP_MM
+           printf("j=%d  flex_stack[%02d][%02d] = %p advance=%d flag buddy=0x%x ind=0x%x\n",
+                   j,i,j,((int *)(flex_stack[i][j])),(sizetable[i]+OVERHEAD),
+                   ((flex_header *)flex_stack[i][j])->buddy,
+                   ((flex_header *)flex_stack[i][j])->ind); fflush(stdout);
+#endif
+       }
+   }
+   return hp;
+}
+
+/***************************************************************************/
+/*  Initialize buddy heap for each of MAX_BUDDIES  slots                   */
+/***************************************************************************/
+
+
+static void MPIDI_alloc_buddies(int nums, int *space)
+{
+    int i;
+        uint size;
+    char *buddy,*prev;
+
+    size = nums * (max_size +OVERHEAD);
+    buddy = buddy_heap_ptr;
+    if ((buddy_heap_ptr + size) > end_heap_ptr) {
+      /* preallocated space is exhausted, the caller needs to make */
+      /* a malloc() call for storing the message                   */
+        *space=NO;
+        return;
+    }
+    buddy_heap_ptr += size;
+    free_buddy[max_bucket] = (buddy_header *)buddy;
+    for(i =0, prev =NULL; i <nums; i++){
+        ((buddy_header *)buddy)->buddy      =BUDDY;
+        ((buddy_header *)buddy)->free       =1;
+        ((buddy_header *)buddy)->bucket    =max_bucket;
+        ((buddy_header *)buddy)->base_buddy =buddy;
+        ((buddy_header *)buddy)->prev       =(buddy_header *)prev;
+        prev =buddy;
+#      ifdef DUMP_MM
+       printf("ALLOC_BUDDY i=%2d buddy=%d free=%d bucket=%d base_buddy=%p prev=%p next=%p max_size=%d \n",
+              i,(int)((buddy_header *)buddy)->buddy,(int)((buddy_header *)buddy)->free,
+              (int)((buddy_header *)buddy)->bucket,(int *) ((buddy_header *)buddy)->base_buddy,
+              (int *)((buddy_header *)buddy)->prev,(int *)(buddy + max_size +OVERHEAD),max_size);
+              fflush(stdout);
+#      endif
+        buddy +=max_size +OVERHEAD;
+        ((buddy_header *)prev)->next        =(buddy_header *)buddy;
+    }
+    ((buddy_header *)prev)->next =NULL;
+}
+
+/***************************************************************************/
+/*  Initialize each of buddy slot                                          */
+/***************************************************************************/
+
+static void MPIDI_init_buddy(unsigned long buf_mem)
+{
+    int i;
+    int buddy_num;
+    size_t size;
+    int space=YES;
+
+#   ifdef DEBUG
+    buddy_heap =buddy_heap_ptr;
+#   endif
+    for(i =0; i <= max_bucket; i++)
+        free_buddy[i] =NULL;
+
+    /* figure out how many buddies we wanna preallocate
+     * size = BUFFER_MEM_SIZE >> 2;
+     * size -= flex_size;
+     */
+    size = buf_mem;
+    size = size / (max_size + OVERHEAD);
+    size = (size == 0) ? 1 : (size > MAX_BUDDIES) ? MAX_BUDDIES : size;
+    MPIDI_alloc_buddies(size,&space);
+    if ( space == NO ) {
+        printf("ERROR  line=%d\n",__LINE__); fflush(stdout);
+    }
+/*    printf("MPI-MM flex=%ld  #buddy=%ld\n",flex_size,size); */
+}
+
+
+/***************************************************************************/
+/* initializ memory buffer for eager messages                              */
+/***************************************************************************/
+
+int MPIDI_mm_init(int nTasks,uint *eager_limit_in,unsigned long *buf_mem_in)
+{
+    int    i, bucket;
+    size_t size;
+    unsigned int    eager_limit;
+    unsigned long buf_mem;
+    unsigned long  buf_mem_max;
+    int	   need_allocation = 1;
+
+    MPIDI_calc_tokens(nTasks,eager_limit_in, buf_mem_in);
+    buf_mem = *buf_mem_in;
+    eager_limit = *eager_limit_in;
+#   ifdef DEBUG
+    printf("Eager Limit=%d  buf_mem=%ld tokens=%d hwmark=%d\n",
+            eager_limit,buf_mem, MPIDI_tfctrl_enabled,MPIDI_tfctrl_hwmark);
+    fflush(stdout);
+#   endif
+    if (eager_limit_in == 0) return 0;  /* no EA buffer is needed */
+    maxMallocs=MAX_MALLOCS;
+    malloc_list=(malloc_list_t *) MPIU_Malloc(maxMallocs * sizeof(malloc_list_t));
+    if (malloc_list == NULL) return errno;
+    nMallocs=0;
+    mem_inuse=0;
+    mem_hwmark=0;
+
+
+    for (max_bucket=0,size=1 ; size < eager_limit ; max_bucket++,size *= 2);
+    max_size = 1 << max_bucket;
+    max_bucket -= (MIN_LOG2SIZE - 1);
+    flex_count = min(FLEX_COUNT,max_bucket);
+    for(i =1, size =MIN_SIZE; i <=MAX_BUCKET+1; i++, size =size << 1)
+        sizetable[i] =size;
+    sizetable[0] = 0;
+
+    for (bucket=1, size = 1, i=1 ; bucket <= max_bucket; ) {
+        sizetrans[i++] = bucket;
+        if (i > size) {
+            size *= 2;
+            bucket++;
+        }
+    }
+    sizetrans[i] = sizetrans[i-1];
+     /* 65536 is for flex stack which is not part of buf_mem_size */
+     heap = MPIU_Malloc(buf_mem + 65536);
+     if (heap == NULL) return errno;
+    malloc_list[nMallocs].ptr=(void *) heap;
+    malloc_list[nMallocs].size=buf_mem + 65536;
+    malloc_list[nMallocs].type=BUDDY;
+    buddy_heap_ptr = heap;
+    end_heap_ptr   = heap + buf_mem + 65536;
+#   ifdef DUMP_MM
+    printf("OVERHEAD=%d MAX_BUCKET=%d  TAB_SIZE=%d buddy_header size=%d mem_size=%ld\n",
+            OVERHEAD,MAX_BUCKET,TAB_SIZE,sizeof(buddy_header),buf_mem); fflush(stdout);
+#   endif
+#   ifdef DEBUG
+    printf("nMallocs=%d ptr=%p  size=%d  type=%d bPtr=%p ePtr=%p\n",
+    nMallocs,(void *)malloc_list[nMallocs].ptr,malloc_list[nMallocs].size,
+    malloc_list[nMallocs].type,buddy_heap_ptr,end_heap_ptr);
+    fflush(stdout);
+#   endif
+    nMallocs++;
+
+    buddy_heap_ptr = MPIDI_init_flex(heap);
+    MPIDI_init_buddy(buf_mem);
+#   ifdef MPIMM_DEBUG
+    if (mpimm_std_debug) {
+        printf("DevMemMgr uses\n");
+        printf("\tmem-size=%d, maxbufsize=%ld\n",mem_size,max_size);
+        printf("\tfcount=%d fsize=%ld, max_bucket=%d\n",
+               flex_count,flex_size,max_bucket);
+        fflush(stdout);
+    }
+#   endif
+    return 0;
+}
+
+
+void MPIDI_close_mm()
+{
+    int i;
+
+    if (nMallocs != 0) {
+      for (i=0; i< nMallocs; i++) {
+        MPIU_Free((void *) (malloc_list[i].ptr));
+      }
+     MPIU_Free(malloc_list);
+    }
+}
+
+/****************************************************************************/
+/* macros for MPIDI_mm_alloc() and MPIDI_mm_free()                          */
+/****************************************************************************/
+
+#define MPIDI_flex_alloc(bucket)   ((flex_sp[bucket] >=FLEX_NUM) ?          \
+                                 NULL : (char *)(flex_stack[bucket]         \
+                                          [flex_sp[bucket]++]) +OVERHEAD)
+
+#define MPIDI_flex_free(ptr)                                                \
+                                 int n;                                     \
+                                 ptr =(char *)ptr -OVERHEAD;                \
+                                 n =((flex_header *)ptr)->ind;              \
+                                 flex_stack[n][--flex_sp[n]] =(char *)ptr;
+
+
+#define MPIDI_remove_head(ind)                                             \
+                                 {                                         \
+                                    if((free_buddy[ind] =                  \
+                                          free_buddy[ind]->next) !=NULL)   \
+                                       free_buddy[ind]->prev =NULL;        \
+                                 }
+
+#define MPIDI_remove(bud)                                                  \
+                                 {                                         \
+                                    if(bud->prev !=NULL)                   \
+                                       bud->prev->next =bud->next;         \
+                                    else                                   \
+                                       free_buddy[bud->bucket] =bud->next; \
+                                    if(bud->next !=NULL)                   \
+                                       bud->next->prev =bud->prev;         \
+                                 }
+
+#define MPIDI_add_head(bud,ind)                                            \
+                                 {                                         \
+                                    if((bud->next =free_buddy[ind]) !=NULL)\
+                                       free_buddy[ind]->prev =bud;         \
+                                    free_buddy[ind] =bud;                  \
+                                    bud->prev =NULL;                       \
+                                 }
+
+#define MPIDI_fill_header(bud,ind,base)                                    \
+                                 {                                         \
+                                    bud->buddy        =BUDDY;              \
+                                    bud->free         =1;                  \
+                                    bud->bucket      =ind;                 \
+                                    bud->base_buddy   =base;               \
+                                 }
+
+#define MPIDI_pair(bud,size) (((char*)(bud) - (bud)->base_buddy) & (size) ? \
+                        (buddy_header *)((char*)(bud) - (size)) :           \
+                        (buddy_header *)((char*)(bud) + (size)))
+
+static buddy_header *MPIDI_split_buddy(int big,int bucket)
+{
+   buddy_header *bud,*buddy;
+   char *base;
+   int i;
+
+   bud =free_buddy[big];
+   MPIDI_remove_head(big);
+   base =bud->base_buddy;
+   for(i =big -1; i >=bucket; i--){
+      buddy =(buddy_header *)((char*)bud +sizetable[i]);
+      MPIDI_fill_header(buddy,i,base);
+      MPIDI_add_head(buddy,i);
+   }
+   bud->bucket =bucket;
+   bud->free =0;
+   return bud;
+}
+
+static void *MPIDI_buddy_alloc(int bucket)
+{
+   int i;
+   buddy_header *bud;
+   int space=YES;
+
+#  ifdef TRACE
+   printf("(buddy) ");
+#  endif
+   if((bud =free_buddy[i =bucket]) ==NULL){
+       i++;
+       do {
+           for(; i <=max_bucket; i++)
+               if(free_buddy[i] !=NULL){
+                   bud =MPIDI_split_buddy(i,bucket);
+                   return (char *)bud +OVERHEAD;
+               }
+           MPIDI_alloc_buddies(1,&space);
+           if (space == NO)
+              return NULL;
+           i = max_bucket;
+       } while (1);
+   }
+   else{
+      MPIDI_remove_head(bucket);
+      bud->free =0;
+      return (char *)bud +OVERHEAD;
+   }
+}
+
+static buddy_header *MPIDI_merge_buddy(buddy_header *bud)
+{
+   buddy_header *buddy;
+   int size;
+
+   while( (bud->bucket <max_bucket)      &&
+          (size = sizetable[bud->bucket]) &&
+          ((buddy =MPIDI_pair(bud,size))->free)  &&
+          (buddy->bucket ==bud->bucket) )
+       {
+           MPIDI_remove(buddy);
+           bud =min(bud,buddy);
+           bud->bucket++;
+       }
+   return bud;
+}
+
+static void MPIDI_buddy_free(void *ptr)
+{
+   buddy_header *bud;
+
+#  ifdef TRACE
+   printf("(buddy) ");
+#  endif
+   bud =(buddy_header *)((char *)ptr -OVERHEAD);
+   if(bud->bucket <max_bucket)
+      bud =MPIDI_merge_buddy(bud);
+   bud->free =1;
+   MPIDI_add_head(bud,bud->bucket);
+}
+void *MPIDI_mm_alloc(size_t size)
+{
+   void *pt;
+   int bucket,tmp;
+   int  nTimes=0;
+
+   MPID_assert(size <= max_size);
+   tmp = NORMSIZE(size);
+   tmp =bucket =sizetrans[tmp];
+   if(bucket >flex_count || (pt =MPIDI_flex_alloc(tmp)) ==NULL) {
+      pt =MPIDI_buddy_alloc(bucket);
+      nTimes++;
+   }
+   if (pt == NULL) {
+       pt=MPIU_Malloc(size);
+       if (MPIDI_Process.statistics) {
+           mem_inuse = mem_inuse + sizetable[tmp];
+          if (mem_inuse > mem_hwmark)
+             mem_hwmark = mem_inuse;
+       }
+       if (pt == NULL) {
+           printf("ERROR  line=%d\n",__LINE__); fflush(stdout);
+       }
+   }
+#  ifdef TRACE
+   printf("MPIDI_mm_alloc(%4d): %p\n",size,pt);
+   nAllocs++;
+#  endif
+   return pt;
+}
+
+void MPIDI_mm_free(void *ptr, size_t size)
+{
+   int tmp,bucket;
+
+   if (size > MAX_SIZE) {
+      printf("ERROR  line=%d\n",__LINE__); fflush(stdout);
+      exit(1);
+   }
+   if ((ptr >= (void *) heap) && (ptr < (void *)end_heap_ptr)) {
+     if(*((char *)ptr -OVERHEAD) ==FLEX){
+        MPIDI_flex_free(ptr);
+     }
+     else
+        MPIDI_buddy_free(ptr);
+   } else {
+       printf("ERROR free %s(%d)\n",__FILE__,__LINE__); fflush(stdout);
+   }
+#  ifdef TRACE
+   nFrees++;
+   printf("MPIDI_mm_free:     %p \n",ptr);
+#  endif
+}
+#endif   /* #if TOKEN_FLOW_CONTROL   */
diff --git a/src/mpid/pamid/src/mpidi_env.c b/src/mpid/pamid/src/mpidi_env.c
index fb540c3..1eab34a 100644
--- a/src/mpid/pamid/src/mpidi_env.c
+++ b/src/mpid/pamid/src/mpidi_env.c
@@ -337,6 +337,13 @@ extern MPIDI_printenv_t  *mpich_env;
 #endif
 
 #define ENV_Deprecated(a, b, c, d, e) ENV_Deprecated__(a, b, c, d, e)
+
+#ifdef TOKEN_FLOW_CONTORL
+ extern void MPIDI_get_buf_mem(unsigned long *);
+ extern int MPIDI_atoi(char* , int* );
+#endif
+ extern int application_set_eager_limit;
+
 static inline void
 ENV_Deprecated__(char* name[], unsigned num_supported, unsigned* deprecated, int rank, int NA)
 {
@@ -694,6 +701,16 @@ MPIDI_Env_setup(int rank, int requested)
     ENV_Unsigned(names, &MPIDI_Process.pt2pt.limits.application.eager.remote, 3, &found_deprecated_env_var, rank);
     ENV_Unsigned(names, &MPIDI_Process.pt2pt.limits.internal.eager.remote, 3, NULL, rank);
   }
+#if TOKEN_FLOW_CONTROL
+  /* Determine if users set eager limit  */
+  {
+    MPIDI_set_eager_limit(&MPIDI_Process.pt2pt.limits.application.eager.remote);
+  }
+  /* Determine buffer memory for early arrivals */
+  {
+    MPIDI_get_buf_mem(&MPIDI_Process.mp_buf_mem);
+  }
+#endif
 
   /*
    * Determine 'local' eager limit
@@ -981,6 +998,14 @@ MPIDI_Env_setup(int rank, int requested)
     }
 #endif
   }
+    {
+#if TOKEN_FLOW_CONTROL
+      char* names[] = {"MP_USE_TOKEN_FLOW_CONTROL", NULL};
+      ENV_Char(names, &MPIDI_Process.is_token_flow_control_on);
+      if (!MPIDI_Process.is_token_flow_control_on)
+           MPIDI_Process.mp_buf_mem=0;
+#endif
+    }
   /* Exit if any deprecated environment variables were specified. */
   if (found_deprecated_env_var)
     {
@@ -994,3 +1019,61 @@ MPIDI_Env_setup(int rank, int requested)
       }
     }
 }
+
+
+int  MPIDI_set_eager_limit(unsigned int *eager_limit)
+{
+     char *cp;
+     int  val;
+     cp = getenv("MP_EAGER_LIMIT");
+     if (cp)
+       {
+         application_set_eager_limit=1;
+         if ( MPIDI_atoi(cp, &val) == 0 )
+           *eager_limit=val;
+       }
+     return 0;
+}
+
+#if TOKEN_FLOW_CONTROL
+   /*****************************************************************/
+   /* Check for MP_BUFFER_MEM, if the value is not set by the user, */
+   /* then set the value with the default of 64 MB.                 */
+   /*****************************************************************/
+int  MPIDI_get_buf_mem(unsigned long *buf_mem) {
+     char *cp;
+     int  i;
+     int args_in_error=0;
+     char pre_alloc_buf[25], buf_max[25];
+     char *buf_max_cp;
+     int pre_alloc_val;
+     unsigned long buf_max_val;
+     int  has_error = 0;
+
+     if (cp = getenv("MP_BUFFER_MEM")) {
+         pre_alloc_buf[24] = '\0';
+         buf_max[24] = '\0';
+         if ( (buf_max_cp = strchr(cp, ',')) ) {
+              printf("No max buffer mem support in MPICH2 \n"); fflush(stdout);
+         } else {
+            /* Old single value format  */
+            if ( MPIDI_atoi(cp, &pre_alloc_val) == 0 )
+               buf_max_val = (unsigned long)pre_alloc_val;
+            else
+               has_error = 1;
+         }
+         if ( has_error == 0) {
+              *buf_mem     = (int) pre_alloc_val;
+             if (buf_max_val > ONE_SHARED_SEGMENT)
+                 *buf_mem = ONE_SHARED_SEGMENT;
+         } else {
+            args_in_error += 1;
+            printf("ERROR in MP_BUFFER_MEM %s(%d)\n",__FILE__,__LINE__); fflush(stdout);
+         }
+     } else {
+         /* MP_BUFFER_MEM is not specified by the user*/
+         *buf_mem     = BUFFER_MEM_DEFAULT;
+     }
+  return 0;
+}
+#endif
diff --git a/src/mpid/pamid/src/mpidi_util.c b/src/mpid/pamid/src/mpidi_util.c
index c729d58..fb116bf 100644
--- a/src/mpid/pamid/src/mpidi_util.c
+++ b/src/mpid/pamid/src/mpidi_util.c
@@ -52,6 +52,7 @@ void MPIDI_Set_mpich_env(int rank, int size) {
      mpich_env->this_task = rank;
      mpich_env->nprocs  = size;
      mpich_env->eager_limit=MPIDI_Process.pt2pt.limits.application.eager.remote;
+     mpich_env->use_token_flow_control=MPIDI_Process.is_token_flow_control_on;
      mpich_env->mp_statistics=MPIDI_Process.mp_statistics;
      if (mpich_env->polling_interval == 0) {
             mpich_env->polling_interval = 400000;
@@ -595,6 +596,7 @@ int MPIDI_Print_mpenv(int rank,int size)
         sender.mp_statistics = mpich_env->mp_statistics;
         sender.polling_interval = mpich_env->polling_interval;
         sender.eager_limit = mpich_env->eager_limit;
+        sender.use_token_flow_control=MPIDI_Process.is_token_flow_control_on;
         sender.retransmit_interval = mpich_env->retransmit_interval;
 
         /* Get shared memory  */
@@ -737,6 +739,7 @@ int MPIDI_Print_mpenv(int rank,int size)
                 MATCHB(interrupts,"Adapter Interrupts Enabled (MP_CSS_INTERRUPT):");
                 MATCHI(polling_interval,"Polling Interval (MP_POLLING_INTERVAL/usec):");
                 MATCHI(eager_limit,"Message Eager Limit (MP_EAGER_LIMIT/Bytes):");
+                MATCHI(use_token_flow_control,"Use token flow control:");
                 MATCHC(wait_mode,"Message Wait Mode(MP_WAIT_MODE):",8);
                 MATCHI(retransmit_interval,"Retransmit Interval (MP_RETRANSMIT_INTERVAL/count):");
                 MATCHB(use_shmem,"Shared Memory Enabled (MP_SHARED_MEMORY):");
diff --git a/src/mpid/pamid/src/pt2pt/mpidi_callback_eager.c b/src/mpid/pamid/src/pt2pt/mpidi_callback_eager.c
index 1438e77..eef73b8 100644
--- a/src/mpid/pamid/src/pt2pt/mpidi_callback_eager.c
+++ b/src/mpid/pamid/src/pt2pt/mpidi_callback_eager.c
@@ -44,6 +44,11 @@ MPIDI_RecvCB(pami_context_t    context,
              pami_recv_t     * recv)
 {
   const MPIDI_MsgInfo *msginfo = (const MPIDI_MsgInfo *)_msginfo;
+#if TOKEN_FLOW_CONTROL
+  int          rettoks=0;
+  void         *uebuf;
+  int          source;
+#endif
   if (recv == NULL)
     {
       if (msginfo->isSync)
@@ -81,11 +86,29 @@ MPIDI_RecvCB(pami_context_t    context,
   unsigned context_id = msginfo->MPIctxt;
 
   MPIU_THREAD_CS_ENTER(MSGQUEUE,0);
+  if (TOKEN_FLOW_CONTROL_ON)
+    {
+      #if TOKEN_FLOW_CONTROL
+      source=PAMIX_Endpoint_query(sender);
+      MPIDI_Receive_tokens(msginfo,source);
+      #else
+      MPID_assert_always(0);
+      #endif
+    }
 #ifndef OUT_OF_ORDER_HANDLING
   rreq = MPIDI_Recvq_FDP(rank, tag, context_id);
 #else
   rreq = MPIDI_Recvq_FDP(rank, PAMIX_Endpoint_query(sender), tag, context_id, msginfo->MPIseqno);
 #endif
+  if ((TOKEN_FLOW_CONTROL_ON) && (MPIDI_MUST_RETURN_TOKENS(sender)))
+    {
+      #if TOKEN_FLOW_CONTROL
+      rettoks=MPIDI_Token_cntr[sender].rettoks;
+      MPIDI_Token_cntr[sender].rettoks=0;
+      #else
+      MPID_assert_always(0);
+      #endif
+    }
 
   /* Match not found */
   if (unlikely(rreq == NULL))
@@ -97,14 +120,38 @@ MPIDI_RecvCB(pami_context_t    context,
       MPIU_THREAD_CS_EXIT(MSGQUEUE,0);
       MPID_Request *newreq = MPIDI_Request_create2();
       MPID_assert(newreq != NULL);
+      if (TOKEN_FLOW_CONTROL_ON)
+        {
+          #if TOKEN_FLOW_CONTROL
+          MPIU_THREAD_CS_ENTER(MSGQUEUE,0);
+          #else
+          MPID_assert_always(0);
+          #endif
+        }
+
       if (sndlen)
       {
         newreq->mpid.uebuflen = sndlen;
-        newreq->mpid.uebuf = MPIU_Malloc(sndlen);
+        if (!(TOKEN_FLOW_CONTROL_ON))
+          {
+            newreq->mpid.uebuf = MPIU_Malloc(sndlen);
+            newreq->mpid.uebuf_malloc = mpiuMalloc ;
+          }
+        else
+          {
+            #if TOKEN_FLOW_CONTROL
+            newreq->mpid.uebuf = MPIDI_mm_alloc(sndlen);
+            newreq->mpid.uebuf_malloc = mpidiBufMM;
+            #else
+            MPID_assert_always(0);
+            #endif
+          }
         MPID_assert(newreq->mpid.uebuf != NULL);
-        newreq->mpid.uebuf_malloc = 1;
       }
-      MPIU_THREAD_CS_ENTER(MSGQUEUE,0);
+      if (!TOKEN_FLOW_CONTROL_ON)
+        {
+          MPIU_THREAD_CS_ENTER(MSGQUEUE,0);
+        }
 #ifndef OUT_OF_ORDER_HANDLING
       rreq = MPIDI_Recvq_FDP(rank, tag, context_id);
 #else
@@ -115,6 +162,14 @@ MPIDI_RecvCB(pami_context_t    context,
       {
         MPIDI_Callback_process_unexp(newreq, context, msginfo, sndlen, sender, sndbuf, recv, msginfo->isSync);
         int completed = MPID_Request_is_complete(newreq);
+        if (TOKEN_FLOW_CONTROL_ON)
+          {
+            #if TOKEN_FLOW_CONTROL
+            MPIDI_Token_cntr[sender].unmatched++;
+            #else
+            MPID_assert_always(0);
+            #endif
+          }
         MPIU_THREAD_CS_EXIT(MSGQUEUE,0);
         if (completed) MPID_Request_release(newreq);
         goto fn_exit_eager;
@@ -128,8 +183,16 @@ MPIDI_RecvCB(pami_context_t    context,
   else
     {
 #if (MPIDI_STATISTICS)
-        MPID_NSTAT(mpid_statp->earlyArrivalsMatched);
+      MPID_NSTAT(mpid_statp->earlyArrivalsMatched);
 #endif
+      if (TOKEN_FLOW_CONTROL_ON)
+        {
+          #if TOKEN_FLOW_CONTROL
+          MPIDI_Update_rettoks(sender);
+          #else
+          MPID_assert_always(0);
+          #endif
+        }
       MPIU_THREAD_CS_EXIT(MSGQUEUE,0);
     }
 
@@ -219,9 +282,21 @@ MPIDI_RecvCB(pami_context_t    context,
       rreq->mpid.uebuflen = sndlen;
       if (sndlen)
         {
-          rreq->mpid.uebuf    = MPIU_Malloc(sndlen);
+          if (!TOKEN_FLOW_CONTROL_ON)
+            {
+              rreq->mpid.uebuf    = MPIU_Malloc(sndlen);
+              rreq->mpid.uebuf_malloc = mpiuMalloc;
+            }
+          else
+            {
+              #if TOKEN_FLOW_CONTROL
+              MPIDI_Alloc_lock(&rreq->mpid.uebuf,sndlen);
+              rreq->mpid.uebuf_malloc = mpidiBufMM;
+              #else
+              MPID_assert_always(0);
+              #endif
+            }
           MPID_assert(rreq->mpid.uebuf != NULL);
-          rreq->mpid.uebuf_malloc = 1;
         }
       /* -------------------------------------------------- */
       /*  Let PAMI know where to put the rest of the data.  */
@@ -234,6 +309,7 @@ MPIDI_RecvCB(pami_context_t    context,
 #endif
 
  fn_exit_eager:
+ MPIDI_Return_tokens(context, source, rettoks);
   /* ---------------------------------------- */
   /*  Signal that the recv has been started.  */
   /* ---------------------------------------- */
diff --git a/src/mpid/pamid/src/pt2pt/mpidi_callback_rzv.c b/src/mpid/pamid/src/pt2pt/mpidi_callback_rzv.c
index 2afd2f0..296890e 100644
--- a/src/mpid/pamid/src/pt2pt/mpidi_callback_rzv.c
+++ b/src/mpid/pamid/src/pt2pt/mpidi_callback_rzv.c
@@ -45,6 +45,10 @@ MPIDI_RecvRzvCB_impl(pami_context_t    context,
 
   MPID_Request * rreq = NULL;
   int found;
+  pami_task_t source;
+#if TOKEN_FLOW_CONTROL
+  int  rettoks=0;
+#endif
 
   /* -------------------- */
   /*  Match the request.  */
@@ -55,10 +59,11 @@ MPIDI_RecvRzvCB_impl(pami_context_t    context,
 
   MPID_Request *newreq = MPIDI_Request_create2();
   MPIU_THREAD_CS_ENTER(MSGQUEUE,0);
+  source = PAMIX_Endpoint_query(sender);
+  MPIDI_Receive_tokens(msginfo,source);
 #ifndef OUT_OF_ORDER_HANDLING
   rreq = MPIDI_Recvq_FDP_or_AEU(newreq, rank, tag, context_id, &found);
 #else
-  pami_task_t source = PAMIX_Endpoint_query(sender);
   rreq = MPIDI_Recvq_FDP_or_AEU(newreq, rank, source, tag, context_id, msginfo->MPIseqno, &found);
 #endif
   TRACE_ERR("RZV CB for req=%p remote-mr=0x%llx bytes=%zu (%sfound)\n",
@@ -112,6 +117,15 @@ MPIDI_RecvRzvCB_impl(pami_context_t    context,
       MPIDI_In_cntr[source].R[(rreq->mpid.idx)].rlen=envelope->length;
       MPIDI_In_cntr[source].R[(rreq->mpid.idx)].sync=msginfo->isSync;
 #endif
+     if ((TOKEN_FLOW_CONTROL_ON) && (MPIDI_MUST_RETURN_TOKENS(sender)))
+       {
+         #if TOKEN_FLOW_CONTROL
+         rettoks=MPIDI_Token_cntr[sender].rettoks;
+         MPIDI_Token_cntr[sender].rettoks=0;
+         #else
+         MPID_assert_always(0);
+         #endif
+       }
     }
   /* ----------------------------------------- */
   /* figure out target buffer for request data */
@@ -166,7 +180,9 @@ MPIDI_RecvRzvCB_impl(pami_context_t    context,
 #endif
       MPIU_THREAD_CS_EXIT(MSGQUEUE,0);
     }
-
+#if TOKEN_FLOW_CONTROL
+  MPIDI_Return_tokens(context, source, rettoks);
+#endif
   /* ---------------------------------------- */
   /*  Signal that the recv has been started.  */
   /* ---------------------------------------- */
diff --git a/src/mpid/pamid/src/pt2pt/mpidi_callback_short.c b/src/mpid/pamid/src/pt2pt/mpidi_callback_short.c
index 2b927b2..2b2aed6 100644
--- a/src/mpid/pamid/src/pt2pt/mpidi_callback_short.c
+++ b/src/mpid/pamid/src/pt2pt/mpidi_callback_short.c
@@ -53,6 +53,10 @@ MPIDI_RecvShortCB(pami_context_t    context,
 
   const MPIDI_MsgInfo *msginfo = (const MPIDI_MsgInfo *)_msginfo;
   MPID_Request * rreq = NULL;
+  pami_task_t source;
+#if TOKEN_FLOW_CONTROL
+  int          rettoks=0;
+#endif
 
   /* -------------------- */
   /*  Match the request.  */
@@ -62,10 +66,11 @@ MPIDI_RecvShortCB(pami_context_t    context,
   unsigned context_id = msginfo->MPIctxt;
 
   MPIU_THREAD_CS_ENTER(MSGQUEUE,0);
+  source = PAMIX_Endpoint_query(sender);
+  MPIDI_Receive_tokens(msginfo,source);
 #ifndef OUT_OF_ORDER_HANDLING
   rreq = MPIDI_Recvq_FDP(rank, tag, context_id);
 #else
-  pami_task_t source = PAMIX_Endpoint_query(sender);
   rreq = MPIDI_Recvq_FDP(rank, source, tag, context_id, msginfo->MPIseqno);
 #endif
 
@@ -73,17 +78,43 @@ MPIDI_RecvShortCB(pami_context_t    context,
   if (unlikely(rreq == NULL))
     {
 #if (MPIDI_STATISTICS)
-        MPID_NSTAT(mpid_statp->earlyArrivals);
+         MPID_NSTAT(mpid_statp->earlyArrivals);
 #endif
+     if (TOKEN_FLOW_CONTROL_ON)
+       {
+         #if TOKEN_FLOW_CONTROL
+         if (MPIDI_MUST_RETURN_TOKENS(source))
+           {
+             rettoks=MPIDI_Token_cntr[source].rettoks;
+             MPIDI_Token_cntr[source].rettoks=0;
+           }
+         #else
+         MPID_assert_always(0);
+         #endif
+     }
       MPIU_THREAD_CS_EXIT(MSGQUEUE,0);
       MPID_Request *newreq = MPIDI_Request_create2();
       MPID_assert(newreq != NULL);
       if (sndlen)
       {
         newreq->mpid.uebuflen = sndlen;
-        newreq->mpid.uebuf = MPIU_Malloc(sndlen);
+        if (!TOKEN_FLOW_CONTROL_ON)
+          {
+            newreq->mpid.uebuf = MPIU_Malloc(sndlen);
+            newreq->mpid.uebuf_malloc = mpiuMalloc;
+          }
+        else
+          {
+            #if TOKEN_FLOW_CONTROL
+            MPIU_THREAD_CS_ENTER(MSGQUEUE,0);
+            newreq->mpid.uebuf = MPIDI_mm_alloc(sndlen);
+            newreq->mpid.uebuf_malloc = mpidiBufMM;
+            MPIU_THREAD_CS_EXIT(MSGQUEUE,0);
+            #else
+            MPID_assert_always(0);
+            #endif
+          }
         MPID_assert(newreq->mpid.uebuf != NULL);
-        newreq->mpid.uebuf_malloc = 1;
       }
       MPIU_THREAD_CS_ENTER(MSGQUEUE,0);
 #ifndef OUT_OF_ORDER_HANDLING
@@ -96,6 +127,14 @@ MPIDI_RecvShortCB(pami_context_t    context,
       {
         MPIDI_Callback_process_unexp(newreq, context, msginfo, sndlen, sender, sndbuf, NULL, isSync);
         /* request is always complete now */
+        if (TOKEN_FLOW_CONTROL_ON && sndlen)
+          {
+            #if TOKEN_FLOW_CONTROL
+            MPIDI_Token_cntr[source].unmatched++;
+            #else
+            MPID_assert_always(0);
+            #endif
+          }
         MPIU_THREAD_CS_EXIT(MSGQUEUE,0);
         MPID_Request_release(newreq);
         goto fn_exit_short;
@@ -109,9 +148,16 @@ MPIDI_RecvShortCB(pami_context_t    context,
   else
     {
 #if (MPIDI_STATISTICS)
-        MPID_NSTAT(mpid_statp->earlyArrivalsMatched);
+     MPID_NSTAT(mpid_statp->earlyArrivalsMatched);
 #endif
-
+      if (TOKEN_FLOW_CONTROL_ON && sndlen)
+        {
+          #if TOKEN_FLOW_CONTROL
+          MPIDI_Update_rettoks(source);
+          #else
+          MPID_assert_always(0);
+          #endif
+        }
       MPIU_THREAD_CS_EXIT(MSGQUEUE,0);
     }
 
@@ -177,6 +223,7 @@ MPIDI_RecvShortCB(pami_context_t    context,
 #endif
 
  fn_exit_short:
+ MPIDI_Return_tokens(context, source, rettoks);
   /* ---------------------------------------- */
   /*  Signal that the recv has been started.  */
   /* ---------------------------------------- */
diff --git a/src/mpid/pamid/src/pt2pt/mpidi_callback_util.c b/src/mpid/pamid/src/pt2pt/mpidi_callback_util.c
index 5fb0c5a..8a82e64 100644
--- a/src/mpid/pamid/src/pt2pt/mpidi_callback_util.c
+++ b/src/mpid/pamid/src/pt2pt/mpidi_callback_util.c
@@ -127,7 +127,7 @@ MPIDI_Callback_process_trunc(pami_context_t  context,
       rreq->mpid.uebuflen = rreq->status.count;
       rreq->mpid.uebuf    = MPIU_Malloc(rreq->status.count);
       MPID_assert(rreq->mpid.uebuf != NULL);
-      rreq->mpid.uebuf_malloc = 1;
+      rreq->mpid.uebuf_malloc = mpiuMalloc;
 
       recv->addr = rreq->mpid.uebuf;
     }
diff --git a/src/mpid/pamid/src/pt2pt/mpidi_control.c b/src/mpid/pamid/src/pt2pt/mpidi_control.c
index 6fbfa2a..dec4384 100644
--- a/src/mpid/pamid/src/pt2pt/mpidi_control.c
+++ b/src/mpid/pamid/src/pt2pt/mpidi_control.c
@@ -397,6 +397,13 @@ MPIDI_ControlCB(pami_context_t    context,
     case MPIDI_CONTROL_RENDEZVOUS_ACKNOWLEDGE:
       MPIDI_RzvAck_proc(context, msginfo, senderrank);
       break;
+#if TOKEN_FLOW_CONTROL
+    case MPIDI_CONTROL_RETURN_TOKENS:
+      MPIU_THREAD_CS_ENTER(MSGQUEUE,0);
+      MPIDI_Token_cntr[sender].tokens += msginfo->alltokens;
+      MPIU_THREAD_CS_EXIT(MSGQUEUE,0);
+      break;
+#endif
     default:
       fprintf(stderr, "Bad msginfo type: 0x%08x  %d\n",
               msginfo->control,
diff --git a/src/mpid/pamid/src/pt2pt/mpidi_done.c b/src/mpid/pamid/src/pt2pt/mpidi_done.c
index e02a544..9864322 100644
--- a/src/mpid/pamid/src/pt2pt/mpidi_done.c
+++ b/src/mpid/pamid/src/pt2pt/mpidi_done.c
@@ -105,7 +105,7 @@ MPIDI_RecvDoneCB(pami_context_t   context,
   MPIDI_Request_complete_norelease(rreq);
   /* caller must release rreq, after unlocking MSGQUEUE (if held) */
 #ifdef OUT_OF_ORDER_HANDLING
-  int source;
+  pami_task_t source;
   source = MPIDI_Request_getPeerRank_pami(rreq);
   if (MPIDI_In_cntr[source].n_OutOfOrderMsgs > 0) {
      MPIDI_Recvq_process_out_of_order_msgs(source, context);
@@ -162,6 +162,16 @@ void MPIDI_Recvq_process_out_of_order_msgs(pami_task_t src, pami_context_t conte
 
       if (matched)  {
         /* process a completed message i.e. data is in EA   */
+        if (TOKEN_FLOW_CONTROL_ON) {
+           #if TOKEN_FLOW_CONTROL
+           if ((ooreq->mpid.uebuflen) && (!(ooreq->mpid.envelope.msginfo.isRzv))) {
+               MPIDI_Token_cntr[src].unmatched--;
+               MPIDI_Update_rettoks(src);
+           }
+           #else
+           MPID_assert_always(0);
+           #endif
+         }
         if (MPIDI_Request_getMatchSeq(ooreq) == (in_cntr->nMsgs+ 1))
           in_cntr->nMsgs++;
 
diff --git a/src/mpid/pamid/src/pt2pt/mpidi_recv.h b/src/mpid/pamid/src/pt2pt/mpidi_recv.h
index bfc77ba..5916255 100644
--- a/src/mpid/pamid/src/pt2pt/mpidi_recv.h
+++ b/src/mpid/pamid/src/pt2pt/mpidi_recv.h
@@ -31,6 +31,80 @@
 #endif*/
 
 
+#if TOKEN_FLOW_CONTROL
+extern MPIDI_Out_cntr_t *MPIDI_Out_cntr;
+extern int MPIDI_tfctrl_hwmark;
+extern void *MPIDI_mm_alloc(size_t);
+extern void  MPIDI_mm_free(void *, size_t);
+extern int tfctrl_enabled;
+extern char *EagerLimit;
+#define MPIDI_Return_tokens        MPIDI_Return_tokens_inline
+#define MPIDI_Receive_tokens       MPIDI_Receive_tokens_inline
+#define MPIDI_Update_rettoks       MPIDI_Update_rettoks_inline
+#define MPIDI_Alloc_lock           MPIDI_Alloc_lock_inline
+
+#define MPIDI_MUST_RETURN_TOKENS(dd)                                          \
+    (MPIDI_Token_cntr[(dd)].rettoks                                           \
+     && (MPIDI_Token_cntr[(dd)].rettoks + MPIDI_Token_cntr[(dd)].unmatched    \
+         >= MPIDI_tfctrl_hwmark))
+
+static inline void *
+MPIDI_Return_tokens_inline(pami_context_t context, int dest, int tokens)
+{
+   MPIDI_MsgInfo  tokenInfo;
+   if (tokens) {
+       memset(&tokenInfo,0, sizeof(MPIDI_MsgInfo));
+       tokenInfo.control=MPIDI_CONTROL_RETURN_TOKENS;
+       tokenInfo.alltokens=tokens;
+       pami_send_immediate_t params = {
+           .dispatch = MPIDI_Protocols_Control,
+           .dest     = dest,
+           .header   = {
+              .iov_base = &tokenInfo,
+              .iov_len  = sizeof(MPIDI_MsgInfo),
+           },
+           .data     = {
+             .iov_base = NULL,
+             .iov_len  = 0,
+           },
+         };
+         pami_result_t rc;
+         rc = PAMI_Send_immediate(context, &params);
+         MPID_assert(rc == PAMI_SUCCESS);
+     }
+}
+
+static inline void *
+MPIDI_Receive_tokens_inline(const MPIDI_MsgInfo *m, int dest)
+    {
+      if ((m)->tokens)
+      {
+          MPIDI_Token_cntr[dest].tokens += (m)->tokens;
+      }
+    }
+
+static inline void *
+MPIDI_Update_rettoks_inline(int source) 
+ {
+     MPIDI_Token_cntr[source].rettoks++;
+ }
+
+static inline void *
+MPIDI_Alloc_lock_inline(void **buf,size_t size)
+ {
+       MPIU_THREAD_CS_ENTER(MSGQUEUE,0);
+       (*buf) = (void *) MPIDI_mm_alloc(size);
+       MPIU_THREAD_CS_EXIT(MSGQUEUE,0);
+ }
+
+#else
+#define MPIDI_Return_tokens(x,y,z)
+#define MPIDI_Receive_tokens(x,y)
+#define MPIDI_Update_rettoks(x)
+#define MPIDI_MUST_RETURN_TOKENS(x) (0)
+#define MPIDI_Alloc_lock(x,y)
+#endif
+
 /**
  * \brief ADI level implemenation of MPI_(I)Recv()
  *
@@ -127,6 +201,16 @@ MPIDI_Recv(void          * buf,
     {
       MPIDI_RecvMsg_Unexp(rreq, buf, count, datatype);
       mpi_errno = rreq->status.MPI_ERROR;
+      if (TOKEN_FLOW_CONTROL_ON) {
+         #if TOKEN_FLOW_CONTROL
+         if ((rreq->mpid.uebuflen) && (!(rreq->mpid.envelope.msginfo.isRzv))) {
+           MPIDI_Token_cntr[(rreq->mpid.peer_pami)].unmatched--;
+           MPIDI_Update_rettoks(rreq->mpid.peer_pami);
+         }
+         #else
+         MPID_assert_always(0);
+         #endif
+      }
       MPIU_THREAD_CS_EXIT(MSGQUEUE,0);
       MPID_Request_discard(newreq);
     }
@@ -147,11 +231,12 @@ MPIDI_Recv(void          * buf,
   *request = rreq;
   if (status != MPI_STATUS_IGNORE)
     *status = rreq->status;
-  #ifdef MPIDI_STATISTICS
-    if (!(MPID_cc_is_complete(&rreq->cc))) {
+#ifdef MPIDI_STATISTICS
+    if (!(MPID_cc_is_complete(&rreq->cc)))
+    {
         MPID_NSTAT(mpid_statp->recvWaitsComplete);
     }
-  #endif
+#endif
 
   return mpi_errno;
 }
diff --git a/src/mpid/pamid/src/pt2pt/mpidi_rendezvous.c b/src/mpid/pamid/src/pt2pt/mpidi_rendezvous.c
index ca73210..7a89023 100644
--- a/src/mpid/pamid/src/pt2pt/mpidi_rendezvous.c
+++ b/src/mpid/pamid/src/pt2pt/mpidi_rendezvous.c
@@ -95,7 +95,7 @@ MPIDI_RendezvousTransfer(pami_context_t   context,
       MPID_assert(rcvbuf != NULL);
       rreq->mpid.uebuf    = rcvbuf;
       rreq->mpid.uebuflen = rcvlen;
-      rreq->mpid.uebuf_malloc = 1;
+      rreq->mpid.uebuf_malloc = mpiuMalloc;
     }
 
   /* ---------------------------------------------------------------- */
diff --git a/src/mpid/pamid/src/pt2pt/mpidi_sendmsg.c b/src/mpid/pamid/src/pt2pt/mpidi_sendmsg.c
index 939deb6..3e043e4 100644
--- a/src/mpid/pamid/src/pt2pt/mpidi_sendmsg.c
+++ b/src/mpid/pamid/src/pt2pt/mpidi_sendmsg.c
@@ -20,7 +20,24 @@
  * \brief Funnel point for starting all MPI messages
  */
 #include <mpidimpl.h>
+#ifndef min
+#define min(a,b) (((a) < (b)) ? (a) : (b))
+#endif
 
+#if TOKEN_FLOW_CONTROL
+#define MPIDI_Piggy_back_tokens    MPIDI_Piggy_back_tokens_inline
+static inline void *
+MPIDI_Piggy_back_tokens_inline(int dest,MPID_Request *shd,size_t len)
+  {
+         int rettoks=0;
+         if (MPIDI_Token_cntr[dest].rettoks)
+         {
+             rettoks=min(MPIDI_Token_cntr[dest].rettoks, TOKENS_BITMASK);
+             MPIDI_Token_cntr[dest].rettoks -= rettoks;
+             shd->mpid.envelope.msginfo.tokens = rettoks;
+          }
+  }
+#endif
 
 static inline void
 MPIDI_SendMsg_short(pami_context_t    context,
@@ -57,12 +74,12 @@ MPIDI_SendMsg_short(pami_context_t    context,
 #endif
   MPID_assert(rc == PAMI_SUCCESS);
 #ifdef MPIDI_TRACE
-  MPIDI_Out_cntr[dest].S[(sreq->mpid.idx)].mode=params.dispatch;
+ MPIDI_Trace_buf[dest].S[(sreq->mpid.idx)].mode=params.dispatch;
  if (!isSync) {
-     MPIDI_Out_cntr[dest].S[(sreq->mpid.idx)].NoComp=1;
-     MPIDI_Out_cntr[dest].S[(sreq->mpid.idx)].sendShort=1;
+     MPIDI_Trace_buf[dest].S[(sreq->mpid.idx)].NoComp=1;
+     MPIDI_Trace_buf[dest].S[(sreq->mpid.idx)].sendShort=1;
  } else
-     MPIDI_Out_cntr[dest].S[(sreq->mpid.idx)].sendEnvelop=1;
+     MPIDI_Trace_buf[dest].S[(sreq->mpid.idx)].sendEnvelop=1;
 
 #endif
 
@@ -112,8 +129,8 @@ MPIDI_SendMsg_eager(pami_context_t    context,
   rc = PAMI_Send(context, &params);
   MPID_assert(rc == PAMI_SUCCESS);
 #ifdef MPIDI_TRACE
-  MPIDI_Out_cntr[dest].S[(sreq->mpid.idx)].mode=MPIDI_Protocols_Eager;
-  MPIDI_Out_cntr[dest].S[(sreq->mpid.idx)].sendEager=1;
+  MPIDI_Trace_buf[dest].S[(sreq->mpid.idx)].mode=MPIDI_Protocols_Eager;
+  MPIDI_Trace_buf[dest].S[(sreq->mpid.idx)].sendEager=1;
 #endif
 }
 
@@ -162,7 +179,7 @@ MPIDI_SendMsg_rzv(pami_context_t    context,
 #else
   sreq->mpid.envelope.memregion_used = 0;
 #ifdef OUT_OF_ORDER_HANDLING
-  if ((!MPIDI_Process.mp_s_use_pami_get) && (!sreq->mpid.shm))
+  if ((!MPIDI_Process.mp_s_use_pami_get) && (!sreq->mpid.envelope.msginfo.noRDMA))
 #else
   if (!MPIDI_Process.mp_s_use_pami_get)
 #endif
@@ -185,7 +202,9 @@ MPIDI_SendMsg_rzv(pami_context_t    context,
 	  sreq->mpid.envelope.memregion_used = 1;
 	}
         sreq->mpid.envelope.data   = sndbuf;
-    } else {
+    }
+    else
+    {
       TRACE_ERR("RZV send (failed registration for sreq=%p addr=%p *addr[0]=%#016llx *addr[1]=%#016llx bytes=%u\n",
 		sreq,sndbuf,
 		*(((unsigned long long*)sndbuf)+0),
@@ -221,12 +240,12 @@ MPIDI_SendMsg_rzv(pami_context_t    context,
   rc = PAMI_Send_immediate(context, &params);
   MPID_assert(rc == PAMI_SUCCESS);
 #ifdef MPIDI_TRACE
-  MPIDI_Out_cntr[dest].S[(sreq->mpid.idx)].bufaddr=sreq->mpid.envelope.data;
-  MPIDI_Out_cntr[dest].S[(sreq->mpid.idx)].mode=MPIDI_Protocols_RVZ;
-  MPIDI_Out_cntr[dest].S[(sreq->mpid.idx)].sendRzv=1;
-  MPIDI_Out_cntr[dest].S[(sreq->mpid.idx)].sendEnvelop=1;
-  MPIDI_Out_cntr[dest].S[(sreq->mpid.idx)].memRegion=sreq->mpid.envelope.memregion_used;
-  MPIDI_Out_cntr[dest].S[(sreq->mpid.idx)].use_pami_get=MPIDI_Process.mp_s_use_pami_get;
+  MPIDI_Trace_buf[dest].S[(sreq->mpid.idx)].bufaddr=sreq->mpid.envelope.data;
+  MPIDI_Trace_buf[dest].S[(sreq->mpid.idx)].mode=MPIDI_Protocols_RVZ;
+  MPIDI_Trace_buf[dest].S[(sreq->mpid.idx)].sendRzv=1;
+  MPIDI_Trace_buf[dest].S[(sreq->mpid.idx)].sendEnvelop=1;
+  MPIDI_Trace_buf[dest].S[(sreq->mpid.idx)].memRegion=sreq->mpid.envelope.memregion_used;
+  MPIDI_Trace_buf[dest].S[(sreq->mpid.idx)].use_pami_get=MPIDI_Process.mp_s_use_pami_get;
 #endif
 }
 
@@ -336,7 +355,7 @@ MPIDI_SendMsg_process_userdefined_dt(MPID_Request      * sreq,
           MPID_Abort(NULL, MPI_ERR_NO_SPACE, -1,
                      "Unable to allocate non-contiguous buffer");
         }
-      sreq->mpid.uebuf_malloc = 1;
+      sreq->mpid.uebuf_malloc = mpiuMalloc;
 
       DLOOP_Offset last = data_sz;
       MPID_Segment_init(sreq->mpid.userbuf,
@@ -397,7 +416,7 @@ MPIDI_SendMsg(pami_context_t   context,
   MPIU_THREAD_CS_EXIT(MSGQUEUE,0);
   MPIDI_Request_setMatchSeq(sreq, out_cntr->nMsgs);
 #endif
-
+if (!TOKEN_FLOW_CONTROL_ON) {
   size_t   data_sz;
   void   * sndbuf;
   if (likely(HANDLE_GET_KIND(sreq->mpid.datatype) == HANDLE_KIND_BUILTIN))
@@ -411,11 +430,11 @@ MPIDI_SendMsg(pami_context_t   context,
     }
 #ifdef MPIDI_TRACE
    sreq->mpid.partner_id=dest;
-   GET_REC_S(sreq,context,isSync,data_sz)
+   MPIDI_GET_S_REC(sreq,context,isSync,data_sz);
 #endif
 
 #ifdef OUT_OF_ORDER_HANDLING
-  sreq->mpid.shm=0;
+  sreq->mpid.envelope.msginfo.noRDMA=0;
 #endif
 
   const unsigned isLocal = PAMIX_Task_is_local(dest_tid);
@@ -460,7 +479,7 @@ MPIDI_SendMsg(pami_context_t   context,
     {
       TRACE_ERR("Sending(rendezvous%s%s) bytes=%u (eager_limit=%u)\n", isInternal==1?",internal":"", isLocal==1?",intranode":"", data_sz, MPIDI_PT2PT_EAGER_LIMIT(isInternal,isLocal));
 #ifdef OUT_OF_ORDER_HANDLING
-      sreq->mpid.shm=isLocal;
+      sreq->mpid.envelope.msginfo.noRDMA=isLocal;
 #endif
       if (likely(data_sz > 0))
         {
@@ -482,6 +501,134 @@ MPIDI_SendMsg(pami_context_t   context,
         }
 #endif
     }
+    }
+  else
+    {  /* TOKEN_FLOW_CONTROL_ON  */
+    #if TOKEN_FLOW_CONTROL
+    if (!(sreq->mpid.userbufcount))
+       {
+#ifdef MPIDI_TRACE
+        sreq->mpid.partner_id=dest;
+        MPIDI_GET_S_REC(sreq,context,isSync,0);
+#endif
+        TRACE_ERR("Sending(short,intranode) bytes=%u (short_limit=%u)\n", data_sz, MPIDI_Process.short_limit);
+        MPIU_THREAD_CS_ENTER(MSGQUEUE,0);
+        MPIDI_Piggy_back_tokens(dest,sreq,0);
+        MPIU_THREAD_CS_EXIT(MSGQUEUE,0);
+        MPIDI_SendMsg_short(context,
+                            sreq,
+                            dest,
+                            sreq->mpid.userbuf,
+                            0,
+                            isSync);
+       }
+     else
+       {
+       size_t   data_sz;
+       void   * sndbuf;
+       int      noRDMA=0;
+       if (likely(HANDLE_GET_KIND(sreq->mpid.datatype) == HANDLE_KIND_BUILTIN))
+         {
+           sndbuf   = sreq->mpid.userbuf;
+           data_sz  = sreq->mpid.userbufcount * MPID_Datatype_get_basic_size(sreq->mpid.datatype);
+         }
+       else
+        {
+          MPIDI_SendMsg_process_userdefined_dt(sreq, &sndbuf, &data_sz);
+         }
+#ifdef MPIDI_TRACE
+       sreq->mpid.partner_id=dest;
+       MPIDI_GET_S_REC(sreq,context,isSync,data_sz);
+#endif
+       if (unlikely(PAMIX_Task_is_local(dest_tid) != 0))  noRDMA=1;
+
+       MPIU_THREAD_CS_ENTER(MSGQUEUE,0);
+       if ((!isSync) && MPIDI_Token_cntr[dest].tokens >= 1)
+        {
+          if (data_sz <= MPIDI_Process.pt2pt.limits.application.immediate.remote)
+             {
+             TRACE_ERR("Sending(short,intranode) bytes=%u (short_limit=%u)\n", data_sz,MPIDI_Process.pt2pt.limits.application.immediate.remote);
+             --MPIDI_Token_cntr[dest].tokens;
+             MPIDI_Piggy_back_tokens(dest,sreq,data_sz);
+             MPIU_THREAD_CS_EXIT(MSGQUEUE,0);
+             MPIDI_SendMsg_short(context,
+                                 sreq,
+                                 dest,
+                                 sndbuf,
+                                 data_sz,
+                                 isSync);
+             }
+           else if (data_sz <= MPIDI_Process.pt2pt.limits.application.eager.remote)
+             {
+              TRACE_ERR("Sending(eager) bytes=%u (eager_limit=%u)\n", data_sz, MPIDI_Process.pt2pt.limits.application.eager.remote);
+              --MPIDI_Token_cntr[dest].tokens;
+              MPIDI_Piggy_back_tokens(dest,sreq,data_sz);
+              MPIU_THREAD_CS_EXIT(MSGQUEUE,0);
+              MPIDI_SendMsg_eager(context,
+                                 sreq,
+                                 dest,
+                                 sndbuf,
+                                 data_sz);
+#ifdef MPIDI_STATISTICS
+                    if (MPID_cc_is_complete(&sreq->cc)) {
+                        MPID_NSTAT(mpid_statp->sendsComplete);
+                    }
+#endif
+
+              }
+            else   /* rendezvous message  */
+              {
+                TRACE_ERR("Sending(RZV) bytes=%u (eager_limit=%u)\n", data_sz, MPIDI_Process.pt2pt.limits.application.eager.remote);
+                MPIDI_Piggy_back_tokens(dest,sreq,data_sz);
+                MPIU_THREAD_CS_EXIT(MSGQUEUE,0);
+                sreq->mpid.envelope.msginfo.noRDMA=noRDMA;
+                MPIDI_SendMsg_rzv(context,
+                                  sreq,
+                                  dest,
+                                  sndbuf,
+                                  data_sz);
+#ifdef MPIDI_STATISTICS
+                       if (MPID_cc_is_complete(&sreq->cc))
+                       {
+                          MPID_NSTAT(mpid_statp->sendsComplete);
+                       }
+#endif
+              }
+       }
+     else
+      {   /* no tokens, all messages use rendezvous protocol */
+        if ((data_sz <= MPIDI_Process.pt2pt.limits.application.eager.remote) && (!isSync)) {
+             ++MPIDI_Token_cntr[dest].n_tokenStarved;
+              sreq->mpid.envelope.msginfo.noRDMA=1;
+        }
+        else sreq->mpid.envelope.msginfo.noRDMA=noRDMA;
+        MPIDI_Piggy_back_tokens(dest,sreq,data_sz);
+        MPIU_THREAD_CS_EXIT(MSGQUEUE,0);
+        TRACE_ERR("Sending(RZV) bytes=%u (eager_limit=%u)\n", data_sz, MPIDI_Process.pt2pt.limits.application.eager.remote);
+        if (likely(data_sz > 0))
+           {
+             MPIDI_SendMsg_rzv(context,
+                                  sreq,
+                                  dest,
+                                sndbuf,
+                              data_sz);
+           }
+          else
+            {
+              MPIDI_SendMsg_rzv_zerobyte(context, sreq, dest);
+            }
+#ifdef MPIDI_STATISTICS
+               if (MPID_cc_is_complete(&sreq->cc))
+                {
+                   MPID_NSTAT(mpid_statp->sendsComplete);
+                }
+#endif
+    }
+  }
+    #else
+    MPID_assert_always(0);
+    #endif /* TOKEN_FLOW_CONTROL */
+ }
 }
 
 

http://git.mpich.org/mpich.git/commitdiff/6d89f69981a1d00f992272062493e2594c997045

commit 6d89f69981a1d00f992272062493e2594c997045
Author: Su Huang <suhuang at us.ibm.com>
Date:   Tue Oct 16 13:21:44 2012 -0400

    fix the build - replace malloc/free by MPIU_Malloc/MPIU_Free
    
    (ibm) 135d4666d3ebe4c847cd61420ef171c009a7bdd2
    
    Signed-off-by: Michael Blocksome <blocksom at us.ibm.com>

diff --git a/src/mpid/pamid/src/mpid_init.c b/src/mpid/pamid/src/mpid_init.c
index 6838678..88a9abe 100644
--- a/src/mpid/pamid/src/mpid_init.c
+++ b/src/mpid/pamid/src/mpid_init.c
@@ -253,13 +253,13 @@ MPIDI_PAMI_client_init(int* rank, int* size, int threading)
     if (env != NULL)
       {
         size_t i, n = strlen(env);
-        char * tmp = (char *) malloc(n+1);
+        char * tmp = (char *) MPIU_Malloc(n+1);
         strncpy(tmp,env,n);
         if (n>0) tmp[n]=0;
 
         MPIDI_atoi(tmp, &MPIDI_Process.disable_internal_eager_scale);
 
-        free (tmp);
+        MPIU_Free(tmp);
       }
 
     if (MPIDI_Process.disable_internal_eager_scale <= *size)
diff --git a/src/mpid/pamid/src/mpidi_env.c b/src/mpid/pamid/src/mpidi_env.c
index 0009c9b..fb540c3 100644
--- a/src/mpid/pamid/src/mpidi_env.c
+++ b/src/mpid/pamid/src/mpidi_env.c
@@ -755,7 +755,7 @@ MPIDI_Env_setup(int rank, int requested)
     if (env != NULL)
       {
         size_t i, n = strlen(env);
-        char * tmp = (char *) malloc(n+1);
+        char * tmp = (char *) MPIU_Malloc(n+1);
         strncpy(tmp,env,n);
         if (n>0) tmp[n]=0;
 
@@ -779,7 +779,7 @@ MPIDI_Env_setup(int rank, int requested)
               }
           }
 
-        free (tmp);
+        MPIU_Free (tmp);
       }
   }
 

http://git.mpich.org/mpich.git/commitdiff/48c81cff1b5dd8b33c39387e4d7e6b9278bfdb20

commit 48c81cff1b5dd8b33c39387e4d7e6b9278bfdb20
Author: Michael Blocksome <blocksom at us.ibm.com>
Date:   Mon Oct 15 14:46:47 2012 -0500

    attempt to fix pe compile error.
    
    (ibm) aca0d365ad4b3bda829483d55de6c9725fc32d40
    
    Signed-off-by: Su Huang <suhuang at us.ibm.com>

diff --git a/src/mpid/pamid/src/mpid_init.c b/src/mpid/pamid/src/mpid_init.c
index 08688d6..6838678 100644
--- a/src/mpid/pamid/src/mpid_init.c
+++ b/src/mpid/pamid/src/mpid_init.c
@@ -19,6 +19,10 @@
  * \file src/mpid_init.c
  * \brief Normal job startup code
  */
+
+#include <stdlib.h>
+#include <string.h>
+
 #include <mpidimpl.h>
 #include "mpidi_platform.h"
 #include "onesided/mpidi_onesided.h"

http://git.mpich.org/mpich.git/commitdiff/26f8ef856e978b6e98838280c9228ebdbc627c58

commit 26f8ef856e978b6e98838280c9228ebdbc627c58
Author: Michael Blocksome <blocksom at us.ibm.com>
Date:   Fri Oct 12 09:53:13 2012 -0500

    New environment varariable: 'PAMID_DISABLE_INTERNAL_EAGER_TASK_LIMIT'
    
    This new environment variable overrides the default task limit at
    which point the internal eager protocols are disabled.
    
    (ibm) 994e4e9a813f10aec687dde462a44bdd07da59dd
    
    Signed-off-by: Su Huang <suhuang at us.ibm.com>

diff --git a/src/mpid/pamid/src/mpid_init.c b/src/mpid/pamid/src/mpid_init.c
index ce0ece3..08688d6 100644
--- a/src/mpid/pamid/src/mpid_init.c
+++ b/src/mpid/pamid/src/mpid_init.c
@@ -244,13 +244,28 @@ MPIDI_PAMI_client_init(int* rank, int* size, int threading)
   /* Determine if the eager point-to-point protocol for internal mpi */
   /* operations should be disabled.                                  */
   /* --------------------------------------------------------------- */
-  if (MPIDI_Process.disable_internal_eager_scale <= *size)
-    {
-      MPIDI_Process.pt2pt.limits.internal.eager.remote     = 0;
-      MPIDI_Process.pt2pt.limits.internal.eager.local      = 0;
-      MPIDI_Process.pt2pt.limits.internal.immediate.remote = 0;
-      MPIDI_Process.pt2pt.limits.internal.immediate.local  = 0;
-    }
+  {
+    char * env = getenv("PAMID_DISABLE_INTERNAL_EAGER_TASK_LIMIT");
+    if (env != NULL)
+      {
+        size_t i, n = strlen(env);
+        char * tmp = (char *) malloc(n+1);
+        strncpy(tmp,env,n);
+        if (n>0) tmp[n]=0;
+
+        MPIDI_atoi(tmp, &MPIDI_Process.disable_internal_eager_scale);
+
+        free (tmp);
+      }
+
+    if (MPIDI_Process.disable_internal_eager_scale <= *size)
+      {
+        MPIDI_Process.pt2pt.limits.internal.eager.remote     = 0;
+        MPIDI_Process.pt2pt.limits.internal.eager.local      = 0;
+        MPIDI_Process.pt2pt.limits.internal.immediate.remote = 0;
+        MPIDI_Process.pt2pt.limits.internal.immediate.local  = 0;
+      }
+  }
 }
 
 

http://git.mpich.org/mpich.git/commitdiff/9132e28ea86e5855d175048488224de575ba7ae7

commit 9132e28ea86e5855d175048488224de575ba7ae7
Author: Michael Blocksome <blocksom at us.ibm.com>
Date:   Tue Oct 9 15:33:31 2012 -0500

    Create a task scaling threshold at which point internal eager is disabled
    
    A new mpidi_platform.h define sets the job size threshold at which
    point the default 'internal eager' limits are set to zero. This has the
    same effect as specifying the environment variable:
    
      PAMID_PT2PT_LIMITS=::::0:0:0:0
    
    The current threshold for bgq is 512k tasks. The default threshold for
    all other platforms is 'max unsigned int" which effectively disables
    this threshold check.
    
    This 'disable internal eager' threshold check is done before any
    envrionment variable processing.
    
    (ibm) ad02891421a83ed1a57d9da7d5ecaa675e572cfb
    
    Signed-off-by: Su Huang <suhuang at us.ibm.com>

diff --git a/src/mpid/pamid/include/mpidi_datatypes.h b/src/mpid/pamid/include/mpidi_datatypes.h
index a428e51..a6baa07 100644
--- a/src/mpid/pamid/include/mpidi_datatypes.h
+++ b/src/mpid/pamid/include/mpidi_datatypes.h
@@ -80,6 +80,7 @@ typedef struct
     unsigned             limits_lookup[2][2][2];
     MPIDI_pt2pt_limits_t limits;
   } pt2pt;
+  unsigned disable_internal_eager_scale; /**< The number of tasks at which point eager will be disabled */
 #if (MPIDI_STATISTICS || MPIDI_PRINTENV)
   unsigned mp_infolevel;
   unsigned mp_statistics;     /* print pamid statistcs data                           */
diff --git a/src/mpid/pamid/include/mpidi_platform.h b/src/mpid/pamid/include/mpidi_platform.h
index b0e5918..61a001d 100644
--- a/src/mpid/pamid/include/mpidi_platform.h
+++ b/src/mpid/pamid/include/mpidi_platform.h
@@ -33,6 +33,9 @@
 #define MPIDI_EAGER_LIMIT  2049
 /** This is set to 0 which effectively disables the eager protocol for local transfers */
 #define MPIDI_EAGER_LIMIT_LOCAL  0
+/** This is set to 'max unsigned' which effectively never disables internal eager at scale */
+#define MPIDI_DISABLE_INTERNAL_EAGER_SCALE ((unsigned)-1)
+
 /* Default features */
 #define USE_PAMI_RDMA 1
 #define USE_PAMI_CONSISTENCY PAMI_HINT_ENABLE
@@ -64,6 +67,8 @@
 #ifdef __BGQ__
 #undef  MPIDI_EAGER_LIMIT_LOCAL
 #define MPIDI_EAGER_LIMIT_LOCAL  64
+#undef  MPIDI_DISABLE_INTERNAL_EAGER_SCALE
+#define MPIDI_DISABLE_INTERNAL_EAGER_SCALE (512*1024)
 #define MPIDI_MAX_THREADS     64
 #define MPIDI_MUTEX_L2_ATOMIC 1
 #define MPIDI_OPTIMIZED_COLLECTIVE_DEFAULT 1
diff --git a/src/mpid/pamid/src/mpid_init.c b/src/mpid/pamid/src/mpid_init.c
index 6f1f99d..ce0ece3 100644
--- a/src/mpid/pamid/src/mpid_init.c
+++ b/src/mpid/pamid/src/mpid_init.c
@@ -63,26 +63,27 @@ MPIDI_Process_t  MPIDI_Process = {
     .limits = {
       .application = {
         .eager = {
-          .remote          = MPIDI_EAGER_LIMIT,
-          .local           = MPIDI_EAGER_LIMIT_LOCAL,
+          .remote        = MPIDI_EAGER_LIMIT,
+          .local         = MPIDI_EAGER_LIMIT_LOCAL,
         },
         .immediate = {
-          .remote          = MPIDI_SHORT_LIMIT,
-          .local           = MPIDI_SHORT_LIMIT,
+          .remote        = MPIDI_SHORT_LIMIT,
+          .local         = MPIDI_SHORT_LIMIT,
         },
       },
       .internal = {
         .eager = {
-          .remote          = MPIDI_EAGER_LIMIT,
-          .local           = MPIDI_EAGER_LIMIT_LOCAL,
+          .remote        = MPIDI_EAGER_LIMIT,
+          .local         = MPIDI_EAGER_LIMIT_LOCAL,
         },
         .immediate = {
-          .remote          = MPIDI_SHORT_LIMIT,
-          .local           = MPIDI_SHORT_LIMIT,
+          .remote        = MPIDI_SHORT_LIMIT,
+          .local         = MPIDI_SHORT_LIMIT,
         },
       },
     },
   },
+  .disable_internal_eager_scale = MPIDI_DISABLE_INTERNAL_EAGER_SCALE,
 #if (MPIDI_STATISTICS || MPIDI_PRINTENV)
   .mp_infolevel          = 0,
   .mp_statistics         = 0,
@@ -238,6 +239,18 @@ MPIDI_PAMI_client_init(int* rank, int* size, int threading)
   *rank = PAMIX_Client_query(MPIDI_Client, PAMI_CLIENT_TASK_ID  ).value.intval;
   MPIR_Process.comm_world->rank = *rank; /* Set the rank early to make tracing better */
   *size = PAMIX_Client_query(MPIDI_Client, PAMI_CLIENT_NUM_TASKS).value.intval;
+
+  /* --------------------------------------------------------------- */
+  /* Determine if the eager point-to-point protocol for internal mpi */
+  /* operations should be disabled.                                  */
+  /* --------------------------------------------------------------- */
+  if (MPIDI_Process.disable_internal_eager_scale <= *size)
+    {
+      MPIDI_Process.pt2pt.limits.internal.eager.remote     = 0;
+      MPIDI_Process.pt2pt.limits.internal.eager.local      = 0;
+      MPIDI_Process.pt2pt.limits.internal.immediate.remote = 0;
+      MPIDI_Process.pt2pt.limits.internal.immediate.local  = 0;
+    }
 }
 
 
@@ -360,7 +373,6 @@ MPIDI_PAMI_context_init(int* threading)
       }
 #endif
 
-
   /* ----------------------------------- */
   /*  Create the communication contexts  */
   /* ----------------------------------- */
@@ -529,6 +541,7 @@ MPIDI_PAMI_init(int* rank, int* size, int* threading)
              "        remote, local   : %u, %u\n"
              "  rma_pending           : %u\n"
              "  shmem_pt2pt           : %u\n"
+             "  disable_internal_eager_scale : %u\n"
 #if (MPIDI_STATISTICS || MPIDI_PRINTENV)
              "  mp_infolevel : %u\n"
              "  mp_statistics: %u\n"
@@ -553,6 +566,7 @@ MPIDI_PAMI_init(int* rank, int* size, int* threading)
              MPIDI_Process.pt2pt.limits_array[7],
              MPIDI_Process.rma_pending,
              MPIDI_Process.shmem_pt2pt,
+             MPIDI_Process.disable_internal_eager_scale,
 #if (MPIDI_STATISTICS || MPIDI_PRINTENV)
              MPIDI_Process.mp_infolevel,
              MPIDI_Process.mp_statistics,

http://git.mpich.org/mpich.git/commitdiff/6d73db2b6cea0f2f680f68800a6f7bdd11f8b47d

commit 6d73db2b6cea0f2f680f68800a6f7bdd11f8b47d
Author: Michael Blocksome <blocksom at us.ibm.com>
Date:   Thu Oct 11 14:19:48 2012 -0500

    Simplify and clarify the 'MPIDI_PT2PT_LIMIT' macro by splitting it in two.
    
    (ibm) 35001dac7082ff96a7ce3ced2d9787512a120556
    
    Signed-off-by: Su Huang <suhuang at us.ibm.com>

diff --git a/src/mpid/pamid/include/mpidi_platform.h b/src/mpid/pamid/include/mpidi_platform.h
index 96c0487..b0e5918 100644
--- a/src/mpid/pamid/include/mpidi_platform.h
+++ b/src/mpid/pamid/include/mpidi_platform.h
@@ -42,13 +42,21 @@
 #define ASYNC_PROGRESS_MODE_DEFAULT 0
 
 /*
- * The default behavior is to disable (ignore) 'internal vs application' and
- * 'local vs remote' point-to-point limits. The only limits provided are the
- * 'immediate' and  'eager (rzv)' limits.
+ * The default behavior is to disable (ignore) the 'internal vs application' and
+ * the 'local vs remote' point-to-point eager limits.
  */
-#define MPIDI_PT2PT_LIMIT(is_internal,is_eager,is_local)                        \
+#define MPIDI_PT2PT_EAGER_LIMIT(is_internal,is_local)                           \
 ({                                                                              \
-  MPIDI_Process.pt2pt.limits_lookup[0][is_eager][0];                            \
+  MPIDI_Process.pt2pt.limits_lookup[0][0][0];                                   \
+})
+
+/*
+ * The default behavior is to disable (ignore) the 'internal vs application' and
+ * the 'local vs remote' point-to-point short limits.
+ */
+#define MPIDI_PT2PT_SHORT_LIMIT(is_internal,is_local)                           \
+({                                                                              \
+  MPIDI_Process.pt2pt.limits_lookup[0][1][0];                                   \
 })
 
 
@@ -65,13 +73,23 @@
 #define PAMIX_IS_LOCAL_TASK_SHIFT   (6)
 
 /*
- * Enable both 'internal vs application' and 'local vs remote' point-to-point
- * limits, in addition to the 'immediate' and 'eager (rzv)' point-to-point limits.
+ * Enable both the 'internal vs application' and the 'local vs remote'
+ * point-to-point eager limits.
  */
-#undef MPIDI_PT2PT_LIMIT
-#define MPIDI_PT2PT_LIMIT(is_internal,is_eager,is_local)                        \
+#undef MPIDI_PT2PT_EAGER_LIMIT
+#define MPIDI_PT2PT_EAGER_LIMIT(is_internal,is_local)                           \
 ({                                                                              \
-  MPIDI_Process.pt2pt.limits_lookup[is_internal][is_eager][is_local];           \
+  MPIDI_Process.pt2pt.limits_lookup[is_internal][0][is_local];                  \
+})
+
+/*
+ * Enable both the 'internal vs application' and the 'local vs remote'
+ * point-to-point short limits.
+ */
+#undef MPIDI_PT2PT_SHORT_LIMIT
+#define MPIDI_PT2PT_SHORT_LIMIT(is_internal,is_local)                           \
+({                                                                              \
+  MPIDI_Process.pt2pt.limits_lookup[is_internal][1][is_local];                  \
 })
 
 
@@ -103,13 +121,21 @@ static const char _ibm_release_version_[] = "V1R2M0";
 #define PAMIX_IS_LOCAL_TASK_SHIFT   (0)
 
 /*
- * Enable only the 'local vs remote' point-to-point limits, in addition to the
- * 'immediate' and 'eager (rzv)' point-to-point limits.
+ * Enable only the 'local vs remote' point-to-point eager limits.
+ */
+#undef MPIDI_PT2PT_EAGER_LIMIT
+#define MPIDI_PT2PT_EAGER_LIMIT(is_internal,is_local)                           \
+({                                                                              \
+  MPIDI_Process.pt2pt.limits_lookup[0][0][is_local];                            \
+})
+
+/*
+ * Enable only the 'local vs remote' point-to-point short limits.
  */
-#undef MPIDI_PT2PT_LIMIT
-#define MPIDI_PT2PT_LIMIT(is_internal,is_eager,is_local)                        \
+#undef MPIDI_PT2PT_SHORT_LIMIT
+#define MPIDI_PT2PT_SHORT_LIMIT(is_internal,is_local)                           \
 ({                                                                              \
-  MPIDI_Process.pt2pt.limits_lookup[0][is_eager][is_local];                     \
+  MPIDI_Process.pt2pt.limits_lookup[0][1][is_local];                            \
 })
 
 
diff --git a/src/mpid/pamid/src/pt2pt/mpidi_sendmsg.c b/src/mpid/pamid/src/pt2pt/mpidi_sendmsg.c
index d5b3985..939deb6 100644
--- a/src/mpid/pamid/src/pt2pt/mpidi_sendmsg.c
+++ b/src/mpid/pamid/src/pt2pt/mpidi_sendmsg.c
@@ -419,15 +419,13 @@ MPIDI_SendMsg(pami_context_t   context,
 #endif
 
   const unsigned isLocal = PAMIX_Task_is_local(dest_tid);
-  const unsigned testImmediate = 1;
-  const unsigned testEager     = 0;
 
   /*
    * Always use the short protocol when data_sz is small.
    */
-  if (likely(data_sz < MPIDI_PT2PT_LIMIT(isInternal,testImmediate,isLocal)))
+  if (likely(data_sz < MPIDI_PT2PT_SHORT_LIMIT(isInternal,isLocal)))
     {
-      TRACE_ERR("Sending(short%s%s) bytes=%u (short_limit=%u)\n", isInternal==1?",internal":"", isLocal==1?",intranode":"", data_sz, MPIDI_PT2PT_LIMIT(isInternal,testImmediate,isLocal));
+      TRACE_ERR("Sending(short%s%s) bytes=%u (short_limit=%u)\n", isInternal==1?",internal":"", isLocal==1?",intranode":"", data_sz, MPIDI_PT2PT_SHORT_LIMIT(isInternal,isLocal));
       MPIDI_SendMsg_short(context,
                           sreq,
                           dest,
@@ -438,9 +436,9 @@ MPIDI_SendMsg(pami_context_t   context,
   /*
    * Use the eager protocol when data_sz is less than the eager limit.
    */
-  else if (data_sz < MPIDI_PT2PT_LIMIT(isInternal,testEager,isLocal))
+  else if (data_sz < MPIDI_PT2PT_EAGER_LIMIT(isInternal,isLocal))
     {
-      TRACE_ERR("Sending(eager%s%s) bytes=%u (eager_limit=%u)\n", isInternal==1?",internal":"", isLocal==1?",intranode":"", data_sz, MPIDI_PT2PT_LIMIT(isInternal,testEager,isLocal));
+      TRACE_ERR("Sending(eager%s%s) bytes=%u (eager_limit=%u)\n", isInternal==1?",internal":"", isLocal==1?",intranode":"", data_sz, MPIDI_PT2PT_EAGER_LIMIT(isInternal,isLocal));
       MPIDI_SendMsg_eager(context,
                           sreq,
                           dest,
@@ -460,7 +458,7 @@ MPIDI_SendMsg(pami_context_t   context,
    */
   else
     {
-      TRACE_ERR("Sending(rendezvous%s%s) bytes=%u (eager_limit=%u)\n", isInternal==1?",internal":"", isLocal==1?",intranode":"", data_sz, MPIDI_PT2PT_LIMIT(isInternal,testEager,isLocal));
+      TRACE_ERR("Sending(rendezvous%s%s) bytes=%u (eager_limit=%u)\n", isInternal==1?",internal":"", isLocal==1?",intranode":"", data_sz, MPIDI_PT2PT_EAGER_LIMIT(isInternal,isLocal));
 #ifdef OUT_OF_ORDER_HANDLING
       sreq->mpid.shm=isLocal;
 #endif

http://git.mpich.org/mpich.git/commitdiff/3999e397ed27f6fc30ea628daa5f659ed6d3c8f0

commit 3999e397ed27f6fc30ea628daa5f659ed6d3c8f0
Author: Michael Blocksome <blocksom at us.ibm.com>
Date:   Tue Oct 9 14:39:20 2012 -0500

    Add 'PAMID_PT2PT_LIMITS' env var to specify *all* point-to-point limit overrides
    
    The entire point-to-point limit set is determined by three boolean
    configuration values:
    - 'is non-local limit'   vs 'is local limit'
    - 'is eager limit'       vs 'is immediate limit'
    - 'is application limit' vs 'is internal limit'
    
    The point-to-point configuration limit values are specified in order and
    are delimited by ':' characters. If a value is not specified for a given
    configuration then the limit is not changed. All eight configuration
    values are not required to be specified, although in order to set the
    last (eighth) configuration value the previous seven configurations must
    be listed. The 'k', 'K', 'm', and 'M' multipliers may be specified. For
    example:
    
       PAMID_PT2PT_LIMITS=":::::::10k"
    
    The configuration entries can be described as:
       0 - remote eager     application limit
       1 - local  eager     application limit
       2 - remote immediate application limit
       3 - local  immediate application limit
       4 - remote eager     internal    limit
       5 - local  eager     internal    limit
       6 - remote immediate internal    limit
       7 - local  immediate internal    limit
    
    Examples:
    
       "10K"
         - sets the application internode eager (the "normal" eager limit)
    
       "10240::64"
         - sets the application internode eager and immediate limits
    
       "::::0:0:0:0"
         - disables 'eager' and 'immediate' for all internal point-to-point
    
    (ibm) d39d893d857c4ffe7e860b7566592f0ca61ee484
    
    Signed-off-by: Su Huang <suhuang at us.ibm.com>

diff --git a/src/mpid/pamid/include/mpidi_datatypes.h b/src/mpid/pamid/include/mpidi_datatypes.h
index 01653e0..a428e51 100644
--- a/src/mpid/pamid/include/mpidi_datatypes.h
+++ b/src/mpid/pamid/include/mpidi_datatypes.h
@@ -45,6 +45,26 @@ typedef struct
 } MPIDI_RequestHandle_t;
 #endif
 
+#define MPIDI_PT2PT_LIMIT_SET(is_internal,is_immediate,is_local,value)		\
+  MPIDI_Process.pt2pt.limits_lookup[is_internal][is_immediate][is_local] = value\
+
+typedef struct
+{
+  unsigned remote;
+  unsigned local;
+} MPIDI_remote_and_local_limits_t;
+
+typedef struct
+{
+  MPIDI_remote_and_local_limits_t eager;
+  MPIDI_remote_and_local_limits_t immediate;
+} MPIDI_immediate_and_eager_limits_t;
+
+typedef struct
+{
+  MPIDI_immediate_and_eager_limits_t application;
+  MPIDI_immediate_and_eager_limits_t internal;
+} MPIDI_pt2pt_limits_t;
 
 /**
  * \brief MPI Process descriptor
@@ -54,9 +74,12 @@ typedef struct
 typedef struct
 {
   unsigned avail_contexts;
-  unsigned short_limit;
-  unsigned eager_limit;
-  unsigned eager_limit_local;
+  union
+  {
+    unsigned             limits_array[8];
+    unsigned             limits_lookup[2][2][2];
+    MPIDI_pt2pt_limits_t limits;
+  } pt2pt;
 #if (MPIDI_STATISTICS || MPIDI_PRINTENV)
   unsigned mp_infolevel;
   unsigned mp_statistics;     /* print pamid statistcs data                           */
diff --git a/src/mpid/pamid/include/mpidi_platform.h b/src/mpid/pamid/include/mpidi_platform.h
index 5f5995b..96c0487 100644
--- a/src/mpid/pamid/include/mpidi_platform.h
+++ b/src/mpid/pamid/include/mpidi_platform.h
@@ -41,16 +41,40 @@
 
 #define ASYNC_PROGRESS_MODE_DEFAULT 0
 
+/*
+ * The default behavior is to disable (ignore) 'internal vs application' and
+ * 'local vs remote' point-to-point limits. The only limits provided are the
+ * 'immediate' and  'eager (rzv)' limits.
+ */
+#define MPIDI_PT2PT_LIMIT(is_internal,is_eager,is_local)                        \
+({                                                                              \
+  MPIDI_Process.pt2pt.limits_lookup[0][is_eager][0];                            \
+})
+
+
+
 #ifdef __BGQ__
 #undef  MPIDI_EAGER_LIMIT_LOCAL
 #define MPIDI_EAGER_LIMIT_LOCAL  64
 #define MPIDI_MAX_THREADS     64
 #define MPIDI_MUTEX_L2_ATOMIC 1
 #define MPIDI_OPTIMIZED_COLLECTIVE_DEFAULT 1
+
 #define PAMIX_IS_LOCAL_TASK
 #define PAMIX_IS_LOCAL_TASK_STRIDE  (4)
 #define PAMIX_IS_LOCAL_TASK_SHIFT   (6)
 
+/*
+ * Enable both 'internal vs application' and 'local vs remote' point-to-point
+ * limits, in addition to the 'immediate' and 'eager (rzv)' point-to-point limits.
+ */
+#undef MPIDI_PT2PT_LIMIT
+#define MPIDI_PT2PT_LIMIT(is_internal,is_eager,is_local)                        \
+({                                                                              \
+  MPIDI_Process.pt2pt.limits_lookup[is_internal][is_eager][is_local];           \
+})
+
+
 #undef ASYNC_PROGRESS_MODE_DEFAULT
 #define ASYNC_PROGRESS_MODE_DEFAULT 1
 
@@ -72,10 +96,23 @@ static const char _ibm_release_version_[] = "V1R2M0";
 #define RDMA_FAILOVER
 #define MPIDI_BANNER          1
 #define MPIDI_NO_ASSERT       1
+
+/* 'is local task' extension and limits */
 #define PAMIX_IS_LOCAL_TASK
 #define PAMIX_IS_LOCAL_TASK_STRIDE  (1)
 #define PAMIX_IS_LOCAL_TASK_SHIFT   (0)
 
+/*
+ * Enable only the 'local vs remote' point-to-point limits, in addition to the
+ * 'immediate' and 'eager (rzv)' point-to-point limits.
+ */
+#undef MPIDI_PT2PT_LIMIT
+#define MPIDI_PT2PT_LIMIT(is_internal,is_eager,is_local)                        \
+({                                                                              \
+  MPIDI_Process.pt2pt.limits_lookup[0][is_eager][is_local];                     \
+})
+
+
 #undef ASYNC_PROGRESS_MODE_DEFAULT
 #define ASYNC_PROGRESS_MODE_DEFAULT 2
 
diff --git a/src/mpid/pamid/src/mpid_init.c b/src/mpid/pamid/src/mpid_init.c
index 24b72b3..6f1f99d 100644
--- a/src/mpid/pamid/src/mpid_init.c
+++ b/src/mpid/pamid/src/mpid_init.c
@@ -59,9 +59,30 @@ MPIDI_Process_t  MPIDI_Process = {
     },
   },
 #endif
-  .short_limit           = MPIDI_SHORT_LIMIT,
-  .eager_limit           = MPIDI_EAGER_LIMIT,
-  .eager_limit_local     = MPIDI_EAGER_LIMIT_LOCAL,
+  .pt2pt = {
+    .limits = {
+      .application = {
+        .eager = {
+          .remote          = MPIDI_EAGER_LIMIT,
+          .local           = MPIDI_EAGER_LIMIT_LOCAL,
+        },
+        .immediate = {
+          .remote          = MPIDI_SHORT_LIMIT,
+          .local           = MPIDI_SHORT_LIMIT,
+        },
+      },
+      .internal = {
+        .eager = {
+          .remote          = MPIDI_EAGER_LIMIT,
+          .local           = MPIDI_EAGER_LIMIT_LOCAL,
+        },
+        .immediate = {
+          .remote          = MPIDI_SHORT_LIMIT,
+          .local           = MPIDI_SHORT_LIMIT,
+        },
+      },
+    },
+  },
 #if (MPIDI_STATISTICS || MPIDI_PRINTENV)
   .mp_infolevel          = 0,
   .mp_statistics         = 0,
@@ -388,7 +409,7 @@ MPIDI_PAMI_dispath_set(size_t              dispatch,
   TRACE_ERR("Immediate-max query:  dispatch=%zu  got=%zu  required=%zu\n",
             dispatch, im_max, proto->immediate_min);
   MPID_assert_always(proto->immediate_min <= im_max);
-  if (immediate_max != NULL)
+  if ((immediate_max != NULL) && (im_max < *immediate_max))
     *immediate_max = im_max;
 }
 
@@ -407,21 +428,25 @@ MPIDI_PAMI_dispath_init()
     if ( rc == PAMI_SUCCESS )
       {
         TRACE_ERR("PAMI_DISPATCH_SEND_IMMEDIATE_MAX=%d.\n", config.value.intval, rc);
-        MPIDI_Process.short_limit = config.value.intval;
+        MPIDI_Process.pt2pt.limits_array[2] = config.value.intval;
       }
     else
       {
         TRACE_ERR((" Attention: PAMI_Client_query(DISPATCH_SEND_IMMEDIATE_MAX=%d) rc=%d\n", config.name, rc));
-        MPIDI_Process.short_limit = 256;
+        MPIDI_Process.pt2pt.limits_array[2] = 256;
       }
+
+    MPIDI_Process.pt2pt.limits_array[3] = MPIDI_Process.pt2pt.limits_array[2];
+    MPIDI_Process.pt2pt.limits_array[6] = MPIDI_Process.pt2pt.limits_array[2];
+    MPIDI_Process.pt2pt.limits_array[7] = MPIDI_Process.pt2pt.limits_array[2];
   }
 #endif
   /* ------------------------------------ */
   /*  Set up the communication protocols  */
   /* ------------------------------------ */
-  unsigned pami_short_limit[2] = {MPIDI_Process.short_limit, MPIDI_Process.short_limit};
-  MPIDI_PAMI_dispath_set(MPIDI_Protocols_Short,     &proto_list.Short,     pami_short_limit+0);
-  MPIDI_PAMI_dispath_set(MPIDI_Protocols_ShortSync, &proto_list.ShortSync, pami_short_limit+1);
+  unsigned send_immediate_max_bytes = (unsigned) -1;
+  MPIDI_PAMI_dispath_set(MPIDI_Protocols_Short,     &proto_list.Short,     &send_immediate_max_bytes);
+  MPIDI_PAMI_dispath_set(MPIDI_Protocols_ShortSync, &proto_list.ShortSync, &send_immediate_max_bytes);
   MPIDI_PAMI_dispath_set(MPIDI_Protocols_Eager,     &proto_list.Eager,     NULL);
   MPIDI_PAMI_dispath_set(MPIDI_Protocols_RVZ,       &proto_list.RVZ,       NULL);
   MPIDI_PAMI_dispath_set(MPIDI_Protocols_Cancel,    &proto_list.Cancel,    NULL);
@@ -443,13 +468,19 @@ MPIDI_PAMI_dispath_init()
    *
    * - We use the min of the results just to be safe.
    */
-  pami_short_limit[0] -= (sizeof(MPIDI_MsgInfo) - 1);
-  if (MPIDI_Process.short_limit > pami_short_limit[0])
-    MPIDI_Process.short_limit = pami_short_limit[0];
-  pami_short_limit[1] -= (sizeof(MPIDI_MsgInfo) - 1);
-  if (MPIDI_Process.short_limit > pami_short_limit[1])
-    MPIDI_Process.short_limit = pami_short_limit[1];
-  TRACE_ERR("pami_short_limit[2] = [%u,%u]\n", pami_short_limit[0], pami_short_limit[1]);
+  send_immediate_max_bytes -= (sizeof(MPIDI_MsgInfo) - 1);
+
+  if (MPIDI_Process.pt2pt.limits.application.immediate.remote > send_immediate_max_bytes)
+    MPIDI_Process.pt2pt.limits.application.immediate.remote = send_immediate_max_bytes;
+
+  if (MPIDI_Process.pt2pt.limits.application.immediate.local > send_immediate_max_bytes)
+    MPIDI_Process.pt2pt.limits.application.immediate.local = send_immediate_max_bytes;
+
+  if (MPIDI_Process.pt2pt.limits.internal.immediate.remote > send_immediate_max_bytes)
+    MPIDI_Process.pt2pt.limits.internal.immediate.remote = send_immediate_max_bytes;
+
+  if (MPIDI_Process.pt2pt.limits.internal.immediate.local > send_immediate_max_bytes)
+    MPIDI_Process.pt2pt.limits.internal.immediate.local = send_immediate_max_bytes;
 }
 
 
@@ -485,9 +516,17 @@ MPIDI_PAMI_init(int* rank, int* size, int* threading)
              "  contexts              : %u\n"
              "  async_progress        : %u\n"
              "  context_post          : %u\n"
-             "  short_limit           : %u\n"
-             "  eager_limit           : %u\n"
-             "  eager_limit_local     : %u\n"
+             "  pt2pt.limits\n"
+             "    application\n"
+             "      eager\n"
+             "        remote, local   : %u, %u\n"
+             "      short\n"
+             "        remote, local   : %u, %u\n"
+             "    internal\n"
+             "      eager\n"
+             "        remote, local   : %u, %u\n"
+             "      short\n"
+             "        remote, local   : %u, %u\n"
              "  rma_pending           : %u\n"
              "  shmem_pt2pt           : %u\n"
 #if (MPIDI_STATISTICS || MPIDI_PRINTENV)
@@ -504,9 +543,14 @@ MPIDI_PAMI_init(int* rank, int* size, int* threading)
              MPIDI_Process.avail_contexts,
              MPIDI_Process.async_progress.mode,
              MPIDI_Process.perobj.context_post.requested,
-             MPIDI_Process.short_limit,
-             MPIDI_Process.eager_limit,
-             MPIDI_Process.eager_limit_local,
+             MPIDI_Process.pt2pt.limits_array[0],
+             MPIDI_Process.pt2pt.limits_array[1],
+             MPIDI_Process.pt2pt.limits_array[2],
+             MPIDI_Process.pt2pt.limits_array[3],
+             MPIDI_Process.pt2pt.limits_array[4],
+             MPIDI_Process.pt2pt.limits_array[5],
+             MPIDI_Process.pt2pt.limits_array[6],
+             MPIDI_Process.pt2pt.limits_array[7],
              MPIDI_Process.rma_pending,
              MPIDI_Process.shmem_pt2pt,
 #if (MPIDI_STATISTICS || MPIDI_PRINTENV)
@@ -625,7 +669,6 @@ int MPID_Init(int * argc,
   /* ------------------------------------------------------------------------------- */
   MPIDI_PAMI_client_init(&rank, &size, requested);
 
-
   /* ------------------------------------ */
   /*  Get new defaults from the Env Vars  */
   /* ------------------------------------ */
diff --git a/src/mpid/pamid/src/mpidi_env.c b/src/mpid/pamid/src/mpidi_env.c
index 5ce40e0..0009c9b 100644
--- a/src/mpid/pamid/src/mpidi_env.c
+++ b/src/mpid/pamid/src/mpidi_env.c
@@ -402,7 +402,8 @@ static inline void
 ENV_Unsigned__(char* name[], unsigned* val, char* string, unsigned num_supported, unsigned* deprecated, int rank, int NA)
 {
   /* Check for deprecated environment variables. */
-  ENV_Deprecated(name, num_supported, deprecated, rank, NA);
+  if (deprecated != NULL)
+    ENV_Deprecated(name, num_supported, deprecated, rank, NA);
 
   char * env;
 
@@ -660,23 +661,126 @@ MPIDI_Env_setup(int rank, int requested)
     TRACE_ERR("MPIDI_Process.async_progress.mode=%u\n", MPIDI_Process.async_progress.mode);
   }
 
-  /* Determine short limit */
+  /*
+   * Determine 'short' limit
+   * - sets both the 'local' and 'remote' short limit, and
+   * - sets both the 'application' and 'internal' short limit
+   *
+   * Identical to setting the PAMID_PT2PT_LIMITS environment variable as:
+   *
+   *   PAMID_PT2PT_LIMITS="::x:x:::x:x"
+   */
   {
     /* THIS ENVIRONMENT VARIABLE NEEDS TO BE DOCUMENTED ABOVE */
     char* names[] = {"PAMID_SHORT", "MP_S_SHORT_LIMIT", "PAMI_SHORT", NULL};
-    ENV_Unsigned(names, &MPIDI_Process.short_limit, 2, &found_deprecated_env_var, rank);
+    ENV_Unsigned(names, &MPIDI_Process.pt2pt.limits.application.immediate.remote, 2, &found_deprecated_env_var, rank);
+    ENV_Unsigned(names, &MPIDI_Process.pt2pt.limits.application.immediate.local, 2, NULL, rank);
+    ENV_Unsigned(names, &MPIDI_Process.pt2pt.limits.internal.immediate.remote, 2, NULL, rank);
+    ENV_Unsigned(names, &MPIDI_Process.pt2pt.limits.internal.immediate.local, 2, NULL, rank);
   }
 
-  /* Determine eager limit */
+  /*
+   * Determine 'remote' eager limit
+   * - sets both the 'application' and 'internal' remote eager limit
+   *
+   * Identical to setting the PAMID_PT2PT_LIMITS environment variable as:
+   *
+   *   PAMID_PT2PT_LIMITS="x::::x:::"
+   *   -- or --
+   *   PAMID_PT2PT_LIMITS="x::::x"
+   */
   {
     char* names[] = {"PAMID_EAGER", "PAMID_RZV", "MP_EAGER_LIMIT", "PAMI_RVZ", "PAMI_RZV", "PAMI_EAGER", NULL};
-    ENV_Unsigned(names, &MPIDI_Process.eager_limit, 3, &found_deprecated_env_var, rank);
+    ENV_Unsigned(names, &MPIDI_Process.pt2pt.limits.application.eager.remote, 3, &found_deprecated_env_var, rank);
+    ENV_Unsigned(names, &MPIDI_Process.pt2pt.limits.internal.eager.remote, 3, NULL, rank);
   }
 
-  /* Determine 'local' eager limit */
+  /*
+   * Determine 'local' eager limit
+   * - sets both the 'application' and 'internal' local eager limit
+   *
+   * Identical to setting the PAMID_PT2PT_LIMITS environment variable as:
+   *
+   *   PAMID_PT2PT_LIMITS=":x::::x::"
+   *   -- or --
+   *   PAMID_PT2PT_LIMITS=":x::::x"
+   */
   {
     char* names[] = {"PAMID_RZV_LOCAL", "PAMID_EAGER_LOCAL", "MP_EAGER_LIMIT_LOCAL", "PAMI_RVZ_LOCAL", "PAMI_RZV_LOCAL", "PAMI_EAGER_LOCAL", NULL};
-    ENV_Unsigned(names, &MPIDI_Process.eager_limit_local, 3, &found_deprecated_env_var, rank);
+    ENV_Unsigned(names, &MPIDI_Process.pt2pt.limits.application.eager.local, 3, &found_deprecated_env_var, rank);
+    ENV_Unsigned(names, &MPIDI_Process.pt2pt.limits.internal.eager.local, 3, NULL, rank);
+  }
+
+  /*
+   * Determine *all* point-to-point limit overrides.
+   *
+   * The entire point-to-point limit set is determined by three boolean
+   * configuration values:
+   * - 'is non-local limit'   vs 'is local limit'
+   * - 'is eager limit'       vs 'is immediate limit'
+   * - 'is application limit' vs 'is internal limit'
+   *
+   * The point-to-point configuration limit values are specified in order and
+   * are delimited by ':' characters. If a value is not specified for a given
+   * configuration then the limit is not changed. All eight configuration
+   * values are not required to be specified, although in order to set the
+   * last (eighth) configuration value the previous seven configurations must
+   * be listed. For example:
+   *
+   *    PAMID_PT2PT_LIMITS=":::::::10240"
+   *
+   * The configuration entries can be described as:
+   *    0 - remote eager     application limit
+   *    1 - local  eager     application limit
+   *    2 - remote immediate application limit
+   *    3 - local  immediate application limit
+   *    4 - remote eager     internal    limit
+   *    5 - local  eager     internal    limit
+   *    6 - remote immediate internal    limit
+   *    7 - local  immediate internal    limit
+   *
+   * Examples:
+   *
+   *    "10240"
+   *      - sets the application internode eager (the "normal" eager limit)
+   *
+   *    "10240::64"
+   *      - sets the application internode eager and immediate limits
+   *
+   *    "::::0:0:0:0"
+   *      - disables 'eager' and 'immediate' for all internal point-to-point
+   */
+  {
+    char * env = getenv("PAMID_PT2PT_LIMITS");
+    if (env != NULL)
+      {
+        size_t i, n = strlen(env);
+        char * tmp = (char *) malloc(n+1);
+        strncpy(tmp,env,n);
+        if (n>0) tmp[n]=0;
+
+        char * tail  = tmp;
+        char * token = tail;
+        for (i = 0; token == tail; i++)
+          {
+            while (*tail != 0 && *tail != ':') tail++;
+            if (*tail == ':')
+              {
+                *tail = 0;
+                if (token != tail)
+                  MPIDI_atoi(token, &MPIDI_Process.pt2pt.limits_array[i]);
+                tail++;
+                token = tail;
+              }
+            else
+              {
+                if (token != tail)
+                  MPIDI_atoi(token, &MPIDI_Process.pt2pt.limits_array[i]);
+              }
+          }
+
+        free (tmp);
+      }
   }
 
   /* Set the maximum number of outstanding RDMA requests */
diff --git a/src/mpid/pamid/src/mpidi_util.c b/src/mpid/pamid/src/mpidi_util.c
index b1b2da5..c729d58 100644
--- a/src/mpid/pamid/src/mpidi_util.c
+++ b/src/mpid/pamid/src/mpidi_util.c
@@ -51,7 +51,7 @@ void MPIDI_Set_mpich_env(int rank, int size) {
 
      mpich_env->this_task = rank;
      mpich_env->nprocs  = size;
-     mpich_env->eager_limit=MPIDI_Process.eager_limit;
+     mpich_env->eager_limit=MPIDI_Process.pt2pt.limits.application.eager.remote;
      mpich_env->mp_statistics=MPIDI_Process.mp_statistics;
      if (mpich_env->polling_interval == 0) {
             mpich_env->polling_interval = 400000;
diff --git a/src/mpid/pamid/src/pt2pt/mpidi_sendmsg.c b/src/mpid/pamid/src/pt2pt/mpidi_sendmsg.c
index 85d7cdc..d5b3985 100644
--- a/src/mpid/pamid/src/pt2pt/mpidi_sendmsg.c
+++ b/src/mpid/pamid/src/pt2pt/mpidi_sendmsg.c
@@ -418,152 +418,72 @@ MPIDI_SendMsg(pami_context_t   context,
   sreq->mpid.shm=0;
 #endif
 
-#ifdef WORKAROUND_UNIMPLEMENTED_SEND_IMMEDIATE_OVERFLOW
-  if (isInternal == 0)
-#endif
+  const unsigned isLocal = PAMIX_Task_is_local(dest_tid);
+  const unsigned testImmediate = 1;
+  const unsigned testEager     = 0;
+
+  /*
+   * Always use the short protocol when data_sz is small.
+   */
+  if (likely(data_sz < MPIDI_PT2PT_LIMIT(isInternal,testImmediate,isLocal)))
     {
-      if (unlikely(PAMIX_Task_is_local(dest_tid) != 0))
-        {
-          /*
-           * Always use the short protocol when data_sz is small.
-           */
-          if (likely(data_sz < MPIDI_Process.short_limit))
-            {
-              TRACE_ERR("Sending(short,intranode) bytes=%u (short_limit=%u)\n", data_sz, MPIDI_Process.short_limit);
-              MPIDI_SendMsg_short(context,
-                                  sreq,
-                                  dest,
-                                  sndbuf,
-                                  data_sz,
-                                  isSync);
-            }
-          /*
-           * Use the eager protocol when data_sz is less than the 'local' eager limit.
-           */
-          else if (data_sz < MPIDI_Process.eager_limit_local)
-            {
-              TRACE_ERR("Sending(eager,intranode) bytes=%u (eager_limit_local=%u)\n", data_sz, MPIDI_Process.eager_limit_local);
-              MPIDI_SendMsg_eager(context,
-                                  sreq,
-                                  dest,
-                                  sndbuf,
-                                  data_sz);
-            }
-          /*
-           * Use the default rendezvous protocol (glue implementation that
-           * guarantees no unexpected data).
-           */
-          else
-            {
-              TRACE_ERR("Sending(RZV,intranode) bytes=%u (eager_limit=%u)\n", data_sz, MPIDI_Process.eager_limit);
-#ifdef OUT_OF_ORDER_HANDLING
-              sreq->mpid.shm=1;
-#endif
-              MPIDI_SendMsg_rzv(context,
-                                sreq,
-                                dest,
-                                sndbuf,
-                                data_sz);
-            }
-        }
-      /*
-       * Always use the short protocol when data_sz is small.
-       */
-      else if (likely(data_sz < MPIDI_Process.short_limit))
-        {
-          TRACE_ERR("Sending(short) bytes=%u (eager_limit=%u)\n", data_sz, MPIDI_Process.eager_limit);
-          MPIDI_SendMsg_short(context,
-                              sreq,
-                              dest,
-                              sndbuf,
-                              data_sz,
-                              isSync);
-        }
-      /*
-       * Use the eager protocol when data_sz is less than the eager limit.
-       */
-      else if (data_sz < MPIDI_Process.eager_limit)
-        {
-          TRACE_ERR("Sending(eager) bytes=%u (eager_limit=%u)\n", data_sz, MPIDI_Process.eager_limit);
-          MPIDI_SendMsg_eager(context,
-                              sreq,
-                              dest,
-                              sndbuf,
-                              data_sz);
+      TRACE_ERR("Sending(short%s%s) bytes=%u (short_limit=%u)\n", isInternal==1?",internal":"", isLocal==1?",intranode":"", data_sz, MPIDI_PT2PT_LIMIT(isInternal,testImmediate,isLocal));
+      MPIDI_SendMsg_short(context,
+                          sreq,
+                          dest,
+                          sndbuf,
+                          data_sz,
+                          isSync);
+    }
+  /*
+   * Use the eager protocol when data_sz is less than the eager limit.
+   */
+  else if (data_sz < MPIDI_PT2PT_LIMIT(isInternal,testEager,isLocal))
+    {
+      TRACE_ERR("Sending(eager%s%s) bytes=%u (eager_limit=%u)\n", isInternal==1?",internal":"", isLocal==1?",intranode":"", data_sz, MPIDI_PT2PT_LIMIT(isInternal,testEager,isLocal));
+      MPIDI_SendMsg_eager(context,
+                          sreq,
+                          dest,
+                          sndbuf,
+                          data_sz);
 #ifdef MPIDI_STATISTICS
-          if (MPID_cc_is_complete(&sreq->cc))
-            {
-              MPID_NSTAT(mpid_statp->sendsComplete);
-            }
-#endif
-        }
-      /*
-       * Use the default rendezvous protocol (glue implementation that
-       * guarantees no unexpected data).
-       */
-      else
+      if (!isLocal && MPID_cc_is_complete(&sreq->cc))
         {
-          TRACE_ERR("Sending(RZV) bytes=%u (eager_limit=%u)\n", data_sz, MPIDI_Process.eager_limit);
-          if (likely(data_sz > 0))
-            {
-              MPIDI_SendMsg_rzv(context,
-                                sreq,
-                                dest,
-                                sndbuf,
-                                data_sz);
-            }
-          else
-            {
-              MPIDI_SendMsg_rzv_zerobyte(context, sreq, dest);
-            }
-#ifdef MPIDI_STATISTICS
-          if (MPID_cc_is_complete(&sreq->cc))
-            {
-              MPID_NSTAT(mpid_statp->sendsComplete);
-            }
-#endif
+          MPID_NSTAT(mpid_statp->sendsComplete);
         }
+#endif
     }
-
-#ifdef WORKAROUND_UNIMPLEMENTED_SEND_IMMEDIATE_OVERFLOW
-  /* internal only == no send immediate */
+  /*
+   * Use the default rendezvous protocol implementation that guarantees
+   * no unexpected data and does not complete the send until the remote
+   * receive is posted.
+   */
   else
     {
-      const unsigned eager_limit =
-        PAMIX_Task_is_local(dest_tid)==0?
-          MPIDI_Process.eager_limit:
-          MPIDI_Process.eager_limit_local;
-
-      if (data_sz < eager_limit)
-        {
-          TRACE_ERR("Sending(eager) bytes=%u (eager_limit=%u)\n", data_sz, eager_limit);
-          MPIDI_SendMsg_eager(context,
-                              sreq,
-                              dest,
-                              sndbuf,
-                              data_sz);
-        }
-      else
-        {
-          TRACE_ERR("Sending(RZV) bytes=%u (eager_limit=NA)\n", data_sz);
+      TRACE_ERR("Sending(rendezvous%s%s) bytes=%u (eager_limit=%u)\n", isInternal==1?",internal":"", isLocal==1?",intranode":"", data_sz, MPIDI_PT2PT_LIMIT(isInternal,testEager,isLocal));
 #ifdef OUT_OF_ORDER_HANDLING
-          sreq->mpid.shm=(PAMIX_Task_is_local(dest_tid)==0);
+      sreq->mpid.shm=isLocal;
 #endif
+      if (likely(data_sz > 0))
+        {
           MPIDI_SendMsg_rzv(context,
                             sreq,
                             dest,
                             sndbuf,
                             data_sz);
         }
+      else
+        {
+          MPIDI_SendMsg_rzv_zerobyte(context, sreq, dest);
+        }
 
 #ifdef MPIDI_STATISTICS
-      if (MPID_cc_is_complete(&sreq->cc))
+      if (!isLocal && MPID_cc_is_complete(&sreq->cc))
         {
           MPID_NSTAT(mpid_statp->sendsComplete);
         }
 #endif
     }
-#endif /* WORKAROUND_UNIMPLEMENTED_SEND_IMMEDIATE_OVERFLOW */
 }
 
 

http://git.mpich.org/mpich.git/commitdiff/6b41fba7799a84ee348551409e19aa18de984c6a

commit 6b41fba7799a84ee348551409e19aa18de984c6a
Author: Michael Blocksome <blocksom at us.ibm.com>
Date:   Sun Apr 7 12:04:55 2013 -0500

    Utility functions originally from "7 MPI-COM error injection cases core dump with MPICH2"
    
        (ibm) D183554
        (ibm) Rh62qdr
        (ibm) 05023445d12c781486885571b98ad98db9162d98

diff --git a/src/mpid/pamid/src/mpidi_util.c b/src/mpid/pamid/src/mpidi_util.c
index fa9bc63..b1b2da5 100644
--- a/src/mpid/pamid/src/mpidi_util.c
+++ b/src/mpid/pamid/src/mpidi_util.c
@@ -31,6 +31,13 @@
 #include <mpidimpl.h>
 #include "mpidi_util.h"
 
+/* Short hand for sizes */
+#define ONE  (1)
+#define ONEK (1<<10)
+#define ONEM (1<<20)
+#define ONEG (1<<30)
+
+
 
 #if (MPIDI_PRINTENV || MPIDI_STATISTICS || MPIDI_BANNER)
 MPIDI_printenv_t  *mpich_env;
@@ -773,3 +780,368 @@ void MPIDI_print_statistics() {
 }
 
 #endif  /* MPIDI_PRINTENV || MPIDI_STATISTICS         */
+
+/**
+ * \brief validate whether a lpid is in a given group
+ *
+ * Searches the group lpid list for a match.
+ *
+ * \param[in] lpid  World rank of the node in question
+ * \param[in] grp   Group to validate against
+ * \return TRUE is lpid is in group
+ */
+
+int MPIDI_valid_group_rank(int lpid, MPID_Group *grp) {
+        int size = grp->size;
+        int z;
+
+        for (z = 0; z < size &&
+                lpid != grp->lrank_to_lpid[z].lpid; ++z);
+        return (z < size);
+}
+
+/****************************************************************/
+/* function MPIDI_uppers converts a passed string to upper case */
+/****************************************************************/
+void MPIDI_toupper(char *s)
+{
+   int i;
+   if (s != NULL) {
+      for(i=0;i<strlen(s);i++) s[i] = toupper(s[i]);
+   }
+}
+
+/*
+  -----------------------------------------------------------------
+  Name:           MPID_scan_str()
+
+  Function:       Scan a flag string for 2 out of 3 possible
+                  characters (K, M, G). Return a 1 if neither
+                  character is found otherwise return the character
+                  along with a buffer containing the string without
+                  the character.
+                  value are valid. If they are valid, the
+                  multiplication of the number and the units
+                  will be returned as an unsigned int. If the
+                  number and units are invalid, a 1 will be returned.
+
+  Description:    search string for character or end of string
+                  if string contains either entered character
+                    check which char it is, set multiplier
+                  if no chars found, return error
+
+  Parameters:     A0 = MPIDI_scan_str(A1, A2, A3, A4, A5)
+
+                  A1    string to scan                char *
+                  A2    first char to scan for        char
+                  A3    second char to scan for       char
+                  A4    multiplier                    char *
+                  A5    returned string               char *
+
+                  A0    Return Code                   int
+
+
+  Return Codes:   0 OK
+                  1 input chars not found
+  ------------------------------------------------------------
+*/
+int MPIDI_scan_str(char *my_str, char fir_c, char sec_c, char *multiplier, char *tempbuf)
+{
+   int str_ptr;           /*index counter into string*/
+   int found;             /*indicates whether one of input chars found*/
+   int len_my_str;        /*length of string with size and units*/
+
+   str_ptr = 0;           /*start at beginning of string*/
+   found = 0;             /*no chars found yet*/
+
+   len_my_str = strlen(my_str);
+
+   /* first check if all 'characters' of *my_str are digits,  */
+   /* str_ptr points to the first occurrence of a character   */
+   for (str_ptr=0; str_ptr<len_my_str; str_ptr++) {
+      if (str_ptr == 0) {   /* there can be a '+' or a '-' in the first position   */
+                            /* but I do not allow a negative value because there's */
+                            /* no negative amount of memory...                     */
+         if (my_str[0] == '+') {
+            tempbuf[0] = my_str[0];  /* copy sign */
+            /* this is ok but a digit MUST follow */
+            str_ptr++;
+            /* If only a '+' was entered the next character is '\0'. */
+            /* This is not a digit so the error message shows up     */
+         }
+      }
+      if (!isdigit(my_str[str_ptr])) {
+         break;
+      }
+      tempbuf[str_ptr] = my_str[str_ptr]; /* copy to return string */
+   } /* endfor */
+
+   tempbuf[str_ptr] = 0x00;       /* terminate return string, this was NOT done before this modification! */
+
+   if((my_str[str_ptr] == fir_c) || (my_str[str_ptr] == sec_c)) {
+      /*check which char it is, then set multiplier and indicate char found*/
+      switch(my_str[str_ptr]) {
+        case 'K':
+          *multiplier = 'K';
+          found++;
+          break;
+        case 'M':
+          *multiplier = 'M';
+          found++;
+          break;
+        case 'G':
+          *multiplier = 'G';
+          found++;
+          break;
+      }
+  /*    my_str[str_ptr] = 0; */  /*change char in string to end of string char*/
+   }
+  if (found == 0) {             /*if input chars not found, indicate error*/
+    return(1); }
+  else {
+    /* K, M or G should be the last character, something like 64M55 is invalid */
+    if (str_ptr == len_my_str-1) {
+       return(0);                 /*if input chars found, return good status*/
+    } else {
+       /* I only allow a 'B' to follow. This is not documented but reflects the */
+       /* behaviour of earlier poe parsing. 64MB is valid, but after 'B' the    */
+       /* string must end */
+       if (my_str[str_ptr+1] == 'B' && (str_ptr+1) == (len_my_str-1)) {
+          return(0);                 /*if input chars found, return good status*/
+       } else {
+          return(1);
+       } /* endif */
+    } /* endif */
+  }
+}
+/*
+  -----------------------------------------------------------------
+  Name:           MPIDI_scan_str3()
+
+  Function:       Scan a flag string for 3 out of 3 possible
+                  characters (K, M, G). Return a 1 if neither
+                  character is found otherwise return the character
+                  along with a buffer containing the string without
+                  the character.
+                  value are valid. If they are valid, the
+                  multiplication of the number and the units
+                  will be returned as an unsigned int. If the
+                  number and units are invalid, a 1 will be returned.
+
+  Description:    search string for character or end of string
+                  if string contains either entered character
+                    check which char it is, set multiplier
+                  if no chars found, return error
+
+  Parameters:     A0 = MPIDI_scan_str(A1, A2, A3, A4, A5, A6)
+
+                  A1    string to scan                char *
+                  A2    first char to scan for        char
+                  A3    second char to scan for       char
+                  A4    third char to scan for        char
+                  A5    multiplier                    char *
+                  A6    returned string               char *
+
+                  A0    Return Code                   int
+
+  Return Codes:   0 OK
+                  1 input chars not found
+  ------------------------------------------------------------
+*/
+int MPIDI_scan_str3(char *my_str, char fir_c, char sec_c, char thr_c, char *multiplier, char *tempbuf)
+{
+
+   int str_ptr;           /*index counter into string*/
+   int found;             /*indicates whether one of input chars found*/
+   int len_my_str;        /*length of string with size and units*/
+
+   str_ptr = 0;           /*start at beginning of string*/
+   found = 0;             /*no chars found yet*/
+
+   len_my_str = strlen(my_str);
+
+   /* first check if all 'characters' of *my_str are digits,  */
+   /* str_ptr points to the first occurrence of a character   */
+   for (str_ptr=0; str_ptr<len_my_str; str_ptr++) {
+      if (str_ptr == 0) {   /* there can be a '+' or a '-' in the first position   */
+                            /* but I do not allow a negative value because there's */
+                            /* no negative amount of memory...                     */
+         if (my_str[0] == '+') {
+            tempbuf[0] = my_str[0];  /* copy sign */
+            /* this is ok but a digit MUST follow */
+            str_ptr++;
+            /* If only a '+' was entered the next character is '\0'. */
+            /* This is not a digit so the error message shows up     */
+         }
+      }
+      if (!isdigit(my_str[str_ptr])) {
+         break;
+      }
+      tempbuf[str_ptr] = my_str[str_ptr]; /* copy to return string */
+   } /* endfor */
+
+   tempbuf[str_ptr] = 0x00;       /* terminate return string, this was NOT done before this modification! */
+
+   if((my_str[str_ptr] == fir_c) || (my_str[str_ptr] == sec_c) || (my_str[str_ptr] == thr_c)) {
+      /*check which char it is, then set multiplier and indicate char found*/
+      switch(my_str[str_ptr]) {
+        case 'K':
+          *multiplier = 'K';
+          found++;
+          break;
+        case 'M':
+          *multiplier = 'M';
+          found++;
+          break;
+        case 'G':
+          *multiplier = 'G';
+          found++;
+          break;
+      }
+  /*    my_str[str_ptr] = 0; */  /*change char in string to end of string char*/
+   }
+  if (found == 0) {             /*if input chars not found, indicate error*/
+    return(1); }
+  else {
+    /* K, M or G should be the last character, something like 64M55 is invalid */
+    if (str_ptr == len_my_str-1) {
+       return(0);                 /*if input chars found, return good status*/
+    } else {
+       /* I only allow a 'B' to follow. This is not documented but reflects the */
+       /* behaviour of earlier poe parsing. 64MB is valid, but after 'B' the    */
+       /* string must end */
+       if (my_str[str_ptr+1] == 'B' && (str_ptr+1) == (len_my_str-1)) {
+          return(0);                 /*if input chars found, return good status*/
+       } else {
+          return(1);
+       } /* endif */
+    } /* endif */
+  }
+}
+
+/*
+  -----------------------------------------------------------------
+  Name:           MPIDI_checkit()
+
+  Function:       Determine whether a given number and units
+                  value are valid. If they are valid, the
+                  multiplication of the number and the units
+                  will be returned as an unsigned int. If the
+                  number and units are invalid, a 1 will be returned.
+
+  Description:    if units is G
+                    if value is > 4 return error
+                    else multiplier is 1G
+                  else if units is M
+                    if value is > 4K return error
+                    else multiplier is 1M
+                  else if units is K
+                    if value is > 4M return error
+                    else multiplier is 1K
+                  if value < 1 return error
+                  else
+                    multiply value by multiplier
+                    return result
+
+  Parameters:     A0 = MPIDI_checkit(A1, A2, A3)
+
+                  A1    given value                   int
+                  A2    given units                   char *
+                  A3    result                        unsigned int *
+
+                  A0    Return Code                   int
+
+  Return Codes:   0 OK
+                  1 bad value
+  ------------------------------------------------------------
+*/
+int MPIDI_checkit(int myval, char myunits, unsigned int *mygoodval)
+{
+  int multiplier = ONE;             /*units multiplier for entered value*/
+
+  if (myunits == 'G') {             /*if units is G*/
+    if (myval>4) return 1;          /*entered value can't be greater than 4*/
+    else multiplier = ONEG;         /*if OK, mult value by units*/
+  }
+  else if (myunits == 'M') {        /*if units is M*/
+    if (myval > (4*ONEK)) return 1;   /*value can't be > 4096*/
+    else multiplier = ONEM;         /*if OK, mult value by units*/
+  }
+  else if (myunits == 'K') {        /*if units is K*/
+    if (myval > (4*ONEM)) return 1; /*value can't be > 4M*/
+    else multiplier = ONEK;         /*if OK, mult value by units*/
+  }
+  if (myval < 1) return 1;          /*value can't be less than 1*/
+
+  *mygoodval = myval * multiplier;  /*do multiplication*/
+  return 0;                         /*good return*/
+
+}
+
+
+
+ /***************************************************************************
+ Function Name: MPIDI_atoi
+
+ Description:   Convert a string into an interger.  The string can be all
+                digits or includes symbols 'K', 'M'.
+
+ Parameters:    char * -- string to be converted
+                unsigned int  * -- result val (caller to cast to int* or long*)
+
+ Return:        int    0 if AOK
+                       number of errors.
+ ***************************************************************************/
+int MPIDI_atoi(char* str_in, unsigned int* val)
+{
+   char tempbuf[256];
+   char size_mult;                 /* multiplier for size strings */
+   int  i, tempval;
+   int  letter=0, retval=0;
+
+   /***********************************/
+   /* Check for letter                */
+   /***********************************/
+   for (i=0; i<strlen(str_in); i++) {
+      if (!isdigit(str_in[i])) {
+         letter = 1;
+         break;
+      }
+   }
+   if (!letter) {    /* only digits */
+      errno = 0;     /*  should set errno to 0 before atoi() call */
+      *val = atoi(str_in);
+      if (errno) {   /* no check for negative integer, there's no '-' in the string */
+         retval = errno;
+      }
+   }
+   else {
+      /***********************************/
+      /* Check for K or M.               */
+      /***********************************/
+      MPIDI_toupper(str_in);
+      retval = MPIDI_scan_str(str_in, 'M', 'K', &size_mult, tempbuf);
+
+      if ( retval == 0) {
+         tempval = atoi(tempbuf);
+
+         /***********************************/
+         /* If 0 K or 0 M entered, set to 0 */
+         /* otherwise, do conversion.       */
+         /***********************************/
+         if (tempval != 0)
+            retval = MPIDI_checkit(tempval, size_mult, (unsigned int*)val);
+         else
+            *val = 0;
+      }
+
+      if (retval == 0) {
+         tempval = atoi(tempbuf);
+         retval = MPIDI_checkit(tempval, size_mult, (unsigned int*)val);
+      }
+      else
+         *val = 0;
+   }
+
+   return retval;
+}

http://git.mpich.org/mpich.git/commitdiff/1fd977b986e704c92304711195b42cbfb0aae30c

commit 1fd977b986e704c92304711195b42cbfb0aae30c
Author: Charles Archer <archerc at us.ibm.com>
Date:   Mon Oct 29 16:35:40 2012 -0400

    Added cross file for 32 bit
    
    (ibm) cd0bf3564d3506c1a2b5dbc69433cc06e5495b08
    
    Signed-off-by: Bob Cernohous <bobc at us.ibm.com>

diff --git a/src/mpid/pamid/cross/pe4 b/src/mpid/pamid/cross/pe4
new file mode 100644
index 0000000..2fc1d9e
--- /dev/null
+++ b/src/mpid/pamid/cross/pe4
@@ -0,0 +1,27 @@
+# begin_generated_IBM_copyright_prolog
+#
+# This is an automatically generated copyright prolog.
+# After initializing,  DO NOT MODIFY OR MOVE
+#  ---------------------------------------------------------------
+# Licensed Materials - Property of IBM
+# Blue Gene/Q 5765-PER 5765-PRP
+#
+# (C) Copyright IBM Corp. 2011, 2012 All Rights Reserved
+# US Government Users Restricted Rights -
+# Use, duplication, or disclosure restricted
+# by GSA ADP Schedule Contract with IBM Corp.
+#
+#  ---------------------------------------------------------------
+#
+# end_generated_IBM_copyright_prolog
+CROSS_F77_SIZEOF_INTEGER=4
+CROSS_F77_SIZEOF_REAL=4
+CROSS_F77_SIZEOF_DOUBLE_PRECISION=8
+CROSS_F90_ADDRESS_KIND=4
+CROSS_F90_OFFSET_KIND=8
+CROSS_F90_INTEGER_KIND=4
+CROSS_F90_REAL_MODEL=6,37
+CROSS_F90_DOUBLE_MODEL=15,307
+CROSS_F90_INTEGER_MODEL_MAP={9,4,4},
+CROSS_F77_TRUE_VALUE=1
+CROSS_F77_FALSE_VALUE=0
\ No newline at end of file

http://git.mpich.org/mpich.git/commitdiff/7e9856376c646160f33a272bb53bac7080c24a97

commit 7e9856376c646160f33a272bb53bac7080c24a97
Author: Charles Archer <archerc at us.ibm.com>
Date:   Thu Oct 25 20:43:58 2012 -0400

    MPI_File_get_position_share fail in mpich2
    
    Variable could be passed into io routines uninitialized.
    A file read was supposed to write the current values into the uninit
    variable, but under certain circumstances the read would be 0
    bytes, leading to the read call returning 0 and leaving the value
    uninitialized.
    
    Simple fix is to always init to zero. This preserves the original
    meaning of a 0 byte read.
    
    (ibm) D187003
    (ibm) 6b32b4c853d24ba8f39ebe030688e3fde6be6db8
    
    Signed-off-by: Michael Blocksome <blocksom at us.ibm.com>

diff --git a/src/mpi/romio/adio/common/ad_get_sh_fp.c b/src/mpi/romio/adio/common/ad_get_sh_fp.c
index 2a6bc5b..d3fc3f5 100644
--- a/src/mpi/romio/adio/common/ad_get_sh_fp.c
+++ b/src/mpi/romio/adio/common/ad_get_sh_fp.c
@@ -28,6 +28,13 @@ void ADIO_Get_shared_fp(ADIO_File fd, int incr, ADIO_Offset *shared_fp,
     ADIO_Offset new_fp;
     MPI_Comm dupcommself;
 
+    /* Set the shared_fp in case this comes from an uninitialized stack variable
+       The read routines will not read into the address of this variable if the file
+       size of a shared pointer is 0, and if incr is always zero, this value will remain
+       uninitialized.  Initialize it here to prevent incorrect values
+    */
+    *shared_fp = 0;
+
 #ifdef ROMIO_NFS
     if (fd->file_system == ADIO_NFS) {
 	ADIOI_NFS_Get_shared_fp(fd, incr, shared_fp, error_code);
@@ -54,7 +61,6 @@ void ADIO_Get_shared_fp(ADIO_File fd, int incr, ADIO_Offset *shared_fp,
 				     MPI_INFO_NULL, 
 				     ADIO_PERM_NULL, error_code);
 	if (*error_code != MPI_SUCCESS) return;
-	*shared_fp = 0;
 	ADIOI_WRITE_LOCK(fd->shared_fp_fd, 0, SEEK_SET, sizeof(ADIO_Offset));
 	ADIO_ReadContig(fd->shared_fp_fd, shared_fp, sizeof(ADIO_Offset), 
 		       MPI_BYTE, ADIO_EXPLICIT_OFFSET, 0, &status, error_code);

http://git.mpich.org/mpich.git/commitdiff/87b2176055388b1e7219f76d883996475659c8f5

commit 87b2176055388b1e7219f76d883996475659c8f5
Author: Michael Blocksome <blocksom at us.ibm.com>
Date:   Tue Oct 23 09:36:51 2012 -0500

    fix bug in the 'is_local_task' extension.
    
    The extension was reporting a non-zero value when the specified task is
    local, but the src/mpid/pamid/src/pt2pt/mpidi_sendmsg.c is expecting a
    boolean value to be used in the point-to-point eager limit lookup.
    
    On bgq, the reported value was '64' which caused the eager limit lookup
    to go off the end of the array.
    
    The pe 'shift value' is 0 which will result in the expression
    (1UL << 0) and should be optimized out by the compiler.
    
    (ibm) Issue 8879
    (ibm) 9c643ab134c3dfda3fac6ad5e3f77a51ac805372
    
    Signed-off-by: Charles Archer <archerc at us.ibm.com>

diff --git a/src/mpid/pamid/include/mpidi_platform.h b/src/mpid/pamid/include/mpidi_platform.h
index 48e3eb7..5f5995b 100644
--- a/src/mpid/pamid/include/mpidi_platform.h
+++ b/src/mpid/pamid/include/mpidi_platform.h
@@ -49,7 +49,7 @@
 #define MPIDI_OPTIMIZED_COLLECTIVE_DEFAULT 1
 #define PAMIX_IS_LOCAL_TASK
 #define PAMIX_IS_LOCAL_TASK_STRIDE  (4)
-#define PAMIX_IS_LOCAL_TASK_BITMASK (0x40)
+#define PAMIX_IS_LOCAL_TASK_SHIFT   (6)
 
 #undef ASYNC_PROGRESS_MODE_DEFAULT
 #define ASYNC_PROGRESS_MODE_DEFAULT 1
@@ -74,7 +74,7 @@ static const char _ibm_release_version_[] = "V1R2M0";
 #define MPIDI_NO_ASSERT       1
 #define PAMIX_IS_LOCAL_TASK
 #define PAMIX_IS_LOCAL_TASK_STRIDE  (1)
-#define PAMIX_IS_LOCAL_TASK_BITMASK (0x01)
+#define PAMIX_IS_LOCAL_TASK_SHIFT   (0)
 
 #undef ASYNC_PROGRESS_MODE_DEFAULT
 #define ASYNC_PROGRESS_MODE_DEFAULT 2
diff --git a/src/mpid/pamid/include/pamix.h b/src/mpid/pamid/include/pamix.h
index 08cdfb7..eedbf58 100644
--- a/src/mpid/pamid/include/pamix.h
+++ b/src/mpid/pamid/include/pamix.h
@@ -115,18 +115,18 @@ int PAMIX_Torus2task(size_t coords[], pami_task_t* task_id);
 #endif
 
 #ifdef PAMIX_IS_LOCAL_TASK
-#if defined(PAMIX_IS_LOCAL_TASK_STRIDE) && defined(PAMIX_IS_LOCAL_TASK_BITMASK)
+#if defined(PAMIX_IS_LOCAL_TASK_STRIDE) && defined(PAMIX_IS_LOCAL_TASK_SHIFT)
 #define PAMIX_Task_is_local(task_id)                                           \
-  (PAMIX_IS_LOCAL_TASK_BITMASK &                                               \
+  (((1UL << PAMIX_IS_LOCAL_TASK_SHIFT) &                                       \
     *(PAMIX_Extensions.is_local_task.base +                                    \
-    task_id * PAMIX_IS_LOCAL_TASK_STRIDE))
+    task_id * PAMIX_IS_LOCAL_TASK_STRIDE)) >> PAMIX_IS_LOCAL_TASK_SHIFT)
 #else
 #define PAMIX_Task_is_local(task_id)                                           \
-  (PAMIX_Extensions.is_local_task.base &&                                      \
+  ((PAMIX_Extensions.is_local_task.base &&                                     \
     (PAMIX_Extensions.is_local_task.bitmask &                                  \
       *(PAMIX_Extensions.is_local_task.base +                                  \
-        task_id * PAMIX_Extensions.is_local_task.stride)))
-#endif /* PAMIX_IS_LOCAL_TASK_STRIDE && PAMIX_IS_LOCAL_TASK_BITMASK */
+        task_id * PAMIX_Extensions.is_local_task.stride))) > 0)
+#endif /* PAMIX_IS_LOCAL_TASK_STRIDE && PAMIX_IS_LOCAL_TASK_SHIFT */
 #else
 #define PAMIX_Task_is_local(task_id) (0)
 #endif /* PAMIX_IS_LOCAL_TASK */

http://git.mpich.org/mpich.git/commitdiff/1cc25348873276e804d1ebc551916dd6f75e5157

commit 1cc25348873276e804d1ebc551916dd6f75e5157
Author: Bob Cernohous <bobc at us.ibm.com>
Date:   Fri Oct 12 16:55:48 2012 -0500

    Handle count 0 allreduce
    
    (ibm) Issue 8805
    (ibm) 57cdcba6d9cd6858ed460f836c69dd052628c950
    
    Signed-off-by: sssharka <sssharka at us.ibm.com>

diff --git a/src/mpid/pamid/src/coll/allreduce/mpido_allreduce.c b/src/mpid/pamid/src/coll/allreduce/mpido_allreduce.c
index fa723fc..f20842e 100644
--- a/src/mpid/pamid/src/coll/allreduce/mpido_allreduce.c
+++ b/src/mpid/pamid/src/coll/allreduce/mpido_allreduce.c
@@ -74,11 +74,12 @@ int MPIDO_Allreduce(const void *sendbuf,
    else rc = MPIDI_Datatype_to_pami(dt, &pdt, op, &pop, &mu);
 
   if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))
-      fprintf(stderr,"allred rc %u, Datatype %p, op %p, mu %u, selectedvar %u != %u\n",
-              rc, pdt, pop, mu, 
-              (unsigned)comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLREDUCE],MPID_COLL_USE_MPICH);
+      fprintf(stderr,"allred rc %u,count %d, Datatype %p, op %p, mu %u, selectedvar %u != %u, sendbuf %p, recvbuf %p\n",
+              rc, count, pdt, pop, mu, 
+              (unsigned)comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLREDUCE],MPID_COLL_USE_MPICH, sendbuf, recvbuf);
       /* convert to metadata query */
-  if(unlikely(rc != MPI_SUCCESS || 
+  /* Punt count 0 allreduce to MPICH. Let them do whatever's 'right' */
+  if(unlikely(rc != MPI_SUCCESS || (count==0) ||
 	      comm_ptr->mpid.user_selected_type[PAMI_XFER_ALLREDUCE] == MPID_COLL_USE_MPICH))
    {
       if(unlikely(MPIDI_Process.verbose >= MPIDI_VERBOSE_DETAILS_ALL && comm_ptr->rank == 0))

http://git.mpich.org/mpich.git/commitdiff/35b7af16e01290c95aa61c8b31ce0acb94f92706

commit 35b7af16e01290c95aa61c8b31ce0acb94f92706
Author: Michael Blocksome <blocksom at us.ibm.com>
Date:   Tue Oct 16 11:34:18 2012 -0500

    Issue 6292: romio ad_bg bug fix for uninitialized struct
    
    When the communicator size is 1 the pset processing is skipped and the
    proc structure is to be initialized with appropriate values.
    
    (ibm) Issue 6292
    (ibm) 853a826f74e577224dc100ff959b562cc60cbabb
    
    Signed-off-by: Su Huang <suhuang at us.ibm.com>

diff --git a/src/mpi/romio/adio/ad_bg/ad_bg_pset.c b/src/mpi/romio/adio/ad_bg/ad_bg_pset.c
index 6194fd3..14c5ebc 100644
--- a/src/mpi/romio/adio/ad_bg/ad_bg_pset.c
+++ b/src/mpi/romio/adio/ad_bg/ad_bg_pset.c
@@ -100,6 +100,7 @@ ADIOI_BG_persInfo_init(ADIOI_BG_ConfInfo_t *conf,
    if(size == 1)
    {
       proc->iamBridge = 1;
+      proc->bridgeRank = rank;
 
       /* Set up the other parameters */
       proc->myIOSize = size;

http://git.mpich.org/mpich.git/commitdiff/ed96b182c00bea496e4f3acbb6c7471ea288ab93

commit ed96b182c00bea496e4f3acbb6c7471ea288ab93
Author: Su Huang <suhuang at us.ibm.com>
Date:   Tue Oct 9 13:14:19 2012 -0400

    Enable PAMIX extensions to enable/disable interrupts
    
    (ibm) F182395
    (ibm) 03ffc63daea2f65a6e51947043f38c73a02b9d25
    
    Signed-off-by: Michael Blocksome <blocksom at us.ibm.com>

diff --git a/src/mpid/pamid/include/mpidi_datatypes.h b/src/mpid/pamid/include/mpidi_datatypes.h
index 2d37fc3..01653e0 100644
--- a/src/mpid/pamid/include/mpidi_datatypes.h
+++ b/src/mpid/pamid/include/mpidi_datatypes.h
@@ -61,6 +61,7 @@ typedef struct
   unsigned mp_infolevel;
   unsigned mp_statistics;     /* print pamid statistcs data                           */
   unsigned mp_printenv; ;     /* print env data                                       */
+  unsigned mp_interrupts; ;   /* interrupts                                           */
 #endif
 #ifdef RDMA_FAILOVER
   unsigned mp_s_use_pami_get; /* force the PAMI_Get path instead of PAMI_Rget         */
diff --git a/src/mpid/pamid/src/mpidi_env.c b/src/mpid/pamid/src/mpidi_env.c
index 6e3243e..5ce40e0 100644
--- a/src/mpid/pamid/src/mpidi_env.c
+++ b/src/mpid/pamid/src/mpidi_env.c
@@ -842,6 +842,7 @@ MPIDI_Env_setup(int rank, int requested)
       ENV_Char(names, &mpich_env->interrupts);
       if (mpich_env->interrupts == 1)      /* force on  */
       {
+        MPIDI_Process.mp_interrupts=1;
         MPIDI_Process.perobj.context_post.requested = 0;
         MPIDI_Process.async_progress.mode    = ASYNC_PROGRESS_MODE_TRIGGER;
 #if (MPIU_THREAD_GRANULARITY == MPIU_THREAD_GRANULARITY_PER_OBJECT)
diff --git a/src/mpid/pamid/src/mpix/mpix.c b/src/mpid/pamid/src/mpix/mpix.c
index 8b661e3..74da3b5 100644
--- a/src/mpid/pamid/src/mpix/mpix.c
+++ b/src/mpid/pamid/src/mpix/mpix.c
@@ -399,3 +399,181 @@ MPIX_Get_last_algorithm_name(MPI_Comm comm, char *protocol, int length)
 
 
 #endif
+
+#ifdef __PE__
+void mpc_disableintr() __attribute__ ((alias("MPIX_disableintr")));
+void mp_disableintr() __attribute__ ((alias("MPIXF_disableintr")));
+void mp_disableintr_() __attribute__ ((alias("MPIXF_disableintr")));
+void mp_disableintr__() __attribute__ ((alias("MPIXF_disableintr")));
+void mpc_enableintr() __attribute__ ((alias("MPIX_enableintr")));
+void mp_enableintr() __attribute__ ((alias("MPIXF_enableintr")));
+void mp_enableintr_() __attribute__ ((alias("MPIXF_enableintr")));
+void mp_enableintr__() __attribute__ ((alias("MPIXF_enableintr")));
+void mpc_queryintr() __attribute__ ((weak,alias("MPIX_queryintr")));
+void mp_queryintr() __attribute__ ((alias("MPIXF_queryintr")));
+void mp_queryintr_() __attribute__ ((alias("MPIXF_queryintr")));
+void mp_queryintr__() __attribute__ ((alias("MPIXF_queryintr")));
+
+ /***************************************************************************
+ Function Name: MPIX_disableintr
+
+ Description: Call the pamid layer to disable interrupts.
+              (Similar to setting MP_CSS_INTERRUPT to "no")
+
+ Parameters: The Fortran versions have an int* parameter used to pass the
+             return code to the calling program.
+
+ Returns: 0     Success
+         <0     Failure
+ ***************************************************************************/
+
+int
+_MPIDI_disableintr()
+{
+        return(MPIDI_disableintr());
+}
+
+int
+MPIX_disableintr()
+{
+        return(_MPIDI_disableintr());
+}
+
+void
+MPIXF_disableintr(int *rc)
+{
+        *rc = _MPIDI_disableintr();
+}
+
+void
+MPIXF_disableintr_(int *rc)
+{
+        *rc = _MPIDI_disableintr();
+}
+
+/*
+ ** Called by: _mp_disableintr
+ ** Purpose : Disables interrupts
+ */
+int
+MPIDI_disableintr()
+{
+    pami_result_t rc=0;
+    int i;
+
+    MPIR_ERRTEST_INITIALIZED_ORDIE();
+    if (MPIDI_Process.mp_interrupts!= 0)
+       {
+         TRACE_ERR("Async advance beginning...\n");
+         /* Enable async progress on all contexts.*/
+         for (i=0; i<MPIDI_Process.avail_contexts; ++i)
+         {
+             PAMIX_Progress_disable(MPIDI_Context[i], PAMIX_PROGRESS_ALL);
+          }
+         TRACE_ERR("Async advance disabled\n");
+         MPIDI_Process.mp_interrupts=0;
+       }
+    return(rc);
+}
+ /***************************************************************************
+ Function Name: MPIX_enableintr
+
+ Description: Call the pamid-layer function to enable interrupts.
+              (Similar to setting MP_CSS_INTERRUPT to "yes")
+
+ Parameters: The Fortran versions have an int* parameter used to pass the
+             return code to the calling program.
+
+ Returns: 0     Success
+         <0     Failure
+ ***************************************************************************/
+int
+_MPIDI_enableintr()
+{
+       return(MPIDI_enableintr());
+}
+
+/* C callable version           */
+int
+MPIX_enableintr()
+{
+        return(_MPIDI_enableintr());
+}
+
+/* Fortran callable version     */                  
+void 
+MPIXF_enableintr(int *rc)
+{
+        *rc = _MPIDI_enableintr();
+}
+
+/* Fortran callable version for -qEXTNAME support  */
+void 
+MPIXF_enableintr_(int *rc)
+{
+        *rc = _MPIDI_enableintr();
+}
+
+int
+MPIDI_enableintr()
+{
+    pami_result_t rc=0;
+    int i;
+
+    MPIR_ERRTEST_INITIALIZED_ORDIE();
+    if (MPIDI_Process.mp_interrupts == 0)
+       {
+         /* Enable async progress on all contexts.*/
+         for (i=0; i<MPIDI_Process.avail_contexts; ++i)
+         {
+             PAMIX_Progress_enable(MPIDI_Context[i], PAMIX_PROGRESS_ALL);
+          }
+         TRACE_ERR("Async advance enabled\n");
+         MPIDI_Process.mp_interrupts=1;
+       }
+    MPID_assert(rc == PAMI_SUCCESS);
+    return(rc);
+}
+
+ /***************************************************************************
+ Function Name: MPIX_queryintr
+
+ Description: Call the pamid-layer function to determine if
+              interrupts are currently on or off.
+
+ Parameters: The Fortran versions have an int* parameter used to pass the
+             current interrupt setting to the calling program.
+ Returns: 0     Indicates interrupts are currently off
+          1     Indicates interrupts are currently on
+         <0     Failure
+ ***************************************************************************/
+int
+MPIDI_queryintr()
+{
+        return(MPIDI_Process.mp_interrupts);
+}
+
+int
+_MPIDI_queryintr()
+{
+        return(MPIDI_queryintr());
+}
+
+int
+MPIX_queryintr()
+{
+        return(_MPIDI_queryintr());
+}
+
+void
+MPIXF_queryintr(int *rc)
+{
+        *rc = _MPIDI_queryintr();
+}
+
+void
+MPIXF_queryintr_(int *rc)
+{
+        *rc = _MPIDI_queryintr();
+}
+#endif

-----------------------------------------------------------------------

Summary of changes:
 src/include/mpiimpl.h                              |    5 +
 src/mpi/attr/attrutil.c                            |    1 +
 src/mpi/comm/comm_create.c                         |    7 +-
 src/mpi/romio/adio/ad_bg/ad_bg_aggrs.c             |   79 +-
 src/mpi/romio/adio/ad_bg/ad_bg_pset.c              |   26 +-
 .../adio/ad_bglockless/ad_bglockless_features.c    |   19 +
 src/mpi/romio/adio/ad_ufs/ad_ufs_open.c            |   13 +
 src/mpi/romio/adio/common/ad_fstype.c              |   30 +-
 src/mpi/romio/adio/common/ad_get_sh_fp.c           |   16 +-
 src/mpi/romio/adio/common/ad_set_sh_fp.c           |    7 +
 src/mpi/romio/adio/common/cb_config_list.c         |    2 +-
 src/mpi/romio/adio/common/lock.c                   |    2 +-
 src/mpi/romio/adio/include/adio.h                  |    1 +
 src/mpi/romio/adio/include/adioi_fs_proto.h        |    5 +
 src/mpi/romio/configure.ac                         |    6 +-
 src/mpi/romio/mpi-io/get_size.c                    |   11 +
 src/mpid/common/datatype/dataloop/veccpy.h         |    8 +-
 src/mpid/pamid/cross/pe4                           |   27 +
 src/mpid/pamid/include/mpidi_constants.h           |   32 +
 src/mpid/pamid/include/mpidi_datatypes.h           |  114 +-
 src/mpid/pamid/include/mpidi_externs.h             |    2 -
 src/mpid/pamid/include/mpidi_hooks.h               |   10 +-
 src/mpid/pamid/include/mpidi_macros.h              |   20 +-
 src/mpid/pamid/include/mpidi_platform.h            |   94 +-
 src/mpid/pamid/include/mpidi_prototypes.h          |   64 +-
 src/mpid/pamid/include/mpidi_trace.h               |  159 ++
 src/mpid/pamid/include/mpidi_util.h                |    8 +-
 src/mpid/pamid/include/mpidimpl.h                  |  123 ++
 src/mpid/pamid/include/mpidpost.h                  |    5 +
 src/mpid/pamid/include/mpidpre.h                   |    4 +
 src/mpid/pamid/include/mpix.h                      |   91 +
 src/mpid/pamid/include/pamix.h                     |   12 +-
 src/mpid/pamid/src/Makefile.mk                     |    5 +-
 .../pamid/src/coll/allgather/mpido_allgather.c     |  440 ++++--
 .../pamid/src/coll/allgatherv/mpido_allgatherv.c   |  384 ++++-
 .../pamid/src/coll/allreduce/mpido_allreduce.c     |  671 ++++---
 src/mpid/pamid/src/coll/alltoall/mpido_alltoall.c  |  179 ++-
 .../pamid/src/coll/alltoallv/mpido_alltoallv.c     |  203 ++-
 src/mpid/pamid/src/coll/barrier/mpido_barrier.c    |   76 +-
 src/mpid/pamid/src/coll/bcast/mpido_bcast.c        |  274 +++-
 src/mpid/pamid/src/coll/gather/mpido_gather.c      |  305 +++-
 src/mpid/pamid/src/coll/gatherv/mpido_gatherv.c    |  218 ++-
 src/mpid/pamid/src/coll/reduce/mpido_reduce.c      |  245 ++-
 src/mpid/pamid/src/coll/scan/mpido_scan.c          |  186 ++-
 src/mpid/pamid/src/coll/scatter/mpido_scatter.c    |  216 ++-
 src/mpid/pamid/src/coll/scatterv/mpido_scatterv.c  |  221 ++-
 src/mpid/pamid/src/comm/mpid_comm.c                |  176 ++-
 src/mpid/pamid/src/comm/mpid_optcolls.c            | 1497 ++++++++--------
 src/mpid/pamid/src/comm/mpid_selectcolls.c         |  126 ++-
 src/mpid/pamid/src/dyntask/Makefile.mk             |   32 +
 src/mpid/pamid/src/dyntask/mpid_comm_disconnect.c  |  510 ++++++
 .../pamid/src/dyntask/mpid_comm_spawn_multiple.c   |  399 +++++
 src/mpid/pamid/src/dyntask/mpid_port.c             |  293 +++
 src/mpid/pamid/src/dyntask/mpidi_pg.c              | 1061 +++++++++++
 src/mpid/pamid/src/dyntask/mpidi_port.c            | 1772 +++++++++++++++++++
 src/mpid/pamid/src/misc/mpid_abort.c               |    3 +
 src/mpid/pamid/src/misc/mpid_get_universe_size.c   |   66 +-
 src/mpid/pamid/src/misc/mpid_unimpl.c              |    3 +-
 src/mpid/pamid/src/mpid_finalize.c                 |   58 +-
 src/mpid/pamid/src/mpid_init.c                     |  670 +++++++-
 src/mpid/pamid/src/mpid_recvq.c                    |   48 +-
 src/mpid/pamid/src/mpid_recvq.h                    |   13 +
 src/mpid/pamid/src/mpid_request.h                  |   27 +-
 src/mpid/pamid/src/mpid_vc.c                       |  260 +++-
 src/mpid/pamid/src/mpidi_bufmm.c                   |  709 ++++++++
 src/mpid/pamid/src/mpidi_env.c                     |  353 ++++-
 src/mpid/pamid/src/mpidi_nbc_sched.c               |   65 +
 src/mpid/pamid/src/mpidi_util.c                    |  489 +++++-
 src/mpid/pamid/src/mpix/mpix.c                     |  673 +++++++-
 src/mpid/pamid/src/onesided/mpid_win_accumulate.c  |   28 +-
 src/mpid/pamid/src/onesided/mpid_win_fence.c       |   35 +
 src/mpid/pamid/src/onesided/mpid_win_free.c        |   11 +
 src/mpid/pamid/src/onesided/mpid_win_get.c         |   28 +-
 src/mpid/pamid/src/onesided/mpid_win_lock.c        |   25 +-
 src/mpid/pamid/src/onesided/mpid_win_pscw.c        |   67 +-
 src/mpid/pamid/src/onesided/mpid_win_put.c         |   28 +-
 src/mpid/pamid/src/onesided/mpidi_onesided.h       |    4 +
 src/mpid/pamid/src/onesided/mpidi_win_control.c    |   56 +-
 src/mpid/pamid/src/pt2pt/mpidi_callback_eager.c    |   96 +-
 src/mpid/pamid/src/pt2pt/mpidi_callback_rzv.c      |   20 +-
 src/mpid/pamid/src/pt2pt/mpidi_callback_short.c    |   59 +-
 src/mpid/pamid/src/pt2pt/mpidi_callback_util.c     |    2 +-
 src/mpid/pamid/src/pt2pt/mpidi_control.c           |   14 +
 src/mpid/pamid/src/pt2pt/mpidi_done.c              |   19 +-
 src/mpid/pamid/src/pt2pt/mpidi_recv.h              |   95 +-
 src/mpid/pamid/src/pt2pt/mpidi_rendezvous.c        |    2 +-
 src/mpid/pamid/src/pt2pt/mpidi_sendmsg.c           |  359 +++--
 src/mpid/pamid/subconfigure.m4                     |   49 +-
 src/pm/hydra/mpichprereq                           |    4 +-
 src/pmi/pmi2/Makefile.mk                           |   17 +-
 src/pmi/pmi2/README                                |  115 --
 src/pmi/pmi2/poe/Makefile.mk                       |   15 +
 src/pmi/pmi2/poe/poe2pmi.c                         |  451 +++++
 src/pmi/pmi2/poe/subconfigure.m4                   |   24 +
 src/pmi/pmi2/simple/Makefile.mk                    |   21 +
 src/pmi/pmi2/simple/README                         |  115 ++
 src/pmi/pmi2/{ => simple}/pmi2compat.h             |    0
 src/pmi/pmi2/simple/simple2pmi.c                   | 1867 ++++++++++++++++++++
 src/pmi/pmi2/simple/simple2pmi.h                   |  123 ++
 src/pmi/pmi2/simple/simple_pmiutil.c               |  297 ++++
 src/pmi/pmi2/simple/simple_pmiutil.h               |  255 +++
 src/pmi/pmi2/simple/subconfigure.m4                |   86 +
 src/pmi/pmi2/simple2pmi.c                          | 1867 --------------------
 src/pmi/pmi2/simple2pmi.h                          |  123 --
 src/pmi/pmi2/simple_pmiutil.c                      |  297 ----
 src/pmi/pmi2/simple_pmiutil.h                      |  255 ---
 src/pmi/pmi2/subconfigure.m4                       |    4 +
 src/util/param/params.yml                          |    2 +-
 108 files changed, 16117 insertions(+), 4747 deletions(-)
 create mode 100644 src/mpid/pamid/cross/pe4
 create mode 100644 src/mpid/pamid/include/mpidi_trace.h
 create mode 100644 src/mpid/pamid/src/dyntask/Makefile.mk
 create mode 100644 src/mpid/pamid/src/dyntask/mpid_comm_disconnect.c
 create mode 100644 src/mpid/pamid/src/dyntask/mpid_comm_spawn_multiple.c
 create mode 100644 src/mpid/pamid/src/dyntask/mpid_port.c
 create mode 100644 src/mpid/pamid/src/dyntask/mpidi_pg.c
 create mode 100644 src/mpid/pamid/src/dyntask/mpidi_port.c
 create mode 100644 src/mpid/pamid/src/mpidi_bufmm.c
 create mode 100644 src/mpid/pamid/src/mpidi_nbc_sched.c
 delete mode 100644 src/pmi/pmi2/README
 create mode 100644 src/pmi/pmi2/poe/Makefile.mk
 create mode 100644 src/pmi/pmi2/poe/poe2pmi.c
 create mode 100644 src/pmi/pmi2/poe/subconfigure.m4
 create mode 100644 src/pmi/pmi2/simple/Makefile.mk
 create mode 100644 src/pmi/pmi2/simple/README
 rename src/pmi/pmi2/{ => simple}/pmi2compat.h (100%)
 create mode 100644 src/pmi/pmi2/simple/simple2pmi.c
 create mode 100644 src/pmi/pmi2/simple/simple2pmi.h
 create mode 100644 src/pmi/pmi2/simple/simple_pmiutil.c
 create mode 100644 src/pmi/pmi2/simple/simple_pmiutil.h
 create mode 100644 src/pmi/pmi2/simple/subconfigure.m4
 delete mode 100644 src/pmi/pmi2/simple2pmi.c
 delete mode 100644 src/pmi/pmi2/simple2pmi.h
 delete mode 100644 src/pmi/pmi2/simple_pmiutil.c
 delete mode 100644 src/pmi/pmi2/simple_pmiutil.h


hooks/post-receive
-- 
MPICH primary repository


More information about the commits mailing list