[mpich-commits] [mpich] MPICH primary repository branch, master, updated. v3.1-260-g6b5993a

Service Account noreply at mpich.org
Thu May 22 09:44:36 CDT 2014


This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "MPICH primary repository".

The branch, master has been updated
       via  6b5993af5cd4aadd6648c024a6b815749c35f8a6 (commit)
      from  98b5e585a61a8eccbd0224b64c66c505fa5ddf0e (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
http://git.mpich.org/mpich.git/commitdiff/6b5993af5cd4aadd6648c024a6b815749c35f8a6

commit 6b5993af5cd4aadd6648c024a6b815749c35f8a6
Author: Su Huang <suhuang at us.ibm.com>
Date:   Thu May 22 09:29:57 2014 -0400

    pamid: task 0 hang in MPI_Init() if MP_PRINTENV=yes
    
    In MPIDI_Print_mpenv(), when calling MPIR_Gather_impl to gather all MP environment variables
    from all tasks in a job, the errflag parameter was not initialized to 0 before it was
    passed to the routine:
           mpi_errno = MPIR_Gather_impl(&sender, sizeof(MPIDI_printenv_t), MPI_BYTE, gatherer,
                                        sizeof(MPIDI_printenv_t),MPI_BYTE, 0,comm_ptr,
                                        (int *) &errflag);
    
    To process the Gather collective call, each task issued MPIC_Recv, MPIC_Send and MPIC_Wait.
    
    MPIC_Send() sends a message with MPIR_GATHER_TAG (defined as 0x3). Since the routine had a
    non-zero errflag passed in,
    
        if (*errflag && MPIR_CVAR_ENABLE_COLL_FT_RET)
            MPIR_TAG_SET_ERROR_BIT(tag);
    
    the 30th bit of the tag was set to 1 :(1 << 30) (MPIR_TAG_ERROR_BIT). Therefore, the tag was
    changed from 0x3 to 0x40000003.
    
    On task 1, a message with this modified tag was sent to task 0. When the message arrived at
    task 0, the receive for the message with the original tag of 0x3 had been posted.
    However, the tag in the arrived message differed from the tag from the posted receive.
    So no match was found for the arrived message which was the root cause of the hang.
    
    MPIR_TAG_SET_ERROR_BIT was added for MPI 3.0 (pe rbrew and beyond) which explains why
    the job does not fail with prior releases.
    
     (ibm) D197745
    
    Signed-off-by: Michael Blocksome <blocksom at us.ibm.com>

diff --git a/src/mpid/pamid/src/mpidi_util.c b/src/mpid/pamid/src/mpidi_util.c
index aba5bb6..45a36d0 100644
--- a/src/mpid/pamid/src/mpidi_util.c
+++ b/src/mpid/pamid/src/mpidi_util.c
@@ -35,7 +35,7 @@
 #include "mpidi_util.h"
 
 #define PAMI_TUNE_MAX_ITER 2000
-
+#define _DEBUG  1
 /* Short hand for sizes */
 #define ONE  (1)
 #define ONEK (1<<10)
@@ -461,7 +461,7 @@ int MPIDI_Print_mpenv(int rank,int size)
         char *popenptr;
         char tempstr[128];
         int  mpi_errno;
-        int  errflag;
+        int  errflag=0;
 
         MPIDI_Set_mpich_env(rank,size);
         memset(&sender,0,sizeof(MPIDI_printenv_t));

-----------------------------------------------------------------------

Summary of changes:
 src/mpid/pamid/src/mpidi_util.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)


hooks/post-receive
-- 
MPICH primary repository


More information about the commits mailing list