[mpich-commits] [mpich] MPICH primary repository branch, master, updated. v3.1.2-113-gf7ad217

Service Account noreply at mpich.org
Mon Aug 25 22:17:45 CDT 2014


This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "MPICH primary repository".

The branch, master has been updated
       via  f7ad217b693fef4c93f43a9e7c3aecc29ae5598d (commit)
       via  7b13db5d8fa60d32e411ec93f821d6e449c54e1d (commit)
       via  a184bd016a1f270b0a1d0242873587fc67780c5b (commit)
       via  cf1240d657e9ff907670e3915fe5b9d36dbbd133 (commit)
       via  92ff146e205da604b1f1015e1b5a26997cf8333b (commit)
       via  07e6da06dbaa35a2f4f51df79d490bb749d2ca56 (commit)
       via  14fd9c432eac9b743ebabaf98b34da214541df12 (commit)
      from  0ddd8086a842302d296f4bdb719b36a33ba284d6 (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
http://git.mpich.org/mpich.git/commitdiff/f7ad217b693fef4c93f43a9e7c3aecc29ae5598d

commit f7ad217b693fef4c93f43a9e7c3aecc29ae5598d
Author: Wesley Bland <wbland at anl.gov>
Date:   Thu Aug 21 11:08:41 2014 -0500

    Expand search for comm_ptr
    
    When searching for a corresponding comm_ptr, we should also check the
    node_comm and node_roots_comm if they exist.
    
    Signed-off-by: Huiwei Lu <huiweilu at mcs.anl.gov>

diff --git a/src/mpid/ch3/src/ch3u_comm.c b/src/mpid/ch3/src/ch3u_comm.c
index e39cd89..871b1f2 100644
--- a/src/mpid/ch3/src/ch3u_comm.c
+++ b/src/mpid/ch3/src/ch3u_comm.c
@@ -340,8 +340,10 @@ void MPIDI_CH3I_Comm_find(MPIR_Context_id_t context_id, MPID_Comm **comm)
     MPIDI_FUNC_ENTER(MPIDI_STATE_MPIDI_CH3I_COMM_FIND);
 
     COMM_FOREACH((*comm)) {
-        if ((*comm)->context_id == context_id) {
-            MPIU_DBG_MSG_D(CH3_OTHER,VERBOSE,"Found matching context id: %d", context_id);
+        if ((*comm)->context_id == context_id || ((*comm)->context_id + MPID_CONTEXT_INTRA_COLL) == context_id ||
+            ((*comm)->node_comm && ((*comm)->node_comm->context_id == context_id || ((*comm)->node_comm->context_id + MPID_CONTEXT_INTRA_COLL) == context_id)) ||
+            ((*comm)->node_roots_comm && ((*comm)->node_roots_comm->context_id == context_id || ((*comm)->node_roots_comm->context_id + MPID_CONTEXT_INTRA_COLL) == context_id)) ) {
+            MPIU_DBG_MSG_D(CH3_OTHER,VERBOSE,"Found matching context id: %d", (*comm)->context_id);
             break;
         }
     }

http://git.mpich.org/mpich.git/commitdiff/7b13db5d8fa60d32e411ec93f821d6e449c54e1d

commit 7b13db5d8fa60d32e411ec93f821d6e449c54e1d
Author: Wesley Bland <wbland at anl.gov>
Date:   Thu Aug 21 11:01:18 2014 -0500

    Revoke any node aware "subcommunicators" as well
    
    When revoking a communicator, if it has node aware communicators attached to
    it (as node_comm and node_roots_comm), revoke those as well.
    
    Signed-off-by: Huiwei Lu <huiweilu at mcs.anl.gov>

diff --git a/src/mpid/ch3/src/mpid_comm_revoke.c b/src/mpid/ch3/src/mpid_comm_revoke.c
index 7a10b7d..63993ae 100644
--- a/src/mpid/ch3/src/mpid_comm_revoke.c
+++ b/src/mpid/ch3/src/mpid_comm_revoke.c
@@ -35,6 +35,8 @@ int MPID_Comm_revoke(MPID_Comm *comm_ptr, int is_remote)
     if (0 == comm_ptr->revoked) {
         /* Mark the communicator as revoked locally */
         comm_ptr->revoked = 1;
+        if (comm_ptr->node_comm) comm_ptr->node_comm->revoked = 1;
+        if (comm_ptr->node_roots_comm) comm_ptr->node_roots_comm->revoked = 1;
 
         /* Keep a reference to this comm so it doesn't get destroyed while
          * it's being revoked */

http://git.mpich.org/mpich.git/commitdiff/a184bd016a1f270b0a1d0242873587fc67780c5b

commit a184bd016a1f270b0a1d0242873587fc67780c5b
Author: Wesley Bland <wbland at anl.gov>
Date:   Thu Aug 21 10:19:20 2014 -0500

    Don't put revoked requests into the receive queue
    
    When a message is received, if the communicator has already been revoked, we
    shouldn't bother keeping the message since it's now invalid (unless its for an
    AGREE or SHRINK request). Instead, just drop the request and return a null
    request to signal the calling function that the request was ignored.
    
    Signed-off-by: Huiwei Lu <huiweilu at mcs.anl.gov>

diff --git a/src/mpid/ch3/channels/nemesis/src/mpid_nem_lmt.c b/src/mpid/ch3/channels/nemesis/src/mpid_nem_lmt.c
index de31188..5c10104 100644
--- a/src/mpid/ch3/channels/nemesis/src/mpid_nem_lmt.c
+++ b/src/mpid/ch3/channels/nemesis/src/mpid_nem_lmt.c
@@ -176,6 +176,14 @@ static int pkt_RTS_handler(MPIDI_VC_t *vc, MPIDI_CH3_Pkt_t *pkt, MPIDI_msg_sz_t
     rreq = MPIDI_CH3U_Recvq_FDP_or_AEU(&rts_pkt->match, &found);
     MPIU_ERR_CHKANDJUMP1(!rreq, mpi_errno,MPI_ERR_OTHER, "**nomemreq", "**nomemuereq %d", MPIDI_CH3U_Recvq_count_unexp());
 
+    /* If the completion counter is 0, that means that the communicator to
+     * which this message is being sent has been revoked and we shouldn't
+     * bother finishing this. */
+    if (!found && rreq->cc == 0) {
+        *rreqp = NULL;
+        goto fn_exit;
+    }
+
     set_request_info(rreq, rts_pkt, MPIDI_REQUEST_RNDV_MSG);
 
     rreq->ch.lmt_req_id = rts_pkt->sender_req_id;
diff --git a/src/mpid/ch3/src/ch3u_eager.c b/src/mpid/ch3/src/ch3u_eager.c
index bbcd179..92440a0 100644
--- a/src/mpid/ch3/src/ch3u_eager.c
+++ b/src/mpid/ch3/src/ch3u_eager.c
@@ -305,6 +305,14 @@ int MPIDI_CH3_PktHandler_EagerShortSend( MPIDI_VC_t *vc, MPIDI_CH3_Pkt_t *pkt,
     rreq = MPIDI_CH3U_Recvq_FDP_or_AEU(&eagershort_pkt->match, &found);
     MPIU_ERR_CHKANDJUMP1(!rreq, mpi_errno,MPI_ERR_OTHER, "**nomemreq", "**nomemuereq %d", MPIDI_CH3U_Recvq_count_unexp());
 
+    /* If the completion counter is 0, that means that the communicator to
+     * which this message is being sent has been revoked and we shouldn't
+     * bother finishing this. */
+    if (!found && rreq->cc == 0) {
+        *rreqp = NULL;
+        goto fn_fail;
+    }
+
     (rreq)->status.MPI_SOURCE = (eagershort_pkt)->match.parts.rank;
     (rreq)->status.MPI_TAG    = (eagershort_pkt)->match.parts.tag;
     MPIR_STATUS_SET_COUNT((rreq)->status, (eagershort_pkt)->data_sz);
@@ -610,6 +618,14 @@ int MPIDI_CH3_PktHandler_EagerSend( MPIDI_VC_t *vc, MPIDI_CH3_Pkt_t *pkt,
 	    
     rreq = MPIDI_CH3U_Recvq_FDP_or_AEU(&eager_pkt->match, &found);
     MPIU_ERR_CHKANDJUMP1(!rreq, mpi_errno,MPI_ERR_OTHER, "**nomemreq", "**nomemuereq %d", MPIDI_CH3U_Recvq_count_unexp());
+
+    /* If the completion counter is 0, that means that the communicator to
+     * which this message is being sent has been revoked and we shouldn't
+     * bother finishing this. */
+    if (!found && rreq->cc == 0) {
+        *rreqp = NULL;
+        goto fn_fail;
+    }
     
     set_request_info(rreq, eager_pkt, MPIDI_REQUEST_EAGER_MSG);
     
@@ -686,6 +702,14 @@ int MPIDI_CH3_PktHandler_ReadySend( MPIDI_VC_t *vc, MPIDI_CH3_Pkt_t *pkt,
 	    
     rreq = MPIDI_CH3U_Recvq_FDP_or_AEU(&ready_pkt->match, &found);
     MPIU_ERR_CHKANDJUMP1(!rreq, mpi_errno,MPI_ERR_OTHER, "**nomemreq", "**nomemuereq %d", MPIDI_CH3U_Recvq_count_unexp());
+
+    /* If the completion counter is 0, that means that the communicator to
+     * which this message is being sent has been revoked and we shouldn't
+     * bother finishing this. */
+    if (!found && rreq->cc == 0) {
+        *rreqp = NULL;
+        goto fn_fail;
+    }
     
     set_request_info(rreq, ready_pkt, MPIDI_REQUEST_EAGER_MSG);
     
diff --git a/src/mpid/ch3/src/ch3u_eagersync.c b/src/mpid/ch3/src/ch3u_eagersync.c
index f88501c..48cd5eb 100644
--- a/src/mpid/ch3/src/ch3u_eagersync.c
+++ b/src/mpid/ch3/src/ch3u_eagersync.c
@@ -235,6 +235,14 @@ int MPIDI_CH3_PktHandler_EagerSyncSend( MPIDI_VC_t *vc, MPIDI_CH3_Pkt_t *pkt,
 	    
     rreq = MPIDI_CH3U_Recvq_FDP_or_AEU(&es_pkt->match, &found);
     MPIU_ERR_CHKANDJUMP1(!rreq, mpi_errno,MPI_ERR_OTHER, "**nomemreq", "**nomemuereq %d", MPIDI_CH3U_Recvq_count_unexp());
+
+    /* If the completion counter is 0, that means that the communicator to
+     * which this message is being sent has been revoked and we shouldn't
+     * bother finishing this. */
+    if (!found && rreq->cc == 0) {
+        *rreqp = NULL;
+        goto fn_fail;
+    }
     
     set_request_info(rreq, es_pkt, MPIDI_REQUEST_EAGER_MSG);
 
diff --git a/src/mpid/ch3/src/ch3u_recvq.c b/src/mpid/ch3/src/ch3u_recvq.c
index 1e1dbb8..f1f2443 100644
--- a/src/mpid/ch3/src/ch3u_recvq.c
+++ b/src/mpid/ch3/src/ch3u_recvq.c
@@ -808,6 +808,26 @@ MPID_Request * MPIDI_CH3U_Recvq_FDP_or_AEU(MPIDI_Message_match * match,
     }
     MPIR_T_PVAR_TIMER_END(RECVQ, time_failed_matching_postedq);
 
+    /* If we didn't match the request, look to see if the communicator is
+     * revoked. If so, just throw this request away since it won't be used
+     * anyway. */
+    {
+        MPID_Comm *comm_ptr;
+        int mpi_errno;
+
+        MPIDI_CH3I_Comm_find(match->parts.context_id, &comm_ptr);
+
+        if (comm_ptr && comm_ptr->revoked && MPIR_TAG_MASK_ERROR_BIT(match->parts.tag) != MPIR_AGREE_TAG &&
+                        comm_ptr->revoked && MPIR_TAG_MASK_ERROR_BIT(match->parts.tag) != MPIR_SHRINK_TAG) {
+            *foundp = FALSE;
+            MPIDI_Request_create_null_rreq( rreq, mpi_errno, found=FALSE;goto lock_exit );
+
+            MPIU_DBG_MSG_FMT(CH3_OTHER, VERBOSE,
+                (MPIU_DBG_FDEST, "RECEIVED MESSAGE FOR REVOKED COMM (tag=%d,src=%d,cid=%d)\n", MPIR_TAG_MASK_ERROR_BIT(match->parts.tag), match->parts.rank, comm_ptr->context_id));
+            return rreq;
+        }
+    }
+
     /* A matching request was not found in the posted queue, so we 
        need to allocate a new request and add it to the unexpected queue */
     {
diff --git a/src/mpid/ch3/src/ch3u_rndv.c b/src/mpid/ch3/src/ch3u_rndv.c
index 4ab7478..6861f44 100644
--- a/src/mpid/ch3/src/ch3u_rndv.c
+++ b/src/mpid/ch3/src/ch3u_rndv.c
@@ -127,6 +127,14 @@ int MPIDI_CH3_PktHandler_RndvReqToSend( MPIDI_VC_t *vc, MPIDI_CH3_Pkt_t *pkt,
     MPIU_THREAD_CS_ENTER(MSGQUEUE,);
     rreq = MPIDI_CH3U_Recvq_FDP_or_AEU(&rts_pkt->match, &found);
     MPIU_ERR_CHKANDJUMP1(!rreq, mpi_errno,MPI_ERR_OTHER, "**nomemreq", "**nomemuereq %d", MPIDI_CH3U_Recvq_count_unexp());
+
+    /* If the completion counter is 0, that means that the communicator to
+     * which this message is being sent has been revoked and we shouldn't
+     * bother finishing this. */
+    if (!found && rreq->cc == 0) {
+        *rreqp = NULL;
+        goto fn_fail;
+    }
     
     set_request_info(rreq, rts_pkt, MPIDI_REQUEST_RNDV_MSG);
 
diff --git a/src/mpid/ch3/src/mpidi_isend_self.c b/src/mpid/ch3/src/mpidi_isend_self.c
index 2961a07..b003e6e 100644
--- a/src/mpid/ch3/src/mpidi_isend_self.c
+++ b/src/mpid/ch3/src/mpidi_isend_self.c
@@ -56,6 +56,16 @@ int MPIDI_Isend_self(const void * buf, int count, MPI_Datatype datatype, int ran
     }
     /* --END ERROR HANDLING-- */
 
+    /* If the completion counter is 0, that means that the communicator to
+     * which this message is being sent has been revoked and we shouldn't
+     * bother finishing this. */
+    if (!found && rreq->cc == 0) {
+        MPIU_Object_set_ref(sreq, 0);
+        MPIDI_CH3_Request_destroy(sreq);
+        sreq = NULL;
+        goto fn_exit;
+    }
+
     MPIDI_Comm_get_vc_set_active(comm, rank, &vc);
     MPIDI_VC_FAI_send_seqnum(vc, seqnum);
     MPIDI_Request_set_seqnum(sreq, seqnum);

http://git.mpich.org/mpich.git/commitdiff/cf1240d657e9ff907670e3915fe5b9d36dbbd133

commit cf1240d657e9ff907670e3915fe5b9d36dbbd133
Author: Wesley Bland <wbland at anl.gov>
Date:   Mon Aug 18 14:06:47 2014 -0500

    Fix error case for MPIDI_Request_create_null_rreq
    
    For some reason, the error case code between MPIDI_Request_create_rreq and
    MPIDI_Request_create_null_rreq was different. This is odd, because both macros
    take FAIL_ as an argument which is executed directly in the error case of
    create_rreq, but not in null_req. This commit makes the two act the same and
    updates the only two calls to the function that existed in the code.
    
    Signed-off-by: Huiwei Lu <huiweilu at mcs.anl.gov>

diff --git a/src/mpid/ch3/include/mpidimpl.h b/src/mpid/ch3/include/mpidimpl.h
index 5e3f82f..b6ee2ef 100644
--- a/src/mpid/ch3/include/mpidimpl.h
+++ b/src/mpid/ch3/include/mpidimpl.h
@@ -400,7 +400,9 @@ extern MPIDI_Process_t MPIDI_Process;
             MPIR_Status_set_procnull(&(rreq_)->status);                    \
         }                                                                  \
         else {                                                             \
-            MPIU_ERR_SETANDJUMP(mpi_errno_,MPI_ERR_OTHER,"**nomemreq");    \
+            MPIU_DBG_MSG(CH3_CHANNEL,TYPICAL,"unable to allocate a request");\
+            (mpi_errno_) = MPIR_ERR_MEMALLOCFAILED;                        \
+            FAIL_;                                                         \
         }                                                                  \
     } while (0)
 
diff --git a/src/mpid/ch3/src/mpid_imrecv.c b/src/mpid/ch3/src/mpid_imrecv.c
index 0acb680..7097fe1 100644
--- a/src/mpid/ch3/src/mpid_imrecv.c
+++ b/src/mpid/ch3/src/mpid_imrecv.c
@@ -22,7 +22,7 @@ int MPID_Imrecv(void *buf, int count, MPI_Datatype datatype,
      * upper level */
     if (message == NULL)
     {
-        MPIDI_Request_create_null_rreq(rreq, mpi_errno, fn_fail);
+        MPIDI_Request_create_null_rreq(rreq, mpi_errno, goto fn_fail);
         *rreqp = rreq;
         goto fn_exit;
     }
diff --git a/src/mpid/ch3/src/mpid_irecv.c b/src/mpid/ch3/src/mpid_irecv.c
index 1c75d20..7a30ad4 100644
--- a/src/mpid/ch3/src/mpid_irecv.c
+++ b/src/mpid/ch3/src/mpid_irecv.c
@@ -27,7 +27,7 @@ int MPID_Irecv(void * buf, int count, MPI_Datatype datatype, int rank, int tag,
 
     if (rank == MPI_PROC_NULL)
     {
-        MPIDI_Request_create_null_rreq(rreq, mpi_errno, fn_fail);
+        MPIDI_Request_create_null_rreq(rreq, mpi_errno, goto fn_fail);
         goto fn_exit;
     }
 

http://git.mpich.org/mpich.git/commitdiff/92ff146e205da604b1f1015e1b5a26997cf8333b

commit 92ff146e205da604b1f1015e1b5a26997cf8333b
Author: Wesley Bland <wbland at anl.gov>
Date:   Thu Aug 21 10:02:53 2014 -0500

    Fix some problems with get_all_failed at larger scales
    
    After some more testing on fusion, some problems transmitting the failed
    procs bitarray sprang up. This seems to solve those problems now.
    
    Signed-off-by: Huiwei Lu <huiweilu at mcs.anl.gov>

diff --git a/src/mpid/ch3/src/mpid_comm_get_all_failed_procs.c b/src/mpid/ch3/src/mpid_comm_get_all_failed_procs.c
index c8eb7f6..9d29cf6 100644
--- a/src/mpid/ch3/src/mpid_comm_get_all_failed_procs.c
+++ b/src/mpid/ch3/src/mpid_comm_get_all_failed_procs.c
@@ -14,7 +14,7 @@
 /* Generates a bitarray based on orig_comm where all procs in group are marked with 1 */
 static int *group_to_bitarray(MPID_Group *group, MPID_Comm *orig_comm) {
     uint32_t *bitarray, mask;
-    int bitarray_size = orig_comm->local_size / 8 + orig_comm->local_size % 8 ? 1 : 0;
+    int bitarray_size = (orig_comm->local_size / 8) + (orig_comm->local_size % 8 ? 1 : 0);
     int *group_ranks, *comm_ranks, i, index;
 
     bitarray = (int *) MPIU_Malloc(sizeof(int) * bitarray_size);
@@ -46,7 +46,7 @@ static int *group_to_bitarray(MPID_Group *group, MPID_Comm *orig_comm) {
 }
 
 /* Generates an MPID_Group from a bitarray */
-static MPID_Group *bitarray_to_group(MPID_Comm *comm_ptr, int *bitarray)
+static MPID_Group *bitarray_to_group(MPID_Comm *comm_ptr, uint32_t *bitarray)
 {
     MPID_Group *ret_group;
     MPID_Group *comm_group;
@@ -59,7 +59,7 @@ static MPID_Group *bitarray_to_group(MPID_Comm *comm_ptr, int *bitarray)
 
     /* Converts the bitarray into a utarray */
     for (i = 0; i < comm_ptr->local_size; i++) {
-        if (bitarray[i/32] & (0x80000000 >> i % 32)) {
+        if (bitarray[i/32] & (0x80000000 >> (i % 32))) {
             utarray_push_back(ranks_array, &i);
             found++;
         }
@@ -85,8 +85,8 @@ int MPID_Comm_get_all_failed_procs(MPID_Comm *comm_ptr, MPID_Group **failed_grou
 {
     int mpi_errno = MPI_SUCCESS;
     int errflag = 0;
-    int i, j;
-    int *bitarray, *remote_bitarray, bitarray_size;
+    int i, j, bitarray_size;
+    uint32_t *bitarray, *remote_bitarray;
     MPID_Group *local_fail;
     MPIDI_STATE_DECL(MPID_STATE_MPID_COMM_GET_ALL_FAILED_PROCS);
 
@@ -104,26 +104,29 @@ int MPID_Comm_get_all_failed_procs(MPID_Comm *comm_ptr, MPID_Group **failed_grou
 
     /* Generate a bitarray based on the list of failed procs */
     bitarray = group_to_bitarray(local_fail, comm_ptr);
-    bitarray_size = comm_ptr->local_size / 8 + comm_ptr->local_size % 8 ? 1 : 0;
-    remote_bitarray = MPIU_Malloc(sizeof(int) * bitarray_size);
+    bitarray_size = (comm_ptr->local_size / 8) + (comm_ptr->local_size % 8 ? 1 : 0);
+    remote_bitarray = MPIU_Malloc(sizeof(uint32_t) * bitarray_size);
 
     /* For now, this will be implemented as a star with rank 0 serving as
      * the source */
     if (comm_ptr->rank == 0) {
         for (i = 1; i < comm_ptr->local_size; i++) {
             /* Get everyone's list of failed processes to aggregate */
-            mpi_errno = MPIC_Recv(remote_bitarray, bitarray_size, MPI_INT,
+            mpi_errno = MPIC_Recv(remote_bitarray, bitarray_size, MPI_UINT32_T,
                 i, tag, comm_ptr->handle, MPI_STATUS_IGNORE, &errflag);
             if (mpi_errno) continue;
 
             /* Combine the received bitarray with my own */
-            for (j = 0; j < bitarray_size; j++)
-                bitarray[j] |= remote_bitarray[j];
+            for (j = 0; j < bitarray_size; j++) {
+                if (remote_bitarray[j] != 0) {
+                    bitarray[j] |= remote_bitarray[j];
+                }
+            }
         }
 
         for (i = 1; i < comm_ptr->local_size; i++) {
             /* Send the list to each rank to be processed locally */
-            mpi_errno = MPIC_Ssend(bitarray, bitarray_size, MPI_INT, i,
+            mpi_errno = MPIC_Send(bitarray, bitarray_size, MPI_UINT32_T, i,
                 tag, comm_ptr->handle, &errflag);
             if (mpi_errno) errflag = 1;
         }
@@ -132,12 +135,12 @@ int MPID_Comm_get_all_failed_procs(MPID_Comm *comm_ptr, MPID_Group **failed_grou
         *failed_group = bitarray_to_group(comm_ptr, bitarray);
     } else {
         /* Send my bitarray to rank 0 */
-        mpi_errno = MPIC_Ssend(bitarray, bitarray_size, MPI_INT, 0,
+        mpi_errno = MPIC_Send(bitarray, bitarray_size, MPI_UINT32_T, 0,
             tag, comm_ptr->handle, &errflag);
         if (mpi_errno) errflag = 1;
 
         /* Get the resulting bitarray back from rank 0 */
-        mpi_errno = MPIC_Recv(remote_bitarray, bitarray_size, MPI_INT, 0,
+        mpi_errno = MPIC_Recv(remote_bitarray, bitarray_size, MPI_UINT32_T, 0,
             tag, comm_ptr->handle, MPI_STATUS_IGNORE, &errflag);
         if (mpi_errno) errflag = 1;
 

http://git.mpich.org/mpich.git/commitdiff/07e6da06dbaa35a2f4f51df79d490bb749d2ca56

commit 07e6da06dbaa35a2f4f51df79d490bb749d2ca56
Author: Huiwei Lu <huiweilu at mcs.anl.gov>
Date:   Fri Aug 8 15:53:24 2014 -0500

    Adds a canceling case for barrier_smp_intra
    
    barrier_smp_intra completes the barrier in two steps, first for intra
    smp nodes, then for inter smp nodes. It uses an additional node_comm
    for intra smp barrier.
    
    This node_comm should also be cancelled inside MPIDI_CH3U_Clean_recvq
    when communicator is revoked.
    
    Signed-off-by: Wesley Bland <wbland at anl.gov>

diff --git a/src/mpi/comm/commutil.c b/src/mpi/comm/commutil.c
index cae702f..8713a2f 100644
--- a/src/mpi/comm/commutil.c
+++ b/src/mpi/comm/commutil.c
@@ -500,6 +500,7 @@ int MPIR_Comm_commit(MPID_Comm *comm)
             comm->node_comm->comm_kind = MPID_INTRACOMM;
             comm->node_comm->hierarchy_kind = MPID_HIERARCHY_NODE;
             comm->node_comm->local_comm = NULL;
+            MPIU_DBG_MSG_D(CH3_OTHER,VERBOSE,"Create node_comm=%p\n", comm->node_comm);
 
             comm->node_comm->local_size  = num_local;
             comm->node_comm->remote_size = num_local;
diff --git a/src/mpid/ch3/src/ch3u_recvq.c b/src/mpid/ch3/src/ch3u_recvq.c
index eddfe94..1e1dbb8 100644
--- a/src/mpid/ch3/src/ch3u_recvq.c
+++ b/src/mpid/ch3/src/ch3u_recvq.c
@@ -945,6 +945,21 @@ int MPIDI_CH3U_Clean_recvq(MPID_Comm *comm_ptr)
             }
         }
 
+        if (MPIR_CVAR_ENABLE_SMP_COLLECTIVES && MPIR_Comm_is_node_aware(comm_ptr)) {
+            int offset = (comm_ptr->comm_kind == MPID_INTRACOMM) ?  MPID_CONTEXT_INTRA_COLL : MPID_CONTEXT_INTER_COLL;
+            match.parts.context_id = comm_ptr->recvcontext_id + MPID_CONTEXT_INTRANODE_OFFSET + offset;
+
+            if (MATCH_WITH_LEFT_RIGHT_MASK(rreq->dev.match, match, mask)) {
+                if (rreq->dev.match.parts.tag != MPIR_AGREE_TAG && rreq->dev.match.parts.tag != MPIR_SHRINK_TAG) {
+                    MPIU_DBG_MSG_FMT(CH3_OTHER,VERBOSE,(MPIU_DBG_FDEST,
+                                "cleaning up unexpected collective pkt rank=%d tag=%d contextid=%d",
+                                rreq->dev.match.parts.rank, rreq->dev.match.parts.tag, rreq->dev.match.parts.context_id));
+                    dequeue_and_set_error(&rreq, prev_rreq, &recvq_unexpected_head, &recvq_unexpected_tail, &error, MPI_PROC_NULL);
+                    continue;
+                }
+            }
+        }
+
         prev_rreq = rreq;
         rreq = rreq->dev.next;
     }
@@ -959,7 +974,7 @@ int MPIDI_CH3U_Clean_recvq(MPID_Comm *comm_ptr)
 
         if (MATCH_WITH_LEFT_RIGHT_MASK(rreq->dev.match, match, mask)) {
             MPIU_DBG_MSG_FMT(CH3_OTHER,VERBOSE,(MPIU_DBG_FDEST,
-                        "cleaning up unexpected pt2pt pkt rank=%d tag=%d contextid=%d",
+                        "cleaning up posted pt2pt pkt rank=%d tag=%d contextid=%d",
                         rreq->dev.match.parts.rank, rreq->dev.match.parts.tag, rreq->dev.match.parts.context_id));
             dequeue_and_set_error(&rreq, prev_rreq, &recvq_posted_head, &recvq_posted_tail, &error, MPI_PROC_NULL);
             continue;
@@ -970,13 +985,28 @@ int MPIDI_CH3U_Clean_recvq(MPID_Comm *comm_ptr)
         if (MATCH_WITH_LEFT_RIGHT_MASK(rreq->dev.match, match, mask)) {
             if (rreq->dev.match.parts.tag != MPIR_AGREE_TAG && rreq->dev.match.parts.tag != MPIR_SHRINK_TAG) {
                 MPIU_DBG_MSG_FMT(CH3_OTHER,VERBOSE,(MPIU_DBG_FDEST,
-                            "cleaning up unexpected collective pkt rank=%d tag=%d contextid=%d",
+                            "cleaning up posted collective pkt rank=%d tag=%d contextid=%d",
                             rreq->dev.match.parts.rank, rreq->dev.match.parts.tag, rreq->dev.match.parts.context_id));
                 dequeue_and_set_error(&rreq, prev_rreq, &recvq_posted_head, &recvq_posted_tail, &error, MPI_PROC_NULL);
                 continue;
             }
         }
 
+        if (MPIR_CVAR_ENABLE_SMP_COLLECTIVES && MPIR_Comm_is_node_aware(comm_ptr)) {
+            int offset = (comm_ptr->comm_kind == MPID_INTRACOMM) ?  MPID_CONTEXT_INTRA_COLL : MPID_CONTEXT_INTER_COLL;
+            match.parts.context_id = comm_ptr->recvcontext_id + MPID_CONTEXT_INTRANODE_OFFSET + offset;
+
+            if (MATCH_WITH_LEFT_RIGHT_MASK(rreq->dev.match, match, mask)) {
+                if (rreq->dev.match.parts.tag != MPIR_AGREE_TAG && rreq->dev.match.parts.tag != MPIR_SHRINK_TAG) {
+                    MPIU_DBG_MSG_FMT(CH3_OTHER,VERBOSE,(MPIU_DBG_FDEST,
+                                "cleaning up posted collective pkt rank=%d tag=%d contextid=%d",
+                                rreq->dev.match.parts.rank, rreq->dev.match.parts.tag, rreq->dev.match.parts.context_id));
+                    dequeue_and_set_error(&rreq, prev_rreq, &recvq_posted_head, &recvq_posted_tail, &error, MPI_PROC_NULL);
+                    continue;
+                }
+            }
+        }
+
         prev_rreq = rreq;
         rreq = rreq->dev.next;
     }

http://git.mpich.org/mpich.git/commitdiff/14fd9c432eac9b743ebabaf98b34da214541df12

commit 14fd9c432eac9b743ebabaf98b34da214541df12
Author: Wesley Bland <wbland at anl.gov>
Date:   Fri Aug 1 12:49:22 2014 -0500

    Correctly report the error class in receive queue
    
    The receive queue had some hacky ways of reporting errors related to process
    failure that didn't really match up with the way the codes should be returned
    correctly. This patch sets the correct error class in the correct place and
    doesn't require extra logic in dequeue_and_set_error to set the class itself.
    
    This seems to get a couple of the tests to pass in non-debug mode.
    
    Signed-off-by: Huiwei Lu <huiweilu at mcs.anl.gov>

diff --git a/src/mpid/ch3/src/ch3u_recvq.c b/src/mpid/ch3/src/ch3u_recvq.c
index ea8bf34..eddfe94 100644
--- a/src/mpid/ch3/src/ch3u_recvq.c
+++ b/src/mpid/ch3/src/ch3u_recvq.c
@@ -863,13 +863,6 @@ static inline int req_uses_vc(const MPID_Request* req, const MPIDI_VC_t *vc)
 static inline void dequeue_and_set_error(MPID_Request **req,  MPID_Request *prev_req, MPID_Request **head, MPID_Request **tail, int *error, int rank)
 {
     MPID_Request *next = (*req)->dev.next;
-
-    if (*error == MPI_SUCCESS) {
-        if (rank == MPI_PROC_NULL)
-            MPIU_ERR_SET(*error, MPIX_ERR_PROC_FAILED, "**comm_fail");
-        else
-            MPIU_ERR_SET1(*error, MPIX_ERR_PROC_FAILED, "**comm_fail", "**comm_fail %d", rank);
-    }
     
     /* remove from queue */
     if (*head == *req) {
@@ -907,7 +900,7 @@ static inline void dequeue_and_set_error(MPID_Request **req,  MPID_Request *prev
 int MPIDI_CH3U_Clean_recvq(MPID_Comm *comm_ptr)
 {
     int mpi_errno = MPI_SUCCESS;
-    int error = MPIX_ERR_REVOKED;
+    int error = MPI_SUCCESS;
     MPID_Request *rreq, *prev_rreq = NULL;
     MPIDI_Message_match match;
     MPIDI_Message_match mask;
@@ -917,6 +910,8 @@ int MPIDI_CH3U_Clean_recvq(MPID_Comm *comm_ptr)
 
     MPIU_THREAD_CS_ASSERT_HELD(MSGQUEUE);
 
+    MPIU_ERR_SETSIMPLE(error, MPIX_ERR_REVOKED, "**revoked");
+
     rreq = recvq_unexpected_head;
     mask.parts.context_id = ~0;
     mask.parts.rank = mask.parts.tag = 0;
@@ -1004,7 +999,9 @@ int MPIDI_CH3U_Complete_disabled_anysources(void)
 
     MPIDI_FUNC_ENTER(MPID_STATE_MPIDI_CH3U_COMPLETE_DISABLED_ANYSOURCES);
     MPIU_THREAD_CS_ENTER(MSGQUEUE,);
-    
+
+    MPIU_ERR_SETSIMPLE(error, MPIX_ERR_PROC_FAILED_PENDING, "**failure_pending");
+
     /* Check each request in the posted queue, and complete-with-error any
        anysource requests posted on communicators that have disabled
        anysources */
@@ -1044,6 +1041,8 @@ int MPIDI_CH3U_Complete_posted_with_error(MPIDI_VC_t *vc)
 
     MPIU_THREAD_CS_ENTER(MSGQUEUE,);
 
+    MPIU_ERR_SETSIMPLE(error, MPIX_ERR_PROC_FAILED, "**proc_failed");
+
     /* check each req in the posted queue and complete-with-error any requests
        using this VC. */
     req = recvq_posted_head;
diff --git a/test/mpi/ft/testlist b/test/mpi/ft/testlist
index b60859e..3a89979 100644
--- a/test/mpi/ft/testlist
+++ b/test/mpi/ft/testlist
@@ -13,6 +13,6 @@ reduce 4 mpiexecarg=-disable-auto-cleanup resultTest=TestStatusNoErrors strict=f
 bcast 4 mpiexecarg=-disable-auto-cleanup resultTest=TestStatusNoErrors strict=false timeLimit=10 xfail=ticket1945
 scatter 4 mpiexecarg=-disable-auto-cleanup resultTest=TestStatusNoErrors strict=false timeLimit=10 xfail=ticket1945
 anysource 3 mpiexecarg=-disable-auto-cleanup resultTest=TestStatusNoErrors strict=false timeLimit=10 xfail=ticket1945
-revoke_nofail 4 mpiexecarg=-disable-auto-cleanup resultsTest=TestStatusNoErrors strict=false timelimit=10 xfail=ticket1945
-shrink 8 mpiexecarg=-disable-auto-cleanup resultTest=TestStatusNoErrors strict=false timeLimit=10 xfail=ticket1945
+revoke_nofail 4 mpiexecarg=-disable-auto-cleanup resultTest=TestStatusNoErrors strict=false timeLimit=10 xfail=ticket1945
+shrink 8 mpiexecarg=-disable-auto-cleanup resultTest=TestStatusNoErrors strict=false timeLimit=10
 agree 4 mpiexecarg=-disable-auto-cleanup resultTest=TestStatusNoErrors strict=false timeLimit=10 xfail=ticket1945

-----------------------------------------------------------------------

Summary of changes:
 src/mpi/comm/commutil.c                           |    1 +
 src/mpid/ch3/channels/nemesis/src/mpid_nem_lmt.c  |    8 +++
 src/mpid/ch3/include/mpidimpl.h                   |    4 +-
 src/mpid/ch3/src/ch3u_comm.c                      |    6 +-
 src/mpid/ch3/src/ch3u_eager.c                     |   24 +++++++
 src/mpid/ch3/src/ch3u_eagersync.c                 |    8 +++
 src/mpid/ch3/src/ch3u_recvq.c                     |   71 +++++++++++++++++---
 src/mpid/ch3/src/ch3u_rndv.c                      |    8 +++
 src/mpid/ch3/src/mpid_comm_get_all_failed_procs.c |   29 +++++----
 src/mpid/ch3/src/mpid_comm_revoke.c               |    2 +
 src/mpid/ch3/src/mpid_imrecv.c                    |    2 +-
 src/mpid/ch3/src/mpid_irecv.c                     |    2 +-
 src/mpid/ch3/src/mpidi_isend_self.c               |   10 +++
 test/mpi/ft/testlist                              |    4 +-
 14 files changed, 148 insertions(+), 31 deletions(-)


hooks/post-receive
-- 
MPICH primary repository


More information about the commits mailing list