[mpich-commits] [mpich] MPICH primary repository branch, master, updated. v3.1-246-g3e57215

Service Account noreply at mpich.org
Thu May 15 15:46:24 CDT 2014


This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "MPICH primary repository".

The branch, master has been updated
       via  3e57215ed2d5e40ba97805a24fa1ab4112b114e9 (commit)
      from  20ff1b2995ad8cb78f8913152074b71b19787b6f (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
http://git.mpich.org/mpich.git/commitdiff/3e57215ed2d5e40ba97805a24fa1ab4112b114e9

commit 3e57215ed2d5e40ba97805a24fa1ab4112b114e9
Author: Michael Blocksome <blocksom at us.ibm.com>
Date:   Tue May 13 16:00:43 2014 -0500

    pamid: force context post on when using 'per-object' locks
    
    The previous code would attempt to run in 'latency optimization mode' if
    the application requested !MPI_THREAD_MULTIPLE and yet still was linking
    with a 'per-object' mpich library - which is optimized for throughput.
    This means that in !MPI_THREAD_MULTIPLE:
    - context post was disabled
    - async progress was disabled
    - using multiple contexts was disabled
    
    This attempt to give users "better" performance in a fundamentally
    flawed configuration would cause the pamid adi to hang on acquiring a
    context lock. For example, consider:
    1. work function is posted to a context
    2. thread acquires the context lock
    3. thread advances the context
    4. work function invoked, then attempts to initiate communication
    5. WITH CONTEXT POST OFF the thread will attempt to acquire the context
       lock before it directly invokes the communcation function
    6. HANG
    
    A complete solution would be to identify all code paths that might
    result in this situation and instead avoid acquiring a lock that is
    already held.
    
    It should be noted that this is a bug in the pamid adi and NOT in the
    PAMI library iteself. The PAMI interface is explicit in that the
    PAMI_Context_lock() is non-recursive - for performance reasons.

diff --git a/src/mpid/pamid/include/mpidi_macros.h b/src/mpid/pamid/include/mpidi_macros.h
index 2720149..b8a1b45 100644
--- a/src/mpid/pamid/include/mpidi_macros.h
+++ b/src/mpid/pamid/include/mpidi_macros.h
@@ -168,23 +168,24 @@ MPIDI_Context_post(pami_context_t       context,
                    void               * cookie)
 {
 #if (MPIU_THREAD_GRANULARITY == MPIU_THREAD_GRANULARITY_PER_OBJECT)
-  if (likely(MPIDI_Process.perobj.context_post.active > 0))
-    {
-      pami_result_t rc;
-      rc = PAMI_Context_post(context, work, fn, cookie);
-      MPID_assert(rc == PAMI_SUCCESS);
-    }
-  else
-    {
-      /* Lock access to the context. For more information see discussion of the
-       * simplifying assumption that the "per object" mpich lock mode does not
-       * expect a completely single threaded run environment in the file
-       * src/mpid/pamid/src/mpid_progress.h
-       */
-       PAMI_Context_lock(context);
-       fn(context, cookie);
-       PAMI_Context_unlock(context);
-    }
+  /* It is possible that a work function posted to a context may attempt to
+   * initiate a communication operation and, if context post were disabled, that
+   * operation would be performed directly on the context BY TAKING A LOCK that
+   * the is already held by the thread that is advancing the context. This will
+   * result in a hang.
+   *
+   * A solution would be to identify all code flows where this situation might
+   * occur and then change the code to avoid taking a lock that is already held.
+   *
+   * Another solution is to always disable the "non-context-post" configuration
+   * when compiled with per-object locking. This would only occur if the
+   * application requested !MPI_THREAD_MULTIPLE and the "pretend single threaded
+   * by disabling async progress, context post, and multiple contexts" optimization
+   * was in effect.
+   */
+  pami_result_t rc;
+  rc = PAMI_Context_post(context, work, fn, cookie);
+  MPID_assert(rc == PAMI_SUCCESS);
 #else /* (MPIU_THREAD_GRANULARITY != MPIU_THREAD_GRANULARITY_PER_OBJECT) */
   /*
    * It is not necessary to lock the context before access in the "global"
diff --git a/src/mpid/pamid/src/mpidi_env.c b/src/mpid/pamid/src/mpidi_env.c
index 34f3244..dd1912f 100644
--- a/src/mpid/pamid/src/mpidi_env.c
+++ b/src/mpid/pamid/src/mpidi_env.c
@@ -548,32 +548,27 @@ MPIDI_Env_setup(int rank, int requested)
     char* names[] = {"PAMID_THREAD_MULTIPLE", NULL};
     ENV_Unsigned(names, &value, 1, &found_deprecated_env_var, rank);
 #if (MPIU_THREAD_GRANULARITY == MPIU_THREAD_GRANULARITY_PER_OBJECT)
+    /* Any mpich work function posted to a context that eventually initiates
+     * other communcation transfers will hang on a lock attempt if the
+     * 'context post' feature is not enabled. Until this code flow is fixed
+     * the context post must not be disabled.
+     *
+     * See discussion in src/mpid/pamid/include/mpidi_macros.h
+     * -> MPIDI_Context_post()
+     */
+    MPIDI_Process.perobj.context_post.requested = 1;
     if (value == 1)      /* force on  */
     {
-      /* The default value for context post should be enabled only if the
-       * default async progress mode is the 'locked' mode.
-       */
-      MPIDI_Process.perobj.context_post.requested =
-        (ASYNC_PROGRESS_MODE_DEFAULT == ASYNC_PROGRESS_MODE_LOCKED);
-
       MPIDI_Process.async_progress.mode    = ASYNC_PROGRESS_MODE_DEFAULT;
       MPIDI_Process.avail_contexts         = MPIDI_MAX_CONTEXTS;
     }
     else if (value == 0) /* force off */
     {
-      MPIDI_Process.perobj.context_post.requested = 0;
       MPIDI_Process.async_progress.mode    = ASYNC_PROGRESS_MODE_DISABLED;
       MPIDI_Process.avail_contexts         = 1;
     }
     else if (requested != MPI_THREAD_MULTIPLE)
     {
-      /* The PAMID_THREAD_MULTIPLE override was not set, yet the application
-       * requested a thread mode other than MPI_THREAD_MULTIPLE. Therefore,
-       * assume that the application prefers the 'latency optimization' over
-       * the 'throughput optimization' and set the async progress configuration
-       * to disable context post.
-       */
-      MPIDI_Process.perobj.context_post.requested = 0;
 #ifdef BGQ_SUPPORTS_TRIGGER_ASYNC_PROGRESS
       MPIDI_Process.async_progress.mode    = ASYNC_PROGRESS_MODE_TRIGGER;
       MPIDI_Process.avail_contexts         = MPIDI_MAX_CONTEXTS;

-----------------------------------------------------------------------

Summary of changes:
 src/mpid/pamid/include/mpidi_macros.h |   35 +++++++++++++++++----------------
 src/mpid/pamid/src/mpidi_env.c        |   23 ++++++++-------------
 2 files changed, 27 insertions(+), 31 deletions(-)


hooks/post-receive
-- 
MPICH primary repository


More information about the commits mailing list