[mpich-commits] [mpich] MPICH primary repository branch, master, updated. v3.2rc1-13-ge2628ab
Service Account
noreply at mpich.org
Mon Oct 26 09:44:36 CDT 2015
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "MPICH primary repository".
The branch, master has been updated
via e2628abe8dc97b400e290deab0ed9e1b321491f2 (commit)
from 7f2ccc24437f6e3bee25b4212e8695cdbb699360 (commit)
Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.
- Log -----------------------------------------------------------------
http://git.mpich.org/mpich.git/commitdiff/e2628abe8dc97b400e290deab0ed9e1b321491f2
commit e2628abe8dc97b400e290deab0ed9e1b321491f2
Author: Sangmin Seo <sseo at anl.gov>
Date: Tue Oct 20 18:26:52 2015 -0500
Add the use_rmk parameter to HYDT_bsci_launch_procs.
RMK is used to allocate nodes on a system before launching a job. If
it is not specified in the user arguments for the UI process (e.g.,
mpiexec), it is set to the same as a job launcher (e.g., SLURM or
PBS). This works fine for launching processes on nodes for the first
time, but it has a problem when a process dynamically creates other
processes at run time, e.g., by calling MPI_Comm_spawn(), because the
job launcher does not know which node(s) will be used for new
processes and it just allocates nodes based on its allocation policy.
This may conflict with the process management policy of the Hydra
framework, since the Hydra independently decides target nodes to
create a proxy process and to spawn new processes. When the conflict
happens the spawned processes cannot communicate correctly.
To resolve this problem, a parameter 'use_rmk' is added to the
HYDT_bsci_launch_procs function. If it is HYD_TRUE, RMK is used to
allocate nodes. HYD_TRUE is passed when the HYDT_bsci_launch_procs
function is called from the UI process. On the other hand, if it is
HYD_FALSE, we force not to use RMK. HYD_FALSE is passed in PMI spawn
functions.
Signed-off-by: Pavan Balaji <balaji at anl.gov>
diff --git a/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c b/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c
index c705c33..304f728 100644
--- a/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c
+++ b/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c
@@ -131,7 +131,7 @@ HYD_status HYD_pmci_launch_procs(void)
control_fd[i] = HYD_FD_UNSET;
status = HYDT_bsci_launch_procs(proxy_stash.strlist, HYD_server_info.pg_list.proxy_list,
- control_fd);
+ HYD_TRUE, control_fd);
HYDU_ERR_POP(status, "launcher cannot launch processes\n");
for (i = 0, proxy = HYD_server_info.pg_list.proxy_list; proxy; proxy = proxy->next, i++)
diff --git a/src/pm/hydra/pm/pmiserv/pmiserv_pmi_v1.c b/src/pm/hydra/pm/pmiserv/pmiserv_pmi_v1.c
index a855582..ba96f84 100644
--- a/src/pm/hydra/pm/pmiserv/pmiserv_pmi_v1.c
+++ b/src/pm/hydra/pm/pmiserv/pmiserv_pmi_v1.c
@@ -591,7 +591,7 @@ static HYD_status fn_spawn(int fd, int pid, int pgid, char *args[])
status = HYD_pmcd_pmi_fill_in_exec_launch_info(pg);
HYDU_ERR_POP(status, "unable to fill in executable arguments\n");
- status = HYDT_bsci_launch_procs(proxy_stash.strlist, pg->proxy_list, NULL);
+ status = HYDT_bsci_launch_procs(proxy_stash.strlist, pg->proxy_list, HYD_FALSE, NULL);
HYDU_ERR_POP(status, "launcher cannot launch processes\n");
{
diff --git a/src/pm/hydra/pm/pmiserv/pmiserv_pmi_v2.c b/src/pm/hydra/pm/pmiserv/pmiserv_pmi_v2.c
index 00e6377..3904164 100644
--- a/src/pm/hydra/pm/pmiserv/pmiserv_pmi_v2.c
+++ b/src/pm/hydra/pm/pmiserv/pmiserv_pmi_v2.c
@@ -706,7 +706,7 @@ static HYD_status fn_spawn(int fd, int pid, int pgid, char *args[])
status = HYD_pmcd_pmi_fill_in_exec_launch_info(pg);
HYDU_ERR_POP(status, "unable to fill in executable arguments\n");
- status = HYDT_bsci_launch_procs(proxy_stash.strlist, pg->proxy_list, NULL);
+ status = HYDT_bsci_launch_procs(proxy_stash.strlist, pg->proxy_list, HYD_FALSE, NULL);
HYDU_ERR_POP(status, "launcher cannot launch processes\n");
{
diff --git a/src/pm/hydra/tools/bootstrap/external/common.h b/src/pm/hydra/tools/bootstrap/external/common.h
index bb20606..2e01351 100644
--- a/src/pm/hydra/tools/bootstrap/external/common.h
+++ b/src/pm/hydra/tools/bootstrap/external/common.h
@@ -20,6 +20,6 @@ int HYDTI_bscd_env_is_avail(const char *env_name);
int HYDTI_bscd_in_env_list(const char *env_name, const char *env_list[]);
HYD_status HYDT_bscd_common_launch_procs(char **args, struct HYD_proxy *proxy_list,
- int *control_fd);
+ int use_rmk, int *control_fd);
#endif /* COMMON_H_INCLUDED */
diff --git a/src/pm/hydra/tools/bootstrap/external/external_common_launch.c b/src/pm/hydra/tools/bootstrap/external/external_common_launch.c
index 7ad0a54..9534d46 100644
--- a/src/pm/hydra/tools/bootstrap/external/external_common_launch.c
+++ b/src/pm/hydra/tools/bootstrap/external/external_common_launch.c
@@ -97,7 +97,8 @@ static HYD_status sge_get_path(char **path)
goto fn_exit;
}
-HYD_status HYDT_bscd_common_launch_procs(char **args, struct HYD_proxy *proxy_list, int *control_fd)
+HYD_status HYDT_bscd_common_launch_procs(char **args, struct HYD_proxy *proxy_list, int use_rmk,
+ int *control_fd)
{
int num_hosts, idx, i, host_idx, fd, exec_idx, offset, lh, len, rc, autofork;
int *pid, *fd_list, *dummy;
diff --git a/src/pm/hydra/tools/bootstrap/external/ll.h b/src/pm/hydra/tools/bootstrap/external/ll.h
index 22d8db7..11876e6 100644
--- a/src/pm/hydra/tools/bootstrap/external/ll.h
+++ b/src/pm/hydra/tools/bootstrap/external/ll.h
@@ -11,7 +11,8 @@
HYD_status HYDTI_bscd_ll_query_node_count(int *count);
-HYD_status HYDT_bscd_ll_launch_procs(char **args, struct HYD_proxy *proxy_list, int *control_fd);
+HYD_status HYDT_bscd_ll_launch_procs(char **args, struct HYD_proxy *proxy_list, int use_rmk,
+ int *control_fd);
HYD_status HYDT_bscd_ll_query_proxy_id(int *proxy_id);
HYD_status HYDT_bscd_ll_query_native_int(int *ret);
HYD_status HYDT_bscd_ll_query_node_list(struct HYD_node **node_list);
diff --git a/src/pm/hydra/tools/bootstrap/external/ll_launch.c b/src/pm/hydra/tools/bootstrap/external/ll_launch.c
index d4dc9c9..04ed4b3 100644
--- a/src/pm/hydra/tools/bootstrap/external/ll_launch.c
+++ b/src/pm/hydra/tools/bootstrap/external/ll_launch.c
@@ -12,7 +12,8 @@
static int fd_stdout, fd_stderr;
-HYD_status HYDT_bscd_ll_launch_procs(char **args, struct HYD_proxy *proxy_list, int *control_fd)
+HYD_status HYDT_bscd_ll_launch_procs(char **args, struct HYD_proxy *proxy_list, int use_rmk,
+ int *control_fd)
{
int idx, i, total_procs, node_count;
int *pid, *fd_list, exec_idx;
diff --git a/src/pm/hydra/tools/bootstrap/external/pbs.h b/src/pm/hydra/tools/bootstrap/external/pbs.h
index bf846ab..f039f8f 100644
--- a/src/pm/hydra/tools/bootstrap/external/pbs.h
+++ b/src/pm/hydra/tools/bootstrap/external/pbs.h
@@ -20,7 +20,8 @@ struct HYDT_bscd_pbs_sys_s {
extern struct HYDT_bscd_pbs_sys_s *HYDT_bscd_pbs_sys;
-HYD_status HYDT_bscd_pbs_launch_procs(char **args, struct HYD_proxy *proxy_list, int *control_fd);
+HYD_status HYDT_bscd_pbs_launch_procs(char **args, struct HYD_proxy *proxy_list, int use_rmk,
+ int *control_fd);
HYD_status HYDT_bscd_pbs_query_env_inherit(const char *env_name, int *ret);
HYD_status HYDT_bscd_pbs_wait_for_completion(int timeout);
HYD_status HYDT_bscd_pbs_launcher_finalize(void);
diff --git a/src/pm/hydra/tools/bootstrap/external/pbs_launch.c b/src/pm/hydra/tools/bootstrap/external/pbs_launch.c
index 1a9196c..27245d8 100644
--- a/src/pm/hydra/tools/bootstrap/external/pbs_launch.c
+++ b/src/pm/hydra/tools/bootstrap/external/pbs_launch.c
@@ -35,7 +35,8 @@ static HYD_status find_pbs_node_id(const char *hostname, int *node_id)
goto fn_exit;
}
-HYD_status HYDT_bscd_pbs_launch_procs(char **args, struct HYD_proxy *proxy_list, int *control_fd)
+HYD_status HYDT_bscd_pbs_launch_procs(char **args, struct HYD_proxy *proxy_list, int use_rmk,
+ int *control_fd)
{
int proxy_count, i, args_count, err, hostid;
struct HYD_proxy *proxy;
@@ -46,7 +47,7 @@ HYD_status HYDT_bscd_pbs_launch_procs(char **args, struct HYD_proxy *proxy_list,
/* If the RMK is not PBS, query for the PBS node list, and convert
* the user-specified node IDs to PBS node IDs */
- if (strcmp(HYDT_bsci_info.rmk, "pbs")) {
+ if (use_rmk == HYD_FALSE || strcmp(HYDT_bsci_info.rmk, "pbs")) {
status = HYDT_bscd_pbs_query_node_list(&pbs_node_list);
HYDU_ERR_POP(status, "error querying PBS node list\n");
}
diff --git a/src/pm/hydra/tools/bootstrap/external/slurm.h b/src/pm/hydra/tools/bootstrap/external/slurm.h
index f4a7c85..25a21f0 100644
--- a/src/pm/hydra/tools/bootstrap/external/slurm.h
+++ b/src/pm/hydra/tools/bootstrap/external/slurm.h
@@ -9,7 +9,8 @@
#include "hydra.h"
-HYD_status HYDT_bscd_slurm_launch_procs(char **args, struct HYD_proxy *proxy_list, int *control_fd);
+HYD_status HYDT_bscd_slurm_launch_procs(char **args, struct HYD_proxy *proxy_list, int use_rmk,
+ int *control_fd);
HYD_status HYDT_bscd_slurm_query_proxy_id(int *proxy_id);
HYD_status HYDT_bscd_slurm_query_native_int(int *ret);
HYD_status HYDT_bscd_slurm_query_node_list(struct HYD_node **node_list);
diff --git a/src/pm/hydra/tools/bootstrap/external/slurm_launch.c b/src/pm/hydra/tools/bootstrap/external/slurm_launch.c
index ad62e21..b88009c 100644
--- a/src/pm/hydra/tools/bootstrap/external/slurm_launch.c
+++ b/src/pm/hydra/tools/bootstrap/external/slurm_launch.c
@@ -59,7 +59,8 @@ static HYD_status proxy_list_to_node_str(struct HYD_proxy *proxy_list, char **no
goto fn_exit;
}
-HYD_status HYDT_bscd_slurm_launch_procs(char **args, struct HYD_proxy *proxy_list, int *control_fd)
+HYD_status HYDT_bscd_slurm_launch_procs(char **args, struct HYD_proxy *proxy_list, int use_rmk,
+ int *control_fd)
{
int num_hosts, idx, i;
int *pid, *fd_list;
@@ -83,7 +84,7 @@ HYD_status HYDT_bscd_slurm_launch_procs(char **args, struct HYD_proxy *proxy_lis
idx = 0;
targs[idx++] = HYDU_strdup(path);
- if (strcmp(HYDT_bsci_info.rmk, "slurm")) {
+ if (use_rmk == HYD_FALSE || strcmp(HYDT_bsci_info.rmk, "slurm")) {
targs[idx++] = HYDU_strdup("--nodelist");
status = proxy_list_to_node_str(proxy_list, &node_list_str);
diff --git a/src/pm/hydra/tools/bootstrap/include/bsci.h b/src/pm/hydra/tools/bootstrap/include/bsci.h
index ac079ad..820e20f 100644
--- a/src/pm/hydra/tools/bootstrap/include/bsci.h
+++ b/src/pm/hydra/tools/bootstrap/include/bsci.h
@@ -51,7 +51,8 @@ struct HYDT_bsci_fns {
/* Launcher functions */
/** \brief Launch processes */
- HYD_status(*launch_procs) (char **args, struct HYD_proxy * proxy_list, int *control_fd);
+ HYD_status(*launch_procs) (char **args, struct HYD_proxy * proxy_list, int use_rmk,
+ int *control_fd);
/** \brief Finalize the bootstrap control device */
HYD_status(*launcher_finalize) (void);
@@ -94,6 +95,7 @@ HYD_status HYDT_bsci_init(const char *rmk, const char *launcher,
*
* \param[in] args Arguments to be used for the launched processes
* \param[in] proxy_list List of proxies to launch
+ * \param[in] use_rmk Force not to use RMK if HYD_FALSE
* \param[out] control_fd Control socket to communicate with the launched process
*
* This function appends a proxy ID to the end of the args list and
@@ -106,8 +108,28 @@ HYD_status HYDT_bsci_init(const char *rmk, const char *launcher,
* perform parallel launches should set the proxy ID string to "-1",
* but allow proxies to query their ID information on each node using
* the HYDT_bsci_query_proxy_id function.
+ *
+ * Background of use_rmk: RMK is used to allocate nodes on a system
+ * before launching a job. If it is not specified in the user
+ * arguments for the UI process (e.g., mpiexec), it is set to the same
+ * as a job launcher (e.g., SLURM or PBS). This works fine for
+ * launching processes on nodes for the first time, but it has
+ * a problem when a process dynamically creates other processes at run
+ * time, e.g., by calling MPI_Comm_spawn(), because the job launcher
+ * does not know which node(s) will be used for new processes and it
+ * just allocates nodes based on its allocation policy. This may
+ * conflict with the process management policy of the Hydra framework,
+ * since the Hydra independently decides target nodes to create a
+ * proxy process and to spawn new processes. When the conflict
+ * happens the spawned processes cannot communicate correctly.
+ * To resolve this problem, a parameter 'use_rmk' is added to this
+ * launch function. If it is HYD_TRUE, RMK is used to allocate nodes.
+ * HYD_TRUE is passed when this function is called from the UI
+ * process. On the other hand, if it is HYD_FALSE, we force not to
+ * use RMK. HYD_FALSE is passed in PMI spawn functions.
*/
-HYD_status HYDT_bsci_launch_procs(char **args, struct HYD_proxy *proxy_list, int *control_fd);
+HYD_status HYDT_bsci_launch_procs(char **args, struct HYD_proxy *proxy_list, int use_rmk,
+ int *control_fd);
/**
diff --git a/src/pm/hydra/tools/bootstrap/persist/persist_client.h b/src/pm/hydra/tools/bootstrap/persist/persist_client.h
index ee2c92b..6ec59f9 100644
--- a/src/pm/hydra/tools/bootstrap/persist/persist_client.h
+++ b/src/pm/hydra/tools/bootstrap/persist/persist_client.h
@@ -12,7 +12,7 @@
#include "persist.h"
HYD_status HYDT_bscd_persist_launch_procs(char **args, struct HYD_proxy *proxy_list,
- int *control_fd);
+ int use_rmk, int *control_fd);
HYD_status HYDT_bscd_persist_wait_for_completion(int timeout);
extern int *HYDT_bscd_persist_control_fd;
diff --git a/src/pm/hydra/tools/bootstrap/persist/persist_launch.c b/src/pm/hydra/tools/bootstrap/persist/persist_launch.c
index c2faae4..b24db99 100644
--- a/src/pm/hydra/tools/bootstrap/persist/persist_launch.c
+++ b/src/pm/hydra/tools/bootstrap/persist/persist_launch.c
@@ -59,7 +59,7 @@ static HYD_status persist_cb(int fd, HYD_event_t events, void *userp)
}
HYD_status HYDT_bscd_persist_launch_procs(char **args, struct HYD_proxy *proxy_list,
- int *control_fd)
+ int use_rmk, int *control_fd)
{
struct HYD_proxy *proxy;
int idx, i;
diff --git a/src/pm/hydra/tools/bootstrap/src/bsci_launch.c b/src/pm/hydra/tools/bootstrap/src/bsci_launch.c
index dfe8c5a..e25f963 100644
--- a/src/pm/hydra/tools/bootstrap/src/bsci_launch.c
+++ b/src/pm/hydra/tools/bootstrap/src/bsci_launch.c
@@ -7,13 +7,14 @@
#include "hydra.h"
#include "bsci.h"
-HYD_status HYDT_bsci_launch_procs(char **args, struct HYD_proxy *proxy_list, int *control_fd)
+HYD_status HYDT_bsci_launch_procs(char **args, struct HYD_proxy *proxy_list, int use_rmk,
+ int *control_fd)
{
HYD_status status = HYD_SUCCESS;
HYDU_FUNC_ENTER();
- status = HYDT_bsci_fns.launch_procs(args, proxy_list, control_fd);
+ status = HYDT_bsci_fns.launch_procs(args, proxy_list, use_rmk, control_fd);
HYDU_ERR_POP(status, "launcher returned error while launching processes\n");
fn_exit:
-----------------------------------------------------------------------
Summary of changes:
src/pm/hydra/pm/pmiserv/pmiserv_pmci.c | 2 +-
src/pm/hydra/pm/pmiserv/pmiserv_pmi_v1.c | 2 +-
src/pm/hydra/pm/pmiserv/pmiserv_pmi_v2.c | 2 +-
src/pm/hydra/tools/bootstrap/external/common.h | 2 +-
.../bootstrap/external/external_common_launch.c | 3 +-
src/pm/hydra/tools/bootstrap/external/ll.h | 3 +-
src/pm/hydra/tools/bootstrap/external/ll_launch.c | 3 +-
src/pm/hydra/tools/bootstrap/external/pbs.h | 3 +-
src/pm/hydra/tools/bootstrap/external/pbs_launch.c | 5 ++-
src/pm/hydra/tools/bootstrap/external/slurm.h | 3 +-
.../hydra/tools/bootstrap/external/slurm_launch.c | 5 ++-
src/pm/hydra/tools/bootstrap/include/bsci.h | 26 ++++++++++++++++++-
.../hydra/tools/bootstrap/persist/persist_client.h | 2 +-
.../hydra/tools/bootstrap/persist/persist_launch.c | 2 +-
src/pm/hydra/tools/bootstrap/src/bsci_launch.c | 5 ++-
15 files changed, 49 insertions(+), 19 deletions(-)
hooks/post-receive
--
MPICH primary repository
More information about the commits
mailing list