[mpich-discuss] resource allocation and multiple mpi_comm_spawn's

Arjen van Elteren info at arjenvanelteren.com
Tue Jun 16 07:43:50 CDT 2015


Hello,

I've patched the hydra process manager (mpich release 3.1), to support
my case (see attachment).

It's a bit ugly (I have not prefixed the sort and compare function names
as these are local to the alloc.c file) but it works for me and makes
the resource allocation for consecutive MPI_COMM_SPAWNs more fair.

All I had to patch was 'utils/alloc/alloc.c' in the hydra source-code.

New procedure for allocating proxy's and assigning executables is now:

1. make proxies (NEW: make a proxy for every node, do not stop when
number of processes is reached)
2 (NEW). sort the proxies by their nodes active number of processes
(decreasing, the first proxy will have the least number of processes)
3. allocate executables to the proxies

Should I create a bug report for this? (I can't find a login button on
the trac website)

Kind regards,

Arjen

On 12-06-15 11:23, Arjen van Elteren wrote:
> Hello,
>
> I'm working with an application that invokes multiple mpi_comm_spawn calls.
>
> I'm using mpiexec on a cluster without a resource manager or job queue,
> so plain ssh and fork calls and everything is managed by mpich.
>
> It looks like mpiexec (both hydra and mpd) re-use the hostfile from the
> start (and do not look at already allocated resources/used nodes).
>
> For example I have a hostfile like this:
>
> node01:1
> node02:1
> node03:1
>
> When I run a call like this:
>
>  MPI_Comm_spawn(cmd, MPI_ARGV_NULL, number_of_workers,
>                  MPI_INFO_NULL, 2,
>                  MPI_COMM_SELF, &worker,
>                  MPI_ERRCODES_IGNORE);
>
> I get an allocation like this:
>
> node              process
> ---------------    ------------------
> node01          manager
> node02          worker 1
> node03          worker 2
>
> Which is what I expected.
>
> But when I instead do 2 calls like this (i.e. each worker has one
> process, but there are 2 workers):
>
>  MPI_Comm_spawn(cmd, MPI_ARGV_NULL, number_of_workers,
>                  MPI_INFO_NULL, 1,
>                  MPI_COMM_SELF, &worker,
>                  MPI_ERRCODES_IGNORE);
>
>  MPI_Comm_spawn(cmd, MPI_ARGV_NULL, number_of_workers,
>                  MPI_INFO_NULL, 1,
>                  MPI_COMM_SELF, &worker,
>                  MPI_ERRCODES_IGNORE);
>
> I get an allocation like this (both hydra and mpd):
>
> node              process
> ---------------    ------------------
> node01          manager
> node02          worker 1   + worker 2
> node03         
>
> Which is not what I expected at all!
>
> In fact, when I do this for a more complex example, I conclude that in
> MPI_Comm_spawn the hostfile is simply re-interpreted for every spawn and
> previous allocations in the same application are not accounted for.
>
> I know I could set hostname in the MPI_Comm_spawn call,  but then I'm
> moving deployment information into my application (and I don't want to
> recompile or add a commandline argument for something that should be
> handled by mpiexec)
>
> Is there an option or easy fix for this problem? (I looked at the code
> of hydra, but I'm unsure how the different proxy's and processes divide
> this spawning work between them (I could not easily detect one "grand
> master" that does the allocation...)
>
> Kind regards,
>
> Arjen
>
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

-------------- next part --------------
--- alloc_org.c	2015-06-16 13:20:30.402577990 +0200
+++ alloc.c	2015-06-16 14:26:58.438649633 +0200
@@ -344,6 +344,59 @@
     goto fn_exit;
 }
 
+
+/**
+ * Sort the proxy list, so that the node with the least amount of active processes 
+ * is used as the first node
+ */
+static int compare_function(const void *a,const void *b) {
+    struct HYD_proxy ** node1= (struct HYD_proxy **)a, ** node2 = (struct HYD_proxy **) b;
+    return (*node1)->node->active_processes - (*node2)->node->active_processes;
+}
+
+HYD_status sort_proxy_list(struct HYD_proxy * input, struct HYD_proxy ** new_head) {    
+    HYD_status status = HYD_SUCCESS;
+    
+    struct HYD_proxy *node= NULL, **list = NULL, *tmp = NULL;
+    int num_nodes = 0, i = 0;
+    
+    if (input->next == NULL) {
+        /* Special case: there is only one proxy, so no sorting required */
+        *new_head = input;
+        return status;
+    }
+    
+    for (node = input; node; node = node->next) {
+        num_nodes++;
+    }
+    
+    
+    HYDU_MALLOC(list, struct HYD_proxy **, num_nodes * sizeof(struct HYD_proxy*), status);
+    //list = (struct HYD_proxy **)malloc(num_nodes * sizeof(struct HYD_proxy *));
+    i = 0;
+    for (node = input; node; node = node->next) {
+        list[i] = node;
+        i++;
+    }
+    qsort(list, num_nodes, sizeof(struct HYD_proxy *), compare_function);
+    tmp = list[0];
+    
+    for (i = 1; i < num_nodes; i++) {
+        tmp->next = list[i];
+        tmp = tmp->next;
+    }
+    tmp->next = NULL;
+    *new_head = list[0];
+    
+    HYDU_FREE(list);
+    
+  fn_exit:
+    return status;
+
+  fn_fail:
+    goto fn_exit;
+}
+
 HYD_status HYDU_create_proxy_list(struct HYD_exec *exec_list, struct HYD_node *node_list,
                                   struct HYD_pg *pg)
 {
@@ -395,10 +448,13 @@
             last_proxy->next = proxy;
         last_proxy = proxy;
 
-        if (allocated_procs >= pg->pg_process_count)
-            break;
+        //if (allocated_procs >= pg->pg_process_count)
+        //    break;
     }
 
+    status =  sort_proxy_list(pg->proxy_list, &pg->proxy_list); 
+    HYDU_ERR_POP(status, "unable to sort the proxy list\n");
+    
     /* If all proxies have as many filler processes as the number of
      * cores, we can reduce those filler processes */
     if (dummy_fillers)
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list