[mpich-discuss] resource allocation and multiple mpi_comm_spawn's

Arjen van Elteren info at arjenvanelteren.com
Fri Jun 12 04:23:25 CDT 2015


Hello,

I'm working with an application that invokes multiple mpi_comm_spawn calls.

I'm using mpiexec on a cluster without a resource manager or job queue,
so plain ssh and fork calls and everything is managed by mpich.

It looks like mpiexec (both hydra and mpd) re-use the hostfile from the
start (and do not look at already allocated resources/used nodes).

For example I have a hostfile like this:

node01:1
node02:1
node03:1

When I run a call like this:

 MPI_Comm_spawn(cmd, MPI_ARGV_NULL, number_of_workers,
                 MPI_INFO_NULL, 2,
                 MPI_COMM_SELF, &worker,
                 MPI_ERRCODES_IGNORE);

I get an allocation like this:

node              process
---------------    ------------------
node01          manager
node02          worker 1
node03          worker 2

Which is what I expected.

But when I instead do 2 calls like this (i.e. each worker has one
process, but there are 2 workers):

 MPI_Comm_spawn(cmd, MPI_ARGV_NULL, number_of_workers,
                 MPI_INFO_NULL, 1,
                 MPI_COMM_SELF, &worker,
                 MPI_ERRCODES_IGNORE);

 MPI_Comm_spawn(cmd, MPI_ARGV_NULL, number_of_workers,
                 MPI_INFO_NULL, 1,
                 MPI_COMM_SELF, &worker,
                 MPI_ERRCODES_IGNORE);

I get an allocation like this (both hydra and mpd):

node              process
---------------    ------------------
node01          manager
node02          worker 1   + worker 2
node03         

Which is not what I expected at all!

In fact, when I do this for a more complex example, I conclude that in
MPI_Comm_spawn the hostfile is simply re-interpreted for every spawn and
previous allocations in the same application are not accounted for.

I know I could set hostname in the MPI_Comm_spawn call,  but then I'm
moving deployment information into my application (and I don't want to
recompile or add a commandline argument for something that should be
handled by mpiexec)

Is there an option or easy fix for this problem? (I looked at the code
of hydra, but I'm unsure how the different proxy's and processes divide
this spawning work between them (I could not easily detect one "grand
master" that does the allocation...)

Kind regards,

Arjen


_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list