[mpich-discuss] mpiexec ignoring host setting in configfile

Alexander Rast alex.rast.technical at gmail.com
Fri Jun 23 10:19:45 CDT 2017

What I need at this point is a better understanding of the implementation
philosophy of MPICH and Hydra because I fear that without it I'm going to
run into a series of endless surprises that will generate questions that
might never have arisen if I'd known what the internal thinking of the
MPICH group was from the outset. Forgive me if some of what I say or ask
for here is quite general, but I'm trying to put MPICH in enough context
that I can reasonably predict how it will behave in various scenarios.

I'm giving you my own thinking here so you have some context on where I'm
coming from, not as an expectation or demand that things should work this
way but as a description of what I'd imagine in the absence of any better
information. It's going to be rather long and I apologise for that but I
think you might need it in order to make sense of what I'm saying.

Also posting this to the mailing list, because if productive information
comes out of this exchange it may be useful for others in future as a sort
of supplementary documentation.

On Thu, Jun 22, 2017 at 6:57 PM, Kenneth Raffenetti <raffenet at mcs.anl.gov>

> On 06/22/2017 11:16 AM, Alexander Rast wrote:
>> On Thu, Jun 22, 2017 at 4:14 PM, Kenneth Raffenetti <raffenet at mcs.anl.gov
>> <mailto:raffenet at mcs.anl.gov>> wrote:
>>     -host(s) is a global option. You can only specify it once.
>> ? This doesn't make sense to me. Example 8.13 from the MPI specification
>> has:
>> mpiexec myprog infile1 : myprog infile2 : myprog infile3, corresponding
>> to Form A
>> mpiexec {<above arguments>} : {...} : {...} : ... : { ...}
>> which invokes potentially multiple programs with different arguments (of
>> which host is one)
>> and then mentions that Form B: mpiexec -configfile <filename> has lines
>> of the form separated by the colons in Form A.
>> So there's no reason, I would think, that a Form B file can't have
>> multiple lines with different hosts specified for the same executable.
> The reason is that Hydra does not support it. You are allowed one -host(s)
> argument. If you want to submit a patch to allow more than one instance, we
> will consider it.

I have to concede that this comes as a surprise. I didn't see anything
previously that said only one -host argument is allowed, and if I read
through the MPI specification on mpiexec and the definition of the -host
argument I would definitely come to the opposite conclusion. It seemed
clear that -host would be allowed multiple times for multiple
specifications. Now, the specification of mpiexec is optional but as I
understood it if mpiexec is supported then the option settings described in
the specification should operate like an MPI_COMM_SPAWN command. From my
point of view it would have been surprising and unexpected behaviour for
-host to behave differently on the first invocation of MPI_COMM_SPAWN than
it did on subsequent invocations. A given implementation would be free to
choose not to support -host as an option switch, but if this were the case,
I'd expect it to raise an error or at least a warning rather than silently
ignoring the switch. The particular interpretation MPICH seems to have
taken as to how to handle this switch is one that would never have occurred
to me under any circumstances, and as you can see, would never have
occurred to me *even after the fact*, without explicit indication that this
was the path chosen, at which point as you can also see I would still have
responded with utter astonishment. I don't understand the rationale.

MPICH supports a different option: a hosts file specified using -f, that
much I understood. The MPI specification says that mpiexec may support any
number of additional implementation-dependent options, whose behaviour is
entirely dependent upon the implementation, but that would simply imply to
me that there is an alternative way to specify hosts, users, process
mappings, etc. Likewise a -hosts option is possible but I would have
considered that an entirely separate case (with possibly different
behaviour) from -host: just because 2 switches look similar doesn't mean
they necessarily are the same. With information now available I am given to
understand that this is not the case but it is certainly not what I would
have expected a priori.

One of the problems is that MPI documentation itself is very complete but
documentation on implementations is very thin, MPICH is not alone in this
aspect, and a lot of the details of MPI are implementation-dependent. This
leaves the user struggling with best guesses as to what an implementation
will do.

Use of a hosts file carries with it its own ambiguities. As I understood
mpiexec as described in the specification for MPI, the -n switch only
corresponds to a request to start that many processes; it doesn't come with
any guarantees. If everything is working according to plan then I would
expect n processes will be started, but I wasn't ready to count on that as
a firm guarantee. This means that the behaviour of the system when either
fewer processes than asked for could be started, or when the host-processes
mapping couldn't be satisfied, is unclear. For instance, if I had -n 10
specified and a hosts file with A:3 B:2 C:3 D:2 then suppose in actual fact
only 7 processes could be started? Would it have been safe to assume that
the mapping in actual fact were A: 0-2, B: 3-4, C: 5-6?. That's one
possibility but not the only one. Or what if for some reason A were only
able to run 2 processes? Would the system remap so that 10 were supported,
e.g. A: 0-1, B: 2-4, C: 5-6, D: 7-9? Or some other combination? Or fail
altogether? Maybe the combinations are manageable in small configurations
but if you had 1000 hosts supporting some number of processes between 0 and
100, for an application distributing a total of 25,000 processes, the
possible combinations become hopeless to calculate. Internally Hydra may
have a deterministic assignment algorithm, but without delving deeply into
the internals this wouldn't be easy to determine.

At any rate, from the point of view of a user, I'd have considered a fixed
mapping algorithm, where one were at the mercy of what Hydra or any other
process manager decided to do in terms of which process ended up where, to
be an exceptionally onerous restriction. It would be from my point of view
sort of fundamental that the user be able to specify if they want to
exactly which process maps to which hardware resource - after all, there's
no way an external process manager can know the details of what hardware
capabilities may be at each node and there may be very strong reasons why
the mapping needs to be as the user wants. This is one of the main reasons
I'd have expected that if the -host option were even allowed it would be
usable in a line-by-line (or colon-separated arglist) mode.

It looks as if by a combination of hosts file and mpiexec options you can
*probably* achieve something similar but this is frightfully implicit. On a
purely personal level I find very implicit specifications - where a
resultant configuration can be *deduced* from the combination of inputs to
the configuration generator, but isn't already explicitly plain in the
inputs themselves - to be confusing and difficult to work with. In this
case you have a result which is dependent not merely on the options passed
to mpiexec and the lines of the hosts file, but on the order of lines
within the hosts file and the order of executable specifications in the
mpiexec invocation. As noted I also have concerns about what happens if the
mapping can't be achieved exactly as the implicit specification suggests
but would be in principle feasible with a different mapping.

So what I'm looking for, again, is some background on the implementation
and expected usage model for MPICH in the sense not of an empirical 'do
this because that's what works' but rather a more top-down 'we imagine
problems cast in this form, and so have configured MPICH to work like this'
so that all these choices can be put into a context I can make sense of.
Maybe you're not the best placed to answer this sort of question. Can
someone else in the discussion list provide any insights?

>>     Further, your configfile does not make much sense since your are
>>     using the same program on both lines. You could rather just do:
>>        mpiexec -n 4 -ppn 2 -hosts Shakespeare,Burns ./mpi_io_test
>> I have not seen the -ppn switch listed anywhere. What does this do? Is
>> this intended to be a 'limit number of processes to n on the hosts
>> specified?
> $ mpiexec -h | grep ppn
>     -ppn                             processes per node
> Notwithstanding this isn't what is needed anyway, because what if I want
>> specifically to run at most 2 processes on Shakespeare but 5 on Burns
>> (imagine, for example, that Shakespeare is already loaded with other
>> processes, so I need to limit it for the executable in concern)?
> Irregular mappings like this can be done with a hostfile.
>> In any case the point is, using the same program on both lines to me
>> seems straightforward enough given that there are numerous instances (as
>> ex. 8.13 above) where various switches and options might be different.
>> I note also that -hosts is different from the -host switch listed as
>> pre-defined for mpiexec.
> Hydra supports both -host and -hosts. They mean the same thing.
>> In my case we have a system where the objective is to direct specific
>> instances of an application or possibly several to specific hosts - so that
>> the mapping can be fixed. In the limit this might consist of -n 1 -host <x>
>> <executable> lines, each directing a specific instance of the executable to
>> a specific host.
> Again, this will not work with today's Hydra. I would suggest playing
> around some more with the available options to see if another solution
> might work for your use-case.

Back to the real situation. What we (will) have is a machine with a large
number of compute nodes. Simply by force of numbers these compute nodes are
going to be unreliable: some of them will work, some not, although we do
assume that at actual run-time the list of working nodes can be gained by
discovery from a root process and configured as a static mapping for the
purposes of any given run. The critical point though is that it's only at
system initialisation time, not before, that this information can be known.
A root process will take a problem (i.e. an application) specification and
determine a mapping for the application based on the available hardware
topology, the problem topology, and the predicted patterns of use. It then
will need to run an MPI_COMM_SPAWN_MULTIPLE to distribute the problem to
the working nodes, and we need to make sure the distribution is exactly as
specified or the system simply won't work - problem components will be
distributed differently from the expected topology, large chunks of data
might end up in entirely unexpected places and the application will fall
down almost immediately (deadlock is very likely as well). The main thing I
need to understand is exactly what the MPI_COMM_SPAWN_MULTIPLE will do. As
you can see there are a lot of secondary questions related to things like
where I/O goes, what user is on which host, Fortran support etc. but these
are only sub-problems of the main issue which is getting to grips with

As you may be able to see, using a hosts file would be awkward at best,
because we're not going to know until application startup what the hosts
would actuallly be. The best solution I can come up with if it really is
the case that this is the only way to map processes to hosts is to have a
2-stage program that does discovery and then pipes the result into a second
stage invoked through mpiexec but I'll need to go away and see if there are
any interactions with MPI_COMM_SPAWN like what I found related to different
users to know if this is going to work. I'll run some tests and let you
know what I find.

Thanks for your help.

> Ken
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20170623/2f7628ec/attachment.html>

More information about the discuss mailing list