[mpich-discuss] mpiexec ignoring host setting in configfile

Fri Jun 23 13:31:52 CDT 2017

On 06/23/2017 10:19 AM, Alexander Rast wrote:
> What I need at this point is a better understanding of the 
> implementation philosophy of MPICH and Hydra because I fear that without 
> it I'm going to run into a series of endless surprises that will 
> generate questions that might never have arisen if I'd known what the 
> internal thinking of the MPICH group was from the outset. Forgive me if 
> some of what I say or ask for here is quite general, but I'm trying to 
> put MPICH in enough context that I can reasonably predict how it will 
> behave in various scenarios.
> 
> I'm giving you my own thinking here so you have some context on where 
> I'm coming from, not as an expectation or demand that things should work 
> this way but as a description of what I'd imagine in the absence of any 
> better information. It's going to be rather long and I apologise for 
> that but I think you might need it in order to make sense of what I'm 
> saying.
> 
> Also posting this to the mailing list, because if productive information 
> comes out of this exchange it may be useful for others in future as a 
> sort of supplementary documentation.
> 
> On Thu, Jun 22, 2017 at 6:57 PM, Kenneth Raffenetti 
> <raffenet at mcs.anl.gov <mailto:raffenet at mcs.anl.gov>> wrote:
> 
>     On 06/22/2017 11:16 AM, Alexander Rast wrote:
> 
>         On Thu, Jun 22, 2017 at 4:14 PM, Kenneth Raffenetti
>         <raffenet at mcs.anl.gov <mailto:raffenet at mcs.anl.gov>
>         <mailto:raffenet at mcs.anl.gov <mailto:raffenet at mcs.anl.gov>>> wrote:
> 
>              -host(s) is a global option. You can only specify it once.
> 
> 
>         ? This doesn't make sense to me. Example 8.13 from the MPI
>         specification has:
> 
>         mpiexec myprog infile1 : myprog infile2 : myprog infile3,
>         corresponding to Form A
> 
>         mpiexec {<above arguments>} : {...} : {...} : ... : { ...}
> 
>         which invokes potentially multiple programs with different
>         arguments (of which host is one)
> 
>         and then mentions that Form B: mpiexec -configfile <filename>
>         has lines of the form separated by the colons in Form A.
> 
>         So there's no reason, I would think, that a Form B file can't
>         have multiple lines with different hosts specified for the same
>         executable.
> 
> 
>     The reason is that Hydra does not support it. You are allowed one
>     -host(s) argument. If you want to submit a patch to allow more than
>     one instance, we will consider it.
> 
> 
> I have to concede that this comes as a surprise. I didn't see anything 
> previously that said only one -host argument is allowed, and if I read 
> through the MPI specification on mpiexec and the definition of the -host 
> argument I would definitely come to the opposite conclusion. It seemed 
> clear that -host would be allowed multiple times for multiple 
> specifications. Now, the specification of mpiexec is optional but as I 
> understood it if mpiexec is supported then the option settings described 
> in the specification should operate like an MPI_COMM_SPAWN command. From 
> my point of view it would have been surprising and unexpected behaviour 
> for -host to behave differently on the first invocation of 
> MPI_COMM_SPAWN than it did on subsequent invocations. A given 
> implementation would be free to choose not to support -host as an option 
> switch, but if this were the case, I'd expect it to raise an error or at 
> least a warning rather than silently ignoring the switch. The particular 
> interpretation MPICH seems to have taken as to how to handle this switch 
> is one that would never have occurred to me under any circumstances, and 
> as you can see, would never have occurred to me *even after the fact*, 
> without explicit indication that this was the path chosen, at which 
> point as you can also see I would still have responded with utter 
> astonishment. I don't understand the rationale.
> 
> MPICH supports a different option: a hosts file specified using -f, that 
> much I understood. The MPI specification says that mpiexec may support 
> any number of additional implementation-dependent options, whose 
> behaviour is entirely dependent upon the implementation, but that would 
> simply imply to me that there is an alternative way to specify hosts, 
> users, process mappings, etc. Likewise a -hosts option is possible but I 
> would have considered that an entirely separate case (with possibly 
> different behaviour) from -host: just because 2 switches look similar 
> doesn't mean they necessarily are the same. With information now 
> available I am given to understand that this is not the case but it is 
> certainly not what I would have expected a priori.
> 
> One of the problems is that MPI documentation itself is very complete 
> but documentation on implementations is very thin, MPICH is not alone in 
> this aspect, and a lot of the details of MPI are 
> implementation-dependent. This leaves the user struggling with best 
> guesses as to what an implementation will do.
> 
> Use of a hosts file carries with it its own ambiguities. As I understood 
> mpiexec as described in the specification for MPI, the -n switch only 
> corresponds to a request to start that many processes; it doesn't come 
> with any guarantees. If everything is working according to plan then I 
> would expect n processes will be started, but I wasn't ready to count on 
> that as a firm guarantee. This means that the behaviour of the system 
> when either fewer processes than asked for could be started, or when the 
> host-processes mapping couldn't be satisfied, is unclear. For instance, 
> if I had -n 10 specified and a hosts file with A:3 B:2 C:3 D:2 then 
> suppose in actual fact only 7 processes could be started? Would it have 
> been safe to assume that the mapping in actual fact were A: 0-2, B: 3-4, 
> C: 5-6?. That's one possibility but not the only one. Or what if for 
> some reason A were only able to run 2 processes? Would the system remap 
> so that 10 were supported, e.g. A: 0-1, B: 2-4, C: 5-6, D: 7-9? Or some 
> other combination? Or fail altogether? Maybe the combinations are 
> manageable in small configurations but if you had 1000 hosts supporting 
> some number of processes between 0 and 100, for an application 
> distributing a total of 25,000 processes, the possible combinations 
> become hopeless to calculate. Internally Hydra may have a deterministic 
> assignment algorithm, but without delving deeply into the internals this 
> wouldn't be easy to determine.
> 
> At any rate, from the point of view of a user, I'd have considered a 
> fixed mapping algorithm, where one were at the mercy of what Hydra or 
> any other process manager decided to do in terms of which process ended 
> up where, to be an exceptionally onerous restriction. It would be from 
> my point of view sort of fundamental that the user be able to specify if 
> they want to exactly which process maps to which hardware resource - 
> after all, there's no way an external process manager can know the 
> details of what hardware capabilities may be at each node and there may 
> be very strong reasons why the mapping needs to be as the user wants. 
> This is one of the main reasons I'd have expected that if the -host 
> option were even allowed it would be usable in a line-by-line (or 
> colon-separated arglist) mode.
> 
> It looks as if by a combination of hosts file and mpiexec options you 
> can *probably* achieve something similar but this is frightfully 
> implicit. On a purely personal level I find very implicit specifications 
> - where a resultant configuration can be *deduced* from the combination 
> of inputs to the configuration generator, but isn't already explicitly 
> plain in the inputs themselves - to be confusing and difficult to work 
> with. In this case you have a result which is dependent not merely on 
> the options passed to mpiexec and the lines of the hosts file, but on 
> the order of lines within the hosts file and the order of executable 
> specifications in the mpiexec invocation. As noted I also have concerns 
> about what happens if the mapping can't be achieved exactly as the 
> implicit specification suggests but would be in principle feasible with 
> a different mapping.
> 
> So what I'm looking for, again, is some background on the implementation 
> and expected usage model for MPICH in the sense not of an empirical 'do 
> this because that's what works' but rather a more top-down 'we imagine 
> problems cast in this form, and so have configured MPICH to work like 
> this' so that all these choices can be put into a context I can make 
> sense of. Maybe you're not the best placed to answer this sort of 
> question. Can someone else in the discussion list provide any insights?

If you are looking for background on expected usage of MPI(CH), you may 
find this book useful: https://mitpress.mit.edu/using-MPI-3ed

As for why Hydra was implemented the way it is, I'll only add that 
features are implemented based on common practice and user demand. What 
you are asking for is neither, from my experience.

> 
> 
> 
>              Further, your configfile does not make much sense since
>         your are
>              using the same program on both lines. You could rather just do:
> 
>                 mpiexec -n 4 -ppn 2 -hosts Shakespeare,Burns ./mpi_io_test
> 
> 
>         I have not seen the -ppn switch listed anywhere. What does this
>         do? Is this intended to be a 'limit number of processes to n on
>         the hosts specified?
> 
> 
>     $ mpiexec -h | grep ppn
>          -ppn                             processes per node
> 
>         Notwithstanding this isn't what is needed anyway, because what
>         if I want specifically to run at most 2 processes on Shakespeare
>         but 5 on Burns (imagine, for example, that Shakespeare is
>         already loaded with other processes, so I need to limit it for
>         the executable in concern)?
> 
> 
>     Irregular mappings like this can be done with a hostfile.
> 
> 
>         In any case the point is, using the same program on both lines
>         to me seems straightforward enough given that there are numerous
>         instances (as ex. 8.13 above) where various switches and options
>         might be different.
> 
>         I note also that -hosts is different from the -host switch
>         listed as pre-defined for mpiexec.
> 
> 
>     Hydra supports both -host and -hosts. They mean the same thing.
> 
> 
>         In my case we have a system where the objective is to direct
>         specific instances of an application or possibly several to
>         specific hosts - so that the mapping can be fixed. In the limit
>         this might consist of -n 1 -host <x> <executable> lines, each
>         directing a specific instance of the executable to a specific host.
> 
> 
>     Again, this will not work with today's Hydra. I would suggest
>     playing around some more with the available options to see if
>     another solution might work for your use-case.
> 
> 
> Back to the real situation. What we (will) have is a machine with a 
> large number of compute nodes. Simply by force of numbers these compute 
> nodes are going to be unreliable: some of them will work, some not, 
> although we do assume that at actual run-time the list of working nodes 
> can be gained by discovery from a root process and configured as a 
> static mapping for the purposes of any given run. The critical point 
> though is that it's only at system initialisation time, not before, that 
> this information can be known. A root process will take a problem (i.e. 
> an application) specification and determine a mapping for the 
> application based on the available hardware topology, the problem 
> topology, and the predicted patterns of use. It then will need to run an 
> MPI_COMM_SPAWN_MULTIPLE to distribute the problem to the working nodes, 
> and we need to make sure the distribution is exactly as specified or the 
> system simply won't work - problem components will be distributed 
> differently from the expected topology, large chunks of data might end 
> up in entirely unexpected places and the application will fall down 
> almost immediately (deadlock is very likely as well). The main thing I 
> need to understand is exactly what the MPI_COMM_SPAWN_MULTIPLE will do. 
> As you can see there are a lot of secondary questions related to things 
> like where I/O goes, what user is on which host, Fortran support etc. 
> but these are only sub-problems of the main issue which is getting to 
> grips with MPI_COMM_SPAWN_MULTIPLE.
> 
> As you may be able to see, using a hosts file would be awkward at best, 
> because we're not going to know until application startup what the hosts 
> would actuallly be. The best solution I can come up with if it really is 
> the case that this is the only way to map processes to hosts is to have 
> a 2-stage program that does discovery and then pipes the result into a 
> second stage invoked through mpiexec but I'll need to go away and see if 
> there are any interactions with MPI_COMM_SPAWN like what I found related 
> to different users to know if this is going to work. I'll run some tests 
> and let you know what I find
This is more of an application development issue, and not really in the 
scope of this list. Nevertheless, have you considered allowing processes 
to determine their computational role *after* initialization rather than 
launching new binaries?

Ken
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss