[mpich-discuss] Strange, inconsistent behaviour with MPI_Comm_spawn
Kenneth Raffenetti
raffenet at mcs.anl.gov
Thu Jun 15 10:17:51 CDT 2017
You can specify alternate usernames in a hostfile. See link:
https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager#Using_Hydra_on_Machines_with_Different_User_Names
You may also want to set HYDRA_LAUNCHER_SSH_ENABLE_WARNINGS=1 to see if
SSH will dump out more useful information.
Ken
On 06/14/2017 09:56 AM, Alexander Rast wrote:
> Still more conjecture on my part, I think I understand what's going on
> with the gethostbyname failed message, but have no idea how one would go
> about fixing it.
>
> The exact message gives 'gethostbyname failed, pi at Burns (errno 1)'. I
> think this is a case of the underlying implementation not being
> intelligent enough to strip the username off in order to resolve the
> host address. In other words, it's looking for a host named 'pi at Burns'
> rather than a host Burns to which it's going to authenticate as user pi.
>
> If you try to use the hostname without user this fails, because then it
> tries to authenticate using the current username from the shell that
> invoked mpiexec (in my case on what will eventually become a 'root host'
> machine - an ordinary Ubuntu desktop PC) which fails because on the
> other MPI hosts that username doesn't exist. I have authentication set
> up using RSA keys which authenticates the local user on the 'root host'
> desktop PC into the 'working user' on the remote MPI hosts, but it seems
> the MPI_Comm_spawn function isn't using that authentication path.
>
> What it looks like, is that the MPICH implementation may be such that in
> order to run MPI_Comm_spawn you have to have an identical username on
> each host you hope to spawn a process on. I very much *hope* this isn't
> the case, because this would be exceptionally awkward, it would mean
> that for each different user wanting to run MPI jobs on the remote
> cluster they'd have to have their own private user and directory on each
> node in the cluster.
>
> So 1) is it possible to and 2) how would I, configure the MPI_Comm_spawn
> command or associated hosts so that you can authenticate as user X on
> host Y (X at Y) for each spawn host using authorized keys from your user W
> on machine Z, i.e. on machine Y you have in directory
> /home/X/.ssh/authorized_keys an entry for W at Z?
>
> On Tue, Jun 13, 2017 at 5:15 PM, Alexander Rast
> <alex.rast.technical at gmail.com <mailto:alex.rast.technical at gmail.com>>
> wrote:
>
> OK, some progress. After more work I determined that the bulk of the
> truly bizarre inconsistencies was down to obnoxious, intrusive
> behaviour in Ubuntu of gnome-keyring-daemon, for example see the
> following post:
>
> https://askubuntu.com/questions/564821/why-cant-i-interact-with-my-ssh-agent-e-g-ssh-add-d-doesnt-work
> <https://askubuntu.com/questions/564821/why-cant-i-interact-with-my-ssh-agent-e-g-ssh-add-d-doesnt-work>
>
> which by the way has not been fixed as of 16.04 Ubuntu.
>
> It seems gnome-keyring-daemon is a particularly badly-behaved
> utility and doesn't help itself with multiple autoload attempts at
> Ubuntu startup. You don't know what ssh agent is actually loaded and
> which RSA keys it has cached. It's also apparently very difficult to
> get rid of, although there are ways. I got it eventually to stop
> loading itself. (Perhaps the MPI community might whinge to Ubuntu
> about this behaviour? Many people have complained about the
> antisocial behaviour of gnome-keyring-daemon but so far Ubuntu's
> response has been: 'we can't see why this should be considered a
> problem. And we have doubts about what you're trying to achieve')
>
> So now the problem has got to the point where there are 2
> alternative error responses, both occurring at the MPI_Comm_spawn
> command. I've included typical error outputs for both scenarios,
> using the code posted earlier. The 2 errors occur with slightly
> different versions of the configuration file used to spawn the
> processes, which I'm also including. Obviously _2 files go together.
>
> Any thoughts now on what might be causing either of these 2
> problems? I find the gethostbyname failed messages particularly
> perplexing, since I'm able to ssh into the machines themselves
> without difficulty either by name or IP address.
>
> On Fri, Jun 9, 2017 at 1:53 PM, Alexander Rast
> <alex.rast.technical at gmail.com
> <mailto:alex.rast.technical at gmail.com>> wrote:
>
> I've reached a limit of mystification. Attempting to run an MPI
> application using MPI_Comm_spawn from a host is resulting in
> bizarre, inconsistent behaviour of ssh and ssh-askpass.
>
> What I did is, I created an RSA keypair using ssh-keygen, copied
> the public keys into the ./ssh directories on the machines I'll
> be running MPI on, put them in the authorized_keys file, placed
> all the machines in the known_hosts file on the launcher host
> (which is starting MPI_Comm_spawn), then ran eval ssh-agent and
> added the id_rsa file to the agent on the launcher host.
>
> You can verify that this part of the system is working because I
> can use ssh directly to access the worker machines that will be
> running the application.
>
> But when I actually try to run the MPI application, when it gets
> to the spawn, all sorts of wierd and wild stuff happens.
> Sometimes a dialogue (which aggressively grabs focus) comes up
> asking for a password (OpenSSH Authentication). Other times the
> same program has just said that the identity/authenticity of the
> target machine can't be established - do I want to continue? (A
> yes causes authentication to fail). In still other cases, it
> appeared to open the connection but then MPI crashed saying it
> couldn't get the host by name. (yes, every machine has the
> hostnames of every other machine in its hosts file). And in yet
> another case, it seemed to try to run but then crashed saying
> unexpected end-of-file. And so on. There seems to be no rhyme or
> reason to the errors, I can't reproduce anything, each time I
> try it some new and surprising behaviour comes up. What's
> happening? Do I have to do something unusual with machine
> configuration/environment?
>
> Here are the associated MPI files if anyone wants to look for
> errors. In fact there are probably some errors in the code
> itself, because it's never been able to be debugged (because of
> this wierd behaviour) but I am fairly sure at least the sequence
> through to the spawn command is OK. All help appreciated...
>
>
>
>
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list