[mpich-discuss] Strange, inconsistent behaviour with MPI_Comm_spawn

Kenneth Raffenetti raffenet at mcs.anl.gov
Thu Jun 15 10:17:51 CDT 2017

You can specify alternate usernames in a hostfile. See link:


You may also want to set HYDRA_LAUNCHER_SSH_ENABLE_WARNINGS=1 to see if 
SSH will dump out more useful information.


On 06/14/2017 09:56 AM, Alexander Rast wrote:
> Still more conjecture on my part, I think I understand what's going on 
> with the gethostbyname failed message, but have no idea how one would go 
> about fixing it.
> The exact message gives 'gethostbyname failed, pi at Burns (errno 1)'. I 
> think this is a case of the underlying implementation not being 
> intelligent enough to strip the username off in order to resolve the 
> host address. In other words, it's looking for a host named 'pi at Burns' 
> rather than a host Burns to which it's going to authenticate as user pi.
> If you try to use the hostname without user this fails, because then it 
> tries to authenticate using the current username from the shell that 
> invoked mpiexec (in my case on what will eventually become a 'root host' 
> machine - an ordinary Ubuntu desktop PC) which fails because on the 
> other MPI hosts that username doesn't exist. I have authentication set 
> up using RSA keys which authenticates the local user on the 'root host' 
> desktop PC into the 'working user' on the remote MPI hosts, but it seems 
> the MPI_Comm_spawn function isn't using that authentication path.
> What it looks like, is that the MPICH implementation may be such that in 
> order to run MPI_Comm_spawn you have to have an identical username on 
> each host you hope to spawn a process on. I very much *hope* this isn't 
> the case, because this would be exceptionally awkward, it would mean 
> that for each different user wanting to run MPI jobs on the remote 
> cluster they'd have to have their own private user and directory on each 
> node in the cluster.
> So 1) is it possible to and 2) how would I, configure the MPI_Comm_spawn 
> command or associated hosts so that you can authenticate as user X on 
> host Y (X at Y) for each spawn host using authorized keys from your user W 
> on machine Z, i.e. on machine Y you have in directory 
> /home/X/.ssh/authorized_keys an entry for W at Z?
> On Tue, Jun 13, 2017 at 5:15 PM, Alexander Rast 
> <alex.rast.technical at gmail.com <mailto:alex.rast.technical at gmail.com>> 
> wrote:
>     OK, some progress. After more work I determined that the bulk of the
>     truly bizarre inconsistencies was down to obnoxious, intrusive
>     behaviour in Ubuntu of gnome-keyring-daemon, for example see the
>     following post:
>     https://askubuntu.com/questions/564821/why-cant-i-interact-with-my-ssh-agent-e-g-ssh-add-d-doesnt-work
>     <https://askubuntu.com/questions/564821/why-cant-i-interact-with-my-ssh-agent-e-g-ssh-add-d-doesnt-work>
>     which by the way has not been fixed as of 16.04 Ubuntu.
>     It seems gnome-keyring-daemon is a particularly badly-behaved
>     utility and doesn't help itself with multiple autoload attempts at
>     Ubuntu startup. You don't know what ssh agent is actually loaded and
>     which RSA keys it has cached. It's also apparently very difficult to
>     get rid of, although there are ways. I got it eventually to stop
>     loading itself. (Perhaps the MPI community might whinge to Ubuntu
>     about this behaviour? Many people have complained about the
>     antisocial behaviour of gnome-keyring-daemon but so far Ubuntu's
>     response has been: 'we can't see why this should be considered a
>     problem. And we have doubts about what you're trying to achieve')
>     So now the problem has got to the point where there are 2
>     alternative error responses, both occurring at the MPI_Comm_spawn
>     command. I've included typical error outputs for both scenarios,
>     using the code posted earlier. The 2 errors occur with slightly
>     different versions of the configuration file used to spawn the
>     processes, which I'm also including. Obviously _2 files go together.
>     Any thoughts now on what might be causing either of these 2
>     problems? I find the gethostbyname failed messages particularly
>     perplexing, since I'm able to ssh into the machines themselves
>     without difficulty either by name or IP address.
>     On Fri, Jun 9, 2017 at 1:53 PM, Alexander Rast
>     <alex.rast.technical at gmail.com
>     <mailto:alex.rast.technical at gmail.com>> wrote:
>         I've reached a limit of mystification. Attempting to run an MPI
>         application using MPI_Comm_spawn from a host is resulting in
>         bizarre, inconsistent behaviour of ssh and ssh-askpass.
>         What I did is, I created an RSA keypair using ssh-keygen, copied
>         the public keys into the ./ssh directories on the machines I'll
>         be running MPI on, put them in the authorized_keys file, placed
>         all the machines in the known_hosts file on the launcher host
>         (which is starting MPI_Comm_spawn), then ran eval ssh-agent and
>         added the id_rsa file to the agent on the launcher host.
>         You can verify that this part of the system is working because I
>         can use ssh directly to access the worker machines that will be
>         running the application.
>         But when I actually try to run the MPI application, when it gets
>         to the spawn, all sorts of wierd and wild stuff happens.
>         Sometimes a dialogue (which aggressively grabs focus) comes up
>         asking for a password (OpenSSH Authentication). Other times the
>         same program has just said that the identity/authenticity of the
>         target machine can't be established - do I want to continue? (A
>         yes causes authentication to fail). In still other cases, it
>         appeared to open the connection but then MPI crashed saying it
>         couldn't get the host by name. (yes, every machine has the
>         hostnames of every other machine in its hosts file). And in yet
>         another case, it seemed to try to run but then crashed saying
>         unexpected end-of-file. And so on. There seems to be no rhyme or
>         reason to the errors, I can't reproduce anything, each time I
>         try it some new and surprising behaviour comes up. What's
>         happening? Do I have to do something unusual with machine
>         configuration/environment?
>         Here are the associated MPI files if anyone wants to look for
>         errors. In fact there are probably some errors in the code
>         itself, because it's never been able to be debugged (because of
>         this wierd behaviour) but I am fairly sure at least the sequence
>         through to the spawn command is OK. All help appreciated...
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:

More information about the discuss mailing list