[mpich-discuss] Running a program on multiple computers
raffenet at mcs.anl.gov
Thu Oct 27 22:17:59 CDT 2016
Apologies for the delay in response. My comments inline below.
On 10/15/2016 07:40 PM, Mahdi, Sam wrote:
> Hello everyone,
> I am attempting to run a single program on 32 cores split across 4
> computers (So each computer has 8 cores). I am attempting to use mpich
> for this. I currently am just testing on 2 computers, I have the program
> installed on both, as well as mpich installed on both. I have created a
> register key and can login in using ssh into the other computer without
> a password. I have come across 2 problems. One, when I attempt to
> connect using the mpirun -np 3 --host a (the IP of the computer I am
> attempting to connect to) hostname
> I recieve the error
> unable to connect from "localhost.localdomain" to "localhost.localdomain"
> This is indicating my computers "localhost.localdomain" is attempting to
> connect to another "localhost.localdomain". How can I change this so
> that it connects via my IP to the other computers IP?
> Secondly, I attempted to use a host file instead using the hydra process
> wiki. I created a hosts file with just the IP of the computer I am
> attempting to connect to. When I type in the command mpiexec -f hosts -n
> 4 ./applic
> I get this error
> [mpiexec at localhost.localdomain] HYDU_parse_hostfile
> (./utils/args/args.c:323): unable to open host file: hosts
> along with other errors of unable to parse hostfile, match handler etc.
> I assume this is all due to it being unable to read the host file. Is
> there any specific place I should save my hosts file? I have it saved
> directly on my Desktop. I have attempted to indicate the full path where
> it is located, but I still get the same error.
There is no required location for the hosts file. If you are specifying
full path and there are still issues, it may be a formatting issue. Can
you paste or attach the contents of your hosts file so we can confirm
the format is good?
> For the first problem, I have read that I need to change /etc/hosts
> manually by using the sudo command to manually enter the IP of the
> computer I am attempting to connect to in the /etc/hosts file. I assume
> the computer is attempting to connect to itself (set up the program
> first on its own core, then send it to another, hence attempting to
> start it on localhost.localdomain).
> For the second problem, I have attempted to add run the command
> mpirun --host my computer IP, the other computer IP ./program
This format should be okay for your purposes. What happens if you try:
mpirun --host my computer IP, the other computer IP /bin/hostname
If the hostnames of each host are echoed to the command-line, then job
launch is successful and the issues is during connection setup during
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
More information about the discuss