[mpich-discuss] MPID_nem_tcp_connpoll(1835): Communication error with rank 1: Connection timed out

amelie chi zhou amelie.czhou at gmail.com
Fri Mar 18 02:09:29 CDT 2016


Hi, Pavan,

Attached is a little summary on "How to run MPICH on Amazon EC2". I'm not
sure whether it's clear enough. Please check.

Regards,
Amelie

On Thu, Mar 17, 2016 at 9:22 AM, amelie chi zhou <amelie.czhou at gmail.com>
wrote:

> Sure. I'm glad to.
>
> Amelie
>
> > On 17 Mar 2016, at 1:58 AM, Balaji, Pavan <balaji at anl.gov> wrote:
> >
> > Hi Amelie,
> >
> > Would you be willing to write up some documentation on "How to use MPICH
> on Amazon EC2" including details on using servers in a single region vs.
> multiple regions?  We'd like to put this up on our FAQ page.
> >
> > Thanks,
> >
> >  -- Pavan
> >
> >> On Mar 16, 2016, at 2:53 AM, amelie chi zhou <amelie.czhou at gmail.com>
> wrote:
> >>
> >> Hi, Pavan,
> >>
> >> Thanks a lot. It does work now!
> >>
> >> Best Regards,
> >> Amelie
> >>
> >> On Wed, Mar 16, 2016 at 3:26 PM, Balaji, Pavan <balaji at anl.gov> wrote:
> >> Amelie,
> >>
> >> OK, I just tried it across two different subnets.  Here's the problem --
> >>
> >> The amazon compute nodes hide their public IP addresses from the list
> of IP addresses visible locally.  So when each node queries for its local
> IP address, it only gets its private IP address (which is obviously useless
> for other processes to connect).
> >>
> >> You can workaround this by making two changes to your environment:
> >>
> >> 1. Explicitly use the public IP addresses directly instead of the
> hostnames in your hostfile.  That is, instead of "
> ec2-52-36-15-57.us-west-2.compute.amazonaws.com", use "52.36.15.57".
> >>
> >> 2. Pass the -localhost option to mpiexec to give the public IP address
> of the host from which you are running mpiexec.
> >>
> >> I created two VM instances, one on the west subnet and the other on the
> east subnet:
> >>
> >> ec2-52-35-56-228.us-west-2.compute.amazonaws.com
> >> ec2-54-172-35-159.compute-1.amazonaws.com
> >>
> >> To run my application, I do this:
> >>
> >> % ./install/bin/mpiexec -localhost 52.35.56.228 -hosts
> 52.35.56.228,54.172.35.159 -n 2 ./examples/cpi
> >>
> >> Let us know if that works.
> >>
> >>  -- Pavan
> >>
> >>> On Mar 16, 2016, at 12:24 AM, amelie chi zhou <amelie.czhou at gmail.com>
> wrote:
> >>>
> >>> Hi, Pavan,
> >>>
> >>> Thanks a lot for trying that.
> >>>
> >>> I have enabled inbound traffic for all types of protocols including
> tcp, udp and icmp for all ports (0 - 65535) and for all ip addresses. I
> noticed that the two instances you created are from the same region (us
> west I suppose). The thing is, for instances in the same region, mpiexec
> can run successfully with no problem in my setup. But when I run mpi
> programs across regions, in my case, between an instance in us east and an
> instance in us west, the error in MPI_Send appears.
> >>> It seems that there might be some problems with the firewall or
> network interfaces, but I have checked and ruled out those possibilities
> (instances in different regions can ssh and scp to each other and there's
> no dropping rule in my firewall setting). So that's where I'm confused.
> >>>
> >>> Regards,
> >>> Amelie
> >>>
> >>>> On Wed, Mar 16, 2016 at 1:01 PM, Balaji, Pavan <balaji at anl.gov>
> wrote:
> >>>>
> >>>> On Mar 15, 2016, at 10:26 PM, amelie chi zhou <amelie.czhou at gmail.com>
> wrote:
> >>>> Here is the full output info. Thanks!
> >>>
> >>> The IP addresses and ports seem to be correctly setup, so that's not
> the problem.
> >>>
> >>> I created my own amazon instances to see what the problem is.  It
> looks like the instances are not able to communicate even though there's no
> explicit firewall enabled that is shown inside the Linux instance.  I did
> some digging and found the "Security group" settings and found that the
> inbound rules only allowed ssh.  I changed it to "All traffic" and now I
> can run my jobs fine.
> >>>
> >>> % ./install/bin/mpiexec -hosts
> ec2-52-36-15-57.us-west-2.compute.amazonaws.com,ec2-5
> >>> 2-37-222-189.us-west-2.compute.amazonaws.com -n 4 ./examples/cpi
> >>> Process 3 of 4 is on ip-172-31-28-127
> >>> Process 2 of 4 is on ip-172-31-21-12
> >>> Process 1 of 4 is on ip-172-31-28-127
> >>> Process 0 of 4 is on ip-172-31-21-12
> >>> pi is approximately 3.1415926544231243, Error is 0.0000000008333312
> >>> wall clock time = 0.010181
> >>>
> >>> Can you try that?
> >>>
> >>>  -- Pavan
> >>>
> >>> _______________________________________________
> >>> discuss mailing list     discuss at mpich.org
> >>> To manage subscription options or unsubscribe:
> >>> https://lists.mpich.org/mailman/listinfo/discuss
> >>>
> >>> _______________________________________________
> >>> discuss mailing list     discuss at mpich.org
> >>> To manage subscription options or unsubscribe:
> >>> https://lists.mpich.org/mailman/listinfo/discuss
> >>
> >> _______________________________________________
> >> discuss mailing list     discuss at mpich.org
> >> To manage subscription options or unsubscribe:
> >> https://lists.mpich.org/mailman/listinfo/discuss
> >>
> >> _______________________________________________
> >> discuss mailing list     discuss at mpich.org
> >> To manage subscription options or unsubscribe:
> >> https://lists.mpich.org/mailman/listinfo/discuss
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20160318/34de1d59/attachment.html>
-------------- next part --------------
To run mpi programs on Amazon EC2, you first need to have an account to Amazon Web Services (AWS). 
Assume you are familiar with AWS and you know how to start/stop/terminate Amazon EC2 instances, you can follow the following steps to run mpi programs within a single Amazon EC2 region or cross regions.


<h>Running MPI in a Single Amazon EC2 Region</h>

Step 1: Install MPI in your cluster

Start a MPI cluster using Amazon EC2 instances. Make sure MPICH2 is properly installed in your cluster. You can download MPICH2 here (http://www.mpich.org/downloads/) and follow the following steps to install.
>>> tar -xzf mpich-3.2.tar.gz
>>> cd mpich-3.2
>>> ./configure
>>> make
>>> sudo make install
If your build was successful, you should be able to see your installed version by typing
>>> mpiexec --version


Step 2: Configure your MPI cluster

You should have a keypair yourkey.pem attached to the created instances. 
First setup your environmental variables as follows. You can find the access key and secret key in your Amazon EC2 account.

export AWS_ACCESS_KEY_ID=[ Your access key ]
export AWS_SECRET_ACCESS_KEY=[ Your access key secret ]

Try ssh to other instances in your cluster to make sure that your environmental setup is successful.
>>> ssh -i path-to-the-key/yourkey.pem username at other-instance-ip
To enable password-less ssh between instances, do the following for each instance in your cluster.
>>> cp path-to-the-key/yourkey.pem ~/.ssh/id_rsa
Then you should be able to ssh by typing
>>> ssh username at other-instance-ip


 
Step 3: Run MPI in your cluster

Save the ip addresses of instances in a file named host_file. You can compile your mpi program and execute it as following.
>>> mpicc mpi_example.c -o example
>>> mpiexec -n 2 -f host_file ./example



<h>Running MPI across Multiple Amazon EC2 Regions</h>

If you want to run your code across multiple cloud regions, some modifications are required to enable network communications.

Step 1: Start your MPI cluster

Create instances in your prefered regions. Make sure the security groups of your instances allow inbound data transfer from other cloud regions.
Install MPI as introduced above to all your cluster machines. In Amazon EC2, you can create one instance installed with all required packages and use the image of this instance to create other machines. This will save you some efforts.

Step 2: Configuration

Say for example you want to execute in us east and us west regions. Note that, now you should have two keys, each for one region.
To enable password-less ssh between instances in different regions, configure the following files.

For instances in the us east region:
>>> cp path-to-the-key/us-east-key.pem ~/.ssh/id_rsa
Concatenate the ~/.ssh/authorized_key files in both us east and us west regions and scp the concatenated file to all instances.
>>> scp -i path-to-the-key/us-west-key.pem username at us-west-instance-ip:~/.ssh/authorized_key ~/.ssh/authorized_key_uw
>>> cat ~/.ssh/authorized_key_uw >> ~/.ssh/authorized_key
>>> scp -i path-to-the-key/us-west-key.pem ~/.ssh/authorized_key username at us-west-instance-ip:~/.ssh/

For instances in the us west region:
>>> cp path-to-the-key/us-west-key.pem ~/.ssh/id_rsa

Step 3: Run MPI in your cluster

Save the *public* ip addresses, usually in the form 52.19.100.32 (for example) in your hostfile. Note, do not use the public DNS, which is usually in the form ec2-54-100-1-1.cpmpute-1.amazonaws.com.
You can run your mpi program by typing:
>>> mpiexec -localhost host_node_ip -n 2 -f host_file ./example
The host_node_ip is the public ip address of your master node.

Done.
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list