[mpich-discuss] MPID_nem_tcp_connpoll(1835): Communication error with rank 1: Connection timed out

Kenneth Raffenetti raffenet at mcs.anl.gov
Fri Mar 18 10:41:39 CDT 2016


Thanks for this Amelie. I've created a wiki page from your instructions 
and linked it from the FAQ.

https://wiki.mpich.org/mpich/index.php/Using_MPICH_in_Amazon_EC2

Ken

On 03/18/2016 02:09 AM, amelie chi zhou wrote:
> Hi, Pavan,
>
> Attached is a little summary on "How to run MPICH on Amazon EC2". I'm
> not sure whether it's clear enough. Please check.
>
> Regards,
> Amelie
>
> On Thu, Mar 17, 2016 at 9:22 AM, amelie chi zhou <amelie.czhou at gmail.com
> <mailto:amelie.czhou at gmail.com>> wrote:
>
>     Sure. I'm glad to.
>
>     Amelie
>
>      > On 17 Mar 2016, at 1:58 AM, Balaji, Pavan <balaji at anl.gov
>     <mailto:balaji at anl.gov>> wrote:
>      >
>      > Hi Amelie,
>      >
>      > Would you be willing to write up some documentation on "How to
>     use MPICH on Amazon EC2" including details on using servers in a
>     single region vs. multiple regions?  We'd like to put this up on our
>     FAQ page.
>      >
>      > Thanks,
>      >
>      >  -- Pavan
>      >
>      >> On Mar 16, 2016, at 2:53 AM, amelie chi zhou
>     <amelie.czhou at gmail.com <mailto:amelie.czhou at gmail.com>> wrote:
>      >>
>      >> Hi, Pavan,
>      >>
>      >> Thanks a lot. It does work now!
>      >>
>      >> Best Regards,
>      >> Amelie
>      >>
>      >> On Wed, Mar 16, 2016 at 3:26 PM, Balaji, Pavan <balaji at anl.gov
>     <mailto:balaji at anl.gov>> wrote:
>      >> Amelie,
>      >>
>      >> OK, I just tried it across two different subnets.  Here's the
>     problem --
>      >>
>      >> The amazon compute nodes hide their public IP addresses from the
>     list of IP addresses visible locally.  So when each node queries for
>     its local IP address, it only gets its private IP address (which is
>     obviously useless for other processes to connect).
>      >>
>      >> You can workaround this by making two changes to your environment:
>      >>
>      >> 1. Explicitly use the public IP addresses directly instead of
>     the hostnames in your hostfile.  That is, instead of
>     "ec2-52-36-15-57.us-west-2.compute.amazonaws.com
>     <http://ec2-52-36-15-57.us-west-2.compute.amazonaws.com>", use
>     "52.36.15.57".
>      >>
>      >> 2. Pass the -localhost option to mpiexec to give the public IP
>     address of the host from which you are running mpiexec.
>      >>
>      >> I created two VM instances, one on the west subnet and the other
>     on the east subnet:
>      >>
>      >> ec2-52-35-56-228.us-west-2.compute.amazonaws.com
>     <http://ec2-52-35-56-228.us-west-2.compute.amazonaws.com>
>      >> ec2-54-172-35-159.compute-1.amazonaws.com
>     <http://ec2-54-172-35-159.compute-1.amazonaws.com>
>      >>
>      >> To run my application, I do this:
>      >>
>      >> % ./install/bin/mpiexec -localhost 52.35.56.228 -hosts
>     52.35.56.228,54.172.35.159 -n 2 ./examples/cpi
>      >>
>      >> Let us know if that works.
>      >>
>      >>  -- Pavan
>      >>
>      >>> On Mar 16, 2016, at 12:24 AM, amelie chi zhou
>     <amelie.czhou at gmail.com <mailto:amelie.czhou at gmail.com>> wrote:
>      >>>
>      >>> Hi, Pavan,
>      >>>
>      >>> Thanks a lot for trying that.
>      >>>
>      >>> I have enabled inbound traffic for all types of protocols
>     including tcp, udp and icmp for all ports (0 - 65535) and for all ip
>     addresses. I noticed that the two instances you created are from the
>     same region (us west I suppose). The thing is, for instances in the
>     same region, mpiexec can run successfully with no problem in my
>     setup. But when I run mpi programs across regions, in my case,
>     between an instance in us east and an instance in us west, the error
>     in MPI_Send appears.
>      >>> It seems that there might be some problems with the firewall or
>     network interfaces, but I have checked and ruled out those
>     possibilities (instances in different regions can ssh and scp to
>     each other and there's no dropping rule in my firewall setting). So
>     that's where I'm confused.
>      >>>
>      >>> Regards,
>      >>> Amelie
>      >>>
>      >>>> On Wed, Mar 16, 2016 at 1:01 PM, Balaji, Pavan <balaji at anl.gov
>     <mailto:balaji at anl.gov>> wrote:
>      >>>>
>      >>>> On Mar 15, 2016, at 10:26 PM, amelie chi zhou
>     <amelie.czhou at gmail.com <mailto:amelie.czhou at gmail.com>> wrote:
>      >>>> Here is the full output info. Thanks!
>      >>>
>      >>> The IP addresses and ports seem to be correctly setup, so
>     that's not the problem.
>      >>>
>      >>> I created my own amazon instances to see what the problem is.
>     It looks like the instances are not able to communicate even though
>     there's no explicit firewall enabled that is shown inside the Linux
>     instance.  I did some digging and found the "Security group"
>     settings and found that the inbound rules only allowed ssh.  I
>     changed it to "All traffic" and now I can run my jobs fine.
>      >>>
>      >>> % ./install/bin/mpiexec -hosts
>     ec2-52-36-15-57.us-west-2.compute.amazonaws.com
>     <http://ec2-52-36-15-57.us-west-2.compute.amazonaws.com>,ec2-5
>      >>> 2-37-222-189.us-west-2.compute.amazonaws.com
>     <http://2-37-222-189.us-west-2.compute.amazonaws.com> -n 4
>     ./examples/cpi
>      >>> Process 3 of 4 is on ip-172-31-28-127
>      >>> Process 2 of 4 is on ip-172-31-21-12
>      >>> Process 1 of 4 is on ip-172-31-28-127
>      >>> Process 0 of 4 is on ip-172-31-21-12
>      >>> pi is approximately 3.1415926544231243, Error is 0.0000000008333312
>      >>> wall clock time = 0.010181
>      >>>
>      >>> Can you try that?
>      >>>
>      >>>  -- Pavan
>      >>>
>      >>> _______________________________________________
>      >>> discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
>      >>> To manage subscription options or unsubscribe:
>      >>> https://lists.mpich.org/mailman/listinfo/discuss
>      >>>
>      >>> _______________________________________________
>      >>> discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
>      >>> To manage subscription options or unsubscribe:
>      >>> https://lists.mpich.org/mailman/listinfo/discuss
>      >>
>      >> _______________________________________________
>      >> discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
>      >> To manage subscription options or unsubscribe:
>      >> https://lists.mpich.org/mailman/listinfo/discuss
>      >>
>      >> _______________________________________________
>      >> discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
>      >> To manage subscription options or unsubscribe:
>      >> https://lists.mpich.org/mailman/listinfo/discuss
>      >
>      > _______________________________________________
>      > discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
>      > To manage subscription options or unsubscribe:
>      > https://lists.mpich.org/mailman/listinfo/discuss
>
>
>
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list