[mpich-discuss] MPID_nem_tcp_connpoll(1835): Communication error with rank 1: Connection timed out
Kenneth Raffenetti
raffenet at mcs.anl.gov
Fri Mar 18 10:41:39 CDT 2016
Thanks for this Amelie. I've created a wiki page from your instructions
and linked it from the FAQ.
https://wiki.mpich.org/mpich/index.php/Using_MPICH_in_Amazon_EC2
Ken
On 03/18/2016 02:09 AM, amelie chi zhou wrote:
> Hi, Pavan,
>
> Attached is a little summary on "How to run MPICH on Amazon EC2". I'm
> not sure whether it's clear enough. Please check.
>
> Regards,
> Amelie
>
> On Thu, Mar 17, 2016 at 9:22 AM, amelie chi zhou <amelie.czhou at gmail.com
> <mailto:amelie.czhou at gmail.com>> wrote:
>
> Sure. I'm glad to.
>
> Amelie
>
> > On 17 Mar 2016, at 1:58 AM, Balaji, Pavan <balaji at anl.gov
> <mailto:balaji at anl.gov>> wrote:
> >
> > Hi Amelie,
> >
> > Would you be willing to write up some documentation on "How to
> use MPICH on Amazon EC2" including details on using servers in a
> single region vs. multiple regions? We'd like to put this up on our
> FAQ page.
> >
> > Thanks,
> >
> > -- Pavan
> >
> >> On Mar 16, 2016, at 2:53 AM, amelie chi zhou
> <amelie.czhou at gmail.com <mailto:amelie.czhou at gmail.com>> wrote:
> >>
> >> Hi, Pavan,
> >>
> >> Thanks a lot. It does work now!
> >>
> >> Best Regards,
> >> Amelie
> >>
> >> On Wed, Mar 16, 2016 at 3:26 PM, Balaji, Pavan <balaji at anl.gov
> <mailto:balaji at anl.gov>> wrote:
> >> Amelie,
> >>
> >> OK, I just tried it across two different subnets. Here's the
> problem --
> >>
> >> The amazon compute nodes hide their public IP addresses from the
> list of IP addresses visible locally. So when each node queries for
> its local IP address, it only gets its private IP address (which is
> obviously useless for other processes to connect).
> >>
> >> You can workaround this by making two changes to your environment:
> >>
> >> 1. Explicitly use the public IP addresses directly instead of
> the hostnames in your hostfile. That is, instead of
> "ec2-52-36-15-57.us-west-2.compute.amazonaws.com
> <http://ec2-52-36-15-57.us-west-2.compute.amazonaws.com>", use
> "52.36.15.57".
> >>
> >> 2. Pass the -localhost option to mpiexec to give the public IP
> address of the host from which you are running mpiexec.
> >>
> >> I created two VM instances, one on the west subnet and the other
> on the east subnet:
> >>
> >> ec2-52-35-56-228.us-west-2.compute.amazonaws.com
> <http://ec2-52-35-56-228.us-west-2.compute.amazonaws.com>
> >> ec2-54-172-35-159.compute-1.amazonaws.com
> <http://ec2-54-172-35-159.compute-1.amazonaws.com>
> >>
> >> To run my application, I do this:
> >>
> >> % ./install/bin/mpiexec -localhost 52.35.56.228 -hosts
> 52.35.56.228,54.172.35.159 -n 2 ./examples/cpi
> >>
> >> Let us know if that works.
> >>
> >> -- Pavan
> >>
> >>> On Mar 16, 2016, at 12:24 AM, amelie chi zhou
> <amelie.czhou at gmail.com <mailto:amelie.czhou at gmail.com>> wrote:
> >>>
> >>> Hi, Pavan,
> >>>
> >>> Thanks a lot for trying that.
> >>>
> >>> I have enabled inbound traffic for all types of protocols
> including tcp, udp and icmp for all ports (0 - 65535) and for all ip
> addresses. I noticed that the two instances you created are from the
> same region (us west I suppose). The thing is, for instances in the
> same region, mpiexec can run successfully with no problem in my
> setup. But when I run mpi programs across regions, in my case,
> between an instance in us east and an instance in us west, the error
> in MPI_Send appears.
> >>> It seems that there might be some problems with the firewall or
> network interfaces, but I have checked and ruled out those
> possibilities (instances in different regions can ssh and scp to
> each other and there's no dropping rule in my firewall setting). So
> that's where I'm confused.
> >>>
> >>> Regards,
> >>> Amelie
> >>>
> >>>> On Wed, Mar 16, 2016 at 1:01 PM, Balaji, Pavan <balaji at anl.gov
> <mailto:balaji at anl.gov>> wrote:
> >>>>
> >>>> On Mar 15, 2016, at 10:26 PM, amelie chi zhou
> <amelie.czhou at gmail.com <mailto:amelie.czhou at gmail.com>> wrote:
> >>>> Here is the full output info. Thanks!
> >>>
> >>> The IP addresses and ports seem to be correctly setup, so
> that's not the problem.
> >>>
> >>> I created my own amazon instances to see what the problem is.
> It looks like the instances are not able to communicate even though
> there's no explicit firewall enabled that is shown inside the Linux
> instance. I did some digging and found the "Security group"
> settings and found that the inbound rules only allowed ssh. I
> changed it to "All traffic" and now I can run my jobs fine.
> >>>
> >>> % ./install/bin/mpiexec -hosts
> ec2-52-36-15-57.us-west-2.compute.amazonaws.com
> <http://ec2-52-36-15-57.us-west-2.compute.amazonaws.com>,ec2-5
> >>> 2-37-222-189.us-west-2.compute.amazonaws.com
> <http://2-37-222-189.us-west-2.compute.amazonaws.com> -n 4
> ./examples/cpi
> >>> Process 3 of 4 is on ip-172-31-28-127
> >>> Process 2 of 4 is on ip-172-31-21-12
> >>> Process 1 of 4 is on ip-172-31-28-127
> >>> Process 0 of 4 is on ip-172-31-21-12
> >>> pi is approximately 3.1415926544231243, Error is 0.0000000008333312
> >>> wall clock time = 0.010181
> >>>
> >>> Can you try that?
> >>>
> >>> -- Pavan
> >>>
> >>> _______________________________________________
> >>> discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
> >>> To manage subscription options or unsubscribe:
> >>> https://lists.mpich.org/mailman/listinfo/discuss
> >>>
> >>> _______________________________________________
> >>> discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
> >>> To manage subscription options or unsubscribe:
> >>> https://lists.mpich.org/mailman/listinfo/discuss
> >>
> >> _______________________________________________
> >> discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
> >> To manage subscription options or unsubscribe:
> >> https://lists.mpich.org/mailman/listinfo/discuss
> >>
> >> _______________________________________________
> >> discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
> >> To manage subscription options or unsubscribe:
> >> https://lists.mpich.org/mailman/listinfo/discuss
> >
> > _______________________________________________
> > discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
>
>
>
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list