[mpich-discuss] MPID_nem_tcp_connpoll(1835): Communication error with rank 1: Connection timed out

amelie chi zhou amelie.czhou at gmail.com
Wed Mar 16 02:53:16 CDT 2016


Hi, Pavan,

Thanks a lot. It does work now!

Best Regards,
Amelie

On Wed, Mar 16, 2016 at 3:26 PM, Balaji, Pavan <balaji at anl.gov> wrote:

> Amelie,
>
> OK, I just tried it across two different subnets.  Here's the problem --
>
> The amazon compute nodes hide their public IP addresses from the list of
> IP addresses visible locally.  So when each node queries for its local IP
> address, it only gets its private IP address (which is obviously useless
> for other processes to connect).
>
> You can workaround this by making two changes to your environment:
>
> 1. Explicitly use the public IP addresses directly instead of the
> hostnames in your hostfile.  That is, instead of "
> ec2-52-36-15-57.us-west-2.compute.amazonaws.com", use "52.36.15.57".
>
> 2. Pass the -localhost option to mpiexec to give the public IP address of
> the host from which you are running mpiexec.
>
> I created two VM instances, one on the west subnet and the other on the
> east subnet:
>
> ec2-52-35-56-228.us-west-2.compute.amazonaws.com
> ec2-54-172-35-159.compute-1.amazonaws.com
>
> To run my application, I do this:
>
> % ./install/bin/mpiexec -localhost 52.35.56.228 -hosts
> 52.35.56.228,54.172.35.159 -n 2 ./examples/cpi
>
> Let us know if that works.
>
>   -- Pavan
>
> > On Mar 16, 2016, at 12:24 AM, amelie chi zhou <amelie.czhou at gmail.com>
> wrote:
> >
> > Hi, Pavan,
> >
> > Thanks a lot for trying that.
> >
> > I have enabled inbound traffic for all types of protocols including tcp,
> udp and icmp for all ports (0 - 65535) and for all ip addresses. I noticed
> that the two instances you created are from the same region (us west I
> suppose). The thing is, for instances in the same region, mpiexec can run
> successfully with no problem in my setup. But when I run mpi programs
> across regions, in my case, between an instance in us east and an instance
> in us west, the error in MPI_Send appears.
> > It seems that there might be some problems with the firewall or network
> interfaces, but I have checked and ruled out those possibilities (instances
> in different regions can ssh and scp to each other and there's no dropping
> rule in my firewall setting). So that's where I'm confused.
> >
> > Regards,
> > Amelie
> >
> > On Wed, Mar 16, 2016 at 1:01 PM, Balaji, Pavan <balaji at anl.gov> wrote:
> >
> > > On Mar 15, 2016, at 10:26 PM, amelie chi zhou <amelie.czhou at gmail.com>
> wrote:
> > > Here is the full output info. Thanks!
> >
> > The IP addresses and ports seem to be correctly setup, so that's not the
> problem.
> >
> > I created my own amazon instances to see what the problem is.  It looks
> like the instances are not able to communicate even though there's no
> explicit firewall enabled that is shown inside the Linux instance.  I did
> some digging and found the "Security group" settings and found that the
> inbound rules only allowed ssh.  I changed it to "All traffic" and now I
> can run my jobs fine.
> >
> > % ./install/bin/mpiexec -hosts
> ec2-52-36-15-57.us-west-2.compute.amazonaws.com,ec2-5
> > 2-37-222-189.us-west-2.compute.amazonaws.com -n 4 ./examples/cpi
> > Process 3 of 4 is on ip-172-31-28-127
> > Process 2 of 4 is on ip-172-31-21-12
> > Process 1 of 4 is on ip-172-31-28-127
> > Process 0 of 4 is on ip-172-31-21-12
> > pi is approximately 3.1415926544231243, Error is 0.0000000008333312
> > wall clock time = 0.010181
> >
> > Can you try that?
> >
> >   -- Pavan
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20160316/fd0927bf/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list