[mpich-discuss] MPID_nem_tcp_connpoll(1835): Communication error with rank 1: Connection timed out

Balaji, Pavan balaji at anl.gov
Wed Mar 16 02:26:26 CDT 2016


Amelie,

OK, I just tried it across two different subnets.  Here's the problem --

The amazon compute nodes hide their public IP addresses from the list of IP addresses visible locally.  So when each node queries for its local IP address, it only gets its private IP address (which is obviously useless for other processes to connect).

You can workaround this by making two changes to your environment:

1. Explicitly use the public IP addresses directly instead of the hostnames in your hostfile.  That is, instead of "ec2-52-36-15-57.us-west-2.compute.amazonaws.com", use "52.36.15.57".

2. Pass the -localhost option to mpiexec to give the public IP address of the host from which you are running mpiexec.

I created two VM instances, one on the west subnet and the other on the east subnet:

ec2-52-35-56-228.us-west-2.compute.amazonaws.com
ec2-54-172-35-159.compute-1.amazonaws.com

To run my application, I do this:

% ./install/bin/mpiexec -localhost 52.35.56.228 -hosts 52.35.56.228,54.172.35.159 -n 2 ./examples/cpi

Let us know if that works.

  -- Pavan

> On Mar 16, 2016, at 12:24 AM, amelie chi zhou <amelie.czhou at gmail.com> wrote:
> 
> Hi, Pavan,
> 
> Thanks a lot for trying that. 
> 
> I have enabled inbound traffic for all types of protocols including tcp, udp and icmp for all ports (0 - 65535) and for all ip addresses. I noticed that the two instances you created are from the same region (us west I suppose). The thing is, for instances in the same region, mpiexec can run successfully with no problem in my setup. But when I run mpi programs across regions, in my case, between an instance in us east and an instance in us west, the error in MPI_Send appears. 
> It seems that there might be some problems with the firewall or network interfaces, but I have checked and ruled out those possibilities (instances in different regions can ssh and scp to each other and there's no dropping rule in my firewall setting). So that's where I'm confused. 
> 
> Regards,
> Amelie
> 
> On Wed, Mar 16, 2016 at 1:01 PM, Balaji, Pavan <balaji at anl.gov> wrote:
> 
> > On Mar 15, 2016, at 10:26 PM, amelie chi zhou <amelie.czhou at gmail.com> wrote:
> > Here is the full output info. Thanks!
> 
> The IP addresses and ports seem to be correctly setup, so that's not the problem.
> 
> I created my own amazon instances to see what the problem is.  It looks like the instances are not able to communicate even though there's no explicit firewall enabled that is shown inside the Linux instance.  I did some digging and found the "Security group" settings and found that the inbound rules only allowed ssh.  I changed it to "All traffic" and now I can run my jobs fine.
> 
> % ./install/bin/mpiexec -hosts ec2-52-36-15-57.us-west-2.compute.amazonaws.com,ec2-5
> 2-37-222-189.us-west-2.compute.amazonaws.com -n 4 ./examples/cpi
> Process 3 of 4 is on ip-172-31-28-127
> Process 2 of 4 is on ip-172-31-21-12
> Process 1 of 4 is on ip-172-31-28-127
> Process 0 of 4 is on ip-172-31-21-12
> pi is approximately 3.1415926544231243, Error is 0.0000000008333312
> wall clock time = 0.010181
> 
> Can you try that?
> 
>   -- Pavan
> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list