[mpich-discuss] MPID_nem_tcp_connpoll(1835): Communication error with rank 1: Connection timed out

amelie chi zhou amelie.czhou at gmail.com
Wed Mar 16 00:24:24 CDT 2016


Hi, Pavan,

Thanks a lot for trying that.

I have enabled inbound traffic for all types of protocols including tcp,
udp and icmp for all ports (0 - 65535) and for all ip addresses. I noticed
that the two instances you created are from the same region (us west I
suppose). The thing is, for instances in the same region, mpiexec can run
successfully with no problem in my setup. But when I run mpi programs
across regions, in my case, between an instance in us east and an instance
in us west, the error in MPI_Send appears.
It seems that there might be some problems with the firewall or network
interfaces, but I have checked and ruled out those possibilities (instances
in different regions can ssh and scp to each other and there's no dropping
rule in my firewall setting). So that's where I'm confused.

Regards,
Amelie

On Wed, Mar 16, 2016 at 1:01 PM, Balaji, Pavan <balaji at anl.gov> wrote:

>
> > On Mar 15, 2016, at 10:26 PM, amelie chi zhou <amelie.czhou at gmail.com>
> wrote:
> > Here is the full output info. Thanks!
>
> The IP addresses and ports seem to be correctly setup, so that's not the
> problem.
>
> I created my own amazon instances to see what the problem is.  It looks
> like the instances are not able to communicate even though there's no
> explicit firewall enabled that is shown inside the Linux instance.  I did
> some digging and found the "Security group" settings and found that the
> inbound rules only allowed ssh.  I changed it to "All traffic" and now I
> can run my jobs fine.
>
> % ./install/bin/mpiexec -hosts
> ec2-52-36-15-57.us-west-2.compute.amazonaws.com,ec2-5
> 2-37-222-189.us-west-2.compute.amazonaws.com -n 4 ./examples/cpi
> Process 3 of 4 is on ip-172-31-28-127
> Process 2 of 4 is on ip-172-31-21-12
> Process 1 of 4 is on ip-172-31-28-127
> Process 0 of 4 is on ip-172-31-21-12
> pi is approximately 3.1415926544231243, Error is 0.0000000008333312
> wall clock time = 0.010181
>
> Can you try that?
>
>   -- Pavan
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20160316/837091f2/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list