[mpich-discuss] MPID_nem_tcp_connpoll(1835): Communication error with rank 1: Connection timed out
Kenneth Raffenetti
raffenet at mcs.anl.gov
Tue Mar 15 11:11:01 CDT 2016
I suspect that there is still a firewall in the way given that the EC2
instances are in different regions. One way to test your security group
rules without MPI would be to try to establish a connection between the
2 machines on a high TCP port (e.g. 10000) with a simple utility like
netcat (https://en.wikipedia.org/wiki/Netcat).
Ken
On 03/15/2016 10:38 AM, amelie chi zhou wrote:
> Hi, Ken,
>
> Thanks for the reply.
> What kind of problem are you referring to?
> In the rules of the security groups, I allow tcp connections from all ip addresses for all ports. Also, the two machines can ssh and scp to each other with no problem. In this simple test, security is not my major concern.
>
> Regards,
> Amelie
>> On 15 Mar 2016, at 10:23 PM, Kenneth Raffenetti <raffenet at mcs.anl.gov> wrote:
>>
>> The different regions are a problem in this setup. Note that security groups in EC2 are *per region*.
>>
>> https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-network-security.html#default-security-group
>>
>> I'll note that using MPI across the internet like this is a bad idea if you have concerns about security.
>>
>> Ken
>>
>>> On 03/15/2016 06:16 AM, amelie chi zhou wrote:
>>> Hi,
>>>
>>> I configured two virtual machines on Amazon EC2 to run mpich-3.2. The
>>> system is Ubuntu 12.04.2 LTS.
>>>
>>> The two virtual machines can ssh to each other successfully
>>> (passwordless) and I can run a simple hello world program using the two
>>> machines.
>>>
>>> ubuntu at ip-10-169-125-85:~$ mpiexec -n 2 -f host_file ./hello_world
>>> Hello world from processor ip-10-169-125-85, rank 1 out of 2 processors
>>> Hello world from processor ip-10-235-37-156, rank 0 out of 2 processors
>>>
>>> Then I run a simple program with MPI_Send and MPI_Receive to communicate
>>> between the two vms. Following are the core code of the program.
>>>
>>> if (world_rank == 0) {
>>> // If we are rank 0, set the number to -1 and send it to process 1
>>> number = -1;
>>> MPI_Send(&number, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
>>> } else if (world_rank == 1) {
>>> MPI_Recv(&number, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
>>> printf("Process 1 received number %d from process 0\n", number);
>>> }
>>>
>>>
>>> Following are the error msg I encountered.
>>>
>>> ubuntu at ip-10-169-125-85:~$ mpiexec -n 2 -f host_file ./send_recv
>>> Fatal error in MPI_Send: Unknown error class, error stack:
>>> MPI_Send(174)..............: MPI_Send(buf=0x7fff49f2759c, count=1,
>>> MPI_INT, dest=1, tag=0, MPI_COMM_WORLD) failed
>>> MPID_nem_tcp_connpoll(1835): Communication error with rank 1: Connection
>>> timed out
>>>
>>>
>>> I googled similar errors and have made sure that: 1) there is no rule in
>>> my firewall setting, 2) there is a tcp port listening on both sides when
>>> the send_recv program runs. I cannot think of any other possible way to
>>> fix this problem. BTW, the two virtual machines are on two different
>>> regions of Amazon EC2 and are not in VPCs. Please help. Thanks!
>>>
>>> Regards,
>>> Amelie
>>>
>>>
>>> _______________________________________________
>>> discuss mailing list discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>> _______________________________________________
>> discuss mailing list discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list