[mpich-discuss] MPID_nem_tcp_connpoll(1835): Communication error with rank 1: Connection timed out

amelie chi zhou amelie.czhou at gmail.com
Tue Mar 15 10:38:56 CDT 2016


Hi, Ken,

Thanks for the reply.
What kind of problem are you referring to? 
In the rules of the security groups, I allow tcp connections from all ip addresses for all ports. Also, the two machines can ssh and scp to each other with no problem. In this simple test, security is not my major concern. 

Regards,
Amelie
> On 15 Mar 2016, at 10:23 PM, Kenneth Raffenetti <raffenet at mcs.anl.gov> wrote:
> 
> The different regions are a problem in this setup. Note that security groups in EC2 are *per region*.
> 
> https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-network-security.html#default-security-group
> 
> I'll note that using MPI across the internet like this is a bad idea if you have concerns about security.
> 
> Ken
> 
>> On 03/15/2016 06:16 AM, amelie chi zhou wrote:
>> Hi,
>> 
>> I configured two virtual machines on Amazon EC2 to run mpich-3.2. The
>> system is Ubuntu 12.04.2 LTS.
>> 
>> The two virtual machines can ssh to each other successfully
>> (passwordless) and I can run a simple hello world program using the two
>> machines.
>> 
>> ubuntu at ip-10-169-125-85:~$ mpiexec -n 2 -f host_file ./hello_world
>> Hello world from processor ip-10-169-125-85, rank 1 out of 2 processors
>> Hello world from processor ip-10-235-37-156, rank 0 out of 2 processors
>> 
>> Then I run a simple program with MPI_Send and MPI_Receive to communicate
>> between the two vms. Following are the core code of the program.
>> 
>>  if (world_rank == 0) {
>>     // If we are rank 0, set the number to -1 and send it to process 1
>>     number = -1;
>>     MPI_Send(&number, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
>>   } else if (world_rank == 1) {
>>     MPI_Recv(&number, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
>>     printf("Process 1 received number %d from process 0\n", number);
>>   }
>> 
>> 
>> Following are the error msg I encountered.
>> 
>> ubuntu at ip-10-169-125-85:~$ mpiexec -n 2 -f host_file ./send_recv
>> Fatal error in MPI_Send: Unknown error class, error stack:
>> MPI_Send(174)..............: MPI_Send(buf=0x7fff49f2759c, count=1,
>> MPI_INT, dest=1, tag=0, MPI_COMM_WORLD) failed
>> MPID_nem_tcp_connpoll(1835): Communication error with rank 1: Connection
>> timed out
>> 
>> 
>> I googled similar errors and have made sure that: 1) there is no rule in
>> my firewall setting, 2) there is a tcp port listening on both sides when
>> the send_recv program runs. I cannot think of any other possible way to
>> fix this problem. BTW, the two virtual machines are on two different
>> regions of Amazon EC2 and are not in VPCs. Please help. Thanks!
>> 
>> Regards,
>> Amelie
>> 
>> 
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list