[mpich-discuss] Amazon ec2 Windows machine

Jayesh Krishna jayesh at mcs.anl.gov
Fri Feb 15 12:00:16 CST 2013


Hi,
 Good to know MPICH is working for you. Are the hostnames of the two ec2 instances the same?
 If you try to use ssm/shm with 1.4.1p1 mpiexec should abort with an error message.

(PS: MPICH_NO_LOCAL=1 forces the communication in nemesis to go through tcp sockets)
Regards,
Jayesh
----- Original Message -----
From: "Nicholas Sgro" <nsgro060 at gmail.com>
To: "Jayesh Krishna" <jayesh at mcs.anl.gov>
Cc: discuss at mpich.org
Sent: Thursday, February 14, 2013 2:34:21 PM
Subject: Re: [mpich-discuss] Amazon ec2 Windows machine

Hi, 

Setting MPICH_NO_LOCAL 1 and using nemesis seems to have solved the problem. If it interests you, I tried the sock channel, and that still does not work. As far as I am concerned, there is no longer any problems. 

I am fairly certain that I am using version 1.4.1p1 (according to wmpiconfig), so I am not sure why I can choose shm and ssm channels (they are options in wmpiexec, and I don't get an error from command line). 

Thanks for your help, 
Nicholas 


On Mon, Feb 11, 2013 at 12:12 PM, Jayesh Krishna < jayesh at mcs.anl.gov > wrote: 


Hi, 
The latest version of MPICH2 on Windows (1.4.1p1) do not have support for shm and ssm channels. Are you sure you are using the latest version of MPICH2 on your machines? Please use the "-channel" option to select the channels ("nemesis"/"sock"). 
Can you try running the job by setting the environment variable "MPICH_NO_LOCAL" to 1 (mpiexec -n 2 -env MPICH_NO_LOCAL 1 c:\Progra~1\MPICH2\examples\cpi.exe)? This option should force all communication to go via tcp sockets. 


Regards, 
Jayesh 

----- Original Message ----- 
From: "Nicholas Sgro" < nsgro060 at gmail.com > 


To: "Rayson Ho" < raysonlogin at gmail.com > 
Cc: discuss at mpich.org 
Sent: Sunday, February 10, 2013 3:46:27 PM 
Subject: Re: [mpich-discuss] Amazon ec2 Windows machine 


Both instances are part of the same security group, and I made sure all inbound traffic was allowed. 

StarCluster looks very useful. Can you recommend something similar to StarCluster but for windows instances? The software I am using is only available for windows. 

thanks, 
Nicholas 


On Sun, Feb 10, 2013 at 1:57 PM, Rayson Ho < raysonlogin at gmail.com > wrote: 


How did you configure the EC2 security groups? By default, EC2 
instances have their inbound traffic blocked, and you will need to 
configure security group rules to enable inbound traffic. 

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-network-security.html 

Also, any reason you are manually creating EC2 HPC clusters instead of 
using a toolkit?? We are a fan of MIT's StarCluster -- with it we can 
start up and shut down clusters very quickly (usually a few minutes). 
It is Linux based, with MPICH (and/or Open MPI), Open Grid Scheduler / 
Grid Engine, and many tools needed for doing HPC in EC2: 

http://star.mit.edu/cluster/ 

And we built a 10,000-node cluster in EC2 based on StarCluster late 
last year, during SC12: 

http://blogs.scalablelogic.com/2012/11/running-10000-node-grid-engine-cluster.html 

Rayson 

================================================== 
Open Grid Scheduler - The Official Open Source Grid Engine 
http://gridscheduler.sourceforge.net/ 




On Fri, Feb 8, 2013 at 5:02 PM, Nicholas Sgro < nsgro060 at gmail.com > wrote: 
> Hi, 
> This is the command I'm using: 
> 
> mpiexec.exe -machinefile machines.txt -env MPICH2_CHANNEL sock -n 2 cpi.exe 
> 
> I have tried using both machine file and hosts in the command line, but I 
> get the same results. The program runs on a single instance with any number 
> of processors. I tried running mpiexec on one instance and using the other 
> as a single host and that also works. 
> 
> -Nicholas 
> 
> 
> On Fri, Feb 8, 2013 at 12:04 PM, Jayesh Krishna < jayesh at mcs.anl.gov > wrote: 
>> 
>> Hi, 
>> How are you running your job (mpiexec command)? Did you try using a 
>> machine file to specify the hostnames when running the job? 
>> Does the program (cpi) execute correctly on a single ec2 instance? 
>> 
>> Regards, 
>> Jayesh 
>> 
>> ----- Original Message ----- 
>> From: "Nicholas Sgro" < nsgro060 at gmail.com > 
>> To: "Jayesh Krishna" < jayesh at mcs.anl.gov > 
>> Sent: Thursday, February 7, 2013 9:57:55 PM 
>> Subject: Re: [mpich-discuss] Amazon ec2 Windows machine 
>> 
>> I'm using version 1.4.1p1. I tried the sock channel. It doesn't seem to 
>> work either. With sock, I get to the point where I enter the number of 
>> intervals, but then it does nothing. 
>> 
>> Do you know any reason it wouldn't work with ec2 instances? 
>> 
>> 
>> 
>> On Thu, Feb 7, 2013 at 4:29 PM, Jayesh Krishna < jayesh at mcs.anl.gov > 
>> wrote: 
>> 
>> 
>> Hi, 
>> Which version of MPICH2 are you using? Did you try the "sock" channel (See 
>> if it works)? 
>> 
>> (PS: We haven't tested MPICH2 on Windows with ec2 instances.) 
>> Regards, 
>> Jayesh 
>> 
>> 
>> ----- Original Message ----- 
>> From: "Nicholas Sgro" < nsgro060 at gmail.com > 
>> To: discuss at mpich.org 
>> Sent: Thursday, February 7, 2013 11:29:57 AM 
>> Subject: [mpich-discuss] Amazon ec2 Windows machine 
>> 
>> 
>> Hi all, 
>> 
>> I am trying to run the example cpi.exe across 2 amazon ec2 instances 
>> running windows. I have different problems depending on the channel I 
>> choose. If I try nemesis, I get the following error: 
>> 
>> Fatal error in MPI_Init: Other MPI error, error stack: 
>> MPIR_Init_thread(392).................: 
>> MPID_Init(139)........................: channel initialization failed 
>> MPIDI_CH3_Init(38)....................: 
>> MPID_nem_init(196)....................: 
>> MPIDI_CH3I_Seg_commit(366)............: 
>> MPIU_SHMW_Hnd_deserialize(324)........: 
>> MPIU_SHMW_Seg_open(863)...............: 
>> MPIU_SHMW_Seg_create_attach_templ(763): unable to allocate shared memory - 
>> OpenFileMapping The system cannot find the file specified. 
>> 
>> If I try to use shm, cpi.exe uses 100% of the processors on both machines, 
>> but makes no progress and I have to cancel the job. 
>> 
>> I am attaching logs from smpd from both machines from the runs with 
>> nemesis and shm. 
>> 
>> I don't have any experience with mpich, so I have no idea what the problem 
>> is. Any guidance would be appreciated. 
>> 
>> Thanks 
>> 
>> 
>> _______________________________________________ 
>> discuss mailing list discuss at mpich.org 
>> To manage subscription options or unsubscribe: 
>> https://lists.mpich.org/mailman/listinfo/discuss 
>> 
> 
> 
> _______________________________________________ 
> discuss mailing list discuss at mpich.org 
> To manage subscription options or unsubscribe: 
> https://lists.mpich.org/mailman/listinfo/discuss 


_______________________________________________ 
discuss mailing list discuss at mpich.org 
To manage subscription options or unsubscribe: 
https://lists.mpich.org/mailman/listinfo/discuss 




More information about the discuss mailing list