[mpich-discuss] Fwd: MPICH fault tolerance and resiliency

Guo, Yanfei yguo at anl.gov
Fri May 26 16:00:06 CDT 2017


Hi Sanjeev,

Please see the inline comments below.

Yanfei Guo
Postdoctoral Researcher
MCS Division, ANL


On 5/26/17, 10:40 AM, "sanjeev s" <snjv.workmail at gmail.com> wrote:

> Hi,


> In dynamic process, I read about two models: Client server and parent child.


> Client Server : We need to have dedicated threads each for client and server. Now considering all instances same , we will end up doing lot of thread creation apart from our application worker threads. 

I assume you are referring to the connect/accept model by client/server. In that case, you may need one thread on the server side to handle the incoming connection because the `MPI_Comm_accept` call is blocking. A dedicated thread is not necessary in the client. It is your application’s job to decide who is the server.

> Moreover when 1 instance (app) goes down, we want
 that instance to come up without doing much manual work. We don't want to club this logic in our application.

I did not follow what were you trying to do here. Can you be more specific?

> Also, When I took the size(number of instance for that comm), I am not getting the count for client instance. To distribute the task, I need to have additional logic to handle this case in my application.

Can you be more specific about which comm you were referring to?

> 2) Parent child: Suppose we have started 4 instance on 4 different machines. Now we need to add another server. I don't think parent child/ client server is good option in this case.


> We don't want to build process management capabilities in our application. We are looking for process management in MPI itself (e.g in Hydra )so that we can leverage on that.


> Please correct me if I am missing something in my understanding of Dynamic model. 


Regards
Sanjeev Sinha





On Fri, May 26, 2017 at 8:46 PM, Halim Amer 
<aamer at anl.gov> wrote:

Sanjeev,

> More precisely my requirement is suppose I started 4 instances of my
> application. Now I want to add one more instance dynamically to this set

From my understanding, dynamic processes would work fine for this case. Could you elaborate on why the dynamic process model is not sufficient for your needs?

Halim
www.mcs.anl.gov/~aamer <http://www.mcs.anl.gov/~aamer>

On 5/26/17 9:11 AM, sanjeev s wrote:


Hi mpich,

I have a requirement where in we need to add start stop application
instances on the fly before starting a job.Is there any mpich service
available. I looked through dynamic process model, but its not sufficing
our need.

More precisely my requirement is suppose I started 4 instances of my
application. Now I want to add one more instance dynamically to this set

Is there any tool which MPICH supports for fault tolerance behavior?

Thanks
Sanjeev





_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss



_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss








_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list