[mpich-discuss] Fwd: MPICH fault tolerance and resiliency

sanjeev s snjv.workmail at gmail.com
Sun May 28 22:18:57 CDT 2017


Hi,

Thanks for the reply. I have given the detailed use case below.

Regards
Sanjeev



On Sat, May 27, 2017 at 2:30 AM, Guo, Yanfei <yguo at anl.gov> wrote:

> Hi Sanjeev,
>
> Please see the inline comments below.
>
> Yanfei Guo
> Postdoctoral Researcher
> MCS Division, ANL
>
>
> On 5/26/17, 10:40 AM, "sanjeev s" <snjv.workmail at gmail.com> wrote:
>
> > Hi,
>
>
> > In dynamic process, I read about two models: Client server and parent
> child.
>
>
> > Client Server : We need to have dedicated threads each for client and
> server. Now considering all instances same , we will end up doing lot of
> thread creation apart from our application worker threads.
>
> I assume you are referring to the connect/accept model by client/server.
> In that case, you may need one thread on the server side to handle the
> incoming connection because the `MPI_Comm_accept` call is blocking. A
> dedicated thread is not necessary in the client. It is your application’s
> job to decide who is the server.
>
>
Yes I am referring to same connect/accept model. I don't want to put this
logic in our application. I want some external process manager ( like
Hydra) which can handle this form me .Please see below for detail use case.



> > Moreover when 1 instance (app) goes down, we want
>  that instance to come up without doing much manual work. We don't want to
> club this logic in our application.
>
> I did not follow what were you trying to do here. Can you be more
> specific?
>
> Suppose I have an application which accepts some request from client and
do the MPI job on request arrival(lazy load of mpi lib). Now I started this
application through mpiexec on 4 machines. *mpiexec will start my
application on 4 machines not MPI jobs*. On request arrival I will
distribute MPI jobs to this cluster after doing some processing in my
application on master node.(1 of out of 4 I will designate as master
node).  Now I want to add one more machine( having my application) to this
cluster, without stopping my already running applications. My master
process should be able to see this 5th machine and distribute the task. Is
there any way to achieve this?


> Also, When I took the size(number of instance for that comm), I am not
> getting the count for client instance. To distribute the task, I need to
> have additional logic to handle this case in my application.
>
> Can you be more specific about which comm you were referring to?
>
> I am referring to MPI_COMM_WORLD. Is there any resize kind of api which
will give me how many jobs are there on this comm after doing one accept
connection?


> > 2) Parent child: Suppose we have started 4 instance on 4 different
> machines. Now we need to add another server. I don't think parent child/
> client server is good option in this case.
>
>
> > We don't want to build process management capabilities in our
> application. We are looking for process management in MPI itself (e.g in
> Hydra )so that we can leverage on that.
>
>
> > Please correct me if I am missing something in my understanding of
> Dynamic model.
>
>
> Regards
> Sanjeev Sinha
>
>
>
>
>
> On Fri, May 26, 2017 at 8:46 PM, Halim Amer
> <aamer at anl.gov> wrote:
>
> Sanjeev,
>
> > More precisely my requirement is suppose I started 4 instances of my
> > application. Now I want to add one more instance dynamically to this set
>
> From my understanding, dynamic processes would work fine for this case.
> Could you elaborate on why the dynamic process model is not sufficient for
> your needs?
>
> Halim
> www.mcs.anl.gov/~aamer <http://www.mcs.anl.gov/~aamer>
>
> On 5/26/17 9:11 AM, sanjeev s wrote:
>
>
> Hi mpich,
>
> I have a requirement where in we need to add start stop application
> instances on the fly before starting a job.Is there any mpich service
> available. I looked through dynamic process model, but its not sufficing
> our need.
>
> More precisely my requirement is suppose I started 4 instances of my
> application. Now I want to add one more instance dynamically to this set
>
> Is there any tool which MPICH supports for fault tolerance behavior?
>
> Thanks
> Sanjeev
>
>
>
>
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
>
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
>
>
>
>
>
>
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20170529/40f956bb/attachment.html>


More information about the discuss mailing list