[mpich-discuss] Fwd: MPICH fault tolerance and resiliency

sanjeev s snjv.workmail at gmail.com
Thu Jun 1 00:28:27 CDT 2017


Thanks for the reply and valuable suggestion.

One more Question:

Anyway to avoid bringing down(cleaning up ) all MPI jobs when one job dies.


Regards
Sanjeev


On Thu, Jun 1, 2017 at 10:25 AM, Guo, Yanfei <yguo at anl.gov> wrote:

> On 5/30/17, 10:19 PM, "sanjeev s" <snjv.workmail at gmail.com> wrote:
>
> > Hi,
>
>
> > I have added the detailed use case..
>
>
> > > I did not follow what were you trying to do here.  Can you be more
> specific?
>
> > Suppose I have an application which accepts some request from client and
> do
> > the MPI job on request arrival(lazy load of mpi lib). Now I started this
> > application through mpiexec on 4 machines.
> > mpiexec will start my application on 4 machines not MPI jobs. On request
> > arrival I will distribute MPI jobs to this cluster after doing some
> processing
> > in my application on master node.(1 of out of 4 I will designate as
> master
> > node).  Now I want to add one more machine( having my application) to
> this
> > cluster, without stopping my already running applications. My master
> process
> > should be able to see this 5th machine and distribute the task. Is there
> any
> > way to achieve this?
>
> Is this "application" an MPI program? If so, there are two options. 1.
> Starting
> the MPI process on the 5th machine and have it connect to your existing
> processes (with MPI_Comm_connect/MPI_Comm_accept). 2. Letting the existing
> processes spawn new processes on the new machine (with MPI_Comm_spawn).
> You can
> check out the book "Using Advanced MPI" for details and examples. If the
> aforementioned application is not an MPI program, then what you are trying
> to
> do is beyond the scope of MPI.
>
>
>
> > > Also, When I took the size(number of instance for that comm), I am not
> > > getting the count for client instance. To distribute the task, I need
> to have
> > > additional logic to handle this case in my application.
>
> > Can you be more specific about which comm you were referring to?
>
>
>
> > I am referring to MPI_COMM_WORLD. Is there any resize kind of api which
> will
> > give me how many jobs are there on this comm after doing one accept
> connection?
>
> MPI_COMM_WORLD is created during MPI_Init. It does not change when a new
> process group is connected. The communicator that accepts the connection
> will
> give you an intercommunicator which allows the accepting processes to talk
> to
> the connecting processes. You can merge this intercommunicator to a
> intracommunicator that contains all the processes.
>
> > Hope this helps in understanding the problem.
>
>
> > Regards
> > Sanjeev
>
> Yanfei Guo
>
>
>
>
> On Mon, May 29, 2017 at 8:48 AM, sanjeev s
> <snjv.workmail at gmail.com> wrote:
>
> Hi,
>
>
> Thanks for the reply. I have given the detailed use case below.
>
>
> Regards
> Sanjeev
>
>
>
>
>
> On Sat, May 27, 2017 at 2:30 AM, Guo, Yanfei
> <yguo at anl.gov> wrote:
>
> Hi Sanjeev,
>
> Please see the inline comments below.
>
> Yanfei Guo
> Postdoctoral Researcher
> MCS Division, ANL
>
>
> On 5/26/17, 10:40 AM, "sanjeev s" <snjv.workmail at gmail.com> wrote:
>
> > Hi,
>
>
> > In dynamic process, I read about two models: Client server and parent
> child.
>
>
> > Client Server : We need to have dedicated threads each for client and
> server. Now considering all instances same , we will end up doing lot of
> thread creation apart from our application worker threads.
>
> I assume you are referring to the connect/accept model by client/server.
> In that case, you may need one thread on the server side to handle the
> incoming connection because the `MPI_Comm_accept` call
>  is blocking. A dedicated thread is not necessary in the client.
> It is your application’s job to decide who is the server.
>
>
>
>
>
> Yes I am referring to same connect/accept model. I don't want to put this
> logic in our application. I want some external process manager ( like
> Hydra) which can handle this form me .Please see below for detail use case.
>
>
>
>
> > Moreover when 1 instance (app) goes down, we want
>  that instance to come up without doing much manual work. We don't want to
> club this logic in our application.
>
> I did not follow what were you trying to do here.
> Can you be more specific?
>
>
>
> Suppose I have an application which accepts some request from client and
> do the MPI job on request arrival(lazy load of mpi lib). Now I started this
> application through mpiexec on 4 machines.
> mpiexec will start my application on 4 machines not MPI jobs. On request
> arrival I will distribute MPI jobs to this cluster after doing some
> processing in my application on master node.(1 of out of 4 I will designate
> as master node).  Now I
>  want to add one more machine( having my application) to this cluster,
> without stopping my already running applications. My master process should
> be able to see this 5th machine and distribute the task. Is there any way
> to achieve this?
>
>
>
>
>
> > Also, When I took the size(number of instance for that comm), I am not
> getting the count for client instance. To distribute the task, I need to
> have additional logic to handle this case in my application.
>
> Can you be more specific about which comm you were referring to?
>
>
>
> I am referring to MPI_COMM_WORLD. Is there any resize kind of api which
> will give me how many jobs are there on this comm after doing one accept
> connection?
>
>
> > 2) Parent child: Suppose we have started 4 instance on 4 different
> machines. Now we need to add another server. I don't think parent child/
> client server is good option in this case.
>
>
> > We don't want to build process management capabilities in our
> application. We are looking for process management in MPI itself (e.g in
> Hydra )so that we can leverage on that.
>
>
> > Please correct me if I am missing something in my understanding of
> Dynamic model.
>
>
> Regards
> Sanjeev Sinha
>
>
>
>
>
> On Fri, May 26, 2017 at 8:46 PM, Halim Amer
> <aamer at anl.gov> wrote:
>
> Sanjeev,
>
> > More precisely my requirement is suppose I started 4 instances of my
> > application. Now I want to add one more instance dynamically to this set
>
> From my understanding, dynamic processes would work fine for this case.
> Could you elaborate on why the dynamic process model is not sufficient for
> your needs?
>
> Halim
> www.mcs.anl.gov/~aamer <http://www.mcs.anl.gov/~aamer> <
> http://www.mcs.anl.gov/~aamer>
>
> On 5/26/17 9:11 AM, sanjeev s wrote:
>
>
> Hi mpich,
>
> I have a requirement where in we need to add start stop application
> instances on the fly before starting a job.Is there any mpich service
> available. I looked through dynamic process model, but its not sufficing
> our need.
>
> More precisely my requirement is suppose I started 4 instances of my
> application. Now I want to add one more instance dynamically to this set
>
> Is there any tool which MPICH supports for fault tolerance behavior?
>
> Thanks
> Sanjeev
>
>
>
>
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
>
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
>
>
>
>
>
>
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20170601/568b571c/attachment.html>


More information about the discuss mailing list