[mpich-discuss] Fwd: MPICH fault tolerance and resiliency

sanjeev s snjv.workmail at gmail.com
Tue May 30 21:19:03 CDT 2017


Hi,

I have added the detailed use case..

I did not follow what were you trying to do here. Can you be more specific?

Suppose I have an application which accepts some request from client and do
the MPI job on request arrival(lazy load of mpi lib). Now I started this
application through mpiexec on 4 machines. *mpiexec will start my
application on 4 machines not MPI jobs*. On request arrival I will
distribute MPI jobs to this cluster after doing some processing in my
application on master node.(1 of out of 4 I will designate as master
node).  Now I want to add one more machine( having my application) to this
cluster, without stopping my already running applications. My master
process should be able to see this 5th machine and distribute the task. Is
there any way to achieve this?


> Also, When I took the size(number of instance for that comm), I am not
> getting the count for client instance. To distribute the task, I need to
> have additional logic to handle this case in my application.
>
> Can you be more specific about which comm you were referring to?
>
> I am referring to MPI_COMM_WORLD. Is there any resize kind of api which
will give me how many jobs are there on this comm after doing one accept
connection?

Hope this helps in understanding the problem.

Regards
Sanjeev


On Mon, May 29, 2017 at 8:48 AM, sanjeev s <snjv.workmail at gmail.com> wrote:

> Hi,
>
> Thanks for the reply. I have given the detailed use case below.
>
> Regards
> Sanjeev
>
>
>
> On Sat, May 27, 2017 at 2:30 AM, Guo, Yanfei <yguo at anl.gov> wrote:
>
>> Hi Sanjeev,
>>
>> Please see the inline comments below.
>>
>> Yanfei Guo
>> Postdoctoral Researcher
>> MCS Division, ANL
>>
>>
>> On 5/26/17, 10:40 AM, "sanjeev s" <snjv.workmail at gmail.com> wrote:
>>
>> > Hi,
>>
>>
>> > In dynamic process, I read about two models: Client server and parent
>> child.
>>
>>
>> > Client Server : We need to have dedicated threads each for client and
>> server. Now considering all instances same , we will end up doing lot of
>> thread creation apart from our application worker threads.
>>
>> I assume you are referring to the connect/accept model by client/server.
>> In that case, you may need one thread on the server side to handle the
>> incoming connection because the `MPI_Comm_accept` call is blocking. A
>> dedicated thread is not necessary in the client. It is your
>> application’s job to decide who is the server.
>>
>>
> Yes I am referring to same connect/accept model. I don't want to put this
> logic in our application. I want some external process manager ( like
> Hydra) which can handle this form me .Please see below for detail use case.
>
>
>
>> > Moreover when 1 instance (app) goes down, we want
>>  that instance to come up without doing much manual work. We don't want
>> to club this logic in our application.
>>
>> I did not follow what were you trying to do here. Can you be more
>> specific?
>>
>> Suppose I have an application which accepts some request from client and
> do the MPI job on request arrival(lazy load of mpi lib). Now I started this
> application through mpiexec on 4 machines. *mpiexec will start my
> application on 4 machines not MPI jobs*. On request arrival I will
> distribute MPI jobs to this cluster after doing some processing in my
> application on master node.(1 of out of 4 I will designate as master
> node).  Now I want to add one more machine( having my application) to this
> cluster, without stopping my already running applications. My master
> process should be able to see this 5th machine and distribute the task. Is
> there any way to achieve this?
>
>
> > Also, When I took the size(number of instance for that comm), I am not
>> getting the count for client instance. To distribute the task, I need to
>> have additional logic to handle this case in my application.
>>
>> Can you be more specific about which comm you were referring to?
>>
>> I am referring to MPI_COMM_WORLD. Is there any resize kind of api which
> will give me how many jobs are there on this comm after doing one accept
> connection?
>
>
>> > 2) Parent child: Suppose we have started 4 instance on 4 different
>> machines. Now we need to add another server. I don't think parent child/
>> client server is good option in this case.
>>
>>
>> > We don't want to build process management capabilities in our
>> application. We are looking for process management in MPI itself (e.g in
>> Hydra )so that we can leverage on that.
>>
>>
>> > Please correct me if I am missing something in my understanding of
>> Dynamic model.
>>
>>
>> Regards
>> Sanjeev Sinha
>>
>>
>>
>>
>>
>> On Fri, May 26, 2017 at 8:46 PM, Halim Amer
>> <aamer at anl.gov> wrote:
>>
>> Sanjeev,
>>
>> > More precisely my requirement is suppose I started 4 instances of my
>> > application. Now I want to add one more instance dynamically to this set
>>
>> From my understanding, dynamic processes would work fine for this case.
>> Could you elaborate on why the dynamic process model is not sufficient for
>> your needs?
>>
>> Halim
>> www.mcs.anl.gov/~aamer <http://www.mcs.anl.gov/~aamer>
>>
>> On 5/26/17 9:11 AM, sanjeev s wrote:
>>
>>
>> Hi mpich,
>>
>> I have a requirement where in we need to add start stop application
>> instances on the fly before starting a job.Is there any mpich service
>> available. I looked through dynamic process model, but its not sufficing
>> our need.
>>
>> More precisely my requirement is suppose I started 4 instances of my
>> application. Now I want to add one more instance dynamically to this set
>>
>> Is there any tool which MPICH supports for fault tolerance behavior?
>>
>> Thanks
>> Sanjeev
>>
>>
>>
>>
>>
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>>
>>
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20170531/d434194b/attachment.html>


More information about the discuss mailing list