[mpich-discuss] Fwd: MPICH fault tolerance and resiliency

Halim Amer aamer at anl.gov
Fri May 26 15:38:03 CDT 2017


I still cannot follow what you are trying to do. The terminology is 
confusing (what are "app instance", machine, etc.). Using MPI/Hydra 
terminology would help. Maybe MPI process instead of instance and 
cluster node instead of machine? Also, I would like to know how MPI 
terminology maps to the models you described. Do you mean using 
comm_spawn for parent-child and comm_connect/comm_accept for the 
client-server model?

Having some source code would be much more productive too.

Halim
www.mcs.anl.gov/~aamer

On 5/26/17 10:40 AM, sanjeev s wrote:
> Hi,
>
> In dynamic process, I read about two models: Client server and parent child.
>
> Client Server : We need to have dedicated threads each for client and
> server. Now considering all instances same , we will end up doing lot of
> thread creation apart from our application worker threads. Moreover when 1
> instance (app) goes down, we want that instance to come up without doing
> much manual work. We don't want to club this logic in our application.
> Also, When I took the size(number of instance for that comm), I am not
> getting the count for client instance. To distribute the task, I need to
> have additional logic to handle this case in my application.
>
> 2) Parent child: Suppose we have started 4 instance on 4 different
> machines. Now we need to add another server. I don't think parent child/
> client server is good option in this case.
>
> We don't want to build process management capabilities in our application.
> We are looking for process management in MPI itself (e.g in Hydra )so that
> we can leverage on that.
>
> Please correct me if I am missing something in my understanding of Dynamic
> model.
>
> Regards
> Sanjeev Sinha
>
>
>
> On Fri, May 26, 2017 at 8:46 PM, Halim Amer <aamer at anl.gov> wrote:
>
>> Sanjeev,
>>
>>> More precisely my requirement is suppose I started 4 instances of my
>>> application. Now I want to add one more instance dynamically to this set
>>
>> From my understanding, dynamic processes would work fine for this case.
>> Could you elaborate on why the dynamic process model is not sufficient for
>> your needs?
>>
>> Halim
>> www.mcs.anl.gov/~aamer
>>
>>
>> On 5/26/17 9:11 AM, sanjeev s wrote:
>>
>>> Hi mpich,
>>>
>>> I have a requirement where in we need to add start stop application
>>> instances on the fly before starting a job.Is there any mpich service
>>> available. I looked through dynamic process model, but its not sufficing
>>> our need.
>>>
>>> More precisely my requirement is suppose I started 4 instances of my
>>> application. Now I want to add one more instance dynamically to this set
>>>
>>> Is there any tool which MPICH supports for fault tolerance behavior?
>>>
>>> Thanks
>>> Sanjeev
>>>
>>>
>>>
>>> _______________________________________________
>>> discuss mailing list     discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>
>>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>
>
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list