[mpich-discuss] Can the MPICH job pause and resumed

Jeff Hammond jeff.science at gmail.com
Sat Oct 10 16:09:33 CDT 2015


Yeah, I think checkpoint-restart is probably the best way to do what this
user wants.

Jeff

On Thursday, October 8, 2015, Rob Latham <robl at mcs.anl.gov> wrote:

>
>
> On 10/08/2015 08:18 AM, Jeff Hammond wrote:
>
>> Nodes and cores are *hardware* concepts.  If you need to add hardware
>> resources to your job, that is a resource manager issue.  If you want to
>> create new MPI processes _that will run on those new nodes/cores_, you
>> can use MPI_Comm_spawn for that.
>>
>> As for pausing a job, I'm not sure why you want to do that.  Are you
>> trying to suspect the job until the new hardware resources become
>> available?
>>
>
> Probably not what Wahi has in mind, but back on the old Crays (old
> vector-era crays, I mean), when checkpoint/restart was an os-level service,
> the job scheduler used that feature:  new job comes in with higher
> priority, checkpoint the old job.  when new job finishes, restart old job.
>
> ==rob
>
>
>> On Wed, Oct 7, 2015 at 10:08 PM, wahi <wahi at sci.am <mailto:wahi at sci.am>>
>> wrote:
>>
>>     Hi Jeff,
>>
>>     Thank you for your reply, I think the MPI_COMM_Spawn is for
>>     generating new processes but I need to add more node or cores after
>>     pausing the job, am I right ?
>>
>>
>>     On 10/08/2015 08:11 AM, Jeff Hammond wrote:
>>
>>>     See MPI_Comm_spawn(_multiple).
>>>
>>>     Jeff
>>>
>>>     On Wednesday, October 7, 2015, Zhao, Xin
>>>     <<mailto:xinzhao3 at illinois.edu>xinzhao3 at illinois.edu
>>>     <mailto:xinzhao3 at illinois.edu>> wrote:
>>>
>>>         Hi Wahi,
>>>
>>>         Do you mean you want to change number of hosts during one MPI
>>>         execution? If so, we think that is not possible.
>>>
>>>         Xin
>>>         ________________________________________
>>>         From: wahi [wahi at sci.am]
>>>         Sent: Wednesday, October 07, 2015 2:43 AM
>>>         To: discuss at mpich.org
>>>         Subject: [mpich-discuss] Can the MPICH job pause and resumed
>>>
>>>         Hi,
>>>
>>>         I would like to know if there is possibility to pause the
>>>         MPICH job and
>>>         restart it with more node numbers ?
>>>
>>>
>>>
>>>         Thanks in advance for any help or suggestion.
>>>
>>>
>>>
>>>
>>>
>>>         Regards,
>>>           Wahi
>>>         _______________________________________________
>>>         discuss mailing list discuss at mpich.org
>>>         To manage subscription options or unsubscribe:
>>>         https://lists.mpich.org/mailman/listinfo/discuss
>>>         _______________________________________________
>>>         discuss mailing list discuss at mpich.org
>>>         To manage subscription options or unsubscribe:
>>>         https://lists.mpich.org/mailman/listinfo/discuss
>>>
>>>
>>>
>>>     --
>>>     Jeff Hammond
>>>     jeff.science at gmail.com <mailto:jeff.science at gmail.com>
>>>     http://jeffhammond.github.io/
>>>
>>>
>>>     _______________________________________________
>>>     discuss mailing listdiscuss at mpich.org <mailto:discuss at mpich.org>
>>>     To manage subscription options or unsubscribe:
>>>     https://lists.mpich.org/mailman/listinfo/discuss
>>>
>>
>>
>>     _______________________________________________
>>     discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
>>     To manage subscription options or unsubscribe:
>>     https://lists.mpich.org/mailman/listinfo/discuss
>>
>>
>>
>>
>> --
>> Jeff Hammond
>> jeff.science at gmail.com <mailto:jeff.science at gmail.com>
>> http://jeffhammond.github.io/
>>
>>
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>>
> --
> Rob Latham
> Mathematics and Computer Science Division
> Argonne National Lab, IL USA
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>


-- 
Jeff Hammond
jeff.science at gmail.com
http://jeffhammond.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20151010/c33c8b1a/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list