[mpich-discuss] ckpoint-num error

john donald johnd9886 at gmail.com
Fri Jun 14 15:58:24 CDT 2013


the file size is 121.6 MB
after i raised the interval to 20 sec
my test application is MPI/c integer sort program with 5000 iterations
sorry for the trivial question but how to know that the checkpoint file is
empty or not
how can i open it?



2013/6/11 Wesley Bland <wbland at mcs.anl.gov>

> Did you check if there's actually anything in the checkpoint files? If
> they're empty, that probably means that you're checkpointing too frequently.
>
> On Jun 10, 2013, at 5:17 PM, john donald <johnd9886 at gmail.com> wrote:
>
> i raised it to 20 sec but same results
> sorry i am new to checkpoint restart
> i am trying this initially on one multicore pc
> how should it look like if the restart succeed? should it work in the same
> terminal in which i am running restart command
> my test app has 5000 iterations , checkpoint is taken at iteration no 300
> for example , if i choose to restart from this checkpoint file should it
> restart near this iteration no 300
>
>
> 2013/6/6 Wesley Bland <wbland at mcs.anl.gov>
>
>> Is there actually anything in those checkpoints? With a checkpoint
>> happening every 4 seconds you may be overdoing it.
>>
>> Wesley
>>
>> On Jun 5, 2013, at 2:14 PM, Rajeev Thakur <thakur at mcs.anl.gov> wrote:
>>
>> > I don't know, but see if anything on this page helps:
>> > http://wiki.mpich.org/mpich/index.php/Checkpointing
>> >
>> > On Jun 5, 2013, at 4:09 PM, john donald wrote:
>> >
>> >>
>> >>
>> >> ---------- Forwarded message ----------
>> >> From: john donald <johnd9886 at gmail.com>
>> >> Date: 2013/6/3
>> >> Subject: ckpoint-num error
>> >> To: mpich-discuss at mcs.anl.gov
>> >>
>> >>
>> >> i used mpiexec with checkpoint and created two checkpoint files:
>> >>
>> >> mpiexec -ckpointlib blcr -ckpoint-prefix /home/john/ckpts/app.ckpoint
>>  -ckpoint-interval 4  -n 4  /home/john/app/md
>> >>
>> >> context-num0-0-0
>> >> context-num1-0-0
>> >>
>> >>
>> >> i am trying to make a restart
>> >> mpiexec -ckpointlib blcr -ckpoint-prefix /home/john/ckpts/app.ckpoint
>> -n 4 -ckpoint-num 1
>> >>
>> >> but nothing happened it just hangs
>> >> i also tried:
>> >> mpiexec -ckpointlib blcr -ckpoint-prefix /home/john/ckpts/app.ckpoint
>> -n 4 -ckpoint-num 0-0-0
>> >> also hangs
>> >>
>> >> _______________________________________________
>> >> discuss mailing list     discuss at mpich.org
>> >> To manage subscription options or unsubscribe:
>> >> https://lists.mpich.org/mailman/listinfo/discuss
>> >
>> > _______________________________________________
>> > discuss mailing list     discuss at mpich.org
>> > To manage subscription options or unsubscribe:
>> > https://lists.mpich.org/mailman/listinfo/discuss
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20130614/8f5cde0d/attachment.html>


More information about the discuss mailing list