<html><head><meta http-equiv="Content-Type" content="text/html charset=iso-8859-1"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Did you check if there's actually anything in the checkpoint files? If they're empty, that probably means that you're checkpointing too frequently.<div><br><div><div>On Jun 10, 2013, at 5:17 PM, john donald <<a href="mailto:johnd9886@gmail.com">johnd9886@gmail.com</a>> wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div dir="rtl"><div dir="ltr">i raised it to 20 sec but same results<br></div><div dir="ltr">sorry i am new to checkpoint restart<br></div><div dir="ltr">i am trying this initially on one multicore pc <br></div><div dir="ltr">
how should it look like if the restart succeed? should it work in the same terminal in which i am running restart command <br></div><div dir="ltr">my test app has 5000 iterations , checkpoint is taken at iteration no 300 for example , if i choose to restart from this checkpoint file should it restart near this iteration no 300 <br>
</div><div class="gmail_extra"><br><br><div class="gmail_quote"><div dir="ltr">2013/6/6 Wesley Bland <span dir="ltr"><<a href="mailto:wbland@mcs.anl.gov" target="_blank">wbland@mcs.anl.gov</a>></span></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Is there actually anything in those checkpoints? With a checkpoint happening every 4 seconds you may be overdoing it.<br>
<span class="HOEnZb"><font color="#888888"><br>
Wesley<br>
</font></span><div class="HOEnZb"><div class="h5"><br>
On Jun 5, 2013, at 2:14 PM, Rajeev Thakur <<a href="mailto:thakur@mcs.anl.gov">thakur@mcs.anl.gov</a>> wrote:<br>
<br>
> I don't know, but see if anything on this page helps:<br>
> <a href="http://wiki.mpich.org/mpich/index.php/Checkpointing" target="_blank">http://wiki.mpich.org/mpich/index.php/Checkpointing</a><br>
><br>
> On Jun 5, 2013, at 4:09 PM, john donald wrote:<br>
><br>
>><br>
>><br>
>> ---------- Forwarded message ----------<br>
>> From: john donald <<a href="mailto:johnd9886@gmail.com">johnd9886@gmail.com</a>><br>
>> Date: 2013/6/3<br>
>> Subject: ckpoint-num error<br>
>> To: <a href="mailto:mpich-discuss@mcs.anl.gov">mpich-discuss@mcs.anl.gov</a><br>
>><br>
>><br>
>> i used mpiexec with checkpoint and created two checkpoint files:<br>
>><br>
>> mpiexec -ckpointlib blcr -ckpoint-prefix /home/john/ckpts/app.ckpoint -ckpoint-interval 4 -n 4 /home/john/app/md<br>
>><br>
>> context-num0-0-0<br>
>> context-num1-0-0<br>
>><br>
>><br>
>> i am trying to make a restart<br>
>> mpiexec -ckpointlib blcr -ckpoint-prefix /home/john/ckpts/app.ckpoint -n 4 -ckpoint-num 1<br>
>><br>
>> but nothing happened it just hangs<br>
>> i also tried:<br>
>> mpiexec -ckpointlib blcr -ckpoint-prefix /home/john/ckpts/app.ckpoint -n 4 -ckpoint-num 0-0-0<br>
>> also hangs<br>
>><br>
>> _______________________________________________<br>
>> discuss mailing list <a href="mailto:discuss@mpich.org">discuss@mpich.org</a><br>
>> To manage subscription options or unsubscribe:<br>
>> <a href="https://lists.mpich.org/mailman/listinfo/discuss" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br>
><br>
> _______________________________________________<br>
> discuss mailing list <a href="mailto:discuss@mpich.org">discuss@mpich.org</a><br>
> To manage subscription options or unsubscribe:<br>
> <a href="https://lists.mpich.org/mailman/listinfo/discuss" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br>
_______________________________________________<br>
discuss mailing list <a href="mailto:discuss@mpich.org">discuss@mpich.org</a><br>
To manage subscription options or unsubscribe:<br>
<a href="https://lists.mpich.org/mailman/listinfo/discuss" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br>
</div></div></blockquote></div><br></div></div>
</blockquote></div><br></div></body></html>