[mpich-discuss] MPICH 2.1.5 checkpoint

Bland, Wesley B. wbland at anl.gov
Thu Jul 17 10:02:48 CDT 2014


I’m not sure I understand your question here, but I believe the way the checkpointing works (or worked, I don’t think it’s functional in recent versions) is that it would periodically freeze the execution to write a checkpoint of all of the data. That checkpointing would take place periodically based on a parameter given by the user. If that interval is shorter than the interval it takes to make a checkpoint, then it would behave poorly (possible freezing the checkpoint to write another checkpoint, then freeze that checkpoint, etc.). Honestly, the person who wrote this code in MPICH is gone and it’s not working in the current version so I can’t give you a lot more information than that. It’s on the roadmap to fix this for future versions, but for now, all I can tell you is to extend the time between checkpoints because the checkpointing system isn’t robust enough to do much more than the simple thing.

Thanks,
Wesley

On Jul 17, 2014, at 9:52 AM, Marcela Castro León <mcastrol at gmail.com<mailto:mcastrol at gmail.com>> wrote:

But Do you know why after the first checkpoint the execution is frezzed during a time similar to the interval and,  after that, a new checkpoint is triggered?
Thank you.


2014-07-17 16:43 GMT+02:00 Bland, Wesley B. <wbland at anl.gov<mailto:wbland at anl.gov>>:
You’ll probably need to somehow get your application to run longer then. Unfortunately, MPICH doesn’t support manually starting a checkpoint at this time.


On Jul 17, 2014, at 9:29 AM, Marcela Castro León <mcastrol at gmail.com<mailto:mcastrol at gmail.com>> wrote:

Hi

I just want to make a one checkpoint during the execution, at the middle.
We'are trying to observe the I/O to reduce the time using a parallel file system.

Thanks.


2014-07-17 13:30 GMT+02:00 Bland, Wesley B. <wbland at anl.gov<mailto:wbland at anl.gov>>:
Is there a reason you can't just take the checkpoints less frequently?

On Jul 17, 2014, at 4:41 AM, "Marcela Castro León" <mcastrol at gmail.com<mailto:mcastrol at gmail.com>> wrote:

Hi,
I'm using mpich 2.1.5  compiled to use blcr checkpoint.
I'm having problems with the checkpoint interval.
When I execute:
mpiexec -ckpointlib blcr -ckpoint-prefix /partnfs/mpichchk -ckpoint-interval 120 -f maquinas -n 16 ./bt.C.16

In fact,  at the second 120, the execution is interrupted for checkpointing, but, as the checkpoint last more than 120 seconds, another checkpoint is immediately triggered instead of resuming the application.
I only achieve to get a checkpoint by setting a checkpoint interval almost at the end of the execution but it is not useful.

How can I solve it?
Besides, Is there a way to know how long is the checkpoint in time?

Thank you very much.

Marcela



mpiexec -info
HYDRA build details:
    Version:                                 1.5
    Release Date:                            Mon Oct  8 14:00:48 CDT 2012
    CC:                              gcc
    CXX:                             c++
    F77:                             gfortran
    F90:                             gfortran
    Configure options:                       '--disable-option-checking' '--prefix=/soft/mpich2/mpich2-1.5-blcr8.3' '--with-hydra-ckpointlib=blcr' '--with-blcr=/soft/blcr' '--enable-checkpointing' '--cache-file=/dev/null' '--srcdir=.' 'CC=gcc' 'CFLAGS= -O2' 'LDFLAGS= -L/soft/blcr/lib64 -L/soft/blcr/lib' 'LIBS=-lrt -lcr -lpthread ' 'CPPFLAGS= -I/SRC/mpi/mpich2-1.5/src/mpl/include -I/SRC/mpi/mpich2-1.5/src/mpl/include -I/SRC/mpi/mpich2-1.5/src/openpa/src -I/SRC/mpi/mpich2-1.5/src/openpa/src -I/SRC/mpi/mpich2-1.5/src/mpi/romio/include -I/soft/blcr/include'
    Process Manager:                         pmi
    Launchers available:                     ssh rsh fork slurm ll lsf sge manual persist
    Topology libraries available:            hwloc
    Resource management kernels available:   user slurm ll lsf sge pbs
    Checkpointing libraries available:       blcr
    Demux engines available:                 poll select


_______________________________________________
discuss mailing list     discuss at mpich.org<mailto:discuss at mpich.org>
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list     discuss at mpich.org<mailto:discuss at mpich.org>
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list     discuss at mpich.org<mailto:discuss at mpich.org>
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


_______________________________________________
discuss mailing list     discuss at mpich.org<mailto:discuss at mpich.org>
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list     discuss at mpich.org<mailto:discuss at mpich.org>
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140717/d4c46837/attachment.html>


More information about the discuss mailing list