[mpich-discuss] MPICH 2.1.5 checkpoint

Bland, Wesley B. wbland at anl.gov
Thu Jul 17 06:30:36 CDT 2014


Is there a reason you can't just take the checkpoints less frequently?

On Jul 17, 2014, at 4:41 AM, "Marcela Castro León" <mcastrol at gmail.com<mailto:mcastrol at gmail.com>> wrote:

Hi,
I'm using mpich 2.1.5  compiled to use blcr checkpoint.
I'm having problems with the checkpoint interval.
When I execute:
mpiexec -ckpointlib blcr -ckpoint-prefix /partnfs/mpichchk -ckpoint-interval 120 -f maquinas -n 16 ./bt.C.16

In fact,  at the second 120, the execution is interrupted for checkpointing, but, as the checkpoint last more than 120 seconds, another checkpoint is immediately triggered instead of resuming the application.
I only achieve to get a checkpoint by setting a checkpoint interval almost at the end of the execution but it is not useful.

How can I solve it?
Besides, Is there a way to know how long is the checkpoint in time?

Thank you very much.

Marcela



mpiexec -info
HYDRA build details:
    Version:                                 1.5
    Release Date:                            Mon Oct  8 14:00:48 CDT 2012
    CC:                              gcc
    CXX:                             c++
    F77:                             gfortran
    F90:                             gfortran
    Configure options:                       '--disable-option-checking' '--prefix=/soft/mpich2/mpich2-1.5-blcr8.3' '--with-hydra-ckpointlib=blcr' '--with-blcr=/soft/blcr' '--enable-checkpointing' '--cache-file=/dev/null' '--srcdir=.' 'CC=gcc' 'CFLAGS= -O2' 'LDFLAGS= -L/soft/blcr/lib64 -L/soft/blcr/lib' 'LIBS=-lrt -lcr -lpthread ' 'CPPFLAGS= -I/SRC/mpi/mpich2-1.5/src/mpl/include -I/SRC/mpi/mpich2-1.5/src/mpl/include -I/SRC/mpi/mpich2-1.5/src/openpa/src -I/SRC/mpi/mpich2-1.5/src/openpa/src -I/SRC/mpi/mpich2-1.5/src/mpi/romio/include -I/soft/blcr/include'
    Process Manager:                         pmi
    Launchers available:                     ssh rsh fork slurm ll lsf sge manual persist
    Topology libraries available:            hwloc
    Resource management kernels available:   user slurm ll lsf sge pbs
    Checkpointing libraries available:       blcr
    Demux engines available:                 poll select


_______________________________________________
discuss mailing list     discuss at mpich.org<mailto:discuss at mpich.org>
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140717/b2a85b2f/attachment.html>


More information about the discuss mailing list