[mpich-discuss] MPICH 2.1.5 checkpoint

Marcela Castro León mcastrol at gmail.com
Thu Jul 17 04:41:28 CDT 2014

I'm using mpich 2.1.5  compiled to use blcr checkpoint.
I'm having problems with the checkpoint interval.
When I execute:
mpiexec -ckpointlib blcr -ckpoint-prefix /partnfs/mpichchk
-ckpoint-interval 120 -f maquinas -n 16 ./bt.C.16

In fact,  at the second 120, the execution is interrupted for
checkpointing, but, as the checkpoint last more than 120 seconds, another
checkpoint is immediately triggered instead of resuming the application.
I only achieve to get a checkpoint by setting a checkpoint interval almost
at the end of the execution but it is not useful.

How can I solve it?
Besides, Is there a way to know how long is the checkpoint in time?

Thank you very much.


mpiexec -info
HYDRA build details:
    Version:                                 1.5
    Release Date:                            Mon Oct  8 14:00:48 CDT 2012
    CC:                              gcc
    CXX:                             c++
    F77:                             gfortran
    F90:                             gfortran
    Configure options:                       '--disable-option-checking'
'--prefix=/soft/mpich2/mpich2-1.5-blcr8.3' '--with-hydra-ckpointlib=blcr'
'--with-blcr=/soft/blcr' '--enable-checkpointing' '--cache-file=/dev/null'
'--srcdir=.' 'CC=gcc' 'CFLAGS= -O2' 'LDFLAGS= -L/soft/blcr/lib64
-L/soft/blcr/lib' 'LIBS=-lrt -lcr -lpthread ' 'CPPFLAGS=
-I/SRC/mpi/mpich2-1.5/src/mpl/include -I/SRC/mpi/mpich2-1.5/src/mpl/include
-I/SRC/mpi/mpich2-1.5/src/openpa/src -I/SRC/mpi/mpich2-1.5/src/openpa/src
-I/SRC/mpi/mpich2-1.5/src/mpi/romio/include -I/soft/blcr/include'
    Process Manager:                         pmi
    Launchers available:                     ssh rsh fork slurm ll lsf sge
manual persist
    Topology libraries available:            hwloc
    Resource management kernels available:   user slurm ll lsf sge pbs
    Checkpointing libraries available:       blcr
    Demux engines available:                 poll select
