[mpich-discuss] mpich2 - checkpointing error

Wesley Bland wbland at anl.gov
Mon Apr 7 07:16:31 CDT 2014


Unfortunately, this is a known problem at the moment. BLCR checkpointing
hasn't worked for a few versions now. It's something we're working to fix
in a future version.

Thanks,
Wesley

On Monday, April 7, 2014, Marcelo Paiva Ramos <marcelo.paiva at cptec.inpe.br>
wrote:

>  Hi,
> Can you help me to solve this problem?
>
>  cat /etc/issue
> CentOS release 6.5 (Final)
> Kernel \r on an \m
>
>  uname -a
> Linux server 2.6.32-431.11.2.el6.x86_64 #1 SMP Tue Mar 25 19:59:55 UTC
> 2014 x86_64 x86_64 x86_64 GNU/Linux
>
>  *INSTALL: blcr-0.8.5*
>  tar xzvf blcr-0.8.5.tar.gz
> cd blcr-0.8.5
> mkdir builddir
> cd builddir
> ../configure --prefix=/opt/blcr
> make
> make install
> /sbin/insmod /opt/blcr/lib/blcr/2.6.32-431.11.2.el6.x86_64/blcr_imports.ko
> /sbin/insmod /opt/blcr/lib/blcr/2.6.32-431.11.2.el6.x86_64/blcr.ko
>  uname -r
> 2.6.32-431.11.2.el6.x86_64
>  lsmod | grep blcr
> blcr                  115465  0
> blcr_imports           10715  1 blcr
>  ldconfig -p | grep blcr
>  libcr_run.so.0 (libc6,x86-64) => /opt/blcr/lib/libcr_run.so.0
>  libcr_run.so (libc6,x86-64) => /opt/blcr/lib/libcr_run.so
>  libcr_omit.so.0 (libc6,x86-64) => /opt/blcr/lib/libcr_omit.so.0
>  libcr_omit.so (libc6,x86-64) => /opt/blcr/lib/libcr_omit.so
>  libcr.so.0 (libc6,x86-64) => /opt/blcr/lib/libcr.so.0
>  libcr.so (libc6,x86-64) => /opt/blcr/lib/libcr.so
> chkconfig --list | grep blcr
> blcr               0:off    1:off    2:on    3:on    4:on    5:on    6:off
>
>  *INSTALL: mpich-3.1*
>  tar xzvf mpich-3.1.tar.gz
> cd mpich-3.1
> ./configure --disable-fast CFLAGS=-O2 FFLAGS=-O2 CXXFLAGS=-O2 FCFLAGS=-O2
> --prefix=/opt/mpich2/ CC=/opt/intel/bin/icc FC=/opt/intel/bin/ifort
> F77=/opt/intel/bin/ifort --enable-checkpointing
> --with-hydra-ckpointlib=blcr --with-blcr=/opt/blcr
> --with-blcr-include=/opt/blcr/include --with-blcr-lib=/opt/blcr/lib
> make
> make install
>
>  *.bashrc*
>  export PATH=$PATH:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/
> bin:/sbin:/bin:/opt/blcr/bin:/opt/mpich2/bin
> export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib64:/opt/
> intel/lib/intel64:/opt/blcr/lib:/opt/mpich2/lib
>
>  *mpiexec -info*
> HYDRA build details:
>     Version:                                 3.1
>     Release Date:                            Thu Feb 20 11:41:13 CST 2014
>     CC:                              /opt/intel/bin/icc  -O2
>     CXX:                             g++  -O2
>     F77:                             /opt/intel/bin/ifort -O2
>     F90:                             /opt/intel/bin/ifort -O2
>     Configure options:                       '--disable-option-checking'
> '--prefix=/opt/mpich2' '--disable-fast' 'CFLAGS=-O2 -O0' 'FFLAGS=-O2 -O0'
> 'CXXFLAGS=-O2 ' 'FCFLAGS=-O2 ' 'CC=/opt/intel/bin/icc'
> 'FC=/opt/intel/bin/ifort' 'F77=/opt/intel/bin/ifort'
> '--enable-checkpointing' '--with-hydra-ckpointlib=blcr'
> '--with-blcr=/opt/blcr' '--with-blcr-include=/opt/blcr/include'
> '--with-blcr-lib=/opt/blcr/lib' '--cache-file=/dev/null' '--srcdir=.'
> 'LDFLAGS= -L/opt/blcr/lib' 'LIBS=-lrt -lcr -lpthread ' 'CPPFLAGS=
> -I/root/mpich-3.1/src/mpl/include -I/root/mpich-3.1/src/mpl/include
> -I/root/mpich-3.1/src/openpa/src -I/root/mpich-3.1/src/openpa/src
> -I/root/mpich-3.1/src/mpi/romio/include -I/opt/blcr/include'
>     Process Manager:                         pmi
>     Launchers available:                     ssh rsh fork slurm ll lsf sge
> pbs manual persist
>     Topology libraries available:            hwloc
>     Resource management kernels available:   user slurm ll lsf sge pbs
> cobalt
>     Checkpointing libraries available:       blcr
>     Demux engines available:                 poll select
>
>
>  *ERROR*
>  mpiexec -n 1 -ckpointlib blcr -ckpoint-interval 20 -ckpoint-prefix
> /home/marcelo/TESTE/ ./teste
> [proxy:0:0 at server] requesting checkpoint
> [proxy:0:0 at server] checkpoint completed
> [proxy:0:0 at server] HYDT_ckpoint_blcr_checkpoint
> (tools/ckpoint/blcr/ckpoint_blcr.c:241): Checkpointing failed.  Make sure
> BLCR kernel module is loaded. Unknown error 2356
> [proxy:0:0 at server] ckpoint_thread (tools/ckpoint/ckpoint.c:76): blcr
> checkpoint returned error
> [proxy:0:0 at server] requesting checkpoint
> [proxy:0:0 at server] checkpoint completed
> [proxy:0:0 at server] HYDT_ckpoint_blcr_checkpoint
> (tools/ckpoint/blcr/ckpoint_blcr.c:241): Checkpointing failed.  Make sure
> BLCR kernel module is loaded. Unknown error 2356
> [proxy:0:0 at server] ckpoint_thread (tools/ckpoint/ckpoint.c:76): blcr
> checkpoint returned error
>
>  Best regards,
>  Marcelo.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140407/73a13bde/attachment.html>


More information about the discuss mailing list