[mpich-discuss] mpich2 - checkpointing error
Wesley Bland
wbland at anl.gov
Mon Apr 7 07:16:31 CDT 2014
Unfortunately, this is a known problem at the moment. BLCR checkpointing
hasn't worked for a few versions now. It's something we're working to fix
in a future version.
Thanks,
Wesley
On Monday, April 7, 2014, Marcelo Paiva Ramos <marcelo.paiva at cptec.inpe.br>
wrote:
> Hi,
> Can you help me to solve this problem?
>
> cat /etc/issue
> CentOS release 6.5 (Final)
> Kernel \r on an \m
>
> uname -a
> Linux server 2.6.32-431.11.2.el6.x86_64 #1 SMP Tue Mar 25 19:59:55 UTC
> 2014 x86_64 x86_64 x86_64 GNU/Linux
>
> *INSTALL: blcr-0.8.5*
> tar xzvf blcr-0.8.5.tar.gz
> cd blcr-0.8.5
> mkdir builddir
> cd builddir
> ../configure --prefix=/opt/blcr
> make
> make install
> /sbin/insmod /opt/blcr/lib/blcr/2.6.32-431.11.2.el6.x86_64/blcr_imports.ko
> /sbin/insmod /opt/blcr/lib/blcr/2.6.32-431.11.2.el6.x86_64/blcr.ko
> uname -r
> 2.6.32-431.11.2.el6.x86_64
> lsmod | grep blcr
> blcr 115465 0
> blcr_imports 10715 1 blcr
> ldconfig -p | grep blcr
> libcr_run.so.0 (libc6,x86-64) => /opt/blcr/lib/libcr_run.so.0
> libcr_run.so (libc6,x86-64) => /opt/blcr/lib/libcr_run.so
> libcr_omit.so.0 (libc6,x86-64) => /opt/blcr/lib/libcr_omit.so.0
> libcr_omit.so (libc6,x86-64) => /opt/blcr/lib/libcr_omit.so
> libcr.so.0 (libc6,x86-64) => /opt/blcr/lib/libcr.so.0
> libcr.so (libc6,x86-64) => /opt/blcr/lib/libcr.so
> chkconfig --list | grep blcr
> blcr 0:off 1:off 2:on 3:on 4:on 5:on 6:off
>
> *INSTALL: mpich-3.1*
> tar xzvf mpich-3.1.tar.gz
> cd mpich-3.1
> ./configure --disable-fast CFLAGS=-O2 FFLAGS=-O2 CXXFLAGS=-O2 FCFLAGS=-O2
> --prefix=/opt/mpich2/ CC=/opt/intel/bin/icc FC=/opt/intel/bin/ifort
> F77=/opt/intel/bin/ifort --enable-checkpointing
> --with-hydra-ckpointlib=blcr --with-blcr=/opt/blcr
> --with-blcr-include=/opt/blcr/include --with-blcr-lib=/opt/blcr/lib
> make
> make install
>
> *.bashrc*
> export PATH=$PATH:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/
> bin:/sbin:/bin:/opt/blcr/bin:/opt/mpich2/bin
> export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib64:/opt/
> intel/lib/intel64:/opt/blcr/lib:/opt/mpich2/lib
>
> *mpiexec -info*
> HYDRA build details:
> Version: 3.1
> Release Date: Thu Feb 20 11:41:13 CST 2014
> CC: /opt/intel/bin/icc -O2
> CXX: g++ -O2
> F77: /opt/intel/bin/ifort -O2
> F90: /opt/intel/bin/ifort -O2
> Configure options: '--disable-option-checking'
> '--prefix=/opt/mpich2' '--disable-fast' 'CFLAGS=-O2 -O0' 'FFLAGS=-O2 -O0'
> 'CXXFLAGS=-O2 ' 'FCFLAGS=-O2 ' 'CC=/opt/intel/bin/icc'
> 'FC=/opt/intel/bin/ifort' 'F77=/opt/intel/bin/ifort'
> '--enable-checkpointing' '--with-hydra-ckpointlib=blcr'
> '--with-blcr=/opt/blcr' '--with-blcr-include=/opt/blcr/include'
> '--with-blcr-lib=/opt/blcr/lib' '--cache-file=/dev/null' '--srcdir=.'
> 'LDFLAGS= -L/opt/blcr/lib' 'LIBS=-lrt -lcr -lpthread ' 'CPPFLAGS=
> -I/root/mpich-3.1/src/mpl/include -I/root/mpich-3.1/src/mpl/include
> -I/root/mpich-3.1/src/openpa/src -I/root/mpich-3.1/src/openpa/src
> -I/root/mpich-3.1/src/mpi/romio/include -I/opt/blcr/include'
> Process Manager: pmi
> Launchers available: ssh rsh fork slurm ll lsf sge
> pbs manual persist
> Topology libraries available: hwloc
> Resource management kernels available: user slurm ll lsf sge pbs
> cobalt
> Checkpointing libraries available: blcr
> Demux engines available: poll select
>
>
> *ERROR*
> mpiexec -n 1 -ckpointlib blcr -ckpoint-interval 20 -ckpoint-prefix
> /home/marcelo/TESTE/ ./teste
> [proxy:0:0 at server] requesting checkpoint
> [proxy:0:0 at server] checkpoint completed
> [proxy:0:0 at server] HYDT_ckpoint_blcr_checkpoint
> (tools/ckpoint/blcr/ckpoint_blcr.c:241): Checkpointing failed. Make sure
> BLCR kernel module is loaded. Unknown error 2356
> [proxy:0:0 at server] ckpoint_thread (tools/ckpoint/ckpoint.c:76): blcr
> checkpoint returned error
> [proxy:0:0 at server] requesting checkpoint
> [proxy:0:0 at server] checkpoint completed
> [proxy:0:0 at server] HYDT_ckpoint_blcr_checkpoint
> (tools/ckpoint/blcr/ckpoint_blcr.c:241): Checkpointing failed. Make sure
> BLCR kernel module is loaded. Unknown error 2356
> [proxy:0:0 at server] ckpoint_thread (tools/ckpoint/ckpoint.c:76): blcr
> checkpoint returned error
>
> Best regards,
> Marcelo.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140407/73a13bde/attachment.html>
More information about the discuss
mailing list