[mpich-discuss] mpich2 - checkpointing error
Marcelo Paiva Ramos
marcelo.paiva at cptec.inpe.br
Tue Apr 8 06:26:24 CDT 2014
It works in another version of MPICH2 and BLCR?
Best regards,
Marcelo.
On 07-04-2014 09:16, Wesley Bland wrote:
> Unfortunately, this is a known problem at the moment. BLCR
> checkpointing hasn't worked for a few versions now. It's something
> we're working to fix in a future version.
>
> Thanks,
> Wesley
>
> On Monday, April 7, 2014, Marcelo Paiva Ramos
> <marcelo.paiva at cptec.inpe.br <mailto:marcelo.paiva at cptec.inpe.br>> wrote:
>
> Hi,
> Can you help me to solve this problem?
>
> cat /etc/issue
> CentOS release 6.5 (Final)
> Kernel \r on an \m
>
> uname -a
> Linux server 2.6.32-431.11.2.el6.x86_64 #1 SMP Tue Mar 25 19:59:55
> UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
>
> *INSTALL: blcr-0.8.5*
> tar xzvf blcr-0.8.5.tar.gz
> cd blcr-0.8.5
> mkdir builddir
> cd builddir
> ../configure --prefix=/opt/blcr
> make
> make install
> /sbin/insmod
> /opt/blcr/lib/blcr/2.6.32-431.11.2.el6.x86_64/blcr_imports.ko
> /sbin/insmod /opt/blcr/lib/blcr/2.6.32-431.11.2.el6.x86_64/blcr.ko
> uname -r
> 2.6.32-431.11.2.el6.x86_64
> lsmod | grep blcr
> blcr 115465 0
> blcr_imports 10715 1 blcr
> ldconfig -p | grep blcr
> libcr_run.so.0 (libc6,x86-64) => /opt/blcr/lib/libcr_run.so.0
> libcr_run.so (libc6,x86-64) => /opt/blcr/lib/libcr_run.so
> libcr_omit.so.0 (libc6,x86-64) => /opt/blcr/lib/libcr_omit.so.0
> libcr_omit.so (libc6,x86-64) => /opt/blcr/lib/libcr_omit.so
> libcr.so.0 (libc6,x86-64) => /opt/blcr/lib/libcr.so.0
> libcr.so (libc6,x86-64) => /opt/blcr/lib/libcr.so
> chkconfig --list | grep blcr
> blcr 0:off 1:off 2:on 3:on 4:on 5:on
> 6:off
>
> *INSTALL: mpich-3.1*
> tar xzvf mpich-3.1.tar.gz
> cd mpich-3.1
> ./configure --disable-fast CFLAGS=-O2 FFLAGS=-O2 CXXFLAGS=-O2
> FCFLAGS=-O2 --prefix=/opt/mpich2/ CC=/opt/intel/bin/icc
> FC=/opt/intel/bin/ifort F77=/opt/intel/bin/ifort
> --enable-checkpointing --with-hydra-ckpointlib=blcr
> --with-blcr=/opt/blcr --with-blcr-include=/opt/blcr/include
> --with-blcr-lib=/opt/blcr/lib
> make
> make install
>
> *.bashrc*
> export
> PATH=$PATH:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/blcr/bin:/opt/mpich2/bin
> export
> LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib64:/opt/intel/lib/intel64:/opt/blcr/lib:/opt/mpich2/lib
>
> *mpiexec -info*
> HYDRA build details:
> Version: 3.1
> Release Date: Thu Feb 20 11:41:13
> CST 2014
> CC: /opt/intel/bin/icc -O2
> CXX: g++ -O2
> F77: /opt/intel/bin/ifort -O2
> F90: /opt/intel/bin/ifort -O2
> Configure options: '--disable-option-checking'
> '--prefix=/opt/mpich2' '--disable-fast' 'CFLAGS=-O2 -O0'
> 'FFLAGS=-O2 -O0' 'CXXFLAGS=-O2 ' 'FCFLAGS=-O2 '
> 'CC=/opt/intel/bin/icc' 'FC=/opt/intel/bin/ifort'
> 'F77=/opt/intel/bin/ifort' '--enable-checkpointing'
> '--with-hydra-ckpointlib=blcr' '--with-blcr=/opt/blcr'
> '--with-blcr-include=/opt/blcr/include'
> '--with-blcr-lib=/opt/blcr/lib' '--cache-file=/dev/null'
> '--srcdir=.' 'LDFLAGS= -L/opt/blcr/lib' 'LIBS=-lrt -lcr -lpthread
> ' 'CPPFLAGS= -I/root/mpich-3.1/src/mpl/include
> -I/root/mpich-3.1/src/mpl/include -I/root/mpich-3.1/src/openpa/src
> -I/root/mpich-3.1/src/openpa/src
> -I/root/mpich-3.1/src/mpi/romio/include -I/opt/blcr/include'
> Process Manager: pmi
> Launchers available: ssh rsh fork slurm ll
> lsf sge pbs manual persist
> Topology libraries available: hwloc
> Resource management kernels available: user slurm ll lsf sge
> pbs cobalt
> Checkpointing libraries available: blcr
> Demux engines available: poll select
>
>
> *ERROR*
> mpiexec -n 1 -ckpointlib blcr -ckpoint-interval 20 -ckpoint-prefix
> /home/marcelo/TESTE/ ./teste
> [proxy:0:0 at server] requesting checkpoint
> [proxy:0:0 at server] checkpoint completed
> [proxy:0:0 at server] HYDT_ckpoint_blcr_checkpoint
> (tools/ckpoint/blcr/ckpoint_blcr.c:241): Checkpointing failed.
> Make sure BLCR kernel module is loaded. Unknown error 2356
> [proxy:0:0 at server] ckpoint_thread (tools/ckpoint/ckpoint.c:76):
> blcr checkpoint returned error
> [proxy:0:0 at server] requesting checkpoint
> [proxy:0:0 at server] checkpoint completed
> [proxy:0:0 at server] HYDT_ckpoint_blcr_checkpoint
> (tools/ckpoint/blcr/ckpoint_blcr.c:241): Checkpointing failed.
> Make sure BLCR kernel module is loaded. Unknown error 2356
> [proxy:0:0 at server] ckpoint_thread (tools/ckpoint/ckpoint.c:76):
> blcr checkpoint returned error
>
> Best regards,
> Marcelo.
>
>
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140408/5581e4a6/attachment.html>
More information about the discuss
mailing list