[mpich-discuss] mpich3.1.4 checkpoint restart failed error.
Pankaj Karoriya
pankaj2karoriya at gmail.com
Fri Jun 12 01:55:18 CDT 2015
I am using mpich3.1.4 in os rhel 6.2 ,blcr-0.8.5.i have configure mpi with
blcr library
i configure mpich3.1.4 with following options;
1. ./configure --prefix=/usr/local/mpich-3.0.4-gcc --enable-cc --enable-fc
--with-device=ch3:sock CC=gcc FC=gfortran --enable-checkpointing
--with-hydra-ckpointlib=blcr --with-blcr=/usr/local/blcr-0.8.5
LD_LIBRARY_PATH=/usr/local/blcr-0.8.5/lib
2.make
3. make install
After installation of mpi when running mpi checkpoint then its creates
context file of mpi process.
now i wants restart my mpi checkpont program than its give error is
No alternate input file when restoring an external pipe
kernel: blcr: cr_restore_all_files [32120]: Unable to restore fd 0
(type=4,err=-9)
kernel: blcr: cr_rstrt_child [32120]: Unable to restore files!
(err=-9)
so whats i do for get of this problem.??
Thanks Pankaj
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20150612/e8ac9f9b/attachment.html>
-------------- next part --------------
Step 1: compile a mpi program
mpicc -o mprime mpi-prime.c -lm -L/usr/local/blcr-0.8.5/lib -lcr_run
Step 2: running a mpi program
mpiexec -ckpointlib blcr -ckpoint-prefix /home/kpankaj/my -ckpoint-interval 3 -np 4 /home/kpankaj/my/mprime
Running Program :
Using 4 tasks to scan 50000000 numbers
[proxy:0:0 at tbed2] requesting checkpoint
[proxy:0:0 at tbed2] checkpoint completed
[proxy:0:0 at tbed2] requesting checkpoint
[proxy:0:0 at tbed2] checkpoint completed
[proxy:0:0 at tbed2] requesting checkpoint
[proxy:0:0 at tbed2] checkpoint completed
[proxy:0:0 at tbed2] requesting checkpoint
[proxy:0:0 at tbed2] checkpoint completed
Done. Largest prime is 49999991 Total primes 3001134
Wallclock time elapsed: 13.57 seconds
Using 4 tasks to scan 100000000 numbers for second round
[proxy:0:0 at tbed2] requesting checkpoint
[proxy:0:0 at tbed2] checkpoint completed
[proxy:0:0 at tbed2] requesting checkpoint
[proxy:0:0 at tbed2] checkpoint completed
[proxy:0:0 at tbed2] requesting checkpoint
[proxy:0:0 at tbed2] checkpoint completed
[proxy:0:0 at tbed2] requesting checkpoint
[proxy:0:0 at tbed2] checkpoint completed
[proxy:0:0 at tbed2] requesting checkpoint
[proxy:0:0 at tbed2] checkpoint completed
[proxy:0:0 at tbed2] requesting checkpoint
[proxy:0:0 at tbed2] checkpoint completed
[proxy:0:0 at tbed2] requesting checkpoint
[proxy:0:0 at tbed2] checkpoint completed
[proxy:0:0 at tbed2] requesting checkpoint
[proxy:0:0 at tbed2] checkpoint completed
[proxy:0:0 at tbed2] requesting checkpoint
[proxy:0:0 at tbed2] checkpoint completed
[proxy:0:0 at tbed2] requesting checkpoint
[proxy:0:0 at tbed2] checkpoint completed
[proxy:0:0 at tbed2] requesting checkpoint
[proxy:0:0 at tbed2] checkpoint completed
[proxy:0:0 at tbed2] requesting checkpoint
[proxy:0:0 at tbed2] checkpoint completed
Done. Largest prime is 99999989 Total primes for second round 5761455
Wallclock time elapsed for second round: 50.10 seconds
Checkpoint Files:
context-num0-0-0 context-num13-0-0 context-num18-0-0 context-num22-0-0 context-num27-0-0
context-num7-0-0
context-num1-0-0 context-num14-0-0 context-num19-0-0 context-num23-0-0 context-num3-0-0 context-num8-0-0
context-num10-0-0 context-num15-0-0 context-num2-0-0 context-num24-0-0 context-num4-0-0 context-num9-0-0
context-num11-0-0 context-num16-0-0 context-num20-0-0 context-num25-0-0 context-num5-0-0 mfile
context-num12-0-0 context-num17-0-0 context-num21-0-0 context-num26-0-0 context-num6-0-0
Restart MPI Program
mpiexec -ckpointlib blcr -ckpoint-prefix /home/kpankaj/my -ckpoint-num 3 -np 4
Var/log/messages
Jun 11 14:21:37 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:37 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:37 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:37 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:37 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:37 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:37 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:40 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:40 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:40 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:40 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:40 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:40 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:40 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:40 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:40 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:40 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:40 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:40 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:40 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:40 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:40 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:40 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:40 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:23:07 tbed2 kernel: blcr: No alternate input file when restoring an external pipe
Jun 11 14:23:07 tbed2 kernel: blcr: cr_restore_all_files [21426]: Unable to restore fd 0 (type=4,err=-9)
Jun 11 14:23:07 tbed2 kernel: blcr: cr_rstrt_child [21426]: Unable to restore files! (err=-9)
-------------- next part --------------
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list