[mpich-discuss] mpich3.1.4 checkpoint restart failed error.

Pankaj Karoriya pankaj2karoriya at gmail.com
Fri Jun 12 01:55:18 CDT 2015


I am using mpich3.1.4 in os rhel 6.2 ,blcr-0.8.5.i have configure mpi with
blcr library
i configure mpich3.1.4 with  following options;

1.  ./configure --prefix=/usr/local/mpich-3.0.4-gcc --enable-cc --enable-fc
--with-device=ch3:sock CC=gcc FC=gfortran --enable-checkpointing
--with-hydra-ckpointlib=blcr --with-blcr=/usr/local/blcr-0.8.5
LD_LIBRARY_PATH=/usr/local/blcr-0.8.5/lib
2.make
3. make install

After installation of mpi when running mpi checkpoint then its creates
context file of mpi process.
now i wants restart my mpi checkpont program than its give error is
          No alternate input file when restoring an external pipe
  kernel: blcr: cr_restore_all_files [32120]:  Unable to restore fd 0
(type=4,err=-9)
         kernel: blcr: cr_rstrt_child [32120]:  Unable to restore files!
(err=-9)
so whats i do for get of this problem.??

Thanks Pankaj
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20150612/e8ac9f9b/attachment.html>
-------------- next part --------------
Step 1: compile a mpi program 
 mpicc -o mprime mpi-prime.c -lm -L/usr/local/blcr-0.8.5/lib -lcr_run 
Step 2: running a mpi program
mpiexec -ckpointlib blcr -ckpoint-prefix /home/kpankaj/my -ckpoint-interval 3 -np 4 /home/kpankaj/my/mprime 
Running Program :
Using 4 tasks to scan 50000000 numbers
[proxy:0:0 at tbed2] requesting checkpoint
[proxy:0:0 at tbed2] checkpoint completed
[proxy:0:0 at tbed2] requesting checkpoint
[proxy:0:0 at tbed2] checkpoint completed
[proxy:0:0 at tbed2] requesting checkpoint
[proxy:0:0 at tbed2] checkpoint completed
[proxy:0:0 at tbed2] requesting checkpoint
[proxy:0:0 at tbed2] checkpoint completed
Done. Largest prime is 49999991 Total primes 3001134
Wallclock time elapsed: 13.57 seconds

 Using 4 tasks to scan 100000000 numbers for second round 
[proxy:0:0 at tbed2] requesting checkpoint
[proxy:0:0 at tbed2] checkpoint completed
[proxy:0:0 at tbed2] requesting checkpoint
[proxy:0:0 at tbed2] checkpoint completed
[proxy:0:0 at tbed2] requesting checkpoint
[proxy:0:0 at tbed2] checkpoint completed
[proxy:0:0 at tbed2] requesting checkpoint
[proxy:0:0 at tbed2] checkpoint completed
[proxy:0:0 at tbed2] requesting checkpoint
[proxy:0:0 at tbed2] checkpoint completed
[proxy:0:0 at tbed2] requesting checkpoint
[proxy:0:0 at tbed2] checkpoint completed
[proxy:0:0 at tbed2] requesting checkpoint
[proxy:0:0 at tbed2] checkpoint completed
[proxy:0:0 at tbed2] requesting checkpoint
[proxy:0:0 at tbed2] checkpoint completed
[proxy:0:0 at tbed2] requesting checkpoint
[proxy:0:0 at tbed2] checkpoint completed
[proxy:0:0 at tbed2] requesting checkpoint
[proxy:0:0 at tbed2] checkpoint completed
[proxy:0:0 at tbed2] requesting checkpoint
[proxy:0:0 at tbed2] checkpoint completed
[proxy:0:0 at tbed2] requesting checkpoint
[proxy:0:0 at tbed2] checkpoint completed
Done. Largest prime is 99999989 Total primes for second round 5761455
Wallclock time elapsed for second round: 50.10 seconds 




Checkpoint Files:
context-num0-0-0   context-num13-0-0  context-num18-0-0  context-num22-0-0  context-num27-0-0  

context-num7-0-0  
context-num1-0-0   context-num14-0-0  context-num19-0-0  context-num23-0-0  context-num3-0-0   context-num8-0-0
context-num10-0-0  context-num15-0-0  context-num2-0-0   context-num24-0-0  context-num4-0-0   context-num9-0-0
context-num11-0-0  context-num16-0-0  context-num20-0-0  context-num25-0-0  context-num5-0-0   mfile
context-num12-0-0  context-num17-0-0  context-num21-0-0  context-num26-0-0  context-num6-0-0

Restart MPI Program 
mpiexec -ckpointlib blcr -ckpoint-prefix /home/kpankaj/my -ckpoint-num 3 -np 4 

Var/log/messages

Jun 11 14:21:37 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:37 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:37 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:37 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:37 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:37 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:37 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:40 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:40 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:40 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:40 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:40 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:40 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:40 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:40 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:40 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:40 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:40 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:40 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:40 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:40 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:40 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:40 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:21:40 tbed2 kernel: blcr: warning: skipped a socket.
Jun 11 14:23:07 tbed2 kernel: blcr: No alternate input file when restoring an external pipe
Jun 11 14:23:07 tbed2 kernel: blcr: cr_restore_all_files [21426]:  Unable to restore fd 0 (type=4,err=-9)
Jun 11 14:23:07 tbed2 kernel: blcr: cr_rstrt_child [21426]:  Unable to restore files!  (err=-9)
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list