[mpich-discuss] Error while checkpointing an MPI application

Pavan Balaji balaji at mcs.anl.gov
Sun Dec 23 21:15:52 CST 2012


It looks like your blcr installation is not functioning correctly.  Did
you try it outside of the mpich environment to make sure it's installed
correctly?

 -- Pavan

On 09/21/2012 05:03 AM US Central Time, Manisha Chauhan wrote:
> Hi all
> 
> I mailed you earlier also but didnot get reply for my query. Hoping to
> get this time. I installed BLCR tool with MPICH2 and HYDRA module.
> 
> While checkpointing an MPI application i have given this command-
> 
> mpirun -ckpointlib blcr -ckpoint-prefix
> /home/superusr/bhavya/b_eff_io/tmp/app.ckpoint -ckpoint-interval 10 -np
> 4 ./beff_out1 -MB 2730 -MT 32768 -T 20 -p ../test -f
> ../test_out/ser11_9_2012_20sec3
> 
> [proxy:0:0 at power1.cdacb.in] requesting checkpoint
> [proxy:0:0 at power1.cdacb.in] checkpoint completed
> [proxy:0:0 at power1.cdacb.in] HYDT_ckpoint_blcr_checkpoint
> (./tools/ckpoint/blcr/ckpoint_blcr.c:244): cr_request_checkpoint failed,
> Unknown error 2356
> [proxy:0:0 at power1.cdacb.in] ckpoint_thread
> (./tools/ckpoint/ckpoint.c:72): blcr checkpoint returned error
> [proxy:0:0 at power1.cdacb.in] requesting checkpoint
> [proxy:0:0 at power1.cdacb.in] checkpoint completed
> [proxy:0:0 at power1.cdacb.in] HYDT_ckpoint_blcr_checkpoint
> (./tools/ckpoint/blcr/ckpoint_blcr.c:244): cr_request_checkpoint failed,
> Unknown error 2356
> [proxy:0:0 at power1.cdacb.in] ckpoint_thread
> (./tools/ckpoint/ckpoint.c:72): blcr checkpoint returned error
> [proxy:0:0 at power1.cdacb.in] requesting checkpoint
> [proxy:0:0 at power1.cdacb.in] checkpoint completed
> [proxy:0:0 at power1.cdacb.in] HYDT_ckpoint_blcr_checkpoint
> (./tools/ckpoint/blcr/ckpoint_blcr.c:244): cr_request_checkpoint failed,
> Unknown error 2356
> [proxy:0:0 at power1.cdacb.in] ckpoint_thread
> (./tools/ckpoint/ckpoint.c:72): blcr checkpoint returned error
> [proxy:0:0 at power1.cdacb.in] requesting checkpoint
> [proxy:0:0 at power1.cdacb.in] checkpoint completed
> [proxy:0:0 at power1.cdacb.in] HYDT_ckpoint_blcr_checkpoint
> (./tools/ckpoint/blcr/ckpoint_blcr.c:244): cr_request_checkpoint failed,
> Unknown error 2356
> [proxy:0:0 at power1.cdacb.in] ckpoint_thread
> (./tools/ckpoint/ckpoint.c:72): blcr checkpoint returned error
> [proxy:0:0 at power1.cdacb.in] requesting checkpoint
> [proxy:0:0 at power1.cdacb.in] checkpoint completed
> [proxy:0:0 at power1.cdacb.in] HYDT_ckpoint_blcr_checkpoint
> (./tools/ckpoint/blcr/ckpoint_blcr.c:244): cr_request_checkpoint failed,
> Unknown error 2356
> [proxy:0:0 at power1.cdacb.in] ckpoint_thread
> (./tools/ckpoint/ckpoint.c:72): blcr checkpoint returned error
> [proxy:0:0 at power1.cdacb.in] requesting checkpoint
> [proxy:0:0 at power1.cdacb.in] checkpoint completed
> [proxy:0:0 at power1.cdacb.in] HYDT_ckpoint_blcr_checkpoint
> (./tools/ckpoint/blcr/ckpoint_blcr.c:244): cr_request_checkpoint failed,
> Unknown error 2356
> [proxy:0:0 at power1.cdacb.in] ckpoint_thread
> (./tools/ckpoint/ckpoint.c:72): blcr checkpoint returned error
> [proxy:0:0 at power1.cdacb.in] requesting checkpoint
> [proxy:0:0 at power1.cdacb.in] checkpoint completed
> [proxy:0:0 at power1.cdacb.in] HYDT_ckpoint_blcr_checkpoint
> (./tools/ckpoint/blcr/ckpoint_blcr.c:244): cr_request_checkpoint failed,
> Unknown error 2356
> [proxy:0:0 at power1.cdacb.in] ckpoint_thread
> (./tools/ckpoint/ckpoint.c:72): blcr checkpoint returned error
> [proxy:0:0 at power1.cdacb.in] requesting checkpoint
> [proxy:0:0 at power1.cdacb.in] checkpoint completed
> [proxy:0:0 at power1.cdacb.in] HYDT_ckpoint_blcr_checkpoint
> (./tools/ckpoint/blcr/ckpoint_blcr.c:244): cr_request_checkpoint failed,
> Unknown error 2356
> [proxy:0:0 at power1.cdacb.in] ckpoint_thread
> (./tools/ckpoint/ckpoint.c:72): blcr checkpoint returned error
> [proxy:0:0 at power1.cdacb.in] requesting checkpoint
> [proxy:0:0 at power1.cdacb.in] checkpoint completed
> [proxy:0:0 at power1.cdacb.in] HYDT_ckpoint_blcr_checkpoint
> (./tools/ckpoint/blcr/ckpoint_blcr.c:244): cr_request_checkpoint failed,
> Unknown error 2356
> [proxy:0:0 at power1.cdacb.in] ckpoint_thread
> (./tools/ckpoint/ckpoint.c:72): blcr checkpoint returned error
> [proxy:0:0 at power1.cdacb.in] requesting checkpoint
> [proxy:0:0 at power1.cdacb.in] checkpoint completed
> [proxy:0:0 at power1.cdacb.in] HYDT_ckpoint_blcr_checkpoint
> (./tools/ckpoint/blcr/ckpoint_blcr.c:244): cr_request_checkpoint failed,
> Unknown error 2356
> [proxy:0:0 at power1.cdacb.in] ckpoint_thread
> (./tools/ckpoint/ckpoint.c:72): blcr checkpoint returned error
> [proxy:0:0 at power1.cdacb.in] requesting checkpoint
> [proxy:0:0 at power1.cdacb.in] checkpoint completed
> [proxy:0:0 at power1.cdacb.in] HYDT_ckpoint_blcr_checkpoint
> (./tools/ckpoint/blcr/ckpoint_blcr.c:244): cr_request_checkpoint failed,
> Unknown error 2356
> [proxy:0:0 at power1.cdacb.in] ckpoint_thread
> (./tools/ckpoint/ckpoint.c:72): blcr checkpoint returned error
> [proxy:0:0 at power1.cdacb.in] requesting checkpoint
> [proxy:0:0 at power1.cdacb.in] checkpoint completed
> [proxy:0:0 at power1.cdacb.in] HYDT_ckpoint_blcr_checkpoint
> (./tools/ckpoint/blcr/ckpoint_blcr.c:244): cr_request_checkpoint failed,
> Unknown error 2356
> [proxy:0:0 at power1.cdacb.in] ckpoint_thread
> (./tools/ckpoint/ckpoint.c:72): blcr checkpoint returned error
> 
>  b_eff_io =  611.691 MB/s on 4 processes with 2730 MByte/PE, scheduled
> time=0.1 Min, on Linux power1.cdacb.in 2.6.32-279.2.1.el6.x86_64 #1 SMP
> Thu Jul 5 21:08:58 EDT 2012 x86_64, NOT VALID (see above)
> 
> I t is showing some error and when i try to restart this application it
> shows killed. Please help me.
> 
> Thanks & Regards
> Manisha Chauhan
> 
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji



More information about the discuss mailing list