[mpich-discuss] checkpoint error

Jeff Hammond jhammond at alcf.anl.gov
Sun May 19 17:33:13 CDT 2013


What Ralf was trying to point out is that you used /home/buntinas in a
path.  That home directory was likely cut-and-paste from an example
created by Darius Buntinas, a developer of MPICH.  It is almost
certainly not a valid path on your machine.  Your home directory
appears to be /home/basma/, which makes sense given your name.

When Ralf said "It sounds like that path should be something like
/home/basma/ckpts/app.ckpoint instead," he was not suggesting that you
blindly cut-and-paste that path and assume it would work.  The use of
"something like" was meant to provoke some analysis on your part that
would lead to you identify the appropriate path to use there.

I am tempted to suggest that you use "-ckpoint-prefix /dev/null",
which will probably work (just not the way you might like), but I am
afraid that you will not internalize what I'm trying to say and
actually follow that advice.

Jeff

On Sun, May 19, 2013 at 4:33 PM, basma a.azeem
<basmaabdelazeem at hotmail.com> wrote:
> same error
>
> basma at basma-Satellite-A500:~$ mpiexec -ckpointlib blcr -ckpoint-prefix
> /home/basma/ckpts/app.ckpoint -ckpoint-interval 1  -n 4
> /home/basma/NPB3.3/NPB3.3/NPB3.3-MPI/bin/is.A.4
>
>
>
>  NAS Parallel Benchmarks 3.3 -- IS Benchmark
>
>  Size:  8388608  (class A)
>  Iterations:   100
>  Number of processes:     4
>
>    iteration
>         1
>         2
>         3
>         4
>         5
>         6
>         7
>         8
>         9
>         10
>         11
>
> [proxy:0:0 at basma-Satellite-A500] requesting checkpoint
> [proxy:0:0 at basma-Satellite-A500] HYDT_ckpoint_checkpoint
> (./tools/ckpoint/ckpoint.c:106): Failed to stat checkpoint prefix
> "/home/basma/ckpts/app.ckpoint": No such file or directory
>
> [proxy:0:0 at basma-Satellite-A500] HYD_pmcd_pmip_control_cmd_cb
> (./pm/pmiserv/pmip_cb.c:905): checkpoint suspend failed
> [proxy:0:0 at basma-Satellite-A500] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:0 at basma-Satellite-A500] main (./pm/pmiserv/pmip.c:206): demux
> engine error waiting for event
> [mpiexec at basma-Satellite-A500] control_cb (./pm/pmiserv/pmiserv_cb.c:202):
> assert (!closed) failed
> [mpiexec at basma-Satellite-A500] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> [mpiexec at basma-Satellite-A500] HYD_pmci_wait_for_completion
> (./pm/pmiserv/pmiserv_pmci.c:197): error waiting for event
> [mpiexec at basma-Satellite-A500] main (./ui/mpich/mpiexec.c:331): process
> manager error waiting for completion
> basma at basma-Satellite-A500:~$
>
>
> ________________________________
> Date: Sun, 19 May 2013 16:26:32 -0500
> From: correac2 at illinois.edu
> To: discuss at mpich.org
> Subject: Re: [mpich-discuss] checkpoint error
>
>
>
> On May 19, 2013 4:02 PM, "basma a.azeem" <basmaabdelazeem at hotmail.com>
> wrote:
>>
>> Thank you for your help
>>
>> i am using  mpich-3.0.3 . with blcr-0.8.5  to checkpoint the integer Sort
>> App of NPB, it give me the following error :
>>
>>
>> basma at basma-Satellite-A500:~$ mpiexec -ckpointlib blcr -ckpoint-prefix
>> /home/buntinas/ckpts/app.ckpoint -ckpoint-interval 1  -n 4
>> /home/basma/NPB3.3/NPB3.3/NPB3.3-MPI/bin/is.A.4
> It sounds like that path should be something like
> /home/basma/ckpts/app.ckpoint instead.
>
>>
>>
>>  NAS Parallel Benchmarks 3.3 -- IS Benchmark
>>
>>  Size:  8388608  (class A)
>>  Iterations:   100
>>  Number of processes:     4
>>
>>    iteration
>>         1
>>         2
>>         3
>>         4
>>         5
>>         6
>>         7
>>         8
>>         9
>>         10
>> [proxy:0:0 at basma-Satellite-A500] requesting checkpoint
>> [proxy:0:0 at basma-Satellite-A500] HYDT_ckpoint_checkpoint
>> (./tools/ckpoint/ckpoint.c:106): Failed to stat checkpoint prefix
>> "/home/buntinas/ckpts/app.ckpoint": No such file or directory
>> [proxy:0:0 at basma-Satellite-A500] HYD_pmcd_pmip_control_cmd_cb
>> (./pm/pmiserv/pmip_cb.c:905): checkpoint suspend failed
>> [proxy:0:0 at basma-Satellite-A500] HYDT_dmxu_poll_wait_for_event
>> (./tools/demux/demux_poll.c:77): callback returned error status
>> [proxy:0:0 at basma-Satellite-A500] main (./pm/pmiserv/pmip.c:206): demux
>> engine error waiting for event
>> [mpiexec at basma-Satellite-A500] control_cb (./pm/pmiserv/pmiserv_cb.c:202):
>> assert (!closed) failed
>> [mpiexec at basma-Satellite-A500] HYDT_dmxu_poll_wait_for_event
>> (./tools/demux/demux_poll.c:77): callback returned error status
>> [mpiexec at basma-Satellite-A500] HYD_pmci_wait_for_completion
>> (./pm/pmiserv/pmiserv_pmci.c:197): error waiting for event
>> [mpiexec at basma-Satellite-A500] main (./ui/mpich/mpiexec.c:331): process
>> manager error waiting for completion
>> basma at basma-Satellite-A500:~$
>>
>>
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>
>
> _______________________________________________ discuss mailing list
> discuss at mpich.org To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss



-- 
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
ALCF docs: http://www.alcf.anl.gov/user-guides



More information about the discuss mailing list