[mpich-discuss] BLCR restart makes processes hang

Bland, Wesley B. wbland at anl.gov
Mon Aug 18 08:58:59 CDT 2014


Unfortunately, you’re correct that there isn’t currently a solution. Until we fix that ticket, checkpointing is currently not functioning in MPICH. It’s on the roadmap to be fixed along with some new fault tolerance features in the future, but it’s not there yet.

Thanks,
Wesley

On Aug 18, 2014, at 8:51 AM, myself <chcdlf at 126.com<mailto:chcdlf at 126.com>> wrote:

I tried to use BLCR with MPICH3. However, it seems not to work. I compile the blcr in CentOS and `make test` show not fail tests. Then, I compile mpich with BLCR. The information is shown as follows,

$ mpichversion
MPICH Version:     3.1.2
MPICH Release date: Mon Jul 21 16:00:21 CDT 2014
MPICH Device:     ch3:nemesis
MPICH configure: --prefix=/home/test/develop/mpich3-blcr --with-device=ch3:nemesis CFLAGS=-fPIC --enable-checkpointing --with-blcr=/home/test/develop/blcr-0.8.5 --with-hydra-ckpointlib=blcr
MPICH CC: gcc -fPIC   -O2
MPICH CXX: g++   -O2
MPICH F77: gfortran   -O2
MPICH FC: gfortran   -O2

After that, I compile my application like this

$ mpicc mpiblcr.c -o mpiblcr -lcr

When I firstly run the application, it seems ok to make the checkpoint files, such as context-num0-0-0.

$ mpiexec -ckpointlib blcr -ckpoint-prefix `pwd` -ckpoint-interval 2 -n 2 ./mpiblcr
5411) Step 0
5410) Step 0
5410) Step 1
5411) Step 1
[proxy:0:0 at node1] requesting checkpoint
[proxy:0:0 at node1] checkpoint completed
5410) Step 2
5411) Step 2

However, when I try to restart the process with checkpoint, it hangs and thereis no information printed.

$ mpiexec -ckpointlib blcr -ckpoint-prefix `pwd` -n 2 -ckpoint-num 1

The pstree shows the pmi start application process

     ├─sshd─┬─3*[sshd───sshd───bash]
     │      ├─sshd───sshd───bash───mpiexec───hydra_pmi_proxy───mpiblcr
     │      └─sshd───sshd───bash───pstree

and `ps aux` shows the process is defunct

$ ps aux | grep osu_bw
test   15290  0.0  0.0      0     0 ?        Z    21:44   0:00 [mpiblcr] <defunct>

I don't know how to identify this problem. I also see someone had the same problem like me several years ago #1144<http://trac.mpich.org/projects/mpich/ticket/1144>. But, there are no solutions.





_______________________________________________
discuss mailing list     discuss at mpich.org<mailto:discuss at mpich.org>
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140818/9ae061c9/attachment.html>


More information about the discuss mailing list