[mpich-discuss] BLCR restart makes processes hang

myself chcdlf at 126.com
Mon Aug 18 08:51:45 CDT 2014


I tried to use BLCR with MPICH3. However, it seems not to work. I compile the blcr in CentOS and `make test` show not fail tests. Then, I compile mpich with BLCR. The information is shown as follows,


$ mpichversion 
MPICH Version:    3.1.2
MPICH Release date:Mon Jul 21 16:00:21 CDT 2014
MPICH Device:    ch3:nemesis
MPICH configure: --prefix=/home/test/develop/mpich3-blcr --with-device=ch3:nemesis CFLAGS=-fPIC --enable-checkpointing --with-blcr=/home/test/develop/blcr-0.8.5 --with-hydra-ckpointlib=blcr
MPICH CC: gcc -fPIC   -O2
MPICH CXX: g++   -O2
MPICH F77: gfortran   -O2
MPICH FC: gfortran   -O2


After that, I compile my application like this


$ mpicc mpiblcr.c -o mpiblcr -lcr


When I firstly run the application, it seems ok to make the checkpoint files, such as context-num0-0-0.


$ mpiexec -ckpointlib blcr -ckpoint-prefix `pwd` -ckpoint-interval 2 -n 2 ./mpiblcr
5411) Step 0
5410) Step 0
5410) Step 1
5411) Step 1
[proxy:0:0 at node1] requesting checkpoint
[proxy:0:0 at node1] checkpoint completed
5410) Step 2
5411) Step 2


However, when I try to restart the process with checkpoint, it hangs and thereis no information printed.


$ mpiexec -ckpointlib blcr -ckpoint-prefix `pwd` -n 2 -ckpoint-num 1


The pstree shows the pmi start application process


     ├─sshd─┬─3*[sshd───sshd───bash]
     │      ├─sshd───sshd───bash───mpiexec───hydra_pmi_proxy───mpiblcr
     │      └─sshd───sshd───bash───pstree


and `ps aux` shows the process is defunct


$ ps aux | grep osu_bw
test   15290  0.0  0.0      0     0 ?        Z    21:44   0:00 [mpiblcr] <defunct>


I don't know how to identify this problem. I also see someone had the same problem like me several years ago #1144. But, there are no solutions.





-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140818/9fe6b7f0/attachment.html>


More information about the discuss mailing list