[mpich-discuss] BLCR restart makes processes hang
myself
chcdlf at 126.com
Mon Aug 18 08:51:45 CDT 2014
I tried to use BLCR with MPICH3. However, it seems not to work. I compile the blcr in CentOS and `make test` show not fail tests. Then, I compile mpich with BLCR. The information is shown as follows,
$ mpichversion
MPICH Version: 3.1.2
MPICH Release date:Mon Jul 21 16:00:21 CDT 2014
MPICH Device: ch3:nemesis
MPICH configure: --prefix=/home/test/develop/mpich3-blcr --with-device=ch3:nemesis CFLAGS=-fPIC --enable-checkpointing --with-blcr=/home/test/develop/blcr-0.8.5 --with-hydra-ckpointlib=blcr
MPICH CC: gcc -fPIC -O2
MPICH CXX: g++ -O2
MPICH F77: gfortran -O2
MPICH FC: gfortran -O2
After that, I compile my application like this
$ mpicc mpiblcr.c -o mpiblcr -lcr
When I firstly run the application, it seems ok to make the checkpoint files, such as context-num0-0-0.
$ mpiexec -ckpointlib blcr -ckpoint-prefix `pwd` -ckpoint-interval 2 -n 2 ./mpiblcr
5411) Step 0
5410) Step 0
5410) Step 1
5411) Step 1
[proxy:0:0 at node1] requesting checkpoint
[proxy:0:0 at node1] checkpoint completed
5410) Step 2
5411) Step 2
However, when I try to restart the process with checkpoint, it hangs and thereis no information printed.
$ mpiexec -ckpointlib blcr -ckpoint-prefix `pwd` -n 2 -ckpoint-num 1
The pstree shows the pmi start application process
├─sshd─┬─3*[sshd───sshd───bash]
│ ├─sshd───sshd───bash───mpiexec───hydra_pmi_proxy───mpiblcr
│ └─sshd───sshd───bash───pstree
and `ps aux` shows the process is defunct
$ ps aux | grep osu_bw
test 15290 0.0 0.0 0 0 ? Z 21:44 0:00 [mpiblcr] <defunct>
I don't know how to identify this problem. I also see someone had the same problem like me several years ago #1144. But, there are no solutions.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140818/9fe6b7f0/attachment.html>
More information about the discuss
mailing list