[mpich-discuss] BLCR restart makes processes hang

myself chcdlf at 126.com
Mon Aug 18 09:11:18 CDT 2014


Well. Thanks! Look forward to the version which is supported with the fault tolerance.





At 2014-08-18 09:58:59, "Bland, Wesley B." <wbland at anl.gov> wrote:
Unfortunately, you’re correct that there isn’t currently a solution. Until we fix that ticket, checkpointing is currently not functioning in MPICH. It’s on the roadmap to be fixed along with some new fault tolerance features in the future, but it’s not there yet.


Thanks,
Wesley



On Aug 18, 2014, at 8:51 AM, myself <chcdlf at 126.com> wrote:


I tried to use BLCR with MPICH3. However, it seems not to work. I compile the blcr in CentOS and `make test` show not fail tests. Then, I compile mpich with BLCR. The information is shown as follows,


$ mpichversion 
MPICH Version:    3.1.2
MPICH Release date:Mon Jul 21 16:00:21 CDT 2014
MPICH Device:    ch3:nemesis
MPICH configure: --prefix=/home/test/develop/mpich3-blcr --with-device=ch3:nemesis CFLAGS=-fPIC --enable-checkpointing --with-blcr=/home/test/develop/blcr-0.8.5 --with-hydra-ckpointlib=blcr
MPICH CC: gcc -fPIC   -O2
MPICH CXX: g++   -O2
MPICH F77: gfortran   -O2
MPICH FC: gfortran   -O2


After that, I compile my application like this


$ mpicc mpiblcr.c -o mpiblcr -lcr


When I firstly run the application, it seems ok to make the checkpoint files, such as context-num0-0-0.


$ mpiexec -ckpointlib blcr -ckpoint-prefix `pwd` -ckpoint-interval 2 -n 2 ./mpiblcr
5411) Step 0
5410) Step 0
5410) Step 1
5411) Step 1
[proxy:0:0 at node1] requesting checkpoint
[proxy:0:0 at node1] checkpoint completed
5410) Step 2
5411) Step 2


However, when I try to restart the process with checkpoint, it hangs and thereis no information printed.


$ mpiexec -ckpointlib blcr -ckpoint-prefix `pwd` -n 2 -ckpoint-num 1


The pstree shows the pmi start application process


     ├─sshd─┬─3*[sshd───sshd───bash]
     │      ├─sshd───sshd───bash───mpiexec───hydra_pmi_proxy───mpiblcr
     │      └─sshd───sshd───bash───pstree


and `ps aux` shows the process is defunct


$ ps aux | grep osu_bw
test   15290  0.0  0.0      0     0 ?        Z    21:44   0:00 [mpiblcr] <defunct>


I don't know how to identify this problem. I also see someone had the same problem like me several years ago #1144. But, there are no solutions.









_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140818/b2eaa217/attachment.html>


More information about the discuss mailing list