[mpich-discuss] Status of checkpointing mechanisms with Slurm + MPICH + BLCR

Wesley Bland wbland at anl.gov
Thu Oct 23 10:25:17 CDT 2014


Hi Manuel,

Unfortunately, the situation for checkpointing hasn’t changed much since that last email. Checkpointing has been a relatively low priority item for us since full-system checkpointing it is used by so few people now (in favor of more application-centric checkpointing) and we have very limited bandwidth to figure out what needs to happen to support it again.

I’m not sure of the last version of MPICH that did work with checkpointing (probably something in the MPICH2-1.4 series if I had to guess. You’re welcome to try it out, but there’s a possibility that you’ll run into issues running such an old version against the SLURM and BLCR.

It’s possible that we’ll get around to fixing checkpointing in the future, but there’s no timeline for that happening. AFAIK, pretty much all of the big MPI implementations have dropped support for it for the same reasons (https://www.open-mpi.org/faq/?category=ft <https://www.open-mpi.org/faq/?category=ft>).

Thanks,
Wesley

> On Oct 23, 2014, at 10:18 AM, Manuel Rodríguez Pascual <manuel.rodriguez.pascual at gmail.com> wrote:
> 
> Good afternoon all,
> 
> I am a newbie in this MPICH world. I am trying to install a cluster with MPICH, having the possibility of checkpoint parallel tasks.
> 
> My original idea was a software stack based on SLURM  14.03.8 + MPICH mpich-3.1.3 + BLCR 0.8.5 . They are supposed to have good integration among them, and the configuration process has been quite smooth until now.
> 
> I have found however that the checkpoint of MPICH tasks is not working. At first I though it was my fault (configuration issues or whatever) due to it can be read in MPICH home page that BLCR integration is possible  https://wiki.mpich.org/mpich/index.php/Checkpointing <https://wiki.mpich.org/mpich/index.php/Checkpointing>
> 
> However, when looking for the solution I found this thread in this same mailing list:
> http://lists.mpich.org/pipermail/discuss/2014-April/002498.html <http://lists.mpich.org/pipermail/discuss/2014-April/002498.html>
> 
> saying " BLCR checkpointing hasn't worked for a few versions now. It's something we're working to fix in a future version".
> 
> My question is then,
> 
> -Is it possible right now to checkpoint MPICH with BLCR?
> 
> -If not, is there any working checkpoint mechanism that you can suggest me?
> 
> -If not, are you aware of a previous MPICH version where BLCR does work? Are there any drawbacks on employing it while you get the new one fixed? (Are you getting the new one fixed?)
> 
>  Thanks for your attention. Best regards,
> 
> 
> 
> Manuel
> 
> 
> -- 
> Dr. Manuel Rodríguez-Pascual
> skype: manuel.rodriguez.pascual
> phone: (+34) 913466173 // (+34) 679925108
>  
> CIEMAT-Moncloa
> Edificio 22, desp. 1.25
> Avenida Complutense, 40 
> 28040- MADRID
> SPAIN
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20141023/cc2b86b2/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list