[mpich-discuss] Checkpoint/Restart problem

Husen R hus3nr at gmail.com
Mon Apr 4 21:08:09 CDT 2016


Dear Pavan,

Thank you for your reply and information.
I know you are one of the authorities of MPICH as your name is written in
the userguide. in my previous email, I just want to make sure.
Unfortunately, I really need this feature. Do you have any suggestion
besides SCR. as far as i know, SCR needs modification to the code.

Thank you in advance.

Regards,

Husen

On Mon, Apr 4, 2016 at 11:54 PM, Balaji, Pavan <balaji at anl.gov> wrote:

>
> I'm sure about the fact that MPICH doesn't support BLCR anymore.  :-)
> I'll make sure it's removed in the future versions of the userguide.
> Thanks for pointing it out.
>
> BLCR kernel support was from my experience about an year ago, so things
> might have changed in that.
>
>   -- Pavan
>
> > On Apr 3, 2016, at 8:36 PM, Husen R <hus3nr at gmail.com> wrote:
> >
> > Hello Pavan,
> >
> > Thank you for your reply,
> >
> > Are you sure about that ?
> > I have downloaded MPICH-3.2 (The latest version) . According to the
> userguide, MPICH supports checkpoint/rollback fault tolerance and currently
> only the BLCR checkpointing library is supported. We can see this
> information at chapter 8 in this userguide
> https://www.mpich.org/static/downloads/3.2/mpich-3.2-userguide.pdf.
> >
> > In addition, I also have downloaded the latest version of BLCR that
> supports recent linux kernels.
> >
> > -Husen
> >
> > On Sun, Apr 3, 2016 at 10:20 AM, Balaji, Pavan <balaji at anl.gov> wrote:
> > Hello,
> >
> > BLCR checkpointing is no longer supported in MPICH.  The BLCR kernel
> module hasn't been updated to recent linux kernels, and move folks have
> moved to alternate checkpointing infrastructure such as FTI or SCR.  You
> might want to consider doing that as well.
> >
> >   -- Pavan
> >
> > > On Apr 1, 2016, at 9:51 PM, Husen R <hus3nr at gmail.com> wrote:
> > >
> > > Dear all,
> > >
> > > Please anyone tell me how to manually checkpoint mpiexec ?
> > > I have followed the instruction in this link
> https://wiki.mpich.org/mpich/index.php/Checkpointing.
> > > I used the following command to send SIGUSR1. nothing happened.
> > >
> > > kill -SIGUSR1 [pid of mpiexec]
> > >
> > > thank you in advance
> > >
> > > regards,
> > >
> > >
> > > Husen
> > > _______________________________________________
> > > discuss mailing list     discuss at mpich.org
> > > To manage subscription options or unsubscribe:
> > > https://lists.mpich.org/mailman/listinfo/discuss
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20160405/dfdb1055/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list