[mpich-discuss] BLCR kernel module not present

michael michael.bane at manchester.ac.uk
Fri Apr 26 15:39:43 CDT 2013


Many thanks, Pavan
I shall have to try... I'd previously been led to believe that there was
no way of checkpointing an MPI code under say SGE
Yours, MIchael

On Fri, 2013-04-26 at 15:25 -0500, Pavan Balaji wrote:

> I can't speak for mvapich, of course.  I'm only speaking about mpich.
> However, most of our derivatives don't destroy the features that are in
> stock mpich.  So I'd think it'll work fine with mvapich as well.
> 
>  -- Pavan
> 
> On 04/26/2013 03:20 PM US Central Time, michael wrote:
> > Thanks Pavan
> > 
> > Just to be clear, you're saying if I use mvapich with blcr then a
> > running multi-node MPI job when killed (eg out of time) by a batch
> > scheduler can be restarted (from which checkpoint?) presuming it doesn't
> > have open files?
> > 
> > Many thanks, M
> > 
> > On Fri, 2013-04-26 at 15:12 -0500, Pavan Balaji wrote:
> >> Michael,
> >>
> >> BLCR support for mpich should work fine; if something is broken, please
> >> let us know.
> >>
> >> However, the core BLCR group itself hadn't released updates in a while,
> >> primarily because they didn't have direct funding for it.  I believe
> >> that's fixed now and they are working on newer releases.
> >>
> >>  -- Pavan
> >>
> >> On 04/26/2013 03:07 PM US Central Time, michael wrote:
> >> > Hi folks
> >> > I was wondering what the state of BLCR for mpich/mvapich is? eg how
> >> > reliably can one presume it to be?
> >> > Thanks, Michael
> >> > 
> >> > 
> >> > On Fri, 2013-04-26 at 14:33 -0500, Wesley Bland wrote:
> >> >> It looks like you might have missed installing the kernel module for
> >> >> BLCR. What is the output of `lsmod`? 
> >> >>
> >> >>
> >> >> Alternatively, if you installed BLCR by using apt-get in Ubuntu, you
> >> >> should be able to use dkms to manage your kernel modules
> >> >> automatically. Make sure you have the package 'blcr-dkms' installed
> >> >> (you should be able to check this by typing `dims status`. 
> >> >>
> >> >>
> >> >> Do either of those solutions solve your issue? 
> >> >>
> >> >>
> >> >> Wesley 
> >> >>
> >> >> On Apr 26, 2013, at 1:55 PM, basma a.azeem
> >> >> <basmaabdelazeem at hotmail.com <mailto:basmaabdelazeem at hotmail.com> <mailto:basmaabdelazeem at hotmail.com>> wrote: 
> >> >>
> >> >>>
> >> >>>
> >> >>> Thank you for your help
> >> >>>
> >> >>>
> >> >>> i installed BLCR 0.8.5 on my ubuntu 12.10  to be used for MPICH -3.0.3
> >> >>> this version of blcr should support to kernels through 3.7.1
> >> >>>
> >> >>> when i run the command :
> >> >>> basma at basma-Satellite-A500:~$ mpiexec --info
> >> >>>
> >> >>> results:
> >> >>>
> >> >>> HYDRA build details:
> >> >>>     Version:                                 3.0.3
> >> >>>     Release Date:                            Thu Mar 28 16:01:21 CDT 2013
> >> >>>     CC:                              gcc    
> >> >>>     CXX:                             c++    
> >> >>>     F77:                             no   
> >> >>>     F90:                             no   
> >> >>>     Configure options:                      
> >> >>> '--disable-option-checking' '--prefix=/home/basma/mpich2-install'
> >> >>> '--disable-f77' '--disable-fc' '--enable-checkpointing'
> >> >>> '--with-hydra-ckpointlib=blcr' '--cache-file=/dev/null' '--srcdir=.'
> >> >>> 'CC=gcc' 'CFLAGS= -O2' 'LDFLAGS= ' 'LIBS=-lrt -lcr -lpthread '
> >> >>> 'CPPFLAGS= -I/home/basma/libraries/mpich-3.0.3/src/mpl/include
> >> >>> -I/home/basma/libraries/mpich-3.0.3/src/mpl/include
> >> >>> -I/home/basma/libraries/mpich-3.0.3/src/openpa/src
> >> >>> -I/home/basma/libraries/mpich-3.0.3/src/openpa/src
> >> >>> -I/home/basma/libraries/mpich-3.0.3/src/mpi/romio/include'
> >> >>>     Process Manager:                         pmi
> >> >>>     Launchers available:                     ssh rsh fork slurm ll
> >> >>> lsf sge manual persist
> >> >>>     Topology libraries available:            hwloc
> >> >>>     Resource management kernels available:   user slurm ll lsf sge
> >> >>> pbs cobalt
> >> >>>     Checkpointing libraries available:       blcr
> >> >>>     Demux engines available:                 poll select
> >> >>>
> >> >>> so i thought that every thing is ok but when i try to rum mpiexec it
> >> >>> failed:
> >> >>>
> >> >>> basma at basma-Satellite-A500:~$ mpiexec -ckpointlib blcr
> >> >>> -ckpoint-prefix /home/business/ckpts/app.ckpoint -ckpoint-interval
> >> >>> 3600  -n 4 /home/basma/libraries/mpich-3.0.3/examples/cpi
> >> >>>
> >> >>> results:
> >> >>>
> >> >>> Fatal error in MPI_Init: Other MPI error, error stack:
> >> >>> MPIR_Init_thread(433)...: 
> >> >>> MPID_Init(151)..........: channel initialization failed
> >> >>> MPIDI_CH3_Init(70)......: 
> >> >>> MPID_nem_init(379)......: 
> >> >>> MPIDI_nem_ckpt_init(153): BLCR kernel module not present
> >> >>> Fatal error in MPI_Init: Other MPI error, error stack:
> >> >>> MPIR_Init_thread(433)...: 
> >> >>> MPID_Init(151)..........: channel initialization failed
> >> >>> MPIDI_CH3_Init(70)......: 
> >> >>> MPID_nem_init(379)......: 
> >> >>> MPIDI_nem_ckpt_init(153): BLCR kernel module not present
> >> >>> Fatal error in MPI_Init: Other MPI error, error stack:
> >> >>> MPIR_Init_thread(433)...: 
> >> >>> MPID_Init(151)..........: channel initialization failed
> >> >>> MPIDI_CH3_Init(70)......: 
> >> >>> MPID_nem_init(379)......: 
> >> >>> MPIDI_nem_ckpt_init(153): BLCR kernel module not present
> >> >>> Fatal error in MPI_Init: Other MPI error, error stack:
> >> >>> MPIR_Init_thread(433)...: 
> >> >>> MPID_Init(151)..........: channel initialization failed
> >> >>> MPIDI_CH3_Init(70)......: 
> >> >>> MPID_nem_init(379)......: 
> >> >>> MPIDI_nem_ckpt_init(153): BLCR kernel module not present
> >> >>>
> >> >>> ===================================================================================
> >> >>> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> >> >>> =   EXIT CODE: 1
> >> >>> =   CLEANING UP REMAINING PROCESSES
> >> >>> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> >> >>> ===================================================================================
> >> >>>
> >> >>>
> >> >>>  
> >> >>> i am a Linux and parallel programming beginner
> >> >>>
> >> >>> Thank you
> >> >>>
> >> >>>
> >> >>>
> >> >>> _______________________________________________
> >> >>> discuss mailing list     discuss at mpich.org <mailto:discuss at mpich.org> <mailto:discuss at mpich.org>
> >> >>> To manage subscription options or unsubscribe:
> >> >>> https://lists.mpich.org/mailman/listinfo/discuss 
> >> >>
> >> >>
> >> >> _______________________________________________
> >> >> discuss mailing list     discuss at mpich.org <mailto:discuss at mpich.org> <mailto:discuss at mpich.org>
> >> >> To manage subscription options or unsubscribe:
> >> >> https://lists.mpich.org/mailman/listinfo/discuss
> >> > 
> >> > 
> >> > _______________________________________________
> >> > discuss mailing list     discuss at mpich.org <mailto:discuss at mpich.org>
> >> > To manage subscription options or unsubscribe:
> >> > https://lists.mpich.org/mailman/listinfo/discuss
> >> > 
> >>
> > 
> > 
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> > 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20130426/36167fd8/attachment.html>


More information about the discuss mailing list