[mpich-discuss] BLCR kernel module not present

Michael Bane michael.bane at manchester.ac.uk
Fri Apr 26 17:55:14 CDT 2013


Thanks for that Raghu

On 26 Apr 2013, at 21:56, Raghunath wrote:

> Michael,
> 
> The BLCR support in MVAPICH works fine as well.  MVAPICH implements
> its own Checkpoint-Restart mechanism for the CH3-IB and Nemesis-IB
> channels. The MPICH design for the Nemesis-TCP channel is left
> untouched, as Pavan indicated.
> 
> --
> Raghu
> 
> 
> On Fri, Apr 26, 2013 at 4:25 PM, Pavan Balaji <balaji at mcs.anl.gov> wrote:
>> 
>> I can't speak for mvapich, of course.  I'm only speaking about mpich.
>> However, most of our derivatives don't destroy the features that are in
>> stock mpich.  So I'd think it'll work fine with mvapich as well.
>> 
>> -- Pavan
>> 
>> On 04/26/2013 03:20 PM US Central Time, michael wrote:
>>> Thanks Pavan
>>> 
>>> Just to be clear, you're saying if I use mvapich with blcr then a
>>> running multi-node MPI job when killed (eg out of time) by a batch
>>> scheduler can be restarted (from which checkpoint?) presuming it doesn't
>>> have open files?
>>> 
>>> Many thanks, M
>>> 
>>> On Fri, 2013-04-26 at 15:12 -0500, Pavan Balaji wrote:
>>>> Michael,
>>>> 
>>>> BLCR support for mpich should work fine; if something is broken, please
>>>> let us know.
>>>> 
>>>> However, the core BLCR group itself hadn't released updates in a while,
>>>> primarily because they didn't have direct funding for it.  I believe
>>>> that's fixed now and they are working on newer releases.
>>>> 
>>>> -- Pavan
>>>> 
>>>> On 04/26/2013 03:07 PM US Central Time, michael wrote:
>>>>> Hi folks
>>>>> I was wondering what the state of BLCR for mpich/mvapich is? eg how
>>>>> reliably can one presume it to be?
>>>>> Thanks, Michael
>>>>> 
>>>>> 
>>>>> On Fri, 2013-04-26 at 14:33 -0500, Wesley Bland wrote:
>>>>>> It looks like you might have missed installing the kernel module for
>>>>>> BLCR. What is the output of `lsmod`?
>>>>>> 
>>>>>> 
>>>>>> Alternatively, if you installed BLCR by using apt-get in Ubuntu, you
>>>>>> should be able to use dkms to manage your kernel modules
>>>>>> automatically. Make sure you have the package 'blcr-dkms' installed
>>>>>> (you should be able to check this by typing `dims status`.
>>>>>> 
>>>>>> 
>>>>>> Do either of those solutions solve your issue?
>>>>>> 
>>>>>> 
>>>>>> Wesley
>>>>>> 
>>>>>> On Apr 26, 2013, at 1:55 PM, basma a.azeem
>>>>>> <basmaabdelazeem at hotmail.com <mailto:basmaabdelazeem at hotmail.com> <mailto:basmaabdelazeem at hotmail.com>> wrote:
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Thank you for your help
>>>>>>> 
>>>>>>> 
>>>>>>> i installed BLCR 0.8.5 on my ubuntu 12.10  to be used for MPICH -3.0.3
>>>>>>> this version of blcr should support to kernels through 3.7.1
>>>>>>> 
>>>>>>> when i run the command :
>>>>>>> basma at basma-Satellite-A500:~$ mpiexec --info
>>>>>>> 
>>>>>>> results:
>>>>>>> 
>>>>>>> HYDRA build details:
>>>>>>>    Version:                                 3.0.3
>>>>>>>    Release Date:                            Thu Mar 28 16:01:21 CDT 2013
>>>>>>>    CC:                              gcc
>>>>>>>    CXX:                             c++
>>>>>>>    F77:                             no
>>>>>>>    F90:                             no
>>>>>>>    Configure options:
>>>>>>> '--disable-option-checking' '--prefix=/home/basma/mpich2-install'
>>>>>>> '--disable-f77' '--disable-fc' '--enable-checkpointing'
>>>>>>> '--with-hydra-ckpointlib=blcr' '--cache-file=/dev/null' '--srcdir=.'
>>>>>>> 'CC=gcc' 'CFLAGS= -O2' 'LDFLAGS= ' 'LIBS=-lrt -lcr -lpthread '
>>>>>>> 'CPPFLAGS= -I/home/basma/libraries/mpich-3.0.3/src/mpl/include
>>>>>>> -I/home/basma/libraries/mpich-3.0.3/src/mpl/include
>>>>>>> -I/home/basma/libraries/mpich-3.0.3/src/openpa/src
>>>>>>> -I/home/basma/libraries/mpich-3.0.3/src/openpa/src
>>>>>>> -I/home/basma/libraries/mpich-3.0.3/src/mpi/romio/include'
>>>>>>>    Process Manager:                         pmi
>>>>>>>    Launchers available:                     ssh rsh fork slurm ll
>>>>>>> lsf sge manual persist
>>>>>>>    Topology libraries available:            hwloc
>>>>>>>    Resource management kernels available:   user slurm ll lsf sge
>>>>>>> pbs cobalt
>>>>>>>    Checkpointing libraries available:       blcr
>>>>>>>    Demux engines available:                 poll select
>>>>>>> 
>>>>>>> so i thought that every thing is ok but when i try to rum mpiexec it
>>>>>>> failed:
>>>>>>> 
>>>>>>> basma at basma-Satellite-A500:~$ mpiexec -ckpointlib blcr
>>>>>>> -ckpoint-prefix /home/business/ckpts/app.ckpoint -ckpoint-interval
>>>>>>> 3600  -n 4 /home/basma/libraries/mpich-3.0.3/examples/cpi
>>>>>>> 
>>>>>>> results:
>>>>>>> 
>>>>>>> Fatal error in MPI_Init: Other MPI error, error stack:
>>>>>>> MPIR_Init_thread(433)...:
>>>>>>> MPID_Init(151)..........: channel initialization failed
>>>>>>> MPIDI_CH3_Init(70)......:
>>>>>>> MPID_nem_init(379)......:
>>>>>>> MPIDI_nem_ckpt_init(153): BLCR kernel module not present
>>>>>>> Fatal error in MPI_Init: Other MPI error, error stack:
>>>>>>> MPIR_Init_thread(433)...:
>>>>>>> MPID_Init(151)..........: channel initialization failed
>>>>>>> MPIDI_CH3_Init(70)......:
>>>>>>> MPID_nem_init(379)......:
>>>>>>> MPIDI_nem_ckpt_init(153): BLCR kernel module not present
>>>>>>> Fatal error in MPI_Init: Other MPI error, error stack:
>>>>>>> MPIR_Init_thread(433)...:
>>>>>>> MPID_Init(151)..........: channel initialization failed
>>>>>>> MPIDI_CH3_Init(70)......:
>>>>>>> MPID_nem_init(379)......:
>>>>>>> MPIDI_nem_ckpt_init(153): BLCR kernel module not present
>>>>>>> Fatal error in MPI_Init: Other MPI error, error stack:
>>>>>>> MPIR_Init_thread(433)...:
>>>>>>> MPID_Init(151)..........: channel initialization failed
>>>>>>> MPIDI_CH3_Init(70)......:
>>>>>>> MPID_nem_init(379)......:
>>>>>>> MPIDI_nem_ckpt_init(153): BLCR kernel module not present
>>>>>>> 
>>>>>>> ===================================================================================
>>>>>>> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>>>>>>> =   EXIT CODE: 1
>>>>>>> =   CLEANING UP REMAINING PROCESSES
>>>>>>> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>>>>>> ===================================================================================
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> i am a Linux and parallel programming beginner
>>>>>>> 
>>>>>>> Thank you
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> discuss mailing list     discuss at mpich.org <mailto:discuss at mpich.org> <mailto:discuss at mpich.org>
>>>>>>> To manage subscription options or unsubscribe:
>>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> discuss mailing list     discuss at mpich.org <mailto:discuss at mpich.org> <mailto:discuss at mpich.org>
>>>>>> To manage subscription options or unsubscribe:
>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> discuss mailing list     discuss at mpich.org <mailto:discuss at mpich.org>
>>>>> To manage subscription options or unsubscribe:
>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>> 
>>>> 
>>> 
>>> 
>>> _______________________________________________
>>> discuss mailing list     discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>> 
>> 
>> --
>> Pavan Balaji
>> http://www.mcs.anl.gov/~balaji
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss




More information about the discuss mailing list