[mpich-discuss] BLCR kernel module not present
Michael Bane
michael.bane at manchester.ac.uk
Fri Apr 26 17:55:14 CDT 2013
Thanks for that Raghu
On 26 Apr 2013, at 21:56, Raghunath wrote:
> Michael,
>
> The BLCR support in MVAPICH works fine as well. MVAPICH implements
> its own Checkpoint-Restart mechanism for the CH3-IB and Nemesis-IB
> channels. The MPICH design for the Nemesis-TCP channel is left
> untouched, as Pavan indicated.
>
> --
> Raghu
>
>
> On Fri, Apr 26, 2013 at 4:25 PM, Pavan Balaji <balaji at mcs.anl.gov> wrote:
>>
>> I can't speak for mvapich, of course. I'm only speaking about mpich.
>> However, most of our derivatives don't destroy the features that are in
>> stock mpich. So I'd think it'll work fine with mvapich as well.
>>
>> -- Pavan
>>
>> On 04/26/2013 03:20 PM US Central Time, michael wrote:
>>> Thanks Pavan
>>>
>>> Just to be clear, you're saying if I use mvapich with blcr then a
>>> running multi-node MPI job when killed (eg out of time) by a batch
>>> scheduler can be restarted (from which checkpoint?) presuming it doesn't
>>> have open files?
>>>
>>> Many thanks, M
>>>
>>> On Fri, 2013-04-26 at 15:12 -0500, Pavan Balaji wrote:
>>>> Michael,
>>>>
>>>> BLCR support for mpich should work fine; if something is broken, please
>>>> let us know.
>>>>
>>>> However, the core BLCR group itself hadn't released updates in a while,
>>>> primarily because they didn't have direct funding for it. I believe
>>>> that's fixed now and they are working on newer releases.
>>>>
>>>> -- Pavan
>>>>
>>>> On 04/26/2013 03:07 PM US Central Time, michael wrote:
>>>>> Hi folks
>>>>> I was wondering what the state of BLCR for mpich/mvapich is? eg how
>>>>> reliably can one presume it to be?
>>>>> Thanks, Michael
>>>>>
>>>>>
>>>>> On Fri, 2013-04-26 at 14:33 -0500, Wesley Bland wrote:
>>>>>> It looks like you might have missed installing the kernel module for
>>>>>> BLCR. What is the output of `lsmod`?
>>>>>>
>>>>>>
>>>>>> Alternatively, if you installed BLCR by using apt-get in Ubuntu, you
>>>>>> should be able to use dkms to manage your kernel modules
>>>>>> automatically. Make sure you have the package 'blcr-dkms' installed
>>>>>> (you should be able to check this by typing `dims status`.
>>>>>>
>>>>>>
>>>>>> Do either of those solutions solve your issue?
>>>>>>
>>>>>>
>>>>>> Wesley
>>>>>>
>>>>>> On Apr 26, 2013, at 1:55 PM, basma a.azeem
>>>>>> <basmaabdelazeem at hotmail.com <mailto:basmaabdelazeem at hotmail.com> <mailto:basmaabdelazeem at hotmail.com>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thank you for your help
>>>>>>>
>>>>>>>
>>>>>>> i installed BLCR 0.8.5 on my ubuntu 12.10 to be used for MPICH -3.0.3
>>>>>>> this version of blcr should support to kernels through 3.7.1
>>>>>>>
>>>>>>> when i run the command :
>>>>>>> basma at basma-Satellite-A500:~$ mpiexec --info
>>>>>>>
>>>>>>> results:
>>>>>>>
>>>>>>> HYDRA build details:
>>>>>>> Version: 3.0.3
>>>>>>> Release Date: Thu Mar 28 16:01:21 CDT 2013
>>>>>>> CC: gcc
>>>>>>> CXX: c++
>>>>>>> F77: no
>>>>>>> F90: no
>>>>>>> Configure options:
>>>>>>> '--disable-option-checking' '--prefix=/home/basma/mpich2-install'
>>>>>>> '--disable-f77' '--disable-fc' '--enable-checkpointing'
>>>>>>> '--with-hydra-ckpointlib=blcr' '--cache-file=/dev/null' '--srcdir=.'
>>>>>>> 'CC=gcc' 'CFLAGS= -O2' 'LDFLAGS= ' 'LIBS=-lrt -lcr -lpthread '
>>>>>>> 'CPPFLAGS= -I/home/basma/libraries/mpich-3.0.3/src/mpl/include
>>>>>>> -I/home/basma/libraries/mpich-3.0.3/src/mpl/include
>>>>>>> -I/home/basma/libraries/mpich-3.0.3/src/openpa/src
>>>>>>> -I/home/basma/libraries/mpich-3.0.3/src/openpa/src
>>>>>>> -I/home/basma/libraries/mpich-3.0.3/src/mpi/romio/include'
>>>>>>> Process Manager: pmi
>>>>>>> Launchers available: ssh rsh fork slurm ll
>>>>>>> lsf sge manual persist
>>>>>>> Topology libraries available: hwloc
>>>>>>> Resource management kernels available: user slurm ll lsf sge
>>>>>>> pbs cobalt
>>>>>>> Checkpointing libraries available: blcr
>>>>>>> Demux engines available: poll select
>>>>>>>
>>>>>>> so i thought that every thing is ok but when i try to rum mpiexec it
>>>>>>> failed:
>>>>>>>
>>>>>>> basma at basma-Satellite-A500:~$ mpiexec -ckpointlib blcr
>>>>>>> -ckpoint-prefix /home/business/ckpts/app.ckpoint -ckpoint-interval
>>>>>>> 3600 -n 4 /home/basma/libraries/mpich-3.0.3/examples/cpi
>>>>>>>
>>>>>>> results:
>>>>>>>
>>>>>>> Fatal error in MPI_Init: Other MPI error, error stack:
>>>>>>> MPIR_Init_thread(433)...:
>>>>>>> MPID_Init(151)..........: channel initialization failed
>>>>>>> MPIDI_CH3_Init(70)......:
>>>>>>> MPID_nem_init(379)......:
>>>>>>> MPIDI_nem_ckpt_init(153): BLCR kernel module not present
>>>>>>> Fatal error in MPI_Init: Other MPI error, error stack:
>>>>>>> MPIR_Init_thread(433)...:
>>>>>>> MPID_Init(151)..........: channel initialization failed
>>>>>>> MPIDI_CH3_Init(70)......:
>>>>>>> MPID_nem_init(379)......:
>>>>>>> MPIDI_nem_ckpt_init(153): BLCR kernel module not present
>>>>>>> Fatal error in MPI_Init: Other MPI error, error stack:
>>>>>>> MPIR_Init_thread(433)...:
>>>>>>> MPID_Init(151)..........: channel initialization failed
>>>>>>> MPIDI_CH3_Init(70)......:
>>>>>>> MPID_nem_init(379)......:
>>>>>>> MPIDI_nem_ckpt_init(153): BLCR kernel module not present
>>>>>>> Fatal error in MPI_Init: Other MPI error, error stack:
>>>>>>> MPIR_Init_thread(433)...:
>>>>>>> MPID_Init(151)..........: channel initialization failed
>>>>>>> MPIDI_CH3_Init(70)......:
>>>>>>> MPID_nem_init(379)......:
>>>>>>> MPIDI_nem_ckpt_init(153): BLCR kernel module not present
>>>>>>>
>>>>>>> ===================================================================================
>>>>>>> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>>>>>>> = EXIT CODE: 1
>>>>>>> = CLEANING UP REMAINING PROCESSES
>>>>>>> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>>>>>> ===================================================================================
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> i am a Linux and parallel programming beginner
>>>>>>>
>>>>>>> Thank you
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org> <mailto:discuss at mpich.org>
>>>>>>> To manage subscription options or unsubscribe:
>>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org> <mailto:discuss at mpich.org>
>>>>>> To manage subscription options or unsubscribe:
>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
>>>>> To manage subscription options or unsubscribe:
>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>
>>>>
>>>
>>>
>>> _______________________________________________
>>> discuss mailing list discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>
>>
>> --
>> Pavan Balaji
>> http://www.mcs.anl.gov/~balaji
>> _______________________________________________
>> discuss mailing list discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list