Some time ago, I discovered that the version of the kernel that Amazon uses for its current infrastructure (linux 2.6.16) contains a bug in the lvm modules. This was a bummer to see since we are using LVM snapshotting facilities to realize sub-second database backups. The bug was only triggered under specific load conditions, but when we’re talking database backups nobody likes getting kernel panics at the most inappropriate times.The interesting thing is that this particular bug has been fixed for a while now in newer kernel versions but we (i.e., all EC2 users) cannot benefit from these kernel sub-release patches since we depend on the kernel version that Amazon installs in all instances.

After some research it became clear that in order to successfully use our fast snapshotting facilities in EC2 (or, for that matter, for anybody to use LVM-related tools on EC2), patching the lvm kernel modules became a requirement.

The first thing to do was to find out the exact version of the kernel that Amazon is installing, the sub-release version of the kernel that contains the required patch, and the version of Xen that Amazon uses to patch their kernel. At the time of writing, Amazon’s kernel is based on a vanilla 2.6.16 kernel, patched with an unknown version of Xen (at least I couldn’t really find what version it was or perhaps it’s customized by Amazon). It turns out that the fix for the LVM bug I was triggering was applied at 2.6.16.12. Therefore the task was to recompile the kernel modules for a Xen-patched kernel of version >= 2.6.16.12. There doesn’t seem to be much information at all out there on how to do it, so at first I feared this might be an ugly or esoteric process, but fortunately it turns out to be quite simple!

The next paragraphs describe the rationale and steps on how to recompile kernel modules that are ready to be used for EC2 instances.

Preparing the sources and compiler setup

The first thing to know, is that kernel modules must be compiled with the same gcc version than the kernel they will run on. Since it is Amazon that originally compiled the kernel we need to determine the gcc version. Luckily, that is a simple task since this information is saved in the compiled modules. Therefore we can find out by issuing the following command on an unmodified running EC2 instance:

[root@ src]# modinfo dm_mod
filename:       /lib/modules/2.6.16-xenU/kernel/drivers/md/dm-mod.ko
license:        GPL
author:         Joe Thornber < dm-devel@redhat.com >
description:    device-mapper driver
depends:        
vermagic:       2.6.16-xenU SMP 686 gcc-4.0
parm:           major:The major number of the device mapper (uint)

It turns out that the kernel (and modules) were compiled with gcc-4.0 for the 686 architecture. Now, we must bring up an instance that has that version of gcc installed. In my case, I think I booted Amazon’s developer image (ami-26b6534f) but you can pick any other that comes with gcc 4.0.

Once the instance with the right compiler is up, we need to copy the kernel sources and patches. The Amazon kernel sources (patched with Xen) can be found at http://s3.amazonaws.com/ec2-downloads/linux-2.6.16-ec2.tgz and patches for a given sub-release version can be found at http://www.kernel.org/pub/linux/kernel/v2.6/.

Our latest CentOS RightImages already provide an untarred copy of the Amazon kernel sources in /usr/src/linux-2.6.16-xenU, so there’s no need to download it. Therefore, the only thing I needed to download was the latest existing linux patch, which happened to be 2.6.16.53.

Once we have these files on the ec2 instance, we are ready to configure and patch the kernel, and then recompile the modules.

Configuring, patching and compiling the new kernel and modules

To configure the kernel, we can use the built-in config facility of the running Amazon kernel. For that, simply uncompress the original Amazon sources and construct the “.config” file from the instance’s /proc filesystem. For example:

cd /usr/src/linux-2.6.16-xenU/
gunzip < /proc/config.gz > .config

then, apply the latest kernel patch on top of that. Here, the tricky part is that we’ll be trying to apply a patch prepared for the vanilla kernel version, but on top of a Xen-modified version. Therefore, this will likely result in conflicts when applied as is. While the patch I applied didn’t result in any conflict I couldn’t easily resolve, this might not always be the case. If you know what you are doing and the extent of the code you want to fix (or upgrade), you should just patch the affected files (usually only modules) and forget about any core kernel fixes. Remember that any kernel upgrades/fixes outside a loadable module won’t be visible anyway, since Amazon will always replace the kernel of an instance before booting.

For example, applying a complete patch to the amazon kernel will look something like:

bzip2 -d /tmp/patch-2.6.16.53.bz2
cd /usr/src/linux-2.6.16-xenU/
patch  -p1 < /tmp/patch-2.6.16.53  
find . -name '*.rej'
./arch/x86_64/ia32/Makefile.rej
./arch/i386/kernel/vm86.c.rej
./net/core/skbuff.c.rej
./Makefile.rej

Once you’ve resolved the conflicts we’re ready to compile and install:

make
make modules_install

If something broke, go back to the conflicts and fix whatever is broken. Once it all compiles, you should have the brand new modules installed in the “/lib/modules/2.6.16-xenU” directory! At that point you can take them for a spin and see if they can be loaded correctly. In the case of lvm, we can try to unload the existing ‘dm’ modules first (if any was loaded) and then load our new ones. If they load correctly we’ll have brand new, bug-fixed kernel modules at work for us.

Packaging the new modules to use for any future ec2 instance

The next step is to get these newly compiled modules and package them properly so we can use them in any of the ec2 instances that we wish. In my case, I used our RightSript infrastructure which allowed me to upgrade any of the templates that use LVM tools within minutes.

All I had to do is to package the kernel modules in a .tgz file and attach it to a new boot RightScript. This boot RightScript installs (i.e., replaces) the modules upon boot, removes any pre-loaded ‘dm’ modules, and loads up the newly installed ones. Here is the complete script:

#!/bin/bash -e
# Copyright (c) 2007 by RightScale Inc., all rights reserved

# First upgrade the kernel modules with some lvm fixes
# Try to unload the md modules if any is loaded (hopefully none will be in use)
echo "Unloading DM modules:" 
for m in `cat /proc/modules | grep ^dm_| cut -d' ' -f1`; do echo $m; modprobe -q -r $m; done

echo "Installing new/custom kernel modules..." 
(cd /lib/modules/ && tar xzf $ATTACH_DIR/modules-2.6.16.53-xenU.tgz )
echo "Loading the device mapper driver..." 
modprobe dm_mod

If you are not familiar with our scripts, any attachment uploaded to the web site will automatically be sent to the booting instance and the ATTACH_DIR environment variable is automatically set to reflect that temporary directory such that the RightScrips can locate it. In this case, only a single tgz file containing the modules was attached.

Now that I have this RightScript (I called it “upgrade LVM kernel modules”) I can seamlessly patch all my server templates by adding it to the list of boot scripts. Voila! Without any other changes, I ensured that the next time any of these templates are instantiated they will use the latest kernel modules and all the nice enhancements and bug fixes that come with them. My database backups are a lot happier now without kernel panics!