Bradley M. Kuhn's Technology Blog

Add this blog to your LiveJournal friends list

January 24, 2008

When your apt-mirror is always downloading

When I started building our apt-mirror, I ran into a problem: the machine was throttled against ubuntu.com's servers, but I had completed much of the download (which took weeks to get multiple distributions). I really wanted to roll out the solution quickly, particularly because the service from the remote servers was worse than ever due to the throttling that the mirroring created. But, with the mirror incomplete, I couldn't so easily make available incomplete repositories.

The solution was to simply let apache redirect users on to the real servers if the mirror doesn't have the file. The first order of business for that is to rewrite and redirect URLs when files aren't found. This is a straightforward Apache configuration:

   RewriteEngine on
   RewriteLogLevel 0
   RewriteCond %{REQUEST_FILENAME} !^/cgi/
   RewriteCond /var/spool/apt-mirror/mirror/archive.ubuntu.com%{REQUEST_FILENAME} !-F
   RewriteCond /var/spool/apt-mirror/mirror/archive.ubuntu.com%{REQUEST_FILENAME} !-d
   RewriteCond %{REQUEST_URI} !(Packages|Sources)\.bz2$
   RewriteCond %{REQUEST_URI} !/index\.[^/]*$ [NC]
   RewriteRule ^(http://%{HTTP_HOST})?/(.*) http://91.189.88.45/$2 [P]

Note a few things there:

I have to hard-code an IP number, because as I mentioned in the last post on this subject, I've faked out DNS for archive.ubuntu.com and other sites I'm mirroring. (Note: this has the unfortunate side-effect that I can't easily take advantage of round-robin DNS on the other side.)
I avoid taking Packages.bz2 from the other site, because apt-mirror actually doesn't mirror the bz2 files (although I've submitted a patch to it so it will eventually).
I make sure that index files get built by my Apache and not redirected.
I am using Apache proxying, which gives me Yet Another type of cache temporarily while I'm still downloading the other packages. (I should actually work out a way to have these caches used by apt-mirror itself in case a user has already requested a new package while waiting for apt-mirror to get it.)

Once I do a rewrite like this for each of the hosts I'm replacing with a mirror, I'm almost done. The problem is that if for any reason my site needs to give a 403 to the clients, I would actually like to double-check to be sure that the URL doesn't happen to work at the place I'm mirroring from.

My hope was that I could write a RewriteRule based on what the HTTP return code would be when the request completed. This was really hard to do, it seemed, and perhaps undoable. The quickest solution I found was to write a CGI script to do the redirect. So, in the Apache config I have:

ErrorDocument 403 /cgi/redirect-forbidden.cgi

And, the CGI script looks like this:

#!/usr/bin/perl

use strict;
use CGI qw(:standard);

my $val = $ENV{REDIRECT_SCRIPT_URI};

$val =~ s%^http://(\S+).sflc.info(/.*)$%$2%;
if ($1 eq "ubuntu-security") {
   $val = "http://91.189.88.37$val";
} else {
   $val = "http://91.189.88.45$val";
}

print redirect($val);

With these changes, the user will be redirected to the original when the files aren't available on the mirror, and as the mirror gets more accurate, they'll get more files from the mirror.

I still have problems if for any reason the user gets a Packages or Sources file from the original site before the mirror is synchronized, but this rarely happens since apt-mirror is pretty careful. The only time it might happen is if the user did an apt-get update when not connected to our VPN and only a short time later did one while connected.

Posted by Bradley M. Kuhn on January 24, 2008

January 16, 2008

apt-mirror and Other Caching for Debian/Ubuntu Repositories

Working for a small non-profit, everyone has to wear lots of hats, and one that I have to wear from time to time (since no one else here can) is “sysadmin”. One of the perennial rules of system administration is: you can never give users enough bandwidth. The problem is, they eventually learn how fast your connection to the outside is, and then complain any time a download doesn't run at that speed. Of course, if you have a T1 or better, it's usually the other side that's the problem. So, I look to use our extra bandwidth during off hours to cache large pools of data that are often downloaded. With a organization full of Ubuntu machines, the Ubuntu repositories are an important target for caching.

apt-mirror is a program that mirrors large Debian-based repositories, including the Ubuntu ones. There are already tutorials available on how to set it up. What I'm writing about here is a way to “force” users to use that repository.

The obvious way, of course, is to make everyone's /etc/apt/sources.list point at the mirrored repository. This often isn't a good option. Save the servers, the user base here is all laptops, which means that they will often be on networks that may actually be closer to another package repository and perhaps I want to avoid interfering with that. (Although given that I can usually give almost any IP number in the world better than the 30kbs/sec that ubuntu.com's servers seem to quickly throttle to, that probably doesn't matter so much).

The bigger problem is that I don't want to be married to the idea that the apt-mirror is part of our essential 24/7 infrastructure. I don't want an angry late-night call from a user because they can't install a package, and I want the complete freedom to discontinue the server at any time, if I find it to be unreliable. I can't do this easily if sources.list files on traveling machines are hard-coded with the apt-mirror server's name or address, especially when I don't know when exactly they'll connect back to our VPN.

The easier solution is to fake out the DNS lookups via the DNS server used by the VPN and the internal network. This way, user only get the mirror when they are connected to the VPN or in the office; otherwise, the get the normal Ubuntu servers. I had actually forgotten you could fake out DNS on a per host basis, but asking my friend Paul reminded me quickly. In /etc/bin/named.conf.local (on Debian/Ubuntu), I just add:

zone "archive.ubuntu.com"      {
        type master;
        file "/etc/bind/db.archive.ubuntu-fake";
};

And in /etc/bind/db.archive.ubuntu-fake:

$TTL    604800
@ IN SOA archive.ubuntu.com.  root.vpn. (
       2008011001  ; serial number                                              
       10800 3600 604800 3600)
     IN NS my-dns-server.vpn.

;                                                                               
;  Begin name records                                                           
;                                                                               
archive.ubuntu.com.  IN A            MY.EXTERNAL.FACING.IP

And there I have it; I just do one of those for each address I want to replace (e.g., security.ubuntu.com). Now, when client machines lookup archive.ubuntu.com (et al), they'll get MY.EXTERNAL.FACING.IP, but only when my-dns-server.vpn is first in their resolv.conf.

Next time, I'll talk about some other ideas on how I make the apt-mirror even better.

Posted by Bradley M. Kuhn on January 16, 2008

January 9, 2008

Postfix Trick to Force Secondary MX to Deliver Locally

Suppose you have a domain name, example.org, that has a primary MX host (mail.example.org) that does most of the delivery. However, one of the users, who works at example.com, actually gets delivery of <user@example.org> at work (from the primary MX for example.com, mail.example.com). Of course, a simple .forward or /etc/aliases entry would work, but this would pointlessly push email back and forth between the two mail servers — in some cases, up to three pointless passes before the final destination! That's particularly an issue in today's SPAM-laden world. Here's how to solve this waste of bandwidth using Postfix.

This tutorial here assumes you have a some reasonable background knowledge of Postfix MTA administration. If you don't, this might go a bit fast for you.

To begin, first note that this setup assumes that you have something like this with regard to your MX setup:

$ host -t mx example.org
example.org mail is handled by 10 mail.example.org.
example.org mail is handled by 20 mail.example.com.
$ host -t mx example.com
example.com mail is handled by 10 mail.example.com.

Our first task is to avoid example.org SPAM backscatter on mail.example.com. To do that, we make a file with all the valid accounts for example.org and put it in mail.example.com:/etc/postfix/relay_recipients. (For more information, read the Postfix docs or various tutorials about this.) After that, we have something like this in mail.example.com:/etc/postfix/main.cf:

relay_domains = example.org
relay_recipient_maps = hash:/etc/postfix/relay_recipients

And this in /etc/postfix/transport:

example.org     smtp:[mail.example.org]

This will give proper delivery for our friend <user@example.org> (assuming mail.example.org is forwarding that address properly to <user@example.com>), but mail will push mail back and forth unnecessarily when mail.example.com gets a message for <user@example.org>. What we actually want is to wise up mail.example.com so it “knows” that mail for <user@example.org> is ultimately going to be delivered locally on that server.

To do this, we add <user@example.org> to the virtual_alias_maps, with an entry like:

user@example.org      user

so that the key user@example.org resolves to the local username user. Fortunately, Postfix is smart enough to look at the virtual table first before performing a relay.

Now, what about aliases like <user.lastname@example.org>, that actually forwards to <user@example.org>? That will have the same pointless forwarding from server-to-server unless we address it specifically. To do so, we use the transport file. of course, we should already have that catch-all entry there to do the relaying:

example.org     smtp:[mail.example.org]

But, we can also add email address specific entries for certain addresses in the example.org domain. Fortunately, email address matches in the transport table take precedence over whole domain match entries (see the transport man page for details.). Therefore, we simply add entries to that transport file like this for each of user's aliases:

user.lastname@example.org    local:user

(Note: that assumes you have a delivery method in master.cf called local. Use whatever transport you typically use to force local delivery.)

And there you have it! If you have (those albeit rare) friendly and appreciative users, user will thank you for the slightly quicker mail delivery, and you'll be glad that you aren't pointlessly shipping SPAM back and forth between MX's unnecessarily.

Posted by Bradley M. Kuhn on January 9, 2008

January 1, 2008

Apache 2.0 → 2.2 LDAP Changes on Ubuntu

I thought the following might be of use to those of you who are still using Apache 2.0 with LDAP and wish to upgrade to 2.2. I found this basic information around online, but I had to search pretty hard for it. Perhaps presenting this in a more straightforward way might help the next searcher to find an answer more quickly. It's probably only of interest if you are using LDAP as your authentication system with an older Apache (e.g., 2.0) and have upgraded to 2.2 on an Ubuntu or Debian system (such as upgrading from dapper to gutsy.)

When running dapper on my intranet web server with Apache 2.0.55-4ubuntu2.2, I had something like this:

     <Directory /var/www/intranet>
           Order allow,deny
           Allow from 192.168.1.0/24 

           Satisfy All
           AuthLDAPEnabled on
           AuthType Basic
           AuthName "Example.Org Intranet"
           AuthLDAPAuthoritative on
           AuthLDAPBindDN uid=apache,ou=roles,dc=example,dc=org
           AuthLDAPBindPassword APACHE_BIND_ACCT_PW
           AuthLDAPURL ldap://127.0.0.1/ou=staff,ou=people,dc=example,dc=org?cn
           AuthLDAPGroupAttributeIsDN off
           AuthLDAPGroupAttribute memberUid

           require valid-user
    </Directory>

I upgraded that server to gutsy (via dapper → edgy → feisty → gutsy in succession, just because it's safer), and it now has Apache 2.2.4-3build1. The methods to do LDAP authentication is a bit more straightforward now, but it does require this change:

    <Directory /var/www/intranet>
        Order allow,deny
        Allow from 192.168.1.0/24 

        AuthType Basic
        AuthName "Example.Org Intranet"
        AuthBasicProvider ldap
        AuthzLDAPAuthoritative on
        AuthLDAPBindDN uid=apache,ou=roles,dc=example,dc=org
        AuthLDAPBindPassword APACHE_BIND_ACCT_PW
        AuthLDAPURL ldap://127.0.0.1/ou=staff,ou=people,dc=example,dc=org

        require valid-user
        Satisfy all
    </Directory>

However, this wasn't enough. When I set this up, I got rather strange error messages such as:

[error] [client MYIP] GROUP: USERNAME not in required group(s).

I found somewhere online (I've now lost the link!) that you couldn't have standard pam auth competing with the LDAP authentication. This seemed strange to me, since I've told it I want the authentication provided by LDAP, but anyway, doing the following on the system:

a2dismod auth_pam
a2dismod auth_sys_group

solved the problem. I decided to move on rather than dig deeper into the true reasons. Sometimes, administration life is actually better with a mystery about.

Posted by Bradley M. Kuhn on January 1, 2008

November 21, 2007

stet and AGPLv3

Many people don't realize that the GPLv3 process actually began long before the November 2005 announcement. For me and a few others, the GPLv3 process started much earlier. Also, in my view, it didn't actually end until this week, when SFLC's client (the FSF) released the AGPLv3. Today, I'm particularly proud that SFLC is releasing the first software covered by the terms of that license.

The GPLv3 process focused on the idea of community, and a community is built from bringing together many individual experiences. I am grateful for all my personal experiences throughout this process. Indeed, I would guess that other GPL fans like myself remember, as I do, the first time the heard the phrase “GPLv3”. For me, it was a bit early — on Tuesday 8 January 2002 in a conference room at MIT. On that day, Richard Stallman, Eben Moglen and I sat down to have an all-day meeting that included discussions regarding updating GPL. A key issue that we sought to address was (in those days) called the “Application Service Provider (ASP) problem” — now called “Software as a Service (SaaS)”.

A few weeks later, on the telephone with Eben one morning, as I stood in my kitchen making oatmeal, we discussed this problem. I pointed out the oft-forgotten section 2(c) of the GPL [version 2]. I argued that contrary to popular belief, it does have restrictions on some minor modifications. Namely, you have to maintain those print statements for copyright and warranty disclaimer information. It's reasonable, in other words, to restrict some minor modifications to defend freedom.

We also talked about that old Computer Science problem of having a program print its own source code. I proposed that maybe we needed a section 2(d) that required that if a program prints its own source to the user, that you can't remove that feature, and that the feature must always print the complete and corresponding source.

Within two months, Affero GPLv1 was published — an authorized fork of the GPL to test the idea. From then until AGPLv3, that “Affero clause” has had many changes, iterations and improvements, and I'm grateful for all the excellent feedback, input and improvements that have gone into it. The result, the Affero GPLv3 (AGPLv3) released on Monday, is an excellent step forward for software freedom licensing. While the community process indicated that the preference was for the Affero clause to be part of a separate license, I'm nevertheless elated that the clause continues to live on and be part of the licensing infrastructure defending software freedom.

Other than coining the Affero clause, my other notable personal contribution to the GPLv3 was management of a software development project to create the online public commenting system. To do the programming, we contracted with Orion Montoya, who has extensive experience doing semantic markup of source texts from an academic perspective. Orion gave me my first introduction to the whole “Web 2.0” thing, and I was amazed how useful the result was; it helped the leaders of the process easily grok the public response. For example, the intensity highlighting — which shows the hot spots in the text that received the most comments — gives a very quick picture of sections that are really of concern to the public. In reviewing the drafts today, I was reminded that the big red area in section 1 about “encryption and authorization codes” is substantially changed and less intensely highlighted by draft 4. That quick-look gives a clear picture of how the community process operated to get a better license for everyone.

Orion, a Classics scholar as an undergrad, named the software stet for its original Latin definition: “let it stand as it is”. It was his hope that stet (the software) would help along the GPLv3 process so that our whole community, after filing comments on each successive draft, could look at the final draft and simply say: Stet!

Stet has a special place in software history, I believe, even if it's just a purely geeky one. It is the first software system in history to be meta-licensed. Namely, it was software whose output was its own license. It's with that exciting hacker concept that I put up today a Trac instance for stet, licensed under the terms of the AGPLv3 ¹.

Stet is by no means ready for drop-in production. Like most software projects, we didn't estimate perfectly how much work would be needed. We got lazy about organization early on, which means it still requires a by-hand install, and new texts must be carefully marked up by hand. We've moved on to other projects, but I'm happy to host the Trac instance here at SFLC indefinitely so that other developers can make it better. That's what copylefted FOSS is all about — even when it's SaaS.

¹Actually, it's under AGPLv3 plus an exception to allow for combining with the GPLv2-only Request Tracker, with which parts of stet combine.

Posted by Bradley M. Kuhn on November 21, 2007

August 24, 2007

More Xen Tricks

In my previous post about Xen, I talked about how easy Xen is to configure and set up, particularly on Ubuntu and Debian. I'm still grateful that Xen remains easy; however, I've lately had a few Xen-related challenges that needed attention. In particular, I've needed to create some surprisingly messy solutions when using vif-route to route multiple IP numbers on the same network through the dom0 to a domU.

I tend to use vif-route rather than vif-bridge, as I like the control it gives me in the dom0. The dom0 becomes a very traditional packet-forwarding firewall that can decide whether or not to forward packets to each domU host. However, I recently found some deep weirdness in IP routing when I use this approach while needing multiple Ethernet interfaces on the domU. Here's an example:

Multiple IP numbers for Apache

Suppose the domU host, called webserv, hosts a number of websites, each with a different IP number, so that I have Apache doing something like¹:

Listen 192.168.0.200:80
Listen 192.168.0.201:80
Listen 192.168.0.202:80
...
NameVirtualHost 192.168.0.200:80
<VirtualHost 192.168.0.200:80>
...
NameVirtualHost 192.168.0.201:80
<VirtualHost 192.168.0.201:80>
...
NameVirtualHost 192.168.0.202:80
<VirtualHost 192.168.0.202:80>
...

The Xen Configuration for the Interfaces

Since I'm serving all three of those sites from webserv, I need all those IP numbers to be real, live IP numbers on the local machine as far as the webserv is concerned. So, in dom0:/etc/xen/webserv.cfg I list something like:

vif  = [ 'mac=de:ad:be:ef:00:00, ip=192.168.0.200',
         'mac=de:ad:be:ef:00:01, ip=192.168.0.201',
         'mac=de:ad:be:ef:00:02, ip=192.168.0.202' ]

… And then make webserv:/etc/iftab look like:

eth0 mac de:ad:be:ef:00:00 arp 1
eth1 mac de:ad:be:ef:00:01 arp 1
eth2 mac de:ad:be:ef:00:02 arp 1

… And make webserv:/etc/network/interfaces (this is probably Ubuntu/Debian-specific, BTW) look like:

auto lo
iface lo inet loopback
auto eth0
iface eth0 inet static
 address 192.168.0.200
 netmask 255.255.255.0
auto eth1
iface eth1 inet static
 address 192.168.0.201
 netmask 255.255.255.0
auto eth2
iface eth2 inet static
 address 192.168.0.202
 netmask 255.255.255.0

Packet Forwarding from the Dom0

But, this doesn't get me the whole way there. My next step is to make sure that the dom0 is routing the packets properly to webserv. Since my dom0 is heavily locked down, all packets are dropped by default, so I have to let through explicitly anything I'd like webserv to be able to process. So, I add some code to my firewall script on the dom0 that looks like:²

webIpAddresses="192.168.0.200 192.168.0.201 192.168.0.202"
UNPRIVPORTS="1024:65535"

for dport in 80 443;
do
  for sport in $UNPRIVPORTS 80 443 8080;
  do
    for ip in $webIpAddresses;
    do
      /sbin/iptables -A FORWARD -i eth0 -p tcp -d $ip \
        --syn -m state --state NEW \
        --sport $sport --dport $dport -j ACCEPT

      /sbin/iptables -A FORWARD -i eth0 -p tcp -d $ip \
        --sport $sport --dport $dport \
        -m state --state ESTABLISHED,RELATED -j ACCEPT

      /sbin/iptables -A FORWARD -o eth0 -s $ip \
        -p tcp --dport $sport --sport $dport \
        -m state --state NEW,ESTABLISHED,RELATED -j ACCEPT
    done  
  done
done

Phew! So at this point, I thought I was done. The packets should find their way forwarded through the dom0 to the Apache instance running on the domU, webserv. While that much was true, I now have the additional problem that packets got lost in a bit of a black hole on webserv. When I discovered the black hole, I quickly realized why. It was somewhat atypical, from webserv's point of view, to have three “real” and different Ethernet devices with three different IP numbers, which all talk to the exact same network. There was more intelligent routing needed.³

Routing in the domU

While most non-sysadmins still use the route command to set up local IP routes on a GNU/Linux host, iproute2 (available via the ip command) has been a standard part of GNU/Linux distributions and supported by Linux for nearly ten years. To properly support the situation of multiple (from webserv's point of view, at least) physical interfaces on the same network, some special iproute2 code is needed. Specifically, I set up separate route tables for each device. I first encoded their names in /etc/iproute2/rt_tables (the numbers 16-18 are arbitrary, BTW):

16      eth0-200
17      eth1-201
18      eth2-202

And here are the ip commands that I thought would work (but didn't, as you'll see next):

/sbin/ip route del default via 192.168.0.1

for table in eth0-200 eth1-201 eth2-202;
do
   iface=`echo $table | perl -pe 's/^(\S+)\-.*$/$1/;'`
   ipEnding=`echo $table | perl -pe 's/^.*\-(\S+)$/$1/;'`
   ip=192.168.0.$ipEnding
   /sbin/ip route add 192.168.0.0/24 dev $iface table $table

   /sbin/ip route add default via 192.168.0.1 table $table
   /sbin/ip rule add from $ip table $table
   /sbin/ip rule add to 0.0.0.0 dev $iface table $table
done

/sbin/ip route add default via 192.168.0.1

The idea is that each table will use rules to force all traffic coming in on the given IP number and/or interface to always go back out on the same, and vice versa. The key is these two lines:

   /sbin/ip rule add from $ip table $table
   /sbin/ip rule add to 0.0.0.0 dev $iface table $table

The first rule says that when traffic is coming from the given IP number, $ip, the routing rules in table, $table should be used. The second says that traffic to anywhere when bound for interface, $iface should use table, $table.

The tables themselves are set up to always make sure the local network traffic goes through the proper associated interface, and that the network router (in this case, 192.168.0.1) is always used for foreign networks, but that it is reached via the correct interface.

This is all well and good, but it doesn't work. Certain instructions fail with the message, RTNETLINK answers: Network is unreachable, because the 192.168.0.0 network cannot be found while the instructions are running. Perhaps there is an elegant solution; I couldn't find one. Instead, I temporarily set up “dummy” global routes in the main route table and deleted them once the table-specific ones were created. Here's the new bash script that does that (lines that are added are emphasized and in bold):

/sbin/ip route del default via 192.168.0.1
for table in eth0-200 eth1-201 eth2-202;
do
   iface=`echo $table | perl -pe 's/^(\S+)\-.*$/$1/;'`
   ipEnding=`echo $table | perl -pe 's/^.*\-(\S+)$/$1/;'`
   ip=192.168.0.$ipEnding
   /sbin/ip route add 192.168.0.0/24 dev $iface table $table

   /sbin/ip route add 192.168.0.0/24 dev $iface src $ip

   /sbin/ip route add default via 192.168.0.1 table $table
   /sbin/ip rule add from $ip table $table
   /sbin/ip rule add to 0.0.0.0 dev $iface table $table

   /sbin/ip route del 192.168.0.0/24 dev $iface src $ip
done
/sbin/ip route add 192.168.0.0/24 dev eth0 src 192.168.0.200
/sbin/ip route add default via 192.168.0.1 
/sbin/ip route del 192.168.0.0/24 dev eth0 src 192.168.0.200

I am pretty sure I'm missing something here — there must be a better way to do this, but the above actually works, even if it's ugly.

Alas, Only Three

There was one additional confusion I put myself through while implementing the solution. I was actually trying to route four separate IP addresses into webserv, but discovered that I got found this error message (found via dmesg on the domU): netfront can't alloc rx grant refs. A quick google around showed me that the XenFaq, which says that Xen 3 cannot handled more than three network interfaces per domU. Seems strangely arbitrary to me; I'd love to hear why cuts it off at three. I can imagine limits at one and two, but it seems that once you can do three, n should be possible (perhaps still with linear slowdown or some such). I'll have to ask the Xen developers (or UTSL) some day to find out what makes it possible to have three work but not four.

¹Yes, I know I could rely on client-provided Host: headers and do this with full name-based virtual hosting, but I don't like to do that for good reason (as outlined in the Apache docs).

²Note that the above firewall code must run on dom0, which has one real Ethernet device (its eth0) that is connected properly to the wide 192.168.0.0/24 network, and should have some IP number of its own there — say 192.168.0.100. And, don't forget that dom0 is configured for vif-route, not vif-bridge. Finally, for brevity, I've left out some of the firewall code that FORWARDs through key stuff like DNS. If you are interested in it, email me or look it up in a firewall book.

³I was actually a bit surprised at this, because I often have multiple IP numbers serviced from the same computer and physical Ethernet interface. However, in those cases, I use virtual interfaces (eth0:0, eth0:1, etc.). On a normal system, Linux does the work of properly routing the IP numbers when you attach multiple IP numbers virtually to the same physical interface. However, in Xen domUs, the physical interfaces are locked by Xen to only permit specific IP numbers to come through, and while you can set up all the virtual interfaces you want in the domU, it will only get packets destine for the IP number specified in the vif section of the configuration file. That's why I added my three different “actual” interfaces in the domU.

Posted by Bradley M. Kuhn on August 24, 2007

June 12, 2007

Virtually Reluctant

Way back when User Mode Linux (UML) was the “only way” the Free Software world did anything like virtualization, I was already skeptical. Those of us who lived through the coming of age of Internet security — with a remote root exploit for every day of the week — became obsessed with the chroot and its ultimate limitations. Each possible upgrade to a better, more robust virtual environment was met with suspicion on the security front. I joined the many who doubted that you could truly secure a machine that offered disjoint services provisioned on the same physical machine. I've recently revisited this position. I won't say that Xen has completely changed my mind, but I am open-minded enough again to experiment.

For more than a decade, I have used chroots as a mechanism to segment a service that needed to run on a given box. In the old days of ancient BINDs and sendmails, this was often the best we could do when living with a program we didn't fully trust to be clean of remotely exploitable bugs.

I suppose those days gave us all rather strange sense of computer security. I constantly have the sense that two services running on the same box always endanger each other in some fundamental way. It therefore took me a while before I was comfortable with the resurgence of virtualization.

However, what ultimately drew me in was the simple fact that modern hardware is just too darn fast. It's tough to get a machine these days that isn't ridiculously overpowered for most tasks you put in front of it. CPUs sit idle; RAM sits empty. We should make more efficient use of the hardware we have.

Even with that reality, I might have given up if it wasn't so easy. I found a good link about Debian on Xen, a useful entry in the Xen Wiki, and some good network and LVM examples. I also quickly learned how to use RAID/LVM together for disk redundancy inside Xen instances. I even got bonded ethernet working with some help to add additional network redundancy.

So, one Saturday morning, I headed into the office, and left that afternoon with two virtual servers running. It helped that Xen 3.0 is packaged properly for recent Ubuntu versions, and a few obvious apt-get installs get you what you need on edgy and feisty. In fact, I only struggled (and only just a bit) with the network, but quickly discovered two important facts:

VIF network routing in my opinion is a bit easier to configure and more stable than VIF bridging, even if routing is a bit slower.
sysctl -w net.ipv4.conf.DEVICE.proxy_arp=1 is needed to make the network routing down into the instances work properly.

I'm not completely comfortable yet with the security of virtualization. Of course, locking down the Dom0 is absolutely essential, because there lies the keys to your virtual kingdom. I lock it down with iptables so that only SSH from a few trusted hosts comes in, and even services as fundamental as DNS can only be had from a few trusted places. But, I still find myself imagining ways people can bust through the instance kernels and find their way to the hypervisor.

I'd really love to see a strong line-by-line code audit of the hypervisor and related utilities to be sure we've got something we can trust. However, in the meantime, I certainly have been sold on the value of this approach, and am glad it's so easy to set up.

Posted by Bradley M. Kuhn on June 12, 2007

May 8, 2007

Tools for Investigating Copyright Infringement, Part 1

Nearly all software developers know that software is covered by copyright. Many know that copyright covers the expression of an idea fixed in a medium (such as a series of bytes), and that the copyright rules govern the copying, modifying and distributing of the work. However, only a very few have considered the questions that arise when trying to determine if one work infringes the copyright of another.

Indeed, in the world of software freedom, copyright is seen as a system we have little choice but to tolerate. Many Free Software developers dislike the copyright system we have, so it is little surprise that developers want to spend minimal time thinking about it. Nevertheless, the copyright system is the foremost legal framework that governs software¹, and we have to live within it for the moment.

My fellow developers have asked me for years what constitute copyright infringement. In turn, for years, I have asked the lawyers I worked with to give me guidelines to pass on to the Free Software development community. I've discovered that it's difficult to adequately describe the nature of copyright infringement to software developers. While it is easy to give pathological examples of obvious infringement (such as taking someone's work, removing their copyright notices and distributing it as your own), it quickly becomes difficult to give definitive answers in many real world examples whether some particular activity constitutes infringement.

In fact, in nearly every GPL enforcement cases that I've worked on in my career, the fact that infringement had occurred was never in dispute. The typical GPL violator started with a work under GPL, made some modifications to a small portion of the codebase, and then distributed the whole work in binary form only. It is virtually impossible to act in that way and still not infringe the original copyright.

Usually, the cases of “hazy” copyright infringement come up the other way around: when a Free Software program is accused of infringing the copyright of some proprietary work. The most famous accusation of this nature came from Darl McBride and his colleagues at SCO, who claimed that something called “Linux” infringed his company's rights. We now know that there was no copyright infringement (BTW, whether McBride meant to accuse the GNU/Linux operating system or the kernel named Linux, we'll never actually know). However, the SCO situation educated the Free Software community that we must strive to answer quickly and definitively when such accusations arise. The burden of proof is usually on the accuser, but being able to make a preemptive response to even the hint of an allegation is always advantageous when fighting FUD in the court of public opinion.

Finally, issues of “would-be” infringement detection come up for companies during due diligence work. Ideally, there should be an easy way for companies to confirm which parts of their systems are derivatives of Free Software systems, which would make compliance with licenses easy. A few proprietary software companies provide this service; however there should be readily available Free Software tools (just as there should be for all tasks one might want to perform with a computer).

It is not so easy to create such tools. Copyright infringement is not trivially defined; in fact, most non-trivial situations require a significant amount of both technical and legal judgement. Software tools cannot make a legal conclusion regarding copyright infringement. Rather, successful tools will guide an expert's analysis of a situation. Such systems will immediately identify the rarely-found obvious indications of infringement, bring to the forefront facts that need an exercise of judgement, and leave everything else in the background.

In this multi-part series of blog entries, I will discuss the state of the art in these Free Software systems for infringement analysis and what plans our community should make for the creation Free systems that address this problem.

¹ Copyright is the legal system that non-lawyers usually identify most readily as governing software, but the patent system (unfortunately) also governs software in many countries, and many non-Free Software licenses (and a few of the stranger Free Software ones) also operate under contract law as well as copyright law. Trade secrets are often involved with software as well. Nevertheless, in the Software Freedom world, copyright is the legal system of primary attention on a daily basis.

Posted by Bradley M. Kuhn on May 8, 2007

April 17, 2007

Remember the Verbosity (A Brief Note)

I don't remember when it happened, but sometime in the past four years, the Makefiles for the kernel named Linux changed. I can't remember exactly, but I do recall sometime “recently” that the kernel build output stopped looking like what I remember from 1991, and started looking like this:

CC arch/i386/kernel/semaphore.o CC arch/i386/kernel/signal.o

This is a heck of a lot easier to read, but there was something cool about having make display the whole gcc command lines, like this:

gcc -m32 -Wp,-MD,arch/i386/kernel/.semaphore.o.d -nostdinc -isystem /usr/lib/gcc/i486-linux-gnu/4.0.3/include -D__KERNEL__ -Iinclude -include include/linux/autoconf.h -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -pipe -msoft-float -mpreferred-stack-boundary=2 -march=i686 -mtune=pentium4 -Iinclude/asm-i386/mach-default -Wdeclaration-after-statement -Wno-pointer-sign -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(semaphore)" -D"KBUILD_MODNAME=KBUILD_STR(semaphore)" -c -o arch/i386/kernel/semaphore.o arch/i386/kernel/semaphore.c gcc -m32 -Wp,-MD,arch/i386/kernel/.signal.o.d -nostdinc -isystem /usr/lib/gcc/i486-linux-gnu/4.0.3/include -D__KERNEL__ -Iinclude -include include/linux/autoconf.h -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -ffreestanding -Os -fomit-frame-pointer -pipe -msoft-float -mpreferred-stack-boundary=2 -march=i686 -mtune=pentium4 -Iinclude/asm-i386/mach-default -Wdeclaration-after-statement -Wno-pointer-sign -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(signal)" -D"KBUILD_MODNAME=KBUILD_STR(signal)" -c -o arch/i386/kernel/signal.o arch/i386/kernel/signal.c

I never gave it much thought, since the new form was easier to read. I figured that those folks who still eat kernel code for breakfast knew about this change well ahead of time. Of course, they were the only ones who needed to see the verbose output of the gcc command lines. I could live with seeing the simpler CC lines for my purposes, until today.

I was compiling kernel code and for the first time since this change in the Makefiles, I was using a non-default gcc to build Linux. I wanted to double-check that I'd given the right options to make throughout the process. I therefore found myself looking for a way to see the full output again (and for the first time). It was easy enough to figure out: giving the variable setting V=1 to make gives you the verbose version. For you Debian folks like me, we're using make-kpkg, so the line we need looks like: MAKEFLAGS="V=1" make-kpkg kernel_image.

It's nice sometimes to pretend I'm compiling 0.99pl12 again and not 2.6.20.7. :) No matter which options you give make, it is still a whole lot easier to bootstrap Linux these days.

Posted by Bradley M. Kuhn on April 17, 2007

April 10, 2007

User-Empowered Security via `encfs`

Eventually, I hope to post about the more involved work that we are doing at SFLC to run the entire organization on Free Software. However, I think that I'll start off with a few posts about smaller useful items that we've put in place for our users. In this entry, I'll talk about one of the solutions we're using to address the issue of compromised data on lost or stolen laptops.

The SFLC staff uses only laptops for daily computing; not a single member of our team has an organization-provided desktop. The mobility of taking the same computer home that one uses on the desk at work is just too important to everyone. One of my biggest worries in this environment — especially given that client confidential material is on every laptop via SVN checkouts of documents — is that data can suddenly become available to anyone in the world if a laptop is lost or stolen. I was reminded of this during the mainstream media coverage¹ of this issue last year.

There's the old security through obscurity perception of running GNU/Linux systems. Proponents of this theory argue that most thieves (or impromptu thieves, who find a lost laptop but decide not to return it to its owner) aren't likely to know how to use a GNU/Linux system, and will probably wipe the drive before selling it or using it. However, with the popularity of Free Software rising, this old standby (which never should have been a standby anyway, of course) doesn't even give an illusion of security anymore.

I have been known as a computer security paranoid in my time, and I keep a rather strict regiment of protocols for my own personal computer security. But, I don't like to inflict new onerous security procedures on the otherwise unwilling. Generally, people will find methods around security procedures when they aren't fully convinced they are necessary, and you're often left with a situation just as bad or worse than when you started implementing your new procedures.

My solution for the lost/stolen laptop security problem was therefore two-fold: (a) education among the userbase about how common it is to have a laptop lost or stolen, and (b) providing a simple user-space mechanism for encrypting sensitive data on the laptop. Since (a) is somewhat obvious, I'll talk about (b) in detail.

I was fortunate that, in parallel, my friend Paul and one of the lawyers here (James Vasile), discovered how easy it is to use encfs and told me about it. encfs uses the Filesystem in Userspace (FUSE) to store encrypted data right in a user's own home directory. And, it is trivially easy to set up! I used Paul's tutorial myself, but there are many published all over the Internet.

My favorite part of this solution is that rather than an onerous mandated procedure, encfs turns security into user empowerment. My colleague James wrote up a tutorial for our internal Wiki, and I've simply encouraged users to take a look and consider encrypting their confidential data. Even though not everyone has taken it up yet, many already have. When a new security measure requires substantial change in behavior of the user, the measure works best when users are given an opportunity to adopt it at their own pace. FUSE deserves a lot of credit in this regard, since it lets users switch their filesystem to encryption in pieces (unlike other cryptographic filesystems that require some planning ahead). For my part, I've been slowly moving parts of my filesystem into an encrypted area as I move aside old habits gradually.

I should note that this solution isn't completely without cost. First, there is no metadata encryption, but I am really not worried about interlopers finding out how big our nameless files and directories are and who created them (anyway, with an SVN checkout, the interesting metadata is in .svn, so it's encrypted in this case). Second, we've found that I/O intensive file operations take approximately twice as long (both under ext3 and XFS) when using encfs. I haven't moved my email archives to my encrypted area yet because of the latter drawback. However, for all my other sensitive data (confidential text documents, IRC chat logs, financial records, ~/.mozilla, etc.), I don't really notice the slow-down using a 1.6 Ghz CPU with ample free RAM. YMMV.

¹ BTW, I'm skeptical about the FBI's claim in that old Washington Post article which states “review of the equipment by computer forensic teams has determined that the data base remains intact and has not been accessed since it was stolen”. I am mostly clueless about computer forensics; however, barring any sort of physical seal on the laptop or hard drive casing, could a forensics expert tell if someone had pulled out the drive, put it in another computer, did a dd if=/dev/hdb of=/dev/hda, and then put it back as it was found?

Posted by Bradley M. Kuhn on April 10, 2007

Next page (older) »

Software Freedom Law Center