(Update)

I haven’t modified the text of the original post below, but I did write an update that addresses the most common feedback that came our way regarding backups, our failures, etc in decent detail. If you read this article and think “um, why didn’t they have bacups”, then you should probably read that follow-up.

Also, one thing I forgot to mention in the original post: we actually had tarballs of all repositories. We create them every few days, but they are not perfect backups. They’re detailed further in the update post.

I’m writing this as a post-mortem on what almost became The Great KDE Disaster Of 2013. You can see the early, semi-panicked blog posts at git.kde.org down… and git.kde.org will be back and what you should be aware of that describe the situation.

Here’s what happened.

“…What the hell?”

On 2013-03-22, the server that hosts the git.kde.org virtual machine was taken down for security updates. Both virtual machines running on the server were shut down without incident, security updates were applied to the host, and the machine was rebooted.

When the host came back up and started the VMs, the VMs immediately showed evidence of file system corruption (the file system in question was ext4). It is not known at this time (and we’ll probably never know) whether this corruption had been silently ongoing for a long period of time, or was the result of something specific that occurred during the shutdown or reboot of the VM or host. There is some evidence to suggest the former, but nothing concrete.

As most of you reading this are well aware, KDE has a series of “anongit” machines whose purpose is to distribute the heavy load across the 1500 hosted Git repositories and to act as backups for the main server. However, when we checked the anongit machines, every single one of them had severely corrupted repositories and many or all repositories were missing.

How could this happen?

A Perfect Mirror

Like all software, our mirroring system had bugs; and like many bugs, we didn’t know they existed until disaster struck.

The root of both bugs was a design flaw: the decision that git.kde.org was always to be considered the trusted, canonical source. The rationale behind this decision is relatively obvious; it’s a locked-down, authenticated resource that runs customized hooks to validate the code being pushed to it. It’s perfectly reasonable to decide that it should be considered to be correct.

To this end, a mirroring system was set up that essentially tries to make the anongits look exactly like git.kde.org within a reasonable amount of time. This includes not only the code in the Git repositories, but also the various bits of metadata that we use for administration. Syncing happens to each anongit within 20 minutes or so, and at that time the mirroring system makes the anongit look just like git.kde.org, ready to sync all repositories back upstream if upstream were to die and be replaced.

However, while we planned for the case of the server losing a disk or entirely biting the dust, or the total loss of the VM’s filesystem, we didn’t plan for the case of filesystem corruption, and the way the corruption affected our mirroring system triggered some very unforeseen and pathological conditions.

The sequence of events:

The VM came up. Its projects file, which contains a complete list of repositories in the system, was corrupt.
The anongits performed a sync. As part of this sync, they retrieved the new projects file, which (somewhat interestingly) appears to have been different for each of the anongits (although corrupt each time). The nature of the corruption was such that the majority or all of the valid repositories in the system were no longer in the projects file.
Each anongit, in order to remove repositories locally that had been “deleted” on the server, which is a valid action that can occur, removed all of the local repositories no longer in the projects file.
Through some mechanism that is not clear (but is potentially due to some anongits syncing more often than others), some of the anongits then started re-cloning some of the repositories. They cloned the corruption – more on this very important point later.

And so, the corruption was perfectly mirrored…or rather, due to its nature, imperfectly mirrored. And all data on the anongits was lost.

Lucky. Lucky. Lucky lucky lucky lucky lucky lucky

We got lucky.

The server that hosts projects.kde.org – and a syncing (although not user-facing) anongit – is a located in a Hetzner data center and up to this point had been making good use of a block of static IPv4s allocated to it three years ago. Hetzner recently decided that, due to the IPv4 shortage, it was going to start charging fairly hefty sums for the use of IPv4 address blocks. As the hardware was getting rather old, this was a great impetus to move it to much newer and better hardware at the same cost and migrate projets.kde.org with it.

Just one day before this all happened, the anongit cloning system had been set up on the new server in preparation for the migration. That was by no means the only piece of luck: this single server happened to have the beginning of its syncing window – which happens once every twenty minutes on this box – fall into the time during the server reboot. As a result, the command to fetch the latest projects list timed out, and the script passed over it and simply attempted to fetch the latest revisions from the repositories on the server, which failed as the server could not produce a valid pack.

With git.kde.org and every other anongit completely dead, whereas we should have had four or five complete copies of the KDE Git repositories, this new server alone retained a pristine copy of all 1500 of them.

We ran git fsck on every repository, and they all checked out. We were (besides being much relieved) then able to re-provision git.kde.org with these repositories, including metadata (by reversing the script usually used to sync metadata to the anongits).

Not everything was quite so simple. Our Gitolite repository was corrupt too. Rather than deal with reconstructing the rather out-of-date Gitolite version we were using with our custom changes, we took the opportunity to start fresh and perform a much-needed, long-overdue upgrade. This brings with it the much nicer version 3 of Gitolite which has some really interesting capabilities that may be useful. It also brings with it a few changes in syntax of the user-facing commands; I’ll write another blog post soon when all of the commands are ported with the details.

In particular, note that the clone command is not currently ported. I need to discuss some of the custom things we’d done with it with Sitaram Chamarty, the author of Gitolite, to ensure that the porting is performed properly.

One more thing about Gitolite, before I move on: during our panicked re-deployment of Gitolite, we ran into some troubles. Sitaram put everything aside to help us out. He’s been a great friend to KDE, and gets my thanks yet again.

Mitigation

Now that we’re aware of the problem, we need to solve it. Unfortunately, it’s not so easy.

Some things are not too bad. One immediate action we’ve taken is to put a check onto the projects file. If a new projects file is generated and is more than 1% different than the previous file, the previous file is kept intact (at 1500 repositories, that means 15 repositories would have to be created or deleted in the span of three minutes, which is extremely unlikely). This check needs to be brought to the anongits as well, although that’s part of a larger, ongoing consideration.

The larger issue, though, is detection of corruption. Even if the projects file doesn’t change too much, minor corruption could result in, say, two repositories being deleted from the anongits. Even if we detected that and re-cloned, the anongits could have corrupted clones. Which brings me to…

Git Isn’t as Safe as You Think

(Update: After I wrote this section, some Git developers worked through the process I used for testing. Some of my findings are particular to how I was testing, which is using --no-hardlinks instead of --no-local due to some confusion over the eventual behavior of each; the information in this section is therefore somewhat invalid. That said, there are still definite cases when Git will give you an exit code of zero on a clone even when it knows without a doubt that something is wrong. See this thread for more details. The original text still follows.)

Git is pretty safe. Usually. But it turns out that you can do some things that will cause it to be relatively quiet about problems, which can make it appear to you, the casual user or sysadmin, as if all is well. I just completed a bunch of testing in an attempt to understand how the repositories that did get re-cloned onto the anongits could have been corrupt, and here’s what I found:

Corruption of Commit Objects

If a commit object is corrupt, you can still make a mirror clone of the repository without any complaints (and with an exit code of zero). Attempting to walk the tree at this point will eventually error out at the corrupt commit. However, there’s an important caveat: it will error out only if you’re walking a path on the tree that contains that commit. This isn’t rocket science, and it’s obvious why this would be the case, but it means that if you want to attempt walking the tree to verify validity, you have to walk the tree starting at every ref.

It takes a bit of polish off of Linus Torvald’s statement at his Google Tech Talk:

If you have disc corruption, if you have RAM corruption, if you have any kind of problems at all, git will notice them. It’s not a question of if. It’s a guarantee. You can have people who try to be malicious. They won’t succeed. You need to know exactly 20 bytes, you need to know 160-bit SHA-1 name of the top of your tree, and if you know that, you can trust your tree, all the way down, the whole history. You can have 10 years of history, you can have 100,000 files, you can have millions of revisions, and you can trust every single piece of it. Because git is so reliable and all the basic data structures are really really simple. And we check checksums.

Linus is right – you can walk your tree and trust it all the way down, the whole history. But your Git repository can have a tree starting at one of a large number of refs, and unless you verify all of them, you can’t trust your repository as a whole.

git fsck will find these problems.

Corruption of Blob Objects

This is broadly similar, with an important difference. In this scenario, you can again make a mirror clone of the repository without any complaints, but attempting to walk the tree will provide an entire revlist all the way back to the first commit (and again, only on a ref/tree that sees this blob).

git fsck will find these problems, too.

Mirror clones

Mirror clones (cloning with the --mirror flag) are designed to ensure that the clone stays current with respect to the refs from the upstream repository. New refs are pulled down to the downstream repository; all updates, including forced updates, are mirrored to the downstream repository.

However, it seems that making a mirror clone also entails a different mechanism of cloning, where objects are simply copied straight over rather than a custom pack being formed. As a result, making a mirror clone essentially bypasses the safety checks in the repository. Corruption upstream becomes corruption downstream, with an exit code of zero.

Way Forward

As you can imagine, the KDE Sysadmin team is having to juggle a number of considerations now as we think about how to prevent this in the future:

One thing that will be put into place as a first effort is that one anongit will keep a 24-hour-old sync; in the case of recent corruption, this can allow repositories to be recovered with relatively recent revisions. The machine that projects.kde.org is migrating to has a ZFS filesystem; snapshots will be taken after every sync, up to some reasonable number of maximum snapshots, which should allow us to recover the repositories at a period of time with relatively fine granularity.

However, both of these are just patchwork solutions. Corruption on the server at a long ago enough time may end up being propagated to anongits as new ones replace old ones or they need a full clean for some reason. We have plans to limit the number of repositories synced to only those that have changed since the last run, but that’s more of a speed optimization than anything; a repository could be created and corrupted before it’s ever previously synced. Even though this could also keep repositories exhibiting corruption via rewinds from propagating that down to the anongits, that’s the one type of corruption we can already deal with, using the Gitolite ref logs. (Of course, we weren’t actually backing these up regularly – but we will. There’s no reason they shouldn’t be backed up as often as the syncing occurs, and be snapshotted too just in case.)

We could stop using mirrored clones. Originally, mirrored clones were in fact not used, but non-mirrored clones on the anongits come with their own set of issues, and are more prone to getting stopped up by legitimate, authenticated force pushes, ref deletions, and so on – and if we set the refspec such that those are allowed through silently, we don’t gain much. A hybrid approach of a non-mirror initial clone followed by a shift to mirror mode could force the server to validate the entire repository as it packs it, so that is something worth investigating.
We could keep a log of actions to each repository, and thus attempt to detect whether rewinds being synced down from the server are legitimate. While it could be done, it’s a fairly complex solution, and there are likely to be a lot of edge cases.
We could run git fsck on the server and on the anongits. A lot. If we are taking snapshots (such as on the projects box), doing full git fsck runs before the snapshot discard period would ensure that every repository has a consistent snapshot it can roll back to. Otherwise, though, this is a very resource intensive process to be doing very constantly, which we would need to do to detect corruption as soon as possible.
We could use ZFS on git.kde.org the same way it’s being used on the projects.kde.org box. ZFS has checksumming that detects errors both at the hardware and filesystem level (it is designed to operate on the very valid assumption that disks and memory are, or are going to go, bad; it’s relatively famous for detecting bad hardware that had previously been bit-flipping data silently for years). Its RAID-Z mechanism doesn’t suffer from the RAID Write Hole, it has excellent read performance given decent enough RAM, and it supports cheap Copy-on-Write snapshotting. The total size of all KDE Git repositories is about 25GB; it wouldn’t take a ton of space to make snapshotting directly on the server provide a decent benefit. I’d love to see this in use, but, after having had excellent experiences with it on Linux for a couple of years, I’m a ZFS fanboy at this point; and, I don’t know how well it’s supported on SUSE, which is the distribution git.kde.org is running (although I’ve run it on Gentoo, Debian, and Ubuntu without any problems).

As you can see, we have a lot of things to think about.

Me, my blog, and my Johnson

Current meme: Johnson boat company

Too Perfect a Mirror