After my previous post came out, a lot of reaction in the comments and on Hacker News and Slashdot and Reddit Programming/Reddit Linux were essentially making the claim that mirrors aren’t backups. Here are some quotes:
My thinking here is screw the mirrors.
No. A hundred times no. Mirrors, RAID, ZFS, versioning systems, etc are not backups.
The main point is this: they broke an ops 101 rule.
Sysadmins: MIRRORING IS NOT A BACKUP SOLUTION. STOP DOING THIS!!!
I think quite a few commentors didn’t understand the import of the section I wrote about the unexpected behavior of
git clone --mirror – something that we were not aware of and in fact has further implications to a backup strategy than we realized. If you haven’t read that section, read it, because it’s very relevant.
I also think a lot of commentors saw the word “mirror” and made assumptions (LVM/RAID/DRBD/Etc.) that aren’t true. In this case a “mirror” means a special version of a Git clone.
Regardless, it’s true that in most cases mirrors aren’t backups. But with Git repositories, traditional backups aren’t backups either. So I’m going to address various backup strategies and the problems/challenges of making them work with a Git system.
Clearing a Few Things Up
I want to clear a couple things up before proceeding:
We Had Tarballs
I forgot to mention that we actually have tarballs of every single repository, updated every couple of days and also transferred to the anongits. These are not full, true mirrors of the repositories, simply normal clones, and they’re designed to make it easier for our contributors or users in low-bandwidth environments to fetch initial clones. They don’t contain our full repository metadata, so we would have lost all of that. And it’s not a perfect backup strategy by any means, as I’ll detail below. But it’s something.
Due to its distributed nature, we could of course have gone through a Herculean effort to find contributors with the latest local clones of repositories and fetch it from them. This is not ideal for a number of reasons, chief among them that we may never find some repositories if nobody has cloned them recently, we may never find repositories anywhere near their current state, and we cannot validate the integrity of said repositories. But it’s there, and at much smaller scales it would work fairly decently as a simple, first-order safety mechanism.
I’m going to go through some of the various backup strategies that we thought about, and the problems with them, so that anyone interested can understand why we had settled on a system of using the anongits as the backups.
What’s really important is to remember that we have 1500 repositories and growing, and we have contributors on everything from university LANs to dial-up modems. This brings up a very important point:
At any given time, for any given repository, we cannot assume that the repository is not in the process of being updated. In order to do such a thing, we would have to put in a number of kludgy, invasive steps where we mark the repository as disabled, wait for some period of time while we hope that people on slow/lossy/oft-unavailable connections are not still uploading new commits or packs (which can be very large as these contributors tend to commit more locally at a time while waiting for connections to be available), then perform a backup.
I don’t really need to go here. These are useful for keeping your server up and running, not preventing corruption.
This is not the type of mirroring my post talked about.
So let’s assume we want to do a normal files backup using a tarball on the repository. Keeping in mind what I said about the various contributor connections, how do we do this in a way that doesn’t effectively keep our various repositories unavailable on a rolling fashion?
One way is to do a (normal) clone and make a tarball of that clone. This is in fact what we do to create the tarballs as an alternate repository distribution mechanism. And this is sane, because the normal mechanism will walk through the objects and make a pack and will find corruption.
But you can’t just tar.gz up the bare repositories on the server and hope for the best. Maybe a given repository will be in a valid state; maybe it won’t. If you lose the server in a disk crash and try to restore a repository that was in the process of garbage collecting or writing a new pack when you tar’d it up, your backup is likely useless. Some backup.
This isn’t even like backing up a databse. Sane databases have a way to export a backup consistent at the point at which you began the export. Git has that, in the sense that you can clone from the server at any time, however we didn’t know that a –mirror clone of a repository can copy objects over without verifying them. If we had known that, it would have changed our thinking right at the very beginning. And this isn’t an oversight on our part…it’s not detailed in the man page. The man page essentially says that the refs will mirror the refs on the server. That’s not the same as saying that all objects, whether corrupt or not, will mirror the state on the server. (UPDATE: in fact, it appears that the problem was our testing methodology after the fact, where we should have used –local but instead used –no-hardlinks; the man page is consistent. However, there are still failure states you can hit when using mirrors that will cause
git clone to exit with a status code of zero. See this mail.)
Once you have a mirror clone, future updates on a repository, if inconsistent due to corruption on the server, will be detected, because the server won’t be able to generate the pack successfully. But that’s only in the future.
So let’s say that we’d created a mirrored clone of each repository and made a tarball out of that, on a filesystem that is slowly corrupting objects. When things started to go south, these tarballs would have been mirroring the corruption, regardless of where we put the tarballs in the end. And again, we had no idea that –mirror would behave that way.
Yes, we could run
git fsck on every repository beforehand, but that significantly increases the amount of time the backups take. It’s possible, given the heavy I/O load on the server, it would have actually meant taking a day’s worth of backups could take more than a day. I’m not sure, but I do know it would take a very long time.
(UPDATE: in fact, we were right in thinking that a dumb tarball is not always safe, although the probability of problems is relatively low. See this mail.)
What about the venerable Rsync? It’s the same problem as simply making live tarballs of the repository. Rsync over files in the middle of being written, and your backup could be corrupt even if the repository on the server is sane. Rsync over corrupted files and your backup is worthles, because unlike Git, it doesn’t do sanity checking – ever, regardless of initial clone or later fetches.
How Long? How Much?
So assuming that we were to make sane tarballs – which in the past we might unwittingly not have done without problems, since we’d likely have used
--mirror to get all non-normal refs – and store them. How long? How much?
There is some evidence that some corruption started as early as February 22nd. Whereas our anongits would have – for previously cloned repositories – had pristine objects already cloned, and corruption on new objects would have been detected – making filesystem-level backups of these repositories would not have helped unless we’d had more than 30 previous days of backups stored somewhere.
KDE isn’t considered a sexy open source project. We’re one of the largest open source projects in the world, in terms of active contributors, but we don’t get much corporate investment. Nearly all of our resources are volunteer. Several of our machines are VMs running on people’s personal servers – resources donated to the project by willing individuals (including myself) that have extra capacity. The money that does come in goes, by and large, towards helping our contributors attend sprints and our annual conference.
So, let’s say that we wanted to keep 30 days of backups (which, again, might in this case not even have been sufficient, and which even assumes that our taballs were consistent, which may not have been the case using
--mirror due to its unexpected behavior). That’s about 900 GB of data. I’m not sure we have that much space lying around anywhere, much less git.kde.org.
One commentor said that S3 is cheap and we ought to be using that. It’s true that the Glacier tier is relatively cost-effective. When we began, S3 was not an option as prices were higher and Glacier was not available. It’s a potential option if we decide tarballs are the way to go, but as pointed out earlier, filesystem-level backups cannot be assumed to be valid unless we spend the extra resources on constant fscking before every repo is backed up, every day. Whereas normal git fetches (and non mirror clones) give us that for free.
So we could store backups going back for a while, sure, but where is the cutoff? How long back? How many? As pointed out here and in the previous post, even 30 days ago, without doing consistent
git fsck runs, our tarball backups could have been bogus.
(Update: People have mentioned tools like tardiff and rdiff-backup. Those can certainly help cut down storage space, but don’t deal with the other problems. They could be run against mirrored clones after a
git fsck but that whole process would still have to take place, and with this many repositories it is a real burden in terms of resources. More below on our current line of thinking:)
Were We Wrong?
It’s super easy to say that the KDE sysadmins skipped ops 101 by structuring our backups how we did. But, we didn’t. We thought about all sorts of problems that could come up with different backup strategies, many of which I’ve outlined in this post.
In the workflow we had designed for the anongits, everything would have used Git’s built-in consistency checking. The idea is that the initial clone ensures everything is consistent (as we now know, that’s not valid for mirrored clones). When fetches happen, the fetch causes the server to walk through the different objects, which will detect corrupt objects at that time and not sync it down to the server (which as far as I can tell is still valid with mirrored clones, it’s only the initial clone that doesn’t check this).
The idea that we failed to take backups, that mirroring isn’t backing up, shows a lack of thinking about the problem in Git terms. We thought a lot about backups, and how to ensure the backups are consistent so that they are useful, and how to fit this paradigm into the resources available to us. And we came up with a plan that ensures that we have multiple copies of our data, spread around the world, and actually takes advantage of the problem presented to us and uses Git’s own mechanisms to ensure integrity.
After this ordeal, I don’t think our plan was wrong outright – but it had bugs in it. The main bug was placing our trust in the list of valid repositories coming from the server. We spent time thinking of all the ways that we could handle repository corruption, and didn’t think about corruption to the master project list. We also then failed to sanity-check this, for instance by making sure that the changes in this project list were reasonable. This is what led to the mirrors removing the repositories from their local systems. This obviously needs to be changed.
The second bug was thinking that
git clone --mirror behaves the way that we expect it to behave based on the man page. It doesn’t, and we didn’t know that, and it’s not because we didn’t RTFM.
So there are two clear action items (and a number of smaller ones I won’t list here):
- We need to ensure that the projects file is properly sanity checked. In addition, any repositories that the server says should be removed should instead probably be archived for some period of time.
- When we perform a fresh mirror clone on a repository from the server, we must immediately run a
git fsckon it and ensure that its integrity is intact. If not, we need to let our sysadmins know, right now.
One More Thing
No. A hundred times no. Mirrors, RAID, ZFS, versioning systems, etc are not backups.
This comment stuck with me, because it shows that people make assumptions that backups for our repositories are like backups for anything else. Those things are completely valid as components of a backup system. Backup systems have to be tailored to the data you’re working with and the resources you have, and statements like this ignore the complexities of the fact that backing up heavily-used servers with specific data types with specific resources and specific users are not all alike.
In fact, that comment is specifically wrong in this case. I mentioned that the new server for projects.kde.org is storing its mirrored repos on a ZFS file system and that it will be snapshotting those repositories after each sync, while feasible. Imagine if one of the existing anongits had already had that system in place; rather than a stroke of luck that kept all of our repositories intact on a tangential server, we could have simply rolled back to the snapshot from an hour or two before, and then transferred those repositories up to the main git.kde.org server. The flaws in our mirroring scripts wouldn’t have even mattered, although we would have discovered them anyways.
I don’t know what anyone else calls that, but I call that a backup.