Distillation - Blanket Fort

Wow, things got crazy with my two previous posts about KDE’s Git corruption troubles.

Unfortunately, what became obvious from the comments on this blog (and, I assume, elsewhere, although I didn’t read comments on any other sites) was that the essential message was, almost universally, completely lost. I wrote the original post because KDE is an open-source project and we’ve never been about hiding issues from the community at large, so I felt it was perfectly fair to be open and honest about the troubles we had, in the hopes that it could help other projects from encountering them. Rather than take something useful away from it, most people seemed to take the Gawker approach. That’s fine, and I take no offense from people shooting the messenger when it’s clear they didn’t actually read past the headlines, but the point was to make people — especially other open-source projects — think about their own systems and their procedures. If I helped one other project avoid data loss because they reexamined their own systems, then great.

So, I’m redirecting my previous two posts, about KDE’s Git troubles, to this post, which I’m going to keep relatively short — because I want to make sure the lessons I was trying to put out there for other open-source projects are very clear.

First the facts. Then the lessons.

The Facts

Three of them:

We needed a Git mirroring system because KDE’s SCM system has a lot of disk load. So we set up a mirroring system — not a backup system (yes, we are well aware of the difference). We carefully thought through many issues relating to mirroring (methodology, dealing with corruption, dealing with network outages). In a pinch, it could have served as a backup system in some situations, but that was not its design goal. Not everything went smoothly at first; for instance, the anongits often couldn’t be relied on to properly be able to be notified to pull when new objects were pushed to the main server, so we had to switch to a time-based polling method. This meant lots of time spent getting the kinks worked out.
This was happening at the same time as the transition to Git was happening, which means it was at the same time as we were setting up: Identity (the authentication service); ~~gitweb~~, then ~~cgit~~, then GitPHP for quickgit.kde.org; Redmine (now ChiliProject) for projects.kde.org; the quick-link generator at commits.kde.org that lets you get an instant link when pushing; Gitolite itself; custom hooks for Gitolite to allow for our particular access control and repo location needs; a system to poke anongits to update certain repos outside their normal schedule; the tarball generator and hosting; Smart HTTP support; Reviewboard (Git instance); plus a ton of non-user facing things like the scripts that synchronize SSH keys around, the scripts that generate the XML files from Chiliproject to use for various other purposes; and so on, and on, and on, and on. Not to mention normal requests/requirements, sysadmin bug handling, account handling, etc. This meant lots of time spent…on everything, really. There are thousands of contributors and millions of users; sometimes hundreds of emails go to the sysadmin mailing list in a day.
The sysadmin team is entirely volunteer, and real life gets in the way. For the past year and a half or two years, for instance, I’ve mostly only had time to keep a few things humming along, due to increased time pressures in real life. That’s put even more of a burden on the remaining sysadmins — and the team is always overburdened, beause in addition to normal maintenance requests, server problems, and so on, there is a list of things we want to do that is 8 km long.

Interlude

A corrupt filesystem on the server led to a corrupt master projets file, which led to the anongit mirrors removing repositories locally. Sanity checking the projects file on the anongits might have helped prevent that propagation (hence the original title of my first post, “Too Perfect A Mirror”), but we didn’t do that…because the worst-comes-to-worst scenario was always wiping out the local anongit mirror and re-mirroring, which didn’t take all that long.

We got lucky, because one of the anongits had failed to synchronize its master projects list and thus had clones of all 1500 repositories still on-disk. If not, we would have had to go to more painful ways of getting the data back, such as using our distribution tarballs (which we use for developers that can resume an HTTP connection but due to disadvantaged Internet connections have trouble making initial clones of repositories) which are also mirrored, but at a slower pace than the main repository syncing; or having the community at large reconstruct the various repositories. Keep in mind that half of those were personal clones or scratch repositories that had not been updated in a very long time and probably would have not caused anyone to blink if they’d been lost. So we had avenues to get the data back, but much more painful.

The Lessons

Two lessons:

Overloaded teams can easily overlook things, even obvious things.

Setting up all of the various bits and bobs of the Git infrastructure was at least six months of non-stop work, whenever any of us had any free time. This also meant putting off necessary things like machine migrations/upgrades, service replacements/upgrades, and so on, so even when the Git work slowed, there were tons of other things waiting that had grown more urgent with time. It’s not a huge logical leap to realize that when demand far outstrips resources, even basic things can be forgotten, or, as a result of being basic, assumed to have been done. I personally am a backup fiend with my own data; I have three copies of the pictures and documents I really care about in places around the world. But when I had time for KDE things, I was thinking about the things specifically on my plate (during that period of time it was not uncommon to sign onto IRC and have three or four highlights with things that needed to be done ASAP), and six or nine months into the huge amount of work that took place at the time of the Git transition, the idea that backups hadn’t been set up didn’t even occur to me. It’s normally something you’d set up at the beginning, but the beginning for us didn’t even slow down for half a year.

I haven’t asked, but I’m guessing this was the case with the other sysadmins too. They’re extremely talented, responsible people that care a lot about KDE and dedicate huge amounts of time towards making things work smoothly. We have a backup server that nightly backups of various KDE services are sent to, and setting up backups is routine on new systems, including many new systems installed or set up since the Git transition. But, the work never slows and the list of things to do only gets longer, and it can be hard to find the time to go back to everything that isn’t currently on fire or in dire need of upgrading and make sure all of the boxes are properly ticked.

So think carefully about the things that are most imporant to your project. Double check what you think you’ve done, and ensure that it matches what you have actually done. If your list of things to do only gets longer, you might as well let it grow — it will anyways — and turn a critical eye backwards.
The distributed nature of open-source can help things fall through the cracks.

Like many open-source projects, much of the discussion and planning between the KDE sysadmins happens on IRC. This means that things are discussed at ad-hoc times and sometimes without all parties present, which can make it far easier to fall into the trap of thinking that somebody else has taken care of things. It’s easy to see how missing an IRC conversation can keep you uninformed about state; but the opposite edge of that sword is thinking that things are being taken care of when you’re not around watching the conversations. Maybe the things you thought would happen happened; maybe they didn’t; maybe something else happened instead.

Recently, we’ve been using better tools to track what needs to be done, but outside of Bugzilla (which is not really great as a To-Do list) we didn’t have those tools two or three years ago, and never really fell into the pattern of using Bugzilla as an internal task tracker…we use it for sysadmin bug requests, but in my view it’s a bit unwieldy for keeping track of internal state.

When you don’t have a daily scrum or weely group meetings, it can be too easy for miscommunication to lead to later problems. Make sure to find ways to bridge that communication gap, because that lack of communication can lead to assumptions, or simply things falling through the cracks, and both can be deadly.