Database programming is fun!

In recent years, there’s been a resurgence of interest in the low-levels of databases, most often in the form of key-value stores. My favourite amongst the current crop is Redis, because of its emphasis on simplicity, and the fact that it is a data-structure store: Redis’ values are drawn from a small but nicely written set of data structures, which you can operate on with a no-nonsense API. In this post, I’m going to talk about my experiences writing a key-value store; not because the result was particularly interesting (if you are looking for one, just use Redis!) – but because the process I went through was so enjoyable, I want other people to have a go too!

RAM == Hipster Disk?

Way back in mid-2008, I read an article that amounted to ‘Disk is the new tape; RAM is the new Disk’ or something like that. I am not even sure if I read the actual text – but the title captivated me. The prospect of writing an ACID database with disk-based b-tree did not appeal in the slightest; but a RAM based data store – well, I wrote those every day in my games! I took this to heart, and decided that LBP(2) should use an all-in-memory, single threaded, event driven, key-value store to hold its data –  mainly, because I thought it would be interesting, yet simple, and secondly, because I didn’t like the fact that the first game’s servers relied heavily on code I couldn’t understand, build, or maintain (being hosted outside of MM, written in Java, and relying heavily on middleware that was never designed for an LBP type of game). Classic ‘not invented here syndrome’, and I’m happy with that ;)
At the time, there wasn’t anything quite right, so I thought I’d write my own database in C. This is only slightly insane – the key point that many people miss when dismissing such a decision is that a game development studio (as opposed to a web-startup, say), almost always, is chock full of people who enjoy solving difficult problems from first principles, normally using C. This is our bread & butter, and it’s what we do best. Why not apply it to the servers, as well as the client, of your game?
The other decisions were not entirely arbitrary: single-threaded & event driven meant I could avoid worrying about threading & locking, which I didn’t fancy tackling this time around; key-value, because I didn’t want to write an SQL parser, and I knew pretty much exactly what all my data usage patterns were going to be, having shipped LBP1 (beta) already. I also knew I would need replication & sharding (there were about half a billion key/value pairs stored in the LBP production database at that time – had it been a key/value store that is! – and that wasn’t going to fit into the target 32gb RAM machines) – but I also wanted efficient range queries (scoreboards, hot levels, activity streams, etc). I won’t cover all of that here – it’s just an exercise in data structures, for which, read my last post ;) – instead, I’m just going to cover a couple of observations and lessons learnt. The bottom line really is that I’d encourage all low-level-C-coding junkies to have a go in this space if you’re looking for a side-project – it’s massively illuminating (I learnt a lot), requires careful data structure design (lovely!), and is entirely non-trivial while also being small enough to be tractable as a side-project. Hence the large number of startups in this space, I suspect!
The experiences described here are those of a first-time-database-engine programmer; I wouldn’t do it exactly the same way a second time. The intention here is not to provide a recipe, but to provide enough off-hand comments and tantilising keywords that maybe 1 reader will be intrigued to go off and do something different & cool. For what it’s worth, after a good year or two of heavy iteration, the key-value store that embarrassingly became known as ‘alexdb’ is now happily in production, serving several hundred million queries a day on a cluster of servers somewhere in Ireland. Joy. Many thanks to my co-workers at the MM & Sony server teams (James, Mike, Jon, Ben, Gavin, Sunmee) for helping make that happen.

SpiritOf(DB code) == SpiritOf(Game engine code)?

As a graphics programmer by trade, the parallels between GPU coding / embedded systems / console programming and database engine programming were striking (and not obvious to me before embarking on this journey). Consider that a decent K-V store should be able to handle on the order of 100,000 requests per second, and deal with gigabytes of data (say goodbye to your L2 cache!). That means that you have to be squarely in ‘console programming mode’: lean, simple code is all that can possibly run in time, to pull a request ‘off the wire’, parse it, look up the answer, serialise the result back on to the wire and write anything to disk that you need to, 100,000 times a second. There is no room for sloppy coding, excessive dynamic memory allocation, or ‘lumpy’ performance. Indeed, it felt even more constraining than the realtime code I’m used to writing; 33ms feels like a luxury when you’re trying to fit a request in 0.01ms!

Thou shalt not block, EVER.

The first lesson I learnt, that isn’t written about nearly enough, is that in a single threaded / event driven server, you simply can NEVER, EVER, afford to block, for even a millisecond. Well, lots of people say that, but they don’t mention how hard it is to pull off. If you do block, you’ll get lots of requests queuing up, and then huge lumps in your performance; the latency of the system will shoot up to the duration of your block, and this will apply to ALL the queued requests. Concretely, if you have 100k reqs/per second, and just one of them blocks for 10ms (not a long time!), then the 1000 requests following the blocky one will ALL appear to be as slow as the block. And that’s assuming you don’t block again. OUCH!
Fine, you say. I won’t block! Easier said than done. humble malloc() on linux can easily take many milliseconds to run, depending on what other processes are doing (OS locks, disk, VM get in the way) – and you’ll only find this out when the load average is nice and high; a simple pointer dereference that causes a VM page fault causing the OS to go out to disk for memory can block you for AGES (mechanical disks); and most databases will at some point want to write to disk (access logs, database writes…). And don’t get me started on other POSIX calls that unexpectedly do disk access to look up locale information…!
The bottom line is that on a heavily loaded machine, it’s entirely possible that simple syscalls will take an AGE – occasionally. So, no mallocs, disk access, logging or even printf’ing should be done on your main processing thread. (See? it’s like game programming!) It’s absolutely fine for a particular request to take a long time, so long as it makes way for other work to be done in the meantime – you’ll want to break out threads here, or as I did, write some kind of co-routine like thing. Mine was based on Simon Tatham’s delightfully ugly ‘Coroutines in C‘ – chosen because each coroutine doesn’t need to allocate its own stack. All disk access and other stuff is then shoved off onto worker threads that do the slow stuff at their leisure. All rather similar to systems you probably have in place for kicking off SPU jobs etc… I didn’t successfully eradicate all slow, blocking calls – and the remaining lumps are the curse of our load-tests. On my tombstone, I will write: ‘he wrote an evented server; but he blocked for 10ms; what a b**tard, now 50,000 children have to wait bloody ages for their cool levels’.

1,000,000,000 > 1

My second and final lesson for this blog post was that 1 Billion is a really, really big number. Game programming operates at a certain scale (gradually growing every year), but it’s just nothing like ‘webscale’. Writing and tuning a database only really starts making sense when you are testing on real data – and real data, means regularly loading and saving the order of billions of items.
The first attempt at my database would write to an append-only-log file, and occasionally ‘checkpoint’ its state to a big file. This is fairly standard practice. However, on 50gig datasets, this checkpointing took a significant amount of time (20 minutes or more, I forget – too painful to remember!); This led to two problems: consistency of data during the long checkpoint, and speeding the damn thing up.
First, consistency: to stay true to the simplicity of avoiding the need for locks, I wondered how I could write a consistent snapshot to disk, while the data was still being mutated by incoming writes. I’d read Poul Henning Kamp’s amusing ‘what’s wrong with 1975 programming‘ rant (also, take a gander at this unrelated but good article by the same chap) and suddenly had a brainwave: linux implements ‘copy on write’ semantics for memory blocks when a process forks. That means that my database could just ‘fork’ when it wanted to save a checkpoint. The child process gets an unchanging copy of all of the memory, just for the cost of a page-table copy in the OS; any writes in the parent process just cause the OS to do all the work for us, and duplicate the pages that are changing. The child can checkpoint at its leisure.

all fork()s lead to Redis

The simplicity of this was striking – the entire checkpoint algorithm effectively boiled down to fork(); fopen(); fwrite(); exit(); – but, not being a linux guru, I was worried there must be something wrong that I was missing. A bit of googling led me straight to the then-nascent redis project, which uses the technique itself, or used to: back then it was a single source file of elegant C and hadn’t really been ‘noticed’. I’ve been following it ever since, and I really think its creator @antirez is a bit of a genius…
Anyway, beyond finding a cool project to follow, the redis source gave me hope that the fork() technique was going to work; now I just had to tackle the long loading times (ooh, game programming parallel again!)
AlexDB (cringe) has a text serialisation format, making it very easy to read, debug, and filter its write-log files. Originally, the checkpoint files were written & read in this format. However, the time taken to read, parse and act on a 500,000,000 line file was proving prohibitive – and you paid on every boot. That made the compile-edit-test cycle somewhat awful.
The solution I present here for your amusement, as much as your edification:

malloc(34359738368)

I decided that the server should just dump its state to disk with a single fwrite() type command. That way, I can just be disk-bound. But how to deal with pointers?In true embedded system style, the database program calls malloc() once on boot, grabbing a single 32 gig block of ram. All pointers in the program were replaced (with a bit of C++ template magic and some macros for the C bits) with 32 bit indices, indexing from the single base pointer in steps of 8 bytes. This made the entire memory image completely re-locatable, and had the added advantage that all pointers were now only 4 bytes instead of 8, while still allowing 32 gigs to be addressed. (This was a huge win, in a system where memory savings are EVERYTHING.) The checkpoint code became the desired single fwrite() (in fact, a gzwrite – we ended up compressing, trading a bit of CPU usage to avoid being quite so disk-write-bound) – and the system felt suitably simple, eccentric, fast that I could happily move on to the next problem.
(You may ask why I didn’t use mmap() to just keep the big 32 gig block mapped to a file. The answer is – I wasn’t sure how that would work, or how you might control the rate at which the OS flushed pages to disk. I also wanted to be able to have a ‘history’ of checkpoints, written to a temporary file and atomically renamed when they were known good – so that I could avoid coming across ‘half baked’ checkpoints. but, I’m sure one could come up with a nice system built around something like mmap. Go ahead!)
AlexDB went on sprouting more and more features – as these things do – growing HTTP clients and servers, efficient resource serving, image manipulation, and even an embedded ruby interpreter; and while such a large system inevitably involves its fair share of screaming and hate, overall, I’m very happy that we put the effort in to build & understand such a system. The resulting code-base is still smaller (by LOC) and cleaner than its predecessor, despite implementing everything itself, and can be built very simply with few dependencies.
What was the point of this post again? To share a few small insights gleaned while trying to write an efficient database; to praise the genius of the authors & maintainers of such brilliant bits of software as Redis & postgresql – for whom I have a new, experience-based respect; and to encourage game engine programmers to venture into the muddy waters of online server development – there’s so much FUD around it, the only way to dispell the fog is to dive in and find out for yourself.

hey h8ers, you gonna h8 this. so comment!

If you’re the sort of person who’d rather use someone else’s graphics engine, you’ll probably rather use someone else’s database – and that’s fine! I’m surprised you’ve read this far; thankyou! But, if you get a kick out of understanding your whole stack, top to bottom – in the time honoured old school tradition of game development – I can’t recommend the world of data storage programming enough. After all, online, data-driven and community-creating features are going to be writ large in the future of pretty much every type of game. And understanding how your MySQL database works from top to bottom, or writing your own, is no bad thing.