minimising downtime for schema changes with postgresql

Edits: Corrected the description of the slony bug, and noted that there is a typo on the lazr_postgresql PYPI page.

Two years ago Launchpad did schema changes once a month. Everyone would cross their fingers and hope while the system administrators took all the application servers offline, patched the database with a months worth of work and brought up the servers again running the new QA’d codebase.

This had two problems:

  1. due to the complexity of the system – something like 300 processes have to be stopped or inhibited to take everything offline – the downtime duration was often about 90 minutes long irrespective of the schema patch duration. [Some of the processes don’t like being interrupted at all].
  2. We simply could not deliver any change in less than 1 week, with the on average latency for something that jumped all the queues still being 2 weeks.

About a year ago we wanted to increase the rate at which schema changes could be carried out – the efforts to speed Launchpad up had consumed most low hanging fruit and more and more schema patches were required. We didn’t want to introduce additional 90 minute downtime windows though. Adopting incremental migrations – the sort of change process described in various places on the internet – seemed like a good way to make it possible to apply the schema changes without this slow shutdown-and-restart step, which was required because the pre-patch codebase couldn’t speak to the new schema. We could optimise each patch to be very fast by avoiding anything that causes a full table scan or table rewrite (such as adding indices, adding columns with a non-NULL default value). That would let us avoid the 90 minutes of downtime caused by stopping and restarting everything. However, that wasn’t sufficient – the reason Launchpad ended up doing monthly downtime is that previous attempts to do more frequent schema changes had too high a failure rate. A key reason for patch deployment time blowing out when everything wasn’t shut down was due to  Launchpad being a very busy system – with the use of Slony, schema changes require an exclusive lock on all tables. [More recent versions of Slony only lock some tables, but it still requires very widespread locks for most DDL operations]. We’re doing nearly 10 thousand transactions per minute, at any point in time there are always locks open on some table in the system: it was highly improbably and effectively impossible for slonik to get an exclusive lock on all tables in a reasonable timeframe. Background tasks that take many minutes to complete exacerbate this – we can’t just block new transactions long enough to deliver all the in-flight web pages and let locks clear that way.

PGBouncer turns out to be an ideal tool here. If you route all your connections through PGBouncer, you have a single point you can deliberately interrupt to clear all database locks in a second or so (it takes time for backends to all notice that their clients have gone).

So we combined these things to get what we called ‘Fast Down Time’ or FDT.  We set the following rules for developers:

  1. Any schema patch had to complete in <= 15 seconds in our schema staging environment (which has a full copy of the production DB), or we’d roll it back and redesign.
  2. Any patch could change either code or schema, never both. schema patches were to land on a separate branch and would be promoted to trunk only after deployment. That branch also receives automated merges from trunk after every commit to trunk, so its running the latest code.

This meant that we could be confident in QA: we would QA the new schema and the application process with the current live code (we deploy trunk multiple times a day). We published some documentation about how to write fast schema patches to help socialise the approach.

Then we wrote an automated tool that would:

  1. Check for known fragile processes and abort if any were found.
  2. Check for very long transactions and abort if any were found.
  3. Shutdown pgbouncer, disconnecting all clients instantly.
  4. Use slonik to apply one or more schema patches.
  5. Start pgbouncer back up again.

The code for this (call it FDTv1) is in the Launchpad source code history – its pretty entangled but its there for grabbing if you need it. Read on to see why its only available in the history🙂

The result was wonderful – we immediately were able to deploy schema changes with <= 90 seconds of downtime, which was significantly less than the 5 minutes our stakeholders had agreed to as a benchmark – if we were under 5 minutes, we could schedule downtime once a day rather than once a month. We had to fix some API client code to retry more reliably, and likewise fix a few minor bugs in the database connection handling logic in the appservers, but all in all it was a pretty smooth project. Along the way we spun off a small python helper to run and control pgbouncer, which let us write effective tests for the connection handling code paths. In

This gave us the following workflow for making schema changes:

  1. Land and deploy an incremental schema change.
  2. Land and deploy any indices that need to be added – these are deployed live using CREATE INDEX CONCURRENTLY.
  3. Land and deploy code changes to populate any additional fields/tables from both application servers, and from cron – we do a bulk backfill that does many small transactions while walking over the entire dataset that needs to be updated / populated.
  4. Land and deploy code changes to drop references to the old schema, whatever it was.
  5. Land and deploy an incremental schema change to finalise the change – such as making a new column NOT NULL once the backfill is complete.

This looks long and unwieldy but its worth noting that its actually just repeated applications of a smaller primitive:

  1. Make a schema change that is fast and compatible with existing code.
  2. Change code to take advantage of the changed schema

Pretty much any change that is desired can be done using this single primitive.

We wanted to go further though – the multiple stages required for complex migrations became a burden with one change a day. Fortunately PostgreSQL now includes its own replication engine, which replicates the WAL logs rather than installing triggers on all tables like Slony.

Stuart, our intrepid DBA migrated Launchpad to PostreSQL 9.1, updated the FDT tool to work with native replication, and migrated Launchpad off of Slony. The result is again wonderful – the overhead in doing a schema patch, with all the protection I described above, is now ~5 seconds. We can do incremental changes in less time than it takes your browser to figure out that a given server is offline. We’re now negotiating with the Launchpad stakeholders to get multiple downtime windows each day, with this almost unnoticable, super reliable process in place.

Reliability wise, FDT has been superb. We’ve had 2 failures: one where we believe we encountered a bug in Slony: We dropped the id column from two tables in one patch (we replaced the autoincrement column as PK with a naturally unique column), and one where we landed a patch that worked on staging but led to lock contention in production – so the patch applied, but the system was very unhealthy after that until we fixed it. Thats after doing approximately 60 patches over a 1 year period.

We’re partway through extracting the patching logic from Launchpad’s code base into a reusable tool, but the basic principles will apply to any PostgreSQL environment. Note that there is a typo on the PYPI page – the actual Launchpad project is at https://launchpad.net/lazr.postgresql.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s