What do I do @ work?

13Sep10

I recently moved within Canonical from being a paid developer of Bazaar to take on a larger challenge  Technical Architect for Launchpad. Its been two months now, and its time to put my head up out of the coal face, have a look around and regroup.

When I worked on Bazaar, every day when I started work got up I was working on a tool anyone can use, designed for collaboration upon sourcecode, for people writing software. This is a toolchain component right at the heart of the free software world. Bazaar and tools like it get used everyday to manage, distribute and collaborate on the sourcecode that makes up the components of Ubuntu, Debian, Fedora and so forth. Every time someone new starts using Bazaar for a new free or open source project, well I felt happy – happy that in my small part I’m helping with this revolution we’re carrying out.

Launchpad is pretty similar to Bazaar in some ways. Obviously they are both free software, both are written in Python, and both are sponsored by Canonical, my employer. And they both are designed to assist in collaboration and communication between free software developers – albeit in rather different ways.

Bazaar is a tool anyone can install locally, run as a command line, GUI, or local webserver, and share code either centrally (e.g. by pushing to Launchpad), or in a peer to peer fashion, acting as their own server.

Launchpad, by contrast is a website which (usually) folk will use as a service – in their browser, from the comand line – FTP (for package building), ssh (for Bazaar branch pushing or pulling), or even local GUI programs using the Launchpad API service. This makes it more approachable for first time collaborators, but its less able to be used offline, and it has all the usual caveats of web sites : it needs a username and password, it’s availability depends on the operators – on the team I’m part of. So there’s a lot less room for error: if we do something wrong, the system is unavailable, and users can’t just ‘apt-get install’ an older release.

With Launchpad our goal is to to get all the infrastructure that open source need out of the way, so that they can focus on their code, collaboration within their team – and almost uniquely – collaboration with other teams. As well as being open source, Launchpad is free for all open source projects to use – Ubuntu is our single biggest user – they use it for all bugtracking, translation and package building, and have a hugefraction of the total storage overhead in the database.

Launchpad is a pretty nice system, so people use it, and as a result (on a technical basis) it is suffering from its own success: small corner cases in the code turn up every day or two, code written years ago to deal with a relatively small data set now has to deal with data sets a thousand or more times larger (one table, for instance, has over 600,000,000 rows in it.

For the last two months then, I’ve been working on Launchpad. As Technical Architect, I need to ensure that the things that we (users, stakeholders and developers of Launchpad) want to do are supported by the structure of the system : the platform(s) we’re building on, the way we approach problems, coding standards and diagnostic tools. That sounds pretty dry and hands off, but I’m finding its actually very balanced. I wrote a presentation when I started the job, which encapsulated the challenges I saw in front of the team on this purely technical front, and what I thought I needed to do.

I think I was about right in my expectations: On a typical day, I’ll be hands on in a problem helping get it diagnosed, talking long term structural changes with someone around how to make things more efficient / flexible / maintainable, and writing a small patch here or there to help move things along.

In the two months since I took on this challenge, we’ve made significant headway on the problem of performance for Launchpad : many inefficient code paths have been identified and removed, some new infrastructure has been created as is being rolled out to make individual pages faster, and we’ve massively increased the diagnostic data we get when things go wrong. We’ve introduced facilities for responding more rapidly to issues in the software (but they have to be rolled out across the system) and I hope, over the next 4 months we’ll reach the first of my performance goals: for any webpage in Launchpad, it will complete rendering in 5 seconds 99% of the time. (Note that we already meet this goal if you measure the whole system, but this is biased by some pages being very frequently hit and also being very small).

[edited to correct a typo and a missing '5 seconds']



7 Responses to “What do I do @ work?”

  1. 1 Torsten Werner

    My favorite error message which is displayed for many months now (https://launchpad.net/~openjdk/+mailinglist-moderate):

    Timeout error

    Sorry, something just went wrong in Launchpad.

    We’ve recorded what happened, and we’ll fix it as soon as possible. Apologies for the inconvenience.

    Trying again in a couple of minutes might work.

    (Error ID: OOPS-1717K407)

    • 2 rbtcollins

      Thats https://bugs.launchpad.net/launchpad-registry/+bug/627412 which as you can see is high and tagged timeout.

      The discussion there suggests that there is a backlog of messages and too many are present to show. You may find reloading gets it to work until we have a full fix in place.

      I encourage you to click the ‘me too’ button on the bug report, I’m going to look at your OOPS and see if I can add more detail to the report (to confirm the theory therein).

      • 3 rbtcollins

        I’ve updated the bug; the OOPS you linked was very helpful – it clearly shows that reading all the messages off disk is the dominating factor there and we need to batch the page.

      • 4 rbtcollins

        The email moderation timeout is now fixed on both our edge and production servers.

  2. What do you mean by “complete rendering in 99% of the time”? It sounds like you’re trying to make every page exactly 1% faster, but I’m not sure that’s not what you meant.

    • 6 rbtcollins

      I missed ’5 seconds’ there. 99% of requests should be 5 seconds.

      This is a better metric than saying ‘the average should be 2.5 seconds’, because averages don’t help you with spread.

  3. 7 w1ngnut

    Hi. Nice to know we have work being done on improving LP. I started using it a few months ago and still find it a little confusing. Hope, other than improving the performance issues – which are noticeable – you achieve some improvements on the overall usability.

    Cheers.


Leave a Reply

Fill in your details below or click an icon to log in:

Gravatar
WordPress.com Logo

Please log in to WordPress.com to post a comment to your blog.

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.