Adam Frisby

Archive for the ‘mysql’ tag

Inside the Asset Server

with 6 comments

The single most misunderstood bit of technology in both OpenSim and Second Life is the thing called the asset server; while my understanding of the backend of Second Life’s asset cluster is limited (Squid+Isilon AFAIK) – my understanding of OpenSim’s is fairly comprehensive.

Let’s start with the basics – the asset server is a gigantic document storage server. In our cases, documents are things such as primitives, textures, sounds, avatars – the works.

Each document is referred to by a specific unique filename called a UUID – which is a 128-bit number (think 3 with 38 zeros behind it – big number) – a UUID looks something like this “d61c1990-79b9-11de-8a39-0800200c9a66″ when represented in hexadecimal notation. You have probably seen them around a lot – just using OpenSim or Second Life.

A UUID is an excellent choice for a filename; because you can make them up randomly and statistically guaruntee it hasnt been used before (the chance of a UUID duplicate for a good random UUID is about 1 in a very very very very large number). This means you can have multiple asset servers on the one grid – each upload can be given an ID randomly, without the need for a central authority to give them out.

They also optimize extremely well – each “bit” is a single yes or no question; to find out exactly where an asset is located, you can play a game of 20 questions, but in this case, with 128 bits – you can ask 128 questions. For example; finding which server an asset is located on, in a server farm with 4 machines (labelled 1,2,3 & 4) could be found with the following question chain…

  1. Is the asset on a server labelled closer to 4 than 1? (yes – servers 3,4 still possibilities)
  2. Is the asset on an odd numbered server? (yes – located on server 3)

Asset found on server #3. It’s not very long is it? Even with say 300 servers; you can find the answer within 9 questions. Some further searching is needed to be done to locate the asset on the physical disk itself; but even then you will be asking a very small number of questions – far less than the maximum 128 allowed.

To put this into a diagram, fetching an asset directly by it’s indexed UUID is equivilent to something like this:

Fig 1. How asset UUIDs divide and conquer

Fig 1. How asset UUIDs divide and conquer

Given that computers are capable of asking and answering questions very very quickly (several billion per second); and that none of the above questions require a central server; you can expand your asset server farm in a fairly linear fashion without scaling constraints; providing that you structure your data appropriately.

128 questions also means that you can define very very precise questions; so many that if you created 100 trillion assets every second; the sun would be lifeless before you ran out of address space that can be answered by that many questions. (Although a piece of statistical mathematics called the Birthday Paradox does make the address space significantly less usable the closer you get to that point – but for other reasons)

At this point, we know how assets are stored and retrieved – files are uploaded into the server farm, and given a UUID; that UUID can be used to find out where the asset is located precisely, and return the data later. However, when you access an item off the asset server; there’s only a modest chance you are accessing the asset by it’s UUID directly.

A lot of the time, you will be accessing the asset through a layer of redirection called the Inventory Server – if you have a texture within your inventory; there are actually two components; the first component is the data itself (the asset), and the second component is the inventory shell (uploader, time uploaded, permissions, reference to the asset, etc). When you lose an item of inventory (such as during a transaction); it is often not the asset servers fault, but instead the inventory server.

The inventory server exists so that if you give a copy of a texture to a friend; you and that friend do not make a duplicate copy of the heavier portion (the asset) – both inventory items, yours and your friends will have the same underlying asset ID. This also means that copying an item within your inventory will not prevent it from being saved during a asset glitch (as both point at the same data).

In Second Life®, the inventory server is using MySQL (and probably InnoDB), and the asset server uses IsilonFS (a custom hardware appliance based system.); in OpenSim this will vary a lot depending on the providers configuration – most grids use a SQL backend for storing both assets and inventory. By default, OpenSim will use a small embedded database for both Inventory and Assets – called SQLite. For grids, we recommend something more robust.

In grid mode, MySQL is still an appropriate solution for Inventory; however asset data will often exceed the normal operating tolerances of MySQL and lead to frequent table corruption, table locking issues (which in turn make performance suck) and other nasties. No large grid should be using MySQL for assets (small home and private grids are however fine.)

Grids such as ReactionGrid which use MS-SQL can get a bit further than MySQL when storing asset data (for inventory they are fundementally the same); MS-SQL like most sane database engines, store BLOB data seperately in a BSP-Tree indexed system; this means they can scale up pretty far (although at about the point of clustering, things degrade.); Postgres can do the same – however as best I am aware, Postgres is not yet properly supported in OpenSim.

For bigger grids like OSGrid (where we have hundreds of gigabytes of assets) a dedicated solution is ideal – unlike Second Life, we cant afford to purchase high end NAS-equipment for our storage solutions; so we have built a distributed system called Fragstore which has two big features.

  1. A configurable backend – it can use distributed storage systems (such as Project Voldemort or Hadoop), or export to a filesystem (which can be a distributed filesystem via projects like KosmosFS.)
  2. A duplicate detection system – so files uploaded twice only get stored once.

The duplicate detection is achieved via a secondary layer of indirection; when we allocate a UUID, we hash the incoming data (256-bit SHA1) and store the UUID and the Hash into a small database table (currently MySQL based); when we recieve future uploads; if the hash matches, the result is only stored once on disk.

The final point I would like to touch on is asset transmission and security – assets are transmitted in both SL and OpenSim over HTTP; the reason this works and is secure is because the UUIDs used in the requests are unguessable. Even by a computer making millions of guesses per second; the chance of hitting a valid address is very very small (same probability as generating a duplicate UUID).

From the asset server to the region server, the request is something in the form of http://assetserver.com/<uuid>/data – which will return the asset in a encoded container (usually packaging a little bit more information about the asset such as the content type, etc.) – using /metadata instead of /data will get you a JSON-encoded package with a bit more information about the asset (creation date, specific asset type, etc.)

From the region servers to the users – this transmission can vary a little. In both SL and OpenSim, the region servers act as proxy caches for the asset server; because assets cannot be updated (and instead are replaced) – this works fairly well. If twenty users in a region see a texture – it only needs to be fetched off the core asset service once; because the region will send it twenty times to the users; rather than twenty users hitting the core server.

In the case of OSGrid, this means we have 2,500+ reverse proxies sitting infront of the asset server (albeit in a somewhat suboptimal layout.); it reduces the bandwidth on the core asset server by approximately 90%+ (you can see our asset stats here); which means we can get away with much lower operating costs (since asset delivery costs are shared with region operators).

Written by Adam Frisby

July 26th, 2009 at 9:43 am

DB’s Considered Harmful [... to my sanity.]

with 6 comments

The following is my log of the OSGrid asset conversion saga.

Day 1: Today we’re taking a third attempt at doing the big fragstore conversion for OSGrid, for those not following the Saga of the Asset Server – about two months ago we started noticing major scalibility issues surrounding assets. Right now they are thrown into MySQL as a blob table, resulting in large amounts of waste both in duplicate content and in the fact we’re storing a filesystem inside of a relational database – you can read up on the earlier design decisions leading to FragStore here.

The previous two conversion attempts have suffered “mysterious MySQL glitches” which we assume may be related to various bugs with long running commands. Apparently the proper course of action when running a query that takes more than 60 seconds to process the command, is to freeze up entirely and stop processing requests – for now and evermore.

In an attempt to make this run a bit smoother – we’ve broken up the process into 2,000 batches of 1,000 assets each – previously our batch mechanism was using MySQL LIMIT X, Y which has the side effect of getting slower and slower as you progress down the table (thus causing the above); so we’ve shifted to using a numeric ID on the assets table. Putting the numeric ID on there allows us to at least index sequential accesses – LIMIT unfortunately will not use any form of index hinting.

mysql> ALTER TABLE `assets`
    -> ADD COLUMN `numericID`  int(11) UNSIGNED NOT NULL AUTO_INCREMENT AFTER `access_time`,
    -> DROP PRIMARY KEY,
    -> ADD PRIMARY KEY (`numericID`),
    -> ADD UNIQUE INDEX `assetID` (`id`);

Query OK, 1623826 rows affected (1 hour 11 min 2.54 sec)
Records: 1623826  Duplicates: 0  Warnings: 0

An hour and a bit later, the difference between the speed of processing before and after is pretty astounding

mysql> select id from assets limit 540000,10;
10 rows in set (11.53 sec)

mysql> select id from assets where numericID between 540000 AND 540009;
10 rows in set (0.42 sec)

It doesnt need a whole bunch of explanation to figure the above may help with our situation. Running the revised and simplified “AssetConverterMarkII” appears to go without a hitch – data is stored into the database, the metadata table is being filled correctly – all in all it appears to be functional. With one minor teeny little problem.

Only the first 4096 bytes of data are being written to the backend store. The remaining sectors of data are written – but consist entirely of zeros. Retrieving the data results in a buffer of the same length as the original stored asset – but often half the data is completely missing. An hour later, it looks like the data is being sent to the backend voldestore correctly, but either on the way there or on the way back, it loses something. Unfortunately it looks like the problem is outside the purview of the client adapter and is somewhere in the deep murk of the backend storage server.

Day 2: Rethinking time – after spending quite some time hunting for some alternatives, the simplest solution looks to be the best.

While I am keen to use Project Voldemort in the long term – in the short term debugging our implementation details are just not on my agenda. We use a IKVM cross-compiled connector library from Java, and the problem looks like it is sitting in there somewhere. Unfortunately debugging Java IKVM libraries from within .NET is painful at best, and not something easily fitting into our timescale.

The simplest solution is to throw the asset blobs onto the filesystem – filesystems are after all developed to handle tiny little files. Directories will slow down when there is more than about 3,000 entries within them, so we’re breaking storage up into “/b1/b2/hash.blob” – this means assuming an even distribution, approximately 30 files per directory at current size, scaling us up to a capacity of 100 million assets before we need to rethink the situation.

Distribution and redundancy are both things I am still keen to employ – putting us on the filesystem does allow us to look at things such as KosmosFS which provide transparent distributed filesystems on Linux, and also gives us the opportunity to look at commercial filestores down the road if we ever win the lottery.

Rewriting fragstore to use filesystem components where voldemort was employed took all of an hour and the asset converter was up and running – a lot faster too. Our conversion transfer rate on Voldemort was 66 assets per second. FragstoreFS?

10,000 Assets Processed (102.04 asset(s)/sec): 0 error(s) so far.

The second thing I wanted to test was just how big a savings we were getting from using Content Addressable Storage – with 10,000 processed, we ended up with 613 duplicates eliminated (6.1%). With 20,000 – 1390 (6.9%), with 90,000 – 8962 (9.95%). We’re hoping as the full dataset is processed – the % of duplicates eliminated continues to increase.

Fig 1. CAS Duplicate Savings

Fig 1. CAS Duplicate Savings

The next issue to present itself was a slowdown as the conversion occured – the number above (102.4) held firm for the first 10% of the conversion, then conversion speed began to massively taper off, first down to 71.39/sec, then down to 50.22/sec by 150,000 converted. My fears were a reprise of the situation we thought fixed on day 1 – slowdowns on accessing as we move further down the table.

Nebadon suggested that this infact might be actually because as OSGrid has become more popular the average size of an asset has increased over time – so we skipped a million rows down the table and started converting some of the later rows. Conversion speed? 68.31/sec. This indicates that yes, later assets are more expensive to process – but the conversion speed should still average the 60 or so per second we need to be able to convert the entire database in under 24 hours.

Appearing somewhat happy with the results, conversion on the complete database has started, but we wont know how well it has worked until tommorow.

Day 3: Stay tuned!

Written by Adam Frisby

April 30th, 2009 at 6:14 pm

Posted in OpenSim

Tagged with , , , , ,

 

You need to log in to vote

The blog owner requires users to be logged in to be able to vote for this post.

Alternatively, if you do not have an account yet you can create one here.

Powered by Vote It Up