Game data archives almost always employ some type of compression for their contents, and sometimes will even mix and match different algorithms for different types of files. Understanding the typical compression formats is therefore crucial to the success of a game hacker.
Moreover, you need to be able to recognize the common algorithms just from their compressed data alone, so when you’re staring at hex dumps, you will know how to proceed. In today’s installment, we’ll go through some of the most popular formats, how they work, and how you can recognize them “in the wild”.
First of all, note that this task is in theory quite hard. The ideal compression algorithm produces data that is essentially indistinguishable from random noise (which counterintuitively contains the highest density of information). Fortunately, games have two additional requirements: to be able to access data quickly, and to be programmed by normal coders as opposed to Ph.D. computer scientists.
Both of these requirements means games typically use either industry-standard formats or relatively simple, quick algorithms that anyone can code from a book. The former often include extra identifying information we can spot, and the latter compress data poorly enough that it looks like “compressed data” instead of looking like random noise.
Let’s start with the industry standard, zlib. This is an open-source compression library which implements the classic ‘deflate’ algorithm used in .zip and .gz files. It’s pretty fast and pretty decent, and since it’s already written and completely free, it gets used all over the place, including in game archives.
How can you recognize it? 0×78 0×9C. The first two bytes of a standard zlib-compressed chunk of data will be those two bytes, which specify the default settings of the algorithm (’deflate’ with 32KB buffer, default compression level). Alternately, you will often see 0×78 0xDA, which is the same except using the maximum compression level. If you see those two bytes at the start of where you expect a file to be, rejoice, since you’ve just solved a big mystery with next to zero effort.
Decoding this format is also pretty easy, since virtually every modern language will have a zlib library for it. In C, you want to just link against libz and call:
#include <zlib.h> uncompress(new_buffer, &new_size, comp_buffer, comp_size);
Be sure to allocate enough memory for your expanded data: hopefully the archive index will have already provided you with the original file size. The function will return a status code and additionally update new_size with the amount of data that was uncompressed.
In Python, dealing with zlib is just embarrassingly easy:
new_data = comp_data.decode('zlib')
One of the built-in string-encoding methods (like ASCII, UTF8, Shift-JIS, etc.) is just zlib encoding, so if you have your data as a string you can just expand it like that. Alternately you can import zlib and use more direct function calls for extra control.
Compressing data is just as easy, with the compress() function in C — or compress2() if you want to specify the compression level — and the encode(’zlib’) string method in Python (or zlib library calls).
I don’t want to say much about the inner workings of the deflate algorithm, since that really doesn’t come up very often: you can safely treat it like a black box. However, there is one extra facet I’ve run across: the Adler32 checksum. This is a very simple 32-bit checksum algorithm (like CRC32) which is included in the zlib library, and therefore gets also used by games a bit. Additionally, the zlib format specifies that an Adler checksum is appended to the end of a compressed file for error-checking purposes.
However, some games twist their zlib implementation slightly by either leaving off the checksum (in favor of using their own elsewhere in the archive) or moving it into the archive index instead. This will cause the zlib uncompress call to return an error, even though it actually uncompressed the data successfully.
So, a word to the wise: if you’re sure that the game is using zlib but you keep getting errors when you try to expand the data, look for this case. You may have to do a little twiddling of the compressed data to add the expected checksum at the end, or just ignore the zlib error codes and continue as normal.
This is the simplest kind of home-grown compression you’re likely to run across. It shows up in image compression a lot, sometimes as part of a larger sequence of processing. Basically the idea is to start with a sequence of bytes and chunk them up whenever you run across a repeated value:
31 92 24 24 24 24 24 C5 00 00 = 31 92 5*24 C5 2*00
Exactly how you represent the chunked-up data varies a bit from algorithm to algorithm, depending on what you expect the sequences to look like.
Escape byte. You might designate a byte, say 0xFF as a flag for designating a run of repeated bytes, and follow it by a count and a value. So the above data would be:
31 92 FF 05 24 C5 FF 02 00 = 31 92 5*24 C5 2*00
If the flag byte actually appears in your data, you have to unescape it by, say, having a length of 0 in the next byte.
Repeated bytes. Here you just start running through your data normally, and whenever you have two bytes in a row that are the same, you replace all the rest of them (the third and thereafter) with a count byte:
31 92 24 24 03 C5 00 00 00 = 31 92 24 24 3*24 C5 00 00 0*00
If you don’t have a third repeated value, you’ll need to waste a byte to give a count of 0.
Alternating types. Here you assume that your data alternates between runs of raw values and runs of repeated bytes, and prepend length counts to each type:
02 31 92 05 24 01 C5 02 00 = 2 (31 92), 5*24, 1 (C5), 2*00
Naturally, if you have two repeated runs in a row, you’ll have to waste a byte to insert a 0-length raw sequence between them. A special case of this I’ve run across is when you expect to have long runs of zero in particular instead of any random byte, so you just alternate between runs of zeroes (with just a bare count value) and runs of raw data.
Note, of course, that there is some subtlety which can be involved depending on the variant you run across. For instance, it’s often the case that pairs of bytes aren’t efficient to encode, so they’re just treated as raw data. Also, rather than giving lengths themselves, sometimes you encode, say, length-3, if length values of 0, 1, and 2 aren’t ever needed. In some cases you might also run across multi-byte length values (controlled, say, by the high bit in the first length byte).
For images, you may have pixels instead of bytes which are the fundamental unit of repetition. In that case, even two RGB pixels in a row which are the same can be successfully compressed.
In any event, how do you recognize this format? The general principle is that all of these variations have to fall back on including raw bytes in the file a lot, so you want to try to look for those identifiable sequences (RGB triplets in image formats are good to key off of) interspersed with control codes. It’s often helpful to have an uncompressed version of an image to compare against, which you can recover from a screenshot or from snooping the game’s memory in a debugger (a topic for later articles).
One step up from run-length encoding is to be able to do something useful with whole sequences of data that are repeated instead of single bytes. Here, the algorithm keeps track of the data it’s already seen, and if some chunk is repeated, it just encodes a back-reference to that section of the file instead:
I love compression. This is compressed! = I love compression. This [is ][compress]ed! = I love compression. This [3,-3][8,-22]ed!
The bracketed sections indicate runs of characters that have been seen before, so you just give a length and a backwards offset for where to copy them from. A lot of compression algorithms, zlib included, are based on this general principle, but one version that seems to crop up a lot is LZSS.
The special feature of this format is how it controls switching between raw bytes and back-references. It uses one bit in a control byte to determine this, often a 1 for a raw byte and a 0 for a back-reference sequence. So one control byte will determine the interpretation of the next 8 pieces:
I love compression. This [3,-3][8,-22]ed! = FF "I love c" FF "ompressi" FF "on. Thi" 73 "s " 03 03 08 16 "ed!"
The 0xFF control bytes just say “8 raw bytes follow”, and the 0×73 byte is binary 01110011: reading from least-significant bit, that’s 2 raw bytes, 2 back-references, and then 3 raw bytes.
Recognizing this format in the wild rests on the control bytes, and you can spot it most easily in script files. If you see text which looks liFFke this,FF with reFFadable tFFext plus some junk characters in every 9th byte, you’re dealing with LZSS. You can also spot this in image formats, since the natural rhythm of RGB triplets will get interrupted by the control bytes.
Note that the farther in the file you go, the harder this gets to recognize, since the proportion of back-references tends to climb once the algorithm has a larger dictionary of previously-seen data to draw upon.
The major hassle with this format is the nature of the back-references. There are a lot of subtle variants of this. One of the most popular ones uses a 4096-byte sliding window, and encodes back-references as a 12-bit offset in the window and a 4-bit length. However, is the length the real length or length-3? Is the offset relative to the current position, or is it an array offset in a separate 4096-byte ring buffer? Is the control byte read most- or least-significant bit first? I’ve even run across an example where there were several different back-reference formats: a 1-byte one for short runs in a small window, a 2-byte one for medium-length runs in a decent window, and a 3-byte one for large runs over a huge window. You will just need to experiment a little bit to see exactly what the particular game is doing, unfortunately.
One subtle point is that you may be allowed to specify a back-reference which overlaps with data you haven’t seen yet. By that I mean a length larger than the negative offset involved:
This is freeeeeeeaky! = This [is ]fre[eeeeee]aky! = This [3,-3]fre[6,-1]aky!
The [6,-1] back-reference works because you are copying the bytes one at a time: first you copy the second ‘e’ from the first, and now you can copy the third ‘e’ from the second, etc. Be aware of this subtlety when you implement your own algorithms, since (a) this can preclude you from doing certain types of memory copying or string slicing, (b) not all games will be able to understand this type of reference, so don’t encode that way unless you know yours can.
From one point of view, this is easier than other algorithms since it only works on single bytes (or symbols, in general) at a time, but it’s also more tricky since the compressed data is a bitstream rather than being easy-to-digest bytes and control codes.
It works by figuring out the frequencies of all the bytes in a file, and encoding the more common ones with fewer than 8 bits, and the less common ones with more than 8, so you end up with a smaller file on average. This is very closely related to concepts of entropy, since each symbol generally gets encoded with a number of bits equal to its own entropy (as determined by its frequency).
Let’s be specific. Consider the string “abracadabra”. The letter breakdown is:
a : 5/11 ~ 1.14 bits b : 2/11 ~ 2.46 bits c : 1/11 ~ 3.46 bits d : 1/11 ~ 3.46 bits r : 2/11 ~ 2.46 bits
Where I’ve given the number of bits of entropy each frequency corresponds to (i.e. if you have a 25% chance of having a certain letter, it has a 2-bit entropy since you need to give one of 4 values, say 00, to specify it out of the other 75% of possibilities: 01, 10, 11). Unfortunately we can’t use fractional bits, so we may have to round up or down from these theoretical values.
How do we choose the right codes? Well, the best way is to build up a tree, starting from the least-likely values. That is, we treat, say, “c or d” as a single symbol with a frequency of 2/11, and say that if we get that far we know we can just spend one extra bit to figure out whether we mean c or d:
a : 5 0=c + 1=d : 2 b : 2 r : 2
Then we continue doing the same thing. At each step we combine the two least-weight items together, adding one bit to the front of their codes as we go. In the case of ties, we pick the ones with shorter already-assigned codes, or alphabetically first values:
a : 5 0=b + 1=r : 4 0=c + 1=d : 2 00=b + 01=r + 10=c + 11=d : 6 a : 5 000=b + 001=r + 010=c + 011=d + 1=a : 11
So the codes we end up with are:
a : 1 b : 000 c : 010 d : 011 r : 001
You will notice an excellent property of these codes: they are not ambiguous. That is, you don’t have 1 for ‘a’ and 100 for ‘b’… as soon as you hit that first 1, you know you can stop and go on to the next symbol without needing to read any more. Therefore, “abracadabra” just gets encoded as:
a b r a c a d a b r a 1 000 001 1 010 1 011 1 000 001 1 = 10000011 01010111 00000110 = 83 57 06
We’ve compressed 88 bits (11 bytes) down to 23 bits (just under 3 bytes). Almost always the bits are packed most- to least-significant in a byte.
One subtlety is the exact method of tree creation, which assigns the codes. The method described above is “canonical”, but sometimes games will use their own idiosynchratic methods which you will have to match exactly to avoid getting garbage.
How do you recognize this in a data file? Well, the decompressor needs to know the codes, and the easiest way to specify this is to give it the frequencies (or more easily, the bit weights) of the values so it can construct its own tree.
Therefore, the compressed data will usually start with, say, a 256-element table of bit weights. So if you see 256 bytes of 05 06 08 07 0C 0B 06 — values that are around 8 plus or minus a few — followed by horrendous random junk, you’re probably looking at Huffman encoding.
Sometimes instead of bit weights you’ll have the actual frequency counts instead, which might need to have a multi-byte encoding scheme if they’re above 256. In that case, you’re mainly looking for a few hundred bytes of “stuff” followed by a sharp transition to much more random data.
Needless to say, the algorithms covered here are not the full range of compression formats out there. I’ll just briefly mention some others in case you run across them, though I haven’t really seen them in the wild.
Arithmetic encoding. This is vaguely related to Huffman encoding, in that you are working strictly with single bytes (or symbols) and trying to stuff the most frequent ones into fewer bits. However, instead of being restricted to an integral number of bits for each one, here you are allowed to be fractional on average.
This works by breaking up the numerical interval [0,1) into subranges corresponding to each symbol: the more common symbols correspond to larger ranges, in proportion to their frequency. You start with [0,1), and the first byte resticts you to the subrange for that symbol. Then the second byte restricts you to a sub-subrange, the third byte a sub-sub-range, etc. Your final encoded data is any single numerical value inside the tiny range you end up in: just pick the number in that range you can represent in the least number of bits as a binary fractional value.
Needless to say there are some good tricks for implementing this without using ludicrously-high-precision math, but I won’t go into that.
LZ77 (Lempel-Ziv ’77)
This is the core of the zlib deflate algorithm, but you’ll sometimes see variants outside of that standard, so it’s useful to know a little about. It’s basically a combination of standard back-references as in LZSS, plus Huffman encoding. You just treat the back-reference command “copy 8 bytes” as a special symbol, like a byte value of 256+8=264.
Then, with this mix of raw data bytes and back-reference symbols, you run it through a Huffman encoding to get the final compressed output. Typically you will do something different with the back-reference offsets: either leave them as raw data, or encode them in their own separate Huffman table.
When taught correctly, this is an algorithm with a mind-blowing twist at the end. As it runs through the file, it builds up an incremental dictionary of previously-seen strings and outputs codes corresponding to the dictionary entries. And then, at the end, when you start to wonder how to encode this big dictionary so the decompressor can use it to make sense of the codes, you just throw the dictionary away. Cute.
Of course it turns out that things are cleverly designed so that the decompressor can build up an identical dictionary as it goes along, so there’s no problem. This algorithm was patent-encumbered for a while, so it didn’t get as widely adopted as it might otherwise have been, but you might start seeing more of it these days.
I’ve focused here on lossless general-purpose compression: the sorts of things that are done to data at the archive level. There is also a stage below this, where data can be compressed before even being put into the archive: making raw images into JPEGs, PNGs, or other compressed image formats, and converting sounds to MP3s, OGGs, and so forth. In many cases those compression steps are just a lossy approximation to the original data, which is okay for graphics and sounds but bad for other files.
In a later installment, I’ll be tackling image formats in particular in more detail, since you will tend to run across custom ones a lot, some of which include image-specific processing steps (like, say, subtracting pixels from their neighbors) which wouldn’t make a lot of sense in a more general-purpose compression algorithm. Encryption is another later topic, since sometimes that will keep you from being able to recognize compressed data for what it is.
And naturally, if you’ve run across other general compression algorithms used in games you’ve looked at, please mention them in the comments, since I don’t pretend to have investigated all the games out there… I’m still being surprised all the time.