Mission: Compressible -- Achieving Full-Motion Video on the Nintendo 64
Resident Evil 2 for the Nintendo 64 was the first game on a cartridge-based console system to deliver full-motion video. Angel Studios' team brought this two-CD game, comprising 1.2GB of data, to a single 64MB cartridge. A significant portion of this data was more than 15 minutes of cutscene video. Achieving this level of compression, meeting the stringent requirement of 30Hz playback, and delivering the best video quality possible was a considerable challenge.
To look at this challenge another way, let's put it into numerical perspective. The original rendered frames of the video sequences were 320x160 pixels at 24-bit color = 153,600 bytes/frame. On the Nintendo 64 Resident Evil 2's approximately 15 minutes of 30Hz video make a grand total of 15 x 60 x 30 x 153,600 = 4,147,200,000 bytes of uncompressed data. Our budget on the cartridge was 25,165,824 bytes, so I had to achieve a compression ratio of 165:1. Worse still, I had to share this modicum of cartridge real estate with the movie audio.
The Playstation version of Resident Evil 2 displays its video with the assistance of a proprietary MDEC chip but because the N64 has no dedicated decompression hardware, our challenge was compounded further. To better understand the magnitude of the implementation hurdles, consider that it is analogous to performing full-screen MPEG decompression at 30Hz, in software, on a CPU roughly equivalent in power to an Intel 486. Fortunately, the N64 has a programmable signal processor called an RSP that has the ability to run in parallel with the CPU.
A Brief JPEG Primer
In order to simplify the timing and synchronization issues, I chose an MPEG-1-style (henceforth referred to as MPEG) compression scheme for the video content only. (Audio was handled separately, which I'll discuss later in this article.)
As an introduction to the relatively complex issues of applying MPEG compression to the video sequences of the game, let me present a brief primer on JPEG compression.
First, the image is converted from RGB into YCbCr. This process converts the RGB information into luminance information (Y) and chromaticity (Cb and Cr):
Inverting the coefficient matrix and applying it to YCbCr finds the inverse transformation. This color model exploits properties of our visual system. Since the human eye is more sensitive to changes in luminance than color, I could devote more of the bandwidth to represent Y than Cb and Cr. In fact, I can halve the size of the image with no perceptible loss in image quality by storing only the nonweighted average of each 2x2-pixel block of chromaticity information. This way the Cb and Cr information is reduced to 25 percent of its original size. If each of the three components (Y, Cb, Cr) represented 1/3 of the original picture information, the subsampled version now adds up to 1/3 + 1/12 + 1/12 = 1/2 the original size.
Second, each component is broken up into blocks of 8x8 pixels. Each 8x8 block can be represented by 64-point values denoted by this set:
where x and y are the two spatial dimensions. The discrete cosine transform (DCT) transforms these values to the frequency domain as c = g(Fu,Fv), where c is the coefficient and Fu and Fv are the respective spatial frequencies for each direction:
The output of this equation gives another set of 64 values known as the DCT coefficients, which is the value of a particular frequency - no longer the amplitude of the signal at the sampled position (x,y). The coefficient corresponding to vector (0,0) is the DC coefficient (the DCT coefficient for which the frequency is zero in both dimensions) and the rest are the AC coefficients (DCT coefficients for which the frequency is nonzero in one or both dimensions). Because sample values typically vary gradually from point to point across an image, the DCT processing compresses data by concentrating most of the signal in the lower values of the (u,v) space. For a typical 8x8 sample block, many - if not all - of the (u,v) pairs have zero or near-zero coefficients and therefore need not be encoded. This fact is exploited with run-length encoding.
Next, the 64 outputted values from the DCT are quantized on a per-element basis with an 8x8 quantization matrix. The quantization compresses the data even further by representing DCT coefficients with precision no greater than is necessary to achieve the desired image quality. This tunable level of precision is what you modify when you move the JPEG compression slider up and down in Photoshop when you save an image.
In the third step (ignoring the detail that the DC components are difference-encoded), all of the quantified coefficients are ordered into a "zigzag" sequence. Since most of the information in a typical 8x8 block is stored in the top-left corner, this approach maximizes the effectiveness of the subsequent run-length encoding step. Then the data from all blocks is encoded with a Huffman or arithmetic scheme. Figure 1 summarizes this encoding process.
Both JPEG and MPEG are "lossy" compression schemes, meaning that the original image can never be reproduced exactly after being compressed. Information is lost during JPEG compression at several points: chromaticity subsampling, quantization, and floating-point inaccuracy during the DCT.