ORGANISATION FOR STANDARDISATION
ORGANISATION INTERNATIONALE DE NORMALISATION
CODING OF MOVING PICTURES AND AUDIO
July 2005, Poznan, Poland
Title: Tutorial on MPEG Surround Audio Coding
Low bitrate audio coding has been an active area of research for more than 15 years. MPEG audio technology has been at the forefront of this advance, creating technology such as MPEG-1 Audio (including the well-known MPEG-1 Layer III or “MP3” specification), MPEG-4 Advanced Audio Coding (AAC) and MPEG-4 High-Efficiency AAC (HE-AAC). The most recent effort within MPEG to progress the state of the art is the “MPEG Surround” work item, which provides an extremely efficient method for coding of multi-channel sound via the transmission of a compressed stereo (or even mono) audio program plus a low-rate side-information channel. In this way backward compatibility is retained to pervasive stereo playback systems while the side information permits next-generation players to present a high-quality multi-channel surround experience.
Today, the vast majority of audio playback equipment, both professional and consumer, use traditional two-channel presentations (stereo). Stereo has been a mainstream consumer format for more than 40 years, and so it is not surprising that there is a search for new technologies that further enhance the listener experience. Along with other types of refinements, such as longer audio sample word lengths and higher sampling rates (“high resolution audio”), the move towards more reproduction channels (“multi-channel audio” or “surround sound”) is quite visible in the market place. Consumers can buy inexpensive 5.1-channel playback systems from mass-market retailers and even 7.1 channel systems are becoming common. However, a non-disruptive transition from stereo to multi-channel audio requires media formats that can serve both those using conventional stereo equipment and those using next-generation multi-channel equipment. While some recent consumer media, such as DVD-Video, DVD-Audio and Super Audio CD, resolve the problem by storing both stereo and multi-channel versions of the sound material, this is not a viable option for applications that have to work under severe transmission channel bandwidth limitations, such as digital audio and TV broadcasting or Internet streaming.
This paper describes the new MPEG Surround technology that is in the process of being standardized in MPEG and which promises to provide an efficient bridge between stereo and multi-channel presentations in low-bitrate applications. The MPEG Surround technology supports very efficient parametric coding of multi-channel audio signals, so as to permit transmission of such signals over channels that typically support only the transmission of stereo (or even mono) signals. Moreover, MPEG Surround Coding is able to provide complete backward compatibility with non-multi-channel audio systems: while legacy receivers decode an MPEG Surround bitstream as stereo, enhanced receivers provide multi-channel output.
MPEG Surround coding can be viewed as an enhancement of known techniques, such as a more flexible method of joint stereo coding [1,2], a generalization of MPEG Parametric Stereo [3,4], or an extension of Binaural Cue Coding (BCC) [5,6,7]. Alternatively, MPEG Surround coding can be considered an extension of well-known matrixed surround schemes (e.g. Dolby Surround/Prologic, Logic 7, Circle Surround) [8,9] to include the transmission of spatial cue side information that guides the multi-channel reconstruction process. The advantage of such an extension is that MPEG Surround coding does not require the manipulation of phase differences between the two channels of a stereo signal for encoding spatial information (as does, e.g. Dolby Prologic), and so makes it possible to use even a single (monophonic) audio channel (in which phase differences are undefined) as the basis for reconstructing a multi-channel output.
MPEG Surround coding exploits our ability to perceive sound in three dimensions and captures that perception in a compact set of parameters. Well-known perceptual audio coders, such as MP3, primarily exploit a single channel’s ability to mask its own quantization noise, optionally with some stereo channel coding. In contrast, spatial perception is primarily attributed to three parameters, or cues, describing how humans localize sound in the horizontal plane: inter-aural level differences (ILD), inter-aural time differences (ITD) and inter-aural coherence (IC). These three concepts are illustrated in Figure 1, which schematically shows a human head and a distant sound source. Direct, or first-arrival wave fronts from the source impinge on the left ear at time while direct sound received by the right ear is diffracted around the head, with an associated time delay and level attenuation. These two effects result in the previously mentioned ITD and ILD cues associated with a given source. Finally, if the sound is from a point-source in a reverberant environment, reflected sound may impinge on both ears, or if the sound is from a diffuse source, non-correlated sound may impinge on both ears, either of which gives rise to the previously mentioned IC cue.
Figure SEQ Figure \* ARABIC 1 - Illustration of ILD, ITD and IC.
MPEG Surround exploits inter-channel differences in level, phase and coherence equivalent to the ILD, ITD and IC cues to capture the spatial image of a multi-channel audio signal relative to a transmitted downmix signal and encodes these cues in a very compact form such that the cues and the transmitted signal can be decoded to synthesize a high quality multi-channel representation. This is illustrated in Figure 2.
Figure 2 – Principles of MPEG Surround Coding
The MPEG Surround encoder receives a multi-channel audio signal, to, where N is the number of input channels (e.g. 5.1). A key aspect of the encoding process is that a downmix signal, xt1and xt2, which is typically stereo (but could also be mono), is derived from the multi-channel input signal, and it is this downmix signal that is compressed for transmission over the channel rather than the multi-channel signal. The encoder may be able to exploit the downmix process to advantage, such that it creates a faithful equivalent of the multi-channel signal in the mono or stereo downmix, and also creates the best possible multi-channel decoding based on the downmix and encoded spatial cues. Alternatively, the downmix could be supplied externally (Artistic Downmix in Figure 2). The MPEG Surround encoding process is agnostic to the compression algorithm used for the transmitted channels (Audio Encoder and Audio Decoder in Figure 2); it could be any of a number of high-performance compression algorithms such as MPEG-1 Layer III, MPEG-4 AAC or MPEG-4 High Efficiency AAC, or it could even be PCM.
A key aspect of the MPEG Surround technique is that the transmitted downmix (e.g. stereo) is an excellent stereo version of the multi-channel signal. In the case that an artistic downmix is available, it would be the preferred stereo presentation of the multi-channel signal. As shown in Figure 2, this is available as the output signal from the Audio Decoder. Hence stereo decoder equipment is in no way disadvantaged relative to MPEG Surround decoders. This is vital, since stereo presentation will remain pervasive due to the number of applications in which listening is primarily via headphones, such as portable music players. Additionally, MPEG Surround supports a mode in which the downmix is compatible with popular matrix surround decoders, e.g. Dolby Surround.
The heart of the encoding process is the extraction of the spatial cues from the multi-channel input signal. This captures the most salient perceptual aspects of the multi-channel sound image, including level or intensity differences, phase differences and inter-channel correlation or coherence, which are encoded in a very compact manner and transmitted as side information. In the MPEG Surround decoder, the cue parameters are used to expand the decoded downmix signal into a high quality multi-channel output.
Spatial cues are used to upmix the stereo (or mono) transmitted signal to e.g. a 5.1 channel signal. This operation is typically done in the time/frequency domain as is shown in Figure 3. Here an analysis filterbank converts the input signal into two channels of high-resolution time/frequency representation, where the upmix occurs as shown in Figure 4, after which the synthesis filterbank converts the 6 channels of time/frequency data into a 5.1 channel audio signal.
3 - Block Diagram of Surround Synthesis
The upmix process applies scaling and decorrelation operations to regions of the stereo time/frequency signal to form the appropriate regions of the 5.1 channel time/frequency signals, as shown in Figure 4. This can be performed via two linear matrix operations (M1 and M2) and a set of decorrelation filters (D1-D3). Note that the audio signals (L, R, C, Ls, Rs, lfe) are in the time/frequency domain. Signals res0, res1 and res2 are optional, and support scalability up to transparent quality.
Figure 4 - Block Diagram of Upmix
Motivated by the demonstrated potential of what was then called the Spatial Audio Coding approach, ISO/MPEG issued a Call for Proposals on MPEG Spatial Audio Coding in March 2004 . There were four responses to the Call, which were evaluated with respect to a number of performance criteria, including the subjective quality of the decoded multi-channel audio signal, the subjective quality of the downmix signals, the spatial cue side information bitrate and other parameters, such as additional functionality and computational complexity.
Based on these performance criteria, MPEG decided that the technology that would be the starting point in standardization process, called Reference Model 0 (RM0), would be a combination of the submissions from two proponents: Fraunhofer IIS/Agere Systems and Coding Technologies/Philips. These systems not only outperformed the other submissions but also showed complementary performance in terms of per-item quality, bitrate and complexity. The merge to form RM0 has been completed and listening tests to check the performance of RM0  indicated that the design goals have been met: RM0 successfully combined the best features of the individual systems and provides sound quality substantially surpassing existing matrixed surround solutions, even for the transmission of a mono downmix signal or for spatial cue bitrates as low as 6kbit/s. It serves as the basis for the further technical development within MPEG. It was furthermore decided that, going forward, the technology would be referred to as MPEG Surround. A description of the RM0 technology can be found in .
The vast majority of current audio decoding and playback systems are stereo, so that any transition to multi-channel audio capability should use existing transmission channels and be compatible with existing playback devices. MPEG Surround satisfies exactly these requirements, in that it transmits a stereo signal plus a small amount of side information which together can easily be carried over a channel that currently carries a compressed stereo signal. For multi-channel applications requiring the lowest possible bitrate, MPEG Surround Coding based on a single transmitted channel can be used. This results in a bitrate saving of approximately 85% as compared to a discrete 5.1 multi-channel transmission. MPEG Surround is appropriate for all applications in which both compression efficiency and backward compatibility are important.
The following are examples of application areas in which MPEG Surround Coding holds promise.
Digital Audio Broadcasting Due to the relatively small channel bandwidth, the relatively large cost of transmission equipment and transmission licenses and the desire to maximize user choices by providing many programs, the majority of existing or planned digital broadcasting systems cannot provide multi-channel sound to the users. Adding this feature could be a strong motivation for users to make the transition from their traditional FM receivers to new digital receivers. MPEG Surround technology could be a key factor in increasing the attractiveness of digital radio systems since it provides functionality not obtainable in other ways.
Although the tremendous popularity of the DVD and associated home theatre playback systems has given rise to a large installed base of multi-channel capable audio playback systems in the home, perhaps the greatest potential for multi-channel digital audio broadcast is playback in the car, since the majority of radio listening occurs in cars. The automotive environment is ideal for enjoying multi-channel music since newer automobiles typically have five or more loudspeakers plus a sub-woofer and, in contrast to a home environment, have them installed in a much better configuration and the automotive listener is in a more stable and predictable position relative to those loudspeakers. Furthermore, surround sound provides a larger optimum listening area (i.e. sweet spot), which greatly improves the spatial imaging in the case of off-center listener position. Audio is the time-tested accompaniment to driving, and high quality 5.1 surround is a natural next step. First field tests, based on MPEG Surround Sound Coding with MPEG-1 Layer II and the Eureka 147 DAB system together with a major car manufacturer have been successfully conducted in the Munich area in Germany.
The backwards compatibility of MPEG Surround Sound Coding to existing stereo digital radio receivers is one of the key factors for existing digital audio broadcasting systems. This compatibility approach has the following advantages:
MPEG Surround Sound Coding is also a very suitable system for going to multichannel audio with the new portable and mobile digital multimedia broadcasting systems, such as DMB or DxB. It is very unlikely that the required bit-rate for conventional 5.1 multichannel coding systems will be available on these systems, because most of the available bit-rate will be used for the video and interactive data-services. Thus, the new MPEG Surround Sound Coding will be the ideal solution to include 5.1 multichannel audio in DMB, or DxB.
Digital TV Broadcasting Currently, the majority of digital TV broadcasts use stereo audio coding. MPEG Surround constitutes an excellent opportunity to extend these established services to surround sound. For a small additional overhead in bit-rate, MPEG Surround enables stereo audio presentations in existing and new services to be smoothly upgraded to multi-channel audio, while maintaining stereo backwards compatibility without any restriction or quality degradation for the installed receiver base. New receivers incorporating MPEG Surround will be able to deliver multichannel audio using the same transmitted signal. Hence MPEG Surround is of interest to any digital TV broadcasting system, either because of the possibility of performing a backward compatible upgrade of the receivers or because of the high efficiency of the multi channel transmission.
As an example, there are several advantages of MPEG Surround in the context of the DVB system (Digital Video Broadcasting) In the context of DVB-T (Terrestrial) applications, a high audio quality is required, while the bit budget is restricted. Current multi-channel services in DVB employ an additional, optional multi-channel audio coder at 448kbps next to the stereo audio that is typically coded at 192-256 kbps using MPEG-1 Layer II. With MPEG Surround operating in its high quality mode using residual coding, the stereo audio can be upgraded to a high quality multi-channel audio signal for as little as 32kbps extra. This makes the additional, optional coded multi-channel audio and simulcast obsolete, resulting in a huge bit-rate saving, while remaining full compatibility with existing stereo receivers. Another significant advantage of avoiding a simulcast of the stereo and multichannel service is the significant reduction of operational complexity and cost it affords. Specifically, in existing 5.1 multichannel services in DVB, the synchronization of the multichannel service, which is contained in a separate data-stream, with the video service is in practice quite a complex issue. For new DVB services requiring very low bit-rates, e.g. in portable applications using DVB-H (Handheld), MPEG-4 HE-AAC is the preferred choice. Also here, a stereo service can be extended to multi-channel using MPEG Surround in a mode in which the spatial parameters take up as little as a few kbps on top of the stereo encoded signal. Additionally, the specific combination of HE-AAC and MPEG Surround provides a reduction in computational complexity.
Music download service: Currently, a number of commercial music download services are available and working with considerable commercial success. Such services could be seamlessly extended to provide multi-channel presentations while remaining compatible with stereo players: on computers with 5.1 channel playback systems the compressed sound files are presented in surround sound while on portable players the same files are reproduced in stereo.
Streaming music service / Internet radio: Many Internet radios operate with severely constrained transmission bandwidth, such that they can offer only mono or stereo content. MPEG Surround Coding technology could extend this to a multi-channel service while still remaining within the permissible operating range of bitrates. Since efficiency is of paramount importance in this application, compression of the transmitted audio signal is vital. Using recent MPEG compression technology (MPEG-4 High Efficiency Profile coding), full MPEG Surround systems have been demonstrated with bitrates as low as 48 kb/s.
Teleconferencing: Teleconferencing is increasingly important in today’s global business environment, but most commercial systems still operate over a single audio channel having rather low bandwidth. MPEG Surround could expand the sound image via a multi-channel presentation and thus afford a better subjective separation of the audio contributions of each speaker in the teleconference.
Audio for Games: Many personal computers are used primarily as gaming engines and are equipped with 5.1 channel audio presentation systems. Synthesizing 5.1 sound from a backward compatible stereo signal would permit efficient storage of multi-channel gaming sound.
MPEG Surround is the latest technology for bitrate efficient and backward compatible presentation of multi-channel audio. Currently a new MPEG Audio work item, MPEG Surround, has been demonstrated to deliver high-quality multi-channel audio at bitrates as low as 48kbps, which was not conceivable as little as two years ago. Many applications will benefit from this technology, especially those that are currently based on the presentation of mono or stereo signals, since these same transmission channels now would be able to carry multi-channel presentations for new decoder equipment, while retaining high-quality mono or stereo presentation for reproduction by legacy equipment.