HP Image and Data Compression Conference
March 13, 1997, Palo Alto, CA, USA.

Transcoding MPEG Video Streams in the Compressed Domain

Susie J. Wee
Hewlett-Packard Laboratories
1501 Page Mill Road, MS 3U-4
Palo Alto, CA 94304
(415) 857-7585
swee@hpl.hp.com

Bhaskaran Vasudev
Hewlett-Packard Laboratories
1501 Page Mill Road, MS 3U-4
Palo Alto, CA 94304
(415) 857-7153
bhaskara@hpl.hp.com
Sam Liu
Hewlett-Packard Laboratories
1501 Page Mill Road, MS 3U-3
Palo Alto, CA 94304
(415) 857-2812
liu@hpl.hp.com

Abstract

Compressed digital video is quickly becoming ubiquitous for video storage and communication. It is often desirable to perform various video processing tasks; however, operations once considered simple, such as translation and downscaling, become less straightforward when applied to compressed video. The naive solution of completely decompressing the video into the pixel domain, performing the processing, and recompressing the entire video sequence is inefficient and impractical.

We are investigating the problem of transcoding compressed video data from one standards-compliant data stream to another. The goal is to develop video transcoding algorithms that achieve high performance with computational efficiency. We have developed a number of compressed-domain algorithms that achieve this goal. We have implemented a software version of the video transcoder, capable of transcoding MPEG-compliant data streams between a number of modes. We are actively developing our algorithms so that this software will become a tool useful for both practical applications and research.


Table of Contents


Introduction

Video compression algorithms are being used to compress digital video for a wide variety of applications, including video delivery over the internet, advanced television broadcasting, as well as video storage and editing. The performance of modern compression algorithms such as MPEG is quite impressive -- raw video data rates often can be reduced by factors of 15-80 without considerable loss in reconstructed video quality. However, the use of these compression algorithms often make other processing tasks quite difficult. For example, many operations once considered simple, such as splicing and downscaling, are much more complicated when applied to compressed video streams.

The goal of transcoding is to process one standards-compliant video stream into another standards-compliant video stream that has properties better suited for a particular application. This is useful for a number of applications. For example, a video server transmitting video over the internet may be restricted by stringent bandwidth requirements. In this scenario, a high-quality compressed bitstream may need to be transcoded to a lower-rate compressed bitstream prior to transmission; this can be achieved by lowering the spatial or temporal resolution of the video or by requantizing the MPEG data. Another application may require MPEG video streams to be transcoded into streams that facilitate video editing functionalities such as splicing or fast-forward and reverse play; this may involve removing the temporal dependencies in the coded data stream. Finally, in a video communication system, the transmitted video stream may be subject to harsh channel conditions resulting in data loss; in this instance it may be useful to create a standards-compliant video stream that is more robust to channel errors.

In this paper, we present a series of compressed-domain image and video transcoding algorithms that were designed with the goal of achieving high performance with computational efficiency. We focus on video compression algorithms that rely on the block discrete cosine transform (DCT) and motion-compensated prediction. This is applicable to a number of predominant image and video coding standards including JPEG, MPEG-1, MPEG-2, H.261, and H.263. In the remainder of this paper we will focus on MPEG; however, many of these results apply to the other standards as well.

This paper begins with a brief description of some aspects of the MPEG video compression algorithm that are relevant to this work. The transcoding problem is then defined and the significance of compressed-domain transcoding is discussed. We provide a brief description of some of the image (intraframe) transcoding algorithms developed in earlier stages of this project. We then describe the video (interframe) transcoding algorithms that are currently under development.

<-- Back to Table of Contents

MPEG Video Compression

Our transcoding algorithms are designed to exploit various features of the MPEG video compression standard. Detailed descriptions of the MPEG video compression standard can be found in [Vas95, Mitch97]. In this section, we will briefly discuss some aspects of the standard that are relevant to this work.

MPEG codes video in a hierarchy of units called sequences, groups of pictures (GOPs), pictures, slices, and macroblocks. 16 by 16 blocks of pixels in the original video frames are coded as a macroblock. The macroblocks are scanned in a left-to-right, top-to-bottom fashion, and series of these macroblocks form a slice. All the slices in a frame comprise a picture, contiguous pictures form a GOP, and all the GOPs form the entire sequence. The MPEG syntax allows a GOP to contain any number of frames, but typical sizes range from 9 to 15 frames. Each GOP refreshes the temporal prediction by coding the first frame in intraframe mode. The remaining frames in the GOP can be coded with intraframe or interframe (predictive) coding techniques.

The MPEG algorithm allows each frame to be coded in one of three modes: intraframe (I), forward prediction (P), and bidirectional prediction (B). A typical IPB pattern in display order is:

I0 B1 B2 P3 B4 B5 P6 B7 B8 P9 B10 B11 I0 B1 B2 P3 B4 B5 P6 B7 B8 P9 B10 B11 I0

I frames are coded independently of other frames. P frames depend on a prediction based on the preceding I or P frame. B frames depend on a prediction based on the preceding and following I or P frames. Notice that each B frame depends on data from a future frame, i.e.~a future frame must be (de)coded prior to (de)coding the current B frame. For this reason, the coding order is distinguished from the display order. The coding order for the sequence shown above is:

I0 P3 B1 B2 P6 B4 B5 P9 B7 B8 I0 B10 B11 P3 B1 B2 P6 B4 B5 P9 B7 B8 I0 B10 B11

MPEG requires the coded video data to be placed in the data stream in coding order.

A GOP always begins with an I frame and typically includes the following (display order) P and B frames that occur before the next I frame; although the syntax allows a GOP to contain multiple I frames. The sequence shown above contains two complete GOPs plus one I frame. The GOP header does not specify the number of I, P, or B frames in the GOP, nor does it specify the structure of the GOP -- these are completely determined by the order of the data in the stream. Thus, there are no rules that restrict the size and structure of the GOP, although care should be taken to ensure that the buffer constraints are satisfied.

We find it useful to define a substructure called a group. A group is a unit that contains two consecutive I or P anchor frames and all the B frames whose prediction is based on those two anchors. The significance of the group is that it represents the smallest unit of dependencies for the B frames.

MPEG uses block motion-compensated prediction to reduce the temporal redundancies inherent to video. In block motion-compensated prediction, the current frame is divided into 16 by 16 pixel units called macroblocks. Each macroblock is compared to a number of 16 by 16 blocks in a previously coded frame. A single motion vector (MV) is used to represent this block with the best match. This block is used as a prediction of the current block, and only the error in the prediction, called the residual, is coded into the data stream.

The frames of a video sequence can be coded as an I, P, or B frame. In I frames, every macroblock must be coded in intraframe mode, i.e. no prediction is used. In P frames, each macroblock can be coded with forward prediction or in intraframe mode. In B frames, each macroblock can be coded with forward, backward, or bidirectional prediction or in intraframe mode. One MV is specified for each forward- and backward-predicted macroblock while two MVs are specified for each bidirectionally predicted macroblock. Thus, each P frame has a forward motion vector field and one anchor frame, while each B frame has a forward and backward motion vector field and two anchor frames.

MPEG uses DCT transform coding to code the intraframe and residual macroblocks. Four 8 by 8 block DCTs are used to encode each macroblock and the resulting DCT coefficients are quantized. Quantization usually results in a sparse representation of the data; i.e. most of the amplitudes of the quantized DCT coefficients are equal to zero. Thus, only the amplitudes and locations of the nonzero coefficients are coded into a data stream.

<-- Back to Table of Contents

MPEG Transcoding: Problem Statement

The goal of MPEG transcoding is to process one MPEG-compliant video stream into another MPEG-compliant video stream that has properties better suited for a particular application. Transcoding differs from the encoding and decoding processes in that both the input and output of the transcoder are MPEG video streams. A naive solution to the transcoding problem, shown in the top of Figure 1, involves the following steps: first, the MPEG-coded video stream is completely decompressed into its pixel-domain representation; this pixel-domain video is then processed with the appropriate operation; and finally the processed video is recompressed into a new MPEG video stream. Such solutions are computationally expensive and have large memory requirements. In addition, the quality of the coded video can deteriorate with each recoding cycle.

[IMG]
Figure 1. MPEG Transcoding: the naive solution (top) and the compressed-domain solution (bottom).

A more efficient transcoding solution, shown in the bottom of Figure 1, is to only partially decompress the video stream and perform the processing directly on the compressed-domain data. For some operations, further gains can be achieved by only processing the portions of the video stream that are affected by the transcoding operation. In this work, the affected portion of the video stream is partially decompressed into its motion vector (MV) and discrete cosine transform (DCT) representation. Computational requirements are reduced by eliminating the process of further decompressing the MV and DCT representation into the spatial domain, processing the spatial domain data, and recompressing the result into its new MV/DCT representation. For many compressed-domain transcoding operations, additional gains are achieved by exploiting the sparsity that is typical of quantized DCT coefficients.

<-- Back to Table of Contents

Intraframe Transcoding Algorithms

Images and video frames coded with intraframe methods are represented by sets of block DCT coefficients. When using intraframe DCT coding, the original video frame is divided into 8 by 8 blocks, each of which is independently transformed with an 8 by 8 DCT. This imposes an artificial block structure that complicates a number of spatial processing operations, such as translation, downscaling, and filtering, that were considered straightforward in the pixel domain. A number of efficient exact and approximate compressed-domain algorithms have been developed for transcoding DCT-compressed images and video frames. A few of these are briefly described below, the reader is referred to the references for more detailed descriptions [Merh96a, Merh94, Merh95, Shen96, Merh96b, Nat95].

The rigid structure of the 8 by 8 block DCT makes simple operations like translation difficult. The difficulty stems from the fact that the DCT blocks of the processed image do not line up with the DCT blocks of the original image. For example, consider a 16 by 16 block of pixels that is transformed with four 8 by 8 block DCTs. If one wishes to calculate the DCT of an arbitrary 8 by 8 subblock of pixels, one must first calculate the four inverse DCTs, then extract the appropriate subblock, and finally perform the forward DCT operation. The inverse motion compensation developed in [Merh96a] performs this operation directly on the DCT-domain data. This algorithm can be extended to a global translation operation because global translation is a special case of inverse motion compensation in which all the motion vectors have the same value. By exploiting this fact, further optimizations can be performed to increase the efficiency of the translation transcoding operation.

Fast DCT-domain downscaling algorithms have been developed for integer subsampling factors [Merh94]. When downscaling, one 8 by 8 DCT block of the downscaled image is determined from multiple 8 by 8 DCT blocks of the original image. In these algorithms, when downscaling by a factor of N by N, each pixel of the downscaled image contains the average value of the corresponding N by N pixels of the original image. An efficient algorithm was also developed for filtering images in the DCT domain [Merh95]. This method can be used to apply two-dimensional symmetric, separable filters to DCT-coded images. In addition, approximate, multiplication-free versions of the transcoding algorithms described above have also been developed [Merh96b, Nat95].

<-- Back to Table of Contents

Interframe Transcoding Algorithms

Video frames coded with interframe coding techniques are represented with motion vectors and residual DCT coefficients. These frames are coded based on a prediction from one or more previously coded frames; thus, properly decoding one frame requires first decoding one or more other frames. This temporal dependence among frames severely complicates a number of spatial and temporal processing techniques such as translation, downscaling, and splicing. In this section, we describe the interframe transcoding algorithms we have developed for temporal mode conversion, splicing, and reverse play of MPEG video streams.

Temporal Mode Conversion

The ability to transcode between arbitrary temporal modes adds a great deal of flexibility and power to compressed-domain video processing. In addition, it provides a method of trading off parameters to achieve various rate/robustness profiles. For example, an MPEG sequence consisting of all I frames, while least efficient from a compression viewpoint, is most robust to channel impairments in a video communication system. In addition, the I-frame MPEG video stream best facilitates many video editing operations such as splicing, downscaling, and reverse play. Finally, once an I-frame representation is available, the intraframe transcoding algorithms can be applied to each frame of the sequence to achieve the same effect on the entire sequence.

We briefly describe the steps required to transcode an MPEG video stream containing I, P, and B frames to an MPEG video stream containing only I frames. The more general problem of transcoding between any set of temporal modes is a more difficult task and is currently under investigation.

Transcoding an IPB MPEG video stream into an all I-frame MPEG video stream requires three steps:

  1. Calculate the DCT coefficients of the motion-compensated prediction. This can be calculated from the intraframe coefficients of the previously coded frames by using the compressed-domain inverse motion compensation routine described in Section \ref{sec:intra}.

  2. Form the intraframe DCT representation of each frame. This step simply involves adding the predicted DCT coefficients to the residual DCT coefficients.

  3. Reorder the coded data and update the relevant header information. If B-frames are used, the coding order of the IPB MPEG stream will differ from the coding order of the I-only MPEG stream. Thus, the coded data for each frame must be shuffled appropriately. In addition, the appropriate parameters of the header data must be updated.

Splicing

The goal of the splicing operation is to form a video data stream that contains the first Nhead frames of one video sequence and the last Ntail frames of another video sequence. For uncoded video, the solution is obvious: simply discard the unused frames and concatenate the remaining data. Two properties make this solution obvious: (1) the data needed to represent each frame is self-contained, i.e. it is independent of the data from other frames; and (2) the uncoded video data has the desirable property of original ordering, i.e. the order of the video data corresponds to the display order of the video frames. MPEG-coded video data does not necessarily retain these properties of temporal independence or original ordering (although it can be forced to do so at the expense of compression efficiency). This complicates the task of splicing two MPEG-coded data streams.

We have developed a flexible algorithm that splices two streams directly in the compressed domain. The algorithm allows a natural tradeoff between computational complexity and compression efficiency, thus it can be tailored to the requirements of a particular system. The proposed algorithm possesses a number of attributes. A minimal number of frames are decoded and processed, thus leading to low computational requirements while preserving compression efficiency. In addition, the head and tail data streams can be processed separately and the final spliced data stream is a simple concatenation of the two streams. Lastly, the order of the coded video data remains intact.

A naive splicing solution is to completely decompress the video, splice the decoded video frames, and recompress the result. With this method, every frame in the spliced video sequence must be recompressed. This method has a number of disadvantages, including high computational requirements, high memory requirements, and low performance (multiple recoding cycles may deteriorate the video data).

The computational requirements are reduced by only processing the frames affected by the splice, and by only decoding the frames needed for that processing. Specifically, the only frames that need to be recoded are those in the groups affected by the head and tail cut points; at most, there will be one such group in the head data stream and one in the tail data stream. Furthermore, the only additional frames that need to be decoded are the I and P frames in the two GOPs affected by the splice.

The proposed algorithm results in an MPEG-compliant data stream with variable-sized GOPs. This exploits the fact that the GOP header does not specify the number of frames in the GOP or its structure; rather these are fully specified by the order of the data in the coded data stream.

We briefly describe each step of the splicing operation below. Further discussion is included in [Wee97].

  1. Form the head data stream. The simplest case occurs when the cut for the head data occurs immediately after an I or P frame. When this occurs, all the relevant video data is contained in one contiguous portion of the data stream. The irrelevant portion of the data stream can simply be discarded, and the remaining relevant portion does not need to be processed. When the cut occurs immediately after a B frame, some extra processing is required because one or more B-frame predictions will be based on an anchor frame that is not included in the final spliced video sequence.

  2. Form the tail data stream. The simplest case occurs when the cut occurs immediately before an I frame. When this occurs, the video data preceding this frame may be discarded and the remaining portion does not need to be processed. When the cut occurs before a P frame, the P must be converted to an I frame, and the remaining data remains in tact. The P to I frame conversion is straightforward, it simply requires decoding the previous I and P frames in the GOP. When the cut occurs before a B frame, extra processing is required because one of the anchor frames are not included in the spliced sequence. Thus, the affected frames must be adjusted accordingly.

  3. Combine the head and tail data streams. The IPB structure of the head and tail data streams determine the complexity of the combine operation. In the simplest case, it will be a straightforward concatenation of the two streams. In more complicated cases, it will require proper ordering of the data from each frame of the head and tail data streams.
The first two steps may require converting frames between the I, P, and B prediction modes. Converting P or B frames to I frames is quite straightforward, however, conversion between any other set of prediction modes is more complicated. Exact algorithms involve performing motion estimation on the decoded video -- this process can dominate the computational requirements of the algorithm. We are currently developing approximate algorithms that significantly reduce the computations required for this conversion.

Reverse Play

The goal of reverse-play transcoding is to create a new MPEG data stream that, when decoded, displays the video frames in the reverse order from the original MPEG data stream. For uncoded video the solution is simple: reorder the video frame data in reverse order. The simplicity of this solution relies on two properties: the data for each video frame is self-contained and it is independent of its placement in the data stream. These properties typically do not hold true for MPEG-coded video data.

Reverse-play transcoding is difficult because MPEG compression is not invariant to changes in frame order, e.g. reversing the order of the input frames will not simply reverse the order of the output MPEG stream. Furthermore, reversing the order of the input video frames does not result in a "reversed" motion vector field. However, if the transcoding is performed carefully, much of the motion vector information contained in the original MPEG video stream can be used to save a significant number of computations.

We have developed a reverse-play transcoding algorithm that operates directly on the compressed-domain data. The proposed algorithm is simple and achieves high performance with low computational and memory requirements. This algorithm only decodes the following data from the original MPEG data stream: I frames must be partially decompressed into their DCT representation and P frames must be partially decompressed to their MV/DCT representation, while for B frames only the forward and backward motion vector fields need to be decoded.

The proposed reverse-play transcoding algorithm has the following steps:

  1. Convert the P frames to I frames. This operation can be performed directly on the DCT coefficients by using the compressed-domain inverse motion compensation algorithm described in Section \ref{sec:intra}.

  2. Exchange the forward and backward motion vector fields used in each B frame. This step exploits the symmetry of the B frame prediction process. In the reversed stream, the B frames will have the same two anchor frames, but in the reverse order. Thus, the forward prediction field can simply be exchanged with the backward prediction field, resulting in significant computational savings. Notice that only the motion vector fields need to be decoded for the B frames.

  3. Properly reorder the frame data and update the relevant header information. If no B frames are used, then the reordering process is quite straightforward. However, when B frames are used, care must be taken to properly reorder the data from the original coding order to reverse coding order. In addition, the appropriate parameters of the header data must be updated.
We have developed a very simple algorithm for reverse-play transcoding. The resulting MPEG video stream is slightly larger than the original MPEG stream because of the extra data that results from converting the P frames to I frames. However, since much of the coding gain comes from compressing the B frames, the increase in data rate is modest. If extra computational power is available, the compression can be recovered by converting the new I frames to P frames. One approach is to perform the full motion estimation to get the new forward motion vector field; however, this is a computationally expensive operation. We are currently developing approximate algorithms that can significantly reduce these computational costs by exploiting the motion vector information contained in the original MPEG stream.
<-- Back to Table of Contents

Final Remarks

We have investigated the problem of transcoding MPEG video streams in the compressed domain. In earlier stages of this project, efficient intraframe (image) transcoding algorithms were developed for inverse motion compensation, translation, downscaling, and filtering. More recently, we have developed interframe (video) transcoding algorithms for temporal mode conversion to I frames, splicing, and reverse play.
<-- Back to Table of Contents

Bibliography

[Vas95]
B. Vasudev and K. Konstantinides, Image and Video Compression Standards: Algorithms and Architectures, Kluwer Academic Publishers, 1995.
[Mitch97]
J.L. Mitchell, W.B. Pennebaker, C.E. Fogg, and D.J. LeGall, MPEG Video Compression Standard, Digital Multimedia Standards Series, Chapman and Hall, 1997.
[Merh96a]
N. Merhav and B. Vasudev, "Fast Inverse Motion Compensation Algorithms for MPEG-2 and for Partial DCT Information", HP Laboratories Technical Report, Vol. HPL-96-53, April 1996.
[Merh94]
N. Merhav and B. Vasudev, "A Transform Domain Approach to Spatial Domain Image Downscaling", HP Laboratories Technical Report, Vol. HPL-94-116, December 1994.
[Merh95]
R. Kresch and N. Merhav, "Fast DCT Domain Filtering Using the DCT and the DST", HP Laboratories Technical Report, Vol. HPL-95-140, December 1995.
[Shen96]
B. Shen, I.K. Sethi, and B. Vasudev, "Digital Video Blue Screen Editing in Compressed Domain", Proceedings of the IS&T/SPIE Visual Communications and Image Processing Conference, February 1996, San Jose, CA.
[Merh96b]
N. Merhav, "Multiplication-Free Approximate Algorithms for Compressed Domain Linear Operations on Images", HP Laboratories Technical Report, Vol. HPL-96-111, July 1996.
[Nat95]
B.K. Natarajan and B. Vasudev, A Fast Approximate Algorithm for Scaling Down Digital Images in the DCT Domain, Proceedings of the IEEE International Conference on Image Processing, October 1995, Vol 2, pp. 241-243, Washington, D.C.
[MPEG2]
MPEG-2 International Standard, Video Recommendation ITU-T H.262, ISO/IEC 13818-2, January 1995.
[Wee97]
S.J. Wee and B. Vasudev, "Splicing MPEG Video Streams in the Compressed Domain", Proceedings of the IEEE International Conference on Multimedia Signal Processing, June 1997, Princeton, NJ.
<-- Back to Table of Contents