liu@hpl.hp.com
Abstract
Compressed digital video is quickly becoming ubiquitous for video
storage and communication. It is often desirable to perform various
video processing tasks; however, operations once considered simple,
such as translation and downscaling, become less straightforward when
applied to compressed video. The naive solution of completely
decompressing the video into the pixel domain, performing the
processing, and recompressing the entire video sequence is inefficient
and impractical.
We are investigating the problem of transcoding compressed video data
from one standards-compliant data stream to another. The goal is to
develop video transcoding algorithms that achieve high performance
with computational efficiency. We have developed a number of
compressed-domain algorithms that achieve this goal. We have
implemented a software version of the video transcoder, capable of
transcoding MPEG-compliant data streams between a number of modes. We
are actively developing our algorithms so that this software will
become a tool useful for both practical applications and research.
Video compression algorithms are being used to compress digital video
for a wide variety of applications, including video delivery over the
internet, advanced television broadcasting, as well as video storage
and editing. The performance of modern compression algorithms such as
MPEG is quite impressive -- raw video data rates often can be reduced
by factors of 15-80 without considerable loss in reconstructed video
quality. However, the use of these compression algorithms often make
other processing tasks quite difficult. For example, many operations
once considered simple, such as splicing and downscaling, are much
more complicated when applied to compressed video streams.
The goal of transcoding is to process one standards-compliant video
stream into another standards-compliant video stream that has
properties better suited for a particular application. This is useful
for a number of applications. For example, a video server
transmitting video over the internet may be restricted by stringent
bandwidth requirements. In this scenario, a high-quality compressed
bitstream may need to be transcoded to a lower-rate compressed
bitstream prior to transmission; this can be achieved by lowering the
spatial or temporal resolution of the video or by requantizing the
MPEG data. Another application may require MPEG video streams to be
transcoded into streams that facilitate video editing functionalities
such as splicing or fast-forward and reverse play; this may involve
removing the temporal dependencies in the coded data stream. Finally,
in a video communication system, the transmitted video stream may be
subject to harsh channel conditions resulting in data loss; in this
instance it may be useful to create a standards-compliant video stream
that is more robust to channel errors.
In this paper, we present a series of compressed-domain image and
video transcoding algorithms that were designed with the goal of
achieving high performance with computational efficiency. We focus on
video compression algorithms that rely on the block discrete cosine
transform (DCT) and motion-compensated prediction. This is applicable
to a number of predominant image and video coding standards including
JPEG, MPEG-1, MPEG-2, H.261, and H.263. In the remainder of this
paper we will focus on MPEG; however, many of these results apply to
the other standards as well.
This paper begins with a brief description of some aspects of the MPEG
video compression algorithm that are relevant to this work. The
transcoding problem is then defined and the significance of
compressed-domain transcoding is discussed. We provide a brief
description of some of the image (intraframe) transcoding algorithms
developed in earlier stages of this project. We then describe the
video (interframe) transcoding algorithms that are currently under
development.
Our transcoding algorithms are designed to exploit various features of
the MPEG video compression standard. Detailed descriptions of the
MPEG video compression standard can be found in
[Vas95,
Mitch97].
In this section, we will briefly discuss some aspects of the standard
that are relevant to this work.
MPEG codes video in a hierarchy of units called sequences, groups of
pictures (GOPs), pictures, slices, and macroblocks. 16 by 16 blocks
of pixels in the original video frames are coded as a macroblock. The
macroblocks are scanned in a left-to-right, top-to-bottom fashion, and
series of these macroblocks form a slice. All the slices in a frame
comprise a picture, contiguous pictures form a GOP, and all the GOPs
form the entire sequence. The MPEG syntax allows a GOP to contain any
number of frames, but typical sizes range from 9 to 15 frames. Each
GOP refreshes the temporal prediction by coding the first frame in
intraframe mode. The remaining frames in the GOP can be coded with
intraframe or interframe (predictive) coding techniques.
The MPEG algorithm allows each frame to be coded in one of three
modes: intraframe (I), forward prediction (P), and bidirectional
prediction (B). A typical IPB pattern in display order is:
I0 B1 B2
P3 B4 B5
P6 B7 B8
P9 B10 B11
I0 B1 B2
P3 B4 B5
P6 B7 B8
P9 B10 B11
I0
I frames are coded independently of other frames. P frames depend on
a prediction based on the preceding I or P frame. B frames depend on
a prediction based on the preceding and following I or P frames.
Notice that each B frame depends on data from a future frame, i.e.~a
future frame must be (de)coded prior to (de)coding the current B
frame. For this reason, the coding order is distinguished from the
display order. The coding order for the sequence shown above
is:
I0 P3 B1
B2 P6 B4
B5 P9 B7
B8 I0 B10 B11
P3 B1 B2
P6 B4 B5
P9 B7 B8
I0 B10 B11
MPEG requires the coded video data to be placed in the data stream in
coding order.
A GOP always begins with an I frame and typically includes the
following (display order) P and B frames that occur before the next I
frame; although the syntax allows a GOP to contain multiple I frames.
The sequence shown above contains two complete GOPs plus one I frame.
The GOP header does not specify the number of I, P, or B frames in the
GOP, nor does it specify the structure of the GOP -- these are
completely determined by the order of the data in the stream. Thus,
there are no rules that restrict the size and structure of the GOP,
although care should be taken to ensure that the buffer constraints
are satisfied.
We find it useful to define a substructure called a group. A
group is a unit that contains two consecutive I or P anchor
frames and all the B frames whose prediction is based on those two
anchors. The significance of the group is that it represents the
smallest unit of dependencies for the B frames.
MPEG uses block motion-compensated prediction to reduce the temporal
redundancies inherent to video. In block motion-compensated
prediction, the current frame is divided into 16 by 16 pixel units
called macroblocks. Each macroblock is compared to a number of 16 by
16 blocks in a previously coded frame. A single motion vector (MV) is
used to represent this block with the best match. This block is used
as a prediction of the current block, and only the error in the
prediction, called the residual, is coded into the data stream.
The frames of a video sequence can be coded as an I, P, or B frame.
In I frames, every macroblock must be coded in intraframe mode,
i.e. no prediction is used. In P frames, each macroblock can be coded
with forward prediction or in intraframe mode. In B frames, each
macroblock can be coded with forward, backward, or bidirectional
prediction or in intraframe mode. One MV is specified for each
forward- and backward-predicted macroblock while two MVs are specified
for each bidirectionally predicted macroblock. Thus, each P frame has
a forward motion vector field and one anchor frame, while each B frame
has a forward and backward motion vector field and two anchor frames.
MPEG uses DCT transform coding to code the intraframe and residual
macroblocks. Four 8 by 8 block DCTs are used to encode each
macroblock and the resulting DCT coefficients are quantized.
Quantization usually results in a sparse representation of the data;
i.e. most of the amplitudes of the quantized DCT coefficients are
equal to zero. Thus, only the amplitudes and locations of the nonzero
coefficients are coded into a data stream.
The goal of MPEG transcoding is to process one MPEG-compliant video
stream into another MPEG-compliant video stream that has properties
better suited for a particular application. Transcoding differs from
the encoding and decoding processes in that both the input and output
of the transcoder are MPEG video streams. A naive solution to the
transcoding problem, shown in the top of Figure 1, involves the
following steps: first, the MPEG-coded video stream is completely
decompressed into its pixel-domain representation; this pixel-domain
video is then processed with the appropriate operation; and finally
the processed video is recompressed into a new MPEG video stream.
Such solutions are computationally expensive and have large memory
requirements. In addition, the quality of the coded video can
deteriorate with each recoding cycle.
Figure
1. MPEG Transcoding: the naive solution (top) and the
compressed-domain solution (bottom).
A more efficient transcoding solution, shown in the bottom of Figure
1, is to only partially decompress the video stream and perform the
processing directly on the compressed-domain data. For some
operations, further gains can be achieved by only processing the
portions of the video stream that are affected by the transcoding
operation. In this work, the affected portion of the video stream is
partially decompressed into its motion vector (MV) and discrete cosine
transform (DCT) representation. Computational requirements are
reduced by eliminating the process of further decompressing the MV and
DCT representation into the spatial domain, processing the spatial
domain data, and recompressing the result into its new MV/DCT
representation. For many compressed-domain transcoding operations,
additional gains are achieved by exploiting the sparsity that is
typical of quantized DCT coefficients.
Images and video frames coded with intraframe methods are represented
by sets of block DCT coefficients. When using intraframe DCT coding,
the original video frame is divided into 8 by 8 blocks, each of which
is independently transformed with an 8 by 8 DCT. This imposes an
artificial block structure that complicates a number of spatial
processing operations, such as translation, downscaling, and
filtering, that were considered straightforward in the pixel domain.
A number of efficient exact and approximate compressed-domain
algorithms have been developed for transcoding DCT-compressed images
and video frames. A few of these are briefly described below, the
reader is referred to the references for more detailed descriptions
[Merh96a,
Merh94,
Merh95,
Shen96,
Merh96b,
Nat95].
The rigid structure of the 8 by 8 block DCT makes simple operations
like translation difficult. The difficulty stems from the fact that
the DCT blocks of the processed image do not line up with the DCT
blocks of the original image. For example, consider a 16 by 16 block
of pixels that is transformed with four 8 by 8 block DCTs. If one
wishes to calculate the DCT of an arbitrary 8 by 8 subblock of pixels,
one must first calculate the four inverse DCTs, then extract the
appropriate subblock, and finally perform the forward DCT operation.
The inverse motion compensation developed in
[Merh96a] performs this operation directly on
the DCT-domain data. This algorithm can be extended to a global
translation operation because global translation is a special case of
inverse motion compensation in which all the motion vectors have the
same value. By exploiting this fact, further optimizations can be
performed to increase the efficiency of the translation transcoding
operation.
Fast DCT-domain downscaling algorithms have been developed for integer
subsampling factors [Merh94]. When
downscaling, one 8 by 8 DCT block of the downscaled image is
determined from multiple 8 by 8 DCT blocks of the original image. In
these algorithms, when downscaling by a factor of N by N, each pixel
of the downscaled image contains the average value of the
corresponding N by N pixels of the original image. An efficient
algorithm was also developed for filtering images in the DCT domain [Merh95]. This method can be used to apply
two-dimensional symmetric, separable filters to DCT-coded images. In
addition, approximate, multiplication-free versions of the transcoding
algorithms described above have also been developed
[Merh96b,
Nat95].
Video frames coded with interframe coding techniques are represented
with motion vectors and residual DCT coefficients. These frames are
coded based on a prediction from one or more previously coded frames;
thus, properly decoding one frame requires first decoding one or more
other frames. This temporal dependence among frames severely
complicates a number of spatial and temporal processing techniques
such as translation, downscaling, and splicing. In this section, we
describe the interframe transcoding algorithms we have developed for
temporal mode conversion, splicing, and reverse play of MPEG video
streams.
The ability to transcode between arbitrary temporal modes adds a great
deal of flexibility and power to compressed-domain video
processing. In addition, it provides a method of trading off
parameters to achieve various rate/robustness profiles. For example,
an MPEG sequence consisting of all I frames, while least efficient
from a compression viewpoint, is most robust to channel impairments in
a video communication system. In addition, the I-frame MPEG video
stream best facilitates many video editing operations such as
splicing, downscaling, and reverse play. Finally, once an I-frame
representation is available, the intraframe transcoding algorithms can
be applied to each frame of the sequence to achieve the same effect on
the entire sequence.
We briefly describe the steps required to transcode an MPEG video
stream containing I, P, and B frames to an MPEG video stream
containing only I frames. The more general problem of transcoding
between any set of temporal modes is a more difficult task and is
currently under investigation.
Transcoding an IPB MPEG video stream into an all I-frame MPEG video
stream requires three steps:
- Calculate the DCT coefficients of the motion-compensated
prediction. This can be calculated from the intraframe
coefficients of the previously coded frames by using the
compressed-domain inverse motion compensation routine described in
Section \ref{sec:intra}.
- Form the intraframe DCT representation of each frame. This
step simply involves adding the predicted DCT coefficients to the
residual DCT coefficients.
- Reorder the coded data and update the relevant header
information. If B-frames are used, the coding order of the IPB
MPEG stream will differ from the coding order of the I-only MPEG
stream. Thus, the coded data for each frame must be shuffled
appropriately. In addition, the appropriate parameters of the header
data must be updated.
The goal of the splicing operation is to form a video data stream that
contains the first Nhead frames of one video sequence and
the last Ntail frames of another video sequence. For
uncoded video, the solution is obvious: simply discard the unused
frames and concatenate the remaining data. Two properties make this
solution obvious: (1) the data needed to represent each frame is
self-contained, i.e. it is independent of the data from other frames;
and (2) the uncoded video data has the desirable property of original
ordering, i.e. the order of the video data corresponds to the display
order of the video frames. MPEG-coded video data does not necessarily
retain these properties of temporal independence or original ordering
(although it can be forced to do so at the expense of compression
efficiency). This complicates the task of splicing two MPEG-coded
data streams.
We have developed a flexible algorithm that splices two streams
directly in the compressed domain. The algorithm allows a natural
tradeoff between computational complexity and compression efficiency,
thus it can be tailored to the requirements of a particular system.
The proposed algorithm possesses a number of attributes. A minimal
number of frames are decoded and processed, thus leading to low
computational requirements while preserving compression efficiency.
In addition, the head and tail data streams can be processed
separately and the final spliced data stream is a simple concatenation
of the two streams. Lastly, the order of the coded video data remains
intact.
A naive splicing solution is to completely decompress the video,
splice the decoded video frames, and recompress the result. With this
method, every frame in the spliced video sequence must be
recompressed. This method has a number of disadvantages, including
high computational requirements, high memory requirements, and low
performance (multiple recoding cycles may deteriorate the video data).
The computational requirements are reduced by only processing the
frames affected by the splice, and by only decoding the frames needed
for that processing. Specifically, the only frames that need to be
recoded are those in the groups affected by the head and tail cut
points; at most, there will be one such group in the head data stream
and one in the tail data stream. Furthermore, the only additional
frames that need to be decoded are the I and P frames in the two GOPs
affected by the splice.
The proposed algorithm results in an MPEG-compliant data stream with
variable-sized GOPs. This exploits the fact that the GOP header does
not specify the number of frames in the GOP or its structure; rather
these are fully specified by the order of the data in the coded data
stream.
We briefly describe each step of the splicing operation below.
Further discussion is included in [Wee97].
- Form the head data stream. The simplest case occurs when
the cut for the head data occurs immediately after an I or P frame.
When this occurs, all the relevant video data is contained in one
contiguous portion of the data stream. The irrelevant portion of the
data stream can simply be discarded, and the remaining relevant
portion does not need to be processed. When the cut occurs
immediately after a B frame, some extra processing is required because
one or more B-frame predictions will be based on an anchor frame that
is not included in the final spliced video sequence.
- Form the tail data stream. The simplest case occurs when
the cut occurs immediately before an I frame. When this occurs, the
video data preceding this frame may be discarded and the remaining
portion does not need to be processed. When the cut occurs before a P
frame, the P must be converted to an I frame, and the remaining data
remains in tact. The P to I frame conversion is straightforward, it
simply requires decoding the previous I and P frames in the GOP. When
the cut occurs before a B frame, extra processing is required because
one of the anchor frames are not included in the spliced sequence.
Thus, the affected frames must be adjusted accordingly.
- Combine the head and tail data streams. The IPB structure
of the head and tail data streams determine the complexity of the
combine operation. In the simplest case, it will be a straightforward
concatenation of the two streams. In more complicated cases, it will
require proper ordering of the data from each frame of the head and
tail data streams.
The first two steps may require converting frames between the I, P,
and B prediction modes. Converting P or B frames to I frames is quite
straightforward, however, conversion between any other set of
prediction modes is more complicated. Exact algorithms involve
performing motion estimation on the decoded video -- this process can
dominate the computational requirements of the algorithm. We are
currently developing approximate algorithms that significantly reduce
the computations required for this conversion.
The goal of reverse-play transcoding is to create a new MPEG data
stream that, when decoded, displays the video frames in the reverse
order from the original MPEG data stream. For uncoded video the
solution is simple: reorder the video frame data in reverse order.
The simplicity of this solution relies on two properties: the data for
each video frame is self-contained and it is independent of its
placement in the data stream. These properties typically do not hold
true for MPEG-coded video data.
Reverse-play transcoding is difficult because MPEG compression is not
invariant to changes in frame order, e.g. reversing the order of the
input frames will not simply reverse the order of the output MPEG
stream. Furthermore, reversing the order of the input video frames
does not result in a "reversed" motion vector field. However, if the
transcoding is performed carefully, much of the motion vector
information contained in the original MPEG video stream can be used to
save a significant number of computations.
We have developed a reverse-play transcoding algorithm that operates
directly on the compressed-domain data. The proposed algorithm is
simple and achieves high performance with low computational and memory
requirements. This algorithm only decodes the following data from the
original MPEG data stream: I frames must be partially decompressed
into their DCT representation and P frames must be partially
decompressed to their MV/DCT representation, while for B frames only
the forward and backward motion vector fields need to be decoded.
The proposed reverse-play transcoding algorithm has the following
steps:
- Convert the P frames to I frames. This operation can be
performed directly on the DCT coefficients by using the
compressed-domain inverse motion compensation algorithm described in
Section \ref{sec:intra}.
- Exchange the forward and backward motion vector fields used in
each B frame. This step exploits the symmetry of the B frame
prediction process. In the reversed stream, the B frames will have
the same two anchor frames, but in the reverse order. Thus, the
forward prediction field can simply be exchanged with the backward
prediction field, resulting in significant computational savings.
Notice that only the motion vector fields need to be decoded for the B
frames.
- Properly reorder the frame data and update the relevant header
information. If no B frames are used, then the reordering process
is quite straightforward. However, when B frames are used, care must
be taken to properly reorder the data from the original coding
order to reverse coding order. In addition, the appropriate
parameters of the header data must be updated.
We have developed a very simple algorithm for reverse-play
transcoding. The resulting MPEG video stream is slightly larger than
the original MPEG stream because of the extra data that results from
converting the P frames to I frames. However, since much of the
coding gain comes from compressing the B frames, the increase in data
rate is modest. If extra computational power is available, the
compression can be recovered by converting the new I frames to P
frames. One approach is to perform the full motion estimation to get
the new forward motion vector field; however, this is a
computationally expensive operation. We are currently developing
approximate algorithms that can significantly reduce these
computational costs by exploiting the motion vector information
contained in the original MPEG stream.
We have investigated the problem of transcoding MPEG video streams in
the compressed domain. In earlier stages of this project, efficient
intraframe (image) transcoding algorithms were developed for inverse
motion compensation, translation, downscaling, and filtering. More
recently, we have developed interframe (video) transcoding algorithms
for temporal mode conversion to I frames, splicing, and reverse play.
- [Vas95]
-
B. Vasudev and K. Konstantinides, Image and Video Compression
Standards: Algorithms and Architectures, Kluwer Academic
Publishers, 1995.
- [Mitch97]
-
J.L. Mitchell, W.B. Pennebaker, C.E. Fogg, and D.J. LeGall, MPEG
Video Compression Standard, Digital Multimedia Standards Series,
Chapman and Hall, 1997.
- [Merh96a]
-
N. Merhav and B. Vasudev, "Fast Inverse Motion Compensation Algorithms
for MPEG-2 and for Partial DCT Information", HP Laboratories
Technical Report, Vol. HPL-96-53, April 1996.
- [Merh94]
-
N. Merhav and B. Vasudev, "A Transform Domain Approach to Spatial
Domain Image Downscaling", HP Laboratories Technical Report,
Vol. HPL-94-116, December 1994.
- [Merh95]
-
R. Kresch and N. Merhav,
"Fast DCT Domain Filtering Using the DCT and the DST",
HP Laboratories Technical Report,
Vol. HPL-95-140,
December 1995.
- [Shen96]
-
B. Shen, I.K. Sethi, and B. Vasudev, "Digital Video Blue Screen
Editing in Compressed Domain", Proceedings of the IS&T;/SPIE Visual
Communications and Image Processing Conference, February 1996, San
Jose, CA.
- [Merh96b]
-
N. Merhav,
"Multiplication-Free Approximate Algorithms for Compressed Domain Linear Operations on Images",
HP Laboratories Technical Report,
Vol. HPL-96-111,
July 1996.
- [Nat95]
-
B.K. Natarajan and B. Vasudev, A Fast Approximate Algorithm for
Scaling Down Digital Images in the DCT Domain, Proceedings of the
IEEE International Conference on Image Processing, October 1995, Vol
2, pp. 241-243, Washington, D.C.
- [MPEG2]
-
MPEG-2 International Standard, Video Recommendation ITU-T
H.262, ISO/IEC 13818-2, January 1995.
- [Wee97]
-
S.J. Wee and B. Vasudev,
"Splicing MPEG Video
Streams in the Compressed Domain", Proceedings of the IEEE
International Conference on Multimedia Signal Processing, June
1997, Princeton, NJ.