INTERNATIONAL ORGANIZATION FOR STANDARDIZATION
ORGANISATION INTERNATIONALE DE NORMALISATION
ISO/IEC JTC 1/SC 29/WG 11
CODING OF MOVING PICTURES AND AUDIO

ISO/IEC JTC 1/SC 29/WG 11N7609

Oct 2005, Nice

Title

MPEG-4 File Formats white paper

Source

Systems

Status

Proposal

Editors

David Singer (Apple), Mohammed Zubair Visharam (Sony)

 

Introduction

Within the ISO/IEC 14496 MPEG-4 standard there are several parts that define file formats for the storage of time-based media (such as audio, video etc.).  They are all based and derived from the ISO Base Media File Format [1], which is a structural, media-independent definition that is also published as part of the JPEG 2000 family of standards.

The MP4 file format [2] defines the storage of MPEG-4 audio, scenes and multimedia content using the ISO Base Media File Format. The AVC File Format [3] defines the storage for the Advanced Video Coding (ISO/IEC 14496-10/AVC) standard [4] data within files of the ISO Base Media File Format family.

There are also related file formats that use the structural definition of a box-structured file as defined in the ISO Base Media File Format, but do not use the definitions for time-based media.  The MPEG-21 File Format [5] is one such standard and defines the storage of an MPEG-21 digital item, with some or all of its ancillary data (such as images, movies, or other non-XML data) within the same file.  Other files using this structure include the standard file formats for JPEG 2000 images, such as JP2 [6].

A diagrammatic overview of the relationship between the various file formats with the ISO Base Media File Format is shown in Figure 1.

Figure 1: Relationship between the ISO, MP4, AVC, MPEG-21 File Formats

Target applications

The family of the storage file formats is based in the concept of box-structured files.  A box-structured file consists of a series of boxes (sometimes called atoms), which have a size and a type.  The type field is usually four printable characters.  Box structured files are used in a number of applications, and it is possible to form ‘multi-purpose’ files which contain the boxes required by more than one specification.  Examples include not only the ISO Base File Format family described here, but also the JPEG 2000 file format family, which for the most part is a still-image file format.

The ISO Base Media File Format additionally contains structural and media data information for timed presentations of media data such as audio, video, etc.  This structure is intentionally general, so that by structuring files in different ways the same base specification can be used for files for

More specialized uses include the use for the storage of a partial or complete MPEG-4 scene and associated object descriptions. This general structure has been adopted not only for the MP4 file format, but a number of other standards bodies, trade associations, and companies .

Box Structured Files

The file structure is object-oriented; that is, a file can be decomposed into its constituent objects very simply, and the structure of the objects can be inferred directly from their type and position. The types are 32-bit values and usually chosen to be four printable characters, for ease of inspection and editing.  There is provision for using extension boxes with a Universal Unique Identifier type (UUID) [8], and specification text is provided on how to convert all box types into UUID’s.

All box-structured files start with a file-type box (possibly after a box-structured signature) that defines the best use of the file, and the specifications to which the file complies.  These are documented as ‘brands’.  Brands identify a specification.  The presence of a brand in this box indicates both a claim and a permission; a claim by the file writer that the file complies with the specification, and a permission for a reader, possibly implementing only that specification, to read and interpret the file.

ISO Base Media File Format

The ISO Base Media File Format is designed to contain timed media information for a presentation in a flexible, extensible format that facilitates interchange, management, editing, and presentation of the media. This presentation may be ‘local’ to the system containing the presentation, or may be via a network or other stream delivery mechanism.

The files have a logical structure, a time structure, and a physical structure, and these structures are not required to be coupled. The logical structure of the file is of a movie that in turn contains a set of time-parallel tracks. The time structure of the file is that the tracks contain sequences of samples in time, and those sequences are mapped into the timeline of the overall movie by optional edit lists.

The physical structure of the file separates the data needed for logical, time, and structural de-composition, from the media data samples themselves.  This structural information is concentrated in a movie box, possibly extended in time by movie fragment boxes.  The movie box documents the logical and timing relationships of the samples, and also contains pointers to where they are located.  Those pointers may be into the same file or another one, referenced by a URL.

Each media stream is contained in a track specialized for that media type (audio, video etc.), and is further parameterized by a sample entry.  The sample entry contains the ‘name’ of the exact media type (i.e., the type of the decoder needed to decode the stream) and any parameterization of that decoder needed.  The name also takes the form of a four-character code.  There are defined sample entry formats not only for MPEG-4 media, but also for the media types used by other organizations using this file format family.  They are registered at the MP4 registration authority [7].

Protected streams are also supported by the file format (e.g. streams encrypted for use in a digital rights management systems (DRM)).  There is a general structure for protected streams, which documents the underlying format, and also documents the protection system applied and any parameters it needs.

Support for meta-data takes two forms.  First, timed meta-data may be stored in an appropriate track, synchronized as desired with the media data it is describing.  Secondly, there is general support for non-timed meta-data attached to the movie or to an individual track.  The structural support is general, and allows, as in the media-data, the storage of meta-data resources elsewhere in the file or in another file.  In addition, these resources may be named, and may be protected.

These generalized meta-data structures may also be used at the file level, above or parallel with or in the absence of the movie box.  In this case, the meta-data box is the primary entry into the presentation.  This structure is used for MPEG-21 files and other bodies are using it to wrap together other integration specifications (e.g. SMIL [9]) with the media integrated.

Sometimes the samples within a track have different characteristics or need to be specially identified.  One of the most common and important characteristic is the synchronization point (often a video I-frame).  These points are identified by a special table in each track.  More generally, the nature of dependencies between track samples can also be documented.  Finally, there is a concept of named, parameterized sample groups.  These permit the documentation of arbitrary characteristics that are shared by some of the samples in a track.  In the AVC file format, sample groups are used to support the concept of layering and sub-sequences.

MP4 File Format

MP4 files are generally used to contain MPEG-4 media, including not only MPEG-4 audio and/or video, but also MPEG-4 presentations.  When a complete or partial presentation is stored in an MP4 file, there are specific structures that document that presentation.

MPEG-4 presentations are scenes, described by the scene language MPEG-4 BIFS.  Within those scenes media objects can be placed; these media objects might be audio, video, or entire sub-scenes.  Each object is described by an object descriptor, and within the object descriptor the streams that make up that object are described. The entire scene is described by an initial object descriptor (IOD).  This is stored in a special box within the movie atom in MP4 files.  The scene and the object descriptors it uses are stored in tracks — a scene track, and an object descriptor track; for files that comprise a full MPEG-4 presentation this IOD and these two tracks are required.

Each stream is described by an elementary stream descriptor.  When a complete scene is delivered, these are delivered as part of the object descriptor stream.  However, for ease of composition, and to manage files that contain only media streams, these elementary stream descriptors are stored with the media streams themselves — in the descriptive track structures — in MP4 files.

MPEG-21 File Format

As described above, the general meta-box can be used at the file level to contain a description and its associated or included data.  This structure is used for MPEG-21 files.  A file-level meta-box is used to hold an MPEG-21 Digital Item Declaration (DID) [10], The meta-box also contains a list of attached resources; which may have local names, and may be located within the same file or in another file.

Streaming Support

When media is delivered over a streaming protocol it often must be transformed from the way it is represented in the file.  The most obvious example of this is the way media is transmitted over the Real Time Protocol (RTP) [11].  In the file, for example, each frame of video is stored contiguously as a file-format sample.  In RTP, packetization rules specific to the codec used, must be obeyed to place these frames in RTP packets.

A streaming server may calculate such packetization at run-time if it wishes.  However, there is support for the assistance of the streaming servers.  Special tracks called hint tracks may be placed in the files.  Hint tracks contain general instructions for streaming servers as to how to form packet streams, from media tracks, for a specific protocol.  Because the form of these instructions is media-independent, servers do not have to be revised when new codecs are introduced.  In addition, the encoding and editing software can be unaware of streaming servers.  Once editing is finished on a file, then a piece of software called a hinter may be used that adds hint tracks to the file, before placing it on a streaming server. There is a defined hint track format for RTP streams in the MP4 file format specification.

File Identification

The following table contains a summary of some of the common file types in the ISO Base Media File Format Family.  The formal registration authorities (e.g. the MP4 registration authority [7] for brands, or the Internet Assigned Numbers Authority [12] for MIME types) and the appropriate specifications should be consulted for definitive information.

 

Brand

Extension

Mime Type

MP4

mp41, mp42

.mp4

video/mp4, audio/mp4, application/mp4

3GPP

various, e.g. 3gp4, 3gp5

.3gp

video/3gpp, audio/3gpp

3GPP2

3g2a

.3g2

video/3gpp2, audio/3gpp2

Motion JPEG 2000

mjp2

.mj2

video/mj2

QuickTime

"qt  "

.mov

video/quicktime

Registration Authority

There is a registration authority which registers and documents the four-character-code code-points used in this file-format family, as well as some other code-points related to MPEG-4 systems.  The database is publicly viewable and registration is free [7].

References

[1] ISO/IEC 14496-12, ISO Base Media File Format;  technically identical to ISO/IEC 15444-12
[2]     ISO/IEC 14496-14, MP4 File Format
[3]     ISO/IEC 14496-15, Advanced Video Coding (AVC) file format
[4]     ISO/IEC 14496-10, Advanced Video Coding
[5]     ISO/IEC 21000-9, MPEG-21 File Format
[6]     ISO/IEC 15444-1, JPEG 2000 Image Coding System
[7]     The MP4 Registration Authority, http://www.mp4ra.org/
[8]     ISO/IEC 9834-8:2004 Information Technology, "Procedures for the operation of OSI Registration of Universally Unique Identifiers (UUIDs) and their use as ASN.1 Object Identifier components" ITU-T Rec. X.667, 2004
[9]     SMIL: Synchronized Multimedia Integration Language; World-Wide Web Consortium (W3C) http://www.w3.org/TR/SMIL2/
[10]  ISO/IEC 21000-2 Digital Item Declaration
[11]  RTP: A Transport Protocol for Real-Time Applications; IETF RFC 3550, http://www.ietf.org/rfc/rfc3550.txt
[12]  The Internet Assigned Numbers Authority http://www.iana.org/