MPEG - 7 is a multimedia content description standard. It was standardized in ISO IEC 15938 (Multimedia content description interface). This description will be associated with the content itself, to allow fast and efficient searching for material that is of interest to the user. MPEG - 7 is formally called Multimedia Content Description Interface. Thus, it is not a standard which deals with the actual encoding of moving pictures and audio, like MPEG - 1, MPEG - 2 MPEG - 4. It uses XML to store metadata, and can be attached to timecode in order to tag particular events, or synchronise lyrics to a song, for example.

It was designed to standardize:

  • a set of Description Schemes ("DS") and Descriptors ("D")
  • a language to specify these schemes, called the Description Definition Language("DDL")
  • a scheme for coding the description

The combination of MPEG - 4 and MPEG - 7 has been sometimes referred to as MPEG - 47.

One common ground between MPEG - 4 and MPEG - 7 is the focus on audiovisual objects. The main objective of MPEG - 7 is to serve the need of audiovisual content - based retrieval (or audiovisual object retrieval) in applications such as digital libraries. Nevertheless, it is certainly not limited to retrieval — it is applicable to any multimedia applications involving the generation {content creation) and usage {content consumption) of multimedia data.

MPEG - 7 became an international standard in September 2001. Its formal name is Multimedia Context Description Interface, documented in ISO / TEC 15938. The standard's seven parts are Systems, Description Definition Language, Visual, Audio, Multimedia Description Schemes, Reference Software, and Conformance

MPEG - 7 supports a variety of multimedia applications. Its data may include still pictures, graphics, 3D models, audio, speech, video, and composition information (how to combine these elements). These MPEG - 7 data elements can be represented in textual or binary format, or both. Part 1 (Systems) specifies the syntax of Binary format for MPEG - 7 (BiM) data. Part 2 (Description Definition Language) specifies the syntax of the textual format which adopts XML Schema as its language of choice. A bidirectional lossless mapping is defined between the textual and binary representations.

The following figure illustrates some possible applications that will benefit from MPEG - 7. As shown, features are extracted and used to instantiate MPEG - 7 descriptions. They are then coded by the MPEG - 7 encoder and sent to the storage and transmission media. Various search and query engines issue search and browsing requests, which constitute the pull activities of the Internet, whereas the agents filter out numerous materials pushed onto the terminal - users and / or computer systems and applications that consume the data.

For multimedia content description, MPEG - 7 has developed Descriptors (D), Description Schemes (DS), and a Description Definition Language (DDL). Following are some of the important terms:

  • Feature. A characteristic of the data

  • Descriptor (D). A definition (syntax and semantics) of the feature

  • Description Scheme (DS). Specification of the structure and relationship between Ds and DSs.

  • Description. A set of instantiated Ds and DSs that describes the structural and conceptual information of the content, storage and usage of the content, and so on

  • Description Definition Language (DDL). Syntactic rules to express and combine DSs and Ds.

Possible Applications using MPEG - 7

Possible Applications using MPEG - 7

It is made clear that the scope of MPEG - 7 is to standardize the Ds, DSs and DDL for descriptions. The mechanism and process of producing and consuming the descriptions are beyond the scope of MPEG - 7. These are left open for industry innovation and competition and, more importantly, for the arrival of ever - improving new technologies.

Similar to the Simulation Model (SM) in MPEG - 1 video, the Test Model (TM) in MPEG - 2 video, and the Verification Models (VMs) in MPEG - 4 (video, audio, SNHC, and systems), MPEG - 7 names its working model the Experimentation Model (XM) - an alphabetical pun! XM provides descriptions of various tools for evaluating the Ds, DSs and DDL, so that experiments and verifications can be conducted and compared by multiple independent parties all over the world. The first set of such experiments is called the core experiments.

Descriptor (D)

MPEG - 7 descriptors are designed to describe both low - level features, such as color, texture, shape, and motion, and high - level features of semantic objects, such as events and abstract concepts. As mentioned above, methods and processes for automatic and even semiautomatic feature extraction are not part of the standard. Despite the efforts and progress in the fields of image and video processing, computer vision, and pattern recognition, automatic and reliable feature extraction is not expected in the near future, especially at the high level. The descriptors are chosen based on a comparison of their performance, efficiency, and size. Low - level visual descriptors for basic visual features include


  • Color space, (a) RGB, (b) YCbCr, (c) HSV (hue, saturation, value), (d) HMMD (HueMaxMinDiff), (e) 3D color space derivable by a 3 x 3 matrix from RGB, (f) monochrome

  • Color quantization, (a) Linear, (b) nonlinear, (c) lookup tables

  • Dominant colors. A small number of representative colors in each region or image. These are useful for image retrieval based on color similarity

  • Scalable color. A color histogram in HSV color space. It is encoded by a Haar transform and hence is scalable

  • Color layout. Spatial distribution of colors for color - layout - based retrieval

  • Color structure. The frequency of a color structuring element describes both the color content and its structure in the image. The color structure element is composed of several image samples in a local neighborhood that have the same color

  • Group of Frames / Group of Pictures (GoF / GoP) color. Similar to the scalable color, except this is applied to a video segment or a group of still images. An aggregated color histogram is obtained by the application of average, median, or intersection operations to the respective bins of all color histograms in the GoF / GoP and is then sent to the Haar transform


  • Homogeneous texture. Uses orientation and scale - tuned Gabor filters that quantitatively represent regions of homogeneous texture. The advantage of Gabor filters is that they provide simultaneous optimal resolution in both space and spatial-frequency domains. Also, they are bandpass filters that conform to the human visual profile. A filter bank consisting of 30 Gabor filters, at five different scales and six different directions for each scale, is used to extract the texture descriptor

  • Texture browsing. Describes the regularity, coarseness, and directionality of edges used to represent and browse homogeneous textures. Again, Gabor filters are used

  • Edge histogram. Represents the spatial distribution of four directional (0°, 45°, 90°, 135°) edges and one nondirectional edge. Images are divided into small subimages, and an edge histogram with five bins is generated for each subimage.


  • Region - based shape. A set of Angular Radial Transform (ART) coefficients is used to describe an object's shape. An object can consist of one or more regions, with possibly some holes in the object. ART transform is a 2D complex transform defined in terms of polar coordinates on a unit disc. ART basis functions are separable along the angular and radial dimensions. Thirty - six basis functions, 12 angular and three radial, are used to extract the shape descriptor

  • Contour - based shape. Uses a curvature scale space (CSS) representation that is invariant to scale and rotation, and robust to nonrigid motion and partial occlusion of the shape

  • 3D shape. Describes 3D mesh models and shape index. The histogram of the shape indices over the entire mesh is used as the descriptor


  • Camera motion. Fixed, pan, tilt, roll, dolly, track, boom.

  • Object motion trajectory. A list of keypoints (x, y, z, t). Optional interpolation functions are used to specify the acceleration along the path.

  • Parametric object motion. The basic model is the 2D affine model for translation, rotation, scaling, sheering, and the combination of these. A planar perspective model and quadratic model can be used for perspective distortion and more complex movements

  • Motion activity. Provides descriptions such as the intensity, pace, mood, and so on, of the video - for example, "scoring in a hockey game" or "interviewing a person"


  • Region locator. Specifies the localization of regions in images with a box or a polygon

  • Camera motions: pan, tilt, roll, dolly, track, and boom. (Camera has an effective focal length of /. It is shown initially at the origin, pointing to the direction of z - axis.)

    Camera motions: pan, tilt, roll, dolly, track, and boom.

  • Spatiotemporal locator. Describes spatiotemporal regions in video sequences. Uses one or more sets of descriptors of regions and their motions


  • Face recognition.A normalized face image is represented as a ID vector, then projected onto a set of 49 basis vectors, representing all possible face vectors

Description Scheme (DS)

This section provides a brief overview of MPEG - 7 Description Schemes (DSs) in the areas of Basic elements, Content management, Content description, Navigation and access, Content organization, and User interaction.

  1. Basic elements
    • Datatypes and mathematical structures. Vectors, matrices, histograms, and soon

    • Constructs.Links media files and localizing segments, regions, and so on

    • Schema tools. Includes root elements (starting elements of MPEG - 7 XML documents and descriptions), top - level elements (organizing DSs for specific content - oriented descriptions), and package tools (grouping related DS components of a description into packages)

  2. Content Management
    • Media Description.Involves a single DS, the Media lnformation DS, composed of a Media identification D and one or more Media Profile Ds that contain information such as coding method, transcoding hints, storage and delivery formats, and so on

    • Creation and Production Description. Includes information about creation (title, creators, creation location, date, etc.), classification (genre, language, parental guidance, etc.), and related materials

    • Content Usage Description. Various DSs to provide information about usage rights, usage record, availability, and finance (cost of production, income from content use)

  3. Content Description
    • Structural Description. A Segment DS describes structural aspects of the content. A segment is a section of an audiovisual object. The relationship among segments is often represented as a segment tree. When the relationship is not purely hierarchical, a segment graph is used

      The Segment DS can be implemented as a class object. It has five subclasses: Audiovisual segment DS, Audio segment DS, Still region DS, Moving region DS, and Video segment DS. The subclass DSs can recursively have their own subclasses.

      A Still region DS, for example, can be used to describe an image in terms of its creation (title, creator, date), usage (copyright), media (file format), textual annotation, color histogram, and possibly texture descriptors, and so on. The initial region (image, in this case) can be further decomposed into several regions, which can in turn have their own DSs.

      The following figure shows a Video segment for a marine rescue mission, in which a person was lowered onto a boat from a helicopter. Three moving regions are inside the Video segment. A segment graph can be constructed to include such structural descriptions as composition of the video frame (helicopter, person, boat) spatial relationship and motion (above, on, close - to, move - toward, etc.) of the regions.

    • Conceptual Description. This involves higher - level (nonstructural) description of the content, such as Event DS for basketball game or Lakers ballgame, Object DS for John or person, State DS for semantic properties at a given time or location, and Concept DS for abstract notations such as "freedom" or "mystery". As for Segment DSs, the concept DSs can also be organized in a tree or graph.

  4. Navigation and access
    • Summaries.These provide a video summary for quick browsing and navigation of the content, usually by presenting only the keyframes. The following DSs are supported: Summarization DS, Hierarchical Summary DS, HighlightLevel DS,

      MPEG - 7 video segment. (This figure also appears in the color insert section.)
    • MPEG - 7 video segment

    • Sequential Summary DS. Hierarchical summaries provide a keyframe hierarchy of multiple levels, whereas sequential summaries often provide a slide show or audiovisual skim, possibly with synchronized audio and text. The following figure illustrates a summary for a video of a "dragon - boat" parade and race in a park. The summary is organized in a three - level hierarchy. Each video segment at each level is depicted by a keyframe of thumbnail size

    • Partitions and Decompositions. This refers to view partitions and decompositions. The View partitions (specified by View DSs) describe different space and frequency views of the audiovisual data, such as a spatial view (this could be a spatial segment of an image), temporal view (as in a temporal segment of a video), frequency view (as in a wavelet subband of an image), or resolu­tion view (as in a thumbnail image), and so on. The View decompositions DSs

      A video summary
      A video summary
      specify different tree or graph decompositions for organizing the views of the audiovisual data, such as a SpaceTree DS (a quad - tree image decomposition)

    • Variations of the Content. A Variation DS specifies a variation from the original data in image resolution, frame rate, color reduction, compression, and so on. It can be used by servers to adapt audiovisual data delivery to network and terminal characteristics for a given Quality of Service (QoS)

  5. Content Organization
    • Collections. The Collection Structure DS groups audiovisual contents into clusters. It specifies common properties of the cluster elements and relationships among the clusters.

    • Models.Model DSs include a Probability model DS, Analytic model DS, and Classifier DS that extract the models and statistics of the attributes and features of the collections.

  6. User Interaction
    • User Preference. DSs describe user preferences in the consumption of audio visual contents, such as content types, browsing modes, privacy characteristics, and whether preferences can be altered by an agent that analyzes user behavior.

Description Definition Language (DDL)

MPEG - 7 adopted the XML Schema Language initially developed by the WWW Consortium (W3C] as its Description Definition Language (DDL). Since XML Schema Language was not designed specifically for audiovisual contents, some extensions are made to it. Without the details, the MPEG - 7 DDL has the following components:

  1. XML Schema structure components
    • The Schema — the wrapper around definitions and declarations
    • Primary structural components, such as simple and complex type definitions, and attribute and element declarations
    • Secondary structural components, such as attribute group definitions, identity-constraint definitions, group definitions, and notation declarations.
    • "Helper" components, such as annotations, particles, and wildcards

  2. XML Schema datatype components:
    • Primitive and derived data types
    • Mechanisms for the user to derive new data types
    • Type checking better than XML 1.0

  3. MPEG - 7 Extensions
    • Array and matrix data types
    • Multiple media types, including - audio, video, and - audiovisual presentations
    • Enumerated data types .for MimeType, CountryCode, regionCode, CurrencyCode, and CharacterSetCode
    • Intellectual Property Management and Protection (IPMP) for Ds and DSs.

All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd Protection Status