Mpeg-2 - MULTIMEDIA

MPEG - 2 is a standard for "the generic coding of moving pictures and associated audio information". It describes a combination of lossy video compression and lossy audio data compression methods which permit storage and transmission of movies using currently available storage media and transmission bandwidth.

Development of the MPEG - 2 standard stalled in 1990. Unlike MPEG - 1, which is basically a standard for storing and playing video on the CD of a single computer at a low bitrate (1.5 Mbps), MPEG - 2 is for higher quality video at a bitrate of more than 4 Mbps. It was initially developed as a standard for digital broadcast TV.

In the late 1980s, Advanced TV (ATV) was envisioned, to broadcast HDTV via terrestrial networks. During the development of MPEG - 2, digital ATV finally took precedence over various early attempts at analog solutions to HDTV. MPEG - 2 has managed to meet the compression and bitrate requirements of digital TV / HDTV and in fact supersedes a separate standard, MPEG - 3, initially thought necessary for HDTV.

The MPEG - 2 audio / video compression standard, also referred to as ISO / IEC 13818, was approved by the ISO / IEC Moving Picture Experts Group in November 1994. Similar to MPEG - 1, it has parts for Systems, Video, Audio, Conformance, and Software, plus other aspects. MPEG - 2 has gained wide acceptance beyond broadcasting digital TV over terrestrial, satellite, or cable networks. Among various applications such as Interactive TV, it is also adopted for digital video discs or digital versatile discs (DVDs).

MPEG - 2 defined seven profiles aimed at different applications (e.g., low - delay videocon ­ ferencing, scalable video, HDTV). The profiles are Simple, Main, SNR scalable, Spatially scalable, High, 4:2:2, and Multiview (where two views would refer to stereoscopic video). Within each profile, up to four levels are defined. As Table shows, not all profiles have four levels. For example, the Simple profile has only the Main level; whereas the High profile does not have the Low level.

The following table fists the four levels in the Main profile, with the maximum amount of data and targeted applications. For example, the High level supports a high picture resolution of 1,920 x 1,152, a maximum frame rate of 60 fps, maximum pixel rate of 62.7 x 106 per second, and a maximum data rate after coding of 80 Mbps. The Low level is targeted at SIF video; hence, it provides backward compatibility with MPEG - 1. The Main level is for CCIR601 video, whereas High 1440 and High levels are aimed at European HDTV and North American HDTV, respectively.

The DVD video specification allows only four display resolutions: 720 x 480,704 x 480, 352 x 480, and 352 x 240. Hence, the DVD video standard uses only a restricted form of the MPEG - 2 Main profile at the Main and Low levels.

Table Four levels in the main profile of MPEG - 2

Four levels in the main profile of MPEG - 2

Supporting Interlaced Video

MPEG - 1 supports only noninterlaced (progressive) video. Since MPEG - 2 is adopted by digital broadcast TV, it must also support interlaced video, because this is one of the options Tor digital broadcast TV and HDTV.

As mentioned earlier in interlaced - video, each frame consists of two fields, referred to as the top - field and the bottom - field. In a frame - picture, all scanlines from both fields are interleaved to form a single frame. This is then divided into 16 x 16 macroblocks and coded using motion compensation. On the other hand, if each field is treated as a separate picture, then it is called field - picture. As the following figure shows, each frame - picture can be split into two field - pictures. The figure shows 16 scanlines from a frame - picture on the left, as opposed to 8 scanlines in each of the two field portions of a field - picture on the right.

We see that, in terms of display area on the monitor / TV, each 16 - column x 16 - row macroblock in the field - picture corresponds to a 16 x 32 block area in the frame - picture, whereas each 16 x 16 macroblock in the frame - picture corresponds to a 16 x 8 block area

Field pictures and field - prediction for field - pictures in MPEG - 2: (a) frame - picture versus field-pictures; (b) field prediction for field - pictures

Field pictures and field - prediction for field - pictures in MPEG - 2

Field pictures and field - prediction for field - pictures in MPEG - 2

in the field - picture. As shown below, this observation, will become an important factor in developing different modes of predictions for motion - compensation - based video coding.

Five Modes of Predictions MPEG - 2 defines frame prediction and field prediction as well as five different prediction modes, suitable for a wide range of applications where the requirement for the accuracy and speed of motion compensation vary.

  • Frame prediction for frame - pictures. This is identical to MPEG - 1 motion - compensation - based prediction methods in both P - frames and B - frames. Frame prediction works well for videos containing only slow and moderate object and camera motions.

  • Field prediction for field - pictures.This mode uses a macroblock size of 16 x 16 from field - pictures. For P - field - pictures (the right most ones shown in the figure), predictions are made from the two most recently encoded fields. Macroblocks in the top - field picture are forward - predicted from the top - field or bottom - field pictures of the preceding I - or P - frame. Macroblocks in the bottom - field picture are predicted from the top - field picture of the same frame or the bottom - field picture of the preceding I - or P - frame.

    For B - field - pictures, both forward and backward predictions are made from field - pictures of preceding and succeeding I - or P - frames. No regulation requires that field "parity" be maintained - that is, the top - field and bottom - field pictures can be predicted from either the top or bottom fields of the reference pictures.

  • Field prediction for frame - pictures.This mode treats the top - field and bottom - field of a frame - picture separately. Accordingly, each 16 x 16 macroblock from the target frame - picture is split into two 16 x 8 parts, each coming from one field. Field prediction is carried out for these 16 x 8 parts. Besides the smaller block size, the only difference is that the bottom - field will not be predicted from the top - field of the same frame, since we are dealing with frame - pictures now.

    For example, for P - frame - pictures, the bottom 16 x 8 part will instead be predicted from either field from the preceding I - or P - frame. Two motion vectors are thus generated for each 16 x 16 macroblock in the P - frame - picture. Similarly, up to four motion vectors can be generated for each macroblock in the B - frame - picture.

  • 16 x 8 MCfor field - pictures. Each 16 x 16 macroblock from the target field - picture is now split into top and bottom 16 x 8 halves — that is, the first eight rows and the next eight rows. Field prediction is performed on each half. As a result, two motion vectors will be generated for each 16 x 16 macroblock in the P - field - picture and up to four motion vectors for each macroblock in the B - field - picture. This mode is good for finer motion compensation when motion is rapid and irregular.

  • Dual - prime for P - pictures.This is the only mode that can be used for either frame - pictures or field - pictures. At first, field prediction from each previous field with the same parity (top or bottom) is made. Each motion vector MV is then used to derive a calculated motion vector CV in the field with the opposite parity, taking into account the temporal scaling and vertical shift between lines in the top and bottom fields. In this way, the pair MV and CV yields two preliminary predictions for each macroblock. Their prediction errors are averaged and used as the final prediction error. This mode is aimed at mimicking B - picture prediction for P - pictures without adopting backward prediction (and hence less encoding delay).

Alternate Scan and Field - DCTAlternate Scan and Field - DCT are techniques, aimed at improving the effectiveness of DCT on prediction errors. They are applicable only to frame - pictures in interlaced videos.

After frame prediction in frame - pictures, the prediction error is sent to DCT, where each block is of size 8x8. Due to the nature of interlaced video, the consecutive rows in these blocks are from different fields; hence, there is less correlation between them than between the alternate rows. This suggests that the DCT coefficients at low vertical spatial frequencies tend to have reduced magnitudes, compared to the ones in noninterlaced video.

(a) Zigzag (progressive) and (b) alternate (interlaced) scans of DCT coefficients for videos in MPEG - 2

Zigzag (progressive) and (b) alternate (interlaced) scans of DCT coefficients for videos in MPEG - 2

In MPEG - 2, Field - DCT can address the same issue. Before applying DCT, rows in the macroblock of frame - pictures can be reordered, so that the first eight rows are from the top - field and the last eight are from the bottom - field. This restores the higher spatial redundancy (and correlation) between consecutive rows. The reordering will be reversed after the IDCT. Field - DCT is not applicable to chrominance images, where each macroblock has only 8 x 8 pixels.

MPEG - 2 Scalabilities

As in JPEG2000, scalability is also an important issue for MPEG - 2. Since MPEG - 2 is designed for a variety of applications, including digital TV and HDTV, the video will often be transmitted over networks with very different characteristics. Therefore it is necessary to have a single coded bitstream that is scalable to various bitrates.

MPEG - 2 scalable coding is also known as layered coding, in which a base layer and one or more enhancement layers can be defined. The base layer can be independently encoded, transmitted, and decoded, to obtain basic video quality. The encoding and decoding of the enhancement layer, however, depends on the base layer or the previous enhancement layer. Often, only one enhancement layer is employed, which is called two - layer scalable coding.

Scalable coding is suitable for MPEG - 2 video transmitted over networks with following characteristics.

Very different bitrates.If the link speed is slow (such as a 56 kbps modem line), only the bitstream from the base layer will be sent. Otherwise, bitstreams from one or more enhancement layers will also be sent, to achieve improved video quality.

Variable - bitrate(VBR) channels. When the bitrate of the channel deteriorates, bitstreams from fewer or no enhancement layers will be transmitted, and vice versa.

Noisy connections.The base layer can be better protected or sent via channels known to be less noisy.

Moreover, scalable coding is ideal for progressive transmission: bitstreams from the base layer are sent first, to give users a fast and basic view of the video, followed by gradually increased data and improved quality. This can be useful for delivering compatible digital TV (ATV) and HDTV.

MPEG - 2 supports the following scalabilities:

  • SNR scalability.The enhancement layer provides higher SNR.
  • Spatial scalability.The enhancement layer provides higher spatial resolution.
  • Temporal scalability. The enhancement layer facilitates higher frame rate.
  • Hybrid scalability.This combines any two of the above three scalabilities.
  • Data partitioning.Quantized DCT coefficients are split into partitions.
  • SNR ScalabilityThe following figure illustrates how SNR scalability works in the MPEG - 2 encoder and decoder.

The MPEG - 2 SNR scalable encoder generates output bitstreams Bits - base and Bits enhance at two layers. At the base layer, a coarse quantization of the DCT coefficients is employed, which results in fewer bits and a relatively low - quality video. After variable - length coding, the bitstream is called Bits - base.

The coarsely quantized DCT coefficients are then, inversely quantized (Q - 1) and fed to the enhancement layer, to be compared with the original DCT coefficient. Their difference is finely quantized to generate a DCT coefficient refinement, which, after variable - length coding, becomes the bitstream called Bits„enhance. The inversely quantized coarse and refined DCT coefficients are added back, and after inverse DCT (IDCT), they are used for motion - compensated prediction for the next frame. Since the enhancement / refinement over the base layer improves the signal - to - noise - ratio, this type of scalability is called SNR scalability.

If, for some reason (e.g., the breakdown of some network channel), Bits enhance from the enhancement layer cannot be obtained, the above scalable scheme can still work using Bits - base only. In that case, the input from the inverse quantizer (Q-1) of the enhancement layer simply has to be treated as zero.

MPEG - 2 SNR scalability: (a) encoder; (b) decoder

MPEG - 2 SNR scalability: (a) encoder; (b) decoder

mpeg2-scalability-decoder

The decoder operates in reverse order to the encoder. Both Bits - base and Bits enhance are variable - length decoded (VLD) and inversely quantized (Q - 1) before they are added together to restore the DCT coefficients. The remaining steps are the same as in any motion - compensation - based video decoder. If both bitstreams (Bits - base and Bits - enhance) are used, the output video is Output high with enhanced quality. If only Bits - base is used, the output video Output - base is of basic quality.

Spatial Scalability The base and enhancement layers for MPEG - 2 spatial scalability are not as tightly coupled as in SNR scalability; hence, this type of scalability is somewhat less complicated. We will not show the details of both encoder and decoder, as we did above, but will explain only the encoding process, using high - level diagrams.

The base layer is designed to generate a bitstream of reduced - resolution pictures Combining them with the enhancement layer produces pictures at the original resolution. As the following figure shows, the original video data is spatially decimated by a factor of 2 and sent to the base layer encoder. After the normal coding steps of motion compensation, DCT on prediction errors, quantization, and entropy coding, the output bitstream is Bits - base.

The predicted macroblock from the base layer is now spatially interpolated to get to resolution 16 x 16. This is then combined with the normal, temporally predicted macroblock from the enhancement layer itself, to form the prediction macroblock for the purpose of motion compensation in this layered coding. The spatial interpolation here adopts bilinear metrpolation, as discussed before.

The combination of macroblocks uses a simple weight table, where the value of the weight vj is in the range of [0, 1.0]. If w — 0, no consideration is given to the predicted macroblock from the base layer. If w = 1, the prediction is entirely from the base layer. Normally, both predicted macroblocks are linearly combined, using the weights w and 1 - w, respectively. To achieve minimum prediction errors, MPEG - 2 encoders have an analyzer to choose different w values from the weight table on a macroblock basis.

Temporal Scalability Temporally scalable coding has both the base and enhancement layers of video at a reduced temporal rate (frame rate). The reduced frame rates for the layers are often the same; however, they could also be different. Pictures from the base layer and enhancement layer(s) have the same spatial resolution as in the input video. When combined, they restore the video to its original temporal rate.

Encoder for MPEG - 2 Spatial scalability: (a) block diagram; (b) combining temporal and spatial predictions for encoding at enhancement layer

Encoder for MPEG - 2 Spatial scalability

Encoder for MPEG - 2 Spatial scalability

The following figure illustrates the MPEG - 2 implementation of temporal scalability. The input video is temporally demultiplexed into two pieces, each carrying half the original frame rate. As before, the base layer encoder carries out the normal single - layer coding procedures for its own input video and yields the output bitstream Bits - base.

The prediction of matching macroblocks at the enhancement layer can be obtained in two ways: Interlayer motion - compensated prediction or combined motion - compensated prediction and interlayer motion - compensated prediction.

Interlayer motion - compensated prediction. The macroblocks of B - frames for motion compensation at the enhancement layer are predicted from the preceding and succeeding frames (either L -, P -, or B -) at the base layer, so as to exploit the possible inter - layer redundancy in motion compensation.

Combined motion - compensation prediction and interlayer motion - compensationprediction.This further combines the advantages of the ordinary forward prediction and the above interlayer prediction. Macroblocks of B - frames at the enhancement layer are forward - predicted from the preceding frame at its own layer and "backward" - predicted from the preceding (or, alternatively, succeeding) frame at the base layer. At the first frame, the P - frame at the enhancement layer adopts only forward prediction from the I - frame at the base layer.

Encoder for MPEG - 2 temporal scalability: (a) block diagram; (b) inter - layer motion - compensated prediction; (c) combined motion - compensated prediction and interlayer motion - compensated prediction

Encoder for MPEG - 2 temporal scalability

Encoder for MPEG - 2 temporal scalability

Hybrid ScalabilityAny two of the above three scalabilities can be combined to from hybrid scalability. These combinations are

  • Spatial and temporal hybrid scalability
  • SNR and spatial hybrid scalability
  • SNR and temporal hybrid scalability

Usually, a three - layer hybrid coder will be adopted, consisting of base layer, enhancement layer 1 and enhancement layer 2.

For example, for Spatial and temporal hybrid scalability, the base layer and enhancement layer 1 will provide spatial scalability, and enhancement layers 1 and 2 will provide temporal scalability, in which enhancement layer 1 is effectively serving as a base layer.

For the encoder, the incoming video data is first temporally demultiplexed into two streams: one to enhancement layer 2; the other to enhancement layer 1 and the base layer (after further spatial decimation for the base layer).

The encoder generates three output bitstreams: (a) Bits - base from the base layer, (b) spatially enhanced Bit enhancel from enhancement layer 1, and (c) spatially and temporally enhanced Bits - enhance 2 from enhancement layer 2.

The implementations of the other two hybrid scalabilities are similar and are left as exercises.

Data Partitioning The compressed video stream is divided into two partitions. The base partition contains lower - frequency DCT coefficients, and the enhancement partition contains high - frequency DCT coefficients. Although the partitions are sometimes also referred to as layers (base layer and enhancement layer), strictly speaking, data partitioning does not conduct the same type of layered coding, since a single stream of video data is simply divided up and does not depend further on the base partition in generating the enhancement partition. Nevertheless, data partitioning can be useful for transmission over noisy channels and for progressive transmission.

Other Major Differences from MPEG-1

Better resilience to bit errors. Since MPEG - 2 video will often be transmitted on various networks, some of them noisy and unreliable, bit errors are inevitable. To cope with this, MPEG - 2 systems have two types of streams: Program and Transport. The Program stream is similar to the Systems stream in MPEG - 1; hence, it also facilitates backward compatibility with MPEG - 1.

The Transport stream aims at providing error resilience and the ability to include multiple programs with independent time bases in a single stream, for asynchronous multiplexing and network transmission. Instead of using long, variable - length packets, as in MPEG - 1 and in the MPEG - 2 Program stream, it uses fixed - length (188 - byte) packets. It also has a new header syntax, for better error checking and correction.

Support of 4:2:2 and 4:4:4 chroma subsampling. In addition to 4:2:0 chroma subsampling, as in H.261 and MPEG - 1, MPEG - 2 also allows 4:2:2 and 4:4:4, to increase color quality. As discussed in Chapter, each chrominance picture in 4:2:2 is horizontally subsampled by a factor of 2, whereas 4:4:4 is a special case, where no chroma subsampling actually takes place.

Table Possible nonlinear scale in MPEG - 2

Possible nonlinear scale in MPEG - 2

Nonlinear quantization.Quantization in MPEG - 2 is similar to that in MPEG - 1. Its step size is also determined by the product of Q[i, j] and scale, where Q is one of the default quantization tables for intra - or inter - coding. Two types of scales are allowed. For the first, scale is the same as in MPEG - 1, in which it is an integer in the range of [1, 31] and scale i = 1. For the second type, however, a nonlinear relationship exists - that is, scale; i = 1.

More restricted slice structure.MPEG - 1 allows slices to cross macroblock row boundaries. As a result, an entire picture can be a single slice. MPEG - 2 slices must start and end in the same macroblock row. In other words, the left edge of a picture always starts a new slice, and the longest slice in MPEG - 2 can have only one row of macroblocks.

More flexible video formats.According to the standard, MPEG - 2 picture sizes can be as large as 16 k x 16 k pixels. In reality, MPEG - 2 is used mainly to support various picture resolutions as defined by DVDLATV, and HDTV.

Similar to H.261, H.263, and MPEG - 1, MPEG - 2 specifies only its bitstream syntax and the decoder. This leaves much room for future improvement, especially on the encoder side.


All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

MULTIMEDIA Topics