Object-Based Visual Coding In Mpeg-4 - MULTIMEDIA

MPEG - 4 encodes / decodes each VOP separately (instead of considering the whole frame). Hence, its object - based visual coding is also known as VOP - based coding.

VOP - Based Coding vs. Frame - Based Coding

MPEG - 1, and 2 do not support the VOP concept, hence, their coding method is referred to frame based. Since each frame is divided into many macroblocks from which motion - compensation - based coding is conducted, it is also known as block - based coding.

MPEG - 1 and 2 visual coding are concerned only with compression ratio and do not consider the existence of visual objects. Therefore, the motion vectors generated may be inconsistent with the object's motion and would not be useful for object - based video analysis and indexing.

Comparison between block - based coding and object - based coding: (a) a video sequence; (b) MPEG - 1 and 2 block - based coding; (c) two potential matches in MPEG - 1 and 2; (d) object - based coding in MPEG - 4

Comparison between block - based coding and object - based coding

Comparison between block - based coding and object - based coding

The above figure illustrates a possible example in which both potential matches yield small prediction errors. If Potential Match 2 yields a (slightly) smaller prediction error than potential Match 1, MV2 will be chosen as the motion vector for the macroblocks in the block - based coding approach, although only MV1 is consistent with the vehicle's direction of motion.

Object - based coding in MPEG - 4 is aimed at solving this problem, in addition to improving compression. The figure shows each VOP is of arbitrary shape and will ideally obtain a unique motion vector consistent with the object's motion.

MPEG - 4 VQP - based coding also employs the motion compensation technique. An intra - frame - coded VOP is called I - VOP. Inter - frame - coded VOPs are called P - VOPs if only forward prediction is employed or B - VOPs if bidirectional predictions are employed. The new difficulty here is that the VOPs may have arbitrary shapes. Therefore, in addition to their texture, their shape information must now be coded.

It is worth noting that texture here actually refers to the visual content that is the gray level and chroma values of the pixels in the VOP. MPEG 1 and 2 do not code - shape information since all frames are rectangular, but they do code the values of the pixels in the frame. In MPEG - 1 and 2, this coding was not explicitly referred to as texture coding. The Term "texture" comes from computer graphics and shows how this discipline has entered the video coding world with MPEG - 4.

Below, we start with a discussion of motion - compensation - based coding for VOPs, followed by introductions to texture coding, shape coding, static texture coding, sprite coding, and global motion compensation.

Motion Compensation

Motion compensationis an algorithmic technique employed in the encoding of video data for video compression, for example in the generation of MPEG - 2 files. Motion compensation describes a picture in terms of the transformation of a reference picture to the current picture. The reference picture may be previous in time or even from the future. When images can be accurately synthesized from previously transmitted / stored images, the compression efficiency can be improved.

Since I - VOP coding is relatively straightforward, our discussions will concentrate on coding for P - VOP and / or B - VOP unless I - VOP is explicitly mentioned.

As before motion compensation - based VOP coding in MPEG - 4 again involves three steps: motion estimation, motion - compensation - based prediction, and coding of the prediction error. To facilitate motion compensation, each VOP is divided into many macroblocks, as in previous frame - based methods. Macroblocks are by default. 16 x 16 in luminance images and 8x 8 in chrominance images and are treated especially when they straddle the boundary of an arbitrarily shaped VOP.

MPEG - 4 defines a rectangular bounding box for each VOP. Its left and top bounds are the left and top bounds of the VOP, which in turn specify the shifted origin for the VOP from the original (0, 0) for the video frame in the absolute (frame) coordinate system. Both horizontal and vertical dimensions of the bounding box must be multiples of 16 in the luminance image. Therefore, the box is usually slightly larger than a conventional bounding box.

Macroblocks entirely within the VOP are referred to as interior macroblocks. As is apparent from the following figure, many of the macroblocks straddle the boundary of the VOP and are called boundary macroblocks.

Motion compensation for interior macroblocks is carried out in the same manner as in MPEG - 1 and 2. However, boundary macroblocks could be difficult to match in motion estimation, since VOPs often have arbitrary (nonrectangular) shape, and their shape may change from one instant in the video to another. To help match every pixel in the target VOP and meet the mandatory requirement of rectangular blocks in transform coding (e.g., DCT), a preprocessing step of padding is applied to the reference VOPs prior to motion estimation.

Only pixels within the VOP of the current (target) VOP are considered for matching in motion compensation, and padding takes place only in the reference VOPs.

For quality, some better extrapolation method than padding could have been developed. Padding was adopted in MPEG - 4 largely due to its simplicity and speed.

The first two steps of motion compensation are: padding, and motion vector coding.

Bounding box and boundary macroblocks of VOP

Bounding box and boundary macroblocks of VOP

Padding. For all boundary macroblocks in the reference VOP, horizontal repetitive padding is invoked first, followed by vertical repetitive padding. Afterward, for all exterior macroblocks that are outside of the VOP but adjacent to one or more boundary macroblocks, extended padding is applied.

The horizontal repetitive padding algorithm examines each row in the boundary macroblocks in the reference VOP. Each boundary pixel is replicated to the left and / or right to fill in the values for the interval of pixels outside the VOP in the macroblock. If the interval is bounded by two boundary pixels, their average is adopted.

A sequence of paddings for reference VOPs in MPEG - 4

sequence of paddings for reference VOPs in MPEG - 4

An example of repetitive padding in a boundary macroblock of a reference VOP: (a) original pixels within the VOP; (b) after horizontal repetitive padding; (c) followed by vertical repetitive padding

An example of repetitive padding in a boundary macroblock of a reference VOP



1. Horizontal repetitive padding

Row 0. The rightmost pixel of the VOP is the only boundary pixel. Its intensity value, 60, is used repetitively as the value of the pixels outside the VOP.

Row 1. Similarly, the rightmost pixel of the VOP is the only boundary pixel. Its intensity value, 50, is used repetitively as the pixel value outside of the VOP.

Rows 2 and 3. No horizontal padding, since no boundary pixels exist.

Row 4. There exist two intervals outside the VOP, each bounded by a single boundary pixel. Their intensity values, 60 and 70, are used as the pixel values of the two intervals, respectively.

Row 5. A single interval outside the VOP is bounded by a pair of boundary pixels of the VOP. The average of their intensity values, (50 + 80) / 2 - 65, is used repetitively as the value of the pixels between them.

2. Vertical repetitive padding Column 0. A single interval is bounded by a pair of boundary pixels of the VOP. One is 42 in the VOP; the other is 60, which just arose from horizontal padding. The average of their intensity values, (42 + 60) / 2 - 51, is repetitively used as the value of the pixels between them.

Columns 1,2,3,4 and 5. These columns are padded similarly to Column 0.

Extended PaddingMacroblocks entirely outside the VOP are exterior macroblocks. Exterior macroblocks immediately next to boundary macroblocks are filled by replicating the values of the border pixels of the boundary macroblock. We note that boundary macroblocks are by now fully padded, so all their horizontal and vertical border pixels have denned values. If an exterior macroblock has more than one boundary macroblock as its immediate neighbor, the boundary macroblock to use for extended padding follows a priority list: left, top, right, and bottom.

Later versions of MPEG - 4 allow some average values of these macroblocks to he used. This extended padding process can be repeated to fill in all exterior macroblocks within the rectangular bounding box of the VOP.

Motion Vector Coding. Motion vector coding efficiency is becoming an important issue in low bitrate video coding because of its increasing relative bit portion. This work presents a new motion vector coding technique based on minimum bitrate prediction. In the proposed scheme, a predicted motion vector is chosen from the three causal neighboring motion vectors so that it can produce a minimum bitrate in motion vector difference coding.

Then the prediction error or motion vector difference (MVD), and the mode information (MODE) for determining the predicted motion vector at a decoder are coded and transmitted in order. Sending bits for MVD ahead of bits for the MODE, the scheme can minimize the bit amount for the MODE by taking advantage of the fact that the minimum bitrate predictor is used for motion vector prediction. Adaptively combining this minimum bitrate prediction scheme with the conventional model - based prediction scheme, more efficient motion vector coding can be achieved.

The proposed scheme improves coding efficiency noticeably for various video sequences. Each macroblock from the target VOP will find a best matching macroblock from the reference VOP through the following motion estimation procedure.

LetC( χ+k:, y+l) be pixels of the rnacroblocks the target VOP, and R(χ+i+k:, y+j+l) be pixels of the macroblock in the reference VOP. Similar to MAD in Eq.a Sum of Absolute Difference (SAD) for measuring the difference between the two macroblocks can be defined as

difference between the two macroblocks

where N is the size of the macroblock. Map(p, q) = 1 when C(p, q) is a pixel within the target VOP; otherwise, Map(p, q) - 0. The vector (i, j) that yields the minimum SAD is adopted as the motion vector MV (u,v):

motion vector MV (u,v):

where p is the maximal allowable magnitude for u and v.

For motion compensation, the motion vector MV is coded. As in H.263, the motion vector of the target macroblock is not simply taken as the MV. Instead, MV is predicted from three neighboring macroblocks. The prediction error for the motion vector is then variable - length coded.

Following are some of the advanced motion compensation techniques adopted similar to the ones in H.263.

  • Four motion vectors (each from an 8 x 8 block) can be generated for each macroblock in the luminance component of a VOP.
  • Motion vectors can have subpixel precision. At half - pixel precision, the range of motion vectors is [—2,048, 2,047]. MPEG - 4 also allows quarter - pixel precision in the luminance component of a VOP.
  • Unrestricted motion vectors are allowed: MV can point beyond the boundaries of the reference VOP. When a pixel outside the VOP is referenced, its value is still defined, due to padding.

Texture Coding

Texture refers to gray level (or chroma) variations and / or patterns in the VOP. Texture coding in MPEG - 4 can be based either on DCT or shape - Adaptive DCT (SA - DCT).

Texture Coding Based on DCT.In I - VOP, the gray (or chroma) values of the pixels in each macroblock of the VOP are directly coded, using the DCT followed by VLC, which is similar to what is done in JPEG for still pictures. P - VOP and B - VOP use motion - compensation - based coding, hence, it is the prediction error that is sent to DCT and VLC. The following discussion will be focused on motion - compensation - based texture coding for P - VOP and B - VOR

Coding for Interior macroblocks, each 16 x 16 in the luminance VOP and 8 x 8 in the chrominance VOP, is similar to the conventional motion - compensation - based coding in H.261, H.263, and MPEG - 1 and 2. Prediction errors from the six 8 x 8 blocks of each macroblock are obtained after the conventional motion estimation step. These are sent to a DCT routine to obtain six 8 x 8 blocks of DCT coefficients.

For boundary macroblocks, areas outside the VOP in the reference VOP are padded using repetitive padding, as described above. After motion compensation, texture prediction errors within the target VOP are obtained. For portions of the boundary macroblocks in the target VOP outside the VOP, zeros are padded to the block sent to DCT, since ideally, prediction errors would be near zero inside the VOP. Whereas repetitive padding and extended padding were for better matching in motion compensation, this additional zero padding is for better DCT results in texture coding.

The quantization step - size for the DC component is 8. For the AC coefficients, one of the following two methods can be employed:

  • The H.263 method, in which all coefficients receive the same quantizer controlled by a single parameter, and different macroblocks can have different quantizers.
  • The MPEG - 2 method, in which DCT coefficients in the same macroblock can have different quantizers and are further controlled by the step size parameter.

Shape - Adaptive DCT (SA - DCT) - Based Coding for Boundary Macroblocks. SA - DCT is another texture coding method for boundary macroblocks. Due to its effectiveness, SA - DCT has been adopted for coding boundary macroblocks in MPEG - 4 version 2.

ID DCT - N is a variation of the ID DCT described earlier, in that N elements are used in the transform instead of a fixed N = 8.

ID Discrete cosine transform

ID Inverse Discrete cosine transform

SA - DCT is a 2D DCT and is computed as a separable 2D transform in two iterations of DCT - N. The following figure illustrates the process of texture coding for boundary macroblocks using SA - DCT. The transform is applied to each of the 8 x 8 blocks in the boundary macroblock.

The figure shows one of the 8 x 8 blocks of a boundary macroblock, where pixels inside the macroblock, denoted f(x, y), are shown gray. The gray pixels are first shifted upward to obtain f'(x, y), as the figure shows. In the first iteration, DCT - N is applied to each column of f'(x, y), with N determined by the number of gray pixels in the column. Hence, we use DCT - 2, DCT - 3, DCT - 5, and so on. The resulting DCT - N coefficients are denoted by F'{x, v), as the figure shows, where the dark dots indicate the DC coefficients of the DCT - Ns. The elements of F1(x, v) are then shifted to the left to obtain F"{x, v) in the figure.

In the second iteration, DCT - N is applied to each row of F"(x, v) to obtain G (u,v), in which the single dark dot indicates the DC coefficient G(0,0) of the 2D SA - DCT.

Texture coding for boundary macroblocks using the Shape Adaptive DOT (SA - DCT)

exture coding for boundary macroblocks using the Shape Adaptive DOT (SA - DCT)

Some coding considerations:

  • The total number of DCT coefficients in G (u, v) is equal to the number of gray pixels inside the 8 x 8 block of the boundary macroblock, which is less than 8 x 8. Hence, the method is shape adaptive and is more efficient to compute.

  • At decoding time, since the array elements must be shifted back properly after each iteration of IDCT - Ns, a binary mask of the original shape is required to decode the texture information coded by SA - DCT. The binary mask is the same as the binary alpha map described below.

Shape Coding

Unlike in MPEG - 1 and 2, MPEG - 4 must code the shape of the VOP, since shape is one of the intrinsic features of visual objects.

MPEG - 4 supports two types of shape information: binary and grayscale....Binary shape information can be in the form of a binary map (also known as a binary alpha - map) that is of the same size as the VOP's rectangular bounding box. A value of 1 (opaque), or 0 (transparent) in the bitmap indicates whether the pixel is inside or outside the VOP.

Alternatively, the grayscale shape information actually refers to the shape's transparency, with gray values ranging from 0 (transparent) to 255 (opaque).

Binary Shape Coding. To encode the binary alpha map more efficiently, the map is divided into 16 x 16 blocks, also known as Binary Alpha Blocks (BAB). If a BAB is entirely opaque or transparent, it is easy to code, and special technique of shape coding is necessary. It is the boundary BABs that contain the contour and hence the shape information for the VOP. They are the subject of binary shape coding.

Various contour - based and bitmap - based (or area - based) algorithms have been studied and compared for coding boundary BABs. Two of the finalists were both bitmap - based. One was the Modified Modified READ (MMR) algorithm, which were also an optional enhancement in the fax Group 3 (G3) standard and the mandatory compression method in the Group 4 (G4) standard. The other finalist was Context - based Arithmetic Encoding (CAE), which was initially developed for JBIG. CAB was finally chosen as the binary shape - coding method for MPEG - 4 because of its simplicity and compression efficiency.

MMR is basically a series of simplifications of the Relative Element Address Designate (READ) algorithm. The basic idea behind the READ algorithm is to code the current line relative to the pixel locations in the previously coded line. The algorithm starts by identifying five pixel locations in the previous and current lines:

  • a0: the last pixel value known to both the encoder and decoder
  • a1: the transition pixel to the right of a0
  • a2:the second transition pixel to the right of a0
  • b1: the first transition pixel whose color is opposite to a0 in the previously coded line
  • b2:the first transition pixel to the right of b1 on the previously coded line

READ works by examining the relative positions of these pixels. At any time, both the encoder and decoder know the position of a0, b1, and b2, while the positions a1and a2are known only in the encoder.

Three coding modes are used:

  • If the run lengths on the previous and the current lines are similar, the distance between a1 and b1 should be much smaller than the distance between a0 and a1. Thus, the vertical mode encodes the current run length a1 - b1.
  • If the previous line has no similar run length, the current run length is coded using one - dimensional run - length coding. This is called the horizontal mode.
  • If a0 ≤ b1 < b2 < a1, we can simply transmit a codeword indicating it is in pass mode and advance a0 to the position under b2, and continue the coding process.

Some simplifications can be made to the READ algorithm for practical implementation. For example, if || a1 - b2 < 3, then it is enough to indicate that we can apply the vertical mode. Also, to prevent error propagation, aĸ - factor is defined, such that every k lines must contain at least one line coded using conventional run - length coding. These modifications constitute the Modified READ algorithm used in the G3 standard. The Modified Modified READ (MMR) algorithm simply removes the restrictions imposed by the aĸ - factor.

Contexts in CAE for binary shape coding in MPEG - 4. O indicates the current pixel, and digits indicate the other pixels in the neighborhood: (a) intra - CAE; (b) inter - CAE

Contexts in CAE for binary shape coding in MPEG - 4.

For Context - based Arithmetic Encoding, the above figure illustrates the "context" for a pixel in the boundary BAB. In intra - CAE mode, when only the target alpha map is involved, ten neighboring pixels (numbered from 0 to 9) in the same alpha map form the context. The ten binary numbers associated with these pixels can offer upto 210 = 1,024 possible contexts.

Now, it is apparent that certain contexts (e.g., all Is or all 0s) appear more frequently than others. With some prior statistics, a probability table can be built to indicate the probability of occurrence for each of the 1,024 contexts.

Recall that Arithmetic coding is capable of encoding a sequence of probabilistic symbols with a single number. Now, each pixel can look up the table to find a probability value for its context. CAE simply scans the 16 x 16 pixels in each BAB sequentially and applies Arithmetic coding to eventually derive a single floating - point number for the BAB.

Inter - CAE mode is a natural extension of intra - CAE: it involves both the target and reference alpha maps. For each boundary macroblock in the target frame, a process of motion estimation (in integer precision) and compensation is invoked first to locate the matching macroblock in the reference frame. This establishes the corresponding positions for each pixel in the boundary BAB.

The figure shows the context of each pixel includes four neighboring pixels from the target alpha map and five pixels from the reference alpha map. According to its context, each pixel in the boundary BAB is assigned one of the 29 = 512 probabilities. Afterward, the CAE algorithm is applied.

The 16 x 16 binary map originally contains 256 bits of information. Compressing it to a single floating number achieves a substantial saving.

The above CAE method is lossless'. The MPEG - 4 group also examined some simple lossy versions of the above shape - coding method. For example, the binary alpha map can be simply subsampled by a factor of 2 or 4 before arithmetic coding. The tradeoff is, of course, the deterioration of the shape.

Grayscale Shape Coding The term grayscale shape coding in MPEG - 4 could be misleading, because the true shape information is coded in the binary alpha map. Grayscale here is used to describe the transparency of the shape, not the texture!

In addition to the bitplanes for RGB frame buffers, raster graphics uses extra bitplanes for an alpha map, which can be used to describe the transparency of the graphical object. When the alpha map has more than one bitplane, multiple levels of transparency can be introduced - for example, 0 for transparent, 255 for opaque, and any number in between for various degrees of intermediate transparency. The term grayscale is used for transparency coding in MPEG - 4 simply because the transparency number happens to be in the range of 0 to 255 - the same as conventional 8 - bit grayscale intensities.

Grayscale shape coding in MPEG - 4 employ the same technique as in the texture coding described above. It uses the alpha map and block - based motion compensation and encodes prediction errors by DCT. The boundary macroblocks need padding, as before, since not all pixels are in the VOP.

Coding of the transparency information (grayscale shape coding) is lossy, as opposed to coding of the binary shape information, which is by default lossless.

Static Texture Coding

MPEG - 4 uses wavelet coding for the texture of static objects. This is particularly applicable when the texture is used for mapping onto 3D surfaces.

Wavelet coding can recursively decompose an image into subbands of multiple frequencies. The Embedded Zerotree Wavelet (EZW) algorithm provides a compact representation by exploiting the potentially large number of insignificant coefficients in the subbands.

The coding of subbands in MPEG - 4 static texture coding is conducted as follows:

  • The subbands with the lowest frequency are coded using DPCM. Prediction of each coefficient is based on three neighbors.
  • Coding of other subbands is based on a multiscale zero tree wavelet coding method.

The multiscale zero tree has a parent - child relation (PCR) tree for each coefficient in the lowest frequency subband. As a result, the location information of all coefficients is better tracked.

In addition to the original magnitude of the coefficients, the degree of quantization affects the data rate. If the magnitude of a coefficient is zero after quantization, it is considered insignificant. At first, a large quantizer is used; only the most significant coefficients are selected and subsequently coded using arithmetic coding. The difference between the quantized and the original coefficients is kept in residual subbands, which will be coded in the next iteration in which a smaller quantizer is employed. The process can continue for additional iterations; hence, it is very scalable.

Sprite coding: (a) the sprite panoramic image of the background; (b) the foreground object (piper) in a bluescreen image; (c) the composed video scene. (This figure also appears in the color insert section.) Piper image courtesy of Simon Frase University Pipe Band

Sprite coding

Sprite Coding

Video photography often involves camera movements such as pan, tilt, zoom in / out, and so on. Often, the main objective is to track and examine foreground (moving) objects. Under these circumstances, the background can be treated as a static image. This creates a new VO type, the sprite - a graphic image that can freely move around within a larger graphic image or set of images.

To separate the foreground object from the background, we introduce the notion of a sprite panorama - a still image that describes the static background over a sequence of video frames. It can be generated using image "stitching" and warping techniques. The large sprite panoramic image can be encoded and sent to the decoder only once, at the beginning of the video sequence. When the decoder receives separately coded foreground objects and parameters describing the camera movements thus far, it can efficiently reconstruct the scene.

The above figure shows a sprite that is a panoramic image stitched from a sequence of video frames. By combining the sprite background with the piper in the bluescreen image , the new video scene can readily be decoded with the aid of the sprite code and the additional pan / tilt and zoom parameters. Clearly, foreground objects can either be from the original video scene or newly created to realize flexible object - based composition of MPEG - 4 videos.

Global Motion Compensation

Common camera motions, such as pan, tilt, rotation, and zoom (so - called global motions, since they apply to every block), often cause rapid content change between successive video frames. Traditional block - based motion compensation would result in a large number of significant motion vectors. Also, these types of camera motions cannot all be described using the translational motion model employed by block - based motion compensation. Global motion compensation (GMC) is designed to solve this problem. There are four major components:

Global motion estimation.Global motion estimation computes the motion of the current image with respect to the sprite. By "global" is meant overall change due to camera change - zooming in, panning to the side, and so on. It is computed by minimizing the sum of square differences between the sprite S and the global motion - compensated image I'.

Global motion estimation

The idea here is that if the background (possibly stitched) image is a sprite S(xi , yi), we expect the new frame to consist mainly of the same background, altered by these global camera motions. To further constrain the global motion estimation problem, the motion over the whole image is parameterized by a perspective motion model using eight parameters, defined as

perspective motion model using eight parameters

This resulting constrained minimization problem can be solved using a gradient descent - based method.

  • Warping and blending. Once the motion parameters are computed, the background images are warped to align with respect to the sprite. The coordinates of the warped image are computed. Afterward, the warped image is blended into the current sprite to produce the new sprite. This can be done using simple averaging or some form of weighted averaging.

  • Motion trajectory coding. Instead of directly transmitting the motion parameters, we encode only the displacements of reference points. This is called trajectory coding. Points at the corners of the VOP bounding box are used as reference points, and their corresponding points in the sprite are calculated. The difference between these two entities is coded and transmitted as differential motion vectors.

  • Choice of local motion compensation (LMC) or GMC. Finally, a decision has to be made whether to use GMC or LMC. For this purpose, we can apply GMC to the moving background and LMC to the foreground. Heuristically (and with much detail skipped), if SADGmc < SADlmc then use CMC to generate the predicted reference VOP. Otherwise, use LMC as before.

2D Mesh Object Plane (MOP) encoding process

2D Mesh Object Plane (MOP) encoding process

All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status