C-Bird — A Case Study - MULTIMEDIA

Let us consider the specifics of how image queries are carried out. To make the discussion concrete, we underpin our discussion by using the image database search engine devised by one of the authors of this text. This system is called Content - Based Image Retrieval from Digital libraries (C - BIRD), an acronym devised from content - based image retrieval, or CBIR.

C - BIRD image - search GUI

C - BIRD image - search GUI

C - BIRD GUI

The above figure shows the GUI for the C - BIRD system. The online image database can be browsed, or it can be searched using a selection of tools: text annotations, color histograms, illumination - invariant color histograms, color density, color layout, texture layout, and model - based search. Many of the images are keyframes from videos, and a video player is incorporated in the system.

Color Histogram

In C - BIRD, features are precomputed for each image in the database. The most prevalent feature that is utilized in image database retrieval is the color histogram, a type of global image feature, that is, the image is not segmented; instead, every image region is treated equally.

A color histogram counts pixels with a given pixel value in red, green, and blue (RGB). For example, in pseudocode, for images with 8 - bit values in each of R, G, B, we can fill a histogram that has 2563 bins:

Color Histogram

Usually, we do not use histograms with so many bins, in part because fewer bins tend to smooth out differences in similar but unequal images. We also wish to save storage space.

How image search proceeds is by matching the feature vector for the sample image, in this case the color histogram, with the feature vector for every — or at least many of — the images in the database.

C - BIRD calculates a color histogram for each target image as a preprocessing step, then references it in the database for each user query image. The histogram is defined coarsely, with bins quantized to 8 bits, with 3 bits for each of red and green and 2 for blue.

For example, the following figure shows that the user has selected a particular image — one with red flowers. The result obtained, from a database of some 5,000 images, is a set of 60 matching images. Most CBIR systems return as the result set either the top few matches or the match set with a similarity measure above a fixed threshold value. C - BIRD uses the latter approach and thus may return zero search results.

How matching proceeds in practice depends on what measure of similarity we adopt. The standard measure used for color histograms is called the histogram intersection. First, a color histogram Hi is generated for each image i in the database. We like to think of the histogram as a three - index array, but of course the machine thinks of it as a long vector — hence the term "feature vector" for any of these types of measures.

The histogram is normalized, so that its sum (now a double) equals unity. This normalization step is interesting: it effectively removes the size of the image. The reason is that if the image has, say, resolution 640 x 480, then the histogram entries sum to 307,200. But if the image is only one - quarter that size, or 320 x 240, the sum is only 76,800. Division by the total pixel count removes this difference. In fact, the normalized histograms can be viewed as probability density functions (pdfs). The histogram is then stored in the database.

Now suppose we select a "model" image — the new image to match against all possible targets in the database. Its histogram Hm; is intersected with all database image histograms Hi, according to the equation

Intersection

Search by color histogram results. (This figure also appeal's in the color insert section.) Some thumbnail images are pom the Corel Gallery and are copyright Corel. All rights reserved

Search by color histogram results

where superscript j denotes histogram bin j, with each histogram having n bins. The closer the intersection value is to 1, the better the images match. This intersection value is fast to compute, but we should note that the intersection value is sensitive to color quantization.

Color Density

The following figure displays the scheme for showing color density. The user selects the percentage of the image having any particular color or set of colors, using a color picker and sliders. We can choose from either conjunction (ANDing) or disjunction (ORing) a simple color percentage specification. This is a coarse search method.

Color Layout

The user can set up a scheme of how colors should appear in the image, in terms of coarse blocks of color. The user has a choice of four grid sizes: 1 x 1, 2 x 2, 4 x 4 and 8 x 8. Search is specified on one of the grid sizes, and the grid can be filled with any RGB color value - or no color value at all, to indicate that the cell should not be considered. Every database image is partitioned into windows four times, once for each window size. A clustered color histogram is used inside each window, and the five most frequent colors are stored in the database. Each query cell position and size corresponds to the position and size of a window in the image.

Color density query scheme

Color density query scheme

Texture Layout

Similar to color layout search, this query allows the user to draw the desired texture distribution. Available textures are zero density texture, medium - density edges in four directions (0°, 45°, 90°, 135°) and combinations of them, and high - density texture in four directions and combinations of them. Texture matching is done by classifying textures according to directionality and density (or separation) and evaluating their correspondence to the texture distribution selected by the user in the texture block layout.

Texture Analysis Details

It is worthwhile considering some of the details for a texture - based content analysis aimed at image search. These details give a taste of typical techniques systems must employ to work in practical situations.

First, we create a texture histogram. A typical set of indices for comprehending texture is Tamura's, Human perception studies show that "repetitiveness," "directionality," and "granularity" are the most relevant discriminatory factors in human textural perception. Here, we use a two - dimensional texture histogram based on directionality Φ and edge separation ξ, which is closely related to "repetitiveness". Φmeasures the edge orientations, and ξ measures the distances between parallel edges.

Color layout grid

Color layout grid

To extract an edge map, the image is first converted to luminance Y via Y = 0.299.R + 0.587G + 0.114B. A Sobel edge operator is applied to the Y - image by sliding the following 3 x 3 weighting matrices (convolution masks) over the image:

3 x 3 weighting matrices

If we average around each pixel with these weights, we produce approximations to derivatives.

The edge magnitude D and the edge gradient Φ are given by

edge magnitude D and the edge gradient Φ

Next, the edges are thinned by suppressing all but maximum values. If a pixel i with edge gradient Φi and edge magnitude Di has a neighbor pixel j along the direction of Φi with gradient ΦjΦi and edge magnitude Dj > Di, then pixel i is suppressed to 0. To make a binary edge image, we set all pixels with D greater than a threshold value to 1 and all others to 0. For edge separation ξ, for each edge pixel i we measure the distance along its gradient Φi to the nearest pixel j having Φj ≈ Φi within 15°. If such a pixel j doesn't exist, the separation is considered infinite.

Having created edge directionality and edge separation maps, C - BIRD constructs a 2D texture histogram of ξ versus Φ. The initial histogram size is 193 x 180, where separation value ξ = 193 is reserved for a separation of infinity (as well as any ξ > 192). The histogram size is then reduced by three for each dimension to size 65 x 60, where joined entries are summed together.

The histogram is "smoothed" by replacing each pixel with a weighted sum of its neighbors and is then reduced again to size 7x 8, with separation value 7 reserved for infinity. At this stage, the texture histogram is also normalized by dividing by the number of pixels in the image segment.

Search by Illumination Invariance

Illumination change can dramatically .alter the color measured by camera RGB sensors, from pink under daylight to purple under fluorescent lighting, for example.

To deal with illumination change from the query image to different database images, each color - channel band of each image is first normalized, then compressed to a 36 - vector. Normalizing each of the R, G, and B bands of an image serves as a simple yet effective guard against color changes when the lighting color changes. A two - dimensional color histogram is then created using the chromaticity, which is the set of band ratios {R, G} / (R + G + B). Chromaticity is similar to the chrominance in video, in that it captures color information only, not luminance (or brightness).

A 128 x 128 - bin 2D color histogram can then be treated as an image and compressed using a wavelet - based compression scheme. To further reduce the number of vector components in a feature vector, the DCT coefficients for the smaller histogram are calculated and placed in zigzag order, then all but 36 components are dropped. Matching is performed in the compressed domain by taking the Euclidean distance be­tween two DCT - compressed 36 - component feature vectors.

Texture layout grid

Texture layout grid

Search with illumination invariance. Some thumbnail images are from the Corel Gallery and are copyright Corel. All rights reserved

Search with illumination invariance. Some thumbnail images are from the Corel Gallery and are copyright Corel. All rights reserved

Several of the above types of searches can be done at once by checking multiple check­boxes. This returns a reduced list of images, since the list is the conjunction of all resulting separate return lists for each method.

Search by Object Model

The most important search type C - BIRD supports is the model - based object search. The user picks a sample image and interactively selects a region for object searching. Objects photographed under different scene conditions are still effectively matched. This search type proceeds by the user selecting a thumbnail and clicking the Model tab to enter Object Selection mode. An object is then interactively selected as a portion of the image; this constitutes an object query by example.

The above figure shows a sample object selection. An image region can be selected using primitive shapes such as a rectangle or ellipse, a magic wand tool that is basically a seed - based flooding algorithm, an active contour (a "snake"), or a brush tool, where the painted region is selected. All the selections can be combined with each other using Boolean operations such as union, intersection, or exclusion.

Once the object region is defined to a user's satisfaction, it can be dragged to the right pane, showing all current selections. Multiple regions can be dragged to the selection pane, but only the active object in the selection pane will be searched on. The user can also control parameters such as flooding thresholds, brush size, and active contour curvature.

C - BIRD interface, showing object selection using an ellipse primitive. (This figure also appears in the color insert section.) Image is from the Corel Gallery and is copyright Corel. All rights reserved

C - BIRD interface, showing object selection using an ellipse primitive. (This figure also appears in the color insert section.)

Details of the underlying mechanisms of this Search by Object Model are set out in and introduced below as an example of a working system. The following figure shows a block diagram for how the algorithm proceeds. First, the user - selected model image is processed and its features are localized (details in the following sections). Color histogram intersection, based on the reduced chromaticity histogram is then applied as a first "screen." Further steps estimate the pose (scale, translation, rotation) of the object inside a target image from the database. This is followed by verification by intersection of texture histograms and then a final check using an efficient version of a Generalized Hough Transform for shape verification.

Block diagram of object matching steps

Block diagram of object matching steps

A possible model image and one of the target images in the database might be as in the following figure, where the scene in (b) was illuminated with a dim fluorescent light.

Locales in Feature Localization

The Search by Object Model introduced above — finding an object inside a target image — is a desirable yet difficult mechanism for querying multimedia data. An added difficulty is that objects can be photographed under different lighting conditions. Human vision has "color constancy", an invariant processing, presumably, that allows us to see colors under different lighting as the same. For image indexing, it should be useful to determine only a covariant processing that changes along with changing light. In that case, we could aim at also recovering the lighting change.

Since object - based search considers objects within an image, we should apply some sort of segmentation to look at regions of objects — say, patches that have about the same color. However, it has been found to be more useful to use a set of rough, possibly overlapping regions (called locales) to express not complete image segmentation but instead a coarser feature localization.

It is worthwhile looking in more detail at this locale - directed search method, which we describe along with the process of feature localization. Since we are interested in lighting change, we also look at a technique to compensate for illumination change, so as to carry out a color covariant search.

Feature Localization versus Image Segmentation

For image segmentation: if R is a segmented region, R is usually connected; all pixels in R are connected (8 - connected or 4 - connected).

Ri n Rj = Φ, i j; regions are disjoint.

Un i= 1Ri = I, where I is the entire image; the segmentation is complete.

Object retrieval algorithms based on image segmentation permit imprecise regions by allowing a tolerance on the region - matching measure. This accounts for small imprecision in the segmentation but not for over - or under - segmentation, which can be attributed to the pixel - level approach. This works only for simplified images, where object pixels have statistics that are position - invariant.

A coarse localization of image features based on proximity and compactness is likely to be a more effective and attainable process than image segmentation.

Definition:A locale Lf is a local enclosure of feature f. A locale Lf uses blocks of pixels called tiles as its positioning units and has the following descriptors:

  1. EnvelopeLf. A set of tiles representing the locality of Lf
  2. Geometric parameters.
  3. Geometric parameters.

    Locales for feature localization

    Locales for feature localization

    Locales for feature localization

  4. Color, texture, and shape parameters of the locale. For example, locale chromaticity, elongation, and locale texture histogram

Initially, an image is subdivided into square tiles (e.g., 8 x 8 or 16 x 16). While the pixel is the building unit for image segmentation, tile is the building unit for feature localization. Tiles group pixels with similar features within their extent and are said to have feature / if enough pixels in them have feature / (e.g., 10%).

Tiles are necessary for good estimation of initial object-level statistics and representa­tion of multiple features at the same location. However, locale geometric parameters are measured in pixels, not tiles. This preserves feature granularity. Hence, feature localization is not merely a reduced - resolution variation on image segmentation.

After a feature localization process, the following can be true:

localization process

The above figure Shows a sketch of two locales for color red and one for color blue. The links represent an association with an envelope, which demonstrates that locales do not have to be connected, disjoint, or complete, yet colors are still localized.

Tile Classification

Before locales can be generated, tiles are first classified as having certain features, for example, red tiles, or red and blue tiles. Since color is most useful for CBIR and is invariant to translations, rotations, and scaling, we will start with color localization, although other features (texture, shape, motion, etc.) can certainly be localized similarly.

Dominant Color Enhancement

To localize on color, we first remove noise and blurring by restoring colors smoothed out during image acquisition. The image is converted from the RGB color space to a chromaticity - luminance color space. For a pixel with color (R, G, B), we define

Dominant Color Enhancement

where the luminance / is separated from the chromaticity (r, g). Clearly, we can also use an approximately illumination - invariant version of color.

Prior to classifying feature tiles, image pixels are classified as having either dominant color or transitional color. Pixels are classified dominant or transitional by examining their neighborhood.

Definition:Dominant colors are pixel colors that do not lie on a slope of color change in their pixel neighborhood. Transitional colors do.

If a pixel does not have sufficient number of neighbors with similar color values within a threshold, it is considered noise and is also classified as transitional. The uniformity of the dominant colors is enhanced by smoothing the dominant pixels only, using a 5 x 5 average filter, with the exception that only dominant pixels having similar color are aver­aged.

Tile feature list

Tiles have a tile feature list of all the color features associated with the tile and their geomet­rical statistics. On the first pass, dominant pixels are added to the tile feature list. For each pixel added, if the color is close to a feature on the list within the luminance - chromaticity thresholds, the color and geometrical statistics for the feature are updated. Otherwise, a new color feature is added to the list. This feature list is referred to as the dominant feature list.

On the second pass, all transitional colors are added to the dominant feature list without modifying the color, but updating the geometrical statistics. To determine which dominant feature list node the transitional pixel should merge to, we examine the neighborhood of the transitional pixel and find the closest color that is well represented in the neighborhood. If an associated dominant color doesn't exist, it is necessary to create a second transitional feature list and add the transitional color to it.

The dominant color (ri, gi, Ii) taken on by a transitional pixel tp having color (r, g, I) satisfies the following minimization:

dominant color (ri, gi, Ii)

Smoothing using dominant colors: (a) original image not smoothed; (b) smoothed image with transitional colors shown in light gray; (c) smoothed image with transitional colors shown in the replacement dominant colors (if possible). Lower row shows detail images

Smoothing using dominant colors

The parameter nc is the number of non - similar colors in the neighborhood of the tp. Similar colors are averaged to generate the (ri, gi, Ii) colors. F(ri, gi, Ii) is the frequency of the ith average color, or in other words, the number of similar colors averaged to generate color i, The color that minimizes this equation is the best compromise for dominant color selection for tp in terms of color similarity and number of similar colors in the neighborhood. The neighborhood size was chosen to be 5 x 5 in our implementation.

When all pixels have been added to the tiles, the dominant and transitional color feature lists are merged. If a transitional list node is close in color to a dominant list node, the geometrical statistics for the merged node are updated, but only the color from the dominant list is preserved. Otherwise, the nodes from both lists are just concatenated onto the joint list.

Locale Generation

Locales are generated using a dynamic 4 x 4 overlapped pyramid linking procedure. On each level, parent nodes compete for inclusion of child nodes in a fair competition. Image tiles are the bottom - level child nodes of the pyramid, and locales are generated for the entire image when the competition propagates to the top level. The top - level pyramid node has a list of color features with associated envelopes (collections of tiles) and geometrical statistics, Competition on each level is initialized by using a 2 x 2 nonoverlapped linkage struc­ture, where four child nodes are linked with a single parent node.

The Localeslnit initialization proceeds as follows:

PROCEDURE

PROCEDURE

After the pyramid linkage initialization, the competition begins. Since a 4 x 4 overlapped pyramid structure is used, four parents compete for linkage with the child, one of which is already linked to it. This process is illustrated by the Envelope Growing pseudo - code:

PROCEDURE

PROCEDURE

Following the pyramidal linking, locales having small mass are removed, since small locales are not accurate enough and are probably either an insignificant part of an object or noise. To increase the efficiency of the search, locales are also sorted according to decreasing mass size.

The color update equation for parent locale j and child locale i at iteration k + 1 is

color update equation for parent locale j and child locale i at iteration k   1

and the update equations for the geometrical statistics are

geometrical statistics

The following figure shows how color locales appear for sample model and target images.

Texture Analysis

Every locale is associated with a locale - based texture histogram. Thus a locale - dependent threshold makes more sense in generating the edge map. The threshold is obtained by examining the histogram of the locale edge magnitudes. The texture histogram is smoothed using a Gaussian filter and subsampled to size 8 x 7, then normalized.

The locale - based texture is a more effective measure of texture than is a global one, since the locale - dependent thresholds can be adjusted adaptively. The edge - maps shown demonstrate that for the lamp and the banana objects, some edge points are missing when using global thresholding, but most of them exist when using locale - based thresholding. To draw the locale - based edge - map, edge pixels generated for any locale are unioned together.

Object Modeling and Matching

Object models in C - BIRD consist of a set of localized features. As shown above, they provide a rich set of statistical measures for later matching. Moreover, their geometric relationships, such as the spatial arrangement of locales, are also extracted. They are best represented using vectors connecting centroids of the respective locales.

The object - search method recovers 2D rigid object translation, scale, and rotation, as well as illumination change. C - BIRD also allows a combination search, where an object search can be combined with other, simpler search types. In that case, the searches are executed according to decreasing speed. Since object search is the most complex search available, it is executed last, and only on the search results passed on so far by the other search types.

Color locales: (a) color locales for the model image; (b) color locales for a database image. (This figure also appears in the color insert section.)

Color locales

Color locales

The object image selected by the user is sent to the server for matching against the locales database. The localization of the submitted model object is considered the appropriate localization for the object, so that image locales need to be found that have a one - to - one correspondence with model locales. Such a correspondence is called an assignment.

A locale assignment has to pass several screening tests to verify an object match. Screen­ing tests are applied in order of increasing complexity and dependence on previous tests.

Global versus locale - based thresholds: (a) the edge map for the database image using a global threshold; (b) the edge map for the database image using a locale-based threshold

Global versus locale - based thresholds:

The object match measure Q is formulated as follows:

object match measure Q

where n is the number of locales in the assignment, m is the number of screening tests considered for the measure, Qi is the fitness value of the assignment in screening test /, and Wi are weights that correspond to the importance of the fitness value of each screening test. The Wi can be arbitrary; they do not have to sum to 1. Care has to be taken to normalize the Qi values to lie in the range [0..1], so that they all have the same numerical meaning.

Locales with higher mass (more pixels) statistically have a smaller percentage of local­ization error. The features are better defined, and small errors average out, so we have higher confidence in locales with large mass. Similarly, assignments with many model locales are preferable to few model locales, since the cumulative locale mass is larger and the errors average out.

We try to assign as many locales as possible first, then compute the match measure and check the error using a tight threshold. Locales are removed or changed in the assignment as necessary until a match is obtained. At that point, it is probably the best match measure possible, so it is unnecessary to try other assignments. In this case, all possible permutations of locale assignments do not have to be checked.

In the worst case, when the object model is not present in the search image, we have to test all assignments to determine there is no match. The image locales in the database and the object model locales are sorted according to decreasing mass size.

Matching Steps

The screening tests applied to locales to generate assignments and validate them are:

  • Color - based screening tests (step b)
  • Illumination color covariant screening
  • Chromaticity voting
  • Elastic correlation
  • Estimation of image object pose (step c)
  • Texture support (step d)
  • Shape verification (step e)
  • Recovery of lighting change

The idea of color covariant matching is to realize that colors may change, from model to target, since the lighting may easily change. A diagonal model of lighting change states that the entire red channel responds to lighting change via an overall multiplicative change, as do the green and blue channels, each with their own multiplicative constant.

Locales vote on the correct lighting change, since each assignment of one model locale color to a target one implies a diagonal lighting shift. Many votes in the same cell of a voting space will imply a probable peak value for lighting change. Using the chromaticity voting scheme, all image locales are paired with all model locales to vote for lighting change values in a voting array.

We can evaluate the feasibility of having an assignment of image locales to model locales using the estimated chromaticity shift parameters by a type of elastic correlation. This computes the probability that there can be a correct assignment and returns the set of possible assignments. Having a candidate set of chromaticity shift parameters, each candidate is successively used to compute the elastic correlation measure. If the measure is high enough (higher than 80%, say), the possible assignments returned by the elastic correlation process are tested for object matching using pose estimation, texture support, and shape verification.

The following figure shows the elastic correlation process applied in the model chromaticity space Ω{r', g'}: the model image has three locale colors at A', B' and C. All the image locale colors, A, B, C, D, E, and F, are shifted to the model illuminant. Although the locales (A', B', C) and (A, B, C) are supposed to be matching entities, they do not appear at exactly the same location. Instead of a rigid template matching (or correlation) method, we employ elastic correlation, in which the nodes A, B, C are allowed to be located in the vicinity of A', B', C, respectively.

The pose estimation method (step (c)) uses geometrical relationships between locales for establishing pose parameters. For that reason, it has to be performed on a feasible locale assignment. Locale spatial relationships are represented by relationships between their centroids. The number of assigned locales is allowed to be as few as two, which is enough geometry information to drive estimation of a rigid body 2D displacement model with four parameters to recover: x, y translation, rotation R, and scale s.

Elastic correlation in Ω{r', g'}

Elastic correlation in Ω{r', g'}

Results of pose estimation are both the best pose parameters for an assignment and the minimization objective value, which is an indication of how well the locales assignment fit using the rigid - body displacement model. If the error is within a small threshold, the pose estimate is accepted.

The texture - support screening test uses a variation of histogram intersection technique, where the texture histograms of locales in the assignment are intersected. If the intersection measure is higher than a threshold, the texture match is accepted.

The final match verification process (step (e)) is shape verification by the method of Generalized Hough Transform {GHT). The GHT is robust with respect to noise and occlusion. Performing a full GHT search for all possible rotation, scale, and translation parameters is computationally expensive and inaccurate. Such a search is not feasible for large databases.

However, after performing pose estimation, we already know the pose parameters and can apply them to the model reference point to find the estimated reference point in the database image. Hence, the GHT search reduces to a mere confirmation that the number of votes in a small neighborhood around the reference point is indicative of a match. This GHT matching approach takes only a few seconds for a typical search. The reference point used is the model center, since it minimizes voting error caused by errors in edge gradient measurements.

Once we have shape verification, the image is reported as a match, and its match measure Q returned, if Q is large enough. After obtaining match measures Qi for all images in the database, the Qi measures are sorted according to decreasing value. The number of matches can further be restricted to the top k if necessary. An estimate of the correct illumination change follows from correct matches reported.

The following figure shows (a) the GHT voting result for searching the pink book from one of the database images. Darkness indicates the number of votes received, which in turn indicates the likelihood that the object is in the image and at that location. (b) shows the reconstructed edge map for the book. Since the model edge map and the location, orientation, and scale of the object are known now, this reconstruction is entirely automated.

Using the GHT for shape verification: (a) GHT accumulator array image; (b) recon­struction of the detected object using the estimated pose and the GHT template (edge map)

Using the GHT for shape verification

shows some search results for the pink book in C-BIRD. While C - BIRD is an experimental system, it does provide a proof in principle that the difficult task of search by object model is possible

search-result-for-the-pink-book-model

search-result-for-the-pink-book-model

search-result-for-the-pink-book-model

Video Locales

Definition:A video locale is a sequence of image feature locales that share similar features in the spatiotemporal domain of videos.

Like locales in images, video locales have their color, texture, and geometric properties. Moreover, they capture motion parameters, such as motion trajectory and speed, as well as temporal information, such as the lifespan of the video locale and its temporal relationships with respect to other video locales.

Since video proceeds in small time steps, we can also expect to develop new locales from ones already known from previous video frames more easily than simply starting from scratch in each frame.

The following figure shows that while speeding up the generation of locales substantially, little difference occurs in generating locales from each image (Intra - frame) and from predicting and then refining the locales (Inter - frame). Video locales provide an effective means toward real - time video object segmentation and tracking.


All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

MULTIMEDIA Topics