Querying On Videos - MULTIMEDIA

Video indexing can make use of motion as the salient feature of temporally changing images for various types of queries.

Since temporality is the main difference between a video and just a collection of images, dealing with the time component is first and foremost in comprehending the indexing, browsing, search, and retrieval of video content. A direction taken by the QBIC group is a new focus on storyboard generation for automatic understanding of video — the so - called "inverse Hollywood" problem. In production of a video, the writer and director start with a visual depiction of how the story proceeds. In a video understanding situation, we would ideally wish to regenerate this storyboard as the starting place for comprehending the video.

The first place to start, then, would be dividing the video into shots, where each shot consists roughly of the video frames between the on and off clicks of the Record button. However, transitions are often placed between shots — fade - in, fade - out, dissolve, wipe, and so on — so detection of shot boundaries may not be so simple as for abrupt changes.

Generally, since we are dealing with digital video, if at all possible we would like to avoid uncompressing MPEG files, say, to speed throughput. Therefore, researchers try to work on the compressed video. A simple approach to this idea is to uncompress just enough to recover the DC term, generating a thumbnail 64 times smaller than the original. Since we must consider P - and B - frames as well as I - frames, even generating a good approximation of the best DC image is itself a complicated problem.

Once DC frames are obtained from the whole video — or, even better, are obtained on the fly — many approaches have been used for finding shot boundaries. Features used have typically been color, texture, and motion vectors, although such concepts as trajectories traversed by objects have also been used.

Shots are grouped into scenes. A scene is a collection of shots that belong together and that are contiguous in time. Even higher - level semantics exist in so - called "film grammar". Semantic information such as the basic elements of the story may be obtainable. These are (at the coarsest level) the story's exposition, crisis, climax, and denouement.

Audio information is important for scene grouping. In a typical scene, the audio has no break within a scene, even though many shots may take place over the course of the scene. General timing information from movie creation may also be brought to bear.

Text may indeed be the most useful means of delineating shots and scenes, making use of closed - captioning information already available. However, relying on text is unreliable, since it may not exist, especially for legacy video.

Different schemes have been proposed for organizing and displaying storyboards rea­sonably succinctly. The most straightforward method is to display a two - dimensional array of keyframes. Just what constitutes a good keyframe has of course been subject to much debate. One approach might be to simply output one frame every few seconds. However, action has a tendency to occur between longer periods of inactive story. Therefore, some kind of clustering method is usually used, to represent a longer period of time that is more or less the same within the temporal period belonging to a single keyframe.

Some researchers have suggested using a graph - based method. Suppose we have a video of two talking heads, the interviewer and the interviewee. A sensible representation might be a digraph with directed arcs taking us from one person to the other, then back again. In this way, we can encapsulate much information about the video's structure and also have available the arsenal of tools developed for graph pruning and management.

Other "proxies" have also been developed for representing shots and scenes. A grouping of sets of keyframes may be more representative than just a sequence of keyframes, as may keyframes of variable sizes. Annotation by text or voice, of each set of keyframes in a ''skimmed'' video, may be required for sensible understanding of the underlying video.

A mosaic of several frames may be useful, wherein frames are combined into larger ones by matching features over a set of frames. This results in set of larger keyframes that are perhaps more representational of the video.

An even more radical approach to video representation involves selecting (or creating) a single frame that best represents the entire movie! This could be based on making sure that people are in the frame, that there is action, and so on.

By taking into account skin color and faces, the algorithm increases the likelihood of the selected keyframe including people and portraits, such as close - ups of movie actors, thereby producing interesting keyframes. Skin color is learned using labeled image samples. Face detection is performed using a neural net.

The following figure (a) shows a selection of frames from a video of beach activity. Here, the keyframes in (b) are selected based mainly on color information (but being careful with respect to the changes incurred by changing illumination conditions when videos are shot).

Digital video and associated keyframes, beach video: (a) frames from a digital video; (b) keyframes selected

Digital video and associated keyframes, beach video

A more difficult problem arises when changes between shots are gradual and when colors are rather similar overall. The keyframes in the following figure (b) are sufficient to show the development of the whole video sequence.

Other approaches attempt to deal with more profoundly human aspects of video, as opposed to lower-level visual or audio features. Much effort has gone into applying data mining or knowledge - base techniques to classifying videos into such categories as sports, news, and so on, and then subcategories such as football and basketball.

All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status