Research on Video Processing

Video shot segmentation

An automatic threshold detection method for video shot segmentation and classification is proposed in my research. A unified difference (Zn) between frames is defined by integrating the color and gray information in between frames. Then, the histogram of Zn is utilized for automatic threshold selection. The above strategy utilizes the advantages and mitigates the shortcomings of gray and color histogram to detect the change from one frame to another. Figs. 1 and 2 present the results of the frame difference evaluations for Zn. Fig. 1 shows the difference between frames using step size 1, and Fig. 2 shows the difference between frames using step size d . The physical meaning of d is the standard number of frames required for a gradual shot transformation. This value may be varied with different types of video, or it can be set to a default value. In the InsightVideo system, d=10. Fig. 3,

Since shot cuts produce large differences in the Zn between frames, the distribution of these differences will help determine a threshold for detecting shot segments automatically. Fig. 3 presents the histogram of the differences in Zn between frames (using step size 10.) Clearly there is a distinct separation between the common non-segmentation area and the shot segmentation area (for both break and gradual shots). This "separation" point is indicated by point D in Fig. 3. A simple rule can be used to detect the separation point: we use the first zero point (from left to right) of the Zn histogram as the threshold of shot segmentation. Any region larger than this threshold may contain a shot segmentation.

After the threshold T1 has been determined, a binary array O(n) will be used to indicate whether the Zn of frame n exceed T1. And the mean and the variance of the sequential frames with their Zn larger than T1 will be used to classified the shot segmentation is to three types: break shot, gradual shot, and camera flash.

Figure 1. Unified frame difference (Zn) between frame n and n+1

Figure 2. Unified frame difference (Zn) between frame n and n+10

Figure 3. Histogram of Unified frame difference (Zn) in Figure 2.

Figure 4. Shot segmentation results

 

Camera motion classification

Motion characterization plays a critical role in content-based video indexing. It is an essential step in creating compact video representation automatically. We can imagine the camera as a "narrative eye": camera pans imitate eye movement to either track an object or to examine a wider view of the scene, freeze frames give the impression that an image should be remembered, close-ups indicate the intensity of impression. There are methods for capturing these impressions in a compact way based on the camera activity. For example, a mosaic image can represent a panning sequence; a single frame represents a static sequence; the frames before and after a zoom can represent the zoom sequence; the targeted object represents a tracking sequence. Thus, an effective characterization of camera motion greatly facilitates the video representation, indexing and retrieval tasks. On the other side, camera motions in the shot will also help us in extracting the key-frame more efficiently, since the camera motion will directly imply the content change in the shot.

Extensive researches have been executed to extract the camera motion in the video by utilizing temporal slices, analyzing optical flow distribution, using transformation model, or voting strategy. However, these strategies fail to detect the rotation of the camera. Furthermore, extracted optical flow or motion vectors may contain considerable noise or error, which significantly reduces the efficiency of their strategies. Hence, our research emphasize on developing a robust qualitative camera motion classification method. We have found that the statistical information for the mutual relationship between any two motion vectors is relatively robust to noise (see Figure 5.) For a given type of camera motion contained in the current frame, the statistical mutual relationship in the frame will show a distinct distribution tendency. Based on this observation, we propose a qualitative camera motion classification method. In addition to detecting most common camera motions (pan, tilt, zoom, still), our method can also detect camera rotation efficiently.

Figure 5. Mutual relationship between motion vectors in the MPEG stream

Figure 6. The relationship between camera motion and motion vectors. The column (a), (b), (c), (d) and (e) indicate the current P-frame (Pi), motion vectors in Pi, the succeeding P-frame (Pi+1), motion vectors in Pi+1, and the 14-bin motion feature vector distribution for (d) respectively. The black block in motion vectors indicate the "intracoded macroblock"; hence, no motion vector is available for those blocks.

Based on the statistical information of mutual relationship between motion vectorss and constructed 14-bin motion fecture vector in each P-frame, a robust qualitative camera motion classification strategy was developed. Based on the detected camera motion in each shot, a camera motion based video retrieval system was developed.

Given any shot in the dataset, we will detect the camera motion for each P frame in the shot. Afterwards, the temporal filter operation was used to eliminate the disorder information. Finally, a motion histogram of each shot was calculated to establish the motion index in each video. The camera motion based video retrieval is executed as follows.

  • Users input a query motion (e.g., pan, zoom.)
  • If there is any shot in the database that has the same camera motion as the query motion, this shot is treated as a result candidate.
  • All candidate shots are ranked by the percentage of frames that fit the query motion, and the larger this number is, the higher the shot is ranked in the retrieval result list.

Figure 7. Camera motion based video retrieval system

Key-frame extraction

Key-frame(s) summarize the content of a video shot. Other research has addressed the problem of automated extraction of key frames by frame difference, clustering, motion information etc. To extract key-frames using these strategies, the video must be fully decoded. In the next section, we introduce a threshold-free method that extracts the key frames in the compressed domain. Our method is based on the method from literature ( Wayne Wolf, "Key frame selection by motion analysis", in Proc. IEEE ICASSP, pp.1228-1231, 1996.), however, there are several distinguishing elements: (1) our method is executed in the compressed domain (only a very limited number of frames need to be decoded), (2) instead of using optical flow, we use motion vectors from the MPEG video, and (3) instead of using the threshold, we use camera motions in the shot to determine the local maximum or minimum.

Our key-frame extraction algorithm is executed using the following steps: 1. Given any shot Si, use the camera motion classification and temporal motion filter to detect and classify the camera motions, as shown in Figure 8. Find the representative frame(s) for each type of motion (see Figure 8), and the collection of all representative frames is taken as the key-frames for Si.

Figure 8. Camera motion based key-frame extraction strategy

Figure 9. Key-frame extraction results. (C) represents the sampling of the shot with 15 frames stepsize, from top left to bottom right; (B) indicates results with other method ; (A) indicates the results of our method.

Video group detection, Video scene detection

Generally, videos can be represented using a hierarchy of five levels (video, scene, group, shot, and key frame), increasing in granularity from top to bottom. To detect the video group and scene will help us in acquiring the video content table. However, sine the group and scene are semantic related unit, there is long way to go before we can get satisfactory results in semantic video unit detection. However, some of our research have proved that by ingegrating all available information related to the video, we can also get some good results. Please review the video data mining for details.