Results for TCSVT

Introduction:
In this page we put some latest results for our submission for TCSVT.

* Note that the TV rights of the soccer video used in the following tests are belonging to MEDIAPRO, MP, Spain. All videos in this page are only used for research purpose within the APIDIS project through online streaming. Downloading or redistributing the videos are not allowed.

* Note that the TV rights of the basketball and volleyball videos used in the following tests are belonging to their corresponding producers. All videos in this page are only used for research purpose through online streaming.

0. Overall Framework and Workflow
*Explanation: Interprete summarization as a resource allocation problem.

1. Meta-data Acquisition
Replay Logo Detection

Shot-boundary Detection
We first give some results on the performance of our shot-boundary detector, by comparing it to the method in Ref.[14]. Since these two methods are controlled by different sets of parameters, improperly selected values of parameters might lead to biased comparative results. We therefore collect detection results by evenly sampling the values of parameters in a rather wide range of parameter space, and plot their corresponding positive and false alarm rates in the following figure, where each point represents one sample of parameter values.

If a shot-boundary is detected within the neighborhood of one ground-truth shot-boundary, we regard this groundtruth short boundary as successfully detected. Otherwise, it is regarded as a false alarm. From a part of the video with 33776 frames, we manually located all 189 shot-boundaries, which includes 19 replay-logos. Therefore, we compute positive alarm rate based on 170 shot-boundaries, and false alarm rate based on 33587 non shot-boundary frames. The maximum neighborhood size around each ground-truth shot-boundary defines the bias tolerance threshold. In the following figure, results under two bias tolerance thresholds, i.e., 4 and 9 frames, are depicted. They are respectively corresponding to a maximum error of 0.16 and 0.36 second in a 25-FPS DVD video.

As shown in the following figure, our method significantly improved the detection rate over the method in Ref.[14], reflected by both the peak performance and the average performance, under the same false alarm rate. Due to the positional ambiguity, shot-boundaries between smooth scene transitions are more sensitive to the changing of the bias tolerance threshold. Therefore, from the fact that the positive alarm rate in our method increases more when the bias tolerance threshold goes higher, we infer that the extra performance gain mainly comes from detecting shot-boundaries for smooth scene transitions. We recommend to use a low threshold to maximize the detection rate, which usually leads to over-segmentation of videos. A common reason for detecting those fake shotboundaries is that the corresponding video is under a slow and long transition of stories. However, our summarization even benefits from this over-segmentation in the sense that it could provide finer organization of summaries by cutting a long and slowlyevolving story into several shorter clips.
Comparative performance evaluation of our shot-boundary detector to the method in Ref.[14](denoted as the previous method).

2. Video segmentation based on view type structure (Click to View)
*Explanation: If there is no far view included in the above segment, this segment will be called a dependent segment, and will be merged into the previous one.
The first video player includes all segments in the first half of the soccer game, and the second one includes all segments for the second half of the game.
Soccer
a) First Half b) Second Half

Volleyball
a) First Set b) Second Set c) Third Set
d) Fourth Set e) Fifth Set

a) First Quarter b) Second Quarter c) Third Quarter
d) Fourth Quarter

3. Local Story Organization and Evaluation
*Explanation: Solve this resource allocation problem by Lagrange relaxation.
a) Basic benefit from expansion of clip interest within each segment
b) Extra benefit from local story organization
c) Different parameter leads to different candidate set of local stories
Extra Experimental Results on Volleyball

4. Global Story Generation
*Explanation: Solve this resource allocation problem by Lagrange relaxation.
5. Comparison between Key-frame based summarization and our summarization system

* To show the benefit of our proposed framework in dealing with temporal biases in annotations, we borrowed some results from our paper accepted by ICME 2010, where automatically detected hotspots are used instead of accurate manual annotations.

* Naive key-frame based summarization used here: We apply Gaussian RBF Parzen window around each annotation, and set the interest of each frame (which is in fact a one-second frame slot) to the maximum response from each RBF kernel. We then sort all frames and select frames in the decreasing order of frame interest, until we reach the required length.

* Results on 10 Minutes. In the following figure, the first row of each graph defines how the segment is organized in close, far, and replay views. The second row defines the manual annotation of the segment, while red bars denote the automatically detected audio hot spots. Eventually, the third row identifies the frames that are selected to be included in the summary, respectively by the naive and proposed strategy.

* Note that NO post-processing (Such as completion of replay logos) has been applied to summaries in this sub-section.

* Our system can correct displaced annotation.

Corresponding Videos (Gray for unselected frames and Color for selected frames).

* Our system will not have unmatched replays.

Corresponding Videos (Gray for unselected frames and Color for selected frames).

6. New Results between Four Summarization Methods

Especially, we compared the performance of our proposed method to those of the following three methods:

*I) a naive key-frame filter, which is a rudimentary system that naively extracts key frames around pre-specified hot-spots. More specifically, this naive method applies a Gaussian RBF Parzen window ($\sigma$ being the standard deviation) around each hot-spot annotation, and sets the interest of each frame to the maximum response resulting from the multiple annotations surrounding the slot. It then selects the slots in decreasing order of interest until we reach the length constraint.

*II) the method of key-frame extraction, proposed by Liu and referred as LIU 2007. The main idea of this method is to find a set of key-frames which minimizes the error of reconstructing the source video from these extracted key-frames. Dynamic programming was used to locate both the key-frames and the shot-boundaries related to each key-frame.

*III) the method of shot selection proposed by Lu, referred as LU 2004. An optimal summary is found by maximizing the accumulated mutual distance between all pairs of consecutive shots in the summary, subject to a length constraint. The mutual distance is evaluated from two aspects, i.e., their histogram difference and their temporal distance.

Experimental Setup:
To investigate the robustness against biases/errors of event annotations, we use automatically detected hotspots from audio commentaries, instead of using the manual annotations. (Explanations on audio hot spot detection could be found in our ICME 2010) paper. To help analyze the resultant summary, each key-frame is represented by a one-second temporal slot instead of a single frame in those key-frame extraction methods, while computation on each key-frame is performed on the first frame of each 1s slot. From a 1000-second portion of the source video, i.e., from 1300s to 2300s, each method is asked to organize a 150 second summary.

How to read the following graph:
Those resultant summaries are plotted in the following figure. In this figure,
*The first row of each graph defines how the segment is organized in close, far, and replay views.
*The second row defines the manual annotation of the segment, while red vertical bars denote the automatically detected audio hot spots.
*Eventually, the following four rows identify the temporal occupancy of the summaries generated by the four tested methods.

Corresponding Videos:
The videos corresponding to those segments are given below. In the first column, we show the produced summary. In the second column, grey scale images are used when the frame is not included in the summary, while color images are used to indicate inclusion in the summary. In the third column, we show the summary generated by the proposed method, with required post-processing (e.g., replay logos), for illustrating a visual comfort story telling.

Proposed RA method, selected parts only Proposed RA method, the whole video with selection status Proposed RA method, final results with post-processing

Naive KF Filter, selected parts only Naive KF Filter, the whole video with selection status

LIU 2007, selected parts only LIU 2007, the whole video with selection status

LU 2004, selected parts only LU 2004, the whole video with selection status

Major observations:

1.) Methods "LIU 2007" and "LU 2004" use no information on semantic events. In order to find all representative key-frames to minimize the reconstruction error, key-frames selected by "LIU 2007" are evenly distributed in the whole video. Since close-up views, medium views and replays contain more histogram variances than far-view grasslands, "LU 2004" favors those types of shots over far views. However, in team sport videos, far views are essential for the audience to understand the complexity of the teamwork. Neither of these two methods refected the relative importance of highlighted semantic events in their produced summaries. To provide personalized contents to satisfy various semantic user preferrences, we prefer to use manually or automatically extracted annotations of semantic events.

2.) Both "LIU 2007" and "LU 2004" penalize producing continuous contents. Results of "LIU 2007" usually consist of short, discontinuous key-frames. Since the mutual distance defined in "LU 2004" is independent from the shot length, method "LU 2004" also favors including more short shots over including less long shots, so that the accumulated mutual distance can be maximized. However, continuity is important in telling a fluent story. Frequent switching between short clips leads to very annoying visual artifacts in their corresponding video data. To the contrary, our proposed method and the naive key-frame filter allow continuous contents under certain parameter settings. The proposed framework further consider the role of replays and different view-types in story-telling to satisfy various narrative user preferences.

3.) The proposed method has improved robustness against temporal biases of (automatically detected) annotations. Two examples further demonstrate the benefit arising from the intelligent local story organization considered by our proposed framework. The first example (1900s-1940s) corresponds to a case for which the audio hot-spot instant is somewhat displaced compared to the action of interest. As a consequence, the naive key-frame filtering system ends up in selecting frames that do not show the first foul action. In contrast, because it assigns clip interests according to view-type structure analysis, our system shows both fouls of the segment plus the replay of the second one. In the second example (2260s-2300s), the naive system renders the replay of the action that preceedes the action of interest, causing a disturbing story-telling artifact. In contrast, as a result of its underlying viewtype analysis and associated segmentation, our system restricts the rendering period to the segment of interest, and allocates the remaining time resources to another segment.

All these clearly illustrate the benefit of our segment-based resource allocation framework.

7. Results for Subjective Evaluation of Artifacts (Public Events Suppressed)
Three Minutes, Less Replays/Close Views, More Events Three Minutes, More Replays/Close Views, Less Events Twelve Minutes, Less Replays/Close Views, More Events

7. Extra Results for Basketball Videos
Three Minutes Six Minutes Nine Minutes

8. Extra Results for Volleyball Videos
Three Minutes Six Minutes Nine Minutes
9. Extra Results on Finally Generated Summaries.
Result 1: Results of Summary under Different Length with Different Ratio between Game Relevance and Emotional Level
2%(2.5 minutes), Emphasize on Game Evolving 2%(2.5 minutes), Emphasize on Both 2%(2.5 minutes), Emphasize on Emotional Moment
4%(5 minutes), Emphasize on Game Evolving 4%(5 minutes), Emphasize on Both 4%(5 minutes), Emphasize on Emotional Moment
8%(10 minutes), Emphasize on Game Evolving 8%(10 minutes), Emphasize on Both 8%(10 minutes), Emphasize on Emotional Moment
16%(20 minutes), Emphasize on Game Evolving 16%(20 minutes), Emphasize on Both 16%(20 minutes), Emphasize on Emotional Moment

Result 2: Results under different continuity gain Phi
8%(10 minutes), Phi=0.0 8%(10 minutes), Phi=0.1 8%(10 minutes), Phi=0.2
8%(10 minutes), Phi=0.3 8%(10 minutes), Phi=0.4 8%(10 minutes), Phi=0.5

Result 3: Results under different redundancy penalty Gamma
8%(10 minutes), Gamma=0.00 8%(10 minutes), Gamma=0.25 8%(10 minutes), Gamma=0.50

8%(10 minutes), Gamma=0.75 8%(10 minutes), Gamma=1.00

Maintained by Fan.CHEN AT uclouvain.be. Last update 2010-4-19