A framework for flexible summarization of racquet sports video using
multiple modalities
q
Chunxi Liu
a
, Qingming Huang
a,b,
*
, Shuqiang Jiang
b
, Liyuan Xing
c
, Qixiang Ye
a
, Wen Gao
d
a
Graduate University of Chinese Academy of Sciences, No. 19, Yuquan Road, Shijingshan District, Beijing 100049, PR China
b
Key Lab of Intell. Info. Process., Inst. of Comput. Tech., Chinese Academy of Sciences, No. 6, Kexueyuan South Road Zhongguancun, Haidian District, Beijing 100190, PR China
c
Centre of Quantifiable Quality of Service, Norwegian University of Science and Technology, O.S. Bragstads plass 2E, Trondheim N-7491, Norway
d
Peking University, No. 5, Summer Palace Road, Haidian District, Beijing 100871, PR China
article info
Article history:
Received 26 September 2007
Accepted 18 August 2008
Available online 29 August 2008
Keywords:
Sports video summarization
Scene segmentation
Temporal voting strategy
Highlight ranking
abstract
While most existing sports video research focuses on detecting event from soccer and baseball etc., little
work has been contributed to flexible content summarization on racquet sports video, e.g. tennis, table
tennis etc. By taking advantages of the periodicity of video shot content and audio keywords in the rac-
quet sports video, we propose a novel flexible video content summarization framework. Our approach
combines the structure event detection method with the highlight ranking algorithm. Firstly, unsuper-
vised shot clustering and supervised audio classification are performed to obtain the visual and audio
mid-level patterns respectively. Then, a temporal voting scheme for structure event detection is proposed
by utilizing the correspondence between audio and video content. Finally, by using the affective features
extracted from the detected events, a linear highlight model is adopted to rank the detected events in
terms of their exciting degrees. Experimental results show that the proposed approach is effective.
Ó 2008 Elsevier Inc. All rights reserved.
1. Introduction
Sports video plays an important role in our daily life and has a
wide range of audiences. On one hand, a large volume of sports
video data are produced everyday. On the other hand, the sports
video content is redundant and the highlight points in the video
are sparse. Many people have no time to watch the whole game
or just want to see the highlights. Therefore, from the users’
perspective, it is necessary to develop a system to automatically
analyze the sports video and generate highlight summarization
for the audience to browse what they want.
Because of the great commercial potential behind sports video
analysis, a lot of research work has been contributed to it. Existing
work on sports video content summarization can be classified into
two classes: event detection and highlight summarization. Single-
modal features, including image/audio, and multi-modal features
that combine image, audio as well as text are employed to deal
with these tasks. In the following we will review the existing work
based on these two tasks.
For event detection, a lot of research work has been proposed.
We will review the existing work according to the used features,
which range from single modality to multi-modality. Some previ-
ous work employed single-modal feature, such as image or audio,
for event detection. For example, for the image modality, Gong
et al. [1] used player, ball, line marks and motion features to detect
special events in soccer program. Xie et al. [2] proposed a method
to segment soccer video into play or break segments for content
abstraction by using a Hidden Markov Model (HMM), where video
dominant color and motion activity were extracted as low-level
features. In [3], cinematic features such as shot type, replays and
object features were integrated into a Bayesian Network classifier
to identify goal event in broadcast soccer video. For the audio
modality, Rui et al. [4] used announcers’ speech pitch and baseball
batting sound to detect exciting segments in baseball games. Xu
et al. [5] built audio keywords for event detection in soccer video.
Xiong et al. [6] proposed a unified framework to extract highlight
from baseball, golf and soccer by detecting cheer and applause.
The content of sports video is intrinsically multi-modal and each
modality takes different role and can compensate the limitation
of other modalities. Therefore, integrating multiple modalities in
a framework is a direction for event detection in recent years
and lots of multi-modal approaches have been proposed. Snoek
and Worring [7] categorized multi-modal approaches into simulta-
neous or sequential in terms of content segmentation, statistical or
knowledge-based in terms of classification method, iterated or
non-iterated in terms of processing cycle. It is also mentioned that
1077-3142/$ - see front matter Ó 2008 Elsevier Inc. All rights reserved.
doi:10.1016/j.cviu.2008.08.002
q
This work is supported by National Hi-Tech Development Program (863
Program) of China under Grant 2006AA01Z117, National Natural Science Founda-
tion of China under Grant 60773136 and 60702035.
* Corresponding author. Address: Key Lab of Intell. Info. Process., Inst. of Comput.
Tech., Chinese Academy of Sciences, No. 6, Kexueyuan South Road Zhongguancun,
Haidian District, Beijing 100190, PR China.
E-mail addresses: cxliu@jdl.ac.cn (C. Liu), qmhuang@jdl.ac.cn (Q. Huang).
Computer Vision and Image Understanding 113 (2009) 415–424
Contents lists available at ScienceDirect
Computer Vision and Image Understanding
journal homepage: www.elsevier.com/locate/cviu