The Visual Computer
A uniﬁed model for human activity recognition using spatial
distribution of gradients and difference of Gaussian kernel
Dinesh Kumar Vishwakarma
· Chhavi Dhiman
© Springer-Verlag GmbH Germany, part of Springer Nature 2018
Understanding of human action and activity from video data is growing ﬁeld and received rapid importance due to surveillance,
security, entertainment and personal logging. In this work, a new hybrid technique is proposed for the description of human
action and activity in video sequences. The uniﬁed framework endows a robust feature vector wrapping both global and local
information strengthening discriminative depiction of action recognition. Initially, entropy-based texture segmentation is used
for human silhouette extraction followed by construction of average energy silhouette images (AEIs). AEIs are the 2D binary
projection of human silhouette frames of the video sequences, which reduces the feature vector generation time complexity.
Spatial Distribution Gradients are computed at different levels of resolution of sub-images of AEI consisting overall shape
variations of human silhouette during the activity. Due to scale, rotation and translation invariant properties of STIPs, the
vocabulary of DoG-based STIPs are created using vector quantization which is unique for each class of the activity. Extensive
experiments are conducted to validate the performance of the proposed approach on four standard benchmarks, i.e., Weizmann,
KTH, Ballet Movements, Multi-view IXMAS. Promising results are obtained when compared with the similar state of the
arts, demonstrating the robustness of the proposed hybrid feature vector for different types of challenges—illumination, view
variations posed by the datasets.
Keywords Human activity recognition · Average energy image (AEI) · Spatial Distribution of Gradients (SDGs) ·
Spatio-temporal interest points (STIP) · Difference of Gaussian (DoG)
With more videos ﬂourishing on the Internet, intelligent
video analysis applications have drawn substantial attention
from the academic, engineering and multimedia community.
Human Action recognition, being one of the fundamental
tasks for videos analytics, provides the vital visual cues.
Hence, interest in video-based human action recognition has
been renewed in order to understand the human actions bet-
ter. The recent surveys [1–4] acknowledge the fact that it
is still a challenging task to develop a discriminative action
Dinesh Kumar Vishwakarma
Department of Information Technology, Delhi Technological
University, New Delhi 110042, India
Department of Electronics and Communication Engineering,
Delhi Technological University, New Delhi 110042, India
representation of action in realistic videos. This is because of
the presence of different viewpoints, illumination variation,
visual appearance (such as colour and texture of clothing),
scale (due to different human body sizes or distances from
the camera), cluttered background and speed of action. These
challenges demand a generalized approach for an efﬁcient
human action recognition which can adapt these variations.
A number of frameworks [5–10] are proposed to model both
global and local features for videos. Global features rep-
resent the human body structure, shape and movements. It
helps to preserve spatial and temporal features of the action.
However, global features are too rigid to adapt possible vari-
ations of action due to viewpoint, appearance and occlusion.
Therefore, today the research is delving towards local and
deep features. At the same time, it is important to acknowl-
edge the strengths of global features for spatio-temporal
action description, due to which deep features are designed
to extract spatial and temporal details of the action. Hence,
integration of multiple features is a rich description of action