Context based object categorization: A critical survey
Carolina Galleguillos
a,
*
, Serge Belongie
a,b
a
University of California San Diego, Department of Computer Science and Engineering, La Jolla, CA 92093, USA
b
California Institute of Technology, Department of Electrical Engineering, Pasadena, CA 91125, USA
article info
Article history:
Received 30 September 2008
Accepted 22 February 2010
Available online 1 March 2010
Keywords:
Object recognition
Context
Object categorization
Computer vision systems
abstract
The goal of object categorization is to locate and identify instances of an object category within an image.
Recognizing an object in an image is difficult when images include occlusion, poor quality, noise or back-
ground clutter, and this task becomes even more challenging when many objects are present in the same
scene. Several models for object categorization use appearance and context information from objects to
improve recognition accuracy. Appearance information, based on visual cues, can successfully identify
object classes up to a certain extent. Context information, based on the interaction among objects in
the scene or global scene statistics, can help successfully disambiguate appearance inputs in recognition
tasks. In this work we address the problem of incorporating different types of contextual information for
robust object categorization in computer vision. We review different ways of using contextual informa-
tion in the field of object categorization, considering the most common levels of extraction of context and
the different levels of contextual interactions. We also examine common machine learning models that
integrate context information into object recognition frameworks and discuss scalability, optimizations
and possible future approaches.
Ó 2010 Elsevier Inc. All rights reserved.
1. Introduction
Traditional approaches to object categorization use appearance
features as the main source of information for recognizing object
classes in real world images. Appearance features, such as color,
edge responses, texture and shape cues, can capture variability in
objects classes up to certain extent. In face of clutter, noise and var-
iation in pose and illumination, object appearance can be disam-
biguated by the coherent composition of objects that real world
scenes often exhibit. An example of this situation is presented in
Fig. 1.
Information about typical configurations of objects in a scene
has been studied in psychology and computer vision for years, in
order to understand its effects in visual search, localization and
recognition performance [1,3,4,19,23]. Biederman et al. [3] pro-
posed five different classes of relations between an object and its
surroundings, interposition, support, probability, position and famil-
iar size. These classes characterize the organization of objects in
real world scenes. Classes corresponding to interposition and sup-
port can be coded by reference to physical space. Probability, posi-
tion and size are defined as semantic relations because they
require access to the referential meaning of the object. Semantic
relations include information about detailed interactions among
objects in the scene and they are often used as contextual features.
Several different models [6,7,13,25,36] in the computer vision
community have exploited these semantic relations in order to im-
prove recognition. Semantic relations, also known as context fea-
tures, can reduce processing time and disambiguate low quality
inputs in object recognition tasks. As an example of this idea, con-
sider the flow chart in Fig. 2. An input image containing an aero-
plane, trees, sky and grass (top left) is first processed through a
segmentation-based object recognition engine. The recognizer out-
puts an ordered shortlist of possible object labels; only the best
match is shown for each segment. Without appealing to context,
several mistakes are evident. Semantic context (probability)in
the form of object co-occurrence allows one to correct the label
of the aeroplane, but leaves the labels of the sky, grass and plant
incorrect. Spatial context (position) asserts that sky is more likely
to appear above grass than vice versa, correcting the labels of the
segments. Finally, scale context (size) corrects the segment labeled
as ‘‘plant” assigning the label of tree, since plants are relatively
smaller than trees and the rest of the objects in the scene.
In this paper, we review a variety of different approaches of
context based object categorization models. In Section 2 we assess
different types of contextual features used in object categorization:
semantic, spatial and scale context. In Section 3 we review the use
of context information from a global and local image level. Sec-
tion 4 presents four different types of local and global contextual
1077-3142/$ - see front matter Ó 2010 Elsevier Inc. All rights reserved.
doi:10.1016/j.cviu.2010.02.004
* Corresponding author.
E-mail addresses: cgallegu@cse.ucsd.edu (C. Galleguillos), sjb@cse.ucsd.edu
(S. Belongie).
Computer Vision and Image Understanding 114 (2010) 712–722
Contents lists available at ScienceDirect
Computer Vision and Image Understanding
journal homepage: www.elsevier.com/locate/cviu