Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Invariant Visual Object and Face Recognition: Neural and Computational Bases, and a Model, VisNet

Invariant Visual Object and Face Recognition: Neural and Computational Bases, and a Model, VisNet REVIEW ARTICLE published: 19 June 2012 COMPUTATIONAL NEUROSCIENCE doi: 10.3389/fncom.2012.00035 Invariant visual object and face recognition: neural and computational bases, and a model, VisNet 1,2 EdmundT. Rolls * Oxford Centre for Computational Neuroscience, Oxford, UK Department of Computer Science, University of Warwick, Coventry, UK Edited by: Neurophysiological evidence for invariant representations of objects and faces in the pri- Evgeniy Bart, Palo Alto Research mate inferior temporal visual cortex is described. Then a computational approach to how Center, USA invariant representations are formed in the brain is described that builds on the neuro- Reviewed by: physiology. A feature hierarchy model in which invariant representations can be built by Alexander G. Dimitrov, Washington self-organizing learning based on the temporal and spatial statistics of the visual input pro- State University Vancouver, USA Jay Hegdé, Georgia Health Sciences duced by objects as they transform in the world is described. VisNet can use temporal University, USA continuity in an associative synaptic learning rule with a short-term memory trace, and/or *Correspondence: it can use spatial continuity in continuous spatial transformation learning which does not Edmund T. Rolls, Department of require a temporal trace. The model of visual processing in the ventral cortical stream can Computer Science, University of build representations of objects that are invariant with respect to translation, view, size, and Warwick, Coventry CV4 7AL, UK. e-mail: [email protected] also lighting. The model has been extended to provide an account of invariant representa- tions in the dorsal visual system of the global motion produced by objects such as looming, rotation, and object-based movement. The model has been extended to incorporate top- down feedback connections to model the control of attention by biased competition in, for example, spatial and object search tasks. The approach has also been extended to account for how the visual system can select single objects in complex visual scenes, and how multiple objects can be represented in a scene. The approach has also been extended to provide, with an additional layer, for the development of representations of spatial scenes of the type found in the hippocampus. Keywords: VisNet, invariance, face recognition, object recognition, inferior temporal visual cortex, trace learning rule, hippocampus, spatial scene representation 1. INTRODUCTION and faces found in the inferior temporal visual cortex as shown One of the major problems that is solved by the visual system in by neuronal recordings. A fuller account is provided in Memory, the cerebral cortex is the building of a representation of visual Attention, and Decision-Making, Chapter 4 (Rolls, 2008b). Then information which allows object and face recognition to occur rel- I build on that foundation a closely linked computational the- atively independently of size, contrast, spatial-frequency, position ory of how these invariant representations of objects and faces on the retina, angle of view, lighting, etc. These invariant rep- may be formed by self-organizing learning in the brain, which resentations of objects, provided by the inferior temporal visual has been investigated by simulations in a model network, VisNet cortex (Rolls, 2008b), are extremely important for the operation (Rolls, 1992, 2008b; Wallis and Rolls, 1997; Rolls and Milward, of many other systems in the brain, for if there is an invari- 2000). ant representation, it is possible to learn on a single trial about This paper reviews this combined neurophysiological and com- reward/punishment associations of the object, the place where putational neuroscience approach developed by the author which that object is located, and whether the object has been seen leads to a theory of invariant visual object recognition, and relates recently, and then to correctly generalize to other views, etc. of this approach to other research. the same object (Rolls, 2008b). The way in which these invariant representations of objects are formed is a major issue in under- 2. INVARIANT REPRESENTATIONS OF FACES AND standing brain function, for with this type of learning, we must OBJECTS IN THE INFERIOR TEMPORAL VISUAL CORTEX not only store and retrieve information, but we must solve in 2.1. PROCESSING TO THE INFERIOR TEMPORAL CORTEX IN THE addition the major computational problem of how all the differ- PRIMATE VISUAL SYSTEM ent images on the retina (position, size, view, etc.) of an object A schematic diagram to indicate some aspects of the processing can be mapped to the same representation of that object in the involved in object identification from the primary visual cor- brain. It is this process with which we are concerned in this tex, V1, through V2 and V4 to the posterior inferior temporal paper. cortex (TEO) and the anterior inferior temporal cortex (TE) is In Section 2 of this paper, I summarize some of the evi- shown in Figure 1 (Rolls and Deco, 2002; Rolls, 2008b; Blumberg dence on the nature of the invariant representations of objects and Kreiman, 2010; Orban, 2011). The approximate location of Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 1 Rolls Invariant visual object recognition these visual cortical areas on the brain of a macaque monkey is approximately 11–15 mm anterior to the interaural plane (Baylis shown in Figure 2, which also shows that TE has a number of et al., 1987; Rolls, 2007a,b, 2008b). For comparison, the “middle different subdivisions. The different TE areas all contain visually face patch” of Tsao et al. (2006) was at A6, which is probably responsive neurons, as do many of the areas within the cortex in part of the posterior inferior temporal cortex (Tsao and Liv- the superior temporal sulcus (Baylis et al., 1987). For the pur- ingstone, 2008). In the anterior inferior temporal cortex areas poses of this summary, these areas will be grouped together as we have investigated, there are separate regions specialized for the anterior inferior temporal cortex (IT), except where otherwise face identity in areas TEa and TEm on the ventral lip of the stated. superior temporal sulcus and the adjacent gyrus, for face expres- The object and face-selective neurons described in this paper sion and movement in the cortex deep in the superior tem- are found mainly between 7 and 3 mm posterior to the sphe- poral sulcus (Baylis et al., 1987; Hasselmo et al., 1989a; Rolls, noid reference, which in a 3–4 kg macaque corresponds to 2007b), and separate neuronal clusters for objects (Booth and FIGURE 1 | Convergence in the visual system. Right – as it occurs VisNet. Convergence through the network is designed to provide in the brain. V1, visual cortex area V1; TEO, posterior inferior temporal fourth layer neurons with information from across the entire input cortex; TE, inferior temporal cortex (IT). Left – as implemented in retina. FIGURE 2 | Lateral view of the macaque brain (left hemisphere) showing the different architectonic areas (e.g., TEm, TEa) in and bordering the anterior part of the superior temporal sulcus (STS) of the macaque (see text). The STS has been drawn opened to reveal the cortical areas inside it, and is circumscribed by a thick line. Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 2 Rolls Invariant visual object recognition Rolls, 1998; Kriegeskorte et al., 2008; Rolls, 2008b). A possi- ble way in which VisNet could produce separate representations of face identity and expression has been investigated (Tromans et al., 2011). Similarly, in humans there are a number of sepa- rate visual representations of faces and other body parts (Spiridon et al., 2006; Weiner and Grill-Spector, 2011), with the clustering together of neurons with similar responses influenced by the self- organizing map processes that are a result of cortical design (Rolls, 2008b). 2.2. TRANSLATION INVARIANCE AND RECEPTIVE FIELD SIZE There is convergence from each small part of a region to the suc- ceeding region (or layer in the hierarchy) in such a way that the receptive field sizes of neurons (for example, 1˚ near the fovea in V1) become larger by a factor of approximately 2.5 with each suc- ceeding stage. (The typical parafoveal receptive field sizes found would not be inconsistent with the calculated approximations of, for example, 8˚ in V4, 20˚ in TEO, and 50˚ in inferior tem- poral cortex Boussaoud et al., 1991; see Figure 1). Such zones of convergence would overlap continuously with each other (see Figure 1). This connectivity provides part of the basis for the fact that many neurons in the temporal cortical visual areas respond to a stimulus relatively independently of where it is in their receptive field, and moreover maintain their stimulus selec- FIGURE 3 | Objects shown in a natural scene, in which the task was to tivity when the stimulus appears in different parts of the visual search for and touch one of the stimuli. The objects in the task as run field (Gross et al., 1985; Tovee et al., 1994; Rolls et al., 2003). were smaller. The diagram shows that if the receptive fields of inferior temporal cortex neurons are large in natural scenes with multiple objects This is called translation or shift invariance. In addition to hav- (in this scene, bananas, and a face), then any receiving neuron in structures ing topologically appropriate connections, it is necessary for the such as the orbitofrontal cortex and amygdala would receive information connections to have the appropriate synaptic weights to perform from many stimuli in the field of view, and would not be able to provide the mapping of each set of features, or object, to the same set evidence about each of the stimuli separately. of neurons in IT. How this could be achieved is addressed in the computational neuroscience models described later in this paper. In another situation the monkey had to search for two objects on a screen, and a touch of one object was rewarded with juice, 2.3. REDUCED TRANSLATION INVARIANCE IN NATURAL SCENES, AND and of another object was punished with saline (see Figure 3 for THE SELECTION OF A REWARDED OBJECT a schematic overview and Figure 30 for the actual display). In Until recently, research on translation invariance considered the both situations neuronal responses to the effective stimuli for the case in which there is only one object in the visual field. What neurons were compared when the objects were presented in the happens in a cluttered, natural, environment? Do all objects that natural scene or on a plain background. It was found that the can activate an inferior temporal neuron do so whenever they are overall response of the neuron to objects was sometimes some- anywhere within the large receptive fields of inferior temporal neu- what reduced when they were presented in natural scenes, though rons (Sato, 1989; Rolls and Tovee, 1995a)? If so, the output of the the selectivity of the neurons remained. However, the main finding visual system might be confusing for structures that receive inputs was that the magnitudes of the responses of the neurons typically from the temporal cortical visual areas. If one of the objects in the became much less in the real scene the further the monkey fixated visual field was associated with reward, and another with punish- in the scene away from the object (see Figures 4 and 31 and Section ment, would the output of the inferior temporal visual cortex to 5.8.1). emotion-related brain systems be an amalgam of both stimuli? If It is proposed that this reduced translation invariance in natural so, how would we be able to choose between the stimuli, and have scenes helps an unambiguous representation of an object which an emotional response to one but not perhaps the other, and select may be the target for action to be passed to the brain regions one for action and not the other (see Figure 3). that receive from the primate inferior temporal visual cortex. It To investigate how information is passed from the inferior helps with the binding problem, by reducing in natural scenes temporal cortex (IT) to other brain regions to enable stimuli the effective receptive field of inferior temporal cortex neurons to to be selected from natural scenes for action, Rolls et al. (2003) approximately the size of an object in the scene. The computa- analyzed the responses of single and simultaneously recorded IT tional utility and basis for this is considered in Section 5.8 and neurons to stimuli presented in complex natural backgrounds. In by Rolls and Deco (2002), Trappenberg et al. (2002), Deco and one situation, a visual fixation task was performed in which the Rolls (2004), Aggelopoulos and Rolls (2005), and Rolls and Deco monkey fixated at different distances from the effective stimulus. (2006), and includes an advantage for what is at the fovea because Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 3 Rolls Invariant visual object recognition of the large cortical magnification of the fovea, and shunting inter- the fovea, just massively reducing the computational including fea- actions between representations weighted by how far they are from ture binding problems. The brain then deals with a complex scene the fovea. by fixating different parts serially, using processes such as bottom- These findings suggest that the principle of providing strong up saliency to guide where fixations should occur (Itti and Koch, weight to whatever is close to the fovea is an important princi- 2000; Zhao and Koch, 2011). ple governing the operation of the inferior temporal visual cortex, Interestingly, although the size of the receptive fields of inferior and in general of the output of the ventral visual system in natural temporal cortex neurons becomes reduced in natural scenes so environments. This principle of operation is very important in that neurons in IT respond primarily to the object being fixated, interfacing the visual system to action systems, because the effec- there is nevertheless frequently some asymmetry in the receptive tive stimulus in making inferior temporal cortex neurons fire is in fields (see Section 5.9 and Figure 35). This provides a partial solu- natural scenes usually on or close to the fovea. This means that the tion to how multiple objects and their positions in a scene can be spatial coordinates of where the object is in the scene do not have captured with a single glance (Aggelopoulos and Rolls, 2005). to be represented in the inferior temporal visual cortex, nor passed from it to the action selection system, as the latter can assume that 2.4. SIZE AND SPATIAL-FREQUENCY INVARIANCE the object making IT neurons fire is close to the fovea in natural Some neurons in the inferior temporal visual cortex and cortex in scenes. Thus the position in visual space being fixated provides the anterior part of the superior temporal sulcus (IT/STS) respond part of the interface between sensory representations of objects relatively independently of the size of an effective face stimulus, and their coordinates as targets for actions in the world. The small with a mean size-invariance (to a half maximal response) of 12 receptive fields of IT neurons in natural scenes make this possible. times (3.5 octaves; Rolls and Baylis, 1986). An example of the After this, local, egocentric, processing implemented in the dorsal responses of an inferior temporal cortex face-selective neuron to faces of different sizes is shown in Figure 5. This is not a property visual processing stream using, e.g., stereodisparity may be used to guide action toward objects being fixated (Rolls and Deco, 2002). of a simple single-layer network (see Figure 7), nor of neurons in V1, which respond best to small stimuli, with a typical size- The reduced receptive field size in complex natural scenes also enables emotions to be selective to just what is being fixated, invariance of 1.5 octaves. Also, the neurons typically responded to because this is the information that is transmitted by the firing a face when the information in it had been reduced from 3D to a of IT neurons to structures such as the orbitofrontal cortex and 2D representation in gray on a monitor, with a response that was amygdala. on average 0.5 of that to a real face. There is an important comparison to be made here with some Another transform over which recognition is relatively invari- approaches in engineering in which attempts are made to analyze a ant is spatial-frequency. For example, a face can be identified when whole visual scene at once. This is a massive computational prob- it is blurred (when it contains only low-spatial frequencies), and lem, not yet solved in engineering. It is very instructive to see that when it is high-pass spatial-frequency filtered (when it looks like a this is not the approach taken by the (primate and human) brain, line drawing). If the face images to which these neurons respond which instead analyses in complex natural scenes what is close to are low-pass filtered in the spatial-frequency domain (so that they are blurred), then many of the neurons still respond when the images contain frequencies only up to 8 cycles per face. Similarly, FIGURE 4 | Firing of a temporal cortex cell to an effective stimulus presented either in a blank background or in a natural scene, as a FIGURE 5 | Typical response of an inferior temporal cortex function of the angle in degrees at which the monkey was fixating away from the effective stimulus. The task was to search for and touch face-selective neuron to faces of different sizes. The size subtended at the retina in degrees is shown. (From Rolls and Baylis, 1986.) the stimulus. (After Rolls et al., 2003.) Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 4 Rolls Invariant visual object recognition the neurons still respond to high-pass filtered images (with only high-spatial-frequency edge information) when frequencies down to only 8 cycles per face are included (Rolls et al., 1985). Face recog- nition shows similar invariance with respect to spatial-frequency (see Rolls et al., 1985). Further analysis of these neurons with narrow (octave) bandpass spatial-frequency filtered face stimuli shows that the responses of these neurons to an unfiltered face can not be predicted from a linear combination of their responses to the narrow bandstimuli (Rolls et al., 1987). This lack of linearity of these neurons, and their responsiveness to a wide range of spa- tial frequencies (see also their broad critical bandmasking Rolls, 2008a), indicate that in at least this part of the primate visual system recognition does not occur using Fourier analysis of the spatial-frequency components of images. The utility of this representation for memory systems in the brain is that the output of the visual system will represent an object invariantly with respect to position on the retina, size, etc. and this simplifies the functionality required of the (multiple) memory sys- tems, which need then simply associate the object representation with reward (orbitofrontal cortex and amygdala), associate it with position in the environment (hippocampus), recognize it as famil- iar (perirhinal cortex), associate it with a motor response in a habit memory (basal ganglia), etc. (Rolls, 2008b). The associations can be relatively simple, involving, for example, Hebbian associativity (Rolls, 2008b). Some neurons in the temporal cortical visual areas actually rep- resent the absolute size of objects such as faces independently of viewing distance (Rolls and Baylis, 1986). This could be called neu- rophysiological size constancy. The utility of this representation by a small population of neurons is that the absolute size of an object is a useful feature to use as an input to neurons that perform object recognition. Faces only come in certain sizes. 2.5. COMBINATIONS OF FEATURES IN THE CORRECT SPATIAL CONFIGURATION Many neurons in this ventral processing stream respond to com- FIGURE 6 | Responses of four temporal cortex neurons to whole faces binations of features (including objects), but not to single features and to parts of faces. The mean firing rate sem are shown. The presented alone, and the features must have the correct spatial responses are shown as changes from the spontaneous firing rate of each neuron. Some neurons respond to one or several parts of faces presented arrangement. This has been shown, for example, with faces, for alone. Other neurons (of which the top one is an example) respond only to which it has been shown by masking out or presenting parts of the combination of the parts (and only if they are in the correct spatial the face (for example, eyes, mouth, or hair) in isolation, or by configuration with respect to each other as shown by Rolls et al., 1994). The jumbling the features in faces, that some cells in the cortex in control stimuli were non-face objects. (After Perrett et al., 1982.) IT/STS respond only if two or more features are present, and are in the correct spatial arrangement (Perrett et al., 1982; Rolls et al., 1994; Freiwald et al., 2009; Rolls, 2011b). Figure 6 shows exam- of the process by which the cortical hierarchy operates, and this is ples of four neurons, the top one of which responds only if all incorporated into VisNet (Elliffe et al., 2002). the features are present, and the others of which respond not Evidence consistent with the suggestion that neurons are only to the full-face, but also to one or more features. Corre- responding to combinations of a few variables represented at the sponding evidence has been found for non-face cells. For example preceding stage of cortical processing is that some neurons in Tanaka et al. (1990) showed that some posterior inferior tempo- V2 and V4 respond to end-stopped lines, to tongues flanked by ral cortex neurons might only respond to the combination of an inhibitory subregions, to combinations of lines, to combinations edge and a small circle if they were in the correct spatial relation- of colors, or to surfaces (Hegde and Van Essen, 2000, 2003, 2007; ship to each other. Consistent evidence for face part configuration Ito and Komatsu, 2004; Brincat and Connor, 2006; Anzai et al., sensitivity has been found in human fMRI studies (Liu et al., 2007; Orban, 2011). In the inferior temporal visual cortex, some 2010). neurons respond to spatial configurations of surface fragments to These findings are important for the computational theory, for help specify the three-dimensional structure of objects (Yamane they show that neurons selective to feature combinations are part et al., 2008). Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 5 Rolls Invariant visual object recognition 2.6. A VIEW-INVARIANT REPRESENTATION number of neurons, the responses occurred only to a subset of For recognizing and learning about objects (including faces), it is the objects (using ensemble encoding), irrespective of the viewing important that an output of the visual system should be not only angle. Moreover, the firing of a neuron on any one trial, taken at translation and size invariant, but also relatively view-invariant. random and irrespective of the particular view of any one object, In an investigation of whether there are such neurons, we found provided information about which object had been seen, and this that some temporal cortical neurons reliably responded differently information increased approximately linearly with the number of to the faces of two different individuals independently of viewing neurons in the sample. This is strong quantitative evidence that angle (Hasselmo et al., 1989b), although in most cases (16/18 neu- some neurons in the inferior temporal cortex provide an invariant rons) the response was not perfectly view-independent. Mixed representation of objects. Moreover, the results of Booth and Rolls together in the same cortical regions there are neurons with view- (1998) show that the information is available in the firing rates, dependent responses (for example, Hasselmo et al., 1989b; Rolls and has all the desirable properties of distributed representations, and Tovee, 1995b). Such neurons might respond, for example, to including exponentially high-coding capacity, and rapid speed of a view of a profile of a monkey but not to a full-face view of the read-out of the information (Rolls, 2008b; Rolls and Treves, 2011). same monkey (Perrett et al., 1985; Hasselmo et al., 1989b). Further evidence consistent with these findings is that some These findings of view-dependent, partially view-independent, studies have shown that the responses of some visual neurons in and view-independent representations in the same cortical regions the inferior temporal cortex do not depend on the presence or are consistent with the hypothesis discussed below that view- absence of critical features for maximal activation (Perrett et al., independent representations are being built in these regions by 1982; Tanaka, 1993, 1996). For example, neuron 4 in Figure 6 associating together the outputs of neurons that have different responded to several of the features in a face when these features view-dependent responses to the same individual. These findings were presented alone (Perrett et al., 1982). In another example, also provide evidence that one output of the visual system includes Mikami et al. (1994) showed that some TE cells respond to partial representations of what is being seen, in a view-independent way views of the same laboratory instrument(s), even when these par- that would be useful for object recognition and for learning asso- tial views contain different features. Such functionality is impor- ciations about objects; and that another output is a view-based tant for object recognition when part of an object is occluded, by, representation that would be useful in social interactions to deter- for example, another object. In a different approach, Logothetis mine whether another individual is looking at one, and for select- et al. (1994) have reported that in monkeys extensively trained ing details of motor responses, for which the orientation of the (over thousands of trials) to treat different views of computer object with respect to the viewer is required (Rolls, 2008b). generated wire-frame “objects” as the same, a small population Further evidence that some neurons in the temporal cortical of neurons in the inferior temporal cortex did respond to differ- visual areas have object-based rather than view-based responses ent views of the same wire-frame object (see also Logothetis and comes from a study of a population of neurons that responds to Sheinberg, 1996). However, extensive training is not necessary for moving faces (Hasselmo et al., 1989b). For example, four neu- invariant representations to be formed, and indeed no explicit rons responded vigorously to a head undergoing ventral flexion, training in invariant object recognition was given in the experi- irrespective of whether the view of the head was full-face, of either ment by Booth and Rolls (1998), as Rolls’ hypothesis (Rolls, 1992) profile, or even of the back of the head. These different views could is that view-invariant representations can be learned by associat- only be specified as equivalent in object-based coordinates. Fur- ing together the different views of objects as they are moved and ther, the movement specificity was maintained across inversion, inspected naturally in a period that may be in the order of a few with neurons responding, for example, to ventral flexion of the seconds. Evidence for this is described in Section 2.7. head irrespective of whether the head was upright or inverted. In this procedure, retinally encoded or viewer-centered movement 2.7. LEARNING OF NEW REPRESENTATIONS IN THE TEMPORAL vectors are reversed, but the object-based description remains CORTICAL VISUAL AREAS the same. To investigate the idea that visual experience might guide the for- Also consistent with object-based encoding is the finding of mation of the responsiveness of neurons so that they provide an a small number of neurons that respond to images of faces of a economical and ensemble-encoded representation of items actu- given absolute size, irrespective of the retinal image size, or distance ally present in the environment (and indeed any rapid learning (Rolls and Baylis, 1986). found might help in the formation of invariant representations), Neurons with view-invariant responses to objects seen naturally the responses of inferior temporal cortex face-selective neurons by macaques have also been described (Booth and Rolls, 1998). The have been analyzed while a set of new faces were shown. Some stimuli were presented for 0.5 s on a color video monitor while the of the neurons studied in this way altered the relative degree to monkey performed a visual fixation task. The stimuli were images which they responded to the different members of the set of novel of 10 real plastic objects that had been in the monkey’s cage for faces over the first few (1–2) presentations of the set (Rolls et al., several weeks, to enable him to build view-invariant representa- 1989). If in a different experiment a single novel face was intro- tions of the objects. Control stimuli were views of objects that duced when the responses of a neuron to a set of familiar faces had never been seen as real objects. The neurons analyzed were in were being recorded, the responses to the set of familiar faces were the TE cortex in and close to the ventral lip of the anterior part not disrupted, while the responses to the novel face became stable of the superior temporal sulcus. Many neurons were found that within a few presentations. Alteration of the tuning of individual responded to some views of some objects. However, for a smaller neurons in this way may result in a good discrimination over the Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 6 Rolls Invariant visual object recognition population as a whole of the faces known to the monkey. This The information from a single cell is informative about a set of evidence is consistent with the categorization being performed by stimuli, but the information increases approximately linearly with self-organizing competitive neuronal networks, as described else- the number of neurons in the ensemble, and can be read mod- where (Rolls and Treves, 1998; Rolls, 2008b). Further evidence has erately efficiently by dot product decoding. This is what neurons been found to support the hypothesis (Rolls, 1992, 2008b) that can do: produce in their depolarization or firing rate a synapti- unsupervised natural experience rapidly alters invariant object cally weighted sum of the firing rate inputs that they receive from representation in the visual cortex (Li and DiCarlo, 2008; Li et al., other neurons (Rolls, 2008b). This property is fundamental to the 2011; cf. Folstein et al., 2010). mechanisms implemented in VisNet. There is little information Further evidence that these neurons can learn new representa- in whether IT neurons fire synchronously or not (Aggelopoulos tions very rapidly comes from an experiment in which binarized et al., 2005; Rolls and Treves, 2011), so that temporal syntactic black and white (two-tone) images of faces that blended with the binding (Singer, 1999) may not be part of the mechanism. Each background were used. These did not activate face-selective neu- neuron has an approximately exponential probability distribution rons. Full gray-scale images of the same photographs were then of firing rates in a sparse distributed representation (Franco et al., shown for ten 0.5 s presentations. In a number of cases, if the neu- 2007; Rolls and Treves, 2011). ron happened to be responsive to that face, when the binarized These generic properties are described in detail elsewhere version of the same face was shown next, the neurons responded (Rolls, 2008b; Rolls and Treves, 2011), as are their implications to it (Tovee et al., 1996). This is a direct parallel to the same phe- for understanding brain function (Rolls, 2012), and so are not nomenon that is observed psychophysically, and provides dramatic further described here. They are incorporated into the design of evidence that these neurons are influenced by only a very few sec- VisNet, as will become evident. onds (in this case 5 s) of experience with a visual stimulus. We It is consistent with this general conceptual background that have shown a neural correlate of this effect using similar stimuli Krieman et al. (2000) have described some neurons in the human and a similar paradigm in a PET (positron emission tomography) temporal lobe that seem to respond selectively to an object. This neuroimaging study in humans, with a region showing an effect is consistent with the principles just described, though the brain of the learning found for faces in the right temporal lobe, and for areas in which these recordings were made may be beyond the objects in the left temporal lobe (Dolan et al., 1997). inferior temporal visual cortex and the tuning appears to be more Once invariant representations of objects have been learned in specific, perhaps reflecting backprojections from language or other the inferior temporal visual cortex based on the statistics of the cognitive areas concerned, for example, with tool use that might spatio-temporal continuity of objects in the visual world (Rolls, influence the categories represented in high-order cortical areas 1992, 2008b; Yi et al., 2008), later processes may be required to (Farah et al., 1996; Farah, 2000; Rolls, 2008b). categorize objects based on other properties than their properties 3. APPROACHES TO INVARIANT OBJECT RECOGNITION as objects. One such property is that certain objects may need to be A goal of my approach is to provide a biologically based and bio- treated as similar for the correct performance of a task, and others logically plausible approach to how the brain computes invariant as different, and that demand can influence the representations of representations for use by other brain systems (Rolls, 2008b). This objects in a number of brain areas (Fenske et al., 2006; Freedman leads me to propose a hierarchical feed-forward series of competi- and Miller, 2008; Kourtzi and Connor, 2011). That process may in tive networks using convergence from stage to stage; and the use of turn influence representations in the inferior temporal visual cor- a modified Hebb synaptic learning rule that incorporates a short- tex, for example, by top-down bias (Rolls and Deco, 2002; Rolls, term memory trace of previous neuronal activity to help learn the 2008b,c). invariant properties of objects from the temporo-spatial statis- tics produced by the normal viewing of objects (Wallis and Rolls, 2.8. DISTRIBUTED ENCODING 1997; Rolls and Milward, 2000; Stringer and Rolls, 2000, 2002; An important question for understanding brain function is Rolls and Stringer, 2001, 2006; Elliffe et al., 2002; Rolls and Deco, whether a particular object (or face) is represented in the brain by 2002; Deco and Rolls, 2004; Rolls, 2008b). In Sections 3.1–3.5, I the firing of one or a few gnostic (or “grandmother”) cells (Barlow, summarize some other approaches to invariant object recognition, 1972), or whether instead the firing of a group or ensemble of cells and in Section 3.6. I introduce feature hierarchies as part of the each with somewhat different responsiveness provides the repre- background to VisNet, which is described starting in Section 4. sentation. Advantages of distributed codes include generalization I start by emphasizing that generalization to different posi- and graceful degradation (fault tolerance), and a potentially very tions, sizes, views, etc. of an object is not a simple property of high capacity in the number of stimuli that can be represented one-layer neural networks. Although neural networks do general- (that is exponential growth of capacity with the number of neu- ize well, the type of generalization they show naturally is to vectors rons in the representation; Rolls and Treves, 1998, 2011; Rolls, which have a high-dot product or correlation with what they have 2008b). If the ensemble encoding is sparse, this provides a good already learned. To make this clear, Figure 7 is a reminder that the input to an associative memory, for then large numbers of stim- activation h of each neuron is computed as uli can be stored (Rolls, 2008b; Rolls and Treves, 2011). We have shown that in the inferior temporal visual cortex and cortex in h D x w (1) i j ij the anterior part of the superior temporal sulcus (IT/STS), there is a sparse distributed representation in the firing rates of neurons about faces and objects (Rolls, 2008b; Rolls and Treves, 2011). where the sum is over the C input axons, indexed by j. Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 7 Rolls Invariant visual object recognition an object recognition system might not distinguish a normal car from a car with the back wheels removed and placed on the roof. Such systems do not therefore perform shape recognition (where shape implies something about the spatial arrangement of features within an object, see further Ullman, 1996), and something more is needed, and is implemented in the primate visual system. How- ever, I note that the features that are present in objects, e.g., a furry texture, are useful to incorporate in object recognition systems, and the brain may well use, and the model VisNet in principle can use, evidence from which features are present in an object as part of the evidence for identification of a particular object. I note that the features might consist also of, for example, the pattern of movement that is characteristic of a particular object (such as a buzzing fly), and might use this as part of the input to final object identification. The capacity to use shape in invariant object recognition is fundamental to primate vision, but may not be used or fully implemented in the visual systems of some other animals with less developed visual systems. For example, pigeons may correctly identify pictures containing people, a particular person, trees, FIGURE 7 | A neuron that computes a dot product of the input pattern pigeons, etc. but may fail to distinguish a figure from a scrambled with its synaptic weight vector generalizes well to other patterns version of a figure (Herrnstein, 1984; Cerella, 1986). Thus their based on their similarity measured in terms of dot product or object recognition may be based more on a collection of parts than correlation, but shows no translation (or size, etc.) invariance. on a direct comparison of complete figures in which the relative positions of the parts are important. Even if the details of the con- Now consider translation (or shift) of the input (random clusions reached from this research are revised (Wasserman et al., binary) pattern vector by one position. The dot product will now 1998), it nevertheless does appear that at least some birds may use drop to a low-level, and the neuron will not respond, even though computationally simpler methods than those needed for invariant it is the same pattern, just shifted by one location. This makes shape recognition. For example, it may be that when some birds the point that special processes are needed to compute invariant are trained to discriminate between images in a large set of pic- representations. Network approaches to such invariant pattern tures, they tend to rely on some chance detail of each picture (such recognition are described in this paper. Once an invariant rep- as a spot appearing by mistake on the picture), rather than on resentation has been computed by a sensory system, it is in a recognition of the shapes of the object in the picture (Watanabe form that is suitable for presentation to a pattern association or et al., 1993). autoassociation neural network (Rolls, 2008b). 3.2. STRUCTURAL DESCRIPTIONS AND SYNTACTIC PATTERN 3.1. FEATURE SPACES RECOGNITION One very simple possibility for performing object classification is A second approach to object recognition is to decompose the based on feature spaces, which amount to lists of (the extent to object or image into parts, and to then produce a structural which) different features are present in a particular object. The description of the relations between the parts. The underlying features might consist of textures, colors, areas, ratios of length to assumption is that it is easier to capture object invariances at a level width, etc. The spatial arrangement of the features is not taken where parts have been identified. This is the type of scheme for into account. If n different properties are used to characterize an which Marr and Nishihara (1978) and Marr (1982) opted (Rolls, object, each viewed object is represented by a set of n real numbers. 2011a). The particular scheme (Binford, 1981) they adopted con- It then becomes possible to represent an object by a point R in an sists of generalized cones, series of which can be linked together to n-dimensional space (where R is the resolution of the real num- form structural descriptions of some, especially animate, stimuli bers used). Such schemes have been investigated (Gibson, 1950, (see Figure 8). 1979; Selfridge, 1959; Tou and Gonzalez, 1974; Bolles and Cain, Such schemes assume that there is a 3D internal model (struc- 1982; Mundy and Zisserman, 1992; Mel, 1997), but, because the tural description) of each object. Perception of the object consists relative positions of the different parts are not implemented in of parsing or segmenting the scene into objects, and then into the object recognition scheme, are not sensitive to spatial jum- parts, then producing a structural description of the object, and bling of the features. For example, if the features consisted of then testing whether this structural description matches that of any nose, mouth, and eyes, such a system would respond to faces with known object stored in the system. Other examples of structural jumbled arrangements of the eyes, nose, and mouth, which does description schemes include those of Sutherland (1968), Winston not match human vision, nor the responses of macaque inferior (1975), and Milner (1974). The relations in the structural descrip- temporal cortex neurons, which are sensitive to the spatial arrange- tion may need to be quite complicated, for example, “connected ment of the features in a face (Rolls et al., 1994). Similarly, such together,” “inside of,” “larger than,” etc. Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 8 Rolls Invariant visual object recognition FIGURE 8 | A 3D structural description of an object-based on axes on the right. In addition, some component axes have 3D models generalized cone parts. Each box corresponds to a 3D model, with its associated with them, as indicated by the way the boxes overlap. (After model axis on the left side of the box and the arrangement of its component Marr and Nishihara, 1978.) Perhaps the most developed model of this type is the recogni- For example, the structural description of many four-legged ani- tion by components (RBC) model of Biederman (1987), imple- mals is rather similar. Rather more than a structural description mented in a computational model by Hummel and Biederman seems necessary to identify many objects and animals. (1992). His small set (less than 50) of primitive parts named A third difficulty, which applies especially to biological sys- “geons” includes simple 3D shapes such as boxes, cylinders, tems, is the difficulty of implementing the syntax needed to hold and wedges. Objects are described by a syntactically linked list the structural description as a 3D model of the object, of produc- of the relations between each of the geons of which they are ing a syntactic structural description on the fly (in real time, and composed. Describing a table in this way (as a flat top sup- with potentially great flexibility of the possible arrangement of ported by three or four legs) seems quite economical. Other the parts), and of matching the syntactic description of the object schemes use 2D surface patches as their primitives (Dane and in the image to all the stored representations in order to find a Bajcsy, 1982; Brady et al., 1985; Faugeras and Hebert, 1986; match. An example of a structural description for a limb might be Faugeras, 1993). When 3D objects are being recognized, the body> thigh> shin> foot> toes. In this description> means “is implication is that the structural description is a 3D descrip- linked to,” and this link must be between the correct pair of descrip- tion. This is in contrast to feature hierarchical systems, in which tors. If we had just a set of parts, without the syntactic or relational recognition of a 3D object from any view might be accom- linking, then there would be no way of knowing whether the toes plished by storing a set of associated 2D views (see below, are attached to the foot or to the body. In fact, worse than this, Section 3.6). there would be no evidence about what was related to what, just There are a number of difficulties with schemes based on a set of parts. Such syntactical relations are difficult to implement structural descriptions, some general, and some with particular in any biologically plausible neuronal networks used in vision, reference to the potential difficulty of their implementation in the because if the representations of all the features or parts just men- brain. First, it is not always easy to decompose the object into tioned were active simultaneously, how would the spatial relations its separate parts, which must be performed before the structural between the features also be encoded? (How would it be apparent description can be produced. For example, it may be difficult to just from the firing of neurons that the toes were linked to the rest produce a structural description of a cat curled up asleep from of the foot but not to the body?) It would be extremely difficult separately identifiable parts. Identification of each of the parts to implement this “on the fly” syntactic binding in a biologically is also frequently very difficult when 3D objects are seen from plausible network (though cf. Hummel and Biederman, 1992), and different viewing angles, as key parts may be invisible or highly the only suggested mechanism for flexible syntactic binding, tem- distorted. This is particularly likely to be difficult in 3D shape poral synchronization of the firing of different neurons, is not well perception. It appears that being committed to producing a cor- supported as a quantitatively important mechanism for informa- rect description of the parts before other processes can operate tion encoding in the ventral visual system, and would have major is making too strong a commitment early on in the recognition difficulties in implementing correct, relational, syntactic binding process. (Section 5.4.1; Rolls, 2008b; Rolls and Treves, 2011). A second difficulty is that many objects or animals that can be A fourth difficulty of the structural description approach is correctly recognized have rather similar structural descriptions. that segmentation into objects must occur effectively before object Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 9 Rolls Invariant visual object recognition recognition, so that the linked structural description list can be of by any biologically plausible network. Compensating for rotation one object. Given the difficulty of segmenting objects in typical is even more difficult (Ullman, 1996). All this has to happen before natural cluttered scenes (Ullman, 1996), and the compounding the segmented canonical representation of the object is compared problem of overlap of parts of objects by other objects, segmen- to the stored object templates with the same canonical represen- tation as a first necessary stage of object recognition adds another tation. The system of course becomes vastly more complicated major difficulty for structural description approaches. when the recognition must be performed of 3D objects seen in a A fifth difficulty is that metric information, such as the relative 3D world, for now the particular view of an object after segmen- size of the parts that are linked syntactically, needs to be specified tation must be placed into a canonical form, regardless of which in the structural description (Stan-Kiewicz and Hummel, 1994), view, or how much of any view, may be seen in a natural scene which complicates the parts that have to be syntactically linked. with occluding contours. However, this process is helped, at least It is because of these difficulties that even in artificial vision sys- in computers that can perform high-precision matrix multiplica- tems implemented on computers, where almost unlimited syntac- tion, by the fact that (for many continuous transforms such as 3D tic binding can easily be implemented, the structural description rotation, translation, and scaling) all the possible views of an object approach to object recognition has not yet succeeded in producing transforming in 3D space can be expressed as the linear combina- a scheme which actually works in more than an environment in tion of other views of the same object (see Chapter 5 of Ullman, which the types of objects are limited, and the world is far from 1996; Koenderink and van Doorn, 1991; Koenderink, 1990). the natural world, consisting, for example, of 2D scenes (Mundy This alignment approach is the main theme of the book by and Zisserman, 1992). Ullman (1996), and there are a number of computer implemen- Although object recognition in the brain is unlikely to be tations (Lowe, 1985; Grimson, 1990; Huttenlocher and Ullman, based on the structural description approach, for the reasons given 1990; Shashua, 1995). However, as noted above, it seems unlikely above, and the fact that the evidence described in this paper sup- that the brain is able to perform the high-precision calculations ports a feature hierarchy rather than the structural description needed to perform the transforms required to align any view of a implementation in the brain, it is certainly the case that humans 3D object with some canonical template representation. For this can provide verbal, syntactic, descriptions of objects in terms of reason, and because the approach also relies on segmentation of the relations of their parts, and that this is often a useful type of the object in the scene before the template alignment algorithms description. Humans may therefore, it is suggested, supplement can start, and because key features may need to be correctly iden- a feature hierarchical object recognition system built into their tified to be used in the alignment (Edelman, 1999), this approach ventral visual system with the additional ability to use the type is not considered further here. of syntax that is necessary for language to provide another level We may note here in passing that some animals with a less com- of description of objects. This ability is useful in, for example, putationally developed visual system appear to attempt to solve the engineering applications. alignment problem by actively moving their heads or eyes to see what template fits, rather than starting with an image on the eye 3.3. TEMPLATE MATCHING AND THE ALIGNMENT APPROACH and attempting to transform it into canonical coordinates. This Another approach is template matching, comparing the image on “active vision” approach used, for example, by some invertebrates the retina with a stored image or picture of an object. This is con- has been described by Land (1999) and Land and Collett (1997). ceptually simple, but there are in practice major problems. One major problem is how to align the image on the retina with the 3.4. SOME FURTHER MACHINE LEARNING APPROACHES stored images, so that all possible images on the retina can be Learning the transformations and invariances of the signal is compared with the stored template or templates of each object. another approach to invariant object recognition at the interface The basic idea of the alignment approach (Ullman, 1996) is to of machine learning and theoretical neuroscience. For example, compensate for the transformations separating the viewed object rather than focusing on the templates, “map-seeking circuit the- and the corresponding stored model, and then compare them. For ory” focuses on the transforms (Arathorn, 2002, 2005). The theory example, the image and the stored model may be similar, except for provides a general computational mechanism for discovery of cor- a difference in size. Scaling one of them will remove this discrep- respondences in massive transformation spaces by exploiting an ancy and improve the match between them. For a 2D world, the ordering property of superpositions. The latter allows a set of possible transforms are translation (shift), scaling, and rotation. transformations of an input image to be formed into a sequence of Given, for example, an input letter of the alphabet to recognize, the superpositions which are then “culled” to a composition of single system might, after segmentation (itself a very difficult process if mappings by a competitive process which matches each superposi- performed independently of (prior to) object recognition), com- tion against a superposition of inverse transformations of memory pensate for translation by computing the center of mass of the patterns. Earlier work considered how to minimize the variance object, and shifting the character to a “canonical location.” Scale in the output when the image transformed (Leen, 1995). Another might be compensated for by calculating the convex hull (the approach is to add transformation invariance to mixture models, smallest envelope surrounding the object), and then scaling the by approximating the non-linear transformation manifold by a image. Of course how the shift and scaling would be accomplished discrete set of points (Frey and Jojic, 2003). They showed how is itself a difficult point – easy to perform on a computer using the expectation maximization algorithm can be used to jointly matrix multiplication as in simple computer graphics, but not the learn clusters, while at the same time inferring the transformation sort of computation that could be performed easily or accurately associated with each input. In another approach, an unsupervised Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 10 Rolls Invariant visual object recognition algorithm for learning Lie group operators for in-plane transforms from input data was described (Rao and Ruderman, 1999). 3.5. NETWORKS THAT CAN RECONSTRUCT THEIR INPUTS Hinton et al. (1995) and Hinton and Ghahramani (1997) have argued that cortical computation is invertible, so that, for exam- ple, the forward transform of visual information from V1 to higher areas loses no information, and there can be a backward transform from the higher areas to V1. A comparison of the reconstructed representation in V1 with the actual image from the world might in principle be used to correct all the synaptic weights between the two (in both the forward and the reverse directions), in such a way that there are no errors in the transform (Hinton, 2010). This suggested reconstruction scheme would seem to involve non-local synaptic weight correction (though see Hinton and Sejnowski, 1986; O’Reilly and Munakata, 2000) for a suggested, although still FIGURE 9 | The feature hierarchy approach to object recognition. The biologically implausible, neural implementation, contrastive Heb- inputs may be neurons tuned to oriented straight line segments. In early bian learning), or other biologically implausible operations. The intermediate-layers neurons respond to a combination of these inputs in scheme also does not seem to provide an account for why or how the correct spatial position with respect to each other. In further the responses of inferior temporal cortex neurons become the way intermediate layers, of which there may be several, neurons respond with they are (providing information about which object is seen rela- some invariance to the feature combinations represented early, and form higher order feature combinations. Finally, in the top layer, neurons respond tively independently of position on the retina, size, or view). The to combinations of what is represented in the preceding intermediate layer, whole forward transform performed in the brain seems to lose and thus provide evidence about objects in a position (and scale and even much of the information about the size, position, and view of the view) invariant way. Convergence through the network is designed to object, as it is evidence about which object is present invariant of provide top layer neurons with information from across the entire input its size, view, etc. that is useful to the stages of processing about retina, as part of the solution to translation invariance, and other types of invariance are treated similarly. objects that follow (Rolls, 2008b). Because of these difficulties, and because the backprojections are needed for processes such as recall (Rolls, 2008b), this approach is not considered further here. In the context of recall, if the visual system were to perform a represent longer curved lines (Zucker et al., 1989), or terminated reconstruction in V1 of a visual scene from what is represented lines (in fact represented in V1 as end-stopped cells), corners, “T” in the inferior temporal visual cortex, then it might be supposed junctions which are characteristic of obscuring edges, and (at least that remembered visual scenes might be as information-rich (and in humans) the arrow and “Y” vertices which are characteristic subjectively as full of rich detail) as seeing the real thing. This is not properties of man-made environments. Evidence that such fea- the case for most humans, and indeed this point suggests that at ture combination neurons are present in V2 is that some neurons least what reaches consciousness from the inferior temporal visual respond to combinations of line elements that join at different cortex (which is activated during the recall of visual memories) is angles (Hegde and Van Essen, 2000, 2003, 2007; Ito and Komatsu, the identity of the object (as made explicit in the firing rate of the 2004; Anzai et al., 2007). (An example of this might be a neu- neurons), and not the low-level details of the exact place, size, and ron responding to a “V” shape at a particular orientation.) As view of the object in the recalled scene, even though, according to one ascends the hierarchy, neurons might respond to more com- the reconstruction argument, that information should be present plex trigger features. For example, two parts of a complex figure in the inferior temporal visual cortex. may need to be in the correct spatial arrangement with respect to each other, as shown by Tanaka (1996) for V4 and posterior 3.6. FEATURE HIERARCHIES AND 2D VIEW-BASED OBJECT inferior temporal cortex neurons. In another example, V4 neu- RECOGNITION rons may respond to the curvature of the elements of a stimulus Another approach, and one that is much closer to what appears to (Carlson et al., 2011). Further on, neurons might respond to com- be present in the primate ventral visual system (Wurtz and Kandel, binations of several such intermediate-level feature combination 2000a; Rolls and Deco, 2002; Rolls, 2008b), is a feature hierarchy neurons, and thus come to respond systematically differently to system (see Figure 9). different objects, and thus to convey information about which In this approach, the system starts with some low-level descrip- object is present. This approach received neurophysiological sup- tion of the visual scene, in terms, for example, of oriented straight port early on from the results of Hubel and Wiesel (1962) and line segments of the type that are represented in the responses of Hubel and Wiesel (1968) in the cat and monkey, and many of the primary visual cortex (V1) neurons, and then builds in repeated data described in Chapter 5 of Rolls and Deco (2002) are consistent hierarchical layers features based on what is represented in previ- with this scheme. ous layers. A feature may thus be defined as a combination of what A number of problems need to be solved for such feature hierar- is represented in the previous layer. For example, after V1, fea- chy visual systems to provide a useful model of object recognition tures might consist of combinations of straight lines, which might in the primate visual system. Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 11 Rolls Invariant visual object recognition First, some way needs to be found to keep the number of feature object. It is shown in Section 5.4 that feature hierarchy systems combination neurons realistic at each stage, without undergoing a can solve this problem by forming feature combination neurons combinatorial explosion. If a separate feature combination neuron at an early stage of processing (e.g., V1 or V2 in the brain) that was needed to code for every possible combination of n types of respond with high-spatial precision to the local arrangement of feature each with a resolution of 2 levels (binary encoding) in the features. Such neurons would respond differently, for example, preceding stage, then 2 neurons would be needed. The suggestion to L, C, and T if they receive inputs from two line-responding that is made in Section 4 is that by forming neurons that respond to neurons. It is shown in Section 5.4 that at later layers of the hier- low-order combinations of features (neurons that respond to just archy, where some of the intermediate-level feature combination say 2–4 features from the preceding stage), the number of actual neurons are starting to show translation invariance, then correct feature analyzing neurons can be kept within reasonable numbers. object recognition may still occur because only one object contains By reasonable we mean the number of neurons actually found at just those sets of intermediate-level neurons in which the spatial any one stage of the visual system, which, for V4 might be in the representation of the features is inherent in the encoding. order of 60 10 neurons (assuming a volume for macaque V4 The type of representation developed in a hierarchical object of approximately 2,000 mm , and a cell density of 20,000–40,000 recognition system, in the brain, and by VisNet as described in the neurons per mm , Rolls, 2008b). This is certainly a large num- rest of this paper would be suitable for recognition of an object, ber; but the fact that a large number of neurons is present at each and for linking associative memories to objects, but would be less stage of the primate visual system is in fact consistent with the good for making actions in 3D space to particular parts of, or hypothesis that feature combination neurons are part of the way inside, objects, as the 3D coordinates of each part of the object in which the brain solves object recognition. A factor which also would not be explicitly available. It is therefore proposed that helps to keep the number of neurons under control is the statis- visual fixation is used to locate in foveal vision part of an object tics of the visual world, which contain great redundancies. The to which movements must be made, and that local disparity and world is not random, and indeed the statistics of natural images other measurements of depth (made explicit in the dorsal visual are such that many regularities are present (Field, 1994), and not system) then provide sufficient information for the motor system every possible combination of pixels on the retina needs to be sep- to make actions relative to the small part of space in which a local, arately encoded. A third factor which helps to keep the number view-dependent, representation of depth would be provided (cf. of connections required onto each neuron under control is that in Ballard, 1990). a multilayer hierarchy each neuron can be set up to receive con- One advantage of feature hierarchy systems is that they can nections from only a small region of the preceding layer. Thus an operate fast (Rolls, 2008b). individual neuron does not need to have connections from all the A second advantage is that the feature analyzers can be built out neurons in the preceding layer. Over multiple-layers, the required of the rather simple competitive networks (Rolls, 2008b) which convergence can be produced so that the same neurons in the top use a local learning rule, and have no external teacher, so that they layer can be activated by an image of an effective object anywhere are rather biologically plausible. Another advantage is that, once on the retina (see Figure 1). trained on subset features common to most objects, the system A second problem of feature hierarchy approaches is how to can then learn new objects quickly. map all the different possible images of an individual object A related third advantage is that, if implemented with compet- through to the same set of neurons in the top layer by modifying itive nets as in the case of VisNet (see Section 5), then neurons the synaptic connections (see Figure 1). The solution discussed in are allocated by self-organization to represent just the features Sections 4, 5.1.1, and 5.3 is the use of a synaptic modification rule present in the natural statistics of real images (cf. Field, 1994), and with a short-term memory trace of the previous activity of the not every possible feature that could be constructed by random neuron, to enable it to learn to respond to the now transformed combinations of pixels on the retina. version of what was seen very recently, which, given the statistics A related fourth advantage of feature hierarchy networks is of looking at the visual world, will probably be an input from the that because they can utilize competitive networks, they can still same object. produce the best guess at what is in the image under non-ideal A third problem of feature hierarchy approaches is how they conditions, when only parts of objects are visible because, for can learn in just a few seconds of inspection of an object to recog- example, of occlusion by other objects, etc. The reasons for this nize it in different transforms, for example, in different positions are that competitive networks assess the evidence for the presence on the retina in which it may never have been presented during of certain “features” to which they are tuned using a dot prod- training. A solution to this problem is provided in Section 5.4, uct operation on their inputs, so that they are inherently tolerant in which it is shown that this can be a natural property of fea- of missing input evidence; and reach a state that reflects the best ture hierarchy object recognition systems, if they are trained first hypothesis or hypotheses (with soft competition) given the whole for all locations on the intermediate-level feature combinations of set of inputs, because there are competitive interactions between which new objects will simply be a new combination, and therefore the different neurons (Rolls, 2008b). requiring learning only in the upper layers of the hierarchy. A fifth advantage of a feature hierarchy system is that, as shown A fourth potential problem of feature hierarchy systems is that in Section 5.5, the system does not need to perform segmentation when solving translation invariance they need to respond to the into objects as part of pre-processing, nor does it need to be able same local spatial arrangement of features (which are needed to to identify parts of an object, and can also operate in cluttered specify the object), but to ignore the global position of the whole scenes in which the object may be partially obscured. The reason Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 12 Rolls Invariant visual object recognition for this is that once trained on objects, the system then oper- After the immediately following description of early models of ates somewhat like an associative memory, mapping the image a feature hierarchy approach implemented in the Cognitron and properties forward onto whatever it has learned about before, and Neocognitron, we turn for the remainder of this paper to analy- then by competition selecting just the most likely output to be ses of how a feature hierarchy approach to invariant visual object activated. Indeed, the feature hierarchy approach provides a mech- recognition might be implemented in the brain, and how key com- anism by which processing at the object recognition level could putational issues could be solved by such a system. The analyses feed back using backprojections to early cortical areas to provide are developed and tested with a model, VisNet, which will shortly top-down guidance to assist segmentation. Although backprojec- be described. Much of the data we have on the operation of the tions are not built into VisNet2 (Rolls and Milward, 2000), they high-order visual cortical areas (Section 2; Rolls and Deco, 2002; have been added when attentional top-down processing must be Anzai et al., 2007; Rolls, 2008b) suggest that they implement a fea- incorporated (Deco and Rolls, 2004), are present in the brain, ture hierarchy approach to visual object recognition, as is made and are incorporated into the models described elsewhere (Rolls, evident in the remainder of this paper. 2008b). Although the operation of the ventral visual system can proceed as a feed-forward hierarchy, as shown by backward mask- 3.6.1. The cognitron and neocognitron ing experiments (Rolls and Tovee, 1994; Rolls et al., 1999; Rolls, An early computational model of a hierarchical feature-based 2003, 2006), top-down influences can of course be implemented approach to object recognition, joining other early discussions by the backprojections, and may be useful in further shaping the of this approach (Selfridge, 1959; Sutherland, 1968; Barlow, 1972; activity of neurons at lower levels in the hierarchy based on the Milner, 1974), was proposed by Fukushima (1975, 1980, 1989, neurons firing at a higher level as a result of dynamical interactions 1991). His model used two types of cell within each layer to of neurons at different layers of the hierarchy (Rolls, 2008b; Jiang approach the problem of invariant representations. In each layer, et al., 2011). a set of “simple cells,” with defined position, orientation, etc. sen- A sixth advantage of feature hierarchy systems is that they can sitivity for the stimuli to which they responded, was followed by naturally utilize features in the images of objects which are not a set of “complex cells,” which generalized a little over position, strictly part of a shape description scheme, such as the fact that orientation, etc. This simple cell – complex cell pairing within different objects have different textures, colors, etc. Feature hierar- each layer provided some invariance. When a neuron in the net- chy systems, because they utilize whatever is represented at earlier work using competitive learning with its stimulus set, which was stages in forming feature combination neurons at the next stage, typically letters on a 16 16 pixel array, learned that a particular naturally incorporate such “feature list” evidence into their analy- feature combination had occurred, that type of feature analyzer sis, and have the advantages of that approach (see Section 3.1 and was replicated in a non-local manner throughout the layer, to pro- also Mel, 1997). Indeed, the feature space approach can utilize a vide further translation invariance. Invariant representations were hybrid representation, some of whose dimensions may be discrete thus learned in a different way from VisNet. Up to eight layers were and defined in structural terms, while other dimensions may be used. The network could learn to differentiate letters, even with continuous and defined in terms of metric details, and others may some translation, scaling, or distortion. Although internally it is be concerned with non-shape properties such as texture and color organized and learns very differently to VisNet, it is an indepen- (cf. Edelman, 1999). dent example of the fact that useful invariant pattern recognition A seventh advantage of feature hierarchy systems is that they can be performed by multilayer hierarchical networks. A major do not need to utilize “on the fly” or run-time arbitrary binding of biological implausibility of the system is that once one neuron features. Instead, the spatial syntax is effectively hard-wired into within a layer learned, other similar neurons were set up through- the system when it is trained, in that the feature combination neu- out the layer by a non-local process. A second biological limitation rons have learned to respond to their set of features when they are was that no learning rule or self-organizing process was specified in a given spatial arrangement on the retina. as to how the complex cells can provide translation-invariant rep- An eighth advantage of feature hierarchy systems is that they resentations of simple cell responses – this was simply handwired. can self-organize (given the right functional architecture, trace Solutions to both these issues are provided by VisNet. synaptic learning rule, and the temporal statistics of the normal visual input from the world), with no need for an external teacher 4. HYPOTHESES ABOUT THE COMPUTATIONAL to specify that the neurons must learn to respond to objects. The MECHANISMS IN THE VISUAL CORTEX FOR OBJECT correct, object, representation self-organizes itself given rather RECOGNITION economically specified genetic rules for building the network (cf. The neurophysiological findings described in Section 2, and wider Rolls and Stringer, 2000). considerations on the possible computational properties of the Ninth, it is also noted that hierarchical visual systems may rec- cerebral cortex (Rolls, 1992, 2000, 2008b; Rolls and Treves, 1998; ognize 3D objects based on a limited set of 2D views of objects, Rolls and Deco, 2002), lead to the following outline working and that the same architectural rules just stated and implemented hypotheses on object recognition by visual cortical mechanisms in VisNet will correctly associate together the different views of (see Rolls, 1992). The principles underlying the processing of faces an object. It is part of the concept (see below), and consistent and other objects may be similar, but more neurons may become with neurophysiological data (Tanaka, 1996), that the neurons allocated to represent different aspects of faces because of the need in the upper layers will generalize correctly within a view (see to recognize the faces of many different individuals, that is to Section 5.6). identify many individuals within the category faces. Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 13 Rolls Invariant visual object recognition Cortical visual processing for object recognition is considered have utility in maximizing the number of memories that can be to be organized as a set of hierarchically connected cortical regions stored when, toward the end of the visual system, the visual repre- consisting at least of V1, V2, V4, posterior inferior temporal cor- sentation of objects is interfaced to associative memory (Rolls and tex (TEO), inferior temporal cortex (e.g., TE3, TEa, and TEm), Treves, 1998; Rolls, 2008b). and anterior temporal cortical areas (e.g., TE2 and TE1). (This Translation invariance would be computed in such a system by stream of processing has many connections with a set of cortical utilizing competitive learning to detect regularities in inputs when areas in the anterior part of the superior temporal sulcus, includ- real objects are translated in the physical world. The hypothesis ing area TPO.) There is convergence from each small part of a is that because objects have continuous properties in space and region to the succeeding region (or layer in the hierarchy) in such time in the world, an object at one place on the retina might acti- a way that the receptive field sizes of neurons (e.g., 1˚ near the vate feature analyzers at the next stage of cortical processing, and fovea in V1) become larger by a factor of approximately 2.5 with when the object was translated to a nearby position, because this each succeeding stage (and the typical parafoveal receptive field would occur in a short period (e.g., 0.5 s), the membrane of the sizes found would not be inconsistent with the calculated approx- post-synaptic neuron would still be in its “Hebb-modifiable” state imations of, e.g., 8˚ in V4, 20˚ in TEO, and 50˚ in the inferior (caused, for example, by calcium entry as a result of the voltage- temporal cortex Boussaoud et al., 1991; see Figure 1). Such zones dependent activation of NMDA receptors), and the presynaptic of convergence would overlap continuously with each other (see afferents activated with the object in its new position would thus Figure 1). This connectivity would be part of the architecture by become strengthened on the still-activated post-synaptic neuron. which translation-invariant representations are computed. It is suggested that the short temporal window (e.g., 0.5 s) of Hebb- Each layer is considered to act partly as a set of local modifiability helps neurons to learn the statistics of objects moving self-organizing competitive neuronal networks with overlapping in the physical world, and at the same time to form different rep- inputs. (The region within which competition would be imple- resentations of different feature combinations or objects, as these mented would depend on the spatial properties of inhibitory are physically discontinuous and present less regular correlations interneurons, and might operate over distances of 1–2 mm in the to the visual system. Földiák (1991) has proposed computing an cortex.) These competitive nets operate by a single set of forward average activation of the post-synaptic neuron to assist with the inputs leading to (typically non-linear, e.g., sigmoid) activation same problem. One idea here is that the temporal properties of of output neurons; of competition between the output neurons the biologically implemented learning mechanism are such that it mediated by a set of feedback inhibitory interneurons which is well suited to detecting the relevant continuities in the world of receive from many of the principal (in the cortex, pyramidal) cells real objects. Another suggestion is that a memory trace for what in the net and project back (via inhibitory interneurons) to many has been seen in the last 300 ms appears to be implemented by of the principal cells and serve to decrease the firing rates of the less a mechanism as simple as continued firing of inferior temporal active neurons relative to the rates of the more active neurons; and neurons after the stimulus has disappeared, as has been found in then of synaptic modification by a modified Hebb rule, such that masking experiments (Rolls and Tovee, 1994; Rolls et al., 1994, synapses to strongly activated output neurons from active input 1999; Rolls, 2003). axons strengthen, and from inactive input axons weaken (Rolls, I also suggested (Rolls, 1992) that other invariances, for exam- 2008b). A biologically plausible form of this learning rule that ple, size, spatial-frequency, and rotation invariance, could be operates well in such networks is learned by a comparable process. (Early processing in V1 which enables different neurons to represent inputs at different spatial scales would allow combinations of the outputs of such neurons w D y .x w / (2) ij i j ij to be formed at later stages. Scale invariance would then result from detecting at a later stage which neurons are almost conjunc- where w is the change of the synaptic weight, is a learning rate ij constant, y is the firing rate of the i th postsynaptic neuron, and tively active as the size of an object alters.) It is suggested that this process takes place at each stage of the multiple-layer cortical pro- x and w are in appropriate units (Rolls, 2008b). Such compet- j ij itive networks operate to detect correlations between the activity cessing hierarchy, so that invariances are learned first over small regions of space, and then over successively larger regions. This of the input neurons, and to allocate output neurons to respond to each cluster of such correlated inputs. These networks thus limits the size of the connection space within which correlations must be sought. act as categorizers. In relation to visual information processing, they would remove redundancy from the input representation, and Increasing complexity of representations could also be built in would develop low-entropy representations of the information (cf. such a multiple-layer hierarchy by similar mechanisms. At each stage or layer the self-organizing competitive nets would result in Barlow, 1985; Barlow et al., 1989). Such competitive nets are bio- logically plausible, in that they utilize Hebb-modifiable forward combinations of inputs becoming the effective stimuli for neu- rons. In order to avoid the combinatorial explosion, it is proposed, excitatory connections, with competitive inhibition mediated by cortical inhibitory neurons. The competitive scheme I suggest following Feldman (1985), that low-order combinations of inputs would be what is learned by each neuron. (Each input would not would not result in the formation of “winner-take-all” or “grand- mother” cells, but would instead result in a small ensemble of active be represented by activity in a single input axon, but instead by activity in a set of active input axons.) Evidence consistent with neurons representing each input (Rolls and Treves, 1998; Rolls, 2008b). The scheme has the advantages that the output neurons this suggestion that neurons are responding to combinations of a few variables represented at the preceding stage of cortical pro- learn better to distribute themselves between the input patterns (cf. Bennett, 1990), and that the sparse representations formed cessing is that some neurons in V1 respond to combinations of Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 14 Rolls Invariant visual object recognition bars or edges (Shevelev et al., 1995; Sillito et al., 1995); V2 and coordinates of each part of the object would not be explicitly avail- V4 respond to end-stopped lines, to angles formed by a combi- able. It is therefore proposed that visual fixation is used to locate nation of lines, to tongues flanked by inhibitory subregions, or to in foveal vision part of an object to which movements must be combinations of colors (Hegde and Van Essen, 2000, 2003, 2007; made, and that local disparity and other measurements of depth Ito and Komatsu, 2004; Anzai et al., 2007; Orban, 2011); in poste- then provide sufficient information for the motor system to make rior inferior temporal cortex to stimuli which may require two or actions relative to the small part of space in which a local, view- more simple features to be present (Tanaka et al., 1990); and in the dependent, representation of depth would be provided (cf. Ballard, temporal cortical face processing areas to images that require the 1990). presence of several features in a face (such as eyes, hair, and mouth) The computational processes proposed above operate by an in order to respond (Perrett et al., 1982; Yamane et al., 1988; Rolls, unsupervised learning mechanism, which utilizes statistical regu- 2011b; see Figure 6). (Precursor cells to face-responsive neurons larities in the physical environment to enable representations to might, it is suggested, respond to combinations of the outputs of be built. In some cases it may be advantageous to utilize some the neurons in V1 that are activated by faces, and might be found form of mild teaching input to the visual system, to enable it to in areas such as V4.) It is an important part of this suggestion that learn, for example, that rather similar visual inputs have very dif- some local spatial information would be inherent in the features ferent consequences in the world, so that different representations which were being combined. For example, cells might not respond of them should be built. In other cases, it might be helpful to bring to the combination of an edge and a small circle unless they were representations together, if they have identical consequences, in in the correct spatial relation to each other. (This is in fact consis- order to use storage capacity efficiently. It is proposed elsewhere tent with the data of Tanaka et al. (1990), and with our data on (Rolls, 1989a,b, 2008b; Rolls and Treves, 1998) that the backpro- face neurons, in that some face neurons require the face features jections from each adjacent cortical region in the hierarchy (and to be in the correct spatial configuration, and not jumbled, Rolls from the amygdala and hippocampus to higher regions of the et al. (1994).) The local spatial information in the features being visual system) play such a role by providing guidance to the com- combined would ensure that the representation at the next level petitive networks suggested above to be important in each cortical would contain some information about the (local) arrangement area. This guidance, and also the capability for recall, are it is sug- of features. Further low-order combinations of such neurons at gested implemented by Hebb-modifiable connections from the the next stage would include sufficient local spatial information so backprojecting neurons to the principal (pyramidal) neurons of that an arbitrary spatial arrangement of the same features would the competitive networks in the preceding stages (Rolls, 1989a,b, not activate the same neuron, and this is the proposed, and lim- 2008b; Rolls and Treves, 1998). ited, solution which this mechanism would provide for the feature The computational processes outlined above use sparse distrib- binding problem (Elliffe et al., 2002; cf. von der Malsburg, 1990). uted coding with relatively finely tuned neurons with a graded By this stage of processing a view-dependent representation of response region centered about an optimal response achieved objects suitable for view-dependent processes such as behavioral when the input stimulus matches the synaptic weight vector on a responses to face expression and gesture would be available. neuron. The distributed nature of the coding but with fine tuning It is suggested that view-independent representations could be would help to limit the combinatorial explosion, to keep the num- formed by the same type of computation, operating to combine a ber of neurons within the biological range. The graded response limited set of views of objects. The plausibility of providing view- region would be crucial in enabling the system to generalize cor- independent recognition of objects by combining a set of different rectly to solve, for example, the invariances. However, such a system views of objects has been proposed by a number of investigators would need many neurons, each with considerable learning capac- (Koenderink and Van Doorn, 1979; Poggio and Edelman, 1990; ity, to solve visual perception in this way. This is fully consistent Logothetis et al., 1994; Ullman, 1996). Consistent with the sug- with the large number of neurons in the visual system, and with gestion that the view-independent representations are formed by the large number of, probably modifiable, synapses on each neu- combining view-dependent representations in the primate visual ron (e.g., 10,000). Further, the fact that many neurons are tuned system, is the fact that in the temporal cortical areas, neurons in different ways to faces is consistent with the fact that in such a with view-independent representations of faces are present in the computational system, many neurons would need to be sensitive same cortical areas as neurons with view-dependent representa- (in different ways) to faces, in order to allow recognition of many tions (from which the view-independent neurons could receive individual faces when all share a number of common properties. inputs; Perrett et al., 1985; Hasselmo et al., 1989b; Booth and Rolls, 1998). This solution to “object-based” representations is very dif- 5. THE FEATURE HIERARCHY APPROACH TO INVARIANT ferent from that traditionally proposed for artificial vision systems, OBJECT RECOGNITION: COMPUTATIONAL ISSUES in which the coordinates in 3D space of objects are stored in a data- The feature hierarchy approach to invariant object recognition base, and general-purpose algorithms operate on these to perform was introduced in Section 3.6, and advantages and disadvantages transforms such as translation, rotation, and scale change in 3D of it were discussed. Hypotheses about how object recognition space (e.g., Marr, 1982). In the present, much more limited but could be implemented in the brain which are consistent with more biologically plausible scheme, the representation would be much of the neurophysiology discussed in Section 2 and by Rolls suitable for recognition of an object, and for linking associative and Deco (2002) and Rolls (2008b) were set out in Section 4. memories to objects, but would be less good for making actions These hypotheses effectively incorporate a feature hierarchy sys- in 3D space to particular parts of, or inside, objects, as the 3D tem while encompassing much of the neurophysiological evidence. Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 15 Rolls Invariant visual object recognition In this Section (5), we consider the computational issues that arise general architecture simulated in VisNet, and the way in which in such feature hierarchy systems, and in the brain systems that it allows natural images to be used as stimuli, has been chosen to implement visual object recognition. The issues are considered enable some comparisons of neuronal responses in the network with the help of a particular model, VisNet, which requires precise and in the brain to similar stimuli to be made. specification of the hypotheses, and at the same time enables them to be explored and tested numerically and quantitatively. However, 5.1.1. The trace rule I emphasize that the issues to be covered in Section 5 are key and The learning rule implemented in the VisNet simulations utilizes major computational issues for architectures of this feature hierar- the spatio-temporal constraints placed upon the behavior of “real- chical type (Rolls, 2008b), and are very relevant to understanding world” objects to learn about natural object transformations. By how invariant object recognition is implemented in the brain. presenting consistent sequences of transforming objects the cells VisNet is a model of invariant object recognition based on Rolls’ in the network can learn to respond to the same object through all (Rolls, 1992) hypotheses. It is a computer simulation that allows of its naturally transformed states, as described by Földiák (1991), hypotheses to be tested and developed about how multilayer hier- Rolls (1992), Wallis et al. (1993), and Wallis and Rolls (1997). The archical networks of the type believed to be implemented in the learning rule incorporates a decaying trace of previous cell activity visual cortical pathways operate. The architecture captures a num- and is henceforth referred to simply as the “trace” learning rule. ber of aspects of the architecture of the visual cortical pathways, The learning paradigm we describe here is intended in principle and is described next. The model of course, as with all mod- to enable learning of any of the transforms tolerated by inferior els, requires precise specification of what is to be implemented, temporal cortex neurons, including position, size, view, lighting, and at the same time involves specified simplifications of the real and spatial-frequency (Rolls, 1992, 2000, 2008b; Rolls and Deco, architecture, as investigations of the fundamental aspects of the 2002). information processing being performed are more tractable in To clarify the reasoning behind this point, consider the situa- a simplified and at the same time quantitatively specified model. tion in which a single neuron is strongly activated by a stimulus First the architecture of the model is described, and this is followed forming part of a real-world object. The trace of this neuron’s acti- by descriptions of key issues in such multilayer feature hierarchical vation will then gradually decay over a time period in the order of models, such as the issue of feature binding, the optimal form of 0.5 s. If, during this limited time window, the net is presented with training rule for the whole system to self-organize, the operation of a transformed version of the original stimulus then not only will the network in natural environments and when objects are partly the initially active afferent synapses modify onto the neuron, but so occluded, how outputs about individual objects can be read out also will the synapses activated by the transformed version of this from the network, and the capacity of the system. stimulus. In this way the cell will learn to respond to either appear- ance of the original stimulus. Making such associations works in 5.1. THE ARCHITECTURE OF VisNet practice because it is very likely that within short-time periods Fundamental elements of Rolls’ (1992) theory for how cortical net- different aspects of the same object will be being inspected. The works might implement invariant object recognition are described cell will not, however, tend to make spurious links across stimuli in Section 4. They provide the basis for the design of VisNet, and that are part of different objects because of the unlikelihood in the can be summarized as: real-world of one object consistently following another. Various biological bases for this temporal trace have been A series of competitive networks, organized in hierarchical lay- advanced as follows: [The precise mechanisms involved may alter ers, exhibiting mutual inhibition over a short range within each the precise form of the trace rule which should be used. Földiák layer. These networks allow combinations of features or inputs (1992) describes an alternative trace rule which models individual occurring in a given spatial arrangement to be learned by neu- NMDA channels. Equally, a trace implemented by extended cell rons, ensuring that higher order spatial properties of the input firing should be reflected in representing the trace as an external stimuli are represented in the network. firing rate, rather than an internal signal.] A convergent series of connections from a localized population of cells in preceding layers to each cell of the following layer,  The persistent firing of neurons for as long as 100–400 ms thus allowing the receptive field size of cells to increase through observed after presentations of stimuli for 16 ms (Rolls and the visual processing areas or layers. Tovee, 1994) could provide a time window within which to A modified Hebb-like learning rule incorporating a temporal associate subsequent images. Maintained activity may poten- trace of each cell’s previous activity, which, it is suggested, will tially be implemented by recurrent connections between as well enable the neurons to learn transform invariances. as within cortical areas (Rolls and Treves, 1998; Rolls and Deco, 2002; Rolls, 2008b). [The prolonged firing of inferior temporal The first two elements of Rolls’ theory are used to constrain the cortex neurons during memory delay periods of several sec- general architecture of a network model, VisNet, of the processes onds, and associative links reported to develop between stimuli just described that is intended to learn invariant representations presented several seconds apart (Miyashita, 1988) are on too of objects. The simulation results described in this paper using long a time scale to be immediately relevant to the present VisNet show that invariant representations can be learned by theory. In fact, associations between visual events occurring the architecture. It is moreover shown that successful learning several seconds apart would, under normal environmental con- depends crucially on the use of the modified Hebb rule. The ditions, be detrimental to the operation of a network of the type Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 16 Rolls Invariant visual object recognition described here, because they would probably arise from different connection between neurons in adjacent layers exists or not is objects. In contrast, the system described benefits from associa- based upon a Gaussian distribution of connection probabilities tions between visual events which occur close in time (typically which roll off radially from the focal point of connections for each within 1 s), as they are likely to be from the same object.] neuron. (A minor extra constraint precludes the repeated connec- The binding period of glutamate in the NMDA channels, which tion of any pair of cells.) In particular, the forward connections may last for 100 ms or more, may implement a trace rule by to a cell in one layer come from a small region of the preceding producing a narrow time window over which the average activ- layer defined by the radius in Table 1 which will contain approx- ity at each presynaptic site affects learning (Hestrin et al., 1990; imately 67% of the connections from the preceding layer. Table 1 Földiák, 1992; Rhodes, 1992; Rolls, 1992; Spruston et al., 1995). shows the dimensions for VisNetL, the system we are currently Chemicals such as nitric oxide may be released during high using (Perry et al., 2010), which is a (16) larger version of the neural activity and gradually decay in concentration over a version of VisNet than used in most of our previous investiga- short-time window during which learning could be enhanced tions, which utilized 32 32 neurons per layer. Figure 1 shows the (Montague et al., 1991; Földiák, 1992; Garthwaite, 2008). general convergent network architecture used. Localization and limitation of connectivity in the network is intended to mimic The trace update rule used in the baseline simulations of VisNet cortical connectivity, partially because of the clear retention of (Wallis and Rolls, 1997) is equivalent to both Földiák’s used in the retinal topology through regions of visual cortex. This architecture context of translation invariance (Wallis et al., 1993) and to the also encourages the gradual combination of features from layer to earlier rule of Sutton and Barto (1981) explored in the context of layer which has relevance to the binding problem, as described in modeling the temporal properties of classical conditioning, and Section 5.4. can be summarized as follows: Modeling topological constraints in connectivity leads to an issue concerning neurons at the edges of the network layers. In w D y x (3) j j principle these neurons may either receive no input from beyond the edge of the preceding layer, or have heir connections repeat- where edly sample neurons at the edge of the previous layer. In practice either solution is liable to introduce artificial weighting on the yN D .1 / y C yN (4) few active inputs at the edge and hence cause the edge to have unwanted influence over the development of the network as a and whole. In the real brain such edge-effects would be naturally smoothed by the transition of the locus of cellular input from the x : jth input to the neuron. y : Output from the neuron. fovea to the lower acuity periphery of the visual field. However, yN : Trace value of the output of : Learning rate. Annealed it poses a problem here because we are in effect only simulat- the neuron at time step  . between unity and zero. ing the small high-acuity foveal portion of the visual field in our w : Synaptic weight between jth : Trace value. The optimal value simulations. As an alternative to the former solutions Wallis and input and the neuron. varies with presentation Rolls (1997) elected to form the connections into a toroid, such sequence length. that connections wrap back onto the network from opposite sides. To bound the growth of each neuron’s synaptic weight vec- This wrapping happens at all four layers of the network, and in the way an image on the “retina” is mapped to the input filters. tor, w for the i th neuron, its length is explicitly normalized (a This solution has the advantage of making all of the boundaries method similarly employed by von der Malsburg (1973) which effectively invisible to the network. Further, this procedure does is commonly used in competitive networks (Rolls, 2008b). An not itself introduce problems into evaluation of the network for alternative, more biologically relevant implementation, using a the problems set, as many of the critical comparisons in VisNet local weight bounding operation which utilizes a form of het- involve comparisons between a network with the same architec- erosynaptic long-term depression (Rolls, 2008b), has in part been explored using a version of the Oja (1982) rule (see Wallis and ture trained with the trace rule, or with the Hebb rule, or not trained at all. In practice, it is shown below that only the network Rolls, 1997). trained with the trace rule solves the problem of forming invariant 5.1.2. The network implemented in VisNet representations. The network itself is designed as a series of hierarchical, conver- gent, competitive networks, in accordance with the hypotheses advanced above. The actual network consists of a series of four Table 1 | VisNet dimensions. layers, constructed such that the convergence of information from the most disparate parts of the network’s input layer can potentially Dimensions # Connections Radius influence firing in a single neuron in the final layer – see Figure 1. This corresponds to the scheme described by many researchers Layer 4 128 128 100 48 (Rolls, 1992, 2008b; Van Essen et al., 1992) as present in the pri- Layer 3 128 128 100 36 mate visual system – see Figure 1. The forward connections to Layer 2 128 128 100 24 a cell in one-layer are derived from a topologically related and Layer 1 128 128 272 24 confined region of the preceding layer. The choice of whether a Input layer 256 256 32 – – Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 17 Rolls Invariant visual object recognition 5.1.3. Competition and lateral inhibition In order to act as a competitive network some form of mutual inhi- bition is required within each layer, which should help to ensure that all stimuli presented are evenly represented by the neurons in each layer. This is implemented in VisNet by a form of lateral inhibition. The idea behind the lateral inhibition, apart from this being a property of cortical architecture in the brain, was to pre- vent too many neurons that received inputs from a similar part of the preceding layer responding to the same activity patterns. The purpose of the lateral inhibition was to ensure that different receiving neurons coded for different inputs. This is important in reducing redundancy (Rolls, 2008b). The lateral inhibition is conceived as operating within a radius that was similar to that of the region within which a neuron received converging inputs from the preceding layer (because activity in one zone of topologically organized processing within a layer should not inhibit processing FIGURE 10 | Contrast-enhancing filter, which has the effect of local in another zone in the same layer, concerned perhaps with another lateral inhibition. The parameters  and  are variables used in equation (5) part of the image). [Although the extent of the lateral inhibition to modify the amount and extent of inhibition, respectively. actually investigated by Wallis and Rolls (1997) in VisNet oper- ated over adjacent pixels, the lateral inhibition introduced by Rolls and Milward (2000) in what they named VisNet2 and which has a more biologically plausible form of the activation function, a been used in subsequent simulations operates over a larger region, sigmoid, was used: set within a layer to approximately half of the radius of conver- gence from the preceding layer. Indeed, Rolls and Milward (2000) sigmoid y D f .r/ D (6) 2 .r / showed in a problem in which invariant representations over 49 1C e locations were being used with a 17 face test set, that the best per- where r is the activation (or firing rate) of the neuron after the lat- formance was with intermediate-range lateral inhibition, using the eral inhibition, y is the firing rate after the contrast enhancement parameters for shown in Table 3. These values of  set the lateral produced by the activation function, and is the slope or gain and inhibition radius within a layer to be approximately half that of is the threshold or bias of the activation function. The sigmoid the spread of the excitatory connections from the preceding layer.] bounds the firing rate between 0 and 1 so global normalization The lateral inhibition and contrast enhancement just described is not required. The slope and threshold are held constant within are actually implemented in VisNet2 (Rolls and Milward, 2000) each layer. The slope is constant throughout training, whereas the and VisNetL (Perry et al., 2010) in two stages, to produce filtering threshold is used to control the sparseness of firing rates within of the type illustrated in Figure 10. This lateral inhibition is imple- each layer. The (population) sparseness of the firing within a layer mented by convolving the activation of the neurons in a layer with is defined (Rolls and Treves, 1998, 2011; Franco et al., 2007; Rolls, a spatial filter, I, where  controls the contrast and  controls the 2008b) as: width, and a and b index the distance away from the center of the filter y n 8 a D P  (7) 2 2 2 a Cb y n i i e if a 6D 0 or b 6D 0, I D (5) a,b 1 I if a D 0 and b D 0. a,b where n is the number of neurons in the layer. To set the sparseness a6D0,b6D0 to a given value, e.g., 5%, the threshold is set to the value of the 95th percentile point of the activations within the layer. (Unless This is a filter that leaves the average activity unchanged. A modi- otherwise stated here, the neurons used the sigmoid activation fied version of this filter designed as a difference of Gaussians with function as just described.) the same inhibition but shorter range local excitation is being In most simulations with VisNet2 and later, the sigmoid activa- tested to investigate whether the self-organizing maps that this tion function was used with parameters (selected after a number promotes (Rolls, 2008b) helps the system to provide some conti- of optimization runs) as shown in Table 2. nuity in the representations formed. The concept is that this may In addition, the lateral inhibition parameters normally used in help the system to code efficiently for large numbers of untrained VisNet2 simulations are as shown in Table 3. (Where a power acti- stimuli that fall between trained stimuli in similarity space. vation function was used in the simulations of Wallis and Rolls The second stage involves contrast enhancement. In VisNet (1997), the power for layer 1 was 6, and for the other layers was 2.) (Wallis and Rolls, 1997), this was implemented by raising the neu- ronal activations to a fixed power and normalizing the resulting 5.1.4. The input to VisNet firing within a layer to have an average firing rate equal to 1.0. In VisNet is provided with a set of input filters which can be applied VisNet2 (Rolls and Milward, 2000) and in subsequent simulations to an image to produce inputs to the network which correspond Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 18 Rolls Invariant visual object recognition Table 2 | Sigmoid parameters for the runs with 25 locations by Rolls receptive fields of the simple cell-like input neurons are modeled and Milward, 2000). by 2D-Gabor functions. The Gabor receptive fields have five degrees of freedom given essentially by the product of an ellip- Layer 1 2 3 4 tical Gaussian and a complex plane wave. The first two degrees of freedom are the 2D-locations of the receptive field’s center; Percentile 99.2 98 88 91 the third is the size of the receptive field; the fourth is the ori- Slope 190 40 75 26 entation of the boundaries separating excitatory and inhibitory regions; and the fifth is the symmetry. This fifth degree of free- dom is given in the standard Gabor transform by the real and Table 3 | Lateral inhibition parameters for the 25-location runs. imaginary part, i.e., by the phase of the complex function rep- resenting it, whereas in a biological context this can be done by Layer 1 2 3 4 combining pairs of neurons with even and odd receptive fields. Radius,  1.38 2.7 4.0 6.0 This design is supported by the experimental work of Pollen and Contrast,  1.5 1.5 1.6 1.4 Ronner (1981), who found simple cells in quadrature-phase pairs. Even more, Daugman (1988) proposed that an ensemble of simple cells is best modeled as a family of 2D-Gabor wavelets sampling to those provided by simple cells in visual cortical area 1 (V1). the frequency domain in a log-polar manner as a function of The purpose of this is to enable within VisNet the more com- eccentricity. Experimental neurophysiological evidence constrains plicated response properties of cells between V1 and the inferior the relation between the free parameters that define a 2D-Gabor temporal cortex (IT) to be investigated, using as inputs natural receptive field (De Valois and De Valois, 1988). There are three stimuli such as those that could be applied to the retina of the constraints fixing the relation between the width, height, ori- real visual system. This is to facilitate comparisons between the entation, and spatial-frequency (Lee, 1996). The first constraint activity of neurons in VisNet and those in the real visual sys- posits that the aspect ratio of the elliptical Gaussian envelope is tem, to the same stimuli. In VisNet no attempt is made to train 2:1. The second constraint postulates that the plane wave tends the response properties of simple cells, but instead we start with a to have its propagating direction along the short axis of the defined series of filters to perform fixed feature extraction to a level elliptical Gaussian. The third constraint assumes that the half- equivalent to that of simple cells in V1, as have other researchers amplitude bandwidth of the frequency response is about 1–1.5 in the field (Fukushima, 1980; Buhmann et al., 1991; Hummel and octaves along the optimal orientation. Further, we assume that the Biederman, 1992), because we wish to simulate the more com- mean is zero in order to have an admissible wavelet basis (Lee, plicated response properties of cells between V1 and the inferior 1996). temporal cortex (IT). The elongated orientation-tuned input fil- In more detail, the Gabor filters are constructed as follows ters used accord with the general tuning profiles of simple cells in (Deco and Rolls, 2004). We consider a pixelized gray-scale image orig V1 (Hawken and Parker, 1987) and in earlier versions of VisNet given by a N  N matrix0 . The subindices ij denote the spatial ij were computed by weighting the difference of two Gaussians by a position of the pixel. Each pixel value is given a gray-level bright- third orthogonal Gaussian as described in detail elsewhere (Wal- ness value coded in a scale between 0 (black) and 255 (white). lis and Rolls, 1997; Rolls and Milward, 2000; Perry et al., 2010). The first step in the pre-processing consists of removing the DC Each individual filter is tuned to spatial-frequency (0.0039–0.5 component of the image (i.e., the mean value of the gray-scale cycles/pixel over eight octaves); orientation (0–135˚ in steps of intensity of the pixels). (The equivalent in the brain is the low- 45˚); and sign (1). Of the 272 layer 1 connections, the num- pass filtering performed by the retinal ganglion cells and lateral ber to each group in VisNetL is as shown in Table 4. In VisNet2 geniculate cells. The visual representation in the LGN is essen- (Rolls and Milward, 2000; used for most VisNet simulations) only tially a contrast-invariant pixel representation of the image, i.e., even symmetric – “bar detecting” – filter shapes are used, which each neuron encodes the relative brightness value at one location take the form of a Gaussian shape along the axis of orienta- in visual space referred to the mean value of the image bright- tion tuning for the filter, and a difference of Gaussians along the ness.) We denote this contrast-invariant LGN representation by perpendicular axis. the N  N matrix 0 defined by the equation ij This filter is referred to as an oriented difference of Gaussians, or DOG filter. Any zero D.C. filter can of course produce a negative N N X X as well as positive output, which would mean that this simulation 1 orig orig 0 D 0 0 . (8) ij ij ij of a simple cell would permit negative as well as positive firing. 2 iD1 jD1 In contrast to some other models the response of each filter is zero thresholded and the negative results used to form a sepa- rate anti-phase input to the network. The filter outputs are also Feed-forward connections to a layer of V1 neurons perform the normalized across scales to compensate for the low-frequency bias extraction of simple features like bars at different locations, orien- in the images of natural objects. tations and sizes. Realistic receptive fields for V1 neurons that However, Gabor filters have also been tested, also produce good extract these simple features can be represented by 2D-Gabor results with VisNet (Deco and Rolls, 2004), and are what we wavelets. Lee (1996) derived a family of discretized 2D-Gabor implement at present in VisNetL. Following Daugman (1988) the wavelets that satisfy the wavelet theory and the neurophysiological Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 19 Rolls Invariant visual object recognition constraints for simple cells mentioned above. They are given by an and with the sampling from the spatial frequencies set as shown expression of the form in Table 4. Cells of layer 1 receive a topologically consistent, localized, ran- k k k dom selection of the filter responses in the input layer, under the G x , y D a 9 a x 2p , a y 2q (9) pqkl 2 constraint that each cell samples every filter spatial-frequency and receives a constant number of inputs. Figure 11 shows pictorially where the general filter sampling paradigm. 9 D9 x cos.l2 / C y sin.l2 / ,x sin.l2 /C y cos.l2 / , 2 0 0 0 0 5.1.5. Measures for network performance (10) A neuron can be said to have learnt an invariant representation if it discriminates one set of stimuli from another set, across all and the mother wavelet is given by transformations. For example, a neuron’s response is translation- invariant if its response to one set of stimuli irrespective of presen- 2 2 tation is consistently higher than for all other stimuli irrespective 4x Cy . / ix 8 2 9 x , y D p e e e . (11) of presentation location. Note that we state “set of stimuli” since neurons in the inferior temporal cortex are not generally selec- In the above equations 2 D /L denotes the step size of each tive for a single stimulus but rather a subpopulation of stimuli angular rotation; l the index of rotation corresponding to the pre- (Baylis et al., 1985; Abbott et al., 1996; Rolls et al., 1997b; Rolls and ferred orientation 2 D l /L; k denotes the octave; and the indices Treves, 1998, 2011; Rolls and Deco, 2002; Franco et al., 2007; Rolls, pq the position of the receptive field center at c D p and c D q. In 2007b, 2008b). The measure of network performance used in Vis- x y this form, the receptive fields at all levels cover the spatial domain Net1 (Wallis and Rolls, 1997), the “Fisher metric” (referred to in in the same way, i.e., by always overlapping the receptive fields in some figure labels as the Discrimination Factor), reflects how well the same fashion. In the model we use aD 2, bD 1, and D cor- a neuron discriminates between stimuli, compared to how well it responding to a spatial-frequency bandwidth of one octave. We discriminates between different locations (or more generally the now use in VisNetL both symmetric and asymmetric filters (as images used rather than the objects, each of which is represented both are present in V1 Ringach, 2002); with the angular spacing by a set of images, over which invariant stimulus or object repre- between the different orientations set to 45˚; and with 8 filter fre- sentations must be learned). The Fisher measure is very similar to quencies spaced one octave apart starting with 0.5 cycles per pixel, taking the ratio of the two F values in a two-way ANOVA, where Table 4 | VisNet layer 1 connectivity. Frequency 0.5 0.25 0.125 0.0625 0.03125 0.0156 0.0078 0.0039 # Connections 180 45 12 7 7 7 7 7 The frequency is in cycles per pixel. FIGURE 11 | The filter sampling paradigm. Here each square used to provide input to a layer 1 cell. The filters double in represents the retinal image presented to the network after being spatial-frequency toward the reader. Left to right the orientation tuning filtered by a Gabor filter of the appropriate orientation sign and increases from 0˚ in steps of 45˚, with segregated pairs of positive (P) frequency. The circles represent the consistent retinotopic coordinates and negative (N) filter responses. Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 20 Rolls Invariant visual object recognition one factor is the stimulus shown, and the other factor is the posi- surprise (DeWeese and Meister, 1999; Rolls and Treves, 2011). Its tion in which a stimulus is shown. The measure takes a value average across stimuli is the mutual information I (S, R).) greater than 1.0 if a neuron has more different responses to the If all the output cells of VisNet learned to respond to the same stimuli than to the locations. That is, values greater than 1 indicate stimulus, then the information about the set of stimuli S would be invariant representations when this measure is used in the follow- very poor, and would not reach its maximal value of log of the ing figures. Further details of how the measure is calculated are number of stimuli (in bits). The second measure that is used here given by Wallis and Rolls (1997). is the information provided by a set of cells about the stimulus set, Measures of network performance based on information the- using the procedures described by Rolls et al. (1997b) and Rolls ory and similar to those used in the analysis of the firing of and Milward (2000). The multiple cell information is the mutual real neurons in the brain (Rolls, 2008b; Rolls and Treves, 2011) information between the whole set of stimuli S and of responses were introduced by Rolls and Milward (2000) for VisNet2, and R calculated using a decoding procedure in which the stimulus s are used in later papers. A single cell information measure was that gave rise to the particular firing rate response vector on each introduced which is the maximum amount of information the trial is estimated. (The decoding step is needed because the high cell has about any one stimulus/object independently of which dimensionality of the response space would lead to an inaccurate transform (e.g., position on the retina) is shown. Because the estimate of the information if the responses were used directly, competitive algorithm used in VisNet tends to produce local rep- as described by Rolls et al. (1997b) and Rolls and Treves (1998).) resentations (in which single cells become tuned to one stimulus A probability table is then constructed of the real stimuli s and or object), this information measure can approach log N bits, the decoded stimuli s . From this probability table, the mutual 2 s where N is the number of different stimuli. Indeed, it is an information between the set of actual stimuli S and the decoded advantage of this measure that it has a defined maximal value, estimates S is calculated as which enables how well the network is performing to be quanti- P s , s fied. Rolls and Milward (2000) showed that the Fisher and sin- 0 0 I S, S D P s , s log (13) P .s/ P .s / gle cell information measures were highly correlated, and given s ,s the advantage just noted of the information measure, it was adopted in Rolls and Milward (2000) and subsequent papers. This was calculated for the subset of cells which had as single cells Rolls and Milward (2000) also introduced a multiple cell infor- the most information about which stimulus was shown. In par- mation measure, which has the advantage that it provides a mea- ticular, in Rolls and Milward (2000) and subsequent papers, the sure of whether all stimuli are encoded by different neurons in multiple cell information was calculated from the first five cells for the network. Again, a high value of this measure indicates good each stimulus that had maximal single cell information about that performance. stimulus, that is from a population of 35 cells if there were seven For completeness, we provide further specification of the two stimuli (each of which might have been shown in, for example, 9 information theoretic measures, which are described in detail by or 25 positions on the retina). Rolls and Milward (2000), (see Rolls, 2008b) Rolls and Treves (2011) for an introduction to the concepts). The measures assess 5.2. INITIAL EXPERIMENTS WITH VisNet the extent to which either a single cell, or a population of cells, Having established a network model, Wallis and Rolls (1997) responds to the same stimulus invariantly with respect to its loca- following a first report by Wallis et al. (1993) described four exper- tion, yet responds differently to different stimuli. The measures iments in which the theory of how invariant representations could effectively show what one learns about which stimulus was pre- be formed was tested using a variety of stimuli undergoing a num- sented from a single presentation of the stimulus at any randomly ber of natural transformations. In each case the network produced chosen location. Results for top (4th) layer cells are shown. High neurons in the final layer whose responses were largely invariant information measures thus show that cells fire similarly to the dif- across a transformation and highly discriminating between stimuli ferent transforms of a given stimulus (object), and differently to or sets of stimuli. A summary showing how the network performed the other stimuli. The single cell stimulus-specific information,I (s, is presented here, with much more evidence of the factors that R), is the amount of information the set of responses, R, has about influence the network’s performance described elsewhere (Wallis a specific stimulus, s (see Rolls et al., 1997c; Rolls and Milward, and Rolls, 1997; Rolls, 2008b). 2000). I (s, R) is given by 5.2.1. “T,” “L,” and “C” as stimuli: learning translation invariance P .rjs/ One of the classical properties of inferior temporal cortex face I .s , R/ D P .rjs/ log (12) P .r/ cells is their invariant response to face stimuli translated across r2R the visual field (Tovee et al., 1994). In this first experiment, the where r is an individual response from the set of responses R of learning of translation-invariant representations by VisNet was the neuron. For each cell the performance measure used was the investigated. maximum amount of information a cell conveyed about any one In order to test the network a set of three stimuli, based upon stimulus. This (rather than the mutual information, I (S, R) where probable 3D edge cues – consisting of a “T,” “L,” and “C” shape – S is the whole set of stimuli s ), is appropriate for a competitive was constructed. Chakravarty (1979) describes the application of network in which the cells tend to become tuned to one stimu- these shapes as cues for the 3D interpretation of edge junctions, lus. (I (s, R) has more recently been called the stimulus-specific and Tanaka et al. (1991) have demonstrated the existence of cells Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 21 Rolls Invariant visual object recognition responsive to such stimuli in IT.) These stimuli were chosen partly Elliffe et al. (2002). The feature combination tuning is illustrated because of their significance as form cues, but on a more practical by the VisNet layer 1 neuron shown in Figures 12 and 13. note because they each contain the same fundamental features – The results for layer 4 neurons are illustrated in Figure 14. namely a horizontal bar conjoined with a vertical bar. In practice By this stage translation-invariant, stimulus-identifying, cells have this means that the oriented simple cell filters of the input layer emerged. The response profiles confirm the high level of neural cannot distinguish these stimuli on the basis of which features are selectivity for a particular stimulus irrespective of location. Neu- present. As a consequence of this, the representation of the stimuli rons in layers 2 and 3 of VisNet had intermediate-levels of received by the network is non-orthogonal and hence considerably translation invariance to those illustrated for layer 1 and layer more difficult to classify than was the case in earlier experiments 4. The gradual increase in the invariance that the tolerance to involving the trace rule described by Földiák (1991). The expec- shifts of the preferred stimulus gradually builds up through the tation is that layer 1 neurons would learn to respond to spatially layers. selective combinations of the basic features thereby helping to The trace used in VisNet enables successive features that, based distinguish these non-orthogonal stimuli. The trajectory followed on the natural statistics of the visual input, are likely to be from the by each stimulus consisted of sweeping left to right horizontally same object or feature complex to be associated together. For good across three locations in the top row, and then sweeping back, right performance, the temporal trace needs to be sufficiently long that to left across the middle row, before returning to the right hand it covers the period in which features seen by a particular neuron side across the bottom row – tracing out a “Z” shape path across in the hierarchy are likely to come from the same object. On the the retina. Unless stated otherwise this pattern of nine presenta- other hand, the trace should not be so long that it produces asso- tion locations was adopted in all image translation experiments ciations between features that are parts of different objects, seen described by Wallis and Rolls (1997). when, for example, the eyes move to another object. One possibil- Training was carried out by permutatively presenting all stimuli ity is to reset the trace during saccades between different objects. If in each location a total of 800 times. The sequence described above explicit trace resetting is not implemented, then the trace should, was followed for each stimulus, with the sequence start point and to optimize the compromise implied by the above, lead to strong direction of sweep being chosen at random for each of the 800 associations between temporally close stimuli, and increasingly training trials. weaker associations between temporally more distant stimuli. In Figures 12 and 13 shows the response after training of a first fact, the trace implemented in VisNet has an exponential decay, layer neuron selective for the “T” stimulus. The weighted sum of all and it has been shown that this form is optimal in the situation filter inputs reveals the combination of horizontally and vertically where the exact duration over which the same object is being tuned filters in identifying the stimulus. In this case many connec- viewed varies, and where the natural statistics of the visual input tions to the lower frequency filters have been reduced to zero by the happen also to show a decreasing probability that the same object learning process, except at the relevant orientations. This contrasts is being viewed as the time period in question increases (Wallis strongly with the random wiring present before training (Wallis and Baddeley, 1997). Moreover, performance can be enhanced if and Rolls, 1997; Rolls, 2008b). It is important that neurons at early the duration of the trace does at the same time approximately stages of feature hierarchy networks respond to combinations of match the period over which the input stimuli are likely to come features in defined relative spatial positions, before invariance is from the same object or feature complex (Wallis and Rolls, 1997; built into the system, as this is part of the way that the binding Rolls, 2008b). Nevertheless, good performance can be obtained in problem is solved, as described in more detail in Section 5.4 and by conditions under which the trace rule allows associations to be FIGURE 12 | The left graph shows the response of a layer 1 neuron to the three training stimuli for the nine training locations. Alongside this are the results of summating all the filter inputs to the neuron. The discrimination factor for this cell was 1.04. Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 22 Rolls Invariant visual object recognition FIGURE 13 | The connections to a single cell in layer 1 of VisNet receptive field of the layer 1 cell is centered just below the center-point from the filters after training in the T, L, and C stimulus set, of the retina. The connection scheme allows for relatively fewer represented by plotting the receptive fields of every input layer cell connections to lower frequency cells than to high-frequency cells in connected to the particular layer 1 cell. Separate input layer cells order to cover a similar region of the input at each frequency. The blank have activity that represents a positive (P) or negative (N) output from squares indicate that no connection exists between the layer 1 cell the bank of filters which have different orientations in degrees (the chosen and the filters of that particular orientation, sign, and columns) and different spatial frequencies (the rows). Here the overall spatial-frequency. FIGURE 14 | Response profiles for two fourth layer neurons – discrimination factors 4.07 and 3.62 – in the L, T, and C experiment. formed only between successive items in the visual stream (Rolls times, in order to learn about the larger scale properties that char- and Milward, 2000; Rolls and Stringer, 2001). acterize individual objects, including, for example, different views It is also the case that the optimal value of  in the trace rule is of objects observed as an object turns or is turned. Thus the sug- likely to be different for different layers of VisNet, and for cortical gestion is made that the temporal trace could be effectively longer processing in the “what” visual stream. For early layers of the sys- at later stages (e.g., inferior temporal visual cortex) compared to tem, small movements of the eyes might lead to different feature early stages (e.g., V2 and V4) of processing in the visual system. In combinations providing the input to cells (which at early stages addition, as will be shown in Section 5.4, it is important to form have small receptive fields), and a short duration of the trace would feature combinations with high-spatial precision before invariance be optimal. However, these small eye movements might be around learning supported by a temporal trace starts, in order that the fea- the same object, and later layers of the architecture would bene- ture combinations and not the individual features have invariant fit from being able to associate together their inputs over longer representations. This leads to the suggestion that the trace rule Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 23 Rolls Invariant visual object recognition should either not operate, or be short, at early stages of cortical to look more like any other image in that same location than the visual processing such as V1. This is reflected in the operation of same image presented elsewhere. A simple competitive network VisNet2, which does not use a temporal trace in layer 1 (Rolls and using just Hebbian learning will thus tend to categorize images by Milward, 2000). where they are rather than what they are – the exact opposite of what the net was intended to learn. This comparison thus indi- cates that a small memory trace acting in the standard Hebbian 5.2.2. Faces as stimuli: translation invariance learning paradigm can radically alter the normal vector averaging, The aim of the next set of experiments described by Wallis and image classification, performed by a Hebbian-based competitive Rolls (1997) was to start to address the issues of how the net- network. work operates when invariant representations must be learned for In order to check that there was an invariant representation a larger number of stimuli, and whether the network can learn in layer 4 of VisNet that could be read by a receiving popula- when much more complicated, real biological stimuli, faces, are tion of neurons, a fifth layer was added to the net which fully used. sampled the fourth layer cells. This layer was in turn trained in Figure 15 contrasts the measure of invariance, or discrimi- a supervised manner using gradient descent or with a Hebbian nation factor, achieved by cells in the four layers, averaged over associative learning rule. (Wallis and Rolls, 1997) showed that the five separate runs of the network (Wallis and Rolls, 1997; Rolls, object classification performed by the layer 5 network was better 2008b). Translation invariance clearly increases through the layers, if the network had been trained with the trace rule than when it as expected. was untrained or was trained with a Hebb rule. Having established that invariant cells have emerged in the final layer, we now consider the role of the trace rule, by assessing the network tested under two new conditions. Firstly, the performance 5.2.3. Faces as stimuli: view-invariance of the network was measured before learning occurs, that is with Given that the network had been shown to be able to operate its initially random connection weights. Secondly, the network usefully with a more difficult translation invariance problem, we was trained with  in the trace rule set to 0, which causes learn- next addressed the question of whether the network can solve ing to proceed in a traceless, standard Hebbian, fashion. (Hebbian other types of transform invariance, as we had intended. The next learning is purely associative Rolls, 2008b.) Figure 16 shows the experiment addressed this question, by training the network on the results under the three training conditions. The results show that problem of 3D stimulus rotation, which produces non-isomorphic the trace rule is the decisive factor in establishing the invariant transforms, to determine whether the network can build a view- responses in the layer 4 neurons. It is interesting to note that the invariant categorization of the stimuli (Wallis and Rolls, 1997). Hebbian learning results are actually worse than those achieved by The trace rule learning paradigm should, in conjunction with the chance in the untrained net. In general, with Hebbian learning, architecture described here, prove capable of learning any of the the most highly discriminating cells barely rate higher than 1. This transforms tolerated by IT neurons, so long as each stimulus is pre- value of discrimination corresponds to the case in which a cell sented in short sequences during which the transformation occurs responds to only one stimulus and in only one location. The poor and can be learned. This experiment continued with the use of performance with the Hebb rule comes as a direct consequence faces but now presented them centrally in the retina in a sequence of the presentation paradigm being employed. If we consider an of different views of a face (Wallis and Rolls, 1997; Rolls, 2008b). image as representing a vector in multidimensional space, a partic- The faces were again smoothed at the edges to erase the harsh ular image in the top left-hand corner of the input retina will tend image boundaries, and the D.C. term was removed. During the 800 epochs of learning, each stimulus was chosen at random, and FIGURE 15 | Variation in network performance for the top 30 most FIGURE 16 | Variation in network performance for the top 30 most highly discriminating cells through the four layers of the network, highly discriminating cells in the fourth layer for the three training averaged over five runs of the network. The net was trained on 7 faces regimes, averaged over five runs of the network. The net was trained on each in 9 locations. 7 faces each in 9 locations. Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 24 Rolls Invariant visual object recognition a sequence of preset views of it was shown, rotating the face either where the trace yN is updated according to to the left or to the right. yN D .1 / y C yN . (15) Although the actual number of images being presented is smaller, some 21 views in all, there is good reason to think that The parameter 2 [0, 1] controls the relative contributions to this problem may be harder to solve than the previous transla- the trace yN from the instantaneous firing rate y and the trace at tion experiments. This is simply due to the fact that all 21 views 1 the previous time step yN , where for D 0 we have yN D y and exactly overlap with one another. The net was indeed able to solve equation (14) becomes the standard Hebb rule the invariance problem, with examples of invariant layer 4 neuron response profiles appearing in Figure 17. w D y x . (16) Further analyses confirmed the good performance on view- At the start of a series of investigations of different forms of the invariance learning (Wallis and Rolls, 1997; Rolls, 2008b). trace-learning rule (Rolls and Milward, 2000) demonstrated that 5.3. DIFFERENT FORMS OF THE TRACE-LEARNING RULE, AND THEIR VisNet’s performance could be greatly enhanced (see Figure 18) RELATION TO ERROR CORRECTION AND TEMPORAL DIFFERENCE with a modified Hebbian trace-learning rule (equation (17)) that LEARNING incorporated a trace of activity from the preceding time steps, with The original trace-learning rule used in the simulations of Wallis no contribution from the activity being produced by the stimulus and Rolls (1997) took the form at the current time step. This rule took the form w D yN x (14) w D y x . (17) j j j j FIGURE 17 | Response profiles for cells in the last two layers of the network – discrimination factors 11.12 and 12.40 – in the experiment with seven different views of each of three faces. FIGURE 18 | Numerical results with the standard trace rule (14), the trained on 7 faces in 9 locations: single cell information measure (left), modified trace-learning rule (17), the Hebb rule (16), and random weights, multiple cell information measure (right). (After Rolls and Stringer, 2001a.) Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 25 Rolls Invariant visual object recognition The trace shown in equation (17) is in the post-synaptic term, w D yN .1 / y x and similar effects were found if the trace was in the presynaptic term, or in both the pre- and the post-synaptic terms. The crucial 1  1 (19) D yN y x difference from the earlier rule (see equation (14)) was that the trace should be calculated up to only the preceding timestep, with D O yN y x no contribution to the trace from the firing on the current trial to the current stimulus. How might this be understood? where O D and D . The modified Hebbian trace- One way to understand this is to note that the trace rule is trying to set up the synaptic weight on trial  based on whether learning rule (17) is thus equivalent to equation (19) which is in the general form of an error correction rule (Hertz et al., 1991). the neuron, based on its previous history, is responding to that stimulus (in other transforms, e.g., position). Use of the trace rule That is, rule (19) involves the subtraction of the current firing rate y from a target value, in this case yN . at  1 does this that is it takes into account the firing of the neuron on previous trials, with no contribution from the firing Although above we have referred to rule (17) as a modified being produced by the stimulus on the current trial. On the other Hebbian rule, we note that it is only associative in the sense of hand, use of the trace at time  in the update takes into account associating previous cell firing with the current cell inputs. In the the current firing of the neuron to the stimulus in that particu- next section we continue to explore the error correction paradigm, lar position, which is not a good estimate of whether that neuron examining five alternative examples of this sort of learning rule. should be allocated to invariantly represent that stimulus. Effec- 5.3.2. Five forms of error correction learning rule tively, using the trace at time  introduces a Hebbian element Error correction learning rules are derived from gradient descent into the update, which tends to build position-encoded analyzers, minimization (Hertz et al., 1991), and continually compare the rather than stimulus-encoded analyzers. (The argument has been current neuronal output to a target value t and adjust the synap- phrased for a system learning translation invariance, but applies tic weights according to the following equation at a particular to the learning of all types of invariance.) A particular advantage timestep of using the trace at  1 is that the trace will then on different occasions (due to the randomness in the location sequences used) w D t y x . (20) reflect previous histories with different sets of positions, enabling the learning of the neuron to be based on evidence from the stim- In this usual form of gradient descent by error correction, the tar- ulus present in many different positions. Using a term from the get t is fixed. However, in keeping with our aim of encouraging current firing in the trace (i.e., the trace calculated at time  ) neurons to respond similarly to images that occur close together in results in this desirable effect always having an undesirable ele- time it seems reasonable to set the target at a particular timestep, ment from the current firing of the neuron to the stimulus in its t , to be some function of cell activity occurring close in time, current position. because encouraging neurons to respond to temporal classes will tend to make them respond to the different variants of a given 5.3.1. The modified Hebbian trace rule and its relation to error stimulus (Földiák, 1991; Rolls, 1992; Wallis and Rolls, 1997). For correction this reason, Rolls and Stringer (2001) explored a range of error The rule of equation (17) corrects the weights using a post- correction rules where the targets t are based on the trace of synaptic trace obtained from the previous firing (produced by neuronal activity calculated according to equation (15). We note other transforms of the same stimulus), with no contribution to that although the target is not a fixed value as in standard error the trace from the current post-synaptic firing (produced by the correction learning, nevertheless the new learning rules perform current transform of the stimulus). Indeed, insofar as the current gradient descent on each timestep, as elaborated below. Although firing y is not the same as yN , this difference can be thought of the target may be varying early on in learning, as learning pro- as an error. This leads to a conceptualization of using the differ- ceeds the target is expected to become more and more constant, ence between the current firing and the preceding trace as an error as neurons settle to respond invariantly to particular stimuli. The correction term, as noted in the context of modeling the temporal first set of five error correction rules we discuss are as follows. properties of classical conditioning by Sutton and Barto (1981), and developed next in the context of invariance learning (see Rolls w D yN y x , (21) and Stringer, 2001). w D y y x , (22) First, we re-express the rule of equation (17) in an alternative form as follows. Suppose we are at timestep  and have just cal- w D yN y x , (23) culated a neuronal firing rate y and the corresponding trace yN C1 w D yN y x , (24) from the trace update equation (15). If we assume 2 (0, 1), then C1 rearranging equation (15) gives w D y y x , (25) where updates (21–23) are performed at timestep  , and updates yN D yN .1 / y , (18) (24) and (25) are performed at timestep  C 1. (The reason for adopting this convention is that the basic form of the error correc- and substituting equation (18) into equation (17) gives tion rule (20) is kept, with the five different rules simply replacing Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 26 Rolls Invariant visual object recognition the term t.) It may be readily seen that equations (22) and (25) are observation vectors x and a vector of modifiable weights w; i.e., special cases of equations (21) and (24), respectively, with D 0. the prediction at time step  is given by y (x , w), and for a linear These rules are all similar except for their targets t , which are dependency the prediction is given by y D w x . (Note here that all functions of a temporally nearby value of cell activity. In partic- w is the transpose of the weight vector w.) The problem of pre- ular, rule (23) is directly related to rule (19), but is more general in diction is to calculate the weight vector w such that the predictions that the parameter D is replaced by an unconstrained para- y are good estimates of the outcome z. The supervised learning approach to the prediction problem meter . In addition, we also note that rule (21) is closely related is to form pairs of observation vectors x and outcome z for all to a rule developed in Peng et al. (1998) for view-invariance learn- time steps, and compute an update to the weights according to the ing. The above five error correction rules are biologically plausible gradient descent equation in that the targets t are all local cell variables (see Rolls and Treves, 1998 and Rolls, 2008b). In particular, rule (23) uses the w D z y r y (26) trace yN from the current time level  , and rules (22) and (25) w do not need exponential trace values yN , instead relying only on where is a learning rate parameter and r indicates the gra- the instantaneous firing rates at the current and immediately pre- w dient with respect to the weight vector w. However, this learning ceding timesteps. However, all five error correction rules involve procedure requires all calculation to be done at the end of the decrementing of synaptic weights according to an error which is sequence, once z is known. To remedy this, it is possible to replace calculated by subtracting the current activity from a target. method (26) with a temporal difference algorithm that is mathe- Numerical results with the error correction rules trained on 7 matically equivalent but allows the computational workload to be faces in 9 locations are presented by Rolls and Stringer (2001). spread out over the entire sequence of observations. Temporal dif- For all the results the synaptic weights were clipped to be pos- ference methods are a particular approach to updating the weights itive during the simulation, because it is important to test that C1 based on the values of successive predictions, y , y . Sutton decrementing synaptic weights purely within the positive inter- (1988) showed that the following temporal difference algorithm is val w2 [0,1] will provide significantly enhanced performance. equivalent to method (26) That is, it is important to show that error correction rules do not necessarily require possibly biologically implausible modifiable negative weights. For each of the rules (21–25), the parameter C1  k w D y y r y , (27) has been individually optimized to the following respective values: kD1 4.9, 2.2, 2.2, 3.8, 2.2. All five error correction rules offer consider- ably improved performance over both the standard trace rule (14) mC1 where y  z. However, unlike method (26) this can be com- and rule (17). Networks trained with rule (21) performed best, and puted incrementally at each successive time step since each update this is probably due to two reasons. Firstly, rule (21) incorporates C1  k depends only on y , y and the sum of r y over previous an exponential trace yN in its target t , and we would expect this time steps k. The next step taken in Sutton (1988) is to generalize to help neurons to learn more quickly to respond invariantly to equation (27) to the following final form of temporal difference a class of inputs that occur close together in time. Hence, setting algorithm, known as “TD()” D 0 as in rule (22) results in reduced performance. Secondly, unlike rules (23) and (24), rule (21) does not contain any compo- C1  k k nent of y in its target. If we examine rules (23), (24), we see that w D y y  r y (28) C1 their respective targets yN , yN contain significant components kD1 of y . where 2 [0, 1] is an adjustable parameter that controls the 5.3.3. Relationship to temporal difference learning weighting on the vectorsr y . Equation (28) represents a much Rolls and Stringer (2001) not only considered the relationship of broader class of learning rules than the more usual gradient rule (17) to error correction, but also considered how the error cor- descent-based rule (27), which is in fact the special case TD(1). rection rules shown in equations (21–25) are related to temporal A further special case of equation (28) is for D 0, i.e., TD(0), difference learning (Sutton, 1988; Sutton and Barto, 1998). Sut- as follows ton (1988) described temporal difference methods in the context C1 of prediction learning. These methods are a class of incremen- w D .y y /r y . (29) tal learning techniques that can learn to predict final outcomes through comparison of successive predictions from the preceding But for problems where y is a linear function of x and w, we time steps. This is in contrast to traditional supervised learning, haver y D x , and so equation (29) becomes which involves the comparison of predictions only with the final C1 outcome. Consider a series of multistep prediction problems in w D y y x . (30) which for each problem there is a sequence of observation vectors, 1 2 m x , x , :::, x , at successive timesteps, followed by a final scalar If we assume the prediction process is being performed by a neuron outcome z. For each sequence of observations temporal differ- with a vector of inputs x , synaptic weight vector w, and output 1 2 m  T ence methods form a sequence of predictions y , y , :::, y , each y D w x , then we see that the TD(0) algorithm (30) is identical of which is a prediction of z. These predictions are based on the to the error correction rule (25) with D 1. In understanding this Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 27 Rolls Invariant visual object recognition k k comparison with temporal difference learning, it may be useful to where the term  x is a weighted sum of the vectors kD1 note that the firing at the end of a sequence of the transformed x . This suggests generalizing the original five error correction exemplars of a stimulus is effectively the temporal difference tar- rules (21–25) by replacing the term x by a weighted sum xO D get z. This establishes a link to temporal difference learning (Rolls, k x with 2 [0, 1]. In Sutton (1988) xO is calculated kD1 j j 2008b). Further, we note that from learning epoch to learning according to epoch, the target z for a given neuron will gradually settle down to be more and more fixed as learning proceeds. xO D x C  xO (33) j j j We now explore in more detail the relation between the error correction rules described above and temporal difference learn- ing. For each sequence of observations with a single outcome the with xO  0. This gives the following five temporal difference- temporal difference method (30), when viewed as an error correc- inspired error correction rules C1 tion rule, is attempting to adapt the weights such that y D y for all successive pairs of time steps – the same general idea w D yN y xO , (34) underlying the error correction rules (21–25). Furthermore, in Sutton and Barto (1998), where temporal difference methods are w D y y xO , (35) applied to reinforcement learning, the TD() approach is again w D y y xO , (36) C1 j further generalized by replacing the target y by any weighted C1 average of predictions y from arbitrary future timesteps, e.g., w D y y xO , (37) 1 C3 1 C7 t D y C y , including an exponentially weighted average C1 2 2 w D y y xO , (38) extending forward in time. So a more general form of the temporal difference algorithm has the form where it may be readily seen that equation (35) and (38) are spe- cial cases of equations (34) and (37), respectively, with D 0. As w D t y x , (31) with the trace yN , the term xO is reset to zero when a new stimulus is presented. These five rules can be related to the more general where here the target t is an arbitrary weighted average of the TD() algorithm, but continue to be biologically plausible using predictions y over future timesteps. Of course, with standard tem- only local cell variables. Setting D 0 in rules (34–38), gives us poral difference methods the target t is always an average over back the original error correction rules (21–25) which may now future timesteps kD C 1,  C 2, etc. But in the five error cor- be related to TD(0). rection rules this is only true for the last exemplar (25). This is Numerical results with error correction rules (34–38), and because with the problem of prediction, for example, the ultimate xO calculated according to equation (33) with D 1, with pos- 1 m mC1 j target of the predictions y ,:::,y is a final outcome y z. itive clipping of weights, trained on 7 faces in 9 locations are However, this restriction does not apply to our particular applica- presented by Rolls and Stringer (2001). For each of the rules tion of neurons trained to respond to temporal classes of inputs (34–38), the parameter has been individually optimized to 1 m within VisNet. Here we only wish to set the firing rates y ,:::,y the following respective values: 1.7, 1.8, 1.5, 1.6, 1.8. Compar- to the same value, not some final given value z. However, the more ing these five temporal difference-inspired rules it was found general error correction rules clearly have a close relationship to that the best performance is obtained with rule (38) where standard temporal difference algorithms. For example, it can be many more cells reach the maximum level of performance pos- seen that equation (22) with D 1 is in some sense a temporal sible with respect to the single cell information measure. In mirror image of equation (30), particularly if the updates w are fact, this rule offered the best such results. This may well be added to the weights w only at the end of a sequence. That is, rule due to the fact that this rule may be directly compared to the 1 m 0 (22) will attempt to set y ,:::,y to an initial value y  0. This standard TD(1) learning rule, which itself may be related to relationship to temporal difference algorithms allows us to begin classical supervised learning for which there are well known to exploit established temporal difference analyses to investigate optimality results, as discussed further by Rolls and Stringer the convergence properties of the error correction methods (Rolls (2001). and Stringer, 2001). From the simulations described by Rolls and Stringer (2001) Although the main aim of Rolls and Stringer (2001) in relat- it appears that the form of optimization described above associ- ing error correction rules to temporal difference learning was ated with TD(1) rather than TD(0) leads to better performance to begin to exploit established temporal difference analyses, they within VisNet. The TD(1)-like rule (38) with D 1.0 and D 1.8 observed that the most general form of temporal difference learn- gave considerably superior results to the TD(0)-like rule (25) with ing, TD(), in fact suggests an interesting generalization to the D 2.2. In fact, the former of these two rules provided the best existing error correction learning rules for which we currently have single cell information results in these studies. We hypothesize that D 0. Assuming y D w x andr y D x , the general equation these results are related to the fact that only a finite set of image (28) for TD() becomes sequences is presented to VisNet, and so the type of optimization performed by TD(1) for repeated presentations of a finite data set C1  k k is more appropriate for this problem than the form of optimization w D y y  x (32) performed by TD(0). kD1 Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 28 Rolls Invariant visual object recognition 5.3.4. Discussion of the different training rules single neuron with a number of inputs x and output y D w x , j j j In terms of biological plausibility, we note the following. First, where w are the synaptic weights. We assume that there are a all the learning rules investigated by Rolls and Stringer (2001) number of input patterns and that for the k th input pattern, k k k T k k are local learning rules, and in this sense are biologically plausi- x D Tx , x , ...U , the output y has a target value t . Hence an 1 2 ble (Rolls and Treves, 1998; Rolls, 2008b). (The rules are local in error measure or cost function can be defined as that the terms used to modify the synaptic weights are potentially 0 1 available in the pre- and post-synaptic elements.) X X X 1 1 k k k k @ A e .w/ D t y D t w x . (39) Second we note that all the rules do require some evidence of j 2 2 k k j the activity on one or more previous stimulus presentations to be available when the synaptic weights are updated. Some of the rules, e.g., learning rule (23), use the trace yN from the current This cost function is a function of the input patterns x and the synaptic weight vector wD [w , w ,:::] . With a fixed set of input time level, while rules (22) and (25) do not need to use an expo- 1 2 nential trace of the neuronal firing rate, but only the instantaneous patterns, we can reduce the error measure by employing a gradient firing rates y at two successive time steps. It is known that synap- descent algorithm to calculate an improved set of synaptic weights. Gradient descent achieves this by moving downhill on the error tic plasticity does involve a combination of separate processes each with potentially differing time courses (Koch, 1999), and surface defined in w space using the update these different processes could contribute to trace rule learning. @e Another mechanism suggested for implementing a trace of pre- k k k w D D t y x . (40) @w vious neuronal activity is the continuing firing for often 300 ms j produced by a short (16 ms) presentation of a visual stimulus (Rolls and Tovee, 1994) which is suggested to be implemented If we update the weights after each pattern k, then the update takes by local cortical recurrent attractor networks (Rolls and Treves, the form of an error correction rule 1998). k k k Third, we note that in utilizing the trace in the targets t , the w D t y x , (41) error correction (or temporal difference-inspired) rules perform a comparison of the instantaneous firing y with a temporally which is also commonly referred to as the delta rule or Widrow– nearby value of the activity, and this comparison involves a sub- Hoff rule (see Widrow and Hoff, 1960; Widrow and Stearns, 1985). traction. The subtraction provides an error, which is then used Error correction rules continually compare the neuronal output to increase or decrease the synaptic weights. This is a somewhat with its pre-specified target value and adjust the synaptic weights different operation from long-term depression (LTD) as well as accordingly. In contrast, the way Rolls and Stringer (2001) intro- long-term potentiation (LTP), which are associative changes which duced of utilizing error correction is to specify the target as the depend on the pre- and post-synaptic activity. However, it is inter- activity trace based on the firing rate at nearby timesteps. Now the esting to note that an error correction rule which appears to actual firing at those nearby time steps is not a pre-determined involve a subtraction of current firing from a target might be fixed target, but instead depends on how the network has actually implemented by a combination of an associative process oper- evolved. This effectively means the cost function e (w) that is being ating with the trace, and an anti-Hebbian process operating to minimized changes from timestep to timestep. Nevertheless, the remove the effects of the current firing. For example, the synap- concept of calculating an error, and using the magnitude and direc- tic updates w D .t y /x can be decomposed into two tion of the error to update the synaptic weights, is the similarity separate associative processes t x and y x , that may occur Rolls and Stringer (2001) made to gradient descent learning. j j independently. (The target, t , could in this case be just the trace To conclude this discussion, the error correction and tempo- of previous neural activity from the preceding trials, excluding any ral difference rules explored by Rolls and Stringer (2001) provide contribution from the current firing.) Another way to implement interesting approaches to help understand invariant pattern recog- an error correction rule using associative synaptic modification nition learning. Although we do not know whether the full power would be to force the post-synaptic neuron to respond to the error of these rules is expressed in the brain, we provided suggestions term. Although this has been postulated to be an effect which about how they might be implemented. At the same time, we note could be implemented by the climbing fiber system in the cerebel- that the original trace rule used by Földiák (1991), Rolls (1992), lum (Ito, 1984, 1989; Rolls and Treves, 1998), there is no similar and Wallis and Rolls (1997) is a simple associative rule, is therefore system known for the neocortex, and it is not clear how this par- biologically very plausible, and, while not as powerful as many of ticular implementation of error correction might operate in the the other rules introduced by Rolls and Stringer (2001), can never- neocortex. theless solve the same class of problem. Rolls and Stringer (2001) In Section 5.3.2 we describe five learning rules as error cor- also emphasized that although they demonstrated how a number rection rules. We now discuss an interesting difference of these of new error correction and temporal difference rules might play a error correction rules from error correction rules as conventionally role in the context of view-invariant object recognition, they may applied. It is usual to derive the general form of error correction also operate elsewhere where it is important for neurons to learn to learning rule from gradient descent minimization in the follow- respond similarly to temporal classes of inputs that tend to occur ing way (Hertz et al., 1991). Consider the idealized situation of a close together in time. Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 29 Rolls Invariant visual object recognition 5.4. THE ISSUE OF FEATURE BINDING, AND A SOLUTION 5.4.1. Syntactic binding of separate neuronal ensembles by In this section we investigate two key issues that arise in hierarchi- synchronization cal layered network architectures, such as VisNet, other examples The problem of syntactic binding of neuronal representations, of which have been described and analyzed by Fukushima (1980), in which some features must be bound together to form one Ackley et al. (1985), Rosenblatt (1961), and Riesenhuber and Pog- object, and other simultaneously active features must be bound gio (1999b). One issue is whether the network can discriminate together to represent another object, has been addressed by von between stimuli that are composed of the same basic alphabet of der Malsburg (1990). He has proposed that this could be per- features. The second issue is whether such network architectures formed by temporal synchronization of those neurons that were can find solutions to the spatial binding problem. These issues are temporarily part of one representation in a different time slot addressed next and by Elliffe et al. (2002) and Rolls (2008b). from other neurons that were temporarily part of another rep- The first issue investigated is whether a hierarchical layered resentation. The idea is attractive in allowing arbitrary relinking network architecture of the type exemplified by VisNet can dis- of features in different combinations. Singer, Engel, Konig, and criminate stimuli that are composed of a limited set of features colleagues (Singer et al., 1990; Engel et al., 1992; Singer and and where the different stimuli include cases where the feature sets Gray, 1995; Singer, 1999; Fries, 2005, 2009; Womelsdorf et al., are subsets and supersets of those in the other stimuli. An issue is 2007), and others (Abeles, 1991) have obtained some evidence that if the network has learned representations of both the parts that when features must be bound, synchronization of neu- and the wholes, will the network identify that the whole is present ronal populations can occur (but see Shadlen and Movshon, when it is shown, and not just that one or more parts is present. (In 1999), and this has been modeled (Hummel and Biederman, many investigations with VisNet, complex stimuli (such as faces) 1992). were used where each stimulus might contain unique features not Synchronization to implement syntactic binding has a number present in the other stimuli.) To address this issue Elliffe et al. of disadvantages and limitations (Rolls and Treves, 1998, 2011; (2002) used stimuli that are composed from a set of four features Riesenhuber and Poggio, 1999a; Rolls, 2008b). The greatest com- which are designed so that each feature is spatially separate from putational problem is that synchronization does not by itself define the other features, and no unique combination of firing caused, the spatial relations between the features being bound, so is not for example, by overlap of horizontal and vertical filter outputs in just as a binding mechanism adequate for shape recognition. For the input representation distinguishes any one stimulus from the example, temporal binding might enable features 1, 2, and 3, which others. The results described in Section 5.4.4 show that VisNet can might define one stimulus to be bound together and kept separate indeed learn correct invariant representations of stimuli which do from, for example, another stimulus consisting of features 2, 3, consist of feature sets where individual features do not overlap and 4, but would require a further temporal binding (leading in spatially with each other and where the stimuli can be composed the end potentially to a combinatorial explosion) to indicate the of sets of features which are supersets or subsets of those in other relative spatial positions of the 1, 2, and 3 in the 123 stimulus, so stimuli. Fukushima and Miyake (1982) did not address this cru- that it can be discriminated from, e.g., 312. cial issue where different stimuli might be composed of subsets or A second problem with the synchronization approach to the supersets of the same set of features, although they did show that spatial binding of features is that, when stimulus-dependent tem- stimuli with partly overlapping features could be discriminated by poral synchronization has been rigorously tested with information the Neocognitron. theoretic approaches, it has so far been found that most of the In Section 5.4.5 we address the spatial binding problem in archi- information available is in the number of spikes, with rather little, tectures such as VisNet. This computational problem that needs to less than 5% of the total information, in stimulus-dependent syn- be addressed in hierarchical networks such as the primate visual chronization (Franco et al., 2004; Rolls et al., 2004; Aggelopoulos system and VisNet is how representations of features can be (e.g., et al., 2005; Rolls, 2008b; Rolls and Treves, 2011). For exam- translation) invariant, yet can specify stimuli or objects in which ple, Aggelopoulos et al. (2005) showed that when macaques used the features must be specified in the correct spatial arrangement. object-based attention to search for one of two objects to touch in This is the feature binding problem, discussed, for example, by a complex natural scene, between 99 and 94% of the information von der Malsburg (1990), and arising in the context of hierarchi- was present in the firing rates of inferior temporal cortex neu- cal layered systems (Rosenblatt, 1961; Fukushima, 1980; Ackley rons, and less that 5% in any stimulus-dependent synchrony that et al., 1985). The issue is whether or not features are bound into was present between the simultaneously recorded inferior tem- the correct combinations in the correct relative spatial positions, poral cortex neurons. The implication of these results is that any or if alternative combinations of known features or the same fea- stimulus-dependent synchrony that is present is not quantitatively tures in different relative spatial positions would elicit the same important as measured by information theoretic analyses under responses. All this has to be achieved while at the same time pro- natural scene conditions when feature binding, segmentation of ducing position-invariant recognition of the whole combination objects from the background, and attention are required. This of features, that is, the object. This is a major computational issue has been found for the inferior temporal cortex, a brain region that needs to be solved for memory systems in the brain to operate where features are put together to form representations of objects correctly. This can be achieved by what is effectively a learning (Rolls and Deco, 2002; Rolls, 2008b), and where attention has process that builds into the system a set of neurons in the hier- strong effects, at least in scenes with blank backgrounds (Rolls archical network that enables the recognition process to operate et al., 2003). It would of course also be of interest to test the same correctly with the appropriate position, size, view, etc. invariances. hypothesis in earlier visual areas, such as V4, with quantitative, Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 30 Rolls Invariant visual object recognition information theoretic, techniques (Rolls and Treves, 2011). In con- in the correct spatial arrangement the advantages of the scheme nection with rate codes, it should be noted that a rate code implies for syntactic binding are obtained, yet without the combinatorial using the number of spikes that arrive in a given time, and that explosion that would result if the feature combination neurons this time can be very short, as little as 20–50 ms, for very useful responded to combinations of many input features so produc- amounts of information to be made available from a population ing potentially very specifically tuned neurons which very rarely of neurons (Tovee et al., 1993; Rolls and Tovee, 1994; Rolls et al., responded.) Then invariant representations are developed in the 1994, 1999, 2006a; Tovee and Rolls, 1995; Rolls, 2003, 2008b; Rolls next layer from these feature combination neurons which already and Treves, 2011). contain evidence on the local spatial arrangement of features. A third problem with the synchronization or “communication Finally, in later layers, only one stimulus would be specified by the through coherence” approach (Fries, 2005, 2009) is that when particular set of low-order feature combination neurons present, information transmission between connected networks is ana- even though each feature combination neuron would itself be lyzed, synchronization is not produced at the levels of synaptic somewhat invariant. The overall design of the scheme is shown strength necessary for information transmission between the net- in Figure 9. Evidence that many neurons in V1 respond to combi- works, and indeed does not appear to affect the information nations of spatial features with the correct spatial configuration is transmission between a pair of weakly coupled networks that now starting to appear (see Section 4), and neurons that respond model weakly coupled cortical networks (Rolls et al., 2012). to feature combinations (such as two lines with a defined angle In the context of VisNet, and how the real visual system may between them, and overall orientation) are found in V2 (Hegde operate to implement object recognition, the use of synchroniza- and Van Essen, 2000; Ito and Komatsu, 2004). The tuning of a tion does not appear to match the way in which the visual system is VisNet layer 1 neuron to a combination of features in the correct organized. For example, von der Malsburg’s argument would indi- relative spatial position is illustrated in Figures 12 and 13. cate that, using only a two-layer network, synchronization could provide the necessary feature linking to perform object recogni- 5.4.4. Discrimination between stimuli with super- and sub-set tion with relatively few neurons, because they can be reused again feature combinations and again, linked differently for different objects. In contrast, the Some investigations with VisNet (Wallis and Rolls, 1997) have primate uses a considerable part of its cortex, perhaps 50% in involved groups of stimuli that might be identified by some unique monkeys, for visual processing, with therefore what could be in the feature common to all transformations of a particular stimulus. 8 12 order of 6 10 neurons and 6 10 synapses involved (Rolls, This might allow VisNet to solve the problem of transform invari- 2008b), so that the solution adopted by the real visual system may ance by simply learning to respond to a unique feature present be one which relies on many neurons with simpler processing than in each stimulus. For example, even in the case where VisNet was arbitrary syntax implemented by synchronous firing of separate trained on invariant discrimination of T, L, and C, the repre- assemblies suggests. On the other hand, a solution such as that sentation of the T stimulus at the spatial-filter level inputs to investigated by VisNet, which forms low-order combinations of VisNet might contain unique patterns of filter outputs where what is represented in previous layers, is very demanding in terms the horizontal and vertical parts of the T join. The unique filter of the number of neurons required, and this matches what is found outputs thus formed might distinguish the T from, for example, in the primate visual system. the L. Elliffe et al. (2002) tested whether VisNet is able to form trans- 5.4.2. Sigma-Pi neurons form invariant cells with stimuli that are specially composed from Another approach to a binding mechanism is to group spatial fea- a common alphabet of features, with no stimulus containing any tures based on local mechanisms that might operate for closely firing in the spatial-filter inputs to VisNet not present in at least adjacent synapses on a dendrite (in what is a Sigma-Pi type of one of the other stimuli. The limited alphabet enables the set of neuron, see Section 7; Finkel and Edelman, 1987; Mel et al., 1998; stimuli to consist of feature sets which are subsets or supersets of Rolls, 2008b). A problem for such architectures is how to force those in the other stimuli. one particular neuron to respond to the same feature combina- For these experiments the common pool of stimulus features tion invariantly with respect to all the ways in which that feature chosen was a set of two horizontal and two vertical 8 1 bars, combination might occur in a scene. each aligned with the sides of a 32 32 square. The stimuli can be constructed by arbitrary combination of these base level fea- 5.4.3. Binding of features and their relative spatial position by tures. We note that effectively the stimulus set consists of four feature combination neurons features, a top bar (T), a bottom bar (B), a left bar (L), and a The approach to the spatial binding problem that is proposed for right bar (R). Figure 19 shows the complete set used, containing VisNet is that individual neurons at an early stage of processing the possible image feature combination. Subsequent discussion are set up (by learning) to respond to low-order combinations of will group these objects by the number of features each contains: input features occurring in a given relative spatial arrangement single-; double-; triple-; and quadruple-feature objects correspond and position on the retina (Rolls, 1992, 1994, 1995; Wallis and to the respective rows of Figure 19. Stimuli are referred to by the Rolls, 1997; Rolls and Treves, 1998; Elliffe et al., 2002; Rolls and list of features they contain; e.g., “LBR” contains the left, bottom, Deco, 2002; cf. Feldman, 1985). (By low-order combinations of and right features, while “TL” contains top and left only. Further input features we mean combinations of a few input features. By details of how the stimuli were prepared are provided by Elliffe forming neurons that respond to combinations of a few features et al. (2002). Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 31 Rolls Invariant visual object recognition by T alone, or by TLB. The basis for this separation by competitive networks of stimuli which are subsets and supersets of each other is described by Rolls and Treves, 1998, Section 4.3.6) and by Rolls (2008b). 5.4.5. Feature binding in a hierarchical network with invariant representations of local feature combinations In this section we consider the ability of output layer neurons to learn new stimuli if the lower layers are trained solely through exposure to simpler feature combinations from which the new stimuli are composed. A key question we address is how invari- ant representations of low-order feature combinations in the early layers of the visual system are able to uniquely specify the cor- rect spatial arrangement of features in the overall stimulus and contribute to preventing false recognition errors in the output layer. The problem, and its proposed solution, can be treated as fol- lows. Consider an object 1234 made from the features 1, 2, 3, and 4. The invariant low-order feature combinations might rep- resent 12, 23, and 34. Then if neurons at the next layer respond to combinations of the activity of these neurons, the only neu- rons in the next layer that would respond would be those tuned to 1234, not to, for example, 3412, which is distinguished from 1234 by the input of a pair neuron responding to 41 rather than to 23. The argument (Rolls, 1992) is that low-order spatial-feature FIGURE 19 | Merged feature objects. All members of the full object set are combination neurons in the early stage contain sufficient spatial shown, using a dotted line to represent the central 32 32 square on which information so that a particular combination of those low-order the individual features are positioned, with the features themselves shown feature combination neurons specifies a unique object, even if the as dark line segments. Nomenclature is by acronym of the features present, where T, top; B, bottom; L, left; and R, right. (After Elliffe et al., 2002.) relative positions of the low-order feature combination neurons are not known, because they are somewhat invariant. The architecture of VisNet is intended to solve this problem To train the network a stimulus was presented in a randomized partly by allowing high-spatial precision combinations of input sequence of nine locations in a square grid across the 128 128 features to be formed in layer 1. The actual input features in VisNet input retina of VisNet2. The central location of the square grid are, as described above, the output of oriented spatial-frequency was in the center of the “retina,” and the eight other locations were tuned filters, and the combinations of these formed in layer 1 offset 8 pixels horizontally and/or vertically from this. Two differ- might thus be thought of in a simple way as, for example, a T or ent learning rules were used, “Hebbian” (16), and “trace” (17), and an L or for that matter a Y. Then in layer 2, application of the trace also an untrained condition with random weights. As in earlier rule might enable neurons to respond to a T with limited spatial work (Wallis and Rolls, 1997; Rolls and Milward, 2000) only the invariance (limited to the size of the region of layer 1 from which trace rule led to any cells with invariant responses, and the results layer 2 cells receive their input). Then an “object” such as H might shown are for networks trained with the trace rule. be formed at a higher layer because of a conjunction of two Ts in The results with VisNet trained on the set of stimuli shown in the same small region. Figure 19 with the trace rule are as follows. First, it was found that To show that VisNet can actually solve this problem, Elliffe et al. single neurons in the top layer learned to differentiate between the (2002) performed the experiments described next. They trained stimuli in that the responses of individual neurons were maximal the first two layers of VisNet with feature pair combinations, for one of the stimuli and had no response to any of the other stim- forming representations of feature pairs with some translation uli invariantly with respect to location. Moreover, the translation invariance in layer 2. Then they used feature triples as input stimuli, invariance was perfect for every stimulus (by different neurons) allowed no more learning in layers 1 and 2, and then investi- over every location (for all stimuli except “RTL” and “TLBR”). gated whether layers 3 and 4 could be trained to produce invariant The results presented show clearly that the VisNet paradigm representations of the triples where the triples could only be distin- can accommodate networks that can perform invariant discrim- guished if the local spatial arrangement of the features within the ination of objects that have a subset–superset relationship. The triple had effectively to be encoded in order to distinguish the dif- result has important consequences for feature binding and for dis- ferent triples. For this experiment, they needed stimuli that could criminating stimuli for other stimuli which may be supersets of the be specified in terms of a set of different features (they chose verti- first stimulus. For example, a VisNet cell which responds invari- cal (1), diagonal (2), and horizontal (3) bars) each capable of being antly to feature combination TL can genuinely signal the presence shown at a set of different relative spatial positions (designated A, of exactly that combination, and will not necessarily be activated B, and C), as shown in Figure 20. Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 32 Rolls Invariant visual object recognition The stimuli are thus defined in terms of what features are then to develop invariant representations in the next layer from present and their precise spatial arrangement with respect to each these neurons which already contain evidence on the local spa- other. The length of the horizontal and vertical feature bars shown tial arrangement of features. An example might be that with the in Figure 20 is 8 pixels. To train the network a stimulus (that is a object 123, the invariant feature pairs would represent 120, 023, pair or triple feature combination) is presented in a randomized and 103. Then if neurons at the next layer correspond to combi- sequence of nine locations in a square grid across the 128 128 nations of these neurons, the only next layer neurons that would input retina. The central location of the square grid is in the cen- respond would be those tuned to 123, not to, for example, 213. ter of the “retina,” and the eight other locations are offset 8 pixels The argument is that the low-order spatial-feature combination horizontally and/or vertically from this. We refer to the two and neurons in the early stage contain sufficient spatial information three feature stimuli as “pairs” and “triples,” respectively. Indi- so that a particular combination of those low-order feature com- vidual stimuli are denoted by three numbers which refer to the bination neurons specifies a unique object, even if the relative individual features present in positions A, B and C, respectively. positions of the low-order feature combination neurons are not For example, a stimulus with positions A and C containing a verti- known because these neurons are somewhat translation-invariant cal and diagonal bar, respectively, would be referred to as stimulus (cf. also Fukushima, 1988). 102, where the 0 denotes no feature present in position B. In total The stimuli used in the experiments of Elliffe et al. (2002) were there are 18 pairs (120, 130, 210, 230, 310, 320, 012, 013, 021, 023, constructed from pre-processed component features as discussed 031, 032, 102, 103, 201, 203, 301, 302) and 6 triples (123, 132, 213, in Section 5.4.4. That is, base stimuli containing a single feature 231, 312, 321). This nomenclature not only defines which fea- were constructed and filtered, and then the pairs and triples were tures are present within objects, but also the spatial relationships constructed by merging these pre-processed single feature images. of their component features. Then the computational problem In the first experiment layers 1 and 2 of VisNet were trained with can be illustrated by considering the triple 123. If invariant rep- the 18 feature pairs, each stimulus being presented in sequences of resentations are formed of single features, then there would be 9 locations across the input. This led to the formation of neurons no way that neurons higher in the hierarchy could distinguish the that responded to the feature pairs with some translation invari- object 123 from 213 or any other arrangement of the three fea- ance in layer 2. Then they trained layers 3 and 4 on the 6 feature tures. An approach to this problem (see, e.g., Rolls, 1992) is to form triples in the same 9 locations, while allowing no more learning early on in the processing neurons that respond to overlapping in layers 1 and 2, and examined whether the output layer of Vis- combinations of features in the correct spatial arrangement, and Net had developed transform invariant neurons to the 6 triples. The idea was to test whether layers 3 and 4 could be trained to produce invariant representations of the triples where the triples could only be distinguished if the local spatial arrangement of the features within the triple had effectively to be encoded in order to distinguish the different triples. The results from this experi- ment were compared and contrasted with results from three other experiments which involved different training regimes for layers 1, 2 and layers 3, 4. All four experiments are summarized in Table 5. Experiment 2 involved no training in layers 1, 2 and 3, 4, with the synaptic weights left unchanged from their initial random values. These results are included as a baseline performance with which to compare results from the other experiments 1, 3, and 4. The model parameters used in these experiments were as described by Rolls and Milward (2000) and Rolls and Stringer (2001). In Figure 21 we present numerical results for the four experi- ments listed in Table 5. On the left are the single cell information measures for all top (4th) layer neurons ranked in order of their FIGURE 20 | Feature combinations for experiments of Section 5.4.5: Table 5 | The different training regimes used in VisNet experiments 1–4 there are 3 features denoted by 1, 2, and 3 (including a blank space 0) of Section 5.4.5. that can be placed in any of 3 positions A, B, and C. Individual stimuli are denoted by three consecutive numbers which refer to the individual Layers 1, 2 Layers 3, 4 features present in positions A, B, and C, respectively. In the experiments in Section 5.4.5, layers 1 and 2 were trained on stimuli consisting of pairs of Experiment 1 Trained on pairs Trained on triples the features, and layers 3 and 4 were trained on stimuli consisting of triples. Then the network was tested to show whether layer 4 neurons would Experiment 2 No training No training distinguish between triples, even though the first two layers had only been Experiment 3 No training Trained on triples trained on pairs. In addition, the network was tested to show whether Experiment 4 Trained on triples Trained on triples individual cells in layer 4 could distinguish between triples even in locations where the triples were not presented during training. (After Elliffe et al., In the no training condition the synaptic weights were left in their initial untrained 2002.) random values. Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 33 Rolls Invariant visual object recognition FIGURE 21 | Numerical results for experiments 1–4 as described in Table 5, with the trace-learning rule (17). On the left are single cell information measures, and on the right are multiple cell information measures. (After Elliffe et al., 2002.) invariance to the triples, while on the right are multiple cell infor- is possible because the feature combination neurons formed in mation measures. To help to interpret these results we can compute the first layer (which could be trained just with a Hebb rule) do the maximum single cell information measure according to respond to combinations of input features in the correct spatial configuration, partly because of the limited size of their receptive fields. The second conclusion is that even though early layers can Maximum single cell information D log .Number of triples/, in this case only respond to small feature subsets, these provide, (42) with no further training of layers 1 and 2, an adequate basis for learning to discriminate in layers 3 and 4 stimuli consisting of where the number of triples is 6. This gives a maximum single combinations of larger numbers of features. Indeed, comparing cell information measure of 2.6 bits for these test cases. First, com- results from experiment 1 with experiment 4 (in which all layers paring the results for experiment 1 with the baseline performance were trained on triples, see Table 5) demonstrates that training of experiment 2 (no training) demonstrates that even with the the lower layer neurons to develop invariant responses to the pairs first two layers trained to form invariant responses to the pairs, offers almost as good performance as training all layers on the and then only layers 3 and 4 trained on feature triples, layer 4 is triples (see Figure 21). indeed capable of developing translation-invariant neurons that can discriminate effectively between the 6 different feature triples. Indeed, from the single cell information measures it can be seen 5.4.6. Stimulus generalization to untrained transforms of new that a number of cells have reached the maximum level of perfor- objects mance in experiment 1. In addition, the multiple cell information Another important aspect of the architecture of VisNet is that it analysis presented in Figure 21 shows that all the stimuli could be need not be trained with every stimulus in every possible location. discriminated from each other by the firing of a number of cells. Indeed, part of the hypothesis (Rolls, 1992) is that training early Analysis of the response profiles of individual cells showed that a layers (e.g., 1–3) with a wide range of visual stimuli will set up fourth layer cell could respond to one of the triple feature stimuli feature analyzers in these early layers which are appropriate later and have no response to any other of the triple feature stimuli on with no further training of early layers for new objects. For invariantly with respect to location. example, presentation of a new object might result in large num- A comparison of the results from experiment 1 with those from bers of low-order feature combination neurons in early layers of experiment 3 (see Table 5 and Figure 21) reveals that training the VisNet being active, but the particular set of feature combination first two layers to develop neurons that respond invariantly to neurons active would be different for the new object. The later lay- the pairs (performed in experiment 1) actually leads to improved ers of the network (in VisNet, layer 4) would then learn this new invariance of 4th layer neurons to the triples, as compared with set of active layer 3 neurons as encoding the new object. However, when the first two layers are left untrained (experiment 3). if the new object was then shown in a new location, the same set of Two conclusions follow from these results (Elliffe et al., 2002). layer 3 neurons would be active because they respond with spatial First, a hierarchical network that seeks to produce invariant repre- invariance to feature combinations, and given that the layer 3–4 sentations in the way used by VisNet can solve the feature binding connections had already been set up by the new object, the correct problem. In particular, when feature pairs in layer 2 with some layer 4 neurons would be activated by the new object in its new translation invariance are used as the input to later layers, these untrained location, and without any further training. later layers can nevertheless build invariant representations of To test this hypothesis Elliffe et al. (2002) repeated the general objects where all the individual features in the stimulus must procedure of experiment 1 of Section 5.4.5, training layers 1 and occur in the correct spatial position relative to each other. This 2 with feature pairs, but then instead trained layers 3 and 4 on the Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 34 Rolls Invariant visual object recognition triples in only 7 of the original 9 locations. The crucial test was network, feature combination neurons which encode the spatial to determine whether VisNet could form top layer neurons that arrangement of the bound features are formed at intermediate responded invariantly to the 6 triples when presented over all nine layers of the network. Then neurons at later layers of the net- locations, not just the seven locations at which the triples had been work which respond to combinations of active intermediate-layer presented during training. neurons do contain sufficient evidence about the local spatial It was found that VisNet is still able to develop some fourth arrangement of the features to identify stimuli because the local layer neurons with perfect invariance, that is which have invari- spatial arrangement is encoded by the intermediate-layer neurons. ant responses over all nine locations, as shown by the single cell The information required to solve the visual feature binding prob- information analysis. The response profiles of individual fourth lem thus becomes encoded by self-organization into what become layer cells showed that they can continue to discriminate between hard-wired properties of the network. In this sense, feature binding the triples even in the two locations where the triples were not is not solved at run-time by the necessity to instantaneously set up presented during training. In addition, the multiple cell analysis arbitrary syntactic links between sets of co-active neurons. The showed that a small population of cells was able to discriminate computational solution proposed to the superset/subset aspect between all of the stimuli irrespective of location, even though for of the binding problem will apply in principle to other multi- two of the test locations the triples had not been trained at those layer competitive networks, although the issues considered here particular locations during the training of layers 3 and 4. have not been explicitly addressed in architectures such as the The use of transformation rules learned by early stages of the Neocognitron (Fukushima and Miyake, 1982). hierarchy to enable later stages to perform correctly on trans- Consistent with these hypotheses about how VisNet operates to formed views never seen before of objects is now being investigated achieve, by layer 4, position-invariant responses to stimuli defined by others (Leibo et al., 2010). by combinations of features in the correct spatial arrangement, investigations of the effective stimuli for neurons in intermediate 5.4.7. Discussion of feature binding in hierarchical layered layers of VisNet showed as follows. In layer 1, cells responded to networks the presence of individual features, or to low-order combinations Elliffe et al. (2002) thus first showed (see Section 5.4.4) that hier- of features (e.g., a pair of features) in the correct spatial arrange- archical feature-detecting neural networks can learn to respond ment at a small number of nearby locations. In layers 2 and 3, differently to stimuli that consist of unique combinations of non- neurons responded to single features or to higher order combina- unique input features, and that this extends to stimuli that are tions of features (e.g., stimuli composed of feature triples) in more direct subsets or supersets of the features present in other stimuli. locations. These findings provide direct evidence that VisNet does Second Elliffe et al. (2002) investigated (see Section 5.4.5) operate as described above to solve the feature binding problem. the hypothesis that hierarchical layered networks can produce A further issue with hierarchical multilayer architectures such identification of unique stimuli even when the feature combi- as VisNet is that false binding errors might occur in the following nation neurons used to define the stimuli are themselves partly way (Mozer, 1991; Mel and Fiser, 2000). Consider the output of translation-invariant. The stimulus identification should work one-layer in such a network in which there is information only correctly because feature combination neurons in which the spa- about which pairs are present. How then could a neuron in the tial features are bound together with high-spatial precision are next layer discriminate between the whole stimulus (such as the formed in the first layer. Then at later layers when neurons with triple 123 in the above experiment) and what could be considered a some translation invariance are formed, the neurons neverthe- more distributed stimulus or multiple different stimuli composed less contain information about the relative spatial position of the of the separated subparts of that stimulus (e.g., the pairs 120, 023, original features. There is only then one object which will be con- 103 occurring in 3 of the 9 training locations in the above exper- sistent with the set of active neurons at earlier layers, which though iment)? The problem here is to distinguish a single object from somewhat translation-invariant as combination neurons, reflect in multiple other objects containing the same component combi- the activity of each neuron information about the original spatial nations (e.g., pairs). We propose that part of the solution to this position of the features. I note that the trace rule training used in general problem in real visual systems is implemented through lat- early layers (1 and 2) in Experiments 1 and 4 would set up partly eral inhibition between neurons in individual layers, and that this invariant feature combination neurons, and yet the late layers (3 mechanism, implemented in VisNet, acts to reduce the possibility and 4) were able to produce during training neurons in layer 4 of false recognition errors in the following two ways. that responded to stimuli that consisted of unique spatial arrange- First, consider the situation in which neurons in layer N have ments of lower order feature combinations. Moreover, and very learned to represent low-order feature combinations with loca- interestingly Elliffe et al. (2002) were able to demonstrate that Vis- tion invariance, and where a neuron n in layer N C 1 has learned Net layer 4 neurons would respond correctly to visual stimuli at to respond to a particular set  of these feature combinations. untrained locations, provided that the feature subsets had been The problem is that neuron n receives the same input from layer trained in early layers of the network at all locations, and that the N as long as the same set  of feature combinations is present, whole stimulus had been trained at some locations in the later and cannot distinguish between different spatial arrangements of layers of the network. these feature combinations. The question is how can neuron n The results described by Elliffe et al. (2002) thus provide one respond only to a particular favored spatial arrangement 9 of solution to the feature binding problem. The solution which has the feature combinations contained within the set . We suggest been shown to work in the model is that in a multilayer competitive that as the favored spatial arrangement 9 is altered by rearranging Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 35 Rolls Invariant visual object recognition the spatial relationships of the component feature combinations, implemented in VisNet. Consistent with this, a considerable part the new feature combinations that are formed in new locations of the non-human primate brain is devoted to visual informa- will stimulate additional neurons nearby in layer N C 1, and these tion processing. The fact that large numbers of neurons and a will tend to inhibit the firing of neuron n. Thus, lateral inhibition multilayer organization are present in the primate ventral visual within a layer will have the effect of making neurons more selective, system is actually thus consistent with the type of model of visual ensuring neuron n responds only to a single spatial arrangement information processing described here. 9 from the set of feature combinations , and hence reducing the possibility of false recognition. 5.5. OPERATION IN A CLUTTERED ENVIRONMENT The second way in which lateral inhibition may help to reduce In this section we consider how hierarchical layered networks of binding errors is through limiting the sparseness of neuronal firing the type exemplified by VisNet operate in cluttered environments. rates within layers. In our discussion above the spurious stimuli Although there has been much work involving object recognition we suggested that might lead to false recognition of triples were in cluttered environments with artificial vision systems, many such obtained from splitting up the component feature combinations systems typically rely on some form of explicit segmentation fol- (pairs) so that they occurred in separate training locations. How- lowed by search and template matching procedure (see Ullman, ever, this would lead to an increase in the number of features 1996 for a general review). In natural environments, objects may present in the complete stimulus; triples contain 3 features while not only appear against cluttered (natural) backgrounds, but also their spurious counterparts would contain 6 features (resulting the object may be partially occluded. Biological nervous systems from 3 separate pairs). For this trivial example, the increase in the operate in quite a different manner to those artificial vision sys- number of features is not dramatic, but if we consider, say, stimuli tems that rely on search and template matching, and the way in composed of 4 features where the component feature combina- which biological systems cope with cluttered environments and tions represented by lower layers might be triples, then to form partial occlusion is likely to be quite different also. spurious stimuli we need to use 12 features (resulting from 4 triples One of the factors that will influence the performance of occurring in separate locations). But if the lower layers also rep- the type of architecture considered here, hierarchically organized resented all possible pairs then the number of features required in series of competitive networks, which form one class of approaches the spurious stimuli would increase further. In fact, as the size of to biologically relevant networks for invariant object recognition the stimulus increases in terms of the number of features, and as (Fukushima, 1980; Poggio and Edelman, 1990; Rolls, 1992, 2008b; the size of the component feature combinations represented by the Wallis and Rolls, 1997; Rolls and Treves, 1998), is how lateral inhi- lower layers increases, there is a combinatorial explosion in terms bition and competition are managed within a layer. Even if an of the number of features required as we attempt to construct object is not obscured, the effect of a cluttered background will be spurious stimuli to trigger false recognition. And the construction to fire additional neurons, which will in turn to some extent com- of such spurious stimuli will then be prevented through setting a pete with and inhibit those neurons that are specifically tuned to limit on the sparseness of firing rates within layers, which will in respond to the desired object. Moreover, where the clutter is adja- turn set a limit on the number of features that can be represented. cent to part of the object, the feature analyzing neurons activated Lateral inhibition is likely to contribute in both these ways to the against a blank background might be different from those activated performance of VisNet when the stimuli consist of subsets and against a cluttered background, if there is no explicit segmentation supersets of each other, as described in Section 5.4.4. process. We consider these issues next, following investigations of Another way is which the problem of multiple objects is Stringer and Rolls (2000). addressed is by limiting the size of the receptive fields of inferior temporal cortex neurons so that neurons in IT respond primarily 5.5.1. VisNet simulations with stimuli in cluttered backgrounds to the object being fixated, but with nevertheless some asymme- In this section we show that recognition of objects learned previ- try in the receptive fields (see Section 5.9). Multiple objects are ously against a blank background is hardly affected by the presence then “seen” by virtue of being added to a visuo-spatial scratchpad of a natural cluttered background. We go on to consider what hap- (Rolls, 2008b). pens when VisNet is set the task of learning new stimuli presented A related issue that arises in this class of network is whether against cluttered backgrounds. forming neurons that respond to feature combinations in the way The images used for training and testing VisNet in the sim- described here leads to a combinatorial explosion in the number ulations described next performed by Stringer and Rolls (2000) of neurons required. The solution to this issue that is proposed were specially constructed. There were 7 face stimuli approxi- is to form only low-order combinations of features at any one mately 64 pixels in height constructed without backgrounds. In stage of the network (Rolls, 1992; cf. Feldman, 1985). Using low- addition there were 3 possible backgrounds: a blank background order combinations limits the number of neurons required, yet (gray-scale 127, where the range is 0–255), and two cluttered back- enables the type of computation that relies on feature combina- grounds as shown in Figure 22 which are 128 128 pixels in size. tion neurons that is analyzed here to still be performed. The actual Each image presented to VisNet’s 128 128 input retina was com- number of neurons required depends also on the redundancies posed of a single face stimulus positioned at one of 9 locations on present in the statistics of real-world images. Even given these fac- either a blank or cluttered background. The cluttered background tors, it is likely that a large number of neurons would be required was intended to be like the background against which an object if the ventral visual system performs the computation of invari- might be viewed in a natural scene. If a background is used in an ant representations in the manner captured by the hypotheses experiment described here, the same background is always used, Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 36 Rolls Invariant visual object recognition This is an interesting and important result, for it shows that after learning, special mechanisms for segmentation and for attention are not needed in order for neurons already tuned by previous learning to the stimuli to be activated correctly in the output layer. Although the experiments described here tested for posi- tion invariance, we predict and would expect that the same results would be demonstrable for size and view-invariant representations of objects. In experiments 3 and 4 of Stringer and Rolls (2000), VisNet was trained with the 7 face stimuli presented on either one of the 2 cluttered backgrounds, but tested with the faces presented on a blank background. Results for this experiment showed poor performance. The results of experiments 3 and 4 suggest that in FIGURE 22 | Cluttered backgrounds used in VisNet simulations: order for a cell to learn invariant responses to different transforms backgrounds 1 and 2 are on the left and right, respectively. of a stimulus when it is presented during training in a cluttered background, some form of segmentation is required in order to and it is always in the same position, with stimuli moved to dif- separate the Figure (i.e., the stimulus or object) from the back- ferent positions on it. The 9 stimulus locations are arranged in a ground. This segmentation might be performed using evidence in square grid across the background, where the grid spacings are 32 the visual scene about different depths, motions, colors, etc. of the pixels horizontally or vertically. Before images were presented to object from its background. In the visual system, this might mean VisNet’s input layer they were pre-processed by the standard set combining evidence represented in different cortical areas, and of input filters which accord with the general tuning profiles of might be performed by cross-connections between cortical areas simple cells in V1 (Hawken and Parker, 1987); full details are given to enable such evidence to help separate the representations of in Rolls and Milward (2000). To train the network a sequence of objects from their backgrounds in the form-representing cortical images is presented to VisNet’s retina that corresponds to a single areas. stimulus occurring in a randomized sequence of the 9 locations Another mechanism that helps the operation of architectures across a background. At each presentation the activation of indi- such as VisNet and the primate visual system to learn about new vidual neurons is calculated, then their firing rates are calculated, objects in cluttered scenes is that the receptive fields of inferior and then the synaptic weights are updated. After a stimulus has temporal cortex neurons become much smaller when objects are been presented in all the training locations, a new stimulus is cho- seen against natural backgrounds (Sections 5.8.1 and 5.8). This sen at random and the process repeated. The presentation of all the will help greatly to learn about new objects that are being fix- stimuli across all locations constitutes 1 epoch of training. In this ated, by reducing responsiveness to other features elsewhere in the manner the network is trained one-layer at a time starting with scene. layer 1 and finishing with layer 4. In the investigations described Another mechanism that might help the learning of new objects in this subsection, the numbers of training epochs for layers 1–4 in a natural scene is attention. An attentional mechanism might were 50, 100, 100, and 75, respectively. highlight the current stimulus being attended to and suppress the In this experiment (see Stringer and Rolls, 2000, experiment 2), effects of background noise, providing a training representation of VisNet was trained with the 7 face stimuli presented on a blank the object more like that which would be produced when it is pre- background, but tested with the faces presented on each of the 2 sented against a blank background. The mechanisms that could cluttered backgrounds. implement such attentional processes are described elsewhere The single and multiple cell information showed perfect per- (Rolls, 2008b). If such attentional mechanisms do contribute to formance. Compared to performance when shown against a blank the development of view-invariance, then it follows that cells in the background, there was very little deterioration in performance temporal cortex may only develop transform invariant responses when testing with the faces presented on either of the two cluttered to objects to which attention is directed. backgrounds. Part of the reason for the poor performance in experiments 3 This is an interesting result to compare with many artificial and 4 was probably that the stimuli were always presented against vision systems that would need to carry out computationally inten- the same fixed background (for technical reasons), and thus the sive serial searching and template matching procedures in order neurons learned about the background rather than the stimuli. to achieve such results. In contrast, the VisNet neural network Part of the difficulty that hierarchical multilayer competitive net- architecture is able to perform such recognition relatively quickly works have with learning in cluttered environments may more through a simple feed-forward computation. generally be that without explicit segmentation of the stimulus Further results from this experiment showed that different neu- from its background, at least some of the features that should be rons can achieve excellent invariant responses to each of the 7 formed to encode the stimuli are not formed properly, because the faces even with the faces presented on a cluttered background. neurons learn to respond to combinations of inputs which come The response profiles are independent of location but differenti- partly from the stimulus, and partly from the background. To ate between the faces in that the responses are maximal for only investigate this Stringer and Rolls (2000) performed experiment 5 one of the faces and minimal for all other faces. in which layers 1–3 were pre-trained with stimuli to ensure that Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 37 Rolls Invariant visual object recognition good feature combination neurons for stimuli were available, and objects from their backgrounds in the form-representing cortical then allowed learning in only layer 4 when stimuli were presented areas. in the cluttered backgrounds. Layer 4 was then trained in the usual A second way in which training a feature hierarchy network in way with the 7 faces presented against a cluttered background. The a cluttered natural scene may be facilitated follows from the find- results showed that prior random exposure to the face stimuli led ing that the receptive fields of inferior temporal cortex neurons to much improved performance. shrink from in the order of 70˚ in diameter when only one object These results demonstrated that the problem of developing is present in a blank scene to much smaller values of as little as 5– position-invariant neurons to stimuli occurring against clut- 10˚ close to the fovea in complex natural scenes (Rolls et al., 2003). tered backgrounds may be ameliorated by the prior existence of The proposed mechanism for this is that if there is an object at the stimulus-tuned feature-detecting neurons in the early layers of the fovea, this object, because of the high-cortical magnification fac- visual system, and that these feature-detecting neurons may be tor at the fovea, dominates the activity of neurons in the inferior set up through previous exposure to the relevant class of objects. temporal cortex by competitive interactions (Trappenberg et al., When tested in cluttered environments, the background clutter 2002; Deco and Rolls, 2004; see Section 5.8). This allows primarily may of course activate some other neurons in the output layer, but the object at the fovea to be represented in the inferior temporal at least the neurons that have learned to respond to the trained cortex, and, it is proposed, for learning to be about this object, and stimuli are activated. The result of this activity is sufficient for not about the other objects in a whole scene. the activity in the output layer to be useful, in the sense that Third, top-down spatial attention (Deco and Rolls, 2004, 2005a; it can be read-off correctly by a pattern associator connected to Rolls, 2008b) could bias the competition toward a region of visual the output layer. Indeed, Stringer and Rolls (2000) tested this by space where the object to be learned is located. connecting a pattern associator to layer 4 of VisNet. The pattern Fourth, if object 1 is presented during training with other associator had seven neurons, one for each face, and 1,024 inputs, different objects present on different trials, then the competitive one from each neuron in layer 4 of VisNet. The pattern associ- networks that are part of VisNet will learn to represent each object ator learned when trained with a simple associative Hebb rule separately, because the features that are part of each object will (equation (16)) to activate the correct output neuron whenever be much more strongly associated together, than are those fea- one of the faces was shown in any position in the uncluttered tures with the other features present in the different objects seen environment. This ability was shown to be dependent on invari- on some trials during training (Stringer et al., 2007; Stringer and ant neurons for each stimulus in the output layer of VisNet, for Rolls, 2008). It is a natural property of competitive networks that the pattern associator could not be taught the task if VisNet had input features that co-occur very frequently together are allocated not been previously trained with a trace-learning rule to produce output neurons to represent the pattern as a result of the learn- invariant representations. Then it was shown that exactly the cor- ing. Input features that do not co-occur frequently, may not have rect neuron was activated when any of the faces was shown in output neurons allocated to them. This principle may help feature any position with the cluttered background. This read-off by a hierarchy systems to learn representations of individual objects, pattern associator is exactly what we hypothesize takes place in even when other objects with some of the same features are present the brain, in that the inferior temporal visual cortex (where neu- in the visual scene, but with different other objects on different tri- rons with invariant responses are found) projects to structures als. With this fundamental and interesting property of competitive such as the orbitofrontal cortex and amygdala, where associations networks, it has now become possible for VisNet to self-organize between the invariant visual representations and stimuli such as invariant representations of individual objects, even though each taste and touch are learned (Rolls and Treves, 1998; Rolls, 1999, object is always presented during training with at least one other 2005, 2008b, 2013; Rolls and Grabenhorst, 2008; Grabenhorst and object present in the scene (Stringer et al., 2007; Stringer and Rolls, Rolls, 2011). Thus testing whether the output of an architecture 2008). This has been extended to learning separate representations such as VisNet can be used effectively by a pattern associator is a of face expression and face identity from the same set of images, very biologically relevant way to evaluate the performance of this depending on the statistics with which the images are presented class of architecture. (Tromans et al., 2011); and learning separate representations of independently rotating objects (Tromans et al., 2012). 5.5.2. Learning invariant representations of an object with multiple objects in the scene and with cluttered backgrounds 5.5.3. VisNet simulations with partially occluded stimuli The results of the experiments just described suggest that in order In this section we examine the recognition of partially occluded for a neuron to learn invariant responses to different transforms stimuli. Many artificial vision systems that perform object recog- of a stimulus when it is presented during training in a cluttered nition typically search for specific markers in stimuli, and hence background, some form of segmentation is required in order to their performance may become fragile if key parts of a stimulus separate the figure (i.e., the stimulus or object) from the back- are occluded. However, in contrast we demonstrate that the model ground. This segmentation might be performed using evidence in of invariance learning in the brain discussed here can continue the visual scene about different depths, motions, colors, etc. of the to offer robust performance with this kind of problem, and that object from its background. In the visual system, this might mean the model is able to correctly identify stimuli with considerable combining evidence represented in different cortical areas, and flexibility about what part of a stimulus is visible. might be performed by cross-connections between cortical areas In these simulations (Stringer and Rolls, 2000), training to enable such evidence to help separate the representations of and testing was performed with a blank background to avoid Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 38 Rolls Invariant visual object recognition confounding the two separate problems of occlusion and back- visible (option (ii)) than the lower half (option (i)). When the ground clutter. In object recognition tasks, artificial vision systems top halves of the faces are occluded the multiple cell information may typically rely on being able to locate a small number of key measure asymptotes to a suboptimal value reflecting the difficulty markers on a stimulus in order to be able to identify it. This of discriminating between these more difficult images. approach can become fragile when a number of these markers Thus this model of the ventral visual system offers robust become obscured. In contrast, biological vision systems may gen- performance with this kind of problem, and the model is able eralize or complete from a partial input as a result of the use of to correctly identify stimuli with considerable flexibility about distributed representations in neural networks, and this could lead what part of a stimulus is visible, because it is effectively using to greater robustness in situations of partial occlusion. distributed representations and associative processing. In this experiment (6 of Stringer and Rolls, 2000), the network was first trained with the 7 face stimuli without occlusion, but dur- 5.6. LEARNING 3D TRANSFORMS ing testing there were two options: either (i) the top halves of all the In this section we describe investigations of Stringer and Rolls faces were occluded or (ii) the bottom halves of all the faces were (2002) which show that trace-learning can in the VisNet archi- occluded. Since VisNet was tested with either the top or bottom tecture solve the problem of in-depth rotation invariant object half of the stimuli no stimulus features were common to the two recognition by developing representations of the transforms which test options. This ensures that if performance is good with both features undergo when they are on the surfaces of 3D objects. options, the performance cannot be based on the use of a single Moreover, it is shown that having learned how features on 3D feature to identify a stimulus. Results for this experiment are shown objects transform as the object is rotated in-depth, the network in Figure 23, with single and multiple cell information measures can correctly recognize novel 3D variations within a generic view on the left and right, respectively. When compared with the per- of an object which is composed of previously learned feature formance without occlusion (Stringer and Rolls, 2000), Figure 23 combinations. shows that there is only a modest drop in performance in the single Rolls’ hypothesis of how object recognition could be imple- cell information measures when the stimuli are partially occluded. mented in the brain postulates that trace rule learning helps For both options (i) and (ii), even with partially occluded stim- invariant representations to form in two ways (Rolls, 1992, 1994, uli, a number of cells continue to respond maximally to one 1995, 2000). The first process enables associations to be learned preferred stimulus in all locations, while responding minimally between different generic 3D views of an object where there are to all other stimuli. However, comparing results from options different qualitative shape descriptors. One example of this would (i) and (ii) shows that the network performance is better when be the front and back views of an object, which might have very the bottom half of the faces is occluded. This is consistent with different shape descriptors. Another example is provided by con- psychological results showing that face recognition is performed sidering how the shape descriptors typical of 3D shapes, such as more easily when the top halves of faces are visible rather than Y vertices, arrow vertices, cusps, and ellipse shapes, alter when the bottom halves (see Bruce, 1988). The top half of a face will most 3D objects are rotated in 3 dimensions. At some point in the generally contain salient features, e.g., eyes and hair, that are 3D rotation, there is a catastrophic rearrangement of the shape particularly helpful for recognition of the individual, and it is descriptors as a new generic view can be seen (Koenderink, 1990). interesting that these simulations appear to further demonstrate An example of a catastrophic change to a new generic view is when this point. Furthermore, the multiple cell information measures a cup being viewed from slightly below is rotated so that one can see confirm that performance is better with the upper half of the face inside the cup from slightly above. The bottom surface disappears, FIGURE 23 | Effects of partial occlusion of a stimulus: numerical are two options: either (i) the top half of all the faces are occluded, or (ii) results for experiment 6 of Stringer and Rolls (2000), with the 7 faces the bottom half of all the faces are occluded. On the left are single cell presented on a blank background during both training and testing. information measures, and on the right are multiple cell information Training was performed with the whole face. However, during testing there measures. Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 39 Rolls Invariant visual object recognition the top surface of the cup changes from a cusp to an ellipse, and the inside of the cup with a whole set of new features comes into view. The second process is that within a generic view, as the object is rotated in-depth, there will be no catastrophic changes in the qualitative 3D shape descriptors, but instead the quantitative val- ues of the shape descriptors alter. For example, while the cup is being rotated within a generic view seen from somewhat below, the curvature of the cusp forming the top boundary will alter, but the qualitative shape descriptor will remain a cusp. Trace-learning could help with both processes. That is, trace-learning could help to associate together qualitatively different sets of shape descrip- tors that occur close together in time, and describe, for example, the generically different views of a cup. Trace-learning could also help with the second process, and learn to associate together the different quantitative values of shape descriptors that typically occur when objects are rotated within a generic view. We note that there is evidence that some neurons in the inferior temporal cortex may show the two types of 3D invariance. First Booth and Rolls (1998) showed that some inferior temporal cor- tex neurons can respond to different generic views of familiar 3D objects. Second, some neurons do generalize across quantitative changes in the values of 3D shape descriptors while faces (Has- FIGURE 24 | Learning 3D perspectival transforms of features. selmo et al., 1989b) and objects (Logothetis et al., 1995; Tanaka, Representations of the 6 visual stimuli with 3 surface features (triples) 1996) are rotated within-generic views. Indeed, Logothetis et al. presented to VisNet during the simulations described in Section 5.6. Each (1995) showed that a few inferior temporal cortex neurons can stimulus is a sphere that is uniquely identified by a unique combination of generalize to novel (untrained) values of the quantitative shape three surface features (a vertical, diagonal, and horizontal arc), which occur in 3 relative positions A, B, and C. Each row shows one of the stimuli descriptors typical of within-generic view object rotation. rotated through the 5 different rotational views in which the stimulus is In addition to the qualitative shape descriptor changes that presented to VisNet. From left to right the rotational views shown are: (i) occur catastrophically between different generic views of an object, –60˚, (ii) –30˚, (iii) 0˚ (central position), (iv) C30˚, and (v) C60˚. (After Stringer and the quantitative changes of 3D shape descriptors that occur and Rolls, 2002.) within a generic view, there is a third type of transform that must be learned for correct invariant recognition of 3D objects as they rotate in-depth. This third type of transform is that existing trace-learning models, because these models assume that which occurs to the surface features on a 3D object as it trans- an initial exposure is required during learning to every transfor- forms in-depth. The main aim here is to consider mechanisms mation of the object to be recognized (Riesenhuber and Poggio, that could enable neurons to learn this third type of transform, 1998). Stringer and Rolls (2002) showed as described here that that is how to generalize correctly over the changes in the sur- this is not the case, and that such models can generalize to novel face markings on 3D objects that are typically encountered as 3D within-generic views of an object provided that the characteristic objects rotate within a generic view. Examples of the types of changes that the features show as objects are rotated have been perspectival transforms investigated are shown in Figure 24. Sur- learned previously for the sets of features when they are present in face markings on the sphere that consist of combinations of three different objects. features in different spatial arrangements undergo characteristic Elliffe et al. (2002) demonstrated for a 2D system how the exis- transforms as the sphere is rotated from 0˚ to60˚ andC60˚. We tence of translation-invariant representations of low-order feature investigated whether the class of architecture exemplified by Vis- combinations in the early layers of the visual system could allow Net, and the trace-learning rule, can learn about the transforms correct stimulus identification in the output layer even when the that surface features of 3D objects typically undergo during 3D stimulus was presented in a novel location where the stimulus had rotation in such a way that the network generalizes across the not previously occurred during learning. The proposal was that the change of the quantitative values of the surface features produced low-order spatial-feature combination neurons in the early stages by the rotation, and yet still discriminates between the different contain sufficient spatial information so that a particular combi- objects (in this case spheres). In the cases being considered, each nation of those low-order feature combination neurons specifies object is identified by surface markings that consist of a different a unique object, even if the relative positions of the low-order fea- spatial arrangement of the same three features (a horizontal, ver- ture combination neurons are not known because these neurons tical, and diagonal line, which become arcs on the surface of the are somewhat translation-invariant (see Section 5.4.5). Stringer object). and Rolls (2002) extended this analysis to feature combinations We note that it has been suggested that the finding that neurons on 3D objects, and indeed in their simulations described in this may offer some degree of 3D rotation invariance after training with section therefore used surface markings for the 3D objects that a single view (or limited set of views) represents a challenge for consisted of triples of features. Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 40 Rolls Invariant visual object recognition The images used for training and testing VisNet were specially used the 6 feature triples as stimuli, with learning only allowed constructed for the purpose of demonstrating how the trace- in layers 3 and 4. However, during this second training stage, the learning paradigm might be further developed to give rise to neu- triples were only presented to VisNet’s input retina in the first 4 rons that are able to respond invariantly to novel within-generic orientations (i–iv). After the two stages of training were completed view perspectives of an object, obtained by rotations in-depth Stringer and Rolls (2002) examined whether the output layer of up to 30˚ from any perspectives encountered during learning. VisNet had formed top layer neurons that responded invariantly The stimuli take the form of the surface feature combinations to the 6 triples when presented in all 5 orientations, not just the 4 of 3-dimensional rotating spheres, with each image presented to in which the triples had been presented during training. To pro- VisNet’s retina being a 2-dimensional projection of the surface vide baseline results for comparison, the results from experiment features of one of the spheres. Each stimulus is uniquely identi- 1 were compared with results from experiment 2 which involved fied by two or three surface features, where the surface features no training in layers 1, 2 and 3, 4, with the synaptic weights left are (1) vertical, (2) diagonal, and (3) horizontal arcs, and where unchanged from their initial random values. each feature may be centered at three different spatial positions, In Figure 25 numerical results are given for the experiments designated A, B, and C, as shown in Figure 24. The stimuli are described. On the left are the single cell information measures for thus defined in terms of what features are present and their pre- all top (4th) layer neurons ranked in order of their invariance to the cise spatial arrangement with respect to each other. We refer to the triples, while on the right are multiple cell information measures. two and three feature stimuli as “pairs” and “triples,” respectively. To help to interpret these results we can compute the maximum Individual stimuli are denoted by three numbers which refer to the single cell information measure according to individual features present in positions A, B and C, respectively. For example, a stimulus with positions A and C containing a verti- Maximum single cell information D log .Number of triples/, cal and diagonal bar, respectively, would be referred to as stimulus (43) 102, where the 0 denotes no feature present in position B. In total there are 18 pairs (120, 130, 210, 230, 310, 320, 012, 013, 021, 023, where the number of triples is 6. This gives a maximum single cell 031, 032, 102, 103, 201, 203, 301, 302) and 6 triples (123, 132, 213, information measure of 2.6 bits for these test cases. The informa- 231, 312, 321). tion results from the experiment demonstrate that even with the To train the network each stimulus was presented to VisNet in a triples presented to the network in only four of the five orientations randomized sequence of five orientations with respect to VisNet’s during training, layer 4 is indeed capable of developing rotation input retina, where the different orientations are obtained from invariant neurons that can discriminate effectively between the 6 successive in-depth rotations of the stimulus through 30˚. That is, different feature triples in all 5 orientations, that is with correct each stimulus was presented to VisNet’s retina from the follow- recognition from all five perspectives. In addition, the multiple ing rotational views: (i) 60˚, (ii) 30˚, (iii) 0˚ (central position cell information for the experiment reaches the maximal level of with surface features facing directly toward VisNet’s retina), (iv) 2.6 bits, indicating that the network as a whole is capable of perfect 30˚, and (v) 60˚. Figure 24 shows representations of the 6 visual discrimination between the 6 triples in any of the 5 orientations. stimuli with 3 surface features (triples) presented to VisNet dur- These results may be compared with the very poor baseline ing the simulations. (For the actual simulations described here, performance from the control experiment, where no learning was the surface features and their deformations were what VisNet was allowed before testing. trained and tested with, and the remaining blank surface of each Stringer and Rolls (2002) also performed a control experiment sphere was set to the same gray-scale as the background.) Each row to show that the network really had learned invariant repre- shows one of the stimuli rotated through the 5 different rotational sentations specific to the kinds of 3D deformations undergone views in which the stimulus is presented to VisNet. At each presen- by the surface features as the objects rotated in-depth. In the tation the activation of individual neurons is calculated, then the control experiment the network was trained on “spheres” with neuronal firing rates are calculated, and then the synaptic weights non-deformed surface features; and then as predicted the network are updated. Each time a stimulus has been presented in all the failed to operate correctly when it was tested with objects with the training orientations, a new stimulus is chosen at random and features present in the transformed way that they appear on the the process repeated. The presentation of all the stimuli through surface of a real 3D object. all 5 orientations constitutes 1 epoch of training. In this manner Stringer and Rolls (2002) were thus able to show how trace- the network was trained one-layer at a time starting with layer 1 learning can form neurons that can respond invariantly to novel and finishing with layer 4. In the investigations described here, the rotational within-generic view perspectives of an object, obtained numbers of training epochs for layers 1–4 were 50, 100, 100, and by within-generic view 3D rotations up to 30˚ from any view 75, respectively. encountered during learning. They were able to show in addi- In experiment 1, VisNet was trained in two stages. In the first tion that this could occur for a novel view of an object which was stage, the 18 feature pairs were used as input stimuli, with each not an interpolation from previously shown views. This was possi- stimulus being presented to VisNet’s retina in sequences of five ble given that the low-order feature combination sets from which orientations as described above. However, during this stage, learn- an object was composed had been learned about in early layers of ing was only allowed to take place in layers 1 and 2. This led to VisNet previously. The within-generic view transform invariant the formation of neurons which responded to the feature pairs object recognition described was achieved through the develop- with some rotation invariance in layer 2. In the second stage, we ment of true 3-dimensional representations of objects based on Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 41 Rolls Invariant visual object recognition FIGURE 25 | Learning 3D perspectival transforms of features. Numerical results for experiments 1 and 2: on the left are single cell information measures, and on the right are multiple cell information measures. (After Stringer and Rolls, 2002.) 3-dimensional features and feature combinations, which, unlike Given that the successive layers of the real visual system (V1, V2, 2-dimensional feature combinations, are invariant under moder- V4, posterior inferior temporal cortex, anterior inferior temporal ate in-depth rotations of the object. Thus, in a sense, these rotation cortex) are of the same order of magnitude, VisNet is designed to invariant representations encode a form of 3-dimensional knowl- work with the same number of neurons in each successive layer. edge with which to interpret the visual input from the real-world, (Of course the details are worth understanding further. V1 is, for that is able provide a basis for robust rotation invariant object example, somewhat larger than earlier layers, but on the other recognition with novel perspectives. The particular finding in the hand serves the dorsal as well as the ventral stream of visual corti- work described here was that VisNet can learn how the surface cal processing.) The hypothesis is that because of redundancies in features on 3D objects transform as the object is rotated in-depth, the visual world, each layer of the system by its convergence and and can use knowledge of the characteristics of the transforms competitive categorization can capture sufficient of the statistics to perform 3D object recognition. The knowledge embodied in of the visual input at each stage to enable correct specification the network is knowledge of the 3D properties of objects, and in of the properties of the world that specify objects. For example, this sense assists the recognition of 3D objects seen from different V1 does not compute all possible combinations of a few lateral views. geniculate inputs, but instead represents linear series of geniculate The process investigated by Stringer and Rolls (2002) will only inputs to form edge-like and bar-like feature analyzers, which are allow invariant object recognition over moderate 3D object rota- the dominant arrangement of pixels found at the small scale in tions, since rotating an object through a large angle may lead to a natural visual scenes. Thus the properties of the visual world at catastrophic change in the appearance of the object that requires this stage can be captured by a small proportion of the total num- the new qualitative 3D shape descriptors to be associated with ber of combinations that would be needed if the visual world were those of the former view. In that case, invariant object recogni- random. Similarly, at a later stage of processing, just a subset of all tion must rely on the first process referred to at the start of this possible combinations of line or edge analyzers would be needed, Section (6) in order to associate together the different generic partly because some combinations are much more frequent in the views of an object to produce view-invariant object identification. visual world, and partly because the coding because of conver- For that process, association of a few cardinal or generic views is gence means that what is represented is for a larger area of visual likely to be sufficient (Koenderink, 1990). The process described in space (that is, the receptive fields of the neurons are larger), which this section of learning how surface features transform is likely to also leads to economy and limits what otherwise would be a com- make a major contribution to the within-generic view transform binatorial need for feature analyzers at later layers. The hypothesis invariance of object identification and recognition. thus is that the effects of redundancies in the input space of stimuli that result from the statistical properties of natural images (Field, 5.7. CAPACITY OF THE ARCHITECTURE, AND INCORPORATION OF A 1987), together with the convergent architecture with competi- TRACE RULE INTO A RECURRENT ARCHITECTURE WITH OBJECT tive learning at each stage, produces a system that can perform ATTRACTORS invariant object recognition for large numbers of objects. Large in One issue that has not been considered extensively so far is the this case could be within one or two orders of magnitude of the capacity of hierarchical feed-forward networks of the type exem- number of neurons in any one-layer of the network (or cortical plified by VisNet that are used for invariant object recognition. One area in the brain). The extent to which this can be realized can approach to this issue is to note that VisNet operates in the general be explored with simulations of the type implemented in VisNet, mode of a competitive network, and that the number of different in which the network can be trained with natural images which stimuli that can be categorized by a competitive network is in the therefore reflect fully the natural statistics of the stimuli presented order of the number of neurons in the output layer (Rolls, 2008b). to the real brain. Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 42 Rolls Invariant visual object recognition We should note that a rich variety of information in perceptual space may be represented by subtle differences in the distributed representation provided by the output of the visual system. At the same time, the actual number of different patterns that may be stored in, for example, a pattern associator connected to the output of the visual system is limited by the number of input connections per neuron from the output neurons of the visual system (Rolls, 2008b). One essential function performed by the ventral visual system is to provide an invariant representation which can be read by a pattern associator in such a way that if the pattern associ- ator learns about one view of the object, then the visual system allows generalization to another view of the same object, because the same output neurons are activated by the different view. In the sense that any view can and must activate the same output neurons of the visual system (the input to the associative network), then FIGURE 26 | The learning scheme implemented in VisNet. A we can say the invariance is made explicit in the representation. trace-learning rule is implemented in the feed-forward inputs to a Making some properties of an input representation explicit in an competitive network. output representation has a major function of enabling associa- tive networks that use visual inputs in, for example, recognition, of any of the views will cause the network to settle into an attractor episodic memory, emotion and motivation to generalize correctly, that represents all the views of the object, that is which is a view- that is invariantly with respect to image transforms that are all invariant representation of an object. (In this Section, the different consistent with the same object in the world (Rolls and Treves, exemplars of an object which need to be associated together are 1998). called views, for simplicity, but could at earlier stages of the hierar- Another approach to the issue of the capacity of networks chy represent, for example, similar feature combinations (derived that use trace learning to associate together different instances from the same object) in different positions in space.) (e.g., views) of the same object is to reformulate the issue in We envisage a set of neuronal operations which set up a synaptic the context of autoassociation (attractor) networks, where ana- weight matrix in the recurrent collaterals by associating together lytic approaches to the storage capacity of the network are well because of their closeness in time the different views of the same developed (Amit, 1989; Rolls and Treves, 1998; Rolls, 2008b). object. This approach to the storage capacity of networks that associate In more detail Parga and Rolls (1998) considered two main together different instantiations of an object to form invariant approaches. First, one could store in a synaptic weight matrix the s representations has been developed by Parga and Rolls (1998) and views of an object. This consists of equally associating all the views Elliffe et al. (2000), and is described next. to each other, including the association of each view with itself. In this approach, the storage capacity of a recurrent net- Choosing in Figure 28 an example such that objects are defined work which performs, for example, view-invariant recognition of in terms of five different views, this might produce (if each view objects by associating together different views of the same object produced firing of one neuron at a rate of 1) a block of 5 5 pairs which tend to occur close together in time, was studied (Parga of views contributing to the synaptic efficacies each with value and Rolls, 1998; Elliffe et al., 2000). The architecture with which 1. Object 2 might produce another block of synapses of value 1 the invariance is computed is a little different to that described further along the diagonal, and symmetric about it. Each object earlier. In the model of Rolls (1992, 1994, 1995), Wallis and Rolls or memory could then be thought of as a single attractor with a (1997), Rolls and Milward (2000) Rolls and Stringer (2006), the distributed representation involving five elements (each element post-synaptic memory trace enabled different afferents from the representing a different view). preceding stage to modify onto the same post-synaptic neuron Then the capacity of the system in terms of the number P of (see Figure 26). In that model there were no recurrent connections o objects that can be stored is just the number of separate attractors between the neurons, although such connections were one way in which can be stored in the network. For random fully distributed which it was postulated the memory trace might be implemented, patterns this is as shown numerically by Hopfield (1982) by simply keeping the representation of one view or aspect active until the next view appeared. Then an association would occur P D 0.14 C (44) between representations that were active close together in time (within, e.g., 100–300 ms). In the model developed by Parga and Rolls (1998) and Elliffe where there are C inputs per neuron (and N D C neurons if the et al. (2000), there is a set of inputs with fixed synaptic weights to a network is fully connected). Now the synaptic matrix envisaged network. The network itself is a recurrent network, with a trace rule here does not consist of random fully distributed binary elements, incorporated in the recurrent collaterals (see Figure 27). When but instead we will assume has a sparseness aD s /N, where s is different views of the same object are presented close together in the number of views stored for each object, from any of which time, the recurrent collaterals learn using the trace rule that the the whole representation of the object must be recognized. In this different views are of the same object. After learning, presentation case, one can show (Gardner, 1988; Tsodyks and Feigel’man, 1988; Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 43 Rolls Invariant visual object recognition FIGURE 27 | The learning scheme considered by Parga and Rolls (1998) rule is implemented in the recurrent collateral synapses of an autoassociative and Elliffe et al. (2000). There are inputs to the network from the preceding memory to associate together the different exemplars (e.g., views) of the stage via unmodifiable synapses, and a trace or pairwise associative learning same object. FIGURE 28 | A schematic illustration of the first type of associations contributing to the synaptic matrix considered by Parga and Rolls (1998). Object 1 (O ) has five views labeled v to v , etc. The matrix is formed by associating the pattern presented in the columns with itself, that is with the same pattern 1 1 5 presented as rows. Treves and Rolls, 1991) that the number of objects that can be number of views of each object increases to a large number (e.g., stored and correctly retrieved is >20), the network will fail to retrieve correctly the internal repre- sentation of the object starting from any one view (which is only k C a fraction 1/s of the length of the stored pattern that represents an P D (45) object). a ln .1=a/ The second approach, taken by Parga and Rolls (1998) and where C is the number of synapses on each neuron devoted to the Elliffe et al. (2000), is to consider the operation of the network recurrent collaterals from other neurons in the network, and k is when the associations between pairs of views can be described a factor that depends weakly on the detailed structure of the rate by a matrix that has the general form shown in Figure 29. Such an association matrix might be produced by different views of an distribution, on the connectivity pattern, etc., but is approximately in the order of 0.2–0.3. A problem with this proposal is that as the object appearing after a given view with equal probability, and Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 44 Rolls Invariant visual object recognition FIGURE 29 | A schematic illustration of the second and main type of labeled v to v , etc. The association of any one view with itself has 1 5 associations contributing to the synaptic matrix considered by Parga strength 1, and of any one with another view of the same object has and Rolls (1998) and Elliffe et al. (2000). Object 1 (O ) has five views strength b. synaptic modification occurring of the view with itself (giving the number of neurons. To be particular, the number of objects rise to the diagonal term), and of any one view with that which that can be stored is 0.081 N /5, when there are five views of each immediately follows it. object. The number of objects is 0.073 N /11, when there are eleven The same weight matrix might be produced not only by pair- views of each object. This is an interesting result in network terms, wise association of successive views because the association rule in that s views each represented by an independent random set allows for associations over the short-time scale of, e.g., 100– of active neurons can, in the network described, be present in the 200 ms, but might also be produced if the synaptic trace had an same “object” attraction basin. It is also an interesting result in exponentially decaying form over several hundred milliseconds, neurophysiological terms, in that the number of objects that can allowing associations with decaying strength between views sepa- be represented in this network scales linearly with the number of rated by one or more intervening views. The existence of a regime, recurrent connections per neuron. That is, the number of objects for values of the coupling parameter between pairs of views in a P that can be stored is approximately finite interval, such that the presentation of any of the views of k C one object leads to the same attractor regardless of the particular P D (46) view chosen as a cue, is one of the issues treated by Parga and Rolls (1998) and Elliffe et al. (2000). A related problem also dealt where C is the number of synapses on each neuron devoted to with was the capacity of this type of synaptic matrix: how many the recurrent collaterals from other neurons in the network, s is objects can be stored and retrieved correctly in a view-invariant the number of views of each object, and k is a factor that is in the way? Parga and Rolls (1998) and Elliffe et al. (2000) showed that region of 0.07–0.09 (Parga and Rolls, 1998). the number grows linearly with the number of recurrent collateral Although the explicit numerical calculation was done for a connections received by each neuron. Some of the groundwork rather small number of views for each object (up to 11), the basic for this approach was laid by the work of Amit and collaborators result, that the network can support this kind of “object” phase, is (Amit, 1989; Griniasty et al., 1993). expected to hold for any number of views (the only requirement A variant of the second approach is to consider that the remain- being that it does not increase with the number of neurons). This ing entries in the matrix shown in Figure 29 all have a small value. is of course enough: once an object is defined by a set of views, This would be produced by the fact that sometimes a view of one when the network is presented with a somewhat different stimulus object would be followed by a view of a different object, when, for or a noisy version of one of them it will still be in the attraction example, a large saccade was made, with no explicit resetting of basin of the object attractor. the trace. On average, any one object would follow another rarely, Parga and Rolls (1998) thus showed that multiple (e.g., “view”) and so the case is considered when all the remaining associations patterns could be within the basin of attraction of a shared (e.g., between pairs of views have a low value. “object”) representation, and that the capacity of the system was Parga and Rolls (1998) and Elliffe et al. (2000) were able to show proportional to the number of synapses per neuron divided by the that invariant object recognition is feasible in attractor neural number of views of each object. networks in the way described. The system is able to store and Elliffe et al. (2000) extended the analysis of Parga and Rolls retrieve in a view-invariant way an extensive number of objects, (1998) by showing that correct retrieval could occur where each defined by a finite set of views. What is implied by extensive retrieval “view” cues were distorted; where there was some associ- is that the number of objects is proportional to the size of the ation between the views of different objects; and where there was network. The crucial factor that defines this size is the number of only partial and indeed asymmetric connectivity provided by the connections per neuron. In the case of the fully connected net- associatively modified recurrent collateral connections in the net- works considered in this section, the size is thus proportional to work. The simulations also extended the analysis by showing that Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 45 Rolls Invariant visual object recognition the system can work well with sparse patterns, and indeed that when a small number of representations need to be associ- the use of sparse patterns increases (as expected) the number of ated together to represent an object. One example is associat- objects that can be stored in the network. ing together what is seen when an object is viewed from dif- Taken together, the work described by Parga and Rolls (1998) ferent perspectives. Another example is scale, with respect to and Elliffe et al. (2000) introduced the idea that the trace rule which neurons early in the visual system tolerate scale changes used to build invariant representations could be implemented in of approximately 1.5 octaves, so that the whole scale range could the recurrent collaterals of a neural network (as well as or as an be covered by associating together a limited number of such alternative to its incorporation in the forward connections from representations (see Chapter 5 of Rolls and Deco (2002) and one-layer to another incorporated in VisNet), and provided a pre- Figure 1). The mechanism would not be so suitable when a cise analysis of the capacity of the network if it operated in this large number of different instances would need to be associ- way. In the brain, it is likely that the recurrent collateral connec- ated together to form an invariant representation of objects, as tions between cortical pyramidal cells in visual cortical areas do might be needed for translation invariance. For the latter, the contribute to building invariant representations, in that if they standard model of VisNet with the associative trace-learning rule are associatively modifiable, as seems likely, and because there is implemented in the feed-forward connections (or trained by con- continuing firing for typically 100–300 ms after a stimulus has tinuous spatial transformation learning as described in Section been shown, associations between different exemplars of the same 5.10) would be more appropriate. However, both types of mech- object that occur together close in time would almost necessar- anism, with the trace rule in the feed-forward or in recurrent ily become built into the recurrent synaptic connections between collateral synapses, could contribute (separately or together) to pyramidal cells. achieve invariant representations. Part of the interest of the attrac- Invariant representation of faces in the context of attractor tor approach described in this section is that it allows analytic neural networks has also been discussed by Bartlett and Sejnowski investigation. (1997) in terms of a model where different views of faces are Another approach to training invariance is the purely asso- presented in a fixed sequence (Griniasty et al., 1993). This is ciative mechanism continuous spatial transformation learning, not however the general situation; normally any pair of views described in Section 5.10. With this training procedure, the capac- can be seen consecutively and they will become associated. The ity is increased with respect to the number of training locations, model described by Parga and Rolls (1998) treats this more general with, for example, 169 training locations producing translation- situation. invariant representations for two face stimuli (Perry et al., 2010). I wish to note the different nature of the invariant object recog- When we scaled up the 32 32 VisNet used for most of the inves- nition problem studied here, and the paired associate learning tigations described here to 128 128 neurons per layer in the task studied by Miyashita (1988), Miyashita and Chang (1988), VisNetL specified in Table 1, it was demonstrated that perfect and Sakai and Miyashita (1991). In the invariant object recogni- translation-invariant representations were produced over at least tion case no particular learning protocol is required to produce an 1,089 locations for 5 objects. Thus the indications are that scaling activity of the inferior temporal cortex cells responsible for invari- up the size of VisNet does markedly improve performance, and ant object recognition that is maintained for 300 ms. The learning in this case allows invariant representations for 5 objects across can occur rapidly, and the learning occurs between stimuli (e.g., more than 1,000 locations to be trained with continuous spatial different views) which occur with no intervening delay. In the transformation learning (Perry et al., 2010). paired associate task, which had the aim of providing a model of It will be of interest in future research to investigate how the semantic memory, the monkeys must learn to associate together VisNet architecture, whether trained with a trace or purely asso- two stimuli that are separated in time (by a number of seconds), ciative rule, scales up with respect to capacity as the number of and this type of learning can take weeks to train. During the delay neurons in the system increases further. More distributed repre- period the sustained activity is rather low in the experiments, and sentations in the output layer may also help to increase the capacity. thus the representation of the first stimulus that remains is weak, In recent investigations, we have been able to train VisNetL (i.e., and can only poorly be associated with the second stimulus. How- 128 128 neurons in each layer, a 256 256 input image, and 8 ever, formally the learning mechanism could be treated in the same spatial frequencies for the Gabor filters as shown in Table 4) on way as that used by Parga and Rolls (1998) for invariant object a view-invariance learning problem, and have found good scal- recognition. The experimental difference is just that in the paired ing up with respect to the original VisNet (i.e., 32 32 neurons associate task used by Miyashita et al., it is the weak memory of in each layer, a 64 64 input image, and 4 spatial frequencies for the first stimulus that is associated with the second stimulus. In the filters). For example, VisNetL can learn with the trace rule contrast, in the invariance learning, it would be the firing activity perfect invariant representations of 32 objects each shown in 24 being produced by the first stimulus (not the weak memory of views (T. J. Webb and E. T. Rolls, recent observations). The objects the first stimulus) that can be associated together. It is possible were made with Blender 3D modeling software, so the image views that the perirhinal cortex makes a useful contribution to invariant generated were carefully controlled for lighting, background inten- object recognition by providing a short-term memory that helps sity, etc. When trained on half of these views for each object, successive views of the same objects to become associated together with the other half used for cross-validation testing, the perfor- (Buckley et al., 2001; Rolls et al., 2005a). mance was reasonable at approximately 68% correct for the 32 The mechanisms described here using an attractor network objects, and having the full set of 8 spatial frequencies did improve with a trace associative learning rule would apply most naturally performance. Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 46 Rolls Invariant visual object recognition 5.8. VISION IN NATURAL SCENES – EFFECTS OF BACKGROUND in complex natural backgrounds. The monkey had to search for VERSUS ATTENTION two objects on a screen, and a touch of one object was rewarded Object-based attention refers to attention to an object. For exam- with juice, and of another object was punished with saline (see ple, in a visual search task the object might be specified as what Figure 3 for a schematic illustration and Figure 30 for a version of should be searched for, and its location must be found. In spa- the display with examples of the stimuli shown to scale). Neuronal tial attention, a particular location in a scene is pre-cued, and the responses to the effective stimuli for the neurons were compared object at that location may need to be identified. Here we consider when the objects were presented in the natural scene or on a plain some of the neurophysiology of object selection and attention in background. It was found that the overall response of the neuron the context of a feature hierarchy approach to invariant object to objects was hardly reduced when they were presented in natural recognition. The computational mechanisms of attention, includ- scenes, and the selectivity of the neurons remained. However, the ing top-down biased competition, are described elsewhere (Rolls main finding was that the magnitudes of the responses of the neu- and Deco, 2002; Deco and Rolls, 2005b; Rolls, 2008b). rons typically became much less in the real scene the further the monkey fixated in the scene away from the object (see Figure 4). A 5.8.1. Neurophysiology of object selection and translation small receptive field size has also been found in inferior temporal invariance in the inferior temporal visual cortex cortex neurons when monkeys have been trained to discriminate Much of the neurophysiology, psychophysics, and modeling of closely spaced small visual stimuli (DiCarlo and Maunsell, 2003). attention has been with a small number, typically two, of objects It is proposed that this reduced translation invariance in natural in an otherwise blank scene. In this Section, I consider how atten- scenes helps an unambiguous representation of an object which tion operates in complex natural scenes, and in particular describe may be the target for action to be passed to the brain regions how the inferior temporal visual cortex operates to enable the that receive from the primate inferior temporal visual cortex. It selection of an object in a complex natural scene (see also Rolls helps with the binding problem, by reducing in natural scenes the and Deco, 2006). The inferior temporal visual cortex contains dis- effective receptive field of at least some inferior temporal cortex tributed and invariant representations of objects and faces (Rolls neurons to approximately the size of an object in the scene. and Baylis, 1986; Hasselmo et al., 1989a; Tovee et al., 1994; Rolls It is also found that in natural scenes, the effect of object-based and Tovee, 1995b; Rolls et al., 1997b; Booth and Rolls, 1998; Rolls, attention on the response properties of inferior temporal cortex 2000, 2007a,b,c, 2011b; Rolls and Deco, 2002; Rolls and Treves, neurons is relatively small, as illustrated in Figure 31 (Rolls et al., 2011). 2003). To investigate how attention operates in complex natural scenes, and how information is passed from the inferior temporal 5.8.2. Attention and translation invariance in natural scenes – a cortex (IT) to other brain regions to enable stimuli to be selected computational account from natural scenes for action, Rolls et al. (2003) analyzed the The results summarized in Figure 31 for 5˚ stimuli show that the responses of inferior temporal cortex neurons to stimuli presented receptive fields were large (77.6˚) with a single stimulus in a blank FIGURE 30 | The visual search task. The monkey had to search for and object is present (a bottle) which the monkey must not touch. The stimuli touch an object (in this case a banana) when shown in a complex natural are shown to scale. The screen subtended 70˚ 55˚ (After Rolls et al., scene, or when shown on a plain background. In each case a second 2003.) Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 47 Rolls Invariant visual object recognition firing rate in the complex background when the effective stimu- lus was selected for action (bottom right, 19.2˚), and when it was not (middle right, 15.6˚; Rolls et al., 2003). (For comparison, the effects of attention against a blank background were much larger, with the receptive field increasing from 17.2˚ to 47.0˚ as a result of object-based attention, as shown in Figure 31, left middle and bottom.) Trappenberg et al. (2002) have suggested what underlying mechanisms could account for these findings, and simulated a model to test the ideas. The model utilizes an attractor network representing the inferior temporal visual cortex (implemented by the recurrent connections between inferior temporal cortex neurons), and a neural input layer with several retinotopically organized modules representing the visual scene in an earlier visual cortical area such as V4 (see Figure 32). The attractor network aspect of the model produces the property that the receptive fields of IT neurons can be large in blank scenes by enabling a weak input in the periphery of the visual field to act as a retrieval cue for the object attractor. On the other hand, when the object is shown in a complex background, the object closest to the fovea tends to act as the retrieval cue for the attractor, because the fovea is given increased weight in activating the IT module because the magni- tude of the input activity from objects at the fovea is greatest due to the higher magnification factor of the fovea incorporated into the model. This results in smaller receptive fields of IT neurons in complex scenes, because the object tends to need to be close to the fovea to trigger the attractor into the state representing that object. (In other words, if the object is far from the fovea, then it will not trigger neurons in IT which represent it, because neurons in IT are preferentially being activated by another object at the fovea.) This may be described as an attractor model in which the competition for which attractor state is retrieved is weighted toward objects at the fovea. Attentional top-down object-based inputs can bias the com- petition implemented in this attractor model, but have relatively minor effects (in, for example, increasing receptive field size) when they are applied in a complex natural scene, as then as usual the stronger forward inputs dominate the states reached. In this net- work, the recurrent collateral connections may be thought of as implementing constraints between the different inputs present, to help arrive at firing in the network which best meets the con- FIGURE 31 | Summary of the receptive field sizes of inferior temporal straints. In this scenario, the preferential weighting of objects close cortex neurons to a 5˚ effective stimulus presented in either a blank background (blank screen) or in a natural scene (complex background). to the fovea because of the increased magnification factor at the The stimulus that was a target for action in the different experimental fovea is a useful principle in enabling the system to provide use- conditions is marked by T. When the target stimulus was touched, a reward ful output. The attentional object biasing effect is much more was obtained. The mean receptive field diameter of the population of marked in a blank scene, or a scene with only two objects present neurons analyzed, and the mean firing rate in spikes/s, is shown. The at similar distances from the fovea, which are conditions in which stimuli subtended 5˚ 3.5˚ at the retina, and occurred on each trial in a random position in the 70˚ 55˚ screen. The dashed circle is proportional to attentional effects have frequently been examined. The results of the receptive field size. Top row: responses with one visual stimulus in a the investigation (Trappenberg et al., 2002) thus suggest that top- blank (left) or complex (right) background. Middle row: responses with two down attention may be a much more limited phenomenon in stimuli, when the effective stimulus was not the target of the visual search. complex, natural, scenes than in reduced displays with one or two Bottom row: responses with two stimuli, when the effective stimulus was objects present. The results also suggest that the alternative prin- the target of the visual search. (After Rolls et al., 2003.) ciple, of providing strong weight to whatever is close to the fovea, is an important principle governing the operation of the inferior background (top left), and were greatly reduced in size (to 22.0˚) temporal visual cortex, and in general of the output of the visual when presented in a complex natural scene (top right). The results system in natural environments. This principle of operation is also show that there was little difference in receptive field size or very important in interfacing the visual system to action systems, Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 48 Rolls Invariant visual object recognition actions are made to locations not being looked at. However, the simulations described in this section suggest that in any case covert attention is likely to be a much less significant influence on visual processing in natural scenes than in reduced scenes with one or two objects present. Given these points, one might question why inferior temporal cortex neurons can have such large receptive fields, which show translation invariance. At least part of the answer to this may be that inferior temporal cortex neurons must have the capability to be large if they are to deal with large objects. A V1 neuron, with its small receptive field, simply could not receive input from all the features necessary to define an object. On the other hand, inferior temporal cortex neurons may be able to adjust their size to approximately the size of objects, using in part the interactive effects involved in attention (Rolls, 2008b), and need the capabil- ity for translation invariance because the actual relative positions of the features of an object could be at different relative positions in the scene. For example, a car can be recognized whichever way it is viewed, so that the parts (such as the bonnet or hood) must be identifiable as parts wherever they happen to be in the image, though of course the parts themselves also have to be in the correct relative positions, as allowed for by the hierarchical feature analysis architecture described in this paper. Some details of the simulations follow. Each independent mod- ule within “V4” in Figure 32 represents a small part of the visual field and receives input from earlier visual areas represented by an input vector for each possible location which is unique for each object. Each module was 6˚ in width, matching the size of the objects presented to the network. For the simulations Trappen- berg et al. (2002) chose binary random input vectors representing V4 V4 objects with N a components set to ones and the remaining V4 V4 V4 N (1 a ) components set to zeros. N is the number of nodes V4 in each module and a is the sparseness of the representation V4 which was set to be a D 0.2 in the simulations. The structure labeled “IT” represents areas of visual association cortex such as the inferior temporal visual cortex and cortex in the anterior part of the superior temporal sulcus in which neurons provide distributed representations of faces and objects (Booth and Rolls, 1998; Rolls, 2000). Nodes in this structure are governed by leaky integrator dynamics with time constant IT dh .t/ FIGURE 32 | The architecture of the inferior temporal cortex (IT) model i IT IT IT IT D h .t/C w c y .t/ i ij j of Trappenberg et al. (2002) operating as an attractor network with dt inputs from the fovea given preferential weighting by the greater magnification factor of the fovea. The model also has a top-down ITV4 V4 IT_BIAS OBJ C w y .t/C k I . (47) k i ik object-selective bias input. The model was used to analyze how object vision and recognition operate in complex natural scenes. IT The firing rate y of the i th node is determined by a sigmoidal IT because the effective stimulus in making inferior temporal cortex function from the activation h as follows neurons fire is in natural scenes usually on or close to the fovea. This means that the spatial coordinates of where the object is in IT y .t/ D   , (48) IT the scene do not have to be represented in the inferior temporal 1C exp 2 h .t/ visual cortex, nor passed from it to the action selection system, as the latter can assume that the object making IT neurons fire is where the parameters D 1 and D 1 represent the gain and the close to the fovea in natural scenes. bias, respectively. There may of course be in addition a mechanism for object The recognition functionality of this structure is modeled as an selection that takes into account the locus of covert attention when attractor neural network (ANN) with trained memories indexed Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 49 Rolls Invariant visual object recognition by  representing particular objects. The memories are formed through Hebbian learning on sparse patterns, IT IT IT IT w D k  a  a , (49) ij i j IT where k (set to 1 in the simulations) is a normalization constant IT that depends on the learning rate, a D 0.2 is the sparseness of the training pattern in IT, and  are the components of the pattern IT used to train the network. The constant c in equation (47) rep- resents the strength of the activity-dependent global inhibition simulating the effects of inhibitory interneurons. The external FIGURE 33 | Correlations as measured by the normalized dot product OBJ between the object vector used to train IT and the state of the IT “top-down” input vector I produces object-selective inputs, network after settling into a stable state with a single object in the which are used as the attentional drive when a visual search task visual scene (blank background) or with other trained objects at all is simulated. The strength of this object bias is modulated by the possible locations in the visual scene (natural background). There is no IT_BIAS value of k in equation (47). object bias included in the results shown in graph (A), whereas an object ITV4 IT_BIAS bias is included in the results shown in (B) with k D 0.7 in the The weights w between the V4 nodes and IT nodes were ij IT_BIAS experiments with a natural background and k D 0.1 in the experiments trained by Hebbian learning of the form with a blank background. (After Trappenberg et al., 2002.) ITV4 ITV4 V 4 IT w D k .k/  a  a . (50) ij i j IT_BIAS value of the object bias k was set to 0 in these simulations. to produce object representations in IT based on inputs in V4. The Good object retrieval (indicated by large correlations) was found ITV4 normalizing modulation factor k (k ) allows the gain of inputs even when the object was far from the fovea, indicating large IT to be modulated as a function of their distance from the fovea, and receptive fields with a blank background. The reason that any drop depends on the module k to which the presynaptic node belongs. is seen in performance as a function of eccentricity is because flip- The model supports translation-invariant object recognition of a ping 2% of the bits outside the object introduces some noise into single object in the visual field if the normalization factor is the the recall process. This demonstrates that the attractor dynamics same for each module and the model is trained with the objects can support translation-invariant object recognition even though placed at every possible location in the visual field. The translation the translation-invariant weight vectors between V4 and IT are ITV4 invariance of the weight vectors between each “V4” module and explicitly modulated by the modulation factor k derived from the IT nodes is however explicitly modulated in the model by the the cortical magnification factor. ITV4 module-dependent modulation factor k (k ) as indicated in In a second simulation individual objects were placed at all Figure 32 by the width of the lines connecting V4 with IT. The possible locations in a natural and cluttered visual scene. The strength of the foveal V4 module is strongest, and the strength resulting correlations between the target pattern and the asymp- decreases for modules representing increasing eccentricity. The totic IT state are shown in Figure 33A with the line labeled “natural form of this modulation factor was derived from the parameter- background.” Many objects in the visual scene are now competing ization of the cortical magnification factors given by Dow et al. for recognition by the attractor network, and the objects around (1981). the foveal position are enhanced through the modulation fac- To study the ability of the model to recognize trained objects tor derived from the cortical magnification factor. This results at various locations relative to the fovea the system was trained in a much smaller size of the receptive field of IT neurons when on a set of objects. The network was then tested with distorted measured with objects in natural backgrounds. versions of the objects, and the “correlation” between the target In addition to this major effect of the background on the size object and the final state of the attractor network was taken as a of the receptive field, which parallels and may account for the measure of the performance. The correlation was estimated from physiological findings outlined above and in Section 5.8.1, there the normalized dot product between the target object vector that is also a dependence of the size of the receptive fields on the level was used during training the IT network, and the state of the IT of object bias provided to the IT network. Examples are shown in network after a fixed amount of time sufficient for the network Figure 33B where an object bias was used. The object bias biases to settle into a stable state. The objects were always presented on the IT network toward the expected object with a strength deter- ITBIAS backgrounds with some noise (introduced by flipping 2% of the mined by the value of k , and has the effect of increasing the bits in the scene which were not the test stimulus) in order to utilize size of the receptive fields in both blank and natural backgrounds the properties of the attractor network, and because the input to (see Figure 33B compared to Figure 33A). This models the effect IT will inevitably be noisy under normal conditions of operation. found neurophysiologically (Rolls et al., 2003). In the first simulation only one object was present in the visual Some of the conclusions are as follows (Trappenberg et al., scene in a plain (blank) background at different eccentricities from 2002). When single objects are shown in a scene with a blank the fovea. As shown in Figure 33A by the line labeled “blank back- background, the attractor network helps neurons to respond to an ground,” the receptive fields of the neurons were very large. The object with large eccentricities of this object relative to the fovea Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 50 Rolls Invariant visual object recognition of the agent. When the object is presented in a natural scene, other the inferior temporal cortex neuron specific for the object tested. neurons in the inferior temporal cortex become activated by the When no attentional object bias was introduced, a shrinkage of other effective stimuli present in the visual field, and these forward the receptive field size was observed in the complex vs the blank inputs decrease the response of the network to the target stimu- background. When attentional object bias was introduced, the lus by a competitive process. The results found fit well with the shrinkage of the receptive field due to the complex background was neurophysiological data, in that IT operates with almost complete somewhat reduced. This is consistent with the neurophysiological translation invariance when there is only one object in the scene, results (Rolls et al., 2003). In the framework of the model (Deco and reduces the receptive field size of its neurons when the object and Rolls, 2004), the reduction of the shrinkage of the receptive is presented in a cluttered environment. The model described here field is due to the biasing of the competition in the inferior tem- provides an explanation of the responses of real IT neurons in poral cortex layer in favor of the specific IT neuron tested, so that natural scenes. it shows more translation invariance (i.e., a slightly larger recep- In natural scenes, the model is able to account for the neuro- tive field). The increase of the receptive field size of an IT neuron, physiological data that the IT neuronal responses are larger when although small, produced by the external top-down attentional the object is close to the fovea, by virtue of fact that objects close to bias offers a mechanism for facilitation of the search for specific the fovea are weighted by the cortical magnification factor related objects in complex natural scenes (Rolls, 2008b). ITV4 modulation k . I note that it is possible that a “spotlight of attention” (Desi- The model accounts for the larger receptive field sizes from the mone and Duncan, 1995) can be moved covertly away from the fovea of IT neurons in natural backgrounds if the target is the fovea (Rolls, 2008b). However, at least during normal visual search object being selected compared to when it is not selected (Rolls tasks in natural scenes, the neurons are sensitive to the object at et al., 2003). The model accounts for this by an effect of top-down which the monkey is looking, that is primarily to the object that bias which simply biases the neurons toward particular objects is on the fovea, as shown by Rolls et al. (2003) and Aggelopoulos compensating for their decreasing inputs produced by the decreas- and Rolls (2005), and described in Sections 1 and 9. ing magnification factor modulation with increasing distance from the fovea. Such object-based attention signals could originate in 5.9. THE REPRESENTATION OF MULTIPLE OBJECTS IN A SCENE the prefrontal cortex and could provide the object bias for the When objects have distributed representations, there is a prob- inferior temporal visual cortex (Renart et al., 2000; Rolls, 2008b). lem of how multiple objects (whether the same or different) can Important properties of the architecture for obtaining the be represented in a scene, because the distributed representa- results just described are the high magnification factor at the fovea tions overlap, and it may not be possible to determine whether and the competition between the effects of different inputs, imple- one has an amalgam of several objects, or a new object (Mozer, mented in the above simulation by the competition inherent in an 1991), or multiple instances of the same object, let alone the attractor network. relative spatial positions of the objects in a scene. Yet humans We have also been able to obtain similar results in a hierarchical can determine the relative spatial locations of objects in a scene feed-forward network where each layer operates as a competitive even in short presentation times without eye movements (Bieder- network (Deco and Rolls, 2004). This network thus captures many man, 1972; and this has been held to involve some spotlight of of the properties of our hierarchical model of invariant object attention). Aggelopoulos and Rolls (2005) analyzed this issue by recognition (Rolls, 1992; Wallis and Rolls, 1997; Rolls and Mil- recording from single inferior temporal cortex neurons with five ward, 2000; Stringer and Rolls, 2000, 2002; Rolls and Stringer, objects simultaneously present in the receptive field. They found 2001, 2006, 2007; Elliffe et al., 2002; Rolls and Deco, 2002; Stringer that although all the neurons responded to their effective stim- et al., 2006), but incorporates in addition a foveal magnification ulus when it was at the fovea, some could also respond to their factor and top-down projections with a dorsal visual stream so effective stimulus when it was in some but not other parafoveal that attentional effects can be studied, as shown in Figure 34. positions 10˚ from the fovea. An example of such a neuron is shown Deco and Rolls (2004) trained the network shown in Figure 34 in Figure 35. The asymmetry is much more evident in a scene with two objects, and used the trace-learning rule (Wallis and with 5 images present (Figure 35A) than when only one image is Rolls, 1997; Rolls and Milward, 2000) in order to achieve trans- shown on an otherwise blank screen (Figure 35B). Competition lation invariance. In a first experiment we placed only one object between different stimuli in the receptive field thus reveals the on the retina at different distances from the fovea (i.e., different asymmetry in the receptive field of inferior temporal visual cortex eccentricities relative to the fovea). This corresponds to the blank neurons. background condition. In a second experiment, we also placed the The asymmetry provides a way of encoding the position of object at different eccentricities relative to the fovea, but on a clut- multiple objects in a scene. Depending on which asymmetric neu- tered natural background. Larger receptive fields were found with rons are firing, the population of neurons provides information to the blank as compared to the cluttered natural background. the next processing stage not only about which image is present at Deco and Rolls (2004) also studied the influence of object- or close to the fovea, but where it is with respect to the fovea. based attentional top-down bias on the effective size of the recep- Simulations with VisNet with an added layer to simulate hip- tive field of an inferior temporal cortex neuron for the case of pocampal scene memory have demonstrated that receptive field an object in a blank or a cluttered background. To do this, they asymmetry appears when multiple objects are simultaneously repeated the two simulations but now considered a non-zero top- present because of the probabilistic connectivity from the preced- down bias coming from prefrontal area 46v and impinging on ing stage which introduces asymmetry, which becomes revealed Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 51 Rolls Invariant visual object recognition FIGURE 34 | Cortical architecture for hierarchical and attention-based temporal visual cortex), and is mainly concerned with object recognition. The visual perception after Deco and Rolls (2004). The system is essentially occipito-parietal stream leads dorsally into PP (posterior parietal complex), composed of five modules structured such that they resemble the two known and is responsible for maintaining a spatial map of an object’s location. The main visual paths of the mammalian visual cortex. Information from the solid lines with arrows between levels show the forward connections, and retino-geniculo-striate pathway enters the visual cortex through area V1 in the the dashed lines the top-down backprojections. Short-term memory systems occipital lobe and proceeds into two processing streams. The in the prefrontal cortex (PF46) apply top-down attentional bias to the object or occipital-temporal stream leads ventrally through V2–V4 and IT (inferior spatial processing streams. (After Deco and Rolls, 2004.) Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 52 Rolls Invariant visual object recognition FIGURE 35 | (A) The responses (firing rate with the spontaneous rate stimulus was presented in each position. The firing rate for each subtracted, means sem) of an inferior temporal cortex neuron when position is that when the effective stimulus (in this case the hand) for tested with 5 stimuli simultaneously present in the close (10˚) the neuron was in that position. The p value is that from the ANOVA configuration with the parafoveal stimuli located 10˚ from the fovea. calculated over the four parafoveal positions. (After Aggelopoulos and (B) The responses of the same neuron when only the effective Rolls, 2005.) by the enhanced lateral inhibition when multiple objects are The learning of invariant representations of objects when mul- presented simultaneously (Rolls et al., 2008). tiple objects are present in a scene is considered in Section 5.5.2. The information in the inferior temporal visual cortex is pro- 5.10. LEARNING INVARIANT REPRESENTATIONS USING SPATIAL vided by neurons that have firing rates that reflect the relevant information, and stimulus-dependent synchrony is not necessary CONTINUITY: CONTINUOUS SPATIAL TRANSFORMATION (Aggelopoulos and Rolls, 2005). Top-down attentional biasing LEARNING input could thus, by biasing the appropriate neurons, facilitate The temporal continuity typical of objects has been used in an bottom-up information about objects without any need to alter associative learning rule with a short-term memory trace to help the time relations between the firing of different neurons. The build invariant object representations in the networks described exact position of the object with respect to the fovea, and effec- previously in this paper. Stringer et al. (2006) showed that spa- tial continuity can also provide a basis for helping a system to tively thus its spatial position relative to other objects in the scene, would then be made evident by the subset of asymmetric neurons self-organize invariant representations. They introduced a new learning paradigm “continuous spatial transformation (CT) learn- firing. This is thus the solution that these experiments (Aggelopoulos ing” which operates by mapping spatially similar input patterns to the same post-synaptic neurons in a competitive learning sys- and Rolls, 2005; Rolls et al., 2008) indicate is used for the represen- tation of multiple objects in a scene, an issue that has previously tem. As the inputs move through the space of possible continuous transforms (e.g., translation, rotation, etc.), the active synapses been difficult to account for in neural systems with distributed representations (Mozer, 1991) and for which “attention” has been are modified onto the set of post-synaptic neurons. Because other a proposed solution. transforms of the same stimulus overlap with previously learned Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 53 Rolls Invariant visual object recognition exemplars, a common set of post-synaptic neurons is activated by the new transforms, and learning of the new active inputs onto the same post-synaptic neurons is facilitated. The concept is illustrated in Figure 36. During the presenta- tion of a visual image at one position on the retina that activates neurons in layer 1, a small winning set of neurons in layer 2 will modify (through associative learning) their afferent connections from layer 1 to respond well to that image in that location. When the same image appears later at nearby locations, so that there is spatial continuity, the same neurons in layer 2 will be activated because some of the active afferents are the same as when the image was in the first position. The key point is that if these afferent con- nections have been strengthened sufficiently while the image is in the first location, then these connections will be able to continue to activate the same neurons in layer 2 when the image appears in overlapping nearby locations. Thus the same neurons in the output layer have learned to respond to inputs that have similar vector elements in common. As can be seen in Figure 36, the process can be continued for subsequent shifts, provided that a sufficient proportion of input cells stay active between individual shifts. This whole process is repeated throughout the network, both horizontally as the image FIGURE 36 | An illustration of how continuous spatial transformation moves on the retina, and hierarchically up through the network. (CT) learning would function in a network with a single-layer of forward synaptic connections between an input layer of neurons and Over a series of stages, transform invariant (e.g., location invari- an output layer. Initially the forward synaptic weights are set to random ant) representations of images are successfully learned, allowing values. The top part (A) shows the initial presentation of a stimulus to the the network to perform invariant object recognition. A similar CT network in position 1. Activation from the (shaded) active input cells is learning process may operate for other kinds of transformation, transmitted through the initially random forward connections to stimulate such as change in view or size. the cells in the output layer. The shaded cell in the output layer wins the competition in that layer. The weights from the active input cells to the Stringer et al. (2006) demonstrated that VisNet can be trained active output neuron are then strengthened using an associative learning with continuous spatial transformation learning to form view- rule. The bottom part (B) shows what happens after the stimulus is shifted invariant representations. They showed that CT learning requires by a small amount to a new partially overlapping position 2. As some of the the training transforms to be relatively close together spatially so active input cells are the same as those that were active when the stimulus that spatial continuity is present in the training set; and that the was presented in position 1, the same output cell is driven by these previously strengthened afferents to win the competition again. The order of stimulus presentation is not crucial, with even inter- rightmost shaded input cell activated by the stimulus in position 2, which leaving with other objects possible during training, because it was inactive when the stimulus was in position 1, now has its connection is spatial continuity rather the temporal continuity that drives to the active output cell strengthened (denoted by the dashed line). Thus the self-organizing learning with the purely associative synaptic the same neuron in the output layer has learned to respond to the two modification rule. input patterns that have similar vector elements in common. As can be seen, the process can be continued for subsequent shifts, provided that a Perry et al. (2006) extended these simulations with VisNet of sufficient proportion of input cells stay active between individual shifts. view-invariant learning using CT to more complex 3D objects, and (After Stringer et al., 2006.) using the same training images in human psychophysical investiga- tions, showed that view-invariant object learning can occur when spatial but not temporal continuity applies in a training condition in which the images of different objects were interleaved. How- 5.11. LIGHTING INVARIANCE ever, they also found that the human view-invariance learning was Object recognition should occur correctly even despite variations better if sequential presentation of the images of an object was of lighting. In an investigation of this, Rolls and Stringer (2006) used, indicating that temporal continuity is an important factor in trained VisNet on a set of 3D objects generated with OpenGL in human invariance learning. which the viewing angle and lighting source could be indepen- Perry et al. (2010) extended the use of continuous spatial trans- dently varied (see Figure 37). After training with the trace rule formation learning to translation invariance. They showed that on all the 180 views (separated by 1˚, and rotated about the ver- translation-invariant representations can be learned by continu- tical axis in Figure 37) of each of the four objects under the left ous spatial transformation learning; that the transforms must be lighting condition, we tested whether the network would recog- close for this to occur; that the temporal order of presentation of nize the objects correctly when they were shown again, but with each transformed image during training is not crucial for learn- the source of the lighting moved to the right so that the objects ing to occur; that relatively large numbers of transforms can be appeared different (see Figure 37). With this protocol, lighting learned; and that such continuous spatial transformation learning invariant object recognition by VisNet was demonstrated (Rolls can be usefully combined with temporal trace training. and Stringer, 2006). Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 54 Rolls Invariant visual object recognition FIGURE 37 | Lighting invariance. VisNet was trained on a set of 3D objects lighting. Just one view of each object is shown in the Figure, but for training (cube, tetrahedron, octahedron, and torus) generated with OpenGL in which and testing 180 views of each object separated by 1˚ were used. (After Rolls for training the objects had left lighting, and for testing the objects had right and Stringer, 2006.) Some insight into the good performance with a change of light- to rotating flow fields or looming with considerable translation ing is that some neurons in the inferior temporal visual cortex invariance (Graziano et al., 1994; Geesaman and Andersen, 1996). respond to the outlines of 3D objects (Vogels and Biederman, In the cortex in the anterior part of the superior temporal sulcus, 2002), and these outlines will be relatively consistent across light- which is a convergence zone for inputs from the ventral and dor- ing variations. Although the features about the object represented sal visual systems, some neurons respond to object-based motion, in VisNet will include more than the representations of the out- for example, to a head rotating clockwise but not anticlockwise, lines, the network may because it uses distributed representations independently of whether the head is upright or inverted which of each object generalize correctly provided that some of the fea- reverses the optic flow across the retina (Hasselmo et al., 1989b). tures are similar to those present during training. Under very In a unifying hypothesis with the design of the ventral cortical difficult lighting conditions, it is likely that the performance of visual system Rolls and Stringer (2007) proposed that the dorsal the network could be improved by including variations in the visual system uses a hierarchical feed-forward network architec- lighting during training, so that the trace rule could help to ture (V1, V2, MT, MSTd, parietal cortex) with training of the build representations that are explicitly invariant with respect to connections with a short-term memory trace associative synaptic lighting. modification rule to capture what is invariant at each stage. The principle is illustrated in Figure 38A. Simulations showed that the 5.12. INVARIANT GLOBAL MOTION IN THE DORSAL VISUAL SYSTEM proposal is computationally feasible, in that invariant representa- A key issue in understanding the cortical mechanisms that under- tions of the motion flow fields produced by objects self-organize in lie motion perception is how we perceive the motion of objects the later layers of the architecture (see examples in Figures 38B–E). such as a rotating wheel invariantly with respect to position on The model produces invariant representations of the motion flow the retina, and size. For example, we perceive the wheel shown in fields produced by global in-plane motion of an object, in-plane Figure 38A rotating clockwise independently of its position on the rotational motion, looming vs receding of the object. The model retina. This occurs even though the local motion for the wheels in also produces invariant representations of object-based rotation the different positions may be opposite. How could this invariance about a principal axis. Thus it is proposed that the dorsal and of the visual motion perception of objects arise in the visual sys- ventral visual systems may share some unifying computational tem? Invariant motion representations are known to be developed principles Rolls and Stringer (2007). Indeed, the simulations of in the cortical dorsal visual system. Motion-sensitive neurons in Rolls and Stringer (2007) used a standard version of VisNet, with V1 have small receptive fields (in the range 1–2˚ at the fovea), and the exception that instead of using oriented bar receptive fields as can therefore not detect global motion, and this is part of the aper- the input to the first layer, local motion flow fields provided the ture problem (Wurtz and Kandel, 2000b). Neurons in MT, which inputs. receives inputs from V1 and V2, have larger receptive fields (e.g., 5˚ at the fovea), and are able to respond to planar global motion, 6. LEARNING INVARIANT REPRESENTATIONS OF SCENES such as a field of small dots in which the majority (in practice AND PLACES as few as 55%) move in one direction, or to the overall direction The primate hippocampal system has neurons that respond to a of a moving plaid, the orthogonal grating components of which view of a spatial scene, or when that location in a scene is being have motion at 45˚ to the overall motion (Movshon et al., 1985; looked at in the dark or when it is obscured (Rolls et al., 1997a, Newsome et al., 1989). Further on in the dorsal visual system, 1998; Robertson et al., 1998; Georges-François et al., 1999; Rolls some neurons in macaque visual area MST (but not MT) respond and Xiang, 2006; Rolls, 2008b). The representation is relatively Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 55 Rolls Invariant visual object recognition FIGURE 38 | (A) Two rotating wheels at different locations rotating in perfect performance of 1 bit (clockwise vs anticlockwise) after training opposite directions. The local flow field is ambiguous. Clockwise or with the trace rule, but not with random initial synaptic weights in the counterclockwise rotation can only be diagnosed by a global flow untrained control condition. (C) The multiple cell information measure computation, and it is shown how the network is expected to solve the shows that small groups of neurons have perfect performance. (D) problem to produce position-invariant global motion-sensitive neurons. Position invariance illustrated for a single cell from layer 4, which One rotating wheel is presented at any one time, but the need is to responded only to the clockwise rotation, and for every one of the 9 develop a representation of the fact that in the case shown the rotating positions. (E) Size-invariance illustrated for a single cell from layer 4, flow field is always clockwise, independently of the location of the flow which after training with three different radii of rotating wheel, field. (B–D) Translation invariance, with training on 9 locations. (B) Single responded only to anticlockwise rotation, independently of the size of cell information measures showing that some layer 4 neurons have the rotating wheels. (After Rolls and Stringer, 2007.) invariant with respect to the position of the macaque in the envi- is being looked at. (There is an analogous set of place neurons ronment, and of head direction, and eye position. The requirement in the rat hippocampus that respond in this case when the rat is for these spatial view neurons is that a position in the spatial scene in a given position in space, relatively invariantly with respect to Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 56 Rolls Invariant visual object recognition head direction (McNaughton et al., 1983; O’Keefe, 1984; Muller the anterior inferior temporal visual cortex, which corresponds to et al., 1991).) How might these spatial view neurons be set up in the fourth layer of VisNet, neurons respond to objects, but sev- primates? eral objects close to the fovea (within approximately 10˚) can be Before addressing this, it is useful to consider the difference represented because many object-tuned neurons have asymmet- between a spatial view or scene representation, and an object rep- ric receptive fields with respect to the fovea (Aggelopoulos and resentation. An object can be moved to different places in space or Rolls, 2005; see Section 5.9). If the fifth layer of VisNet performs in a spatial scene. An example is a motor car that can be moved to the same operation as previous layers, it will form neurons that different places in space. The object is defined by a combination respond to combinations of objects in the scene with the positions of features or parts in the correct relative spatial position, but its of the objects relative spatially to each other incorporated into the representation is independent of where it is in space. In contrast, a representation (as described in Section 5.4). The result will be spa- representation of space has objects in defined relative spatial posi- tial view neurons in the case of primates when the visual field of tions, which cannot be moved relative to one another in space. An the primate has a narrow focus (due to the high-resolution fovea), example might be Trafalgar Square, in which Nelson’s column is in and place cells when as in the rat the visual field is very wide (De the middle, and the National Gallery and St Martin’s in the Fields Araujo et al., 2001; Rolls, 2008b). The trace-learning rule in layer church are at set relative locations in space, and cannot be moved 5 should help the spatial view or place fields that develop to be relative to one another. This draws out the point that there may large and single, because of the temporal continuity that is inher- be some computational similarities between the construction of ent when the agent moves from one part of the view or place space an objects and of a scene or a representation of space, but there to another, in the same way as has been shown for the entorhinal are also important differences in how they are used. In the present grid cell to hippocampal place cell mapping (Rolls et al., 2006b; context we are interested in how the brain may set up a spatial Rolls, 2008b). view representation in which the relative position of the objects in The hippocampal dentate granule cells form a network the scene defines the spatial view. That spatial view representation expected to be important in this competitive learning of spa- may be relatively invariant with respect to the exact position from tial view or place representations based on visual inputs. As the which the scene is viewed (though extensions are needed if there animal navigates through the environment, different spatial view are central objects in a space through which one moves). cells would be formed. Because of the overlapping fields of adja- It is now possible to propose a unifying hypothesis of the rela- cent spatial view neurons, and hence their coactivity as the animal tion between the ventral visual system, and primate hippocampal navigates, recurrent collateral associative connections at the next spatial view representations (Rolls, 2008b; Rolls et al., 2008). Let stage of the system, CA3, could form a continuous attractor rep- us consider a computational architecture in which a fifth layer is resentation of the environment (Rolls, 2008b). We thus have a added to the VisNet architecture, as illustrated in Figure 39. In hypothesis for how the spatial representations are formed as a FIGURE 39 | Adding a fifth layer, corresponding to the shown in the earlier layers. Right – as it occurs in the brain. V1, visual parahippocampal gyrus/hippocampal system, after the inferior cortex area V1; TEO, posterior inferior temporal cortex; TE, inferior temporal visual cortex (corresponding to layer 4) may lead to the temporal cortex (IT). Left – as implemented in VisNet (layers 1–4). self-organization of spatial view/place cells in layer 5 when whole Convergence through the network is designed to provide fourth layer scenes are presented (see text). Convergence in the visual system is neurons with information from across the entire input retina. Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 57 Rolls Invariant visual object recognition natural extension of the hierarchically organized competitive net- entorhinal cortex and thus to parahippocampal areas. In either works in the ventral visual system. The expression of such spatial case, it is an interesting and unifying hypothesis that an effect of representations in CA3 may be particularly useful for associating adding an additional layer to VisNet-like ventral stream visual pro- those spatial representations with other inputs, such as objects or cessing might with training in a natural environment lead to the rewards (Rolls, 2008b). self-organization, using the same principles as in the ventral visual We have performed simulations to test this hypothesis with stream, of spatial view or place representations in parahippocam- VisNet simulations with conceptually a fifth layer added (Rolls pal or hippocampal areas (Rolls, 2008b; Rolls et al., 2008). Such et al., 2008). Training now with whole scenes that consist of a set spatial view representations are relatively invariant with respect of objects in a given fixed spatial relation to each other results to the position from which the scene is viewed (Georges-François in neurons in the added layer that respond to one of the trained et al., 1999), but are selective to the relative spatial position of the whole scenes, but do not respond if the objects in the scene are objects that define the spatial view (Rolls, 2008b; Rolls et al., 2008). rearranged to make a new scene from the same objects. The for- mation of these scene-specific representations in the added layer is 7. FURTHER APPROACHES TO INVARIANT OBJECT related to the fact that in the inferior temporal cortex (Aggelopou- RECOGNITION los and Rolls, 2005), and in the VisNet model (Rolls et al., 2008), A related approach to invariant object recognition is described the receptive fields of inferior temporal cortex neurons shrink and by Riesenhuber and Poggio (1999b), and builds on the hypothesis become asymmetric when multiple objects are present simultane- that not just shift invariance (as implemented in the Neocognitron ously in a natural scene. This also provides a solution to the issue of Fukushima (1980)), but also other invariances such as scale, of the representation of multiple objects, and their relative spatial rotation, and even view, could be built into a feature hierarchy positions, in complex natural scenes (Rolls, 2008b). system, as suggested by Rolls (1992) and incorporated into Vis- Consistently, in a more artificial network trained by gradient Net (Wallis et al., 1993; Wallis and Rolls, 1997; Rolls and Milward, ascent with a goal function that included forming relatively time 2000; Rolls and Stringer, 2007; Rolls, 2008b; see also Perrett and invariant representations and decorrelating the responses of neu- Oram, 1993). The approach of Riesenhuber and Poggio (1999b) rons within each layer of the 5-layer network, place-like cells were and its developments (Riesenhuber and Poggio, 1999a, 2000; Serre formed at the end of the network when the system was trained with et al., 2007a,b,c) is a feature hierarchy approach that uses alter- a real or simulated robot moving through spatial environments nate “simple cell” and “complex cell” layers in a way analogous to (Wyss et al., 2006), and slowness as an asset in learning spatial (Fukushima, 1980; see Figure 40). representations has also been investigated by others (Wiskott and The function of each S cell layer is to build more complicated Sejnowski, 2002; Wiskott, 2003; Franzius et al., 2007). It will be features from the inputs, and works by template matching. The interesting to test whether spatial view cells develop in a VisNet function of each “C” cell layer is to provide some translation fifth layer if trained with foveate views of the environment, or place invariance over the features discovered in the preceding simple cell cells if trained with wide angle views of the environment (cf. De layer (as in Fukushima, 1980), and operates by performing a MAX Araujo et al., 2001), and the utility of testing this with a VisNet-like function on the inputs. The non-linear MAX function makes a architecture is that it is embodies a biologically plausible imple- complex cell respond only to whatever is the highest activity input mentation based on neuronally plausible competitive learning and being received, and is part of the process by which invariance is a short-term memory trace-learning rule. achieved according to this proposal. This C layer process involves It is an interesting part of the hypothesis just described that “implicitly scanning over afferents of the same type differing in because spatial views and places are defined by the relative spatial the parameter of the transformation to which responses should positions of fixed landmarks (such a buildings), slow learning of be invariant (for instance, feature size for scale invariance), and such representations over a number of trials might be useful, so then selecting the best-matching afferent” (Riesenhuber and Pog- that the neurons come to represent spatial views or places, and do gio, 1999b). Brain mechanisms by which this computation could not learn to represent a random collection of moveable objects be set up are not part of the scheme, and the model does not seen once in conjunction. In this context, an alternative brain incorporate learning in its architecture, so does not yet provide a region to the dentate gyrus for this next layer of VisNet-like pro- biologically plausible model of invariant object recognition. The cessing might be the parahippocampal areas that receive from the model receives as its inputs a set of symmetric spatial-frequency inferior temporal visual cortex. Spatial view cells are present in the filters that are closely spaced in spatial-frequency, and maps these parahippocampal areas (Rolls et al., 1997a, 1998, 2005b; Robertson through pairs of convergence followed by MAX function layers, et al., 1998; Georges-François et al., 1999), and neurons with place- without learning. Whatever output appears in the final layer is like fields (though in some cases as a grid, Hafting et al., 2005) then tested with a support vector machine to measure how well are found in the rat medial entorhinal cortex (Moser and Moser, the output can be used by this very powerful subsequent learning 1998; Brun et al., 2002; Fyhn et al., 2004; Moser, 2004). These spa- stage to categorize different types of image. Whether that is a good tial view and place-like representations could be formed in these test of invariance learning is a matter for discussion (Pinto et al., regions as, effectively, an added layer to VisNet. Moreover, these 2008; see Section 8). The approach taken in VisNet is that instead cortical regions have recurrent collateral connections that could of using a benchmark test of image exemplars from which to learn implement a continuous attractor representation. Alternatively, categories (Serre et al., 2007a,b,c), instead VisNet is trained to gen- it is possible that these parahippocampal spatial representations eralize across transforms of objects that provide the training set. reflect the effects of backprojections from the hippocampus to the However, the fact that the model of Poggio, Riesenhuber, Serre and Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 58 Rolls Invariant visual object recognition FIGURE 40 | Sketch of Riesenhuber and Poggio’s (1999a,b) model of invariant object recognition. The model includes layers of “S” cells which perform template matching (solid lines), and “C” cells (solid lines) which pool information by a non-linear MAX function to achieve invariance (see text). (After Riesenhuber and Poggio, 1999a,b.) colleagues does use a hierarchical approach to object recognition developed to visual object recognition and vision navigation for does represent useful convergent thinking toward how invariant off-road mobile robots. Ullman has considered the use of features object recognition may be implemented in the brain. Similarly, in a hierarchy to help with processes such as segmentation and the approach of training a five-layer network with a more artificial object recognition (Ullman, 2007). gradient ascent approach with a goal function that does how- Another approach to the implementation of invariant represen- ever include forming relatively time invariant representations and tations in the brain is the use of neurons with Sigma-Pi synapses. decorrelating the responses of neurons within each layer (Wyss Sigma-Pi synapses effectively allow one input to a synapse to be et al., 2006; both processes that have their counterpart in VisNet), multiplied or gated by a second input to the synapse (Rolls, 2008b). also reflects convergent thinking. The multiplying input might gate the appropriate set of the other Further evidence consistent with the approach developed in the inputs to a synapse to produce the shift or scale change required. investigations of VisNet described in this paper comes from psy- For example, the multiplying input could be a signal that varies chophysical studies. Wallis and Bülthoff (1999) and Perry et al. with the shift required to compute translation invariance, effec- (2006) describe psychophysical evidence for learning of view- tively mapping the appropriate set of x inputs through to the invariant representations by experience, in that the learning can output neurons depending on the shift required (Olshausen et al., be shown in special circumstances to be affected by the temporal 1993, 1995; Mel et al., 1998; Mel and Fiser, 2000). Local opera- sequence in which different views of objects are seen. tions on a dendrite could be involved in such a process (Mel et al., Another related approach, from the machine learning area, is 1998). The explicit neural implementation of the gating mecha- that of convolutional networks. Convolutional Networks are a bio- nism seems implausible, given the need to multiply and thus remap logically inspired trainable architecture that can learn invariant large parts of the retinal input depending on shift and scale modi- features. Each stage in a ConvNet is composed of a filter bank, some fying connections to a particular set of output neurons. Moreover, non-linearities, and feature pooling layers. With multiple stages, a the explicit control signal to set the multiplication required in V1 ConvNet can learn multi-level hierarchies of features (LeCun et al., has not been identified. Moreover, if this was the solution used by 2010). Non-linearities that include rectification and local contrast the brain, the whole problem of shift and scale invariance could normalization are important in such systems (Jarrett et al., 2009; in principle be solved in one-layer of the system, rather than with and are of course properties of VisNet). Applications have been the multiple hierarchically organized set of layers actually used Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 59 Rolls Invariant visual object recognition in the brain, as shown schematically in Figure 1. The multiple- approach to the Caltech-256, and instead of focusing on a set of layers actually used in the brain are much more consistent with natural images within a category, provides images with a system- the type of scheme incorporated in VisNet. Moreover, if a multi- atic variation of pose and illumination for 1,000 small objects. plying system of the type hypothesized by Olshausen et al. (1993), Each object is placed onto a turntable and photographed in con- Mel et al. (1998), and Olshausen et al. (1995) was implemented sistent conditions at 5˚ increments, resulting in a set of images in a multilayer hierarchy with the shift and scale change emerging that not only show the whole object (with regard to out of plane gradually, then the multiplying control signal would need to be rotations), but does so with some continuity from one image to supplied to every stage of the hierarchy. A further problem with the next (see examples in Figure 42). such approaches is how the system is trained in the first place. 8.2. THE HMAX MODELS USED FOR COMPARISON WITH VISNETL The performance of VisNetL was compared against a standard 8. MEASURING THE CAPACITY OF VisNet HMAX model (Serre et al., 2007b,c; Mutch and Lowe, 2008), and For a theory of the brain mechanisms of invariant object recog- a HMAX model scaled down to have a comparable complexity (in nition, it is important that the system should scale up, so that terms, for example, of the number of neurons) to that of VisNetL. if a model such as VisNet was the size of the human visual sys- The scaled down HMAX model is referred to as HMAX_min. The tem, it would have comparable performance. Most of the research current HMAX family models have in the order of 10 million com- with VisNet to date has focused on the principles of operation of putational units (Serre et al., 2007b), which is at least 100 times the the system, and what aspects of invariant object recognition the number contained within the current implementation of VisNetL model can solve (Rolls, 2008b). In this section I consider how the (which uses 128 128 neurons in each of 4 layers, i.e., 65,536 system performs in its scaled up version (VisNetL, with 128 128 neurons). In producing HMAX_min, we aimed to maintain the neurons in each of 4 layers). I compare the capacity of VisNetL architectural features of HMAX, and primarily to scale it down. with that of another model, HMAX, as that has been described HMAX_min is based upon the “base” implementation of Mutch as competing with state of the art systems (Serre et al., 2007a,b,c; and Lowe (2008) . The minimal version used in the comparisons Mutch and Lowe, 2008), and I raise interesting issues about how differs from this base HMAX implementation in two significant to measure the capacity of systems for invariant object recognition ways. First, HMAX_min has only 4 scales compared to the 10 in natural scenes. scales of HMAX. (Care was taken to ensure that HMAX_min still The tests (performed by L. Robinson of the Department of covered the same image size range – 256, 152, 90, and 53 pixels.) Computer Science, University of Warwick, UK and E. T. Rolls) Second, the number of distinct units in the S2 “template matching” utilized a benchmark approach incorporated in the work of Serre, layer was limited to only 25 in HMAX_min, compared to 2,000 in Mutch, Poggio and colleagues (Serre et al., 2007b,c; Mutch and HMAX. This results in a scaled down model HMAX_min, with Lowe, 2008) and indeed typical of many standard approaches in approximately 12,000 units in the C1 layer, 75,000 units in the S2 computer vision. This uses standard datasets such as the Caltech- layer, and 25 in the upper C2 layer, which is much closer to the 256 (Griffin et al., 2007) in which sets of images from different 65,536 neurons of VisNetL. (The 75,000 units in S2 allow for every categories are to be classified. C2 neuron to be connected by its own weight to a C1 neuron.; When counting the number of neurons in the models, the num- 8.1. OBJECT BENCHMARK DATABASES ber of neurons in S1 is not included, as they just provide the inputs The Caltech-256 dataset (Griffin et al., 2007) is comprised of 256 to the models.) object classes made up of images that have many aspect ratios, sizes and differ quite significantly in quality (having being manually 8.3. PERFORMANCE ON A CALTECH-256 TEST collated from web searches). The objects within the images show VisNetL and the two HMAX models were trained to discrimi- significant intra-class variation and have a variety of poses, illumi- nate between two object classes from the Caltech-256 database, nation, scale, and occlusion as expected from natural images (see the teddy-bear and cowboy-hat (see examples in Figure 41). Sixty examples in Figure 41). In this sense, the Caltech-256 database is image examples of each class were rescaled to 256 256 and considered to be a difficult challenge to object recognition systems. converted to gray-scale, so that shape recognition was being inves- I come to the conclusion below that the benchmarking approach tigated. The 60 images from each class were randomly partitioned with this type of dataset is not useful for training a system that into training and testing sets, with the training set size ranging must learn invariant object representations. The reason for this is over 1, 5, 15 and 30 images, and the corresponding testing set being that the exemplars of each category in the Caltech-256 dataset are the remainder of the 60 images in the cross-validation design. A too discontinuous to provide a basis for learning invariant object linear support vector machine (libSVM, Chang and Lin, 2011) representations. For example, the exemplars within a category in approach operating on the output of layer 4 of VisnetL was used these datasets may be very different indeed. to compare the categorization of the trained images with that of Partly because of the limitations of the Caltech-256 database the test images, as that is the approach used by HMAX (Serre et al., for training in invariant object recognition, we also investigated 2007b,c; Mutch and Lowe, 2008). The standard default parameters training with the Amsterdam Library of Images (ALOI; Geuse- of the support vector machine were used in identical form for the broek et al., 2005) database . The ALOI database takes a different VisNetL and HMAX tests. 1 2 http://staff.science.uva.nl/aloi/ http://cbcl.mit.edu/jmutch/cns/index.html Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 60 Rolls Invariant visual object recognition FIGURE 41 | Example images from the Caltech-256 database for two object classes, teddy-bears and cowboy-hats. FIGURE 42 | Example images from the two object classes within the ALOI database, (A) 90 (rubber duck) and (B) 93 (black shoe). Only the 45˚ increments are shown. Figure 43 shows the performance of all three models when close in viewing angle the training images need to be; and also performing the task with the Caltech-256 dataset. It is clear that to investigate the effects of using different numbers of training VisNetL performed better than HMAX_min as soon as there were images. reasonable numbers of training images, and this was confirmed Figure 44 shows that VisNetL performed better than statistically using the Chi-square test. It is also shown that the HMAX_min as soon as there were even a few training images, full HMAX model (as expected given its very large number of with HMAX as expected performing better. VisNetL performed neurons) exhibits higher performance than that of VisNetL and almost as well as the very much larger HMAX as soon as there HMAX_min. were reasonable numbers of training images. What VisNetL can do here is to learn view-invariant represen- 8.4. PERFORMANCE WITH THE AMSTERDAM LIBRARY OF IMAGES tations using its trace-learning rule to build feature analyzers that Eight classes of object (with designations 36, 90, 93, 103, 138, 156, reflect the similarity across at least adjacent views of the training 203, 161) from the dataset were chosen (see Figure 42, for exam- set. Very interestingly, with 8 training images, the view spacing ple). Each class comprises of 72 images taken at 5˚ increments of the training images was 45˚, and the test images in the cross- through the full 360˚ out of plane rotation. Three sets of train- validation design were the intermediate views, 22.5˚ away from the ing images were used. (1) Three training images per class were nearest trained view. This is promising, for it shows that enormous taken at 315, 0, and 45˚. (2) Eight training images encompassing numbers of training images with many different closely spaced the entire rotation of the object were taken in 45˚ increments. views are not necessary for VisNetL. Even 8 training views spaced (3) Eighteen training images also encompassing the entire rota- 45˚ apart produced reasonable training. tion of the object were taken in 20˚ increments. The testing set consisted for each object of the remaining orientations from the 8.5. INDIVIDUAL LAYER PERFORMANCE set of 72 that were not present in the particular training set. The To test whether the VisNet hierarchy is actually performing use- aim of using the different training sets was to investigate how ful computations with these datasets the simulations were re-run, Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 61 Rolls Invariant visual object recognition FIGURE 43 | Performance of VisNetL, HMAX, and HMAX_min on the VisNetL performs better than HMAX_min, and this was confirmed classification task using the Caltech-256 dataset. The error bars show statistically using the Chi-square test performed with 30 training images the standard error of the means over 5 cross-validation trials with different and 30 cross-validation test images in each of two categories images chosen at random for the training set on each trial. It is clear that (Chi-squareD 8.09, dfD 1, pD 0.0025). FIGURE 44 | Performance of VisNetL, HMAX_min, and HMAX on the was confirmed statistically using the Chi-square test performed with 18 classification task with 8 classes using the Amsterdam Library of Images training images 20˚ apart in view and 54 cross-validation testing images 5˚ dataset. It is clear that VisNetL performs better than HMAX_min, and this apart in each of eight categories (Chi-squareD 110.58, dfD 1, pD 10 ). though this time instead of only training the SVM on the activity hierarchy is actually forming useful representations with these generated in the final layer, four identical SVM’s were trained inde- datasets then we should see the discriminatory power of SVMs pendently on the activities of each of the four layers. If the VisNet trained on each layer increase as we traverse the hierarchy. Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 62 Rolls Invariant visual object recognition When the Caltech-256 dataset was used to train VisNetL there be distinguished by different non-overlapping sets of features (see was very little difference in the measured performance of classi- Section 3.1). An example might be a banana and an orange, where fiers trained on each layer. This is revealing, for it shows that the the list of features of the banana might include yellow, elongated, Caltech-256 dataset does not have sufficient similarity between the and smooth surface; and of the orange its orange color, round exemplars within a given class for the trace-learning rule utilized in shape, and dimpled surface. Such objects could be distinguished VisNet to perform useful learning. Thus, at least with a convergent just on the basis of a list of the properties, which could be processed feature hierarchy network trained in this way, there is insufficient appropriately by a competitive network, pattern associator, etc. No similarity and information in the exemplars of each category of special mechanism is needed for view-invariance, because the list the Caltech-256 to learn to generalize in a view-invariant way to of properties is very similar from most viewing angles. Object further exemplars of that category. recognition of this type may be common in animals, especially In contrast, when the ALOI dataset was used to train Vis- those with visual systems less developed than those of primates. NetL the later layers performed better (layer 2–72% correct; layer However, this approach does not describe the shape and form of 3–84% correct; layer 4–86% correct: p< 0.001). Thus there is suf- objects, and is insufficient to account for primate vision. Never- ficient continuity in the images in the ALOI dataset to support theless, the features present in objects are valuable cues to object view-invariance learning in this feature hierarchy network. identity, and are naturally incorporated into the feature hierarchy approach. 8.6. EVALUATION A second type of process might involve the ability to gener- One conclusion is that VisNetL performs comparably to a scaled alize across a small range of views of an object, that is within a down version of HMAX on benchmark tests. This is reassuring, generic view, where cues of the first type cannot be used to solve for HMAX has been described as competing with state of the art the problem. An example might be generalization across a range of systems (Serre et al., 2007a,b,c; Mutch and Lowe, 2008). views of a cup when looking into the cup, from just above the near A second conclusion is that image databases such as the Caltech- lip until the bottom inside of the cup comes into view. This type 256 that are used to test the performance of object recognition of process includes the learning of the transforms of the surface systems (Serre et al., 2007a,b,c; Mutch and Lowe, 2008; and in markings on 3D objects which occur when the object is rotated, as many computer vision approaches) are inappropriate as training described in Section 5.6. Such generalization would work because sets for systems that perform invariant visual object recognition. the neurons are tuned as filters to accept a range of variation of Instead, for such systems, it will be much more relevant to train the input within parameters such as relative size and orientation of on image sets in which the image exemplars within a class show the components of the features. Generalization of this type would much more continuous variation. This provides the system with not be expected to work when there is a catastrophic change in the opportunity to learn invariant representations, instead of just the features visible, as, for example, occurs when the cup is rotated doing its best to categorize images into classes from relatively lim- so that one can suddenly no longer see inside it, and the outside ited numbers of images that do not allow the system to learn the bottom of the cup comes into view. rules of the transforms that objects undergo in the real-world, The third type of process is one that can deal with the sud- and that can be used to help object recognition when objects may den catastrophic change in the features visible when an object is be seen from different views. This is an important conclusion for rotated to a completely different view, as in the cup example just research in the area. Consistently, others are realizing that invari- given (cf. Koenderink, 1990). Another example, quite extreme to ant visual object recognition is a hard problem (Pinto et al., 2008; illustrate the point, might be when a card with different images DiCarlo et al., 2012). In this context, the hypotheses presented in on its two sides is rotated so that one face and then the other is this paper are my theory of how invariant visual object recognition in view. This makes the point that this third type of process may is performed by the brain (Rolls, 1992, 2008b), and the model Vis- involve arbitrary pairwise association learning, to learn which fea- Net tests those hypotheses and provides a model for how invariant tures and views are different aspects of the same object. Another visual object representations can be learned (Rolls, 2008b). example occurs when only some parts of an object are visible. For Third, the findings described here are encouraging with respect example, a red-handled screwdriver may be recognized either from to training view-invariant representations, in that the training its round red handle, or from its elongated silver-colored blade. images with the ALOI dataset could be separated by as much The full view-invariant recognition of objects that occurs even as 45˚ to still provide for view-invariant object recognition with when the objects share the same features, such as color, texture, cross-validation images that were never closer than 22.5˚ to a etc. is an especially computationally demanding task which the training image. This is helpful, for it is an indication that large primate visual system is able to perform with its highly devel- numbers of different views will not need to be trained with the oped temporal lobe cortical visual areas. The neurophysiological VisNet architecture in order to achieve good view-invariant object evidence and the neuronal network analyses described here and recognition. elsewhere (Rolls, 2008b) provide clear hypotheses about how the primate visual system may perform this task. 9. DIFFERENT PROCESSES INVOLVED IN DIFFERENT TYPES OF OBJECT IDENTIFICATION 10. CONCLUSION To conclude this paper, it is proposed that there are (at least) three We have seen that the feature hierarchy approach has a num- different types of process that could be involved in object identifi- ber of advantages in performing object recognition over other cation. The first is the simple situation where different objects can approaches (see Section 3), and that some of the key computational Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 63 Rolls Invariant visual object recognition issues that arise in these architectures have solutions (see Sections competitive learning using a modified associative learning rule 4 and 5). The neurophysiological and computational approach with a short-term memory trace of preceding neuronal activ- taken here focuses on a feature hierarchy model in which invari- ity, provide a basis for understanding much processing in the ant representations can be built by self-organizing learning based ventral visual stream, from V1 to the inferior temporal visual cor- on the statistics of the visual input. tex. Second, the same principles help to understand some of the The model can use temporal continuity in an associative synap- processing in the dorsal visual stream by which invariant repre- tic learning rule with a short-term memory trace, and/or it can use sentations of the global motion of objects may be formed. Third, spatial continuity in continuous spatial transformation learning. the same principles continued from the ventral visual stream The model of visual processing in the ventral cortical stream onward to the hippocampus help to show how spatial view and can build representations of objects that are invariant with respect place representations may be built from the visual input. Fourth, to translation, view, size, and lighting. in all these cases, the learning is possible because the system is The model uses a feature combination neuron approach with able to extract invariant representations because it can utilize the relative spatial positions of the objects specified in the feature the spatio-temporal continuities and statistics in the world that combination neurons, and this provides a solution to the binding help to define objects, moving objects, and spatial scenes. Fifth, problem. a great simplification and economy in terms of brain design is The model has been extended to provide an account of invari- that the computational principles need not be different in each ant representations in the dorsal visual system of the global motion of the cortical areas in these hierarchical systems, for some of produced by objects such as looming, rotation, and object-based the important properties of the processing in these systems to be movement. performed. The model has been extended to incorporate top-down feed- In conclusion, we have seen how the invariant recognition of back connections to model the control of attention by biased objects involves not only the storage and retrieval of information, competition in, for example, spatial and object search tasks (Deco but also major computations to produce invariant representations. and Rolls, 2004; Rolls, 2008b). Once these invariant representations have been formed, they are The model has also been extended to account for how the visual used for many processes including not only recognition mem- system can select single objects in complex visual scenes, how ory (Rolls, 2008b), but also associative learning of the rewarding multiple objects can be represented in a scene, and how invari- and punishing properties of objects for emotion and motivation ant representations of single objects can be learned even when (Rolls, 2005, 2008b, 2013), the memory for the spatial locations of multiple objects are present in the scene. objects and rewards, the building of spatial representations based It has also been suggested in a unifying proposal that adding on visual input, and as an input to short-term memory, attention, a fifth layer to the model and training the system in spatial envi- decision, and action selection systems (Rolls, 2008b). ronments will enable hippocampus-like spatial view neurons or place cells to develop, depending on the size of the field of view ACKNOWLEDGMENTS (Section 6). Edmund T. Rolls is grateful to Larry Abbott, Nicholas Aggelopou- We have thus seen how many of the major computational los, Roland Baddeley, Francesco Battaglia, Michael Booth, Gordon issues that arise when formulating a theory of object recognition Baylis, Hugo Critchley, Gustavo Deco, Martin Elliffe, Leonardo in the ventral visual system (such as feature binding, invari- Franco, Michael Hasselmo, Nestor Parga, David Perrett, Gavin ance learning, the recognition of objects when they are in clut- Perry, Leigh Robinson, Simon Stringer, Martin Tovee, Alessandro tered natural scenes, the representation of multiple objects in Treves, James Tromans, and Tristan Webb for contributing to many a scene, and learning invariant representations of single objects of the collaborative studies described here. Professor R. Watt, of when there are multiple objects in the scene), could be solved Stirling University, is thanked for assistance with the implementa- in the brain, with tests of the hypotheses performed by simula- tion of the difference of Gaussian filters used in many experiments tions that are consistent with complementary neurophysiological with VisNet and VisNet2. Support from the Medical Research results. Council, the Wellcome Trust, the Oxford McDonnell Centre in The approach described here is unifying in a number of ways. Cognitive Neuroscience, and the Oxford Centre for Computa- First, a set of simple organizational principles involving a hier- tional Neuroscience (www.oxcns.org, where .pdfs of papers are archy of cortical areas with convergence from stage to stage, and available) is acknowledged. REFERENCES algorithm for Boltzmann machines. inferior temporal cortex neurons of orientations. Nat. Neurosci. 10, Abbott, L. F., Rolls, E. T., and Tovee, M. Cogn. Sci. 9, 147–169. encode the positions of different 1313–1321. J. (1996). Representational capacity Aggelopoulos, N. C., Franco, L., and objects in the scene. Eur. J. Neurosci. Arathorn, D. (2002). Map-Seeking Cir- of face coding in monkeys. Cereb. Rolls, E. T. (2005). Object perception 22, 2903–2916. cuits in Visual Cognition: A Com- Cortex 6, 498–505. in natural scenes: encoding by infe- Amit, D. J. (1989). Modelling Brain putational Mechanism for Biological Abeles, M. (1991). Corticonics: Neural rior temporal cortex simultaneously Function. New York: Cambridge and Machine Vision. Stanford, CA: Circuits of the Cerebral Cortex. Cam- recorded neurons. J. Neurophysiol. University Press. Stanford University Press. bridge: Cambridge University Press. 93, 1342–1357. Anzai, A., Peng, X., and Van Essen, Arathorn, D. (2005). “Computation Ackley, D. H., Hinton, G. E., and Aggelopoulos, N. C., and Rolls, E. T. D. C. (2007). Neurons in monkey in the higher visual cortices: Sejnowski, T. J. (1985). A learning (2005). Natural scene perception: visual area V2 encode combinations map-seeking circuit theory and Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 64 Rolls Invariant visual object recognition application to machine vision,” in topography of area TEO in the De Valois, R. L., and De Valois, K. K. Farah, M. J. (2000). The Cognitive Neu- Proceedings of the AIPR 2004: 33rd macaque. J. Comp. Neurol. 306, (1988). Spatial Vision. New York: roscience of Vision. Oxford: Black- Applied Imagery Pattern Recognition 554–575. Oxford University Press. well. Workshop, 73–78. Brady, M., Ponce, J., Yuille, A., and Deco, G., and Rolls, E. T. (2004). Farah, M. J., Meyer, M. M., and Ballard, D. H. (1990). “Animate vision Asada, H. (1985). Describing sur- A neurodynamical cortical model McMullen, P. A. (1996). The liv- uses object-centred reference faces, A. I. Memo 882. Artif. Intell. of visual attention and invariant ing/nonliving dissociation is not an frames,” in Advanced Neural Com- 17, 285–349. object recognition. Vision Res. 44, artifact: giving an a priori implau- puters, ed. R. Eckmiller (Elsevier: Brincat, S. L., and Connor, C. E. (2006). 621–644. sible hypothesis a strong test. Cogn. Amsterdam), 229–236. Dynamic shape synthesis in poste- Deco, G., and Rolls, E. T. (2005a). Atten- Neuropsychol. 13, 137–154. Barlow, H. B. (1972). Single units and rior inferotemporal cortex. Neuron tion, short term memory, and action Faugeras, O. D. (1993). The Represen- sensation: a neuron doctrine for 49, 17–24. selection: a unifying theory. Prog. tation, Recognition and Location of perceptual psychology. Perception 1, Bruce, V. (1988). Recognising Faces. Neurobiol. 76, 236–256. 3-D Objects. Cambridge, MA: MIT 371–394. Hillsdale, NJ: Erlbaum. Deco, G., and Rolls, E. T. (2005b). Press. Barlow, H. B. (1985). “Cerebral cortex Brun, V. H., Otnass, M. K., Molden, Neurodynamics of biased com- Faugeras, O. D., and Hebert, M. (1986). as model builder,” in Models of the S., Steffenach, H. A., Witter, M. P., petition and cooperation for The representation, recognition and Visual Cortex, eds D. Rose and V. G. Moser, M. B., and Moser, E. I. (2002). attention: a model with spik- location of 3-D objects. Int. J. Robot. Dobson (Chichester: Wiley), 37–46. Place cells and place recognition ing neurons. J. Neurophysiol. 94, Res. 5, 27–52. Barlow, H. B., Kaushal, T. P., and Mitchi- maintained by direct entorhinal– 295–313. Feldman, J. A. (1985). Four frames suf- son, G. J. (1989). Finding minimum hippocampal circuitry. Science 296, Desimone, R., and Duncan, J. (1995). fice: a provisional model of vision entropy codes. Neural Comput. 1, 2243–2246. Neural mechanisms of selective and space. Behav. Brain Sci. 8, 412–423. Buckley, M. J., Booth, M. C. A., Rolls, visual attention. Annu. Rev. Neurosci. 265–289. Bartlett, M. S., and Sejnowski, T. J. E. T., and Gaffan, D. (2001). Selec- 18, 193–222. Fenske, M. J., Aminoff, E., Gronau, (1997). “Viewpoint invariant face tive perceptual impairments follow- DeWeese, M. R., and Meister, M. (1999). N., and Bar, M. (2006). Top-down recognition using independent com- ing perirhinal cortex ablation. J. How to measure the information facilitation of visual object recog- ponent analysis and attractor net- Neurosci. 21, 9824–9836. gained from one symbol. Network nition: object-based and context- works,” in Advances in Neural Infor- Buhmann, J., Lange, J., von der 10, 325–340. based contributions. Prog. Brain Res. mation Processing Systems, Vol. 9, Malsburg, C., Vorbrüggen, J. C., DiCarlo, J. J., and Maunsell, J. H. R. 155, 3–21. eds M. Mozer, M. Jordan, and and Würtz, R. P. (1991). “Object (2003). Anterior inferotemporal Field, D. J. (1987). Relations between T. Petsche (Cambridge, MA: MIT recognition in the dynamic link neurons of monkeys engaged the statistics of natural images and Press), 817–823. architecture: parallel implementa- in object recognition can be the response properties of corti- Baylis, G. C., Rolls, E. T., and Leonard, tion of a transputer network,” highly sensitive to object reti- cal cells. J. Opt. Soc. Am. A 4, C. M. (1985). Selectivity between in Neural Networks for Signal nal position. J. Neurophysiol. 89, 2379–2394. faces in the responses of a pop- Processing, ed. B. Kosko (Engle- 3264–3278. Field, D. J. (1994). What is the goal of ulation of neurons in the cor- wood Cliffs, NJ: Prentice-Hall), DiCarlo, J. J., Zoccolan, D., and Rust, N. sensory coding? Neural Comput. 6, tex in the superior temporal sul- 121–159. C. (2012). How does the brain solve 559–601. cus of the monkey. Brain Res. 342, Carlson, E. T., Rasquinha, R. J., Zhang, visual object recognition? Neuron Finkel, L. H., and Edelman, G. 91–102. K., and Connor, C. E. (2011). A 73, 415–434. M. (1987). “Population rules for Baylis, G. C., Rolls, E. T., and Leonard, sparse object coding scheme in area Dolan, R. J., Fink, G. R., Rolls, E. T., synapses in networks,” in Synaptic C. M. (1987). Functional subdivi- v4. Curr. Biol. 21, 288–293. Booth, M., Holmes, A., Frackowiak, Function, eds G. M. Edelman, W. E. sions of temporal lobe neocortex. J. Cerella, J. (1986). Pigeons and percep- R. S. J., and Friston, K. J. (1997). How Gall, and W. M. Cowan (New York: Neurosci. 7, 330–342. trons. Pattern Recognit. 19, 431–438. the brain learns to see objects and John Wiley & Sons), 711–757. Bennett, A. (1990). Large competitive Chakravarty, I. (1979). A generalized faces in an impoverished context. Földiák, P. (1991). Learning invariance networks. Network 1, 449–462. line and junction labeling scheme Nature 389, 596–599. from transformation sequences. Biederman, I. (1972). Perceiving real- with applications to scene analy- Dow, B. W., Snyder, A. Z., Vautin, R. Neural Comput. 3, 193–199. world scenes. Science 177, 77–80. sis. IEEE Trans. Pattern Anal. Mach. G., and Bauer, R. (1981). Magni- Földiák, P. (1992). Models of Sensory Biederman, I. (1987). Recognition-by- Intell. 1, 202–205. fication factor and receptive field Coding. Technical Report CUED/F– components: a theory of human Chang, C.-C., and Lin, C.-J. (2011). LIB- size in foveal striate cortex of INFENG/TR 91. Department of image understanding. Psychol. Rev. SVM: a library for support vector the monkey. Exp. Brain Res. 44, Engineering, University of Cam- 94, 115–147. machines. ACM Trans. Intell. Syst. 213–218. bridge, Cambridge. Binford, T. O. (1981). Inferring sur- Technol. 2, 27. Edelman, S. (1999). Representation and Folstein, J. R., Gauthier, I., and faces from images. Artif. Intell. 17, Dane, C., and Bajcsy, R. (1982). “An Recognition in Vision. Cambridge, Palmeri, T. J. (2010). Mere expo- 205–244. object-centred three-dimensional MA: MIT Press. sure alters category learning of Blumberg, J., and Kreiman, G. (2010). model builder,” in Proceedings of Elliffe, M. C. M., Rolls, E. T., Parga, novel objects. Front. Psychol. 1:40. How cortical neurons help us see: the 6th International Conference N., and Renart, A. (2000). A recur- doi:10.3389/fpsyg.2010.00040 visual recognition in the human on Pattern Recognition, Munich, rent model of transformation invari- Franco, L., Rolls, E. T., Aggelopoulos, brain. J. Clin. Invest. 120, 3054–3063. 348–350. ance by association. Neural Netw. 13, N. C., and Jerez, J. M. (2007). Neu- Bolles, R. C., and Cain, R. A. (1982). Daugman, J. (1988). Complete discrete 225–237. ronal selectivity, population sparse- Recognizing and locating partially 2D-Gabor transforms by neural net- Elliffe, M. C. M., Rolls, E. T., and ness, and ergodicity in the inferior visible objects: the local-feature- works for image analysis and com- Stringer, S. M. (2002). Invariant temporal visual cortex. Biol. Cybern. focus method. Int. J. Robot. Res. 1, pression. IEEE Trans. Acoust. 36, recognition of feature combinations 96, 547–560. 57–82. 1169–1179. in the visual system. Biol. Cybern. 86, Franco, L., Rolls, E. T., Aggelopou- Booth, M. C. A., and Rolls, E. T. De Araujo, I. E. T., Rolls, E. T., 59–71. los, N. C., and Treves, A. (2004). (1998). View-invariant representa- and Stringer, S. M. (2001). A Engel, A. K., Konig, P., Kreiter, A. K., The use of decoding to analyze tions of familiar objects by neu- view model which accounts for the Schillen, T. B., and Singer, W. (1992). the contribution to the informa- rons in the inferior temporal visual response properties of hippocam- Temporal coding in the visual sys- tion of the correlations between the cortex. Cereb. Cortex 8, 510–523. pal primate spatial view cells and tem: new vistas on integration in the firing of simultaneously recorded Boussaoud, D., Desimone, R., and rat place cells. Hippocampus 11, nervous system. Trends Neurosci. 15, neurons. Exp. Brain Res. 155, Ungerleider, L. G. (1991). Visual 699–706. 218–226. 370–384. Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 65 Rolls Invariant visual object recognition Franzius, M., Sprekeler, H., and motion patterns by form/cue invari- Hegde, J., and Van Essen, D. C. (2000). alignment with an image. Int. J. Wiskott, L. (2007). Slowness ant MSTd neurons. J. Neurosci. 16, Selectivity for complex shapes in pri- Comput. Vis. 5, 195–212. and sparseness lead to place, 4716–4732. mate visual area V2. J. Neurosci. 20, Ito, M. (1984). The Cerebellum and head-direction, and spatial-view Georges-François, P., Rolls, E. T., and RC61. Neural Control. New York: Raven cells. PLoS Comput. Biol. 3, e166. Robertson, R. G. (1999). Spatial view Hegde, J., and Van Essen, D. C. (2003). Press. doi:10.1371/journal.pcbi.0030166 cells in the primate hippocampus: Strategies of shape representation in Ito, M. (1989). Long-term depression. Freedman, D. J., and Miller, E. K. (2008). allocentric view not head direction macaque visual area V2. Vis. Neu- Annu. Rev. Neurosci. 12, 85–102. Neural mechanisms of visual cate- or eye position or place. Cereb. Cor- rosci. 20, 313–328. Ito, M., and Komatsu, H. (2004). gorization: insights from neurophys- tex 9, 197–212. Hegde, J., and Van Essen, D. C. (2007). Representation of angles embedded iology. Neurosci. Biobehav. Rev. 32, Geusebroek, J.-M., Burghouts, G. J., A comparative study of shape rep- within contour stimuli in area V2 of 311–329. and Smeulders, A. W. M. (2005). resentation in macaque visual areas macaque monkeys. J. Neurosci. 24, Freiwald, W. A., Tsao, D. Y., and Living- The Amsterdam library of object V2 and V4. Cereb. Cortex 17, 3313–3324. stone, M. S. (2009). A face feature images. Int. J. Comput. Vis. 61, 1100–1116. Itti, L., and Koch, C. (2000). A saliency- space in the macaque temporal lobe. 103–112. Herrnstein, R. J. (1984). “Objects, cat- based search mechanism for overt Nat. Neurosci. 12, 1187–1196. Gibson, J. J. (1950). The Perception of egories, and discriminative stimuli,” and covert shifts of visual attention. Frey, B. J., and Jojic, N. (2003). the Visual World. Boston: Houghton in Animal Cognition, Chap. 14, eds Vision Res. 40, 1489–1506. Transformation-invariant clustering Mifflin. H. L. Roitblat, T. G. Bever, and H. Jarrett, K., Kavukcuoglu, K., Ranzato, using the EM algorithm. IEEE Trans. Gibson, J. J. (1979). The Ecologi- S. Terrace (Hillsdale, NJ: Lawrence M., and Lecun, Y. (2009). “What Pattern Anal. Mach. Intell. 25, 1–17. cal Approach to Visual Perception. Erlbaum and Associates), 233–261. is the best multi-stage architec- Fries, P. (2005). A mechanism for cog- Boston: Houghton Mifflin. Hertz, J. A., Krogh, A., and Palmer, R. ture for object recognition?” in nitive dynamics: neuronal commu- Grabenhorst, F., and Rolls, E. T. (2011). G. (1991). Introduction to the Theory 2009 IEEE 12th International Con- nication through neuronal coher- Value, pleasure, and choice systems of Neural Computation. Wokingham: ference on Computer Vision (ICCV), ence. Trends Cogn. Sci. (Regul. Ed.) in the ventral prefrontal cortex. Addison-Wesley. 2146–2153. 9, 474–480. Trends Cogn. Sci. (Regul. Ed.) 15, Hestrin, S., Sah, P., and Nicoll, R. Jiang, F., Dricot, L., Weber, J., Righi, Fries, P. (2009). Neuronal gamma- 56–67. (1990). Mechanisms generating the G., Tarr, M. J., Goebel, R., and Ros- band synchronization as a funda- Graziano, M. S. A., Andersen, R. A., time course of dual component exci- sion, B. (2011). Face categorization mental process in cortical com- and Snowden, R. J. (1994). Tuning tatory synaptic currents recorded in visual scenes may start in a higher putation. Annu. Rev. Neurosci. 32, of MST neurons to spiral motions. J. in hippocampal slices. Neuron 5, order area of the right fusiform 209–224. Neurosci. 14, 54–67. 247–253. gyrus: evidence from dynamic visual Fukushima, K. (1975). Cognitron: a Griffin, G., Holub, A., and Perona, Hinton, G. E. (2010). Learning to rep- stimulation in neuroimaging. J. self-organizing neural network. Biol. P. (2007). The Caltech-256. Caltech resent visual input. Philos. Trans. R. Neurophysiol. 106, 2720–2736. Cybern. 20, 121–136. Technical Report, Los Angeles, 1–20. Soc. Lond. B Biol. Sci. 365, 177–184. Koch, C. (1999). Biophysics of Com- Fukushima, K. (1980). Neocognitron: Grimson, W. E. L. (1990). Object Recog- Hinton, G. E., Dayan, P., Frey, B. J., and putation. Oxford: Oxford University a self-organizing neural network nition by Computer. Cambridge, Neal, R. M. (1995). The “wake-sleep” Press. model for a mechanism of pattern MA: MIT Press. algorithm for unsupervised neural Koenderink, J. J. (1990). Solid Shape. recognition unaffected by shift in Griniasty, M., Tsodyks, M. V., and Amit, networks. Science 268, 1158–1161. Cambridge, MA: MIT Press. position. Biol. Cybern. 36, 193–202. D. J. (1993). Conversion of temporal Hinton, G. E., and Ghahramani, Z. Koenderink, J. J., and Van Doorn, A. J. Fukushima, K. (1988). Neocognitron: a correlations between stimuli to spa- (1997). Generative models for dis- (1979). The internal representation hierarchical neural network model tial correlations between attractors. covering sparse distributed repre- of solid shape with respect to vision. capable of visual pattern recogni- Neural Comput. 35, 1–17. sentations. Philos. Trans. R. Soc. Biol. Cybern. 32, 211–217. tion unaffected by shift in position. Gross, C. G., Desimone, R., Albright, Lond. B Biol. Sci. 352, 1177–1190. Koenderink, J. J., and van Doorn, A. Neural Netw. 1, 119–130. T. D., and Schwartz, E. L. (1985). Hinton, G. E., and Sejnowski, T. J. J. (1991). Affine structure from Fukushima, K. (1989). Analysis of the Inferior temporal cortex and pat- (1986). “Learning and relearning in motion. J. Opt. Soc. Am. A 8, process of visual pattern recognition tern recognition. Exp. Brain Res. Boltzmann machines,” in Parallel 377–385. by the neocognitron. Neural Netw. 2, 11(Suppl.), 179–201. Distributed Processing, Vol. 1, Chap. Kourtzi, Z., and Connor, C. E. (2011). 413–420. Hafting, T., Fyhn, M., Molden, S., Moser, 7, eds D. Rumelhart and J. L. McClel- Neural representations for object Fukushima, K. (1991). Neural networks M. B., and Moser, E. I. (2005). land (Cambridge, MA: MIT Press), perception: structure, category, and for visual pattern recognition. IEEE Microstructure of a spatial map in 282–317. adaptive coding. Annu. Rev. Neu- Trans. E 74, 179–190. the entorhinal cortex. Nature 436, Hopfield, J. J. (1982). Neural networks rosci. 34, 45–67. Fukushima, K., and Miyake, S. (1982). 801–806. and physical systems with emer- Kriegeskorte, N., Mur, M., Ruff, D. A., Neocognitron: a new algorithm Hasselmo, M. E., Rolls, E. T., and gent collective computational abili- Kiani, R., Bodurka, J., Esteky, H., for pattern recognition tolerant of Baylis, G. C. (1989a). The role ties. Proc. Natl. Acad. Sci. U. S. A. 79, Tanaka, K., and Bandettini, P. A. deformations and shifts in position. of expression and identity in the 2554–2558. (2008). Matching categorical object Pattern Recognit. 15, 455–469. face-selective responses of neurons Hubel, D. H., and Wiesel, T. N. (1962). representations in inferior temporal Fyhn, M., Molden, S., Witter, M. P., in the temporal visual cortex of Receptive fields, binocular interac- cortex of man and monkey. Neuron Moser, E. I., and Moser, M.-B. the monkey. Behav. Brain Res. 32, tion, and functional architecture in 60, 1126–1141. (2004). Spatial representation in 203–218. the cat’s visual cortex. J. Physiol. 160, Krieman, G., Koch, C., and Fried, the entorhinal cortex. Science 2004, Hasselmo, M. E., Rolls, E. T., Baylis, 106–154. I. (2000). Category-specific visual 1258–1264. G. C., and Nalwa, V. (1989b). Hubel, D. H., and Wiesel, T. N. responses of single neurons in the Gardner, E. (1988). The space of inter- Object-centered encoding by face- (1968). Receptive fields and func- human medial temporal lobe. Nat. actions in neural network models. J. selective neurons in the cortex in tional architecture of monkey striate Neurosci. 3, 946–953. Phys. A Math. Gen. 21, 257–270. the superior temporal sulcus of cortex. J. Physiol. 195, 215–243. Land, M. F. (1999). Motion and vision: Garthwaite, J. (2008). Concepts of the monkey. Exp. Brain Res. 75, Hummel, J. E., and Biederman, I. why animals move their eyes. J. neural nitric oxide-mediated 417–429. (1992). Dynamic binding in a neural Comp. Physiol. A 185, 341–352. transmission. Eur. J. Neurosci. 27, Hawken, M. J., and Parker, A. J. (1987). network for shape recognition. Psy- Land, M. F., and Collett, T. S. (1997). 2783–3802. Spatial properties of the monkey chol. Rev. 99, 480–517. “A survey of active vision in inverte- Geesaman, B. J., and Andersen, R. A. striate cortex. Proc. R. Soc. Lond. B Huttenlocher, D. P., and Ullman, S. brates,” in From Living Eyes to See- (1996). The analysis of complex Biol. Sci. 231, 251–288. (1990). Recognizing solid objects by ing Machines, eds M. V. Srinivasan Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 66 Rolls Invariant visual object recognition and S. Venkatesh (Oxford: Oxford Mel, B. W., and Fiser, J. (2000). Mini- Newsome, W. T., Britten, K. H., and visual object recognition hard? University Press), 16–36. mizing binding errors using learned Movshon, J. A. (1989). Neuronal PLoS Comput. Biol. 4, e27. LeCun, Y., Kavukcuoglu, K., and Fara- conjunctive features. Neural Com- correlates of a perceptual decision. doi:10.1371/journal.pcbi.0040027 bet, C. (2010). “Convolutional net- put. 12, 731–762. Nature 341, 52–54. Poggio, T., and Edelman, S. (1990). works and applications in vision,” in Mel, B. W., Ruderman, D. L., and Oja, E. (1982). A simplified neuron A network that learns to recognize 2010 IEEE International Symposium Archie, K. A. (1998). Translation- model as a principal component three-dimensional objects. Nature on Circuits and Systems, 253–256. invariant orientation tuning in analyzer. J. Math. Biol. 15, 267–273. 343, 263–266. Lee, T. S. (1996). Image representa- visual “complex” cells could derive O’Keefe, J. (1984). “Spatial memory Pollen, D., and Ronner, S. (1981). Phase tion using 2D Gabor wavelets. IEEE from intradendritic computations. J. within and without the hippocampal relationship between adjacent sim- Trans. Pattern Anal. Mach. Intell. 18, Neurosci. 18, 4325–4334. system,” in Neurobiology of the Hip- ple cells in the visual cortex. Science 959–971. Mikami, A., Nakamura, K., and Kub- pocampus, ed. W. Seifert (London: 212, 1409–1411. Leen, T. K. (1995). From data distri- ota, K. (1994). Neuronal responses Academic Press), 375–403. Rao, R. P. N., and Ruderman, D. butions to regularization in invari- to photographs in the superior tem- Olshausen, B. A., Anderson, C. H., and L. (1999). “Learning lie groups ant learning. Neural Comput. 7, poral sulcus of the rhesus monkey. Van Essen, D. C. (1993). A neurobio- for invariant visual perception,” in 974–981. Behav. Brain Res. 60, 1–13. logical model of visual attention and Advances in Neural Information Pro- Leibo, J. Z., Mutch, J., Rosasco, Milner, P. (1974). A model for visual invariant pattern recognition based cessing Systems, Vol. 11, eds M. S. L., Ullman, S., and Poggio, T. shape recognition. Psychol. Rev. 81, on dynamic routing of information. Kearns, S. A. Solla, and D. A. Cohn (2010). Learning Generic Invari- 521–535. J. Neurosci. 13, 4700–4719. (Cambridge: MIT Press), 810–816. ances in Object Recognition: Trans- Miyashita, Y. (1988). Neuronal corre- Olshausen, B. A., Anderson, C. H., and Renart, A., Parga, N., and Rolls, E. lation and Scale. MIT-CSAIL-TR- late of visual associative long-term Van Essen, D. C. (1995). A multiscale T. (2000). “A recurrent model of 2010-061, Cambridge. memory in the primate temporal dynamic routing circuit for forming the interaction between the pre- Li, N., and DiCarlo, J. J. (2008). Unsu- cortex. Nature 335, 817–820. size- and position-invariant object frontal cortex and inferior temporal pervised natural experience rapidly Miyashita, Y., and Chang, H. S. representations. J. Comput. Neurosci. cortex in delay memory tasks,” in alters invariant object representa- (1988). Neuronal correlate of picto- 2, 45–62. Advances in Neural Information tion in visual cortex. Science 321, rial short-term memory in the pri- Orban, G. A. (2011). The extraction Processing Systems, Vol. 12, eds S. 1502–1507. mate temporal cortex. Nature 331, of 3D shape in the visual system Solla, T. Leen, and K.-R. Mueller Li, S., Mayhew, S. D., and Kourtzi, Z. 68–70. of human and nonhuman primates. (Cambridge, MA: MIT Press), (2011). Learning shapes spatiotem- Montague, P. R., Gally, J. A., and Edel- Annu. Rev. Neurosci. 34, 361–388. 171–177. poral brain patterns for flexible cat- man, G. M. (1991). Spatial signalling O’Reilly, J., and Munakata, Y. (2000). Rhodes, P. (1992). The open time of egorical decisions. Cereb. Cortex. in the development and function of Computational Explorations in Cog- the NMDA channel facilitates the doi: 10.1093/cercor/bhr309. [Epub neural connections. Cereb. Cortex 1, nitive Neuroscience. Cambridge, MA: self-organisation of invariant object ahead of print]. 199–220. MIT Press. responses in cortex. Soc. Neurosci. Liu, J., Harris, A., and Kanwisher, N. Moser, E. I. (2004). Hippocampal place Parga, N., and Rolls, E. T. (1998). Abstr. 18, 740. (2010). Perception of face parts and cells demand attention. Neuron 42, Transform invariant recognition by Riesenhuber, M., and Poggio, T. (1998). face configurations: an fMRI study. 183–185. association in a recurrent network. “Just one view: invariances in infer- J. Cogn. Neurosci. 22, 203–211. Moser, M. B., and Moser, E. I. Neural Comput. 10, 1507–1525. otemporal cell tuning,” in Advances Logothetis, N. K., Pauls, J., Bulthoff, (1998). Functional differentiation in Peng, H. C., Sha, L. F., Gan, Q., and in Neural Information Processing H. H., and Poggio, T. (1994). View- the hippocampus. Hippocampus 8, Wei, Y. (1998). Energy function Systems, Vol. 10, eds M. I. Jor- dependent object recognition by 608–619. for learning invariance in multi- dan, M. J. Kearns, and S. A. monkeys. Curr. Biol. 4, 401–414. Movshon, J. A., Adelson, E. H., Gizzi, M. layer perceptron. Electron. Lett. 34, Solla (Cambridge, MA: MIT Press), Logothetis, N. K., Pauls, J., and Pog- S., and Newsome, W. T. (1985). “The 292–294. 215–221. gio, T. (1995). Shape representation analysis of moving visual patterns,” Perrett, D. I., and Oram, M. W. Riesenhuber, M., and Poggio, T. (1999a). in the inferior temporal cortex of in Pattern Recognition Mechanisms, (1993). Neurophysiology of shape Are cortical models really bound by monkeys. Curr. Biol. 5, 552–563. eds C. Chagas, R. Gattass, and C. G. processing. Image Vis. Comput. 11, the “binding problem”? Neuron 24, Logothetis, N. K., and Sheinberg, D. Gross (New York: Springer-Verlag), 317–333. 87–93. L. (1996). Visual object recognition. 117–151. Perrett, D. I., Rolls, E. T., and Caan, W. Riesenhuber, M., and Poggio, T. Annu. Rev. Neurosci. 19, 577–621. Mozer, M. C. (1991). The Perception (1982). Visual neurons responsive to (1999b). Hierarchical models of Lowe, D. (1985). Perceptual Organiza- of Multiple Objects: A Connection- faces in the monkey temporal cortex. object recognition in cortex. Nat. tion and Visual Recognition. Boston: ist Approach. Cambridge, MA: MIT Exp. Brain Res. 47, 329–342. Neurosci. 2, 1019–1025. Kluwer. Press. Perrett, D. I., Smith, P. A. J., Potter, D. Riesenhuber, M., and Poggio, T. (2000). Marr, D. (1982). Vision. San Francisco: Muller, R. U., Kubie, J. L., Bostock, E. D., Mistlin, A. J., Head, A. S., Mil- Models of object recognition. Nat. Freeman. M., Taube, J. S., and Quirk, G. J. ner, D., and Jeeves, M. A. (1985). Neurosci. 3(Suppl.), 1199–1204. Marr, D., and Nishihara, H. K. (1978). (1991). “Spatial firing correlates of Visual cells in temporal cortex sensi- Ringach, D. L. (2002). Spatial struc- Representation and recognition of neurons in the hippocampal forma- tive to face view and gaze direction. ture and symmetry of simple-cell the spatial organization of three tion of freely moving rats,” in Brain Proc. R. Soc. Lond. B Biol. Sci. 223, receptive fields in macaque primary dimensional structure. Proc. R. Soc. and Space, ed. J. Paillard (Oxford: 293–317. visual cortex. J. Neurophysiol. 88, Lond. B Biol. Sci. 200, 269–294. Oxford University Press), 296–333. Perry, G., Rolls, E. T., and Stringer, S. 455–463. McNaughton, B. L., Barnes, C. A., and Mundy, J., and Zisserman, A. (1992). M. (2006). Spatial vs temporal conti- Robertson, R. G., Rolls, E. T., and O’Keefe, J. (1983). The contributions “Introduction – towards a new nuity in view invariant visual object Georges-François, P. (1998). Spa- of position, direction, and velocity to framework for vision,” in Geometric recognition learning. Vision Res. 46, tial view cells in the primate hip- single unit activity in the hippocam- Invariance in Computer Vision, eds 3994–4006. pocampus: effects of removal of pus of freely-moving rats. Exp. Brain J. Mundy and A. Zisserman (Cam- Perry, G., Rolls, E. T., and Stringer, S. view details. J. Neurophysiol. 79, Res. 52, 41–49. bridge, MA: MIT Press), 1–39. M. (2010). Continuous transforma- 1145–1156. Mel, B. W. (1997). SEEMORE: com- Mutch, J., and Lowe, D. G. (2008). tion learning of translation invariant Rolls, E. T. (1989a). “Functions of neu- bining color, shape, and texture his- Object class recognition and local- representations. Exp. Brain Res. 204, ronal networks in the hippocam- togramming in a neurally-inspired ization using sparse features with 255–270. pus and neocortex in memory,” in approach to visual object recogni- limited receptive fields. Int. J. Com- Pinto, N., Cox, D. D., and DiCarlo, Neural Models of Plasticity: Experi- tion. Neural Comput. 9, 777–804. put. Vis. 80, 45–57. J. J. (2008). Why is real-world mental and Theoretical Approaches, Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 67 Rolls Invariant visual object recognition Chap. 13, eds J. H. Byrne and W. Approach. Oxford: Oxford Univer- neurophysiological and compu- Rolls, E. T., and Stringer, S. M. (2007). O. Berry (San Diego, CA: Academic sity Press. tational bases. Neural Netw. 19, Invariant global motion recognition Press), 240–265. Rolls, E. T. (2008c). Top-down con- 1383–1394. in the dorsal visual system: a uni- Rolls, E. T. (1989b). “The representation trol of visual perception: attention Rolls, E. T., Franco, L., Aggelopoulos, N. fying theory. Neural Comput. 19, and storage of information in neu- in natural vision. Perception 37, C., and Jerez, J. M. (2006a). Infor- 139–169. ronal networks in the primate cere- 333–354. mation in the first spike, the order Rolls, E. T., and Tovee, M. J. (1994). Pro- bral cortex and hippocampus,” in Rolls, E. T. (2011a). David Marr’s vision: of spikes, and the number of spikes cessing speed in the cerebral cortex The Computing Neuron, Chap. 8, eds floreat computational neuroscience. provided by neurons in the inferior and the neurophysiology of visual R. Durbin, C. Miall, and G. Mitchi- Brain 134, 913–916. temporal visual cortex. Vision Res. masking. Proc. R. Soc. Lond. B Biol. son (Wokingham: Addison-Wesley), Rolls, E. T. (2011b). “Face neurons,” in 46, 4193–4205. Sci. 257, 9–15. 125–159. The Oxford Handbook of Face Percep- Rolls, E. T., Stringer, S. M., and Elliot, T. Rolls, E. T., and Tovee, M. J. (1995a). Rolls, E. T. (1992). Neurophysiologi- tion, Chap. 4, eds A. J. Calder, G. (2006b). Entorhinal cortex grid cells The responses of single neurons in cal mechanisms underlying face pro- Rhodes, M. H. Johnson, and J. V. can map to hippocampal place cells the temporal visual cortical areas of cessing within and beyond the tem- Haxby (Oxford: Oxford University by competitive learning. Network 17, the macaque when more than one poral cortical visual areas. Philos. Press), 51–75. 447–465. stimulus is present in the visual field. Trans. R. Soc. Lond. B Biol. Sci. 335, Rolls, E. T. (2012). Neuroculture: On Rolls, E. T., Franco, L., and Stringer, S. Exp. Brain Res. 103, 409–420. 11–21. the Implications of Brain Science. M. (2005a). The perirhinal cortex Rolls, E. T., and Tovee, M. J. (1995b). Rolls, E. T. (1994). Brain mecha- Oxford: Oxford University Press. and long-term familiarity memory. Sparseness of the neuronal represen- nisms for invariant visual recogni- Rolls, E. T. (2013). Emotion and Q. J. Exp. Psychol. B. 58, 234–245. tation of stimuli in the primate tem- tion and learning. Behav. Processes Decision-Making Explained. Oxford: Rolls, E. T., Xiang, J.-Z., and Franco, L. poral visual cortex. J. Neurophysiol. 33, 113–138. Oxford University Press. (2005b). Object, space and object- 73, 713–726. Rolls, E. T. (1995). Learning mech- Rolls, E. T., Aggelopoulos, N. C., Franco, space representations in the primate Rolls, E. T., Tovee, M. J., and Panzeri, anisms in the temporal lobe L., and Treves, A. (2004). Informa- hippocampus. J. Neurophysiol. 94, S. (1999). The neurophysiology of visual cortex. Behav. Brain Res. 66, tion encoding in the inferior tem- 833–844. backward visual masking: informa- 177–185. poral visual cortex: contributions of Rolls, E. T., and Grabenhorst, F. tion analysis. J. Cogn. Neurosci. 11, Rolls, E. T. (1999). The Brain and the firing rates and the correlations (2008). The orbitofrontal cor- 335–346. Emotion. Oxford: Oxford University between the firing of neurons. Biol. tex and beyond: from affect to Rolls, E. T., Tovee, M. J., Purcell, D. Press. Cybern. 90, 19–32. decision-making. Prog. Neurobiol. G., Stewart, A. L., and Azzopardi, Rolls, E. T. (2000). Functions of the pri- Rolls, E. T., Aggelopoulos, N. C., and 86, 216–244. P. (1994). The responses of neurons mate temporal lobe cortical visual Zheng, F. (2003). The receptive fields Rolls, E. T., and Milward, T. (2000). A in the temporal cortex of primates, areas in invariant visual object of inferior temporal cortex neurons model of invariant object recogni- and face identification and detec- and face recognition. Neuron 27, in natural scenes. J. Neurosci. 23, tion in the visual system: learning tion. Exp. Brain Res. 101, 474–484. 205–218. 339–348. rules, activation functions, lateral Rolls, E. T., and Treves, A. (1998). Rolls, E. T. (2003). Consciousness Rolls, E. T., and Baylis, G. C. (1986). inhibition, and information-based Neural Networks and Brain Function. absent and present: a neurophysio- Size and contrast have only small performance measures. Neural Com- Oxford: Oxford University Press. logical exploration. Prog. Brain Res. effects on the responses to faces put. 12, 2547–2572. Rolls, E. T., and Treves, A. (2011). The 144, 95–106. of neurons in the cortex of Rolls, E. T., Robertson, R. G., and neuronal encoding of information Rolls, E. T. (2005). Emotion Explained. the superior temporal sulcus of Georges-François, P. (1997a). Spa- in the brain. Prog. Neurobiol. 95, Oxford: Oxford University Press. the monkey. Exp. Brain Res. 65, tial view cells in the primate 448–490. Rolls, E. T. (2006). “Consciousness 38–48. hippocampus. Eur. J. Neurosci. 9, Rolls, E. T., Treves, A., Robertson, R. G., absent and present: a neurophysio- Rolls, E. T., Baylis, G. C., Hasselmo, M., 1789–1794. Georges-François, P., and Panzeri, logical exploration of masking,” in and Nalwa, V. (1989). “The repre- Rolls, E. T., Treves, A., and Tovee, S. (1998). Information about spatial The First Half Second, Chap. 6, eds H. sentation of information in the tem- M. J. (1997b). The representational view in an ensemble of primate hip- Ogmen and B. G. Breitmeyer (Cam- poral lobe visual cortical areas of capacity of the distributed encoding pocampal cells. J. Neurophysiol. 79, bridge, MA: MIT Press), 89–108. macaque monkeys,” in Seeing Con- of information provided by popula- 1797–1813. Rolls, E. T. (2007a). “Invariant represen- tour and Colour, eds J. Kulikowski, C. tions of neurons in the primate tem- Rolls, E. T., Tromans, J. M., and Stringer, tations of objects in natural scenes Dickinson, and I. Murray (Oxford: poral visual cortex. Exp. Brain Res. S. M. (2008). Spatial scene repre- in the temporal cortex visual areas,” Pergamon). 114, 149–162. sentations formed by self-organizing in Representation and Brain, Chap. 3, Rolls, E. T., Baylis, G. C., and Has- Rolls, E. T., Treves, A., Tovee, M., and learning in a hippocampal extension ed. S. Funahashi (Tokyo: Springer), selmo, M. E. (1987). The responses Panzeri, S. (1997c). Information in of the ventral visual system. Eur. J. 47–102. of neurons in the cortex in the the neuronal representation of indi- Neurosci. 28, 2116–2127. Rolls, E. T. (2007b). The representation superior temporal sulcus of the vidual stimuli in the primate tempo- Rolls, E. T., Webb, T. J., and Deco, of information about faces in the monkey to band-pass spatial fre- ral visual cortex. J. Comput. Neurosci. G. (2012). Communication before temporal and frontal lobes of pri- quency filtered faces. Vision Res. 27, 4, 309–333. coherence. Eur. J. Neurosci. (in mates including humans. Neuropsy- 311–326. Rolls, E. T., and Stringer, S. M. (2000). press). chologia 45, 124–143. Rolls, E. T., Baylis, G. C., and Leonard, C. On the design of neural networks in Rolls, E. T., and Xiang, J.-Z. (2006). Spa- Rolls, E. T. (2007c). Sensory processing M. (1985). Role of low and high spa- the brain by genetic evolution. Prog. tial view cells in the primate hip- in the brain related to the control tial frequencies in the face-selective Neurobiol. 61, 557–579. pocampus, and memory recall. Rev. of food intake. Proc. Nutr. Soc. 66, responses of neurons in the cortex in Rolls, E. T., and Stringer, S. M. (2001). Neurosci. 17, 175–200. 96–112. the superior temporal sulcus. Vision Invariant object recognition in the Rosenblatt, F. (1961). Principles of Neu- Rolls, E. T. (2008a). Face representations Res. 25, 1021–1035. visual system with error correction rodynamics: Perceptrons and the The- in different brain areas, and criti- Rolls, E. T., and Deco, G. (2002). and temporal difference learning. ory of Brain Mechanisms. Washing- cal band masking. J. Neuropsychol. 2, Computational Neuroscience of Network 12, 111–129. ton, DC: Spartan. 325–360. Vision. Oxford: Oxford University Rolls, E. T., and Stringer, S. M. (2006). Sakai, K., and Miyashita, Y. (1991). Rolls, E. T. (2008b). Memory, Atten- Press. Invariant visual object recognition: Neural organisation for the long- tion, and Decision-Making. A Uni- Rolls, E. T., and Deco, G. (2006). a model, with lighting invariance. J. term memory of paired associates. fying Computational Neuroscience Attention in natural scenes: Physiol. Paris 100, 43–62. Nature 354, 152–155. Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 68 Rolls Invariant visual object recognition Sato, T. (1989). Interactions of visual Annual Conference of the Cognitive Tovee, M. J., Rolls, E. T., and Azzopardi, visual system: an integrated systems stimuli in the receptive fields of infe- Science Society, ed. G. W. Cottrell P. (1994). Translation invariance perspective. Science 255, 419–423. rior temporal neurons in macaque. (San Diego: Erlbaum), 254–259. and the responses of neurons in Vogels, R., and Biederman, I. (2002). Exp. Brain Res. 77, 23–30. Stringer, S. M., Perry, G., Rolls, E. T., and the temporal visual cortical areas Effects of illumination intensity Selfridge, O. G. (1959). “Pandemonium: Proske, J. H. (2006). Learning invari- of primates. J. Neurophysiol. 72, and direction on object coding in a paradigm for learning,” in The ant object recognition in the visual 1049–1060. macaque inferior temporal cortex. Mechanization of Thought Processes, system with continuous transforma- Tovee, M. J., Rolls, E. T., and Ramachan- Cereb. Cortex 12, 756–766. eds D. Blake and A. Uttley (London: tions. Biol. Cybern. 94, 128–142. dran, V. S. (1996). Rapid visual learn- von der Malsburg, C. (1973). H. M. Stationery Office), 511–529. Stringer, S. M., and Rolls, E. T. (2000). ing in neurones of the primate tem- Self-organization of orientation- Serre, T., Kreiman, G., Kouh, M., Cadieu, Position invariant recognition in the poral visual cortex. Neuroreport 7, sensitive columns in the striate C., Knoblich, U., and Poggio, T. visual system with cluttered environ- 2757–2760. cortex. Kybernetik 14, 85–100. (2007a). A quantitative theory of ments. Neural Netw. 13, 305–315. Tovee, M. J., Rolls, E. T., Treves, A., von der Malsburg, C. (1990). “A neural immediate visual recognition. Prog. Stringer, S. M., and Rolls, E. T. (2002). and Bellis, R. P. (1993). Information architecture for the representation of Brain Res. 165, 33–56. Invariant object recognition in the encoding and the responses of sin- scenes,” in Brain Organization and Serre, T., Oliva, A., and Poggio, T. visual system with novel views of gle neurons in the primate temporal Memory: Cells, Systems and Circuits, (2007b). A feedforward architecture 3D objects. Neural Comput. 14, visual cortex. J. Neurophysiol. 70, Chap. 18, eds J. L. McGaugh, N. M. accounts for rapid categorization. 2585–2596. 640–654. Weinburger, and G. Lynch (Oxford: Proc. Natl. Acad. Sci. U.S.A. 104, Stringer, S. M., and Rolls, E. T. (2008). Trappenberg, T. P., Rolls, E. T., and Oxford University Press), 356–372. 6424–6429. Learning transform invariant object Stringer, S. M. (2002). “Effective size Wallis, G., and Baddeley, R. (1997). Serre, T., Wolf, L., Bileschi, S., Riesen- recognition in the visual system of receptive fields of inferior tempo- Optimal unsupervised learning in huber, M., and Poggio, T. (2007c). with multiple stimuli present during ral visual cortex neurons in natural invariant object recognition. Neural Robust object recognition with training. Neural Netw. 21, 888–903. scenes,” in Advances in Neural Infor- Comput. 9, 883–894. cortex-like mechanisms. IEEE Trans. Stringer, S. M., Rolls, E. T., and Tromans, mation Processing Systems, Vol. 14, Wallis, G., and Bülthoff, H. (1999). Pattern Anal. Mach. Intell. 29, J. M. (2007). Invariant object recog- eds T. G. Dietterich, S. Becker, and Z. Learning to recognize objects. Trends 411–426. nition with trace learning and mul- Gharamani (Cambridge, MA: MIT Cogn. Sci. (Regul. Ed.) 3, 22–31. Shadlen, M. N., and Movshon, J. A. tiple stimuli present during training. Press), 293–300. Wallis, G., and Rolls, E. T. (1997). Invari- (1999). Synchrony unbound: a criti- Network 18, 161–187. Treves, A., and Rolls, E. T. (1991). What ant face and object recognition in cal evaluation of the temporal bind- Sutherland, N. S. (1968). Outline of a determines the capacity of autoas- the visual system. Prog. Neurobiol. ing hypothesis. Neuron 24, 67–77. theory of visual pattern recognition sociative memories in the brain? 51, 167–194. Shashua, A. (1995). Algebraic functions in animal and man. Proc. R. Soc. Network 2, 371–397. Wallis, G., Rolls, E. T., and Foldiak, P. for recognition. IEEE Trans. Pattern Lond., B, Biol. Sci. 171, 297–317. Tromans, J. M., Harris, M., and Stringer, (1993). Learning invariant responses Anal. Mach. Intell. 17, 779–789. Sutton, R. S. (1988). Learning to pre- S. M. (2011). A computational to the natural transformations of Shevelev, I. A., Novikova, R. V., Lazareva, dict by the methods of temporal model of the development of sepa- objects. Proc. Int. Jt. Conf. Neural N. A., Tikhomirov, A. S., and differences. Mach. Learn. 3, 9–44. rate representations of facial iden- Netw. 2, 1087–1090. Sharaev, G. A. (1995). Sensitivity Sutton, R. S., and Barto, A. G. (1981). tity and expression in the primate Wasserman, E., Kirkpatrick-Steger, A., to cross-like figures in cat striate Towards a modern theory of adap- visual system. PLoS ONE 6, e25616. and Biederman, I. (1998). Effects neurons. Neuroscience 69, 51–57. tive networks: expectation and pre- doi:10.1371/journal.pone.0025616 of geon deletion, scrambling, and Sillito, A. M., Grieve, K. L., Jones, H. diction. Psychol. Rev. 88, 135–170. Tromans, J. M., Page, J. I., and Stringer, movement on picture identification E., Cudeiro, J., and Davis, J. (1995). Sutton, R. S., and Barto, A. G. (1998). S. M. (2012). Learning separate in pigeons. J. Exp. Psychol. Anim. Visual cortical mechanisms detect- Reinforcement Learning. Cambridge, visual representations of indepen- Behav. Process. 24, 34–46. ing focal orientation discontinuities. MA: MIT Press. dently rotating objects. Network. Watanabe, S., Lea, S. E. G., and Dittrich, Nature 378, 492–496. Tanaka, K. (1993). Neuronal mecha- PMID: 22364581. [Epub ahead of W. H. (1993). “What can we learn Singer, W. (1999). Neuronal synchrony: nisms of object recognition. Science print]. from experiments on pigeon dis- a versatile code for the definition of 262, 685–688. Tsao, D. Y., Freiwald, W. A., Tootell, R. crimination?” in Vision, Brain, and relations? Neuron 24, 49–65. Tanaka, K. (1996). Inferotemporal cor- B., and Livingstone, M. S. (2006). Behavior in Birds, eds H. P. Zeigler Singer, W., Gray, C., Engel, A., Konig, tex and object vision. Annu. Rev. A cortical region consisting entirely and H.-J. Bischof (Cambridge, MA: P., Artola, A., and Brocher, S. Neurosci. 19, 109–139. of face-selective cells. Science 311, MIT Press), 351–376. (1990). Formation of cortical cell Tanaka, K., Saito, C., Fukada, Y., and 617–618. Weiner, K. S., and Grill-Spector, K. assemblies. Cold Spring Harb. Symp. Moriya, M. (1990). “Integration of Tsao, D. Y., and Livingstone, M. S. (2011). Neural representations of Quant. Biol. 55, 939–952. form, texture, and color informa- (2008). Mechanisms of face per- faces and limbs neighbor in human Singer, W., and Gray, C. M. (1995). tion in the inferotemporal cortex of ception. Annu. Rev. Neurosci. 31, high-level visual cortex: evidence for Visual feature integration and the the macaque,” in Vision, Memory and 411–437. a new organization principle. Psy- temporal correlation hypothesis. the Temporal Lobe, Chap. 10, eds E. Tsodyks, M. V., and Feigel’man, M. chol. Res. PMID: 22139022. [Epub Annu. Rev. Neurosci. 18, 555–586. Iwai and M. Mishkin (New York: V. (1988). The enhanced storage ahead of print]. Spiridon, M., Fischl, B., and Kanwisher, Elsevier), 101–109. capacity in neural networks with Widrow, B., and Hoff, M. E. (1960). N. (2006). Location and spatial pro- Tanaka, K., Saito, H., Fukada, Y., and low activity level. Europhys. Lett. 6, “Adaptive switching circuits,” in 1960 file of category-specific regions in Moriya, M. (1991). Coding visual 101–105. IRE WESCON Convention Record, human extrastriate cortex. Hum. images of objects in the inferotem- Ullman, S. (1996). High-Level Vision, Part 4 (New York: IRE), 96–104. Brain Mapp. 27, 77–89. poral cortex of the macaque monkey. Object Recognition and Visual [Reprinted in Anderson and Rosen- Spruston, N., Jonas, P., and Sakmann, B. J. Neurophysiol. 66, 170–189. Cognition. Cambridge, MA: feld, 1988]. (1995). Dendritic glutamate recep- Tou, J. T., and Gonzalez, A. G. (1974). Bradford/MIT Press. Widrow, B., and Stearns, S. D. (1985). tor channel in rat hippocampal CA3 Pattern Recognition Principles. Read- Ullman, S. (2007). Object recognition Adaptive Signal Processing. Engle- and CA1 pyramidal neurons. J. Phys- ing, MA: Addison-Wesley. and segmentation by a fragment- wood Cliffs, NJ: Prentice-Hall. iol. 482, 325–352. Tovee, M. J., and Rolls, E. T. (1995). based hierarchy. Trends Cogn. Sci. Winston, P. H. (1975). “Learning struc- Stan-Kiewicz, B., and Hummel, J. Information encoding in short fir- (Regul. Ed.) 11, 58–64. tural descriptions from examples,” in (1994). “Metricat: a representation ing rate epochs by single neurons in Van Essen, D., Anderson, C. H., and The Psychology of Computer Vision, for basic and subordinate-level clas- the primate temporal visual cortex. Felleman, D. J. (1992). Informa- ed. P. H. Winston (New York: sification,” in Proceedings of the 18th Vis. Cogn. 2, 35–58. tion processing in the primate McGraw-Hill), 157–210. Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 69 Rolls Invariant visual object recognition Wiskott, L. (2003). Slow feature analy- Wurtz, R. H., and Kandel, E. R. (2000b). in macaque inferotemporal cortex. conducted in the absence of any com- sis: a theoretical analysis of optimal “Perception of motion depth and Nat. Neurosci. 11, 1352–1360. mercial or financial relationships that free responses. Neural Comput. 15, form,” in Principles of Neural Sci- Yi, D. J., Turk-Browne, N. B., Flom- could be construed as a potential con- 2147–2177. ence, 4th Edn, Chap. 28, eds E. baum, J. I., Kim, M. S., Scholl, flict of interest. Wiskott, L., and Sejnowski, T. J. (2002). R. Kandel, J. H. Schwartz, and T. B. J., and Chun, M. M. (2008). Slow feature analysis: unsupervised M. Jessell (New York: McGraw-Hill), Spatiotemporal object continuity Received: 01 November 2011; accepted: learning of invariances. Neural Com- 548–571. in human ventral visual cortex. 23 May 2012; published online: 19 June put. 14, 715–770. Wyss, R., Konig, P., and Verschure, P. F. Proc. Natl. Acad. Sci. U.S.A. 105, 2012. Womelsdorf, T., Schoffelen, J. M., Oost- (2006). A model of the ventral visual 8840–8845. Citation: Rolls ET (2012) Invariant enveld, R., Singer, W., Desimone, R., system based on temporal stability Zhao, Q., and Koch, C. (2011). Learn- visual object and face recognition: neural Engel, A. K., and Fries, P. (2007). and local memory. PLoS Biol. 4, e120. ing a saliency map using fixated and computational bases, and a model, Modulation of neuronal interac- doi:10.1371/journal.pbio.0040120 locations in natural scenes. J. Vis. VisNet. Front. Comput. Neurosci. 6:35. tions through neuronal synchro- Yamane, S., Kaji, S., and Kawano, K. 11, 9. doi: 10.3389/fncom.2012.00035 nization. Science 316, 1609–1612. (1988). What facial features activate Zucker, S. W., Dobbins, A., and Iver- Copyright © 2012 Rolls. This is an open- Wurtz, R. H., and Kandel, E. R. face neurons in the inferotemporal son, L. (1989). Two stages of curve access article distributed under the terms (2000a). “Central visual pathways,” cortex of the monkey? Exp. Brain detection suggest two styles of visual of the Creative Commons Attribution in Principles of Neural Science, 4th Res. 73, 209–214. computation. Neural Comput. 1, Non Commercial License, which per- Edn, Chap. 27, eds E. R. Kan- Yamane, Y., Carlson, E. T., Bow- 68–81. mits non-commercial use, distribution, del, J. H. Schwartz, and T. M. man, K. C., Wang, Z., and Con- and reproduction in other forums, pro- Jessell (New York: McGraw-Hill), nor, C. E. (2008). A neural code vided the original authors and source are Conflict of Interest Statement: The 543–547. for three-dimensional object shape author declares that the research was credited. Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 70 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Frontiers in Computational Neuroscience Pubmed Central

Invariant Visual Object and Face Recognition: Neural and Computational Bases, and a Model, VisNet

Frontiers in Computational Neuroscience , Volume 6 – Jun 19, 2012

Loading next page...
 
/lp/pubmed-central/invariant-visual-object-and-face-recognition-neural-and-computational-7bybecptVP

References (321)

Publisher
Pubmed Central
Copyright
Copyright © 2012 Rolls.
ISSN
1662-5188
eISSN
1662-5188
DOI
10.3389/fncom.2012.00035
Publisher site
See Article on Publisher Site

Abstract

REVIEW ARTICLE published: 19 June 2012 COMPUTATIONAL NEUROSCIENCE doi: 10.3389/fncom.2012.00035 Invariant visual object and face recognition: neural and computational bases, and a model, VisNet 1,2 EdmundT. Rolls * Oxford Centre for Computational Neuroscience, Oxford, UK Department of Computer Science, University of Warwick, Coventry, UK Edited by: Neurophysiological evidence for invariant representations of objects and faces in the pri- Evgeniy Bart, Palo Alto Research mate inferior temporal visual cortex is described. Then a computational approach to how Center, USA invariant representations are formed in the brain is described that builds on the neuro- Reviewed by: physiology. A feature hierarchy model in which invariant representations can be built by Alexander G. Dimitrov, Washington self-organizing learning based on the temporal and spatial statistics of the visual input pro- State University Vancouver, USA Jay Hegdé, Georgia Health Sciences duced by objects as they transform in the world is described. VisNet can use temporal University, USA continuity in an associative synaptic learning rule with a short-term memory trace, and/or *Correspondence: it can use spatial continuity in continuous spatial transformation learning which does not Edmund T. Rolls, Department of require a temporal trace. The model of visual processing in the ventral cortical stream can Computer Science, University of build representations of objects that are invariant with respect to translation, view, size, and Warwick, Coventry CV4 7AL, UK. e-mail: [email protected] also lighting. The model has been extended to provide an account of invariant representa- tions in the dorsal visual system of the global motion produced by objects such as looming, rotation, and object-based movement. The model has been extended to incorporate top- down feedback connections to model the control of attention by biased competition in, for example, spatial and object search tasks. The approach has also been extended to account for how the visual system can select single objects in complex visual scenes, and how multiple objects can be represented in a scene. The approach has also been extended to provide, with an additional layer, for the development of representations of spatial scenes of the type found in the hippocampus. Keywords: VisNet, invariance, face recognition, object recognition, inferior temporal visual cortex, trace learning rule, hippocampus, spatial scene representation 1. INTRODUCTION and faces found in the inferior temporal visual cortex as shown One of the major problems that is solved by the visual system in by neuronal recordings. A fuller account is provided in Memory, the cerebral cortex is the building of a representation of visual Attention, and Decision-Making, Chapter 4 (Rolls, 2008b). Then information which allows object and face recognition to occur rel- I build on that foundation a closely linked computational the- atively independently of size, contrast, spatial-frequency, position ory of how these invariant representations of objects and faces on the retina, angle of view, lighting, etc. These invariant rep- may be formed by self-organizing learning in the brain, which resentations of objects, provided by the inferior temporal visual has been investigated by simulations in a model network, VisNet cortex (Rolls, 2008b), are extremely important for the operation (Rolls, 1992, 2008b; Wallis and Rolls, 1997; Rolls and Milward, of many other systems in the brain, for if there is an invari- 2000). ant representation, it is possible to learn on a single trial about This paper reviews this combined neurophysiological and com- reward/punishment associations of the object, the place where putational neuroscience approach developed by the author which that object is located, and whether the object has been seen leads to a theory of invariant visual object recognition, and relates recently, and then to correctly generalize to other views, etc. of this approach to other research. the same object (Rolls, 2008b). The way in which these invariant representations of objects are formed is a major issue in under- 2. INVARIANT REPRESENTATIONS OF FACES AND standing brain function, for with this type of learning, we must OBJECTS IN THE INFERIOR TEMPORAL VISUAL CORTEX not only store and retrieve information, but we must solve in 2.1. PROCESSING TO THE INFERIOR TEMPORAL CORTEX IN THE addition the major computational problem of how all the differ- PRIMATE VISUAL SYSTEM ent images on the retina (position, size, view, etc.) of an object A schematic diagram to indicate some aspects of the processing can be mapped to the same representation of that object in the involved in object identification from the primary visual cor- brain. It is this process with which we are concerned in this tex, V1, through V2 and V4 to the posterior inferior temporal paper. cortex (TEO) and the anterior inferior temporal cortex (TE) is In Section 2 of this paper, I summarize some of the evi- shown in Figure 1 (Rolls and Deco, 2002; Rolls, 2008b; Blumberg dence on the nature of the invariant representations of objects and Kreiman, 2010; Orban, 2011). The approximate location of Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 1 Rolls Invariant visual object recognition these visual cortical areas on the brain of a macaque monkey is approximately 11–15 mm anterior to the interaural plane (Baylis shown in Figure 2, which also shows that TE has a number of et al., 1987; Rolls, 2007a,b, 2008b). For comparison, the “middle different subdivisions. The different TE areas all contain visually face patch” of Tsao et al. (2006) was at A6, which is probably responsive neurons, as do many of the areas within the cortex in part of the posterior inferior temporal cortex (Tsao and Liv- the superior temporal sulcus (Baylis et al., 1987). For the pur- ingstone, 2008). In the anterior inferior temporal cortex areas poses of this summary, these areas will be grouped together as we have investigated, there are separate regions specialized for the anterior inferior temporal cortex (IT), except where otherwise face identity in areas TEa and TEm on the ventral lip of the stated. superior temporal sulcus and the adjacent gyrus, for face expres- The object and face-selective neurons described in this paper sion and movement in the cortex deep in the superior tem- are found mainly between 7 and 3 mm posterior to the sphe- poral sulcus (Baylis et al., 1987; Hasselmo et al., 1989a; Rolls, noid reference, which in a 3–4 kg macaque corresponds to 2007b), and separate neuronal clusters for objects (Booth and FIGURE 1 | Convergence in the visual system. Right – as it occurs VisNet. Convergence through the network is designed to provide in the brain. V1, visual cortex area V1; TEO, posterior inferior temporal fourth layer neurons with information from across the entire input cortex; TE, inferior temporal cortex (IT). Left – as implemented in retina. FIGURE 2 | Lateral view of the macaque brain (left hemisphere) showing the different architectonic areas (e.g., TEm, TEa) in and bordering the anterior part of the superior temporal sulcus (STS) of the macaque (see text). The STS has been drawn opened to reveal the cortical areas inside it, and is circumscribed by a thick line. Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 2 Rolls Invariant visual object recognition Rolls, 1998; Kriegeskorte et al., 2008; Rolls, 2008b). A possi- ble way in which VisNet could produce separate representations of face identity and expression has been investigated (Tromans et al., 2011). Similarly, in humans there are a number of sepa- rate visual representations of faces and other body parts (Spiridon et al., 2006; Weiner and Grill-Spector, 2011), with the clustering together of neurons with similar responses influenced by the self- organizing map processes that are a result of cortical design (Rolls, 2008b). 2.2. TRANSLATION INVARIANCE AND RECEPTIVE FIELD SIZE There is convergence from each small part of a region to the suc- ceeding region (or layer in the hierarchy) in such a way that the receptive field sizes of neurons (for example, 1˚ near the fovea in V1) become larger by a factor of approximately 2.5 with each suc- ceeding stage. (The typical parafoveal receptive field sizes found would not be inconsistent with the calculated approximations of, for example, 8˚ in V4, 20˚ in TEO, and 50˚ in inferior tem- poral cortex Boussaoud et al., 1991; see Figure 1). Such zones of convergence would overlap continuously with each other (see Figure 1). This connectivity provides part of the basis for the fact that many neurons in the temporal cortical visual areas respond to a stimulus relatively independently of where it is in their receptive field, and moreover maintain their stimulus selec- FIGURE 3 | Objects shown in a natural scene, in which the task was to tivity when the stimulus appears in different parts of the visual search for and touch one of the stimuli. The objects in the task as run field (Gross et al., 1985; Tovee et al., 1994; Rolls et al., 2003). were smaller. The diagram shows that if the receptive fields of inferior temporal cortex neurons are large in natural scenes with multiple objects This is called translation or shift invariance. In addition to hav- (in this scene, bananas, and a face), then any receiving neuron in structures ing topologically appropriate connections, it is necessary for the such as the orbitofrontal cortex and amygdala would receive information connections to have the appropriate synaptic weights to perform from many stimuli in the field of view, and would not be able to provide the mapping of each set of features, or object, to the same set evidence about each of the stimuli separately. of neurons in IT. How this could be achieved is addressed in the computational neuroscience models described later in this paper. In another situation the monkey had to search for two objects on a screen, and a touch of one object was rewarded with juice, 2.3. REDUCED TRANSLATION INVARIANCE IN NATURAL SCENES, AND and of another object was punished with saline (see Figure 3 for THE SELECTION OF A REWARDED OBJECT a schematic overview and Figure 30 for the actual display). In Until recently, research on translation invariance considered the both situations neuronal responses to the effective stimuli for the case in which there is only one object in the visual field. What neurons were compared when the objects were presented in the happens in a cluttered, natural, environment? Do all objects that natural scene or on a plain background. It was found that the can activate an inferior temporal neuron do so whenever they are overall response of the neuron to objects was sometimes some- anywhere within the large receptive fields of inferior temporal neu- what reduced when they were presented in natural scenes, though rons (Sato, 1989; Rolls and Tovee, 1995a)? If so, the output of the the selectivity of the neurons remained. However, the main finding visual system might be confusing for structures that receive inputs was that the magnitudes of the responses of the neurons typically from the temporal cortical visual areas. If one of the objects in the became much less in the real scene the further the monkey fixated visual field was associated with reward, and another with punish- in the scene away from the object (see Figures 4 and 31 and Section ment, would the output of the inferior temporal visual cortex to 5.8.1). emotion-related brain systems be an amalgam of both stimuli? If It is proposed that this reduced translation invariance in natural so, how would we be able to choose between the stimuli, and have scenes helps an unambiguous representation of an object which an emotional response to one but not perhaps the other, and select may be the target for action to be passed to the brain regions one for action and not the other (see Figure 3). that receive from the primate inferior temporal visual cortex. It To investigate how information is passed from the inferior helps with the binding problem, by reducing in natural scenes temporal cortex (IT) to other brain regions to enable stimuli the effective receptive field of inferior temporal cortex neurons to to be selected from natural scenes for action, Rolls et al. (2003) approximately the size of an object in the scene. The computa- analyzed the responses of single and simultaneously recorded IT tional utility and basis for this is considered in Section 5.8 and neurons to stimuli presented in complex natural backgrounds. In by Rolls and Deco (2002), Trappenberg et al. (2002), Deco and one situation, a visual fixation task was performed in which the Rolls (2004), Aggelopoulos and Rolls (2005), and Rolls and Deco monkey fixated at different distances from the effective stimulus. (2006), and includes an advantage for what is at the fovea because Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 3 Rolls Invariant visual object recognition of the large cortical magnification of the fovea, and shunting inter- the fovea, just massively reducing the computational including fea- actions between representations weighted by how far they are from ture binding problems. The brain then deals with a complex scene the fovea. by fixating different parts serially, using processes such as bottom- These findings suggest that the principle of providing strong up saliency to guide where fixations should occur (Itti and Koch, weight to whatever is close to the fovea is an important princi- 2000; Zhao and Koch, 2011). ple governing the operation of the inferior temporal visual cortex, Interestingly, although the size of the receptive fields of inferior and in general of the output of the ventral visual system in natural temporal cortex neurons becomes reduced in natural scenes so environments. This principle of operation is very important in that neurons in IT respond primarily to the object being fixated, interfacing the visual system to action systems, because the effec- there is nevertheless frequently some asymmetry in the receptive tive stimulus in making inferior temporal cortex neurons fire is in fields (see Section 5.9 and Figure 35). This provides a partial solu- natural scenes usually on or close to the fovea. This means that the tion to how multiple objects and their positions in a scene can be spatial coordinates of where the object is in the scene do not have captured with a single glance (Aggelopoulos and Rolls, 2005). to be represented in the inferior temporal visual cortex, nor passed from it to the action selection system, as the latter can assume that 2.4. SIZE AND SPATIAL-FREQUENCY INVARIANCE the object making IT neurons fire is close to the fovea in natural Some neurons in the inferior temporal visual cortex and cortex in scenes. Thus the position in visual space being fixated provides the anterior part of the superior temporal sulcus (IT/STS) respond part of the interface between sensory representations of objects relatively independently of the size of an effective face stimulus, and their coordinates as targets for actions in the world. The small with a mean size-invariance (to a half maximal response) of 12 receptive fields of IT neurons in natural scenes make this possible. times (3.5 octaves; Rolls and Baylis, 1986). An example of the After this, local, egocentric, processing implemented in the dorsal responses of an inferior temporal cortex face-selective neuron to faces of different sizes is shown in Figure 5. This is not a property visual processing stream using, e.g., stereodisparity may be used to guide action toward objects being fixated (Rolls and Deco, 2002). of a simple single-layer network (see Figure 7), nor of neurons in V1, which respond best to small stimuli, with a typical size- The reduced receptive field size in complex natural scenes also enables emotions to be selective to just what is being fixated, invariance of 1.5 octaves. Also, the neurons typically responded to because this is the information that is transmitted by the firing a face when the information in it had been reduced from 3D to a of IT neurons to structures such as the orbitofrontal cortex and 2D representation in gray on a monitor, with a response that was amygdala. on average 0.5 of that to a real face. There is an important comparison to be made here with some Another transform over which recognition is relatively invari- approaches in engineering in which attempts are made to analyze a ant is spatial-frequency. For example, a face can be identified when whole visual scene at once. This is a massive computational prob- it is blurred (when it contains only low-spatial frequencies), and lem, not yet solved in engineering. It is very instructive to see that when it is high-pass spatial-frequency filtered (when it looks like a this is not the approach taken by the (primate and human) brain, line drawing). If the face images to which these neurons respond which instead analyses in complex natural scenes what is close to are low-pass filtered in the spatial-frequency domain (so that they are blurred), then many of the neurons still respond when the images contain frequencies only up to 8 cycles per face. Similarly, FIGURE 4 | Firing of a temporal cortex cell to an effective stimulus presented either in a blank background or in a natural scene, as a FIGURE 5 | Typical response of an inferior temporal cortex function of the angle in degrees at which the monkey was fixating away from the effective stimulus. The task was to search for and touch face-selective neuron to faces of different sizes. The size subtended at the retina in degrees is shown. (From Rolls and Baylis, 1986.) the stimulus. (After Rolls et al., 2003.) Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 4 Rolls Invariant visual object recognition the neurons still respond to high-pass filtered images (with only high-spatial-frequency edge information) when frequencies down to only 8 cycles per face are included (Rolls et al., 1985). Face recog- nition shows similar invariance with respect to spatial-frequency (see Rolls et al., 1985). Further analysis of these neurons with narrow (octave) bandpass spatial-frequency filtered face stimuli shows that the responses of these neurons to an unfiltered face can not be predicted from a linear combination of their responses to the narrow bandstimuli (Rolls et al., 1987). This lack of linearity of these neurons, and their responsiveness to a wide range of spa- tial frequencies (see also their broad critical bandmasking Rolls, 2008a), indicate that in at least this part of the primate visual system recognition does not occur using Fourier analysis of the spatial-frequency components of images. The utility of this representation for memory systems in the brain is that the output of the visual system will represent an object invariantly with respect to position on the retina, size, etc. and this simplifies the functionality required of the (multiple) memory sys- tems, which need then simply associate the object representation with reward (orbitofrontal cortex and amygdala), associate it with position in the environment (hippocampus), recognize it as famil- iar (perirhinal cortex), associate it with a motor response in a habit memory (basal ganglia), etc. (Rolls, 2008b). The associations can be relatively simple, involving, for example, Hebbian associativity (Rolls, 2008b). Some neurons in the temporal cortical visual areas actually rep- resent the absolute size of objects such as faces independently of viewing distance (Rolls and Baylis, 1986). This could be called neu- rophysiological size constancy. The utility of this representation by a small population of neurons is that the absolute size of an object is a useful feature to use as an input to neurons that perform object recognition. Faces only come in certain sizes. 2.5. COMBINATIONS OF FEATURES IN THE CORRECT SPATIAL CONFIGURATION Many neurons in this ventral processing stream respond to com- FIGURE 6 | Responses of four temporal cortex neurons to whole faces binations of features (including objects), but not to single features and to parts of faces. The mean firing rate sem are shown. The presented alone, and the features must have the correct spatial responses are shown as changes from the spontaneous firing rate of each neuron. Some neurons respond to one or several parts of faces presented arrangement. This has been shown, for example, with faces, for alone. Other neurons (of which the top one is an example) respond only to which it has been shown by masking out or presenting parts of the combination of the parts (and only if they are in the correct spatial the face (for example, eyes, mouth, or hair) in isolation, or by configuration with respect to each other as shown by Rolls et al., 1994). The jumbling the features in faces, that some cells in the cortex in control stimuli were non-face objects. (After Perrett et al., 1982.) IT/STS respond only if two or more features are present, and are in the correct spatial arrangement (Perrett et al., 1982; Rolls et al., 1994; Freiwald et al., 2009; Rolls, 2011b). Figure 6 shows exam- of the process by which the cortical hierarchy operates, and this is ples of four neurons, the top one of which responds only if all incorporated into VisNet (Elliffe et al., 2002). the features are present, and the others of which respond not Evidence consistent with the suggestion that neurons are only to the full-face, but also to one or more features. Corre- responding to combinations of a few variables represented at the sponding evidence has been found for non-face cells. For example preceding stage of cortical processing is that some neurons in Tanaka et al. (1990) showed that some posterior inferior tempo- V2 and V4 respond to end-stopped lines, to tongues flanked by ral cortex neurons might only respond to the combination of an inhibitory subregions, to combinations of lines, to combinations edge and a small circle if they were in the correct spatial relation- of colors, or to surfaces (Hegde and Van Essen, 2000, 2003, 2007; ship to each other. Consistent evidence for face part configuration Ito and Komatsu, 2004; Brincat and Connor, 2006; Anzai et al., sensitivity has been found in human fMRI studies (Liu et al., 2007; Orban, 2011). In the inferior temporal visual cortex, some 2010). neurons respond to spatial configurations of surface fragments to These findings are important for the computational theory, for help specify the three-dimensional structure of objects (Yamane they show that neurons selective to feature combinations are part et al., 2008). Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 5 Rolls Invariant visual object recognition 2.6. A VIEW-INVARIANT REPRESENTATION number of neurons, the responses occurred only to a subset of For recognizing and learning about objects (including faces), it is the objects (using ensemble encoding), irrespective of the viewing important that an output of the visual system should be not only angle. Moreover, the firing of a neuron on any one trial, taken at translation and size invariant, but also relatively view-invariant. random and irrespective of the particular view of any one object, In an investigation of whether there are such neurons, we found provided information about which object had been seen, and this that some temporal cortical neurons reliably responded differently information increased approximately linearly with the number of to the faces of two different individuals independently of viewing neurons in the sample. This is strong quantitative evidence that angle (Hasselmo et al., 1989b), although in most cases (16/18 neu- some neurons in the inferior temporal cortex provide an invariant rons) the response was not perfectly view-independent. Mixed representation of objects. Moreover, the results of Booth and Rolls together in the same cortical regions there are neurons with view- (1998) show that the information is available in the firing rates, dependent responses (for example, Hasselmo et al., 1989b; Rolls and has all the desirable properties of distributed representations, and Tovee, 1995b). Such neurons might respond, for example, to including exponentially high-coding capacity, and rapid speed of a view of a profile of a monkey but not to a full-face view of the read-out of the information (Rolls, 2008b; Rolls and Treves, 2011). same monkey (Perrett et al., 1985; Hasselmo et al., 1989b). Further evidence consistent with these findings is that some These findings of view-dependent, partially view-independent, studies have shown that the responses of some visual neurons in and view-independent representations in the same cortical regions the inferior temporal cortex do not depend on the presence or are consistent with the hypothesis discussed below that view- absence of critical features for maximal activation (Perrett et al., independent representations are being built in these regions by 1982; Tanaka, 1993, 1996). For example, neuron 4 in Figure 6 associating together the outputs of neurons that have different responded to several of the features in a face when these features view-dependent responses to the same individual. These findings were presented alone (Perrett et al., 1982). In another example, also provide evidence that one output of the visual system includes Mikami et al. (1994) showed that some TE cells respond to partial representations of what is being seen, in a view-independent way views of the same laboratory instrument(s), even when these par- that would be useful for object recognition and for learning asso- tial views contain different features. Such functionality is impor- ciations about objects; and that another output is a view-based tant for object recognition when part of an object is occluded, by, representation that would be useful in social interactions to deter- for example, another object. In a different approach, Logothetis mine whether another individual is looking at one, and for select- et al. (1994) have reported that in monkeys extensively trained ing details of motor responses, for which the orientation of the (over thousands of trials) to treat different views of computer object with respect to the viewer is required (Rolls, 2008b). generated wire-frame “objects” as the same, a small population Further evidence that some neurons in the temporal cortical of neurons in the inferior temporal cortex did respond to differ- visual areas have object-based rather than view-based responses ent views of the same wire-frame object (see also Logothetis and comes from a study of a population of neurons that responds to Sheinberg, 1996). However, extensive training is not necessary for moving faces (Hasselmo et al., 1989b). For example, four neu- invariant representations to be formed, and indeed no explicit rons responded vigorously to a head undergoing ventral flexion, training in invariant object recognition was given in the experi- irrespective of whether the view of the head was full-face, of either ment by Booth and Rolls (1998), as Rolls’ hypothesis (Rolls, 1992) profile, or even of the back of the head. These different views could is that view-invariant representations can be learned by associat- only be specified as equivalent in object-based coordinates. Fur- ing together the different views of objects as they are moved and ther, the movement specificity was maintained across inversion, inspected naturally in a period that may be in the order of a few with neurons responding, for example, to ventral flexion of the seconds. Evidence for this is described in Section 2.7. head irrespective of whether the head was upright or inverted. In this procedure, retinally encoded or viewer-centered movement 2.7. LEARNING OF NEW REPRESENTATIONS IN THE TEMPORAL vectors are reversed, but the object-based description remains CORTICAL VISUAL AREAS the same. To investigate the idea that visual experience might guide the for- Also consistent with object-based encoding is the finding of mation of the responsiveness of neurons so that they provide an a small number of neurons that respond to images of faces of a economical and ensemble-encoded representation of items actu- given absolute size, irrespective of the retinal image size, or distance ally present in the environment (and indeed any rapid learning (Rolls and Baylis, 1986). found might help in the formation of invariant representations), Neurons with view-invariant responses to objects seen naturally the responses of inferior temporal cortex face-selective neurons by macaques have also been described (Booth and Rolls, 1998). The have been analyzed while a set of new faces were shown. Some stimuli were presented for 0.5 s on a color video monitor while the of the neurons studied in this way altered the relative degree to monkey performed a visual fixation task. The stimuli were images which they responded to the different members of the set of novel of 10 real plastic objects that had been in the monkey’s cage for faces over the first few (1–2) presentations of the set (Rolls et al., several weeks, to enable him to build view-invariant representa- 1989). If in a different experiment a single novel face was intro- tions of the objects. Control stimuli were views of objects that duced when the responses of a neuron to a set of familiar faces had never been seen as real objects. The neurons analyzed were in were being recorded, the responses to the set of familiar faces were the TE cortex in and close to the ventral lip of the anterior part not disrupted, while the responses to the novel face became stable of the superior temporal sulcus. Many neurons were found that within a few presentations. Alteration of the tuning of individual responded to some views of some objects. However, for a smaller neurons in this way may result in a good discrimination over the Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 6 Rolls Invariant visual object recognition population as a whole of the faces known to the monkey. This The information from a single cell is informative about a set of evidence is consistent with the categorization being performed by stimuli, but the information increases approximately linearly with self-organizing competitive neuronal networks, as described else- the number of neurons in the ensemble, and can be read mod- where (Rolls and Treves, 1998; Rolls, 2008b). Further evidence has erately efficiently by dot product decoding. This is what neurons been found to support the hypothesis (Rolls, 1992, 2008b) that can do: produce in their depolarization or firing rate a synapti- unsupervised natural experience rapidly alters invariant object cally weighted sum of the firing rate inputs that they receive from representation in the visual cortex (Li and DiCarlo, 2008; Li et al., other neurons (Rolls, 2008b). This property is fundamental to the 2011; cf. Folstein et al., 2010). mechanisms implemented in VisNet. There is little information Further evidence that these neurons can learn new representa- in whether IT neurons fire synchronously or not (Aggelopoulos tions very rapidly comes from an experiment in which binarized et al., 2005; Rolls and Treves, 2011), so that temporal syntactic black and white (two-tone) images of faces that blended with the binding (Singer, 1999) may not be part of the mechanism. Each background were used. These did not activate face-selective neu- neuron has an approximately exponential probability distribution rons. Full gray-scale images of the same photographs were then of firing rates in a sparse distributed representation (Franco et al., shown for ten 0.5 s presentations. In a number of cases, if the neu- 2007; Rolls and Treves, 2011). ron happened to be responsive to that face, when the binarized These generic properties are described in detail elsewhere version of the same face was shown next, the neurons responded (Rolls, 2008b; Rolls and Treves, 2011), as are their implications to it (Tovee et al., 1996). This is a direct parallel to the same phe- for understanding brain function (Rolls, 2012), and so are not nomenon that is observed psychophysically, and provides dramatic further described here. They are incorporated into the design of evidence that these neurons are influenced by only a very few sec- VisNet, as will become evident. onds (in this case 5 s) of experience with a visual stimulus. We It is consistent with this general conceptual background that have shown a neural correlate of this effect using similar stimuli Krieman et al. (2000) have described some neurons in the human and a similar paradigm in a PET (positron emission tomography) temporal lobe that seem to respond selectively to an object. This neuroimaging study in humans, with a region showing an effect is consistent with the principles just described, though the brain of the learning found for faces in the right temporal lobe, and for areas in which these recordings were made may be beyond the objects in the left temporal lobe (Dolan et al., 1997). inferior temporal visual cortex and the tuning appears to be more Once invariant representations of objects have been learned in specific, perhaps reflecting backprojections from language or other the inferior temporal visual cortex based on the statistics of the cognitive areas concerned, for example, with tool use that might spatio-temporal continuity of objects in the visual world (Rolls, influence the categories represented in high-order cortical areas 1992, 2008b; Yi et al., 2008), later processes may be required to (Farah et al., 1996; Farah, 2000; Rolls, 2008b). categorize objects based on other properties than their properties 3. APPROACHES TO INVARIANT OBJECT RECOGNITION as objects. One such property is that certain objects may need to be A goal of my approach is to provide a biologically based and bio- treated as similar for the correct performance of a task, and others logically plausible approach to how the brain computes invariant as different, and that demand can influence the representations of representations for use by other brain systems (Rolls, 2008b). This objects in a number of brain areas (Fenske et al., 2006; Freedman leads me to propose a hierarchical feed-forward series of competi- and Miller, 2008; Kourtzi and Connor, 2011). That process may in tive networks using convergence from stage to stage; and the use of turn influence representations in the inferior temporal visual cor- a modified Hebb synaptic learning rule that incorporates a short- tex, for example, by top-down bias (Rolls and Deco, 2002; Rolls, term memory trace of previous neuronal activity to help learn the 2008b,c). invariant properties of objects from the temporo-spatial statis- tics produced by the normal viewing of objects (Wallis and Rolls, 2.8. DISTRIBUTED ENCODING 1997; Rolls and Milward, 2000; Stringer and Rolls, 2000, 2002; An important question for understanding brain function is Rolls and Stringer, 2001, 2006; Elliffe et al., 2002; Rolls and Deco, whether a particular object (or face) is represented in the brain by 2002; Deco and Rolls, 2004; Rolls, 2008b). In Sections 3.1–3.5, I the firing of one or a few gnostic (or “grandmother”) cells (Barlow, summarize some other approaches to invariant object recognition, 1972), or whether instead the firing of a group or ensemble of cells and in Section 3.6. I introduce feature hierarchies as part of the each with somewhat different responsiveness provides the repre- background to VisNet, which is described starting in Section 4. sentation. Advantages of distributed codes include generalization I start by emphasizing that generalization to different posi- and graceful degradation (fault tolerance), and a potentially very tions, sizes, views, etc. of an object is not a simple property of high capacity in the number of stimuli that can be represented one-layer neural networks. Although neural networks do general- (that is exponential growth of capacity with the number of neu- ize well, the type of generalization they show naturally is to vectors rons in the representation; Rolls and Treves, 1998, 2011; Rolls, which have a high-dot product or correlation with what they have 2008b). If the ensemble encoding is sparse, this provides a good already learned. To make this clear, Figure 7 is a reminder that the input to an associative memory, for then large numbers of stim- activation h of each neuron is computed as uli can be stored (Rolls, 2008b; Rolls and Treves, 2011). We have shown that in the inferior temporal visual cortex and cortex in h D x w (1) i j ij the anterior part of the superior temporal sulcus (IT/STS), there is a sparse distributed representation in the firing rates of neurons about faces and objects (Rolls, 2008b; Rolls and Treves, 2011). where the sum is over the C input axons, indexed by j. Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 7 Rolls Invariant visual object recognition an object recognition system might not distinguish a normal car from a car with the back wheels removed and placed on the roof. Such systems do not therefore perform shape recognition (where shape implies something about the spatial arrangement of features within an object, see further Ullman, 1996), and something more is needed, and is implemented in the primate visual system. How- ever, I note that the features that are present in objects, e.g., a furry texture, are useful to incorporate in object recognition systems, and the brain may well use, and the model VisNet in principle can use, evidence from which features are present in an object as part of the evidence for identification of a particular object. I note that the features might consist also of, for example, the pattern of movement that is characteristic of a particular object (such as a buzzing fly), and might use this as part of the input to final object identification. The capacity to use shape in invariant object recognition is fundamental to primate vision, but may not be used or fully implemented in the visual systems of some other animals with less developed visual systems. For example, pigeons may correctly identify pictures containing people, a particular person, trees, FIGURE 7 | A neuron that computes a dot product of the input pattern pigeons, etc. but may fail to distinguish a figure from a scrambled with its synaptic weight vector generalizes well to other patterns version of a figure (Herrnstein, 1984; Cerella, 1986). Thus their based on their similarity measured in terms of dot product or object recognition may be based more on a collection of parts than correlation, but shows no translation (or size, etc.) invariance. on a direct comparison of complete figures in which the relative positions of the parts are important. Even if the details of the con- Now consider translation (or shift) of the input (random clusions reached from this research are revised (Wasserman et al., binary) pattern vector by one position. The dot product will now 1998), it nevertheless does appear that at least some birds may use drop to a low-level, and the neuron will not respond, even though computationally simpler methods than those needed for invariant it is the same pattern, just shifted by one location. This makes shape recognition. For example, it may be that when some birds the point that special processes are needed to compute invariant are trained to discriminate between images in a large set of pic- representations. Network approaches to such invariant pattern tures, they tend to rely on some chance detail of each picture (such recognition are described in this paper. Once an invariant rep- as a spot appearing by mistake on the picture), rather than on resentation has been computed by a sensory system, it is in a recognition of the shapes of the object in the picture (Watanabe form that is suitable for presentation to a pattern association or et al., 1993). autoassociation neural network (Rolls, 2008b). 3.2. STRUCTURAL DESCRIPTIONS AND SYNTACTIC PATTERN 3.1. FEATURE SPACES RECOGNITION One very simple possibility for performing object classification is A second approach to object recognition is to decompose the based on feature spaces, which amount to lists of (the extent to object or image into parts, and to then produce a structural which) different features are present in a particular object. The description of the relations between the parts. The underlying features might consist of textures, colors, areas, ratios of length to assumption is that it is easier to capture object invariances at a level width, etc. The spatial arrangement of the features is not taken where parts have been identified. This is the type of scheme for into account. If n different properties are used to characterize an which Marr and Nishihara (1978) and Marr (1982) opted (Rolls, object, each viewed object is represented by a set of n real numbers. 2011a). The particular scheme (Binford, 1981) they adopted con- It then becomes possible to represent an object by a point R in an sists of generalized cones, series of which can be linked together to n-dimensional space (where R is the resolution of the real num- form structural descriptions of some, especially animate, stimuli bers used). Such schemes have been investigated (Gibson, 1950, (see Figure 8). 1979; Selfridge, 1959; Tou and Gonzalez, 1974; Bolles and Cain, Such schemes assume that there is a 3D internal model (struc- 1982; Mundy and Zisserman, 1992; Mel, 1997), but, because the tural description) of each object. Perception of the object consists relative positions of the different parts are not implemented in of parsing or segmenting the scene into objects, and then into the object recognition scheme, are not sensitive to spatial jum- parts, then producing a structural description of the object, and bling of the features. For example, if the features consisted of then testing whether this structural description matches that of any nose, mouth, and eyes, such a system would respond to faces with known object stored in the system. Other examples of structural jumbled arrangements of the eyes, nose, and mouth, which does description schemes include those of Sutherland (1968), Winston not match human vision, nor the responses of macaque inferior (1975), and Milner (1974). The relations in the structural descrip- temporal cortex neurons, which are sensitive to the spatial arrange- tion may need to be quite complicated, for example, “connected ment of the features in a face (Rolls et al., 1994). Similarly, such together,” “inside of,” “larger than,” etc. Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 8 Rolls Invariant visual object recognition FIGURE 8 | A 3D structural description of an object-based on axes on the right. In addition, some component axes have 3D models generalized cone parts. Each box corresponds to a 3D model, with its associated with them, as indicated by the way the boxes overlap. (After model axis on the left side of the box and the arrangement of its component Marr and Nishihara, 1978.) Perhaps the most developed model of this type is the recogni- For example, the structural description of many four-legged ani- tion by components (RBC) model of Biederman (1987), imple- mals is rather similar. Rather more than a structural description mented in a computational model by Hummel and Biederman seems necessary to identify many objects and animals. (1992). His small set (less than 50) of primitive parts named A third difficulty, which applies especially to biological sys- “geons” includes simple 3D shapes such as boxes, cylinders, tems, is the difficulty of implementing the syntax needed to hold and wedges. Objects are described by a syntactically linked list the structural description as a 3D model of the object, of produc- of the relations between each of the geons of which they are ing a syntactic structural description on the fly (in real time, and composed. Describing a table in this way (as a flat top sup- with potentially great flexibility of the possible arrangement of ported by three or four legs) seems quite economical. Other the parts), and of matching the syntactic description of the object schemes use 2D surface patches as their primitives (Dane and in the image to all the stored representations in order to find a Bajcsy, 1982; Brady et al., 1985; Faugeras and Hebert, 1986; match. An example of a structural description for a limb might be Faugeras, 1993). When 3D objects are being recognized, the body> thigh> shin> foot> toes. In this description> means “is implication is that the structural description is a 3D descrip- linked to,” and this link must be between the correct pair of descrip- tion. This is in contrast to feature hierarchical systems, in which tors. If we had just a set of parts, without the syntactic or relational recognition of a 3D object from any view might be accom- linking, then there would be no way of knowing whether the toes plished by storing a set of associated 2D views (see below, are attached to the foot or to the body. In fact, worse than this, Section 3.6). there would be no evidence about what was related to what, just There are a number of difficulties with schemes based on a set of parts. Such syntactical relations are difficult to implement structural descriptions, some general, and some with particular in any biologically plausible neuronal networks used in vision, reference to the potential difficulty of their implementation in the because if the representations of all the features or parts just men- brain. First, it is not always easy to decompose the object into tioned were active simultaneously, how would the spatial relations its separate parts, which must be performed before the structural between the features also be encoded? (How would it be apparent description can be produced. For example, it may be difficult to just from the firing of neurons that the toes were linked to the rest produce a structural description of a cat curled up asleep from of the foot but not to the body?) It would be extremely difficult separately identifiable parts. Identification of each of the parts to implement this “on the fly” syntactic binding in a biologically is also frequently very difficult when 3D objects are seen from plausible network (though cf. Hummel and Biederman, 1992), and different viewing angles, as key parts may be invisible or highly the only suggested mechanism for flexible syntactic binding, tem- distorted. This is particularly likely to be difficult in 3D shape poral synchronization of the firing of different neurons, is not well perception. It appears that being committed to producing a cor- supported as a quantitatively important mechanism for informa- rect description of the parts before other processes can operate tion encoding in the ventral visual system, and would have major is making too strong a commitment early on in the recognition difficulties in implementing correct, relational, syntactic binding process. (Section 5.4.1; Rolls, 2008b; Rolls and Treves, 2011). A second difficulty is that many objects or animals that can be A fourth difficulty of the structural description approach is correctly recognized have rather similar structural descriptions. that segmentation into objects must occur effectively before object Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 9 Rolls Invariant visual object recognition recognition, so that the linked structural description list can be of by any biologically plausible network. Compensating for rotation one object. Given the difficulty of segmenting objects in typical is even more difficult (Ullman, 1996). All this has to happen before natural cluttered scenes (Ullman, 1996), and the compounding the segmented canonical representation of the object is compared problem of overlap of parts of objects by other objects, segmen- to the stored object templates with the same canonical represen- tation as a first necessary stage of object recognition adds another tation. The system of course becomes vastly more complicated major difficulty for structural description approaches. when the recognition must be performed of 3D objects seen in a A fifth difficulty is that metric information, such as the relative 3D world, for now the particular view of an object after segmen- size of the parts that are linked syntactically, needs to be specified tation must be placed into a canonical form, regardless of which in the structural description (Stan-Kiewicz and Hummel, 1994), view, or how much of any view, may be seen in a natural scene which complicates the parts that have to be syntactically linked. with occluding contours. However, this process is helped, at least It is because of these difficulties that even in artificial vision sys- in computers that can perform high-precision matrix multiplica- tems implemented on computers, where almost unlimited syntac- tion, by the fact that (for many continuous transforms such as 3D tic binding can easily be implemented, the structural description rotation, translation, and scaling) all the possible views of an object approach to object recognition has not yet succeeded in producing transforming in 3D space can be expressed as the linear combina- a scheme which actually works in more than an environment in tion of other views of the same object (see Chapter 5 of Ullman, which the types of objects are limited, and the world is far from 1996; Koenderink and van Doorn, 1991; Koenderink, 1990). the natural world, consisting, for example, of 2D scenes (Mundy This alignment approach is the main theme of the book by and Zisserman, 1992). Ullman (1996), and there are a number of computer implemen- Although object recognition in the brain is unlikely to be tations (Lowe, 1985; Grimson, 1990; Huttenlocher and Ullman, based on the structural description approach, for the reasons given 1990; Shashua, 1995). However, as noted above, it seems unlikely above, and the fact that the evidence described in this paper sup- that the brain is able to perform the high-precision calculations ports a feature hierarchy rather than the structural description needed to perform the transforms required to align any view of a implementation in the brain, it is certainly the case that humans 3D object with some canonical template representation. For this can provide verbal, syntactic, descriptions of objects in terms of reason, and because the approach also relies on segmentation of the relations of their parts, and that this is often a useful type of the object in the scene before the template alignment algorithms description. Humans may therefore, it is suggested, supplement can start, and because key features may need to be correctly iden- a feature hierarchical object recognition system built into their tified to be used in the alignment (Edelman, 1999), this approach ventral visual system with the additional ability to use the type is not considered further here. of syntax that is necessary for language to provide another level We may note here in passing that some animals with a less com- of description of objects. This ability is useful in, for example, putationally developed visual system appear to attempt to solve the engineering applications. alignment problem by actively moving their heads or eyes to see what template fits, rather than starting with an image on the eye 3.3. TEMPLATE MATCHING AND THE ALIGNMENT APPROACH and attempting to transform it into canonical coordinates. This Another approach is template matching, comparing the image on “active vision” approach used, for example, by some invertebrates the retina with a stored image or picture of an object. This is con- has been described by Land (1999) and Land and Collett (1997). ceptually simple, but there are in practice major problems. One major problem is how to align the image on the retina with the 3.4. SOME FURTHER MACHINE LEARNING APPROACHES stored images, so that all possible images on the retina can be Learning the transformations and invariances of the signal is compared with the stored template or templates of each object. another approach to invariant object recognition at the interface The basic idea of the alignment approach (Ullman, 1996) is to of machine learning and theoretical neuroscience. For example, compensate for the transformations separating the viewed object rather than focusing on the templates, “map-seeking circuit the- and the corresponding stored model, and then compare them. For ory” focuses on the transforms (Arathorn, 2002, 2005). The theory example, the image and the stored model may be similar, except for provides a general computational mechanism for discovery of cor- a difference in size. Scaling one of them will remove this discrep- respondences in massive transformation spaces by exploiting an ancy and improve the match between them. For a 2D world, the ordering property of superpositions. The latter allows a set of possible transforms are translation (shift), scaling, and rotation. transformations of an input image to be formed into a sequence of Given, for example, an input letter of the alphabet to recognize, the superpositions which are then “culled” to a composition of single system might, after segmentation (itself a very difficult process if mappings by a competitive process which matches each superposi- performed independently of (prior to) object recognition), com- tion against a superposition of inverse transformations of memory pensate for translation by computing the center of mass of the patterns. Earlier work considered how to minimize the variance object, and shifting the character to a “canonical location.” Scale in the output when the image transformed (Leen, 1995). Another might be compensated for by calculating the convex hull (the approach is to add transformation invariance to mixture models, smallest envelope surrounding the object), and then scaling the by approximating the non-linear transformation manifold by a image. Of course how the shift and scaling would be accomplished discrete set of points (Frey and Jojic, 2003). They showed how is itself a difficult point – easy to perform on a computer using the expectation maximization algorithm can be used to jointly matrix multiplication as in simple computer graphics, but not the learn clusters, while at the same time inferring the transformation sort of computation that could be performed easily or accurately associated with each input. In another approach, an unsupervised Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 10 Rolls Invariant visual object recognition algorithm for learning Lie group operators for in-plane transforms from input data was described (Rao and Ruderman, 1999). 3.5. NETWORKS THAT CAN RECONSTRUCT THEIR INPUTS Hinton et al. (1995) and Hinton and Ghahramani (1997) have argued that cortical computation is invertible, so that, for exam- ple, the forward transform of visual information from V1 to higher areas loses no information, and there can be a backward transform from the higher areas to V1. A comparison of the reconstructed representation in V1 with the actual image from the world might in principle be used to correct all the synaptic weights between the two (in both the forward and the reverse directions), in such a way that there are no errors in the transform (Hinton, 2010). This suggested reconstruction scheme would seem to involve non-local synaptic weight correction (though see Hinton and Sejnowski, 1986; O’Reilly and Munakata, 2000) for a suggested, although still FIGURE 9 | The feature hierarchy approach to object recognition. The biologically implausible, neural implementation, contrastive Heb- inputs may be neurons tuned to oriented straight line segments. In early bian learning), or other biologically implausible operations. The intermediate-layers neurons respond to a combination of these inputs in scheme also does not seem to provide an account for why or how the correct spatial position with respect to each other. In further the responses of inferior temporal cortex neurons become the way intermediate layers, of which there may be several, neurons respond with they are (providing information about which object is seen rela- some invariance to the feature combinations represented early, and form higher order feature combinations. Finally, in the top layer, neurons respond tively independently of position on the retina, size, or view). The to combinations of what is represented in the preceding intermediate layer, whole forward transform performed in the brain seems to lose and thus provide evidence about objects in a position (and scale and even much of the information about the size, position, and view of the view) invariant way. Convergence through the network is designed to object, as it is evidence about which object is present invariant of provide top layer neurons with information from across the entire input its size, view, etc. that is useful to the stages of processing about retina, as part of the solution to translation invariance, and other types of invariance are treated similarly. objects that follow (Rolls, 2008b). Because of these difficulties, and because the backprojections are needed for processes such as recall (Rolls, 2008b), this approach is not considered further here. In the context of recall, if the visual system were to perform a represent longer curved lines (Zucker et al., 1989), or terminated reconstruction in V1 of a visual scene from what is represented lines (in fact represented in V1 as end-stopped cells), corners, “T” in the inferior temporal visual cortex, then it might be supposed junctions which are characteristic of obscuring edges, and (at least that remembered visual scenes might be as information-rich (and in humans) the arrow and “Y” vertices which are characteristic subjectively as full of rich detail) as seeing the real thing. This is not properties of man-made environments. Evidence that such fea- the case for most humans, and indeed this point suggests that at ture combination neurons are present in V2 is that some neurons least what reaches consciousness from the inferior temporal visual respond to combinations of line elements that join at different cortex (which is activated during the recall of visual memories) is angles (Hegde and Van Essen, 2000, 2003, 2007; Ito and Komatsu, the identity of the object (as made explicit in the firing rate of the 2004; Anzai et al., 2007). (An example of this might be a neu- neurons), and not the low-level details of the exact place, size, and ron responding to a “V” shape at a particular orientation.) As view of the object in the recalled scene, even though, according to one ascends the hierarchy, neurons might respond to more com- the reconstruction argument, that information should be present plex trigger features. For example, two parts of a complex figure in the inferior temporal visual cortex. may need to be in the correct spatial arrangement with respect to each other, as shown by Tanaka (1996) for V4 and posterior 3.6. FEATURE HIERARCHIES AND 2D VIEW-BASED OBJECT inferior temporal cortex neurons. In another example, V4 neu- RECOGNITION rons may respond to the curvature of the elements of a stimulus Another approach, and one that is much closer to what appears to (Carlson et al., 2011). Further on, neurons might respond to com- be present in the primate ventral visual system (Wurtz and Kandel, binations of several such intermediate-level feature combination 2000a; Rolls and Deco, 2002; Rolls, 2008b), is a feature hierarchy neurons, and thus come to respond systematically differently to system (see Figure 9). different objects, and thus to convey information about which In this approach, the system starts with some low-level descrip- object is present. This approach received neurophysiological sup- tion of the visual scene, in terms, for example, of oriented straight port early on from the results of Hubel and Wiesel (1962) and line segments of the type that are represented in the responses of Hubel and Wiesel (1968) in the cat and monkey, and many of the primary visual cortex (V1) neurons, and then builds in repeated data described in Chapter 5 of Rolls and Deco (2002) are consistent hierarchical layers features based on what is represented in previ- with this scheme. ous layers. A feature may thus be defined as a combination of what A number of problems need to be solved for such feature hierar- is represented in the previous layer. For example, after V1, fea- chy visual systems to provide a useful model of object recognition tures might consist of combinations of straight lines, which might in the primate visual system. Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 11 Rolls Invariant visual object recognition First, some way needs to be found to keep the number of feature object. It is shown in Section 5.4 that feature hierarchy systems combination neurons realistic at each stage, without undergoing a can solve this problem by forming feature combination neurons combinatorial explosion. If a separate feature combination neuron at an early stage of processing (e.g., V1 or V2 in the brain) that was needed to code for every possible combination of n types of respond with high-spatial precision to the local arrangement of feature each with a resolution of 2 levels (binary encoding) in the features. Such neurons would respond differently, for example, preceding stage, then 2 neurons would be needed. The suggestion to L, C, and T if they receive inputs from two line-responding that is made in Section 4 is that by forming neurons that respond to neurons. It is shown in Section 5.4 that at later layers of the hier- low-order combinations of features (neurons that respond to just archy, where some of the intermediate-level feature combination say 2–4 features from the preceding stage), the number of actual neurons are starting to show translation invariance, then correct feature analyzing neurons can be kept within reasonable numbers. object recognition may still occur because only one object contains By reasonable we mean the number of neurons actually found at just those sets of intermediate-level neurons in which the spatial any one stage of the visual system, which, for V4 might be in the representation of the features is inherent in the encoding. order of 60 10 neurons (assuming a volume for macaque V4 The type of representation developed in a hierarchical object of approximately 2,000 mm , and a cell density of 20,000–40,000 recognition system, in the brain, and by VisNet as described in the neurons per mm , Rolls, 2008b). This is certainly a large num- rest of this paper would be suitable for recognition of an object, ber; but the fact that a large number of neurons is present at each and for linking associative memories to objects, but would be less stage of the primate visual system is in fact consistent with the good for making actions in 3D space to particular parts of, or hypothesis that feature combination neurons are part of the way inside, objects, as the 3D coordinates of each part of the object in which the brain solves object recognition. A factor which also would not be explicitly available. It is therefore proposed that helps to keep the number of neurons under control is the statis- visual fixation is used to locate in foveal vision part of an object tics of the visual world, which contain great redundancies. The to which movements must be made, and that local disparity and world is not random, and indeed the statistics of natural images other measurements of depth (made explicit in the dorsal visual are such that many regularities are present (Field, 1994), and not system) then provide sufficient information for the motor system every possible combination of pixels on the retina needs to be sep- to make actions relative to the small part of space in which a local, arately encoded. A third factor which helps to keep the number view-dependent, representation of depth would be provided (cf. of connections required onto each neuron under control is that in Ballard, 1990). a multilayer hierarchy each neuron can be set up to receive con- One advantage of feature hierarchy systems is that they can nections from only a small region of the preceding layer. Thus an operate fast (Rolls, 2008b). individual neuron does not need to have connections from all the A second advantage is that the feature analyzers can be built out neurons in the preceding layer. Over multiple-layers, the required of the rather simple competitive networks (Rolls, 2008b) which convergence can be produced so that the same neurons in the top use a local learning rule, and have no external teacher, so that they layer can be activated by an image of an effective object anywhere are rather biologically plausible. Another advantage is that, once on the retina (see Figure 1). trained on subset features common to most objects, the system A second problem of feature hierarchy approaches is how to can then learn new objects quickly. map all the different possible images of an individual object A related third advantage is that, if implemented with compet- through to the same set of neurons in the top layer by modifying itive nets as in the case of VisNet (see Section 5), then neurons the synaptic connections (see Figure 1). The solution discussed in are allocated by self-organization to represent just the features Sections 4, 5.1.1, and 5.3 is the use of a synaptic modification rule present in the natural statistics of real images (cf. Field, 1994), and with a short-term memory trace of the previous activity of the not every possible feature that could be constructed by random neuron, to enable it to learn to respond to the now transformed combinations of pixels on the retina. version of what was seen very recently, which, given the statistics A related fourth advantage of feature hierarchy networks is of looking at the visual world, will probably be an input from the that because they can utilize competitive networks, they can still same object. produce the best guess at what is in the image under non-ideal A third problem of feature hierarchy approaches is how they conditions, when only parts of objects are visible because, for can learn in just a few seconds of inspection of an object to recog- example, of occlusion by other objects, etc. The reasons for this nize it in different transforms, for example, in different positions are that competitive networks assess the evidence for the presence on the retina in which it may never have been presented during of certain “features” to which they are tuned using a dot prod- training. A solution to this problem is provided in Section 5.4, uct operation on their inputs, so that they are inherently tolerant in which it is shown that this can be a natural property of fea- of missing input evidence; and reach a state that reflects the best ture hierarchy object recognition systems, if they are trained first hypothesis or hypotheses (with soft competition) given the whole for all locations on the intermediate-level feature combinations of set of inputs, because there are competitive interactions between which new objects will simply be a new combination, and therefore the different neurons (Rolls, 2008b). requiring learning only in the upper layers of the hierarchy. A fifth advantage of a feature hierarchy system is that, as shown A fourth potential problem of feature hierarchy systems is that in Section 5.5, the system does not need to perform segmentation when solving translation invariance they need to respond to the into objects as part of pre-processing, nor does it need to be able same local spatial arrangement of features (which are needed to to identify parts of an object, and can also operate in cluttered specify the object), but to ignore the global position of the whole scenes in which the object may be partially obscured. The reason Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 12 Rolls Invariant visual object recognition for this is that once trained on objects, the system then oper- After the immediately following description of early models of ates somewhat like an associative memory, mapping the image a feature hierarchy approach implemented in the Cognitron and properties forward onto whatever it has learned about before, and Neocognitron, we turn for the remainder of this paper to analy- then by competition selecting just the most likely output to be ses of how a feature hierarchy approach to invariant visual object activated. Indeed, the feature hierarchy approach provides a mech- recognition might be implemented in the brain, and how key com- anism by which processing at the object recognition level could putational issues could be solved by such a system. The analyses feed back using backprojections to early cortical areas to provide are developed and tested with a model, VisNet, which will shortly top-down guidance to assist segmentation. Although backprojec- be described. Much of the data we have on the operation of the tions are not built into VisNet2 (Rolls and Milward, 2000), they high-order visual cortical areas (Section 2; Rolls and Deco, 2002; have been added when attentional top-down processing must be Anzai et al., 2007; Rolls, 2008b) suggest that they implement a fea- incorporated (Deco and Rolls, 2004), are present in the brain, ture hierarchy approach to visual object recognition, as is made and are incorporated into the models described elsewhere (Rolls, evident in the remainder of this paper. 2008b). Although the operation of the ventral visual system can proceed as a feed-forward hierarchy, as shown by backward mask- 3.6.1. The cognitron and neocognitron ing experiments (Rolls and Tovee, 1994; Rolls et al., 1999; Rolls, An early computational model of a hierarchical feature-based 2003, 2006), top-down influences can of course be implemented approach to object recognition, joining other early discussions by the backprojections, and may be useful in further shaping the of this approach (Selfridge, 1959; Sutherland, 1968; Barlow, 1972; activity of neurons at lower levels in the hierarchy based on the Milner, 1974), was proposed by Fukushima (1975, 1980, 1989, neurons firing at a higher level as a result of dynamical interactions 1991). His model used two types of cell within each layer to of neurons at different layers of the hierarchy (Rolls, 2008b; Jiang approach the problem of invariant representations. In each layer, et al., 2011). a set of “simple cells,” with defined position, orientation, etc. sen- A sixth advantage of feature hierarchy systems is that they can sitivity for the stimuli to which they responded, was followed by naturally utilize features in the images of objects which are not a set of “complex cells,” which generalized a little over position, strictly part of a shape description scheme, such as the fact that orientation, etc. This simple cell – complex cell pairing within different objects have different textures, colors, etc. Feature hierar- each layer provided some invariance. When a neuron in the net- chy systems, because they utilize whatever is represented at earlier work using competitive learning with its stimulus set, which was stages in forming feature combination neurons at the next stage, typically letters on a 16 16 pixel array, learned that a particular naturally incorporate such “feature list” evidence into their analy- feature combination had occurred, that type of feature analyzer sis, and have the advantages of that approach (see Section 3.1 and was replicated in a non-local manner throughout the layer, to pro- also Mel, 1997). Indeed, the feature space approach can utilize a vide further translation invariance. Invariant representations were hybrid representation, some of whose dimensions may be discrete thus learned in a different way from VisNet. Up to eight layers were and defined in structural terms, while other dimensions may be used. The network could learn to differentiate letters, even with continuous and defined in terms of metric details, and others may some translation, scaling, or distortion. Although internally it is be concerned with non-shape properties such as texture and color organized and learns very differently to VisNet, it is an indepen- (cf. Edelman, 1999). dent example of the fact that useful invariant pattern recognition A seventh advantage of feature hierarchy systems is that they can be performed by multilayer hierarchical networks. A major do not need to utilize “on the fly” or run-time arbitrary binding of biological implausibility of the system is that once one neuron features. Instead, the spatial syntax is effectively hard-wired into within a layer learned, other similar neurons were set up through- the system when it is trained, in that the feature combination neu- out the layer by a non-local process. A second biological limitation rons have learned to respond to their set of features when they are was that no learning rule or self-organizing process was specified in a given spatial arrangement on the retina. as to how the complex cells can provide translation-invariant rep- An eighth advantage of feature hierarchy systems is that they resentations of simple cell responses – this was simply handwired. can self-organize (given the right functional architecture, trace Solutions to both these issues are provided by VisNet. synaptic learning rule, and the temporal statistics of the normal visual input from the world), with no need for an external teacher 4. HYPOTHESES ABOUT THE COMPUTATIONAL to specify that the neurons must learn to respond to objects. The MECHANISMS IN THE VISUAL CORTEX FOR OBJECT correct, object, representation self-organizes itself given rather RECOGNITION economically specified genetic rules for building the network (cf. The neurophysiological findings described in Section 2, and wider Rolls and Stringer, 2000). considerations on the possible computational properties of the Ninth, it is also noted that hierarchical visual systems may rec- cerebral cortex (Rolls, 1992, 2000, 2008b; Rolls and Treves, 1998; ognize 3D objects based on a limited set of 2D views of objects, Rolls and Deco, 2002), lead to the following outline working and that the same architectural rules just stated and implemented hypotheses on object recognition by visual cortical mechanisms in VisNet will correctly associate together the different views of (see Rolls, 1992). The principles underlying the processing of faces an object. It is part of the concept (see below), and consistent and other objects may be similar, but more neurons may become with neurophysiological data (Tanaka, 1996), that the neurons allocated to represent different aspects of faces because of the need in the upper layers will generalize correctly within a view (see to recognize the faces of many different individuals, that is to Section 5.6). identify many individuals within the category faces. Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 13 Rolls Invariant visual object recognition Cortical visual processing for object recognition is considered have utility in maximizing the number of memories that can be to be organized as a set of hierarchically connected cortical regions stored when, toward the end of the visual system, the visual repre- consisting at least of V1, V2, V4, posterior inferior temporal cor- sentation of objects is interfaced to associative memory (Rolls and tex (TEO), inferior temporal cortex (e.g., TE3, TEa, and TEm), Treves, 1998; Rolls, 2008b). and anterior temporal cortical areas (e.g., TE2 and TE1). (This Translation invariance would be computed in such a system by stream of processing has many connections with a set of cortical utilizing competitive learning to detect regularities in inputs when areas in the anterior part of the superior temporal sulcus, includ- real objects are translated in the physical world. The hypothesis ing area TPO.) There is convergence from each small part of a is that because objects have continuous properties in space and region to the succeeding region (or layer in the hierarchy) in such time in the world, an object at one place on the retina might acti- a way that the receptive field sizes of neurons (e.g., 1˚ near the vate feature analyzers at the next stage of cortical processing, and fovea in V1) become larger by a factor of approximately 2.5 with when the object was translated to a nearby position, because this each succeeding stage (and the typical parafoveal receptive field would occur in a short period (e.g., 0.5 s), the membrane of the sizes found would not be inconsistent with the calculated approx- post-synaptic neuron would still be in its “Hebb-modifiable” state imations of, e.g., 8˚ in V4, 20˚ in TEO, and 50˚ in the inferior (caused, for example, by calcium entry as a result of the voltage- temporal cortex Boussaoud et al., 1991; see Figure 1). Such zones dependent activation of NMDA receptors), and the presynaptic of convergence would overlap continuously with each other (see afferents activated with the object in its new position would thus Figure 1). This connectivity would be part of the architecture by become strengthened on the still-activated post-synaptic neuron. which translation-invariant representations are computed. It is suggested that the short temporal window (e.g., 0.5 s) of Hebb- Each layer is considered to act partly as a set of local modifiability helps neurons to learn the statistics of objects moving self-organizing competitive neuronal networks with overlapping in the physical world, and at the same time to form different rep- inputs. (The region within which competition would be imple- resentations of different feature combinations or objects, as these mented would depend on the spatial properties of inhibitory are physically discontinuous and present less regular correlations interneurons, and might operate over distances of 1–2 mm in the to the visual system. Földiák (1991) has proposed computing an cortex.) These competitive nets operate by a single set of forward average activation of the post-synaptic neuron to assist with the inputs leading to (typically non-linear, e.g., sigmoid) activation same problem. One idea here is that the temporal properties of of output neurons; of competition between the output neurons the biologically implemented learning mechanism are such that it mediated by a set of feedback inhibitory interneurons which is well suited to detecting the relevant continuities in the world of receive from many of the principal (in the cortex, pyramidal) cells real objects. Another suggestion is that a memory trace for what in the net and project back (via inhibitory interneurons) to many has been seen in the last 300 ms appears to be implemented by of the principal cells and serve to decrease the firing rates of the less a mechanism as simple as continued firing of inferior temporal active neurons relative to the rates of the more active neurons; and neurons after the stimulus has disappeared, as has been found in then of synaptic modification by a modified Hebb rule, such that masking experiments (Rolls and Tovee, 1994; Rolls et al., 1994, synapses to strongly activated output neurons from active input 1999; Rolls, 2003). axons strengthen, and from inactive input axons weaken (Rolls, I also suggested (Rolls, 1992) that other invariances, for exam- 2008b). A biologically plausible form of this learning rule that ple, size, spatial-frequency, and rotation invariance, could be operates well in such networks is learned by a comparable process. (Early processing in V1 which enables different neurons to represent inputs at different spatial scales would allow combinations of the outputs of such neurons w D y .x w / (2) ij i j ij to be formed at later stages. Scale invariance would then result from detecting at a later stage which neurons are almost conjunc- where w is the change of the synaptic weight, is a learning rate ij constant, y is the firing rate of the i th postsynaptic neuron, and tively active as the size of an object alters.) It is suggested that this process takes place at each stage of the multiple-layer cortical pro- x and w are in appropriate units (Rolls, 2008b). Such compet- j ij itive networks operate to detect correlations between the activity cessing hierarchy, so that invariances are learned first over small regions of space, and then over successively larger regions. This of the input neurons, and to allocate output neurons to respond to each cluster of such correlated inputs. These networks thus limits the size of the connection space within which correlations must be sought. act as categorizers. In relation to visual information processing, they would remove redundancy from the input representation, and Increasing complexity of representations could also be built in would develop low-entropy representations of the information (cf. such a multiple-layer hierarchy by similar mechanisms. At each stage or layer the self-organizing competitive nets would result in Barlow, 1985; Barlow et al., 1989). Such competitive nets are bio- logically plausible, in that they utilize Hebb-modifiable forward combinations of inputs becoming the effective stimuli for neu- rons. In order to avoid the combinatorial explosion, it is proposed, excitatory connections, with competitive inhibition mediated by cortical inhibitory neurons. The competitive scheme I suggest following Feldman (1985), that low-order combinations of inputs would be what is learned by each neuron. (Each input would not would not result in the formation of “winner-take-all” or “grand- mother” cells, but would instead result in a small ensemble of active be represented by activity in a single input axon, but instead by activity in a set of active input axons.) Evidence consistent with neurons representing each input (Rolls and Treves, 1998; Rolls, 2008b). The scheme has the advantages that the output neurons this suggestion that neurons are responding to combinations of a few variables represented at the preceding stage of cortical pro- learn better to distribute themselves between the input patterns (cf. Bennett, 1990), and that the sparse representations formed cessing is that some neurons in V1 respond to combinations of Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 14 Rolls Invariant visual object recognition bars or edges (Shevelev et al., 1995; Sillito et al., 1995); V2 and coordinates of each part of the object would not be explicitly avail- V4 respond to end-stopped lines, to angles formed by a combi- able. It is therefore proposed that visual fixation is used to locate nation of lines, to tongues flanked by inhibitory subregions, or to in foveal vision part of an object to which movements must be combinations of colors (Hegde and Van Essen, 2000, 2003, 2007; made, and that local disparity and other measurements of depth Ito and Komatsu, 2004; Anzai et al., 2007; Orban, 2011); in poste- then provide sufficient information for the motor system to make rior inferior temporal cortex to stimuli which may require two or actions relative to the small part of space in which a local, view- more simple features to be present (Tanaka et al., 1990); and in the dependent, representation of depth would be provided (cf. Ballard, temporal cortical face processing areas to images that require the 1990). presence of several features in a face (such as eyes, hair, and mouth) The computational processes proposed above operate by an in order to respond (Perrett et al., 1982; Yamane et al., 1988; Rolls, unsupervised learning mechanism, which utilizes statistical regu- 2011b; see Figure 6). (Precursor cells to face-responsive neurons larities in the physical environment to enable representations to might, it is suggested, respond to combinations of the outputs of be built. In some cases it may be advantageous to utilize some the neurons in V1 that are activated by faces, and might be found form of mild teaching input to the visual system, to enable it to in areas such as V4.) It is an important part of this suggestion that learn, for example, that rather similar visual inputs have very dif- some local spatial information would be inherent in the features ferent consequences in the world, so that different representations which were being combined. For example, cells might not respond of them should be built. In other cases, it might be helpful to bring to the combination of an edge and a small circle unless they were representations together, if they have identical consequences, in in the correct spatial relation to each other. (This is in fact consis- order to use storage capacity efficiently. It is proposed elsewhere tent with the data of Tanaka et al. (1990), and with our data on (Rolls, 1989a,b, 2008b; Rolls and Treves, 1998) that the backpro- face neurons, in that some face neurons require the face features jections from each adjacent cortical region in the hierarchy (and to be in the correct spatial configuration, and not jumbled, Rolls from the amygdala and hippocampus to higher regions of the et al. (1994).) The local spatial information in the features being visual system) play such a role by providing guidance to the com- combined would ensure that the representation at the next level petitive networks suggested above to be important in each cortical would contain some information about the (local) arrangement area. This guidance, and also the capability for recall, are it is sug- of features. Further low-order combinations of such neurons at gested implemented by Hebb-modifiable connections from the the next stage would include sufficient local spatial information so backprojecting neurons to the principal (pyramidal) neurons of that an arbitrary spatial arrangement of the same features would the competitive networks in the preceding stages (Rolls, 1989a,b, not activate the same neuron, and this is the proposed, and lim- 2008b; Rolls and Treves, 1998). ited, solution which this mechanism would provide for the feature The computational processes outlined above use sparse distrib- binding problem (Elliffe et al., 2002; cf. von der Malsburg, 1990). uted coding with relatively finely tuned neurons with a graded By this stage of processing a view-dependent representation of response region centered about an optimal response achieved objects suitable for view-dependent processes such as behavioral when the input stimulus matches the synaptic weight vector on a responses to face expression and gesture would be available. neuron. The distributed nature of the coding but with fine tuning It is suggested that view-independent representations could be would help to limit the combinatorial explosion, to keep the num- formed by the same type of computation, operating to combine a ber of neurons within the biological range. The graded response limited set of views of objects. The plausibility of providing view- region would be crucial in enabling the system to generalize cor- independent recognition of objects by combining a set of different rectly to solve, for example, the invariances. However, such a system views of objects has been proposed by a number of investigators would need many neurons, each with considerable learning capac- (Koenderink and Van Doorn, 1979; Poggio and Edelman, 1990; ity, to solve visual perception in this way. This is fully consistent Logothetis et al., 1994; Ullman, 1996). Consistent with the sug- with the large number of neurons in the visual system, and with gestion that the view-independent representations are formed by the large number of, probably modifiable, synapses on each neu- combining view-dependent representations in the primate visual ron (e.g., 10,000). Further, the fact that many neurons are tuned system, is the fact that in the temporal cortical areas, neurons in different ways to faces is consistent with the fact that in such a with view-independent representations of faces are present in the computational system, many neurons would need to be sensitive same cortical areas as neurons with view-dependent representa- (in different ways) to faces, in order to allow recognition of many tions (from which the view-independent neurons could receive individual faces when all share a number of common properties. inputs; Perrett et al., 1985; Hasselmo et al., 1989b; Booth and Rolls, 1998). This solution to “object-based” representations is very dif- 5. THE FEATURE HIERARCHY APPROACH TO INVARIANT ferent from that traditionally proposed for artificial vision systems, OBJECT RECOGNITION: COMPUTATIONAL ISSUES in which the coordinates in 3D space of objects are stored in a data- The feature hierarchy approach to invariant object recognition base, and general-purpose algorithms operate on these to perform was introduced in Section 3.6, and advantages and disadvantages transforms such as translation, rotation, and scale change in 3D of it were discussed. Hypotheses about how object recognition space (e.g., Marr, 1982). In the present, much more limited but could be implemented in the brain which are consistent with more biologically plausible scheme, the representation would be much of the neurophysiology discussed in Section 2 and by Rolls suitable for recognition of an object, and for linking associative and Deco (2002) and Rolls (2008b) were set out in Section 4. memories to objects, but would be less good for making actions These hypotheses effectively incorporate a feature hierarchy sys- in 3D space to particular parts of, or inside, objects, as the 3D tem while encompassing much of the neurophysiological evidence. Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 15 Rolls Invariant visual object recognition In this Section (5), we consider the computational issues that arise general architecture simulated in VisNet, and the way in which in such feature hierarchy systems, and in the brain systems that it allows natural images to be used as stimuli, has been chosen to implement visual object recognition. The issues are considered enable some comparisons of neuronal responses in the network with the help of a particular model, VisNet, which requires precise and in the brain to similar stimuli to be made. specification of the hypotheses, and at the same time enables them to be explored and tested numerically and quantitatively. However, 5.1.1. The trace rule I emphasize that the issues to be covered in Section 5 are key and The learning rule implemented in the VisNet simulations utilizes major computational issues for architectures of this feature hierar- the spatio-temporal constraints placed upon the behavior of “real- chical type (Rolls, 2008b), and are very relevant to understanding world” objects to learn about natural object transformations. By how invariant object recognition is implemented in the brain. presenting consistent sequences of transforming objects the cells VisNet is a model of invariant object recognition based on Rolls’ in the network can learn to respond to the same object through all (Rolls, 1992) hypotheses. It is a computer simulation that allows of its naturally transformed states, as described by Földiák (1991), hypotheses to be tested and developed about how multilayer hier- Rolls (1992), Wallis et al. (1993), and Wallis and Rolls (1997). The archical networks of the type believed to be implemented in the learning rule incorporates a decaying trace of previous cell activity visual cortical pathways operate. The architecture captures a num- and is henceforth referred to simply as the “trace” learning rule. ber of aspects of the architecture of the visual cortical pathways, The learning paradigm we describe here is intended in principle and is described next. The model of course, as with all mod- to enable learning of any of the transforms tolerated by inferior els, requires precise specification of what is to be implemented, temporal cortex neurons, including position, size, view, lighting, and at the same time involves specified simplifications of the real and spatial-frequency (Rolls, 1992, 2000, 2008b; Rolls and Deco, architecture, as investigations of the fundamental aspects of the 2002). information processing being performed are more tractable in To clarify the reasoning behind this point, consider the situa- a simplified and at the same time quantitatively specified model. tion in which a single neuron is strongly activated by a stimulus First the architecture of the model is described, and this is followed forming part of a real-world object. The trace of this neuron’s acti- by descriptions of key issues in such multilayer feature hierarchical vation will then gradually decay over a time period in the order of models, such as the issue of feature binding, the optimal form of 0.5 s. If, during this limited time window, the net is presented with training rule for the whole system to self-organize, the operation of a transformed version of the original stimulus then not only will the network in natural environments and when objects are partly the initially active afferent synapses modify onto the neuron, but so occluded, how outputs about individual objects can be read out also will the synapses activated by the transformed version of this from the network, and the capacity of the system. stimulus. In this way the cell will learn to respond to either appear- ance of the original stimulus. Making such associations works in 5.1. THE ARCHITECTURE OF VisNet practice because it is very likely that within short-time periods Fundamental elements of Rolls’ (1992) theory for how cortical net- different aspects of the same object will be being inspected. The works might implement invariant object recognition are described cell will not, however, tend to make spurious links across stimuli in Section 4. They provide the basis for the design of VisNet, and that are part of different objects because of the unlikelihood in the can be summarized as: real-world of one object consistently following another. Various biological bases for this temporal trace have been A series of competitive networks, organized in hierarchical lay- advanced as follows: [The precise mechanisms involved may alter ers, exhibiting mutual inhibition over a short range within each the precise form of the trace rule which should be used. Földiák layer. These networks allow combinations of features or inputs (1992) describes an alternative trace rule which models individual occurring in a given spatial arrangement to be learned by neu- NMDA channels. Equally, a trace implemented by extended cell rons, ensuring that higher order spatial properties of the input firing should be reflected in representing the trace as an external stimuli are represented in the network. firing rate, rather than an internal signal.] A convergent series of connections from a localized population of cells in preceding layers to each cell of the following layer,  The persistent firing of neurons for as long as 100–400 ms thus allowing the receptive field size of cells to increase through observed after presentations of stimuli for 16 ms (Rolls and the visual processing areas or layers. Tovee, 1994) could provide a time window within which to A modified Hebb-like learning rule incorporating a temporal associate subsequent images. Maintained activity may poten- trace of each cell’s previous activity, which, it is suggested, will tially be implemented by recurrent connections between as well enable the neurons to learn transform invariances. as within cortical areas (Rolls and Treves, 1998; Rolls and Deco, 2002; Rolls, 2008b). [The prolonged firing of inferior temporal The first two elements of Rolls’ theory are used to constrain the cortex neurons during memory delay periods of several sec- general architecture of a network model, VisNet, of the processes onds, and associative links reported to develop between stimuli just described that is intended to learn invariant representations presented several seconds apart (Miyashita, 1988) are on too of objects. The simulation results described in this paper using long a time scale to be immediately relevant to the present VisNet show that invariant representations can be learned by theory. In fact, associations between visual events occurring the architecture. It is moreover shown that successful learning several seconds apart would, under normal environmental con- depends crucially on the use of the modified Hebb rule. The ditions, be detrimental to the operation of a network of the type Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 16 Rolls Invariant visual object recognition described here, because they would probably arise from different connection between neurons in adjacent layers exists or not is objects. In contrast, the system described benefits from associa- based upon a Gaussian distribution of connection probabilities tions between visual events which occur close in time (typically which roll off radially from the focal point of connections for each within 1 s), as they are likely to be from the same object.] neuron. (A minor extra constraint precludes the repeated connec- The binding period of glutamate in the NMDA channels, which tion of any pair of cells.) In particular, the forward connections may last for 100 ms or more, may implement a trace rule by to a cell in one layer come from a small region of the preceding producing a narrow time window over which the average activ- layer defined by the radius in Table 1 which will contain approx- ity at each presynaptic site affects learning (Hestrin et al., 1990; imately 67% of the connections from the preceding layer. Table 1 Földiák, 1992; Rhodes, 1992; Rolls, 1992; Spruston et al., 1995). shows the dimensions for VisNetL, the system we are currently Chemicals such as nitric oxide may be released during high using (Perry et al., 2010), which is a (16) larger version of the neural activity and gradually decay in concentration over a version of VisNet than used in most of our previous investiga- short-time window during which learning could be enhanced tions, which utilized 32 32 neurons per layer. Figure 1 shows the (Montague et al., 1991; Földiák, 1992; Garthwaite, 2008). general convergent network architecture used. Localization and limitation of connectivity in the network is intended to mimic The trace update rule used in the baseline simulations of VisNet cortical connectivity, partially because of the clear retention of (Wallis and Rolls, 1997) is equivalent to both Földiák’s used in the retinal topology through regions of visual cortex. This architecture context of translation invariance (Wallis et al., 1993) and to the also encourages the gradual combination of features from layer to earlier rule of Sutton and Barto (1981) explored in the context of layer which has relevance to the binding problem, as described in modeling the temporal properties of classical conditioning, and Section 5.4. can be summarized as follows: Modeling topological constraints in connectivity leads to an issue concerning neurons at the edges of the network layers. In w D y x (3) j j principle these neurons may either receive no input from beyond the edge of the preceding layer, or have heir connections repeat- where edly sample neurons at the edge of the previous layer. In practice either solution is liable to introduce artificial weighting on the yN D .1 / y C yN (4) few active inputs at the edge and hence cause the edge to have unwanted influence over the development of the network as a and whole. In the real brain such edge-effects would be naturally smoothed by the transition of the locus of cellular input from the x : jth input to the neuron. y : Output from the neuron. fovea to the lower acuity periphery of the visual field. However, yN : Trace value of the output of : Learning rate. Annealed it poses a problem here because we are in effect only simulat- the neuron at time step  . between unity and zero. ing the small high-acuity foveal portion of the visual field in our w : Synaptic weight between jth : Trace value. The optimal value simulations. As an alternative to the former solutions Wallis and input and the neuron. varies with presentation Rolls (1997) elected to form the connections into a toroid, such sequence length. that connections wrap back onto the network from opposite sides. To bound the growth of each neuron’s synaptic weight vec- This wrapping happens at all four layers of the network, and in the way an image on the “retina” is mapped to the input filters. tor, w for the i th neuron, its length is explicitly normalized (a This solution has the advantage of making all of the boundaries method similarly employed by von der Malsburg (1973) which effectively invisible to the network. Further, this procedure does is commonly used in competitive networks (Rolls, 2008b). An not itself introduce problems into evaluation of the network for alternative, more biologically relevant implementation, using a the problems set, as many of the critical comparisons in VisNet local weight bounding operation which utilizes a form of het- involve comparisons between a network with the same architec- erosynaptic long-term depression (Rolls, 2008b), has in part been explored using a version of the Oja (1982) rule (see Wallis and ture trained with the trace rule, or with the Hebb rule, or not trained at all. In practice, it is shown below that only the network Rolls, 1997). trained with the trace rule solves the problem of forming invariant 5.1.2. The network implemented in VisNet representations. The network itself is designed as a series of hierarchical, conver- gent, competitive networks, in accordance with the hypotheses advanced above. The actual network consists of a series of four Table 1 | VisNet dimensions. layers, constructed such that the convergence of information from the most disparate parts of the network’s input layer can potentially Dimensions # Connections Radius influence firing in a single neuron in the final layer – see Figure 1. This corresponds to the scheme described by many researchers Layer 4 128 128 100 48 (Rolls, 1992, 2008b; Van Essen et al., 1992) as present in the pri- Layer 3 128 128 100 36 mate visual system – see Figure 1. The forward connections to Layer 2 128 128 100 24 a cell in one-layer are derived from a topologically related and Layer 1 128 128 272 24 confined region of the preceding layer. The choice of whether a Input layer 256 256 32 – – Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 17 Rolls Invariant visual object recognition 5.1.3. Competition and lateral inhibition In order to act as a competitive network some form of mutual inhi- bition is required within each layer, which should help to ensure that all stimuli presented are evenly represented by the neurons in each layer. This is implemented in VisNet by a form of lateral inhibition. The idea behind the lateral inhibition, apart from this being a property of cortical architecture in the brain, was to pre- vent too many neurons that received inputs from a similar part of the preceding layer responding to the same activity patterns. The purpose of the lateral inhibition was to ensure that different receiving neurons coded for different inputs. This is important in reducing redundancy (Rolls, 2008b). The lateral inhibition is conceived as operating within a radius that was similar to that of the region within which a neuron received converging inputs from the preceding layer (because activity in one zone of topologically organized processing within a layer should not inhibit processing FIGURE 10 | Contrast-enhancing filter, which has the effect of local in another zone in the same layer, concerned perhaps with another lateral inhibition. The parameters  and  are variables used in equation (5) part of the image). [Although the extent of the lateral inhibition to modify the amount and extent of inhibition, respectively. actually investigated by Wallis and Rolls (1997) in VisNet oper- ated over adjacent pixels, the lateral inhibition introduced by Rolls and Milward (2000) in what they named VisNet2 and which has a more biologically plausible form of the activation function, a been used in subsequent simulations operates over a larger region, sigmoid, was used: set within a layer to approximately half of the radius of conver- gence from the preceding layer. Indeed, Rolls and Milward (2000) sigmoid y D f .r/ D (6) 2 .r / showed in a problem in which invariant representations over 49 1C e locations were being used with a 17 face test set, that the best per- where r is the activation (or firing rate) of the neuron after the lat- formance was with intermediate-range lateral inhibition, using the eral inhibition, y is the firing rate after the contrast enhancement parameters for shown in Table 3. These values of  set the lateral produced by the activation function, and is the slope or gain and inhibition radius within a layer to be approximately half that of is the threshold or bias of the activation function. The sigmoid the spread of the excitatory connections from the preceding layer.] bounds the firing rate between 0 and 1 so global normalization The lateral inhibition and contrast enhancement just described is not required. The slope and threshold are held constant within are actually implemented in VisNet2 (Rolls and Milward, 2000) each layer. The slope is constant throughout training, whereas the and VisNetL (Perry et al., 2010) in two stages, to produce filtering threshold is used to control the sparseness of firing rates within of the type illustrated in Figure 10. This lateral inhibition is imple- each layer. The (population) sparseness of the firing within a layer mented by convolving the activation of the neurons in a layer with is defined (Rolls and Treves, 1998, 2011; Franco et al., 2007; Rolls, a spatial filter, I, where  controls the contrast and  controls the 2008b) as: width, and a and b index the distance away from the center of the filter y n 8 a D P  (7) 2 2 2 a Cb y n i i e if a 6D 0 or b 6D 0, I D (5) a,b 1 I if a D 0 and b D 0. a,b where n is the number of neurons in the layer. To set the sparseness a6D0,b6D0 to a given value, e.g., 5%, the threshold is set to the value of the 95th percentile point of the activations within the layer. (Unless This is a filter that leaves the average activity unchanged. A modi- otherwise stated here, the neurons used the sigmoid activation fied version of this filter designed as a difference of Gaussians with function as just described.) the same inhibition but shorter range local excitation is being In most simulations with VisNet2 and later, the sigmoid activa- tested to investigate whether the self-organizing maps that this tion function was used with parameters (selected after a number promotes (Rolls, 2008b) helps the system to provide some conti- of optimization runs) as shown in Table 2. nuity in the representations formed. The concept is that this may In addition, the lateral inhibition parameters normally used in help the system to code efficiently for large numbers of untrained VisNet2 simulations are as shown in Table 3. (Where a power acti- stimuli that fall between trained stimuli in similarity space. vation function was used in the simulations of Wallis and Rolls The second stage involves contrast enhancement. In VisNet (1997), the power for layer 1 was 6, and for the other layers was 2.) (Wallis and Rolls, 1997), this was implemented by raising the neu- ronal activations to a fixed power and normalizing the resulting 5.1.4. The input to VisNet firing within a layer to have an average firing rate equal to 1.0. In VisNet is provided with a set of input filters which can be applied VisNet2 (Rolls and Milward, 2000) and in subsequent simulations to an image to produce inputs to the network which correspond Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 18 Rolls Invariant visual object recognition Table 2 | Sigmoid parameters for the runs with 25 locations by Rolls receptive fields of the simple cell-like input neurons are modeled and Milward, 2000). by 2D-Gabor functions. The Gabor receptive fields have five degrees of freedom given essentially by the product of an ellip- Layer 1 2 3 4 tical Gaussian and a complex plane wave. The first two degrees of freedom are the 2D-locations of the receptive field’s center; Percentile 99.2 98 88 91 the third is the size of the receptive field; the fourth is the ori- Slope 190 40 75 26 entation of the boundaries separating excitatory and inhibitory regions; and the fifth is the symmetry. This fifth degree of free- dom is given in the standard Gabor transform by the real and Table 3 | Lateral inhibition parameters for the 25-location runs. imaginary part, i.e., by the phase of the complex function rep- resenting it, whereas in a biological context this can be done by Layer 1 2 3 4 combining pairs of neurons with even and odd receptive fields. Radius,  1.38 2.7 4.0 6.0 This design is supported by the experimental work of Pollen and Contrast,  1.5 1.5 1.6 1.4 Ronner (1981), who found simple cells in quadrature-phase pairs. Even more, Daugman (1988) proposed that an ensemble of simple cells is best modeled as a family of 2D-Gabor wavelets sampling to those provided by simple cells in visual cortical area 1 (V1). the frequency domain in a log-polar manner as a function of The purpose of this is to enable within VisNet the more com- eccentricity. Experimental neurophysiological evidence constrains plicated response properties of cells between V1 and the inferior the relation between the free parameters that define a 2D-Gabor temporal cortex (IT) to be investigated, using as inputs natural receptive field (De Valois and De Valois, 1988). There are three stimuli such as those that could be applied to the retina of the constraints fixing the relation between the width, height, ori- real visual system. This is to facilitate comparisons between the entation, and spatial-frequency (Lee, 1996). The first constraint activity of neurons in VisNet and those in the real visual sys- posits that the aspect ratio of the elliptical Gaussian envelope is tem, to the same stimuli. In VisNet no attempt is made to train 2:1. The second constraint postulates that the plane wave tends the response properties of simple cells, but instead we start with a to have its propagating direction along the short axis of the defined series of filters to perform fixed feature extraction to a level elliptical Gaussian. The third constraint assumes that the half- equivalent to that of simple cells in V1, as have other researchers amplitude bandwidth of the frequency response is about 1–1.5 in the field (Fukushima, 1980; Buhmann et al., 1991; Hummel and octaves along the optimal orientation. Further, we assume that the Biederman, 1992), because we wish to simulate the more com- mean is zero in order to have an admissible wavelet basis (Lee, plicated response properties of cells between V1 and the inferior 1996). temporal cortex (IT). The elongated orientation-tuned input fil- In more detail, the Gabor filters are constructed as follows ters used accord with the general tuning profiles of simple cells in (Deco and Rolls, 2004). We consider a pixelized gray-scale image orig V1 (Hawken and Parker, 1987) and in earlier versions of VisNet given by a N  N matrix0 . The subindices ij denote the spatial ij were computed by weighting the difference of two Gaussians by a position of the pixel. Each pixel value is given a gray-level bright- third orthogonal Gaussian as described in detail elsewhere (Wal- ness value coded in a scale between 0 (black) and 255 (white). lis and Rolls, 1997; Rolls and Milward, 2000; Perry et al., 2010). The first step in the pre-processing consists of removing the DC Each individual filter is tuned to spatial-frequency (0.0039–0.5 component of the image (i.e., the mean value of the gray-scale cycles/pixel over eight octaves); orientation (0–135˚ in steps of intensity of the pixels). (The equivalent in the brain is the low- 45˚); and sign (1). Of the 272 layer 1 connections, the num- pass filtering performed by the retinal ganglion cells and lateral ber to each group in VisNetL is as shown in Table 4. In VisNet2 geniculate cells. The visual representation in the LGN is essen- (Rolls and Milward, 2000; used for most VisNet simulations) only tially a contrast-invariant pixel representation of the image, i.e., even symmetric – “bar detecting” – filter shapes are used, which each neuron encodes the relative brightness value at one location take the form of a Gaussian shape along the axis of orienta- in visual space referred to the mean value of the image bright- tion tuning for the filter, and a difference of Gaussians along the ness.) We denote this contrast-invariant LGN representation by perpendicular axis. the N  N matrix 0 defined by the equation ij This filter is referred to as an oriented difference of Gaussians, or DOG filter. Any zero D.C. filter can of course produce a negative N N X X as well as positive output, which would mean that this simulation 1 orig orig 0 D 0 0 . (8) ij ij ij of a simple cell would permit negative as well as positive firing. 2 iD1 jD1 In contrast to some other models the response of each filter is zero thresholded and the negative results used to form a sepa- rate anti-phase input to the network. The filter outputs are also Feed-forward connections to a layer of V1 neurons perform the normalized across scales to compensate for the low-frequency bias extraction of simple features like bars at different locations, orien- in the images of natural objects. tations and sizes. Realistic receptive fields for V1 neurons that However, Gabor filters have also been tested, also produce good extract these simple features can be represented by 2D-Gabor results with VisNet (Deco and Rolls, 2004), and are what we wavelets. Lee (1996) derived a family of discretized 2D-Gabor implement at present in VisNetL. Following Daugman (1988) the wavelets that satisfy the wavelet theory and the neurophysiological Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 19 Rolls Invariant visual object recognition constraints for simple cells mentioned above. They are given by an and with the sampling from the spatial frequencies set as shown expression of the form in Table 4. Cells of layer 1 receive a topologically consistent, localized, ran- k k k dom selection of the filter responses in the input layer, under the G x , y D a 9 a x 2p , a y 2q (9) pqkl 2 constraint that each cell samples every filter spatial-frequency and receives a constant number of inputs. Figure 11 shows pictorially where the general filter sampling paradigm. 9 D9 x cos.l2 / C y sin.l2 / ,x sin.l2 /C y cos.l2 / , 2 0 0 0 0 5.1.5. Measures for network performance (10) A neuron can be said to have learnt an invariant representation if it discriminates one set of stimuli from another set, across all and the mother wavelet is given by transformations. For example, a neuron’s response is translation- invariant if its response to one set of stimuli irrespective of presen- 2 2 tation is consistently higher than for all other stimuli irrespective 4x Cy . / ix 8 2 9 x , y D p e e e . (11) of presentation location. Note that we state “set of stimuli” since neurons in the inferior temporal cortex are not generally selec- In the above equations 2 D /L denotes the step size of each tive for a single stimulus but rather a subpopulation of stimuli angular rotation; l the index of rotation corresponding to the pre- (Baylis et al., 1985; Abbott et al., 1996; Rolls et al., 1997b; Rolls and ferred orientation 2 D l /L; k denotes the octave; and the indices Treves, 1998, 2011; Rolls and Deco, 2002; Franco et al., 2007; Rolls, pq the position of the receptive field center at c D p and c D q. In 2007b, 2008b). The measure of network performance used in Vis- x y this form, the receptive fields at all levels cover the spatial domain Net1 (Wallis and Rolls, 1997), the “Fisher metric” (referred to in in the same way, i.e., by always overlapping the receptive fields in some figure labels as the Discrimination Factor), reflects how well the same fashion. In the model we use aD 2, bD 1, and D cor- a neuron discriminates between stimuli, compared to how well it responding to a spatial-frequency bandwidth of one octave. We discriminates between different locations (or more generally the now use in VisNetL both symmetric and asymmetric filters (as images used rather than the objects, each of which is represented both are present in V1 Ringach, 2002); with the angular spacing by a set of images, over which invariant stimulus or object repre- between the different orientations set to 45˚; and with 8 filter fre- sentations must be learned). The Fisher measure is very similar to quencies spaced one octave apart starting with 0.5 cycles per pixel, taking the ratio of the two F values in a two-way ANOVA, where Table 4 | VisNet layer 1 connectivity. Frequency 0.5 0.25 0.125 0.0625 0.03125 0.0156 0.0078 0.0039 # Connections 180 45 12 7 7 7 7 7 The frequency is in cycles per pixel. FIGURE 11 | The filter sampling paradigm. Here each square used to provide input to a layer 1 cell. The filters double in represents the retinal image presented to the network after being spatial-frequency toward the reader. Left to right the orientation tuning filtered by a Gabor filter of the appropriate orientation sign and increases from 0˚ in steps of 45˚, with segregated pairs of positive (P) frequency. The circles represent the consistent retinotopic coordinates and negative (N) filter responses. Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 20 Rolls Invariant visual object recognition one factor is the stimulus shown, and the other factor is the posi- surprise (DeWeese and Meister, 1999; Rolls and Treves, 2011). Its tion in which a stimulus is shown. The measure takes a value average across stimuli is the mutual information I (S, R).) greater than 1.0 if a neuron has more different responses to the If all the output cells of VisNet learned to respond to the same stimuli than to the locations. That is, values greater than 1 indicate stimulus, then the information about the set of stimuli S would be invariant representations when this measure is used in the follow- very poor, and would not reach its maximal value of log of the ing figures. Further details of how the measure is calculated are number of stimuli (in bits). The second measure that is used here given by Wallis and Rolls (1997). is the information provided by a set of cells about the stimulus set, Measures of network performance based on information the- using the procedures described by Rolls et al. (1997b) and Rolls ory and similar to those used in the analysis of the firing of and Milward (2000). The multiple cell information is the mutual real neurons in the brain (Rolls, 2008b; Rolls and Treves, 2011) information between the whole set of stimuli S and of responses were introduced by Rolls and Milward (2000) for VisNet2, and R calculated using a decoding procedure in which the stimulus s are used in later papers. A single cell information measure was that gave rise to the particular firing rate response vector on each introduced which is the maximum amount of information the trial is estimated. (The decoding step is needed because the high cell has about any one stimulus/object independently of which dimensionality of the response space would lead to an inaccurate transform (e.g., position on the retina) is shown. Because the estimate of the information if the responses were used directly, competitive algorithm used in VisNet tends to produce local rep- as described by Rolls et al. (1997b) and Rolls and Treves (1998).) resentations (in which single cells become tuned to one stimulus A probability table is then constructed of the real stimuli s and or object), this information measure can approach log N bits, the decoded stimuli s . From this probability table, the mutual 2 s where N is the number of different stimuli. Indeed, it is an information between the set of actual stimuli S and the decoded advantage of this measure that it has a defined maximal value, estimates S is calculated as which enables how well the network is performing to be quanti- P s , s fied. Rolls and Milward (2000) showed that the Fisher and sin- 0 0 I S, S D P s , s log (13) P .s/ P .s / gle cell information measures were highly correlated, and given s ,s the advantage just noted of the information measure, it was adopted in Rolls and Milward (2000) and subsequent papers. This was calculated for the subset of cells which had as single cells Rolls and Milward (2000) also introduced a multiple cell infor- the most information about which stimulus was shown. In par- mation measure, which has the advantage that it provides a mea- ticular, in Rolls and Milward (2000) and subsequent papers, the sure of whether all stimuli are encoded by different neurons in multiple cell information was calculated from the first five cells for the network. Again, a high value of this measure indicates good each stimulus that had maximal single cell information about that performance. stimulus, that is from a population of 35 cells if there were seven For completeness, we provide further specification of the two stimuli (each of which might have been shown in, for example, 9 information theoretic measures, which are described in detail by or 25 positions on the retina). Rolls and Milward (2000), (see Rolls, 2008b) Rolls and Treves (2011) for an introduction to the concepts). The measures assess 5.2. INITIAL EXPERIMENTS WITH VisNet the extent to which either a single cell, or a population of cells, Having established a network model, Wallis and Rolls (1997) responds to the same stimulus invariantly with respect to its loca- following a first report by Wallis et al. (1993) described four exper- tion, yet responds differently to different stimuli. The measures iments in which the theory of how invariant representations could effectively show what one learns about which stimulus was pre- be formed was tested using a variety of stimuli undergoing a num- sented from a single presentation of the stimulus at any randomly ber of natural transformations. In each case the network produced chosen location. Results for top (4th) layer cells are shown. High neurons in the final layer whose responses were largely invariant information measures thus show that cells fire similarly to the dif- across a transformation and highly discriminating between stimuli ferent transforms of a given stimulus (object), and differently to or sets of stimuli. A summary showing how the network performed the other stimuli. The single cell stimulus-specific information,I (s, is presented here, with much more evidence of the factors that R), is the amount of information the set of responses, R, has about influence the network’s performance described elsewhere (Wallis a specific stimulus, s (see Rolls et al., 1997c; Rolls and Milward, and Rolls, 1997; Rolls, 2008b). 2000). I (s, R) is given by 5.2.1. “T,” “L,” and “C” as stimuli: learning translation invariance P .rjs/ One of the classical properties of inferior temporal cortex face I .s , R/ D P .rjs/ log (12) P .r/ cells is their invariant response to face stimuli translated across r2R the visual field (Tovee et al., 1994). In this first experiment, the where r is an individual response from the set of responses R of learning of translation-invariant representations by VisNet was the neuron. For each cell the performance measure used was the investigated. maximum amount of information a cell conveyed about any one In order to test the network a set of three stimuli, based upon stimulus. This (rather than the mutual information, I (S, R) where probable 3D edge cues – consisting of a “T,” “L,” and “C” shape – S is the whole set of stimuli s ), is appropriate for a competitive was constructed. Chakravarty (1979) describes the application of network in which the cells tend to become tuned to one stimu- these shapes as cues for the 3D interpretation of edge junctions, lus. (I (s, R) has more recently been called the stimulus-specific and Tanaka et al. (1991) have demonstrated the existence of cells Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 21 Rolls Invariant visual object recognition responsive to such stimuli in IT.) These stimuli were chosen partly Elliffe et al. (2002). The feature combination tuning is illustrated because of their significance as form cues, but on a more practical by the VisNet layer 1 neuron shown in Figures 12 and 13. note because they each contain the same fundamental features – The results for layer 4 neurons are illustrated in Figure 14. namely a horizontal bar conjoined with a vertical bar. In practice By this stage translation-invariant, stimulus-identifying, cells have this means that the oriented simple cell filters of the input layer emerged. The response profiles confirm the high level of neural cannot distinguish these stimuli on the basis of which features are selectivity for a particular stimulus irrespective of location. Neu- present. As a consequence of this, the representation of the stimuli rons in layers 2 and 3 of VisNet had intermediate-levels of received by the network is non-orthogonal and hence considerably translation invariance to those illustrated for layer 1 and layer more difficult to classify than was the case in earlier experiments 4. The gradual increase in the invariance that the tolerance to involving the trace rule described by Földiák (1991). The expec- shifts of the preferred stimulus gradually builds up through the tation is that layer 1 neurons would learn to respond to spatially layers. selective combinations of the basic features thereby helping to The trace used in VisNet enables successive features that, based distinguish these non-orthogonal stimuli. The trajectory followed on the natural statistics of the visual input, are likely to be from the by each stimulus consisted of sweeping left to right horizontally same object or feature complex to be associated together. For good across three locations in the top row, and then sweeping back, right performance, the temporal trace needs to be sufficiently long that to left across the middle row, before returning to the right hand it covers the period in which features seen by a particular neuron side across the bottom row – tracing out a “Z” shape path across in the hierarchy are likely to come from the same object. On the the retina. Unless stated otherwise this pattern of nine presenta- other hand, the trace should not be so long that it produces asso- tion locations was adopted in all image translation experiments ciations between features that are parts of different objects, seen described by Wallis and Rolls (1997). when, for example, the eyes move to another object. One possibil- Training was carried out by permutatively presenting all stimuli ity is to reset the trace during saccades between different objects. If in each location a total of 800 times. The sequence described above explicit trace resetting is not implemented, then the trace should, was followed for each stimulus, with the sequence start point and to optimize the compromise implied by the above, lead to strong direction of sweep being chosen at random for each of the 800 associations between temporally close stimuli, and increasingly training trials. weaker associations between temporally more distant stimuli. In Figures 12 and 13 shows the response after training of a first fact, the trace implemented in VisNet has an exponential decay, layer neuron selective for the “T” stimulus. The weighted sum of all and it has been shown that this form is optimal in the situation filter inputs reveals the combination of horizontally and vertically where the exact duration over which the same object is being tuned filters in identifying the stimulus. In this case many connec- viewed varies, and where the natural statistics of the visual input tions to the lower frequency filters have been reduced to zero by the happen also to show a decreasing probability that the same object learning process, except at the relevant orientations. This contrasts is being viewed as the time period in question increases (Wallis strongly with the random wiring present before training (Wallis and Baddeley, 1997). Moreover, performance can be enhanced if and Rolls, 1997; Rolls, 2008b). It is important that neurons at early the duration of the trace does at the same time approximately stages of feature hierarchy networks respond to combinations of match the period over which the input stimuli are likely to come features in defined relative spatial positions, before invariance is from the same object or feature complex (Wallis and Rolls, 1997; built into the system, as this is part of the way that the binding Rolls, 2008b). Nevertheless, good performance can be obtained in problem is solved, as described in more detail in Section 5.4 and by conditions under which the trace rule allows associations to be FIGURE 12 | The left graph shows the response of a layer 1 neuron to the three training stimuli for the nine training locations. Alongside this are the results of summating all the filter inputs to the neuron. The discrimination factor for this cell was 1.04. Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 22 Rolls Invariant visual object recognition FIGURE 13 | The connections to a single cell in layer 1 of VisNet receptive field of the layer 1 cell is centered just below the center-point from the filters after training in the T, L, and C stimulus set, of the retina. The connection scheme allows for relatively fewer represented by plotting the receptive fields of every input layer cell connections to lower frequency cells than to high-frequency cells in connected to the particular layer 1 cell. Separate input layer cells order to cover a similar region of the input at each frequency. The blank have activity that represents a positive (P) or negative (N) output from squares indicate that no connection exists between the layer 1 cell the bank of filters which have different orientations in degrees (the chosen and the filters of that particular orientation, sign, and columns) and different spatial frequencies (the rows). Here the overall spatial-frequency. FIGURE 14 | Response profiles for two fourth layer neurons – discrimination factors 4.07 and 3.62 – in the L, T, and C experiment. formed only between successive items in the visual stream (Rolls times, in order to learn about the larger scale properties that char- and Milward, 2000; Rolls and Stringer, 2001). acterize individual objects, including, for example, different views It is also the case that the optimal value of  in the trace rule is of objects observed as an object turns or is turned. Thus the sug- likely to be different for different layers of VisNet, and for cortical gestion is made that the temporal trace could be effectively longer processing in the “what” visual stream. For early layers of the sys- at later stages (e.g., inferior temporal visual cortex) compared to tem, small movements of the eyes might lead to different feature early stages (e.g., V2 and V4) of processing in the visual system. In combinations providing the input to cells (which at early stages addition, as will be shown in Section 5.4, it is important to form have small receptive fields), and a short duration of the trace would feature combinations with high-spatial precision before invariance be optimal. However, these small eye movements might be around learning supported by a temporal trace starts, in order that the fea- the same object, and later layers of the architecture would bene- ture combinations and not the individual features have invariant fit from being able to associate together their inputs over longer representations. This leads to the suggestion that the trace rule Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 23 Rolls Invariant visual object recognition should either not operate, or be short, at early stages of cortical to look more like any other image in that same location than the visual processing such as V1. This is reflected in the operation of same image presented elsewhere. A simple competitive network VisNet2, which does not use a temporal trace in layer 1 (Rolls and using just Hebbian learning will thus tend to categorize images by Milward, 2000). where they are rather than what they are – the exact opposite of what the net was intended to learn. This comparison thus indi- cates that a small memory trace acting in the standard Hebbian 5.2.2. Faces as stimuli: translation invariance learning paradigm can radically alter the normal vector averaging, The aim of the next set of experiments described by Wallis and image classification, performed by a Hebbian-based competitive Rolls (1997) was to start to address the issues of how the net- network. work operates when invariant representations must be learned for In order to check that there was an invariant representation a larger number of stimuli, and whether the network can learn in layer 4 of VisNet that could be read by a receiving popula- when much more complicated, real biological stimuli, faces, are tion of neurons, a fifth layer was added to the net which fully used. sampled the fourth layer cells. This layer was in turn trained in Figure 15 contrasts the measure of invariance, or discrimi- a supervised manner using gradient descent or with a Hebbian nation factor, achieved by cells in the four layers, averaged over associative learning rule. (Wallis and Rolls, 1997) showed that the five separate runs of the network (Wallis and Rolls, 1997; Rolls, object classification performed by the layer 5 network was better 2008b). Translation invariance clearly increases through the layers, if the network had been trained with the trace rule than when it as expected. was untrained or was trained with a Hebb rule. Having established that invariant cells have emerged in the final layer, we now consider the role of the trace rule, by assessing the network tested under two new conditions. Firstly, the performance 5.2.3. Faces as stimuli: view-invariance of the network was measured before learning occurs, that is with Given that the network had been shown to be able to operate its initially random connection weights. Secondly, the network usefully with a more difficult translation invariance problem, we was trained with  in the trace rule set to 0, which causes learn- next addressed the question of whether the network can solve ing to proceed in a traceless, standard Hebbian, fashion. (Hebbian other types of transform invariance, as we had intended. The next learning is purely associative Rolls, 2008b.) Figure 16 shows the experiment addressed this question, by training the network on the results under the three training conditions. The results show that problem of 3D stimulus rotation, which produces non-isomorphic the trace rule is the decisive factor in establishing the invariant transforms, to determine whether the network can build a view- responses in the layer 4 neurons. It is interesting to note that the invariant categorization of the stimuli (Wallis and Rolls, 1997). Hebbian learning results are actually worse than those achieved by The trace rule learning paradigm should, in conjunction with the chance in the untrained net. In general, with Hebbian learning, architecture described here, prove capable of learning any of the the most highly discriminating cells barely rate higher than 1. This transforms tolerated by IT neurons, so long as each stimulus is pre- value of discrimination corresponds to the case in which a cell sented in short sequences during which the transformation occurs responds to only one stimulus and in only one location. The poor and can be learned. This experiment continued with the use of performance with the Hebb rule comes as a direct consequence faces but now presented them centrally in the retina in a sequence of the presentation paradigm being employed. If we consider an of different views of a face (Wallis and Rolls, 1997; Rolls, 2008b). image as representing a vector in multidimensional space, a partic- The faces were again smoothed at the edges to erase the harsh ular image in the top left-hand corner of the input retina will tend image boundaries, and the D.C. term was removed. During the 800 epochs of learning, each stimulus was chosen at random, and FIGURE 15 | Variation in network performance for the top 30 most FIGURE 16 | Variation in network performance for the top 30 most highly discriminating cells through the four layers of the network, highly discriminating cells in the fourth layer for the three training averaged over five runs of the network. The net was trained on 7 faces regimes, averaged over five runs of the network. The net was trained on each in 9 locations. 7 faces each in 9 locations. Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 24 Rolls Invariant visual object recognition a sequence of preset views of it was shown, rotating the face either where the trace yN is updated according to to the left or to the right. yN D .1 / y C yN . (15) Although the actual number of images being presented is smaller, some 21 views in all, there is good reason to think that The parameter 2 [0, 1] controls the relative contributions to this problem may be harder to solve than the previous transla- the trace yN from the instantaneous firing rate y and the trace at tion experiments. This is simply due to the fact that all 21 views 1 the previous time step yN , where for D 0 we have yN D y and exactly overlap with one another. The net was indeed able to solve equation (14) becomes the standard Hebb rule the invariance problem, with examples of invariant layer 4 neuron response profiles appearing in Figure 17. w D y x . (16) Further analyses confirmed the good performance on view- At the start of a series of investigations of different forms of the invariance learning (Wallis and Rolls, 1997; Rolls, 2008b). trace-learning rule (Rolls and Milward, 2000) demonstrated that 5.3. DIFFERENT FORMS OF THE TRACE-LEARNING RULE, AND THEIR VisNet’s performance could be greatly enhanced (see Figure 18) RELATION TO ERROR CORRECTION AND TEMPORAL DIFFERENCE with a modified Hebbian trace-learning rule (equation (17)) that LEARNING incorporated a trace of activity from the preceding time steps, with The original trace-learning rule used in the simulations of Wallis no contribution from the activity being produced by the stimulus and Rolls (1997) took the form at the current time step. This rule took the form w D yN x (14) w D y x . (17) j j j j FIGURE 17 | Response profiles for cells in the last two layers of the network – discrimination factors 11.12 and 12.40 – in the experiment with seven different views of each of three faces. FIGURE 18 | Numerical results with the standard trace rule (14), the trained on 7 faces in 9 locations: single cell information measure (left), modified trace-learning rule (17), the Hebb rule (16), and random weights, multiple cell information measure (right). (After Rolls and Stringer, 2001a.) Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 25 Rolls Invariant visual object recognition The trace shown in equation (17) is in the post-synaptic term, w D yN .1 / y x and similar effects were found if the trace was in the presynaptic term, or in both the pre- and the post-synaptic terms. The crucial 1  1 (19) D yN y x difference from the earlier rule (see equation (14)) was that the trace should be calculated up to only the preceding timestep, with D O yN y x no contribution to the trace from the firing on the current trial to the current stimulus. How might this be understood? where O D and D . The modified Hebbian trace- One way to understand this is to note that the trace rule is trying to set up the synaptic weight on trial  based on whether learning rule (17) is thus equivalent to equation (19) which is in the general form of an error correction rule (Hertz et al., 1991). the neuron, based on its previous history, is responding to that stimulus (in other transforms, e.g., position). Use of the trace rule That is, rule (19) involves the subtraction of the current firing rate y from a target value, in this case yN . at  1 does this that is it takes into account the firing of the neuron on previous trials, with no contribution from the firing Although above we have referred to rule (17) as a modified being produced by the stimulus on the current trial. On the other Hebbian rule, we note that it is only associative in the sense of hand, use of the trace at time  in the update takes into account associating previous cell firing with the current cell inputs. In the the current firing of the neuron to the stimulus in that particu- next section we continue to explore the error correction paradigm, lar position, which is not a good estimate of whether that neuron examining five alternative examples of this sort of learning rule. should be allocated to invariantly represent that stimulus. Effec- 5.3.2. Five forms of error correction learning rule tively, using the trace at time  introduces a Hebbian element Error correction learning rules are derived from gradient descent into the update, which tends to build position-encoded analyzers, minimization (Hertz et al., 1991), and continually compare the rather than stimulus-encoded analyzers. (The argument has been current neuronal output to a target value t and adjust the synap- phrased for a system learning translation invariance, but applies tic weights according to the following equation at a particular to the learning of all types of invariance.) A particular advantage timestep of using the trace at  1 is that the trace will then on different occasions (due to the randomness in the location sequences used) w D t y x . (20) reflect previous histories with different sets of positions, enabling the learning of the neuron to be based on evidence from the stim- In this usual form of gradient descent by error correction, the tar- ulus present in many different positions. Using a term from the get t is fixed. However, in keeping with our aim of encouraging current firing in the trace (i.e., the trace calculated at time  ) neurons to respond similarly to images that occur close together in results in this desirable effect always having an undesirable ele- time it seems reasonable to set the target at a particular timestep, ment from the current firing of the neuron to the stimulus in its t , to be some function of cell activity occurring close in time, current position. because encouraging neurons to respond to temporal classes will tend to make them respond to the different variants of a given 5.3.1. The modified Hebbian trace rule and its relation to error stimulus (Földiák, 1991; Rolls, 1992; Wallis and Rolls, 1997). For correction this reason, Rolls and Stringer (2001) explored a range of error The rule of equation (17) corrects the weights using a post- correction rules where the targets t are based on the trace of synaptic trace obtained from the previous firing (produced by neuronal activity calculated according to equation (15). We note other transforms of the same stimulus), with no contribution to that although the target is not a fixed value as in standard error the trace from the current post-synaptic firing (produced by the correction learning, nevertheless the new learning rules perform current transform of the stimulus). Indeed, insofar as the current gradient descent on each timestep, as elaborated below. Although firing y is not the same as yN , this difference can be thought of the target may be varying early on in learning, as learning pro- as an error. This leads to a conceptualization of using the differ- ceeds the target is expected to become more and more constant, ence between the current firing and the preceding trace as an error as neurons settle to respond invariantly to particular stimuli. The correction term, as noted in the context of modeling the temporal first set of five error correction rules we discuss are as follows. properties of classical conditioning by Sutton and Barto (1981), and developed next in the context of invariance learning (see Rolls w D yN y x , (21) and Stringer, 2001). w D y y x , (22) First, we re-express the rule of equation (17) in an alternative form as follows. Suppose we are at timestep  and have just cal- w D yN y x , (23) culated a neuronal firing rate y and the corresponding trace yN C1 w D yN y x , (24) from the trace update equation (15). If we assume 2 (0, 1), then C1 rearranging equation (15) gives w D y y x , (25) where updates (21–23) are performed at timestep  , and updates yN D yN .1 / y , (18) (24) and (25) are performed at timestep  C 1. (The reason for adopting this convention is that the basic form of the error correc- and substituting equation (18) into equation (17) gives tion rule (20) is kept, with the five different rules simply replacing Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 26 Rolls Invariant visual object recognition the term t.) It may be readily seen that equations (22) and (25) are observation vectors x and a vector of modifiable weights w; i.e., special cases of equations (21) and (24), respectively, with D 0. the prediction at time step  is given by y (x , w), and for a linear These rules are all similar except for their targets t , which are dependency the prediction is given by y D w x . (Note here that all functions of a temporally nearby value of cell activity. In partic- w is the transpose of the weight vector w.) The problem of pre- ular, rule (23) is directly related to rule (19), but is more general in diction is to calculate the weight vector w such that the predictions that the parameter D is replaced by an unconstrained para- y are good estimates of the outcome z. The supervised learning approach to the prediction problem meter . In addition, we also note that rule (21) is closely related is to form pairs of observation vectors x and outcome z for all to a rule developed in Peng et al. (1998) for view-invariance learn- time steps, and compute an update to the weights according to the ing. The above five error correction rules are biologically plausible gradient descent equation in that the targets t are all local cell variables (see Rolls and Treves, 1998 and Rolls, 2008b). In particular, rule (23) uses the w D z y r y (26) trace yN from the current time level  , and rules (22) and (25) w do not need exponential trace values yN , instead relying only on where is a learning rate parameter and r indicates the gra- the instantaneous firing rates at the current and immediately pre- w dient with respect to the weight vector w. However, this learning ceding timesteps. However, all five error correction rules involve procedure requires all calculation to be done at the end of the decrementing of synaptic weights according to an error which is sequence, once z is known. To remedy this, it is possible to replace calculated by subtracting the current activity from a target. method (26) with a temporal difference algorithm that is mathe- Numerical results with the error correction rules trained on 7 matically equivalent but allows the computational workload to be faces in 9 locations are presented by Rolls and Stringer (2001). spread out over the entire sequence of observations. Temporal dif- For all the results the synaptic weights were clipped to be pos- ference methods are a particular approach to updating the weights itive during the simulation, because it is important to test that C1 based on the values of successive predictions, y , y . Sutton decrementing synaptic weights purely within the positive inter- (1988) showed that the following temporal difference algorithm is val w2 [0,1] will provide significantly enhanced performance. equivalent to method (26) That is, it is important to show that error correction rules do not necessarily require possibly biologically implausible modifiable negative weights. For each of the rules (21–25), the parameter C1  k w D y y r y , (27) has been individually optimized to the following respective values: kD1 4.9, 2.2, 2.2, 3.8, 2.2. All five error correction rules offer consider- ably improved performance over both the standard trace rule (14) mC1 where y  z. However, unlike method (26) this can be com- and rule (17). Networks trained with rule (21) performed best, and puted incrementally at each successive time step since each update this is probably due to two reasons. Firstly, rule (21) incorporates C1  k depends only on y , y and the sum of r y over previous an exponential trace yN in its target t , and we would expect this time steps k. The next step taken in Sutton (1988) is to generalize to help neurons to learn more quickly to respond invariantly to equation (27) to the following final form of temporal difference a class of inputs that occur close together in time. Hence, setting algorithm, known as “TD()” D 0 as in rule (22) results in reduced performance. Secondly, unlike rules (23) and (24), rule (21) does not contain any compo- C1  k k nent of y in its target. If we examine rules (23), (24), we see that w D y y  r y (28) C1 their respective targets yN , yN contain significant components kD1 of y . where 2 [0, 1] is an adjustable parameter that controls the 5.3.3. Relationship to temporal difference learning weighting on the vectorsr y . Equation (28) represents a much Rolls and Stringer (2001) not only considered the relationship of broader class of learning rules than the more usual gradient rule (17) to error correction, but also considered how the error cor- descent-based rule (27), which is in fact the special case TD(1). rection rules shown in equations (21–25) are related to temporal A further special case of equation (28) is for D 0, i.e., TD(0), difference learning (Sutton, 1988; Sutton and Barto, 1998). Sut- as follows ton (1988) described temporal difference methods in the context C1 of prediction learning. These methods are a class of incremen- w D .y y /r y . (29) tal learning techniques that can learn to predict final outcomes through comparison of successive predictions from the preceding But for problems where y is a linear function of x and w, we time steps. This is in contrast to traditional supervised learning, haver y D x , and so equation (29) becomes which involves the comparison of predictions only with the final C1 outcome. Consider a series of multistep prediction problems in w D y y x . (30) which for each problem there is a sequence of observation vectors, 1 2 m x , x , :::, x , at successive timesteps, followed by a final scalar If we assume the prediction process is being performed by a neuron outcome z. For each sequence of observations temporal differ- with a vector of inputs x , synaptic weight vector w, and output 1 2 m  T ence methods form a sequence of predictions y , y , :::, y , each y D w x , then we see that the TD(0) algorithm (30) is identical of which is a prediction of z. These predictions are based on the to the error correction rule (25) with D 1. In understanding this Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 27 Rolls Invariant visual object recognition k k comparison with temporal difference learning, it may be useful to where the term  x is a weighted sum of the vectors kD1 note that the firing at the end of a sequence of the transformed x . This suggests generalizing the original five error correction exemplars of a stimulus is effectively the temporal difference tar- rules (21–25) by replacing the term x by a weighted sum xO D get z. This establishes a link to temporal difference learning (Rolls, k x with 2 [0, 1]. In Sutton (1988) xO is calculated kD1 j j 2008b). Further, we note that from learning epoch to learning according to epoch, the target z for a given neuron will gradually settle down to be more and more fixed as learning proceeds. xO D x C  xO (33) j j j We now explore in more detail the relation between the error correction rules described above and temporal difference learn- ing. For each sequence of observations with a single outcome the with xO  0. This gives the following five temporal difference- temporal difference method (30), when viewed as an error correc- inspired error correction rules C1 tion rule, is attempting to adapt the weights such that y D y for all successive pairs of time steps – the same general idea w D yN y xO , (34) underlying the error correction rules (21–25). Furthermore, in Sutton and Barto (1998), where temporal difference methods are w D y y xO , (35) applied to reinforcement learning, the TD() approach is again w D y y xO , (36) C1 j further generalized by replacing the target y by any weighted C1 average of predictions y from arbitrary future timesteps, e.g., w D y y xO , (37) 1 C3 1 C7 t D y C y , including an exponentially weighted average C1 2 2 w D y y xO , (38) extending forward in time. So a more general form of the temporal difference algorithm has the form where it may be readily seen that equation (35) and (38) are spe- cial cases of equations (34) and (37), respectively, with D 0. As w D t y x , (31) with the trace yN , the term xO is reset to zero when a new stimulus is presented. These five rules can be related to the more general where here the target t is an arbitrary weighted average of the TD() algorithm, but continue to be biologically plausible using predictions y over future timesteps. Of course, with standard tem- only local cell variables. Setting D 0 in rules (34–38), gives us poral difference methods the target t is always an average over back the original error correction rules (21–25) which may now future timesteps kD C 1,  C 2, etc. But in the five error cor- be related to TD(0). rection rules this is only true for the last exemplar (25). This is Numerical results with error correction rules (34–38), and because with the problem of prediction, for example, the ultimate xO calculated according to equation (33) with D 1, with pos- 1 m mC1 j target of the predictions y ,:::,y is a final outcome y z. itive clipping of weights, trained on 7 faces in 9 locations are However, this restriction does not apply to our particular applica- presented by Rolls and Stringer (2001). For each of the rules tion of neurons trained to respond to temporal classes of inputs (34–38), the parameter has been individually optimized to 1 m within VisNet. Here we only wish to set the firing rates y ,:::,y the following respective values: 1.7, 1.8, 1.5, 1.6, 1.8. Compar- to the same value, not some final given value z. However, the more ing these five temporal difference-inspired rules it was found general error correction rules clearly have a close relationship to that the best performance is obtained with rule (38) where standard temporal difference algorithms. For example, it can be many more cells reach the maximum level of performance pos- seen that equation (22) with D 1 is in some sense a temporal sible with respect to the single cell information measure. In mirror image of equation (30), particularly if the updates w are fact, this rule offered the best such results. This may well be added to the weights w only at the end of a sequence. That is, rule due to the fact that this rule may be directly compared to the 1 m 0 (22) will attempt to set y ,:::,y to an initial value y  0. This standard TD(1) learning rule, which itself may be related to relationship to temporal difference algorithms allows us to begin classical supervised learning for which there are well known to exploit established temporal difference analyses to investigate optimality results, as discussed further by Rolls and Stringer the convergence properties of the error correction methods (Rolls (2001). and Stringer, 2001). From the simulations described by Rolls and Stringer (2001) Although the main aim of Rolls and Stringer (2001) in relat- it appears that the form of optimization described above associ- ing error correction rules to temporal difference learning was ated with TD(1) rather than TD(0) leads to better performance to begin to exploit established temporal difference analyses, they within VisNet. The TD(1)-like rule (38) with D 1.0 and D 1.8 observed that the most general form of temporal difference learn- gave considerably superior results to the TD(0)-like rule (25) with ing, TD(), in fact suggests an interesting generalization to the D 2.2. In fact, the former of these two rules provided the best existing error correction learning rules for which we currently have single cell information results in these studies. We hypothesize that D 0. Assuming y D w x andr y D x , the general equation these results are related to the fact that only a finite set of image (28) for TD() becomes sequences is presented to VisNet, and so the type of optimization performed by TD(1) for repeated presentations of a finite data set C1  k k is more appropriate for this problem than the form of optimization w D y y  x (32) performed by TD(0). kD1 Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 28 Rolls Invariant visual object recognition 5.3.4. Discussion of the different training rules single neuron with a number of inputs x and output y D w x , j j j In terms of biological plausibility, we note the following. First, where w are the synaptic weights. We assume that there are a all the learning rules investigated by Rolls and Stringer (2001) number of input patterns and that for the k th input pattern, k k k T k k are local learning rules, and in this sense are biologically plausi- x D Tx , x , ...U , the output y has a target value t . Hence an 1 2 ble (Rolls and Treves, 1998; Rolls, 2008b). (The rules are local in error measure or cost function can be defined as that the terms used to modify the synaptic weights are potentially 0 1 available in the pre- and post-synaptic elements.) X X X 1 1 k k k k @ A e .w/ D t y D t w x . (39) Second we note that all the rules do require some evidence of j 2 2 k k j the activity on one or more previous stimulus presentations to be available when the synaptic weights are updated. Some of the rules, e.g., learning rule (23), use the trace yN from the current This cost function is a function of the input patterns x and the synaptic weight vector wD [w , w ,:::] . With a fixed set of input time level, while rules (22) and (25) do not need to use an expo- 1 2 nential trace of the neuronal firing rate, but only the instantaneous patterns, we can reduce the error measure by employing a gradient firing rates y at two successive time steps. It is known that synap- descent algorithm to calculate an improved set of synaptic weights. Gradient descent achieves this by moving downhill on the error tic plasticity does involve a combination of separate processes each with potentially differing time courses (Koch, 1999), and surface defined in w space using the update these different processes could contribute to trace rule learning. @e Another mechanism suggested for implementing a trace of pre- k k k w D D t y x . (40) @w vious neuronal activity is the continuing firing for often 300 ms j produced by a short (16 ms) presentation of a visual stimulus (Rolls and Tovee, 1994) which is suggested to be implemented If we update the weights after each pattern k, then the update takes by local cortical recurrent attractor networks (Rolls and Treves, the form of an error correction rule 1998). k k k Third, we note that in utilizing the trace in the targets t , the w D t y x , (41) error correction (or temporal difference-inspired) rules perform a comparison of the instantaneous firing y with a temporally which is also commonly referred to as the delta rule or Widrow– nearby value of the activity, and this comparison involves a sub- Hoff rule (see Widrow and Hoff, 1960; Widrow and Stearns, 1985). traction. The subtraction provides an error, which is then used Error correction rules continually compare the neuronal output to increase or decrease the synaptic weights. This is a somewhat with its pre-specified target value and adjust the synaptic weights different operation from long-term depression (LTD) as well as accordingly. In contrast, the way Rolls and Stringer (2001) intro- long-term potentiation (LTP), which are associative changes which duced of utilizing error correction is to specify the target as the depend on the pre- and post-synaptic activity. However, it is inter- activity trace based on the firing rate at nearby timesteps. Now the esting to note that an error correction rule which appears to actual firing at those nearby time steps is not a pre-determined involve a subtraction of current firing from a target might be fixed target, but instead depends on how the network has actually implemented by a combination of an associative process oper- evolved. This effectively means the cost function e (w) that is being ating with the trace, and an anti-Hebbian process operating to minimized changes from timestep to timestep. Nevertheless, the remove the effects of the current firing. For example, the synap- concept of calculating an error, and using the magnitude and direc- tic updates w D .t y /x can be decomposed into two tion of the error to update the synaptic weights, is the similarity separate associative processes t x and y x , that may occur Rolls and Stringer (2001) made to gradient descent learning. j j independently. (The target, t , could in this case be just the trace To conclude this discussion, the error correction and tempo- of previous neural activity from the preceding trials, excluding any ral difference rules explored by Rolls and Stringer (2001) provide contribution from the current firing.) Another way to implement interesting approaches to help understand invariant pattern recog- an error correction rule using associative synaptic modification nition learning. Although we do not know whether the full power would be to force the post-synaptic neuron to respond to the error of these rules is expressed in the brain, we provided suggestions term. Although this has been postulated to be an effect which about how they might be implemented. At the same time, we note could be implemented by the climbing fiber system in the cerebel- that the original trace rule used by Földiák (1991), Rolls (1992), lum (Ito, 1984, 1989; Rolls and Treves, 1998), there is no similar and Wallis and Rolls (1997) is a simple associative rule, is therefore system known for the neocortex, and it is not clear how this par- biologically very plausible, and, while not as powerful as many of ticular implementation of error correction might operate in the the other rules introduced by Rolls and Stringer (2001), can never- neocortex. theless solve the same class of problem. Rolls and Stringer (2001) In Section 5.3.2 we describe five learning rules as error cor- also emphasized that although they demonstrated how a number rection rules. We now discuss an interesting difference of these of new error correction and temporal difference rules might play a error correction rules from error correction rules as conventionally role in the context of view-invariant object recognition, they may applied. It is usual to derive the general form of error correction also operate elsewhere where it is important for neurons to learn to learning rule from gradient descent minimization in the follow- respond similarly to temporal classes of inputs that tend to occur ing way (Hertz et al., 1991). Consider the idealized situation of a close together in time. Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 29 Rolls Invariant visual object recognition 5.4. THE ISSUE OF FEATURE BINDING, AND A SOLUTION 5.4.1. Syntactic binding of separate neuronal ensembles by In this section we investigate two key issues that arise in hierarchi- synchronization cal layered network architectures, such as VisNet, other examples The problem of syntactic binding of neuronal representations, of which have been described and analyzed by Fukushima (1980), in which some features must be bound together to form one Ackley et al. (1985), Rosenblatt (1961), and Riesenhuber and Pog- object, and other simultaneously active features must be bound gio (1999b). One issue is whether the network can discriminate together to represent another object, has been addressed by von between stimuli that are composed of the same basic alphabet of der Malsburg (1990). He has proposed that this could be per- features. The second issue is whether such network architectures formed by temporal synchronization of those neurons that were can find solutions to the spatial binding problem. These issues are temporarily part of one representation in a different time slot addressed next and by Elliffe et al. (2002) and Rolls (2008b). from other neurons that were temporarily part of another rep- The first issue investigated is whether a hierarchical layered resentation. The idea is attractive in allowing arbitrary relinking network architecture of the type exemplified by VisNet can dis- of features in different combinations. Singer, Engel, Konig, and criminate stimuli that are composed of a limited set of features colleagues (Singer et al., 1990; Engel et al., 1992; Singer and and where the different stimuli include cases where the feature sets Gray, 1995; Singer, 1999; Fries, 2005, 2009; Womelsdorf et al., are subsets and supersets of those in the other stimuli. An issue is 2007), and others (Abeles, 1991) have obtained some evidence that if the network has learned representations of both the parts that when features must be bound, synchronization of neu- and the wholes, will the network identify that the whole is present ronal populations can occur (but see Shadlen and Movshon, when it is shown, and not just that one or more parts is present. (In 1999), and this has been modeled (Hummel and Biederman, many investigations with VisNet, complex stimuli (such as faces) 1992). were used where each stimulus might contain unique features not Synchronization to implement syntactic binding has a number present in the other stimuli.) To address this issue Elliffe et al. of disadvantages and limitations (Rolls and Treves, 1998, 2011; (2002) used stimuli that are composed from a set of four features Riesenhuber and Poggio, 1999a; Rolls, 2008b). The greatest com- which are designed so that each feature is spatially separate from putational problem is that synchronization does not by itself define the other features, and no unique combination of firing caused, the spatial relations between the features being bound, so is not for example, by overlap of horizontal and vertical filter outputs in just as a binding mechanism adequate for shape recognition. For the input representation distinguishes any one stimulus from the example, temporal binding might enable features 1, 2, and 3, which others. The results described in Section 5.4.4 show that VisNet can might define one stimulus to be bound together and kept separate indeed learn correct invariant representations of stimuli which do from, for example, another stimulus consisting of features 2, 3, consist of feature sets where individual features do not overlap and 4, but would require a further temporal binding (leading in spatially with each other and where the stimuli can be composed the end potentially to a combinatorial explosion) to indicate the of sets of features which are supersets or subsets of those in other relative spatial positions of the 1, 2, and 3 in the 123 stimulus, so stimuli. Fukushima and Miyake (1982) did not address this cru- that it can be discriminated from, e.g., 312. cial issue where different stimuli might be composed of subsets or A second problem with the synchronization approach to the supersets of the same set of features, although they did show that spatial binding of features is that, when stimulus-dependent tem- stimuli with partly overlapping features could be discriminated by poral synchronization has been rigorously tested with information the Neocognitron. theoretic approaches, it has so far been found that most of the In Section 5.4.5 we address the spatial binding problem in archi- information available is in the number of spikes, with rather little, tectures such as VisNet. This computational problem that needs to less than 5% of the total information, in stimulus-dependent syn- be addressed in hierarchical networks such as the primate visual chronization (Franco et al., 2004; Rolls et al., 2004; Aggelopoulos system and VisNet is how representations of features can be (e.g., et al., 2005; Rolls, 2008b; Rolls and Treves, 2011). For exam- translation) invariant, yet can specify stimuli or objects in which ple, Aggelopoulos et al. (2005) showed that when macaques used the features must be specified in the correct spatial arrangement. object-based attention to search for one of two objects to touch in This is the feature binding problem, discussed, for example, by a complex natural scene, between 99 and 94% of the information von der Malsburg (1990), and arising in the context of hierarchi- was present in the firing rates of inferior temporal cortex neu- cal layered systems (Rosenblatt, 1961; Fukushima, 1980; Ackley rons, and less that 5% in any stimulus-dependent synchrony that et al., 1985). The issue is whether or not features are bound into was present between the simultaneously recorded inferior tem- the correct combinations in the correct relative spatial positions, poral cortex neurons. The implication of these results is that any or if alternative combinations of known features or the same fea- stimulus-dependent synchrony that is present is not quantitatively tures in different relative spatial positions would elicit the same important as measured by information theoretic analyses under responses. All this has to be achieved while at the same time pro- natural scene conditions when feature binding, segmentation of ducing position-invariant recognition of the whole combination objects from the background, and attention are required. This of features, that is, the object. This is a major computational issue has been found for the inferior temporal cortex, a brain region that needs to be solved for memory systems in the brain to operate where features are put together to form representations of objects correctly. This can be achieved by what is effectively a learning (Rolls and Deco, 2002; Rolls, 2008b), and where attention has process that builds into the system a set of neurons in the hier- strong effects, at least in scenes with blank backgrounds (Rolls archical network that enables the recognition process to operate et al., 2003). It would of course also be of interest to test the same correctly with the appropriate position, size, view, etc. invariances. hypothesis in earlier visual areas, such as V4, with quantitative, Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 30 Rolls Invariant visual object recognition information theoretic, techniques (Rolls and Treves, 2011). In con- in the correct spatial arrangement the advantages of the scheme nection with rate codes, it should be noted that a rate code implies for syntactic binding are obtained, yet without the combinatorial using the number of spikes that arrive in a given time, and that explosion that would result if the feature combination neurons this time can be very short, as little as 20–50 ms, for very useful responded to combinations of many input features so produc- amounts of information to be made available from a population ing potentially very specifically tuned neurons which very rarely of neurons (Tovee et al., 1993; Rolls and Tovee, 1994; Rolls et al., responded.) Then invariant representations are developed in the 1994, 1999, 2006a; Tovee and Rolls, 1995; Rolls, 2003, 2008b; Rolls next layer from these feature combination neurons which already and Treves, 2011). contain evidence on the local spatial arrangement of features. A third problem with the synchronization or “communication Finally, in later layers, only one stimulus would be specified by the through coherence” approach (Fries, 2005, 2009) is that when particular set of low-order feature combination neurons present, information transmission between connected networks is ana- even though each feature combination neuron would itself be lyzed, synchronization is not produced at the levels of synaptic somewhat invariant. The overall design of the scheme is shown strength necessary for information transmission between the net- in Figure 9. Evidence that many neurons in V1 respond to combi- works, and indeed does not appear to affect the information nations of spatial features with the correct spatial configuration is transmission between a pair of weakly coupled networks that now starting to appear (see Section 4), and neurons that respond model weakly coupled cortical networks (Rolls et al., 2012). to feature combinations (such as two lines with a defined angle In the context of VisNet, and how the real visual system may between them, and overall orientation) are found in V2 (Hegde operate to implement object recognition, the use of synchroniza- and Van Essen, 2000; Ito and Komatsu, 2004). The tuning of a tion does not appear to match the way in which the visual system is VisNet layer 1 neuron to a combination of features in the correct organized. For example, von der Malsburg’s argument would indi- relative spatial position is illustrated in Figures 12 and 13. cate that, using only a two-layer network, synchronization could provide the necessary feature linking to perform object recogni- 5.4.4. Discrimination between stimuli with super- and sub-set tion with relatively few neurons, because they can be reused again feature combinations and again, linked differently for different objects. In contrast, the Some investigations with VisNet (Wallis and Rolls, 1997) have primate uses a considerable part of its cortex, perhaps 50% in involved groups of stimuli that might be identified by some unique monkeys, for visual processing, with therefore what could be in the feature common to all transformations of a particular stimulus. 8 12 order of 6 10 neurons and 6 10 synapses involved (Rolls, This might allow VisNet to solve the problem of transform invari- 2008b), so that the solution adopted by the real visual system may ance by simply learning to respond to a unique feature present be one which relies on many neurons with simpler processing than in each stimulus. For example, even in the case where VisNet was arbitrary syntax implemented by synchronous firing of separate trained on invariant discrimination of T, L, and C, the repre- assemblies suggests. On the other hand, a solution such as that sentation of the T stimulus at the spatial-filter level inputs to investigated by VisNet, which forms low-order combinations of VisNet might contain unique patterns of filter outputs where what is represented in previous layers, is very demanding in terms the horizontal and vertical parts of the T join. The unique filter of the number of neurons required, and this matches what is found outputs thus formed might distinguish the T from, for example, in the primate visual system. the L. Elliffe et al. (2002) tested whether VisNet is able to form trans- 5.4.2. Sigma-Pi neurons form invariant cells with stimuli that are specially composed from Another approach to a binding mechanism is to group spatial fea- a common alphabet of features, with no stimulus containing any tures based on local mechanisms that might operate for closely firing in the spatial-filter inputs to VisNet not present in at least adjacent synapses on a dendrite (in what is a Sigma-Pi type of one of the other stimuli. The limited alphabet enables the set of neuron, see Section 7; Finkel and Edelman, 1987; Mel et al., 1998; stimuli to consist of feature sets which are subsets or supersets of Rolls, 2008b). A problem for such architectures is how to force those in the other stimuli. one particular neuron to respond to the same feature combina- For these experiments the common pool of stimulus features tion invariantly with respect to all the ways in which that feature chosen was a set of two horizontal and two vertical 8 1 bars, combination might occur in a scene. each aligned with the sides of a 32 32 square. The stimuli can be constructed by arbitrary combination of these base level fea- 5.4.3. Binding of features and their relative spatial position by tures. We note that effectively the stimulus set consists of four feature combination neurons features, a top bar (T), a bottom bar (B), a left bar (L), and a The approach to the spatial binding problem that is proposed for right bar (R). Figure 19 shows the complete set used, containing VisNet is that individual neurons at an early stage of processing the possible image feature combination. Subsequent discussion are set up (by learning) to respond to low-order combinations of will group these objects by the number of features each contains: input features occurring in a given relative spatial arrangement single-; double-; triple-; and quadruple-feature objects correspond and position on the retina (Rolls, 1992, 1994, 1995; Wallis and to the respective rows of Figure 19. Stimuli are referred to by the Rolls, 1997; Rolls and Treves, 1998; Elliffe et al., 2002; Rolls and list of features they contain; e.g., “LBR” contains the left, bottom, Deco, 2002; cf. Feldman, 1985). (By low-order combinations of and right features, while “TL” contains top and left only. Further input features we mean combinations of a few input features. By details of how the stimuli were prepared are provided by Elliffe forming neurons that respond to combinations of a few features et al. (2002). Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 31 Rolls Invariant visual object recognition by T alone, or by TLB. The basis for this separation by competitive networks of stimuli which are subsets and supersets of each other is described by Rolls and Treves, 1998, Section 4.3.6) and by Rolls (2008b). 5.4.5. Feature binding in a hierarchical network with invariant representations of local feature combinations In this section we consider the ability of output layer neurons to learn new stimuli if the lower layers are trained solely through exposure to simpler feature combinations from which the new stimuli are composed. A key question we address is how invari- ant representations of low-order feature combinations in the early layers of the visual system are able to uniquely specify the cor- rect spatial arrangement of features in the overall stimulus and contribute to preventing false recognition errors in the output layer. The problem, and its proposed solution, can be treated as fol- lows. Consider an object 1234 made from the features 1, 2, 3, and 4. The invariant low-order feature combinations might rep- resent 12, 23, and 34. Then if neurons at the next layer respond to combinations of the activity of these neurons, the only neu- rons in the next layer that would respond would be those tuned to 1234, not to, for example, 3412, which is distinguished from 1234 by the input of a pair neuron responding to 41 rather than to 23. The argument (Rolls, 1992) is that low-order spatial-feature FIGURE 19 | Merged feature objects. All members of the full object set are combination neurons in the early stage contain sufficient spatial shown, using a dotted line to represent the central 32 32 square on which information so that a particular combination of those low-order the individual features are positioned, with the features themselves shown feature combination neurons specifies a unique object, even if the as dark line segments. Nomenclature is by acronym of the features present, where T, top; B, bottom; L, left; and R, right. (After Elliffe et al., 2002.) relative positions of the low-order feature combination neurons are not known, because they are somewhat invariant. The architecture of VisNet is intended to solve this problem To train the network a stimulus was presented in a randomized partly by allowing high-spatial precision combinations of input sequence of nine locations in a square grid across the 128 128 features to be formed in layer 1. The actual input features in VisNet input retina of VisNet2. The central location of the square grid are, as described above, the output of oriented spatial-frequency was in the center of the “retina,” and the eight other locations were tuned filters, and the combinations of these formed in layer 1 offset 8 pixels horizontally and/or vertically from this. Two differ- might thus be thought of in a simple way as, for example, a T or ent learning rules were used, “Hebbian” (16), and “trace” (17), and an L or for that matter a Y. Then in layer 2, application of the trace also an untrained condition with random weights. As in earlier rule might enable neurons to respond to a T with limited spatial work (Wallis and Rolls, 1997; Rolls and Milward, 2000) only the invariance (limited to the size of the region of layer 1 from which trace rule led to any cells with invariant responses, and the results layer 2 cells receive their input). Then an “object” such as H might shown are for networks trained with the trace rule. be formed at a higher layer because of a conjunction of two Ts in The results with VisNet trained on the set of stimuli shown in the same small region. Figure 19 with the trace rule are as follows. First, it was found that To show that VisNet can actually solve this problem, Elliffe et al. single neurons in the top layer learned to differentiate between the (2002) performed the experiments described next. They trained stimuli in that the responses of individual neurons were maximal the first two layers of VisNet with feature pair combinations, for one of the stimuli and had no response to any of the other stim- forming representations of feature pairs with some translation uli invariantly with respect to location. Moreover, the translation invariance in layer 2. Then they used feature triples as input stimuli, invariance was perfect for every stimulus (by different neurons) allowed no more learning in layers 1 and 2, and then investi- over every location (for all stimuli except “RTL” and “TLBR”). gated whether layers 3 and 4 could be trained to produce invariant The results presented show clearly that the VisNet paradigm representations of the triples where the triples could only be distin- can accommodate networks that can perform invariant discrim- guished if the local spatial arrangement of the features within the ination of objects that have a subset–superset relationship. The triple had effectively to be encoded in order to distinguish the dif- result has important consequences for feature binding and for dis- ferent triples. For this experiment, they needed stimuli that could criminating stimuli for other stimuli which may be supersets of the be specified in terms of a set of different features (they chose verti- first stimulus. For example, a VisNet cell which responds invari- cal (1), diagonal (2), and horizontal (3) bars) each capable of being antly to feature combination TL can genuinely signal the presence shown at a set of different relative spatial positions (designated A, of exactly that combination, and will not necessarily be activated B, and C), as shown in Figure 20. Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 32 Rolls Invariant visual object recognition The stimuli are thus defined in terms of what features are then to develop invariant representations in the next layer from present and their precise spatial arrangement with respect to each these neurons which already contain evidence on the local spa- other. The length of the horizontal and vertical feature bars shown tial arrangement of features. An example might be that with the in Figure 20 is 8 pixels. To train the network a stimulus (that is a object 123, the invariant feature pairs would represent 120, 023, pair or triple feature combination) is presented in a randomized and 103. Then if neurons at the next layer correspond to combi- sequence of nine locations in a square grid across the 128 128 nations of these neurons, the only next layer neurons that would input retina. The central location of the square grid is in the cen- respond would be those tuned to 123, not to, for example, 213. ter of the “retina,” and the eight other locations are offset 8 pixels The argument is that the low-order spatial-feature combination horizontally and/or vertically from this. We refer to the two and neurons in the early stage contain sufficient spatial information three feature stimuli as “pairs” and “triples,” respectively. Indi- so that a particular combination of those low-order feature com- vidual stimuli are denoted by three numbers which refer to the bination neurons specifies a unique object, even if the relative individual features present in positions A, B and C, respectively. positions of the low-order feature combination neurons are not For example, a stimulus with positions A and C containing a verti- known because these neurons are somewhat translation-invariant cal and diagonal bar, respectively, would be referred to as stimulus (cf. also Fukushima, 1988). 102, where the 0 denotes no feature present in position B. In total The stimuli used in the experiments of Elliffe et al. (2002) were there are 18 pairs (120, 130, 210, 230, 310, 320, 012, 013, 021, 023, constructed from pre-processed component features as discussed 031, 032, 102, 103, 201, 203, 301, 302) and 6 triples (123, 132, 213, in Section 5.4.4. That is, base stimuli containing a single feature 231, 312, 321). This nomenclature not only defines which fea- were constructed and filtered, and then the pairs and triples were tures are present within objects, but also the spatial relationships constructed by merging these pre-processed single feature images. of their component features. Then the computational problem In the first experiment layers 1 and 2 of VisNet were trained with can be illustrated by considering the triple 123. If invariant rep- the 18 feature pairs, each stimulus being presented in sequences of resentations are formed of single features, then there would be 9 locations across the input. This led to the formation of neurons no way that neurons higher in the hierarchy could distinguish the that responded to the feature pairs with some translation invari- object 123 from 213 or any other arrangement of the three fea- ance in layer 2. Then they trained layers 3 and 4 on the 6 feature tures. An approach to this problem (see, e.g., Rolls, 1992) is to form triples in the same 9 locations, while allowing no more learning early on in the processing neurons that respond to overlapping in layers 1 and 2, and examined whether the output layer of Vis- combinations of features in the correct spatial arrangement, and Net had developed transform invariant neurons to the 6 triples. The idea was to test whether layers 3 and 4 could be trained to produce invariant representations of the triples where the triples could only be distinguished if the local spatial arrangement of the features within the triple had effectively to be encoded in order to distinguish the different triples. The results from this experi- ment were compared and contrasted with results from three other experiments which involved different training regimes for layers 1, 2 and layers 3, 4. All four experiments are summarized in Table 5. Experiment 2 involved no training in layers 1, 2 and 3, 4, with the synaptic weights left unchanged from their initial random values. These results are included as a baseline performance with which to compare results from the other experiments 1, 3, and 4. The model parameters used in these experiments were as described by Rolls and Milward (2000) and Rolls and Stringer (2001). In Figure 21 we present numerical results for the four experi- ments listed in Table 5. On the left are the single cell information measures for all top (4th) layer neurons ranked in order of their FIGURE 20 | Feature combinations for experiments of Section 5.4.5: Table 5 | The different training regimes used in VisNet experiments 1–4 there are 3 features denoted by 1, 2, and 3 (including a blank space 0) of Section 5.4.5. that can be placed in any of 3 positions A, B, and C. Individual stimuli are denoted by three consecutive numbers which refer to the individual Layers 1, 2 Layers 3, 4 features present in positions A, B, and C, respectively. In the experiments in Section 5.4.5, layers 1 and 2 were trained on stimuli consisting of pairs of Experiment 1 Trained on pairs Trained on triples the features, and layers 3 and 4 were trained on stimuli consisting of triples. Then the network was tested to show whether layer 4 neurons would Experiment 2 No training No training distinguish between triples, even though the first two layers had only been Experiment 3 No training Trained on triples trained on pairs. In addition, the network was tested to show whether Experiment 4 Trained on triples Trained on triples individual cells in layer 4 could distinguish between triples even in locations where the triples were not presented during training. (After Elliffe et al., In the no training condition the synaptic weights were left in their initial untrained 2002.) random values. Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 33 Rolls Invariant visual object recognition FIGURE 21 | Numerical results for experiments 1–4 as described in Table 5, with the trace-learning rule (17). On the left are single cell information measures, and on the right are multiple cell information measures. (After Elliffe et al., 2002.) invariance to the triples, while on the right are multiple cell infor- is possible because the feature combination neurons formed in mation measures. To help to interpret these results we can compute the first layer (which could be trained just with a Hebb rule) do the maximum single cell information measure according to respond to combinations of input features in the correct spatial configuration, partly because of the limited size of their receptive fields. The second conclusion is that even though early layers can Maximum single cell information D log .Number of triples/, in this case only respond to small feature subsets, these provide, (42) with no further training of layers 1 and 2, an adequate basis for learning to discriminate in layers 3 and 4 stimuli consisting of where the number of triples is 6. This gives a maximum single combinations of larger numbers of features. Indeed, comparing cell information measure of 2.6 bits for these test cases. First, com- results from experiment 1 with experiment 4 (in which all layers paring the results for experiment 1 with the baseline performance were trained on triples, see Table 5) demonstrates that training of experiment 2 (no training) demonstrates that even with the the lower layer neurons to develop invariant responses to the pairs first two layers trained to form invariant responses to the pairs, offers almost as good performance as training all layers on the and then only layers 3 and 4 trained on feature triples, layer 4 is triples (see Figure 21). indeed capable of developing translation-invariant neurons that can discriminate effectively between the 6 different feature triples. Indeed, from the single cell information measures it can be seen 5.4.6. Stimulus generalization to untrained transforms of new that a number of cells have reached the maximum level of perfor- objects mance in experiment 1. In addition, the multiple cell information Another important aspect of the architecture of VisNet is that it analysis presented in Figure 21 shows that all the stimuli could be need not be trained with every stimulus in every possible location. discriminated from each other by the firing of a number of cells. Indeed, part of the hypothesis (Rolls, 1992) is that training early Analysis of the response profiles of individual cells showed that a layers (e.g., 1–3) with a wide range of visual stimuli will set up fourth layer cell could respond to one of the triple feature stimuli feature analyzers in these early layers which are appropriate later and have no response to any other of the triple feature stimuli on with no further training of early layers for new objects. For invariantly with respect to location. example, presentation of a new object might result in large num- A comparison of the results from experiment 1 with those from bers of low-order feature combination neurons in early layers of experiment 3 (see Table 5 and Figure 21) reveals that training the VisNet being active, but the particular set of feature combination first two layers to develop neurons that respond invariantly to neurons active would be different for the new object. The later lay- the pairs (performed in experiment 1) actually leads to improved ers of the network (in VisNet, layer 4) would then learn this new invariance of 4th layer neurons to the triples, as compared with set of active layer 3 neurons as encoding the new object. However, when the first two layers are left untrained (experiment 3). if the new object was then shown in a new location, the same set of Two conclusions follow from these results (Elliffe et al., 2002). layer 3 neurons would be active because they respond with spatial First, a hierarchical network that seeks to produce invariant repre- invariance to feature combinations, and given that the layer 3–4 sentations in the way used by VisNet can solve the feature binding connections had already been set up by the new object, the correct problem. In particular, when feature pairs in layer 2 with some layer 4 neurons would be activated by the new object in its new translation invariance are used as the input to later layers, these untrained location, and without any further training. later layers can nevertheless build invariant representations of To test this hypothesis Elliffe et al. (2002) repeated the general objects where all the individual features in the stimulus must procedure of experiment 1 of Section 5.4.5, training layers 1 and occur in the correct spatial position relative to each other. This 2 with feature pairs, but then instead trained layers 3 and 4 on the Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 34 Rolls Invariant visual object recognition triples in only 7 of the original 9 locations. The crucial test was network, feature combination neurons which encode the spatial to determine whether VisNet could form top layer neurons that arrangement of the bound features are formed at intermediate responded invariantly to the 6 triples when presented over all nine layers of the network. Then neurons at later layers of the net- locations, not just the seven locations at which the triples had been work which respond to combinations of active intermediate-layer presented during training. neurons do contain sufficient evidence about the local spatial It was found that VisNet is still able to develop some fourth arrangement of the features to identify stimuli because the local layer neurons with perfect invariance, that is which have invari- spatial arrangement is encoded by the intermediate-layer neurons. ant responses over all nine locations, as shown by the single cell The information required to solve the visual feature binding prob- information analysis. The response profiles of individual fourth lem thus becomes encoded by self-organization into what become layer cells showed that they can continue to discriminate between hard-wired properties of the network. In this sense, feature binding the triples even in the two locations where the triples were not is not solved at run-time by the necessity to instantaneously set up presented during training. In addition, the multiple cell analysis arbitrary syntactic links between sets of co-active neurons. The showed that a small population of cells was able to discriminate computational solution proposed to the superset/subset aspect between all of the stimuli irrespective of location, even though for of the binding problem will apply in principle to other multi- two of the test locations the triples had not been trained at those layer competitive networks, although the issues considered here particular locations during the training of layers 3 and 4. have not been explicitly addressed in architectures such as the The use of transformation rules learned by early stages of the Neocognitron (Fukushima and Miyake, 1982). hierarchy to enable later stages to perform correctly on trans- Consistent with these hypotheses about how VisNet operates to formed views never seen before of objects is now being investigated achieve, by layer 4, position-invariant responses to stimuli defined by others (Leibo et al., 2010). by combinations of features in the correct spatial arrangement, investigations of the effective stimuli for neurons in intermediate 5.4.7. Discussion of feature binding in hierarchical layered layers of VisNet showed as follows. In layer 1, cells responded to networks the presence of individual features, or to low-order combinations Elliffe et al. (2002) thus first showed (see Section 5.4.4) that hier- of features (e.g., a pair of features) in the correct spatial arrange- archical feature-detecting neural networks can learn to respond ment at a small number of nearby locations. In layers 2 and 3, differently to stimuli that consist of unique combinations of non- neurons responded to single features or to higher order combina- unique input features, and that this extends to stimuli that are tions of features (e.g., stimuli composed of feature triples) in more direct subsets or supersets of the features present in other stimuli. locations. These findings provide direct evidence that VisNet does Second Elliffe et al. (2002) investigated (see Section 5.4.5) operate as described above to solve the feature binding problem. the hypothesis that hierarchical layered networks can produce A further issue with hierarchical multilayer architectures such identification of unique stimuli even when the feature combi- as VisNet is that false binding errors might occur in the following nation neurons used to define the stimuli are themselves partly way (Mozer, 1991; Mel and Fiser, 2000). Consider the output of translation-invariant. The stimulus identification should work one-layer in such a network in which there is information only correctly because feature combination neurons in which the spa- about which pairs are present. How then could a neuron in the tial features are bound together with high-spatial precision are next layer discriminate between the whole stimulus (such as the formed in the first layer. Then at later layers when neurons with triple 123 in the above experiment) and what could be considered a some translation invariance are formed, the neurons neverthe- more distributed stimulus or multiple different stimuli composed less contain information about the relative spatial position of the of the separated subparts of that stimulus (e.g., the pairs 120, 023, original features. There is only then one object which will be con- 103 occurring in 3 of the 9 training locations in the above exper- sistent with the set of active neurons at earlier layers, which though iment)? The problem here is to distinguish a single object from somewhat translation-invariant as combination neurons, reflect in multiple other objects containing the same component combi- the activity of each neuron information about the original spatial nations (e.g., pairs). We propose that part of the solution to this position of the features. I note that the trace rule training used in general problem in real visual systems is implemented through lat- early layers (1 and 2) in Experiments 1 and 4 would set up partly eral inhibition between neurons in individual layers, and that this invariant feature combination neurons, and yet the late layers (3 mechanism, implemented in VisNet, acts to reduce the possibility and 4) were able to produce during training neurons in layer 4 of false recognition errors in the following two ways. that responded to stimuli that consisted of unique spatial arrange- First, consider the situation in which neurons in layer N have ments of lower order feature combinations. Moreover, and very learned to represent low-order feature combinations with loca- interestingly Elliffe et al. (2002) were able to demonstrate that Vis- tion invariance, and where a neuron n in layer N C 1 has learned Net layer 4 neurons would respond correctly to visual stimuli at to respond to a particular set  of these feature combinations. untrained locations, provided that the feature subsets had been The problem is that neuron n receives the same input from layer trained in early layers of the network at all locations, and that the N as long as the same set  of feature combinations is present, whole stimulus had been trained at some locations in the later and cannot distinguish between different spatial arrangements of layers of the network. these feature combinations. The question is how can neuron n The results described by Elliffe et al. (2002) thus provide one respond only to a particular favored spatial arrangement 9 of solution to the feature binding problem. The solution which has the feature combinations contained within the set . We suggest been shown to work in the model is that in a multilayer competitive that as the favored spatial arrangement 9 is altered by rearranging Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 35 Rolls Invariant visual object recognition the spatial relationships of the component feature combinations, implemented in VisNet. Consistent with this, a considerable part the new feature combinations that are formed in new locations of the non-human primate brain is devoted to visual informa- will stimulate additional neurons nearby in layer N C 1, and these tion processing. The fact that large numbers of neurons and a will tend to inhibit the firing of neuron n. Thus, lateral inhibition multilayer organization are present in the primate ventral visual within a layer will have the effect of making neurons more selective, system is actually thus consistent with the type of model of visual ensuring neuron n responds only to a single spatial arrangement information processing described here. 9 from the set of feature combinations , and hence reducing the possibility of false recognition. 5.5. OPERATION IN A CLUTTERED ENVIRONMENT The second way in which lateral inhibition may help to reduce In this section we consider how hierarchical layered networks of binding errors is through limiting the sparseness of neuronal firing the type exemplified by VisNet operate in cluttered environments. rates within layers. In our discussion above the spurious stimuli Although there has been much work involving object recognition we suggested that might lead to false recognition of triples were in cluttered environments with artificial vision systems, many such obtained from splitting up the component feature combinations systems typically rely on some form of explicit segmentation fol- (pairs) so that they occurred in separate training locations. How- lowed by search and template matching procedure (see Ullman, ever, this would lead to an increase in the number of features 1996 for a general review). In natural environments, objects may present in the complete stimulus; triples contain 3 features while not only appear against cluttered (natural) backgrounds, but also their spurious counterparts would contain 6 features (resulting the object may be partially occluded. Biological nervous systems from 3 separate pairs). For this trivial example, the increase in the operate in quite a different manner to those artificial vision sys- number of features is not dramatic, but if we consider, say, stimuli tems that rely on search and template matching, and the way in composed of 4 features where the component feature combina- which biological systems cope with cluttered environments and tions represented by lower layers might be triples, then to form partial occlusion is likely to be quite different also. spurious stimuli we need to use 12 features (resulting from 4 triples One of the factors that will influence the performance of occurring in separate locations). But if the lower layers also rep- the type of architecture considered here, hierarchically organized resented all possible pairs then the number of features required in series of competitive networks, which form one class of approaches the spurious stimuli would increase further. In fact, as the size of to biologically relevant networks for invariant object recognition the stimulus increases in terms of the number of features, and as (Fukushima, 1980; Poggio and Edelman, 1990; Rolls, 1992, 2008b; the size of the component feature combinations represented by the Wallis and Rolls, 1997; Rolls and Treves, 1998), is how lateral inhi- lower layers increases, there is a combinatorial explosion in terms bition and competition are managed within a layer. Even if an of the number of features required as we attempt to construct object is not obscured, the effect of a cluttered background will be spurious stimuli to trigger false recognition. And the construction to fire additional neurons, which will in turn to some extent com- of such spurious stimuli will then be prevented through setting a pete with and inhibit those neurons that are specifically tuned to limit on the sparseness of firing rates within layers, which will in respond to the desired object. Moreover, where the clutter is adja- turn set a limit on the number of features that can be represented. cent to part of the object, the feature analyzing neurons activated Lateral inhibition is likely to contribute in both these ways to the against a blank background might be different from those activated performance of VisNet when the stimuli consist of subsets and against a cluttered background, if there is no explicit segmentation supersets of each other, as described in Section 5.4.4. process. We consider these issues next, following investigations of Another way is which the problem of multiple objects is Stringer and Rolls (2000). addressed is by limiting the size of the receptive fields of inferior temporal cortex neurons so that neurons in IT respond primarily 5.5.1. VisNet simulations with stimuli in cluttered backgrounds to the object being fixated, but with nevertheless some asymme- In this section we show that recognition of objects learned previ- try in the receptive fields (see Section 5.9). Multiple objects are ously against a blank background is hardly affected by the presence then “seen” by virtue of being added to a visuo-spatial scratchpad of a natural cluttered background. We go on to consider what hap- (Rolls, 2008b). pens when VisNet is set the task of learning new stimuli presented A related issue that arises in this class of network is whether against cluttered backgrounds. forming neurons that respond to feature combinations in the way The images used for training and testing VisNet in the sim- described here leads to a combinatorial explosion in the number ulations described next performed by Stringer and Rolls (2000) of neurons required. The solution to this issue that is proposed were specially constructed. There were 7 face stimuli approxi- is to form only low-order combinations of features at any one mately 64 pixels in height constructed without backgrounds. In stage of the network (Rolls, 1992; cf. Feldman, 1985). Using low- addition there were 3 possible backgrounds: a blank background order combinations limits the number of neurons required, yet (gray-scale 127, where the range is 0–255), and two cluttered back- enables the type of computation that relies on feature combina- grounds as shown in Figure 22 which are 128 128 pixels in size. tion neurons that is analyzed here to still be performed. The actual Each image presented to VisNet’s 128 128 input retina was com- number of neurons required depends also on the redundancies posed of a single face stimulus positioned at one of 9 locations on present in the statistics of real-world images. Even given these fac- either a blank or cluttered background. The cluttered background tors, it is likely that a large number of neurons would be required was intended to be like the background against which an object if the ventral visual system performs the computation of invari- might be viewed in a natural scene. If a background is used in an ant representations in the manner captured by the hypotheses experiment described here, the same background is always used, Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 36 Rolls Invariant visual object recognition This is an interesting and important result, for it shows that after learning, special mechanisms for segmentation and for attention are not needed in order for neurons already tuned by previous learning to the stimuli to be activated correctly in the output layer. Although the experiments described here tested for posi- tion invariance, we predict and would expect that the same results would be demonstrable for size and view-invariant representations of objects. In experiments 3 and 4 of Stringer and Rolls (2000), VisNet was trained with the 7 face stimuli presented on either one of the 2 cluttered backgrounds, but tested with the faces presented on a blank background. Results for this experiment showed poor performance. The results of experiments 3 and 4 suggest that in FIGURE 22 | Cluttered backgrounds used in VisNet simulations: order for a cell to learn invariant responses to different transforms backgrounds 1 and 2 are on the left and right, respectively. of a stimulus when it is presented during training in a cluttered background, some form of segmentation is required in order to and it is always in the same position, with stimuli moved to dif- separate the Figure (i.e., the stimulus or object) from the back- ferent positions on it. The 9 stimulus locations are arranged in a ground. This segmentation might be performed using evidence in square grid across the background, where the grid spacings are 32 the visual scene about different depths, motions, colors, etc. of the pixels horizontally or vertically. Before images were presented to object from its background. In the visual system, this might mean VisNet’s input layer they were pre-processed by the standard set combining evidence represented in different cortical areas, and of input filters which accord with the general tuning profiles of might be performed by cross-connections between cortical areas simple cells in V1 (Hawken and Parker, 1987); full details are given to enable such evidence to help separate the representations of in Rolls and Milward (2000). To train the network a sequence of objects from their backgrounds in the form-representing cortical images is presented to VisNet’s retina that corresponds to a single areas. stimulus occurring in a randomized sequence of the 9 locations Another mechanism that helps the operation of architectures across a background. At each presentation the activation of indi- such as VisNet and the primate visual system to learn about new vidual neurons is calculated, then their firing rates are calculated, objects in cluttered scenes is that the receptive fields of inferior and then the synaptic weights are updated. After a stimulus has temporal cortex neurons become much smaller when objects are been presented in all the training locations, a new stimulus is cho- seen against natural backgrounds (Sections 5.8.1 and 5.8). This sen at random and the process repeated. The presentation of all the will help greatly to learn about new objects that are being fix- stimuli across all locations constitutes 1 epoch of training. In this ated, by reducing responsiveness to other features elsewhere in the manner the network is trained one-layer at a time starting with scene. layer 1 and finishing with layer 4. In the investigations described Another mechanism that might help the learning of new objects in this subsection, the numbers of training epochs for layers 1–4 in a natural scene is attention. An attentional mechanism might were 50, 100, 100, and 75, respectively. highlight the current stimulus being attended to and suppress the In this experiment (see Stringer and Rolls, 2000, experiment 2), effects of background noise, providing a training representation of VisNet was trained with the 7 face stimuli presented on a blank the object more like that which would be produced when it is pre- background, but tested with the faces presented on each of the 2 sented against a blank background. The mechanisms that could cluttered backgrounds. implement such attentional processes are described elsewhere The single and multiple cell information showed perfect per- (Rolls, 2008b). If such attentional mechanisms do contribute to formance. Compared to performance when shown against a blank the development of view-invariance, then it follows that cells in the background, there was very little deterioration in performance temporal cortex may only develop transform invariant responses when testing with the faces presented on either of the two cluttered to objects to which attention is directed. backgrounds. Part of the reason for the poor performance in experiments 3 This is an interesting result to compare with many artificial and 4 was probably that the stimuli were always presented against vision systems that would need to carry out computationally inten- the same fixed background (for technical reasons), and thus the sive serial searching and template matching procedures in order neurons learned about the background rather than the stimuli. to achieve such results. In contrast, the VisNet neural network Part of the difficulty that hierarchical multilayer competitive net- architecture is able to perform such recognition relatively quickly works have with learning in cluttered environments may more through a simple feed-forward computation. generally be that without explicit segmentation of the stimulus Further results from this experiment showed that different neu- from its background, at least some of the features that should be rons can achieve excellent invariant responses to each of the 7 formed to encode the stimuli are not formed properly, because the faces even with the faces presented on a cluttered background. neurons learn to respond to combinations of inputs which come The response profiles are independent of location but differenti- partly from the stimulus, and partly from the background. To ate between the faces in that the responses are maximal for only investigate this Stringer and Rolls (2000) performed experiment 5 one of the faces and minimal for all other faces. in which layers 1–3 were pre-trained with stimuli to ensure that Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 37 Rolls Invariant visual object recognition good feature combination neurons for stimuli were available, and objects from their backgrounds in the form-representing cortical then allowed learning in only layer 4 when stimuli were presented areas. in the cluttered backgrounds. Layer 4 was then trained in the usual A second way in which training a feature hierarchy network in way with the 7 faces presented against a cluttered background. The a cluttered natural scene may be facilitated follows from the find- results showed that prior random exposure to the face stimuli led ing that the receptive fields of inferior temporal cortex neurons to much improved performance. shrink from in the order of 70˚ in diameter when only one object These results demonstrated that the problem of developing is present in a blank scene to much smaller values of as little as 5– position-invariant neurons to stimuli occurring against clut- 10˚ close to the fovea in complex natural scenes (Rolls et al., 2003). tered backgrounds may be ameliorated by the prior existence of The proposed mechanism for this is that if there is an object at the stimulus-tuned feature-detecting neurons in the early layers of the fovea, this object, because of the high-cortical magnification fac- visual system, and that these feature-detecting neurons may be tor at the fovea, dominates the activity of neurons in the inferior set up through previous exposure to the relevant class of objects. temporal cortex by competitive interactions (Trappenberg et al., When tested in cluttered environments, the background clutter 2002; Deco and Rolls, 2004; see Section 5.8). This allows primarily may of course activate some other neurons in the output layer, but the object at the fovea to be represented in the inferior temporal at least the neurons that have learned to respond to the trained cortex, and, it is proposed, for learning to be about this object, and stimuli are activated. The result of this activity is sufficient for not about the other objects in a whole scene. the activity in the output layer to be useful, in the sense that Third, top-down spatial attention (Deco and Rolls, 2004, 2005a; it can be read-off correctly by a pattern associator connected to Rolls, 2008b) could bias the competition toward a region of visual the output layer. Indeed, Stringer and Rolls (2000) tested this by space where the object to be learned is located. connecting a pattern associator to layer 4 of VisNet. The pattern Fourth, if object 1 is presented during training with other associator had seven neurons, one for each face, and 1,024 inputs, different objects present on different trials, then the competitive one from each neuron in layer 4 of VisNet. The pattern associ- networks that are part of VisNet will learn to represent each object ator learned when trained with a simple associative Hebb rule separately, because the features that are part of each object will (equation (16)) to activate the correct output neuron whenever be much more strongly associated together, than are those fea- one of the faces was shown in any position in the uncluttered tures with the other features present in the different objects seen environment. This ability was shown to be dependent on invari- on some trials during training (Stringer et al., 2007; Stringer and ant neurons for each stimulus in the output layer of VisNet, for Rolls, 2008). It is a natural property of competitive networks that the pattern associator could not be taught the task if VisNet had input features that co-occur very frequently together are allocated not been previously trained with a trace-learning rule to produce output neurons to represent the pattern as a result of the learn- invariant representations. Then it was shown that exactly the cor- ing. Input features that do not co-occur frequently, may not have rect neuron was activated when any of the faces was shown in output neurons allocated to them. This principle may help feature any position with the cluttered background. This read-off by a hierarchy systems to learn representations of individual objects, pattern associator is exactly what we hypothesize takes place in even when other objects with some of the same features are present the brain, in that the inferior temporal visual cortex (where neu- in the visual scene, but with different other objects on different tri- rons with invariant responses are found) projects to structures als. With this fundamental and interesting property of competitive such as the orbitofrontal cortex and amygdala, where associations networks, it has now become possible for VisNet to self-organize between the invariant visual representations and stimuli such as invariant representations of individual objects, even though each taste and touch are learned (Rolls and Treves, 1998; Rolls, 1999, object is always presented during training with at least one other 2005, 2008b, 2013; Rolls and Grabenhorst, 2008; Grabenhorst and object present in the scene (Stringer et al., 2007; Stringer and Rolls, Rolls, 2011). Thus testing whether the output of an architecture 2008). This has been extended to learning separate representations such as VisNet can be used effectively by a pattern associator is a of face expression and face identity from the same set of images, very biologically relevant way to evaluate the performance of this depending on the statistics with which the images are presented class of architecture. (Tromans et al., 2011); and learning separate representations of independently rotating objects (Tromans et al., 2012). 5.5.2. Learning invariant representations of an object with multiple objects in the scene and with cluttered backgrounds 5.5.3. VisNet simulations with partially occluded stimuli The results of the experiments just described suggest that in order In this section we examine the recognition of partially occluded for a neuron to learn invariant responses to different transforms stimuli. Many artificial vision systems that perform object recog- of a stimulus when it is presented during training in a cluttered nition typically search for specific markers in stimuli, and hence background, some form of segmentation is required in order to their performance may become fragile if key parts of a stimulus separate the figure (i.e., the stimulus or object) from the back- are occluded. However, in contrast we demonstrate that the model ground. This segmentation might be performed using evidence in of invariance learning in the brain discussed here can continue the visual scene about different depths, motions, colors, etc. of the to offer robust performance with this kind of problem, and that object from its background. In the visual system, this might mean the model is able to correctly identify stimuli with considerable combining evidence represented in different cortical areas, and flexibility about what part of a stimulus is visible. might be performed by cross-connections between cortical areas In these simulations (Stringer and Rolls, 2000), training to enable such evidence to help separate the representations of and testing was performed with a blank background to avoid Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 38 Rolls Invariant visual object recognition confounding the two separate problems of occlusion and back- visible (option (ii)) than the lower half (option (i)). When the ground clutter. In object recognition tasks, artificial vision systems top halves of the faces are occluded the multiple cell information may typically rely on being able to locate a small number of key measure asymptotes to a suboptimal value reflecting the difficulty markers on a stimulus in order to be able to identify it. This of discriminating between these more difficult images. approach can become fragile when a number of these markers Thus this model of the ventral visual system offers robust become obscured. In contrast, biological vision systems may gen- performance with this kind of problem, and the model is able eralize or complete from a partial input as a result of the use of to correctly identify stimuli with considerable flexibility about distributed representations in neural networks, and this could lead what part of a stimulus is visible, because it is effectively using to greater robustness in situations of partial occlusion. distributed representations and associative processing. In this experiment (6 of Stringer and Rolls, 2000), the network was first trained with the 7 face stimuli without occlusion, but dur- 5.6. LEARNING 3D TRANSFORMS ing testing there were two options: either (i) the top halves of all the In this section we describe investigations of Stringer and Rolls faces were occluded or (ii) the bottom halves of all the faces were (2002) which show that trace-learning can in the VisNet archi- occluded. Since VisNet was tested with either the top or bottom tecture solve the problem of in-depth rotation invariant object half of the stimuli no stimulus features were common to the two recognition by developing representations of the transforms which test options. This ensures that if performance is good with both features undergo when they are on the surfaces of 3D objects. options, the performance cannot be based on the use of a single Moreover, it is shown that having learned how features on 3D feature to identify a stimulus. Results for this experiment are shown objects transform as the object is rotated in-depth, the network in Figure 23, with single and multiple cell information measures can correctly recognize novel 3D variations within a generic view on the left and right, respectively. When compared with the per- of an object which is composed of previously learned feature formance without occlusion (Stringer and Rolls, 2000), Figure 23 combinations. shows that there is only a modest drop in performance in the single Rolls’ hypothesis of how object recognition could be imple- cell information measures when the stimuli are partially occluded. mented in the brain postulates that trace rule learning helps For both options (i) and (ii), even with partially occluded stim- invariant representations to form in two ways (Rolls, 1992, 1994, uli, a number of cells continue to respond maximally to one 1995, 2000). The first process enables associations to be learned preferred stimulus in all locations, while responding minimally between different generic 3D views of an object where there are to all other stimuli. However, comparing results from options different qualitative shape descriptors. One example of this would (i) and (ii) shows that the network performance is better when be the front and back views of an object, which might have very the bottom half of the faces is occluded. This is consistent with different shape descriptors. Another example is provided by con- psychological results showing that face recognition is performed sidering how the shape descriptors typical of 3D shapes, such as more easily when the top halves of faces are visible rather than Y vertices, arrow vertices, cusps, and ellipse shapes, alter when the bottom halves (see Bruce, 1988). The top half of a face will most 3D objects are rotated in 3 dimensions. At some point in the generally contain salient features, e.g., eyes and hair, that are 3D rotation, there is a catastrophic rearrangement of the shape particularly helpful for recognition of the individual, and it is descriptors as a new generic view can be seen (Koenderink, 1990). interesting that these simulations appear to further demonstrate An example of a catastrophic change to a new generic view is when this point. Furthermore, the multiple cell information measures a cup being viewed from slightly below is rotated so that one can see confirm that performance is better with the upper half of the face inside the cup from slightly above. The bottom surface disappears, FIGURE 23 | Effects of partial occlusion of a stimulus: numerical are two options: either (i) the top half of all the faces are occluded, or (ii) results for experiment 6 of Stringer and Rolls (2000), with the 7 faces the bottom half of all the faces are occluded. On the left are single cell presented on a blank background during both training and testing. information measures, and on the right are multiple cell information Training was performed with the whole face. However, during testing there measures. Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 39 Rolls Invariant visual object recognition the top surface of the cup changes from a cusp to an ellipse, and the inside of the cup with a whole set of new features comes into view. The second process is that within a generic view, as the object is rotated in-depth, there will be no catastrophic changes in the qualitative 3D shape descriptors, but instead the quantitative val- ues of the shape descriptors alter. For example, while the cup is being rotated within a generic view seen from somewhat below, the curvature of the cusp forming the top boundary will alter, but the qualitative shape descriptor will remain a cusp. Trace-learning could help with both processes. That is, trace-learning could help to associate together qualitatively different sets of shape descrip- tors that occur close together in time, and describe, for example, the generically different views of a cup. Trace-learning could also help with the second process, and learn to associate together the different quantitative values of shape descriptors that typically occur when objects are rotated within a generic view. We note that there is evidence that some neurons in the inferior temporal cortex may show the two types of 3D invariance. First Booth and Rolls (1998) showed that some inferior temporal cor- tex neurons can respond to different generic views of familiar 3D objects. Second, some neurons do generalize across quantitative changes in the values of 3D shape descriptors while faces (Has- FIGURE 24 | Learning 3D perspectival transforms of features. selmo et al., 1989b) and objects (Logothetis et al., 1995; Tanaka, Representations of the 6 visual stimuli with 3 surface features (triples) 1996) are rotated within-generic views. Indeed, Logothetis et al. presented to VisNet during the simulations described in Section 5.6. Each (1995) showed that a few inferior temporal cortex neurons can stimulus is a sphere that is uniquely identified by a unique combination of generalize to novel (untrained) values of the quantitative shape three surface features (a vertical, diagonal, and horizontal arc), which occur in 3 relative positions A, B, and C. Each row shows one of the stimuli descriptors typical of within-generic view object rotation. rotated through the 5 different rotational views in which the stimulus is In addition to the qualitative shape descriptor changes that presented to VisNet. From left to right the rotational views shown are: (i) occur catastrophically between different generic views of an object, –60˚, (ii) –30˚, (iii) 0˚ (central position), (iv) C30˚, and (v) C60˚. (After Stringer and the quantitative changes of 3D shape descriptors that occur and Rolls, 2002.) within a generic view, there is a third type of transform that must be learned for correct invariant recognition of 3D objects as they rotate in-depth. This third type of transform is that existing trace-learning models, because these models assume that which occurs to the surface features on a 3D object as it trans- an initial exposure is required during learning to every transfor- forms in-depth. The main aim here is to consider mechanisms mation of the object to be recognized (Riesenhuber and Poggio, that could enable neurons to learn this third type of transform, 1998). Stringer and Rolls (2002) showed as described here that that is how to generalize correctly over the changes in the sur- this is not the case, and that such models can generalize to novel face markings on 3D objects that are typically encountered as 3D within-generic views of an object provided that the characteristic objects rotate within a generic view. Examples of the types of changes that the features show as objects are rotated have been perspectival transforms investigated are shown in Figure 24. Sur- learned previously for the sets of features when they are present in face markings on the sphere that consist of combinations of three different objects. features in different spatial arrangements undergo characteristic Elliffe et al. (2002) demonstrated for a 2D system how the exis- transforms as the sphere is rotated from 0˚ to60˚ andC60˚. We tence of translation-invariant representations of low-order feature investigated whether the class of architecture exemplified by Vis- combinations in the early layers of the visual system could allow Net, and the trace-learning rule, can learn about the transforms correct stimulus identification in the output layer even when the that surface features of 3D objects typically undergo during 3D stimulus was presented in a novel location where the stimulus had rotation in such a way that the network generalizes across the not previously occurred during learning. The proposal was that the change of the quantitative values of the surface features produced low-order spatial-feature combination neurons in the early stages by the rotation, and yet still discriminates between the different contain sufficient spatial information so that a particular combi- objects (in this case spheres). In the cases being considered, each nation of those low-order feature combination neurons specifies object is identified by surface markings that consist of a different a unique object, even if the relative positions of the low-order fea- spatial arrangement of the same three features (a horizontal, ver- ture combination neurons are not known because these neurons tical, and diagonal line, which become arcs on the surface of the are somewhat translation-invariant (see Section 5.4.5). Stringer object). and Rolls (2002) extended this analysis to feature combinations We note that it has been suggested that the finding that neurons on 3D objects, and indeed in their simulations described in this may offer some degree of 3D rotation invariance after training with section therefore used surface markings for the 3D objects that a single view (or limited set of views) represents a challenge for consisted of triples of features. Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 40 Rolls Invariant visual object recognition The images used for training and testing VisNet were specially used the 6 feature triples as stimuli, with learning only allowed constructed for the purpose of demonstrating how the trace- in layers 3 and 4. However, during this second training stage, the learning paradigm might be further developed to give rise to neu- triples were only presented to VisNet’s input retina in the first 4 rons that are able to respond invariantly to novel within-generic orientations (i–iv). After the two stages of training were completed view perspectives of an object, obtained by rotations in-depth Stringer and Rolls (2002) examined whether the output layer of up to 30˚ from any perspectives encountered during learning. VisNet had formed top layer neurons that responded invariantly The stimuli take the form of the surface feature combinations to the 6 triples when presented in all 5 orientations, not just the 4 of 3-dimensional rotating spheres, with each image presented to in which the triples had been presented during training. To pro- VisNet’s retina being a 2-dimensional projection of the surface vide baseline results for comparison, the results from experiment features of one of the spheres. Each stimulus is uniquely identi- 1 were compared with results from experiment 2 which involved fied by two or three surface features, where the surface features no training in layers 1, 2 and 3, 4, with the synaptic weights left are (1) vertical, (2) diagonal, and (3) horizontal arcs, and where unchanged from their initial random values. each feature may be centered at three different spatial positions, In Figure 25 numerical results are given for the experiments designated A, B, and C, as shown in Figure 24. The stimuli are described. On the left are the single cell information measures for thus defined in terms of what features are present and their pre- all top (4th) layer neurons ranked in order of their invariance to the cise spatial arrangement with respect to each other. We refer to the triples, while on the right are multiple cell information measures. two and three feature stimuli as “pairs” and “triples,” respectively. To help to interpret these results we can compute the maximum Individual stimuli are denoted by three numbers which refer to the single cell information measure according to individual features present in positions A, B and C, respectively. For example, a stimulus with positions A and C containing a verti- Maximum single cell information D log .Number of triples/, cal and diagonal bar, respectively, would be referred to as stimulus (43) 102, where the 0 denotes no feature present in position B. In total there are 18 pairs (120, 130, 210, 230, 310, 320, 012, 013, 021, 023, where the number of triples is 6. This gives a maximum single cell 031, 032, 102, 103, 201, 203, 301, 302) and 6 triples (123, 132, 213, information measure of 2.6 bits for these test cases. The informa- 231, 312, 321). tion results from the experiment demonstrate that even with the To train the network each stimulus was presented to VisNet in a triples presented to the network in only four of the five orientations randomized sequence of five orientations with respect to VisNet’s during training, layer 4 is indeed capable of developing rotation input retina, where the different orientations are obtained from invariant neurons that can discriminate effectively between the 6 successive in-depth rotations of the stimulus through 30˚. That is, different feature triples in all 5 orientations, that is with correct each stimulus was presented to VisNet’s retina from the follow- recognition from all five perspectives. In addition, the multiple ing rotational views: (i) 60˚, (ii) 30˚, (iii) 0˚ (central position cell information for the experiment reaches the maximal level of with surface features facing directly toward VisNet’s retina), (iv) 2.6 bits, indicating that the network as a whole is capable of perfect 30˚, and (v) 60˚. Figure 24 shows representations of the 6 visual discrimination between the 6 triples in any of the 5 orientations. stimuli with 3 surface features (triples) presented to VisNet dur- These results may be compared with the very poor baseline ing the simulations. (For the actual simulations described here, performance from the control experiment, where no learning was the surface features and their deformations were what VisNet was allowed before testing. trained and tested with, and the remaining blank surface of each Stringer and Rolls (2002) also performed a control experiment sphere was set to the same gray-scale as the background.) Each row to show that the network really had learned invariant repre- shows one of the stimuli rotated through the 5 different rotational sentations specific to the kinds of 3D deformations undergone views in which the stimulus is presented to VisNet. At each presen- by the surface features as the objects rotated in-depth. In the tation the activation of individual neurons is calculated, then the control experiment the network was trained on “spheres” with neuronal firing rates are calculated, and then the synaptic weights non-deformed surface features; and then as predicted the network are updated. Each time a stimulus has been presented in all the failed to operate correctly when it was tested with objects with the training orientations, a new stimulus is chosen at random and features present in the transformed way that they appear on the the process repeated. The presentation of all the stimuli through surface of a real 3D object. all 5 orientations constitutes 1 epoch of training. In this manner Stringer and Rolls (2002) were thus able to show how trace- the network was trained one-layer at a time starting with layer 1 learning can form neurons that can respond invariantly to novel and finishing with layer 4. In the investigations described here, the rotational within-generic view perspectives of an object, obtained numbers of training epochs for layers 1–4 were 50, 100, 100, and by within-generic view 3D rotations up to 30˚ from any view 75, respectively. encountered during learning. They were able to show in addi- In experiment 1, VisNet was trained in two stages. In the first tion that this could occur for a novel view of an object which was stage, the 18 feature pairs were used as input stimuli, with each not an interpolation from previously shown views. This was possi- stimulus being presented to VisNet’s retina in sequences of five ble given that the low-order feature combination sets from which orientations as described above. However, during this stage, learn- an object was composed had been learned about in early layers of ing was only allowed to take place in layers 1 and 2. This led to VisNet previously. The within-generic view transform invariant the formation of neurons which responded to the feature pairs object recognition described was achieved through the develop- with some rotation invariance in layer 2. In the second stage, we ment of true 3-dimensional representations of objects based on Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 41 Rolls Invariant visual object recognition FIGURE 25 | Learning 3D perspectival transforms of features. Numerical results for experiments 1 and 2: on the left are single cell information measures, and on the right are multiple cell information measures. (After Stringer and Rolls, 2002.) 3-dimensional features and feature combinations, which, unlike Given that the successive layers of the real visual system (V1, V2, 2-dimensional feature combinations, are invariant under moder- V4, posterior inferior temporal cortex, anterior inferior temporal ate in-depth rotations of the object. Thus, in a sense, these rotation cortex) are of the same order of magnitude, VisNet is designed to invariant representations encode a form of 3-dimensional knowl- work with the same number of neurons in each successive layer. edge with which to interpret the visual input from the real-world, (Of course the details are worth understanding further. V1 is, for that is able provide a basis for robust rotation invariant object example, somewhat larger than earlier layers, but on the other recognition with novel perspectives. The particular finding in the hand serves the dorsal as well as the ventral stream of visual corti- work described here was that VisNet can learn how the surface cal processing.) The hypothesis is that because of redundancies in features on 3D objects transform as the object is rotated in-depth, the visual world, each layer of the system by its convergence and and can use knowledge of the characteristics of the transforms competitive categorization can capture sufficient of the statistics to perform 3D object recognition. The knowledge embodied in of the visual input at each stage to enable correct specification the network is knowledge of the 3D properties of objects, and in of the properties of the world that specify objects. For example, this sense assists the recognition of 3D objects seen from different V1 does not compute all possible combinations of a few lateral views. geniculate inputs, but instead represents linear series of geniculate The process investigated by Stringer and Rolls (2002) will only inputs to form edge-like and bar-like feature analyzers, which are allow invariant object recognition over moderate 3D object rota- the dominant arrangement of pixels found at the small scale in tions, since rotating an object through a large angle may lead to a natural visual scenes. Thus the properties of the visual world at catastrophic change in the appearance of the object that requires this stage can be captured by a small proportion of the total num- the new qualitative 3D shape descriptors to be associated with ber of combinations that would be needed if the visual world were those of the former view. In that case, invariant object recogni- random. Similarly, at a later stage of processing, just a subset of all tion must rely on the first process referred to at the start of this possible combinations of line or edge analyzers would be needed, Section (6) in order to associate together the different generic partly because some combinations are much more frequent in the views of an object to produce view-invariant object identification. visual world, and partly because the coding because of conver- For that process, association of a few cardinal or generic views is gence means that what is represented is for a larger area of visual likely to be sufficient (Koenderink, 1990). The process described in space (that is, the receptive fields of the neurons are larger), which this section of learning how surface features transform is likely to also leads to economy and limits what otherwise would be a com- make a major contribution to the within-generic view transform binatorial need for feature analyzers at later layers. The hypothesis invariance of object identification and recognition. thus is that the effects of redundancies in the input space of stimuli that result from the statistical properties of natural images (Field, 5.7. CAPACITY OF THE ARCHITECTURE, AND INCORPORATION OF A 1987), together with the convergent architecture with competi- TRACE RULE INTO A RECURRENT ARCHITECTURE WITH OBJECT tive learning at each stage, produces a system that can perform ATTRACTORS invariant object recognition for large numbers of objects. Large in One issue that has not been considered extensively so far is the this case could be within one or two orders of magnitude of the capacity of hierarchical feed-forward networks of the type exem- number of neurons in any one-layer of the network (or cortical plified by VisNet that are used for invariant object recognition. One area in the brain). The extent to which this can be realized can approach to this issue is to note that VisNet operates in the general be explored with simulations of the type implemented in VisNet, mode of a competitive network, and that the number of different in which the network can be trained with natural images which stimuli that can be categorized by a competitive network is in the therefore reflect fully the natural statistics of the stimuli presented order of the number of neurons in the output layer (Rolls, 2008b). to the real brain. Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 42 Rolls Invariant visual object recognition We should note that a rich variety of information in perceptual space may be represented by subtle differences in the distributed representation provided by the output of the visual system. At the same time, the actual number of different patterns that may be stored in, for example, a pattern associator connected to the output of the visual system is limited by the number of input connections per neuron from the output neurons of the visual system (Rolls, 2008b). One essential function performed by the ventral visual system is to provide an invariant representation which can be read by a pattern associator in such a way that if the pattern associ- ator learns about one view of the object, then the visual system allows generalization to another view of the same object, because the same output neurons are activated by the different view. In the sense that any view can and must activate the same output neurons of the visual system (the input to the associative network), then FIGURE 26 | The learning scheme implemented in VisNet. A we can say the invariance is made explicit in the representation. trace-learning rule is implemented in the feed-forward inputs to a Making some properties of an input representation explicit in an competitive network. output representation has a major function of enabling associa- tive networks that use visual inputs in, for example, recognition, of any of the views will cause the network to settle into an attractor episodic memory, emotion and motivation to generalize correctly, that represents all the views of the object, that is which is a view- that is invariantly with respect to image transforms that are all invariant representation of an object. (In this Section, the different consistent with the same object in the world (Rolls and Treves, exemplars of an object which need to be associated together are 1998). called views, for simplicity, but could at earlier stages of the hierar- Another approach to the issue of the capacity of networks chy represent, for example, similar feature combinations (derived that use trace learning to associate together different instances from the same object) in different positions in space.) (e.g., views) of the same object is to reformulate the issue in We envisage a set of neuronal operations which set up a synaptic the context of autoassociation (attractor) networks, where ana- weight matrix in the recurrent collaterals by associating together lytic approaches to the storage capacity of the network are well because of their closeness in time the different views of the same developed (Amit, 1989; Rolls and Treves, 1998; Rolls, 2008b). object. This approach to the storage capacity of networks that associate In more detail Parga and Rolls (1998) considered two main together different instantiations of an object to form invariant approaches. First, one could store in a synaptic weight matrix the s representations has been developed by Parga and Rolls (1998) and views of an object. This consists of equally associating all the views Elliffe et al. (2000), and is described next. to each other, including the association of each view with itself. In this approach, the storage capacity of a recurrent net- Choosing in Figure 28 an example such that objects are defined work which performs, for example, view-invariant recognition of in terms of five different views, this might produce (if each view objects by associating together different views of the same object produced firing of one neuron at a rate of 1) a block of 5 5 pairs which tend to occur close together in time, was studied (Parga of views contributing to the synaptic efficacies each with value and Rolls, 1998; Elliffe et al., 2000). The architecture with which 1. Object 2 might produce another block of synapses of value 1 the invariance is computed is a little different to that described further along the diagonal, and symmetric about it. Each object earlier. In the model of Rolls (1992, 1994, 1995), Wallis and Rolls or memory could then be thought of as a single attractor with a (1997), Rolls and Milward (2000) Rolls and Stringer (2006), the distributed representation involving five elements (each element post-synaptic memory trace enabled different afferents from the representing a different view). preceding stage to modify onto the same post-synaptic neuron Then the capacity of the system in terms of the number P of (see Figure 26). In that model there were no recurrent connections o objects that can be stored is just the number of separate attractors between the neurons, although such connections were one way in which can be stored in the network. For random fully distributed which it was postulated the memory trace might be implemented, patterns this is as shown numerically by Hopfield (1982) by simply keeping the representation of one view or aspect active until the next view appeared. Then an association would occur P D 0.14 C (44) between representations that were active close together in time (within, e.g., 100–300 ms). In the model developed by Parga and Rolls (1998) and Elliffe where there are C inputs per neuron (and N D C neurons if the et al. (2000), there is a set of inputs with fixed synaptic weights to a network is fully connected). Now the synaptic matrix envisaged network. The network itself is a recurrent network, with a trace rule here does not consist of random fully distributed binary elements, incorporated in the recurrent collaterals (see Figure 27). When but instead we will assume has a sparseness aD s /N, where s is different views of the same object are presented close together in the number of views stored for each object, from any of which time, the recurrent collaterals learn using the trace rule that the the whole representation of the object must be recognized. In this different views are of the same object. After learning, presentation case, one can show (Gardner, 1988; Tsodyks and Feigel’man, 1988; Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 43 Rolls Invariant visual object recognition FIGURE 27 | The learning scheme considered by Parga and Rolls (1998) rule is implemented in the recurrent collateral synapses of an autoassociative and Elliffe et al. (2000). There are inputs to the network from the preceding memory to associate together the different exemplars (e.g., views) of the stage via unmodifiable synapses, and a trace or pairwise associative learning same object. FIGURE 28 | A schematic illustration of the first type of associations contributing to the synaptic matrix considered by Parga and Rolls (1998). Object 1 (O ) has five views labeled v to v , etc. The matrix is formed by associating the pattern presented in the columns with itself, that is with the same pattern 1 1 5 presented as rows. Treves and Rolls, 1991) that the number of objects that can be number of views of each object increases to a large number (e.g., stored and correctly retrieved is >20), the network will fail to retrieve correctly the internal repre- sentation of the object starting from any one view (which is only k C a fraction 1/s of the length of the stored pattern that represents an P D (45) object). a ln .1=a/ The second approach, taken by Parga and Rolls (1998) and where C is the number of synapses on each neuron devoted to the Elliffe et al. (2000), is to consider the operation of the network recurrent collaterals from other neurons in the network, and k is when the associations between pairs of views can be described a factor that depends weakly on the detailed structure of the rate by a matrix that has the general form shown in Figure 29. Such an association matrix might be produced by different views of an distribution, on the connectivity pattern, etc., but is approximately in the order of 0.2–0.3. A problem with this proposal is that as the object appearing after a given view with equal probability, and Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 44 Rolls Invariant visual object recognition FIGURE 29 | A schematic illustration of the second and main type of labeled v to v , etc. The association of any one view with itself has 1 5 associations contributing to the synaptic matrix considered by Parga strength 1, and of any one with another view of the same object has and Rolls (1998) and Elliffe et al. (2000). Object 1 (O ) has five views strength b. synaptic modification occurring of the view with itself (giving the number of neurons. To be particular, the number of objects rise to the diagonal term), and of any one view with that which that can be stored is 0.081 N /5, when there are five views of each immediately follows it. object. The number of objects is 0.073 N /11, when there are eleven The same weight matrix might be produced not only by pair- views of each object. This is an interesting result in network terms, wise association of successive views because the association rule in that s views each represented by an independent random set allows for associations over the short-time scale of, e.g., 100– of active neurons can, in the network described, be present in the 200 ms, but might also be produced if the synaptic trace had an same “object” attraction basin. It is also an interesting result in exponentially decaying form over several hundred milliseconds, neurophysiological terms, in that the number of objects that can allowing associations with decaying strength between views sepa- be represented in this network scales linearly with the number of rated by one or more intervening views. The existence of a regime, recurrent connections per neuron. That is, the number of objects for values of the coupling parameter between pairs of views in a P that can be stored is approximately finite interval, such that the presentation of any of the views of k C one object leads to the same attractor regardless of the particular P D (46) view chosen as a cue, is one of the issues treated by Parga and Rolls (1998) and Elliffe et al. (2000). A related problem also dealt where C is the number of synapses on each neuron devoted to with was the capacity of this type of synaptic matrix: how many the recurrent collaterals from other neurons in the network, s is objects can be stored and retrieved correctly in a view-invariant the number of views of each object, and k is a factor that is in the way? Parga and Rolls (1998) and Elliffe et al. (2000) showed that region of 0.07–0.09 (Parga and Rolls, 1998). the number grows linearly with the number of recurrent collateral Although the explicit numerical calculation was done for a connections received by each neuron. Some of the groundwork rather small number of views for each object (up to 11), the basic for this approach was laid by the work of Amit and collaborators result, that the network can support this kind of “object” phase, is (Amit, 1989; Griniasty et al., 1993). expected to hold for any number of views (the only requirement A variant of the second approach is to consider that the remain- being that it does not increase with the number of neurons). This ing entries in the matrix shown in Figure 29 all have a small value. is of course enough: once an object is defined by a set of views, This would be produced by the fact that sometimes a view of one when the network is presented with a somewhat different stimulus object would be followed by a view of a different object, when, for or a noisy version of one of them it will still be in the attraction example, a large saccade was made, with no explicit resetting of basin of the object attractor. the trace. On average, any one object would follow another rarely, Parga and Rolls (1998) thus showed that multiple (e.g., “view”) and so the case is considered when all the remaining associations patterns could be within the basin of attraction of a shared (e.g., between pairs of views have a low value. “object”) representation, and that the capacity of the system was Parga and Rolls (1998) and Elliffe et al. (2000) were able to show proportional to the number of synapses per neuron divided by the that invariant object recognition is feasible in attractor neural number of views of each object. networks in the way described. The system is able to store and Elliffe et al. (2000) extended the analysis of Parga and Rolls retrieve in a view-invariant way an extensive number of objects, (1998) by showing that correct retrieval could occur where each defined by a finite set of views. What is implied by extensive retrieval “view” cues were distorted; where there was some associ- is that the number of objects is proportional to the size of the ation between the views of different objects; and where there was network. The crucial factor that defines this size is the number of only partial and indeed asymmetric connectivity provided by the connections per neuron. In the case of the fully connected net- associatively modified recurrent collateral connections in the net- works considered in this section, the size is thus proportional to work. The simulations also extended the analysis by showing that Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 45 Rolls Invariant visual object recognition the system can work well with sparse patterns, and indeed that when a small number of representations need to be associ- the use of sparse patterns increases (as expected) the number of ated together to represent an object. One example is associat- objects that can be stored in the network. ing together what is seen when an object is viewed from dif- Taken together, the work described by Parga and Rolls (1998) ferent perspectives. Another example is scale, with respect to and Elliffe et al. (2000) introduced the idea that the trace rule which neurons early in the visual system tolerate scale changes used to build invariant representations could be implemented in of approximately 1.5 octaves, so that the whole scale range could the recurrent collaterals of a neural network (as well as or as an be covered by associating together a limited number of such alternative to its incorporation in the forward connections from representations (see Chapter 5 of Rolls and Deco (2002) and one-layer to another incorporated in VisNet), and provided a pre- Figure 1). The mechanism would not be so suitable when a cise analysis of the capacity of the network if it operated in this large number of different instances would need to be associ- way. In the brain, it is likely that the recurrent collateral connec- ated together to form an invariant representation of objects, as tions between cortical pyramidal cells in visual cortical areas do might be needed for translation invariance. For the latter, the contribute to building invariant representations, in that if they standard model of VisNet with the associative trace-learning rule are associatively modifiable, as seems likely, and because there is implemented in the feed-forward connections (or trained by con- continuing firing for typically 100–300 ms after a stimulus has tinuous spatial transformation learning as described in Section been shown, associations between different exemplars of the same 5.10) would be more appropriate. However, both types of mech- object that occur together close in time would almost necessar- anism, with the trace rule in the feed-forward or in recurrent ily become built into the recurrent synaptic connections between collateral synapses, could contribute (separately or together) to pyramidal cells. achieve invariant representations. Part of the interest of the attrac- Invariant representation of faces in the context of attractor tor approach described in this section is that it allows analytic neural networks has also been discussed by Bartlett and Sejnowski investigation. (1997) in terms of a model where different views of faces are Another approach to training invariance is the purely asso- presented in a fixed sequence (Griniasty et al., 1993). This is ciative mechanism continuous spatial transformation learning, not however the general situation; normally any pair of views described in Section 5.10. With this training procedure, the capac- can be seen consecutively and they will become associated. The ity is increased with respect to the number of training locations, model described by Parga and Rolls (1998) treats this more general with, for example, 169 training locations producing translation- situation. invariant representations for two face stimuli (Perry et al., 2010). I wish to note the different nature of the invariant object recog- When we scaled up the 32 32 VisNet used for most of the inves- nition problem studied here, and the paired associate learning tigations described here to 128 128 neurons per layer in the task studied by Miyashita (1988), Miyashita and Chang (1988), VisNetL specified in Table 1, it was demonstrated that perfect and Sakai and Miyashita (1991). In the invariant object recogni- translation-invariant representations were produced over at least tion case no particular learning protocol is required to produce an 1,089 locations for 5 objects. Thus the indications are that scaling activity of the inferior temporal cortex cells responsible for invari- up the size of VisNet does markedly improve performance, and ant object recognition that is maintained for 300 ms. The learning in this case allows invariant representations for 5 objects across can occur rapidly, and the learning occurs between stimuli (e.g., more than 1,000 locations to be trained with continuous spatial different views) which occur with no intervening delay. In the transformation learning (Perry et al., 2010). paired associate task, which had the aim of providing a model of It will be of interest in future research to investigate how the semantic memory, the monkeys must learn to associate together VisNet architecture, whether trained with a trace or purely asso- two stimuli that are separated in time (by a number of seconds), ciative rule, scales up with respect to capacity as the number of and this type of learning can take weeks to train. During the delay neurons in the system increases further. More distributed repre- period the sustained activity is rather low in the experiments, and sentations in the output layer may also help to increase the capacity. thus the representation of the first stimulus that remains is weak, In recent investigations, we have been able to train VisNetL (i.e., and can only poorly be associated with the second stimulus. How- 128 128 neurons in each layer, a 256 256 input image, and 8 ever, formally the learning mechanism could be treated in the same spatial frequencies for the Gabor filters as shown in Table 4) on way as that used by Parga and Rolls (1998) for invariant object a view-invariance learning problem, and have found good scal- recognition. The experimental difference is just that in the paired ing up with respect to the original VisNet (i.e., 32 32 neurons associate task used by Miyashita et al., it is the weak memory of in each layer, a 64 64 input image, and 4 spatial frequencies for the first stimulus that is associated with the second stimulus. In the filters). For example, VisNetL can learn with the trace rule contrast, in the invariance learning, it would be the firing activity perfect invariant representations of 32 objects each shown in 24 being produced by the first stimulus (not the weak memory of views (T. J. Webb and E. T. Rolls, recent observations). The objects the first stimulus) that can be associated together. It is possible were made with Blender 3D modeling software, so the image views that the perirhinal cortex makes a useful contribution to invariant generated were carefully controlled for lighting, background inten- object recognition by providing a short-term memory that helps sity, etc. When trained on half of these views for each object, successive views of the same objects to become associated together with the other half used for cross-validation testing, the perfor- (Buckley et al., 2001; Rolls et al., 2005a). mance was reasonable at approximately 68% correct for the 32 The mechanisms described here using an attractor network objects, and having the full set of 8 spatial frequencies did improve with a trace associative learning rule would apply most naturally performance. Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 46 Rolls Invariant visual object recognition 5.8. VISION IN NATURAL SCENES – EFFECTS OF BACKGROUND in complex natural backgrounds. The monkey had to search for VERSUS ATTENTION two objects on a screen, and a touch of one object was rewarded Object-based attention refers to attention to an object. For exam- with juice, and of another object was punished with saline (see ple, in a visual search task the object might be specified as what Figure 3 for a schematic illustration and Figure 30 for a version of should be searched for, and its location must be found. In spa- the display with examples of the stimuli shown to scale). Neuronal tial attention, a particular location in a scene is pre-cued, and the responses to the effective stimuli for the neurons were compared object at that location may need to be identified. Here we consider when the objects were presented in the natural scene or on a plain some of the neurophysiology of object selection and attention in background. It was found that the overall response of the neuron the context of a feature hierarchy approach to invariant object to objects was hardly reduced when they were presented in natural recognition. The computational mechanisms of attention, includ- scenes, and the selectivity of the neurons remained. However, the ing top-down biased competition, are described elsewhere (Rolls main finding was that the magnitudes of the responses of the neu- and Deco, 2002; Deco and Rolls, 2005b; Rolls, 2008b). rons typically became much less in the real scene the further the monkey fixated in the scene away from the object (see Figure 4). A 5.8.1. Neurophysiology of object selection and translation small receptive field size has also been found in inferior temporal invariance in the inferior temporal visual cortex cortex neurons when monkeys have been trained to discriminate Much of the neurophysiology, psychophysics, and modeling of closely spaced small visual stimuli (DiCarlo and Maunsell, 2003). attention has been with a small number, typically two, of objects It is proposed that this reduced translation invariance in natural in an otherwise blank scene. In this Section, I consider how atten- scenes helps an unambiguous representation of an object which tion operates in complex natural scenes, and in particular describe may be the target for action to be passed to the brain regions how the inferior temporal visual cortex operates to enable the that receive from the primate inferior temporal visual cortex. It selection of an object in a complex natural scene (see also Rolls helps with the binding problem, by reducing in natural scenes the and Deco, 2006). The inferior temporal visual cortex contains dis- effective receptive field of at least some inferior temporal cortex tributed and invariant representations of objects and faces (Rolls neurons to approximately the size of an object in the scene. and Baylis, 1986; Hasselmo et al., 1989a; Tovee et al., 1994; Rolls It is also found that in natural scenes, the effect of object-based and Tovee, 1995b; Rolls et al., 1997b; Booth and Rolls, 1998; Rolls, attention on the response properties of inferior temporal cortex 2000, 2007a,b,c, 2011b; Rolls and Deco, 2002; Rolls and Treves, neurons is relatively small, as illustrated in Figure 31 (Rolls et al., 2011). 2003). To investigate how attention operates in complex natural scenes, and how information is passed from the inferior temporal 5.8.2. Attention and translation invariance in natural scenes – a cortex (IT) to other brain regions to enable stimuli to be selected computational account from natural scenes for action, Rolls et al. (2003) analyzed the The results summarized in Figure 31 for 5˚ stimuli show that the responses of inferior temporal cortex neurons to stimuli presented receptive fields were large (77.6˚) with a single stimulus in a blank FIGURE 30 | The visual search task. The monkey had to search for and object is present (a bottle) which the monkey must not touch. The stimuli touch an object (in this case a banana) when shown in a complex natural are shown to scale. The screen subtended 70˚ 55˚ (After Rolls et al., scene, or when shown on a plain background. In each case a second 2003.) Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 47 Rolls Invariant visual object recognition firing rate in the complex background when the effective stimu- lus was selected for action (bottom right, 19.2˚), and when it was not (middle right, 15.6˚; Rolls et al., 2003). (For comparison, the effects of attention against a blank background were much larger, with the receptive field increasing from 17.2˚ to 47.0˚ as a result of object-based attention, as shown in Figure 31, left middle and bottom.) Trappenberg et al. (2002) have suggested what underlying mechanisms could account for these findings, and simulated a model to test the ideas. The model utilizes an attractor network representing the inferior temporal visual cortex (implemented by the recurrent connections between inferior temporal cortex neurons), and a neural input layer with several retinotopically organized modules representing the visual scene in an earlier visual cortical area such as V4 (see Figure 32). The attractor network aspect of the model produces the property that the receptive fields of IT neurons can be large in blank scenes by enabling a weak input in the periphery of the visual field to act as a retrieval cue for the object attractor. On the other hand, when the object is shown in a complex background, the object closest to the fovea tends to act as the retrieval cue for the attractor, because the fovea is given increased weight in activating the IT module because the magni- tude of the input activity from objects at the fovea is greatest due to the higher magnification factor of the fovea incorporated into the model. This results in smaller receptive fields of IT neurons in complex scenes, because the object tends to need to be close to the fovea to trigger the attractor into the state representing that object. (In other words, if the object is far from the fovea, then it will not trigger neurons in IT which represent it, because neurons in IT are preferentially being activated by another object at the fovea.) This may be described as an attractor model in which the competition for which attractor state is retrieved is weighted toward objects at the fovea. Attentional top-down object-based inputs can bias the com- petition implemented in this attractor model, but have relatively minor effects (in, for example, increasing receptive field size) when they are applied in a complex natural scene, as then as usual the stronger forward inputs dominate the states reached. In this net- work, the recurrent collateral connections may be thought of as implementing constraints between the different inputs present, to help arrive at firing in the network which best meets the con- FIGURE 31 | Summary of the receptive field sizes of inferior temporal straints. In this scenario, the preferential weighting of objects close cortex neurons to a 5˚ effective stimulus presented in either a blank background (blank screen) or in a natural scene (complex background). to the fovea because of the increased magnification factor at the The stimulus that was a target for action in the different experimental fovea is a useful principle in enabling the system to provide use- conditions is marked by T. When the target stimulus was touched, a reward ful output. The attentional object biasing effect is much more was obtained. The mean receptive field diameter of the population of marked in a blank scene, or a scene with only two objects present neurons analyzed, and the mean firing rate in spikes/s, is shown. The at similar distances from the fovea, which are conditions in which stimuli subtended 5˚ 3.5˚ at the retina, and occurred on each trial in a random position in the 70˚ 55˚ screen. The dashed circle is proportional to attentional effects have frequently been examined. The results of the receptive field size. Top row: responses with one visual stimulus in a the investigation (Trappenberg et al., 2002) thus suggest that top- blank (left) or complex (right) background. Middle row: responses with two down attention may be a much more limited phenomenon in stimuli, when the effective stimulus was not the target of the visual search. complex, natural, scenes than in reduced displays with one or two Bottom row: responses with two stimuli, when the effective stimulus was objects present. The results also suggest that the alternative prin- the target of the visual search. (After Rolls et al., 2003.) ciple, of providing strong weight to whatever is close to the fovea, is an important principle governing the operation of the inferior background (top left), and were greatly reduced in size (to 22.0˚) temporal visual cortex, and in general of the output of the visual when presented in a complex natural scene (top right). The results system in natural environments. This principle of operation is also show that there was little difference in receptive field size or very important in interfacing the visual system to action systems, Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 48 Rolls Invariant visual object recognition actions are made to locations not being looked at. However, the simulations described in this section suggest that in any case covert attention is likely to be a much less significant influence on visual processing in natural scenes than in reduced scenes with one or two objects present. Given these points, one might question why inferior temporal cortex neurons can have such large receptive fields, which show translation invariance. At least part of the answer to this may be that inferior temporal cortex neurons must have the capability to be large if they are to deal with large objects. A V1 neuron, with its small receptive field, simply could not receive input from all the features necessary to define an object. On the other hand, inferior temporal cortex neurons may be able to adjust their size to approximately the size of objects, using in part the interactive effects involved in attention (Rolls, 2008b), and need the capabil- ity for translation invariance because the actual relative positions of the features of an object could be at different relative positions in the scene. For example, a car can be recognized whichever way it is viewed, so that the parts (such as the bonnet or hood) must be identifiable as parts wherever they happen to be in the image, though of course the parts themselves also have to be in the correct relative positions, as allowed for by the hierarchical feature analysis architecture described in this paper. Some details of the simulations follow. Each independent mod- ule within “V4” in Figure 32 represents a small part of the visual field and receives input from earlier visual areas represented by an input vector for each possible location which is unique for each object. Each module was 6˚ in width, matching the size of the objects presented to the network. For the simulations Trappen- berg et al. (2002) chose binary random input vectors representing V4 V4 objects with N a components set to ones and the remaining V4 V4 V4 N (1 a ) components set to zeros. N is the number of nodes V4 in each module and a is the sparseness of the representation V4 which was set to be a D 0.2 in the simulations. The structure labeled “IT” represents areas of visual association cortex such as the inferior temporal visual cortex and cortex in the anterior part of the superior temporal sulcus in which neurons provide distributed representations of faces and objects (Booth and Rolls, 1998; Rolls, 2000). Nodes in this structure are governed by leaky integrator dynamics with time constant IT dh .t/ FIGURE 32 | The architecture of the inferior temporal cortex (IT) model i IT IT IT IT D h .t/C w c y .t/ i ij j of Trappenberg et al. (2002) operating as an attractor network with dt inputs from the fovea given preferential weighting by the greater magnification factor of the fovea. The model also has a top-down ITV4 V4 IT_BIAS OBJ C w y .t/C k I . (47) k i ik object-selective bias input. The model was used to analyze how object vision and recognition operate in complex natural scenes. IT The firing rate y of the i th node is determined by a sigmoidal IT because the effective stimulus in making inferior temporal cortex function from the activation h as follows neurons fire is in natural scenes usually on or close to the fovea. This means that the spatial coordinates of where the object is in IT y .t/ D   , (48) IT the scene do not have to be represented in the inferior temporal 1C exp 2 h .t/ visual cortex, nor passed from it to the action selection system, as the latter can assume that the object making IT neurons fire is where the parameters D 1 and D 1 represent the gain and the close to the fovea in natural scenes. bias, respectively. There may of course be in addition a mechanism for object The recognition functionality of this structure is modeled as an selection that takes into account the locus of covert attention when attractor neural network (ANN) with trained memories indexed Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 49 Rolls Invariant visual object recognition by  representing particular objects. The memories are formed through Hebbian learning on sparse patterns, IT IT IT IT w D k  a  a , (49) ij i j IT where k (set to 1 in the simulations) is a normalization constant IT that depends on the learning rate, a D 0.2 is the sparseness of the training pattern in IT, and  are the components of the pattern IT used to train the network. The constant c in equation (47) rep- resents the strength of the activity-dependent global inhibition simulating the effects of inhibitory interneurons. The external FIGURE 33 | Correlations as measured by the normalized dot product OBJ between the object vector used to train IT and the state of the IT “top-down” input vector I produces object-selective inputs, network after settling into a stable state with a single object in the which are used as the attentional drive when a visual search task visual scene (blank background) or with other trained objects at all is simulated. The strength of this object bias is modulated by the possible locations in the visual scene (natural background). There is no IT_BIAS value of k in equation (47). object bias included in the results shown in graph (A), whereas an object ITV4 IT_BIAS bias is included in the results shown in (B) with k D 0.7 in the The weights w between the V4 nodes and IT nodes were ij IT_BIAS experiments with a natural background and k D 0.1 in the experiments trained by Hebbian learning of the form with a blank background. (After Trappenberg et al., 2002.) ITV4 ITV4 V 4 IT w D k .k/  a  a . (50) ij i j IT_BIAS value of the object bias k was set to 0 in these simulations. to produce object representations in IT based on inputs in V4. The Good object retrieval (indicated by large correlations) was found ITV4 normalizing modulation factor k (k ) allows the gain of inputs even when the object was far from the fovea, indicating large IT to be modulated as a function of their distance from the fovea, and receptive fields with a blank background. The reason that any drop depends on the module k to which the presynaptic node belongs. is seen in performance as a function of eccentricity is because flip- The model supports translation-invariant object recognition of a ping 2% of the bits outside the object introduces some noise into single object in the visual field if the normalization factor is the the recall process. This demonstrates that the attractor dynamics same for each module and the model is trained with the objects can support translation-invariant object recognition even though placed at every possible location in the visual field. The translation the translation-invariant weight vectors between V4 and IT are ITV4 invariance of the weight vectors between each “V4” module and explicitly modulated by the modulation factor k derived from the IT nodes is however explicitly modulated in the model by the the cortical magnification factor. ITV4 module-dependent modulation factor k (k ) as indicated in In a second simulation individual objects were placed at all Figure 32 by the width of the lines connecting V4 with IT. The possible locations in a natural and cluttered visual scene. The strength of the foveal V4 module is strongest, and the strength resulting correlations between the target pattern and the asymp- decreases for modules representing increasing eccentricity. The totic IT state are shown in Figure 33A with the line labeled “natural form of this modulation factor was derived from the parameter- background.” Many objects in the visual scene are now competing ization of the cortical magnification factors given by Dow et al. for recognition by the attractor network, and the objects around (1981). the foveal position are enhanced through the modulation fac- To study the ability of the model to recognize trained objects tor derived from the cortical magnification factor. This results at various locations relative to the fovea the system was trained in a much smaller size of the receptive field of IT neurons when on a set of objects. The network was then tested with distorted measured with objects in natural backgrounds. versions of the objects, and the “correlation” between the target In addition to this major effect of the background on the size object and the final state of the attractor network was taken as a of the receptive field, which parallels and may account for the measure of the performance. The correlation was estimated from physiological findings outlined above and in Section 5.8.1, there the normalized dot product between the target object vector that is also a dependence of the size of the receptive fields on the level was used during training the IT network, and the state of the IT of object bias provided to the IT network. Examples are shown in network after a fixed amount of time sufficient for the network Figure 33B where an object bias was used. The object bias biases to settle into a stable state. The objects were always presented on the IT network toward the expected object with a strength deter- ITBIAS backgrounds with some noise (introduced by flipping 2% of the mined by the value of k , and has the effect of increasing the bits in the scene which were not the test stimulus) in order to utilize size of the receptive fields in both blank and natural backgrounds the properties of the attractor network, and because the input to (see Figure 33B compared to Figure 33A). This models the effect IT will inevitably be noisy under normal conditions of operation. found neurophysiologically (Rolls et al., 2003). In the first simulation only one object was present in the visual Some of the conclusions are as follows (Trappenberg et al., scene in a plain (blank) background at different eccentricities from 2002). When single objects are shown in a scene with a blank the fovea. As shown in Figure 33A by the line labeled “blank back- background, the attractor network helps neurons to respond to an ground,” the receptive fields of the neurons were very large. The object with large eccentricities of this object relative to the fovea Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 50 Rolls Invariant visual object recognition of the agent. When the object is presented in a natural scene, other the inferior temporal cortex neuron specific for the object tested. neurons in the inferior temporal cortex become activated by the When no attentional object bias was introduced, a shrinkage of other effective stimuli present in the visual field, and these forward the receptive field size was observed in the complex vs the blank inputs decrease the response of the network to the target stimu- background. When attentional object bias was introduced, the lus by a competitive process. The results found fit well with the shrinkage of the receptive field due to the complex background was neurophysiological data, in that IT operates with almost complete somewhat reduced. This is consistent with the neurophysiological translation invariance when there is only one object in the scene, results (Rolls et al., 2003). In the framework of the model (Deco and reduces the receptive field size of its neurons when the object and Rolls, 2004), the reduction of the shrinkage of the receptive is presented in a cluttered environment. The model described here field is due to the biasing of the competition in the inferior tem- provides an explanation of the responses of real IT neurons in poral cortex layer in favor of the specific IT neuron tested, so that natural scenes. it shows more translation invariance (i.e., a slightly larger recep- In natural scenes, the model is able to account for the neuro- tive field). The increase of the receptive field size of an IT neuron, physiological data that the IT neuronal responses are larger when although small, produced by the external top-down attentional the object is close to the fovea, by virtue of fact that objects close to bias offers a mechanism for facilitation of the search for specific the fovea are weighted by the cortical magnification factor related objects in complex natural scenes (Rolls, 2008b). ITV4 modulation k . I note that it is possible that a “spotlight of attention” (Desi- The model accounts for the larger receptive field sizes from the mone and Duncan, 1995) can be moved covertly away from the fovea of IT neurons in natural backgrounds if the target is the fovea (Rolls, 2008b). However, at least during normal visual search object being selected compared to when it is not selected (Rolls tasks in natural scenes, the neurons are sensitive to the object at et al., 2003). The model accounts for this by an effect of top-down which the monkey is looking, that is primarily to the object that bias which simply biases the neurons toward particular objects is on the fovea, as shown by Rolls et al. (2003) and Aggelopoulos compensating for their decreasing inputs produced by the decreas- and Rolls (2005), and described in Sections 1 and 9. ing magnification factor modulation with increasing distance from the fovea. Such object-based attention signals could originate in 5.9. THE REPRESENTATION OF MULTIPLE OBJECTS IN A SCENE the prefrontal cortex and could provide the object bias for the When objects have distributed representations, there is a prob- inferior temporal visual cortex (Renart et al., 2000; Rolls, 2008b). lem of how multiple objects (whether the same or different) can Important properties of the architecture for obtaining the be represented in a scene, because the distributed representa- results just described are the high magnification factor at the fovea tions overlap, and it may not be possible to determine whether and the competition between the effects of different inputs, imple- one has an amalgam of several objects, or a new object (Mozer, mented in the above simulation by the competition inherent in an 1991), or multiple instances of the same object, let alone the attractor network. relative spatial positions of the objects in a scene. Yet humans We have also been able to obtain similar results in a hierarchical can determine the relative spatial locations of objects in a scene feed-forward network where each layer operates as a competitive even in short presentation times without eye movements (Bieder- network (Deco and Rolls, 2004). This network thus captures many man, 1972; and this has been held to involve some spotlight of of the properties of our hierarchical model of invariant object attention). Aggelopoulos and Rolls (2005) analyzed this issue by recognition (Rolls, 1992; Wallis and Rolls, 1997; Rolls and Mil- recording from single inferior temporal cortex neurons with five ward, 2000; Stringer and Rolls, 2000, 2002; Rolls and Stringer, objects simultaneously present in the receptive field. They found 2001, 2006, 2007; Elliffe et al., 2002; Rolls and Deco, 2002; Stringer that although all the neurons responded to their effective stim- et al., 2006), but incorporates in addition a foveal magnification ulus when it was at the fovea, some could also respond to their factor and top-down projections with a dorsal visual stream so effective stimulus when it was in some but not other parafoveal that attentional effects can be studied, as shown in Figure 34. positions 10˚ from the fovea. An example of such a neuron is shown Deco and Rolls (2004) trained the network shown in Figure 34 in Figure 35. The asymmetry is much more evident in a scene with two objects, and used the trace-learning rule (Wallis and with 5 images present (Figure 35A) than when only one image is Rolls, 1997; Rolls and Milward, 2000) in order to achieve trans- shown on an otherwise blank screen (Figure 35B). Competition lation invariance. In a first experiment we placed only one object between different stimuli in the receptive field thus reveals the on the retina at different distances from the fovea (i.e., different asymmetry in the receptive field of inferior temporal visual cortex eccentricities relative to the fovea). This corresponds to the blank neurons. background condition. In a second experiment, we also placed the The asymmetry provides a way of encoding the position of object at different eccentricities relative to the fovea, but on a clut- multiple objects in a scene. Depending on which asymmetric neu- tered natural background. Larger receptive fields were found with rons are firing, the population of neurons provides information to the blank as compared to the cluttered natural background. the next processing stage not only about which image is present at Deco and Rolls (2004) also studied the influence of object- or close to the fovea, but where it is with respect to the fovea. based attentional top-down bias on the effective size of the recep- Simulations with VisNet with an added layer to simulate hip- tive field of an inferior temporal cortex neuron for the case of pocampal scene memory have demonstrated that receptive field an object in a blank or a cluttered background. To do this, they asymmetry appears when multiple objects are simultaneously repeated the two simulations but now considered a non-zero top- present because of the probabilistic connectivity from the preced- down bias coming from prefrontal area 46v and impinging on ing stage which introduces asymmetry, which becomes revealed Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 51 Rolls Invariant visual object recognition FIGURE 34 | Cortical architecture for hierarchical and attention-based temporal visual cortex), and is mainly concerned with object recognition. The visual perception after Deco and Rolls (2004). The system is essentially occipito-parietal stream leads dorsally into PP (posterior parietal complex), composed of five modules structured such that they resemble the two known and is responsible for maintaining a spatial map of an object’s location. The main visual paths of the mammalian visual cortex. Information from the solid lines with arrows between levels show the forward connections, and retino-geniculo-striate pathway enters the visual cortex through area V1 in the the dashed lines the top-down backprojections. Short-term memory systems occipital lobe and proceeds into two processing streams. The in the prefrontal cortex (PF46) apply top-down attentional bias to the object or occipital-temporal stream leads ventrally through V2–V4 and IT (inferior spatial processing streams. (After Deco and Rolls, 2004.) Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 52 Rolls Invariant visual object recognition FIGURE 35 | (A) The responses (firing rate with the spontaneous rate stimulus was presented in each position. The firing rate for each subtracted, means sem) of an inferior temporal cortex neuron when position is that when the effective stimulus (in this case the hand) for tested with 5 stimuli simultaneously present in the close (10˚) the neuron was in that position. The p value is that from the ANOVA configuration with the parafoveal stimuli located 10˚ from the fovea. calculated over the four parafoveal positions. (After Aggelopoulos and (B) The responses of the same neuron when only the effective Rolls, 2005.) by the enhanced lateral inhibition when multiple objects are The learning of invariant representations of objects when mul- presented simultaneously (Rolls et al., 2008). tiple objects are present in a scene is considered in Section 5.5.2. The information in the inferior temporal visual cortex is pro- 5.10. LEARNING INVARIANT REPRESENTATIONS USING SPATIAL vided by neurons that have firing rates that reflect the relevant information, and stimulus-dependent synchrony is not necessary CONTINUITY: CONTINUOUS SPATIAL TRANSFORMATION (Aggelopoulos and Rolls, 2005). Top-down attentional biasing LEARNING input could thus, by biasing the appropriate neurons, facilitate The temporal continuity typical of objects has been used in an bottom-up information about objects without any need to alter associative learning rule with a short-term memory trace to help the time relations between the firing of different neurons. The build invariant object representations in the networks described exact position of the object with respect to the fovea, and effec- previously in this paper. Stringer et al. (2006) showed that spa- tial continuity can also provide a basis for helping a system to tively thus its spatial position relative to other objects in the scene, would then be made evident by the subset of asymmetric neurons self-organize invariant representations. They introduced a new learning paradigm “continuous spatial transformation (CT) learn- firing. This is thus the solution that these experiments (Aggelopoulos ing” which operates by mapping spatially similar input patterns to the same post-synaptic neurons in a competitive learning sys- and Rolls, 2005; Rolls et al., 2008) indicate is used for the represen- tation of multiple objects in a scene, an issue that has previously tem. As the inputs move through the space of possible continuous transforms (e.g., translation, rotation, etc.), the active synapses been difficult to account for in neural systems with distributed representations (Mozer, 1991) and for which “attention” has been are modified onto the set of post-synaptic neurons. Because other a proposed solution. transforms of the same stimulus overlap with previously learned Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 53 Rolls Invariant visual object recognition exemplars, a common set of post-synaptic neurons is activated by the new transforms, and learning of the new active inputs onto the same post-synaptic neurons is facilitated. The concept is illustrated in Figure 36. During the presenta- tion of a visual image at one position on the retina that activates neurons in layer 1, a small winning set of neurons in layer 2 will modify (through associative learning) their afferent connections from layer 1 to respond well to that image in that location. When the same image appears later at nearby locations, so that there is spatial continuity, the same neurons in layer 2 will be activated because some of the active afferents are the same as when the image was in the first position. The key point is that if these afferent con- nections have been strengthened sufficiently while the image is in the first location, then these connections will be able to continue to activate the same neurons in layer 2 when the image appears in overlapping nearby locations. Thus the same neurons in the output layer have learned to respond to inputs that have similar vector elements in common. As can be seen in Figure 36, the process can be continued for subsequent shifts, provided that a sufficient proportion of input cells stay active between individual shifts. This whole process is repeated throughout the network, both horizontally as the image FIGURE 36 | An illustration of how continuous spatial transformation moves on the retina, and hierarchically up through the network. (CT) learning would function in a network with a single-layer of forward synaptic connections between an input layer of neurons and Over a series of stages, transform invariant (e.g., location invari- an output layer. Initially the forward synaptic weights are set to random ant) representations of images are successfully learned, allowing values. The top part (A) shows the initial presentation of a stimulus to the the network to perform invariant object recognition. A similar CT network in position 1. Activation from the (shaded) active input cells is learning process may operate for other kinds of transformation, transmitted through the initially random forward connections to stimulate such as change in view or size. the cells in the output layer. The shaded cell in the output layer wins the competition in that layer. The weights from the active input cells to the Stringer et al. (2006) demonstrated that VisNet can be trained active output neuron are then strengthened using an associative learning with continuous spatial transformation learning to form view- rule. The bottom part (B) shows what happens after the stimulus is shifted invariant representations. They showed that CT learning requires by a small amount to a new partially overlapping position 2. As some of the the training transforms to be relatively close together spatially so active input cells are the same as those that were active when the stimulus that spatial continuity is present in the training set; and that the was presented in position 1, the same output cell is driven by these previously strengthened afferents to win the competition again. The order of stimulus presentation is not crucial, with even inter- rightmost shaded input cell activated by the stimulus in position 2, which leaving with other objects possible during training, because it was inactive when the stimulus was in position 1, now has its connection is spatial continuity rather the temporal continuity that drives to the active output cell strengthened (denoted by the dashed line). Thus the self-organizing learning with the purely associative synaptic the same neuron in the output layer has learned to respond to the two modification rule. input patterns that have similar vector elements in common. As can be seen, the process can be continued for subsequent shifts, provided that a Perry et al. (2006) extended these simulations with VisNet of sufficient proportion of input cells stay active between individual shifts. view-invariant learning using CT to more complex 3D objects, and (After Stringer et al., 2006.) using the same training images in human psychophysical investiga- tions, showed that view-invariant object learning can occur when spatial but not temporal continuity applies in a training condition in which the images of different objects were interleaved. How- 5.11. LIGHTING INVARIANCE ever, they also found that the human view-invariance learning was Object recognition should occur correctly even despite variations better if sequential presentation of the images of an object was of lighting. In an investigation of this, Rolls and Stringer (2006) used, indicating that temporal continuity is an important factor in trained VisNet on a set of 3D objects generated with OpenGL in human invariance learning. which the viewing angle and lighting source could be indepen- Perry et al. (2010) extended the use of continuous spatial trans- dently varied (see Figure 37). After training with the trace rule formation learning to translation invariance. They showed that on all the 180 views (separated by 1˚, and rotated about the ver- translation-invariant representations can be learned by continu- tical axis in Figure 37) of each of the four objects under the left ous spatial transformation learning; that the transforms must be lighting condition, we tested whether the network would recog- close for this to occur; that the temporal order of presentation of nize the objects correctly when they were shown again, but with each transformed image during training is not crucial for learn- the source of the lighting moved to the right so that the objects ing to occur; that relatively large numbers of transforms can be appeared different (see Figure 37). With this protocol, lighting learned; and that such continuous spatial transformation learning invariant object recognition by VisNet was demonstrated (Rolls can be usefully combined with temporal trace training. and Stringer, 2006). Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 54 Rolls Invariant visual object recognition FIGURE 37 | Lighting invariance. VisNet was trained on a set of 3D objects lighting. Just one view of each object is shown in the Figure, but for training (cube, tetrahedron, octahedron, and torus) generated with OpenGL in which and testing 180 views of each object separated by 1˚ were used. (After Rolls for training the objects had left lighting, and for testing the objects had right and Stringer, 2006.) Some insight into the good performance with a change of light- to rotating flow fields or looming with considerable translation ing is that some neurons in the inferior temporal visual cortex invariance (Graziano et al., 1994; Geesaman and Andersen, 1996). respond to the outlines of 3D objects (Vogels and Biederman, In the cortex in the anterior part of the superior temporal sulcus, 2002), and these outlines will be relatively consistent across light- which is a convergence zone for inputs from the ventral and dor- ing variations. Although the features about the object represented sal visual systems, some neurons respond to object-based motion, in VisNet will include more than the representations of the out- for example, to a head rotating clockwise but not anticlockwise, lines, the network may because it uses distributed representations independently of whether the head is upright or inverted which of each object generalize correctly provided that some of the fea- reverses the optic flow across the retina (Hasselmo et al., 1989b). tures are similar to those present during training. Under very In a unifying hypothesis with the design of the ventral cortical difficult lighting conditions, it is likely that the performance of visual system Rolls and Stringer (2007) proposed that the dorsal the network could be improved by including variations in the visual system uses a hierarchical feed-forward network architec- lighting during training, so that the trace rule could help to ture (V1, V2, MT, MSTd, parietal cortex) with training of the build representations that are explicitly invariant with respect to connections with a short-term memory trace associative synaptic lighting. modification rule to capture what is invariant at each stage. The principle is illustrated in Figure 38A. Simulations showed that the 5.12. INVARIANT GLOBAL MOTION IN THE DORSAL VISUAL SYSTEM proposal is computationally feasible, in that invariant representa- A key issue in understanding the cortical mechanisms that under- tions of the motion flow fields produced by objects self-organize in lie motion perception is how we perceive the motion of objects the later layers of the architecture (see examples in Figures 38B–E). such as a rotating wheel invariantly with respect to position on The model produces invariant representations of the motion flow the retina, and size. For example, we perceive the wheel shown in fields produced by global in-plane motion of an object, in-plane Figure 38A rotating clockwise independently of its position on the rotational motion, looming vs receding of the object. The model retina. This occurs even though the local motion for the wheels in also produces invariant representations of object-based rotation the different positions may be opposite. How could this invariance about a principal axis. Thus it is proposed that the dorsal and of the visual motion perception of objects arise in the visual sys- ventral visual systems may share some unifying computational tem? Invariant motion representations are known to be developed principles Rolls and Stringer (2007). Indeed, the simulations of in the cortical dorsal visual system. Motion-sensitive neurons in Rolls and Stringer (2007) used a standard version of VisNet, with V1 have small receptive fields (in the range 1–2˚ at the fovea), and the exception that instead of using oriented bar receptive fields as can therefore not detect global motion, and this is part of the aper- the input to the first layer, local motion flow fields provided the ture problem (Wurtz and Kandel, 2000b). Neurons in MT, which inputs. receives inputs from V1 and V2, have larger receptive fields (e.g., 5˚ at the fovea), and are able to respond to planar global motion, 6. LEARNING INVARIANT REPRESENTATIONS OF SCENES such as a field of small dots in which the majority (in practice AND PLACES as few as 55%) move in one direction, or to the overall direction The primate hippocampal system has neurons that respond to a of a moving plaid, the orthogonal grating components of which view of a spatial scene, or when that location in a scene is being have motion at 45˚ to the overall motion (Movshon et al., 1985; looked at in the dark or when it is obscured (Rolls et al., 1997a, Newsome et al., 1989). Further on in the dorsal visual system, 1998; Robertson et al., 1998; Georges-François et al., 1999; Rolls some neurons in macaque visual area MST (but not MT) respond and Xiang, 2006; Rolls, 2008b). The representation is relatively Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 55 Rolls Invariant visual object recognition FIGURE 38 | (A) Two rotating wheels at different locations rotating in perfect performance of 1 bit (clockwise vs anticlockwise) after training opposite directions. The local flow field is ambiguous. Clockwise or with the trace rule, but not with random initial synaptic weights in the counterclockwise rotation can only be diagnosed by a global flow untrained control condition. (C) The multiple cell information measure computation, and it is shown how the network is expected to solve the shows that small groups of neurons have perfect performance. (D) problem to produce position-invariant global motion-sensitive neurons. Position invariance illustrated for a single cell from layer 4, which One rotating wheel is presented at any one time, but the need is to responded only to the clockwise rotation, and for every one of the 9 develop a representation of the fact that in the case shown the rotating positions. (E) Size-invariance illustrated for a single cell from layer 4, flow field is always clockwise, independently of the location of the flow which after training with three different radii of rotating wheel, field. (B–D) Translation invariance, with training on 9 locations. (B) Single responded only to anticlockwise rotation, independently of the size of cell information measures showing that some layer 4 neurons have the rotating wheels. (After Rolls and Stringer, 2007.) invariant with respect to the position of the macaque in the envi- is being looked at. (There is an analogous set of place neurons ronment, and of head direction, and eye position. The requirement in the rat hippocampus that respond in this case when the rat is for these spatial view neurons is that a position in the spatial scene in a given position in space, relatively invariantly with respect to Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 56 Rolls Invariant visual object recognition head direction (McNaughton et al., 1983; O’Keefe, 1984; Muller the anterior inferior temporal visual cortex, which corresponds to et al., 1991).) How might these spatial view neurons be set up in the fourth layer of VisNet, neurons respond to objects, but sev- primates? eral objects close to the fovea (within approximately 10˚) can be Before addressing this, it is useful to consider the difference represented because many object-tuned neurons have asymmet- between a spatial view or scene representation, and an object rep- ric receptive fields with respect to the fovea (Aggelopoulos and resentation. An object can be moved to different places in space or Rolls, 2005; see Section 5.9). If the fifth layer of VisNet performs in a spatial scene. An example is a motor car that can be moved to the same operation as previous layers, it will form neurons that different places in space. The object is defined by a combination respond to combinations of objects in the scene with the positions of features or parts in the correct relative spatial position, but its of the objects relative spatially to each other incorporated into the representation is independent of where it is in space. In contrast, a representation (as described in Section 5.4). The result will be spa- representation of space has objects in defined relative spatial posi- tial view neurons in the case of primates when the visual field of tions, which cannot be moved relative to one another in space. An the primate has a narrow focus (due to the high-resolution fovea), example might be Trafalgar Square, in which Nelson’s column is in and place cells when as in the rat the visual field is very wide (De the middle, and the National Gallery and St Martin’s in the Fields Araujo et al., 2001; Rolls, 2008b). The trace-learning rule in layer church are at set relative locations in space, and cannot be moved 5 should help the spatial view or place fields that develop to be relative to one another. This draws out the point that there may large and single, because of the temporal continuity that is inher- be some computational similarities between the construction of ent when the agent moves from one part of the view or place space an objects and of a scene or a representation of space, but there to another, in the same way as has been shown for the entorhinal are also important differences in how they are used. In the present grid cell to hippocampal place cell mapping (Rolls et al., 2006b; context we are interested in how the brain may set up a spatial Rolls, 2008b). view representation in which the relative position of the objects in The hippocampal dentate granule cells form a network the scene defines the spatial view. That spatial view representation expected to be important in this competitive learning of spa- may be relatively invariant with respect to the exact position from tial view or place representations based on visual inputs. As the which the scene is viewed (though extensions are needed if there animal navigates through the environment, different spatial view are central objects in a space through which one moves). cells would be formed. Because of the overlapping fields of adja- It is now possible to propose a unifying hypothesis of the rela- cent spatial view neurons, and hence their coactivity as the animal tion between the ventral visual system, and primate hippocampal navigates, recurrent collateral associative connections at the next spatial view representations (Rolls, 2008b; Rolls et al., 2008). Let stage of the system, CA3, could form a continuous attractor rep- us consider a computational architecture in which a fifth layer is resentation of the environment (Rolls, 2008b). We thus have a added to the VisNet architecture, as illustrated in Figure 39. In hypothesis for how the spatial representations are formed as a FIGURE 39 | Adding a fifth layer, corresponding to the shown in the earlier layers. Right – as it occurs in the brain. V1, visual parahippocampal gyrus/hippocampal system, after the inferior cortex area V1; TEO, posterior inferior temporal cortex; TE, inferior temporal visual cortex (corresponding to layer 4) may lead to the temporal cortex (IT). Left – as implemented in VisNet (layers 1–4). self-organization of spatial view/place cells in layer 5 when whole Convergence through the network is designed to provide fourth layer scenes are presented (see text). Convergence in the visual system is neurons with information from across the entire input retina. Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 57 Rolls Invariant visual object recognition natural extension of the hierarchically organized competitive net- entorhinal cortex and thus to parahippocampal areas. In either works in the ventral visual system. The expression of such spatial case, it is an interesting and unifying hypothesis that an effect of representations in CA3 may be particularly useful for associating adding an additional layer to VisNet-like ventral stream visual pro- those spatial representations with other inputs, such as objects or cessing might with training in a natural environment lead to the rewards (Rolls, 2008b). self-organization, using the same principles as in the ventral visual We have performed simulations to test this hypothesis with stream, of spatial view or place representations in parahippocam- VisNet simulations with conceptually a fifth layer added (Rolls pal or hippocampal areas (Rolls, 2008b; Rolls et al., 2008). Such et al., 2008). Training now with whole scenes that consist of a set spatial view representations are relatively invariant with respect of objects in a given fixed spatial relation to each other results to the position from which the scene is viewed (Georges-François in neurons in the added layer that respond to one of the trained et al., 1999), but are selective to the relative spatial position of the whole scenes, but do not respond if the objects in the scene are objects that define the spatial view (Rolls, 2008b; Rolls et al., 2008). rearranged to make a new scene from the same objects. The for- mation of these scene-specific representations in the added layer is 7. FURTHER APPROACHES TO INVARIANT OBJECT related to the fact that in the inferior temporal cortex (Aggelopou- RECOGNITION los and Rolls, 2005), and in the VisNet model (Rolls et al., 2008), A related approach to invariant object recognition is described the receptive fields of inferior temporal cortex neurons shrink and by Riesenhuber and Poggio (1999b), and builds on the hypothesis become asymmetric when multiple objects are present simultane- that not just shift invariance (as implemented in the Neocognitron ously in a natural scene. This also provides a solution to the issue of Fukushima (1980)), but also other invariances such as scale, of the representation of multiple objects, and their relative spatial rotation, and even view, could be built into a feature hierarchy positions, in complex natural scenes (Rolls, 2008b). system, as suggested by Rolls (1992) and incorporated into Vis- Consistently, in a more artificial network trained by gradient Net (Wallis et al., 1993; Wallis and Rolls, 1997; Rolls and Milward, ascent with a goal function that included forming relatively time 2000; Rolls and Stringer, 2007; Rolls, 2008b; see also Perrett and invariant representations and decorrelating the responses of neu- Oram, 1993). The approach of Riesenhuber and Poggio (1999b) rons within each layer of the 5-layer network, place-like cells were and its developments (Riesenhuber and Poggio, 1999a, 2000; Serre formed at the end of the network when the system was trained with et al., 2007a,b,c) is a feature hierarchy approach that uses alter- a real or simulated robot moving through spatial environments nate “simple cell” and “complex cell” layers in a way analogous to (Wyss et al., 2006), and slowness as an asset in learning spatial (Fukushima, 1980; see Figure 40). representations has also been investigated by others (Wiskott and The function of each S cell layer is to build more complicated Sejnowski, 2002; Wiskott, 2003; Franzius et al., 2007). It will be features from the inputs, and works by template matching. The interesting to test whether spatial view cells develop in a VisNet function of each “C” cell layer is to provide some translation fifth layer if trained with foveate views of the environment, or place invariance over the features discovered in the preceding simple cell cells if trained with wide angle views of the environment (cf. De layer (as in Fukushima, 1980), and operates by performing a MAX Araujo et al., 2001), and the utility of testing this with a VisNet-like function on the inputs. The non-linear MAX function makes a architecture is that it is embodies a biologically plausible imple- complex cell respond only to whatever is the highest activity input mentation based on neuronally plausible competitive learning and being received, and is part of the process by which invariance is a short-term memory trace-learning rule. achieved according to this proposal. This C layer process involves It is an interesting part of the hypothesis just described that “implicitly scanning over afferents of the same type differing in because spatial views and places are defined by the relative spatial the parameter of the transformation to which responses should positions of fixed landmarks (such a buildings), slow learning of be invariant (for instance, feature size for scale invariance), and such representations over a number of trials might be useful, so then selecting the best-matching afferent” (Riesenhuber and Pog- that the neurons come to represent spatial views or places, and do gio, 1999b). Brain mechanisms by which this computation could not learn to represent a random collection of moveable objects be set up are not part of the scheme, and the model does not seen once in conjunction. In this context, an alternative brain incorporate learning in its architecture, so does not yet provide a region to the dentate gyrus for this next layer of VisNet-like pro- biologically plausible model of invariant object recognition. The cessing might be the parahippocampal areas that receive from the model receives as its inputs a set of symmetric spatial-frequency inferior temporal visual cortex. Spatial view cells are present in the filters that are closely spaced in spatial-frequency, and maps these parahippocampal areas (Rolls et al., 1997a, 1998, 2005b; Robertson through pairs of convergence followed by MAX function layers, et al., 1998; Georges-François et al., 1999), and neurons with place- without learning. Whatever output appears in the final layer is like fields (though in some cases as a grid, Hafting et al., 2005) then tested with a support vector machine to measure how well are found in the rat medial entorhinal cortex (Moser and Moser, the output can be used by this very powerful subsequent learning 1998; Brun et al., 2002; Fyhn et al., 2004; Moser, 2004). These spa- stage to categorize different types of image. Whether that is a good tial view and place-like representations could be formed in these test of invariance learning is a matter for discussion (Pinto et al., regions as, effectively, an added layer to VisNet. Moreover, these 2008; see Section 8). The approach taken in VisNet is that instead cortical regions have recurrent collateral connections that could of using a benchmark test of image exemplars from which to learn implement a continuous attractor representation. Alternatively, categories (Serre et al., 2007a,b,c), instead VisNet is trained to gen- it is possible that these parahippocampal spatial representations eralize across transforms of objects that provide the training set. reflect the effects of backprojections from the hippocampus to the However, the fact that the model of Poggio, Riesenhuber, Serre and Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 58 Rolls Invariant visual object recognition FIGURE 40 | Sketch of Riesenhuber and Poggio’s (1999a,b) model of invariant object recognition. The model includes layers of “S” cells which perform template matching (solid lines), and “C” cells (solid lines) which pool information by a non-linear MAX function to achieve invariance (see text). (After Riesenhuber and Poggio, 1999a,b.) colleagues does use a hierarchical approach to object recognition developed to visual object recognition and vision navigation for does represent useful convergent thinking toward how invariant off-road mobile robots. Ullman has considered the use of features object recognition may be implemented in the brain. Similarly, in a hierarchy to help with processes such as segmentation and the approach of training a five-layer network with a more artificial object recognition (Ullman, 2007). gradient ascent approach with a goal function that does how- Another approach to the implementation of invariant represen- ever include forming relatively time invariant representations and tations in the brain is the use of neurons with Sigma-Pi synapses. decorrelating the responses of neurons within each layer (Wyss Sigma-Pi synapses effectively allow one input to a synapse to be et al., 2006; both processes that have their counterpart in VisNet), multiplied or gated by a second input to the synapse (Rolls, 2008b). also reflects convergent thinking. The multiplying input might gate the appropriate set of the other Further evidence consistent with the approach developed in the inputs to a synapse to produce the shift or scale change required. investigations of VisNet described in this paper comes from psy- For example, the multiplying input could be a signal that varies chophysical studies. Wallis and Bülthoff (1999) and Perry et al. with the shift required to compute translation invariance, effec- (2006) describe psychophysical evidence for learning of view- tively mapping the appropriate set of x inputs through to the invariant representations by experience, in that the learning can output neurons depending on the shift required (Olshausen et al., be shown in special circumstances to be affected by the temporal 1993, 1995; Mel et al., 1998; Mel and Fiser, 2000). Local opera- sequence in which different views of objects are seen. tions on a dendrite could be involved in such a process (Mel et al., Another related approach, from the machine learning area, is 1998). The explicit neural implementation of the gating mecha- that of convolutional networks. Convolutional Networks are a bio- nism seems implausible, given the need to multiply and thus remap logically inspired trainable architecture that can learn invariant large parts of the retinal input depending on shift and scale modi- features. Each stage in a ConvNet is composed of a filter bank, some fying connections to a particular set of output neurons. Moreover, non-linearities, and feature pooling layers. With multiple stages, a the explicit control signal to set the multiplication required in V1 ConvNet can learn multi-level hierarchies of features (LeCun et al., has not been identified. Moreover, if this was the solution used by 2010). Non-linearities that include rectification and local contrast the brain, the whole problem of shift and scale invariance could normalization are important in such systems (Jarrett et al., 2009; in principle be solved in one-layer of the system, rather than with and are of course properties of VisNet). Applications have been the multiple hierarchically organized set of layers actually used Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 59 Rolls Invariant visual object recognition in the brain, as shown schematically in Figure 1. The multiple- approach to the Caltech-256, and instead of focusing on a set of layers actually used in the brain are much more consistent with natural images within a category, provides images with a system- the type of scheme incorporated in VisNet. Moreover, if a multi- atic variation of pose and illumination for 1,000 small objects. plying system of the type hypothesized by Olshausen et al. (1993), Each object is placed onto a turntable and photographed in con- Mel et al. (1998), and Olshausen et al. (1995) was implemented sistent conditions at 5˚ increments, resulting in a set of images in a multilayer hierarchy with the shift and scale change emerging that not only show the whole object (with regard to out of plane gradually, then the multiplying control signal would need to be rotations), but does so with some continuity from one image to supplied to every stage of the hierarchy. A further problem with the next (see examples in Figure 42). such approaches is how the system is trained in the first place. 8.2. THE HMAX MODELS USED FOR COMPARISON WITH VISNETL The performance of VisNetL was compared against a standard 8. MEASURING THE CAPACITY OF VisNet HMAX model (Serre et al., 2007b,c; Mutch and Lowe, 2008), and For a theory of the brain mechanisms of invariant object recog- a HMAX model scaled down to have a comparable complexity (in nition, it is important that the system should scale up, so that terms, for example, of the number of neurons) to that of VisNetL. if a model such as VisNet was the size of the human visual sys- The scaled down HMAX model is referred to as HMAX_min. The tem, it would have comparable performance. Most of the research current HMAX family models have in the order of 10 million com- with VisNet to date has focused on the principles of operation of putational units (Serre et al., 2007b), which is at least 100 times the the system, and what aspects of invariant object recognition the number contained within the current implementation of VisNetL model can solve (Rolls, 2008b). In this section I consider how the (which uses 128 128 neurons in each of 4 layers, i.e., 65,536 system performs in its scaled up version (VisNetL, with 128 128 neurons). In producing HMAX_min, we aimed to maintain the neurons in each of 4 layers). I compare the capacity of VisNetL architectural features of HMAX, and primarily to scale it down. with that of another model, HMAX, as that has been described HMAX_min is based upon the “base” implementation of Mutch as competing with state of the art systems (Serre et al., 2007a,b,c; and Lowe (2008) . The minimal version used in the comparisons Mutch and Lowe, 2008), and I raise interesting issues about how differs from this base HMAX implementation in two significant to measure the capacity of systems for invariant object recognition ways. First, HMAX_min has only 4 scales compared to the 10 in natural scenes. scales of HMAX. (Care was taken to ensure that HMAX_min still The tests (performed by L. Robinson of the Department of covered the same image size range – 256, 152, 90, and 53 pixels.) Computer Science, University of Warwick, UK and E. T. Rolls) Second, the number of distinct units in the S2 “template matching” utilized a benchmark approach incorporated in the work of Serre, layer was limited to only 25 in HMAX_min, compared to 2,000 in Mutch, Poggio and colleagues (Serre et al., 2007b,c; Mutch and HMAX. This results in a scaled down model HMAX_min, with Lowe, 2008) and indeed typical of many standard approaches in approximately 12,000 units in the C1 layer, 75,000 units in the S2 computer vision. This uses standard datasets such as the Caltech- layer, and 25 in the upper C2 layer, which is much closer to the 256 (Griffin et al., 2007) in which sets of images from different 65,536 neurons of VisNetL. (The 75,000 units in S2 allow for every categories are to be classified. C2 neuron to be connected by its own weight to a C1 neuron.; When counting the number of neurons in the models, the num- 8.1. OBJECT BENCHMARK DATABASES ber of neurons in S1 is not included, as they just provide the inputs The Caltech-256 dataset (Griffin et al., 2007) is comprised of 256 to the models.) object classes made up of images that have many aspect ratios, sizes and differ quite significantly in quality (having being manually 8.3. PERFORMANCE ON A CALTECH-256 TEST collated from web searches). The objects within the images show VisNetL and the two HMAX models were trained to discrimi- significant intra-class variation and have a variety of poses, illumi- nate between two object classes from the Caltech-256 database, nation, scale, and occlusion as expected from natural images (see the teddy-bear and cowboy-hat (see examples in Figure 41). Sixty examples in Figure 41). In this sense, the Caltech-256 database is image examples of each class were rescaled to 256 256 and considered to be a difficult challenge to object recognition systems. converted to gray-scale, so that shape recognition was being inves- I come to the conclusion below that the benchmarking approach tigated. The 60 images from each class were randomly partitioned with this type of dataset is not useful for training a system that into training and testing sets, with the training set size ranging must learn invariant object representations. The reason for this is over 1, 5, 15 and 30 images, and the corresponding testing set being that the exemplars of each category in the Caltech-256 dataset are the remainder of the 60 images in the cross-validation design. A too discontinuous to provide a basis for learning invariant object linear support vector machine (libSVM, Chang and Lin, 2011) representations. For example, the exemplars within a category in approach operating on the output of layer 4 of VisnetL was used these datasets may be very different indeed. to compare the categorization of the trained images with that of Partly because of the limitations of the Caltech-256 database the test images, as that is the approach used by HMAX (Serre et al., for training in invariant object recognition, we also investigated 2007b,c; Mutch and Lowe, 2008). The standard default parameters training with the Amsterdam Library of Images (ALOI; Geuse- of the support vector machine were used in identical form for the broek et al., 2005) database . The ALOI database takes a different VisNetL and HMAX tests. 1 2 http://staff.science.uva.nl/aloi/ http://cbcl.mit.edu/jmutch/cns/index.html Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 60 Rolls Invariant visual object recognition FIGURE 41 | Example images from the Caltech-256 database for two object classes, teddy-bears and cowboy-hats. FIGURE 42 | Example images from the two object classes within the ALOI database, (A) 90 (rubber duck) and (B) 93 (black shoe). Only the 45˚ increments are shown. Figure 43 shows the performance of all three models when close in viewing angle the training images need to be; and also performing the task with the Caltech-256 dataset. It is clear that to investigate the effects of using different numbers of training VisNetL performed better than HMAX_min as soon as there were images. reasonable numbers of training images, and this was confirmed Figure 44 shows that VisNetL performed better than statistically using the Chi-square test. It is also shown that the HMAX_min as soon as there were even a few training images, full HMAX model (as expected given its very large number of with HMAX as expected performing better. VisNetL performed neurons) exhibits higher performance than that of VisNetL and almost as well as the very much larger HMAX as soon as there HMAX_min. were reasonable numbers of training images. What VisNetL can do here is to learn view-invariant represen- 8.4. PERFORMANCE WITH THE AMSTERDAM LIBRARY OF IMAGES tations using its trace-learning rule to build feature analyzers that Eight classes of object (with designations 36, 90, 93, 103, 138, 156, reflect the similarity across at least adjacent views of the training 203, 161) from the dataset were chosen (see Figure 42, for exam- set. Very interestingly, with 8 training images, the view spacing ple). Each class comprises of 72 images taken at 5˚ increments of the training images was 45˚, and the test images in the cross- through the full 360˚ out of plane rotation. Three sets of train- validation design were the intermediate views, 22.5˚ away from the ing images were used. (1) Three training images per class were nearest trained view. This is promising, for it shows that enormous taken at 315, 0, and 45˚. (2) Eight training images encompassing numbers of training images with many different closely spaced the entire rotation of the object were taken in 45˚ increments. views are not necessary for VisNetL. Even 8 training views spaced (3) Eighteen training images also encompassing the entire rota- 45˚ apart produced reasonable training. tion of the object were taken in 20˚ increments. The testing set consisted for each object of the remaining orientations from the 8.5. INDIVIDUAL LAYER PERFORMANCE set of 72 that were not present in the particular training set. The To test whether the VisNet hierarchy is actually performing use- aim of using the different training sets was to investigate how ful computations with these datasets the simulations were re-run, Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 61 Rolls Invariant visual object recognition FIGURE 43 | Performance of VisNetL, HMAX, and HMAX_min on the VisNetL performs better than HMAX_min, and this was confirmed classification task using the Caltech-256 dataset. The error bars show statistically using the Chi-square test performed with 30 training images the standard error of the means over 5 cross-validation trials with different and 30 cross-validation test images in each of two categories images chosen at random for the training set on each trial. It is clear that (Chi-squareD 8.09, dfD 1, pD 0.0025). FIGURE 44 | Performance of VisNetL, HMAX_min, and HMAX on the was confirmed statistically using the Chi-square test performed with 18 classification task with 8 classes using the Amsterdam Library of Images training images 20˚ apart in view and 54 cross-validation testing images 5˚ dataset. It is clear that VisNetL performs better than HMAX_min, and this apart in each of eight categories (Chi-squareD 110.58, dfD 1, pD 10 ). though this time instead of only training the SVM on the activity hierarchy is actually forming useful representations with these generated in the final layer, four identical SVM’s were trained inde- datasets then we should see the discriminatory power of SVMs pendently on the activities of each of the four layers. If the VisNet trained on each layer increase as we traverse the hierarchy. Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 62 Rolls Invariant visual object recognition When the Caltech-256 dataset was used to train VisNetL there be distinguished by different non-overlapping sets of features (see was very little difference in the measured performance of classi- Section 3.1). An example might be a banana and an orange, where fiers trained on each layer. This is revealing, for it shows that the the list of features of the banana might include yellow, elongated, Caltech-256 dataset does not have sufficient similarity between the and smooth surface; and of the orange its orange color, round exemplars within a given class for the trace-learning rule utilized in shape, and dimpled surface. Such objects could be distinguished VisNet to perform useful learning. Thus, at least with a convergent just on the basis of a list of the properties, which could be processed feature hierarchy network trained in this way, there is insufficient appropriately by a competitive network, pattern associator, etc. No similarity and information in the exemplars of each category of special mechanism is needed for view-invariance, because the list the Caltech-256 to learn to generalize in a view-invariant way to of properties is very similar from most viewing angles. Object further exemplars of that category. recognition of this type may be common in animals, especially In contrast, when the ALOI dataset was used to train Vis- those with visual systems less developed than those of primates. NetL the later layers performed better (layer 2–72% correct; layer However, this approach does not describe the shape and form of 3–84% correct; layer 4–86% correct: p< 0.001). Thus there is suf- objects, and is insufficient to account for primate vision. Never- ficient continuity in the images in the ALOI dataset to support theless, the features present in objects are valuable cues to object view-invariance learning in this feature hierarchy network. identity, and are naturally incorporated into the feature hierarchy approach. 8.6. EVALUATION A second type of process might involve the ability to gener- One conclusion is that VisNetL performs comparably to a scaled alize across a small range of views of an object, that is within a down version of HMAX on benchmark tests. This is reassuring, generic view, where cues of the first type cannot be used to solve for HMAX has been described as competing with state of the art the problem. An example might be generalization across a range of systems (Serre et al., 2007a,b,c; Mutch and Lowe, 2008). views of a cup when looking into the cup, from just above the near A second conclusion is that image databases such as the Caltech- lip until the bottom inside of the cup comes into view. This type 256 that are used to test the performance of object recognition of process includes the learning of the transforms of the surface systems (Serre et al., 2007a,b,c; Mutch and Lowe, 2008; and in markings on 3D objects which occur when the object is rotated, as many computer vision approaches) are inappropriate as training described in Section 5.6. Such generalization would work because sets for systems that perform invariant visual object recognition. the neurons are tuned as filters to accept a range of variation of Instead, for such systems, it will be much more relevant to train the input within parameters such as relative size and orientation of on image sets in which the image exemplars within a class show the components of the features. Generalization of this type would much more continuous variation. This provides the system with not be expected to work when there is a catastrophic change in the opportunity to learn invariant representations, instead of just the features visible, as, for example, occurs when the cup is rotated doing its best to categorize images into classes from relatively lim- so that one can suddenly no longer see inside it, and the outside ited numbers of images that do not allow the system to learn the bottom of the cup comes into view. rules of the transforms that objects undergo in the real-world, The third type of process is one that can deal with the sud- and that can be used to help object recognition when objects may den catastrophic change in the features visible when an object is be seen from different views. This is an important conclusion for rotated to a completely different view, as in the cup example just research in the area. Consistently, others are realizing that invari- given (cf. Koenderink, 1990). Another example, quite extreme to ant visual object recognition is a hard problem (Pinto et al., 2008; illustrate the point, might be when a card with different images DiCarlo et al., 2012). In this context, the hypotheses presented in on its two sides is rotated so that one face and then the other is this paper are my theory of how invariant visual object recognition in view. This makes the point that this third type of process may is performed by the brain (Rolls, 1992, 2008b), and the model Vis- involve arbitrary pairwise association learning, to learn which fea- Net tests those hypotheses and provides a model for how invariant tures and views are different aspects of the same object. Another visual object representations can be learned (Rolls, 2008b). example occurs when only some parts of an object are visible. For Third, the findings described here are encouraging with respect example, a red-handled screwdriver may be recognized either from to training view-invariant representations, in that the training its round red handle, or from its elongated silver-colored blade. images with the ALOI dataset could be separated by as much The full view-invariant recognition of objects that occurs even as 45˚ to still provide for view-invariant object recognition with when the objects share the same features, such as color, texture, cross-validation images that were never closer than 22.5˚ to a etc. is an especially computationally demanding task which the training image. This is helpful, for it is an indication that large primate visual system is able to perform with its highly devel- numbers of different views will not need to be trained with the oped temporal lobe cortical visual areas. The neurophysiological VisNet architecture in order to achieve good view-invariant object evidence and the neuronal network analyses described here and recognition. elsewhere (Rolls, 2008b) provide clear hypotheses about how the primate visual system may perform this task. 9. DIFFERENT PROCESSES INVOLVED IN DIFFERENT TYPES OF OBJECT IDENTIFICATION 10. CONCLUSION To conclude this paper, it is proposed that there are (at least) three We have seen that the feature hierarchy approach has a num- different types of process that could be involved in object identifi- ber of advantages in performing object recognition over other cation. The first is the simple situation where different objects can approaches (see Section 3), and that some of the key computational Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 63 Rolls Invariant visual object recognition issues that arise in these architectures have solutions (see Sections competitive learning using a modified associative learning rule 4 and 5). The neurophysiological and computational approach with a short-term memory trace of preceding neuronal activ- taken here focuses on a feature hierarchy model in which invari- ity, provide a basis for understanding much processing in the ant representations can be built by self-organizing learning based ventral visual stream, from V1 to the inferior temporal visual cor- on the statistics of the visual input. tex. Second, the same principles help to understand some of the The model can use temporal continuity in an associative synap- processing in the dorsal visual stream by which invariant repre- tic learning rule with a short-term memory trace, and/or it can use sentations of the global motion of objects may be formed. Third, spatial continuity in continuous spatial transformation learning. the same principles continued from the ventral visual stream The model of visual processing in the ventral cortical stream onward to the hippocampus help to show how spatial view and can build representations of objects that are invariant with respect place representations may be built from the visual input. Fourth, to translation, view, size, and lighting. in all these cases, the learning is possible because the system is The model uses a feature combination neuron approach with able to extract invariant representations because it can utilize the relative spatial positions of the objects specified in the feature the spatio-temporal continuities and statistics in the world that combination neurons, and this provides a solution to the binding help to define objects, moving objects, and spatial scenes. Fifth, problem. a great simplification and economy in terms of brain design is The model has been extended to provide an account of invari- that the computational principles need not be different in each ant representations in the dorsal visual system of the global motion of the cortical areas in these hierarchical systems, for some of produced by objects such as looming, rotation, and object-based the important properties of the processing in these systems to be movement. performed. The model has been extended to incorporate top-down feed- In conclusion, we have seen how the invariant recognition of back connections to model the control of attention by biased objects involves not only the storage and retrieval of information, competition in, for example, spatial and object search tasks (Deco but also major computations to produce invariant representations. and Rolls, 2004; Rolls, 2008b). Once these invariant representations have been formed, they are The model has also been extended to account for how the visual used for many processes including not only recognition mem- system can select single objects in complex visual scenes, how ory (Rolls, 2008b), but also associative learning of the rewarding multiple objects can be represented in a scene, and how invari- and punishing properties of objects for emotion and motivation ant representations of single objects can be learned even when (Rolls, 2005, 2008b, 2013), the memory for the spatial locations of multiple objects are present in the scene. objects and rewards, the building of spatial representations based It has also been suggested in a unifying proposal that adding on visual input, and as an input to short-term memory, attention, a fifth layer to the model and training the system in spatial envi- decision, and action selection systems (Rolls, 2008b). ronments will enable hippocampus-like spatial view neurons or place cells to develop, depending on the size of the field of view ACKNOWLEDGMENTS (Section 6). Edmund T. Rolls is grateful to Larry Abbott, Nicholas Aggelopou- We have thus seen how many of the major computational los, Roland Baddeley, Francesco Battaglia, Michael Booth, Gordon issues that arise when formulating a theory of object recognition Baylis, Hugo Critchley, Gustavo Deco, Martin Elliffe, Leonardo in the ventral visual system (such as feature binding, invari- Franco, Michael Hasselmo, Nestor Parga, David Perrett, Gavin ance learning, the recognition of objects when they are in clut- Perry, Leigh Robinson, Simon Stringer, Martin Tovee, Alessandro tered natural scenes, the representation of multiple objects in Treves, James Tromans, and Tristan Webb for contributing to many a scene, and learning invariant representations of single objects of the collaborative studies described here. Professor R. Watt, of when there are multiple objects in the scene), could be solved Stirling University, is thanked for assistance with the implementa- in the brain, with tests of the hypotheses performed by simula- tion of the difference of Gaussian filters used in many experiments tions that are consistent with complementary neurophysiological with VisNet and VisNet2. Support from the Medical Research results. Council, the Wellcome Trust, the Oxford McDonnell Centre in The approach described here is unifying in a number of ways. Cognitive Neuroscience, and the Oxford Centre for Computa- First, a set of simple organizational principles involving a hier- tional Neuroscience (www.oxcns.org, where .pdfs of papers are archy of cortical areas with convergence from stage to stage, and available) is acknowledged. REFERENCES algorithm for Boltzmann machines. inferior temporal cortex neurons of orientations. Nat. Neurosci. 10, Abbott, L. F., Rolls, E. T., and Tovee, M. Cogn. Sci. 9, 147–169. encode the positions of different 1313–1321. J. (1996). Representational capacity Aggelopoulos, N. C., Franco, L., and objects in the scene. Eur. J. Neurosci. Arathorn, D. (2002). Map-Seeking Cir- of face coding in monkeys. Cereb. Rolls, E. T. (2005). Object perception 22, 2903–2916. cuits in Visual Cognition: A Com- Cortex 6, 498–505. in natural scenes: encoding by infe- Amit, D. J. (1989). Modelling Brain putational Mechanism for Biological Abeles, M. (1991). Corticonics: Neural rior temporal cortex simultaneously Function. New York: Cambridge and Machine Vision. Stanford, CA: Circuits of the Cerebral Cortex. Cam- recorded neurons. J. Neurophysiol. University Press. Stanford University Press. bridge: Cambridge University Press. 93, 1342–1357. Anzai, A., Peng, X., and Van Essen, Arathorn, D. (2005). “Computation Ackley, D. H., Hinton, G. E., and Aggelopoulos, N. C., and Rolls, E. T. D. C. (2007). Neurons in monkey in the higher visual cortices: Sejnowski, T. J. (1985). A learning (2005). Natural scene perception: visual area V2 encode combinations map-seeking circuit theory and Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 64 Rolls Invariant visual object recognition application to machine vision,” in topography of area TEO in the De Valois, R. L., and De Valois, K. K. Farah, M. J. (2000). The Cognitive Neu- Proceedings of the AIPR 2004: 33rd macaque. J. Comp. Neurol. 306, (1988). Spatial Vision. New York: roscience of Vision. Oxford: Black- Applied Imagery Pattern Recognition 554–575. Oxford University Press. well. Workshop, 73–78. Brady, M., Ponce, J., Yuille, A., and Deco, G., and Rolls, E. T. (2004). Farah, M. J., Meyer, M. M., and Ballard, D. H. (1990). “Animate vision Asada, H. (1985). Describing sur- A neurodynamical cortical model McMullen, P. A. (1996). The liv- uses object-centred reference faces, A. I. Memo 882. Artif. Intell. of visual attention and invariant ing/nonliving dissociation is not an frames,” in Advanced Neural Com- 17, 285–349. object recognition. Vision Res. 44, artifact: giving an a priori implau- puters, ed. R. Eckmiller (Elsevier: Brincat, S. L., and Connor, C. E. (2006). 621–644. sible hypothesis a strong test. Cogn. Amsterdam), 229–236. Dynamic shape synthesis in poste- Deco, G., and Rolls, E. T. (2005a). Atten- Neuropsychol. 13, 137–154. Barlow, H. B. (1972). Single units and rior inferotemporal cortex. Neuron tion, short term memory, and action Faugeras, O. D. (1993). The Represen- sensation: a neuron doctrine for 49, 17–24. selection: a unifying theory. Prog. tation, Recognition and Location of perceptual psychology. Perception 1, Bruce, V. (1988). Recognising Faces. Neurobiol. 76, 236–256. 3-D Objects. Cambridge, MA: MIT 371–394. Hillsdale, NJ: Erlbaum. Deco, G., and Rolls, E. T. (2005b). Press. Barlow, H. B. (1985). “Cerebral cortex Brun, V. H., Otnass, M. K., Molden, Neurodynamics of biased com- Faugeras, O. D., and Hebert, M. (1986). as model builder,” in Models of the S., Steffenach, H. A., Witter, M. P., petition and cooperation for The representation, recognition and Visual Cortex, eds D. Rose and V. G. Moser, M. B., and Moser, E. I. (2002). attention: a model with spik- location of 3-D objects. Int. J. Robot. Dobson (Chichester: Wiley), 37–46. Place cells and place recognition ing neurons. J. Neurophysiol. 94, Res. 5, 27–52. Barlow, H. B., Kaushal, T. P., and Mitchi- maintained by direct entorhinal– 295–313. Feldman, J. A. (1985). Four frames suf- son, G. J. (1989). Finding minimum hippocampal circuitry. Science 296, Desimone, R., and Duncan, J. (1995). fice: a provisional model of vision entropy codes. Neural Comput. 1, 2243–2246. Neural mechanisms of selective and space. Behav. Brain Sci. 8, 412–423. Buckley, M. J., Booth, M. C. A., Rolls, visual attention. Annu. Rev. Neurosci. 265–289. Bartlett, M. S., and Sejnowski, T. J. E. T., and Gaffan, D. (2001). Selec- 18, 193–222. Fenske, M. J., Aminoff, E., Gronau, (1997). “Viewpoint invariant face tive perceptual impairments follow- DeWeese, M. R., and Meister, M. (1999). N., and Bar, M. (2006). Top-down recognition using independent com- ing perirhinal cortex ablation. J. How to measure the information facilitation of visual object recog- ponent analysis and attractor net- Neurosci. 21, 9824–9836. gained from one symbol. Network nition: object-based and context- works,” in Advances in Neural Infor- Buhmann, J., Lange, J., von der 10, 325–340. based contributions. Prog. Brain Res. mation Processing Systems, Vol. 9, Malsburg, C., Vorbrüggen, J. C., DiCarlo, J. J., and Maunsell, J. H. R. 155, 3–21. eds M. Mozer, M. Jordan, and and Würtz, R. P. (1991). “Object (2003). Anterior inferotemporal Field, D. J. (1987). Relations between T. Petsche (Cambridge, MA: MIT recognition in the dynamic link neurons of monkeys engaged the statistics of natural images and Press), 817–823. architecture: parallel implementa- in object recognition can be the response properties of corti- Baylis, G. C., Rolls, E. T., and Leonard, tion of a transputer network,” highly sensitive to object reti- cal cells. J. Opt. Soc. Am. A 4, C. M. (1985). Selectivity between in Neural Networks for Signal nal position. J. Neurophysiol. 89, 2379–2394. faces in the responses of a pop- Processing, ed. B. Kosko (Engle- 3264–3278. Field, D. J. (1994). What is the goal of ulation of neurons in the cor- wood Cliffs, NJ: Prentice-Hall), DiCarlo, J. J., Zoccolan, D., and Rust, N. sensory coding? Neural Comput. 6, tex in the superior temporal sul- 121–159. C. (2012). How does the brain solve 559–601. cus of the monkey. Brain Res. 342, Carlson, E. T., Rasquinha, R. J., Zhang, visual object recognition? Neuron Finkel, L. H., and Edelman, G. 91–102. K., and Connor, C. E. (2011). A 73, 415–434. M. (1987). “Population rules for Baylis, G. C., Rolls, E. T., and Leonard, sparse object coding scheme in area Dolan, R. J., Fink, G. R., Rolls, E. T., synapses in networks,” in Synaptic C. M. (1987). Functional subdivi- v4. Curr. Biol. 21, 288–293. Booth, M., Holmes, A., Frackowiak, Function, eds G. M. Edelman, W. E. sions of temporal lobe neocortex. J. Cerella, J. (1986). Pigeons and percep- R. S. J., and Friston, K. J. (1997). How Gall, and W. M. Cowan (New York: Neurosci. 7, 330–342. trons. Pattern Recognit. 19, 431–438. the brain learns to see objects and John Wiley & Sons), 711–757. Bennett, A. (1990). Large competitive Chakravarty, I. (1979). A generalized faces in an impoverished context. Földiák, P. (1991). Learning invariance networks. Network 1, 449–462. line and junction labeling scheme Nature 389, 596–599. from transformation sequences. Biederman, I. (1972). Perceiving real- with applications to scene analy- Dow, B. W., Snyder, A. Z., Vautin, R. Neural Comput. 3, 193–199. world scenes. Science 177, 77–80. sis. IEEE Trans. Pattern Anal. Mach. G., and Bauer, R. (1981). Magni- Földiák, P. (1992). Models of Sensory Biederman, I. (1987). Recognition-by- Intell. 1, 202–205. fication factor and receptive field Coding. Technical Report CUED/F– components: a theory of human Chang, C.-C., and Lin, C.-J. (2011). LIB- size in foveal striate cortex of INFENG/TR 91. Department of image understanding. Psychol. Rev. SVM: a library for support vector the monkey. Exp. Brain Res. 44, Engineering, University of Cam- 94, 115–147. machines. ACM Trans. Intell. Syst. 213–218. bridge, Cambridge. Binford, T. O. (1981). Inferring sur- Technol. 2, 27. Edelman, S. (1999). Representation and Folstein, J. R., Gauthier, I., and faces from images. Artif. Intell. 17, Dane, C., and Bajcsy, R. (1982). “An Recognition in Vision. Cambridge, Palmeri, T. J. (2010). Mere expo- 205–244. object-centred three-dimensional MA: MIT Press. sure alters category learning of Blumberg, J., and Kreiman, G. (2010). model builder,” in Proceedings of Elliffe, M. C. M., Rolls, E. T., Parga, novel objects. Front. Psychol. 1:40. How cortical neurons help us see: the 6th International Conference N., and Renart, A. (2000). A recur- doi:10.3389/fpsyg.2010.00040 visual recognition in the human on Pattern Recognition, Munich, rent model of transformation invari- Franco, L., Rolls, E. T., Aggelopoulos, brain. J. Clin. Invest. 120, 3054–3063. 348–350. ance by association. Neural Netw. 13, N. C., and Jerez, J. M. (2007). Neu- Bolles, R. C., and Cain, R. A. (1982). Daugman, J. (1988). Complete discrete 225–237. ronal selectivity, population sparse- Recognizing and locating partially 2D-Gabor transforms by neural net- Elliffe, M. C. M., Rolls, E. T., and ness, and ergodicity in the inferior visible objects: the local-feature- works for image analysis and com- Stringer, S. M. (2002). Invariant temporal visual cortex. Biol. Cybern. focus method. Int. J. Robot. Res. 1, pression. IEEE Trans. Acoust. 36, recognition of feature combinations 96, 547–560. 57–82. 1169–1179. in the visual system. Biol. Cybern. 86, Franco, L., Rolls, E. T., Aggelopou- Booth, M. C. A., and Rolls, E. T. De Araujo, I. E. T., Rolls, E. T., 59–71. los, N. C., and Treves, A. (2004). (1998). View-invariant representa- and Stringer, S. M. (2001). A Engel, A. K., Konig, P., Kreiter, A. K., The use of decoding to analyze tions of familiar objects by neu- view model which accounts for the Schillen, T. B., and Singer, W. (1992). the contribution to the informa- rons in the inferior temporal visual response properties of hippocam- Temporal coding in the visual sys- tion of the correlations between the cortex. Cereb. Cortex 8, 510–523. pal primate spatial view cells and tem: new vistas on integration in the firing of simultaneously recorded Boussaoud, D., Desimone, R., and rat place cells. Hippocampus 11, nervous system. Trends Neurosci. 15, neurons. Exp. Brain Res. 155, Ungerleider, L. G. (1991). Visual 699–706. 218–226. 370–384. Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 65 Rolls Invariant visual object recognition Franzius, M., Sprekeler, H., and motion patterns by form/cue invari- Hegde, J., and Van Essen, D. C. (2000). alignment with an image. Int. J. Wiskott, L. (2007). Slowness ant MSTd neurons. J. Neurosci. 16, Selectivity for complex shapes in pri- Comput. Vis. 5, 195–212. and sparseness lead to place, 4716–4732. mate visual area V2. J. Neurosci. 20, Ito, M. (1984). The Cerebellum and head-direction, and spatial-view Georges-François, P., Rolls, E. T., and RC61. Neural Control. New York: Raven cells. PLoS Comput. Biol. 3, e166. Robertson, R. G. (1999). Spatial view Hegde, J., and Van Essen, D. C. (2003). Press. doi:10.1371/journal.pcbi.0030166 cells in the primate hippocampus: Strategies of shape representation in Ito, M. (1989). Long-term depression. Freedman, D. J., and Miller, E. K. (2008). allocentric view not head direction macaque visual area V2. Vis. Neu- Annu. Rev. Neurosci. 12, 85–102. Neural mechanisms of visual cate- or eye position or place. Cereb. Cor- rosci. 20, 313–328. Ito, M., and Komatsu, H. (2004). gorization: insights from neurophys- tex 9, 197–212. Hegde, J., and Van Essen, D. C. (2007). Representation of angles embedded iology. Neurosci. Biobehav. Rev. 32, Geusebroek, J.-M., Burghouts, G. J., A comparative study of shape rep- within contour stimuli in area V2 of 311–329. and Smeulders, A. W. M. (2005). resentation in macaque visual areas macaque monkeys. J. Neurosci. 24, Freiwald, W. A., Tsao, D. Y., and Living- The Amsterdam library of object V2 and V4. Cereb. Cortex 17, 3313–3324. stone, M. S. (2009). A face feature images. Int. J. Comput. Vis. 61, 1100–1116. Itti, L., and Koch, C. (2000). A saliency- space in the macaque temporal lobe. 103–112. Herrnstein, R. J. (1984). “Objects, cat- based search mechanism for overt Nat. Neurosci. 12, 1187–1196. Gibson, J. J. (1950). The Perception of egories, and discriminative stimuli,” and covert shifts of visual attention. Frey, B. J., and Jojic, N. (2003). the Visual World. Boston: Houghton in Animal Cognition, Chap. 14, eds Vision Res. 40, 1489–1506. Transformation-invariant clustering Mifflin. H. L. Roitblat, T. G. Bever, and H. Jarrett, K., Kavukcuoglu, K., Ranzato, using the EM algorithm. IEEE Trans. Gibson, J. J. (1979). The Ecologi- S. Terrace (Hillsdale, NJ: Lawrence M., and Lecun, Y. (2009). “What Pattern Anal. Mach. Intell. 25, 1–17. cal Approach to Visual Perception. Erlbaum and Associates), 233–261. is the best multi-stage architec- Fries, P. (2005). A mechanism for cog- Boston: Houghton Mifflin. Hertz, J. A., Krogh, A., and Palmer, R. ture for object recognition?” in nitive dynamics: neuronal commu- Grabenhorst, F., and Rolls, E. T. (2011). G. (1991). Introduction to the Theory 2009 IEEE 12th International Con- nication through neuronal coher- Value, pleasure, and choice systems of Neural Computation. Wokingham: ference on Computer Vision (ICCV), ence. Trends Cogn. Sci. (Regul. Ed.) in the ventral prefrontal cortex. Addison-Wesley. 2146–2153. 9, 474–480. Trends Cogn. Sci. (Regul. Ed.) 15, Hestrin, S., Sah, P., and Nicoll, R. Jiang, F., Dricot, L., Weber, J., Righi, Fries, P. (2009). Neuronal gamma- 56–67. (1990). Mechanisms generating the G., Tarr, M. J., Goebel, R., and Ros- band synchronization as a funda- Graziano, M. S. A., Andersen, R. A., time course of dual component exci- sion, B. (2011). Face categorization mental process in cortical com- and Snowden, R. J. (1994). Tuning tatory synaptic currents recorded in visual scenes may start in a higher putation. Annu. Rev. Neurosci. 32, of MST neurons to spiral motions. J. in hippocampal slices. Neuron 5, order area of the right fusiform 209–224. Neurosci. 14, 54–67. 247–253. gyrus: evidence from dynamic visual Fukushima, K. (1975). Cognitron: a Griffin, G., Holub, A., and Perona, Hinton, G. E. (2010). Learning to rep- stimulation in neuroimaging. J. self-organizing neural network. Biol. P. (2007). The Caltech-256. Caltech resent visual input. Philos. Trans. R. Neurophysiol. 106, 2720–2736. Cybern. 20, 121–136. Technical Report, Los Angeles, 1–20. Soc. Lond. B Biol. Sci. 365, 177–184. Koch, C. (1999). Biophysics of Com- Fukushima, K. (1980). Neocognitron: Grimson, W. E. L. (1990). Object Recog- Hinton, G. E., Dayan, P., Frey, B. J., and putation. Oxford: Oxford University a self-organizing neural network nition by Computer. Cambridge, Neal, R. M. (1995). The “wake-sleep” Press. model for a mechanism of pattern MA: MIT Press. algorithm for unsupervised neural Koenderink, J. J. (1990). Solid Shape. recognition unaffected by shift in Griniasty, M., Tsodyks, M. V., and Amit, networks. Science 268, 1158–1161. Cambridge, MA: MIT Press. position. Biol. Cybern. 36, 193–202. D. J. (1993). Conversion of temporal Hinton, G. E., and Ghahramani, Z. Koenderink, J. J., and Van Doorn, A. J. Fukushima, K. (1988). Neocognitron: a correlations between stimuli to spa- (1997). Generative models for dis- (1979). The internal representation hierarchical neural network model tial correlations between attractors. covering sparse distributed repre- of solid shape with respect to vision. capable of visual pattern recogni- Neural Comput. 35, 1–17. sentations. Philos. Trans. R. Soc. Biol. Cybern. 32, 211–217. tion unaffected by shift in position. Gross, C. G., Desimone, R., Albright, Lond. B Biol. Sci. 352, 1177–1190. Koenderink, J. J., and van Doorn, A. Neural Netw. 1, 119–130. T. D., and Schwartz, E. L. (1985). Hinton, G. E., and Sejnowski, T. J. J. (1991). Affine structure from Fukushima, K. (1989). Analysis of the Inferior temporal cortex and pat- (1986). “Learning and relearning in motion. J. Opt. Soc. Am. A 8, process of visual pattern recognition tern recognition. Exp. Brain Res. Boltzmann machines,” in Parallel 377–385. by the neocognitron. Neural Netw. 2, 11(Suppl.), 179–201. Distributed Processing, Vol. 1, Chap. Kourtzi, Z., and Connor, C. E. (2011). 413–420. Hafting, T., Fyhn, M., Molden, S., Moser, 7, eds D. Rumelhart and J. L. McClel- Neural representations for object Fukushima, K. (1991). Neural networks M. B., and Moser, E. I. (2005). land (Cambridge, MA: MIT Press), perception: structure, category, and for visual pattern recognition. IEEE Microstructure of a spatial map in 282–317. adaptive coding. Annu. Rev. Neu- Trans. E 74, 179–190. the entorhinal cortex. Nature 436, Hopfield, J. J. (1982). Neural networks rosci. 34, 45–67. Fukushima, K., and Miyake, S. (1982). 801–806. and physical systems with emer- Kriegeskorte, N., Mur, M., Ruff, D. A., Neocognitron: a new algorithm Hasselmo, M. E., Rolls, E. T., and gent collective computational abili- Kiani, R., Bodurka, J., Esteky, H., for pattern recognition tolerant of Baylis, G. C. (1989a). The role ties. Proc. Natl. Acad. Sci. U. S. A. 79, Tanaka, K., and Bandettini, P. A. deformations and shifts in position. of expression and identity in the 2554–2558. (2008). Matching categorical object Pattern Recognit. 15, 455–469. face-selective responses of neurons Hubel, D. H., and Wiesel, T. N. (1962). representations in inferior temporal Fyhn, M., Molden, S., Witter, M. P., in the temporal visual cortex of Receptive fields, binocular interac- cortex of man and monkey. Neuron Moser, E. I., and Moser, M.-B. the monkey. Behav. Brain Res. 32, tion, and functional architecture in 60, 1126–1141. (2004). Spatial representation in 203–218. the cat’s visual cortex. J. Physiol. 160, Krieman, G., Koch, C., and Fried, the entorhinal cortex. Science 2004, Hasselmo, M. E., Rolls, E. T., Baylis, 106–154. I. (2000). Category-specific visual 1258–1264. G. C., and Nalwa, V. (1989b). Hubel, D. H., and Wiesel, T. N. responses of single neurons in the Gardner, E. (1988). The space of inter- Object-centered encoding by face- (1968). Receptive fields and func- human medial temporal lobe. Nat. actions in neural network models. J. selective neurons in the cortex in tional architecture of monkey striate Neurosci. 3, 946–953. Phys. A Math. Gen. 21, 257–270. the superior temporal sulcus of cortex. J. Physiol. 195, 215–243. Land, M. F. (1999). Motion and vision: Garthwaite, J. (2008). Concepts of the monkey. Exp. Brain Res. 75, Hummel, J. E., and Biederman, I. why animals move their eyes. J. neural nitric oxide-mediated 417–429. (1992). Dynamic binding in a neural Comp. Physiol. A 185, 341–352. transmission. Eur. J. Neurosci. 27, Hawken, M. J., and Parker, A. J. (1987). network for shape recognition. Psy- Land, M. F., and Collett, T. S. (1997). 2783–3802. Spatial properties of the monkey chol. Rev. 99, 480–517. “A survey of active vision in inverte- Geesaman, B. J., and Andersen, R. A. striate cortex. Proc. R. Soc. Lond. B Huttenlocher, D. P., and Ullman, S. brates,” in From Living Eyes to See- (1996). The analysis of complex Biol. Sci. 231, 251–288. (1990). Recognizing solid objects by ing Machines, eds M. V. Srinivasan Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 66 Rolls Invariant visual object recognition and S. Venkatesh (Oxford: Oxford Mel, B. W., and Fiser, J. (2000). Mini- Newsome, W. T., Britten, K. H., and visual object recognition hard? University Press), 16–36. mizing binding errors using learned Movshon, J. A. (1989). Neuronal PLoS Comput. Biol. 4, e27. LeCun, Y., Kavukcuoglu, K., and Fara- conjunctive features. Neural Com- correlates of a perceptual decision. doi:10.1371/journal.pcbi.0040027 bet, C. (2010). “Convolutional net- put. 12, 731–762. Nature 341, 52–54. Poggio, T., and Edelman, S. (1990). works and applications in vision,” in Mel, B. W., Ruderman, D. L., and Oja, E. (1982). A simplified neuron A network that learns to recognize 2010 IEEE International Symposium Archie, K. A. (1998). Translation- model as a principal component three-dimensional objects. Nature on Circuits and Systems, 253–256. invariant orientation tuning in analyzer. J. Math. Biol. 15, 267–273. 343, 263–266. Lee, T. S. (1996). Image representa- visual “complex” cells could derive O’Keefe, J. (1984). “Spatial memory Pollen, D., and Ronner, S. (1981). Phase tion using 2D Gabor wavelets. IEEE from intradendritic computations. J. within and without the hippocampal relationship between adjacent sim- Trans. Pattern Anal. Mach. Intell. 18, Neurosci. 18, 4325–4334. system,” in Neurobiology of the Hip- ple cells in the visual cortex. Science 959–971. Mikami, A., Nakamura, K., and Kub- pocampus, ed. W. Seifert (London: 212, 1409–1411. Leen, T. K. (1995). From data distri- ota, K. (1994). Neuronal responses Academic Press), 375–403. Rao, R. P. N., and Ruderman, D. butions to regularization in invari- to photographs in the superior tem- Olshausen, B. A., Anderson, C. H., and L. (1999). “Learning lie groups ant learning. Neural Comput. 7, poral sulcus of the rhesus monkey. Van Essen, D. C. (1993). A neurobio- for invariant visual perception,” in 974–981. Behav. Brain Res. 60, 1–13. logical model of visual attention and Advances in Neural Information Pro- Leibo, J. Z., Mutch, J., Rosasco, Milner, P. (1974). A model for visual invariant pattern recognition based cessing Systems, Vol. 11, eds M. S. L., Ullman, S., and Poggio, T. shape recognition. Psychol. Rev. 81, on dynamic routing of information. Kearns, S. A. Solla, and D. A. Cohn (2010). Learning Generic Invari- 521–535. J. Neurosci. 13, 4700–4719. (Cambridge: MIT Press), 810–816. ances in Object Recognition: Trans- Miyashita, Y. (1988). Neuronal corre- Olshausen, B. A., Anderson, C. H., and Renart, A., Parga, N., and Rolls, E. lation and Scale. MIT-CSAIL-TR- late of visual associative long-term Van Essen, D. C. (1995). A multiscale T. (2000). “A recurrent model of 2010-061, Cambridge. memory in the primate temporal dynamic routing circuit for forming the interaction between the pre- Li, N., and DiCarlo, J. J. (2008). Unsu- cortex. Nature 335, 817–820. size- and position-invariant object frontal cortex and inferior temporal pervised natural experience rapidly Miyashita, Y., and Chang, H. S. representations. J. Comput. Neurosci. cortex in delay memory tasks,” in alters invariant object representa- (1988). Neuronal correlate of picto- 2, 45–62. Advances in Neural Information tion in visual cortex. Science 321, rial short-term memory in the pri- Orban, G. A. (2011). The extraction Processing Systems, Vol. 12, eds S. 1502–1507. mate temporal cortex. Nature 331, of 3D shape in the visual system Solla, T. Leen, and K.-R. Mueller Li, S., Mayhew, S. D., and Kourtzi, Z. 68–70. of human and nonhuman primates. (Cambridge, MA: MIT Press), (2011). Learning shapes spatiotem- Montague, P. R., Gally, J. A., and Edel- Annu. Rev. Neurosci. 34, 361–388. 171–177. poral brain patterns for flexible cat- man, G. M. (1991). Spatial signalling O’Reilly, J., and Munakata, Y. (2000). Rhodes, P. (1992). The open time of egorical decisions. Cereb. Cortex. in the development and function of Computational Explorations in Cog- the NMDA channel facilitates the doi: 10.1093/cercor/bhr309. [Epub neural connections. Cereb. Cortex 1, nitive Neuroscience. Cambridge, MA: self-organisation of invariant object ahead of print]. 199–220. MIT Press. responses in cortex. Soc. Neurosci. Liu, J., Harris, A., and Kanwisher, N. Moser, E. I. (2004). Hippocampal place Parga, N., and Rolls, E. T. (1998). Abstr. 18, 740. (2010). Perception of face parts and cells demand attention. Neuron 42, Transform invariant recognition by Riesenhuber, M., and Poggio, T. (1998). face configurations: an fMRI study. 183–185. association in a recurrent network. “Just one view: invariances in infer- J. Cogn. Neurosci. 22, 203–211. Moser, M. B., and Moser, E. I. Neural Comput. 10, 1507–1525. otemporal cell tuning,” in Advances Logothetis, N. K., Pauls, J., Bulthoff, (1998). Functional differentiation in Peng, H. C., Sha, L. F., Gan, Q., and in Neural Information Processing H. H., and Poggio, T. (1994). View- the hippocampus. Hippocampus 8, Wei, Y. (1998). Energy function Systems, Vol. 10, eds M. I. Jor- dependent object recognition by 608–619. for learning invariance in multi- dan, M. J. Kearns, and S. A. monkeys. Curr. Biol. 4, 401–414. Movshon, J. A., Adelson, E. H., Gizzi, M. layer perceptron. Electron. Lett. 34, Solla (Cambridge, MA: MIT Press), Logothetis, N. K., Pauls, J., and Pog- S., and Newsome, W. T. (1985). “The 292–294. 215–221. gio, T. (1995). Shape representation analysis of moving visual patterns,” Perrett, D. I., and Oram, M. W. Riesenhuber, M., and Poggio, T. (1999a). in the inferior temporal cortex of in Pattern Recognition Mechanisms, (1993). Neurophysiology of shape Are cortical models really bound by monkeys. Curr. Biol. 5, 552–563. eds C. Chagas, R. Gattass, and C. G. processing. Image Vis. Comput. 11, the “binding problem”? Neuron 24, Logothetis, N. K., and Sheinberg, D. Gross (New York: Springer-Verlag), 317–333. 87–93. L. (1996). Visual object recognition. 117–151. Perrett, D. I., Rolls, E. T., and Caan, W. Riesenhuber, M., and Poggio, T. Annu. Rev. Neurosci. 19, 577–621. Mozer, M. C. (1991). The Perception (1982). Visual neurons responsive to (1999b). Hierarchical models of Lowe, D. (1985). Perceptual Organiza- of Multiple Objects: A Connection- faces in the monkey temporal cortex. object recognition in cortex. Nat. tion and Visual Recognition. Boston: ist Approach. Cambridge, MA: MIT Exp. Brain Res. 47, 329–342. Neurosci. 2, 1019–1025. Kluwer. Press. Perrett, D. I., Smith, P. A. J., Potter, D. Riesenhuber, M., and Poggio, T. (2000). Marr, D. (1982). Vision. San Francisco: Muller, R. U., Kubie, J. L., Bostock, E. D., Mistlin, A. J., Head, A. S., Mil- Models of object recognition. Nat. Freeman. M., Taube, J. S., and Quirk, G. J. ner, D., and Jeeves, M. A. (1985). Neurosci. 3(Suppl.), 1199–1204. Marr, D., and Nishihara, H. K. (1978). (1991). “Spatial firing correlates of Visual cells in temporal cortex sensi- Ringach, D. L. (2002). Spatial struc- Representation and recognition of neurons in the hippocampal forma- tive to face view and gaze direction. ture and symmetry of simple-cell the spatial organization of three tion of freely moving rats,” in Brain Proc. R. Soc. Lond. B Biol. Sci. 223, receptive fields in macaque primary dimensional structure. Proc. R. Soc. and Space, ed. J. Paillard (Oxford: 293–317. visual cortex. J. Neurophysiol. 88, Lond. B Biol. Sci. 200, 269–294. Oxford University Press), 296–333. Perry, G., Rolls, E. T., and Stringer, S. 455–463. McNaughton, B. L., Barnes, C. A., and Mundy, J., and Zisserman, A. (1992). M. (2006). Spatial vs temporal conti- Robertson, R. G., Rolls, E. T., and O’Keefe, J. (1983). The contributions “Introduction – towards a new nuity in view invariant visual object Georges-François, P. (1998). Spa- of position, direction, and velocity to framework for vision,” in Geometric recognition learning. Vision Res. 46, tial view cells in the primate hip- single unit activity in the hippocam- Invariance in Computer Vision, eds 3994–4006. pocampus: effects of removal of pus of freely-moving rats. Exp. Brain J. Mundy and A. Zisserman (Cam- Perry, G., Rolls, E. T., and Stringer, S. view details. J. Neurophysiol. 79, Res. 52, 41–49. bridge, MA: MIT Press), 1–39. M. (2010). Continuous transforma- 1145–1156. Mel, B. W. (1997). SEEMORE: com- Mutch, J., and Lowe, D. G. (2008). tion learning of translation invariant Rolls, E. T. (1989a). “Functions of neu- bining color, shape, and texture his- Object class recognition and local- representations. Exp. Brain Res. 204, ronal networks in the hippocam- togramming in a neurally-inspired ization using sparse features with 255–270. pus and neocortex in memory,” in approach to visual object recogni- limited receptive fields. Int. J. Com- Pinto, N., Cox, D. D., and DiCarlo, Neural Models of Plasticity: Experi- tion. Neural Comput. 9, 777–804. put. Vis. 80, 45–57. J. J. (2008). Why is real-world mental and Theoretical Approaches, Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 67 Rolls Invariant visual object recognition Chap. 13, eds J. H. Byrne and W. Approach. Oxford: Oxford Univer- neurophysiological and compu- Rolls, E. T., and Stringer, S. M. (2007). O. Berry (San Diego, CA: Academic sity Press. tational bases. Neural Netw. 19, Invariant global motion recognition Press), 240–265. Rolls, E. T. (2008c). Top-down con- 1383–1394. in the dorsal visual system: a uni- Rolls, E. T. (1989b). “The representation trol of visual perception: attention Rolls, E. T., Franco, L., Aggelopoulos, N. fying theory. Neural Comput. 19, and storage of information in neu- in natural vision. Perception 37, C., and Jerez, J. M. (2006a). Infor- 139–169. ronal networks in the primate cere- 333–354. mation in the first spike, the order Rolls, E. T., and Tovee, M. J. (1994). Pro- bral cortex and hippocampus,” in Rolls, E. T. (2011a). David Marr’s vision: of spikes, and the number of spikes cessing speed in the cerebral cortex The Computing Neuron, Chap. 8, eds floreat computational neuroscience. provided by neurons in the inferior and the neurophysiology of visual R. Durbin, C. Miall, and G. Mitchi- Brain 134, 913–916. temporal visual cortex. Vision Res. masking. Proc. R. Soc. Lond. B Biol. son (Wokingham: Addison-Wesley), Rolls, E. T. (2011b). “Face neurons,” in 46, 4193–4205. Sci. 257, 9–15. 125–159. The Oxford Handbook of Face Percep- Rolls, E. T., Stringer, S. M., and Elliot, T. Rolls, E. T., and Tovee, M. J. (1995a). Rolls, E. T. (1992). Neurophysiologi- tion, Chap. 4, eds A. J. Calder, G. (2006b). Entorhinal cortex grid cells The responses of single neurons in cal mechanisms underlying face pro- Rhodes, M. H. Johnson, and J. V. can map to hippocampal place cells the temporal visual cortical areas of cessing within and beyond the tem- Haxby (Oxford: Oxford University by competitive learning. Network 17, the macaque when more than one poral cortical visual areas. Philos. Press), 51–75. 447–465. stimulus is present in the visual field. Trans. R. Soc. Lond. B Biol. Sci. 335, Rolls, E. T. (2012). Neuroculture: On Rolls, E. T., Franco, L., and Stringer, S. Exp. Brain Res. 103, 409–420. 11–21. the Implications of Brain Science. M. (2005a). The perirhinal cortex Rolls, E. T., and Tovee, M. J. (1995b). Rolls, E. T. (1994). Brain mecha- Oxford: Oxford University Press. and long-term familiarity memory. Sparseness of the neuronal represen- nisms for invariant visual recogni- Rolls, E. T. (2013). Emotion and Q. J. Exp. Psychol. B. 58, 234–245. tation of stimuli in the primate tem- tion and learning. Behav. Processes Decision-Making Explained. Oxford: Rolls, E. T., Xiang, J.-Z., and Franco, L. poral visual cortex. J. Neurophysiol. 33, 113–138. Oxford University Press. (2005b). Object, space and object- 73, 713–726. Rolls, E. T. (1995). Learning mech- Rolls, E. T., Aggelopoulos, N. C., Franco, space representations in the primate Rolls, E. T., Tovee, M. J., and Panzeri, anisms in the temporal lobe L., and Treves, A. (2004). Informa- hippocampus. J. Neurophysiol. 94, S. (1999). The neurophysiology of visual cortex. Behav. Brain Res. 66, tion encoding in the inferior tem- 833–844. backward visual masking: informa- 177–185. poral visual cortex: contributions of Rolls, E. T., and Grabenhorst, F. tion analysis. J. Cogn. Neurosci. 11, Rolls, E. T. (1999). The Brain and the firing rates and the correlations (2008). The orbitofrontal cor- 335–346. Emotion. Oxford: Oxford University between the firing of neurons. Biol. tex and beyond: from affect to Rolls, E. T., Tovee, M. J., Purcell, D. Press. Cybern. 90, 19–32. decision-making. Prog. Neurobiol. G., Stewart, A. L., and Azzopardi, Rolls, E. T. (2000). Functions of the pri- Rolls, E. T., Aggelopoulos, N. C., and 86, 216–244. P. (1994). The responses of neurons mate temporal lobe cortical visual Zheng, F. (2003). The receptive fields Rolls, E. T., and Milward, T. (2000). A in the temporal cortex of primates, areas in invariant visual object of inferior temporal cortex neurons model of invariant object recogni- and face identification and detec- and face recognition. Neuron 27, in natural scenes. J. Neurosci. 23, tion in the visual system: learning tion. Exp. Brain Res. 101, 474–484. 205–218. 339–348. rules, activation functions, lateral Rolls, E. T., and Treves, A. (1998). Rolls, E. T. (2003). Consciousness Rolls, E. T., and Baylis, G. C. (1986). inhibition, and information-based Neural Networks and Brain Function. absent and present: a neurophysio- Size and contrast have only small performance measures. Neural Com- Oxford: Oxford University Press. logical exploration. Prog. Brain Res. effects on the responses to faces put. 12, 2547–2572. Rolls, E. T., and Treves, A. (2011). The 144, 95–106. of neurons in the cortex of Rolls, E. T., Robertson, R. G., and neuronal encoding of information Rolls, E. T. (2005). Emotion Explained. the superior temporal sulcus of Georges-François, P. (1997a). Spa- in the brain. Prog. Neurobiol. 95, Oxford: Oxford University Press. the monkey. Exp. Brain Res. 65, tial view cells in the primate 448–490. Rolls, E. T. (2006). “Consciousness 38–48. hippocampus. Eur. J. Neurosci. 9, Rolls, E. T., Treves, A., Robertson, R. G., absent and present: a neurophysio- Rolls, E. T., Baylis, G. C., Hasselmo, M., 1789–1794. Georges-François, P., and Panzeri, logical exploration of masking,” in and Nalwa, V. (1989). “The repre- Rolls, E. T., Treves, A., and Tovee, S. (1998). Information about spatial The First Half Second, Chap. 6, eds H. sentation of information in the tem- M. J. (1997b). The representational view in an ensemble of primate hip- Ogmen and B. G. Breitmeyer (Cam- poral lobe visual cortical areas of capacity of the distributed encoding pocampal cells. J. Neurophysiol. 79, bridge, MA: MIT Press), 89–108. macaque monkeys,” in Seeing Con- of information provided by popula- 1797–1813. Rolls, E. T. (2007a). “Invariant represen- tour and Colour, eds J. Kulikowski, C. tions of neurons in the primate tem- Rolls, E. T., Tromans, J. M., and Stringer, tations of objects in natural scenes Dickinson, and I. Murray (Oxford: poral visual cortex. Exp. Brain Res. S. M. (2008). Spatial scene repre- in the temporal cortex visual areas,” Pergamon). 114, 149–162. sentations formed by self-organizing in Representation and Brain, Chap. 3, Rolls, E. T., Baylis, G. C., and Has- Rolls, E. T., Treves, A., Tovee, M., and learning in a hippocampal extension ed. S. Funahashi (Tokyo: Springer), selmo, M. E. (1987). The responses Panzeri, S. (1997c). Information in of the ventral visual system. Eur. J. 47–102. of neurons in the cortex in the the neuronal representation of indi- Neurosci. 28, 2116–2127. Rolls, E. T. (2007b). The representation superior temporal sulcus of the vidual stimuli in the primate tempo- Rolls, E. T., Webb, T. J., and Deco, of information about faces in the monkey to band-pass spatial fre- ral visual cortex. J. Comput. Neurosci. G. (2012). Communication before temporal and frontal lobes of pri- quency filtered faces. Vision Res. 27, 4, 309–333. coherence. Eur. J. Neurosci. (in mates including humans. Neuropsy- 311–326. Rolls, E. T., and Stringer, S. M. (2000). press). chologia 45, 124–143. Rolls, E. T., Baylis, G. C., and Leonard, C. On the design of neural networks in Rolls, E. T., and Xiang, J.-Z. (2006). Spa- Rolls, E. T. (2007c). Sensory processing M. (1985). Role of low and high spa- the brain by genetic evolution. Prog. tial view cells in the primate hip- in the brain related to the control tial frequencies in the face-selective Neurobiol. 61, 557–579. pocampus, and memory recall. Rev. of food intake. Proc. Nutr. Soc. 66, responses of neurons in the cortex in Rolls, E. T., and Stringer, S. M. (2001). Neurosci. 17, 175–200. 96–112. the superior temporal sulcus. Vision Invariant object recognition in the Rosenblatt, F. (1961). Principles of Neu- Rolls, E. T. (2008a). Face representations Res. 25, 1021–1035. visual system with error correction rodynamics: Perceptrons and the The- in different brain areas, and criti- Rolls, E. T., and Deco, G. (2002). and temporal difference learning. ory of Brain Mechanisms. Washing- cal band masking. J. Neuropsychol. 2, Computational Neuroscience of Network 12, 111–129. ton, DC: Spartan. 325–360. Vision. Oxford: Oxford University Rolls, E. T., and Stringer, S. M. (2006). Sakai, K., and Miyashita, Y. (1991). Rolls, E. T. (2008b). Memory, Atten- Press. Invariant visual object recognition: Neural organisation for the long- tion, and Decision-Making. A Uni- Rolls, E. T., and Deco, G. (2006). a model, with lighting invariance. J. term memory of paired associates. fying Computational Neuroscience Attention in natural scenes: Physiol. Paris 100, 43–62. Nature 354, 152–155. Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 68 Rolls Invariant visual object recognition Sato, T. (1989). Interactions of visual Annual Conference of the Cognitive Tovee, M. J., Rolls, E. T., and Azzopardi, visual system: an integrated systems stimuli in the receptive fields of infe- Science Society, ed. G. W. Cottrell P. (1994). Translation invariance perspective. Science 255, 419–423. rior temporal neurons in macaque. (San Diego: Erlbaum), 254–259. and the responses of neurons in Vogels, R., and Biederman, I. (2002). Exp. Brain Res. 77, 23–30. Stringer, S. M., Perry, G., Rolls, E. T., and the temporal visual cortical areas Effects of illumination intensity Selfridge, O. G. (1959). “Pandemonium: Proske, J. H. (2006). Learning invari- of primates. J. Neurophysiol. 72, and direction on object coding in a paradigm for learning,” in The ant object recognition in the visual 1049–1060. macaque inferior temporal cortex. Mechanization of Thought Processes, system with continuous transforma- Tovee, M. J., Rolls, E. T., and Ramachan- Cereb. Cortex 12, 756–766. eds D. Blake and A. Uttley (London: tions. Biol. Cybern. 94, 128–142. dran, V. S. (1996). Rapid visual learn- von der Malsburg, C. (1973). H. M. Stationery Office), 511–529. Stringer, S. M., and Rolls, E. T. (2000). ing in neurones of the primate tem- Self-organization of orientation- Serre, T., Kreiman, G., Kouh, M., Cadieu, Position invariant recognition in the poral visual cortex. Neuroreport 7, sensitive columns in the striate C., Knoblich, U., and Poggio, T. visual system with cluttered environ- 2757–2760. cortex. Kybernetik 14, 85–100. (2007a). A quantitative theory of ments. Neural Netw. 13, 305–315. Tovee, M. J., Rolls, E. T., Treves, A., von der Malsburg, C. (1990). “A neural immediate visual recognition. Prog. Stringer, S. M., and Rolls, E. T. (2002). and Bellis, R. P. (1993). Information architecture for the representation of Brain Res. 165, 33–56. Invariant object recognition in the encoding and the responses of sin- scenes,” in Brain Organization and Serre, T., Oliva, A., and Poggio, T. visual system with novel views of gle neurons in the primate temporal Memory: Cells, Systems and Circuits, (2007b). A feedforward architecture 3D objects. Neural Comput. 14, visual cortex. J. Neurophysiol. 70, Chap. 18, eds J. L. McGaugh, N. M. accounts for rapid categorization. 2585–2596. 640–654. Weinburger, and G. Lynch (Oxford: Proc. Natl. Acad. Sci. U.S.A. 104, Stringer, S. M., and Rolls, E. T. (2008). Trappenberg, T. P., Rolls, E. T., and Oxford University Press), 356–372. 6424–6429. Learning transform invariant object Stringer, S. M. (2002). “Effective size Wallis, G., and Baddeley, R. (1997). Serre, T., Wolf, L., Bileschi, S., Riesen- recognition in the visual system of receptive fields of inferior tempo- Optimal unsupervised learning in huber, M., and Poggio, T. (2007c). with multiple stimuli present during ral visual cortex neurons in natural invariant object recognition. Neural Robust object recognition with training. Neural Netw. 21, 888–903. scenes,” in Advances in Neural Infor- Comput. 9, 883–894. cortex-like mechanisms. IEEE Trans. Stringer, S. M., Rolls, E. T., and Tromans, mation Processing Systems, Vol. 14, Wallis, G., and Bülthoff, H. (1999). Pattern Anal. Mach. Intell. 29, J. M. (2007). Invariant object recog- eds T. G. Dietterich, S. Becker, and Z. Learning to recognize objects. Trends 411–426. nition with trace learning and mul- Gharamani (Cambridge, MA: MIT Cogn. Sci. (Regul. Ed.) 3, 22–31. Shadlen, M. N., and Movshon, J. A. tiple stimuli present during training. Press), 293–300. Wallis, G., and Rolls, E. T. (1997). Invari- (1999). Synchrony unbound: a criti- Network 18, 161–187. Treves, A., and Rolls, E. T. (1991). What ant face and object recognition in cal evaluation of the temporal bind- Sutherland, N. S. (1968). Outline of a determines the capacity of autoas- the visual system. Prog. Neurobiol. ing hypothesis. Neuron 24, 67–77. theory of visual pattern recognition sociative memories in the brain? 51, 167–194. Shashua, A. (1995). Algebraic functions in animal and man. Proc. R. Soc. Network 2, 371–397. Wallis, G., Rolls, E. T., and Foldiak, P. for recognition. IEEE Trans. Pattern Lond., B, Biol. Sci. 171, 297–317. Tromans, J. M., Harris, M., and Stringer, (1993). Learning invariant responses Anal. Mach. Intell. 17, 779–789. Sutton, R. S. (1988). Learning to pre- S. M. (2011). A computational to the natural transformations of Shevelev, I. A., Novikova, R. V., Lazareva, dict by the methods of temporal model of the development of sepa- objects. Proc. Int. Jt. Conf. Neural N. A., Tikhomirov, A. S., and differences. Mach. Learn. 3, 9–44. rate representations of facial iden- Netw. 2, 1087–1090. Sharaev, G. A. (1995). Sensitivity Sutton, R. S., and Barto, A. G. (1981). tity and expression in the primate Wasserman, E., Kirkpatrick-Steger, A., to cross-like figures in cat striate Towards a modern theory of adap- visual system. PLoS ONE 6, e25616. and Biederman, I. (1998). Effects neurons. Neuroscience 69, 51–57. tive networks: expectation and pre- doi:10.1371/journal.pone.0025616 of geon deletion, scrambling, and Sillito, A. M., Grieve, K. L., Jones, H. diction. Psychol. Rev. 88, 135–170. Tromans, J. M., Page, J. I., and Stringer, movement on picture identification E., Cudeiro, J., and Davis, J. (1995). Sutton, R. S., and Barto, A. G. (1998). S. M. (2012). Learning separate in pigeons. J. Exp. Psychol. Anim. Visual cortical mechanisms detect- Reinforcement Learning. Cambridge, visual representations of indepen- Behav. Process. 24, 34–46. ing focal orientation discontinuities. MA: MIT Press. dently rotating objects. Network. Watanabe, S., Lea, S. E. G., and Dittrich, Nature 378, 492–496. Tanaka, K. (1993). Neuronal mecha- PMID: 22364581. [Epub ahead of W. H. (1993). “What can we learn Singer, W. (1999). Neuronal synchrony: nisms of object recognition. Science print]. from experiments on pigeon dis- a versatile code for the definition of 262, 685–688. Tsao, D. Y., Freiwald, W. A., Tootell, R. crimination?” in Vision, Brain, and relations? Neuron 24, 49–65. Tanaka, K. (1996). Inferotemporal cor- B., and Livingstone, M. S. (2006). Behavior in Birds, eds H. P. Zeigler Singer, W., Gray, C., Engel, A., Konig, tex and object vision. Annu. Rev. A cortical region consisting entirely and H.-J. Bischof (Cambridge, MA: P., Artola, A., and Brocher, S. Neurosci. 19, 109–139. of face-selective cells. Science 311, MIT Press), 351–376. (1990). Formation of cortical cell Tanaka, K., Saito, C., Fukada, Y., and 617–618. Weiner, K. S., and Grill-Spector, K. assemblies. Cold Spring Harb. Symp. Moriya, M. (1990). “Integration of Tsao, D. Y., and Livingstone, M. S. (2011). Neural representations of Quant. Biol. 55, 939–952. form, texture, and color informa- (2008). Mechanisms of face per- faces and limbs neighbor in human Singer, W., and Gray, C. M. (1995). tion in the inferotemporal cortex of ception. Annu. Rev. Neurosci. 31, high-level visual cortex: evidence for Visual feature integration and the the macaque,” in Vision, Memory and 411–437. a new organization principle. Psy- temporal correlation hypothesis. the Temporal Lobe, Chap. 10, eds E. Tsodyks, M. V., and Feigel’man, M. chol. Res. PMID: 22139022. [Epub Annu. Rev. Neurosci. 18, 555–586. Iwai and M. Mishkin (New York: V. (1988). The enhanced storage ahead of print]. Spiridon, M., Fischl, B., and Kanwisher, Elsevier), 101–109. capacity in neural networks with Widrow, B., and Hoff, M. E. (1960). N. (2006). Location and spatial pro- Tanaka, K., Saito, H., Fukada, Y., and low activity level. Europhys. Lett. 6, “Adaptive switching circuits,” in 1960 file of category-specific regions in Moriya, M. (1991). Coding visual 101–105. IRE WESCON Convention Record, human extrastriate cortex. Hum. images of objects in the inferotem- Ullman, S. (1996). High-Level Vision, Part 4 (New York: IRE), 96–104. Brain Mapp. 27, 77–89. poral cortex of the macaque monkey. Object Recognition and Visual [Reprinted in Anderson and Rosen- Spruston, N., Jonas, P., and Sakmann, B. J. Neurophysiol. 66, 170–189. Cognition. Cambridge, MA: feld, 1988]. (1995). Dendritic glutamate recep- Tou, J. T., and Gonzalez, A. G. (1974). Bradford/MIT Press. Widrow, B., and Stearns, S. D. (1985). tor channel in rat hippocampal CA3 Pattern Recognition Principles. Read- Ullman, S. (2007). Object recognition Adaptive Signal Processing. Engle- and CA1 pyramidal neurons. J. Phys- ing, MA: Addison-Wesley. and segmentation by a fragment- wood Cliffs, NJ: Prentice-Hall. iol. 482, 325–352. Tovee, M. J., and Rolls, E. T. (1995). based hierarchy. Trends Cogn. Sci. Winston, P. H. (1975). “Learning struc- Stan-Kiewicz, B., and Hummel, J. Information encoding in short fir- (Regul. Ed.) 11, 58–64. tural descriptions from examples,” in (1994). “Metricat: a representation ing rate epochs by single neurons in Van Essen, D., Anderson, C. H., and The Psychology of Computer Vision, for basic and subordinate-level clas- the primate temporal visual cortex. Felleman, D. J. (1992). Informa- ed. P. H. Winston (New York: sification,” in Proceedings of the 18th Vis. Cogn. 2, 35–58. tion processing in the primate McGraw-Hill), 157–210. Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 69 Rolls Invariant visual object recognition Wiskott, L. (2003). Slow feature analy- Wurtz, R. H., and Kandel, E. R. (2000b). in macaque inferotemporal cortex. conducted in the absence of any com- sis: a theoretical analysis of optimal “Perception of motion depth and Nat. Neurosci. 11, 1352–1360. mercial or financial relationships that free responses. Neural Comput. 15, form,” in Principles of Neural Sci- Yi, D. J., Turk-Browne, N. B., Flom- could be construed as a potential con- 2147–2177. ence, 4th Edn, Chap. 28, eds E. baum, J. I., Kim, M. S., Scholl, flict of interest. Wiskott, L., and Sejnowski, T. J. (2002). R. Kandel, J. H. Schwartz, and T. B. J., and Chun, M. M. (2008). Slow feature analysis: unsupervised M. Jessell (New York: McGraw-Hill), Spatiotemporal object continuity Received: 01 November 2011; accepted: learning of invariances. Neural Com- 548–571. in human ventral visual cortex. 23 May 2012; published online: 19 June put. 14, 715–770. Wyss, R., Konig, P., and Verschure, P. F. Proc. Natl. Acad. Sci. U.S.A. 105, 2012. Womelsdorf, T., Schoffelen, J. M., Oost- (2006). A model of the ventral visual 8840–8845. Citation: Rolls ET (2012) Invariant enveld, R., Singer, W., Desimone, R., system based on temporal stability Zhao, Q., and Koch, C. (2011). Learn- visual object and face recognition: neural Engel, A. K., and Fries, P. (2007). and local memory. PLoS Biol. 4, e120. ing a saliency map using fixated and computational bases, and a model, Modulation of neuronal interac- doi:10.1371/journal.pbio.0040120 locations in natural scenes. J. Vis. VisNet. Front. Comput. Neurosci. 6:35. tions through neuronal synchro- Yamane, S., Kaji, S., and Kawano, K. 11, 9. doi: 10.3389/fncom.2012.00035 nization. Science 316, 1609–1612. (1988). What facial features activate Zucker, S. W., Dobbins, A., and Iver- Copyright © 2012 Rolls. This is an open- Wurtz, R. H., and Kandel, E. R. face neurons in the inferotemporal son, L. (1989). Two stages of curve access article distributed under the terms (2000a). “Central visual pathways,” cortex of the monkey? Exp. Brain detection suggest two styles of visual of the Creative Commons Attribution in Principles of Neural Science, 4th Res. 73, 209–214. computation. Neural Comput. 1, Non Commercial License, which per- Edn, Chap. 27, eds E. R. Kan- Yamane, Y., Carlson, E. T., Bow- 68–81. mits non-commercial use, distribution, del, J. H. Schwartz, and T. M. man, K. C., Wang, Z., and Con- and reproduction in other forums, pro- Jessell (New York: McGraw-Hill), nor, C. E. (2008). A neural code vided the original authors and source are Conflict of Interest Statement: The 543–547. for three-dimensional object shape author declares that the research was credited. Frontiers in Computational Neuroscience www.frontiersin.org June 2012 | Volume 6 | Article 35 | 70

Journal

Frontiers in Computational NeurosciencePubmed Central

Published: Jun 19, 2012

There are no references for this article.