A Neural Network Architecture for Learning Word-Referent Associations in Multiple Contexts
A Neural Network Architecture for Learning Word-Referent Associations in Multiple Contexts
Bassani, Hansenclever F.;Araujo, Aluizio F. R.
2019-05-20 00:00:00
This article proposes a biologically inspired neurocomputational architecture which learns associations between words and referents in dierent contexts, considering evidence collected from the literature of Psycholinguistics and Neurolinguistics. The multi-layered architecture takes as input raw images of objects (referents) and streams of word’s phonemes (labels), builds an adequate representation, recognizes the current context, and associates label with referents incrementally, by employing a Self-Organizing Map which creates new association nodes (prototypes) as required, adjusts the existing prototypes to better represent the input stimuli and removes prototypes that become obsolete/unused. The model takes into account the current context to retrieve the correct meaning of words with multiple meanings. Simulations show that the model can reach up to 78% of word-referent association accuracy in ambiguous situations and approximates well the learning rates of humans as reported by three dierent authors in five Cross-Situational Word Learning experiments, also displaying similar learning patterns in the dierent learning conditions. Keywords: Self-Organizing Maps, Cross-Situational Word Learning, Context, Learning Representations, Neurocomputational Model. 1. Introduction “gavagai”, one might understand this as clear evidence that the word “gavagai” means rabbit. However, such sound could Language is surely a vital and distinctive trait of human also mean “white”, “furry”, “food”, “let’s go hunting” or even beings. Even though language acquisition by young children something completely unrelated with rabbit, such as “it is going is an active research topic in cognitive sciences, a number of to rain”. The expression “gavagai” could even be a composition open issues persist, despite the achievements of the field. For of two or three words with their own meanings. instance, we do not know exactly how humans acquire the One possible strategy to address the problem described by meaning of words, an essential part of the language acquisition Quine (1960), is known as “cross-situational word learning” process. In this article, we propose a word learning model (CWSL) (Yu & Smith, 2007). In this type of learning, the words composed of a set of neural modules, or schemes (Arbib, would not be learned after a single exposure. The learning 2008), that simultaneously compete and cooperate to perform process would consider information from multiple learning higher-level tasks. The model was proposed considering the trials. Thus, a learner who is unable to decide unambiguously evidence brought by the literature of neurolinguistics and psy- the meaning of a word after a single trial would form a new cholinguistics about the characteristics of the word learning knowledge subject to be further strengthened or weakened upon capabilities displayed by humans. With that, the proposed new evidence. model is able to simulate multiple statistical characteristics Currently, we can argue that word learning requires a set of displayed by humans when they learn new words. cognitive abilities that are not yet fully understood (Bloom, We assume that word learning may be studied disregarding 2002), such as theory of mind (the ability to simulate and the interference of other aspects of language acquisition, such understand the thought of others), concept acquisition, and fast as the acquisition of grammar, semantics, and pragmatics. mapping (the ability to associate referents and labels with few, Therefore, according to Bloom (2002), in order to learn the or even one trial). In this article, we focus on the last two meaning of a word, an individual must learn three dierent abilities of this list. elements: (i) the concept or meaning of the word (referent); (ii) the sound or lexical representation of the word (label); and Concept acquisition may be seen as the ability to recognize (iii) the association between referent and label. Each of these and group similar referents together so that the category itself challenging tasks will be addressed in this article. (concept) could be further associated with a label. Harnad A classic example (Quine, 1960) illustrates the diculties (2005) points out that “To Cognize is to Categorize” and that children and foreign language learners have to handle to Perlovsky (2006) describes the mind as a hierarchy of multiple correctly match words and referents. When a native speaker layers of concept-models, from simple elements like edges of an unknown language sees a white rabbit and pronounces or moving dots to more abstract concept-models of objects, Preprint submitted to Neural Networks May 22, 2019 arXiv:1905.08300v1 [cs.LG] 20 May 2019 relationships, complete scenes, and so on. The proposed model is compatible with these views because it defines the learning tasks mentioned above as a subspace clustering problem (Kriegel et al., 2005; Bassani & Araujo, 2015; Hu & Pei, 2018), in which the cluster prototypes capture the concept-models. At the current state, the model focuses on the lower levels of the concept-model hierarchy mentioned by Perlovsky, learning the referents, labels, and their associations for concrete nouns that can be depicted in static images, such as chair, table, and pen, in their dierent usage contexts (basic concept-models). The model learns such elements incrementally by creating new prototype nodes as required, adjusting the existing prototypes to better represent the auditory and visual input stimuli or removing prototypes that become obsolete/unused. To achieve this, we specify a neurocomputational architec- ture composed of four layers: (i) the first layer extracts the perceptions from raw visual data (the referents) and auditory data (the labels); (ii) the second layer creates a more suitable representation for labels and referents; (iii) the third layer recognizes the current context and; (iv) the fourth layer creates Figure 1: Illustration of a trial in the 4x4 condition. The pictures of four objects the associations between labels and referents in the dierent (referents) are shown in the monitor while the sound of four pseudowords is presented auditorily over the speakers (labels). contexts in which they are used, thus forming the prototypes representing the basic concept-models learned by the model. In order to evaluate the proposed model, we simulate the results suggested that global competition is most likely to occur. CSWL experiments carried out with human beings by Yu & The computational models proposed in the literature for Smith (2007), Yurovsky et al. (2013), and Trueswell et al. CSWL can be divided into two categories (Yu & Smith, 2007): (2013). These experiments provide sound evidence on the the Hypothesis-Testing Models, in which the learner maintains operation of word learning mechanisms. Any model aiming a list of hypothesized pairings to be further confirmed or re- to represent the functioning of these learning mechanisms must jected due to a mutual exclusivity constraint and the Associative be able to reproduce to some extent the world learning patterns Models, a basic form of Hebbian learning which strengths described in the following paragraphs. associations between observed word-referent pairs. Yu & Smith (2007) designed experiments to evaluate the Trueswell et al. (2013) designed experiments to compared abilities of humans in acquiring correct word-referent pairings the two hypotheses and their results suggested that subjects did and they have found compelling evidence that adult humans not keep track of multiple candidate meanings for each label, are able to learn label-referent pairings through CSWL. In their hence, according to the authors, such experiments weaken the experiments, the stimuli consisted of slides containing 2, 3, or hypothesis that humans employ some kind of statistical learning 4 pictures of unusual objects paired with 2, 3, or 4 pseudowords of the word-referent pairings. presented in the auditory form. These artificial words were Current studies have focused on comparing these two model- generated by a computer program using standard phonemes ing approaches in terms of how well they fit experimental data, in English. In this case, the label-referent pairs were formed but no consensus has emerged yet. For instance, Kachergis by single and unique objects randomly chosen, used in three et al. (2017) found that an associative model which includes dierent training conditions of ambiguity. competition between familiarity and uncertainty biases repro- The training conditions dier only in the number of labels duces better the individual and combined eects of frequency and referents simultaneously presented to the subjects. Figure and contextual diversity on human learning. Khoe et al. 1 illustrates a 4x4 condition, in which four objects (referents) (2019) found that this associative model better captures the full were presented simultaneously on the screen, while the sound range of individual dierences and conditions when learning of 4 pseudowords (labels) were heard from the speakers. The is cross-situational, although the hypothesis testing approach results showed that the individuals were able to discover on outperforms it when there is no referential ambiguity during average more than 16 out of the 18 pairs in the 2x2 condition training. and more than 13 out of the 18 pairs in the 3x3 condition. Yurovsky et al. (2013) expanded the previous experiment The model proposed in this article diers from these studies including situations in which labels could be associated with by focusing in dealing with real-world data (raw images and more than one referent. They were interested in evaluating if phoneme sequences) and in employing a neural network archi- there was competition occurring in the learning process and if tecture that can be used to simulate models of both categories, it was local (among referents presented in the same trial) or though in the present work the associative approach was con- global (among referents presented in dierent trials). Their sidered. 2 The obtained results show that the proposed model is able to 3. Association between labels and perceptions does not ex- replicate the patterns of CSWL presented by humans. Addition- plain how children learn labels of more abstract referents ally, the proposed model was also tested in scenarios in which that they cannot see or touch. A significant number of there was ambiguity about the correct word-referent parings, children’s words refer to abstract conceptual categories with more than one association. We show that the model can such as “morning” or “day” (Nelson et al., 1993; Feijoo take into account the context to solve ambiguity and choose the et al., 2017). correct referent for ambiguous words. The view of the authors of this work is that the capability The following sections of this article are structured as of statistical association is necessary, though not sucient, follows: Section 2 discusses the Associationism theory and for word learning, and it can serve as a basis for other presents the experimental evidence on word-referent associ- higher cognitive functions. We are interested in verifying ations. Section 3 describes correlated models for language how well we can model the human word learning behavior in acquisition. Section 4 presents the proposed modular archi- cross-situational word learning with a modular neural network tecture for replicating the CSWL experiments while Section 5 that learns statistical correlations. and Section 6 detail the two neural network models employed This modular network was built considering evidence col- in the learning tasks, LARFDSSOM, and ART2 with Context. lected from the literature of Psycholinguistics, Neurolinguistics Section 7 describes the CSWL experiments performed by Yu and organized them in a modular architecture which presents & Smith (2007), Yurovsky et al. (2013), and Trueswell et al. similarities to those employed in Computational Linguistics (2013) along with the simulations carried out with the proposed (Allen, 1994). model for replicating them. Finally, Section 8 discusses and Below, we present the evidence that we collected from the summarizes the main conclusions drawn from the obtained literature, separated by their field. In section Section 4 we results. present the proposed architecture and discuss how each piece of evidence was taken into account in its specification. 2. Associationism and Experimental Evidence About How Humans Learn Word-Referent Associations 2.1. Evidence from Psycholinguistics Cross-situational word learning: There is plenty of work (Yu Associationism is one of the most widely held theories of & Smith, 2007; Yurovsky et al., 2013; Trueswell et al., 2013; learning, appearing since John Locke (1700). According to Bunce & Scott, 2017) showing that human adults can robustly it, learning is based on sensibility to covariation of the human figure out the correct word-referent associations in ambiguous brain. Richards & Goldfarb (1986) proposed that children could learning situations, in which the correct mapping of a word learn the meaning of a word by repeatedly associating its verbal to an intended referent cannot be guaranteed. The learning label with their perceptual experience at the time that the label rates and patterns presented by humans in dierent conditions is used. For those perceptual properties that repeatedly co-occur of ambiguity provide valuable information for evaluating word with the label, the association strengthens. learning models. We can find several pieces of evidence supporting Associa- Correcting feedback is not a requirement: Correcting feed- tionism in word learning. For instance, children’s first words of- back may help learning, however, children do not require it to ten refer to things that they can see and touch; words are learned learn word meanings. Lieven (1994), reviews works showing best in conditions in which an associative match would be easier that there are cultures in which adults do not even speak directly to make. Additionally, the results of cross-situational word to children until they are using words in a meaningful manner. learning show that adults can learn word-referent associations This suggests a computational model considering unsupervised with repeated co-occurrence. However, Associationism cannot or reinforcement learning. explain all the observed word learning phenomena. Below, “New word, new object” preference: Studies suggest that we list the most significant points collected by Bloom (2002) children are biased to consider that each word is associated with against a pure associationist theory of word learning. a single referent (Kagan, 1981; Markman & Wachtel, 1988). 1. Associationism requires that label and referent are simul- Therefore, if they are presented with a new word they will taneously present in the environment. However, studies prefer to associate it with a currently unlabeled referent. This is show that about 30-50% of the time a word is used, young also known as “mutual exclusivity”. children are not attending to the object the adult is talking Object categorization can be biased by labels: Most labels about (Collins, 1977; Harris et al., 1983; Bunce & Scott, are associated not with a singular object but with a category of 2017). similar objects (that share certain properties). For instance, the 2. Associationism predicts that before children have enough word “car” refers to a set of dierent types of vehicles that share data to retrieve the right associations they would of- certain features. Plunkett et al. (2008) show that the choice of ten make mapping errors unless they wait until having what labels are presented for children as naming new objects collected strong statistical evidence. However, it was can aect how they categorize these objects, biasing them to observed that in certain situations, children can learn a new create certain categories that they would not create otherwise. word even after a single exposition (Markson & Bloom, Mayor & Plunkett (2010) created a neurocomputational model 1997; Frank & Goodman, 2014). that successfully reproduced this behavior for simulated data. 3 Dierent features are relevant for each category: The prop- 3. Previous Language Acquisition Models Based on erties young children attend to when categorizing a novel entity Self-Organizing Maps depend on its type (object versus a non-solid substance) (Soja Considering that children are able to acquire language with- et al., 1991), plant or rock (Keil, 1994), real or toy monkey out explicit feedback, several language acquisition models are (Carey, 1995), animal or tool (Becker & Ward, 1991). This based on unsupervised learning methods. Self-Organizing suggests the employment of subspace clustering methods in Maps (Kohonen, 1982) and Adaptive Resonant Theory (ART) the categorization of items to form the referent concepts. In (Grossberg, 1976a,b) are two of the most prominent unsu- subspace clustering, certain attributes can be more relevant than pervised learning neural networks. ART was employed for others for each category, and an item may belong to more than modeling human behavior in the task of memorization of one category. For instance, consider the categorization of a red word lists (Pacheco, 2004; Araujo et al., 2010), while several hexagon. This object belongs to dierent categories depending computational models for word learning are based on SOM on the features that are taken into account. Regarding its color, (Ritter & Kohonen, 1989; Miikkulainen, 1997; Plunkett et al., it belongs to the category of red objects, while regarding its 1992; Plunkett, 1997; Li et al., 2004; Silberman et al., 2007; shape it belongs to the category of hexagonal objects. Finally, Li et al., 2007). Refer to Li & Zhao (2013) for a review of it belongs to a third category when taking both features into SOM-based language acquisition models. account. Ritter & Kohonen (1989) applied SOM to capture the seman- Fast Mapping: Other studies (Carey & Bartlett, 1978; Dol- tic structure of words. Their pioneer work showed that implicit laghan, 1985; Heibeck & Markman, 1987; Rice, 1990; Markson categories in the linguistic environment can be recognized by & Bloom, 1997) show that children and adults can learn SOM. word-referent associations after a few exposures (even one), Guenther & Gjaja (1996) have shown that a SOM fed with without any explicit training or feedback, and even without any formant representation of dierent phonemic categories can explicit act of naming. simulate the perceptual magnet eect (Kuhl, 1991), an eect Context can aect retrieved memories: Brainerd and Reyna characterized by a warping of the perceptual space near central (1998, 2008) have shown that in experiments in which a list phonemic category, that allows certain sound categories to be of words with a shared central meaning are presented for considered as more similar to each other than to those patterns subjects to memorize, after the memorization, the subjects further away from the center. are induced to recognize as having seen on the list words The associative hypothesis is explicitly modeled by Hebbian related with this central meaning even when they were not learning in DISLEX, DevLex, and DevLex II models. The basic on the list (false memories). These experiments suggest that the contextual meaning formed during the pattern presentations idea is that the activation of co-occurring lexical and semantic plays an important role for memorization and is taken into representations in each map leads to an adaptive formation of account during recognition (Matzen & Benjamin, 2009). This associative connections between them. behavior was modeled and reproduced by Araujo et al. (2010) Miikkulainen (1997) introduced the DISLEX model to sim- with a modular neural network. ulate dyslexia and aphasia. The model was the first to con- nect dierent SOMs through associative links. Each SOM 2.2. Evidence from Neurolinguistics represents a dierent type of linguistic information, such as Hierarchical perceptual processing: Sensory information is phonological, orthographic and semantic. DISLEX has also processed to extract information that is relevant to the indi- been shown to be able to simulate patterns of bilingual language vidual (perceptions), through innate or self-adaptive processes, recovery in aphasic patients (Kiran et al., 2013). probably in inferior cortical regions such as the visual cortex Following this structure, two models, DevLex (Li et al., (Miikkulainen et al., 2005) and auditory cortex (Pasley et al., 2004) and DevLex II (Li et al., 2007), were proposed to 2012). Superior cortical areas, such as V5 and the posterior simulate children’s early lexical development. Instead of parietal cortex integrate information to form more complete employing maps with a fixed structure, in the DevLex family, perceptions (Udesen & Madsen, 1992; Born & Bradley, 2005). new nodes are inserted in the map when required, to improve Mirror neurons: Certain neurons respond to correlated per- the accuracy of learning. DevLex has been shown to model ceptual information from dierent modalities, such as verbal, patterns of lexical confusion as a function of word density and visual and motor information about the same action or event, as semantic similarity, simulating age-of-acquisition eects while learning a growing lexicon. DevLex II has been shown to observed in the sensory-motor cortex (Rizzolatti & Craighero, simulate several empirical phenomena, including patterns of 2004; Pulvermuller, 2005). vocabulary spurt, the relationship between comprehension and Context recognition: Hippocampus and amygdala keep a production, fast mapping, lexical category development and, historical record of the input stimuli, forming a kind of context lexical overextension. (Fletcher et al., 1997; Aggleton & Brown, 1999). Topographic-preserving input mapping: Nearby neurons in Silberman et al. (2007) employed a single layer SOM for the brain respond to inputs with similar features as in certain ar- simulating the associations between words and concepts in a eas of the brain where topographic maps are formed, especially semantic network that extracts semantic information from the in the primary motor, visual, and somatosensory cortical areas CHILDES database (Macwhinney, 2010). The model was able (Haykin, 1998; Spitzer, 1999; Miikkulainen et al., 2005). to replicate learning patterns such as the eects of semantic 4 ... ... ... ... Activity levels priming that indicates faster response when recognizing a word semantically related to the information in the episodic memory, LARFDSSOM - Association D - Association than when recognizing unrelated words. Mayor & Plunkett (2010) presented a model for simulating ART2 with Context fast mapping in early word learning. Their model included two C - Context SOMs, one fed with visual input representing artificial objects and the other fed with acoustic input representing words. The B - Representation connections between the two SOMs were also adjusted by LARFDSSOM - Words LARFDSSOM - Images Hebbian learning. The model displayed learning patterns of 4 phonemes 128 features descriptor (sift) early lexical category development, such as the tendency to ... attribute to a new object a known name of another object in A - Perception the same category. 12 Features by Interest points detector Hessian Affine phoneme Despite the acknowledgeable achievements of these models, Phonetic none of them was designed to replicate the CSWL experi- Representation chair fork /k /or /f /ar /e /ch ments, which is an excellent source of data about word-referent associations. In this regard, Yu & Smith (2012) described and compared two competing types of models for CSWL: Figure 2: Illustration of the processing layers of the model. A - Perception Hypothesis-Testing Models and Associative Models. In Asso- Acquisition; B - Representation; C - Context formation and recognition; and D ciative Models (Yu & Smith, 2007), the representation is a large - Association and context-dependent recognition. word-object matrix in which each cell contains the associative strength between one word and one object and a basic form of phoneme may carry little meaning, however, a sequence of Hebbian learning is employed to strength associations between phonemes could represent a word or a lemme (temporal observed word-referent pairs. In the Hypothesis-Testing Mod- consolidation). Similarly, in the visual processing, the els Medina et al. (2011); Trueswell et al. (2013), the learner description of a small patch of an image may carry little maintains a list of hypothesized pairings (a single hypothesis for meaning, however, the description of a set of patches can each word) to be further confirmed or rejected due to a mutual carry information enough to represent an object or a scene exclusivity constraint. Both types of models were shown to be (spacial consolidation). able to replicate the patterns of CSWL and the main conclusion C – Context: This layer contains the context module that of the authors was that it is necessary to look at the components receives the multisensory perceptions as input, accumulates of models to understand how they contribute to overall learning. sequences of these inputs, and clusters them to form a "tem- Such models, however, were not modular and were not poral context" that can be recognized afterward. The context developed to work with real-world input data, such as images recognition is important, for instance, to disambiguate the and sounds. This limits their ability to replicate the details meaning of homophone/homograph words, such as mouse of experiments carried out with humans. The next section (animal or computer device). The recognized context is describes the modular architecture we proposed to address forwarded to the next layer together with the inputs received. those issues. D – Association: The module in this layer, associates (or integrates) the perceptions of words, visual objects and 4. Proposed Modular Architecture contexts. This association is achieved by the means of perception clustering. Therefore, each cluster represents an Figure 2 illustrates the proposed architecture, which is strati- association. For instance, each word can be associated with fied in four layers. The first two layers are comprised of parallel dierent meanings that occur in dierent contexts by being modules that are specialized for each kind of stimuli (auditory represented in more than one cluster. In the same way, a or visual), while the third and fourth layers present one module visual object can be associated with more than one word by each performing multisensory integration. Below we present a being represented in more than one cluster. For instance, the general description of each layer: object car can be associated with the words car and vehicle, A – Perception: It extracts relevant information (perceptions) in two dierent clusters. from the sensory data. The sensory-perception mapping Figure 2 indicates the learning models employed in each modules present in this layer are specialized for each kind of module, as well as how the information flows through the whole input. The auditory module extracts phonemes from a sound architecture. In the following subsections, we describe each (or from a text, for convenience), while the visual module module in more detail. The learning models are described extracts descriptions of interest points from image patches. afterward. B – Representation: It consolidates perceptions that are dis- tributed in space or time, creating a representation that is 4.1. Sensory-Perceptive Mapping Modules suitable for understanding a given stimulus. This layer contains representation modules specialized for each type In the CSWL experiments, visual and auditory stimuli are of stimulus (visual or auditory). For instance, an isolated simultaneously presented to the subjects, as depicted in Figure 5 1. In the proposed model, these two kinds of stimulus are 4.2. Representation Modules processed in parallel in the first layer to produce a numeric A feature vector produced by both modules described above, representation of the perceptions as output as described in the considered in isolation, is not enough to identify the auditory following subsections. or visual elements. For instance, one phoneme is not enough to identify a word, analogously, the descriptor of one POI of an image cannot identify an object. Therefore, it is necessary 4.1.1. The Auditory Sensory-Perceptive Mapping to compose the information from several feature vectors to The auditory input data consists of a stream of text rep- properly describe an element of interest, thus allowing its resenting the name of each object displayed on the scream. recognition. For instance, the string: "mixer, canister, rasp, goblet", would The basic idea employed in this module is to build a describe the objects in Figure 1. Bag-of-Features (BoF) representation, by determining and In order to obtain a numeric representation of the auditory stringing the features distributed in space or time. This ap- data, we followed a procedure similar to that described by proach was used for Unsupervised Visual Object Discovery Araujo et al. (2010). First, we convert each word to its (UVOC) from Images by Tuytelaars et al. (2010) and Kinnunen respective phonetic representation. This step employs the CMU et al. (2012). It derives from the Bag-of-Words (BoW) ap- Pronouncing Dictionary (CMUdict) (Lenzo, 2007). Therefore, proach, a way to represent text (Salton & McGill, 1986) for the example above is translated into: "K AE N AH S T ER, categorization tasks. The BoF approach consists of two steps: R AE S P, G AA B L AH T, M IH K S ER", in which, each first, similar features are clustered to create a dictionary of phoneme is represented by its ARPAbet symbol, separated by features called “codebook”, in which, the number of clusters spaces. determines the size of the features vector produced to represent Afterward, each phoneme is translated into a vector of 12 real the objects. After creating this dictionary, the objects are values ranging from -1 to +1 (see Table B.2 in Appendix B). described by counting the number of features mapped in each This numeric representation was built considering the place of cluster, thus, forming a histogram of occurrence, which is pronunciation of each phoneme in the International Phonetic usually normalized. Alphabet (IPA) charts for vowels and consonants, encoding In Tuytelaars et al. (2010), several clustering methods and specific features for vowels (4 of them) and for consonants (8 of types of histogram normalization were evaluated. The authors them). Therefore, when a vowel is represented, the features for concluded that when there is one object category per image, consonants are set to zero, and when a consonant is represented, even k-means yields good results, being outperformed only by the features of vowels are set to zero. The rationale behind spectral clustering. Kinnunen et al. (2012) considered SOM this procedure is to obtain similar representations for phonemes to be a viable alternative of clustering method for BoF. The with similar sounds. authors obtained similar results to those presented by Tuytelaars Finally, the representation of any sequence of words is a list et al. (2010). However, they found SOM to be more robust to of vectors, each vector describing the characteristics of one the type of normalization applied to the histogram. phoneme in the sequence. This list represents the perception Instead of the traditional SOM, we employ LARFDSSOM output by the Auditory Sensory-Perceptive Mapping. in the representation layer to generate the codebook. LARFDSSOM is a suitable method for this task because it is capable of subspace clustering and it employs a locally 4.1.2. The Visual Sensory-Perceptive Mapping weighted distance metric to adjust the relevances of the input The extraction of visual perceptions consists of detecting and dimensions. This is an important property when the input describing numerically the parts of the object present in the data present high dimensionality, since it is able to identify, image. In this article, we follow the literature of Unsupervised for instance, which kinds of image patches are relevant for Object Discovery (Weber et al., 2000; Tuytelaars et al., 2010; determining each object category and its associated phonetic Kinnunen et al., 2012), and we use the Scale Invariant Feature representation. A detailed description of LARFDSSOM is Transform (SIFT) to detect Points of Interest (POIs) and de- provided in Section 5. scribe each POI as a vector of 128 values (Lowe, 1999), called The representation module maps were pre-trained to learn a “POI descriptor”. These POI descriptors are normalized by an codebook, forming 28 clusters in the phonetic representation L2 normalization. map and 37 clusters in the visual representation map. This In this module, each object in the screen is represented by a training has occurred in advance since these maps represent the list of descriptors of the POIs detected and described by SIFT. previous knowledge that each individual has about the phonetic For instance, in the 4x4 condition exemplified in Figure 1, we structure of its native language and about the basic perceptual have four objects on the screen that will result in four lists of elements necessary to recognize objects. POI descriptor vectors, one list per object. These lists represent the perception output by the visual Sensory-Perceptive Map- 4.3. Context Module ping. The outputs of both Sensory-Perceptive Mapping modules in This module should associate a context with each newly Layer I are, then, fed as inputs to the respective Representation received input, in a way to distinguish the same stimulus Modules in Layer II. presented under distinct contexts, and also, to approximate 6 dierent inputs when presented in similar contexts. In the Cross-situational word learning: The proposed model was brain, this role is played by the hippocampus where, several designed to replicate the CSWL experiments, while keeping the recurrent connections are observed in the cortical regions of main aspects of the structure of previous SOM-based language memorization, hence recurrent neural networks seem to be a acquisition models. suitable approach. Hence, we applied the ART2 with context Correcting feedback is not a requirement: The proposed described in Section 6. model was developed based on unsupervised learning models, The visual and auditory representations are given as input to therefore it does not require correcting feedback. “New word, new object” preference: Though this was not the ART2 With Context, which recognize the current context or evaluated in our experiments, the similarity based competition create a new context if necessary. The outputs of the context employed in the learning model used in the association layer module consist of the visual and auditory inputs, unchanged, (LARFDSSOM), makes that stimuli significantly dierent from associated with the context representation recognized by ART2 with Context, and stored by its context units, UC. what was previously seen (novel stimuli) tend to be stored on new associations nodes. Object categorization can be biased by labels: The proposed 4.4. Association Module architecture was specially designed to take this into account by The Association Module takes as input the three outputs of making both labels and referents as inputs to the association the context module, visual, auditory and contextual information layer. This allows labels to aect the categorization of referents, to associate them. In this article, this task is also carried out by a by making their representations more similar/dierent. This is LARFDSSOM. The map computes the activation of all existing also true for the contextual information. nodes and the node with the highest activation, the winner node, Dierent features are relevant for each category: This is a represents the best association found. If its activation is above feature of LARFDSSOM, which learns the relevance of each the threshold parameter, a , this node is updated to slightly input dimension for each category during the self-organization modify the previous association. Otherwise, a new node is process. inserted in the map to represent a new association learned as Fast Mapping: LARFDSSOM can learn new associations in it is presented in its inputs. one shot. It is worth pointing out that, as the nodes on the map are Context can aect retrieved memories: In the proposed updated, they learn which input dimensions are relevant. This architecture, the current context is recognized and aects the allows the nodes to take into account only the aspects of information stored and retrieved, since it is part of the represen- the visual, auditory and contextual information that present a tation sent for the association layer. certain level of correlation. For instance, if a word occurs Hierarchical perceptual processing: This inspired the layered frequently with the same sound in several dierent contexts, the architecture proposed, which takes raw sensory data as input, node can learn that the context is irrelevant for this association. extracts perceptions, converts it to a more suitable representa- In another example, if certain aspects of the image correlate tion which is fed to the context formation layer, and finally, with a certain sound while others do not, the uncorrelated forwarded to the association layer. aspects are taken as irrelevant. Mirror neurons: The nodes in the association layer perform In our simulations, the LARFDSSOM was initialized with the multisensory integration and can be activated by informa- a single neuron randomly positioned in the input space and tion of dierent modalities, similarly as the mirror neurons. no limit was applied to the number of nodes created so that Topographic-preserving input mapping: This inspired the network could grow as much as required to represent the the employment of a SOM-based model with a associations found. The output of the association module is topographic-preserving characteristic in layers B and D. the activation of the winner node. If this value is above the The following two sections provide details about the imple- threshold parameter, a , it indicates that the pattern presented mentation of LARFDSSOM (employed in representation and by the inputs of the network was recognized, thus, the visual, association layers) and ART2 with Context (employed in the contextual, and auditory information are considered associated Context Layer). All the source-code and datasets produced in and, the higher this value, the stronger the association made by Perception Layer are available online . the map is. This allows us to compare object-sound associations in dierent contexts and to identify the strongest association. During the recognition phase of the cross-situational 5. Subspace Clustering with Self-Organizing Maps word-learning simulations, all pairings of objects and sounds The Self-Organizing Map (SOM) proposed by Kohonen are presented as input for the model and the pair with the (1982), is a neural network trained with unlabeled data (un- highest activation is considered as the strongest association supervised learning). It maps a high-dimensional data into a made from the network. lower dimensional (usually bi-dimensional) grid of N nodes (or neurons), compressing information while preserving the 4.5. How the Evidence was Taken Into Account topological relationships of the original data. Each piece of evidence collected in the literature and de- scribed in Sections 2.1 and 2.2 was somehow taken into account in the proposition of the architecture, as indicated below: Available on GitHub: https://github.com/hfbassani/word-referent-association 7 The following characteristics of SOM are worth highlighting DSSOM presented solid results, comparable to or better than here: previous subspace clustering methods from the data mining field. However, the fixed topology of DSSOM (N N grid) It creates an abstraction and a simplified representation of requires strong knowledge about the data, and may not ade- the input data distribution (Haykin, 1998). Each node can quately represent the neighborhood topology of clusters that be seen as a prototype representing similar input data. live in dierent subspaces. This issue was addressed in the map described in the next section, which is the method that Its topological properties correlate with what is observed we have chosen to employ in the proposed model for learning in the sensory processing regions of the brain, where word-referent associations. the input stimuli are represented in topologically ordered neural maps (Miikkulainen et al., 2005). In particular, 5.1. Local Adaptive Receptive Field Dimension Selective sensory inputs such as tactile (Kaas et al., 1983), visual Self-Organizing Map - LARFDSSOM (Hubel & Wiesel, 1962, 1977), and acoustic (Suga, 1985) LARFDSSOM (Bassani & Araujo, 2015) preserves the inputs are mapped to dierent areas of the cerebral cortex main characteristics of SOM and DSSOM. However, in in a topologically orderly manner. LARFDSSOM the nodes are not organized in a fixed grid. In- stead, it introduces a time-varying structure with a mechanism SOM-based models were applied to a variety of problems that inserts new nodes into the map whenever the winner node involving sensory processing, including voice recognition is not similar enough to the current input pattern. In order and image processing (Kangas, 1991; Venkateswarlu & to achieve this, it defines an activation function (Equation 2), Kumari, 2011; Abdelsamea et al., 2015; Chen et al., 2017); inversely related to the distance presented in Equation 1 and These characteristics have made SOM a good candidate a threshold parameter (a ). When the activation of the winner for modeling the processing of perceptions. However, as node in response to an input pattern is below this threshold, a we mentioned in the previous section, traditional clustering new node is inserted into the map, at the position of the input algorithms (SOM included) are not adequate to create abstract pattern. representations in the form of perceptions, because they weight equally all input dimensions and because they map each input ac(D (x; c ); ! ) = (2) ! j j 1 + D (x; c )=(k! k + ) ! j j stimuli to a single cluster. These limitations prevent SOM from being able to correctly cluster this kind of data and where , is a small value to avoid division by zero, k! k is create prototypes that represent the several possible abstractions the norm of the relevance vector, and D (x; c ) is the weighted ! j associated with the same stimulus, as in the example of the red distance function shown in Equation 1. hexagon given above. Therefore, other SOM-based subspace The relevance vector is computed as an inverse function of clustering methods that address these limitations are considered the average distance of each node to the input patterns that here. it clusters, , i.e., the greater is the average distance in a The Dimension Selective Self-Organizing Map (DSSOM) dimension, the smaller is the respective relevance (Equation 3). (Bassani & Araujo, 2012) was one step towards making SOM adequate for subspace clustering. By using a weighted Eu- c (n + 1) = c (n) + e(x c (n)); j j j clidean distance (Equation 1) to compare samples and proto- (n + 1) = (1 e ) (n) + e (jx c (n)j); j j j types it is able to adjust the relevance of each dimension to > (3) determine the winning node for each grid node. Thus, the > if , jimin jimax jimean ji model allows the weight of some dimensions to be even zeroed 1 + exp ! = ji > sl( ) jimax jimin so that these dimensions do not influence the selection of data 1 otherwise clustered by a given node. The adjustment of these weights is done adaptively during self-organization process. where e is the learning rate given by: e = e if j is the winner node and e = e if j is a neighbor of the winner node, , n jimax 2 2 2 [D (x; c )] = ! (x c ) (1) ! j i ji ji and are respectively, the maximum, the minimum, jimin jimean i=1 and the mean of the components of the distance vector and where x is an input stimulus, c is the j-th prototype on the map, e , e , sl, 2 [0; 1] are parameters. j b n and ! 2 [0; 1] is the weighting factor that the j-th prototype Also, in LARFDSSSOM, nodes that do not cluster a min- ji applies to the i-th input dimension. imum percentage (l p) of the input patterns are periodically These weighting factors are estimated from the variance of removed from the map (every maxcom p competitions). Ad- the input patterns clustered by each node on the grid. The ditionally, the neighborhood connects only nodes that take into higher the variance, the lower its weighting factor is. Moreover, account a similar subset of the input dimensions. DSSOM allows more than one node to win for a given input The operation of the map comprises three phases: organi- stimulus, so that, nodes that apply a set of weighting factors zation, convergence and clustering phase. In the organization dierent from those considered by the previous winners can phase, the nodes compete to cluster each new input pattern, so also group that stimulus. that the winner and its neighbors are updated to approximate it 8 Algorithm 1: Self-Organization Phase Algorithm 2: Clustering with LARFDSSOM 1 Initialize parameters a , l p, nwins, maxcom p... ; 1 foreach input pattern (x) in the dataset do 2 Initialize the map with one node with c initialized at the first 2 Present x to the map; input stimulus, 0, ! 1 and wins 0; 3 Compute the activation of all nodes (Equation 2); j s j 3 Initialize the variable nwins 1; 4 Find the winner s with the highest activation (a ); 4 foreach input stimulus (x) do 5 if a a then s t 5 Present x to the map; 6 repeat 6 Compute the activation of all nodes (Equation 2); 7 Assign x to the cluster of the winner node s; 7 Find the winner s with the highest activation (a ); 8 Find the next winner s disregarding the previous 8 if a < a and N < N then winners; s t max 9 Create new node j setting: c x, 0, ! 1 and j j j 9 until a < a ; s t wins l p nwins; 10 else 10 Setup the neighborhood of node j; 11 x was not recognized; 11 else 12 end 12 Update the vectors c, , and ! of the winner and of its 13 end neighbors (Equation 3); 13 Set wins wins + 1; s s 14 end input patterns and make this context aect both pattern search 15 if nwins = maxcom p then and recognition phases. 16 Remove nodes with wins < l p maxcom p; The ART2 with Context, Figure 3, present the same input 17 Update the connections of the remaining nodes; 18 Reset the number of wins of the remaining nodes: (F ) and output (F ) layers of ART2, however, context units 1 2 wins 0; j UC and PC with recurrent connections were added to the 19 nwins 0; model. Each UC unit contains a kind of average of the input 20 end values. Each U unit stores the intensity of the occurrence 21 nwins nwins + 1; of a property in the input pattern, in the internal representa- 22 end tion of ART2 network, i.e., properly rescaled and with noise suppression. Each UC unit receives two connections: the new input pattern from U and a feedback from itself with its own previously stored value. This feedback has a back and new nodes are created whenever the most activated node parameter which controls the weight of the previous value of does not reach the threshold a . The convergence phase is each UC unit. At the end of the presentation of a sequence of similar to the organization phase, with the exception that node i stimuli, it is expected that the context formed and stored in UC insertion is not allowed. Finally, in the clustering phase, the units approximates an average representation of similar stimuli consolidated map is not changed anymore, being used only for present in the sequence. The PC units serve as an interface clustering. between F and F layers, and they have a role equivalent to In this article, for simulating the learning process of a 1 2 the P units of the original ART2 model. subject going through the CSWL experiments, we employ the Algorithm 3 presents all the steps needed for training the organization phase (shown in Alg. 1) without limiting the ART2 with Context included in the proposed model, for which number of nodes in the map and with nodes being updated as the parameters are: per Equation 3, while the convergence phase is not used. : number of nodes in the F layer. It is equal to the number The clustering phase (shown in Alg. 2) is used for testing 1 of semantic features. a and b: fixed weights in F . We set a what the simulated subjects have learned to recognize. and b = 10. c: fixed weights used by the reset condition in [0,1] interval. d: activation of the winner unit in F within 6. Context Formation and Recognition with ART2 the [0,1] interval. The value 0.9 was used. e: parameter to avoid division by zero when the norm of a vector is zero. The Since words can have dierent meanings in dierent con- value 0.0001 was used. : parameter of noise suppression, text, taking context into account when recognizing words is a typically 1= n. The input vector components with values lower fundamental task in word learning. In this work, we employ than will have their values taken to zero. : learning rate. for this task a neural network called ART2 with Context Araujo Used value: 0.001. : surveillance parameter. To determine et al. (2010), based on ART2 Carpenter & Grossberg (1987) the number of groups to be formed. Values in [0.7,1] interval which is a model from the Adaptive Resonant Theory. Such produce eective control over the number of groups formed. an unsupervised incremental learning is capable of grouping Used value: 1. e pochs: maximum number of epochs, we set patterns, associates stimuli of dierent natures, adjusts the de- it to 1. n : maximum number of iterations. Used value: 1 iter gree of similarity of the grouped patterns, works with plasticity back: weight of the context in the interval [0,1]. Used value: and stability, and presents some plausibility. Araujo et al. 0.9. cw: influence rate of the contextual information over the (2010) adapted ART2 by inserting context units with recurrent reset mechanism, inside [0,1] interval. Used value: 0. d : ct x connections. These context units aim to store a history of the eect equivalent to d used for the context units. Used value: 9 0.9. : context learning rate: Used value: 0.8. The variables ct x Algorithm 3: Training ART2 with Context. p ; q ; r ; s ; u ; v ; w ; x ; y ; uc ; pc ; are the i-th elements of the i i i i i i i i i i i 1 Initialize: a, b, c, d, e, , , , e pochs, n , n, back, cw, d , ; iter ct x ct x vectors P, Q, R, S, U, W, X, Y, UC, and PC. J: the node in 2 for e pochs do 3 for each input stimulus s do F with higher activation. reset: indicates if the winner node i 4 Initialize activations in F layer: in F layer cannot learn the presented pattern. T: the top-down 5 u = 0; w = s ; p = 0; q = 0; x = s =(e +jjsjj); v = f (x ); i i i i i i i i i matrix of weights. B: the bottom-up matrix of weights. The 6 Update activations in F Layer: f (x) function is defined as: 7 u = v =(e +jjvjj); w = s + au ; p = ui; i i i i i i 8 x = w =(e +jjwjj); q = p =(e +jj pjj); v = f (x ) + b f (q ); i i i i i i i 9 Propagate values to UC: x if x f (x) = 10 uc = (back)(uc ) + (1 back) f (u ); > i i i 0 if x < 11 Rescale the context units: 12 uc = uc =(e +jjucjj); i i 13 Propagate the context values to PC: The training algorithm (Algorithm 3) consists of the follow- 14 pc = uc ; i i ing: After the variable initializations (line 1) a loop is executed 15 Update activations in F layer: P P for each training epoch. For each input pattern the activations 16 y = (1 cw) b p + cw (b p ); j i; j i i+n; j i i i of the units in layers U, W, P, Q, X, and V are initialized (line 17 reset = true; 18 while reset do 5) and updated to reflect the eects of the input pattern (lines 7 19 Find the unit in F with highest activation y : 2 J and 8). Then the values computed are propagated to the context 20 y = max[y ]; 1 j q; J j units UC (line 10) and the new values are rescaled (line 12) and 21 if y = 1 then copied to PC units (line 14). Next, values stored on P and PC 22 J = an unused unit; 23 reset = f alse; units are propagated to the F layer, where a competition occurs 24 end among the groups. Each group responds with an activation 25 if reset then y (lines 16-17) and the loop started in line 19 repeats until a 26 ui = vi=(e +jjvjj); p = u + dt ; pc = t ; i i J;i i+n J;i+n winner group is defined and updated. First, the group J with 27 r = (u + c p + cw)=(e +jjujj + cjj pjj + cwjj pcjj); i i i 28 if jjrjj < ( e) then higher activation is found (line 20). If this group was disabled 29 reset = true; y = 1; (activation = -1), all groups were deactivated because of a reset 30 else sign, and a new group is created (lines 21-24). Otherwise, it is 31 reset = f alse; w = s + au; x = w =(e +jjwjj); i i i i 32 q = p =(e +jj pjj); v = f (x ) + b f (q ); i i i i i 33 end 34 else F Layer 35 for n do iter 36 Update the weights of the winner unit J: 37 t i = du + [1 + d(d 1)]t ; J i J;i Y Y Y 1 j m 38 b J = du + [1 + d(d 1)]b ; i i i;J 39 t = d uc + [1 + d (d 1)]t ; J;i+n ct x ct x i ct x ct x ct x J;i+n 40 b = d uc + [1 + d (d 1)]b ; i+n;J ct x ct x i ct x ct x ct x i+n;J 41 Rescale the updated vectors: 42 t = t =jjt jj; b = b =jjb jj; J;i J;i J i;J i;J J 43 t = t =jjt jj; b = b =jjb jj; J;i+n J;i+n J i+n;J i+n;J J 44 Update activations in F Layer: Rescaling PC 45 u = v =(e +jjvjj); w = s + au ; p = ui + dt ; i i i i i i Ji P Q 46 x = w =(e +jjwjj); q = p =(e +jj pjj); i i i i i i cp 47 v = f (x ) + b f (q ); i i i bf(q ) 48 end 49 end 50 end Rescaling 51 end UC Reset 52 end control i Context au Units verified if the winner group is similar enough to the presented f(x ) pattern (using the parameter). If not, the group is disabled and a reset occurs so that another group can be found (lines T Rescaling W X 25-33). If the winner group is considered similar enough to reset i i the input pattern, it is approximated to it (lines 36-40), the intra F1 F Layer vectors updated are normalized (lines 42 and 43) and finally, context the activations in layer F are updated (lines 45-47). The pattern recognition is done in a way very similar to the network training. The main dierence is that there is no storage Figure 3: Architecture of ART2 with Context, composed of two layers: F is the input layer, F is the output layer; and the context units: UC, with a in the F layer. Moreover, an adaptation of the parameter 2 2 feedback loop, responsible for creating the context representation, and the PC is done: it starts with an initial value next to 1 and is slightly units serving as an interface between F and F layers. 1 2 Feedback Dataset preparation reduced until a group is found in the F layer. Extract visual perceptions Raw dataset of The next section describes the simulations carried out with Perceptions Start from images and auditory labeled images of dataset perceptions from labels house objects the proposed module. Pre-trained auditory representation Train auditory and visual Visual and auditory module models of the representation 7. Simulations representations layer with perceptions in a Pre-trained visual dataset random order representation module The simulations aimed to reproduce the CSWL experiments available in the literature, following the methodology intro- Train procedure Context Association Experiment duced by Yu & Smith (2007) and further extended by others. Module Module inputs Use the activation Subsections from Section 7.3 to Section 7.7 describe the CSWL levels in the association module to experiments considered in this work and the respective simula- rank the associations Context Association Module Module tions carried out with the proposed model. Section 7.2 describes Test procedure with trained modules end Experiment simulation the dataset used in the simulations. Notice that we employ the term “Experiment” to refer to the actual experiments carried out Figure 4: Workflow of simulations: the process for generating the visual and by Yu & Smith (2007), Yurovsky et al. (2013), and Trueswell auditory representations is illustrated in the Dataset preparation box. This et al. (2013) with humans. The term “Simulation” refers to process was executed only once and the same representations were used in all experiments. In the Experiment simulation box, is illustrated the process the simulations carried out with the proposed model, aiming for simulating an experiment. This process was repeated for each experiment to replicate each particular experiment. From Section 7.3 to with dierent inputs, selected from the representations dataset according to the Section 7.7 we describe very briefly the considered CSWL experiment design. experiments and their respective simulations. The details of the mentioned experiments are described in the Appendix A. Such subsections describing the simulations are divided into chair, clock, computer, cooker, cup, desk, door, dresser, fork, the following parts: first (i) a detailed description of the knife, refrigerator, sofa, spoon, and telephone). In addition, 18 experiment conducted by the authors is presented, then (ii) object images associated with these names were obtained from the procedures used to simulate the experiments are described, Google Image Search , using the respective word as the search and finally, (iii) the results produced by the simulations are term. presented in comparison with the results obtained in the original Figure 1 displays a sample of the object images collected. experiments. The complete dataset is available online (see footnote 1 on Furthermore, in Section 7.8, the model with the adjusted set page 7). This dataset was used in all simulations presented in of parameters is evaluated in the last simulation, which aims to the following subsections. cover a part of the model that was not evaluated in the previous experiments: the Context Module and its role in providing the 7.3. Experiment 1: Word Learning Under Uncertainty correct meaning for words with dierent meanings in dierent Yu & Smith (2007) evaluated the CSWL abilities of 38 un- contexts. Since no work with this objective was found in dergraduate students dealing with slides containing pictures of the literature, an experimental design is firstly proposed to unusual objects paired with pseudowords presented in auditory evaluate this ability in individuals, then, the results produced form. There were 3 groups of 18 pairs trained under dierent by the simulations of this experiment are presented. Figure 4 conditions concerning the number of labels and pictures that illustrates the workflow for simulating the experiments. We are presented (2 and 2, 3 and 3, or 4 and 4). Each subject start this subsections with the description of the parametric was presented to 1word and 4 pictures and asked to choose the setup. picture labeled by that word. The details of the experiments are in Appendix A.1. 7.1. Parameter Adjustment The parameters of each module of the proposed model were 7.3.1. Procedures for Simulation 1 adjusted only once to minimize the dierences between the In the cross-situational experiments, the auditory stimuli, the results of all experiments and their respective simulations. sounds of the words formed a unique stream, thus, in each The exploration of possible parameter values was made by trial, a single auditory representation was created by chaining employing a Latin Hypercube Sampling (LHS) (Saltelli et al., the representation of the sequence of phonemes of the words 2009) and the best parameter set is presented in Table 1. presented. The dataset used in the experiments and the way that each For example, assuming that the following four words stimulus was presented to the model is detailed in the next are used in a test: bed, chair, bowl and fork, the section. representation of the respective sequence of phonemes: /b e d t S e @ b @ U f O k/ formed the auditory input as de- 7.2. The Real World Object Image and Label Dataset scribed in Section 4.1.1. On the other hand, in Yu & Smith In order to simulate the stimuli provided to the participants (2007) individuals could pay attention to each image at a time, in the experiments of Yu & Smith (2007), we used 18 words of observing them individually. Moreover, since there is not a objects commonly found at home (armoire, bed, bowl, canister, strong correlation between the images, they make more sense 11 Table 1: Best parameter values obtained with the LHS adjustment. four images. One of them is the correct association and the others are randomly chosen foils. The input stimuli are built Parameter Value similarly as in the training, with the only dierence that now Visual Representation Module – LARFDSSOM there is only one word, which its representation is paired with the representation of each one of the four images. To identify Activation threshold (a ) 0.985 the association made by the model, each input pair is presented Lowest cluster percentage (l p) 0.15% in a random sequence and the level of activity of the winner Relevance rate ( ) 0.10 node in the association layer is registered. Then, the input Max competitions (maxcom p) 0.021S pair that produced the highest activation is considered as the Winner learning rate (e ) 5 10 Neighbors learning rate (e ) 12 10 e strongest association made by the model. n b Relevance smoothness (s) 0.007581760 The model was trained and tested 38 times, initialized with a Connection threshold (c) 0.50 dierent random seed, representing 38 dierent individuals. Auditory Representation Module – LARFDSSOM 7.3.2. Results of Experiment 1 and Simulation 1 Activation threshold (a ) 0.935 Figure 5 shows that in the results obtained by Yu & Smith Lowest cluster percentage (l p) 0.001% (2007) in all conditions the individual have correctly guessed Relevance rate ( ) 0.10 significantly more pairs (0:889 0:07, in condition 2x2, 0:778 Max competition (maxcom p) 2S 0:10, in 3x3 and 0:556 0:00 in 4x4) then they would have Winner learning rate (e ) 0.10 by chance (1/4 = 0.25). Even in the most dicult condition Neighbors learning rate (e ) 14 10 e n b Relevance smoothness (s) 0.00394 (4x4), with 16 possible associations by trial, the individuals Connection threshold (c) 0.50 guessed on average 10 of the 18 pairs (0.55). The authors argue that humans are good at guessing the correct word-referent Context Module – ART2 With Context associations in situations of ambiguity and the results clearly Fixed weight in F1 (a) 10 show that the increase in the level of ambiguity inside the Fixed weight in F1 (b) 10 trials negatively aects the learning. This is confirmed by Reset weight condition (c) 0.10 comparing the averages in conditions 2x2 and 4x4 in a t-test Winning Unit Activity in F2 (d) 0.9 with a significance level of 1%. A parameter to avoid division by zero (e) 0.0001 Noise suppression parameter () 0.0739221 1.0 Learning rate () 0.8 Surveillance parameter () 0.999 0.8 Number of Epochs (e pochs) 1 0.6 Number of Iterations (n ) 1 iter Backpropagation context parameter (back) 0.90 0.4 Context influences above reset mechanism (cw) 0.0002 0.2 Winner Unit Activity in F2 for the context (d ) 0.9 ct x Context learning rate ( ) 0.80 ct x 0.0 2x2 3x3 4x4 Association Module – LARFDSSOM Yu et al. (2007) Model Activity threshold (a ) 0.999 Lowest cluster percentage (l p) 17.5211% Figure 5: Experiments of Yu & Smith (2007) in comparison with the results of Relevance rate ( ) 0.870879 our simulations. The strong horizontal dashed line indicates the probability of guessing by chance, while the error bars indicate the standard deviation. Maximum competition (maxcom p) 10000 Winner learning rate (e ) 0.465091 Neighbors learning rate (e ) 0.0134102e n b Although there are visible dierences, analogous conclu- Relevance smoothness (s) 1.31357 sions can be drawn from the results of our simulations. The Connection threshold (c) 0.986745 model could guess the correct associations better than chance and displayed a similar pattern of decay of learning as a func- tion of the ambiguity inside trials (0:778 0:044, in condition when individually observed. Therefore, in our simulations, 2x2, 0:700 0:061, in 3x3, and 0:567 0:084, in 4x4). The most each image was represented individually, as described in Sec- significant dierence is observed in condition 2x2, in which the tion 4.1.2. Then, the input stimuli of a trial were constructed by model learns around 78% of the pairs on average, while the paring the auditory stimulus with each one of the visual stimuli. individuals were able to learn about 89%. Yet, the same t-test For instance, in each trial of the 2x2 condition, two inputs confirms that the learning rates in conditions 2x2 and 4x4 are were given for the model: one built by paring the auditory statistically dierent. representation with the first image and another one built by 7.4. Experiment 2: Word Learning with More Than One Refer- pairing it with the second image. ent After the learning trials, analogously as in Yu & Smith (2007) the testing consisted of presenting the sound of one word and Yurovsky et al. (2013) experiments aimed to assess the 12 behavior of individuals for words with two correct associations. suggests that two mappings composed of a single word and two A total of 48 students were tested in 18 word-referent pairs dierent referents do not act like two independent mappings under 3 distinct conditions: each set of 6 words were associated (two words and two dierent referents). This suggests the with 1, 2, or none referents. In each of the 27 learning trials, the occurrence of some kind of competition for the mappings of subject had to deal with 3 dierent word combinations. Then, a word. each test consists of providing the subjects with 4 word-referent The same conclusions can be drawn from our simulations pairs to rank the most likely associations. The details of the for single words (0:372 0:126 > 0:25), one referent (0:622 experiments are in Appendix A.2. 0:169 > 0:5) and both referents (0:2780:135 > 0:17) of double words. The model was also less likely to learn both referents of 7.4.1. Procedures for Simulation 2 double words than one referent of single words (t(47) = 3.5267, p < .001). In order to simulate the stimuli given to the participants of this experiment, the same 18 objects of the previous experiment Yurovsky et al. (2013) also pointed out that, while this ex- were used. The six single words were: bed, chair, bowl, fork, periment allows concluding there is some kind of competition for the mappings, it is not clear which type of competition, local door, and canister and presented together with their respective images. The six double words were: clock, computer, desk, (within trials) or global (across trials), since both referents were refrigerator, sofa, and cooker, with their six respective images shown in each trial. The next experiment addresses this issue. used as their first referents. The second referents of double words were images of dierent objects: respectively goblet, 7.5. Experiment 3: Local vs Global Competition mat, mixer, crib, blender, and shaker. Finally, the six noise words were: spoon, telephone, knife, armoire, cup, and dresser. Yurovsky et al. (2013) run experiments with 48 subjects who The paired input stimuli were built exactly as in the 4x4 were trained with a single correct referent of double words. condition of Experiment 1, and, in each testing trial, each one The individuals were asked to to the same test of the previous of the four testing words was selected (in a random order) experimental. The details of the experiments are in Appendix and paired with each one of the four referents. The stimulus A.3. built for each pair was presented as input for the model and the activation of the winner node in the association layer was 7.5.1. Procedures for Simulation 3 computed. Then, the activation levels were used to rank the Analogously as in Simulation 2, the six single words (bed, pairs for computing the single, double, and either scores. chair, bowl, fork, door, canister) and double words (clock, This training and testing procedure was repeated 48 times computer, desk, refrigerator, sofa, and cooker) were the same, with random initializations, representing the 48 participants. with their respective images. And the images of the same dierent objects (goblet, mat, mixer, crib, blender, and shaker) 7.4.2. Results of Experiment 2 and Simulation 2 were used as the second meaning for double words. Noise The results obtained by Yurovsky et al. (2013) (Figure 6) words were not used and the testing procedure was kept the show that participants displayed a better-than-chance knowl- same of Simulation 2. edge of the referents of single words (0:454 0:264 > 0:25), of one of the referents of double words (0:698 0:210 > 0:5), and 7.5.2. Results of Experiment 3 and Simulation 3 even for both referents of double words (0:301 0:146 > 0:17), The results of this experiment (Figure 7), showed that, dierence statistically verified by a t-test with a significance although participants knew all types of mappings above chance level of 1%. (single words: 0:4000:247 > 0:25; double words one referent: 1.00 0:580 0:277 > 0:5; and both referents: 0:240 0:203 > 0:17), they again showed better knowledge of single word referents 0.75 than of both word referents (t(47) = 3.81, p < 0.001). This 0.50 result suggests competition across trials. 0.25 1.00 0.00 0.75 Single Either Both 0.50 Yurovsky et al. (2013) Model 0.25 Figure 6: Comparison of the results obtained by Yurovsky et al. (2013) with the result of the simulation with the proposed model in Experiment 2. Dashed lines 0.00 indicate the chance levels of performance. The error bars indicate the Standard Single Either Both Error (SE), not Standard Deviation (SD), where S E = S D= # of samples. Yurovsky et al. (2013) Model Yurovsky et al. (2013) found that participants were signifi- Figure 7: Comparison of results obtained by Yurovsky et al. (2013) with the cantly less likely to learn both referents of a double word than results obtained with the model in Experiment 3. Dashed lines indicate the one referent of single words (t(47) = 3.68, p < .001). This chance levels of performance. The error bars indicate the Standard Error. 13 The simulations presented an analogous behavior with participants have shown, the model did perform worse for both above-chance accuracy (single words: 0:478 0:115 > 0:25; referents of double words (0:283 0:129 > 0:17). This was double words one referent: 0:594 0:142 > 0:5; and both actually an expected result, since the model, in its present form, referents: 0:367 0:101 > 0:17). The highest dierence was does not take any advantage of known mappings to speed up observed for the recognition rate of both referents of double the acquisition of new mappings. words, which could not be considered statistically equivalent to 0.40 the results displayed by humans. In spite of that, the simulated participants also showed a better knowledge of single word 0.30 referents than of both word referents (t(47) = 4.5613, p < 0.0001), which also points to global competition. 0.20 7.6. Experiment 4: Online vs Bach Learning 0.10 Yurovsky et al. (2013) designed an experiment similar to 0.00 Experiment 3 to assess the degree of globality of the compe- Early First Late First tition process, i.e., they evaluated the influence of the temporal Yurovsky et al. (2013) Model order of the individual trials upon accuracy. The details of the Figure 9: Comparison of results obtained by Yurovsky et al. (2013) with the experiments are in Appendix A.4. results obtained with the model in Experiment 4 for the frequency that Early and Late referents were ranked first. The error bars indicate the Standard Error 7.6.1. Procedures for Simulation 4 and the dashed lines indicate the chance levels of performance in the experiment with humans (0.2) and in simulation (0.142). For the simulations, the same stimuli of the previous exper- iment were used for training and test. The only change was Regarding the ordering factor, the results presented in Figure in the order of presentation of double words referents along 9 show that when participants picked up both correct referents the trials, which one of the referents of each double word for double words, they were slightly more likely (t(47) = was randomly chosen to be presented earlier, while the second 1:55; p = 0:08) to rank the early referent first (0:24 0:23) than referent was presented only after all presentations of the first the late referent (0:16 0:20). The model presented a similar referent. pattern, though more strongly (early first: 0:217 0:12; late first: 0:067 0:63, t(47) = 7:8206; p < 0:0001). 7.6.2. Results of Experiment 4 and Simulation 4 In the next section, we evaluate the capability of the model Figure 8 shows the obtained results. Participants displayed of reproducing the results of the experiments designed by similar results for single words (0:450 0:300 > 0:25) and for Trueswell et al. (2013) to verify other learning aspects. one referent of double words (0:730 0:240 > 0:5). However, they learned both referents of double words (0:400 0:300 > 7.7. Experiment 5: Statistical Association vs 0:17) as well as the referent of single words. Therefore, in Propose-but-Verify contrast with previous experiments, the results did not show evidence of competition. Trueswell et al. (2013) proposed the hypothesis A possible explanation given by Yurovsky et al. (2013), “Propose-but-Verify” in which learning results from a is that while global competition protects old mappings from one-trial procedure which links word-referent pairs that can noisy information, local competition leverage prior mappings be unlinked after opposite observations. To prove it, they knowledge to speed up the acquisition of new mappings. designed experiments to verify if participants retain one or more association mappings for each word. They used 50 1.00 students to hear sentences an choose object referred by it. The 0.75 individuals were supposed to learn association between phrases and images. The details of the experiments are in Section 0.50 Appendix A.5. 0.25 7.7.1. Procedures for Simulation 5 0.00 For simulating the stimuli given to the participants in this Single Either Both experiment, the following 12 randomly chosen words were used Yurovsky et al. (2013) Model among the 18 of Experiment 3: bed, chair, bowl, fork, door, Figure 8: Comparison of results obtained by Yurovsky et al. (2013) with canister, clock, computer, desk, refrigerator, sofa, and cooker. the results obtained with the model in Experiment 4 for single, either and Also, the same 12 respective images were used as referents for both words learning accuracy. Dashed lines indicate the chance levels of these words. performance. The error bars indicate the Standard Error. In each trial, the model was trained similarly as in previous The simulations have shown similar results for single words experiments. The input stimuli were produced exactly as in (0:500 0:146 > 0:25) and for one referent of double words previous experiments, combining the representation of the word (0:650 0:139 > 0:5). However, dierently from what with the representation of each referent, though now they were 14 1.0 presented in the 1x5 condition, i.e., five combined input stimuli 0.8 per trial. 0.6 Diering from the previous simulations, this time, in order to match the procedure of the experiment, the level of activation 0.4 of the winner node in the association layer was registered after 0.2 each input stimulus. The referent that has resulted in the highest activation is considered as the choice of the model for the best 0.0 Wrong Right association in the trial. The simulation was repeated 50 times Trueswell et al. (2013) Model with dierent random seeds to simulate the 50 participants. Figure 11: Comparison of the results shown by Trueswell et al. (2013) with the results obtained with the model in Experiment 5. The “Wrong” label indicates the accuracy displayed when the wrong referent was chosen in the previous learning cycle and the “Right” label indicates the accuracy displayed when the correct referent was previously chosen. Dashed lines indicate the chance level 7.7.2. Results of Experiment 5 and Simulation 5 of performance. The error bars show a confidence interval of 95% Figure 10 shows the percentage of correct answers along With this, Trueswell et al. (2013) conclude that participants the five learning cycles. As expected for a 1x5 condition, the did not retain multiple associations through the learning cycles. average results suggest that the learning was more dicult than However, the proposed model has presented an analogous be- in previous experiments, though still viable. With the analysis havior, displaying an above chance accuracy only in the “Right” of the growth of the learning curve, Trueswell et al. (2013) condition (0:407 0:134), while in the “Wrong” condition the have shown that there was a significant increase in the accuracy results approach a random guess (0:232 0:069). throughout the learning cycles. The simulations have presented We know, though, that the model can generate multiple an analogous behavior. A t-test with 1% of significance level hypotheses in each trial (up to five, in this case). This way, confirms that both the participants and the model present an we are left with two possibilities: (a) the model did not generate accuracy above chance at the last learning cycle. multiple associations in each trial, or (b) the model did generate multiple associations, however, they were not strong enough to aect the accuracy significantly in the following cycle. In the model, the number of new associations generated in each trial is 0.5 represented by the number of nodes created in the Association 0.4 Module. Therefore, observing this value we can elucidate what 0.3 has actually happened. 0.2 0.1 0.0 1 2 3 4 5 Trueswell et al. (2013) Model Figure 10: Comparison of the accuracy growth through the learning cycles obtained by Trueswell et al. (2013) with the results obtained with the model in Experiment 5. Dashed lines indicate the chance level of performance. The 1 2 3 4 5 error bars show a confidence interval of 95%. Average number of nodes created per trial Figure 12: Number of nodes created in the Association Module after each trial (1-5), through the learning cycles in Experiment 5. The error bars show the Since the previous result shows that learning still occurs in standard deviation. the 1x5 condition, the next step was to evaluate the hypothesis raised about the type of learning (Statistical Association vs Figure 12 shows the evolution of the number of nodes created Propose-but-Verify). As can be seen in Figure 11, the partic- in the Association Module in each trial, through the learning ipants have identified the correct referent with an above-chance cycles. In the first cycle, the model creates about 3 nodes per accuracy (0:47 0:14) only after assigning the correct referent trial on average, and this number decays through the learning in the previous cycle. When the participants missed the correct cycles, reaching a value below 1 in the last cycle. This is referent in the previous cycle, they seem to choose a random an expected behavior since, after some learning, the model referent (0:208 0:038 ' 0.20), presenting an accuracy near to has already created the associations required to represent most 1 in 5 (randomly guessing). Therefore, even when a word has mappings. This indicates that hypothesis (b) is the correct one. co-occurred before with the correct referent, participants show The model generates multiple associations, though not as many no sign of remembering it when they missed it in a previous as possible, however, they are not strong enough to aect the cycle. accuracy in the next learning cycle. 15 We argue that two factors may explain the results observed by the referent from the list presented earlier is chosen; or (iii) one Trueswell et al. (2013) without disregarding the hypothesis of of the lures is chosen. The prior for the situation (i) is 0.25 multiple associations. One is that global competition may insert (one in four referents), while the prior for selecting one of both noise in the associations formed, degrading weak associations associations, (i) or (ii), is 0.5 (two in four). (seen only once). The other factor is that in the experimental 7.8.1. Procedures for Simulation 6 design of Trueswell et al. (2013), the number of incorrect The words chosen to simulate the Experiment 6 were: ar- associations is computed from the second to the fifth cycle, moire, snake, dog, cat, cheese, trap and mouse for the first list, when the number of associations created by trial may have and speaker, printer, computer, notebook, monitor, keyboard, decreased, as our simulations suggest. Therefore, in our and mouse for the second list. Note that the ambiguous label, model, a chance level accuracy for words that were incorrectly AL, is the word mouse. The referents for both lists consisted associated in the previous cycle is not a result of retaining a again of images download from Google Image Search , using single association hypothesis. the respective word as the search term. The two referents for 7.8. Experiment 6 Design: The Role of The Context in Word the AL consisted of an image of the animal (RA), and an image Disambiguation of the computer device (RB). Training and testing were done according to the experimental In all previous simulations, the context module was active design described above, and the input stimuli combining the and functional. The results obtained in those simulations show word representation with the representation of each referent that it does not interfere with the learning in the evaluated were produced exactly as in previous experiments. The sim- conditions. However, the role of the context module itself was ulation was repeated 48 times with dierent random seeds to not directly evaluated. When a word has dierent meanings, simulate 48 participants. they are usually employed in dierent situations (contexts). Therefore, our hypothesis is that learning the context in which 7.8.2. Results of Simulation 6 words are used can help the model to learn their dierent The obtained results are shown in Figure 13. In conditions meanings. 3+3 (3a+3b and 3b+3a) and 4+2 (4a+2b and 4b+2a) the In order to evaluate this, we designed the following experi- context was eective to induce the recovery of the correct ment, based on the 1x5 condition proposed by Trueswell et al. referent with a high accuracy (respectively, 0:937 0:167 (2013): The stimuli are composed of two lists of six words (A and 0:739 0:252), with an expected decay in accuracy from and B), sharing exactly one word, the ambiguous label (AL), conditions 3+3 to 4+2. In conditions 5+1 (5a+1b and 5b+1a), that is associated with a dierent referent in each list, i.e., in list however, the accuracy falls to 0:5 0:145, which means that A the label AL is associated with the referent RA while in list the contextual information is not enough to induce the correct B it is associated with a dierent referent, RB. association. The model seems to have diculty choosing The training should be carried out in six cycles of 14 trials between the two possible referents RA and RB, though it can each: the odd cycles are done with words from list A (including easily discard the lures. A t-test with 1% of significance level AL), while the even cycles are done with words from list B confirms that these results are dierent between them and that (including AL). Therefore, the cycles are intercalated in the they are above chance. form: (A, B, A, B, A, B). This training aims to induce the creation of two dierent contexts associated with the words in 1.00 each list. Since the context changes slowly, the associations created with words in the same list tend to be similar, since 0.75 they are consecutively presented. 0.50 The testing procedure aims to verify if the recovered referent for the AL matches the context induced by the stimuli previ- 0.25 ously given as input, and how many stimuli are necessary to 0.00 induce the context. Thus, in order to induce the context, six Condition 3+3 Condition 4+2 Condition 5+1 trials in the condition 1x4 are done using words from one of Average accuracy of the correct referent the lists (excluding AL), before testing the association of AL, also in condition 1x4, with both referents (RA and RB) and Figure 13: Accuracy of the model in choosing the referent induced by the two other randomly chosen referents (lures), one from list A context in each condition: In 3+3, the last three words induce the desired context; in 4+2, the last two words induce the desired context; and in 5+1, and the other from list B. The context inducting conditions are: only the last word induces the desired context. 3a+3b, 3b+3a, 4a+2b, 4b+2a, 5a+1b and 5b+1a. The condition 3a+3b, for instance, indicates that three trials with labels and These results emphasize the role of the context, showing that referents from list A were presented, followed by three trials it can help to recover the correct meaning for ambiguous words. with labels and referents from list B. In this condition, it is expected that the context of list B (late list) is induced, thus, 8. Discussion the correct association to be retrieved is with referent RB. In each testing trial, three results are possible: (i) the referent The experimental paradigm of the cross-situational word of the list presented later is chosen (expected association); (ii) learning has shown to be a very useful tool for evaluating 16 the hypothesis about the mechanisms that allow us to learn association by disregarding words and referents in the other word-referent associations. The model described in this article three. The model, otherwise, only strengthens the current has been proposed considering pieces of evidence accumu- stronger association for each pair. Therefore, in future work, lated in the studies of psycholinguistics and neurolinguistics, the model should be modified to take this into account. organized in a modular architecture that allows us to better In spite of that, the main conclusions obtained by Yu & Smith understand and communicate about the functions required for (2007); Yurovsky et al. (2013), and Trueswell et al. (2013) with word-referent associations. their experiments, including those in Experiment 4, could also The results obtained in Experiment 1 are similar in terms be drawn from the simulation results, summarized below: of accuracy to the results of the models evaluated by Yu & Exp.1: The model was able to simulate the remarkable ability Smith (2012). However, this article also considered other con- of participants in learning associations between labels and ditions not evaluated in previous works and introduces advances referents in dierent levels of ambiguity, including the fact in terms of model architecture in comparison with previous that it decays with the increase of ambiguity; models. One improvement is the use of a Time-Varying Self-Organizing Map (LARFDSSOM) as the point of connec- Exp.2: The model was able to replicate the greater diculty tion between the visual and auditory layers. In previous models, that participants present to learn two referents of the same this was done via associative connections trained by Hebbian label than to learn only one referent; learning. In the proposed model, LARFDSSOM learns the correlations between the dierent input dimensions from the Exp.3: Global competition is also the most relevant type of co-variations observed in the input data by the means of its interference that degrades the learning of the model as relevance learning mechanism. This is similar to Hebbian seems to be the case with the individuals; learning, however, it has other useful features such as the Exp.4: The model employs online learning instead of batch topological representation of input data and the activation levels learning, which matches the type of learning identified by produced by the nodes, which allowed us to model the CSWL Yurovsky et al. (2013) in their experiments. However, the experiments. Moreover, the same kind of map is used in model does not benefit from the knowledge of previously dierent levels of the architecture, which seems to be more known mappings when forming new associations, which plausible. prevented it from reproducing part of the results observed The proposed architecture is far from being a complete by the authors. model of cross-situational word learning. It is, however, a step towards a model that allows us to simulate and evaluate Exp.5: Though more dicult, learning still occurs in the high hypotheses about the mechanisms behind this characteristic ambiguity condition 1x5, and the model could replicate of human nature. The fact that it can deal with real-world this fact accurately in each one of the five learning cycles. inputs, images for referents and text or sound for words, gives it an enormous flexibility for simulating more accurately several It is also worth noting that in Experiment 5, the simulations types of experiments carried out with human beings. We allowed us to verify how many associations were created in notice that in several conditions its association accuracy is a each trial, which has shown that it is possible for a method that little below the accuracy of humans. For instance, in the 2x2 makes multiple associations to achieve the results observed by condition of Exp.1, humans can reach 89% of accuracy while Trueswell et al. (2013), in opposition to the authors’ assump- the model achieved only 78%. This could be due to the fact that tion. This is an example of how such kind of modeling can be handcrafted features extractors were used to represent images useful in the evaluation of new hypotheses. and words in the first layer. This can be further improved Another improvement of the current architecture, in relation by employing modern representation techniques such as word to previous models, was the introduction of the context module. embeddings (Mikolov et al., 2013) and convolutional neural With this, we could evaluate how context can be used to networks, already shown to work well in combination with retrieve the correct meaning of ambiguous words. Therefore, LARFDSSOM (Medeiros et al., 2019) and to achieve human the results obtained in the simulation of Experiment 6 can and level performance in certain image classification scenarios (He should be evaluated in experiments with human beings to verify et al., 2015). how well the model predicts the eects of the context in the In spite of that, the simulations show that the model is disambiguation of words meanings. suitable for replicating most of the experiments considered in Finally, the proposed model may be applied in the propo- this work, allowing us to draw similar conclusions. However, sition and test of new hypotheses and experimental paradigms, in Experiment 4, it seems that it is much easier for humans to contributing to understanding the mechanisms involved in word learn a second referent of a double word after having learned learning, and can be used as a component for developing the first referent than learning both simultaneously, while this agents that learn natural language. Still, it is important to was not the case for the proposed model. We evaluate that emphasize that although this model was developed taking into this happens because the model cannot take advantage of other consideration the current knowledge provided by neuroscience known associations within a trial to reduce ambiguity. For and cognitive psychology, it is a high level computational instance, in a 4x4 condition, if a human participant knows model and may not reflect the real learning and representation three of the four associations, he might easily guess the fourth mechanisms that occur in the brain. 17 Acknowledgment Harnad, S. (2005). To cognize is to categorize: Cognition is categorization. In C. L. Henri Cohen (Ed.), Handbook of Categorization in Cognitive Science The authors would like to thank CNPq (Conselho Nacional (pp. 20–46). Elsevier Science. Harris, M., D., J., & Grant, J. (1983). The nonverbal content of mothers’ speech de Desenvolvimento Científico e Tecnológico) and FACEPE to infants. First Language, 4, 21–31. (Fundação de Amparo à Ciência e Tecnologia do Estado de Haykin, S. (1998). Neural Networks: A Comprehensive Foundation. Prentice Pernambuco) for supporting project #APQ-0880-1.03/14. Hall. He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Pro- References ceedings of the 2015 IEEE International Conference on Computer Vision (ICCV) ICCV ’15 (pp. 1026–1034). Washington, DC, USA: IEEE Computer References Society. Heibeck, T., & Markman, E. (1987). Word learning in children: An examina- Abdelsamea, M. M., Mohamed, M. H., & Bamatraf, M. (2015). An tion of fast mapping. Child Development, 58, 1021–1034. eective image feature classication using an improved som. CoRR, Hu, J., & Pei, J. (2018). Subspace multi-clustering: a review. Knowledge and abs/1501.01723. Information Systems, 56, 257–284. Aggleton, J. P., & Brown, M. W. (1999). Episodic memory, amnesia, and the Hubel, D., & Wiesel, T. N. (1962). Receptive fields, binocular interaction and hippocampal-anterior thalamic axis. Behavioral and Brain Sciences, 22, functional architecture in the cat’s visual cortex. Journal of Physiology, 160, 425–44. 106–154. Allen, J. (1994). Natural Language Understanding (2nd Edition). Hubel, D., & Wiesel, T. N. (1977). Functional architecture of macaque visual Addison-Wesley. cortex. Proceedings of the Royal Society B, 198, 1–59. Araujo, A. F. R., Bassani, H. F., & Pacheco, R. F. (2010). Occurrence of false Kaas, J. H., Merzenich, M. M., & Killackey, H. P. (1983). The reorganization memories: A neural module considering context for memorization of words of somatosensory cortex following periphereal nerve damage in adult and lists. In IEEE International Joint Conference on Neural Networks (pp. 1–8). developing mammals. Annual Review of Neurosciences, 6, 325–356. Arbib, M. A. (2008). From grasp to language: Embodied concepts and the Kachergis, G., Yu, C., & Shirin, R. M. (2017). A bootstrapping model of challenge of abstraction. Journal of Physiology-Paris, 102, 4 – 20. Links frequency and context eects in word learning. Cognitive Science, 41, and Interactions Between Language and Motor Systems in the Brain. 590–622. Bassani, H. F., & Araujo, A. F. (2015). Dimension selective self-organizing Kagan, J. (1981). The second year. Cambridge, MA: Harvard University Press. maps with time-varying structure for subspace and projected cluster- Kangas, J. (1991). Time-dependent self-organizing maps for speech recog- ing. Neural Networks and Learning Systems, IEEE Transactions on, 26, nition. In O. S. . J. K. T. Kohonen, K. Mäkisara (Ed.), Artificial Neural 458–471. Networks (pp. 1591 – 1594). Amsterdam: North-Holland. Bassani, H. F., & Araujo, A. F. R. (2012). Dimension selective self-organizing Keil, F. C. (1994). Explanation, association, and the acquisition of word maps for clustering high dimensional data. In IEEE International Joint meaning. Lingua, 92, 169 – 196. Conference on Neural Networks. Brisbane. Khoe, Y. H., Perfors, A., & Hendrickson, A. T. (2019). Modeling individual Becker, A. H., & Ward, T. B. (1991). Children’s use of shape in extending performance in cross-situational word learning. novel labels to animate objects: Identity versus postural change. Cognitive Kinnunen, T., Kamarainen, J.-K., Lensu, L., & KäLviäInen, H. (2012). Development, 6, 3 – 16. Unsupervised object discovery via self-organisation. Pattern Recogn. Lett., Bloom, P. (2002). How Children Learn the Meanings of Words. The MIT Press. 33, 2102–2112. Born, R. T., & Bradley, D. C. (2005). Structure and function of visual area MT. Kiran, S., Grasemann, U., Sandberg, C., & Miikkulainen, R. (2013). A Annu. Rev. Neurosci., 28, 157–189. Computational Account of Bilingual Aphasia Rehabilitation. Biling (Camb Bunce, J. P., & Scott, R. M. (2017). Finding meaning in a noisy world: explor- Engl), 16, 325–342. ing the eects of referential ambiguity and competition on 2 5-year-olds’ Kohonen, T. (1982). Self-organized formation of topologically correct feature cross-situational word learning. Journal of child language, 44, 650–676. maps. Biological Cybernetics, 43, 59–69. Carey, S. (1995). Conceptual Change in Childhood. MIT Press. Kriegel, H. P., Kroger, P., Renz, M., & Wurst, S. (2005). A generic framework Carey, S., & Bartlett, E. (1978). Acquiring a single new word. Papers and for ecient subspace clustering of high-dimensional data. In ICDM (pp. Reports on Child Language Development, 15, 17–29. 250–257). Carpenter, G. A., & Grossberg, S. (1987). Art-2 - self-organization of stable Kuhl, P. K. (1991). Human adults and human infants show a "perceptual magnet category recognition codes for analog input patterns. Applied Optics, 26, eect" for the prototypes of speech categories, monkeys do not. Percept 4919–4930. Psychophys, 50, 93–107. Chen, J.-H., Su, M.-C., Cao, R., Hsu, S.-C., & Lu, J.-C. (2017). A self Lenzo, K. (2007). The cmu pronouncing dictionary. organizing map optimization based image recognition and processing model Li, P., Farkas, I., & MacWhinney, B. (2004). Early lexical development in a for bridge crack inspection. Automation in Construction, 73, 58 – 66. self-organizing neural network. Neural Networks, 17, 1345 – 1362. New Collins, G. (1977). Visual co-orientation and maternal speech. In I. H. S. (Ed.) Developments in Self-Organizing Systems. (Ed.), Studies in mother-infant interaction. London: Academic Press.. Li, P., & Zhao, X. (2013). Self-organizing map models of language acquisition. Dollaghan, C. (1985). Child meets word: "fast mapping" in preschool children. Frontiers in Psychology, 4. J Speech Hear Res, 28, 449–454. Li, P., Zhao, X., & Mac Whinney, B. (2007). Dynamic self-organization and Feijoo, S., Muñoz, C., Amadó, A., & Serrat, E. (2017). When meaning is not early lexical development in children. Cognitive science, 31, 581–612. enough: Distributional and semantic cues to word categorization in child Lieven, E. (1994). Crosslinguistic and crosscultural aspects of language directed speech. Frontiers in psychology, 8, 1242. addressed to children. In I. C. G. . B. R. (Eds.) (Ed.), Input and interaction Fletcher, P. C., Frith, C. D., & Rugg, M. D. (1997). The functional in language acquisition. Cambridge: Cambridge University Press. neuroanatomy of episodic memory. Trends in Neurosciences, 20, 213–218. Lowe, D. (1999). Object recognition from local scale-invariant features. In Frank, M. C., & Goodman, N. D. (2014). Inferring word meanings by assuming IEEE International Conference on Computer Vision - ICCV (pp. 1150–1157 that speakers are informative. Cognitive psychology, 75, 80–96. vol.2). volume 2. Grossberg, S. (1976a). Adaptive pattern classification and universal recoding: Macwhinney, B. (2010). Computational models of child language learning: an I. parallel development and coding of neural feature detectors. Biological introduction. J Child Lang, 37, 477–485. Cybernetics, 23, 121–134. Markman, E. M., & Wachtel, G. F. (1988). Children’s use of mutual exclusivity Grossberg, S. (1976b). Adaptive pattern classification and universal recording: to constrain the meaning of words. Cognitive Psychology, 20, 121–157. II. Feedback, expectation, olfaction, illusions. Biological Cybernetics, 23, Markson, L., & Bloom, P. (1997). Evidence against a dedicated system for 187–202. word learning in children. Nature, 385, 813–815. Guenther, F. H., & Gjaja, M. N. (1996). The perceptual magnet eect as Matzen, L. E., & Benjamin, A. S. (2009). Remembering words not presented in an emergent property of neural map formation. J. Acoust. Soc. Am., 100, sentences: How study context changes patterns of false memories. Memory 1111–1121. 18 & Cognition, 37, 52–64. Venkateswarlu, R. L. K., & Kumari, R. V. (2011). Novel approach for speech Mayor, J., & Plunkett, K. (2010). A neurocomputational account of taxonomic recognition by using self — organized maps. In 2011 International Con- responding and fast mapping in early word learning. Psychol Rev, 117, ference on Emerging Trends in Networks and Computer Communications 1–31. (ETNCC) (pp. 215–222). Medeiros, H. R., de Oliveira, F. D., Bassani, H. F., & Araujo, A. F. (2019). Weber, M., Welling, M., & Perona, P. (2000). Unsupervised learning of models Dynamic topology and relevance learning som-based algorithm for image for recognition. In European Conference on Computer Vision - ECCV, Part clustering tasks. Computer Vision and Image Understanding, 179, 19 – 30. I (pp. 18–32). London, UK: Springer-Verlag. Medina, T. N., Snedeker, J., Trueswell, J. C., & Gleitman, L. R. (2011). Yu, C., & Smith, L. B. (2007). Rapid word learning under uncertainty via How words can and cannot be learned by observation. Proceedings of the cross-situational statistics. Psychol Sci, 18, 414–420. National Academy of Sciences, 108, 9014–9019. Yu, C., & Smith, L. B. (2012). Modeling cross-situational word-referent Miikkulainen, R. (1997). Dyslexic and category-specific aphasic impairments learning: prior questions. Psychol Rev, 119, 21–39. in a self-organizing feature map model of the lexicon. Brain and Language, Yurovsky, D., Yu, C., & Smith, L. B. (2013). Competitive processes in (pp. 334–366). cross-situational word learning. Cognitive Science, 37, 891–921. Miikkulainen, R., Bednar, J. A., Choe, Y., & Sirosh, J. (2005). Computational Maps in the Visual Cortex volume 1. Springer. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Appendix A. More Information on CSWL Experiments Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Appendix A.1. Exp.1: Word Learning Under Uncertainty Processing Systems - Volume 2 NIPS’13 (pp. 3111–3119). USA: Curran Associates Inc. Yu & Smith (2007) evaluated the CSWL abilities of 38 Nelson, K., Hampson, J., & Shaw, L. K. (1993). Nouns in early lexicons: undergraduate students. The stimuli provided consisted of evidence, explanations and implications. J Child Lang, 20, 61–84. Pacheco, R. F. (2004). Módulos neurais para modelagem de falsas memórias. slides containing 2, 3 or 4 pictures of unusual objects paired Ph.D. thesis Universidade Federal de São Carlos. respectively with 2, 3 or 4 pseudowords presented in auditory Pasley, B. N., David, S. V., Mesgarani, N., Flinker, A., Shamma, S. A., Crone, form. These artificial words were generated by a computer N. E., Knight, R. T., & Chang, E. F. (2012). Reconstructing speech from program using standard phonemes in the English language, the human auditory cortex. PLoS Biology, 10. Perlovsky, L. I. (2006). Modeling field theory of higher cognitive functions. In native language of the participants. In this case, there were A. Loula (Ed.), Artificial Cognition Systems (pp. 65–106). IGI Global. 54 label-referent pairs formed by single and unique objects Plunkett, K. (1997). Theories of early language acquisition. Trends in Cognitive randomly chosen and divided into three groups of 18 pairs, Sciences, 1, 146–153. which were used in three dierent training conditions. Plunkett, K., Hu, J., & Cohen, L. B. (2008). Labels can override perceptual categories in early infancy. Cognition, 106, 665 – 681. The distinct training conditions dier only in the number Plunkett, K., Sinha, C., Moller, M., & Strandsby, O. (1992). Symbol grounding of labels and referents simultaneously presented to the test or the emergence of symbols? Vocabulary growth in children and a subjects. In the 2x2 condition, two labels and two pictures connectionist net. Connection Science, 4, 293–312. were presented in each trial; in the 3x3 condition, three labels Pulvermuller, F. (2005). Brain mechanisms linking language and action. Nature Reviews Neuroscience, 6, 576–582. and three pictures were presented in each trial; and, in the Quine, W. (1960). Word and object. Cambridge, MA: MIT Press. 4x4 conditions, four labels and four pictures were presented Rice, M. (1990). Preschooler’s QUIL: Quick incidental learning of words. In in each trial. During the trials, there was no indication of N. E. Hillsdale (Ed.), In G. Conti-Ramsden & C. Snow (Eds.), Children’s language (Vol. 7). which label goes with each picture. However, in the underlying Richards, D. D., & Goldfarb, J. (1986). The episodic memory model of label-referent mappings it is guaranteed that an individual label conceptual development: An integrative viewpoint. Cognitive Development, was present in a training trial, if and only if the referent was 1, 183–219. also present. Figure 1 illustrates a 4x4 condition. Ritter, H., & Kohonen, T. (1989). Self-organizing semantic maps. Biological Cybernetics, 61, 241–254. In the test procedures, the participants were told that multiple Rizzolatti, G., & Craighero, L. (2004). The mirror-neuron system. Annual words and pictures would co-occur in each trial and that their Review of Neuroscience, 27, 169–192. task was to figure out across trials which word went with which Saltelli, A., Chan, K., & Scott, E. M. (2009). Sensitivity Analysis. Wiley. picture. They were not told that there was one referent per Salton, G., & McGill, M. J. (1986). Introduction to Modern Information Retrieval. New York, NY, USA: McGraw-Hill, Inc. word. After training in each condition, subjects received a Silberman, Y., Bentin, S., & Miikkulainen, R. (2007). Semantic boost on four-alternative forced-choice test of learning, in which, they episodic associations: An empirically-based computational model. Cog- were presented with 1 word and 4 pictures and asked to indicate nitive Science, 31, 645–671. the picture named by that word. The target picture and the 3 Soja, N. N., Carey, S., & Spelke, E. S. (1991). Ontological categories guide young children’s inductions of word meaning: object terms and substance foils were all drawn from a set of 18 training pictures. terms. Cognition, 38, 179–211. Spitzer, M. (1999). The Mind Within the Net: Models of Learning, Thinking, Appendix A.2. Exp.2: Word Learning with More Than One and Acting. A Bradford book. MIT Press. Suga, N. (1985). The extent to which bisonar information is represented in the Referent bat auditory cortex. In W. G. G.M. Edelman, & W. Cowan (Eds.), Dynamic Yurovsky et al. (2013) performed a series of experiments to Aspects of Neocortical Function (pp. 653–695). Willey (Interscience). Trueswell, J. C., Medina, T. N., Hafri, A., & Gleitman, L. R. (2013). Propose evaluate the behavior of individuals when there are two correct but verify: fast mapping meets cross-situational word learning. Cognitive associations. In the first experiment, 48 grad students were Psychology, 66, 126–156. evaluated, also with 18 word-referent pairs. However, the pairs Tuytelaars, T., Lampert, C. H., Blaschko, M. B., & Buntine, W. (2010). were split in dierent conditions: six words were associated Unsupervised object discovery: A comparison. Int. J. Comput. Vision, 88, 284–302. with a single referent (single words), six words were associated Udesen, H., & Madsen, A. L. (1992). Balint’s syndrome–visual disorientation. with two referents (double words), and the last six words had Ugeskr. Laeg., 154, 1492–1494. no associated referents (noise words). 19 The single words play the same role as those in the previous two single words and two double words were presented in each experiment, always co-occurring with their referents in each trial with their respective referents (4x4 condition). trial. The double words, however, co-occur with both referents ... ... Labels: ... ... in each trial. Since both single and double words co-occur six a b b c c d h h a g g f m m a ii f g g h h i i a a c c e g g times with their referents, the total number of occurrences is the B B D1 G G F2 A1 F1 II A2 C C E1 same for both types of words. The noise words occur with the Referents: ... ... ... ... A1 C C A2 H H M M II H H G G A1 G G same frequency for all referents, thus, they are not consistently mapped to any referent. They serve only for producing an equal Trials Trials number of words in all the trials. Each trial consists of presenting the stimuli in the 4x4 Figure A.15: Structure of Experiment 3. Dierently from Experiment 2, in condition. From a total of 27 trials (Figure A.14), in two this experiment, only one correct referent of double words is presented in each trial. The co-occurrence frequency of correct associations was the same of of them the stimuli were composed of four single words; in Experiment 2. 14 trials the stimuli were composed of two single words, one double word, and one noise word; and in 11 trials the stimuli Appendix A.4. Exp.4: Online vs Bach Learning were composed of two double words, and two noise words. Yurovsky et al. (2013) conjectured that if the competition is This way, although in all trials there were always four words primarily global, and occurs only after all training information and four referents, the mapping structure varied considerably has been accumulated (batch learning), there should be no eect across the trials, and in only two of them it consisted exclusively of the temporal order of the individual trials. However, if global of one-to-one mappings. competition emerges trial-by-trial (online learning), and does ... ... Labels: ... ... not interact with other local mappings within a trial, then it is a b b c c d e g f h a j c i k k b i b i l l g m f n expected that a decrement of the accuracy will be observed for B B A2 F1 E2 A1 II L L K K F1 N2 the second referents of double words presented later in relation Referents: ... ... ... ... A1 C C E1 F2 A2 C C B B II F2 N1 to the knowledge of referents presented earlier. Yurovsky et al. (2013) designed an experiment to evaluate Trials Trials this, with the organization shown in Figure A.16 for a new group of participants. Notice that this experiment is similar to Figure A.14: Structure of Experiment 2. The lowercase letters represent words Experiment 3. However, one of the referents of each double and the uppercase letters represent referents. Single words are in bold (ex.: b-B and c-C), double words are in white (ex.: a-A1 and a-A2, f-F1 and f-F2), and word is randomly chosen to be presented earlier, while the noise words are in gray (ex.: d and g). second referent is presented only after all co-occurrences with the first referent have been carried out. Notice also that both After the learning trials, the learning rates of each individual referents have the exact same frequency of co-occurrence with were evaluated similarly as in Yu & Smith (2007). Every single their respective double word. word was presented with its referent and three other randomly chosen referents and each double word was presented with both Labels: ... ... ... ... a c c e g g m a i f g h i a a b b c c d m i g h i h h a g g f their referents and two other randomly chosen referents. The individuals were asked to rank the four objects from the C E1 B B D1 C A1 F1 II A2 G G F2 Referents: ... ... ... ... most to the least likely meaning of the word. To compute the A1 G A1 C C G M M II H H G G A2 H H scores of single words, one correct guess is computed when the Anterior referents Posterior referents Anterior referents Posterior referents correct referent was ranked first. For double words, two types of scores were computed: a Double score is computed when Figure A.16: Structure of Experiment 4. Dierently from Experiment 3, in the participant ranks both correct referents (in either order) in this experiment, one of the referents of double words is presented first (A1), the first and second positions, and a Either score is computed while the other is presented in later trials (A2). The co-occurrence frequency of when the participant ranks one of the correct referents in the correct associations was the same of Experiments 2 and 3. first position and an incorrect referent in the second position. Appendix A.5. Exp.5: Statistical Association vs Appendix A.3. Exp.3: Local vs Global Competition Propose-but-Verify To evaluate if global or local competition has occurred, Although the results of previous experiments suggest that Yurovsky et al. (2013), in this experiment, another 48 partici- learning under such conditions derives from some kind of pants were exposed to only one correct referent of double words statistical-associative learning mechanism, as the one the pro- in each trial, while the testing procedure was the same of the posed model employs, Trueswell et al. (2013) suggest the previous experiment. If only local competition during training hypothesis that learning is instead the product of a one-trial was occurring, then the participants of this trial should be able procedure in which a single hypothesized word-referent pairing to learn both referents of double words as well as they learn the is made in one shot and retained across learning instances, referent of single words. Otherwise, global competition was being abandoned only if a subsequent observation fails to occurring. The stimuli were presented as illustrated in Figure confirm the pairing. The authors called this hypothesis A.15. Noise words were not necessary for this experiment since “Propose-but-Verify”. 20 In order to test this, Trueswell et al. (2013) designed ex- periments to explicitly verify if participants retain a set of association mappings for each word or if they keep a single conjecture about the association. In each of the trials designed by Trueswell et al. (2013), five images were used as referents, while the auditory stimuli consisted of phrases such as “Oh! look, a ...!” with one label (condition 1x5). In total, 12 artificial words were used as labels and 12 images of objects were used as referents. In such a scenario, there is a high degree of uncertainty about the correct referent. The trials were divided into five learning cycles. In each cycle, each word was presented once in a random order. The other four cycles are repetitions of the first cycle in the same order. Fifty undergrad students participated in the tests. They were instructed that, after hearing the phrase, they should click on the object referred by the phrase. Since the participants were tested in every trial, this allowed the authors to register the evolution of the learning rates of the individuals after each learning cycle. The rationale is that if participants store only one association, and the referent is not the correct one, then when finding the same word in a subsequent trial, they should choose randomly between the available referents and should not show any bias for the correct referent, since there should be no trace in memory of such association. A bias for the correct referent should be observed if the participants can keep track of multiple possible associations. Appendix B. Numeric Representation of Phonemes The numeric representation of the auditory data was con- structed following a procedure similar to the one described in Araujo et al. (2010). First, each word is converted to its respective phonetic representation according to the CMU Pro- nouncing Dictionary (Lenzo, 2007), in which, each phoneme is represented by its ARPAbet symbol (Table B.2). Then, each phoneme is translated into a vector of 12 real values ranging from -1 to +1, according to its place of pronunciation in the International Phonetic Alphabet (IPA) charts for vowels (4 features) and consonants (8 features). For example, the word “ball” is converted as follow: 0 0 0 0 1 1 1 1 1 1 1 1 ball ! B AO L ! 1 0:5 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0:45 1 1 1 1 1 1 1 21 Table B.2: Correspondence between IPA and ARPAbet symbols and the respective numeric representation of each phoneme. Phoneme IPA ARPAbet Numeric Representation AA 1 0.5 1 -1 0 0 0 0 0 0 0 0 father A at æ AE 1 -0.5 -1 -1 0 0 0 0 0 0 0 0 but. sofa 2. @ AH 0.67 0 -1 -1 0 0 0 0 0 0 0 0 o O AO 0.33 1 1 1 0 0 0 0 0 0 0 0 how aU AW 0 0.5 0 0 0 0 0 0 0 0 0 0 AY 0 0 -0.5 0 0 0 0 0 0 0 0 0 my aI red E EH 0.33 -0.5 -1 -1 0 0 0 0 0 0 0 0 her. coward Ç. Ä ER 0.33 0 1 0 0 0 0 0 0 0 0 0 IH -0.67 -0.5 -1 -1 0 0 0 0 0 0 0 0 big I bee i IY -1 -1 1 -1 0 0 0 0 0 0 0 0 boy OI OY 0 0 0 0 0 0 0 0 0 0 0 0 OW -0.33 1 1 1 0 0 0 0 0 0 0 0 show oU say eI EY -0.33 -1 1 0 0 0 0 0 0 0 0 0 should U UH -0.67 0.5 -1 0 0 0 0 0 0 0 0 0 UW -1 1 1 1 0 0 0 0 0 0 0 0 you u buy b B 0 0 0 0 1 -1 1 -1 1 -1 -1 -1 chair tS CH 0 0 0 0 0.27 -1 -1 0 -1 -1 -1 -1 day d D 0 0 0 0 0.45 -1 1 -1 1 -1 -1 -1 that ð DH 0 0 0 0 0.64 -1 -1 1 1 -1 -1 -1 for f F 0 0 0 0 0.82 -1 -1 1 -1 -1 -1 -1 go g G 0 0 0 0 -0.27 -1 1 -1 1 -1 -1 -1 house h HH 0 0 0 0 -1 -1 -1 1 0 -1 -1 -1 JH 0 0 0 0 0.45 -1 -1 0 1 -1 -1 -1 just dZ key k K 0 0 0 0 -0.27 -1 1 -1 -1 -1 -1 -1 late l L 0 0 0 0 0.45 -1 -1 -1 1 -1 -1 1 M 0 0 0 0 1 1 -1 -1 1 -1 -1 -1 man m knee n N 0 0 0 0 0.45 1 -1 -1 1 -1 -1 -1 sing ï NG 0 0 0 0 -0.27 1 -1 -1 1 -1 -1 -1 P 0 0 0 0 1 -1 1 -1 -1 -1 -1 -1 pay p run r. ô R 0 0 0 0 0.27 -1 -1 -1 1 -1 1 -1 say s S 0 0 0 0 0.45 -1 -1 1 -1 -1 -1 -1 SH 0 0 0 0 0.27 -1 -1 1 -1 -1 -1 -1 show S take t T 0 0 0 0 0.45 -1 1 -1 -1 -1 -1 -1 thanks T TH 0 0 0 0 0.64 -1 -1 1 -1 -1 -1 -1 very v V 0 0 0 0 0.82 -1 -1 1 1 -1 -1 -1 way w W 0 0 0 0 1 -1 -1 -1 1 1 -1 -1 yes j Y 0 0 0 0 -0.09 -1 -1 -1 1 1 -1 -1 zoo z Z 0 0 0 0 0.45 -1 -1 1 1 -1 -1 -1 measure Z ZH 0 0 0 0 0.27 -1 -1 1 1 -1 -1 -1 # 0 0 0 0 0 0 0 0 0 0 0 0 – silent
http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.pngStatisticsarXiv (Cornell University)http://www.deepdyve.com/lp/arxiv-cornell-university/a-neural-network-architecture-for-learning-word-referent-associations-AIl0qX0wxX
A Neural Network Architecture for Learning Word-Referent Associations in Multiple Contexts