TY - JOUR
AU1 - Ferrand, Romain
AU2 - Baronig, Maximilian
AU3 - Unger, Florian
AU4 - Legenstein, Robert
AB - Introduction Memory is an integral component of biological neuronal systems. It underlies behavior at many levels, starting from basic fear memory to complex cognitive processes such as language understanding. Experimental results have provided ample evidence that memories are stored in so-called memory engrams. The main assumption of the memory engram theory is that learning induces persistent changes in specific brain cells that retain information and are subsequently reactivated upon appropriate retrieval conditions [1–3]. A host of experimental evidence supports the hypothesis that synaptic plasticity is essential for memory storage. However, some recent results indicate that also non-synaptic plasticity such as the regulation of neuronal membrane properties contributes to the creation of memory engrams [4–8]. In fact, there has been some scepticism about the role of synaptic plasticity in memory formation [6,9,10]. One important argument is that synaptic plasticity, in particular long-term potentiation (LTP) typically requires repetitive stimulation, whereas learning and memory can arise from single experiences. Another suggestion was that non-synaptic plasticity may act as a permissive signal for synaptic changes [4]. In this view, rapid excitability changes of neurons could set the stage for later synaptic reconfiguration. An intriguing experimental finding was that excitability changes are crucial for learning in trace conditioning experiments [11,12], which was shown to synergize with synaptic plasticity in a modelling study [13]. Taken together, the available studies support the intriguing hypothesis that synaptic and non-synaptic plasticity co-operate to construct memory engrams. However, as stated in [4], the question about the functional role of the plasticity of neuronal membrane characteristics has remained open. In this article, we show that fast changes of the intrinsic excitability in the apical dendritic trunk, which we call trunk strength plasticity (TSP) in the following, can give rise to enhanced learning capabilities of neuronal networks. Our hypothesis is that this non-synaptic TSP, which can rapidly be induced, provides the network with instantaneous memory. This memory is used by the network for memory-dependent computation. Synaptic plasticity on the other hand operates on a slower timescale. The role of synaptic plasticity is thus to learn to make use of the memorization capabilities introduced by TSP for the computational task at hand. In that sense, the two plasticity processes synergize in the search for a task solution (see e.g., [14–16] for other synergistic plasticity approaches). Our theoretical analysis shows that this division of labor gives rise to temporally and spatially local synaptic plasticity rules. We find that despite the locality of the synaptic and non-synaptic plasticity processes, networks equipped with such mechanisms exhibit remarkable learning capabilities. This is demonstrated for reward-based learning tasks that necessitate instantaneous memory such as a radial maze task as well as for a complex question-answering task [17], thus illustrating that basic language understanding can be acquired through local learning. Results A model for memory processing in pyramidal neurons Pyramidal neurons are the principal cells of many memory related brain areas such as the hippocampus, amygdala, and prefrontal cortex. They have a characteristic bipolar shape, with basal dendrites close to the soma and apical dendrites that are separated from the soma by an extended apical trunk (green schematic cell in Fig 1A). These two components establish two sites of synaptic integration. They independently integrate synaptic input in a non-linear manner, and these two signals are then combined at the soma to determine the action potential output of the neuron [18,19]. For our network model, we were inspired by memory-augmented neural networks (MANNs) as e.g. in [20,21]. These models use a memory module where key vectors from can be associated with value vectors from . When some fact is stored in memory (memorization operation), a key and a value vector is computed and this key-value pair is memorized. When some information is retrieved from memory (memory recall operation), a query vector is computed and a value vector is retrieved based on the similarity between the query vector and all stored key vectors. The retrieved value vector is then used in a single layer neural network to compute the network output. In our model, memories are stored in a memory layer that is a population of m model pyramidal neurons, see Fig 1B. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 1. Network model. (A) Schematic of our simple pyramidal cell model (green) consisting of an apical and basal compartment with activations and respectively. The excitability of the apical trunk is variable and indicated in dark green. The neuron output is projected to an output layer (yellow). Prediction errors generate learning signals which are fed back via randomly initialized feedback weights . (B) Network view. Input neurons (blue) project to apical and basal compartments of a population of pyramidal cells (green). Pyramidal output is projected to a layer of linear output neurons (yellow) producing the network output . https://doi.org/10.1371/journal.pone.0313331.g001 The memory layer receives input from two input populations with activities and at time t, which project to the apical and basal compartments of the pyramidal neurons via synaptic weight matrices Wapical and Wbasal respectively. In each pyramidal neuron i of the memory layer, the resulting basal activation and apical activation is given by (1) with the rectified linear nonlinearity, σ ( s ) = max ⁡ { 0 , s } , and pre-activations (2) The vector of apical activations of neurons is thus akin to the key vector and the vector of the basal activations is akin to the value vector in a MANN. In the following, we will suppress the neuron index i in our model description in order to simplify notation. The real valued output of a pyramidal neuron is then given by a linear combination of these activations, (3) where the scalar denotes the branch strength of the apical trunk. As we will describe below in detail, we model the trunk strength as a dynamic variable , used to memorize information about the apical and basal activations (i.e., the key and value vectors). Hence, this layer of pyramidal cells implements a simple memory module. The output of the pyramidal cell layer is projected to c output neurons (the output layer) via the weight matrix , which produces the network output . In addition, the network output can be used to determine learning signals that are used as feedback signals for learning (red in Fig 1; see below). A large variety of non-synaptic plasticity mechanisms exist in pyramidal neurons [22–26]. Changes of neuronal membrane characteristics are not necessarily only changing the global neuronal excitability or the action potential initiation threshold, they can also be confined to dendritic subunits [23,26]. In particular, Losonczy et al. found that local branch activity led to the potentiation of the branch strength when paired either with the cholinergic agonist carbachol or — more importantly — with two or three backpropagating action potentials, indicating an intra-neuron Hebbian type of branch plasticity [26]. In addition, experimental work [23] has shown that currents can be down-regulated in a Hebbian manner in CA1 pyramidal neurons. Such down-regulation can in turn increase the dendro-somatic coupling of the apical dendrites [27]. Motivated by this finding, we assume a Hebbian-type trunk plasticity: (4)(5)(6) with σ the ReLu non-linearity, parameters , and is the maximum trunk strength. Hence, in our model, the trunk strength is potentiated when both the apical and basal compartments are activated by synaptic input (term in ). This potentiation saturates at . The ReLU nonlinearity in assures that the trunk strength remains non-negative. We have also included a depression term that depends quadratically on the apical activation (last term in ) similar to Oja’s Hebbian rule [28]. The depression term can be interpreted as a homeostatic term as it reduces the trunk excitability when there is large activity [25]. In the particular implementation that we considered, the depression depends quadratically on the apical activity. This homeostasis can reset a memory by depressing trunk strengths of neurons where an apical activation is not paired with a basal one. Other options would be to use the basal activity or a combination of both for homeostasis. It turns out that the actual choice is not crucial as long as there is homeostasis that depends on some activation, see Section A in S1 Appendix. From a functional perspective, this plasticity enables the neurons in the memory layer to implement a simple memory system. At each time step t, the network can perform either a memorization or a memory recall operation. In a memorization operation, both input populations xa , t and xb , t are active, and the trunk strength is updated. The layer thus memorizes at which neurons both the apical and basal dendrites were activated. At a memory recall, we assume that only the apical input population xa , t is active. The activated apical compartments will induce neuronal activity only in those cells in which the trunk strength was potentiated, thus reading out a trace of the memory. The output of the memory layer activates the neurons in the output layer, a learning signal is computed, fed back to the memory layer and synaptic weights are updated according to the plasticity rules described below. Note that the network architecture is purely feed-forward, without any recurrent synaptic connections. The trunk strength however implements an implicit recurrence, see . This dynamic state variable can be used by the network to store information about previous inputs. In principle, the apical and basal input populations could provide different aspects of the input at a memorization event. For simplicity however, in our simulations they exhibited the same activity pattern xa , t = xb , t at memorization events, which turned out to work well in our simulations. Local synaptic plasticity for memory-dependent processing with TSP In our model, memory is encoded in the network in the vector of trunk strengths . This memory is constructed on the fast timescale of single neuronal activations. According to our hypothesis, synaptic plasticity in contrast works on a slower timescale and is used to learn how to make use of this memory process in the specific task context. Similar ideas have been put forward in a number of models termed memory-augmented neural networks [20,21,29,30]. In these networks however, synaptic weights are optimized with backpropagation through time (BPTT), a complex and biologically highly implausible learning algorithm [31]. In contrast, it turns out that if trunk strengths are used for rapid information storage, one can derive local synaptic plasticity rules, see Methods. The derivation takes advantage of the eligibility propagation (e-prop) algorithm [32]. The resulting on-line plasticity rules approximate BPTT using synaptic eligibility traces [33] in combination with learning signals that are directly fed back from the network output to the network neurons. This is illustrated in Fig 1 for a supervised learning scenario (red). Consider a network with c output neurons and corresponding outputs . For given target outputs , we obtain c learning signals which are fed back to the memory layer neurons through a feedback weight matrix . To simplify notation, we express the following equations for a single neuron and drop the post-synaptic neuron index. Consider a neuron in the memory layer with feedback weight vector . The weighted learning signals are summed to obtain the neuron-specific learning signal : (7) The synaptic plasticity rules for synapses at the basal and apical dendrites are then given by (see Methods for a derivation) (8)(9) where η is a learning rate, and denote the eligibility traces of the basal and apical synapses respectively, and H is the Heaviside step function: H ( s ) = 0 for s ≤ 0 and H ( s ) = 1 otherwise. In general, the plasticity rules combine the neuron-specific learning signal , that is the assigned error to the neuron, with the synapse specific eligibility trace. This eligibility trace records the eligibility of the synapse for changes in the trunk strength. For example, if there was a large error assigned to the neuron and its trunk strength was high, synapses that led to a trunk strength increase are eligible for that error and are changed such that this increase will be smaller in future. The eligibility is computed by filtering information locally available to the neuron: (10) for H ( s ) = 1 and functions . Note that the update rules (8) and (9) as well as the updates of eligibility traces (10) need only temporally and spatially local information, that is, information that is available at the post-synaptic neuron at the current or previous time step. Hence, this update could in principle be implemented by pyramidal neurons. The function f in dynamically controls the decay of the eligibility, and it is the same for both the apical and the basal synapses: (11) where H denotes the Heaviside step function. Hence, f (as well as the other functions, see below) are gated by , meaning that the eligibility trace is zero if the trunk strength is zero. This arises from the fact that the apical dendrite has no influence on the neuron output when , and thus the network output does not depend on the input to this neuron during a recall (note that during a recall, only the apical inputs are active). It is 1 (no decay) in the absence of apical activity and reduced by the squared apical activity and the product of basal and apical activity. In other words, the synapse stays fully eligible as long as the trunk strength is not altered. If the trunk strength is altered due to apical activity, the eligibility is reduced, as other synapses may become eligible for these changes. The functions in modulate the increase of the eligibility trace at synaptic input: (12)(13) Both and are gated by activity in their corresponding compartments. Hence, if the compartment is inactive, the eligibility is not changed even if there is synaptic input, as the compartment did not contribute to any changes of the trunk strength. Also, both include a term which takes into account that it is harder to increase the trunk strength when it is already close to its maximum value. is then linearly dependent on the apical activation, as a potentiation of the trunk strength is proportional to apical activity (). is similar, with an additional term that records eligibility according to the decrease of the trunk strength due to apical activation. The update dynamics of eligibility traces and weights of basal synapses are illustrated in Fig 2 (in the setup shown there, the apical dynamics are similar). Here, a neuron receives two apical inputs , and two basal inputs , (Fig 2 right and panel A). First, and are co-active leading to an increase in the trunk strength (panel B). The eligibility-trace for the first basal synapse is thus increased as it caused this trunk potentiation (panel C). Later, there is a longer co-activation of the other synapse pair and , leading to stronger trunk potentiation. Now, the second basal synapse increases its eligibility trace, while the trace of the first one is decreased. At the final time step, a learning signal appears, paired with apical activity. This leads to weight changes that are proportional to the eligibility trace (panel E). Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 2. Illustration of eligibility trace dynamics and synaptic plasticity at the basal compartment in our model. imulation of a single neuron with two apical and basal synapses, each having unit weight. The first apical and basal synapse is initially activated for one time step, followed by the activation of the second apical and basal synapse for three consecutive time steps (panel A). These co-activations lead to increases in the branch strength (panel B), as well as to changes in the eligibility traces (panel C, see text). Then, a recall is performed at time step 19 where both apical synapses are activated. Further, a learning signal is received (panel D). Changes of basal weights are then given by the product of the eligibility traces with the learning signal and apical activation (panel E, see . https://doi.org/10.1371/journal.pone.0313331.g002 In summary, the eligibility traces record first and second order terms of apical and basal activity, together with some gating and a dependence on the trunk strength. Although the update rules for the traces are far from simple, they include only terms available at the post-synaptic neuron and in particular are local in time — in contrast to BPTT. Hence, they can in principle be computed at the synapse. When we compare the synaptic and non-synaptic plasticity mechanism in our model, we observe that they operate on very different timescales with complementary functions. TSP acts very fast in order to memorize relevant information about the current input in the trunk strengths of the memory layer neurons. On the other hand, the local synaptic plasticity dynamics approximate gradient descent and optimize the synaptic weights over many learning episodes in order to minimize network error on the specific task. Their roles are thus complementary, but they synergize in the following way. TSP provides general memory capabilities that help to learn rather arbitrary memory-related tasks. Synaptic plasticity then utilizes this memory and adapts synaptic weights such that the relevant information is memorized and retrieved for the task at hand. In the simulations reported below, we turned off synaptic plasticity in the testing phase after the model was trained. This shows that while the synaptic plasticity is needed for task acquisition, it is not necessary for inference after proper synaptic weights have been determined. In the following, we evaluated the above described network on a variety of sequence processing tasks that all require some form of memory. In each task, one episode consists of a sequence of time steps t, in each of which the trunk strength and the eligibility traces get updated in each neuron of the memory layer. Only after the entire episode, synaptic weight changes are applied based on the accumulated eligibility traces, if the model is in the training phase. For the testing phase, eligibility traces are not recorded, since no synaptic weight update is performed. The eligibility traces can hence be viewed as auxiliary functions, keeping track of the network activity over time for an accurate approximation of BPTT. Note, that eligibility traces have no direct influence on the network output, hence, changes in trunk strength are the only available resource to account for within-episode short-term memorization. Learning associations and stimulus comparisons with local synaptic plasticity and TSP We first tested whether the model is able to learn general associations between sensory input patterns. To this end, we generated two disjunct sets , , each consisting of n Omniglot [34] characters. The network had n output neurons, one for each element in S and each element in Swas associated with one output neuron. Before each episode, we randomly drew a bijective mapping between these two sets to generate n stimulus pairs from, R × S such that each element from the two sets appeared in exactly one pair. These pairs were shown to the network in random order, representing facts to be memorized by the network, see Fig 3A. Afterwards, an input consisting of one element of Rwas shown as a query and the network had to activate the output neuron associated with the paired stimulus. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 3. Learning associations with local plasticity. (A) An agent observes a sequence of stimulus pairs. After being cued by one of the observed stimuli, it has to indicate the associated one. (B) Number of training episodes needed until the network achieved an accuracy of 80% as a function of association pairs to be remembered (mean and SD over 16 training trials). https://doi.org/10.1371/journal.pone.0313331.g003 The network consisted of d = 128 input neurons in each of the apical and basal input populations, 200 neurons in the memory layer, and n output neurons. In order to generate a reasonable higher-level representation of the Omniglot characters, each character of the presented pair was first embedded in 64-dimensional space with a convolutional network pre-trained using a prototypical objective [35] (the pre-training was done on a subset of the Omniglot classes which were not used in Rand S). The embeddings were then concatenated in a 128-dimensional vector . Then, both the apical and basal input populations were activated with this vector, i.e., and . At query time, one character from Rwas embedded and concatenated with a 64-dimensional zero vector to obtain xa , t to provide the input to the apical compartment, while there was no input to the basal compartment (xb , t = 0). We trained the network using a reward-based paradigm where it received a positive reward for the correct response at query time and a negative reward otherwise. In this reward-based setting, we used the standard proximal policy optimization (PPO) objective with an entropy bonus [36] (see section ’General simulation details’ in Methods for a definition) to compute the learning signals used for the synaptic weight updates. The network achieved 100% accuracy on this task after about 2000 episodes for n = 5 association pairs. Fig 3B shows how learning time scales with the number of associations to be learned (blue bars). A few hundred episodes suffice for two associations. For comparison, we also trained the network with direct supervision (error signals are determined from the target response; orange bars; see section ’General simulation details’ in Methods). As expected, learning time increase is milder, but with a similar tendency. Neural network models can be sensitive to parameter settings. In order to test whether our model is robust to parameter changes, we analyzed the impact of the hyperparameters , , as well as the memory layer size on network performance. We found that network performance is stable over a large range of these parameters. See Section B in S1 Appendix for details of the analysis. Another memory-related task frequently used in experiments is the classical delayed match-to-sample task [37]. Here, the animal observes two stimuli separated in time and must produce an action depending on whether or not the two stimuli are equal, see Fig 4A. We modeled this task in the setup described above with a pre-trained convolutional network to embed the stimulus in 64-dimensional space and 200 neurons in the memory layer. The agent first observed one out of five Omniglot characters , followed by eight steps where white noise input is shown. Finally, another character was shown which was chosen to be the same as with probability 0 . 5 and one of the other characters with probability 0 . 125 each. The output of the network was then interpreted as an action a ∈ { left , right } indicating a match or non-match. A reward was delivered accordingly which was used to compute the learning signals for synaptic weight updates. Then, another episode started. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 4. Learning of a delayed match-to-sample task with local plasticity. (A) Task schema. The agent observes a stimulus followed by 8 white noise inputs and another stimulus . The agent should choose the left action when the initial stimulus matches query stimulus . (B) Learning progress in terms of choice accuracy. Green: Only one character instantiation per class for training and testing. Blue: Network is tested on a character not seen during training. Brown: LSTM in the fixed setting (16 trials; shading indicates standard deviation) https://doi.org/10.1371/journal.pone.0313331.g004 In this task, we also tested whether the network can cope with variance in the input stimuli. In Omniglot, each character class consists of 20 drawings of the character from different people with significant variance. We compared network performance in a setting when the specific instantiation of the presented character was fixed to a setting where it was drawn randomly from the set in each episode. In particular, for performance evaluation, we used a character instantiation that was not used for training (see Methods for details). Training progress is shown in Fig 4B. The network achieved an accuracy of 95 ± 1 . 8% on this task, with no significant difference between the fixed-character and sampled-character settings. We wondered whether the generalization capabilities of the network in the sampled-character setting could be fully contributed to a writer-independent representation of the characters in the embedding of the convolutional network. We therefore visualized the variance of embeddings for different samples of a given character using principal component analysis and t-SNE [38], see Section C in S1 Appendix. We found that there is still significant variance in these embeddings. Although the convolutional embedding certainly helps for generalization, this shows that the inputs do not need to be rigidly symbolic. Rather, the network can deal with variability in the input representation provided by the convolutional network. We also tested the performance of a long short-term memory (LSTM) network [39] trained with BPTT, i.e., with non-local plasticity, see Fig 4B. Interestingly, the LSTM was not able to learn this task consistently (it reached performances of around 90% in some trials but failed in others). We considered one LSTM with the same number of neurons and one with the same number of parameters as our network. Fig 4B shows the better performing one. This shows that TSP improves the learning capabilities of neural networks with local synaptic plasticity on this task. While the trunk strength can in principle hold information over arbitrary durations, the noise in the input during the delay period induces trunk strength changes which perturb the memory, leading to a drop in accuracy for increasing delays. See Section D in S1 Appendix for an analysis and discussion. Learning context-dependent reward associations with local synaptic plasticity and TSP We next tested whether the network trained with local synaptic plasticity rules was also able to learn a more complex radial maze task [40]. In this task, the animal is located in an eight-armed radial maze (see Fig 5). It observes one out of four context inputs (in our model, characters from the Omniglot data set), indicating one pair of arms that can be entered (indicated by color in Fig 5A). For each context, one of the two arms contains a reward. In the beginning of each episode, the branch containing the reward is randomly assigned for each of the contexts and held fixed for the episode duration. The animal has to first explore in which of the branches the reward is located and then remember this information for each of the four contexts separately throughout the episode, requiring memorization abilities of the animal. Each episode in this task consisted of 40 trials (i.e., 40 context stimuli and arm choices). In each episode, the reward locations were chosen randomly initially and stayed constant throughout that 40-trial episode. Hence, the task demands memorization of the reward location within each episode. The memorized information can then be used to choose the rewarded location in the remaining trials of the episode. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 5. Context-dependent reward associations. (A) Schema of the radial maze task. In each trial, one arm pair is accessible to the agent (yellow in the example) and the context cue is presented (Omniglot character). The agent then has to choose the correct arm (left or right) to obtain the reward. (B) Fraction of rewarded actions over learning episodes in the basic radial maze task (blue) and the same task where the rewarding arm is switched after visit (green; mean and SD. over 16 runs). Red: maximum achievable performance. Orange: LSTM in the basic radial maze task. https://doi.org/10.1371/journal.pone.0313331.g005 We modelled this task using a network as described above, where the visual context was embedded in a 64-dimensional vector using the same pre-trained convolutional network and a 200 neuron memory layer. At the beginning of a trial, one arm pair was chosen randomly out of the four possible pairs and the context stimulus c was presented to the network. The network output was interpreted as an action a ∈ { left , right } to choose one of the available arms. The network then received either a positive or negative reward, which was used to compute learning signals and update synaptic weights. Afterwards, the network observed a summary of this trial through a triple ( c , a , r ) , consisting of the context stimulus c, the chosen action a, and the binary variable r ∈ { 0 , 1 } indicating the received reward. This information could be used by the network to memorize the rewarding action in the given context. Further details regarding the task setup can be found in Methods. We measured the performance of the network through choice accuracy, that is, the average fraction of rewarding choices within episodes (Fig 5B). Since the agent has to guess the rewarding action per context initially, the maximum expected accuracy is 0 . 95. The network learned this task perfectly within about 2000 episodes. Note that this task also includes some basic form of counterfactual reasoning, since the agent can reason about the reward location when visiting a non-rewarded arm. We also tested a more complex variant of this task where the reward in the visited context switches to the other arm after the visit. Also, this task could perfectly be learned with local synaptic plasticity within approximately twice as many episodes compared to the non-switching case. An analysis of the learned network solution can be found in Section E in S1 Appendix. We also evaluated the performance of an LSTM network with BPTT on the basic version of the task, see Fig 5B. The LSTM was converging towards a solution, but learned much slower. Learning question answering tasks with local synaptic plasticity and TSP In the above simulations, we tested our model on standard experimental paradigms: a delayed match-to-sample task, and a radial maze task. We next asked whether local synaptic plasticity rules could learn to harness TSPto solve more complex cognitive tasks. One standard benchmark for memory-augmented neural networks is the bAbI task set [17]. It consists of 20 question-answering tasks, where each task is composed of a story consisting of a sequence of up to 325 sentences, followed by a question for which the answer can be inferred from the information contained in the story. See Section F in S1 Appendix for example tasks. For our experiments, we used the 10k bAbI dataset to train a network with 200 neurons in the memory layer. According to the benchmark guideline, a task is considered as solved if the error rate is less or equal to 5%. Each sentence of a story was embedded in an 80-dimensional vector, and the sequence of these embeddings was presented to the network sequentially. We first considered a random embedding, where we generated an 80-dimensional random vector for each word using the He-uniform variance scaling method [41]. For a given sentence, the vectors of all words in the sentence were then linearly combined with a position encoding that encodes the position of the word in the sentence as in [20]. Further details regarding the task setup can be found in Methods. In Table 1 we report the mean error rate of the model over 5 runs for each of the 20 bAbI tasks. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 1. Comparison of mean errors of a network trained with BPTT vs our local synaptic plasticity on bAbI tasks 10k (mean and SD. over 5 trials). Error rates for tasks solved by using our local synaptic plasticity rules are printed in bold face. BPTT: backpropagation through time; LSP: Local synaptic plasticity; LSP joint: joint training where a single network was trained to perform all tasks concurrently. https://doi.org/10.1371/journal.pone.0313772.t001 Using the random embedding, the network was able to learn 13 tasks using local synaptic plasticity (column 3). As a baseline, we considered the same network architecture trained with BPTT (column 2). Notably, all tasks for which the network could be optimized by BPTT could also be learned by our temporally and spatially local synaptic plasticity rules, showing the effectiveness of local learning in our model. Here, the model was trained separately on each of the tasks, resulting in one network for each task. We next tested whether a single network was able to solve all the tasks that could be solved by individual networks by training one model jointly on these tasks (column 4). We found that this was the case and that error rates on a majority of these tasks were even improved, indicating a knowledge transfer between them during learning. In order to test how much a more task-specific sentence representation would improve the results, we also considered a pre-trained embedding which is optimized for the task. Here, we used the random embedding as initialization and trained the embedding end-to-end using BPTT on the task at hand. Then the embedding was fixed, and a fresh network was trained with local synaptic plasticity on this input representation. We tested all tasks that were not solved with the random embedding, and found that three additional tasks could be solved with a better input representation (column 6). Again, all tasks that could be solved with BPTT could also be learned with local synaptic plasticity. Overlapping assembly representations emerge through local learning How does the network solve such tasks after training? In order to answer this question, we analyzed the behavior of a trained network in the “Single supporting fact" task, see Fig 6. This task involves simple person-location relations such as “John moved to the kitchen" among several persons and possible locations. Although the results are presented with the verb “moved", we note that there are actually various phrases in the story to indicate such person-location relations. Those variations do not alter the person-location relation, and the model should thus learn to treat them as equal. To analyze stories, we recorded the vectors of trunk strengths that arose from the dataset’s stories. These vectors thus represent the memory state of the memory layer throughout the stories. Subsequently, we applied a non-negative matrix factorization (NMF) [42] to project these memory states into a 20-dimensional space. We then projected the memory states, keys, values, and recall keys into this space. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 6. Network analysis for the Single Supporting Fact task. (A) Projection of keys, values, and recall keys based on a non-negative matrix factorization of memory traces after network training. Keys are shown for specific persons, with representations averaged over locations and verbs. The key for John clearly activates 6 components corresponding to possible locations for John. Value representations are shown for specific locations, with representations averaged over persons and verbs. (B) Story sample along with its respective key (top, outer ring), value (top, inner ring), and memory state after memorization (bottom). Each key and value pair predominantly overlaps in a single component, which is then memorized. Additionally, the change in John’s location in the last fact is accurately updated from component 14 to 1, causing component 14 to be deactivated due to the negative term in our Oja-type Hebbian rule. https://doi.org/10.1371/journal.pone.0313331.g006 Fig 6A (top) shows the average activity in the key layer for a given person, where the average was taken over all possible locations and action verbs. For example, the leftmost visualized vector shows sentences of the form “john moved to the ...", where the location is marginalized out. When comparing the representations for four persons, one observes that keys effectively discriminate between persons with close to orthogonal representations. When we performed the same analysis for a given location, averaging over persons, we did not observe such a structure (not shown). In contrast, such orthogonal representations can be observed for specific locations in the value layer when an average was taken over persons and action values (Fig 6A, middle). We found that the orthogonality of representations depends on the memory size and is crucial for task performance, see Section G in S1 Appendix. In summary, we observed that keys effectively discriminate between persons, while values indicate locations. We thus define each person’s key vector as the average of the key vector projections from all possible variations of sentences for this person. Similarly, we define a location’s value vector as the average of the value vector projections from all possible variations of sentences for this location. While key- and value-representations are orthogonal to those of the same layer, we found a systematic overlap between key representations with value representations. For instance, John’s key vector primarily overlaps with specific locations indicated by the value vectors (components 1, 4, 7, and 14 overlap office, kitchen, bedroom, and hallway respectively, Fig 6A, compare top and middle). This analysis can be similarly conducted based on the value vectors. For example, the most activated components of the office’s value overlap primarily with specific people indicated by the keys (component 1, 15, 17, and 19 overlap with John, Mary, Daniel, and Sandra respectively). The bottom row of Fig 6A shows recall keys during queries for the location of persons (e.g., “where is john"). We observe that these recall keys are very similar to keys during storage operations for facts that include the same person. This representation, that has been learned through local synaptic plasticity in our model, can be used to store relevant information in the trunk strengths of the pyramidal neurons of the memory layer. This is illustrated in Fig 6B where we illustrate the processing of a simple story. In the first sentence “Mary moved to the bedroom", the overlap between the key- and value-representations potentiates the trunk strength of the corresponding pyramidal neurons (component 3 in our projection), which can be observed when projecting the memory state after the Hebbian update into the low-dimensional space (leftmost bottom representation in Fig 6B). This potentiated state is retained after the next two sentences, and new memories are added according to the presented facts. At the last presented fact “john moved to the office", John changes his location from the hallway to the office. This change is accurately recorded in the memory: the overlapping of component 1 in the key and value represents the new location, while the deactivation of components 14, corresponding to the previous location, occurs due to the negative term in the Hebbian rule. The final state of the memory is then combined with a specific recall key from the question, “Where is John?" The answer is determined by the overlapping between their activated components, effectively corresponding to the newly activated component 1, which is part of the office representation – John’s last change of location. Hence, the readout can easily determine office as the correct answer. In summary, the model has learned assembly representations for entities. These representations are partly orthogonal and partly overlapping. An overlap defines a potential association that can be stored in the neurons of the memory layer. Experimental studies in humans have found clear evidence for assembly representations of celebrities and popular places in the medial temporal lobe of humans with partial overlap [43,44]. According to our model, overlapping assemblies emerge through learning because they are needed for the storage of associations in the memory layer. A model for memory processing in pyramidal neurons Pyramidal neurons are the principal cells of many memory related brain areas such as the hippocampus, amygdala, and prefrontal cortex. They have a characteristic bipolar shape, with basal dendrites close to the soma and apical dendrites that are separated from the soma by an extended apical trunk (green schematic cell in Fig 1A). These two components establish two sites of synaptic integration. They independently integrate synaptic input in a non-linear manner, and these two signals are then combined at the soma to determine the action potential output of the neuron [18,19]. For our network model, we were inspired by memory-augmented neural networks (MANNs) as e.g. in [20,21]. These models use a memory module where key vectors from can be associated with value vectors from . When some fact is stored in memory (memorization operation), a key and a value vector is computed and this key-value pair is memorized. When some information is retrieved from memory (memory recall operation), a query vector is computed and a value vector is retrieved based on the similarity between the query vector and all stored key vectors. The retrieved value vector is then used in a single layer neural network to compute the network output. In our model, memories are stored in a memory layer that is a population of m model pyramidal neurons, see Fig 1B. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 1. Network model. (A) Schematic of our simple pyramidal cell model (green) consisting of an apical and basal compartment with activations and respectively. The excitability of the apical trunk is variable and indicated in dark green. The neuron output is projected to an output layer (yellow). Prediction errors generate learning signals which are fed back via randomly initialized feedback weights . (B) Network view. Input neurons (blue) project to apical and basal compartments of a population of pyramidal cells (green). Pyramidal output is projected to a layer of linear output neurons (yellow) producing the network output . https://doi.org/10.1371/journal.pone.0313331.g001 The memory layer receives input from two input populations with activities and at time t, which project to the apical and basal compartments of the pyramidal neurons via synaptic weight matrices Wapical and Wbasal respectively. In each pyramidal neuron i of the memory layer, the resulting basal activation and apical activation is given by (1) with the rectified linear nonlinearity, σ ( s ) = max ⁡ { 0 , s } , and pre-activations (2) The vector of apical activations of neurons is thus akin to the key vector and the vector of the basal activations is akin to the value vector in a MANN. In the following, we will suppress the neuron index i in our model description in order to simplify notation. The real valued output of a pyramidal neuron is then given by a linear combination of these activations, (3) where the scalar denotes the branch strength of the apical trunk. As we will describe below in detail, we model the trunk strength as a dynamic variable , used to memorize information about the apical and basal activations (i.e., the key and value vectors). Hence, this layer of pyramidal cells implements a simple memory module. The output of the pyramidal cell layer is projected to c output neurons (the output layer) via the weight matrix , which produces the network output . In addition, the network output can be used to determine learning signals that are used as feedback signals for learning (red in Fig 1; see below). A large variety of non-synaptic plasticity mechanisms exist in pyramidal neurons [22–26]. Changes of neuronal membrane characteristics are not necessarily only changing the global neuronal excitability or the action potential initiation threshold, they can also be confined to dendritic subunits [23,26]. In particular, Losonczy et al. found that local branch activity led to the potentiation of the branch strength when paired either with the cholinergic agonist carbachol or — more importantly — with two or three backpropagating action potentials, indicating an intra-neuron Hebbian type of branch plasticity [26]. In addition, experimental work [23] has shown that currents can be down-regulated in a Hebbian manner in CA1 pyramidal neurons. Such down-regulation can in turn increase the dendro-somatic coupling of the apical dendrites [27]. Motivated by this finding, we assume a Hebbian-type trunk plasticity: (4)(5)(6) with σ the ReLu non-linearity, parameters , and is the maximum trunk strength. Hence, in our model, the trunk strength is potentiated when both the apical and basal compartments are activated by synaptic input (term in ). This potentiation saturates at . The ReLU nonlinearity in assures that the trunk strength remains non-negative. We have also included a depression term that depends quadratically on the apical activation (last term in ) similar to Oja’s Hebbian rule [28]. The depression term can be interpreted as a homeostatic term as it reduces the trunk excitability when there is large activity [25]. In the particular implementation that we considered, the depression depends quadratically on the apical activity. This homeostasis can reset a memory by depressing trunk strengths of neurons where an apical activation is not paired with a basal one. Other options would be to use the basal activity or a combination of both for homeostasis. It turns out that the actual choice is not crucial as long as there is homeostasis that depends on some activation, see Section A in S1 Appendix. From a functional perspective, this plasticity enables the neurons in the memory layer to implement a simple memory system. At each time step t, the network can perform either a memorization or a memory recall operation. In a memorization operation, both input populations xa , t and xb , t are active, and the trunk strength is updated. The layer thus memorizes at which neurons both the apical and basal dendrites were activated. At a memory recall, we assume that only the apical input population xa , t is active. The activated apical compartments will induce neuronal activity only in those cells in which the trunk strength was potentiated, thus reading out a trace of the memory. The output of the memory layer activates the neurons in the output layer, a learning signal is computed, fed back to the memory layer and synaptic weights are updated according to the plasticity rules described below. Note that the network architecture is purely feed-forward, without any recurrent synaptic connections. The trunk strength however implements an implicit recurrence, see . This dynamic state variable can be used by the network to store information about previous inputs. In principle, the apical and basal input populations could provide different aspects of the input at a memorization event. For simplicity however, in our simulations they exhibited the same activity pattern xa , t = xb , t at memorization events, which turned out to work well in our simulations. Local synaptic plasticity for memory-dependent processing with TSP In our model, memory is encoded in the network in the vector of trunk strengths . This memory is constructed on the fast timescale of single neuronal activations. According to our hypothesis, synaptic plasticity in contrast works on a slower timescale and is used to learn how to make use of this memory process in the specific task context. Similar ideas have been put forward in a number of models termed memory-augmented neural networks [20,21,29,30]. In these networks however, synaptic weights are optimized with backpropagation through time (BPTT), a complex and biologically highly implausible learning algorithm [31]. In contrast, it turns out that if trunk strengths are used for rapid information storage, one can derive local synaptic plasticity rules, see Methods. The derivation takes advantage of the eligibility propagation (e-prop) algorithm [32]. The resulting on-line plasticity rules approximate BPTT using synaptic eligibility traces [33] in combination with learning signals that are directly fed back from the network output to the network neurons. This is illustrated in Fig 1 for a supervised learning scenario (red). Consider a network with c output neurons and corresponding outputs . For given target outputs , we obtain c learning signals which are fed back to the memory layer neurons through a feedback weight matrix . To simplify notation, we express the following equations for a single neuron and drop the post-synaptic neuron index. Consider a neuron in the memory layer with feedback weight vector . The weighted learning signals are summed to obtain the neuron-specific learning signal : (7) The synaptic plasticity rules for synapses at the basal and apical dendrites are then given by (see Methods for a derivation) (8)(9) where η is a learning rate, and denote the eligibility traces of the basal and apical synapses respectively, and H is the Heaviside step function: H ( s ) = 0 for s ≤ 0 and H ( s ) = 1 otherwise. In general, the plasticity rules combine the neuron-specific learning signal , that is the assigned error to the neuron, with the synapse specific eligibility trace. This eligibility trace records the eligibility of the synapse for changes in the trunk strength. For example, if there was a large error assigned to the neuron and its trunk strength was high, synapses that led to a trunk strength increase are eligible for that error and are changed such that this increase will be smaller in future. The eligibility is computed by filtering information locally available to the neuron: (10) for H ( s ) = 1 and functions . Note that the update rules (8) and (9) as well as the updates of eligibility traces (10) need only temporally and spatially local information, that is, information that is available at the post-synaptic neuron at the current or previous time step. Hence, this update could in principle be implemented by pyramidal neurons. The function f in dynamically controls the decay of the eligibility, and it is the same for both the apical and the basal synapses: (11) where H denotes the Heaviside step function. Hence, f (as well as the other functions, see below) are gated by , meaning that the eligibility trace is zero if the trunk strength is zero. This arises from the fact that the apical dendrite has no influence on the neuron output when , and thus the network output does not depend on the input to this neuron during a recall (note that during a recall, only the apical inputs are active). It is 1 (no decay) in the absence of apical activity and reduced by the squared apical activity and the product of basal and apical activity. In other words, the synapse stays fully eligible as long as the trunk strength is not altered. If the trunk strength is altered due to apical activity, the eligibility is reduced, as other synapses may become eligible for these changes. The functions in modulate the increase of the eligibility trace at synaptic input: (12)(13) Both and are gated by activity in their corresponding compartments. Hence, if the compartment is inactive, the eligibility is not changed even if there is synaptic input, as the compartment did not contribute to any changes of the trunk strength. Also, both include a term which takes into account that it is harder to increase the trunk strength when it is already close to its maximum value. is then linearly dependent on the apical activation, as a potentiation of the trunk strength is proportional to apical activity (). is similar, with an additional term that records eligibility according to the decrease of the trunk strength due to apical activation. The update dynamics of eligibility traces and weights of basal synapses are illustrated in Fig 2 (in the setup shown there, the apical dynamics are similar). Here, a neuron receives two apical inputs , and two basal inputs , (Fig 2 right and panel A). First, and are co-active leading to an increase in the trunk strength (panel B). The eligibility-trace for the first basal synapse is thus increased as it caused this trunk potentiation (panel C). Later, there is a longer co-activation of the other synapse pair and , leading to stronger trunk potentiation. Now, the second basal synapse increases its eligibility trace, while the trace of the first one is decreased. At the final time step, a learning signal appears, paired with apical activity. This leads to weight changes that are proportional to the eligibility trace (panel E). Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 2. Illustration of eligibility trace dynamics and synaptic plasticity at the basal compartment in our model. imulation of a single neuron with two apical and basal synapses, each having unit weight. The first apical and basal synapse is initially activated for one time step, followed by the activation of the second apical and basal synapse for three consecutive time steps (panel A). These co-activations lead to increases in the branch strength (panel B), as well as to changes in the eligibility traces (panel C, see text). Then, a recall is performed at time step 19 where both apical synapses are activated. Further, a learning signal is received (panel D). Changes of basal weights are then given by the product of the eligibility traces with the learning signal and apical activation (panel E, see . https://doi.org/10.1371/journal.pone.0313331.g002 In summary, the eligibility traces record first and second order terms of apical and basal activity, together with some gating and a dependence on the trunk strength. Although the update rules for the traces are far from simple, they include only terms available at the post-synaptic neuron and in particular are local in time — in contrast to BPTT. Hence, they can in principle be computed at the synapse. When we compare the synaptic and non-synaptic plasticity mechanism in our model, we observe that they operate on very different timescales with complementary functions. TSP acts very fast in order to memorize relevant information about the current input in the trunk strengths of the memory layer neurons. On the other hand, the local synaptic plasticity dynamics approximate gradient descent and optimize the synaptic weights over many learning episodes in order to minimize network error on the specific task. Their roles are thus complementary, but they synergize in the following way. TSP provides general memory capabilities that help to learn rather arbitrary memory-related tasks. Synaptic plasticity then utilizes this memory and adapts synaptic weights such that the relevant information is memorized and retrieved for the task at hand. In the simulations reported below, we turned off synaptic plasticity in the testing phase after the model was trained. This shows that while the synaptic plasticity is needed for task acquisition, it is not necessary for inference after proper synaptic weights have been determined. In the following, we evaluated the above described network on a variety of sequence processing tasks that all require some form of memory. In each task, one episode consists of a sequence of time steps t, in each of which the trunk strength and the eligibility traces get updated in each neuron of the memory layer. Only after the entire episode, synaptic weight changes are applied based on the accumulated eligibility traces, if the model is in the training phase. For the testing phase, eligibility traces are not recorded, since no synaptic weight update is performed. The eligibility traces can hence be viewed as auxiliary functions, keeping track of the network activity over time for an accurate approximation of BPTT. Note, that eligibility traces have no direct influence on the network output, hence, changes in trunk strength are the only available resource to account for within-episode short-term memorization. Learning associations and stimulus comparisons with local synaptic plasticity and TSP We first tested whether the model is able to learn general associations between sensory input patterns. To this end, we generated two disjunct sets , , each consisting of n Omniglot [34] characters. The network had n output neurons, one for each element in S and each element in Swas associated with one output neuron. Before each episode, we randomly drew a bijective mapping between these two sets to generate n stimulus pairs from, R × S such that each element from the two sets appeared in exactly one pair. These pairs were shown to the network in random order, representing facts to be memorized by the network, see Fig 3A. Afterwards, an input consisting of one element of Rwas shown as a query and the network had to activate the output neuron associated with the paired stimulus. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 3. Learning associations with local plasticity. (A) An agent observes a sequence of stimulus pairs. After being cued by one of the observed stimuli, it has to indicate the associated one. (B) Number of training episodes needed until the network achieved an accuracy of 80% as a function of association pairs to be remembered (mean and SD over 16 training trials). https://doi.org/10.1371/journal.pone.0313331.g003 The network consisted of d = 128 input neurons in each of the apical and basal input populations, 200 neurons in the memory layer, and n output neurons. In order to generate a reasonable higher-level representation of the Omniglot characters, each character of the presented pair was first embedded in 64-dimensional space with a convolutional network pre-trained using a prototypical objective [35] (the pre-training was done on a subset of the Omniglot classes which were not used in Rand S). The embeddings were then concatenated in a 128-dimensional vector . Then, both the apical and basal input populations were activated with this vector, i.e., and . At query time, one character from Rwas embedded and concatenated with a 64-dimensional zero vector to obtain xa , t to provide the input to the apical compartment, while there was no input to the basal compartment (xb , t = 0). We trained the network using a reward-based paradigm where it received a positive reward for the correct response at query time and a negative reward otherwise. In this reward-based setting, we used the standard proximal policy optimization (PPO) objective with an entropy bonus [36] (see section ’General simulation details’ in Methods for a definition) to compute the learning signals used for the synaptic weight updates. The network achieved 100% accuracy on this task after about 2000 episodes for n = 5 association pairs. Fig 3B shows how learning time scales with the number of associations to be learned (blue bars). A few hundred episodes suffice for two associations. For comparison, we also trained the network with direct supervision (error signals are determined from the target response; orange bars; see section ’General simulation details’ in Methods). As expected, learning time increase is milder, but with a similar tendency. Neural network models can be sensitive to parameter settings. In order to test whether our model is robust to parameter changes, we analyzed the impact of the hyperparameters , , as well as the memory layer size on network performance. We found that network performance is stable over a large range of these parameters. See Section B in S1 Appendix for details of the analysis. Another memory-related task frequently used in experiments is the classical delayed match-to-sample task [37]. Here, the animal observes two stimuli separated in time and must produce an action depending on whether or not the two stimuli are equal, see Fig 4A. We modeled this task in the setup described above with a pre-trained convolutional network to embed the stimulus in 64-dimensional space and 200 neurons in the memory layer. The agent first observed one out of five Omniglot characters , followed by eight steps where white noise input is shown. Finally, another character was shown which was chosen to be the same as with probability 0 . 5 and one of the other characters with probability 0 . 125 each. The output of the network was then interpreted as an action a ∈ { left , right } indicating a match or non-match. A reward was delivered accordingly which was used to compute the learning signals for synaptic weight updates. Then, another episode started. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 4. Learning of a delayed match-to-sample task with local plasticity. (A) Task schema. The agent observes a stimulus followed by 8 white noise inputs and another stimulus . The agent should choose the left action when the initial stimulus matches query stimulus . (B) Learning progress in terms of choice accuracy. Green: Only one character instantiation per class for training and testing. Blue: Network is tested on a character not seen during training. Brown: LSTM in the fixed setting (16 trials; shading indicates standard deviation) https://doi.org/10.1371/journal.pone.0313331.g004 In this task, we also tested whether the network can cope with variance in the input stimuli. In Omniglot, each character class consists of 20 drawings of the character from different people with significant variance. We compared network performance in a setting when the specific instantiation of the presented character was fixed to a setting where it was drawn randomly from the set in each episode. In particular, for performance evaluation, we used a character instantiation that was not used for training (see Methods for details). Training progress is shown in Fig 4B. The network achieved an accuracy of 95 ± 1 . 8% on this task, with no significant difference between the fixed-character and sampled-character settings. We wondered whether the generalization capabilities of the network in the sampled-character setting could be fully contributed to a writer-independent representation of the characters in the embedding of the convolutional network. We therefore visualized the variance of embeddings for different samples of a given character using principal component analysis and t-SNE [38], see Section C in S1 Appendix. We found that there is still significant variance in these embeddings. Although the convolutional embedding certainly helps for generalization, this shows that the inputs do not need to be rigidly symbolic. Rather, the network can deal with variability in the input representation provided by the convolutional network. We also tested the performance of a long short-term memory (LSTM) network [39] trained with BPTT, i.e., with non-local plasticity, see Fig 4B. Interestingly, the LSTM was not able to learn this task consistently (it reached performances of around 90% in some trials but failed in others). We considered one LSTM with the same number of neurons and one with the same number of parameters as our network. Fig 4B shows the better performing one. This shows that TSP improves the learning capabilities of neural networks with local synaptic plasticity on this task. While the trunk strength can in principle hold information over arbitrary durations, the noise in the input during the delay period induces trunk strength changes which perturb the memory, leading to a drop in accuracy for increasing delays. See Section D in S1 Appendix for an analysis and discussion. Learning context-dependent reward associations with local synaptic plasticity and TSP We next tested whether the network trained with local synaptic plasticity rules was also able to learn a more complex radial maze task [40]. In this task, the animal is located in an eight-armed radial maze (see Fig 5). It observes one out of four context inputs (in our model, characters from the Omniglot data set), indicating one pair of arms that can be entered (indicated by color in Fig 5A). For each context, one of the two arms contains a reward. In the beginning of each episode, the branch containing the reward is randomly assigned for each of the contexts and held fixed for the episode duration. The animal has to first explore in which of the branches the reward is located and then remember this information for each of the four contexts separately throughout the episode, requiring memorization abilities of the animal. Each episode in this task consisted of 40 trials (i.e., 40 context stimuli and arm choices). In each episode, the reward locations were chosen randomly initially and stayed constant throughout that 40-trial episode. Hence, the task demands memorization of the reward location within each episode. The memorized information can then be used to choose the rewarded location in the remaining trials of the episode. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 5. Context-dependent reward associations. (A) Schema of the radial maze task. In each trial, one arm pair is accessible to the agent (yellow in the example) and the context cue is presented (Omniglot character). The agent then has to choose the correct arm (left or right) to obtain the reward. (B) Fraction of rewarded actions over learning episodes in the basic radial maze task (blue) and the same task where the rewarding arm is switched after visit (green; mean and SD. over 16 runs). Red: maximum achievable performance. Orange: LSTM in the basic radial maze task. https://doi.org/10.1371/journal.pone.0313331.g005 We modelled this task using a network as described above, where the visual context was embedded in a 64-dimensional vector using the same pre-trained convolutional network and a 200 neuron memory layer. At the beginning of a trial, one arm pair was chosen randomly out of the four possible pairs and the context stimulus c was presented to the network. The network output was interpreted as an action a ∈ { left , right } to choose one of the available arms. The network then received either a positive or negative reward, which was used to compute learning signals and update synaptic weights. Afterwards, the network observed a summary of this trial through a triple ( c , a , r ) , consisting of the context stimulus c, the chosen action a, and the binary variable r ∈ { 0 , 1 } indicating the received reward. This information could be used by the network to memorize the rewarding action in the given context. Further details regarding the task setup can be found in Methods. We measured the performance of the network through choice accuracy, that is, the average fraction of rewarding choices within episodes (Fig 5B). Since the agent has to guess the rewarding action per context initially, the maximum expected accuracy is 0 . 95. The network learned this task perfectly within about 2000 episodes. Note that this task also includes some basic form of counterfactual reasoning, since the agent can reason about the reward location when visiting a non-rewarded arm. We also tested a more complex variant of this task where the reward in the visited context switches to the other arm after the visit. Also, this task could perfectly be learned with local synaptic plasticity within approximately twice as many episodes compared to the non-switching case. An analysis of the learned network solution can be found in Section E in S1 Appendix. We also evaluated the performance of an LSTM network with BPTT on the basic version of the task, see Fig 5B. The LSTM was converging towards a solution, but learned much slower. Learning question answering tasks with local synaptic plasticity and TSP In the above simulations, we tested our model on standard experimental paradigms: a delayed match-to-sample task, and a radial maze task. We next asked whether local synaptic plasticity rules could learn to harness TSPto solve more complex cognitive tasks. One standard benchmark for memory-augmented neural networks is the bAbI task set [17]. It consists of 20 question-answering tasks, where each task is composed of a story consisting of a sequence of up to 325 sentences, followed by a question for which the answer can be inferred from the information contained in the story. See Section F in S1 Appendix for example tasks. For our experiments, we used the 10k bAbI dataset to train a network with 200 neurons in the memory layer. According to the benchmark guideline, a task is considered as solved if the error rate is less or equal to 5%. Each sentence of a story was embedded in an 80-dimensional vector, and the sequence of these embeddings was presented to the network sequentially. We first considered a random embedding, where we generated an 80-dimensional random vector for each word using the He-uniform variance scaling method [41]. For a given sentence, the vectors of all words in the sentence were then linearly combined with a position encoding that encodes the position of the word in the sentence as in [20]. Further details regarding the task setup can be found in Methods. In Table 1 we report the mean error rate of the model over 5 runs for each of the 20 bAbI tasks. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 1. Comparison of mean errors of a network trained with BPTT vs our local synaptic plasticity on bAbI tasks 10k (mean and SD. over 5 trials). Error rates for tasks solved by using our local synaptic plasticity rules are printed in bold face. BPTT: backpropagation through time; LSP: Local synaptic plasticity; LSP joint: joint training where a single network was trained to perform all tasks concurrently. https://doi.org/10.1371/journal.pone.0313772.t001 Using the random embedding, the network was able to learn 13 tasks using local synaptic plasticity (column 3). As a baseline, we considered the same network architecture trained with BPTT (column 2). Notably, all tasks for which the network could be optimized by BPTT could also be learned by our temporally and spatially local synaptic plasticity rules, showing the effectiveness of local learning in our model. Here, the model was trained separately on each of the tasks, resulting in one network for each task. We next tested whether a single network was able to solve all the tasks that could be solved by individual networks by training one model jointly on these tasks (column 4). We found that this was the case and that error rates on a majority of these tasks were even improved, indicating a knowledge transfer between them during learning. In order to test how much a more task-specific sentence representation would improve the results, we also considered a pre-trained embedding which is optimized for the task. Here, we used the random embedding as initialization and trained the embedding end-to-end using BPTT on the task at hand. Then the embedding was fixed, and a fresh network was trained with local synaptic plasticity on this input representation. We tested all tasks that were not solved with the random embedding, and found that three additional tasks could be solved with a better input representation (column 6). Again, all tasks that could be solved with BPTT could also be learned with local synaptic plasticity. Overlapping assembly representations emerge through local learning How does the network solve such tasks after training? In order to answer this question, we analyzed the behavior of a trained network in the “Single supporting fact" task, see Fig 6. This task involves simple person-location relations such as “John moved to the kitchen" among several persons and possible locations. Although the results are presented with the verb “moved", we note that there are actually various phrases in the story to indicate such person-location relations. Those variations do not alter the person-location relation, and the model should thus learn to treat them as equal. To analyze stories, we recorded the vectors of trunk strengths that arose from the dataset’s stories. These vectors thus represent the memory state of the memory layer throughout the stories. Subsequently, we applied a non-negative matrix factorization (NMF) [42] to project these memory states into a 20-dimensional space. We then projected the memory states, keys, values, and recall keys into this space. Download: PPT PowerPoint slide PNG larger image TIFF original image Fig 6. Network analysis for the Single Supporting Fact task. (A) Projection of keys, values, and recall keys based on a non-negative matrix factorization of memory traces after network training. Keys are shown for specific persons, with representations averaged over locations and verbs. The key for John clearly activates 6 components corresponding to possible locations for John. Value representations are shown for specific locations, with representations averaged over persons and verbs. (B) Story sample along with its respective key (top, outer ring), value (top, inner ring), and memory state after memorization (bottom). Each key and value pair predominantly overlaps in a single component, which is then memorized. Additionally, the change in John’s location in the last fact is accurately updated from component 14 to 1, causing component 14 to be deactivated due to the negative term in our Oja-type Hebbian rule. https://doi.org/10.1371/journal.pone.0313331.g006 Fig 6A (top) shows the average activity in the key layer for a given person, where the average was taken over all possible locations and action verbs. For example, the leftmost visualized vector shows sentences of the form “john moved to the ...", where the location is marginalized out. When comparing the representations for four persons, one observes that keys effectively discriminate between persons with close to orthogonal representations. When we performed the same analysis for a given location, averaging over persons, we did not observe such a structure (not shown). In contrast, such orthogonal representations can be observed for specific locations in the value layer when an average was taken over persons and action values (Fig 6A, middle). We found that the orthogonality of representations depends on the memory size and is crucial for task performance, see Section G in S1 Appendix. In summary, we observed that keys effectively discriminate between persons, while values indicate locations. We thus define each person’s key vector as the average of the key vector projections from all possible variations of sentences for this person. Similarly, we define a location’s value vector as the average of the value vector projections from all possible variations of sentences for this location. While key- and value-representations are orthogonal to those of the same layer, we found a systematic overlap between key representations with value representations. For instance, John’s key vector primarily overlaps with specific locations indicated by the value vectors (components 1, 4, 7, and 14 overlap office, kitchen, bedroom, and hallway respectively, Fig 6A, compare top and middle). This analysis can be similarly conducted based on the value vectors. For example, the most activated components of the office’s value overlap primarily with specific people indicated by the keys (component 1, 15, 17, and 19 overlap with John, Mary, Daniel, and Sandra respectively). The bottom row of Fig 6A shows recall keys during queries for the location of persons (e.g., “where is john"). We observe that these recall keys are very similar to keys during storage operations for facts that include the same person. This representation, that has been learned through local synaptic plasticity in our model, can be used to store relevant information in the trunk strengths of the pyramidal neurons of the memory layer. This is illustrated in Fig 6B where we illustrate the processing of a simple story. In the first sentence “Mary moved to the bedroom", the overlap between the key- and value-representations potentiates the trunk strength of the corresponding pyramidal neurons (component 3 in our projection), which can be observed when projecting the memory state after the Hebbian update into the low-dimensional space (leftmost bottom representation in Fig 6B). This potentiated state is retained after the next two sentences, and new memories are added according to the presented facts. At the last presented fact “john moved to the office", John changes his location from the hallway to the office. This change is accurately recorded in the memory: the overlapping of component 1 in the key and value represents the new location, while the deactivation of components 14, corresponding to the previous location, occurs due to the negative term in the Hebbian rule. The final state of the memory is then combined with a specific recall key from the question, “Where is John?" The answer is determined by the overlapping between their activated components, effectively corresponding to the newly activated component 1, which is part of the office representation – John’s last change of location. Hence, the readout can easily determine office as the correct answer. In summary, the model has learned assembly representations for entities. These representations are partly orthogonal and partly overlapping. An overlap defines a potential association that can be stored in the neurons of the memory layer. Experimental studies in humans have found clear evidence for assembly representations of celebrities and popular places in the medial temporal lobe of humans with partial overlap [43,44]. According to our model, overlapping assemblies emerge through learning because they are needed for the storage of associations in the memory layer. Discussion We propose in this article that memory-dependent neural processing is jointly shaped by non-synaptic and synaptic plasticity. The interaction between non-synaptic intrinsic and synaptic plasticity has been studied in previous work. These works however considered intrinsic plasticity as a homeostatic mechanism to maximize information transfer [45,46], to support unsupervised learning [15,16], or for associative learning [13]. The cooperation of intrinsic and synaptic plasticity has also been studied for information theoretic learning in artificial neural networks [47]. Another aspect of intrinsic plasticity is that it can regulate the dynamical behavior of recurrent network models [48–50]. In contrast to these works, our model shows that non-synaptic plasticity can be used as a memory buffer that is utilized by synaptic plasticity processes. Local plasticity for two-compartment neurons has been studied in [51], but in the context of representation learning. The idea that fast synaptic plasticity can underlie working memory has been proposed already in [52], see [53] for a recent review. This idea has been extensively used in memory-augmented neural networks [20,21,29,30,54–56]. Here, however, synaptic plasticity was used as a fast memory buffer and networks were trained with backpropagation through time. Fast synaptic Hebbian plasticity was combined with reward-based learning in [57] to learn a navigation task. From the neuroscience perspective, non-synaptic fast plasticity is consistent with the observation that non-synaptic plasticity can act on a fast timescale [6,9,10]. But what could be functional advantages of non-synaptic neuron-specific plasticity? Our study indicates one potential advantage: local learning rules for synaptic connections can be used to shape the circuitry around the memory units. The proposed network model could be improved in several directions. While we were able to show that local learning works very well when compared with BPTT (see e.g. Table 1), the simple memory architecture is limiting. Hence, when compared to H-Mem [21], a brain-inspired memory network without local learning, it fails on a number of bAbI tasks which can be solved by H-Mem. The fast memory in models such as H-Mem is implemented by a fully connected synaptic weight matrix between a neuron layer that represents the key vector and a neuron layer that represents the value vector. This more complex memory architecture prevents the derivation of local plasticity rules with the method utilized in this article. It is an interesting question whether biologically plausible learning mechanisms could be derived for this case as well. Since Hebbian branch plasticity similar to was found in oblique dendrites of pyramidal cells, it would be interesting to investigate models where each oblique branch acts as a memory unit similar to our . This could significantly boost the memory capacity of the model. Another limitation of our model is that learning can take long. In the pattern association task (Fig 3), learning of associations of two pattern-pairs needs a few hundred episodes, which increases with the number of associated pairs (Fig 3B). Here, curriculum setups and more specialized circuit architectures are potential candidates for improvement. We emphasize however that unseen associations between previously observed patterns can be memorized instantly after training by our model. One basic assumption of our model is that the coupling strength between the apical dendrites and the soma can be modified by sub-cellular processes. Local plasticity of excitability of oblique dendrites in hippocampal pyramidal cells has been reported in [26]. The regulation of the somatic coupling of apical dendrites — to which we refer to as trunk strength plasticity in this article — could be implemented through the regulation of hyperpolarization-activated channels, which generate the current [23]. It has been shown that these channels strongly influence the dendro-somatic coupling of apical dendrites [27]. Other potential mechanisms include sodium-, calcium-, or potassium-conductances to facilitate calcium spike generation. These mechanisms would result in a change of calcium-dynamics at specific locations at the pyramidal cell through learning, which could be observed with imaging techniques in behaving animals. Our derived learning rules can be classified as extended three-factor learning rules with eligibility traces [33]. Three factor rules are implementing learning rules of the form (14)(15) where ej is the eligibility trace for synapse j, is the third factor, α and β are constants, is the presynaptic input, is the postsynaptic activity, and f and g are functions specific to the considered three factor rule [33]. Typically, the eligibility trace decays (since 0 < α < 1), and is increased by coincident pre- and postsynaptic activity. There is experimental evidence for such rules in various brain areas such as striatum [58], cortex [59], and hippocampus [60–62]. One prominent three-factor rule with strong experimental support is the synaptic tagging-and-capture hypothesis [63]. There, a synapse is tagged by joint pre- and postsynaptic activity, and consolidation occurs up to an hour later via a third factor that is a neuromodulatory input. Third-factor rules are also quite often used in models for reward-based learning [64–67]. Our rules necessarily extend these rules since we have apical and basal postsynaptic activations which enter the rule, whereas the standard three-factor formulation is based on point neuron models. We would like to compare this general framework to our plasticity rule for basal synapses. To simplify the discussion, we re-formulate the rule by neglecting the impact of h and the homeostatic term in and (10)–(12) to obtain (16)(17) Comparing to , we see a similar structure with taking the role of the third factor. In addition, weight changes are modulated by apical activation. This is reminiscent of behavioral timescale plasticity [62], which can also be interpreted as a three factor rule where the calcium plateau potential (complex spike) is the third factor [33]. and (15) also have a very similar structure. We can however observe two differences. First, eligibility traces do not decay exponentially, rather the decay depends on coincident basal and apical activity. Second, the eligibility trace is increased in (17) when presynaptic input is paired with basal and apical activity. Hence, in general the rules fit well into the three-factor framework, with an extension that takes the separation of the basal and apical compartment into account. While the derived eligibility traces satisfy temporal locality and the utilized signals are local to the neurons, those traces depend on these signals in a complex manner. In particular, , , and f depend respectively on both the apical and basal activities. Although not both of these signals are directly local to the respective synapses, evidence for bi-directional communication between the soma and distal apical dendrites has been observed [68,69], a mechanism that is thought to support coincidence detection between projected feed-forward basal and feed-back apical inputs. Such coincidence-based plasticity appears both in and as potentiation of the synapse for coincident activation of and . While it is unlikely that exact implementations of our proposed synaptic plasticity rules exist in the brain, it would be interesting to investigate to what degree biological plasticity processes overlap with the synaptic plasticity rules proposed in this work. In our model, a fact is processed such that a key representation is provided to the apical dendrites of pyramidal cells in the memory layer and a value representation is provided to the basal compartments. Experimental evidence suggests that the apical region of cortical pyramidal cells receives top-down input while sensory feed-forward information arrives at the basal region [70–73]. From this perspective, keys in our model could be interpreted as a higher-level abstracted conceptual representation of the associated information, whereas values represent the lower-level associated stimuli. Upon recall, we only activate the apical dendrites. Hence, from this perspective, a query in our model is composed of higher-level conceptual representations that recalls a more lower-level representation. We note however that the distinction between bottom-up and top-down input is less clear in hippocampal networks. We have found that task-dependent learning can lead to overlapping assembly representations, see Fig 6 that is reminiscent of representations in the human medial temporal lobe [43,44]. More resent work has shown that memories of temporally close events tend to be stored in more overlapping neural assemblies than memories of temporally more distant events [74]. This effect was examined in a modeling study [75]. Our model was not designed to generate time-distance dependent assembly overlap. The overlap that we observe in Fig 6 was learned by the model since it has a functional role for this specific task. Nevertheless, we performed additional simulations and did observe a function-independent overlap that reduces for longer temporal distances, see Section H in S1 Appendix. In summary, we have shown that non-synaptic plasticity enables memory-dependent learning with local synaptic plasticity rules. The resulting learning architecture is quite powerful, leading to results comparable to BPTT on bAbI. To the best of our knowledge, our model is the first network that is able to learn bAbI tasks with local learning rules. The involvement of non-synaptic plasticity in memory formation has been demonstrated experimentally [6,76]. Our model proposes a functional role for it in synergy with synaptic plasticity. Materials and methods In this section, we provide detailed information about the model and derive the local synaptic plasticity rules. Memory network model Our network model consists of 2d input neurons separated into apical and basal populations with activities xa , t and, xb , t respectively. The input neurons project to apical and basal compartments of m memory neurons in a fully-connected manner via weight matrices Wapical and, Wbasal respectively. For the i-th neuron in this memory layer, the resulting basal activation and apical activation are thus given by: (18) with the rectified linear nonlinearity, σ ( s ) = max ⁡ { 0 , s } , and pre-activations (19) with xa , t and xb , t as apical and basal input vectors respectively, at time step t. We included a bias in each neuron by providing a constant input . In this way, the derivation of the update rule for the bias can be performed together with the synaptic weights. In the resulting rules, only the input has to be set to 1 in order to obtain the simplified bias updates. The output of each cell is then given by a linear combination of these two activations, whereof is multiplied with the trunk excitability : (20) The trunk excitability is updated according to the Hebbian update (21) with σ the ReLu non-linearity, parameters , and is the maximum trunk strength. The network output was then computed as (22) for output weight matrix Wout, where is determined by (20). Note, that in the reinforcement learning case, we further split this readout weight matrix into wV and Wπ, respectively, the value and policy weights, but here this separation is omitted to simplify notation. For these cases, Wout can be interpreted as a concatenation of wV and Wπ into a single matrix. We distinguish between a memorization event, where the memory is updated according to and the model output is discarded, and a memory recall event, where the model produces an output zt based on the memory state ht as in , but no memory update is performed. During memory recall events, only the apical input population is active, the basal input population is defined to be silent, i.e. xb , t = 0. In the reinforcement learning setup, we execute a memorization event followed by a memory recall event at every single time step. The network output of the recall event is then used to update the state value estimator. In the supervised learning setup, network outputs are only needed when an action is demanded (we call this a query step and the input at this step a query). Therefore, we perform memory recall events at query steps and memorization events are performed for other time steps. Derivation of local synaptic plasticity rules Here, we derive gradient-based local plasticity rules for the slow synaptic weights Wout in the output layer and Wapical , Wbasal in the memory layer (in contrast to the fast trunk strengths in the memory layer neurons). In vanilla backpropagation through time (as commonly used for training recurrent networks), symmetrical feedback connections are used to back-propagate the error gradients through the unrolled network. These symmetrical feedback weights are generally considered biologically implausible due to the weight transport problem [77]. In this work we use the e-prop framework [32] to derive local learning rules for every synapse, that require neither the application of symmetric weights, nor the propagation of errors to preceding time steps. Output layer. For the readout layer weights Wout, local learning signals are already available because the output of this layer is directly used to obtain the error signal E. We can therefore update output weight from memory layer neuron i to output neuron k with (23) where η > 0 denotes the learning rate. Basic e-prop framework and feedback alignment. To perform gradient descent on weights in the memory layer, we need to compute the gradients (24) with η > 0. The second term can be expressed as an eligibility trace, see below. The first term however necessitates the backpropagation of error signals through time. One core idea of the e-prop formalism [32] is to replace this gradient by a temporally local approximation, (25)(26) where index k denotes the k-th neuron in the output layer. Here, is interpreted as a neuron-specific learning signal for neuron i in the memory layer. In the above equations, we distinguish between the total derivative and the partial derivative . The function f may depend on many variables, where many of them may also depend on y. The total derivative takes all these dependencies into account, while in the partial derivative there is only a direct dependency of f on y. More details about this notation can be found in [32], where the e-prop formalism was proposed. The approximation in implements feedback alignment, where the symmetric feedback weights are replaced by randomly chosen weights [78]. In our simulations, we used the adaptive e-prop method [32], where the feedback weights are initialized randomly and then undergo the same weight updates as weights in Wout: . In addition, these two weight matrices are subject to L2 weight decay. This ensures, that after some iterations Wout and will converge to similar values. Memory layer. By making use of the learning signals from the previous section, we can find local synaptic plasticity rules to update the synaptic weights Wapical and Wbasal of the apical and basal compartment respectively. We derive local gradients and to minimize the error E through the e-prop formalism [32]. With these local gradients, it is not necessary to back-propagate the error signal through all time steps of the computation. Instead, eligibility traces are forward-propagated and used in conjunction with learning signals to determine the gradient required for the weight updates. Apical compartment. To find the derivative of the error function E with respect to apical weight , we factorize (27) Now we use our learning signal to approximate , which is a core idea in the e-prop formalism [32]. We also expand by applying the product rule, resulting in (28) Unrolling in time, we obtain (29) We can re-write this definition as a recursive function and hereby introduce our eligibility trace : (30) We observe, that this recurrent definition only depends on partial derivatives ( and ) that are local in time and can therefore be easily derived from as: (31)(32) where H(x) is the Heaviside step function used as derivative of the ReLU function. By plugging and (32) into (30), we obtain: (33) with (34)(35) as stated in the main text. We finally use this eligibility trace to substitute in : (36) where the second term is the derivative of our apical activation with respect to the apical weight . Basal compartment. Analogously, we compute the derivative of the error E with respect to our basal weight matrix Wbasal. We again start off by inserting our learning signal : (37)(38) Because of our assumptions, is only non-zero during recall events and the basal compartment does not receive any inputs at this point (i.e. ), then . Analogous to , we can also express recurrently via eligibility trace : (39) where the local derivative was already defined in . To obtain we can simply compute the derivative of w.r.t. : (40) Inserting this back into (39) yields: with equivalent to and (41) Taken together, our approximated derivative reads: (42) Weight updates in simulations The above derived approximate gradients were accumulated over mini-batches of training examples and used for parameter updates in our model. In practice, we used these approximate gradients in the Adam [79] algorithm, which amounts to a synapse-specific learning rate based on global learning rate along with a momentum update. We used the default values for other hyperparameters of Adam which are , and . We also used an L2 regularization term for each of the synaptic weight matrices. The hyperparameters for learning are specified in subsection ’Hyperparameters’ below. For efficiency reasons, we accumulate the synaptic weight changes over mini-batches of episodes before applying them, which is a common practice in neural network training. Temporal normalization of input neuron activity In order to stabilize training, activity normalization techniques such as batch normalization and layer normalization are often used in artificial neural networks. In biological networks, such normalization may be carried out through inhibitory networks. In contrast to standard techniques, we have to ensure in our model that no information can be inferred from the future, therefore normalization needs to be performed in an online manner. We normalized input apical and basal activity vectors and component-wise over the temporal dimension t. In our case, we use a variance-mean online computation based on [80]: (43)(44)(45) with at time t, online approximated mean and cumulative squared difference between input and mean. We further define . For each task, we calculated and online during the training phase across all training samples to obtain and to apply them as constants for the normalization during inference. This procedure is a common practice and coherent with standard normalization techniques like Batch Normalization [81]. General simulation details Hyperparameter decay. We applied an exponential decay for some of the hyperparameters. The decay function of a hyperparameter Θ was defined as (46) with constant decay rate and delay . This decay was used for the general learning rate of the ADAM optimizer, the entropy coefficient ξ used in the reinforcement learning setups (these parameters are described below), and Hebbian parameters and . Reinforcement learning. For the reinforcement learning experiments, we used Proximal Policy Optimization (PPO) [82] with an entropy bonus [83]. For this setup, output layer weights Wout were split into and with l different available actions. is used to compute a scalar state value estimate and to compute the policy vector from network output . The policy vector represents a probability distribution over all possible actions l at time step t. To train both outputs jointly, we defined an error function consisting of an additive combination of a policy error and state value estimate error . This corresponded to an actor-critic setup, where the network learned to represent both actor (via policy ) and critic (via state value estimate ). The overall error E could therefore be defined as (47) with state value error and policy error . State value error. The error for our state value estimate was computed by using the temporal difference (TD) error [84]: (48) with reward , a constant γ and model parameters γ. denotes the squared L2 norm. The notation is used to indicate that this term is treated as a constant for gradient computations, and therefore no gradients are propagated to this term. Policy error. The policy error was defined as negative of the PPO objective: (49) Here, ξ > 0 is the entropy coefficient exponentially decaying according to , and ξ > 0 defines the interval of clipping probability ratio as in [82]. Again, the notation indicates that this term was treated as a constant during calculation of the error gradient. The advantage estimate assuming action is taken at time step t is defined by (50) using a temporal difference residual [85] to ensure temporal locality of the learning signal. For the reinforcement learning tasks, we wanted to enable the model to write and read at every time step. Hence, we let the model perform a memorization event followed by a recall event at each time step t. Supervised learning. In the supervised learning case, the network output is computed as . The supervised error function was defined as cross-entropy between a softmax of the model output vector during the time step t and target vector , (51) with . Note, that in contrast to the reinforcement learning setup where an error is available each time step t, here the error can only be non-zero at time t where the target signal is received. Hyperparameters. For all ours experiments we used 200 memory neurons, batch size of 32 elements, a global norm gradient clip of 10, and a L2 normalization with a coefficient. For our reinforcement learning experiments we used 0 . 9 as discount rate γ and 0 . 2 for the PPO clip ratio 0 . 2 . The task-dependent hyperparameters, as well as the hyperparameters used to obtain the LSTM baseline are reported in Tables 2, 3 and 4. The hyperparameters for both, the memory network model and LSTM were tuned using Hyperband [86] a standard hyperparameter optimization technique, with a similar computational budget. To provide a fair comparison, in addition to hyperparameter tuning, we considered two LSTM configurations, one where the number of LSTM hidden units were the same as our model but with more trainable parameters, and another with the same number of trainable parameters. We report the configuration with the same number of trainable parameters that had the best performance. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 2. Hyperparameters: Supervised learning. https://doi.org/10.1371/journal.pone.0313772.t002 Download: PPT PowerPoint slide PNG larger image TIFF original image Table 3. Hyperparameters: Reinforcement learning. https://doi.org/10.1371/journal.pone.0313772.t003 Download: PPT PowerPoint slide PNG larger image TIFF original image Table 4. Hyperparameters: LSTM. https://doi.org/10.1371/journal.pone.0313772.t004 Datasets Omniglot dataset. Omniglot [34] (under MIT License) is a dataset of 1625 handwritten characters from 50 different alphabets, where each character has been drawn by 20 different people. Hence, there are 1625 classes with 20 samples each in this dataset. We pre-trained a 64-dimensional embedding using a 1-shot prototypical network [35] on 965 Omniglot classes. For our experiments, the remaining 660 classes (that were not in the pre-training dataset) were then used, embedded by the previously pre-trained network. Note that these classes were not just held-out samples, but completely new characters. bAbI dataset. The bAbI question answering tasks dataset [17] (under BSD License) is a collection of 20 different tasks, each of which consists of stories written in natural language. A sample can be seen as a sequence of statements (a story), where at the end of each story a question is asked about the content of the previously shown sentences. Depending on the task, the questions often require the combination of multiple statements in order to provide a correct answer. These tasks were designed in a way that they require different skills of natural language processing like induction, deduction, and chaining of facts, which are natural for humans, but hard for machine learning models. Simulation details for association learning task The association learning task aimed to test whether the model was capable of learning to memorize and recall association pairs. We performed this task in two different learning scenarios: A reinforcement learning scenario where the model only obtained a reward indicating whether the output action was correct or not, and one supervised learning setup where the cross-entropy error between the network output and the target output was computed. For this task we compiled two different sets and , each consisting of n Omniglot [34] characters from different classes. Each trial in this task consisted of n associations between two stimuli, bijectively mapping each single character from set R to a single character from set S. Our input vectors for time steps t ∈ { 1 , … , n } were assembled by concatenating two 64-dimensionally embedded vectors (embedding calculated from the prototypical network, see section ’Omniglot dataset’ in Methods for details), representing one character from set R and one character from set S respectively. This resulted in an input time series according to , followed by a query vector . In this query the character , which is the same as one of the previously shown with τ ∈ { 1 , … , n } , was presented concatenated with a -vector instead of its associated . At each trial, the bijection between R and S as well as the sequential order of characters were randomly drawn to ensure an unpredictable set of associations and sequential presentation at each trial. * Reinforcement learning variant. At query time, the model produced a n-dimensional output where each element is assigned to a specific character in S. If the model could then successfully predict the previously associated vector , it obtained a reward of + 1. If it failed, it instead received a reward of –1. For time steps t ≤ n, where the associative stimuli were presented to the network, no reward was available, since no output action needed to be taken. This principle accounted to a reinforcement learning setup with a delayed reward, where we masked the network outputs during the time steps where the associations were presented, since no meaningful policy error can be computed because no action was required. The value estimate output however was not masked, because it was used to calculate error for the synaptic weight update. The PPO objective was computed as explained in the ’General simulation details’ section. * Supervised learning variant. In the supervised case, we provided, as a label, the index of the character in S bijectively associated with in the query. The error was then computed for the model prediction at the query presentation. The supervised objective was computed as explained in the ’General simulation details’ section. Simulation details for delayed-match-to-sample task In the delayed match-to-sample task, a single Omniglot character from a pre-generated set P of 5 different characters (from different classes) was drawn and presented as 64-dimensional embedded input vector in the first time step t = 1, followed by 8 time steps of white random noise (with zero mean and standard deviation of 1). The query stimulus was again a character from the same pre-generated set either matching or not matching the previously presented class. We consider 2 different setups (see below for details), where in one setup (the fixed case) the input set P contained a single sample for each class, whereas in the second (the sampled) case, P contained multiple different instances for each class. The model output was a 2-dimensional policy vector representing the match or no match actions. If the correct action was selected, a reward of + 1 was obtained, otherwise –1. As for the association learning task, the policy output for all time steps t where no action was required was masked. We asked whether the model can generalize to the case where input stimulus and query stimulus have to be matched even if they are not identical, but rather 2 different samples from the same class. Therefore, we evaluated this task in a fixed and a sampled version. In the fixed case, a single instance of each class was added to the input set P. In the case of a match between the two stimuli, the input stimulus and query stimulus were hence always the exact same sample (i.e. the same Omniglot character drawn from the same person). In the sampled case, multiple instances of the same class were added to P. Hence, input stimulus and query stimulus could be a different class sample, even though they were from the same class. In this case, the model should also output a match decision. Moreover in the sampled case, we used different, non-overlapping input sets for training and testing, further probing generalization capabilities of the model. To accomplish this, we split our set P into two different subsets, , containing 18 samples for each of the 5 different classes and containing two sample for the same 5 classes, not contained in . This way we could make sure that the model sees the same classes during training and testing, but not the same instances. The PPO objective was computed as explained in the ’General simulation details’ section. Simulation details for context-dependent reward association (radial maze) task Task setup. In this task we modelled a radial maze consisting of 8 branches, organized into 4 pairs, with each consecutive branch forming a unique pair, see Fig 5A in the main text. Each pair of branches is associated with a distinct visual context represented by an Omniglot character. During the maze’s initialization, only one branch in each pair contains a reward, creating a challenging memorization task for the agent. Each learning episode consisted of 40 consecutive trials, where, before each episode, the locations of rewarding objects were shuffled. 4 different Omniglot characters were randomly chosen and embedded in 64-dimensional space (using a pre-trained prototypical network, see section ’Omniglot dataset’ in Methods) to indicate the contexts. Each single trial consisted of two time steps, a query step where the model had to take a decision based on the context and a fact step, where the model received some information about the outcome of the action performed in response to the query. During a query step at time t for trial s, the input vector was compiled by concatenating the 64-dimensional context vector with a 3-dimensional -vector. The model computed a 2-dimensional policy vector deciding between the two actions left or right. A reward of was given if the rewarding branch was chosen, otherwise the reward was . After obtaining the reward, the model received a new input vector for the fact step. This vector consisted of the concatenation of the context vector , the previously chosen action as 2-dimensional one-hot vector and a scalar value indicating whether the chosen branch was rewarding () or not (). The episode then continued with the query step of the next trial using a new pair with a different context c. More explicitly, the input time series was , with S = 40. We tried two different variants of this task: one, where the locations of the rewarding objects for each context c stayed fixed for the whole episode, and one instance where after the fact step, the location of the rewarding object for the previously presented context was switched. The PPO objective was computed as explained in the ’General simulation details’ section. Simulation details for question answering task (bAbI task) Each task consists of 10 , 000 data samples, split into training and test set. A sample (or story) consists of a sequence of s sentences called facts, followed by a query sentence q and a target answer a. The j-th word of the i-th sentence in a story can be described as a one-hot encoded vector of size V , where V is the vocabulary size. We consider a sentence i of w words as . In order to embed each sentence i into a d-dimensional vector, we first generate a word embedding matrix using the He-uniform variance scaling method [41]. We then create a sentence representation that encodes the position of words within a sentence (as proposed in [20]). The authors call this type of representation position encoding (PE). For the sentence , this takes the form: , where ∘ is the Hadamard product. The column vector has the structure . The query sentence q is also embedded using A (i.e. ). In our experiment, we applied two types of embedding strategies: first, we generated an embedding as described above and let it fixed for the entire training. We term this method random embedding because the word embedding matrix A is randomly sampled for each run. For the second strategy we still generate an embedding as described above, but we trained the embedding end-to-end with our memory model for each task using BPTT, then the embedding was extracted (i.e. word and sentence embedding) and used as pre-trained and fixed embedding for the same model but trained with local synaptic plasticity. We term this method learned embedding. We trained our model on each of the 20 bAbI tasks [17] separately at first, then identified which of the tasks were solved (with a test error < 5%) and finally trained the same model again jointly on these tasks. In the joint setup, the stories from the identified tasks were collected together into one training set. Stories were then shuffled before each training epoch. The network was then trained on this training set, without receiving explicit information from which of the tasks each story originated from. The readout matrix Wout was designed in a way that the computed network output vector y matched vocabulary size V , meaning that the network output (after applying a softmax) was interpreted as a probability vector over words’ dictionary. The error was then computed according to . In most of the tasks, the output is only a single word, but for example in task 19, the answer could be “n,w" which are technically two words (’n’ for north, ’w’ for west), concatenated using a comma. To not violate the constraint that we interpret the output of the model as choice for a single word, we defined all n-grams appearing as an answer as a single word and added it to the dictionary. In the “joint" setup, we used a single softmax output layer which represented the vocabulary over all identified tasks. This single output layer was used for all tasks (in the joint setup). Code source Source code under GNU General Public License v3.0 is available at https://github.com/igiTUGraz/tsp. Memory network model Our network model consists of 2d input neurons separated into apical and basal populations with activities xa , t and, xb , t respectively. The input neurons project to apical and basal compartments of m memory neurons in a fully-connected manner via weight matrices Wapical and, Wbasal respectively. For the i-th neuron in this memory layer, the resulting basal activation and apical activation are thus given by: (18) with the rectified linear nonlinearity, σ ( s ) = max ⁡ { 0 , s } , and pre-activations (19) with xa , t and xb , t as apical and basal input vectors respectively, at time step t. We included a bias in each neuron by providing a constant input . In this way, the derivation of the update rule for the bias can be performed together with the synaptic weights. In the resulting rules, only the input has to be set to 1 in order to obtain the simplified bias updates. The output of each cell is then given by a linear combination of these two activations, whereof is multiplied with the trunk excitability : (20) The trunk excitability is updated according to the Hebbian update (21) with σ the ReLu non-linearity, parameters , and is the maximum trunk strength. The network output was then computed as (22) for output weight matrix Wout, where is determined by (20). Note, that in the reinforcement learning case, we further split this readout weight matrix into wV and Wπ, respectively, the value and policy weights, but here this separation is omitted to simplify notation. For these cases, Wout can be interpreted as a concatenation of wV and Wπ into a single matrix. We distinguish between a memorization event, where the memory is updated according to and the model output is discarded, and a memory recall event, where the model produces an output zt based on the memory state ht as in , but no memory update is performed. During memory recall events, only the apical input population is active, the basal input population is defined to be silent, i.e. xb , t = 0. In the reinforcement learning setup, we execute a memorization event followed by a memory recall event at every single time step. The network output of the recall event is then used to update the state value estimator. In the supervised learning setup, network outputs are only needed when an action is demanded (we call this a query step and the input at this step a query). Therefore, we perform memory recall events at query steps and memorization events are performed for other time steps. Derivation of local synaptic plasticity rules Here, we derive gradient-based local plasticity rules for the slow synaptic weights Wout in the output layer and Wapical , Wbasal in the memory layer (in contrast to the fast trunk strengths in the memory layer neurons). In vanilla backpropagation through time (as commonly used for training recurrent networks), symmetrical feedback connections are used to back-propagate the error gradients through the unrolled network. These symmetrical feedback weights are generally considered biologically implausible due to the weight transport problem [77]. In this work we use the e-prop framework [32] to derive local learning rules for every synapse, that require neither the application of symmetric weights, nor the propagation of errors to preceding time steps. Output layer. For the readout layer weights Wout, local learning signals are already available because the output of this layer is directly used to obtain the error signal E. We can therefore update output weight from memory layer neuron i to output neuron k with (23) where η > 0 denotes the learning rate. Basic e-prop framework and feedback alignment. To perform gradient descent on weights in the memory layer, we need to compute the gradients (24) with η > 0. The second term can be expressed as an eligibility trace, see below. The first term however necessitates the backpropagation of error signals through time. One core idea of the e-prop formalism [32] is to replace this gradient by a temporally local approximation, (25)(26) where index k denotes the k-th neuron in the output layer. Here, is interpreted as a neuron-specific learning signal for neuron i in the memory layer. In the above equations, we distinguish between the total derivative and the partial derivative . The function f may depend on many variables, where many of them may also depend on y. The total derivative takes all these dependencies into account, while in the partial derivative there is only a direct dependency of f on y. More details about this notation can be found in [32], where the e-prop formalism was proposed. The approximation in implements feedback alignment, where the symmetric feedback weights are replaced by randomly chosen weights [78]. In our simulations, we used the adaptive e-prop method [32], where the feedback weights are initialized randomly and then undergo the same weight updates as weights in Wout: . In addition, these two weight matrices are subject to L2 weight decay. This ensures, that after some iterations Wout and will converge to similar values. Memory layer. By making use of the learning signals from the previous section, we can find local synaptic plasticity rules to update the synaptic weights Wapical and Wbasal of the apical and basal compartment respectively. We derive local gradients and to minimize the error E through the e-prop formalism [32]. With these local gradients, it is not necessary to back-propagate the error signal through all time steps of the computation. Instead, eligibility traces are forward-propagated and used in conjunction with learning signals to determine the gradient required for the weight updates. Apical compartment. To find the derivative of the error function E with respect to apical weight , we factorize (27) Now we use our learning signal to approximate , which is a core idea in the e-prop formalism [32]. We also expand by applying the product rule, resulting in (28) Unrolling in time, we obtain (29) We can re-write this definition as a recursive function and hereby introduce our eligibility trace : (30) We observe, that this recurrent definition only depends on partial derivatives ( and ) that are local in time and can therefore be easily derived from as: (31)(32) where H(x) is the Heaviside step function used as derivative of the ReLU function. By plugging and (32) into (30), we obtain: (33) with (34)(35) as stated in the main text. We finally use this eligibility trace to substitute in : (36) where the second term is the derivative of our apical activation with respect to the apical weight . Basal compartment. Analogously, we compute the derivative of the error E with respect to our basal weight matrix Wbasal. We again start off by inserting our learning signal : (37)(38) Because of our assumptions, is only non-zero during recall events and the basal compartment does not receive any inputs at this point (i.e. ), then . Analogous to , we can also express recurrently via eligibility trace : (39) where the local derivative was already defined in . To obtain we can simply compute the derivative of w.r.t. : (40) Inserting this back into (39) yields: with equivalent to and (41) Taken together, our approximated derivative reads: (42) Output layer. For the readout layer weights Wout, local learning signals are already available because the output of this layer is directly used to obtain the error signal E. We can therefore update output weight from memory layer neuron i to output neuron k with (23) where η > 0 denotes the learning rate. Basic e-prop framework and feedback alignment. To perform gradient descent on weights in the memory layer, we need to compute the gradients (24) with η > 0. The second term can be expressed as an eligibility trace, see below. The first term however necessitates the backpropagation of error signals through time. One core idea of the e-prop formalism [32] is to replace this gradient by a temporally local approximation, (25)(26) where index k denotes the k-th neuron in the output layer. Here, is interpreted as a neuron-specific learning signal for neuron i in the memory layer. In the above equations, we distinguish between the total derivative and the partial derivative . The function f may depend on many variables, where many of them may also depend on y. The total derivative takes all these dependencies into account, while in the partial derivative there is only a direct dependency of f on y. More details about this notation can be found in [32], where the e-prop formalism was proposed. The approximation in implements feedback alignment, where the symmetric feedback weights are replaced by randomly chosen weights [78]. In our simulations, we used the adaptive e-prop method [32], where the feedback weights are initialized randomly and then undergo the same weight updates as weights in Wout: . In addition, these two weight matrices are subject to L2 weight decay. This ensures, that after some iterations Wout and will converge to similar values. Memory layer. By making use of the learning signals from the previous section, we can find local synaptic plasticity rules to update the synaptic weights Wapical and Wbasal of the apical and basal compartment respectively. We derive local gradients and to minimize the error E through the e-prop formalism [32]. With these local gradients, it is not necessary to back-propagate the error signal through all time steps of the computation. Instead, eligibility traces are forward-propagated and used in conjunction with learning signals to determine the gradient required for the weight updates. Apical compartment. To find the derivative of the error function E with respect to apical weight , we factorize (27) Now we use our learning signal to approximate , which is a core idea in the e-prop formalism [32]. We also expand by applying the product rule, resulting in (28) Unrolling in time, we obtain (29) We can re-write this definition as a recursive function and hereby introduce our eligibility trace : (30) We observe, that this recurrent definition only depends on partial derivatives ( and ) that are local in time and can therefore be easily derived from as: (31)(32) where H(x) is the Heaviside step function used as derivative of the ReLU function. By plugging and (32) into (30), we obtain: (33) with (34)(35) as stated in the main text. We finally use this eligibility trace to substitute in : (36) where the second term is the derivative of our apical activation with respect to the apical weight . Basal compartment. Analogously, we compute the derivative of the error E with respect to our basal weight matrix Wbasal. We again start off by inserting our learning signal : (37)(38) Because of our assumptions, is only non-zero during recall events and the basal compartment does not receive any inputs at this point (i.e. ), then . Analogous to , we can also express recurrently via eligibility trace : (39) where the local derivative was already defined in . To obtain we can simply compute the derivative of w.r.t. : (40) Inserting this back into (39) yields: with equivalent to and (41) Taken together, our approximated derivative reads: (42) Weight updates in simulations The above derived approximate gradients were accumulated over mini-batches of training examples and used for parameter updates in our model. In practice, we used these approximate gradients in the Adam [79] algorithm, which amounts to a synapse-specific learning rate based on global learning rate along with a momentum update. We used the default values for other hyperparameters of Adam which are , and . We also used an L2 regularization term for each of the synaptic weight matrices. The hyperparameters for learning are specified in subsection ’Hyperparameters’ below. For efficiency reasons, we accumulate the synaptic weight changes over mini-batches of episodes before applying them, which is a common practice in neural network training. Temporal normalization of input neuron activity In order to stabilize training, activity normalization techniques such as batch normalization and layer normalization are often used in artificial neural networks. In biological networks, such normalization may be carried out through inhibitory networks. In contrast to standard techniques, we have to ensure in our model that no information can be inferred from the future, therefore normalization needs to be performed in an online manner. We normalized input apical and basal activity vectors and component-wise over the temporal dimension t. In our case, we use a variance-mean online computation based on [80]: (43)(44)(45) with at time t, online approximated mean and cumulative squared difference between input and mean. We further define . For each task, we calculated and online during the training phase across all training samples to obtain and to apply them as constants for the normalization during inference. This procedure is a common practice and coherent with standard normalization techniques like Batch Normalization [81]. General simulation details Hyperparameter decay. We applied an exponential decay for some of the hyperparameters. The decay function of a hyperparameter Θ was defined as (46) with constant decay rate and delay . This decay was used for the general learning rate of the ADAM optimizer, the entropy coefficient ξ used in the reinforcement learning setups (these parameters are described below), and Hebbian parameters and . Reinforcement learning. For the reinforcement learning experiments, we used Proximal Policy Optimization (PPO) [82] with an entropy bonus [83]. For this setup, output layer weights Wout were split into and with l different available actions. is used to compute a scalar state value estimate and to compute the policy vector from network output . The policy vector represents a probability distribution over all possible actions l at time step t. To train both outputs jointly, we defined an error function consisting of an additive combination of a policy error and state value estimate error . This corresponded to an actor-critic setup, where the network learned to represent both actor (via policy ) and critic (via state value estimate ). The overall error E could therefore be defined as (47) with state value error and policy error . State value error. The error for our state value estimate was computed by using the temporal difference (TD) error [84]: (48) with reward , a constant γ and model parameters γ. denotes the squared L2 norm. The notation is used to indicate that this term is treated as a constant for gradient computations, and therefore no gradients are propagated to this term. Policy error. The policy error was defined as negative of the PPO objective: (49) Here, ξ > 0 is the entropy coefficient exponentially decaying according to , and ξ > 0 defines the interval of clipping probability ratio as in [82]. Again, the notation indicates that this term was treated as a constant during calculation of the error gradient. The advantage estimate assuming action is taken at time step t is defined by (50) using a temporal difference residual [85] to ensure temporal locality of the learning signal. For the reinforcement learning tasks, we wanted to enable the model to write and read at every time step. Hence, we let the model perform a memorization event followed by a recall event at each time step t. Supervised learning. In the supervised learning case, the network output is computed as . The supervised error function was defined as cross-entropy between a softmax of the model output vector during the time step t and target vector , (51) with . Note, that in contrast to the reinforcement learning setup where an error is available each time step t, here the error can only be non-zero at time t where the target signal is received. Hyperparameters. For all ours experiments we used 200 memory neurons, batch size of 32 elements, a global norm gradient clip of 10, and a L2 normalization with a coefficient. For our reinforcement learning experiments we used 0 . 9 as discount rate γ and 0 . 2 for the PPO clip ratio 0 . 2 . The task-dependent hyperparameters, as well as the hyperparameters used to obtain the LSTM baseline are reported in Tables 2, 3 and 4. The hyperparameters for both, the memory network model and LSTM were tuned using Hyperband [86] a standard hyperparameter optimization technique, with a similar computational budget. To provide a fair comparison, in addition to hyperparameter tuning, we considered two LSTM configurations, one where the number of LSTM hidden units were the same as our model but with more trainable parameters, and another with the same number of trainable parameters. We report the configuration with the same number of trainable parameters that had the best performance. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 2. Hyperparameters: Supervised learning. https://doi.org/10.1371/journal.pone.0313772.t002 Download: PPT PowerPoint slide PNG larger image TIFF original image Table 3. Hyperparameters: Reinforcement learning. https://doi.org/10.1371/journal.pone.0313772.t003 Download: PPT PowerPoint slide PNG larger image TIFF original image Table 4. Hyperparameters: LSTM. https://doi.org/10.1371/journal.pone.0313772.t004 Hyperparameter decay. We applied an exponential decay for some of the hyperparameters. The decay function of a hyperparameter Θ was defined as (46) with constant decay rate and delay . This decay was used for the general learning rate of the ADAM optimizer, the entropy coefficient ξ used in the reinforcement learning setups (these parameters are described below), and Hebbian parameters and . Reinforcement learning. For the reinforcement learning experiments, we used Proximal Policy Optimization (PPO) [82] with an entropy bonus [83]. For this setup, output layer weights Wout were split into and with l different available actions. is used to compute a scalar state value estimate and to compute the policy vector from network output . The policy vector represents a probability distribution over all possible actions l at time step t. To train both outputs jointly, we defined an error function consisting of an additive combination of a policy error and state value estimate error . This corresponded to an actor-critic setup, where the network learned to represent both actor (via policy ) and critic (via state value estimate ). The overall error E could therefore be defined as (47) with state value error and policy error . State value error. The error for our state value estimate was computed by using the temporal difference (TD) error [84]: (48) with reward , a constant γ and model parameters γ. denotes the squared L2 norm. The notation is used to indicate that this term is treated as a constant for gradient computations, and therefore no gradients are propagated to this term. Policy error. The policy error was defined as negative of the PPO objective: (49) Here, ξ > 0 is the entropy coefficient exponentially decaying according to , and ξ > 0 defines the interval of clipping probability ratio as in [82]. Again, the notation indicates that this term was treated as a constant during calculation of the error gradient. The advantage estimate assuming action is taken at time step t is defined by (50) using a temporal difference residual [85] to ensure temporal locality of the learning signal. For the reinforcement learning tasks, we wanted to enable the model to write and read at every time step. Hence, we let the model perform a memorization event followed by a recall event at each time step t. Supervised learning. In the supervised learning case, the network output is computed as . The supervised error function was defined as cross-entropy between a softmax of the model output vector during the time step t and target vector , (51) with . Note, that in contrast to the reinforcement learning setup where an error is available each time step t, here the error can only be non-zero at time t where the target signal is received. Hyperparameters. For all ours experiments we used 200 memory neurons, batch size of 32 elements, a global norm gradient clip of 10, and a L2 normalization with a coefficient. For our reinforcement learning experiments we used 0 . 9 as discount rate γ and 0 . 2 for the PPO clip ratio 0 . 2 . The task-dependent hyperparameters, as well as the hyperparameters used to obtain the LSTM baseline are reported in Tables 2, 3 and 4. The hyperparameters for both, the memory network model and LSTM were tuned using Hyperband [86] a standard hyperparameter optimization technique, with a similar computational budget. To provide a fair comparison, in addition to hyperparameter tuning, we considered two LSTM configurations, one where the number of LSTM hidden units were the same as our model but with more trainable parameters, and another with the same number of trainable parameters. We report the configuration with the same number of trainable parameters that had the best performance. Download: PPT PowerPoint slide PNG larger image TIFF original image Table 2. Hyperparameters: Supervised learning. https://doi.org/10.1371/journal.pone.0313772.t002 Download: PPT PowerPoint slide PNG larger image TIFF original image Table 3. Hyperparameters: Reinforcement learning. https://doi.org/10.1371/journal.pone.0313772.t003 Download: PPT PowerPoint slide PNG larger image TIFF original image Table 4. Hyperparameters: LSTM. https://doi.org/10.1371/journal.pone.0313772.t004 Datasets Omniglot dataset. Omniglot [34] (under MIT License) is a dataset of 1625 handwritten characters from 50 different alphabets, where each character has been drawn by 20 different people. Hence, there are 1625 classes with 20 samples each in this dataset. We pre-trained a 64-dimensional embedding using a 1-shot prototypical network [35] on 965 Omniglot classes. For our experiments, the remaining 660 classes (that were not in the pre-training dataset) were then used, embedded by the previously pre-trained network. Note that these classes were not just held-out samples, but completely new characters. bAbI dataset. The bAbI question answering tasks dataset [17] (under BSD License) is a collection of 20 different tasks, each of which consists of stories written in natural language. A sample can be seen as a sequence of statements (a story), where at the end of each story a question is asked about the content of the previously shown sentences. Depending on the task, the questions often require the combination of multiple statements in order to provide a correct answer. These tasks were designed in a way that they require different skills of natural language processing like induction, deduction, and chaining of facts, which are natural for humans, but hard for machine learning models. Omniglot dataset. Omniglot [34] (under MIT License) is a dataset of 1625 handwritten characters from 50 different alphabets, where each character has been drawn by 20 different people. Hence, there are 1625 classes with 20 samples each in this dataset. We pre-trained a 64-dimensional embedding using a 1-shot prototypical network [35] on 965 Omniglot classes. For our experiments, the remaining 660 classes (that were not in the pre-training dataset) were then used, embedded by the previously pre-trained network. Note that these classes were not just held-out samples, but completely new characters. bAbI dataset. The bAbI question answering tasks dataset [17] (under BSD License) is a collection of 20 different tasks, each of which consists of stories written in natural language. A sample can be seen as a sequence of statements (a story), where at the end of each story a question is asked about the content of the previously shown sentences. Depending on the task, the questions often require the combination of multiple statements in order to provide a correct answer. These tasks were designed in a way that they require different skills of natural language processing like induction, deduction, and chaining of facts, which are natural for humans, but hard for machine learning models. Simulation details for association learning task The association learning task aimed to test whether the model was capable of learning to memorize and recall association pairs. We performed this task in two different learning scenarios: A reinforcement learning scenario where the model only obtained a reward indicating whether the output action was correct or not, and one supervised learning setup where the cross-entropy error between the network output and the target output was computed. For this task we compiled two different sets and , each consisting of n Omniglot [34] characters from different classes. Each trial in this task consisted of n associations between two stimuli, bijectively mapping each single character from set R to a single character from set S. Our input vectors for time steps t ∈ { 1 , … , n } were assembled by concatenating two 64-dimensionally embedded vectors (embedding calculated from the prototypical network, see section ’Omniglot dataset’ in Methods for details), representing one character from set R and one character from set S respectively. This resulted in an input time series according to , followed by a query vector . In this query the character , which is the same as one of the previously shown with τ ∈ { 1 , … , n } , was presented concatenated with a -vector instead of its associated . At each trial, the bijection between R and S as well as the sequential order of characters were randomly drawn to ensure an unpredictable set of associations and sequential presentation at each trial. * Reinforcement learning variant. At query time, the model produced a n-dimensional output where each element is assigned to a specific character in S. If the model could then successfully predict the previously associated vector , it obtained a reward of + 1. If it failed, it instead received a reward of –1. For time steps t ≤ n, where the associative stimuli were presented to the network, no reward was available, since no output action needed to be taken. This principle accounted to a reinforcement learning setup with a delayed reward, where we masked the network outputs during the time steps where the associations were presented, since no meaningful policy error can be computed because no action was required. The value estimate output however was not masked, because it was used to calculate error for the synaptic weight update. The PPO objective was computed as explained in the ’General simulation details’ section. * Supervised learning variant. In the supervised case, we provided, as a label, the index of the character in S bijectively associated with in the query. The error was then computed for the model prediction at the query presentation. The supervised objective was computed as explained in the ’General simulation details’ section. Simulation details for delayed-match-to-sample task In the delayed match-to-sample task, a single Omniglot character from a pre-generated set P of 5 different characters (from different classes) was drawn and presented as 64-dimensional embedded input vector in the first time step t = 1, followed by 8 time steps of white random noise (with zero mean and standard deviation of 1). The query stimulus was again a character from the same pre-generated set either matching or not matching the previously presented class. We consider 2 different setups (see below for details), where in one setup (the fixed case) the input set P contained a single sample for each class, whereas in the second (the sampled) case, P contained multiple different instances for each class. The model output was a 2-dimensional policy vector representing the match or no match actions. If the correct action was selected, a reward of + 1 was obtained, otherwise –1. As for the association learning task, the policy output for all time steps t where no action was required was masked. We asked whether the model can generalize to the case where input stimulus and query stimulus have to be matched even if they are not identical, but rather 2 different samples from the same class. Therefore, we evaluated this task in a fixed and a sampled version. In the fixed case, a single instance of each class was added to the input set P. In the case of a match between the two stimuli, the input stimulus and query stimulus were hence always the exact same sample (i.e. the same Omniglot character drawn from the same person). In the sampled case, multiple instances of the same class were added to P. Hence, input stimulus and query stimulus could be a different class sample, even though they were from the same class. In this case, the model should also output a match decision. Moreover in the sampled case, we used different, non-overlapping input sets for training and testing, further probing generalization capabilities of the model. To accomplish this, we split our set P into two different subsets, , containing 18 samples for each of the 5 different classes and containing two sample for the same 5 classes, not contained in . This way we could make sure that the model sees the same classes during training and testing, but not the same instances. The PPO objective was computed as explained in the ’General simulation details’ section. Simulation details for context-dependent reward association (radial maze) task Task setup. In this task we modelled a radial maze consisting of 8 branches, organized into 4 pairs, with each consecutive branch forming a unique pair, see Fig 5A in the main text. Each pair of branches is associated with a distinct visual context represented by an Omniglot character. During the maze’s initialization, only one branch in each pair contains a reward, creating a challenging memorization task for the agent. Each learning episode consisted of 40 consecutive trials, where, before each episode, the locations of rewarding objects were shuffled. 4 different Omniglot characters were randomly chosen and embedded in 64-dimensional space (using a pre-trained prototypical network, see section ’Omniglot dataset’ in Methods) to indicate the contexts. Each single trial consisted of two time steps, a query step where the model had to take a decision based on the context and a fact step, where the model received some information about the outcome of the action performed in response to the query. During a query step at time t for trial s, the input vector was compiled by concatenating the 64-dimensional context vector with a 3-dimensional -vector. The model computed a 2-dimensional policy vector deciding between the two actions left or right. A reward of was given if the rewarding branch was chosen, otherwise the reward was . After obtaining the reward, the model received a new input vector for the fact step. This vector consisted of the concatenation of the context vector , the previously chosen action as 2-dimensional one-hot vector and a scalar value indicating whether the chosen branch was rewarding () or not (). The episode then continued with the query step of the next trial using a new pair with a different context c. More explicitly, the input time series was , with S = 40. We tried two different variants of this task: one, where the locations of the rewarding objects for each context c stayed fixed for the whole episode, and one instance where after the fact step, the location of the rewarding object for the previously presented context was switched. The PPO objective was computed as explained in the ’General simulation details’ section. Task setup. In this task we modelled a radial maze consisting of 8 branches, organized into 4 pairs, with each consecutive branch forming a unique pair, see Fig 5A in the main text. Each pair of branches is associated with a distinct visual context represented by an Omniglot character. During the maze’s initialization, only one branch in each pair contains a reward, creating a challenging memorization task for the agent. Each learning episode consisted of 40 consecutive trials, where, before each episode, the locations of rewarding objects were shuffled. 4 different Omniglot characters were randomly chosen and embedded in 64-dimensional space (using a pre-trained prototypical network, see section ’Omniglot dataset’ in Methods) to indicate the contexts. Each single trial consisted of two time steps, a query step where the model had to take a decision based on the context and a fact step, where the model received some information about the outcome of the action performed in response to the query. During a query step at time t for trial s, the input vector was compiled by concatenating the 64-dimensional context vector with a 3-dimensional -vector. The model computed a 2-dimensional policy vector deciding between the two actions left or right. A reward of was given if the rewarding branch was chosen, otherwise the reward was . After obtaining the reward, the model received a new input vector for the fact step. This vector consisted of the concatenation of the context vector , the previously chosen action as 2-dimensional one-hot vector and a scalar value indicating whether the chosen branch was rewarding () or not (). The episode then continued with the query step of the next trial using a new pair with a different context c. More explicitly, the input time series was , with S = 40. We tried two different variants of this task: one, where the locations of the rewarding objects for each context c stayed fixed for the whole episode, and one instance where after the fact step, the location of the rewarding object for the previously presented context was switched. The PPO objective was computed as explained in the ’General simulation details’ section. Simulation details for question answering task (bAbI task) Each task consists of 10 , 000 data samples, split into training and test set. A sample (or story) consists of a sequence of s sentences called facts, followed by a query sentence q and a target answer a. The j-th word of the i-th sentence in a story can be described as a one-hot encoded vector of size V , where V is the vocabulary size. We consider a sentence i of w words as . In order to embed each sentence i into a d-dimensional vector, we first generate a word embedding matrix using the He-uniform variance scaling method [41]. We then create a sentence representation that encodes the position of words within a sentence (as proposed in [20]). The authors call this type of representation position encoding (PE). For the sentence , this takes the form: , where ∘ is the Hadamard product. The column vector has the structure . The query sentence q is also embedded using A (i.e. ). In our experiment, we applied two types of embedding strategies: first, we generated an embedding as described above and let it fixed for the entire training. We term this method random embedding because the word embedding matrix A is randomly sampled for each run. For the second strategy we still generate an embedding as described above, but we trained the embedding end-to-end with our memory model for each task using BPTT, then the embedding was extracted (i.e. word and sentence embedding) and used as pre-trained and fixed embedding for the same model but trained with local synaptic plasticity. We term this method learned embedding. We trained our model on each of the 20 bAbI tasks [17] separately at first, then identified which of the tasks were solved (with a test error < 5%) and finally trained the same model again jointly on these tasks. In the joint setup, the stories from the identified tasks were collected together into one training set. Stories were then shuffled before each training epoch. The network was then trained on this training set, without receiving explicit information from which of the tasks each story originated from. The readout matrix Wout was designed in a way that the computed network output vector y matched vocabulary size V , meaning that the network output (after applying a softmax) was interpreted as a probability vector over words’ dictionary. The error was then computed according to . In most of the tasks, the output is only a single word, but for example in task 19, the answer could be “n,w" which are technically two words (’n’ for north, ’w’ for west), concatenated using a comma. To not violate the constraint that we interpret the output of the model as choice for a single word, we defined all n-grams appearing as an answer as a single word and added it to the dictionary. In the “joint" setup, we used a single softmax output layer which represented the vocabulary over all identified tasks. This single output layer was used for all tasks (in the joint setup). Code source Source code under GNU General Public License v3.0 is available at https://github.com/igiTUGraz/tsp. Supporting information S1 Appendix. Additional simulation results. https://doi.org/10.1371/journal.pone.0313331.s001 (PDF)
TI - Non-synaptic plasticity enables memory-dependent local learning
JF - PLoS ONE
DO - 10.1371/journal.pone.0313331
DA - 2025-03-17
UR - https://www.deepdyve.com/lp/public-library-of-science-plos-journal/non-synaptic-plasticity-enables-memory-dependent-local-learning-EhtIGS2EL1
SP - e0313331
VL - 20
IS - 3
DP - DeepDyve
ER -