TY - JOUR AU - Labuschagne, Willem AB - Abstract This paper looks at the representation of supra-classical, non-monotonic (SCNM) logic by an artificial neural network. It identifies the features of defeasiblity in this logic related to inference in the context of common-sense reasoning. It considers the machine characteristics that make a representation possible, with reference to previous literature. We describe a theoretical environment for investigating the representation and provide experimental evidence confirming that a Boltzmann machine is a suitable network representation. A Boltzmann machine can learn an input distribution corresponding to a preference relation and explicitly retrieve appropriate model states, constituting one-to-many mappings, entailed by the uncertain information contained in a premiss. The place of the Boltzmann machine in knowledge representation is discussed. In future papers, this neural network model of SCNM logic will serve as an experimental gateway for exploration of typicality and belief revision. 1 Introduction This paper explores the connectionist representation of supra-classical, non-monotonic (SCNM) logic and thereby seeks to contribute to some reconciliation, between the connectionist and symbolic paradigms of artificial intelligence. Our motivation comes not only from a hypothesis that SCNM logic is an appropriate formalism of defeasible belief but also our view that as the predominant approach of modern logic it merits exploration by machine representation. While the motivation for the paper was diverse, the paper is primarily concerned with the representation of supra-classical logical inference by neural networks. Its aim is to demonstrate that the Boltzmann machine, specifically a stochastic recurrent neural network, is a faithful model of inference in supra-classical logic, utilizing preferential semantics. Further, that the association between the network representation and symbolic logic can be utilized for the benefit of both paradigms. Fodor and Pylyshyn [25] viewed the two paradigms of artificial intelligence as being in conflict. However, many authors have argued that they are complementary. We are not trying to translate a system of SCNM logic into a discrete machine network. We are trying to show by providing evidence from typical cases that the Boltzmann architecture can learn information from (micro-world) environments that allow it to faithfully represent the inference that logic would dictate in those environments. We explicitly accept the view of Gärdenfors [29], as demonstrated by [14, 26]: that neural networks can learn the systematic structure of logic because systematicity is intrinsically part of any world that is represented by language. Logic offers a framework, at an abstract symbolic level, for the understanding of human reasoning [92, Chp 8–10]. At a concrete biological level, neuroscience offers an increasingly sophisticated understanding of the functioning of the brain. Somewhere between the two, connectionist artificial neural network systems provide a computational framework that is inspired by the brain. It seems just as relevant today, that a fuller understanding of human cognition should elucidate the relationship between these different levels of description, as captured by Hinton’s comments, made originally in 1984, even prior to the introduction of modern logic [55, KLM]: ‘Ultimately it will be necessary to bridge the gap between hardware-oriented connectionist descriptions and the more abstract symbol manipulation models that have proved to be an extremely powerful and pervasive way of describing human information processing.’ [44]. The challenge of embodiment, connecting symbolic logic and neural networks, was regarded by many authors in the 1990s [29, 44, 49, 84] as one of the ultimate challenges of artificial intelligence. It may seem that the importance of the moment has gone by, that it is out of time; even though only a partial resolution was previously found in the domain of classical logic. We aim to extend this work in classical logic by reconciling theoretical formalizations of inference in SCNM logic with experimental observations from an artificial neural network. Specifically, we aim to demonstrate that a Boltzmann machine can be used as a faithful representation of SCNM logic. Preferential semantics in SCNM logic is now a conventional approach to logical inference, established in the landmark paper of Kraus, Lehmann and Magidor [55, KLM]. These authors realized that inference needs to be more flexible, cognizant of solutions that are less preferred or likely. The semantics of SCNM require that a conclusion be a set of preferentially ordered states, not the just the single best/correct inference of classical logic. Without the retention of these less preferred solutions, a logical agent cannot adapt in the face of change. It is not intended that this paper examine the details, context or the application of SCNM logic to artificial intelligence: such a goal would be beyond the scope of a single text. Although the choice of this particular logic may be controversial, it is of significance because common-sense reasoning, as exemplified by SCNM inference, is felt by many authors [55, 59, 60, 92] to be one of the hallmarks of human cognition. A critical analysis of the requirements of the rational consequence relation in SCNM logic, from the neural network perspective, is an original contribution of this paper. As discussed below, we suggest that the network representation should be able to perform the same tasks that are fundamental to the logical process of inference: Learn a preference relation on the set of model states:The origin of the preference relation in the logic remains undefined: it could be generated from a system of heuristics or learned directly from the environment. But generally, it is a ranking on the probability of states in the world. From a machine perspective, the most common states in the world will be seen most often and will be learned preferentially as local minima. This generative case represents the entire ranking, the complete joint probability table, |$p(x)$|⁠. In this paper, we confirm Hinton’s assertion that the Boltzmann machine can cycle through the entire joint probability table when presented with a null premiss: i.e. it can retrieve a generative model when blinded about the state of the world. Select appropriate model states based on limited information (a premiss):Inference is generally based on limited information about the world: a discriminative case, |$p(x|u)$|⁠. The machine representation must be able to retrieve only those model states that are entailed by the premiss (those local minima that are subsets of the input) and retrieve them in the ranking of the preference relation. In this paper, we confirm that the Boltzmann machine can retrieve a preferentially ranked subset of states entailed by a premiss. Compare selected model states (usually premiss and conclusion): The final requirement of inference is not explicitly examined in this paper: it involves simple comparison of the supplied input and the retrieved output states or testing inclusion in a subset. We have suggested that a feed-forward network could easily implement this final phase of inference and we have implied such comparison in our presentation of logical properties (pages 29–33). Adapt to new information:Adaptation is at the heart of defeasibility. In an uncertain world, a logical agent faced with new information is often forced to revise its view of the truth, accepting previous exceptions. This issue constitutes an entire of domain of modern logic, belief change. We intend to examine this topic in future work. The first two requirements of defeasible SCNM logic are examined in this paper. We suggest they can be implemented in a single architecture, a Boltzmann machine. We also argue that the connectionist representation of SCNM logic should not be seen as an end point in itself, but as a tool to be utilized in examining common issues in cognition. 2 Background literature This section is an attempt to broadly cover research in two domains of artificial intelligence. It presents an outline of the literature in SCNM logic and its relevance to common-sense reasoning in human cognition. The end of the section focuses in detail on research specifically concerning the connectionist representation of SCNM logic in artificial neural networks. A discussion of the Boltzmann machine and optimization by simulated annealing is included the next section on methodology. 2.1 Supra-classical, non-monotonic logic Non-monotonic logic, as formalized in the KLM preferential semantics [55], is now a core philosophy of applied logic. It emerged out of the realization that classical logic was too inflexible to represent common-sense reasoning [28, 66]. This evolution in logic has been described as ‘a journey from the absolute to the relative’ [38]. Classical logic regards truth as absolute, permanent: preserved in the face of change. Non-monotonic logic attempts to capture the concept of defeasible inference. For example, knowing that Tweety is a bird and that birds fly, we might reasonably conclude that Tweety flies: a conclusion we might choose to retract, on finding Tweety was a penguin. The common-sense notion is that, agents may tentatively draw conclusions given incomplete (uncertain) information and have the ability to retract them in the light of new evidence [54, 103]. In this context, adding premises available for inference can lead to loss of conclusions [65]. Within the prevailing framework of preferential semantics, non-monotonic logic is implicitly supra-classical. It permits us to infer more from a set of premises than classical propositional logic would, by generating a preference ranking on model states; including less preferred exceptions. This ability to tolerate counter-examples is a prime characteristic of supra-classicality [59] and part of an even wider context of para-consistent logics that specifically support inconsistency [33, 86]. Huge advances in logic were made at the beginning of the 20th century by Gottlob Frege [112], Bertrand Russell [47], Kurt Gödel [50] and Alfred Tarski [34, 106]. There is no space to discuss this preexisting classical logic, nor is there room to discuss the birth of SCNM logic in the early papers of McDermott and Doyle [69], Reiter [88], McCarthy [68] and Sholam [99]. It has been necessary to draw a historical line in the sand, only the briefest outline of SCNM logic is provided here. 2.1.1 KLM: preferential semantics Let us first look at properties related to defeasibility: the intuition we would like to encapsulate in logic. In the definitions that follow, |$\leftrightarrow $| denotes logical equivalence (alternatively, |$\equiv $|⁠), |$\models $| denotes classical entailment and |$\wedge $| denotes logical ‘And’. The symbol |$\mid \sim $| denotes defeasible entailment by rational consequence. Well-formed formulae in the language are denoted by the Greek letters: it is intended that |$\alpha , \beta $| represent existing information and that |$\gamma $| represents new information. $$\begin{align*} & \textbf{Properties~Related~to~Defeasible~Entailment} \mid\sim \\ & \textrm{If }\ \models \alpha \leftrightarrow \alpha^{\prime}\ \textrm{and}\ \beta \leftrightarrow \beta^{\prime}\ \textrm{and}\ \alpha \mid\sim \beta\ \text{,then}\ \alpha^{\prime} \mid\sim \beta^{\prime} && \text{(Well-Behaved Equivalence)} \\ & \textrm{If }\ \alpha \models \beta\ \text{,then}\ \alpha \wedge \gamma \models \beta &&\text{(Monotonicity---Classical)} \\ & \textrm{If }\ \alpha \mid\sim \beta\ \text{,then for some}\ \gamma,\ \alpha \wedge \gamma{{\mathrel{| \!{\diagup} \!\!\!\!\!{\sim}}}} \beta &&\text{(Non-monotonicity)} \\ & \textrm{If }\ \alpha \models \beta\ \text{,then}\ \alpha \mid\sim \beta && \text{(Supra-classicality)} \end{align*}$$ Firstly, defeasible entailment should be a well-behaved semantic equivalence, independent of syntactic change. The second property, monotonicity, is stated in the formalism of classical logic: |$\models $|⁠. As previously discussed, this property is inappropriately strong for the context of common-sense reasoning: defeasible truth is not absolute. Surprisingly, the third property, non-monotonicity, is too weak. It results in systems that are irrational. Every time new information is received, the agent must revise all the pre-existing assertions. By default, artificial neural networks are strictly non-monotonic, they irrationally forget past learned assertions: catastrophic forgetting [89]. In fact, common-sense reasoning, as represented by defeasibility, is somewhere between monotonic and non-monotonic. The fourth property states that any information captured by classical entailment is at least defeasible and so defeasibility is part of a broader framework of supra-classical logics. The landmark paper of Kraus, Lehmann and Magidor [55, KLM] was published in the Journal of Artificial Intelligence in 1990. It presents a sequence of five systems that are possible candidates for defeasible reasoning. From weakest (least rational) non-monotonic to the strongest monotonic, these systems in order are C Cumulative, CL Cumulative with Loop, P Preferential, CM Cumulative Monotonic and M Monotonic. For each system, the paper separately considers the proof theoretic properties, the semantics and the resulting consequence relations. The weakest of the systems, C Cumulative, contains the basic properties of all the others. $$\begin{align*} & \textbf{KLM~Cumulative~Properties} && \\ & \text{1. } \alpha \mid\sim \alpha &&\text{(Reflexivity)} \\ & \text{2. If }\ \models \alpha \leftrightarrow \beta\ \textrm{and}\ \alpha \mid\sim \gamma \text{, then } \beta \mid\sim \gamma &&\text{(Left Equivalence)} \\ & \text{3. If }\ \alpha \mid\sim \beta \ \textrm{and}\ \beta \models \gamma \text{, then } \alpha \mid\sim \gamma &&\text{(Right Weakening)} \\ & \text{4. If }\ \alpha \wedge \gamma \mid\sim \beta \ \textrm{and}\ \alpha \mid\sim \gamma \text{, then } \alpha \mid\sim \beta &&\text{(Cut)} \\ & \text{5. If }\ \alpha \mid\sim \beta \ \textrm{and}\ \alpha \mid\sim \gamma \text{, then } \alpha \mid\sim \beta \wedge \gamma &&\text{(And, derived)} \\ & \text{6. If }\ \alpha \mid\sim \beta \ \textrm{and}\ \alpha \mid\sim \gamma \text{, then } \alpha \wedge \gamma \mid\sim \beta &&\text{(Cautious Monotonicity)} \end{align*}$$ Reflexivity is a universal requirement. Logical Equivalence can be derived from Left Equivalence and expresses the concept of syntactic independence. Right Weakening states that plausible consequences should include those that are strictly classical. Cut expresses the idea that information that is separately entailed can be removed without loss of assertions. It formalizes the concept of foundational information in the knowledge base. Cautious Monotonicity goes some way towards re-establishing the strength of classical entailment within the system. $$\begin{align*} & \textbf{Classical~Properties} \\ & \text{1. If } \alpha \models \beta \ \textrm{and}\ \beta \models \gamma \text{, then } \alpha \models \gamma &&\text{(Transitivity)} \\ & \text{2. If } \alpha \models \beta \text{, then } \alpha \wedge \gamma \models \beta &&\text{(Monotonicity)} \\ & \text{3. If } \alpha \models \beta \text{, then } \neg \beta \models \neg \alpha &&\text{(Contraposition)} \end{align*}$$ The CL Cumulative Loop system adds a transitive Loop property to the base set of cumulative properties. This property is important in preference ranking. However, defeasible entailment in itself is not transitive. Both the CM and M systems are too strongly classical in nature to be candidates for defeasible reasoning, adding respectively monotonicity and finally contraposition, the strongest of the classical attributes. The P Preferential system ‘occupies the central position in the hierarchy of non-monotonic’ reasoning. Its semantics were described by Sholam [99] and it was considered by Adams [3] and Pearl and Geffner [80] in the context of conditional assertion and probabilistic logic as the ‘conservative core of a non-monotonic reasoning system’. $$\begin{align*} & \textbf{Preferential: additional~property} && \\ & \text{1. If } \alpha \mid\sim \gamma \ \& \ \beta \mid\sim \gamma \text{, then } \alpha \vee \beta \mid\sim \gamma &&\text{(Or)} \\ & \textbf{Supplementary~property} && \\ & \text{2. If } \alpha \mid\sim \beta \text{, then either } \alpha \wedge \gamma \mid\sim \beta \textrm{ or } \alpha \mid\sim \neg \ \gamma &&\textbf{(Rational~Monotonicity)} \end{align*}$$ System P adds the Or property to the base set of cumulative properties and includes the And property, which can be derived via Cautious Monotonicity and Cut. The original paper [55] discusses three further properties: Negation Rationality, Disjunctive Rationality and Rational Monotonicity, which potentially might be added to strengthen the P system (make it more classical). In a subsequent paper, Lehmann and Magidor [60] concentrate on Rational Monotonicity, which can be envisaged as the upper boundary of defeasibility; above which the logic becomes too strongly, classically monotonic. The definition of Rational Monotonicity can be formulated in the context of common-sense reasoning: when un-surprising new information (⁠|$\gamma $|⁠) is received, the agent need not revise previous assertions. Rational consequence formalizes single predictive inferences, where the limited information supplied by the premiss is amplified by information provided by a default rule. A default rule (and preference relation) may be based on factors such as past experience or observation of the frequency of states in the environment. Rather than a theoretical discussion of the preferential semantics and consequence relations, which would involve considerable space and technical detail, we present a simple example in the hope that it may be more informative. Let us consider a traffic intersection with a light and a car. In sentences of the language the fixed order of the propositions, the light and the car, will be maintained. Truth valuations on atoms will be denoted as follows: |$L$| the light is green, |$\neg L$| the light is red, |$C$| the car goes through the intersection and |$\neg C$| the car stops. We employ a binary logic where ‘+1’ stands for true and ‘-1’ stands for false. An example default rule is that: ‘Cars normally stop for red lights’. This rule is not a universal generalization; the consequent entailment is defeasible and so supra-classical logic incorporates exceptions to the single conclusions of classical monotonic logic. Cars may exceptionally ‘run a red light’. It is convenient to represent the default rule by means of an ordering on the (valuations on) states of the relevant micro-world, as illustrated in Figure 1. Such an ordering stratifies or ranks the states into layers, as described by Kraus et al and Lehmann and Magidor [55, 60]. These layers are often indexed with an ordinal value. It is conventional to refer to the ordering as a preference relation. Traditionally, the most preferred (plausible or normal) model states have been called the minimal models. However, throughout this paper, we choose the intuitive alternative of calling them the maximal (preferred) models. Figure 1 Open in new tabDownload slide An example micro-world model of a preference relation, there are three levels of preference in this system, shown with an ordinal index of ranks and most preferred or common model states as maximal. Figure 1 Open in new tabDownload slide An example micro-world model of a preference relation, there are three levels of preference in this system, shown with an ordinal index of ranks and most preferred or common model states as maximal. The maximal model states in this example world are states with valuations ‘1 1’ where the light is green and the car goes through and ‘-1 -1’ where the light is red and the car stops. The model state with valuations ‘1 -1’ where the light is green but the car stalls happens occasionally, but is less preferred and ‘-1 1’ where the car runs a red light is least preferred. If in this example scenario the agent received information that the light was green, it would be plausible to conclude, based on rational consequence, that the car went through the intersection, although it may have stalled. This defeasible conclusion is reached by selecting the maximal models of the premiss from the preference ranking. Given a default rule represented as such a preference relation, the corresponding rational consequence relation sanctions the defeasible entailment of |$\beta $| by |$\alpha $|⁠, if and only if every maximally preferred model of |$\alpha $| is also a model of |$\beta $| [38]. $$\begin{align*} & \alpha \mid\sim \beta \quad \longleftrightarrow \quad Maximal\ Models \lbrack \alpha \rbrack \ \subseteq \ Models \lbrack \beta \rbrack \end{align*}$$ In general, defeasible entailment is synonymous with the maximal preference. 2.2 Common-sense reasoning ‘Arguably, the most important characteristic of non-monotonic logic is not its non-monotonicity, but its supra-classicality’ [59]. Supra-classical logics allow for non-preferred conclusions and can employ a range of consequence relations that tolerate counter-examples to the restrictive view of classical logic. A conclusion in supra-classical logic involves an ordered set of model states entailed by a premiss. In the domain of statistics, these states are counter-factuals [78]. In the connectionist paradigm, states correspond to stable minima. We assert that a supra-classical conclusion (a ‘problem solution’) is a set of ranked minima, not just the single global minimum. We have chosen this broader context of the rational consequence in supra-classical logic (preferential semantics), because many authors regard this ability, to be able to learn exceptions, as a key feature of common-sense reasoning [38, 55, 59, 60]. Indeed, Pearl [78, 79] proposes that these counter-factuals form the basis of reasoning about causality, as opposed to simple statistical association. Although there is no accepted practical definition of common-sense reasoning, many authors believe that SCNM logic is a credible formalism [55, 59, 60, 92]. Common-sense reasoning is a term typically applied to the menial and yet extraordinarily complex activities that are common place, such as tying your shoe lace [22]. The general knowledge base required for these activities has so far eluded artificial intelligence, despite significant initiatives such as the Cyc Project developed by Lenat [64] and the Open Mind Project developed by Minsky and Singh [100]. It would be incomplete to consider common-sense reasoning without reference to psychology; previously the only experimental framework available. Surprisingly, given the separation between the disciplines of mathematical logic and behavioural psychology, the empirical psychological evidence is very supportive of SCNM logic as a surrogate for human reasoning. The earliest and most influential result was reported by Wason [110, 111]. Wason’s selection test involved human subjects given four cards, their facing and concealed sides potentially supporting some relationship; an inference made in the rules of classical logic. The subjects were able to select independently all the inferences that they felt were correct. Ninety percent of Wason’s test subjects consistently chose the positive inference correctly, in the style of modus ponens: if |$\alpha $| and |$\alpha \rightarrow \beta $|⁠, then |$\beta $|⁠. Thirty-five percent of subjects mistakenly affirmed the consequent, a positive fallacy: an indication of the degree of possible human error. But, only four percent of subjects chose the classical negative inference correctly. Thus rebutting reasoning in the style of modus tollens: if |$\neg \beta $| and |$\alpha \rightarrow \beta $|⁠, then |$\neg \alpha $|⁠. There has been prolonged discussion about this result, with the suggestion that it simply confirms human error. Cheng and Holyoak [16] showed that placing the test in a familiar setting or prompting the subjects with a concrete rationalization (introducing new information) improved the selection of the negative inference to almost 90%. However, the result can be interpreted in another way: modus tollens is implicitly based on the strongest classical property of contraposition. Contraposition was specifically removed from the formalism of SCNM logic because it prevents defeasibility. Further validation of the tenants of defeasible inference have been provided in a range of studies. Stenning and Van Lambalgen [102] repeated Wason’s selection task in a context of improved explanation, making the point that context and familiarity are important to human cognition, and generally supporting the framework of defeasible inference. Similarly, Byrne [13] extended the selection task, demonstrating the importance of non-monotonicity in common-sense reasoning. Neves, Bonnefon and Raufaste [74] argue that many of the properties of defeasible inference, including rational monotonicity, were corroborated in a verbal selection test. Pfeifer and Kleiter [81] examined the corroboration of KLM System P through estimates of probability intervals by their subjects, concluding that results supported the foundational properties including logical equivalence (syntactic independence). 2.3 Representation in artificial neural networks The earliest work on the representation of logic within neural networks began in the 1980s. Nilsson [75] presents the fundamental mapping of truth in classical logic to probabilities [0 or 1], using binary semantic trees for sentence analysis. The intuitive extension of this approach to preference relations in non-monotonic logic simply utilizes the range of probability values for different preference levels. The work of Bacchus [6–9], although not in the main stream of non-monotonic logic, develops probabilistic logics from a consideration of statistical knowledge bases. Leitgeb [62] sets out the properties of logic and inference, considering why embodiment is possible. In discussion he theorizes in detail about the structure and behaviour of agents that would have properties compatible with the requirements of the logic. He concludes that ‘dynamical agents’, such as ‘simple inhibition nets’, are viable candidates for the representation of logic. He proceeds to theoretically prove the properties of these inhibitory networks. Leitgeb has many valuable ideas [61, 63], particularly his emphasis on discrete binary states and inhibitory constraints. However, Leitgeb’s ‘simple inhibition nets’ lack a true distributed representation, there are no connection weights and no discussion of how the networks would learn or adapt. From an experimental perspective, there is no evidence that they were ever built or tested. 2.3.1 Statistical relational learning Many authors [32, 51–53, 78] have published in the field of statistical relational learning (SRL). This domain can be seen as an extension of Bacchus’s work on probabilistic logic. Statistical relationships within the data, represented in the joint probability distribution, are viewed from the perspective of database theory (as entity relationships) and are modelled graphically using Bayesian and Markov networks. The concept of using both directed and undirected graphs to represent probabilistic data is not new. Markov networks are a more expressive super-set of the Bayesian models [78]. SRL specifically addresses the difficulty of representing one-to-many associations within the data, which are not mathematical functions. These associations between facts in the data form the default rule, which is the basis of the rational consequence relation in SCNM logic. These one-to-many factual relationships are very important for common-sense reasoning and are difficult to capture in any variety of classical logic: as distinct from object relationships. In the context of SRL, these Bayesian and Markov networks are built by a process of ‘inductive logic programming’, using either algorithmic or manual construction. Weight calculation or parameter learning is then performed directly by calculation from the log likelihood of the data. This calculation is known to be an NP-hard problem, so only an approximation of the data distribution is possible. Finally, inference in these networks is achieved by implementing Gibbs sampling where the network nodes are set to the observed inputs or randomized to the un-observed inputs. This process is analogous to ‘clamping’ in the initial phase of a Boltzmann learning algorithm, where Gibbs sampling is required to retrieve the output of the network at equilibrium. However, there are important differences between such abstract graphical models, which are constructed and analysed algorithmically, and neural networks, which use distributed representations, and are capable of learning and processing using mechanisms based on neurobiology. In the following sections, we discuss literature that has explicitly explored the neural network implementation of logical inference. 2.3.2 Early symmetric networks The symmetric (recurrent) neural networks (SNNs) are an important class of networks derived from the domain of statistical mechanics. Following the seminal papers published by Hopfield [46] and Hinton et al. [44, 45], Balkenius and Gärdenfors [10] were the first to recognize the unique characteristics of the newly formulated networks, in relation to the representation of propositional logic. They specifically emphasize the property of constraint satisfaction with regard to the ability of these networks to find single solutions in classical logic. They use the term ‘resonance states’ to refer to energy minima in the network, corresponding to solutions in classical logic. They demonstrated theoretically that simple SNNs could replicate the conclusions of logical schemata. Their detailed description of logical schemata link them to concepts used by Rumelhart, Smolensky, McCelland, and the PDP Researech Group [91,Vol 1, Chp 14: Schemata] and [71, 72]. Unfortunately, there is a confusing collection of related concepts in the literature and the domains of logic and neural networks are full of overlapping and contradictory vocabulary: Gärdenfors logical schemata, an epistemic state in modern logic, a generative model or frame in computer science and a joint probability distribution in statistics. In this paper, we have chosen the terminology: ‘logical micro-world’, used by Frank at el. [26], see Section 3.2. Jagota [49] examined the stable storage of database tuple information in Hopfield style networks serving as associative memory. He related this to Boolean formulae and regular expressions, although not directly to inference. He theorized about the storage capacity of these networks, which are now known to be limited arithmetically by the number of nodes. Pinkas [83, 84] strengthened the work of Balkenius and Gärdenfors [10] by providing a mathematical foundation, demonstrating an equivalence between two fundamental ideas: a solution in logic and energy minima in a SNN. Pinkas indicates that SNNs are capable of learning an ordering on states (a preference ranking) consistent with the concepts of non-monotonic logic [55, 60]. Although the paper deals with many SNNs, Pinkas specifically acknowledges that only the Boltzmann machine is capable of searching multiple energy minima simultaneously and learning a ‘strongly equivalent’ or ‘magnitude preserving’ ranking on model states. Further work by Pinkas [85] emphasizes his focus on the global minimum of the networks, at the expense of local minima. There are two major issues with these experiments in the context of SCNM logic and common-sense reasoning. |$\bullet $| Single Problems: Pinkas’ networks are engineered/designed from an individual sentences or formulae in the logic. They represent specific single problems with narrow applicability, each solution found is appropriate only to that specific sentence. We have focused on a generic network that can represent its environment, and any property of SCNM logic that is a consequence of that environment. |$\bullet $| Single Solutions: Pinkas’ penalty logic was originally theoretically discussed in the context of preferential semantics and the translated networks were stochastic. However, there is no experimental verification of these results in terms of SCNM logic in the paper. The reported results are 3-SAT problems, single solutions in the realm of classical logic. There is no experimental evidence of one-to-many relationships. This would require not just the global minima, but also an ordering on all the sets of local minima: our focus includes the retrieval of exceptions. Asymmetric Hopfield networks are known to produce multiple outputs, potentially one-to-many mappings, by way of chaotic or cyclical attractors. However, we were not aware of any underlying mathematical principles in an asymmetric network governing the retrieval of these minima (preferred states). Our preliminary testing with a symmetrical variety of Hopfield network disappointingly only returned single outputs [12]. This is in contrast to the view of Pinkas [84], who suggests that Hopfield style networks should be at least ‘preference preserving’: maintaining an ordering on states but not the magnitude of the ordering. 2.3.3 Neural-symbolic integration The SHRUTI system, as proposed in the field of neural-symbolic integration (NSI), has been offered as a model of human cognition and by implication a representation of symbolic logic [96–98]. It primarily attempts to solve the issue of dynamic variable binding in predicate calculus. The system is based on feed-forward networks, which reach a deterministic conclusion with an explicit probability. Without alteration of its nodal activations to some stochastic function, it is difficult to see how the system could retrieve multiple ranked counter-examples to a preferred conclusion. The addition of Hebbian learning moves the SHRUTI system closer to a SNN, but there is limited discussion of how the network obtains and stores the necessary cross-firing statistics to manage such learning (cf. a Boltzmann machine). d'Avila Garcez, Lamb, and Gabby [20, 21] have extended the approach of representation in classical logic from Pinkas [84], to non-symmetric neural systems, utilizing large ensembles of feed-forward networks: the CILP system. CILP can be seen as a hybrid system where the network is first constructed around a specific logical problem using a translation algorithm, analogous to Pinkas penalty logic for SNNs. The authors demonstrate the practical capacity of these networks in a variety of settings including first-order logics, temporal and modal logic. Further, they consider the challenge of relational associations and dynamic variable binding using predicates, in the context of a specific problem: Michalski’s east-west trains [21, Chp 10]. However, these publications, like those of Pinkas, are limited by the issues previously stated: single problems and single solutions. These authors have published recently on the representational power of neural networks under the framework of neurosymbolic AI, [19]. There are many issues in the Garcez and Lamb paper, which are of importance to us. Particularly in relation to non-monotonic logic, implicitly based on probability, as an implementation of common-sense reasoning: |$\bullet $| its potential contribution to robustness, |$\bullet $| its role in retrieving counterfactual explanations, |$\bullet $| the representation of compositionality, |$\bullet $| the concept of relationship learning, as implemented by one-to-many mappings, |$\bullet $| the requirements of adaptation in an uncertain environment, |$\bullet $| the learning of a joint probability table, equivalent to a preference relation. The distinction between classical and SCNM logic is that a solution set of less preferred model states is required to address the changing of maximal models under conditions of defeasiblity. A neural network representing this logic must learn and retrieve one-to-many mappings rather than act as a function approximator, reaching a single best solution. This contrast in setting to our current paper can be seen most clearly in the examples: Pinkas’ [84] 3-SAT problems and in Garcez’s [21] Michalski’s trains. Although both these systems theoretically deal with non-monotonic logic, the specific results reported in the papers are many-to-one functions. Our paper exclusively examines SCNM logic where representation of less preferred model states (exceptions) is thought to be characteristic of common-sense reasoning. We contend that in this context, Shastri’s SHRUTI and Garcez’s CILP systems suffer from the same two problems, in regard to the scope of SCNM logic, which also applied to Pinkas’s work. They address the single problems for which they were designed (algorithmically engineered from the logic), and they supply single solutions (finding global optima) in the domain of classical logic. 2.3.4 Representation in the Boltzmann machine Having considered a range of network architectures, the Boltzmann machine was adopted as the platform for this research. In general, the Boltzmann machine inherits all the favourable characteristics of the SNNs, with their logical equivalence demonstrated in classical logic via Pinkas’s penalty logic. It also learns and can simultaneously search multiple local or global minima. The minima can be interpreted as corresponding to solutions or conclusions in the logic. This is not only in the classical realm of SAT-problems but also in the wider context of supra-classical logic, which requires evaluation of counter-examples related to common-sense reasoning. The machine is also the neural network equivalent of the graphical models (Bayesian and Markov) used in SRL (Section 2.3). These models are theoretically designed from statistical associations between factual observations that form the basis of the preference relation in SCNM logic. Briefly, the Boltzmann machine uses symmetric, non-reflexive connections, stochastic activation functions, and requires simulated annealing for sampling of local cross-firing. There is no fixed architecture for the machine, but the hidden layer is usually fully interconnected. The summed input to a node is typically called its energy. The change in energy at node |$i$| is related to the sum of the input from connected nodes |$j$|⁠, where |$w_{ij}$| is the related weight and |$s_j$| the nodal state at |$j$|⁠. |$\theta $| is the bias or threshold of node |$i$|⁠. $$\begin{align*} & varDelta Enet_i = \sum_{j} w_{ij} s_j - \theta_i \end{align*}$$ The Boltzmann machine uses a sigmoid activation function. Unusually, the activation is stochastic, the probability of activation of node |$i$| given threshold |$T$| (⁠|$P_{i|T}$|⁠), rather than the actual output, is specified by the activation function. The firing is ‘all or nothing’ (1 or 0). |$\varDelta Enet_i$| is the input summation for node |$i$| as above. Rather than the traditional analogy of |$k_BT$|⁠, the product of Boltzmann constant and temperature, |$T$| should be regarded as the optimization threshold. $$\begin{align*} & P_{i|T} \ = \ \frac{1} {1 + e^{(\frac{ -\varDelta Enet_i}{T}})} .\end{align*}$$ As a consequence of its stochastic nodal activation and optimization by simulated annealing, the Boltzmann machine is able to cycle through multiple states at equilibrium, searching multiple energy minima simultaneously [84]. These constitute the one-to-many relationships entailed by a premiss. Further, because of its probabilistic learning, it is theoretically able to represent a complete joint probability table: a generative model. This generative model can be explicitly retrieved from the machine by time-slicing through its learned input distribution, given a completely neutral premiss: clamping with a null input [44]. As a consequence, it is the only SNN Pinkas regards as capable of representing a ‘strong equivalence’ relation: not only order preserving but also magnitude preserving. $$\begin{align*} & P_{k|T} \ \ \equiv \ \ \frac{e^{(\frac{- Enet_k}{T})}} {\sum_{l} \ e^{(\frac{- Enet_l}{T} )}} \end{align*}$$ These favourable properties are encapsulated in the Boltzmann distribution equation, which relates the probability of output states at equilibrium, to the relative entropy (learned preference) of these states. Many authors have utilized different varieties of Boltzmann machine for practical problems in the realm of categorization or function approximation [15, 23, 24, 41, 77, 82, 93, 94, 107]. We, however, examine the use of the Boltzmann machine in a different context, confirming that the machine can explicitly retrieve the distribution of its training set when presented with a neutral (non-specified, null) input and select appropriately ranked model states given incomplete or partial information (a premiss). In short, while the methods discussed in literature sections 2.3.2 and 2.3.3 above represent single problems and find single solutions, a Boltzmann machine when trained in an environment is able to represent multiple problems from that environment, and find multiple solutions (an output distribution). This multi-modal output, retrieval of one-to-many relationships, constitutes a ranking of counter-examples, the essence of SCNM logic. 2.3.5 Biological plausibility While biological realism is not our principal focus, as noted above, this research occurs in the broad context of reconciling different levels of description. Evidence from the fields of psychology and neurobiology is relevant. Any network that intends to embody SCNM logic, and indirectly account for some aspect of human cognition, should at least consider the underlying biology. A number of authors have published on the topic of biological plausibility [48, 67]. In particular, O'Reilly [76] has identified six principal requirements. The Boltzmann machine is clearly consistent with the first five criteria: distributed representation present in most neural networks, inhibitory competition provided by recurrent connections, bi-directional activation requiring symmetric weights, error driven task learning based on supervised learning and Hebbian learning based on weight change directed via cross-firing. The sixth requirement, ‘biological realism’, is rather vague. However, we note the following two properties of the Boltzmann machine: |$\bullet $| The biphasic nature of the learning algorithm: has a possible co-relation to REM sleep referenced in Section 5.1. |$\bullet $| The machine’s stochastic nature, driven by simulated annealing: the inefficient learning algorithm specifically equips this network for complex biological tasks. Engineering in artificial intelligence has moved to the more efficient version, the restricted machine. 3 Methods This section begins with a brief definition of the SCNM logic utilized in the paper. It provides a description of the micro-world experimental environments used for testing candidate neural-network representations against SCNM logic, with an account of the mapping from micro-world states to patterns of activation in a neural network. It then summarizes the requirements sufficient for the representation of the logic. It concludes with a description of the Boltzmann machine network implementation and training. 3.1 Logical preliminaries We give a brief definition of the SCNM logic utilized in the rational consequence relation, as first introduced in section 2.1. A propositional SCNM logic is generated by a finite set of atomic propositions with conventional propositional and set connectives (⁠|$\neg , \vee , \wedge , \rightarrow , \leftrightarrow ; \cap , \cup , \subseteq , \supseteq $|⁠). Let |$\top $| stand for truth, the set of tautologies and |$\bot $| stand for falsity, the set of contradictions. The syntax of the language is not transparent, i.e. the atoms cannot be decomposed and do not involve predicates: object relationships. However, the data have factual associations or dependencies deliberately included, as discussed previously in section 2.3 and below in section 3.2. The semantics of the logic are based on a finite set of states in a micro-world. For simplicity, we identify states with the assignment of truth-values (true 1, false -1) to atomic propositions. A state in which a proposition |$\alpha $| is true is a model of |$\alpha $|⁠. A proposition |$\alpha $| classically entails a proposition |$\beta $|⁠, if and only if every model of |$\alpha $| is also a model of |$\beta $|⁠. $$\begin{align*} & \alpha \models \beta \quad \longleftrightarrow \quad Models \lbrack \alpha \rbrack \subseteq Models \lbrack \beta \rbrack \end{align*}$$ Classical logic is explicitly monotonic (see discussion section 2.1) and presumes the absolute nature of truth. Classical entailment is very restrictive and fails to capture much of everyday common-sense reasoning. In the example of the traffic light (section 2.1.1), classical inference from the observation that the traffic light for oncoming traffic is red, to the conclusion that the oncoming car will stop, would result in pedestrians stepping into the path of the oncoming car that has exceptionally ‘run a red light’. This is in contrast to most common-sense reasoners understanding that truth is relative. Accordingly, we utilize a logic equipped with a more generous entailment relation known as a rational consequence relation: rational consequence formalizes single predictive inferences. In these inferences, the limited information supplied by the premiss is amplified by information provided by a default rule. Past experience, observation of the frequency of states in the environment, commonly forms the basis for the default rule. In the example, the default rule is that ‘Cars normally stop for red lights’. These default rules are not universal generalizations; the consequent entailment is defeasible and so supra-classical logic incorporates exceptions to the single conclusions of classical monotonic logic. In the example, cars may exceptionally ‘run a red light’. It is convenient to represent the default rule by means of an ordering on the (valuations on) states of the relevant micro-world. Such an ordering stratifies or ranks the states into layers, as described by Kraus, Lehmann and Magidor [55, 60]. These layers may be assigned an ordinal value. It is conventional to refer to the ordering as a preference relation. Traditionally, the most preferred model states are denoted as the minimal models. In this paper, these most preferred models are given the intuitive denotation of the ‘maximal’ models. Given a default rule represented as such a preference relation, the corresponding rational consequence relation sanctions the defeasible entailment of |$\beta $| by |$\alpha $|⁠, if and only if every maximally preferred model of |$\alpha $| is also a model of |$\beta $| [38]. $$\begin{align*} & \alpha \mid\sim \beta \quad \longleftrightarrow \quad Maximal\ Models \lbrack \alpha \rbrack \subseteq Models \lbrack \beta \rbrack \end{align*}$$ We have previously argued in Section 2.2 that supra-classicality, with its ability to recall less preferred model states (exceptions), is the cardinal property of defeasibility. We note that a conclusion in supra-classical logic involves an ordered set of model states entailed by a premiss, in the domain of statistics these states are counter-factuals [78] and in the connectionist paradigm states correspond to stable minima. We have suggested that a supra-classical conclusion (a ‘problem solution’) would be a set of ranked minima, not just the single global minimum. Further, we re-assert that many authors regard this ability to learn exceptions, as a key feature of common-sense reasoning [38, 55, 59, 60] and possibly the basis of reasoning about causality [78, 79]. 3.2 Micro-world schemata Micro-worlds, a term first coined by Minsky and Papert [72], are the experimental sand-boxes of this paper. They are simplified, defined environments about which we may logically reason at the symbolic level. In these environments, our candidate neural network representations are trained and tested against the expectations of SCNM logic. Micro-worlds are equivalent to the schemata of Balkenius and Gärdenfors [10] and were elegantly utilized by Frank et al. [26], in examining the connectionist learning of language. Here, we are using them to explore the rational consequence relation and implement preferential semantics as a proxy for common-sense reasoning. Because of the combinatory nature of the potential trillions of environments related to even a 4-atom micro-world, it would be an intractable problem to look at more than a representative selection of them, given that even for a 4-atom micro-world there are more than 16! combinations. We have chosen to present a single example in a four atom micro-world for consistency. Logical discussions are frequently based around small examples, the rationality and assumptions of which are easier to examine. There was no benchmark to use in the testing of the candidate machine. So, our 4-atom micro-world was carefully designed to include a variety of factual dependencies (opposition, dependency and independence). These atomic relationships dictate the final probability of model states (SRL). The 4-atom micro-world requires the machine to learn a distribution over 16 states, which provides a rational semantic background for preference and inference in the example. To ensure the robustness of our results, we examined further examples in 3, 4, 5 and 6-atom micro-worlds, with alterations the default rules and the ordering of the atoms, and variations on the assumptions as outlined below. The full results of examples of all variants, which are consistent with those presented in this paper, are set out in our technical report [12]. The larger 6-atom micro-worlds examined in the technical report require the machine to learn a ranking over 64 states. A non-trivial problem from the perspective of logic. From a machine perspective, it requires recall of ordered sets of local minima. We believe our experimental results support a conclusion that the Boltzmann architecture is suitable for this task, as theoretically asserted by Pinkas and Hinton. Although there is no experimental evidence of such in the Pinkas and Hinton papers. The micro-worlds presented in this paper, to train and test candidate machines, were incremental extensions of the simple logical ‘Light-Fan System’ in traditional usage. The logic appropriate for the basic Light-Fan System has just two atomic propositions, L standing for ‘the light is on’ and F standing for ‘the fan is on’. The states of this micro-world are then the four possible functions assigning true or false to each atom. It is convenient to depict such a function as a sequence of its outputs, which is possible if we take the order of the atomic propositions to be fixed. Thus, the state in which the light is on but the fan is off can be depicted by the binary sequence ‘1 -1’, showing that L is true (value 1) and F is false (value -1). Note that for convenience states are often labelled in an abbreviated decimal form, e.g. ‘1 -1’ is labelled 2 and ‘1 1’ is 3. Candidate networks were trained using this binary logic. However, the machines were tested using a ternary logic, where inputs of zero stand for not observed or unknown. For example, an input premiss of ‘1 0’ stands for ‘Light on, Fan not observed’: such a denotation is not in binary form. In this paper, we focus on micro-worlds having 4 and 6 atomic propositions or components. The additional atomic propositions were as follows: H the heater (is on), W the window (is open), A the air-conditioning (is on) and O the open fire (is lit). The motivating analogy for these micro-worlds was that of a temperature-controlled room. Factual associations or dependencies between the data elements arise from this semantic analogy (see Section 2.3). Active cooling is produced by the fan and air-conditioner, active heating by the heater and the open fire, passive dependent cooling by the window and independent illumination by a light. The analogy and its factual dependencies generate the default rule that is represented by the preference ranking on the states. It represents the likelihood/frequency of model states in the micro-world, each of these states is learned as a local minimum. Figure 2 illustrates a 4-atom micro-world, the atomic propositions are Light, Fan, Heater and Window. We revisit this example world many times in the experiments reported below. The example default rule represented by the preference relation incorporates a set of specified, ranked observations: |$\bullet $| Components with a high energy cost, the fan and heater, would typically be off. |$\bullet $| The environment is warm and therefore the fan is more likely to be on than the heater. |$\bullet $| With active cooling the window is likely to be open, whereas with active heating, the window will typically be closed. The window is a dependent component. |$\bullet $| The light may be on or off, independent of the other components. Figure 2 Open in new tabDownload slide An example micro-world consisting of four atoms (Light, Fan, Heater, Window), with semantics as described in the text, where the most preferred states are shown at the top of the ranking. States (patterns) are shown in binary eg. ‘1 -1 -1 -1’ and for convenience labelled with the equivalent decimal value, e.g. 8. The number (#) and frequency (%) of each state in the training set are given for each level, e.g. pattern ‘1 -1 -1 -1’ (8) is seen 16 times—16.7% of the total. There are (⁠|$4 \times 16) + (2 \times 8) + (2 \times 4) + (8 \times 1) = 96$| total patterns in this example training set. This example world is revisited many times in the experiments reported below. Figure 2 Open in new tabDownload slide An example micro-world consisting of four atoms (Light, Fan, Heater, Window), with semantics as described in the text, where the most preferred states are shown at the top of the ranking. States (patterns) are shown in binary eg. ‘1 -1 -1 -1’ and for convenience labelled with the equivalent decimal value, e.g. 8. The number (#) and frequency (%) of each state in the training set are given for each level, e.g. pattern ‘1 -1 -1 -1’ (8) is seen 16 times—16.7% of the total. There are (⁠|$4 \times 16) + (2 \times 8) + (2 \times 4) + (8 \times 1) = 96$| total patterns in this example training set. This example world is revisited many times in the experiments reported below. From these four heuristics we have constructed a rational preference relation. This preference relation denotes the probability of states in the data set from a physical micro-world, as depicted in Figure 2. Specifically, from the first observation, it follows that the most preferred Level 0 consists of all states where atoms F and H are false, ‘-1’. From the second and third observations, it follows that Level 1 consists of states where F is true, ‘1’ and H is false, ‘-1’. The dependent component W is true, ‘1’ when the fan is on. From the third observation, it follows that Level 2 consists of states where F is false, ‘-1’ and H is true, ‘1’. The dependent component W is false, ‘-1’ when the heater is on. From the fourth observation, it follows that Levels 0–2 contain all relevant variants of L is true, ‘1’ or L is false, ‘-1’. By default, Level 3 consists of all the remaining states, which are inconsistent with the observations that constitute the default rule, particularly those irrational states where the heater and fan are both on: Fan true ‘1’, Heater true ‘1’. Figure 2 illustrates a single example epistemic state with preference relation in a 4-atom micro-world. This example micro-world has been deliberately designed to demonstrate a range of associations between the atomic propositions: independence of the Light, dependence of the Window on the Heater and Fan and opposition of the Heater and Fan. The most preferred states are observed more frequently, such as the state ‘1 -1 -1 -1’ in this example, where only the light is on. Regardless of any semantic analogy used to conceptualize this micro-world, it is only a single example of the trillions of permutations possible on factual associations between 4-atoms. 3.2.1 Mapping states to network activations As noted above, we identify states with the assignment of truth-values to atomic propositions. A state in which a proposition |$\alpha $| is true is a model of |$\alpha $|⁠. In our Boltzmann machine networks, logical states are represented directly as the patterns of activation on the input or output units of the network, one unit per atom. In other words, in the example 4-atom micro-world just discussed, the logical state ‘1 -1 -1 -1’ (which is a model state where the Light is on, Fan off, Heater off, Window closed) is represented as the pattern of activation ‘1 -1 -1 -1’. While emphasizing different aspects of context, the terms state and pattern are effectively interchangeable in our networks. In effect, this correspondence in language extends further to a sentence or formula or a model state in the logic and an activation pattern in the machine, which are all analogous denotations. In order for a network to learn a preference order on states, training sets are designed, which, in effect, allow the machine to observe a distribution of states from the environment of the micro-world. This statistical distribution arising from the factual dependencies corresponds to the preference relation in the logic. Although the logic does not require the numerical exactness of probability, it still maintains some notion of magnitude, the distance between preference levels [84]. This concept of magnitude is particularly important when considering belief revision [101]. Furthermore, all worlds share certain design assumptions: the least preferred states are usually included once in the training set (they thus have a frequency of roughly 1% of the total distribution) and there is usually an exponential change in pattern frequencies between preference levels (a doubling in frequency between levels). 3.3 Logical requirements One-to-many relationships, which map single inputs to multiple outputs, are not mathematical functions and have been disparagingly called ‘ill-posed problems’ [90, 105]. One-to-many relationships are, however, common in the real world, typical examples are the following: kinematic solutions to the positioning of robot limbs and diagnostic classifications [27, 108]. In our case, the example is SCNM logic, which also requires a ranking of the outputs within the 1-to-many relationship. We want the machine representation to accept a single input (a premiss) and to provide as output the models of the premiss distributed identically to the preference ordering in the logic. It is not sufficient for the machine to learn a generative model of the joint probability distribution; we also require it to retrieve a multi-modal output given uncertain information: a discriminative case. As described in the introduction, we suggest that the necessary logical requirements of a network representation of preferential semantics (under rational consequence) are the ability to: Learn a preference relation on the set of model states: generative model, |$p(x)$|⁠. Preference is a probabilistic ranking of model states, analogous to the joint probability table. Select appropriate model states based on limited information (a premiss): discriminative model, |$p(x|u)$|⁠. These local minima are subsets entailed by the input premiss, retrieved in a magnitude preserving ordering. Compare selected model states (usually premiss and conclusion). Comparison of the input premiss and the retrieved output states. Adapt its preference relations to new information. In an uncertain world, a logical agent faced with new information must often revise its view of the truth. The first three requirements encompass the process of inference in SCNM logic with rational consequence and are the subject of this paper. We specifically concentrate on the first two components of the rational consequence relation (Section 4) because learning of the preference relation and the selection of appropriate model states are related neural network tasks, which we demonstrate can be satisfied by a single machine. A separate, feed-forward network could compare the outputs of maximally preferred model states: the third requirement. The fourth requirement is the subject of a future paper. Returning to the concept of a micro-world, Figure 3 illustrates the relationship between the logical and machine levels of description. The logic exists throughout the micro-world, but its semantics and properties can only be proven mathematically at an abstract level. The machine in this context is not intended to be a theorem prover, and its representations are not open to symbolic interpretation [40] as was the method of previous research in reconciling the logical and neural network paradigms, Section 2.3. We are only interested in the machine outputs: that the experimental evidence they provide fits with the expected conclusions of the logic, within the context of a micro-world. We suggest that it is sufficient to show that it fulfils the requirements identified above. If the machine representation is able to faithfully learn the preference ordering and select maximally preferred models entailed by a premiss, then we assert it will always find solutions that match the properties provable in the logic, as a consequence of the shared preference relation. Figure 3 Open in new tabDownload slide The structure of a micro-world: the shared preference relation provides a connection between the design level logic and the neural network machine. ANN = Artificial Neural Network. Figure 3 Open in new tabDownload slide The structure of a micro-world: the shared preference relation provides a connection between the design level logic and the neural network machine. ANN = Artificial Neural Network. 3.4 Network implementation and training The network architecture used in this research is the Boltzmann machine, as discussed in Section 2.3.4. Preliminary comparison with a multi-layer perceptron (MLP) showed that the MLP could not faithfully learn an input distribution; it could not rank the appropriate one-to-many relationships as expected from the logic [12]. We were also unable to retrieve the appropriate ranking of output states from a Hopfield network. Our initial implementation of the Boltzmann machine was based on the technical descriptions of the 424-Encoder from Hinton’s papers [39, 44, 45] and the work of Aarts and Korst [1, 2]. This network was then tested on a variety of abstract and real data sets. During the process of applying the Boltzmann machine to the micro-worlds described above, some modifications to the Hinton architecture and learning algorithm were made. These modifications, as described below, can be regarded as minor and within the natural range of variation when implementing specific versions of a generic Boltzmann machine. Full details of the design and implementation (object oriented code that implements parallel threads so as to collect statistical results from multiple machines) can be found in our technical report [12]. 3.4.1 Architecture The number of input and output units in our networks is dictated by the micro-world being implemented, each atom is represented by one input and one output. Brief results are presented in this section for 3, 4, 5 and 6 atoms, while the more detailed results in Section 4 focus on 4 and 6 atom micro-worlds. A range of numbers of hidden units was explored and was not found to be critical to performance. The final numbers of hidden nodes used were: 4 in the 3-atom worlds, 6 in the 4-atom worlds, 8 in the 5-atom worlds and 10 in the 6-atom worlds. An exception to this was found in the 3-atom world, where learning of the least preferred states or patterns was improved by adding more hidden nodes (see result in Table 1 and the discussion following). Compared to the Hinton networks the intra-layer connections were removed from the input and output layers after experimentation indicated they were not significantly helping performance. However, the hidden layer intra-connections were maintained for biological plausibility, in keeping with the work of Hinton [45], Balkenius and Gärdenfors [10] and Leitgeb [62] who placed importance on inhibitory constraints. We have labelled this architecture ‘Hidden Layer Rich’, a denotation we will use in future papers. The discrete layers of our networks are similar to simple feed-forward neural network architectures. It was more convenient to divide the visible nodes into input and output. During training, in the clamped phase of each cycle, the atoms of a premiss/state (e.g. ‘1 1 -1 1’) are clamped on the input and output units (an auto-associative task). During testing the specified atoms of the premiss are clamped on specific input units (+1 true and -1 false). Unspecified units are clamped with zero (an indeterminate value) and we examine the distribution of states created on the output units. The typical architecture of a network in a 4-atom micro-world is illustrated in Figure 4. Figure 4 Open in new tabDownload slide Architecture for a modified HLR Boltzmann machine network in a 4-atom micro-world. The network is a standard ‘Hinton’ machine, symmetrically connected, layered with a standard bias unit in each layer. Intra-layer connectivity is maintained only in the hidden layer. Input data are clamped and output data are sampled (see text). Figure 4 Open in new tabDownload slide Architecture for a modified HLR Boltzmann machine network in a 4-atom micro-world. The network is a standard ‘Hinton’ machine, symmetrically connected, layered with a standard bias unit in each layer. Intra-layer connectivity is maintained only in the hidden layer. Input data are clamped and output data are sampled (see text). 3.4.2 Learning algorithm Learning was carried out in accordance with the standard Boltzmann machine learning algorithm of Hinton [45]. The algorithm has two alternating phases (Figure 5): a clamped phase where external input is applied to the visible nodes and a resting phase where there is no input (the network is run free). The basis of its learning (weight adjustment) is the comparison of cross-firing statistics in the clamped and resting phases. Cross-firing |$\rho _{ij}$| is determined by the product of the nodal states, |$\tilde{s_i}$| and |$\tilde{s_j}$|⁠, averaged over a large number of samples. $$\begin{align*} & \rho_{ij} \ = \ \tilde{s_i} \times \tilde{s_j} \end{align*}$$ $$\begin{align*} & \varDelta w_{ij} \ \ = \ \ \eta \ (\rho_{ij}^+ - \rho_{ij}^-) \end{align*}$$ Figure 5 Open in new tabDownload slide The Hebbian nature of the Boltzmann learning algorithm. Including pseudo-code, where |$\eta $| is the learning rate, |$\mu $| is the momentum; |$\rho $|+ and |$\rho $|- are the cross-firing statistics across a specific weights in the clamped and free phases of the algorithm (see text). Figure 5 Open in new tabDownload slide The Hebbian nature of the Boltzmann learning algorithm. Including pseudo-code, where |$\eta $| is the learning rate, |$\mu $| is the momentum; |$\rho $|+ and |$\rho $|- are the cross-firing statistics across a specific weights in the clamped and free phases of the algorithm (see text). The change in weight |$w_{ij}$| between two nodes |$i$| and |$j$| is related to the difference in cross-firing between the clamped phase |$\rho _{ij}^+$| and in the resting phase |$\rho _{ij}^-$|⁠, multiplied by some learning rate |$\eta $|⁠. In effect, this comparison is a variety of error correction (supervision) utilized to model the network states in the clamped phase. This localized Hebbian learning is biologically plausible. However, the algorithm requires simulated annealing to retrieve the cross-firing statistics, particularly in the resting phase. Simulated annealing might be regarded as the hallmark of Boltzmann learning: both a blessing and a curse. It theoretically enables optimal solutions, given a sufficiently long time (possibly infinite). The Boltzmann distribution, from statistical mechanics, characterizes the energy distribution in the network at equilibrium, achieved by simulated annealing. An excellent discussion of the simulated annealing as an optimization method can be found in Aarts and Korst [1, 2]. The algorithm is a generalization of local search: $$\begin{align*} & P_{j|T} \ \ = \ \ \begin{cases} 1 & \quad \mbox{if }\ E_j < E_i,\\ \exp ( (E_i - E_j)/k_B T & \quad \mbox{if }\ E_j \geq E_i. \end{cases} \end{align*}$$ As for other methods of local search, if the new state |$j$| has less energy (is more optimal) than the old state |$i$|⁠, the new state is always accepted. However, initially, if the new state |$j$| is locally less optimal than the old, it may still be accepted with a probability (⁠|$P_{j|T}$|⁠) related to the energy difference between the states |$E_i - E_j$|⁠, at equilibrium threshold |$T$| (where |$k_B$| is the Boltzmann constant). Aarts and Korst regard simulated annealing as the parent algorithm for all threshold optimization methods: when the threshold (temperature) is set close to zero, the method becomes deterministic gradient descent in local search. The great benefit of simulated annealing is its up-hill search at initial high thresholds; the method has the ability to overcome local minima. However, the schedule for lowering the threshold (temperature) is critical and has to be performed slowly. $$\begin{align*} & T_k \ \ = \ \ \frac{c}{log(k + 1)} \end{align*}$$ Here the threshold for the |$k^{th}$| iteration of the schedule is derived from the inverse log of |$k$|⁠. As stated previously, minor modifications were made in our implementation, which can be regarded as within the generic nature of the Boltzmann machine, to either improve performance and reduce complexity: |$\bullet $| Annealing was removed from the clamped phase. During this phase the hidden nodal states are largely determined by the clamped nature of the visible nodes. |$\bullet $| Layered annealing rather than pooled was used in the free phase. |$\bullet $| Our annealing schedules were designed, based on the inverse log function from thermodynamics [31, 70]. |$\bullet $| Annealing during training and testing were configured at slightly different ‘temperature’ ranges, using the same inverse log function. Weight decay [56] and sparsity [41] were experimentally tested trying to mitigate the Hebbian characteristic of weight saturation, but ultimately only a standard implementation of momentum was retained in our final version of the learning algorithm. A summary of our slightly modified version of the Boltzmann machine learning algorithm is presented in Figure 5. Tuning of the learning process was time consuming: details of the annealing schedules can critically influence results. A wide variety of schemes were examined: from high temperature ranges (40 |$\rightarrow $| 10) to low temperature ranges (5 |$\rightarrow $| 1), for varying temperature points and cycles at each temperature point (5 - 30). There was no single correct schedule. The other tuning parameters were on average: training time 2,000 epochs, learning rate 0.3, momentum 0.7 and 20 samples per pattern (for estimating |$\rho $|+ and |$\rho $|-, Figure 5). 3.4.3 Training Crucially, the output of a Boltzmann machine is not static or deterministic, it is continually cycling between various states. Thus the representative sample output from a machine is a time slice at equilibrium of all the output states. Given the stochastic nature of the machine, the performance of individual Boltzmann machines can vary widely, with different weight configurations serving as alternative potential solutions. For this reason, our results include the accumulated output sampled from multiple machines. When looking at retrieving the whole preference relation, Section 4.1 below, we have taken 300,000 output samples over 300 machines. When looking at a single premiss, here and in Section 4.2, we have taken 60,000 output samples over 60 machines. These raw sample distributions are then converted to a percentage distribution, dividing by the total number of samples. Data sets for Boltzmann machine training were derived from the preference relation in the logic, analogous to the machine being able to observe the frequency of states (patterns) within the environment. During training the input and output units are set to the same patterns (an auto-associative task). The distribution of patterns constitutes all of the environmental/training information and, for testing (given a specific input), the behaviour of interest is the distribution of output patterns. For each micro-world four training sets were constructed: two with an exponential increase in pattern frequencies between preference levels and two with an arithmetic increase. For each of these pairs: one training set had the least preferred patterns absent and the other had the least preferred patterns present for a single instance (a very small proportion of their training). An example of a training set with an exponential pattern distribution and least preferred patterns present was illustrated above in Figure 2 and will be used in extensively in the discussion of results in Section 4. Table 1 A basic test of training and recall. Results for testing Boltzmann machines against fully specified premises, showing the frequency of the correct state in the output distribution. Results are given for most preferred and least preferred states, see text. Testing against fully specified premises . Micro-world Most preferred state Output Least preferred state Output 3-Atom ‘-1 -1 1’ (1) 95% ‘1 1 -1’ (6) 54% 4-Atom ‘-1 -1 -1 -1’ (0) 92% ‘-1 1 1 -1’ (6) 85% 5-Atom ‘1 -1 -1 -1 -1’ (16) 97% ‘1 1 -1 1 1’ (27) 89% 6-Atom ‘1 -1 -1 -1 -1 -1’ (32) 96% ‘-1 1 1 1 1 1’ (31) 93% Testing against fully specified premises . Micro-world Most preferred state Output Least preferred state Output 3-Atom ‘-1 -1 1’ (1) 95% ‘1 1 -1’ (6) 54% 4-Atom ‘-1 -1 -1 -1’ (0) 92% ‘-1 1 1 -1’ (6) 85% 5-Atom ‘1 -1 -1 -1 -1’ (16) 97% ‘1 1 -1 1 1’ (27) 89% 6-Atom ‘1 -1 -1 -1 -1 -1’ (32) 96% ‘-1 1 1 1 1 1’ (31) 93% Open in new tab Table 1 A basic test of training and recall. Results for testing Boltzmann machines against fully specified premises, showing the frequency of the correct state in the output distribution. Results are given for most preferred and least preferred states, see text. Testing against fully specified premises . Micro-world Most preferred state Output Least preferred state Output 3-Atom ‘-1 -1 1’ (1) 95% ‘1 1 -1’ (6) 54% 4-Atom ‘-1 -1 -1 -1’ (0) 92% ‘-1 1 1 -1’ (6) 85% 5-Atom ‘1 -1 -1 -1 -1’ (16) 97% ‘1 1 -1 1 1’ (27) 89% 6-Atom ‘1 -1 -1 -1 -1 -1’ (32) 96% ‘-1 1 1 1 1 1’ (31) 93% Testing against fully specified premises . Micro-world Most preferred state Output Least preferred state Output 3-Atom ‘-1 -1 1’ (1) 95% ‘1 1 -1’ (6) 54% 4-Atom ‘-1 -1 -1 -1’ (0) 92% ‘-1 1 1 -1’ (6) 85% 5-Atom ‘1 -1 -1 -1 -1’ (16) 97% ‘1 1 -1 1 1’ (27) 89% 6-Atom ‘1 -1 -1 -1 -1 -1’ (32) 96% ‘-1 1 1 1 1 1’ (31) 93% Open in new tab The adequacy of training in each of the micro-worlds can be demonstrated by looking at the results from testing machines against fully specified premises (complete model states). When a complete state is clamped on the input units of a well-trained machine, exactly that same state should dominate the output distribution. A brief summary of the adequacy of training across a range of micro-worlds is presented in Table 1, by considering one most and one least preferred state in each atomic variety of micro-world. These results are for ‘exponential’ training sets with least preferred patterns present. Preferred states are produced as the correct output more reliably than less preferred states, which are seen less frequently in training. Figure 6 shows a specific example in more detail: the accumulated results for a run of 60 machines trained on the example 4-atom micro-world, illustrated in Figure 2. These machines are tested against two fully specified premises: one at high preference ‘-1-1-1-1’ (0), and one at low preference ‘-1 1 1-1’ (6). The machine output in both cases is almost entirely the expected model state. Figure 6 Open in new tabDownload slide Results for testing 60 Boltzmann machines against two fully specified premises in the example 4-atom micro-world. The preference relation has most preferred models at the top. Model states are listed as decimal labels with their expected frequency. Figure 6 Open in new tabDownload slide Results for testing 60 Boltzmann machines against two fully specified premises in the example 4-atom micro-world. The preference relation has most preferred models at the top. Model states are listed as decimal labels with their expected frequency. 4 Results We have deliberately limited the experimental data presented in this paper, as we felt the topic is confusing, being a hybrid of logic and neural networks. In a separate technical report, we present an extensive collection of data from testing the Boltzmann machine in a selection of 3, 4, 5 and 6-atom micro-worlds [12]. The technical report supports the conclusions drawn in this paper, that the Boltzmann machine is a faithful representation of inference in SCNM logic. Here, we present an overview and samples of our results, drawn mostly from the exemplar 4-atom micro-world illustrated in Figure 2, together with a summary table of the accumulated errors relative to micro-world atomic size. We realize it is unusual to present just statistical summaries of the accumulated errors, along with representative raw data. However, we encourage the reader look at the raw data as this is the best way to consider if the machine output pattern matches the logical expectation. Sections 4.1 and 4.2 present core properties of the logic: representation of the preference relation as a generative case and selection of appropriate ranked model states entailed by a premiss as a discriminative case, as outlined in Section 3.3. Two important logical properties are then briefly explored in Section 4.3, as a practical illustration of inference by the Boltzmann machine. It is not necessary to prove these logical properties considering as previously, that if the machine fulfils the first two requirements we have identified above, then it can only retrieve model states expected by the logic and will therefore follow any properties held by the logic. 4.1 Preference relation: the generative case The first component we have identified as a requirement of predictive inference is the generation of a complete joint probability distribution. This preference relation can be retrieved from the machine by testing it against a neutral or null premiss: an input that contains no observed information about the state of the micro-world: ‘0 0…0’. When clamped with this input and sampled at equilibrium, the machine cycles through all the micro-world states, retrieving the learned distribution of its training set. When testing machines against the neutral premiss, the output frequencies obtained for each state can be directly compared to the expected training frequencies of each whole state, derived from the preference relation in the logic. Overall accuracy of a machine is simply indicated by the absolute percentage error at each state compared across the whole distribution. We have presented this error estimate averaged per state |$\pm $| one standard deviation. The results confirm that the machine can learn a generative model, the whole joint probability distribution p(x,h). This model is a close approximation of the preference relation expected by SCNM logic. Table 2 summarizes the Boltzmann machine’s good performance, across a variety of micro-worlds with 3, 4, 5 and 6-atoms, using the metric of average error per state. As a generalization, this error per state is less than 2% |$\pm $| 0.05% and is a consequence of the stochastic design of the machine. Table 2 Results for retrieval of the preference relation, after testing against the neutral premiss (‘0 0 0 0’) across a range of atomically variant micro-worlds. The low average error indicates that the preference relation (output distribution) is correctly retrieved. Testing against the neutral premiss . Micro-world . Least preferred patterns . Average error per state . . . % . |$\pm $| STDev . 3-Atom Absent 0.9 0.11 Present 1.8 0.11 4-Atom Absent 0.6 0.07 Present 1.6 0.04 5-Atom Absent 1.6 0.04 Present 1.8 0.02 6-Atom Absent 0.5 0.03 Present 1.1 0.03 Testing against the neutral premiss . Micro-world . Least preferred patterns . Average error per state . . . % . |$\pm $| STDev . 3-Atom Absent 0.9 0.11 Present 1.8 0.11 4-Atom Absent 0.6 0.07 Present 1.6 0.04 5-Atom Absent 1.6 0.04 Present 1.8 0.02 6-Atom Absent 0.5 0.03 Present 1.1 0.03 Open in new tab Table 2 Results for retrieval of the preference relation, after testing against the neutral premiss (‘0 0 0 0’) across a range of atomically variant micro-worlds. The low average error indicates that the preference relation (output distribution) is correctly retrieved. Testing against the neutral premiss . Micro-world . Least preferred patterns . Average error per state . . . % . |$\pm $| STDev . 3-Atom Absent 0.9 0.11 Present 1.8 0.11 4-Atom Absent 0.6 0.07 Present 1.6 0.04 5-Atom Absent 1.6 0.04 Present 1.8 0.02 6-Atom Absent 0.5 0.03 Present 1.1 0.03 Testing against the neutral premiss . Micro-world . Least preferred patterns . Average error per state . . . % . |$\pm $| STDev . 3-Atom Absent 0.9 0.11 Present 1.8 0.11 4-Atom Absent 0.6 0.07 Present 1.6 0.04 5-Atom Absent 1.6 0.04 Present 1.8 0.02 6-Atom Absent 0.5 0.03 Present 1.1 0.03 Open in new tab 4.1.1 Single vs. ensemble machines These experimental results were obtained by running five sets of machines, with 60 machines in each set. A total of 300 separately trained machines. The output samples within each run were accumulated. In effect, each run of machines acted as an ensemble with joint input and output layers; the hidden layer consisting of 60 parallel machines. This architecture implementing physical accumulation of output samples results in a neutralization of absolute errors on opposite sides of the mean, which would otherwise result from random seeding of the initial machine weights. We therefore present a more detailed account of the errors for both single and accumulated output, ensemble machines. We have chosen for consistency to focus on the 4-atom micro-world illustrated in Figure 2. The difference between the training and the output distributions (percentage error across the whole distribution) is illustrated in the detailed statistics presented for 6,000 single machines and 100 ensemble machines with 60 parallel hidden layers, shown in Table 3, Figures 7 and 8. Table 3 Descriptive statistics for the error between input and output distributions, comparing Single and Ensemble machines of similar overall size, in the 4-atom micro-world model from Figure 2. Statistics via [18]. Statistics for % Error: across whole distribution Statistic Single 6,000 Ensemble 60 |$\times $| 100 Range 70.0: Min 4.4, Max 74.4 4.9: Min 8.4, Max 13.3 Quartiles 1st 19.8, 3rd 30.4 1st 9.7, 3rd 11.0 Centre Mean 25.4, Median 24.8 Mean 10.4, Median 10.4 Variation SE 0.10, StDev 7.8 SE 0.09, StDev 0.95 Statistics for % Error: across whole distribution Statistic Single 6,000 Ensemble 60 |$\times $| 100 Range 70.0: Min 4.4, Max 74.4 4.9: Min 8.4, Max 13.3 Quartiles 1st 19.8, 3rd 30.4 1st 9.7, 3rd 11.0 Centre Mean 25.4, Median 24.8 Mean 10.4, Median 10.4 Variation SE 0.10, StDev 7.8 SE 0.09, StDev 0.95 Open in new tab Table 3 Descriptive statistics for the error between input and output distributions, comparing Single and Ensemble machines of similar overall size, in the 4-atom micro-world model from Figure 2. Statistics via [18]. Statistics for % Error: across whole distribution Statistic Single 6,000 Ensemble 60 |$\times $| 100 Range 70.0: Min 4.4, Max 74.4 4.9: Min 8.4, Max 13.3 Quartiles 1st 19.8, 3rd 30.4 1st 9.7, 3rd 11.0 Centre Mean 25.4, Median 24.8 Mean 10.4, Median 10.4 Variation SE 0.10, StDev 7.8 SE 0.09, StDev 0.95 Statistics for % Error: across whole distribution Statistic Single 6,000 Ensemble 60 |$\times $| 100 Range 70.0: Min 4.4, Max 74.4 4.9: Min 8.4, Max 13.3 Quartiles 1st 19.8, 3rd 30.4 1st 9.7, 3rd 11.0 Centre Mean 25.4, Median 24.8 Mean 10.4, Median 10.4 Variation SE 0.10, StDev 7.8 SE 0.09, StDev 0.95 Open in new tab Figure 7 Open in new tabDownload slide Histogram of % error for single HLR architecture machines (across whole distribution in a 4-atom micro-world.) Figure 7 Open in new tabDownload slide Histogram of % error for single HLR architecture machines (across whole distribution in a 4-atom micro-world.) Figure 8 Open in new tabDownload slide Histogram of % error for ensemble HLR architecture machines, with 60 parallel hidden layers (across whole distribution in a 4-atom micro-world.). Figure 8 Open in new tabDownload slide Histogram of % error for ensemble HLR architecture machines, with 60 parallel hidden layers (across whole distribution in a 4-atom micro-world.). There are some important issues identified in these results. We point out that the total sample size for the single machines is the same as for the ensemble machines of 60. Given this, the central limit theorem in statistics [5] cannot be an explanation of the improvement in the mean error. |$\bullet $| There are almost ideal single machines, with very low error (4.4%) across the whole distribution. However, they are rare and there is no efficient procedure for generating them. |$\bullet $| There is a huge variation in the error on single machines (total range: 70%). The mean error is moderately acceptable (25.4%) but there is a large positive skew. |$\bullet $| The best results for the ensemble machines are not as good as the best single machines. However, the mean results for even a small ensemble are considerably better (10.4%) than for a much larger group of single machines. The error approximates normal distribution, with a very narrow variance (total range: less than 5%). |$\bullet $| The processing time for any ensemble is the same as for a single machine; all the hidden layers can be run in parallel. In summary, although almost ideal single machines exist, they are rare and there is no reliable procedure for producing them. Any small ensemble will produce fast, robust results: with low mean error and narrow variance. Further experimentation demonstrated that little benefit was obtained by increasing the ensemble size above 60 parallel hidden layers, Figure 9. Figure 9 Open in new tabDownload slide Boxplot of % error vs. increasing number of parallel hidden layers, in HLR architecture Boltzmann machines. Figure 9 Open in new tabDownload slide Boxplot of % error vs. increasing number of parallel hidden layers, in HLR architecture Boltzmann machines. 4.1.2 A detailed example Looking at the averaged error, between input and output distributions, only gives an overview of the performance of the machine against the logic. Figure 10 presents a complete output for the specific example 4-atom micro-world utilizing two different training sets from this micro-world: ‘88-0’ with least preferred states absent and ‘96-1’ with least preferred states present. Five runs of ensemble machines with 60 parallel hidden layers each are tested against the neutral premiss (‘0 0 0 0’). The actual output can be compared with the expected values present in the training set in the left column. Comparing the expected and actual output frequencies state by state illustrates that the network representation is able to separate states correctly according to their preference, across the spectrum of preference levels. These results demonstrate that the Boltzmann machine faithfully represents the shape of the learned input distribution. Figure 10 Open in new tabDownload slide Example results for retrieving the preference relations, on two training sets, against the neutral premiss (‘0 0 0 0’) in the example 4-atom micro-world. The actual output distribution is a good match for the full preference relation / expected distribution. Figure 10 Open in new tabDownload slide Example results for retrieving the preference relations, on two training sets, against the neutral premiss (‘0 0 0 0’) in the example 4-atom micro-world. The actual output distribution is a good match for the full preference relation / expected distribution. Because of its stochastic error, the machine is only able to usefully separate 4 or 5 levels of preference when there are more than a dozen model states. In the most complex 6-atom micro-world with 64 model states [12], the machine was just able to maintain a separation between preference levels because the differences in input frequencies were close to its stochastic error. The large number of non-preferred states with frequencies of zero tends to dilute out the error on the few most preferred states. This favourable result would be reduced if the information theoretic divergence [57] had been used for measuring the difference between the distributions. However, this metric in the domain of engineering doesn’t provide a favourable representation of the least preferred model states, which are required for exception processing in SCNM logic. On the training sets where the least preferred patterns are present for just a single instance the machine has a larger error per state: on average 1.6% when present versus 0.9% where they are absent. Looking at training set ‘96-1’ in Figure 10, for example, the reader can appreciate that the machine overestimates the probability of the least preferred input patterns and underestimates the frequency of the most preferred patterns, by 3–5%. This is typical of all the situations where the machine trains on the least preferred patterns. This movement of the sample toward a central mean is not an ideal characteristic for representation of the logic. However, it may be biologically plausible considering as previously that the least frequent/preferred model states are disproportionately important for exception processing. 4.2 Model selection: the discriminative case Recall that in Section 2.1, we suggest that a conclusion in supra-classical logic involves an ordered set of model states entailed by a premiss. While retrieval of the complete probability distribution is statistically important to confirm learning, specific selection of model states based on a premiss is at the heart of inference. This corresponds to the second component that we have identified as a requirement of predictive inference, under the rational consequence relation, Section 3.3. This specific selection of model states provides evidence related to conditional probability |$p(x|u)$|⁠; a discriminative model. We have proposed that this model selection corresponds to a set of ranked states or energy minima that are output by a Boltzmann machine. It is difficult to provide a metric of the machine’s performance in this context, for individual specific premises numerical analysis can be misleading. Although the expected training distributions are supplied, it is qualitatively more appropriate to compare the output distribution directly to the preference relation expected from the logic, in order to consider the pattern of the results. Example results are provided: Figure 11 shows four examples in the 4-atom micro-world instance (the micro-world illustrated in Figure 2), premiss ‘0 -1 1 0’; the machine returns states 2 and 10 versus 3 and 11: correctly separating states at different levels in the low orders of preference. Figure 12 shows two examples in a 6-atom micro-world, premiss ‘0 -1 0 -1 0 1’; the machine returns states 9 and 41: correctly placing these states of like ranking at the same level. A much larger collection of examples of model selection based on individual premises in a range of micro-worlds is presented in our technical report [12]. In all the examples presented, the ‘exponential’ training sets utilized had the least preferred patterns present, i.e. machines are being tested in the most unfavourable circumstances. Figure 11 Open in new tabDownload slide Results for selection of model states from four partially specified premises in the example 4-atom micro-world. Figure 11 Open in new tabDownload slide Results for selection of model states from four partially specified premises in the example 4-atom micro-world. Figure 12 Open in new tabDownload slide Results for selection of model states from two partially specified premises in a 6-atom micro-world. The outputs on contiguous groups of least preferred states are accumulated: 5–8, 10–16, 18–31 etc. Figure 12 Open in new tabDownload slide Results for selection of model states from two partially specified premises in a 6-atom micro-world. The outputs on contiguous groups of least preferred states are accumulated: 5–8, 10–16, 18–31 etc. In summary, the results demonstrate that based on a partially specified premiss the Boltzmann machine is able to: Select the appropriate model states entailed by the input premiss Place these states in the correct preference ranking Separate these states with a distance proportional to their preference level. This type of ranking equivalence is termed ‘strong or magnitude persevering’ [84], meaning that the equivalence maintains: the appropriate states themselves, the correct ordering on the states and the correct magnitude of separation between levels of preference. It is important to remember that the logic does not require an exact probability metric [84]. With this in mind, we have provided the expected state distributions for these results as a means of qualitatively considering their logical correctness. The details of the whole preference ranking in these larger micro-worlds are available in our technical report [12] but are not necessary considering the expected distributions provided. The machine often succeeds in model selection when ‘asked’ to rank limited model states, where its performance against the neutral premiss with the entire ordering may have been marginal. We have not presented the average error metrics with these results, because they could not provide evidence related to the output pattern. There were results where the average error was large but the machine faithfully reproduced the preferences of the logic. The machine output pattern was correct. There were also a few results where the machine output pattern was wrong. In these cases, the numerical error was small, but the machine seemed to defy the logic. These last, unexpected results occurred where selection of states at the same high level of preference, required a secondary selection based on a dependent variable. An example can be seen in the next Section 4.3 (Figure 15, premiss ‘0 0 -1 0’), where states 0 and 8 have different output frequencies from 1 and 9, despite their being at the same preferences. In these circumstances, the machine seems to perform a tie-break conditional on the atomic distribution of the dependent variable. These results and the surprising, fundamental issues they raise will be examined in a future paper. 4.3 Logical properties We present two examples of properties provable by SCNM logic within the context of these micro-worlds. These specific properties were chosen because of their importance to the logical formalization of defeasibility. We present them in relation to our assertion that the Boltzmann machine will always provide support for the logical properties that hold in a micro-world: given the two required characteristics that we have identified and demonstrated (sections 4.1 and 4.2). Part of the motivation for presenting them is to demonstrate that the third component of inference (Section 3.3), the comparison of selected model states, could be easily achieved. 4.3.1 Non-monotonicity Non-monotonicity is a refutation of the absolute truth of classical logic and can be seen as a foundational property of SCNM logic (Section 2.1). Yet in isolation, the property might be regarded as too weak (irrational). Recalling the definition for defeasible entailment: $$\begin{align*} & \alpha \mid\sim \beta \quad \longleftrightarrow \quad Maximal\ Models \lbrack \alpha \rbrack \subseteq Models \lbrack \beta \rbrack \end{align*}$$ The principle of non-monotonicity states that: $$\begin{align*} & \textrm{Given}\ \alpha \mid\sim \beta\ \text{,then for some}\ \gamma, \ (\alpha \wedge \gamma) {{\mathrel{| \!{\diagup} \!\!\!\!\!{\sim}}}} \beta \end{align*}$$ where semantically |$\gamma $| represents new information, a condition that is difficult to denote within the language. An example of this property in our 4-atom micro-world is $$\begin{align*} & F \mid\sim W \ \textrm{but}\ (F \wedge H) {{\mathrel{| \!{\diagup} \!\!\!\!\!{\sim}}}} W. \end{align*}$$ The Boltzmann machine is able to provide supporting evidence for this property, see Figure 13. Figure 13 Open in new tabDownload slide Non-Monotonicity. Output model states for three premises in the example 4-atom world. Where ‘>>>’ supports entailment and ‘!!!’ does not support entailment, of W by the most preferred models of F & H. Figure 13 Open in new tabDownload slide Non-Monotonicity. Output model states for three premises in the example 4-atom world. Where ‘>>>’ supports entailment and ‘!!!’ does not support entailment, of W by the most preferred models of F & H. Consider the two premises F (‘0 1 0 0’) and W (‘0 0 0 1’). The machine returns the maximally preferred models of F as states 5 and 13; these models are clearly a subset of the models of W (as returned by the machine in the middle panel). Whereas if we look at the premiss F |$\wedge $| H (‘0 1 1 0’), the maximally preferred models returned by the machine include 6 and 14; these model states are not a subset of the models of W (as returned by the machine in the middle panel). 4.3.2 Rational monotonicity Rational monotonicity is a more complex retraction and goes some way toward re-establishing classical entailment: providing a rational boundary to the disorder that would ensue from unchecked non-monotonicity. It allows unrestricted update in situations of independence between atoms: doxastic independence (discussed in a future paper). Rational monotonicity is required by defeasible reasoning in addition to the six other preferential properties of KLM: Reflexivity, Left Logical Equivalence, Right Weakening, Cut, Cautious Monotonicity & Or [60]. The principal of rational monotonicity has two components: $$\begin{align*} & \textrm{Given } \alpha \mid\sim \beta\ \text{,then:} \\ & \quad \quad \textrm{Either}\ (\alpha \wedge \gamma) \mid\sim \beta \\ & \quad \quad \textrm{Or if}\ \alpha \mid\sim \neg \gamma,\ \textrm{then}\ (\alpha \wedge \gamma) {{\mathrel{| \!{\diagup} \!\!\!\!\!{\sim}}}} \beta \end{align*}$$ Examples of the two components of this property in our 4-atom micro-world are: $$\begin{align*} & F \mid\sim W \ \textrm{and}\ (F \wedge L) \mid\sim W \\ & F \mid\sim W \ \textrm{but}\ (F \wedge H) {{\mathrel{| \!{\diagup} \!\!\!\!\!{\sim}}}} W, \ \textrm{as}\ F \mid\sim \neg H \end{align*}$$ For the first component, see Figure 14. From the machine output for the premises F (‘0 1 0 0’) and W (‘0 0 0 1’) we can see, as previously, the machine supports the defeasible entailment F |$\mid \sim $| W. Considering the premiss F |$\wedge $| L (‘1 1 0 0’), we can see that the machine returns 13 as the maximally preferred model state which is a subset of the models of W (as returned by the machine, in the middle panel). The light (L) is a doxastically independent atom/proposition, which does not affect the rationality of the previous defeasible entailment. Figure 14 Open in new tabDownload slide Rational Monotonicity, Part 1. Output model states for premises in the example 4-atom world (Light, Fan, Heater, Window). Where ‘>>>’ supports entailment by the most preferred models. Figure 14 Open in new tabDownload slide Rational Monotonicity, Part 1. Output model states for premises in the example 4-atom world (Light, Fan, Heater, Window). Where ‘>>>’ supports entailment by the most preferred models. Figure 15 Open in new tabDownload slide Rational Monotonicity, Part 2. Where ‘>>>’ supports entailment by the most preferred models. Figure 15 Open in new tabDownload slide Rational Monotonicity, Part 2. Where ‘>>>’ supports entailment by the most preferred models. For the second component, we only need to add evidence of F |$\mid \sim \neg $|H to the outputs already given in Figure 13, which already illustrate (F |$\wedge $| H) |${ {\mathrel{| \!{\diagup } \!\!\!\!\!{\sim }}}}$| W. The outputs from the machine for the premiss |$\neg $|H (‘0 0 -1 0’) are shown in Figure 15, they include the maximally preferred models of premiss F (‘0 1 0 0’), which are model states 5 and 13. The rationality is that the heater (H) and the fan (F) are in active opposition to each other. It is not intended that the exposition of these logical properties provide any additional evidence of veracity for the machine as a representation of the logic. However, they illustrate: |$\bullet $| The practical utility of the machine as a representation of the logic. |$\bullet $| That two important logical properties can be emulated by the machine, based on our assertion as a consequence of the shared preference relation. |$\bullet $| The ease with which output model states from two separate sentences could be compared by an ‘observer machine’, to complete the requirements of inference. 4.4 Summary of results In the introduction and in Section 3.3, we identified four necessary logical requirements of a neural network to represent inference under the rational consequence relation in SCNM logic. The experimental results presented do not constitute a mathematical proof, but they do provide compelling evidence that the first two of the necessary requirements are fulfilled by the Boltzmann machine. These requirements constitute the greatest challenge to such a machine representation. Seen from the standpoint of the machine they are: The ability to learn an input probability distribution constituting a preference relation. The ability to retrieve one-to-many mappings constituting appropriate model selection entailed by a partially specified premiss. In the experimental context of these moderate sized logical micro-worlds, the Boltzmann machine is a faithful representation. It is able to learn a preference relation with numerical accuracy: average error per state |$\leq $| 2%, and select appropriate model states based on the limited information available in a premiss. It maintains a ‘strongly magnitude preserving equivalence’. Our experimental evidence is supported theoretically by the work of Pinkas [84] in specific SNNs and indirectly by SRL utilizing Markov models. The properties of SCNM logic arise from the preferential semantics of the micro-world, combined with the inference process we outlined as requirements, Section 3.3. If the machine can learn the preferential semantics of an environment and follow the process of inference in terms of appropriate selection of model states from a premiss, then it can only retrieve the solutions expected by SCNM logic. It should not be surprising that the Boltzmann machine is able to retrieve the results of its probabilistic learning since, at equilibrium, the Boltzmann distribution around which the machine is constructed, represents the likelihood of the learned states. Yet we suggest that these properties, the ability to retrieve a ranked set of output states characterizing a generative model and the retrieval one-to-many relationships in the context of a discriminative model, are rare among neural networks. Further, these properties are fundamental requirements of any network aspiring to embody inference in SCNM logic. In a broader context of common-sense reasoning, there are additional considerations that this paper only introduces. In the section on network requirements (Section 3.3) there were two further items. The third requirement, comparing model states, we suggest can be easily implemented using a separate, feed-forward network. The fourth requirement, the ability to adapt to new information, will be addressed in a future paper. There are also biological prerequisites for any network hoping to offer some insight into human cognition (Section 2.3.5). 5 Discussion In conclusion, we would like to place this paper in the general context of cognition and to present a brief overview of our research including future papers. 5.1 Logic, cognition and the Boltzmann machine The brains of all living creatures consist of neural networks: massively inter-connected collections of individual nodal cells. While the exact mechanics of these networks are still in doubt and are likely to be varied, they retain a distributed representation of the information that they learn [104]. While abstractions can be of value in providing potential representational models, any conjectures about cognition should at least have a possible basis in neural network implementation. Harnad’s tripartite level theory of cognition [35, 36] is a widely cited conjecture [30, 58]. It postulates three levels of processing: an iconic level of representation at the sensory boundary with the environment, a categorical level of invariance detection and a higher symbolic level where actual reasoning takes place. Harnad regards networks as purely syntactic structures and is reluctant to credit any role for a neural network implementation at the symbolic level: ‘connectionist networks do not have the systematic semantic properties that cognitive phenomena (possess)’ [36]. This view is contradicted, however, by the seminal experiments of Frank et al. [26]. Frank demonstrated that artificial neural networks can represent the physical semantics of an environment including predicate relationships, even when trained on a syntactically incomplete language. In presenting a level-based framework similar to that of Harnad, Radermacher [87] reaches the conclusion: ‘In humans, the logical and symbolic functions of the brain are realized within a biological neural network’. We believe this to be the majority view. Currently, neural networks are the only practical models available for the implementation of cognition. The purpose of this paper is to provide a link between one abstract representation of cognition, SCNM logic and another slightly less abstract model, a Boltzmann neural network: to demonstrate that these models are complementary. We do not propose that a Boltzmann like network is the sole cognitive mechanic of common-sense reasoning. Even within the limited scope of this paper, the Boltzmann network needs to be extended to capture the third requirement of inference: the comparison of output model states (Section 3.3). However, the machine’s stochastic activation functions and probabilistic learning are likely requirements for distributional representation of the environment, which is the statistical basis for logical preference [43]. In this paper, we have demonstrated that these characteristics, learning a generative model and selection of model states (‘strong equivalence’), are the prerequisites for representing SCNM logical inference. This functionality could be seen as a necessary component of a larger network structure, implementing rational consequence. Looking at the individual characteristics of the Boltzmann machine for biological plausibility (Section 2.3.5), there are other favourable properties, particularly in relation to Hebbian learning. The Boltzmann machine is one of the few networks that relies on cross-firing statistics: it is a truly remarkable algorithm that directs the adaptation of the entire network based solely on local information. It is likely that this adaptation, via cross-firing, has a simple correlate in potentiation across a synapse Hebb [37] and is related to long term memory [95, 102, Section 8: Implementing Reasoning in Neural Networks]. It may be physically correlated with growth and ‘pruning’ of the dendritic tree. The division of the Boltzmann learning algorithm into two phases, one requiring no external input into the network, is also extremely biologically plausible. Most complex organisms sleep. Sleep may be the physical correlate of a dual phase learning algorithm required for weight update in the consolidation of memory [95]. Simulated annealing would seem to be the most biologically implausible of the machine’s properties. However, from an optimization point of view, simulated annealing is the most comprehensive and adaptable of the threshold optimization algorithms [1]. It has the ability to overcome irregularities in the solution space, which are typical of real biological tasks. In fact, it is implausible that a deterministic variant of threshold optimization, strictly limited to gradient descent, would have evolved as the primary means of representing a biological world. We can only speculate about the neurobiological basis of annealing at a molecular level, because of our incomplete understanding of complex biological mechanisms. However, simulated annealing like cross-firing has the advantage that it only requires local implementation. Recurrent firing would only need to produce local changes in neuro-modulator chemicals that alter excitability across a synapse. Such rapid (time scale of seconds) synaptic chemical changes have been proposed by other authors [102, 109] as a basis of fast functional linkage and short term memory. Taking this family of networks even closer to the neurobiology, a version of the Boltzmann machine based on spiking activation (where node activation is represented by firing spikes rather than a single real number) was first proposed by [42]. Good performance of this spiking Boltzmann model has been demonstrated in visual recognition tasks [17, 73]. 5.2 I: Representation The current paper makes a contribution to the domain of cognition by advancing a connectionist model for the representation of SCNM logic. SCNM logic can be seen as a formalism of common-sense reasoning. It specifically requires a ranking of preferred conclusions in the context of inference under the rational consequence relation. This ranking of model states incorporates the less preferred counter-examples, which are the basis of exception processing and possibly reasoning about causality. In probabilistic terms, this ranking is the theoretical equivalent of the energy minima within a symmetric neural network. Information and energy: In the context of an SNN, these concepts are alternative characterizations. From the previous literature, particularly the research of Pinkas, we know that logical formulae can be mathematically translated to a specific neural network structure, so that conclusions in the logic are represented by energy states in the network. The Boltzmann machine can learn multiple optima that are represented by the partition energy function at equilibrium: the Boltzmann distribution. The Boltzmann machine may be unique in its ability to retrieve a distribution ‘strongly equivalent’ to its training set. Probability and stochastic activation: Ranking with ‘strong equivalence’ can only be retrieved by a neural network with stochastic activation functions, sampling from an energy distribution: any such network is analogous to the Boltzmann machine. The default rule and consequent preference relation of SCNM logic are a qualitative counterpart of a generative model in probability. Stochastic activation functions are the underlying basis of probabilistic representation in neural networks, be they symmetrically recurrent or feed-forward. The stochastic activation allows the network to sample a distribution in the manner of a Markov model. There were no traditional bench-marks for SCNM logic with which to test the hypothesized Boltzmann model. This paper therefore utilized logical micro-world environments so that outputs of the network model could be compared to inferential conclusions in the logic: other authors have used similar schemata. The experimental results from a variety of generic micro-worlds, with incremental numbers of atoms, supported the view that the Boltzmann machine is a faithful model of SCNM logic. It was able to learn a preference relation and able to retrieve appropriately ranked model states entailed by a premiss, in the context of inference under rational consequence. Ideal single machines with very small errors per state are possible, but difficult to efficiently generate during training. The paper offers a solution of ensemble machines, accumulating a single output from multiple parallel hidden layers. This construction reduces the state errors and enables fast, robust learning. Further, we have shown that this machine emulates two important example properties of the logic (Section 4.3), as a demonstration of the practical utility of the representation. In terms of human cognition, there is evidence from the domain of neuroscience that spiking neural networks, utilizing the restricted Boltzmann machine, can be trained using an event-driven variant of Hebbian learning in large neuromorphic systems. Plausibility versus efficiency: The biological and engineering domains often have competing requirements. Hinton’s original version of the Boltzmann machine had a rich hidden layer consistent with the role of inhibitory constraint within the biological cortex. It implements simulated annealing to achieve optimization. Simulated annealing can be viewed as the parent algorithm of threshold optimization, inherently suitable for a disordered biological environment. Both these implementations make the standard machine less computationally efficient than the restricted machine. However, in the context of a massively parallel, biological system, they may not be such a disadvantage. Scope: The current paper is limited to a propositional syntax without predicates because we were attempting to examine the broad domain of supra-classical logic for the first time. Incorporating the requirements of generic learning and the tolerance of counter-examples in this broader context considerably narrows the field of candidate neural network representations for SCNM logic. The paper offers the Boltzmann machine as a practical representation of SCNM logic. It supports the place of a Boltzmann like mechanic in spiking neural networks with stochastic Hebbian learning. Further, the model will enable the experimental investigation of domains, such as typicality and belief revision, which are current areas of mathematical conjecture in logic. We plan two future papers building on this Boltzmann machine model, for which the experimentation is largely completed. The first concerns the nature of the knowledge representation within the model, and the way in which different interpretations are incongruent. The second is an exploration of the network’s ability to adapt to new information. These are summarized in the sections below. 5.3 II: Incongruence Closer examination of our initial results with the Boltzmann machine revealed that every training set of vectors contains dual information about whole state frequencies and atomic activation frequencies. The whole state frequency distribution is equivalent to whole state preference in logic. However, a separate state distribution can be reconstructed from the product of atomic frequencies. These dual information distributions are often incongruent. Experimentation with the architecture of the Boltzmann machine produced a machine with a restricted hidden layer that more closely selected the whole state preference ranking of traditional logic. Preference versus typicality: This future paper proposes that the distribution of states reconstructed from atomic frequency is the probabilistic equivalent of typicality in logic. Although the term typicality is widely used in the domain of logic, the traditional view of preferential semantics in non-monotonic logic does not utilize it. We believe that there is no adequate definition in logic based on compositional, atomic characteristics. We argue, by counter-example, that representation of typicality by ‘minimal model semantics’ is incorrect and that atomic typicality requires a separate ranking from whole state preference. The paper attempts to provide an atomic definition of typicality based on the experimental results from the Boltzmann model. 5.4 III: Adaptation Any cognitive agent must be able not only to draw inferences from the environment but also to adapt to changes in the environment. Belief revision in logic formalizes the structure of adaptation. It is a relatively young domain; the landmark paper of AGM was published in [4] and there are still a surfeit of competing theories. In general, neural networks have difficulty retaining previous learning when exposed to new data. They are by default irrationally non-monotonic. This future paper utilizes a variation of the Boltzmann machine learning algorithm to implement pseudo-rehearsal [89], which allows the network to maintain past learning during re-training. It enables experimental examination of the machine plausibility of current theories of belief revision in the logic. Order versus chaos: The experimental results from re-training the Boltzmann machine suggest that the current approaches to belief revision in logic only apply in a limited number of worlds that are rationally congruent. In the majority of worlds that are chaotic and incongruent, typicality provides a ranking of states based on their individual atomic probabilities. In these disordered worlds, inconsistencies in whole state preference, between preexisting rules and the revised models, cannot be rationally accommodated with reference to a unitary state exemplar. It is hypothesized that exception processing using atomic typicality is the evolutionary basis of biological adaptation. Dedication Dedicated to our friend and colleague, Willem Labuschagne, died 19 March 2019. References [1] E. Aarts and J. Korst Simulated Annealing and Boltzmann Machines . Inter-science Series in Mathematics and Optimization . John Wiley and Sons , 1990 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC [2] E. Aarts and J. Korst Simulated annealing . In Local Search in Combinatorial Optimization , pp. 91 – 120 . John Wiley and Sons , 1997 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC [3] E. Adams The logic of conditionals . Inquiry , 8 , 166 – 197 , 1965 . Google Scholar Crossref Search ADS WorldCat [4] C. Alchourron , P. Gärdenfors, and D. Makinson On the logic of theory change: partial meet contraction and revision functions . Journal of Symbolic Logic , 50 , 510 – 530 , 1985 . Google Scholar Crossref Search ADS WorldCat [5] C. Annis Central limit theorem (summary) . Statistical Engineering . 2014 . http://www.statisticalengineering.com/central_limit_theorem_(summary).htm. Google Scholar OpenURL Placeholder Text WorldCat [6] F. Bacchus Representing and Reasoning With Probabilistic Knowledge . PhD Thesis , University of Alberta , 1988 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC [7] F. Bacchus A logic for representing reasoning with statistical knowledge . Computational Intelligence , 6 , 209 – 231 , 1990 . Google Scholar Crossref Search ADS WorldCat [8] F. Bacchus Default reasoning from statistics . In Proceedings AAAI , pp., 392 – 398 . 1991 . Google Scholar OpenURL Placeholder Text WorldCat [9] F. Bacchus From statistical knowledge bases to degrees of belief . Artificial Intelligence , 87 , 75 – 143 , 1996 . Google Scholar Crossref Search ADS WorldCat [10] C. Balkenius and P. Gärdenfors Non-monotonic inferences in neural networks . In Principles of Knowledge Representation and Reasoning , pp. 32 – 39 . 1991 . Google Scholar OpenURL Placeholder Text WorldCat [11] G. Blanchette The Boltzmann Machine: A Connectionist Model for Supra-Classical Logic . PhD Thesis , Otago University , New Zealand , 2018. https://ourarchive.otago.ac.nz/handle/10523/8312 Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC [12] G. Blanchette , B. McCane, W. Labuschagne, and A. Robins Towards a representation of non-monotonic inference in an artificial neural network . Technical Report . Otago University Press, Computer Science , 2015. http://www.cs.otago.ac.nz/research/publications/OUCS-2014-04.pdf Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC [13] R. Byrne Suppressing valid inferences with conditionals . Cognition , 31 , 61 – 83 , 1989 . Google Scholar Crossref Search ADS PubMed WorldCat [14] F. Chang Symbolically speaking: a connectionist model of sentence production . Cognitive Science , 93 , 1 – 43 , 2002 . Google Scholar OpenURL Placeholder Text WorldCat [15] H. Chen and A. Murray Continuous restricted Boltzmann machine with an implementable training algorithm . IEEE Proceedings of Visual Image Processing , 150 , 153 – 158 , 2003 . Google Scholar Crossref Search ADS WorldCat [16] P. Cheng and K. Holyoak Pragmatic reasoning schemas . Cognitive Psychology , 17 , 391 – 416 , 1985 . Google Scholar Crossref Search ADS PubMed WorldCat [17] A. Courville , J. Bergstra, and Y. Bengio A spike and slab restricted Boltzmann machine . Artificial Intelligence and Statistics , 1 , 233 – 241 , 2011 . Google Scholar OpenURL Placeholder Text WorldCat [18] CRAN . A language and environment for statistical computing . 2014 . www.r-project.org/ [19] A. d’Avila Garcez and L. Lamb Neurosymbolic ai: the 3rd wave . Technical Report . Cornell Univesity , 2020 . https://arxiv.org/abs/2012.05876 Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC [20] A. d’Avila Garcez , L. Lamb, and D. Gabbay Connectionist modal logic: representing modalities in neural networks . Theoretical Computer Science , 371 , 34 – 53 , 2007 . Google Scholar Crossref Search ADS WorldCat [21] A. d’Avila Garcez , L. Lamb, and D. Gabbay Neural-Symbolic Cognitive Reasoning . Cognitive Technologies . Springer , 2009 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC [22] E. Davis and L. Morgenstern Introduction: progress in formal common-sense reasoning . Artificial Intelligence , 153 , 1 – 12 , 2004 . Google Scholar Crossref Search ADS WorldCat [23] M. Egger The Boltzmann machine: a survey and generalization . Technical Report TR 805 . Massachusetts Institute of Technology , 1988 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC [24] S. Eslami , N. Heess, and J. Win The shape Boltzmann machine: a strong model of object shape . In IEEE Computer Vision and Pattern Recognition , pp. 406 – 413 . 2012 . Google Scholar OpenURL Placeholder Text WorldCat [25] J. Fodor and Z. Pylyshyn Connectionism and cognitive architecture: a critical analysis . Cognition , 28 , 3 – 71 , 1988 . Google Scholar Crossref Search ADS PubMed WorldCat [26] S. Frank , W. Haselager, and I. van Rooij Connectionist semantic systemicity . Cognition , 110 , 358 – 379 , 2009 . Google Scholar Crossref Search ADS PubMed WorldCat [27] H. Freeman Neural Networks, Algorithms, Applications and Programming Techniques . Addison-Wesley , 1994 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC [28] D. Gabbay , C. Hogger, and J. Robinson Handbook of Logic in Artificial Intelligence and Logic Programming , Volume 3: Nonmonotonic Reasoning and Uncertain Reasoning . Oxford UP , Oxford , 1994 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC [29] P. Gärdenfors How logic emerges from the dynamics of information . In Logic and Information Flow , pp. 49 – 77 . Cambridge MIT Press , 1994 . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC [30] P. Gärdenfors Conceptual Spaces . MIT Press , Cambridge , 2004 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC [31] S. Geman and D. Geman Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images . In IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 6 , pp. 721 – 741 . 1984 . [32] L. Getoor and B. Taskar Introduction to Statistical Relational Learning . MIT Press Cambridge , 2007 . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC [33] P. Girard and K. Tanaka Paraconsistent logics . Synthese , 193 , 1 – 14 , 2016 . Google Scholar Crossref Search ADS WorldCat [34] M. Gomez-Torrente Alfred tarski . Stanford Encyclopedia of Philosophy . 2015 . http://plato.stanford.edu/archives/win2015/entries/tarski/ [35] S. Harnad Categorical Perception: The Groundwork of Cognition . Cambridge University Press , 1987 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC [36] S. Harnad The symbol grounding problem . Physica , 42 , 335 – 346 , 1990 . Google Scholar OpenURL Placeholder Text WorldCat [37] D. Hebb The Organisation of Behaviour . John Wiley and Sons , 1949 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC [38] J. Heidema and W. Labuschagne Knowledge and belief: the agent-oriented view . In Culture in Retrospect , pp. 194 – 214 . UNISA Press , 2001 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC [39] G. Hinton Deterministic Boltzmann learning performs steepest descent in weight space . Neural Computation , 1 , 143 – 150 , 1989 . Google Scholar Crossref Search ADS WorldCat [40] G. Hinton Preface to the special issue on connectionist symbol processing . Artificial Intelligence , 46 , 1 – 4 , 1990 . Google Scholar Crossref Search ADS WorldCat [41] G. Hinton A practical guide to training restricted Boltzmann machines . Technical Report TR 2010-003 . University of Toronto, Machine Learning , 2010 . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC [42] G. Hinton and A. Brown Spiking Boltzmann machines . NIPS , 122 – 128 , 1999 . Google Scholar OpenURL Placeholder Text WorldCat [43] G. Hinton , S. Osindero, and Y. Teh What kind of graphical model is the brain ? 2000 . www.cs.toronto.edu/hinton/talks/ijcai3.ppt [44] G. Hinton , T. Sejnowski, and D. Ackley Boltzmann machines: constraint satisfaction networks that learn . Technical Report TR 84–119 . Carnegie-Mellon University, Computer Science , 1984 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC [45] G. Hinton , T. Sejnowski, and D. Ackley A learning algorithm for Boltzmann machines . Cognitive Science , 9 , 147 – 169 , 1985 . Google Scholar OpenURL Placeholder Text WorldCat [46] J. Hopfield Neural networks and physical systems with emergent collective computational abilities . Proceedings Natural Academy of Science , 79 , 2554 – 2558 , 1982 . Google Scholar Crossref Search ADS WorldCat [47] A Irvine Bertrand Russell . Stanford Encyclopedia of Philosophy . 2015 . http://plato.stanford.edu/archives/win2015/entries/russell/ [48] E. Izhikevich Which model to use for cortical spiking neurons? In IEEE Transactions on Neural Networks , vol. 15 , pp. 1063 – 1070 . 2004 . [49] A. Jagota Representing Discrete Structures in a Hopfield-Style Network . Neural Networks for Knowledge Representation and Inference , pp. 123 – 142 . Lawrence Erlbaum , 1994 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC [50] J Kennedy Kurt Gödel . Stanford Encyclopedia of Philosophy . 2016 . http://plato.stanford.edu/archives/win2016/entries/goedel/ [51] K. Kersting , L. De Raedt, and T. Raiko Logical hidden Markov models . Journal of Artificial Intelligence Research , 25 , 425 – 456 , 2006 . Google Scholar Crossref Search ADS WorldCat [52] H. Khosravi and B. Bina A survey on statistical relational learning . Canadian Artificial Intelligence LNAI , 6085 , 256 – 268 , 2010 . Google Scholar OpenURL Placeholder Text WorldCat [53] D. Koller and A. Pfeffer Probabilistic frame based systems . In Proceedings AAAI , vol. 15 , pp. 580 – 587 . 1998 . [54] R. Koons Defeasible reasoning . Stanford Encyclopedia of Philosophy . 2014 . https://plato.stanford.edu/archives/spr2014/entries/reasoning-defeasible/ Google Scholar OpenURL Placeholder Text WorldCat [55] S. Kraus , D. Lehmann, and M. Magidor Non-monotonic reasoning, preferential models and cumulative logics . Artificial Intelligence , 44 , 167 – 207 , 1990 . Google Scholar Crossref Search ADS WorldCat [56] A Krogh and J Hertz Simple weight decay can improve generalization . In Advances in Neural Information Processing Systems , vol. 4 , pp. 950 – 957 . Morgan Kaufmann , 1995 . [57] S. Kullback and R. Leibler On information and sufficiency . Annals of Mathematical Statistics , 22 , 79 – 86 , 1951 . Google Scholar Crossref Search ADS WorldCat [58] W. Labuschagne and J. Heidema Towards agent-oriented logic: (i–ii) variations on the theme of logical consequence . Technical Report . Otago University , 2010 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC [59] W. Labuschagne , J. Heidema, and K. Britz Supra-classical consequence relations: tolerating rare counter-examples . In Advances in AI in Springer LNAI , pp. 326 – 337 . 2013 . Google Scholar OpenURL Placeholder Text WorldCat [60] D. Lehmann and M. Magidor What does a conditional knowledge base entail ? Artificial Intelligence , 55 , 1 – 60 , 1992 . Google Scholar Crossref Search ADS WorldCat [61] H. Leitgeb Nonmonotonic reasoning by inhibition nets . Artificial Intelligence , 128 , 161 – 201 , 2001 . Google Scholar Crossref Search ADS WorldCat [62] H. Leitgeb Inference on a Low Level. An Investigation into Deduction, Non-Monotonic Reasoning and the Philosophy of Cognition , vol. 30 of Applied Logic Series . Kluwer Academic , 2004 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC [63] H. Leitgeb Neural network models of conditionals: an introduction . In International Workshop on Logic and Philosophy of Knowledge , X. Arrazola and J. Larrazabal, eds, pp. 191 – 223 . 2007 . [64] D. Lenat The cyc project . www.Cyc.com , 2016 . [65] D. Makinson Bridges between classical and nonmonotonic logic . Journal of the IGPL , 11 , 69 – 96 , 2003 . Google Scholar Crossref Search ADS WorldCat [66] D. Makinson Bridges from Classical to Nonmonotonic Logic . King’s College Publications , 2005 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC [67] P. Mazzoni , R. Anderson, and M. Jordan A more biologically plausible learning rule for neural networks . Proceedings Natural Academy of Science , 88 , 4433 – 4437 , 1991 . Google Scholar Crossref Search ADS WorldCat [68] J. McCarthy Circumscription—a form of non-monotonic reasoning . Artificial Intelligence , 13 , 27 – 39 , 1980 . Google Scholar Crossref Search ADS WorldCat [69] D. McDermott and J. Doyle Non-monotonic logic i . Artificial Intelligence , 13 , 41 – 72 , 1980 . Google Scholar Crossref Search ADS WorldCat [70] N. Metropolis and A. Rosenbluth Equation of state calculations by fast computing machines . Journal of Chemical Physics , 21 , 1087 – 1092 , 1953 . Google Scholar Crossref Search ADS WorldCat [71] M. Minsky A framework for representing knowledge . Tech. Report 306 . MIT AI Laboratory , 1974 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC [72] M. Minsky and S. Papert Progress report on artificial intelligence . 1971 . web.media.mit.eduminsky/papers/PR1971.html/ [73] E. Neftci , S. Das, B. Pedroni, K. Kreutz-Delgado, and G. Cauwenberghs Event-driven contrastive divergence for spiking neuromorphic systems . Frontiers in Neuroscience , 7 , 74 – 87 , 2014 . Google Scholar Crossref Search ADS WorldCat [74] S. Neves , J. Bonnefon, and E. Raufaste An empirical test of patterns for non-monotonic inference . Annals of Mathematics and Artificial Intelligence , 34 , 107 – 130 , 2002 . Google Scholar Crossref Search ADS WorldCat [75] N. Nilsson Probabilistic logic . Artificial Intelligence , 28 , 71 – 87 , 1986 . Google Scholar Crossref Search ADS WorldCat [76] R. O’Reilly Six principles for biologically based computational models of cortical cognition . Trends in Cognitive Sciences , 11 , 455 – 462 , 1998 . Google Scholar OpenURL Placeholder Text WorldCat [77] J. Ortega and J. Parrilla Adaptive cooperation between processors in a parallel Boltzmann machine implementation . In Lecture Notes in Computer Science , vol. 1607 , pp. 208 – 218 . Springer , 1999 . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC [78] J. Pearl Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference . Representation and Reasoning . Morgan Kaufmann , 1997 . [79] J. Pearl An introduction to causal inference . ISBN: 1507894295 , 2015 . [80] J. Pearl and H. Geffner Probabilistic semantics for a subset of default reasoning . Technical Report CSD-8700XX, R-93-III . Computer Science, UCLA , 1988 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC [81] N. Pfeifer and G. Kleiter Coherence and non-monotonicity in human reasoning . Synthese , 146 , 93 – 109 , 2005 . Google Scholar Crossref Search ADS WorldCat [82] K. Pfleger Categorical Boltzmann machines . Technical Report TR 98–05 . Stanford University, Knowledge Systems Laboratory , 1998 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC [83] G. Pinkas Propositional logic, non-monotonic reasoning and symmetric networks—on bridging the gap between symbolic and connectionist knowledge representation . In Neural Networks for Knowledge Representation and Inference , pp. 175 – 203 . Lawrence Erlbaum , 1994 . [84] G. Pinkas Reasoning, non-monotonicity and learning in connectionist networks that capture propositional knowledge . Artificial Intelligence , 77 , 203 – 247 , 1995 . Google Scholar Crossref Search ADS WorldCat [85] G. Pinkas and R. Dechter Improving connectionist energy minimization . Journal of Artificial Intelligence Research , 3 , 223 – 248 , 1995 . Google Scholar Crossref Search ADS WorldCat [86] G. Priest , K. Tanaka, and Z. Weber Paraconsistent logic . Stanford Encyclopedia of Philosophy . 2016 . https://plato.stanford.edu/archives/win2016/entries/logic-paraconsistent/ [87] F. Radermacher Cognition in systems . Cybernetics and Systems , 27 , 1 – 41 , 1996 . Google Scholar Crossref Search ADS WorldCat [88] R. Reiter A logic for default reasoning . Artificial Intelligence , 13 , 81 – 132 , 1980 . Google Scholar Crossref Search ADS WorldCat [89] A. Robins Catastrophic forgetting, rehearsal and pseudo-rehearsal . Connection Science: Journal of Neural Computing, Artificial Intelligence and Cognitive Research , 7 , 123 – 146 , 1995 . Google Scholar Crossref Search ADS WorldCat [90] R. Rosales and S. Sclaroff Combining generative and discriminative models in a framework for articulated pose estimation . International Journal of Computer Vision , 67 , 251 – 276 , 2006 . Google Scholar Crossref Search ADS WorldCat [91] D. Rumelhart , P. Smolensky, J. McCelland the PDP Research Group Parallel Distributed Processing: Explorations in the Microstructure of Cognition , vol. 1 : Foundations . MIT Press , Cambridge, MA , 1986 . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC [92] S. Russell and P. Norvig Artificial Intelligence: A Modern Approach . Prentice Hall , 2003 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC [93] S. Sathasivam Boltzmann machine and new activation . Applied Mathematical Sciences , 78 , 3853 – 3860 , 2011 . Google Scholar OpenURL Placeholder Text WorldCat [94] T. Sejnowski High order Boltzmann machines . In Neural Networks for Computing , vol. 151 of American Institute of Physics Conference Proceedings 151 , pp. 398 – 395 . 1986 . [95] T. Sejnowski and A. Destexhe Why do we sleep? Brain Research , 886 , 208 – 223 , 2000 . Google Scholar Crossref Search ADS PubMed WorldCat [96] L. Shastri SHRUTI: A Neurally Motivated Architecture for Rapid, Scalable Inference , vol. 77 of Perspectives in Neural Symbolic Integration , chapter 8 , pp. 183 – 204 . Springer , 2007 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC [97] L. Shastri and V. Ajanagadde From simple associations to systematic reasoning: a connectionist representation of rules, variables and dynamic bindings using temporal synchrony . Behavioural and Brain Sciences , 16 , 417 – 494 , 1993 . Google Scholar Crossref Search ADS WorldCat [98] L. Shastri and C. Wendelken Probabilistic inference and learning in a connectionist causal network . Technical Report . International Computer Science Institute , 2000 . [99] Y. Sholam A semantical approach to non-monotonic logics . In Readings in Non-Monotonic Reasoning , pp. 227 – 249 . Morgan Kaufmann , 1987 . [100] P. Singh The open mind common-sense project . 2002 . http://www.kurzweilai.net/the-open-mind-common-sense-project [101] W. Spohn Ordinal conditional functions: a dynamic theory of epistemic states . In Causation in Decision, Belief Change and Statistics , Harper & Skyrms , ed., vol. 11 , pp. 105 – 134 . Kluwer Academic , 1988 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC [102] K. Stenning and M. Van Lambalgen Human Reasoning and Cognitive Science . MIT Press , Cambridge , 2008 . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC [103] C. Strasser and G. Antonelli Non-monotonic logic . Stanford Encyclopedia of Philosophy . 2016 . https://plato.stanford.edu/archives/win2016/entries/logic-nonmonotonic/ [104] G. Striedter Neurobiology . Oxford University Press , 2016 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC [105] K. Swingler Applying Neural Networks: A Practical Guide . Academic Press Inc , 1996 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC [106] A. Tarski Logic, Semantics, Meta-Mathematics: Papers from 1923 to 1938 . Clarendon Press , 1956 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC [107] A. Tichnor and H. Barret Optical implementations in Boltzmann machines . Optical Engineering , 26 , 16 – 21 , 1987 . Google Scholar OpenURL Placeholder Text WorldCat [108] UCI . Machine learning repository . 2013 . http://archive.ics.uci.edu/ml/datasets.html [109] C. von der Malsburg and D. Willshaw Co-operativity and the brain . Trends in Neurosciences , 4 , 80 – 83 , 1981 . Google Scholar Crossref Search ADS WorldCat [110] P. Wason Reasoning . In New Horizons in Psychology I , pp. 135 – 151 . Penguin , Harmondsworth , 1966 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC [111] P. Wason Regression in reasoning ? British Journal of Psychology , 60 , 471 – 480 , 1969 . Google Scholar Crossref Search ADS PubMed WorldCat [112] E. Zalta Gottlob Frege . Stanford Encyclopedia of Philosophy . 2016 . http://plato.stanford.edu/archives/win2016/entries/frege/ © The Author(s) 2021. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permission@oup.com. This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model) TI - Modelling supra-classical logic in a Boltzmann neural network: I representation JF - Journal of Logic and Computation DO - 10.1093/logcom/exab054 DA - 2021-09-02 UR - https://www.deepdyve.com/lp/oxford-university-press/modelling-supra-classical-logic-in-a-boltzmann-neural-network-i-JUir6N0zf7 SP - 1 EP - 1 VL - Advance Article IS - DP - DeepDyve ER -