Access the full text.
Sign up today, get DeepDyve free for 14 days.
Lance Parsons, E. Haque, Huan Liu (2004)
Subspace clustering for high dimensional data: a reviewSIGKDD Explor., 6
P. Voulgaris (1995)
On optimal ℓ∞ to ℓ∞ filteringAutom., 31
R. Neapolitan (2007)
Learning Bayesian networksProceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Tian Zhang, R. Ramakrishnan, M. Livny (1996)
BIRCH: an efficient data clustering method for very large databases
Peter Jackson (1986)
Introduction to expert systems
A. Dempster, N. Laird, D. Rubin (1977)
Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper
Bernard Silverman (1987)
Density Estimation for Statistics and Data Analysis
J. Hoeting, D. Madigan, A. Raftery, C. Volinsky (1999)
Bayesian Model Averaging: A Tutorial
T. Burr (2003)
Causation, Prediction, and SearchTechnometrics, 45
A. Aamodt, E. Plaza (1994)
Case-Based Reasoning: Foundational Issues, Methodological Variations, and System ApproachesAI Commun., 7
T. Cover, Joy Thomas (1991)
Elements of Information Theory
Á. Soto, F. Zavala, A. Araneda (2007)
AN ACCELERATED ALGORITHM FOR DENSITY ESTIMATION IN LARGE DATABASES USING GAUSSIAN MIXTURESCybernetics and Systems, 38
D. Mery, R. Silva, L. Calôba, J. Rebello (2003)
Pattern recognition in the automatic inspection of aluminium castingsInsight, 45
D. Scott (1992)
Multivariate Density Estimation: Theory, Practice, and Visualization
Huaiyu Zhu (1997)
On Information and Sufficiency
Debra Anderson, T. Frivold, A. Valdes (1997)
Next-generation Intrusion Detection Expert System (NIDES)A Summary
Victoria Hodge, Jim Austin (2004)
A Survey of Outlier Detection MethodologiesArtificial Intelligence Review, 22
M. Chickering, D. Geiger, D. Heckerman (1995)
Learning Bayesian Networks: Search Methods and Experimental Results
F. Jensen (2001)
Bayesian Networks and Decision Graphs
P. Larrañaga, Cindy Kuijpers, R. Murga, Y. Yurramendi (1996)
Learning Bayesian network structures by searching for the best ordering with genetic algorithmsIEEE Trans. Syst. Man Cybern. Part A, 26
N. Friedman, D. Koller (2004)
Being Bayesian About Network Structure. A Bayesian Approach to Structure Discovery in Bayesian NetworksMachine Learning, 50
A. Moore (1998)
Very Fast EM-Based Mixture Model Clustering Using Multiresolution Kd-Trees
D. Chickering (1996)
Learning Equivalence Classes of Bayesian-Network StructuresJ. Mach. Learn. Res., 2
M. Teyssier, D. Koller (2005)
Ordering-Based Search: A Simple and Effective Algorithm for Learning Bayesian NetworksArXiv, abs/1207.1429
A. Moore, Weng-Keen Wong (2003)
Optimal Reinsertion: A New Search Operator for Accelerated and More Accurate Bayesian Network Structure Learning
P. Bradley, U. Fayyad, Cory Reina (1998)
Scaling EM (Expectation Maximization) Clustering to Large Databases
W. Wilson (2005)
Multivariate Statistical MethodsTechnometrics, 47
J. Pearl (1991)
Probabilistic reasoning in intelligent systems - networks of plausible inference
D. Scott (1992)
Multivariate Density Estimation, Theory, Practice and VisualizationThe Statistician, 43
Brigham Anderson, A. Moore (1998)
ADtrees for Fast Counting and for Fast Learning of Association Rules
N. Friedman, I. Nachman, D. Pe’er (1999)
Learning Bayesian Network Structure from Massive Datasets: The "Sparse Candidate" Algorithm
D. Heckerman (1996)
Bayesian Networks for Knowledge Discovery
L. Lewis (1993)
A case-based reasoning approach to the management of faults in communication networksIEEE INFOCOM '93 The Conference on Computer Communications, Proceedings
(2004)
A Survey of Outlier Detection Methodologies
D. Chickering (2016)
Learning Bayesian Networks is NP-Complete
Yufeng Kou, Chang-Tien Lu, S. Sirwongwattana, Yo-Ping Huang (2004)
Survey of fraud detection techniquesIEEE International Conference on Networking, Sensing and Control, 2004, 2
V. Cherkassky, Filip Mulier (1998)
Learning from Data: Concepts, Theory, and Methods
G. Cooper, E. Herskovits (1992)
A Bayesian method for the induction of probabilistic networks from dataMachine Learning, 9
Applied Artificial Intelligence, 22:309–330 Copyright # 2008 Taylor & Francis Group, LLC ISSN: 0883-9514 print/1087-6545 online DOI: 10.1080/08839510801972801 UNSUPERVISED ANOMALY DETECTION IN LARGE DATABASES USING BAYESIAN NETWORKS Antonio Cansado and Alvaro Soto Pontificia Universidad Catoolica de Chile, Santiago, Chile Today, there has been a massive proliferation of huge databases storing valuable information. The opportunities of an effective use of these new data sources are enormous; however, the huge size and dimensionality of current large databases calls for new ideas to scale up current statistical and computational approaches. This article presents an application of artificial intelligence technology to the problem of automatic detection of candidate anomalous records in a large database. We build our approach with three main goals in mind: 1) an effective detection of the records that are poten- tially anomalous; 2) a suitable selection of the subset of attributes that explains what makes a rec- ord anomalous; and 3) an efficient implementation that allows us to scale the approach to large databases. Our algorithm, called Bayesian network anomaly detector (BNAD), uses the joint prob- ability density function (pdf) provided by a Bayesian network (BN) to achieve these goals. By using appropriate data structures, advanced caching techniques, the flexibility of Gaussian mixture mod- els, and the efficiency of BNs to model joint pdfs, BNAD manages to efficiently learn a suitable BN from a large dataset. We test BNAD using synthetic and real databases, the latter from the fields of manufacturing and astronomy, obtaining encouraging results. INTRODUCTION Today, technology is changing the way people produce and handle information. From business to science and engineering, there has been a massive proliferation of huge databases storing valuable information. The opportunities of an effective use of these new data sources, Gigabytes up to Terabytes, are enormous; traditional data analysis techniques, however, have not kept in pace with the type of processing needed by the new volumes of data. We would like to thank astronomers Felipe Barrientos and David Schade for their contribution to the results presented in the section ‘‘Strange Objects Detection in Astronomy’’. This work was partially funded by FONDECYT grant 1030336. Address correspondence to Alvaro Soto, Departamento de Ciencia de la Computacio on, Potificia Universidad Cato olica de Chile, Casilla 306 – Santiago 22, Chile. E-mail: asoto@ing.puc.cl 310 A. Cansado and A. Soto The huge size and dimensionality of current large databases require new ideas to scale up current statistical and computational approaches. This is especially critical when the system needs to interact with a human expert, as it is usually the case. If a data analysis tool takes weeks or months to return a result, the interaction between the analyst and the data is severely limited. In this article we present an application of artificial of intelligence (AI) technology to the problem of automatic detection of anomalous records in a large database. Although we refer to the problem as anomaly detection, this problem has been cast under different names, such as outlier detec- tion, novelty detection, noise detection, deviation detection, or exception mining (Hodge and Austin 2004). The detection of anomalies in a database plays a critical role in many tasks. Depending on the application, these anomalies may correspond to fraudulent transactions in a financial data- base, strange celestial objects in an astrophysical data catalog, or records of faulty products in a production database (see Hodge and Austin (2004) for a extensive list of applications). We build our approach with three main goals in mind: Goal-1: an effective detection of the records that are potentially anomalous Goal-2: a suitable selection of the subset of attributes that explains what makes a record anomalous Goal-3: an efficient implementation that allows us to scale the approach to large databases. We follow a probabilistic approach by modeling the joint probability density function (pdf) of the attributes of the records in the database. This function provides a straightforward method to rank the records according to their oddness (Goal-1). While highly common records, well explained by the model, receive a high likelihood, strange records, poorly explained by the model, receive a low likelihood. Although the probabilistic approach seems to be fruitful, the esti- mation of a joint pdf in a high dimensional space is an extremely challeng- ing task. Any direct attempt to model the joint pdf would require to fit a prohibitively large number of parameters. Furthermore, searching for mod- els in a large parameter space increases the chances of getting stuck in a local optimum (Mitchell 1997). To handle the complexity associated with the estimation of the joint pdf, we use a Bayesian network (BN) (Pearl 1988; Lauritzen 1996; Jensen 2001; Neapolitan 2004). A BN provides an efficient graphical represen- tation of a joint pdf by taking advantage of conditional independence rela- tions inherent to most high dimensional datasets. These conditional independence relations among the attributes of the records in the database Anomaly Detection in Large Databases 311 provide a suitable factorization of the joint pdf in terms of simpler local conditional probability functions, whose reduced dimensionality simplifies the estimation process. Another key advantage of using BNs is the straightforward evaluation of the likelihood of each data point, which otherwise can be the bottleneck for the detection of strange objects. Furthermore, the local structure and conditional pdfs embedded in the network, provide relevant information about which groups of attributes are related to other groups, to what extent they are related, and under what circumstances. These features provide key information to find an explanation about the set of attributes and con- ditions that make an object anomalous (Goal-2). In general, finding an appropriate BN to model the joint pdf involves two main steps: learning the structure of the network and learning the con- ditional pdfs that relate the nodes in the network. To learn the structure of the network, we extend the sparse candidate algorithm (SCA) (Friedman, Nachman, and Pee´r 1999) to the case of continuous variables. We choose SCA due to its ability to scale to large datasets. Instead of using traditional Greedy Hill Climbing (GHC) search over the space of possible structures (Chickering 1996b), where local changes such as adding, deleting, or revers- ing arcs require Oðn Þ possible changes to each network of n variables, SCA efficiently shrinks the search space by selecting only the most relevant par- ents for each variable. The selection of relevant parents is guided by a scoring metric based on information theory (Friedman et al. 1999). To learn the conditional pdfs that relate each node with its parents, we use a Gaussian mixture model (GMM) trained with an accelerated version of the expectation maximization (EM) algorithm (Dempster, Laird, and Rubin 1977). Our accelerated version of EM described in (Soto, Zavala, and Araneda 2007), is based on the use of data summarization techniques, such as the one used in Zhang, Ramakrishnan, and Livny (1996) and Moore (1999). By using these techniques, EM does not need to sweep over all the observations and dimensions during each iteration, but just over their summaries. This provides a great efficiency that allows us to scale the approach to large databases (Goal-3). The use of GMMs to model the local conditional pdfs at each node pro- vides a straightforward method to increase the extent of the search of the original SCA. This is provided by particular properties of GMMs. Following the regular search of SCA, after we run intensive computations to obtain a local joint pdf among a set of variables, it is possible to use properties from GMMs to obtain any marginal or conditional density among these variables by simply applying basic matrix operations, such as swapping or deleting columns and rows in the covariance matrix. Using such properties, we can increase the scope of the search for network structures without adding a significant computational load to our algorithm. 312 A. Cansado and A. Soto We call our algorithm Bayesian network anomaly detector (BNAD). In this article, we present the main components of BNAD, and we also evaluate its performance to detect anomalies in synthetic and real databases. BACKGROUND In this section, we start by briefly describing the main elements of a BN along with the notation we use in the article. Afterwards, we review the scoring functions used to explore the space of possible BN structures that model the data. Finally, we describe GMMs and their properties that are relevant to this work. Bayesian Networks (BNs) Consider a set x ¼fx ; .. . ; x g of continuous random variables. We use 1 n lowercase boldface letters, such as x, to denote sets of single random vari- ables, such as x . A BN, also called belief network, is a directed acyclic graph G that represents independencies embodied in a given joint probability dis- tribution over the set of variables x , i 2f1; .. . ; ng. The graph G associates a vertex or node V to each variable x . The joint pdf of the variables x can be i i i G G factorized as pðx jPa ðx ÞÞ, where Pa ðx Þ is the set of direct parents of x i i i i in G. Given the values of Pa ðx Þ, x is conditionally independent of all the i i variables that are not descendant of V in the graph G. Therefore, edges define dependency relations among variables. We denote by B ¼hG; hi the BN represented by graph G, where the parameters of each factor G G pðx jPa ðx ÞÞ are contained in h. We also denote as De ðx Þ the set of i i i descendants of x in G. In general, finding an optimal structure for a BN is NP-complete (Chick- ering 1996a). Thus, it is not possible to search for an optimal structure using the whole space of feasible networks. Fortunately, studies have shown exper- imentally that a ‘‘good’’ network structure is often enough to provide a suit- able approximation to the underlying model (Heckerman 1996). Scoring Functions A well-known function to measure the distance from a ‘‘true’’ prob- ability distribution pðxÞ to an arbitrary probability distribution qðxÞ is the Kullback-Leibler divergence (KLD) (Kullback and Leibler 1951), defined as pðxÞ KLD ðpðxÞkqðxÞÞ ¼ pðxÞ log dx: ð1Þ qðxÞ x Anomaly Detection in Large Databases 313 Although not a symmetric function, therefore not a true metric dis- tance, the KLD satisfies many important mathematical properties. In parti- cular, it can be used to measure the degree of dependency between two random variables, x and x , by measuring the dissimilarity between the i j joint distribution pðx ; x Þ and the product of its marginals distributions i j pðx Þ and pðx Þ, i j pðx ; x Þ i j KLDðpðx ; x Þjpðx Þpðx ÞÞ ¼ pðx ; x Þ log dx dx : ð2Þ i j i j i j j i pðx Þpðx Þ i j x ;x i j Equation (2) is also known as the mutual information or relative entropy between the joint and the product distribution (Cover and Thomas 1991). As in the case of SCA, we use mutual information as the key metric to select the initial set of candidate parents to build the BN structure. Using the KLD, (Friedman et al. 1999) defines discrepancy between the empirical joint density of x and x , ppðx ; x Þ, and the corresponding joint den- i j i j sity implied by the current estimation B of the BN structure, p ðx ; x Þ,as B i j Discðx ; x Þ¼ KLDðppðx ; x Þkp ðx ; x ÞÞ: ð3Þ i j i j B i j Discrepancy is the key score used by SCA to select a new candidate parent for a node. In this work we also use this metric to explore the space of BN structures. Gaussian Mixture Model (GMM) Given a finite set x ¼fx ; .. . ; x g of continuous random variables, find- 1 n ing their joint probability distribution pðxÞ in a large dimensional space is usually a challenging task. Fortunately, it has been proven (Silverman 1986) that such distribution can be approximated with arbitrary accuracy using a sum of Gaussian distributions weighted by membership probabilities w , pðxÞ¼ w p ðxjl ;R Þ; x 2< ; ð4Þ h h h h¼1 such that w ¼ 1, and where p ðjl ;R Þ corresponds to the Gaussian h h h h distribution with mean l and covariance matrix R , h h 1 1 t 1 p ðxjl ; R Þ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi exp ðx l Þ R ðx l Þ : ð5Þ h h h h h ð2pÞ jR j The number of Gaussian components, k, is determined by the complexity of pðÞ and the desired accuracy of the approximation. The distribution in 314 A. Cansado and A. Soto Equation (4) is called a GMM. In general, the parameters of the model: means, covariance matrices, and membership probabilities, are unknown but can be found using the EM algorithm (Dempster et al. 1977). Useful properties for GMMs appear, arising from those of Gaussian distributions. In particular, given a GMM with k components for ðx; yÞ, the conditional distribution of x given y corresponds to a GMM with the same number of components. In effect, we see that pðxjyÞ/ pðx; yÞ¼ w p ðx; yÞ h h h¼1 ¼ w p ðyÞ p ðxjyÞ h h h h¼1 Given the properties of Gaussian distributions, the conditional distribu- tions p ðxjyÞ are also Gaussian, and thus we obtain that pðxjyÞ/ w p ðxjyÞ; ð6Þ h¼1 which is the form of a GMM with k components. Expressions for the conditional means and covariance matrices involved in Equation (6) are available in closed form (Anderson and Moore 1979). We will use the expression in Equation (6) to efficiently increase the search space of BNAD. PREVIOUS WORK This section discusses relevant previous work. The first section discusses previous work on the problem of detecting anomalous records in data- bases. The discussion is focused mainly on works in the AI domain, in parti- cular, the machine-learning community. Then, the next section describes relevant works related to learning the structure of a BN from data. Anomaly Detection In the AI community, there have been several attempts to tackle the prob- lem of detecting anomalies in databases (Anderson, Frivold, and Valdes 1995; Lewis 1993). Most of these approaches are built upon either knowledge-based systems, in particular expert systems (Jackson 1998), or case-based reasoning techniques (Aamodt and Plaza 1994). The expert system approach encodes knowledge about possible anomalies as if-then rules. The case-based reason- ing approach bases the detection on the estimated distances to a set of known anomalies. Although the two approaches have shown success when applied to Anomaly Detection in Large Databases 315 small problems, in the case of large-scale problems, the huge amount of rules and domain knowledge required to implement suitable solutions, make these approaches too complex to be implemented. The machine learning and the related knowledge discovery in data- bases (KDD) communities have also tackled the problem, motivated mainly by applications on fraud detection (see Kou, Lu, Sirwongwattana, and Huang (2004) and Hodge and Austin (2004) for reviews). Most of these applications are based on supervised learning techniques, such as super- vised neural networks and decision trees. Their need for labeled data, how- ever, limits the convenience of these approaches when dealing with complex large dimensional domains. Unsupervised learning techniques, such as our approach, have also been used, clustering techniques being the favorite tool. Using clustering, anomalies are detected as small clusters or isolated points located in low density regions of the features space (Kou et al. 2004). Although there have been efforts to scale clustering approaches to large databases (Bradley, Fayyad, and Reina 1998; Moore 1999) the main problem arises with dimen- sionality. In a large dimensional space, many unknown relations or patterns may be hidden in arbitrary subspaces, making it difficult to find suitable metrics to find the clusters. While it is possible to use dimensionality reduction techniques, such as principal components analysis (Morrison 2004), the problem of finding the relevant subspaces in large dimensional datasets is still a dilemma for traditional clustering techniques (see Parsons, Haque, and Liu (2004) for a recent review). The last observation highlights one of the main strength of our approach, which is given by the factorization of the joint pdf provided by the BN. If we consider a BN from a clustering point-of-view, the factorization of the joint pdf provided by the BN can be understood as model fitting in selective dimen- sions or subspaces. In effect, each factor in the joint pdf is given by a local con- ditional pdf over a subset of variables. These subsets of variables correspond to relevant subspaces of the feature space. In each of these subspaces, we use GMMs to fit the main clusters and to find anomalies by identifying points with low likelihood. This selective model fitting is the main mechanism how, by focusing on key reduced subspaces, our approach avoids the curse of dimen- sionality problem (Mitchell 1997). Furthermore, by using this mechanism, our approach does not operate as a black box, but provides a generative prob- abilistic model. This model helps to explain the sources of an anomaly by iden- tifying the factors that account for a low likelihood. Bayesian Network Structure Learning The problem of learning the structure of a BN from data has been intensively investigated (Cooper and Herskovits 1992; Chickering, Geiger, 316 A. Cansado and A. Soto and Heckerman 1995; Friedman et al. 1999; Spirtes, Glymour, and Scheines 2000; Friedman and Koller 2003; Moore and Wong 2003; Teyssier and Koller 2005). In general, these works can be classified into two main approaches: the constraint-based approach and the scoring-based approach. The constraint-based learning approach uses a statistical test to find dependency relations among the variables, which are then mapped to the network structure (Spirtes et al. 2000). The scoring-based learning approach considers the search as an optimization problem, where the exploration of the space of network structures is guided by a statistically motivated metric that quantifies how each network models the data (Cooper and Herskovits 1992). In practice, due to the complexities associa- ted in finding and applying a robust test to detect dependencies, the scoring-based approach is currently the most used option and it is the approach we follow in this work. Under the scoring based-approach, the basic idea is to search for network structures using an adequate heuristic, and to assign a score that penalizes model complexity to each structure. These scores include penalized log-likelihood, such as Bayesian scores (Cooper and Herskovits 1992), and information theory-based scores (Friedman et al. 1999). A commonly used scoring function is the Bayesian information cri- terion (BIC) (Hoeting et al. 1999). Bayesian information criterion evalu- ates the fit of a BN to the data through the likelihood function, penalizing for additional model complexity. In this work we use the BIC scoring function. This score has a suitable connection to the probability that a given BN is the true model underlying the data. Furthermore, it can be decomposed to facilitate its calculation. As we noted previously, we base our approach in the SCA (Friedman et al. 1999). An SCA is an iterative algorithm that uses statistically motivated scores and a current estimate of the BN to restrict the possible parent set of each variable. By restricting the number of parents, the algorithm avoids examining candidates that are extremely unreasonable. At each iteration, the restricting phase is followed by a maximization step, where GHC is used to greedily keep the best arc addition, within the set of candidate parents. In our work, we use a restricting phase, similar to SCA. We use, however, the properties of GMMs to work with continuous variables and to increase the search space of network structures without adding a significant compu- tational load. Furthermore, we use special implementations of discrepancy and BIC scoring that allow us to cache computations in order to scale the approach to large and high dimensional databases. Another state-of-the-art algorithm to find network structures is the opti- mal reinsertion algorithm (ORA) (Moore and Wong 2003). The ORA succeeds on speeding up BN structure-learning, while achieving higher scoring networks than GHC. Using a precomputed cache, found by using Anomaly Detection in Large Databases 317 AD-search (Anderson and Moore 1998), ORA avoids high computational costs required by adding, deleting, or inverting edges. Unfortunately, AD- search cannot be used with continuous attributes, as is our interest. With- out the benefits of AD-search, ORA would require a high computational complexity to fit models to local joint pdfs, making it hardly useful to be applied to large databases. Teyssier and Koller (2005) propose the ordering-based search (OBS) algorithm. This algorithm is based on searching over the space of orderings of the variables in the network, rather than over the standard space of net- work structures. A similar idea was used by Larran naga, ~ Kuijpers, Murga, and Yurramendi (1996), who applies genetic algorithms to conduct the search. Teyssier and Koller (2005) claim that the OBS algorithm reaches similar network scoring than the ORA, but with a better computational perfor- mance. Although OBS seems to be a good alternative, the main disadvan- tage to scale the approach to large datasets is the calculation of the scores for the potential successors of the initial ordering, which requires the calculation of a large set of sufficient statistics. BAYES NETWORK ANOMALY DETECTOR (BNAD) This section describes the main steps followed by BNAD to find a suit- able BN to model the data. The algorithm works in an unsupervised way with unlabeled data and assumes that there is not missing data. The main goal is to find a BN able to detect potential anomalous records in large databases. We achieve this goal by integrating and developing several AI and data processing technologies. Specifically, by using appropriate data structures, advanced caching techniques, BNs efficiency to model joint pdfs, and the properties of GMMs, our algorithm manages to learn a suit- able BN from a large dataset. The next section describes the basic search strategy used by BNAD, which is mainly based on SCA and the subsequent section explains how we use properties from GMMs to increase the scope of the search for network structures. Basic Search Strategy Traditionally, scoring-based learning algorithms use GHC to explore the space of networks structures. In this work, as in the case of SCA, we efficiently shrink the search space by statistically selecting the most probable parents for each variable. Furthermore, our algorithm searches the space using an accelerated implementation of EM to train GMMs (Soto et al. 2007). b b Let B B ¼hG G; hhi be the initial estimation of the BN that models the data. In this initial estimation all the variables are considered independent, i.e., there 318 A. Cansado and A. Soto are no edges in G G. Following Friedman et al. (1999), BNAD searches for new network structures by iterating two basic steps: restrict and maximize. . Restrict: For each x 2 x, compute the quantity Discðx ; x Þ, where x is a i i j j Gb Gb candidate parent for x in G G, i.e., x 2 x nfPa ðx Þ[ x [ De ðx Þg . i j i i i Afterwards, form the set d with the variables x with the highest D discrepancies. To compute Discðx ; x Þ, we estimate pðx ; x Þ by the empirical distri- i j i j bution and p ðx ; x Þ by sampling a fixed number of observations from i j Bb B B, adding the assumption that x is a parent of x . In other words, we esti- j i G G mate p ðx ; x Þ by sampling x from pðx jPa ðx Þ; x Þ. In our experiments, B i j i i i j we use a set of 5000 samples from p to estimate p ðx ; x Þ. We limit the B i j Bb maximal number of parents for each node to 5. . Maximize: For each new candidate parent x in the set d, independently 0 0 compute BICðB Þ, where B is given by hG G [fx ! x g; h i and the para- j i meters in h are equal to those in hh, except for those related to variable x , which now has an additional parent x . If for the highest scoring i j 0 0 BN, B , jBICðB Þ BICðB BÞj d, the algorithm stops, where d is a max max user-defined parameter of convergence. Otherwise, B is used as B B in max the next iteration. This algorithm requires Oðmn Þ operations to collect the required stat- istics, where m is the number of records in the database. It is important, however, to notice that a single pass over the m records is sufficient in com- puting every pairwise frequency in constant time. In order to efficiently compute discrepancy, we discretize the observations in t bins. Equally sized bins (Scott 1992) are used in the discretization as an approximation to the real values of discrepancy. As we mentioned, BNAD uses BIC as the scoring function. Although other model selection scoring can be used, the key issue is to choose a scor- ing metric that can be decomposed according to the BN factorization of the joint pdf. In this way, we can reuse most partial computations. In the case of BIC, it is possible to show that BICðBÞ¼ BIC ðx Þ; ð7Þ c i where BIC ðx Þ corresponds to the BIC score of node V , conditional on its c i i parents Pa ðx Þ. Using this additive decomposition, we only need to com- pute BICs for the nodes whose parents have changed, which in BNAD occurs one at a time. The important observation is that all partial log- likelihoods used in the calculation of BIC can be cached for later use, Anomaly Detection in Large Databases 319 avoiding OðamÞ operations for each reused calculation, where a is the number of attributes needed to compute the partial log-likelihood. Bayesian network anomaly detector detects candidate strange objects by evaluating the likelihood of each object and selecting the worst a% of them. Deciding the correct value of a depends directly on the capacity of the BN to fit the data. If the training of the BN is successful, most objects will be accurately modeled, and only relevant anomalies will be displayed as low probability objects. It is also possible to find objects with very low prob- ability but of no interest to the user, due to noise and other external factors. Extended Search Strategy When searching for network structures, the estimation of the local con- ditional pdfs is the bottleneck that limits a further search. As an example, when we test our approach running a regular implementation of EM to find a BN that models the database described in a subsequent section, it takes 5 days to find a suitable structure. In contrast, when we use our accel- erated version of EM, it takes just 232 minutes. Although the accelerated version of EM helps to reduce the computational burden, our experiments indicate that the running time of this algorithm consumes approximately 90% of the total processing time. In this section we show how we are able to increase the scope of the search for network structures while avoiding extra expensive calls to EM. In our search for network structures, we use our version of EM to esti- mate the joint pdf that relates each variable x with its candidate set of par- Gb ents pðx ; Pa ðx ÞÞ. After estimating this joint density, we use the property in i i Equation (6) to obtain the relevant conditional pdf for the network under Gb consideration, i.e., pðx jPa ðx ÞÞ. The interesting fact is that after estimating i i Gb pðx ; Pa ðx ÞÞ, we can use properties from GMMs not only to estimate i i Gb pðx jPa ðx ÞÞ but also to find any marginal or conditional distributions i i among these variables by simply applying basic matrix operations, such as, swapping or deleting columns and rows. As an example of the previous property, suppose that the restrict phase of BNAD requires the estimation of pðx jx ; x ; x Þ. To achieve this, BNAD 1 2 3 4 calls first the accelerated version of EM to estimate pðx ; x ; x ; x Þ. After 1 2 3 4 estimating this joint density, we can compute any conditional or marginal, such as pðx jx ; x ; x Þ, without the computational cost of a new EM call, but 2 1 3 4 just using basic matrix operations. In this case, the calculation of pðx jx ; x ; x Þ can be seen as an edge inversion process that allows us to 2 1 3 4 explore a structure where x is a parent of x . The BNAD takes advantage 2 1 of such properties to explore additional relations in the space of network structures without adding a significant computational load. 320 A. Cansado and A. Soto When we use properties of GMMs to expand the search, we need to take some considerations into account. First, we need to test only new edges that do not produce cycles in the network. Also, if a node has a set of pre- existing parents, these edges need to be removed in order to avoid an additional call to EM. Finally, besides the calculation of the new conditional pdf, there is an additional computation due to the calculation of the BIC scores for the new structure, which requires at least OðamÞ operations, as it requires a new partial log-likelihood calculation. To expand the search space of BNAD using the properties previously mentioned, we need to modify the basic search scheme presented in the sec- tion entitle ‘‘Basic search strategy.’’ The restrict phase is the same as before. At each iteration, it provides the set d of best D candidates parents according to discrepancy. In the case of the maximize phase, it needs to be modified in order to evaluate the BIC scores for the new network structures that expand the search. These new network structures result from calculating marginal and conditional densities from the joint pdfs obtained for the D candidate parents found in the restrict phase. In particular, if the restrict phase speci- fies a candidate parent x for x , then besides exploring the regular BN with j i Gb the local relation pðx jx ; Pa ðx ÞÞ, we also calculate the BIC score for the i j i Gb related structures that satisfy the local relation pðx jx ; Pa ðx ÞÞ or the local j i i Gb Gb relations pðx jx ; x ; Pa ðx Þnðx ÞÞ,where x 2 Pa ðx Þnðx Þ.It isimportant k i j i k k i k to note that in this process we only consider new structures that do not con- tain cycles and do not require extra calls to EM. According to this, the new maximize phase is given by the following: . Maximize: For each new candidate parent x in the set d obtained in the restrict phase, compute the BIC score for the following BNs: 0 0 b ^ . B given by hG G [fx ! x g; h i. To obtain h , update in hh the conditional j i Gb local relation for x by computing pðx jx ; Pa ðx ÞÞ. Afterwards, compute i i j i BICðB Þ. 0 0 Gb 0 . B given by hG [fx ! x g[ fPa ðx Þ! x g; h i. To obtain G , remove i j i j Gb existing parents of x in G G. Then, add fx [ Pa ðx Þg as new parents of x . j i i j If the resulting network B is cyclic rollback. Otherwise, obtain h by updating in hh the conditional local relation for x . To do this, use properties Gb Gb of GMM to calculate pðx jx ; Pa ðx ÞÞ from pðx ; x ; Pa ðx ÞÞ. Afterwards, j i i i j i compute BICðB Þ. 0 0 Gb . B given by hG [fx ! x g[fPa ðx Þnðx Þ! x g[fx ! x g; h i.To i k i k k j k obtain G , remove existing parents of x in G G. Then, add G 0 fx [ x [ Pa ðx Þnðx Þg as new parents of x . If the resulting network B i j i k k is cyclic rollback. Otherwise, obtain h by updating in hh the conditional local relation for x . To do this, use properties of GMM to calculate Gb Gb pðx jx ; x ; Pa ðx Þnðx ÞÞ from pðx ; x ; Pa ðx ÞÞ. Afterwards, compute k i j i k i j i BICðB Þ. Anomaly Detection in Large Databases 321 0 0 As before, keep just the highest scoring BN, B .If jBICðB Þ max max b b BICðB BÞj d, the algorithm stops, otherwise, B is used as B B in the next max iteration. This algorithm can be seen as a particular edge-reversing strategy using GMM properties. Due to the new calculations, mainly the new BIC scores, the computational cost will be slightly higher than the regular search strat- egy of SCA. The search space however, becomes wider without the need of new expensive EM calls. This may help to skip from local maximums of the scoring function. EXPERIMENTAL RESULTS In this section, we use synthetic and real databases to illustrate the main features of our approach. First, we use synthetic databases to perform three experiments oriented towards demonstrating the abilities of BNAD to achieve each of the three main goals pursued by this work: 1) effective detection of potential anomalous records 2) selection of attributes that explains the source of an anomaly 3) efficient implementation to scale to large databases. Afterwards, we test the abilities of BNAD to detect anomalous records in two real databases. In all the experiments, we use the accelerated version of EM. We keep all cache of EM calculations and BIC functions in disk for computational efficiency. To compute the discrepancy metric score, we notice that using less than five bins tends to increase the error, while using more than 15 bins does not provide additional benefits. Thus, we choose to work with t ¼ 10 bins. Also, we use 5000 samples in the calculation of discrepancy since we notice that a greater number of samples does not provide major changes in the final network structure. Furthermore, the memory needed to create frequency tables seems to be quite irrelevant, being approximately 16 MB for a configuration with 200 variables and 10 bins each, and 256 MB for 400 variables and 20 bins. All the experiments were conducted using an Intel Pentium D processor running at 3.2 GHz, except experiments 2 and 3 which use an AMD Athlon 3200 þ processor running at 2.2 GHz. Synthetic Databases We create synthetic databases using the following strategy. Given the number of components k of a GMM and a set of n variables x , i 2f1; .. . ; ng: 322 A. Cansado and A. Soto . For h ¼ 1; .. . ; k, create a random vector l with the mean value of each GMM component. . For h ¼ 1; .. . ; k, create an orthogonal base of n vectors that represents the covariance matrix R of each GMM component. This matrix is positive- definite by construction. . Create a random membership vector w, such that, w ¼ 1. h¼1 . Let G be an empty graph. For each node x 2 x, select a random number l of parent nodes x , such that, 0 l < i 1 and 1 j < i 1. Add all the resulting relations x ! x to G. j i . For each node x 2 x create a conditional GMM, pðx jPa ðx ÞÞ, that satis- i i i fies the restrictions imposed by G and the joint pdf given by R , l , and w ,1 h k. This algorithm creates an acyclic BN with consistent conditional probabil- ities. No cycles are possible due to the restriction on edge order. Consist- ency is guaranteed by taking all conditional pdfs from partitions of the same joint pdf modeled through a GMM with k components. Experiment 1: Detection of Anomalous Records We first evaluate the ability of BNAD to detect anomalies and we also compare its performance respect to GHC. For this purpose, we generate m ¼ 10; 000 observations of a BN with n ¼ 20 nodes, and k ¼ 10 Gaussian components in the GMMs. We artificially add to this database anomalous records consisting of a valid instance of the database, for which c attributes have been modified with random values taken within the range of the vari- able. We just accept as anomalies, records that are located in areas of low density under the generating model. For this experiment, we test two cases. In both, we artificially introduce 10 anomalies in the database. In one case, we modify c ¼ 4 variables and in the other we modify c ¼ 8 variables. The results are shown in Table 1. As it can be seen, the models obtained by BNAD and GHC present similar BIC scores. These scores are also similar to the score obtained for the true underlying model. In order to evaluate the quality of the classifications obtained by both algorithms, we use the concepts of sensitivity and specificity, which correspond to the percentage TABLE 1 Performance of BNAD Compared to GHC Based on Synthetic Databases. Cut-Off Points Are Determined Such That Sensitivity Equals 90%. Both Methods Attain Specificity of 90% Cut-off points Algorithm BIC c ¼ 4 c ¼ 8 EM calls GHC 425.075 933 38 850 BNAD 422.317 942 25 572 True model 458.023 1008 98 – Anomaly Detection in Large Databases 323 of well-classified anomalies and well-classified regular observations, respect- ively. We set the sensitivity to 90%, and we define the cut-off point as the minimum number of observations we need to explore, in increasing order of likelihood to attain that sensitivity. For both algorithms, this strategy determines a specificity of 90%. The cut-off points in Table 1 show that both algorithms are able to rank the 10 anomalies among the group of lowest likelihood records. In both cases, we need to explore a similar number of observations to detect 90% of the anomalies. As expected, the number of observations we need to explore to detect the anomalies decreases considerably when anomalies become more evident, as is the case where we modify eight of the attributes instead of four. In this case, 90% of the anomalies are properly detected with less than 1% of false positives. The big advantage of BNAD over GHC is in terms of efficiency. In effect, BNAD was able to reduce by 33% the number of EM calls made by GHC. It is important to notice that classification based on the true model attained similar results. This implies that errors in classification using GHC and BNAD are mainly due to the natural variability of the data. The section on experiment 3 shows further results that compare the performance of BNAD with respect to GHC for a database with m ¼ 20; 000 observations of a BN with n ¼ 50 nodes. In particular, Figure 3 shows the complete receiver operating characteristic (ROC) curve for this case. Experiment 2: Detection of the Sources of an Anomaly The previous analysis bases the detection of a candidate anomaly on the evaluation of the complete log-likelihood under the estimated BN. As men- tioned earlier, one of the major benefits of using a BN for anomaly detection corresponds to the factorization of the joint pdf of the attributes in the intended database. One of our hypothesis is that for each candidate anomaly, the information of the relative values of the log-likelihood associated to each factor provides important information about the sources of the anomaly. To test the previous hypothesis, we conduct an experiment dedicated to explore which factors of the BN representation influence the most a low log-likelihood value for each anomaly. To perform this test, we construct a synthetic database consisting of m ¼ 10; 000 observations of a BN with n ¼ 20 nodes, and k ¼ 10 Gaussian components in the joint GMM. We ran- domly modify, afterwards, the values of each of the 20 attributes in the data- base, one at a time. Once attribute i has been modified, we measure the impact of the modification in three ways. We measure its total influence, which is the effect the modification has over the complete log-likelihood of the observations. We measure its direct influence, which is the effect the modification has over the partial log-likelihood of that node in particular. Finally, we measure its indirect influence, which is the effect 324 A. Cansado and A. Soto FIGURE 1 Effect of the presence of anomalies over total and partial log-likelihoods on a set of 20 vari- ables. The x axis shows the attribute that was modified. The y axis shows the difference in log-likelihood between original and modified values divided by the original value. the modification has over those partial log-likelihoods that include x as a parent. The results of the previous experiment are shown in Figure 1. The X axis shows the attribute that was modified each time. The Y axis shows the difference in log-likelihood between original and modified values divided by the original value. The figure shows that a small effect on the overall log-likelihood becomes larger when we consider only those partial log-likelihoods that include the source of the anomaly. We can appreciate that direct influence is notable and could be easily detected by statistics. Less important but still notable is indirect influence. As for total log-likelihoods, although we can appreciate differences, there are neither direct nor indirect strong influences. As a consequence, if we only use total values, we would hardly differentiate between anomalous and regular records. In contrast, by using the information in the relevant factors, it is possible to increase the sensitivity of the detector and also to identify the causes of an anomaly. Therefore, BN factorization can be very effective in anomaly detection and its importance has been ignored in previous works. Experiment 3: Scaling to a Large Database In order to test the ability of BNAD to scale to a large database, we gen- erate a synthetic database consisting of 500,000 records and 50 dimensions. In this case, we obtain the samples from a BN with local conditional Anomaly Detection in Large Databases 325 probabilities modeled as a GMM with 10 components. As in the previous cases, we insert in this database a set of 500 artificial anomalies for which c ¼ 1 attribute has been modified with random values taken within the range of the variable. We use subsets of the previous database to test how BNAD scales with respect to the number of records. In particular, we run BNAD using subsets consisting on 20,000, 80,000, 160,000, and 500,000 records, all of them with 50 dimensions. The results are shown in Figure 2. This figure shows that due to the EM implementation used, BNAD scales almost linearly with respect to the number of records in the database. To compare the processing time of BNAD with respect to GHC, we run GHC for the subset of 20,000 records and 50 dimensions (we avoid running GHC for the other datasets given that it takes more than a week to converge). Figure 3 shows the resulting ROC curve for BNAD and GHC for this case. In terms of accuracy in the detection of the anomalies, both algorithms show a very similar performance. Nonetheless, the main difference relies in the computational load: while GHC takes 64 hours to find a suitable BN, BNAD takes just 12 hours to find a BN with a similar performance. These processing times are strongly related to the number FIGURE 2 Scalability of BNAD with respect to the number of records for a databases with 50 dimensions. 326 A. Cansado and A. Soto FIGURE 3 Performance of BNAD compared to GHC measured by a ROC curve in a database with 20,000 records and 50 variables. of EM calls performed by each algorithm, precisely 5203 for GHC and 929 for BNAD. Real Databases Flaw Detection in Metallic Pieces We test our algorithm in a flaw detection application using a database containing information extracted from x-ray images of regular and faulty metallic pieces (Mery, da Silva, Caloba, and Rebello 2003). The database consists of 28 variables and 22,936 observations, where each record con- tains information about visual features of a specific region of each metallic piece. The database was previously labeled by a human expert, who added to each record a binary attribute indicating if the record corresponds to a regular or faulty piece. The total number of faulty records in the database was 60. Given that our approach corresponds to an unsupervised method, we use the classification labels just as ground truth to evaluate the ability of our algorithm to detect the true anomalies. Again, we compare the performance of BNAD with respect to GHC. Figure 4 shows the ROC curves for both algorithms. In this case, BNAD Anomaly Detection in Large Databases 327 outperformed GHC: it ranked all 60 flaws within the 1079 elements with lowest likelihood, whereas GHC needed 1684 to collect all flaws. This shows the advantage of using a wider search space that to some extent avoids local optimums. Another important advantage of BNAD is again provided by the computational load. Concretely, BNAD performed 250 calls to EM, whereas GHC performed 991, which correspond to processing times of 15 and 119 minutes, respectively. It is interesting to mention that these results are comparable to the performance of the supervised classifier documented in Mery et al. (2003). In that work, a neural network classifier trained with the same data- base was able to achieve a classification accuracy of 95%. It needed, how- ever, to be provided with true labeled data during training, being a relevant disadvantage with respect to our method that operates in a total unsupervised mode. One important note is that we did not design BNAD to work as a flaw detection algorithm, only as an anomaly or outlier detector; however, it was still useful as a filter for such applications. The main reason is that only few objects were indeed flaws; therefore it was possible to consider them as FIGURE 4 Performance of BNAD compared to GHC measured by a ROC curve in metallic pieces database. 328 A. Cansado and A. Soto anomalies. We think that using BNAD as a filter of candidates flaws may simplify the labeling process to train a supervised classifier. Strange Objects Detection in Astronomy We also test our algorithm using an astronomical database from the Canadian-France-Hawaii Telescope (CFHT). The database consists of 79 variables and 104,386 objects, corresponding to information related to color and shape of galaxies. Following the advice of the astronomers that provided the database, we only consider 15 attributes. The idea is to find unusual shapes and colors that could describe interesting new galaxies. Given that in this case we do not have ground truth data about real strange objects in the database, we artificially insert anomalous records using a similar strategy. In this case, we modify four variables of each of the 104 anomalies inserted in the database. After modeling, we find that 80.7% of the anomalies artificially inserted appear among the 1% of the objects with lowest likelihood, corresponding to a rate of 0.97% false posi- tives. It is important to note that these false positives might be real strange objects within the database. In the case of the real anomalies, we lack the domain knowledge to quantify the relevance of the candidate anomalies detected by BNAD. How- ever, we sent a list of the 1000 records with lowest likelihood to experts of the Canada-France Legacy Project, who provided the database. As a feed- back, the experts confirmed that there were indeed a large number of interesting objects to them among the 100 records with lowest likelihood. CONCLUSIONS In this article, we presented a new algorithm for the detection of candidate anomalies in large databases, composed of thousands of records and high dimensional datasets with floating-point domains. The representational power of Gaussian mixture models, together with an optimized version of the expectation maximization algorithm, and cach- ing strategies throughout the whole implementation, provided an efficient algorithm for Bayesian network structure-learning. The results of our pro- posed algorithm—BNAD—on synthetic and real databases indicate that it is possible to use a simple, yet powerful, model for estimating the joint probability density function (pdf ) of the domain variables. We showed that this joint pdf can be used to effectively detect anomalies as low likelihood elements. In fact, in the real case of flaw detection in metallic pieces, all flaws were found within the least probable (lowest likelihood) 4.7% ele- ments. Moreover, as likelihood evaluation is fast, once the BN is trained, it can be used as a real-time filter of candidate anomalies which extends its use in real-life applications. Anomaly Detection in Large Databases 329 Following the results from Friedman et al. (1999), we use a similar scor- ing metric that favors relevant areas of the search space by pruning BNs with unlikely related variables. In this article, we showed that these inher- ited results not only apply to discrete data, but also to data with continuous domains. Furthermore, BNAD was able to achieve similar results than those of GHC using less computational cost. We also showed that the joint probability factorization provided by the BN can help in the detection of rare objects. It profits from the conditional probability of each variable given its parents to signal, the most distinguish- ing attributes for a given record. This constitutes a major advantage of BN with respect to other ‘‘black-box’’ alternatives. We believe that further research in this field could point out even more impressive results. In terms of computational complexity, by including an accelerated version of EM in combination with caching techniques, BNAD shows attract- ing results in scalability, being quasi-linear in the number of elements. In this way, BNAD is able to reduce processing time from several days to hours, making the processing of larger datasets feasible. Furthermore, BNAD also increases the search space for BN structures without the burden of additional EM calls by using GMM properties that provide marginal and conditional distributions by simply applying basic matrix operations. As future work, we are currently exploring the use of active learning techniques to add semantic feedback from an expert to efficiently search the set of candidate anomalies provided by BNAD. We are also exploring the use of subspace clustering techniques to directly find relevant subspaces to search for anomalous records. REFERENCES Aamodt, A. and E. Plaza. 1994. Case-based reasoning: Foundational issues, methodological variations, and system approaches. Artificial Intelligence Communications 7(1):39–59. Anderson, B. and A. Moore. 1998. AD-trees for fast counting and for fast learning of association rules. In: Proceedings of 4th International Conference on Knowledge Discovery and Data Mining, pp. 134–138. Anderson, B. D. and J. B. Moore. 1979. Optimal Filtering. Englewood Cliffs, NJ: Prentice Hall. Anderson, D., T. Frivold, and A. Valdes. 1995. Next-generation intrusion detection expert system (NIDES): A summary. Technical Report SRI-CSL-95-07, Computer Science Laboratory, Menlo Park, CA: SRI International. Bradley, P. S., U. Fayyad, and C. Reina. 1998. Scaling EM (expectation maximization) clustering to large databases. Technical Report MSR-TR-98-35, Redmond, WA: Microsoft Research. Chickering, D. M. 1996a. Learning bayesian networks is NP-complete. In: Learning from Data: Artificial Intelligence and Statistics V, eds. D. H. Fisher and H.-J. Lenz, pp. 121–130, New York, NY: Springer-Verlag. Chickering, D. M. 1996b. Learning equivalence classes of Bayesian network structures. In: Proceedings of 12th Conference on Uncertainty in Artificial Intelligence, pp. 150–157. Chickering, D. M., D. Geiger, and D. Heckerman. 1995. Learning Bayesian networks: Search methods and experimental results. In: Proceedings of 5th Conference on Artificial Intelligence and Statistics, pp. 112–128. 330 A. Cansado and A. Soto Cooper, G. F. and E. Herskovits. 1992. A Bayesian method for the induction of probabilistic networks from data. Machine Learning 9(4):309–347. Cover, T. M. and J. A. Thomas. 1991. Elements of Information Theory. New York, NY: John Wiley and Sons, Inc. Dempster, A., N. Laird, and D. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society, Series B 39(1):1–38. Friedman, N. and D. Koller. 2003. Being Bayesian about network structure: A Bayesian approach to structure discovery in bayesian networks. Machine Learning 50(1–2):95–125. Friedman, N., I. Nachman, and D. Pee´r. 1999. Learning Bayesian network structure from massive datasets: The sparse candidate algorithm. In: Proceedings of 15th Conference on Uncertainty in Artificial Intelligence, pp. 206–215. Heckerman, D. 1996. Bayesian networks for knowledge discovery. In: Advances in Knowledge Discovery and Data Mining, eds. U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, pp. 273–305. Cambridge, MA: MIT Press. Hodge, V. J. and J. Austin. 2004. A survey of outlier detection methodologies. Artificial Intelligence Review 22(2):85–126. Hoeting, J. A., D. Madigan, A. E. Raftery, and C. T. Volinsky. 1999. Bayesian model averaging: A tutorial. Statistical Science 14(4):382–417. Jackson, P. 1998. Introduction to Expert Systems. Reading, MA: Addison Wesley. Jensen, F. V. 2001. Bayesian Networks and Decision Graphs. New York, NY: Springer-Verlag. Kou, Y., C.-T. Lu, S. Sirwongwattana, and Y.-P. Huang. 2004. Survey of fraud detection techniques. In: Proceedings of the 2004 IEEE International Conference on Networking, Sensing and Control, pp. 749–754. Kullback, S. and R. A. Leibler. 1951. On information and sufficiency. Annals of Mathematical Statistics 22(1):79–86. Larran naga, ~ P., C. M. Kuijpers, R. H. Murga, and Y. Yurramendi. 1996. Learning Bayesian network struc- tures by searching for the best ordering with genetic algorithms. IEEE Transactions on Systems, Man and Cybernetics 26(4):487–493. Lauritzen, S. L. 1996. Graphical Models. Oxford: Clarendon Press. Lewis, L. M. 1993. A case based reasoning approach to the management of faults in communication net- works. In: Proceedings of 12th Annual Joint Conference of the IEEE Computer and Communications Societies. INFOCOM 1993, pp. 1422–1429. Mery, D., R. R. da Silva, L. P. Caloba, and J. M. Rebello. 2003. Pattern recognition in the automatic inspection of aluminium castings. Insight 45(7):431–439. Mitchell, T. 1997. Machine Learning. New York, NY: McGraw Hill. Moore, A. 1999. Very fast EM-based mixture model clustering using multiresolution KD-trees. In: Proceed- ings of the 11th Conference on Advances in Neural Information Processing Systems, NIPS, pp. 543–549. Moore, A. and W.-K. Wong. 2003. Optimal reinsertion: A new search operator for accelerated and more accurate Bayesian network structure learning. In: Proceedings of 20th International Conference on Machine Learning, ICML, Menlo Park, CA: AAAI Press. pp. 552–559. Morrison, D. F. 2004. Multivariate Statistical Methods. Duxbury Advanced Series. Neapolitan, R. E. 2004. Learning Bayesian Networks. Prentice Hall. Parsons, L., E. Haque, and H. Liu. 2004. Subspace clustering for high dimensional data: A review. ACM SIGKDD Explorations Newsletter 6(1):90–105. Pearl, J. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Mateo, CA: Morgan Kaufmann. Scott, D. W. 1992. Multivariate Density Estimation: Theory, Practice, and Visualization. New York, NY: John Wiley and Sons, Inc. Silverman, B. W. 1986. Density Estimation for Statistics and Data Analysis. London, UK: Chapman and Hall. Soto, A., F. Zavala, and A. Araneda. 2007. An accelerated algorithm for density estimation in large databases using Gaussian mixtures. Cybernetics and Systems 38(2):123–139. Spirtes, P., C. Glymour, andR. Scheines. 2000. Causation, Prediction, and Search. Cambridge, MA: TheMIT Press. Teyssier, M. and D. Koller. 2005. Ordering-based search: A simple and effective algorithm for learning Bayesian networks. In: Proceedings of 21st Conference on Uncertainty in Artificial Intelligence, pp. 584–590. Zhang, T., R. Ramakrishnan, and M. Livny. 1996. Birch: An efficient data clustering method for very large databases. In: Proceedings of the 1996 ACM SIG MOD International Conference on Management of Data, pp. 103–114.
Applied Artificial Intelligence – Taylor & Francis
Published: Apr 18, 2008
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.