Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Genetic network inference: from co-expression clustering to reverse engineering

Genetic network inference: from co-expression clustering to reverse engineering Vol. 16 no. 8 2000 BIOINFORMATICS Pages 707–726 Genetic network inference: from co-expression clustering to reverse engineering 1 2, 3 3 Patrik D’haeseleer , Shoudan Liang and Roland Somogyi University of New Mexico, Department of Computer Science, Albuquerque, NM 87131, USA, NASA Ames Research Center, Moffett Field, CA 94035, USA and Incyte Pharmaceuticals, Inc., 3174 Porter Dr, Palo Alto, CA 94304, USA Received on December 20, 1999; revised on February 29, 2000; accepted on March 3, 2000 Abstract • Which genes are responsible for this disease? motivation: Advances in molecular biological, analytical • Which drugs will treat this disease? and computational technologies are enabling us to sys- tematically investigate the complex molecular processes Beginning with gene sequencing, we are identifying the underlying biological systems. In particular, using high- structure of thousands of genes, and a variety of structural throughput gene expression assays, we are able to measure and regulatory features that provide functional clues. the output of the gene regulatory network. We aim here to However, only the molecular machinery of the cell is review datamining and modeling approaches for concep- able to consistently interpret the sequence information tualizing and unraveling the functional relationships im- into the functions which determine the complex genetic plicit in these datasets. Clustering of co-expression pro- and biochemical networks that define the behavior of an files allows us to infer shared regulatory inputs and func- organism. Since we ultimately seek understanding of the tional pathways. We discuss various aspects of clustering, regulatory skeleton of these networks, we are also taking ranging from distance measures to clustering algorithms steps to monitor the molecular activities on a global level and multiple-cluster memberships. More advanced anal- to reflect the effective functional state of a biological ysis aims to infer causal connections between genes di- system. Several technologies, ranging from hybridization rectly, i.e. who is regulating whom and how. We discuss microarrays, automated RTPCR, to two-dimensional gel several approaches to the problem of reverse engineer- electrophoresis and antibody arrays, allow us to assay ing of genetic networks, from discrete Boolean networks, RNA and protein expression profiles with differing levels to continuous linear and non-linear models. We conclude of precision and depth. that the combination of predictive modeling with system- But how should we organize this process of activity- atic experimental verification will be required to gain a data acquisition? How should we interpret these results deeper insight into living organisms, therapeutic targeting to enhance our understanding of living organisms, and and bioengineering. identify therapeutic targets and critical processes for Contact: [email protected]; [email protected]; bioengineering? Here it is crucial to find the proper [email protected] abstractions around which to build modeling frameworks and data analysis tools. These abstractions must be Introduction centered on two important principles: Novel, high-throughput technologies are opening global perspectives of living organisms on the molecular level. 1. Genetic information flow: Defining the mapping Together with a vast experimental literature on biomolec- from sequence space to functional space. ular processes, these data are now providing us with the The genome contains the information for constructing the challenge of multifunctionality, implying regulatory net- complex molecular features of an organism, as reflected works as opposed to isolated, linear pathways of causality in the process of development. In information terms, the (Szallasi, 1999). Questions which have traditionally been complexity of the fully developed organism is contained posed in the singular are now being addressed in the plu- in that of the genome. But which processes link these two ral. types of information, in other words, what are the codes that translate sequence into structure and function? We • What are the functions of this gene? need to represent these codes in a form understandable • Which genes regulate this gene? to us, so that we may apply them to model building and c Oxford University Press 2000 707 P.D’haeseleer et al. prediction. We therefore seek methods that allow us to One way to make progress in understanding the princi- ples of network behavior is to radically simplify the indi- extract these codes from gene sequence and activity data. vidual molecular interactions, and focus on the collective 2. Complex dynamic systems: From states to outcome. Boolean networks (Kauffman, 1969) represent trajectories to attractors. such a simplification: each gene is considered as a binary variable—either ON or OFF—regulated by other genes When addressing biological function, we usually refer through logical or Boolean functions (as can be found in to functions in time, i.e. causality and dynamics. On a the biochemical literature: see e.g. Yuh et al., 1998). Even molecular level, function is manifested in the behavior with this simplification, the network behavior is already of complex networks. The dynamics of these networks extremely rich (Kauffman, 1993). Many useful concepts resemble trajectories of state transitions—this is what we naturally emerge from such a simple mathematical model. are monitoring when conducting temporal gene expression For example, cell differentiation corresponds to transitions surveys. The concept of attractors is what really lends from one global gene expression pattern to another. Stabil- meaning to these trajectories, i.e. the attractors are the ity of global gene expression patterns can be understood as high-dimensional dynamic molecular representations of a consequence of the dynamic properties of the network, stable phenotypic structures such as differentiated cells namely that all networks fall into one or more attractors, and tissues, either healthy or diseased (Kauffman, 1993; representing stable states of cell differentiation, adaptation Somogyi and Sniegoski, 1996). Our goal is to understand or disease. the dynamics to the point where we can predict the For a Boolean network with N genes, the total number attractor of a molecular network, and know enough about of global gene expression patterns can be very large the network architecture to direct these networks to even for moderate N (2 : each gene can be either on attractors of choice, e.g. from a cancerous cell type to a or off independently). We assume that each gene is benign cell type, from degeneration to regeneration etc. controlled by up to K other genes in the network. The The goal of this review is to discuss principles of genetic connectivity K is a very important parameter for network network organization, and computational methods for ex- dynamics (with large K , the dynamics tends to be more tracting network architectures from experimental data. In chaotic). In Random Boolean Network models, these K Section A conceptual approach to complex network dy- inputs, and a K -input Boolean function, are chosen at namics we will present a brief introduction to Boolean random for each gene (for a justification of this choice, Networks, as a useful conceptual framework to think about see Kauffman, 1993). The Boolean variables (ON/OFF the dynamic behavior of large, interacting regulatory net- states of the genes) at time t + 1 are determined by the works. Section Inference of regulation through cluster- state of the network at time t through the K inputs as ing of gene expression data describes methods and ap- well as the logical function assigned to each gene. An plications for the clustering of genes based on their ex- excellent tool for calculating and visualizing the dynamics pression patterns. In Section Modeling methodologies, of these networks is the DDLAB software (Wuensche, we give an overview of the different modeling method- 1993, 1998). ologies that may be used to model regulatory networks. Because the total number of expression patterns is finite, Finally, Section Gene network inference: reverse engi- the system will eventually return to an expression pattern neering shows how genetic regulatory networks may be it has visited earlier. Since the system is deterministic, it inferred directly from large-scale gene expression data, in- will from then on keep following the exact same cycle cluding estimates of how much data is needed to do so. of expression patterns. This periodic state cycle is called the attractor of the network. For example, in Figure 1 we A conceptual approach to complex network show the repeating 6-state cycle attractor pattern within dynamics the trajectory of a 12-element network (right panel). This In higher metazoa, each gene or protein is estimated on trajectory is only one of many alternative trajectories also average to interact with four to eight other genes (Arnone leading to the same attractor, as shown in the ‘basin of and Davidson, 1997), and to be involved in 10 biological attraction’ graph (lower panel). If a state within the basin functions (Miklos and Rubin, 1996). The global gene of attraction is perturbed to any other state within it, the expression pattern is, therefore, the result of the collective dynamics inexorably converge to the same attractor. This behavior of individual regulatory pathways. In such highly feature helps explain why cell types or cellular adaptations interconnected cellular signaling networks, gene function to particular environments are stable with respect to small depends on its cellular context; thus understanding the perturbations from a variety of internal or external noise network as a whole is essential. However, dynamic sources. systems with large numbers of variables present a difficult The long-term dynamics are determined by the attrac- mathematical problem. tors. In the non-chaotic case, which is the only case of 708 Genetic network inference Fig. 1. Dynamics of the Boolean model of a genetic network illustrated using the DDLAB software (Wuensche, 1993, 1999). Wiring, trajectory and basin of attraction. Top panel, wiring diagram of binary gene interactions. The network consists of hypothetical regulatory gene modules (1,3,6) and dependent modules of structural genes (S, U, T—share identical wiring and rules). Right inset, trajectory. As determined by the wiring and rules, the network passes through a series of states from a given starting state, finally arriving at a repeating attractor pattern, a six-state cycle in this case (dark grey = ON, light grey = OFF; time points numbered at left; modules S, U and T have been collapsed into a single element for simplicity). Bottom panel, basin of attraction. The states of the trajectory shown in the inset are displayed as a series of connected points (labeled by time points) arriving in a cyclic graph resembling the attractor. The additional nodes in the graph resemble other states which also lead to the same attractor, therefore the term ‘basin of attraction’—all states in this graph merge into a single attractor (Somogyi and Sniegoski, 1996). practical interest, the number of states in the attractor is Bhattacharjya and Liang, 1996), rather than exponentially typically only a very small fraction of the total number of with N . The notion that gene expression patterns are states, growing as a square root of N (Kauffman, 1993; constrained is in general agreement with the experimental 709 P.D’haeseleer et al. findings of large-scale gene expression data. For example, been a significant aggregate change over all conditions, genes from the same sequence or functional family do (c) or whether the fluctuation pattern shows high diversity not act independently, but tend to fluctuate in parallel, according to Shannon entropy (Fuhrman et al., 2000). reducing effective N (Wen et al., 1998). In terms of A comparison of these criteria in the analysis for toxic attractor cycles, it is rare for a gene to oscillate more response genes has shown that the Shannon entropy allows than once during the cell cycle in the yeast (Chu et al., the clearest distinction of drug-specific expression patterns 1998). Also, the change in gene expression pattern during (Cunningham et al., 2000). diauxic shift—when yeast metabolism switches from Beyond straightforward scoring methods, we would like glucose to ethanol (DeRisi et al., 1997)—has been found to classify gene expression patterns to explore shared to be very similar to the change in expression pattern functions and regulation. This can be accomplished using during adaptive evolution to growth in a low glucose clustering methods. The simplest approach to clustering, medium (Ferea et al., 1999). sometimes referred to as GBA (Guilt By Association), is The number of distinct attractors of a Random Boolean to select a gene and determine its nearest neighbors in Network also tends to grow as a square root of the number expression space within a certain user-defined distance of genes (Kauffman, 1993). If we equate the attractors cut-off (Walker et al., 1999). Genes sharing the same of the network with individual cell types, as Kauffman expression pattern are likely to be involved in the same suggests, this explains why a large genome of a few billion regulatory process. Clustering allows us to extract groups base pairs is capable of a few hundred stable cell types. of genes that are tightly co-expressed over a range of This convergent behavior implies immense complexity different experiments. If the promoter regions of the genes reduction, convergence and stabilization in networks of are known—as is the case for yeast—it is possible to constrained architecture. identify the cis-regulatory control elements shared by Boolean networks provide a useful conceptual tool for the co-expressed genes. Several algorithms to extract investigating the principles of network organization and common regulatory motifs from gene clusters have been dynamics (Wuensche, 1999). We can study the role of developed (Brazma et al., 1998; Roth et al., 1998; various constraints on global behavior in terms of network van Helden et al., 1998; Wolfsberg et al., 1999). For complexity, stability and evolvability. From experimental example, Tavazoie et al. (1999) identified 18 biologically studies we are learning about constraints in the number significant DNA-motifs in the promoter region of gene of inputs and outputs per gene, input and output sharing clusters based on cell-cycle expression patterns. Most among genes evolved within a gene family or pathway, motifs were highly selective for the cluster in which and restrictions on rule types (thresholding, no ‘exclusive they were found, and seven were known regulatory or’ rules etc.). Investigations into abstract models will motifs for the genes in their respective clusters. In an help us understand the cybernetic significance of network example from brain development (Figure 2), correlations features, and provide meaningful questions for targeted between cis elements and expression profiles can be experimental exploration. established, but are sensitive to the clustering method used. Inference of regulation through clustering of gene We will briefly review important issues involved in expression data clustering and some of the main clustering methods used, as well as a few classical clustering methods which Introduction have not yet been adopted in gene expression analysis. Large-scale gene screening technologies such as mRNA We should caution the reader that different clustering hybridization micro-arrays have dramatically increased methods can have very different results (see for example our ability to explore the living organism at the genomic Figure 2), and—at this point—it is not yet clear which level (Zweiger, 1999). Large amounts of data can be clustering methods are most useful for gene expression routinely generated. In order to identify genes of interest, analysis. Claverie (1999) provides a preliminary review we need software tools capable of selecting and screening of gene expression analysis techniques with a focus on candidate genes for further investigation (Somogyi, 1999). coexpression clustering. Niehrs and Pollet (1999) provide At the simplest level, we can determine which genes an overview of very tightly coexpressed groups of genes show significant expression changes compared with a (which they call ‘synexpression groups’) that have been control group in a pair-wise comparison. As data sets identified based on large-scale gene expression data. become more complex, covering a variety of biological For further reading, some useful textbooks on cluster- conditions or time series, one may consider several scoring methods for selecting the most interesting genes; e.g. ing include Massart and Kaufman (1983), Aldenderfer according to (a) whether there has been a significant and Blashfield (1984) and Kaufman and Rousseeuw change at any one condition, (b) whether there has (1990). 710 Genetic network inference (B) Hippocampal Development Hippocampal Development (A) Min/Max Normalization - Agglomerative Clustering Max Normalization - K-means Clustering Cis- Cis- Cluster Cluster bifurcation Gene Cluster identifier Gene element element no. cyclinA # # ### ### 000 00000000000000000 MOG ###### ### cyclinB # # ## ## 000 00000000000001000 5HT3 # # # # ## # ## G67I8086 # ## ## ### 000 00000000000001010 aFGF ## # # # # # ## G67I86 # ## ## ### 000 00000000000001011 GRb3 # ## # # # # ## Brm # # ### #### 000 00000000000001100 IP3R3 # # ## # # # ## nestin ### ### 000 00000000000001101 PDGFb # # # # # # # ## IGFR2 # ## ### 000 00000000000002000 PDGFR # # # ## ## ## IGFR1 # # # ## # ### 000 00000000000010000 mAChR1 # # # # # # # ## H4 ## # 000 00000100000000000 nAChRa3 # # # ### ## # ODC #### ### # # 000 00000100000000010 trkC # # # # # # # # # IGF2 # # # # 000 00000100000000100 BDNF ## ###### # GAD65 # ## # # # # # # 000 00000100000001000 Brm # # # ## #### GRg3 # # ## # # ## # 000 00000100000100000 cyclinB # # ## ## ## GRa3 # ## # # # 000 00000100000100010 G67I8086 # # ### # ### actin ## ## ## 000 00000100010000000 G67I86 # # # # ### ## cellubrevin ## # # ## 000 00000100010000100 IGFR1 # # # ### ## # GRg2 # ## ## 000 00000100010000110 nestin # ### ### IP3R3 ## ## # # # ## 000 00000100100000000 TH # # ####### 5HT3 # # # # # # ### 000 00000200000000000 ACHE ### ##### # Krox-24 TCP # # # ## # # ## 000 00001000000000000 actin # ###### ## EGF ## # ## # ## # 000 00001000000000010 CCO1 #### # #### GRa5 ## ## ## # # # 001 00000000000000000 CCO2 # # ## ### # # GRa1 ## # # 0 01 00000000000000010 cellubrevin #### # ## # # IP3R2 ## ## ## 001 00000000001000000 cjun ##### # # ## S100beta ## ## # ## ## 001 00000000001000001 CNTF # # # #### # # NFL ## # ## Krox-24 001 00000000001000100 CNTFR ######### CX43 # # ####### nAChRa5 ##### 001 00000000001000101 6 cyclinA # # mGluR3 # # # ## 001 00000000001100000 6 GRa4 # # ## 001 00000000001100001 EGF # # # # # # # ## 6 NT3 # # ## # 001 00000000001100100 FABP ## ##### ## 6 GRb1 ## 0 0 1 00000000001100101 GAD65 ####### ## 6 GAD67 ## # ## # # # # Krox-24 001 00000000001101000 GAP43 ## # ###### 6 GRa2 ## 001 00000000001101010 GMFb ##### # # # # 6 GRg1 # # 001 00000000001101011 GRa3 # ## #### # # 6 5HT1b ## ## # # # # # 001 00000000001102000 GRg2 # ###### ## 6 NFM # # # # # ## # # 001 00000000001110000 GRg3 ## ###### # 6 bFGF ## # Krox-24 001 00000000002000000 H4 ####### ## 6 GFAP ## # # # # 001 00000000003000000 IGF1 # # ##### ## 6 mAChR2 ## ## # # # # # 001 00000000010000000 IGF2 ####### # # 6 GRb2 # # # ## # # # # 001 00000000020000000 IGFR2 ###### ### 6 nAChRa7 # # ## # # ## # Krox-24 001 00000000020000010 InsR ###### ### 6 preGAD67 ## ## # # # # # 001 00000000020000100 MK2 # # ####### 6 NFH # # # # # # 001 00000000100000000 neno # # ## # ## # # Krox-24 synaptophysin ## # # # ## ## 001 00000000100000010 ODC ##### ### # FABP ## # ## # # ## 001 00000010000000000 synaptophysin ## ## # ## ## PDGFb ## # # # # # ## 001 00000010000000010 TCP ##### ## # # PDGFR # ## ## ## ## 001 00000010000001000 GFAP ## # # ### # # GRb3 # # # # # # # ## 001 00000010000010000 mGluR8 # # # ### ## # mAChR1 # # # # # # # ## 001 00000020000000000 NFH # # # ## ## ## ACHE ## ## # # Krox-24 001 00000100000000000 NFL ## # ## #### Krox-24 mGluR8 # # # # # # ## # 001 00000200000000000 S100beta ## ### ## ## 5HT2 # # # # ## # ## 001 00000300000000000 5HT1b ## #### ### CX43 # # ## # ## # # 001 00001000000000000 5HT2 # # #### # ## neno # # # # # ## # # Krox-24 001 00001000000001000 bFGF ## ####### Krox-24 IGF1 # # # # # # # ## 001 00001000000001010 GAD67 ## ##### ## Krox-24 aFGF ## # # # # # ## 002 00000000000000000 GRa1 ## # # ##### GAP43 ## # ###### 003 00000000000000000 GRa2 ## ####### CCO1 # # ## ## 010 00000000000000000 GRa4 # # ##### ## MOG # # # 020 00000000000000000 GRa5 # # ## ## # ## cjun # # # # 020 00000000000010000 GRb1 ## ####### BDNF ## # # ### # # 020 00000000000010010 GRb2 # # #### # ## MK2 # # # # ## # ## 100 00000000000000000 GRg1 # # ##### # # CCO2 # # ## ## # # # 100 00000000000000100 IP3R2 # # ####### TH # # ## ## ### 100 00000000000010000 mAChR2 ## ## # # ## # CNTF # # # ## 100 00000000001000000 mGluR3 # # ##### ## CNTFR # # ## # # ## # 100 00000010000000000 nAChRa5 ## # #### ## GMFb # # # # 100 00000010000001000 nAChRa7 # # #### ### Krox-24 nAChRa3 # #### # 200 00000000000000000 NFM # # ##### # # InsR # # # # # ### 200 00000000000000100 NT3 # # ####### trkC # # # # # # # # # 200 00000000001000000 preGAD67 ## ##### ## Fig. 2. Comparison of clustering methods on hippocampal development gene expression data. Genes were grouped into eight clusters with both methods. Note the relative positions of the genes sharing a Krox-24 transcriptional regulatory element among the clusters. (A) Gene expression patterns were normalized to their respective minima and maxima, and clustered using an agglomerative algorithm. The cluster bifurcation pattern is shown on the left; cluster boundaries were drawn at the depth shaded in grey (left), based on the 20-level cluster identifier that captures the branching pattern for each gene within the dendrogram (right). (B) Gene expression patterns were normalized to their respective maxima, and clustered using a numerical K -means algorithm (Fuhrman et al., 2000). E15 E18 P0 P3 P7 P10 P13 P25 E15 E18 P0 P3 P7 P10 P13 P25 A P.D’haeseleer et al. Figure 2 for an example of a single data set clustered using Distance measures and preprocessing different normalizations and clustering methods. When Most clustering algorithms take a matrix of pairwise using relative expression levels (for example, microarray distances between genes as input. The choice of distance data), the data will tend to be log-normally distributed, measure—used to quantify the difference in expression so the logarithm of the relative expression values should profiles between two genes—may be as important as be used. Califano et al. (2000) suggest using a nonlinear the choice of clustering algorithm. Distance measures transformation into a uniform distribution for each gene can be divided into at least three classes, emphasizing instead, which will tend to spread out the clusters more different regularities present within the data: (a) similarity effectively. according to positive correlations, which may identify similar or identical regulation; (b) similarity according to Clustering algorithms positive and negative correlations, which may also help All clustering algorithms assume the pre-existence of identify control processes that antagonistically regulate groupings of the objects to be clustered. Random noise and downstream pathways; (c) similarity according to mutual other uncertainties have obscured these groupings. The information, which may detect even more complex objectives of the clustering algorithm are to recover the relationships. original grouping among the data. So far, most clustering studies in the gene expression Clustering algorithms can be divided into hierarchical literature use either Euclidean distance or Pearson cor- and non-hierarchical methods. Non-hierarchical methods relation between expression profiles as a distance mea- typically cluster N objects into K groups in an iterative sure. Other measures used include Euclidean distance be- process until certain goodness criteria are optimized. tween expression profiles and slopes (for time series Wen Examples of non-hierarchical methods include K -means, et al., 1998), squared Pearson correlation (D’haeseleer et expectation-maximization (EM) and Autoclass. Hierar- al., 1997), Euclidean distance between pairwise correla- chical methods return an hierarchy of nested clusters, tions to all other genes (Ewing et al., 1999), Spearman where each cluster typically consists of the union of two rank correlation (D’haeseleer et al., 1997), and mutual in- or more smaller clusters. The hierarchical methods can formation (D’haeseleer et al., 1997; Michaels et al., 1998; be further distinguished into agglomerative and divisive Butte and Kohane, 2000). methods, depending on whether they start with single- Conspicuously absent so far are distance measures object clusters and recursively merge them into larger that can deal with the large numbers of highly related clusters, or start with the cluster containing all objects and measurements in the data sets. For example, clustering recursively divide it into smaller clusters. In this section, yeast genes based on all publicly available data will we review several clustering methods for gene expression be highly biased towards the large cell-cycle data sets: (see also Figure 2 for comparison of agglomerative and 73 data points in four time series, containing almost eight K -means clustering). complete cell cycles (Spellman et al., 1998), whereas The K -means algorithm (MacQueen, 1967) can be only a single data point may be present for various used to partition N genes into K clusters, where K is stress conditions, mutations, etc. Correlation between the pre-determined by the user (see Tavazoie et al., 1999) experiments will also lead to highly elliptical clusters, for an application to yeast gene expression). K initial which form a problem for clustering methods that are cluster ‘centroids’ are chosen—either by the user, to biased towards compact, round clusters (such as K - reflect representative expression patterns, or at random— means). A distance measure that can deal with the and each gene is assigned to the cluster with the nearest covariance between experiments in a principled way (e.g. centroid. Next, the centroid for each cluster is recalculated Mahalanobis distance Mahalanobis, 1936) may be more appropriate here. For even longer time series, distance as the average expression pattern of all genes belonging measures based on Fourier or wavelet transforms may be to the cluster, and genes are reassigned to the closest considered. centroid. Membership in the clusters and cluster centroids A related issue is normalization and other preprocessing are updated iteratively until no more changes occur, or of the data. Distance measures that are sensitive to scaling the amount of change falls below a pre-defined threshold. K -means clustering minimizes the sum of the squared and/or offsets (such as Euclidean distance) may require distance to the centroids, which tends to result in round normalization of the data. Normalization can be done with clusters. Different random initial seeds can be tried to respect to the maximum expression level for each gene, assess the robustness of the clustering results. with respect to both minimum and maximum expression levels or with respect to the mean and standard deviation The Self-Organized Map (SOM) method is closely re- of each expression profile. From a statistical point of lated to K -means and has been applied to mRNA expres- view, we recommend using the latter, unless there is a sion data of yeast-cell cycles as well as hematopoietic dif- good reason to preserve the mean expression values. See ferentiation of four well-studied model cell lines (Tamayo 712 Genetic network inference et al., 1999). The method is more structured than K -means the distance between their average expression pattern. in that the cluster centers are located on a grid. At each After N − 1 steps, all the genes are merged together iteration, a randomly selected gene expression pattern at- into an hierarchical tree. Other hierarchical methods may tracts the nearest cluster center, plus some of its neighbors calculate the distance between clusters differently. In in the grid. Over time, fewer cluster centers are updated (unweighted pair-group method using arithmetic aver- at each iteration, until finally only the nearest cluster is ages (UPGMA) Sneath and Sokal, 1973) for example, the drawn towards each gene, placing the cluster centers in the distance between two clusters is defined as the average center of gravity of the surrounding expression patterns. distance between genes in the two clusters. Drawbacks of this method are that the user has to specify Ben-Doret al. (1999) have developed a clustering algo- a priori the number of clusters (as for K -means), as well rithm based on random graph theory. Their method shares as the grid topology, including the dimensions of the grid features with both agglomerative hierarchical clustering (typically one, two or three dimensional) and the number and K -means. Clusters are constructed one at a time. The of clusters in each dimension (e.g. eight clusters could be gene with the largest ‘affinity’ (smallest average distance mapped to a 2×4 two-dimensional grid or a 2×2×2 three- to all other genes in the cluster) is added to the cluster, dimensional cube). The artificial grid structure makes it if the affinity is larger than a cut-off. A gene can also be very easy to visualize the results, but may have residual removed from the cluster if its affinity drops below the cut- effects on the final clustering. Optimization techniques for off. A finite number of clusters are constructed depending selecting the number of clusters developed for K -means on the cut-off. The ability to remove ill-fitting genes from can presumably be used here too. the cluster is an attractive feature of this algorithm. Zhu The Expectation-Maximization (EM) algorithm (Demp- and Zhang (2000) used a similar algorithm to cluster yeast ster et al., 1977) for fitting a mixture of Gaussians (also sporulation data. known as Fuzzy K -Means Bezdek, 1981) is very similar to Alon et al. (1999) used a divisive hierarchical algorithm K -means, and has been used by Mjolsness et al. (1999b) to cluster gene expression data of colon cancer. The to cluster yeast data. Rather than classifying each gene into method relies on the maximum entropy principle and one specific cluster, we can assign membership functions attempts to find the most likely partition of data into (typically Gaussians, or any other parametric probability clusters at a given ‘cost’ (sum of squared within-cluster distribution) to each cluster, allowing each gene to be part distances). Starting from a single cluster with large cost, of several clusters. As in K -means, we alternately update as the allowed cost is lowered, the cluster breaks up the membership for each expression pattern, and then the spontaneously into multiple clusters in order to maximize parameters associated with each cluster: centroid, covari- the entropy for the configuration, within the constraint of ance and mixture weight. Cluster boundaries are sharp and fixed total cost. linear in K -means, smooth and rounded in EM. Califano et al. (2000) have developed a clustering al- Autoclass is also related to EM, in that it finds a mixture gorithm to identify groups of genes which can be used of probability distributions. In addition, it uses Bayesian for phenotype classification of cell types, by searching for methods to derive the maximum posterior probability clas- clusters of microarray samples that are highly correlated sification, and the optimum number of clusters (Cheese- over a subset of genes. Only the most significant clusters man and Stutz, 1996). are returned. The same technique could be used to find Wen et al. (1998) used the FITCH hierarchical cluster- clusters of genes that are highly coexpressed over a sub- ing algorithm (Felsenstein, 1993) to group the expression set of their expression profiles. Han et al. (1997) used a patterns of 112 genes in spinal-cord development, pro- similar, partial matching approach to group objects into a ducing a graph similar to the phylogenetic trees famil- hypergraph based on correlations over subsets of the data. iar to most biologists Sokal and Michener (1958). The In a hypergraph, each hyperedge (corresponding to a sin- expression clusters captured the main waves of gene ex- gle cluster) connects several nodes (genes), so each node pression in development. While the algorithm used in this (gene) can be part of several hyperedges (clusters). Mjol- study minimizes the overall distance in the tree, the com- sness et al. (1999a) developed a hierarchical algorithm that putational requirement grows with the fourth power of the places objects into a directed, acyclic graph, where each number of elements, making it impractical for much larger cluster can be part of several parent clusters. The algo- data sets. rithm optimizes the number of clusters, cluster positions Eisen et al. (1998) applied a standard agglomerative and partial cluster memberships of objects, such as to pro- hierarchical clustering algorithm, average-linkage anal- vide the most compact graph structure. All three clustering ysis, to large-scale gene expression data. Starting with methods allow genes to be part of several clusters, possi- N clusters containing a single gene each, at each step bly coinciding with multiple regulatory motifs or multi- in the iteration the two closest clusters are merged into ple functional classifications for each gene. This makes a larger cluster. Distance between clusters is defined as them especially appropriate for eukaryotic gene expres- 713 P.D’haeseleer et al. sion where genes are controlled by complex inputs from multiple transcription factors and enhancers. Other applications of co-expression clusters Gene expression clustering is potentially useful in at least three areas: (i) extraction of regulatory motifs (co- regulation from co-expression); (ii) inference of functional annotation; (iii) as a molecular signature in distinguishing cell or tissue types. Genes in the same expression cluster will tend to share biological functions. In a system as complex as the developing rat spinal cord, expression clustering clearly led to a segregation according to functional gene families (Wen et al., 1998). Moreover, cluster–function relationships exist over several methods of classification; for example, neurotransmitter receptor ligand classes and sequence/pathway groups each selectively map to expression waves (Figure 3). Tavazoie et al. (1999), used K -means to partition yeast genes during the cell cycle into 30 distinct clusters, and found the members of each cluster to be significantly enriched for genes with similar functions. Functions of unknown genes may be hypothesized from genes with known function within the same cluster. Yeast genes with previously unknown functions have been identified from their temporal pattern Fig. 3. Gene expression clusters reflect gene families and pathways. of expression during spore morphogenesis and their Neurotransmitter receptors follow particular expression waveforms functional role in sporulation has been confirmed in according to ligand and functional class. Waves 1–4 correspond to major expression clusters found in rat spinal-cord development deletion experiments (Chu et al., 1998). (Wen et al., 1998). Note that the early expression waves 1 and 2 mRNA expression can be regarded as a molecular signa- are dominated by ACh and GABA receptors, and by receptor ion- ture of a cell’s phenotype. Clustering of gene expression channels in general. Each line represents a gene, and can be traced patterns helps differentiate different cell types, which is by following its reflection at each node (Agnew, 1998). useful, for example, in recognizing subclasses of cancers (Alon et al., 1999; Golub et al., 1999; Perou et al., 1999; Alizadeh et al., 2000). Two-way clustering of both the of any arbitrary shapes and sizes in a multidi- genes and experiments allows for easy visualization (Eisen mensional pattern space. Each clustering cri- et al., 1998; Alon et al., 1999; Alizadeh et al., 2000; Wein- terion imposes a certain structure on the data, stein et al., 1997). Because activities of genes are often re- and if the data happens to conform to the re- lated to each other, gene expression is highly constrained, quirements of a particular criterion, the true and gene expression patterns under different conditions clusters are recovered. can be very similar. Clustering is necessary for identify- It is impossible to objectively evaluate how ‘good’ a spe- ing the coherent patterns. cific clustering is without referring to what the clustering Which clustering method to use? will be used for. However, once an application has been identified, it may be possible to evaluate objectively the We have discussed several different distance measures quality of the clustering for that particular application. and clustering algorithms. Each combination of distance For example, if we want to extract regulatory motifs measure and clustering algorithm will tend to emphasize from clusters, we can compare clustering methods based different types of regularities in the data. Some may be on the P -values of the resulting motifs. Similarly, for useless for what we want to do. Others may give us functional classification, we can compare P -values asso- complementary pieces of information. Jain and Dubes ciated with enrichment of clusters in certain functional (1988) state: categories. It is unlikely that there would be a single There is no single best criterion for obtaining best-clustering method for all applications. Considering a partition because no precise and workable the overwhelming number of combinations of distance definition of ‘cluster’ exists. Clusters can be measures and clustering algorithms—far too many to 714 Genetic network inference try them all each time—the field is in dire need of a (Arkin et al., 1998) included a total of 67 parameters, and comparison study of the main combinations for some of required supercomputers for its stochastic simulation. the standard applications, such as functional classification In-depth biochemical modeling is very important in un- or extraction of regulatory motifs. If we want to use gene derstanding the precise interactions in common regulatory clusters to infer regulatory interactions, synthetic data mechanisms, but clearly we cannot expect to construct generated from small but detailed models of regulatory such a detailed molecular model of, say, an entire yeast networks could provide a useful touchstone for comparing cell with some 6000 genes by analysing each gene clustering methods. Preliminary results comparing SOM, and determining all the binding and reaction constants K -means, FITCH and Autoclass—all using Euclidean one-by-one. Likewise, from the perspective of drug target distance—showed very poor performance of all clus- identification for human disease, we cannot realistically tering methods in identifying a metabolic pathway with hope to characterize all the relevant molecular interactions associated regulation of the enzymes by the metabolites one-by-one as a requirement for building a predictive (Mendes, 1999). disease model. There is a need for methods that can The greatest challenge in cluster analysis lies in faith- handle large-scale data in a global fashion, and that can fully capturing complex relationships in biological net- analyse these large systems at some intermediate level, works. As stated above, a gene may participate in mul- without going all the way down to the exact biochemical tiple functions, and is controlled by the activities of many reactions. other genes through a variety of cis-regulatory elements. Boolean or continuous Therefore, for complex datasets spanning a variety of bio- logical responses, a gene should by definition be a mem- The Boolean approximation assumes highly cooperative ber of several clusters, each reflecting a particular aspect binding (very ‘sharp’ activation response curves) and/or of its function and control. As more data becomes avail- positive feedback loops to make the variables saturate able to accurately delineate expression behavior under dif- in ON or OFF positions. However, examining real gene ferent conditions, we should consider using some of the expression data, it seems clear that genes spend a lot of clustering methods that partition genes into non-exclusive their time at intermediate values: gene expression levels clusters. Alternatively, several clustering methods could tend to be continuous rather than binary. Furthermore, be used simultaneously, allocating each gene to several important concepts in control theory that seem indis- clusters based on the different regularities emphasized by pensable for gene regulation systems either cannot be each method. implemented with Boolean variables, or lead to a radically different dynamical behavior: amplification, subtraction Modeling methodologies and addition of signals; smoothly varying an internal parameter to compensate for a continuously varying Cluster analysis can help elucidate the regulation (or co- environmental parameter; smoothly varying the period of regulation) of individual genes, but eventually we will a periodic phenomenon like the cell cycle, etc. Feedback have to consider the integrated behavior of networks of control (see e.g. Franklin et al., 1994) is one of the most regulatory interactions. Various types of gene regulation important tools used in control theory to regulate system network models have been proposed, and the model of variables to a desired level, and reduce sensitivity to both choice is often determined by the question one is trying external disturbances and variation of system parameters. to answer. In this section we will briefly address some of Negative feedback with a moderate feedback gain has a the decisions that need to be made when constructing a stabilizing effect on the output of the system. However, network model, and the trade-offs associated with each. negative feedback in Boolean circuits will always cause Level of biochemical detail oscillations, rather than increased stability, because the Gene-regulation models can vary from the very abstract— Boolean transfer function effectively has an infinite slope such as Kauffman’s random Boolean networks (Kauff- (saturating at 0 and 1). Moreover, Savageau (1998) iden- man, 1993)—to the very concrete—such as the full tified several rules for gene circuitry (bacterial operons) biochemical interaction models with stochastic kinetics that can only be captured by continuous analysis methods. In this study, positive and negative modes of regulation in Arkin et al. (1998). The former approach is the most were respectively linked to high and low demand for mathematically tractable, and its simplicity allows ex- expression, and a relationship was established between amination of very large systems (thousands of genes). the coupling of regulator and effector genes and circuit The latter fits the biochemical reality better and may carry more weight with the experimental biologists, capacity and demand. but its complexity necessarily restricts it to very small Some of these problems can be alleviated by hybrid systems. For example, the detailed biochemical model Boolean systems. In particular, Glass (1975, 1978) has of the five-gene lysis–lysogeny switch in Lambda phage proposed sets of piecewise linear differential equations, 715 P.D’haeseleer et al. where each gene has a continuous-valued internal state, proach to unpredictability is to construct models that can and a Boolean external state. Researchers at the Free manipulate entire probability distributions rather than just University of Brussels (Thomas, 1991; Thieffry and single values. Stochastic differential equations could be Thomas, 1998) have proposed an asynchronously updated used for example. Of course, this does add a whole new logic with intermediate threshold values. These systems level of complexity to the models. Alternatively, a deter- allow for easy analysis of certain properties of networks, ministic model can sometimes be extended by a simplified and have been used for qualitative models of small gene analysis of the variance on the expected behavior. networks, but still do not seem appropriate for quantitative Spatial or non-spatial modeling of real, large-scale gene expression data. Spatiality can play an important role, both at the level of Deterministic or stochastic intercellular interactions, and at the level of cell compart- One implicit assumption in continuous-valued models is ments (e.g. nucleus versus cytoplasm versus membrane). that fluctuations in the range of single molecules can Most processes in multicellular organisms, especially be ignored. Differential equations are already widely during development, involve interactions between dif- used to model biochemical systems, and a continuous ferent cells types, or even between cells of the same approach may be sufficient for a large variety of interesting type. Some useful information may be extracted using mechanisms. However, molecules present at only a few a nonspatial model (see for example D’haeseleer et al. copies per cell do play an important role in some (1999); D’haeseleer and Fuhrman (2000) for a non-spatial biological phenomena, such as the lysis–lysogeny switch model of CNS development and injury), but eventually a in Lambda phage (Ptashne, 1992). In that case, it may be spatial model will be needed. impossible to model the behavior of the system exactly Spatiality adds a whole extra dimension of complex- with a purely deterministic model. ity to the models: spatial development, cell-type interac- tions, reservoirs, diffusion constants, etc. In some cases, These stochastic effects—which have mainly been observed in prokaryotes—may not play as much of a the abundance of data—spatial patterns—can more than role in the larger eukaryotic cells. In yeast, most mRNA make up for the extra complexity of the model. For ex- species seem to occur at close to one mRNA copy per cell ample, Mjolsness et al. (1991) used a time series of one- (Velculescu et al., 1997; Holstege et al., 1998a), down to dimensional spatial patterns to fit a simple model of eve 0.1 mRNA/cell or less (i.e. the mRNA is only present 10% stripe formation in Drosophila. Models like the ones pro- of the time or less in any one cell). Low copy numbers like posed by Marnellos and Mjolsness (1998) for the role of these could be due to leaky transcription and not have any lateral interactions in early Drosophila neurogenesis pro- regulatory role. Also, if the half-life of the corresponding vide experimentally testable predictions about potentially protein (typically measured in hours or days) is much important interactions. larger than the half-life of the mRNA (averaging around Data availability 20 min in yeast (Holstege et al., 1998b)), the protein level may not be affected by stochastic fluctuations in In general, we must also realize that molecular activity mRNA. Analysis of mRNA and protein decay rates and measurements are limited and are carried out over popula- abundances may allow us to identify those few genes for tion averages of cells, not on individual cell circuits. Mod- which stochastic modeling may prove necessary. eling methodologies must, therefore, be designed around Particle-based models can keep track of individual the level of organization for which data is actually accessi- molecule counts, and often include much biochemical ble. An exhaustive model must take into account RNA and detail and/or spatial structure. Of course, keeping track protein concentrations, phosphorylation states, molecular of all this detail is computationally expensive, so they are localization and so forth, since each molecular variable typically only used for small systems. A related modeling carries unique information. However, due to present lim- technique is Stochastic Petri Nets (SPNs), which can be itations in measuring technology, these data are not rou- considered a subset of Markov processes, and can be tinely accessible. Modeling is then challenged with pro- used to model molecular interactions (Goss and Peccoud, viding as much predictive power as possible given limited 1998). Whereas fitting the parameters of a general particle data on molecular states. The constraints and redundan- model to real data can be quite difficult, optimization cies in biological networks suggest that much may still be algorithms exist for SPNs. Hybrid Petri Nets (Mounts gained even though not all parameters involved in the pro- and Liebman, 1997; Matsuno et al., 2000) include both cess may be modeled. discrete and continuous variables, allowing them to model Forward and inverse modeling both small-copy number and mass action interactions. Additional sources of unpredictability can include ex- Some of the more detailed modeling methodologies listed ternal noise, or errors on measured data. The Bayesian ap- above have been used to construct computer models of 716 Genetic network inference small, well-described regulatory networks. Of course, covering problem is solved using the branch and bound this requires an extensive knowledge of the system in technique. They devise a perturbation strategy that may be question, often resulting from decades of research. In used by laboratory scientists for systematic experimental this review, we will not focus on this forward modeling design. It should be pointed out that the problem of design- approach, but rather on the inverse modeling,or reverse ing a Boolean circuit that corresponds to certain input– engineering problem: given an amount of data, what output mappings is well studied in electrical engineering, can we deduce about the unknown underlying regulatory and several efficient algorithms exist (e.g. Espresso Bray- network? Reverse engineering typically requires the use ton et al., 1984) that could provide inspiration for reverse of a parametric model, the parameters of which are then engineering of Boolean regulatory networks. fit to the real-world data. If the connection structure of Data requirements the regulatory network (i.e. which genes have a regulatory effect on each other) is unknown, the parametric model To correctly infer the regulation of a single gene, we need will necessarily have to be very general and simplistic. to observe the expression of that gene under many dif- The results of this sort of model only relate to the ferent combinations of expression levels of its regulatory overall network structure. While this will imply little inputs. This implies sampling a wide variety of different about the actual molecular mechanisms involved, much environmental conditions and perturbations. Gene expres- helpful information will be gained on genes critical for a sion time series yield a lot of data, but all the data points biological process, sufficient for the identification of drug tend to be about a single dynamical process in the cell, and targets for example. Once the network structure is well will be related to the surrounding time points. A 10-point known, a more detailed model may be used to estimate time series generally contains less information than a data individual mechanism-related parameters, such as binding set of 10 expression measurements under dissimilar envi- and decay constants. ronmental conditions, or with mutations in different path- ways. The advantage of the time series is that it can pro- Gene network inference: reverse engineering vide insight into the dynamics of the process. On the other hand, data sets consisting of individual measurements pro- Clustering is a relatively easy way to extract useful vide an efficient way to map the attractors of the network. information out of large-scale gene expression data sets, Both types of data, and multiple data sets of each, will be however, it typically only tells us which genes are needed to unravel the regulatory interactions of the genes. co-regulated, not what is regulating what. In network Successful modeling efforts will probably have to use inference, the goal is to construct a coarse-scale model of data from different sources, and will have to be able to deal the network of regulatory interactions between the genes. with different data types such as time series and steady- This requires inference of the causal relationships among state data, different error levels, incomplete data, etc. genes, i.e. reverse engineering the network architecture Whereas clustering methods can use data from different from its activity profiles. As the molecular dynamics data strains, in different growth media etc., combining data sets we acquire becomes more copious and complex, we may for reverse engineering of regulatory networks requires need to routinely consult reverse engineering methods to that differences between the experimental conditions be provide the guiding hypotheses for experimental design. quantified much more precisely. Likewise, data will have One may wonder whether it is at all possible to reverse to be calibrated properly to allow comparison with other engineer a network from its activity profiles. A straight- data sets. In this respect, there is a growing need for forward answer to this question should be obtainable from a reliable reference in relative expression measurements. model networks, e.g. Boolean networks, for which we un- An obvious approach could be to agree on a standard derstand the network architectures and can easily gener- ate activity profiles. In a first attempt, a simple method strain and carefully controlled growth conditions to use was introduced that showed that reverse engineering is in all data collection efforts on the same organism. possible in principle (Somogyi et al., 1997). A more sys- Alternatively, a reference mRNA population could be tematic and general algorithm was developed by Liang et derived directly from the DNA itself. al. (1998), using Mutual Information to identify a mini- Estimates for various network models mal set of inputs that uniquely define the output for each The ambitious goal of network reverse engineering comes gene at the next time step. Akutsu et al. (1999) proposed at the price of requiring more data points. The space of a simplified reverse engineering algorithm and rigorously models to be searched increases exponentially with the examined the input data requirements. For more realis- tic applications, a further modification was introduced by number of parameters of the model, and therefore with Akutsu et al. (2000) that allows the inference of Boolean the number of variables. Narrowing the range of models networks given noisy input data. Ideker et al. (2000) de- by adding extra constraints can simplify the search for the veloped an alternative algorithm in which the minimal set- best model considerably. Including such information into 717 P.D’haeseleer et al. Table 1. Fully connected: each gene can receive regulatory inputs from all 1.0000 other genes 0.1000 Model Data needed Boolean, fully connected 2 0.0100 Boolean, connectivity K 2 (K + log( N )) Boolean, connectivity K , linearly separable K log( N /K ) Continuous, fully connected, additive N + 1 k=1 Continuous, connectivity K , additive K log( N /K)(∗) 0.0010 k=2 Pairwise correlation comparisons (clustering) log( N ) k=3 Connectivity K : at most K regulatory inputs per gene. Additive, linearly 0.0001 separable: regulation can be modeled using a weighted sum. Pairwise 0 2040 6080 100 correlation: significance level for pairwise comparisons based on State transitions correlation must decrease inversely proportional to number of variables. (∗): conjecture. Fig. 4. Dependence of Boolean network reverse engineering algo- rithm on depth of training data. The probability of finding an incor- the inference process is the true art of modeling. rect solution is graphed versus the number of state transition pairs How many data points are needed to infer a gene used as input for the algorithm for a N = 50 element network. More network of N genes depends on the complexity of training data is required for networks of k = 3 inputs per gene than the model used to do the inference. Constraining the for networks of lower connectivity to minimize incorrect solutions. connectivity of the network (number of regulatory inputs However, only a small fraction e.g. 80 state transition pairs from per gene) and the nature of the regulatory interactions can a total of 250 = 1.13 × 1015 is required to obtain reliable results dramatically reduce the amount of data needed. Table 1 (Liang et al., 1998). provides an overview of some of the models considered, and estimates of the amount of data needed for each. These estimates hold for independently chosen data points, with the equivalent Boolean model, we speculate it to and only indicate asymptotic growth rates, ignoring any be of the form (K log( N /K )). A promising avenue of constant factors. further research in this area may be the results on sample To completely specify a fully connected Boolean net- complexity for recurrent neural networks, which have a work model, where the output of each gene is modeled very similar structure to the models presented here (Koiran as a Boolean function of the expression levels of all N and Sontag, 1998; Sontag, 1997). genes, we need to measure all possible 2 input–output Finally, to allow for comparison with gene clustering pairs. This is clearly inconceivable for realistic numbers methods, we examined data requirements for clustering of genes. If we reduce the connectivity of the Boolean based on pairwise correlation comparisons (see Ap- network to an average of K regulatory inputs per gene, pendix Data requirements for pairwise correlation the data requirements decrease significantly. For a slightly comparisons). As the number of genes being compared simpler model, we can derive a lower bound of (2 (K + increases, the number of data points will have to in- log( N ))) (see Appendix Data requirements for sparse crease proportional to log( N ), in order to maintain a Boolean networks), which agrees well with preliminary constant, low level of false positives. Claverie (1999) experimental results by Liang et al. (1998) and Akutsu et arrived at a similar logarithmic scaling for binary data al. (1999) (see Figure 4). Further constraining the Boolean (absent/detected). model to use only linearly separable Boolean functions Note that, for reasonably constrained models, the num- (those that can be implemented using a weighted sum of ber of data points needed will scale with log( N ) rather the inputs, followed by a threshold function) reduces the than N , and that the data requirements for network in- data requirements to (K log( N /K )) (Hertz, 1998). ference are at least a factor K larger than for clustering. For models with continuous expression levels, the In practice, the amount of data may need to be orders of data requirements are less clear. In the case of linear magnitude higher because of non-independence and large (D’haeseleer et al., 1999) or quasi-linear (Weaver et al., measurement errors (see also Szallasi, 1999). Higher ac- 1999) additive models, fitting the model is equivalent curacy methods such as RT-PCR yield more bits of infor- to performing a multiple regression, so at least N + 1 mation per data point than cDNA microarrays or oligonu- data points are needed for a fully connected model of N cleotide chips, so fewer data points may be required to genes. Data requirements for sparse additive regulation achieve the same accuracy in the model. Modeling real models are as yet unknown, but based on the similarity data with Boolean networks discards a lot of information P (incorrect solution) Genetic network inference in the data sets, because the expression levels need to be connectionist model (Mjolsness et al., 1991), linear model discretized to one bit per measurement. Continuous mod- (D’haeseleer et al., 1999), linear transcription model els will tend to use the available information in the data set (Chen et al., 1999), weight matrix model (Weaver et better. al., 1999). The core of these seems to be the additive interaction between the regulatory inputs to each gene, Correlation metric construction so we suggest calling these models collectively additive Adam Arkin and John Ross have been working on a regulation models. method called Correlation Metric Construction (Arkin and In the simplest case, we can think of these models as Ross, 1995; Arkin et al., 1997), to reconstruct reaction being similar to multiple regression: networks from measured time series of the component chemical species. This approach is based in part on x ≈ w x + b i ji j i electronic circuit theory, general systems theory and i multivariate statistics. Although aimed more towards cell or signaling or metabolic networks, the same methodology x (t + t ) = w x (t ) + b i ji j i could be applied to regulatory networks. The system (a reactor vessel with chemicals im- where x is the expression level of gene i at time t , b plementing glycolysis) is driven using random (and i i is a bias term indicating whether gene i is expressed or independent) inputs for some of the chemical species, not in the absence of regulatory inputs, and weight w while the concentration of all the species is monitored ji indicates the influence of gene j on the regulation of over time. A distance matrix is constructed based on gene i . This can be written equivalently as a difference the maximum time-lagged correlation between any two or differential equation. Given an equidistant time series chemical species. This distance matrix is then fed into a simple clustering algorithm to generate a tree of con- of expression levels (or an equidistant interpolation of a nections between the species, and the results are mapped non-equidistant time series), we can use linear algebra into a two-dimensional graph for visualization. It is also to find the least-squares fit to the data. Weaver et al. possible to use the information regarding the time lag (1999) showed how a non-linear transfer function can be between species at which the highest correlation was incorporated into the model as well, and demonstrated found, which could be useful to infer causal relationships. that some randomly generated networks can be accurately More sophisticated methods from general systems theory, reconstructed using this modeling technique. D’haeseleer based on mutual information, could be used to infer et al. (1999); D’haeseleer and Fuhrman (2000) showed dependency. that even a simple linear model can be used to infer biologically relevant regulatory relationships from real Systems of differential equations data sets (Figure 5). Simple systems of differential equations have already Chen et al. (1999) presented a number of linear differen- proven their worth in modeling simple gene regulation tial equation models which included both mRNA and pro- systems. For example, the seminal work of Mjolsness tein levels. They showed how such models can be solved et al. (1991) used a spatial ‘gene circuit’ approach to using linear algebra and Fourier transforms. Interestingly, model a small number of genes involved in pattern they found that mRNA concentrations alone were not suf- formation during the blastoderm stage of development ficient to solve their model, without at least the initial pro- in Drosophila. The change in expression levels at each tein levels. Conversely, the model can be solved given only a time series of protein concentrations. point in time depended on a weighted sum of inputs Models that are more complex may require more general from other genes, and diffusion from neighboring ‘cells’. methods for fitting the parameters to the expression Synchronized cell divisions along a longitudinal axis data. Mjolsness et al. (1991) used Simulated Annealing (under the control of a maternal clock) were alternated to fit their hybrid model—incorporating both reaction- with updating the gene expression levels. The model was able to successfully replicate the pattern of eve stripes in diffusion kinetics and cell divisions—to spatial data. Drosophila, as well as some mutant patterns on which the Mjolsness et al. (1999b) used a similar approach to fit model was not explicitly trained. a recurrent neural network with weight decay to clusters of yeast genes. Genetic algorithms (GAs) have been Additive regulation models used to model the interaction between four ‘waves’ of The differential equation systems described above model coordinately regulated genes (Wahde and Hertz, 1999) gene networks using an update rule based on a weighted previously identified in rat CNS development (Wen et sum of inputs. Several variants of such models have al., 1998). Similarly, Tominaga et al. (1999) used a GA been proposed, with each group coining a different name: to fit a power-law model (Savageau, 1995; Tominaga 719 P.D’haeseleer et al. (A) Nestin 0.5 0.5 GABARalpha4 aFGF 0.5 -10 0 10 20 30 40 50 60 GAD65 (B) GAD67 GRg2 GRa3 GRa2 GRa1 preGAD67 GRb3 GRa4 GRb1 GRb2 GRg1 G67I8086 GRg3 G67I86 nestin GRa5 Fig. 5. Continuous-valued reverse engineering of a CNS genetic network using a linear additive method. (A) Experimental gene expression data (circles; development and injury), and simulation using a linear model (lines). The model faithfully reproduces the time series of the training data sets. Dotted line: spinal cord, starting 11 days before birth. Solid line: hippocampus development, starting seven days before birth. Dashed line: hippocampus kainate injury, starting at postnatal day 25. (B) Hypothetical gene interaction diagram for the GABA signaling family inferred from developmental gene expression data (spinal cord and hippocampus data). While individual proposed interactions have not yet been experimentally verified, the high predicted connectivity within this gene family appears biologically plausible. The positive feedback interaction of the GAD species has been proposed independently in another study (Somogyi et al., 1995). Solid lines correspond to positive interactions, broken lines suggest inhibitory relationships (D’haeseleer et al., 1999). 720 Genetic network inference and Okamoto, 1998) of a small gene network. Networks for clustering according to co-expression profiles with larger numbers of genes will likely require stronger should select the appropriate experimental sets for optimization algorithms. Akutsu et al. (2000) proposes analysis, and provide flexible solutions with multi- using Linear Programming for both power-law models ple cluster memberships that more accurately reflect and qualitative hybrid Boolean-like networks. Efficient the biological reality. Well-designed cluster analysis gradient descent algorithms developed for continuous- promises to identify new pathway relationships time recurrent neural networks (Pearlmutter, 1995) may and gene functions that may be critical to cellular be useful for even larger networks (D’haeseleer et al., control in health and disease. 1999). Alternatively, the size of the problem can be 3. Reverse engineering. Since it is the ultimate goal drastically reduced by combining gene clustering with to identify the causative relationships between network inference, deriving a regulation model only for gene products that determine important phenotypic the cluster centers (Wahde and Hertz, 1999; Mjolsness et parameters, top priority should be given to develop al., 1999b). reverse engineering methods that provide significant predictions. Alternative computational approaches Conclusions and outlook should be applied to given data sets, and their pre- We are participating in the transition of biology into an dictions tested in targeted experiments to identify information-driven science. However, this transition can the most reliable methods. be meaningful only if we focus on generating models that allow us to systematically derive predictions about 4. Integrated modeling. While the current focus is on important biological processes in disease, development the analysis of large-scale gene expression data, and metabolic control. These will find important applica- there are other established sources of information tions in pharmaceutical development and bioengineering on gene function, ranging from sequence homology (Zweiger, 1999). We have reviewed conceptual founda- and cis-regulatory sequences, to disease association tions for understanding complex biological networks, and and a wide variety of functional knowledge from several practical methods for data analysis. There are still targeted experiments. Ideally, all of these categories major challenges ahead, which may be divided into five of information should be included in model build- areas: ing. A major challenge here lies in the reliability and compatibility of these data sets. 1. Measurement quantity, depth and quality. Any at- tempt at predictive data analysis and model building 5. Coupling of modeling with systematic experimental critically depends on the scope and quality of the in- design. Discovery of a novel gene function through put data. Ideally, we would like to gain access to the expression profiling and computational inference activities of all important molecular species in a bio- depends on the optimal coordination of experimen- logical process (ranging from mRNA to metabolites tal technology with data-analysis methods. While and second messengers), with adequate quantitative, data-analysis methods must be centered around anatomical and temporal resolution. However, even data that are realistically accessible, critical pre- though our analytical measurement technologies dictions from the models must guide experimental are undergoing transformations in precision and design. The hope is that progressive iteration of throughput, there will always be limitations to the predictions, experimental measurements and model amount of data and resolution we can acquire and updates will result in increased fidelity and depth of process. Computational data analysis must therefore computational models of biological networks. identify the most essential molecular parameters to guide experimental measurements, and critically Acknowledgements evaluate measurement precision and reproducibility with appropriate statistical measures. P.D. gratefully acknowledges the support of the National Science Foundation (grant IRI-9711199), the Office of 2. Clustering and functional categorization. One Naval Research (grant N00014-99-1-0417), and the Intel priority in this area is to compare the large variety Corporation. S.L. gratefully acknowledges support from of existing clustering methods (including different NASA Collaborative Agreement NCC 2-794. The authors normalizations and distance measures), and identify those that give the most biologically relevant results. would like to express their appreciation to Xiling Wen, Just as a gene can play multiple functional roles in Millicent Dugich and Stefanie Fuhrman for providing data various pathways and is subject to different regula- on hippocampal development and injury, and to Stefanie tory inputs, co-expression patterns vary according Fuhrman for critically reading and commenting on the to the cellular and experimental context. Methods manuscript. 721 P.D’haeseleer et al. algorithms: we say that two genes cluster together if their Appendix A: Data requirements for sparse Boolean correlation is significantly greater (with a significance networks level α) than a certain cut-off value ρ . We test whether To fully specify a Boolean network with limited connec- we can exclude the null hypothesis ρ< ρ based on tivity, we need to specify the connection pattern and the the measured correlation coefficient r over the available rule table for a function of K inputs at each. A lower bound data points. Because of the large number of comparisons of (2 + K log( N )) can be derived using information being made, we need to reduce the significance level for theory. For a slightly simpler model, where we assume the the correlation test, proportional to the number of tests pattern of connectivity is given, we can calculate how the each gene is involved in: α = α / N (this will keep number of independently chosen data points should scale the expected number of false positives for each gene with K and N . Since this is a simpler model, its data re- constant). In order to be able to use the same cut-off value quirements should be a lower bound to the requirements for the measured correlation r to decide whether two for the model with unknown connections. genes cluster together, the number of data points will have Every data point (i.e. every input–output pair, specifying to increase as the significance level for each test grows the state of the entire Boolean network at time t and t + 1), smaller. specifies exactly one of 2 entries in each rule table. Given If the real correlation coefficient ρ is close to 1.0, this particular combination of the K inputs to each gene at the distribution of the measured correlation coefficient time t , the output of the gene is given by its state at time r is very asymmetrical. The following z-transformation, t + 1. We will estimate the probability P that all N rule developed by Fisher et al. (1969), is approximately tables are fully specified by n data points, and calculate normally distributed with mean z(ρ) and variance 1/(n − how the number of data points n needs to scale with P , 3) (with n the number of data points): the number of genes N , and connectivity K.For P ≈ 1 (i.e. we have enough data to have a good chance at a fully 1 1 + r z(r ) = ln . specified model), the probability for a single rule table to 2 1 − r be fully specified by n data points is approximately: We can now devise a single-sided test on z(r ) to answer K −K n 1 − 2 (1 − 2 ) . the following question. If z(r)> z(r ), what is the significance level with which we can reject the hypothesis The probability that all N rule tables are fully specified by z(ρ) < z(ρ ) (and thus ρ< ρ )? At the tail of the normal c c n data points then becomes: distribution, the area under the normal curve to the right K −K n N of z(r ) can be approximated by: P ≈ (1 − 2 (1 − 2 ) ) . α (z−z(ρ )) Taking base-2 logarithms, and further simplifying using 2σ α = √ e dz log (1 − z) ≈−z log (e) for z 1, we find: σ 2π 2 2 z=z(r ) (z(r )−z(ρ )) α c C =− log( P) 2 1 2σ ≈ √ e K −K n 2π(z(r ) − z(ρ )) α c ≈− N log(1 − 2 (1 − 2 ) ) K −K n ≈ N 2 (1 − 2 ) log(e) with σ = 1/ n − 3, and taking natural logs: C =− log(C / log(e)) 2 1 ln(α) ≈ ln(α ) − ln( N ) −K ≈− log( N ) − K − n log(1 − 2 ) √ ≈− ln(n − 3) − ln( 2π(z(r ) − z(ρ ))) −K α c ≈− log( N ) − K + n2 log(e). −(n − 3)(z(r ) − z(ρ )) /2 α c If P ≈ 1, C will be a small, and C a large positive 1 2 n ≈ 3 + (ln( N ) + ln(1/α ) constant. We can now express n, the number of data points (z(r ) − z(ρ )) α c needed, in terms of N , K and C : − ln(n − 3)/2 − ln( 2π(z(r ) − z(ρ )))) α c n ≈ 2 (K + log( N ) + C )/ log(e) = O(log( N )). which is O(2 (K + log( N ))). In other words, if we want to use the same cut-off value r to decide whether ρ> ρ , we need to scale the number α c Appendix B: Data requirements for pairwise of data points logarithmically with the number of genes. correlation comparisons Strictly speaking, this analysis only holds for correlation Let us examine a very simple form of clustering as a tests, but we can expect similar effects to play a role in representative example of the wide variety of clustering other clustering algorithms. 722 Genetic network inference Cheeseman,P. and Stutz,J. (1996) Bayesian classification (auto- References class): theory and results. In Fayyad,U.M., Piatetsky-Shapiro,G., Agnew,B. (1998) NIH plans bioengineering initiative. Science, 280, Smyth,P. and Uthurusamy,R. (eds), Advances in Knowledge Dis- 1516–1518. covery and Data Mining AAAI Press/MIT Press, http://ic-www. Akutsu,T., Miyano,S. and Kuhara,S. (1999) Identification of genetic arc.nasa.gov/ic/projects/bayes-group/images/kdd-95.ps networks from a small number of gene expression patterns under Chen,T., He,H.L. and Church,G.M. (1999) Modeling gene ex- the Boolean network model. Pacific Symposium on Biocomput- pression with differential equations. Pacific Symposium on ing, 4,17–28. http://www.smi.stanford.edu/projects/helix/psb99/ Biocomputing, 4,29–40. http://www.smi.stanford.edu/projects/ Akutsu.pdf helix/psb99/Chen.pdf Akutsu,T., Miyano,S. and Kuhara,S. (2000) Algorithms for infer- Chu,S., DeRisi,J., Eisen,M., Mulholland,J., Botstein,D., Brown,P.O. ring qualitative models of biological networks. Pacific Sympo- and Herskowitz,I. (1998) The transcriptional program of sporu- sium on Biocomputing, 5, 290–301.http://www.smi.stanford.edu/ lation in budding yeast. Science, 282, 699–705. projects/helix/psb00/akutsu.pdf Claverie,J.-M. (1999) Computational methods for the identification Aldenderfer,M.S. and Blashfield,R.K. (1984) Cluster Analysis. Sage of differential and coordinated gene expression. Hum. Mol. Publications, Newbury Park, CA. Genet., 8, 1821–1832. Alizadeh,A.A. et al. (2000) Distinct types of diffuse large B- Cunningham,M.J., Liang,S., Fuhrman,S., Seilhamer,J.J. and Somo- cell lymphoma identified by gene expression profiling. Nature, gyi,R. (2000) Gene expression microarray data analysis for toxi- 403(6769), 503–511. cology profiling. Ann. New York Acad. Sci., in press. Alon,U., Barkai,N., Notterman,D.A., Gish,K., Ybarra,S., Mack,D. Dempster,A.P., Laird,N.M. and Rubin,D.B. (1977) Maximum like- and Levine,A.J. (1999) Broad patterns of gene expression lihood from incomplete data via the EM algorithm. J. R. Stat. revealed by clustering analysis of tumor and normal colon tissues Soc., B39,1–38. probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. USA, DeRisi,J.L., Iyer,V.R. and Brown,P.O. (1997) Exploring the 96(12), 6745–6750. metabolic and genetic control of gene expression on a genomic Arkin,A. and Ross,J. (1995) Statistical construction of chemical scale. Science, 278(5338), 680–686. reaction mechanism from measured time-series. J. Phys. Chem., D’haeseleer,P. and Fuhrman,S. (2000) Gene network inference 99, 970–979. using a linear, additive regulation model. In preparation. Arkin,A., Ross,J. and McAdams,H.H. (1998) Stochastic kinetic D’haeseleer,P., Wen,X., Fuhrman,S. and Somogyi,R. (1997) Mining analysis of developmental pathway bifurcation in phage lambda- the gene expression matrix: inferring gene relationships from- infected Escherichia coli cells. Genetics, 149, 1633–1648. large scale gene expression data. In: M.Holcombe, R.Paton Arkin,A., Shen,P. and Ross,J. (1997) A test case of correlation (eds) Information processing in cells and Tissues. Plenum, metric construction of a reaction pathway from measurements. pp. 203–212. http://www.cs.unm.edu/∼patrik/networks/IPCAT/ Science, 277, 1275–1279. ipcat.html Arnone,M.I. and Davidson,E.H. (1997) The hardwiring of develop- D’haeseleer,P., Wen,X., Fuhrman,S. and Somogyi,R. (1999) Linear ment: organization and function of genomic regulatory systems. modeling of mRNA expression levels during CNS development Development, 124, 1851–1864. and injury. Pacific Symposium on Biocomputing, 4,41–52. http: Ben-Dor,A., Shamir,R. and Yakhini,Z. (1999) Clustering gene //www.smi.stanford.edu/projects/helix/psb99/Dhaeseleer.pdf expression patterns. J. Comput. Biol., 6(3–4), 281–297. Eisen,M.B., Spellman,P.T., Brown,P.O. and Botstein,D. (1998) Bezdek,J.C. (1981) Pattern Recognition with Fuzzy Objective Cluster analysis and display of genome-wide expression patterns. Function Algorithms. Plenum Press, New York. Proc. Natl. Acad. Sci. USA, 95(25), 14863–14868. Bhattacharjya,A. and Liang,S. (1996) Power laws in some random Erb,R.S. and Michaels,G.S. (1999) Sensitivity of biological mod- Boolean networks. Phys. Rev. Lett., 77, 1644. els to errors in parameter estimates. Pacific Symposium on Brayton,R.K., Hachtel,G.D., McMullen,C.T. and Sangiovanni- Biocomputing, 4,53–64. http://www.smi.stanford.edu/projects/ Vincentelli,A.L. (1984) Algorithms for VLSI Synthesis. Kluwer helix/psb99/Erb.pdf Academic Publishers, ftp://ic.eecs.berkeley.edu/pub/Espresso/ Ewing,R.M., Kahla,A.B., Poirot,O., Lopez,F., Audic,S. and Brazma,A., Jonassen,I., Vilo,J. and Ukkonen,E. (1998) Predicting Claverie,J.M. (1999) Large-scale statistical analyses of rice ESTs gene regulatory elements in silico on a genomic scale. Genome reveal correlated patterns of gene expression. Genome Res., 9, 950–959. Res., 8, 1202–1215. Felsenstein,J. (1993) PHYLIP (Phylogeny Inference Package). ver- Brown,P.O. and Botstein,D. (1999) Exploring the new world of sion 3.5c, distributed by the author, Department of Genetics, Uni- genome with DNA microarrays. Nature Genet., 21,33–37. versity of Washington, Seattle. Butte,A.J. and Kohane,I.S. (2000) Mutual information relevance Ferea,T.L., Botstein,D., Brown,P.O. and Rosenzweig,R.F. (1999) networks: functional genomic clustering using pairwise entropy Systematic changes in gene expression patterns following adap- measurements. Pacific Symposium on Biocomputing, 5, 415– tive evolution in yeast. Proc. Natl. Acad. Sci. USA, 96, 17 9721– 426.http://www.smi.stanford.edu/projects/helix/psb00/butte.pdf Califano,A., Stolovitzky,G. and Tu,Y. (2000) Analysis of gene ex- Fisher,R.A., In Kendall,M.G. and Stuart,A. (1969) The Advanced pression microarrays for phenotype classification. 8th Interna- Theory of Statistics. Vol. 1, 3rd edn., Hafner Press, New York, tional Conference on Intelligent Systems for Molecular Biology, pp. 391. in press. 723 P.D’haeseleer et al. Franklin,G.F., Powell,J.D. and Emami-Naeini,A. (1994) Feedback An Introduction to Cluster Analysis. John Wiley and Sons, New Control of Dynamic Systems. 3rd edn, Addison-Wesley. York. Koiran,P. and Sontag,E.D. (1998) Vapnik-Chervonenkis dimension Fuhrman,S., Cunningham,M.J., Wen,X., Zweiger,G., Seilhamer,J.J. of recurrent neural networks. Discrete Appl. Math., 86,63–79. and Somogyi,R. (2000) The application of Shannon entropy in the identification of putative drug targets. Biosystems, 55(1–3), Lance,G.N. and Williams,W.T. (1966) A general theory of classifi- 5–14. catory sorting strategies: 1. hierarchical systems. Comput. J., 9, 373–380. Fuhrman,S., D’haeseleer,P. and Somogyi,R. (1999) Tracing genetic information flow from gene expression to pathways and molec- Liang,S., Fuhrman,S. and Somogyi,R. (1998) REVEAL, a gen- ular networks. 1999 Society for Neuroscience Short Course Syl- eral reverse engineering algorithm for inference of genetic net- labus, DNA Microarrays: The New Frontier in Gene Discovery work architectures. Pacific Symposium on Biocomputing, 3,18– and Gene Expression Analysis. 29. http://www.smi.stanford.edu/projects/helix/psb98/liang.pdf Glass,L. (1975) Combinatorial and topological methods in nonlin- MacQueen,J. (1967) Some methods for classification and anal- ear chemical kinetics. J. Chem. Phys., 63, 1325–1335. ysis of multivariate observation. In Le Cam,L.M. and Nye- man,J. (eds), Proceedings of the Fifth Berkeley Symposium Glass,L. (1978) Stable oscillations in mathematical models of on Mathematical Statistics and Probability Vol. I, University of biological control systems. J. Math. Biol., 6, 207–223. California Press. Golub,T.R., Slonim,D.K., Tamayo,P., Huard,C., Gaasen- Mahalanobis,P.C. (1936) On the generalized distance in statistics. beek,M., Mesirov,J.P., Coller,H., Loh,M.L., Downing,J.R., Proc. Nat. Inst. Sci. India 12, pp. 49–55. Caligiuri,M.A., Bloomfield,C.D. and Lander,E.S. (1999) Molec- ular classification of cancer: class discovery and class prediction Marnellos,G. and Mjolsness,E. (1998) A gene network approach by gene expression monitoring. Science, 286, 531–537. to modeling early neurogenesis in Drosophila. Pacific Sympo- sium on Biocomputing, 3,30–41. http://www.smi.stanford.edu/ Goss,P.J. and Peccoud,J. (1998) Quantitative modeling of stochastic projects/helix/psb98/marnellos.pdf systems in molecular biology by using stochastic Petri nets. Proc. Natl. Acad. Sci. USA, 95, 6750–6755. Massart,D.L. and Kaufman,L. (1983) The Interpretation of Analyt- ical Chemical Data by the Use of Cluster Analysis. John Wiley Han,E.-H., Karypis,G., Kumar,V. and Mobasher,B. (1997) and Sons, New York. Clustering in a high-dimensional space using hyper- graph models. Technical Report # 97-019 Department Matsuno,H., Doi,A. and Nagasaki,M. (2000) Hybrid petri net of Computer Science, University of Minnesota. http: representation of genetic regulatory network. Pacific Sympo- //www.cs.umn.edu/tech reports/1997/TR 97-019 Clustering sium on Biocomputing, 5, 338–349.http://www.smi.stanford.edu/ Based on Association Rule Hypergraphs.html. projects/helix/psb00/matsuno.pdf Hertz,J. (1998) Statistical issues in reverse engineering of genetic Mendes,P. (1999) Metabolic simulation as an aid in understanding networks. Poster for Pacific Symposium on Biocomputing, http: gene expression data. In Bornberg-Bauer,E., De Beuckelaer,A., //www.nordita.dk/∼hertz/papers/dgshort.ps.gz. Kummer,U. and Rost,U. (eds), Workshop on Computation of Biochemical Pathways and Genetic Networks Logos Verlag, Heyer,L.J., Semyon Kruglyak, and Shibu Yooseph, (1999) Explor- Berlin, pp. 27–33. ing expression data: identification and analysis of coexpressed genes. Genome Res., 9(11), 1106–1115. Michaels,G., Carr,D.B., Wen,X., Fuhrman,S., Askenazi,M. and So- mogyi,R. (1998) Cluster analysis and data visualization of large- Holstege,F.C.P., Jennings,E.G., Wyrick,J.J., Lee,T.I., Hengart- scale gene expression data. Pacific Symposium on Biocomput- ner,C.J., Green,M.R., Golub,T.R., Lander,E.S. and Young,R.A. ing, 3,42–53. http://www.smi.stanford.edu/projects/helix/psb98/ (1998a) Dissecting the regulatory circuitry of a eukaryotic michaels.pdf genome. Cell, 95, 717–728. Miklos,G.L. and Rubin,G.M. (1996) The role of the genome project Holstege,F.C.P., Jennings,E.G., Wyrick,J.J., Lee,T.I., Hengart- in determining gene function: insights from model organisms. ner,C.J., Green,M.R., Golub,T.R., Lander,E.S. and Young,R.A. Cell, 86, 4 521–529. (1998b) Genome-Wide Expression Page, http://web.wi.mit.edu/ young/expression/ Mjolsness,E., Castano,R. ˜ and Gray,A. (1999a) Multi-parent clus- tering algorithms for large-scale gene expression analy- Huang,S. (1999) Gene expression profiling, genetic networks, sis. Technical Report JPL-ICTR-99-5 Jet Propulsion Lab- and cellular states: an integrating concept for tumorigenesis oratory Section 367. http://www-aig.jpl.nasa.gov/public/mls/ and drug discovery. J. Mol. Med., 77, 6 469–480. papers/emj/multiparentPreprint.pdf. Ideker,T.E., Thorsson,V. and Karp,R.M. (2000) Discovery of regula- Mjolsness,E., Mann,T., Castano,R. ˜ and Wold,B. (1999b) From co- tory interactions through perturbation: inference and experimen- expression to co-regulation: an approach to inferring transcrip- tal design. Pacific Symposium on Biocomputing, 5, 302–313.http: tional regulation among gene classes from large-scale expres- //www.smi.stanford.edu/projects/helix/psb00/ideker.pdf sion data. Technical Report JPL-ICTR-99-4 Jet Propulsion Jain,A.K. and Dubes,R.C. (1988) Algorithms for Clustering Data. Laboratory Section 365. http://www-aig.jpl.nasa.gov/public/mls/ Prentice Hall, Englewood Cliffs, NJ. papers/emj/GRN99prprnt.pdf. Kauffman,S.A. (1969) Metabolic stability and epigenesis in ran- Mjolsness,E., Sharp,D.H. and Reinitz,J. (1991) A connectionist domly connected nets. J. Theoret. Biol., 22, 437–467. model of development. J. Theor. Biol., 152, 429–454. Kauffman,S.A. (1993) The Origins of Order, Self-Organization and Selection in Evolution. Oxford University Press. Mounts,W.M. and Liebman,M.N. (1997) Application of petri nets Kaufman,L. and Rousseeuw,P.J. (1990) Finding Groups in Data: and stochastic activity nets to modeling biological pathways and 724 Genetic network inference processes. Int. J. Comput. Simul.. Church,G.M. (1999) Systematic determination of genetic network architecture. Nature Genet., 22, 281–285. Niehrs,C. and Pollet,N. Synexpression groups in eukaryotes. Na- ture, 402, 483–487. Thieffry,D. and Thomas,R. (1998) Qualitative analysis of gene networks. Pacific Symposium on Biocomputing, 3,77–88. http: Pearlmutter,B.A. (1995) Gradient calculations for dynamic recur- //www.smi.stanford.edu/projects/helix/psb98/thieffry.pdf rent neural networks: a survey. IEEE Trans. Neural Netw., 6, 1212–1228. Thomas,R. (1991) Regulatory networks seen as asynchronous automata: a logical description. J. Theor. Biol., 153,1–23. Perou,C.M., Jeffrey,S.S., van de Rijn,M., Rees,C.A., Eisen,M.B., Ross,D.T., Pergamenschikov,A., Williams,C.F., Zhu,S.X., Tominaga,D. and Okamoto,M. (1998) Design of canonical model Lee,J.C., Lashkari,D., Shalon,D., Brown,P.O. and Botstein,D. describing complex nonlinear dynamics. In Yoshida,T. and (1999) Distinctive gene expression patterns in human mam- Shioya,S. (eds), Computer Applications in Biotechnology 1998 mary epithelial cells and breast cancers. Proc. Natl. Acad. Sci. Elsevier Science, pp. 85–90. USA, 96, 16 9212–9217. Tominaga,D., Okamoto,M., Kami,Y., Watanabe,S. and Eguchi,Y. Ptashne,M. (1992) A genetic Switch. Cell Press and Blackwell (1999) Nonlinear Numerical Optimization Technique Based scientific publications. on a Genetic Algorithm, http://www.bioinfo.de/isb/gcb99/talks/ tominaga/ Roth,P., Hughes,J.D., Estep,P.W. and Church,G.M. (1998) Finding DNA regulatory motifs within unaligned noncoding sequences van Helden,J., Andre,B. and Collado-Vides,J. (1998) Extracting clustered by whole-genome mRNA quantitation. Nature Biotech- regulatory sites from the upstream region of yeast genes by nol., 16, 939–945. computational analysis of oligonucleotide frequencies. J. Mol. Biol., 281, 827–842. Savageau,M.A. (1995) Power-law formalism: a canonical nonlinear approach to modeling and analysis. In Proceedings of the World Velculescu,V.E., Zhang,L., Zhou,W., Vogelstein,J., Basrai,M.A., Congress of Nonlinear Analysts ’92 pp. 3323–3334. Bassett,D.E.,Jr, Hieter,P., Vogelstein,B. and Kinzler,K.W. (1997) Characterization of the yeast transcriptome. Cell, 88, 243–251. Savageau,M.A. (1998) Rules for the evolution of gene circuitry. Wahde,M. and Hertz,J. (1999) Course-grained reverse engineering Pacific Symposium on Biocomputing, 3,54–65. http://www.smi. stanford.edu/projects/helix/psb98/savageau.pdf of genetic regulatory networks. Proceedings of Information Processing in Cells and Tissues (IPCAT) ’99. Sneath,P.H.A. and Sokal,R.R. (1973) Numerical Taxonomy. Free- man, San Francisco. Walker,M.G., Volkmuth,W., Sprinzak,E., Hodgson,D. and Klin- gler,T. (1999) Prediction of gene function by genome-scale Sokal,R.R. and Michener,C.D. (1958) Univ. Kans. Sci. Bull., 38, expression analysis: prostate-cancer associated genes. Genome 1409–1438. Res., 9, 1198–1203. Somogyi,R. and Sniegoski,C. (1996) Modeling the complexity of Weaver,D.C., Workman,C.T. and Stormo,G.D. (1999) Modeling genetic networks. Complexity, 1(6), 45–63. regulatory networks with weight matrices. Pacific Symposium Somogyi,R., Fuhrman,S., Askenazi,M. and Wuensche,A. (1997) on Biocomputing, 4, 112–123. http://www.smi.stanford.edu/ The gene expression matrix: towards the extraction of genetic projects/helix/psb99/Weaver.pdf network architectures. Nonlinear Analysis, Proc. of Second Weinstein,J.N., Myers,T.G., O’Connor,P.M., Friend,S.H., For- World Cong. of Nonlinear Analysts (WCNA96), 30(3), 1815– nace,A.J.,Jr, Kohn,K.W., Fojo,T., Bates,S.E., Rubinstein,L.V., 1824. http://rsb.info.nih.gov/mol-physiol/reprints/WCNA.pdf Anderson,N.L., Buolamwini,J.K., van Osdol,W.W., Monks,A.P., Somogyi,R., Wen,X., Ma,W. and Barker,J.L. (1995) Developmental Scudiero,D.A., Sausville,E.A., Zaharevitz,D.W., Bunow,B., kinetics of GAD family mRNAs parallel neurogenesis in the rat Viswanadhan,V.N., Johnson,G.S., Wittes,R.E. and Paull,K.D. spinal cord. J. Neurosci., 15, 2575–2591. (1997) An information-intensive approach to the molecular phar- Somogyi,R. (1999) Making sense of gene expression data. Phar- macology of cancer. Science, 275, 343–349. mainformatics: A Trends Guide (Trends Supplement). pp. 17–24. Wen,X., Fuhrman,S., Michaels,G.S., Carr,D.B., Smith,S., Sontag,P. (1997) Shattering all sets of k points in general position re- Barker,J.L. and Somogyi,R. (1998) Large-scale temporal gene quires (k − 1)/2 parameters. Neural Comput., 9, 337–348. expression mapping of central nervous system development. Spellman,P.T., Sherlock,G., Zhang,M.Q., Iyer,V.R., Anders,K., Proc. Natl. Acad. Sci. USA, 95, 1 334–339. Eisen,M.B., Brown,P.O., Botstein,D. and Futcher,B. (1998) Wolfsberg,T.G., Gabrielian,A.E., Campbell,M.J., Cho,R.J., Comprehensive identification of cell cycle-regulated genes of Spouge,J.L. and Landsman,D. (1999) Candidate regulatory the yeast Saccharomyces cerevisiae by microarray hybridization. sequence elements for cell cycle-dependent transcription in Mol. Biol. Cell., 9, 3273–3297. saccharomyces cerevisiae. Genome Res., 9, 775–792. Szallasi,Z. (1999) Genetic network analysis in light of massively Wuensche,A. (1993) Discrete Dynamics Lab (DDLab) http://www. parallel biological data acquisition. Pacific Symposium on Bio- santafe.edu/∼wuensch/ddlab.html. computing, 4,5–16. http://www.smi.stanford.edu/projects/helix/ Wuensche,A. (1998) Genomic regulation modeled as a network psb99/Szallasi.pdf with basins of attraction. Pacific Symposium on Biocom- Tamayo,P., Slonim,D., Mesirov,J., Zhu,Q., Kitareewan,S., Dmitro- puting, 3,89–102. http://www.smi.stanford.edu/projects/helix/ vsky,E., Lander,E.S. and Golub,T.R. (1999) Interpreting patterns psb98/wuensche.pdf of gene expression with self-organizing maps: methods and ap- plication to hematopoietic differentiation. Proc. Natl. Acad. Sci. Wuensche,A. (1999) Classifying Cellular Automata Automatically: USA, 96, 2907. Finding gliders, filtering, and relating space-time patterns, attrac- Tavazoie,S., Hughes,J.D., Campbell,M.J., Cho,R.J. and tor basins, and the Z parameter. Complexity, 4(3), 47–66. http: 725 P.D’haeseleer et al. //www.santafe.edu/∼wuensch/cplex ab.html puting, 5, 476–487.http://www.smi.stanford.edu/projects/helix/ psb00/zhu.pdf Yuh,C.H., Bolouri,H. and Davidson,E.H. (1998) Genomic cis- regulatory logic: experimental and computational analysis of a Zweiger,G. (1999) Knowledge discovery in gene-expression- sea urchin gene. Science, 279, 1896–1902. microarray data: mining the information output of the genome. Trends Biotech., 17, 429–436. Zhu,J. and Zhang,M.Q. (2000) Cluster, function and promoter: analysis of yeast expression array. Pacific Symposium on Biocom- http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

Genetic network inference: from co-expression clustering to reverse engineering

Bioinformatics , Volume 16 (8): 20 – Aug 1, 2000

Loading next page...
 
/lp/oxford-university-press/genetic-network-inference-from-co-expression-clustering-to-reverse-xQPVdFGIuI

References (103)

Publisher
Oxford University Press
Copyright
© Oxford University Press 2000
ISSN
1367-4803
eISSN
1460-2059
DOI
10.1093/bioinformatics/16.8.707
Publisher site
See Article on Publisher Site

Abstract

Vol. 16 no. 8 2000 BIOINFORMATICS Pages 707–726 Genetic network inference: from co-expression clustering to reverse engineering 1 2, 3 3 Patrik D’haeseleer , Shoudan Liang and Roland Somogyi University of New Mexico, Department of Computer Science, Albuquerque, NM 87131, USA, NASA Ames Research Center, Moffett Field, CA 94035, USA and Incyte Pharmaceuticals, Inc., 3174 Porter Dr, Palo Alto, CA 94304, USA Received on December 20, 1999; revised on February 29, 2000; accepted on March 3, 2000 Abstract • Which genes are responsible for this disease? motivation: Advances in molecular biological, analytical • Which drugs will treat this disease? and computational technologies are enabling us to sys- tematically investigate the complex molecular processes Beginning with gene sequencing, we are identifying the underlying biological systems. In particular, using high- structure of thousands of genes, and a variety of structural throughput gene expression assays, we are able to measure and regulatory features that provide functional clues. the output of the gene regulatory network. We aim here to However, only the molecular machinery of the cell is review datamining and modeling approaches for concep- able to consistently interpret the sequence information tualizing and unraveling the functional relationships im- into the functions which determine the complex genetic plicit in these datasets. Clustering of co-expression pro- and biochemical networks that define the behavior of an files allows us to infer shared regulatory inputs and func- organism. Since we ultimately seek understanding of the tional pathways. We discuss various aspects of clustering, regulatory skeleton of these networks, we are also taking ranging from distance measures to clustering algorithms steps to monitor the molecular activities on a global level and multiple-cluster memberships. More advanced anal- to reflect the effective functional state of a biological ysis aims to infer causal connections between genes di- system. Several technologies, ranging from hybridization rectly, i.e. who is regulating whom and how. We discuss microarrays, automated RTPCR, to two-dimensional gel several approaches to the problem of reverse engineer- electrophoresis and antibody arrays, allow us to assay ing of genetic networks, from discrete Boolean networks, RNA and protein expression profiles with differing levels to continuous linear and non-linear models. We conclude of precision and depth. that the combination of predictive modeling with system- But how should we organize this process of activity- atic experimental verification will be required to gain a data acquisition? How should we interpret these results deeper insight into living organisms, therapeutic targeting to enhance our understanding of living organisms, and and bioengineering. identify therapeutic targets and critical processes for Contact: [email protected]; [email protected]; bioengineering? Here it is crucial to find the proper [email protected] abstractions around which to build modeling frameworks and data analysis tools. These abstractions must be Introduction centered on two important principles: Novel, high-throughput technologies are opening global perspectives of living organisms on the molecular level. 1. Genetic information flow: Defining the mapping Together with a vast experimental literature on biomolec- from sequence space to functional space. ular processes, these data are now providing us with the The genome contains the information for constructing the challenge of multifunctionality, implying regulatory net- complex molecular features of an organism, as reflected works as opposed to isolated, linear pathways of causality in the process of development. In information terms, the (Szallasi, 1999). Questions which have traditionally been complexity of the fully developed organism is contained posed in the singular are now being addressed in the plu- in that of the genome. But which processes link these two ral. types of information, in other words, what are the codes that translate sequence into structure and function? We • What are the functions of this gene? need to represent these codes in a form understandable • Which genes regulate this gene? to us, so that we may apply them to model building and c Oxford University Press 2000 707 P.D’haeseleer et al. prediction. We therefore seek methods that allow us to One way to make progress in understanding the princi- ples of network behavior is to radically simplify the indi- extract these codes from gene sequence and activity data. vidual molecular interactions, and focus on the collective 2. Complex dynamic systems: From states to outcome. Boolean networks (Kauffman, 1969) represent trajectories to attractors. such a simplification: each gene is considered as a binary variable—either ON or OFF—regulated by other genes When addressing biological function, we usually refer through logical or Boolean functions (as can be found in to functions in time, i.e. causality and dynamics. On a the biochemical literature: see e.g. Yuh et al., 1998). Even molecular level, function is manifested in the behavior with this simplification, the network behavior is already of complex networks. The dynamics of these networks extremely rich (Kauffman, 1993). Many useful concepts resemble trajectories of state transitions—this is what we naturally emerge from such a simple mathematical model. are monitoring when conducting temporal gene expression For example, cell differentiation corresponds to transitions surveys. The concept of attractors is what really lends from one global gene expression pattern to another. Stabil- meaning to these trajectories, i.e. the attractors are the ity of global gene expression patterns can be understood as high-dimensional dynamic molecular representations of a consequence of the dynamic properties of the network, stable phenotypic structures such as differentiated cells namely that all networks fall into one or more attractors, and tissues, either healthy or diseased (Kauffman, 1993; representing stable states of cell differentiation, adaptation Somogyi and Sniegoski, 1996). Our goal is to understand or disease. the dynamics to the point where we can predict the For a Boolean network with N genes, the total number attractor of a molecular network, and know enough about of global gene expression patterns can be very large the network architecture to direct these networks to even for moderate N (2 : each gene can be either on attractors of choice, e.g. from a cancerous cell type to a or off independently). We assume that each gene is benign cell type, from degeneration to regeneration etc. controlled by up to K other genes in the network. The The goal of this review is to discuss principles of genetic connectivity K is a very important parameter for network network organization, and computational methods for ex- dynamics (with large K , the dynamics tends to be more tracting network architectures from experimental data. In chaotic). In Random Boolean Network models, these K Section A conceptual approach to complex network dy- inputs, and a K -input Boolean function, are chosen at namics we will present a brief introduction to Boolean random for each gene (for a justification of this choice, Networks, as a useful conceptual framework to think about see Kauffman, 1993). The Boolean variables (ON/OFF the dynamic behavior of large, interacting regulatory net- states of the genes) at time t + 1 are determined by the works. Section Inference of regulation through cluster- state of the network at time t through the K inputs as ing of gene expression data describes methods and ap- well as the logical function assigned to each gene. An plications for the clustering of genes based on their ex- excellent tool for calculating and visualizing the dynamics pression patterns. In Section Modeling methodologies, of these networks is the DDLAB software (Wuensche, we give an overview of the different modeling method- 1993, 1998). ologies that may be used to model regulatory networks. Because the total number of expression patterns is finite, Finally, Section Gene network inference: reverse engi- the system will eventually return to an expression pattern neering shows how genetic regulatory networks may be it has visited earlier. Since the system is deterministic, it inferred directly from large-scale gene expression data, in- will from then on keep following the exact same cycle cluding estimates of how much data is needed to do so. of expression patterns. This periodic state cycle is called the attractor of the network. For example, in Figure 1 we A conceptual approach to complex network show the repeating 6-state cycle attractor pattern within dynamics the trajectory of a 12-element network (right panel). This In higher metazoa, each gene or protein is estimated on trajectory is only one of many alternative trajectories also average to interact with four to eight other genes (Arnone leading to the same attractor, as shown in the ‘basin of and Davidson, 1997), and to be involved in 10 biological attraction’ graph (lower panel). If a state within the basin functions (Miklos and Rubin, 1996). The global gene of attraction is perturbed to any other state within it, the expression pattern is, therefore, the result of the collective dynamics inexorably converge to the same attractor. This behavior of individual regulatory pathways. In such highly feature helps explain why cell types or cellular adaptations interconnected cellular signaling networks, gene function to particular environments are stable with respect to small depends on its cellular context; thus understanding the perturbations from a variety of internal or external noise network as a whole is essential. However, dynamic sources. systems with large numbers of variables present a difficult The long-term dynamics are determined by the attrac- mathematical problem. tors. In the non-chaotic case, which is the only case of 708 Genetic network inference Fig. 1. Dynamics of the Boolean model of a genetic network illustrated using the DDLAB software (Wuensche, 1993, 1999). Wiring, trajectory and basin of attraction. Top panel, wiring diagram of binary gene interactions. The network consists of hypothetical regulatory gene modules (1,3,6) and dependent modules of structural genes (S, U, T—share identical wiring and rules). Right inset, trajectory. As determined by the wiring and rules, the network passes through a series of states from a given starting state, finally arriving at a repeating attractor pattern, a six-state cycle in this case (dark grey = ON, light grey = OFF; time points numbered at left; modules S, U and T have been collapsed into a single element for simplicity). Bottom panel, basin of attraction. The states of the trajectory shown in the inset are displayed as a series of connected points (labeled by time points) arriving in a cyclic graph resembling the attractor. The additional nodes in the graph resemble other states which also lead to the same attractor, therefore the term ‘basin of attraction’—all states in this graph merge into a single attractor (Somogyi and Sniegoski, 1996). practical interest, the number of states in the attractor is Bhattacharjya and Liang, 1996), rather than exponentially typically only a very small fraction of the total number of with N . The notion that gene expression patterns are states, growing as a square root of N (Kauffman, 1993; constrained is in general agreement with the experimental 709 P.D’haeseleer et al. findings of large-scale gene expression data. For example, been a significant aggregate change over all conditions, genes from the same sequence or functional family do (c) or whether the fluctuation pattern shows high diversity not act independently, but tend to fluctuate in parallel, according to Shannon entropy (Fuhrman et al., 2000). reducing effective N (Wen et al., 1998). In terms of A comparison of these criteria in the analysis for toxic attractor cycles, it is rare for a gene to oscillate more response genes has shown that the Shannon entropy allows than once during the cell cycle in the yeast (Chu et al., the clearest distinction of drug-specific expression patterns 1998). Also, the change in gene expression pattern during (Cunningham et al., 2000). diauxic shift—when yeast metabolism switches from Beyond straightforward scoring methods, we would like glucose to ethanol (DeRisi et al., 1997)—has been found to classify gene expression patterns to explore shared to be very similar to the change in expression pattern functions and regulation. This can be accomplished using during adaptive evolution to growth in a low glucose clustering methods. The simplest approach to clustering, medium (Ferea et al., 1999). sometimes referred to as GBA (Guilt By Association), is The number of distinct attractors of a Random Boolean to select a gene and determine its nearest neighbors in Network also tends to grow as a square root of the number expression space within a certain user-defined distance of genes (Kauffman, 1993). If we equate the attractors cut-off (Walker et al., 1999). Genes sharing the same of the network with individual cell types, as Kauffman expression pattern are likely to be involved in the same suggests, this explains why a large genome of a few billion regulatory process. Clustering allows us to extract groups base pairs is capable of a few hundred stable cell types. of genes that are tightly co-expressed over a range of This convergent behavior implies immense complexity different experiments. If the promoter regions of the genes reduction, convergence and stabilization in networks of are known—as is the case for yeast—it is possible to constrained architecture. identify the cis-regulatory control elements shared by Boolean networks provide a useful conceptual tool for the co-expressed genes. Several algorithms to extract investigating the principles of network organization and common regulatory motifs from gene clusters have been dynamics (Wuensche, 1999). We can study the role of developed (Brazma et al., 1998; Roth et al., 1998; various constraints on global behavior in terms of network van Helden et al., 1998; Wolfsberg et al., 1999). For complexity, stability and evolvability. From experimental example, Tavazoie et al. (1999) identified 18 biologically studies we are learning about constraints in the number significant DNA-motifs in the promoter region of gene of inputs and outputs per gene, input and output sharing clusters based on cell-cycle expression patterns. Most among genes evolved within a gene family or pathway, motifs were highly selective for the cluster in which and restrictions on rule types (thresholding, no ‘exclusive they were found, and seven were known regulatory or’ rules etc.). Investigations into abstract models will motifs for the genes in their respective clusters. In an help us understand the cybernetic significance of network example from brain development (Figure 2), correlations features, and provide meaningful questions for targeted between cis elements and expression profiles can be experimental exploration. established, but are sensitive to the clustering method used. Inference of regulation through clustering of gene We will briefly review important issues involved in expression data clustering and some of the main clustering methods used, as well as a few classical clustering methods which Introduction have not yet been adopted in gene expression analysis. Large-scale gene screening technologies such as mRNA We should caution the reader that different clustering hybridization micro-arrays have dramatically increased methods can have very different results (see for example our ability to explore the living organism at the genomic Figure 2), and—at this point—it is not yet clear which level (Zweiger, 1999). Large amounts of data can be clustering methods are most useful for gene expression routinely generated. In order to identify genes of interest, analysis. Claverie (1999) provides a preliminary review we need software tools capable of selecting and screening of gene expression analysis techniques with a focus on candidate genes for further investigation (Somogyi, 1999). coexpression clustering. Niehrs and Pollet (1999) provide At the simplest level, we can determine which genes an overview of very tightly coexpressed groups of genes show significant expression changes compared with a (which they call ‘synexpression groups’) that have been control group in a pair-wise comparison. As data sets identified based on large-scale gene expression data. become more complex, covering a variety of biological For further reading, some useful textbooks on cluster- conditions or time series, one may consider several scoring methods for selecting the most interesting genes; e.g. ing include Massart and Kaufman (1983), Aldenderfer according to (a) whether there has been a significant and Blashfield (1984) and Kaufman and Rousseeuw change at any one condition, (b) whether there has (1990). 710 Genetic network inference (B) Hippocampal Development Hippocampal Development (A) Min/Max Normalization - Agglomerative Clustering Max Normalization - K-means Clustering Cis- Cis- Cluster Cluster bifurcation Gene Cluster identifier Gene element element no. cyclinA # # ### ### 000 00000000000000000 MOG ###### ### cyclinB # # ## ## 000 00000000000001000 5HT3 # # # # ## # ## G67I8086 # ## ## ### 000 00000000000001010 aFGF ## # # # # # ## G67I86 # ## ## ### 000 00000000000001011 GRb3 # ## # # # # ## Brm # # ### #### 000 00000000000001100 IP3R3 # # ## # # # ## nestin ### ### 000 00000000000001101 PDGFb # # # # # # # ## IGFR2 # ## ### 000 00000000000002000 PDGFR # # # ## ## ## IGFR1 # # # ## # ### 000 00000000000010000 mAChR1 # # # # # # # ## H4 ## # 000 00000100000000000 nAChRa3 # # # ### ## # ODC #### ### # # 000 00000100000000010 trkC # # # # # # # # # IGF2 # # # # 000 00000100000000100 BDNF ## ###### # GAD65 # ## # # # # # # 000 00000100000001000 Brm # # # ## #### GRg3 # # ## # # ## # 000 00000100000100000 cyclinB # # ## ## ## GRa3 # ## # # # 000 00000100000100010 G67I8086 # # ### # ### actin ## ## ## 000 00000100010000000 G67I86 # # # # ### ## cellubrevin ## # # ## 000 00000100010000100 IGFR1 # # # ### ## # GRg2 # ## ## 000 00000100010000110 nestin # ### ### IP3R3 ## ## # # # ## 000 00000100100000000 TH # # ####### 5HT3 # # # # # # ### 000 00000200000000000 ACHE ### ##### # Krox-24 TCP # # # ## # # ## 000 00001000000000000 actin # ###### ## EGF ## # ## # ## # 000 00001000000000010 CCO1 #### # #### GRa5 ## ## ## # # # 001 00000000000000000 CCO2 # # ## ### # # GRa1 ## # # 0 01 00000000000000010 cellubrevin #### # ## # # IP3R2 ## ## ## 001 00000000001000000 cjun ##### # # ## S100beta ## ## # ## ## 001 00000000001000001 CNTF # # # #### # # NFL ## # ## Krox-24 001 00000000001000100 CNTFR ######### CX43 # # ####### nAChRa5 ##### 001 00000000001000101 6 cyclinA # # mGluR3 # # # ## 001 00000000001100000 6 GRa4 # # ## 001 00000000001100001 EGF # # # # # # # ## 6 NT3 # # ## # 001 00000000001100100 FABP ## ##### ## 6 GRb1 ## 0 0 1 00000000001100101 GAD65 ####### ## 6 GAD67 ## # ## # # # # Krox-24 001 00000000001101000 GAP43 ## # ###### 6 GRa2 ## 001 00000000001101010 GMFb ##### # # # # 6 GRg1 # # 001 00000000001101011 GRa3 # ## #### # # 6 5HT1b ## ## # # # # # 001 00000000001102000 GRg2 # ###### ## 6 NFM # # # # # ## # # 001 00000000001110000 GRg3 ## ###### # 6 bFGF ## # Krox-24 001 00000000002000000 H4 ####### ## 6 GFAP ## # # # # 001 00000000003000000 IGF1 # # ##### ## 6 mAChR2 ## ## # # # # # 001 00000000010000000 IGF2 ####### # # 6 GRb2 # # # ## # # # # 001 00000000020000000 IGFR2 ###### ### 6 nAChRa7 # # ## # # ## # Krox-24 001 00000000020000010 InsR ###### ### 6 preGAD67 ## ## # # # # # 001 00000000020000100 MK2 # # ####### 6 NFH # # # # # # 001 00000000100000000 neno # # ## # ## # # Krox-24 synaptophysin ## # # # ## ## 001 00000000100000010 ODC ##### ### # FABP ## # ## # # ## 001 00000010000000000 synaptophysin ## ## # ## ## PDGFb ## # # # # # ## 001 00000010000000010 TCP ##### ## # # PDGFR # ## ## ## ## 001 00000010000001000 GFAP ## # # ### # # GRb3 # # # # # # # ## 001 00000010000010000 mGluR8 # # # ### ## # mAChR1 # # # # # # # ## 001 00000020000000000 NFH # # # ## ## ## ACHE ## ## # # Krox-24 001 00000100000000000 NFL ## # ## #### Krox-24 mGluR8 # # # # # # ## # 001 00000200000000000 S100beta ## ### ## ## 5HT2 # # # # ## # ## 001 00000300000000000 5HT1b ## #### ### CX43 # # ## # ## # # 001 00001000000000000 5HT2 # # #### # ## neno # # # # # ## # # Krox-24 001 00001000000001000 bFGF ## ####### Krox-24 IGF1 # # # # # # # ## 001 00001000000001010 GAD67 ## ##### ## Krox-24 aFGF ## # # # # # ## 002 00000000000000000 GRa1 ## # # ##### GAP43 ## # ###### 003 00000000000000000 GRa2 ## ####### CCO1 # # ## ## 010 00000000000000000 GRa4 # # ##### ## MOG # # # 020 00000000000000000 GRa5 # # ## ## # ## cjun # # # # 020 00000000000010000 GRb1 ## ####### BDNF ## # # ### # # 020 00000000000010010 GRb2 # # #### # ## MK2 # # # # ## # ## 100 00000000000000000 GRg1 # # ##### # # CCO2 # # ## ## # # # 100 00000000000000100 IP3R2 # # ####### TH # # ## ## ### 100 00000000000010000 mAChR2 ## ## # # ## # CNTF # # # ## 100 00000000001000000 mGluR3 # # ##### ## CNTFR # # ## # # ## # 100 00000010000000000 nAChRa5 ## # #### ## GMFb # # # # 100 00000010000001000 nAChRa7 # # #### ### Krox-24 nAChRa3 # #### # 200 00000000000000000 NFM # # ##### # # InsR # # # # # ### 200 00000000000000100 NT3 # # ####### trkC # # # # # # # # # 200 00000000001000000 preGAD67 ## ##### ## Fig. 2. Comparison of clustering methods on hippocampal development gene expression data. Genes were grouped into eight clusters with both methods. Note the relative positions of the genes sharing a Krox-24 transcriptional regulatory element among the clusters. (A) Gene expression patterns were normalized to their respective minima and maxima, and clustered using an agglomerative algorithm. The cluster bifurcation pattern is shown on the left; cluster boundaries were drawn at the depth shaded in grey (left), based on the 20-level cluster identifier that captures the branching pattern for each gene within the dendrogram (right). (B) Gene expression patterns were normalized to their respective maxima, and clustered using a numerical K -means algorithm (Fuhrman et al., 2000). E15 E18 P0 P3 P7 P10 P13 P25 E15 E18 P0 P3 P7 P10 P13 P25 A P.D’haeseleer et al. Figure 2 for an example of a single data set clustered using Distance measures and preprocessing different normalizations and clustering methods. When Most clustering algorithms take a matrix of pairwise using relative expression levels (for example, microarray distances between genes as input. The choice of distance data), the data will tend to be log-normally distributed, measure—used to quantify the difference in expression so the logarithm of the relative expression values should profiles between two genes—may be as important as be used. Califano et al. (2000) suggest using a nonlinear the choice of clustering algorithm. Distance measures transformation into a uniform distribution for each gene can be divided into at least three classes, emphasizing instead, which will tend to spread out the clusters more different regularities present within the data: (a) similarity effectively. according to positive correlations, which may identify similar or identical regulation; (b) similarity according to Clustering algorithms positive and negative correlations, which may also help All clustering algorithms assume the pre-existence of identify control processes that antagonistically regulate groupings of the objects to be clustered. Random noise and downstream pathways; (c) similarity according to mutual other uncertainties have obscured these groupings. The information, which may detect even more complex objectives of the clustering algorithm are to recover the relationships. original grouping among the data. So far, most clustering studies in the gene expression Clustering algorithms can be divided into hierarchical literature use either Euclidean distance or Pearson cor- and non-hierarchical methods. Non-hierarchical methods relation between expression profiles as a distance mea- typically cluster N objects into K groups in an iterative sure. Other measures used include Euclidean distance be- process until certain goodness criteria are optimized. tween expression profiles and slopes (for time series Wen Examples of non-hierarchical methods include K -means, et al., 1998), squared Pearson correlation (D’haeseleer et expectation-maximization (EM) and Autoclass. Hierar- al., 1997), Euclidean distance between pairwise correla- chical methods return an hierarchy of nested clusters, tions to all other genes (Ewing et al., 1999), Spearman where each cluster typically consists of the union of two rank correlation (D’haeseleer et al., 1997), and mutual in- or more smaller clusters. The hierarchical methods can formation (D’haeseleer et al., 1997; Michaels et al., 1998; be further distinguished into agglomerative and divisive Butte and Kohane, 2000). methods, depending on whether they start with single- Conspicuously absent so far are distance measures object clusters and recursively merge them into larger that can deal with the large numbers of highly related clusters, or start with the cluster containing all objects and measurements in the data sets. For example, clustering recursively divide it into smaller clusters. In this section, yeast genes based on all publicly available data will we review several clustering methods for gene expression be highly biased towards the large cell-cycle data sets: (see also Figure 2 for comparison of agglomerative and 73 data points in four time series, containing almost eight K -means clustering). complete cell cycles (Spellman et al., 1998), whereas The K -means algorithm (MacQueen, 1967) can be only a single data point may be present for various used to partition N genes into K clusters, where K is stress conditions, mutations, etc. Correlation between the pre-determined by the user (see Tavazoie et al., 1999) experiments will also lead to highly elliptical clusters, for an application to yeast gene expression). K initial which form a problem for clustering methods that are cluster ‘centroids’ are chosen—either by the user, to biased towards compact, round clusters (such as K - reflect representative expression patterns, or at random— means). A distance measure that can deal with the and each gene is assigned to the cluster with the nearest covariance between experiments in a principled way (e.g. centroid. Next, the centroid for each cluster is recalculated Mahalanobis distance Mahalanobis, 1936) may be more appropriate here. For even longer time series, distance as the average expression pattern of all genes belonging measures based on Fourier or wavelet transforms may be to the cluster, and genes are reassigned to the closest considered. centroid. Membership in the clusters and cluster centroids A related issue is normalization and other preprocessing are updated iteratively until no more changes occur, or of the data. Distance measures that are sensitive to scaling the amount of change falls below a pre-defined threshold. K -means clustering minimizes the sum of the squared and/or offsets (such as Euclidean distance) may require distance to the centroids, which tends to result in round normalization of the data. Normalization can be done with clusters. Different random initial seeds can be tried to respect to the maximum expression level for each gene, assess the robustness of the clustering results. with respect to both minimum and maximum expression levels or with respect to the mean and standard deviation The Self-Organized Map (SOM) method is closely re- of each expression profile. From a statistical point of lated to K -means and has been applied to mRNA expres- view, we recommend using the latter, unless there is a sion data of yeast-cell cycles as well as hematopoietic dif- good reason to preserve the mean expression values. See ferentiation of four well-studied model cell lines (Tamayo 712 Genetic network inference et al., 1999). The method is more structured than K -means the distance between their average expression pattern. in that the cluster centers are located on a grid. At each After N − 1 steps, all the genes are merged together iteration, a randomly selected gene expression pattern at- into an hierarchical tree. Other hierarchical methods may tracts the nearest cluster center, plus some of its neighbors calculate the distance between clusters differently. In in the grid. Over time, fewer cluster centers are updated (unweighted pair-group method using arithmetic aver- at each iteration, until finally only the nearest cluster is ages (UPGMA) Sneath and Sokal, 1973) for example, the drawn towards each gene, placing the cluster centers in the distance between two clusters is defined as the average center of gravity of the surrounding expression patterns. distance between genes in the two clusters. Drawbacks of this method are that the user has to specify Ben-Doret al. (1999) have developed a clustering algo- a priori the number of clusters (as for K -means), as well rithm based on random graph theory. Their method shares as the grid topology, including the dimensions of the grid features with both agglomerative hierarchical clustering (typically one, two or three dimensional) and the number and K -means. Clusters are constructed one at a time. The of clusters in each dimension (e.g. eight clusters could be gene with the largest ‘affinity’ (smallest average distance mapped to a 2×4 two-dimensional grid or a 2×2×2 three- to all other genes in the cluster) is added to the cluster, dimensional cube). The artificial grid structure makes it if the affinity is larger than a cut-off. A gene can also be very easy to visualize the results, but may have residual removed from the cluster if its affinity drops below the cut- effects on the final clustering. Optimization techniques for off. A finite number of clusters are constructed depending selecting the number of clusters developed for K -means on the cut-off. The ability to remove ill-fitting genes from can presumably be used here too. the cluster is an attractive feature of this algorithm. Zhu The Expectation-Maximization (EM) algorithm (Demp- and Zhang (2000) used a similar algorithm to cluster yeast ster et al., 1977) for fitting a mixture of Gaussians (also sporulation data. known as Fuzzy K -Means Bezdek, 1981) is very similar to Alon et al. (1999) used a divisive hierarchical algorithm K -means, and has been used by Mjolsness et al. (1999b) to cluster gene expression data of colon cancer. The to cluster yeast data. Rather than classifying each gene into method relies on the maximum entropy principle and one specific cluster, we can assign membership functions attempts to find the most likely partition of data into (typically Gaussians, or any other parametric probability clusters at a given ‘cost’ (sum of squared within-cluster distribution) to each cluster, allowing each gene to be part distances). Starting from a single cluster with large cost, of several clusters. As in K -means, we alternately update as the allowed cost is lowered, the cluster breaks up the membership for each expression pattern, and then the spontaneously into multiple clusters in order to maximize parameters associated with each cluster: centroid, covari- the entropy for the configuration, within the constraint of ance and mixture weight. Cluster boundaries are sharp and fixed total cost. linear in K -means, smooth and rounded in EM. Califano et al. (2000) have developed a clustering al- Autoclass is also related to EM, in that it finds a mixture gorithm to identify groups of genes which can be used of probability distributions. In addition, it uses Bayesian for phenotype classification of cell types, by searching for methods to derive the maximum posterior probability clas- clusters of microarray samples that are highly correlated sification, and the optimum number of clusters (Cheese- over a subset of genes. Only the most significant clusters man and Stutz, 1996). are returned. The same technique could be used to find Wen et al. (1998) used the FITCH hierarchical cluster- clusters of genes that are highly coexpressed over a sub- ing algorithm (Felsenstein, 1993) to group the expression set of their expression profiles. Han et al. (1997) used a patterns of 112 genes in spinal-cord development, pro- similar, partial matching approach to group objects into a ducing a graph similar to the phylogenetic trees famil- hypergraph based on correlations over subsets of the data. iar to most biologists Sokal and Michener (1958). The In a hypergraph, each hyperedge (corresponding to a sin- expression clusters captured the main waves of gene ex- gle cluster) connects several nodes (genes), so each node pression in development. While the algorithm used in this (gene) can be part of several hyperedges (clusters). Mjol- study minimizes the overall distance in the tree, the com- sness et al. (1999a) developed a hierarchical algorithm that putational requirement grows with the fourth power of the places objects into a directed, acyclic graph, where each number of elements, making it impractical for much larger cluster can be part of several parent clusters. The algo- data sets. rithm optimizes the number of clusters, cluster positions Eisen et al. (1998) applied a standard agglomerative and partial cluster memberships of objects, such as to pro- hierarchical clustering algorithm, average-linkage anal- vide the most compact graph structure. All three clustering ysis, to large-scale gene expression data. Starting with methods allow genes to be part of several clusters, possi- N clusters containing a single gene each, at each step bly coinciding with multiple regulatory motifs or multi- in the iteration the two closest clusters are merged into ple functional classifications for each gene. This makes a larger cluster. Distance between clusters is defined as them especially appropriate for eukaryotic gene expres- 713 P.D’haeseleer et al. sion where genes are controlled by complex inputs from multiple transcription factors and enhancers. Other applications of co-expression clusters Gene expression clustering is potentially useful in at least three areas: (i) extraction of regulatory motifs (co- regulation from co-expression); (ii) inference of functional annotation; (iii) as a molecular signature in distinguishing cell or tissue types. Genes in the same expression cluster will tend to share biological functions. In a system as complex as the developing rat spinal cord, expression clustering clearly led to a segregation according to functional gene families (Wen et al., 1998). Moreover, cluster–function relationships exist over several methods of classification; for example, neurotransmitter receptor ligand classes and sequence/pathway groups each selectively map to expression waves (Figure 3). Tavazoie et al. (1999), used K -means to partition yeast genes during the cell cycle into 30 distinct clusters, and found the members of each cluster to be significantly enriched for genes with similar functions. Functions of unknown genes may be hypothesized from genes with known function within the same cluster. Yeast genes with previously unknown functions have been identified from their temporal pattern Fig. 3. Gene expression clusters reflect gene families and pathways. of expression during spore morphogenesis and their Neurotransmitter receptors follow particular expression waveforms functional role in sporulation has been confirmed in according to ligand and functional class. Waves 1–4 correspond to major expression clusters found in rat spinal-cord development deletion experiments (Chu et al., 1998). (Wen et al., 1998). Note that the early expression waves 1 and 2 mRNA expression can be regarded as a molecular signa- are dominated by ACh and GABA receptors, and by receptor ion- ture of a cell’s phenotype. Clustering of gene expression channels in general. Each line represents a gene, and can be traced patterns helps differentiate different cell types, which is by following its reflection at each node (Agnew, 1998). useful, for example, in recognizing subclasses of cancers (Alon et al., 1999; Golub et al., 1999; Perou et al., 1999; Alizadeh et al., 2000). Two-way clustering of both the of any arbitrary shapes and sizes in a multidi- genes and experiments allows for easy visualization (Eisen mensional pattern space. Each clustering cri- et al., 1998; Alon et al., 1999; Alizadeh et al., 2000; Wein- terion imposes a certain structure on the data, stein et al., 1997). Because activities of genes are often re- and if the data happens to conform to the re- lated to each other, gene expression is highly constrained, quirements of a particular criterion, the true and gene expression patterns under different conditions clusters are recovered. can be very similar. Clustering is necessary for identify- It is impossible to objectively evaluate how ‘good’ a spe- ing the coherent patterns. cific clustering is without referring to what the clustering Which clustering method to use? will be used for. However, once an application has been identified, it may be possible to evaluate objectively the We have discussed several different distance measures quality of the clustering for that particular application. and clustering algorithms. Each combination of distance For example, if we want to extract regulatory motifs measure and clustering algorithm will tend to emphasize from clusters, we can compare clustering methods based different types of regularities in the data. Some may be on the P -values of the resulting motifs. Similarly, for useless for what we want to do. Others may give us functional classification, we can compare P -values asso- complementary pieces of information. Jain and Dubes ciated with enrichment of clusters in certain functional (1988) state: categories. It is unlikely that there would be a single There is no single best criterion for obtaining best-clustering method for all applications. Considering a partition because no precise and workable the overwhelming number of combinations of distance definition of ‘cluster’ exists. Clusters can be measures and clustering algorithms—far too many to 714 Genetic network inference try them all each time—the field is in dire need of a (Arkin et al., 1998) included a total of 67 parameters, and comparison study of the main combinations for some of required supercomputers for its stochastic simulation. the standard applications, such as functional classification In-depth biochemical modeling is very important in un- or extraction of regulatory motifs. If we want to use gene derstanding the precise interactions in common regulatory clusters to infer regulatory interactions, synthetic data mechanisms, but clearly we cannot expect to construct generated from small but detailed models of regulatory such a detailed molecular model of, say, an entire yeast networks could provide a useful touchstone for comparing cell with some 6000 genes by analysing each gene clustering methods. Preliminary results comparing SOM, and determining all the binding and reaction constants K -means, FITCH and Autoclass—all using Euclidean one-by-one. Likewise, from the perspective of drug target distance—showed very poor performance of all clus- identification for human disease, we cannot realistically tering methods in identifying a metabolic pathway with hope to characterize all the relevant molecular interactions associated regulation of the enzymes by the metabolites one-by-one as a requirement for building a predictive (Mendes, 1999). disease model. There is a need for methods that can The greatest challenge in cluster analysis lies in faith- handle large-scale data in a global fashion, and that can fully capturing complex relationships in biological net- analyse these large systems at some intermediate level, works. As stated above, a gene may participate in mul- without going all the way down to the exact biochemical tiple functions, and is controlled by the activities of many reactions. other genes through a variety of cis-regulatory elements. Boolean or continuous Therefore, for complex datasets spanning a variety of bio- logical responses, a gene should by definition be a mem- The Boolean approximation assumes highly cooperative ber of several clusters, each reflecting a particular aspect binding (very ‘sharp’ activation response curves) and/or of its function and control. As more data becomes avail- positive feedback loops to make the variables saturate able to accurately delineate expression behavior under dif- in ON or OFF positions. However, examining real gene ferent conditions, we should consider using some of the expression data, it seems clear that genes spend a lot of clustering methods that partition genes into non-exclusive their time at intermediate values: gene expression levels clusters. Alternatively, several clustering methods could tend to be continuous rather than binary. Furthermore, be used simultaneously, allocating each gene to several important concepts in control theory that seem indis- clusters based on the different regularities emphasized by pensable for gene regulation systems either cannot be each method. implemented with Boolean variables, or lead to a radically different dynamical behavior: amplification, subtraction Modeling methodologies and addition of signals; smoothly varying an internal parameter to compensate for a continuously varying Cluster analysis can help elucidate the regulation (or co- environmental parameter; smoothly varying the period of regulation) of individual genes, but eventually we will a periodic phenomenon like the cell cycle, etc. Feedback have to consider the integrated behavior of networks of control (see e.g. Franklin et al., 1994) is one of the most regulatory interactions. Various types of gene regulation important tools used in control theory to regulate system network models have been proposed, and the model of variables to a desired level, and reduce sensitivity to both choice is often determined by the question one is trying external disturbances and variation of system parameters. to answer. In this section we will briefly address some of Negative feedback with a moderate feedback gain has a the decisions that need to be made when constructing a stabilizing effect on the output of the system. However, network model, and the trade-offs associated with each. negative feedback in Boolean circuits will always cause Level of biochemical detail oscillations, rather than increased stability, because the Gene-regulation models can vary from the very abstract— Boolean transfer function effectively has an infinite slope such as Kauffman’s random Boolean networks (Kauff- (saturating at 0 and 1). Moreover, Savageau (1998) iden- man, 1993)—to the very concrete—such as the full tified several rules for gene circuitry (bacterial operons) biochemical interaction models with stochastic kinetics that can only be captured by continuous analysis methods. In this study, positive and negative modes of regulation in Arkin et al. (1998). The former approach is the most were respectively linked to high and low demand for mathematically tractable, and its simplicity allows ex- expression, and a relationship was established between amination of very large systems (thousands of genes). the coupling of regulator and effector genes and circuit The latter fits the biochemical reality better and may carry more weight with the experimental biologists, capacity and demand. but its complexity necessarily restricts it to very small Some of these problems can be alleviated by hybrid systems. For example, the detailed biochemical model Boolean systems. In particular, Glass (1975, 1978) has of the five-gene lysis–lysogeny switch in Lambda phage proposed sets of piecewise linear differential equations, 715 P.D’haeseleer et al. where each gene has a continuous-valued internal state, proach to unpredictability is to construct models that can and a Boolean external state. Researchers at the Free manipulate entire probability distributions rather than just University of Brussels (Thomas, 1991; Thieffry and single values. Stochastic differential equations could be Thomas, 1998) have proposed an asynchronously updated used for example. Of course, this does add a whole new logic with intermediate threshold values. These systems level of complexity to the models. Alternatively, a deter- allow for easy analysis of certain properties of networks, ministic model can sometimes be extended by a simplified and have been used for qualitative models of small gene analysis of the variance on the expected behavior. networks, but still do not seem appropriate for quantitative Spatial or non-spatial modeling of real, large-scale gene expression data. Spatiality can play an important role, both at the level of Deterministic or stochastic intercellular interactions, and at the level of cell compart- One implicit assumption in continuous-valued models is ments (e.g. nucleus versus cytoplasm versus membrane). that fluctuations in the range of single molecules can Most processes in multicellular organisms, especially be ignored. Differential equations are already widely during development, involve interactions between dif- used to model biochemical systems, and a continuous ferent cells types, or even between cells of the same approach may be sufficient for a large variety of interesting type. Some useful information may be extracted using mechanisms. However, molecules present at only a few a nonspatial model (see for example D’haeseleer et al. copies per cell do play an important role in some (1999); D’haeseleer and Fuhrman (2000) for a non-spatial biological phenomena, such as the lysis–lysogeny switch model of CNS development and injury), but eventually a in Lambda phage (Ptashne, 1992). In that case, it may be spatial model will be needed. impossible to model the behavior of the system exactly Spatiality adds a whole extra dimension of complex- with a purely deterministic model. ity to the models: spatial development, cell-type interac- tions, reservoirs, diffusion constants, etc. In some cases, These stochastic effects—which have mainly been observed in prokaryotes—may not play as much of a the abundance of data—spatial patterns—can more than role in the larger eukaryotic cells. In yeast, most mRNA make up for the extra complexity of the model. For ex- species seem to occur at close to one mRNA copy per cell ample, Mjolsness et al. (1991) used a time series of one- (Velculescu et al., 1997; Holstege et al., 1998a), down to dimensional spatial patterns to fit a simple model of eve 0.1 mRNA/cell or less (i.e. the mRNA is only present 10% stripe formation in Drosophila. Models like the ones pro- of the time or less in any one cell). Low copy numbers like posed by Marnellos and Mjolsness (1998) for the role of these could be due to leaky transcription and not have any lateral interactions in early Drosophila neurogenesis pro- regulatory role. Also, if the half-life of the corresponding vide experimentally testable predictions about potentially protein (typically measured in hours or days) is much important interactions. larger than the half-life of the mRNA (averaging around Data availability 20 min in yeast (Holstege et al., 1998b)), the protein level may not be affected by stochastic fluctuations in In general, we must also realize that molecular activity mRNA. Analysis of mRNA and protein decay rates and measurements are limited and are carried out over popula- abundances may allow us to identify those few genes for tion averages of cells, not on individual cell circuits. Mod- which stochastic modeling may prove necessary. eling methodologies must, therefore, be designed around Particle-based models can keep track of individual the level of organization for which data is actually accessi- molecule counts, and often include much biochemical ble. An exhaustive model must take into account RNA and detail and/or spatial structure. Of course, keeping track protein concentrations, phosphorylation states, molecular of all this detail is computationally expensive, so they are localization and so forth, since each molecular variable typically only used for small systems. A related modeling carries unique information. However, due to present lim- technique is Stochastic Petri Nets (SPNs), which can be itations in measuring technology, these data are not rou- considered a subset of Markov processes, and can be tinely accessible. Modeling is then challenged with pro- used to model molecular interactions (Goss and Peccoud, viding as much predictive power as possible given limited 1998). Whereas fitting the parameters of a general particle data on molecular states. The constraints and redundan- model to real data can be quite difficult, optimization cies in biological networks suggest that much may still be algorithms exist for SPNs. Hybrid Petri Nets (Mounts gained even though not all parameters involved in the pro- and Liebman, 1997; Matsuno et al., 2000) include both cess may be modeled. discrete and continuous variables, allowing them to model Forward and inverse modeling both small-copy number and mass action interactions. Additional sources of unpredictability can include ex- Some of the more detailed modeling methodologies listed ternal noise, or errors on measured data. The Bayesian ap- above have been used to construct computer models of 716 Genetic network inference small, well-described regulatory networks. Of course, covering problem is solved using the branch and bound this requires an extensive knowledge of the system in technique. They devise a perturbation strategy that may be question, often resulting from decades of research. In used by laboratory scientists for systematic experimental this review, we will not focus on this forward modeling design. It should be pointed out that the problem of design- approach, but rather on the inverse modeling,or reverse ing a Boolean circuit that corresponds to certain input– engineering problem: given an amount of data, what output mappings is well studied in electrical engineering, can we deduce about the unknown underlying regulatory and several efficient algorithms exist (e.g. Espresso Bray- network? Reverse engineering typically requires the use ton et al., 1984) that could provide inspiration for reverse of a parametric model, the parameters of which are then engineering of Boolean regulatory networks. fit to the real-world data. If the connection structure of Data requirements the regulatory network (i.e. which genes have a regulatory effect on each other) is unknown, the parametric model To correctly infer the regulation of a single gene, we need will necessarily have to be very general and simplistic. to observe the expression of that gene under many dif- The results of this sort of model only relate to the ferent combinations of expression levels of its regulatory overall network structure. While this will imply little inputs. This implies sampling a wide variety of different about the actual molecular mechanisms involved, much environmental conditions and perturbations. Gene expres- helpful information will be gained on genes critical for a sion time series yield a lot of data, but all the data points biological process, sufficient for the identification of drug tend to be about a single dynamical process in the cell, and targets for example. Once the network structure is well will be related to the surrounding time points. A 10-point known, a more detailed model may be used to estimate time series generally contains less information than a data individual mechanism-related parameters, such as binding set of 10 expression measurements under dissimilar envi- and decay constants. ronmental conditions, or with mutations in different path- ways. The advantage of the time series is that it can pro- Gene network inference: reverse engineering vide insight into the dynamics of the process. On the other hand, data sets consisting of individual measurements pro- Clustering is a relatively easy way to extract useful vide an efficient way to map the attractors of the network. information out of large-scale gene expression data sets, Both types of data, and multiple data sets of each, will be however, it typically only tells us which genes are needed to unravel the regulatory interactions of the genes. co-regulated, not what is regulating what. In network Successful modeling efforts will probably have to use inference, the goal is to construct a coarse-scale model of data from different sources, and will have to be able to deal the network of regulatory interactions between the genes. with different data types such as time series and steady- This requires inference of the causal relationships among state data, different error levels, incomplete data, etc. genes, i.e. reverse engineering the network architecture Whereas clustering methods can use data from different from its activity profiles. As the molecular dynamics data strains, in different growth media etc., combining data sets we acquire becomes more copious and complex, we may for reverse engineering of regulatory networks requires need to routinely consult reverse engineering methods to that differences between the experimental conditions be provide the guiding hypotheses for experimental design. quantified much more precisely. Likewise, data will have One may wonder whether it is at all possible to reverse to be calibrated properly to allow comparison with other engineer a network from its activity profiles. A straight- data sets. In this respect, there is a growing need for forward answer to this question should be obtainable from a reliable reference in relative expression measurements. model networks, e.g. Boolean networks, for which we un- An obvious approach could be to agree on a standard derstand the network architectures and can easily gener- ate activity profiles. In a first attempt, a simple method strain and carefully controlled growth conditions to use was introduced that showed that reverse engineering is in all data collection efforts on the same organism. possible in principle (Somogyi et al., 1997). A more sys- Alternatively, a reference mRNA population could be tematic and general algorithm was developed by Liang et derived directly from the DNA itself. al. (1998), using Mutual Information to identify a mini- Estimates for various network models mal set of inputs that uniquely define the output for each The ambitious goal of network reverse engineering comes gene at the next time step. Akutsu et al. (1999) proposed at the price of requiring more data points. The space of a simplified reverse engineering algorithm and rigorously models to be searched increases exponentially with the examined the input data requirements. For more realis- tic applications, a further modification was introduced by number of parameters of the model, and therefore with Akutsu et al. (2000) that allows the inference of Boolean the number of variables. Narrowing the range of models networks given noisy input data. Ideker et al. (2000) de- by adding extra constraints can simplify the search for the veloped an alternative algorithm in which the minimal set- best model considerably. Including such information into 717 P.D’haeseleer et al. Table 1. Fully connected: each gene can receive regulatory inputs from all 1.0000 other genes 0.1000 Model Data needed Boolean, fully connected 2 0.0100 Boolean, connectivity K 2 (K + log( N )) Boolean, connectivity K , linearly separable K log( N /K ) Continuous, fully connected, additive N + 1 k=1 Continuous, connectivity K , additive K log( N /K)(∗) 0.0010 k=2 Pairwise correlation comparisons (clustering) log( N ) k=3 Connectivity K : at most K regulatory inputs per gene. Additive, linearly 0.0001 separable: regulation can be modeled using a weighted sum. Pairwise 0 2040 6080 100 correlation: significance level for pairwise comparisons based on State transitions correlation must decrease inversely proportional to number of variables. (∗): conjecture. Fig. 4. Dependence of Boolean network reverse engineering algo- rithm on depth of training data. The probability of finding an incor- the inference process is the true art of modeling. rect solution is graphed versus the number of state transition pairs How many data points are needed to infer a gene used as input for the algorithm for a N = 50 element network. More network of N genes depends on the complexity of training data is required for networks of k = 3 inputs per gene than the model used to do the inference. Constraining the for networks of lower connectivity to minimize incorrect solutions. connectivity of the network (number of regulatory inputs However, only a small fraction e.g. 80 state transition pairs from per gene) and the nature of the regulatory interactions can a total of 250 = 1.13 × 1015 is required to obtain reliable results dramatically reduce the amount of data needed. Table 1 (Liang et al., 1998). provides an overview of some of the models considered, and estimates of the amount of data needed for each. These estimates hold for independently chosen data points, with the equivalent Boolean model, we speculate it to and only indicate asymptotic growth rates, ignoring any be of the form (K log( N /K )). A promising avenue of constant factors. further research in this area may be the results on sample To completely specify a fully connected Boolean net- complexity for recurrent neural networks, which have a work model, where the output of each gene is modeled very similar structure to the models presented here (Koiran as a Boolean function of the expression levels of all N and Sontag, 1998; Sontag, 1997). genes, we need to measure all possible 2 input–output Finally, to allow for comparison with gene clustering pairs. This is clearly inconceivable for realistic numbers methods, we examined data requirements for clustering of genes. If we reduce the connectivity of the Boolean based on pairwise correlation comparisons (see Ap- network to an average of K regulatory inputs per gene, pendix Data requirements for pairwise correlation the data requirements decrease significantly. For a slightly comparisons). As the number of genes being compared simpler model, we can derive a lower bound of (2 (K + increases, the number of data points will have to in- log( N ))) (see Appendix Data requirements for sparse crease proportional to log( N ), in order to maintain a Boolean networks), which agrees well with preliminary constant, low level of false positives. Claverie (1999) experimental results by Liang et al. (1998) and Akutsu et arrived at a similar logarithmic scaling for binary data al. (1999) (see Figure 4). Further constraining the Boolean (absent/detected). model to use only linearly separable Boolean functions Note that, for reasonably constrained models, the num- (those that can be implemented using a weighted sum of ber of data points needed will scale with log( N ) rather the inputs, followed by a threshold function) reduces the than N , and that the data requirements for network in- data requirements to (K log( N /K )) (Hertz, 1998). ference are at least a factor K larger than for clustering. For models with continuous expression levels, the In practice, the amount of data may need to be orders of data requirements are less clear. In the case of linear magnitude higher because of non-independence and large (D’haeseleer et al., 1999) or quasi-linear (Weaver et al., measurement errors (see also Szallasi, 1999). Higher ac- 1999) additive models, fitting the model is equivalent curacy methods such as RT-PCR yield more bits of infor- to performing a multiple regression, so at least N + 1 mation per data point than cDNA microarrays or oligonu- data points are needed for a fully connected model of N cleotide chips, so fewer data points may be required to genes. Data requirements for sparse additive regulation achieve the same accuracy in the model. Modeling real models are as yet unknown, but based on the similarity data with Boolean networks discards a lot of information P (incorrect solution) Genetic network inference in the data sets, because the expression levels need to be connectionist model (Mjolsness et al., 1991), linear model discretized to one bit per measurement. Continuous mod- (D’haeseleer et al., 1999), linear transcription model els will tend to use the available information in the data set (Chen et al., 1999), weight matrix model (Weaver et better. al., 1999). The core of these seems to be the additive interaction between the regulatory inputs to each gene, Correlation metric construction so we suggest calling these models collectively additive Adam Arkin and John Ross have been working on a regulation models. method called Correlation Metric Construction (Arkin and In the simplest case, we can think of these models as Ross, 1995; Arkin et al., 1997), to reconstruct reaction being similar to multiple regression: networks from measured time series of the component chemical species. This approach is based in part on x ≈ w x + b i ji j i electronic circuit theory, general systems theory and i multivariate statistics. Although aimed more towards cell or signaling or metabolic networks, the same methodology x (t + t ) = w x (t ) + b i ji j i could be applied to regulatory networks. The system (a reactor vessel with chemicals im- where x is the expression level of gene i at time t , b plementing glycolysis) is driven using random (and i i is a bias term indicating whether gene i is expressed or independent) inputs for some of the chemical species, not in the absence of regulatory inputs, and weight w while the concentration of all the species is monitored ji indicates the influence of gene j on the regulation of over time. A distance matrix is constructed based on gene i . This can be written equivalently as a difference the maximum time-lagged correlation between any two or differential equation. Given an equidistant time series chemical species. This distance matrix is then fed into a simple clustering algorithm to generate a tree of con- of expression levels (or an equidistant interpolation of a nections between the species, and the results are mapped non-equidistant time series), we can use linear algebra into a two-dimensional graph for visualization. It is also to find the least-squares fit to the data. Weaver et al. possible to use the information regarding the time lag (1999) showed how a non-linear transfer function can be between species at which the highest correlation was incorporated into the model as well, and demonstrated found, which could be useful to infer causal relationships. that some randomly generated networks can be accurately More sophisticated methods from general systems theory, reconstructed using this modeling technique. D’haeseleer based on mutual information, could be used to infer et al. (1999); D’haeseleer and Fuhrman (2000) showed dependency. that even a simple linear model can be used to infer biologically relevant regulatory relationships from real Systems of differential equations data sets (Figure 5). Simple systems of differential equations have already Chen et al. (1999) presented a number of linear differen- proven their worth in modeling simple gene regulation tial equation models which included both mRNA and pro- systems. For example, the seminal work of Mjolsness tein levels. They showed how such models can be solved et al. (1991) used a spatial ‘gene circuit’ approach to using linear algebra and Fourier transforms. Interestingly, model a small number of genes involved in pattern they found that mRNA concentrations alone were not suf- formation during the blastoderm stage of development ficient to solve their model, without at least the initial pro- in Drosophila. The change in expression levels at each tein levels. Conversely, the model can be solved given only a time series of protein concentrations. point in time depended on a weighted sum of inputs Models that are more complex may require more general from other genes, and diffusion from neighboring ‘cells’. methods for fitting the parameters to the expression Synchronized cell divisions along a longitudinal axis data. Mjolsness et al. (1991) used Simulated Annealing (under the control of a maternal clock) were alternated to fit their hybrid model—incorporating both reaction- with updating the gene expression levels. The model was able to successfully replicate the pattern of eve stripes in diffusion kinetics and cell divisions—to spatial data. Drosophila, as well as some mutant patterns on which the Mjolsness et al. (1999b) used a similar approach to fit model was not explicitly trained. a recurrent neural network with weight decay to clusters of yeast genes. Genetic algorithms (GAs) have been Additive regulation models used to model the interaction between four ‘waves’ of The differential equation systems described above model coordinately regulated genes (Wahde and Hertz, 1999) gene networks using an update rule based on a weighted previously identified in rat CNS development (Wen et sum of inputs. Several variants of such models have al., 1998). Similarly, Tominaga et al. (1999) used a GA been proposed, with each group coining a different name: to fit a power-law model (Savageau, 1995; Tominaga 719 P.D’haeseleer et al. (A) Nestin 0.5 0.5 GABARalpha4 aFGF 0.5 -10 0 10 20 30 40 50 60 GAD65 (B) GAD67 GRg2 GRa3 GRa2 GRa1 preGAD67 GRb3 GRa4 GRb1 GRb2 GRg1 G67I8086 GRg3 G67I86 nestin GRa5 Fig. 5. Continuous-valued reverse engineering of a CNS genetic network using a linear additive method. (A) Experimental gene expression data (circles; development and injury), and simulation using a linear model (lines). The model faithfully reproduces the time series of the training data sets. Dotted line: spinal cord, starting 11 days before birth. Solid line: hippocampus development, starting seven days before birth. Dashed line: hippocampus kainate injury, starting at postnatal day 25. (B) Hypothetical gene interaction diagram for the GABA signaling family inferred from developmental gene expression data (spinal cord and hippocampus data). While individual proposed interactions have not yet been experimentally verified, the high predicted connectivity within this gene family appears biologically plausible. The positive feedback interaction of the GAD species has been proposed independently in another study (Somogyi et al., 1995). Solid lines correspond to positive interactions, broken lines suggest inhibitory relationships (D’haeseleer et al., 1999). 720 Genetic network inference and Okamoto, 1998) of a small gene network. Networks for clustering according to co-expression profiles with larger numbers of genes will likely require stronger should select the appropriate experimental sets for optimization algorithms. Akutsu et al. (2000) proposes analysis, and provide flexible solutions with multi- using Linear Programming for both power-law models ple cluster memberships that more accurately reflect and qualitative hybrid Boolean-like networks. Efficient the biological reality. Well-designed cluster analysis gradient descent algorithms developed for continuous- promises to identify new pathway relationships time recurrent neural networks (Pearlmutter, 1995) may and gene functions that may be critical to cellular be useful for even larger networks (D’haeseleer et al., control in health and disease. 1999). Alternatively, the size of the problem can be 3. Reverse engineering. Since it is the ultimate goal drastically reduced by combining gene clustering with to identify the causative relationships between network inference, deriving a regulation model only for gene products that determine important phenotypic the cluster centers (Wahde and Hertz, 1999; Mjolsness et parameters, top priority should be given to develop al., 1999b). reverse engineering methods that provide significant predictions. Alternative computational approaches Conclusions and outlook should be applied to given data sets, and their pre- We are participating in the transition of biology into an dictions tested in targeted experiments to identify information-driven science. However, this transition can the most reliable methods. be meaningful only if we focus on generating models that allow us to systematically derive predictions about 4. Integrated modeling. While the current focus is on important biological processes in disease, development the analysis of large-scale gene expression data, and metabolic control. These will find important applica- there are other established sources of information tions in pharmaceutical development and bioengineering on gene function, ranging from sequence homology (Zweiger, 1999). We have reviewed conceptual founda- and cis-regulatory sequences, to disease association tions for understanding complex biological networks, and and a wide variety of functional knowledge from several practical methods for data analysis. There are still targeted experiments. Ideally, all of these categories major challenges ahead, which may be divided into five of information should be included in model build- areas: ing. A major challenge here lies in the reliability and compatibility of these data sets. 1. Measurement quantity, depth and quality. Any at- tempt at predictive data analysis and model building 5. Coupling of modeling with systematic experimental critically depends on the scope and quality of the in- design. Discovery of a novel gene function through put data. Ideally, we would like to gain access to the expression profiling and computational inference activities of all important molecular species in a bio- depends on the optimal coordination of experimen- logical process (ranging from mRNA to metabolites tal technology with data-analysis methods. While and second messengers), with adequate quantitative, data-analysis methods must be centered around anatomical and temporal resolution. However, even data that are realistically accessible, critical pre- though our analytical measurement technologies dictions from the models must guide experimental are undergoing transformations in precision and design. The hope is that progressive iteration of throughput, there will always be limitations to the predictions, experimental measurements and model amount of data and resolution we can acquire and updates will result in increased fidelity and depth of process. Computational data analysis must therefore computational models of biological networks. identify the most essential molecular parameters to guide experimental measurements, and critically Acknowledgements evaluate measurement precision and reproducibility with appropriate statistical measures. P.D. gratefully acknowledges the support of the National Science Foundation (grant IRI-9711199), the Office of 2. Clustering and functional categorization. One Naval Research (grant N00014-99-1-0417), and the Intel priority in this area is to compare the large variety Corporation. S.L. gratefully acknowledges support from of existing clustering methods (including different NASA Collaborative Agreement NCC 2-794. The authors normalizations and distance measures), and identify those that give the most biologically relevant results. would like to express their appreciation to Xiling Wen, Just as a gene can play multiple functional roles in Millicent Dugich and Stefanie Fuhrman for providing data various pathways and is subject to different regula- on hippocampal development and injury, and to Stefanie tory inputs, co-expression patterns vary according Fuhrman for critically reading and commenting on the to the cellular and experimental context. Methods manuscript. 721 P.D’haeseleer et al. algorithms: we say that two genes cluster together if their Appendix A: Data requirements for sparse Boolean correlation is significantly greater (with a significance networks level α) than a certain cut-off value ρ . We test whether To fully specify a Boolean network with limited connec- we can exclude the null hypothesis ρ< ρ based on tivity, we need to specify the connection pattern and the the measured correlation coefficient r over the available rule table for a function of K inputs at each. A lower bound data points. Because of the large number of comparisons of (2 + K log( N )) can be derived using information being made, we need to reduce the significance level for theory. For a slightly simpler model, where we assume the the correlation test, proportional to the number of tests pattern of connectivity is given, we can calculate how the each gene is involved in: α = α / N (this will keep number of independently chosen data points should scale the expected number of false positives for each gene with K and N . Since this is a simpler model, its data re- constant). In order to be able to use the same cut-off value quirements should be a lower bound to the requirements for the measured correlation r to decide whether two for the model with unknown connections. genes cluster together, the number of data points will have Every data point (i.e. every input–output pair, specifying to increase as the significance level for each test grows the state of the entire Boolean network at time t and t + 1), smaller. specifies exactly one of 2 entries in each rule table. Given If the real correlation coefficient ρ is close to 1.0, this particular combination of the K inputs to each gene at the distribution of the measured correlation coefficient time t , the output of the gene is given by its state at time r is very asymmetrical. The following z-transformation, t + 1. We will estimate the probability P that all N rule developed by Fisher et al. (1969), is approximately tables are fully specified by n data points, and calculate normally distributed with mean z(ρ) and variance 1/(n − how the number of data points n needs to scale with P , 3) (with n the number of data points): the number of genes N , and connectivity K.For P ≈ 1 (i.e. we have enough data to have a good chance at a fully 1 1 + r z(r ) = ln . specified model), the probability for a single rule table to 2 1 − r be fully specified by n data points is approximately: We can now devise a single-sided test on z(r ) to answer K −K n 1 − 2 (1 − 2 ) . the following question. If z(r)> z(r ), what is the significance level with which we can reject the hypothesis The probability that all N rule tables are fully specified by z(ρ) < z(ρ ) (and thus ρ< ρ )? At the tail of the normal c c n data points then becomes: distribution, the area under the normal curve to the right K −K n N of z(r ) can be approximated by: P ≈ (1 − 2 (1 − 2 ) ) . α (z−z(ρ )) Taking base-2 logarithms, and further simplifying using 2σ α = √ e dz log (1 − z) ≈−z log (e) for z 1, we find: σ 2π 2 2 z=z(r ) (z(r )−z(ρ )) α c C =− log( P) 2 1 2σ ≈ √ e K −K n 2π(z(r ) − z(ρ )) α c ≈− N log(1 − 2 (1 − 2 ) ) K −K n ≈ N 2 (1 − 2 ) log(e) with σ = 1/ n − 3, and taking natural logs: C =− log(C / log(e)) 2 1 ln(α) ≈ ln(α ) − ln( N ) −K ≈− log( N ) − K − n log(1 − 2 ) √ ≈− ln(n − 3) − ln( 2π(z(r ) − z(ρ ))) −K α c ≈− log( N ) − K + n2 log(e). −(n − 3)(z(r ) − z(ρ )) /2 α c If P ≈ 1, C will be a small, and C a large positive 1 2 n ≈ 3 + (ln( N ) + ln(1/α ) constant. We can now express n, the number of data points (z(r ) − z(ρ )) α c needed, in terms of N , K and C : − ln(n − 3)/2 − ln( 2π(z(r ) − z(ρ )))) α c n ≈ 2 (K + log( N ) + C )/ log(e) = O(log( N )). which is O(2 (K + log( N ))). In other words, if we want to use the same cut-off value r to decide whether ρ> ρ , we need to scale the number α c Appendix B: Data requirements for pairwise of data points logarithmically with the number of genes. correlation comparisons Strictly speaking, this analysis only holds for correlation Let us examine a very simple form of clustering as a tests, but we can expect similar effects to play a role in representative example of the wide variety of clustering other clustering algorithms. 722 Genetic network inference Cheeseman,P. and Stutz,J. (1996) Bayesian classification (auto- References class): theory and results. In Fayyad,U.M., Piatetsky-Shapiro,G., Agnew,B. (1998) NIH plans bioengineering initiative. Science, 280, Smyth,P. and Uthurusamy,R. (eds), Advances in Knowledge Dis- 1516–1518. covery and Data Mining AAAI Press/MIT Press, http://ic-www. Akutsu,T., Miyano,S. and Kuhara,S. (1999) Identification of genetic arc.nasa.gov/ic/projects/bayes-group/images/kdd-95.ps networks from a small number of gene expression patterns under Chen,T., He,H.L. and Church,G.M. (1999) Modeling gene ex- the Boolean network model. Pacific Symposium on Biocomput- pression with differential equations. Pacific Symposium on ing, 4,17–28. http://www.smi.stanford.edu/projects/helix/psb99/ Biocomputing, 4,29–40. http://www.smi.stanford.edu/projects/ Akutsu.pdf helix/psb99/Chen.pdf Akutsu,T., Miyano,S. and Kuhara,S. (2000) Algorithms for infer- Chu,S., DeRisi,J., Eisen,M., Mulholland,J., Botstein,D., Brown,P.O. ring qualitative models of biological networks. Pacific Sympo- and Herskowitz,I. (1998) The transcriptional program of sporu- sium on Biocomputing, 5, 290–301.http://www.smi.stanford.edu/ lation in budding yeast. Science, 282, 699–705. projects/helix/psb00/akutsu.pdf Claverie,J.-M. (1999) Computational methods for the identification Aldenderfer,M.S. and Blashfield,R.K. (1984) Cluster Analysis. Sage of differential and coordinated gene expression. Hum. Mol. Publications, Newbury Park, CA. Genet., 8, 1821–1832. Alizadeh,A.A. et al. (2000) Distinct types of diffuse large B- Cunningham,M.J., Liang,S., Fuhrman,S., Seilhamer,J.J. and Somo- cell lymphoma identified by gene expression profiling. Nature, gyi,R. (2000) Gene expression microarray data analysis for toxi- 403(6769), 503–511. cology profiling. Ann. New York Acad. Sci., in press. Alon,U., Barkai,N., Notterman,D.A., Gish,K., Ybarra,S., Mack,D. Dempster,A.P., Laird,N.M. and Rubin,D.B. (1977) Maximum like- and Levine,A.J. (1999) Broad patterns of gene expression lihood from incomplete data via the EM algorithm. J. R. Stat. revealed by clustering analysis of tumor and normal colon tissues Soc., B39,1–38. probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. USA, DeRisi,J.L., Iyer,V.R. and Brown,P.O. (1997) Exploring the 96(12), 6745–6750. metabolic and genetic control of gene expression on a genomic Arkin,A. and Ross,J. (1995) Statistical construction of chemical scale. Science, 278(5338), 680–686. reaction mechanism from measured time-series. J. Phys. Chem., D’haeseleer,P. and Fuhrman,S. (2000) Gene network inference 99, 970–979. using a linear, additive regulation model. In preparation. Arkin,A., Ross,J. and McAdams,H.H. (1998) Stochastic kinetic D’haeseleer,P., Wen,X., Fuhrman,S. and Somogyi,R. (1997) Mining analysis of developmental pathway bifurcation in phage lambda- the gene expression matrix: inferring gene relationships from- infected Escherichia coli cells. Genetics, 149, 1633–1648. large scale gene expression data. In: M.Holcombe, R.Paton Arkin,A., Shen,P. and Ross,J. (1997) A test case of correlation (eds) Information processing in cells and Tissues. Plenum, metric construction of a reaction pathway from measurements. pp. 203–212. http://www.cs.unm.edu/∼patrik/networks/IPCAT/ Science, 277, 1275–1279. ipcat.html Arnone,M.I. and Davidson,E.H. (1997) The hardwiring of develop- D’haeseleer,P., Wen,X., Fuhrman,S. and Somogyi,R. (1999) Linear ment: organization and function of genomic regulatory systems. modeling of mRNA expression levels during CNS development Development, 124, 1851–1864. and injury. Pacific Symposium on Biocomputing, 4,41–52. http: Ben-Dor,A., Shamir,R. and Yakhini,Z. (1999) Clustering gene //www.smi.stanford.edu/projects/helix/psb99/Dhaeseleer.pdf expression patterns. J. Comput. Biol., 6(3–4), 281–297. Eisen,M.B., Spellman,P.T., Brown,P.O. and Botstein,D. (1998) Bezdek,J.C. (1981) Pattern Recognition with Fuzzy Objective Cluster analysis and display of genome-wide expression patterns. Function Algorithms. Plenum Press, New York. Proc. Natl. Acad. Sci. USA, 95(25), 14863–14868. Bhattacharjya,A. and Liang,S. (1996) Power laws in some random Erb,R.S. and Michaels,G.S. (1999) Sensitivity of biological mod- Boolean networks. Phys. Rev. Lett., 77, 1644. els to errors in parameter estimates. Pacific Symposium on Brayton,R.K., Hachtel,G.D., McMullen,C.T. and Sangiovanni- Biocomputing, 4,53–64. http://www.smi.stanford.edu/projects/ Vincentelli,A.L. (1984) Algorithms for VLSI Synthesis. Kluwer helix/psb99/Erb.pdf Academic Publishers, ftp://ic.eecs.berkeley.edu/pub/Espresso/ Ewing,R.M., Kahla,A.B., Poirot,O., Lopez,F., Audic,S. and Brazma,A., Jonassen,I., Vilo,J. and Ukkonen,E. (1998) Predicting Claverie,J.M. (1999) Large-scale statistical analyses of rice ESTs gene regulatory elements in silico on a genomic scale. Genome reveal correlated patterns of gene expression. Genome Res., 9, 950–959. Res., 8, 1202–1215. Felsenstein,J. (1993) PHYLIP (Phylogeny Inference Package). ver- Brown,P.O. and Botstein,D. (1999) Exploring the new world of sion 3.5c, distributed by the author, Department of Genetics, Uni- genome with DNA microarrays. Nature Genet., 21,33–37. versity of Washington, Seattle. Butte,A.J. and Kohane,I.S. (2000) Mutual information relevance Ferea,T.L., Botstein,D., Brown,P.O. and Rosenzweig,R.F. (1999) networks: functional genomic clustering using pairwise entropy Systematic changes in gene expression patterns following adap- measurements. Pacific Symposium on Biocomputing, 5, 415– tive evolution in yeast. Proc. Natl. Acad. Sci. USA, 96, 17 9721– 426.http://www.smi.stanford.edu/projects/helix/psb00/butte.pdf Califano,A., Stolovitzky,G. and Tu,Y. (2000) Analysis of gene ex- Fisher,R.A., In Kendall,M.G. and Stuart,A. (1969) The Advanced pression microarrays for phenotype classification. 8th Interna- Theory of Statistics. Vol. 1, 3rd edn., Hafner Press, New York, tional Conference on Intelligent Systems for Molecular Biology, pp. 391. in press. 723 P.D’haeseleer et al. Franklin,G.F., Powell,J.D. and Emami-Naeini,A. (1994) Feedback An Introduction to Cluster Analysis. John Wiley and Sons, New Control of Dynamic Systems. 3rd edn, Addison-Wesley. York. Koiran,P. and Sontag,E.D. (1998) Vapnik-Chervonenkis dimension Fuhrman,S., Cunningham,M.J., Wen,X., Zweiger,G., Seilhamer,J.J. of recurrent neural networks. Discrete Appl. Math., 86,63–79. and Somogyi,R. (2000) The application of Shannon entropy in the identification of putative drug targets. Biosystems, 55(1–3), Lance,G.N. and Williams,W.T. (1966) A general theory of classifi- 5–14. catory sorting strategies: 1. hierarchical systems. Comput. J., 9, 373–380. Fuhrman,S., D’haeseleer,P. and Somogyi,R. (1999) Tracing genetic information flow from gene expression to pathways and molec- Liang,S., Fuhrman,S. and Somogyi,R. (1998) REVEAL, a gen- ular networks. 1999 Society for Neuroscience Short Course Syl- eral reverse engineering algorithm for inference of genetic net- labus, DNA Microarrays: The New Frontier in Gene Discovery work architectures. Pacific Symposium on Biocomputing, 3,18– and Gene Expression Analysis. 29. http://www.smi.stanford.edu/projects/helix/psb98/liang.pdf Glass,L. (1975) Combinatorial and topological methods in nonlin- MacQueen,J. (1967) Some methods for classification and anal- ear chemical kinetics. J. Chem. Phys., 63, 1325–1335. ysis of multivariate observation. In Le Cam,L.M. and Nye- man,J. (eds), Proceedings of the Fifth Berkeley Symposium Glass,L. (1978) Stable oscillations in mathematical models of on Mathematical Statistics and Probability Vol. I, University of biological control systems. J. Math. Biol., 6, 207–223. California Press. Golub,T.R., Slonim,D.K., Tamayo,P., Huard,C., Gaasen- Mahalanobis,P.C. (1936) On the generalized distance in statistics. beek,M., Mesirov,J.P., Coller,H., Loh,M.L., Downing,J.R., Proc. Nat. Inst. Sci. India 12, pp. 49–55. Caligiuri,M.A., Bloomfield,C.D. and Lander,E.S. (1999) Molec- ular classification of cancer: class discovery and class prediction Marnellos,G. and Mjolsness,E. (1998) A gene network approach by gene expression monitoring. Science, 286, 531–537. to modeling early neurogenesis in Drosophila. Pacific Sympo- sium on Biocomputing, 3,30–41. http://www.smi.stanford.edu/ Goss,P.J. and Peccoud,J. (1998) Quantitative modeling of stochastic projects/helix/psb98/marnellos.pdf systems in molecular biology by using stochastic Petri nets. Proc. Natl. Acad. Sci. USA, 95, 6750–6755. Massart,D.L. and Kaufman,L. (1983) The Interpretation of Analyt- ical Chemical Data by the Use of Cluster Analysis. John Wiley Han,E.-H., Karypis,G., Kumar,V. and Mobasher,B. (1997) and Sons, New York. Clustering in a high-dimensional space using hyper- graph models. Technical Report # 97-019 Department Matsuno,H., Doi,A. and Nagasaki,M. (2000) Hybrid petri net of Computer Science, University of Minnesota. http: representation of genetic regulatory network. Pacific Sympo- //www.cs.umn.edu/tech reports/1997/TR 97-019 Clustering sium on Biocomputing, 5, 338–349.http://www.smi.stanford.edu/ Based on Association Rule Hypergraphs.html. projects/helix/psb00/matsuno.pdf Hertz,J. (1998) Statistical issues in reverse engineering of genetic Mendes,P. (1999) Metabolic simulation as an aid in understanding networks. Poster for Pacific Symposium on Biocomputing, http: gene expression data. In Bornberg-Bauer,E., De Beuckelaer,A., //www.nordita.dk/∼hertz/papers/dgshort.ps.gz. Kummer,U. and Rost,U. (eds), Workshop on Computation of Biochemical Pathways and Genetic Networks Logos Verlag, Heyer,L.J., Semyon Kruglyak, and Shibu Yooseph, (1999) Explor- Berlin, pp. 27–33. ing expression data: identification and analysis of coexpressed genes. Genome Res., 9(11), 1106–1115. Michaels,G., Carr,D.B., Wen,X., Fuhrman,S., Askenazi,M. and So- mogyi,R. (1998) Cluster analysis and data visualization of large- Holstege,F.C.P., Jennings,E.G., Wyrick,J.J., Lee,T.I., Hengart- scale gene expression data. Pacific Symposium on Biocomput- ner,C.J., Green,M.R., Golub,T.R., Lander,E.S. and Young,R.A. ing, 3,42–53. http://www.smi.stanford.edu/projects/helix/psb98/ (1998a) Dissecting the regulatory circuitry of a eukaryotic michaels.pdf genome. Cell, 95, 717–728. Miklos,G.L. and Rubin,G.M. (1996) The role of the genome project Holstege,F.C.P., Jennings,E.G., Wyrick,J.J., Lee,T.I., Hengart- in determining gene function: insights from model organisms. ner,C.J., Green,M.R., Golub,T.R., Lander,E.S. and Young,R.A. Cell, 86, 4 521–529. (1998b) Genome-Wide Expression Page, http://web.wi.mit.edu/ young/expression/ Mjolsness,E., Castano,R. ˜ and Gray,A. (1999a) Multi-parent clus- tering algorithms for large-scale gene expression analy- Huang,S. (1999) Gene expression profiling, genetic networks, sis. Technical Report JPL-ICTR-99-5 Jet Propulsion Lab- and cellular states: an integrating concept for tumorigenesis oratory Section 367. http://www-aig.jpl.nasa.gov/public/mls/ and drug discovery. J. Mol. Med., 77, 6 469–480. papers/emj/multiparentPreprint.pdf. Ideker,T.E., Thorsson,V. and Karp,R.M. (2000) Discovery of regula- Mjolsness,E., Mann,T., Castano,R. ˜ and Wold,B. (1999b) From co- tory interactions through perturbation: inference and experimen- expression to co-regulation: an approach to inferring transcrip- tal design. Pacific Symposium on Biocomputing, 5, 302–313.http: tional regulation among gene classes from large-scale expres- //www.smi.stanford.edu/projects/helix/psb00/ideker.pdf sion data. Technical Report JPL-ICTR-99-4 Jet Propulsion Jain,A.K. and Dubes,R.C. (1988) Algorithms for Clustering Data. Laboratory Section 365. http://www-aig.jpl.nasa.gov/public/mls/ Prentice Hall, Englewood Cliffs, NJ. papers/emj/GRN99prprnt.pdf. Kauffman,S.A. (1969) Metabolic stability and epigenesis in ran- Mjolsness,E., Sharp,D.H. and Reinitz,J. (1991) A connectionist domly connected nets. J. Theoret. Biol., 22, 437–467. model of development. J. Theor. Biol., 152, 429–454. Kauffman,S.A. (1993) The Origins of Order, Self-Organization and Selection in Evolution. Oxford University Press. Mounts,W.M. and Liebman,M.N. (1997) Application of petri nets Kaufman,L. and Rousseeuw,P.J. (1990) Finding Groups in Data: and stochastic activity nets to modeling biological pathways and 724 Genetic network inference processes. Int. J. Comput. Simul.. Church,G.M. (1999) Systematic determination of genetic network architecture. Nature Genet., 22, 281–285. Niehrs,C. and Pollet,N. Synexpression groups in eukaryotes. Na- ture, 402, 483–487. Thieffry,D. and Thomas,R. (1998) Qualitative analysis of gene networks. Pacific Symposium on Biocomputing, 3,77–88. http: Pearlmutter,B.A. (1995) Gradient calculations for dynamic recur- //www.smi.stanford.edu/projects/helix/psb98/thieffry.pdf rent neural networks: a survey. IEEE Trans. Neural Netw., 6, 1212–1228. Thomas,R. (1991) Regulatory networks seen as asynchronous automata: a logical description. J. Theor. Biol., 153,1–23. Perou,C.M., Jeffrey,S.S., van de Rijn,M., Rees,C.A., Eisen,M.B., Ross,D.T., Pergamenschikov,A., Williams,C.F., Zhu,S.X., Tominaga,D. and Okamoto,M. (1998) Design of canonical model Lee,J.C., Lashkari,D., Shalon,D., Brown,P.O. and Botstein,D. describing complex nonlinear dynamics. In Yoshida,T. and (1999) Distinctive gene expression patterns in human mam- Shioya,S. (eds), Computer Applications in Biotechnology 1998 mary epithelial cells and breast cancers. Proc. Natl. Acad. Sci. Elsevier Science, pp. 85–90. USA, 96, 16 9212–9217. Tominaga,D., Okamoto,M., Kami,Y., Watanabe,S. and Eguchi,Y. Ptashne,M. (1992) A genetic Switch. Cell Press and Blackwell (1999) Nonlinear Numerical Optimization Technique Based scientific publications. on a Genetic Algorithm, http://www.bioinfo.de/isb/gcb99/talks/ tominaga/ Roth,P., Hughes,J.D., Estep,P.W. and Church,G.M. (1998) Finding DNA regulatory motifs within unaligned noncoding sequences van Helden,J., Andre,B. and Collado-Vides,J. (1998) Extracting clustered by whole-genome mRNA quantitation. Nature Biotech- regulatory sites from the upstream region of yeast genes by nol., 16, 939–945. computational analysis of oligonucleotide frequencies. J. Mol. Biol., 281, 827–842. Savageau,M.A. (1995) Power-law formalism: a canonical nonlinear approach to modeling and analysis. In Proceedings of the World Velculescu,V.E., Zhang,L., Zhou,W., Vogelstein,J., Basrai,M.A., Congress of Nonlinear Analysts ’92 pp. 3323–3334. Bassett,D.E.,Jr, Hieter,P., Vogelstein,B. and Kinzler,K.W. (1997) Characterization of the yeast transcriptome. Cell, 88, 243–251. Savageau,M.A. (1998) Rules for the evolution of gene circuitry. Wahde,M. and Hertz,J. (1999) Course-grained reverse engineering Pacific Symposium on Biocomputing, 3,54–65. http://www.smi. stanford.edu/projects/helix/psb98/savageau.pdf of genetic regulatory networks. Proceedings of Information Processing in Cells and Tissues (IPCAT) ’99. Sneath,P.H.A. and Sokal,R.R. (1973) Numerical Taxonomy. Free- man, San Francisco. Walker,M.G., Volkmuth,W., Sprinzak,E., Hodgson,D. and Klin- gler,T. (1999) Prediction of gene function by genome-scale Sokal,R.R. and Michener,C.D. (1958) Univ. Kans. Sci. Bull., 38, expression analysis: prostate-cancer associated genes. Genome 1409–1438. Res., 9, 1198–1203. Somogyi,R. and Sniegoski,C. (1996) Modeling the complexity of Weaver,D.C., Workman,C.T. and Stormo,G.D. (1999) Modeling genetic networks. Complexity, 1(6), 45–63. regulatory networks with weight matrices. Pacific Symposium Somogyi,R., Fuhrman,S., Askenazi,M. and Wuensche,A. (1997) on Biocomputing, 4, 112–123. http://www.smi.stanford.edu/ The gene expression matrix: towards the extraction of genetic projects/helix/psb99/Weaver.pdf network architectures. Nonlinear Analysis, Proc. of Second Weinstein,J.N., Myers,T.G., O’Connor,P.M., Friend,S.H., For- World Cong. of Nonlinear Analysts (WCNA96), 30(3), 1815– nace,A.J.,Jr, Kohn,K.W., Fojo,T., Bates,S.E., Rubinstein,L.V., 1824. http://rsb.info.nih.gov/mol-physiol/reprints/WCNA.pdf Anderson,N.L., Buolamwini,J.K., van Osdol,W.W., Monks,A.P., Somogyi,R., Wen,X., Ma,W. and Barker,J.L. (1995) Developmental Scudiero,D.A., Sausville,E.A., Zaharevitz,D.W., Bunow,B., kinetics of GAD family mRNAs parallel neurogenesis in the rat Viswanadhan,V.N., Johnson,G.S., Wittes,R.E. and Paull,K.D. spinal cord. J. Neurosci., 15, 2575–2591. (1997) An information-intensive approach to the molecular phar- Somogyi,R. (1999) Making sense of gene expression data. Phar- macology of cancer. Science, 275, 343–349. mainformatics: A Trends Guide (Trends Supplement). pp. 17–24. Wen,X., Fuhrman,S., Michaels,G.S., Carr,D.B., Smith,S., Sontag,P. (1997) Shattering all sets of k points in general position re- Barker,J.L. and Somogyi,R. (1998) Large-scale temporal gene quires (k − 1)/2 parameters. Neural Comput., 9, 337–348. expression mapping of central nervous system development. Spellman,P.T., Sherlock,G., Zhang,M.Q., Iyer,V.R., Anders,K., Proc. Natl. Acad. Sci. USA, 95, 1 334–339. Eisen,M.B., Brown,P.O., Botstein,D. and Futcher,B. (1998) Wolfsberg,T.G., Gabrielian,A.E., Campbell,M.J., Cho,R.J., Comprehensive identification of cell cycle-regulated genes of Spouge,J.L. and Landsman,D. (1999) Candidate regulatory the yeast Saccharomyces cerevisiae by microarray hybridization. sequence elements for cell cycle-dependent transcription in Mol. Biol. Cell., 9, 3273–3297. saccharomyces cerevisiae. Genome Res., 9, 775–792. Szallasi,Z. (1999) Genetic network analysis in light of massively Wuensche,A. (1993) Discrete Dynamics Lab (DDLab) http://www. parallel biological data acquisition. Pacific Symposium on Bio- santafe.edu/∼wuensch/ddlab.html. computing, 4,5–16. http://www.smi.stanford.edu/projects/helix/ Wuensche,A. (1998) Genomic regulation modeled as a network psb99/Szallasi.pdf with basins of attraction. Pacific Symposium on Biocom- Tamayo,P., Slonim,D., Mesirov,J., Zhu,Q., Kitareewan,S., Dmitro- puting, 3,89–102. http://www.smi.stanford.edu/projects/helix/ vsky,E., Lander,E.S. and Golub,T.R. (1999) Interpreting patterns psb98/wuensche.pdf of gene expression with self-organizing maps: methods and ap- plication to hematopoietic differentiation. Proc. Natl. Acad. Sci. Wuensche,A. (1999) Classifying Cellular Automata Automatically: USA, 96, 2907. Finding gliders, filtering, and relating space-time patterns, attrac- Tavazoie,S., Hughes,J.D., Campbell,M.J., Cho,R.J. and tor basins, and the Z parameter. Complexity, 4(3), 47–66. http: 725 P.D’haeseleer et al. //www.santafe.edu/∼wuensch/cplex ab.html puting, 5, 476–487.http://www.smi.stanford.edu/projects/helix/ psb00/zhu.pdf Yuh,C.H., Bolouri,H. and Davidson,E.H. (1998) Genomic cis- regulatory logic: experimental and computational analysis of a Zweiger,G. (1999) Knowledge discovery in gene-expression- sea urchin gene. Science, 279, 1896–1902. microarray data: mining the information output of the genome. Trends Biotech., 17, 429–436. Zhu,J. and Zhang,M.Q. (2000) Cluster, function and promoter: analysis of yeast expression array. Pacific Symposium on Biocom-

Journal

BioinformaticsOxford University Press

Published: Aug 1, 2000

There are no references for this article.