Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Inferring Boolean network structure via correlation

Inferring Boolean network structure via correlation Vol. 27 no. 11 2011, pages 1529–1536 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btr166 Systems biology Advance Access publication April 5, 2011 1,† 2,† 2 1,3,∗ Markus Maucher , Barbara Kracher , Michael Kühl and Hans A. Kestler Research group Bioinformatics and Systems Biology, Clinic for Internal Medicine I, University Medical Center Ulm, 2 3 Institute for Biochemistry and Molecular Biology, Ulm University, and Research Group Bioinformatics and Systems Biology, Institute for Neural Information Processing, Ulm University, Ulm, Germany Associate Editor: Olga Troyanskaya ABSTRACT (Bornholdt, 2005). Nonetheless, it is powerful enough to model the structure of network motifs, basic network components frequently Motivation: Accurate, context-specific regulation of gene found in gene regulatory networks (Alon, 2006; Babu et al., 2004). expression is essential for all organisms. Accordingly, it is very Gene transcription in eukaryotic cells has to be tightly regulated important to understand the complex relations within cellular gene to ensure proper cell function, i.e. according to a particular cellular regulatory networks. A tool to describe and analyze the behavior context (cell cycle phase, cell type, environmental conditions, of such networks are Boolean models. The reconstruction of a developmental stage), a specific subset of genes is expressed Boolean network from biological data requires identification of while the expression of other genes has to be actively repressed. dependencies within the network. This task becomes increasingly This regulation is generally mediated via certain proteins, called computationally demanding with large amounts of data created by transcription factors, that bind to specific sequence motifs in recent high-throughput technologies. Thus, we developed a method the promoter region of a gene and either enhance or inhibit that is especially suited for network structure reconstruction from the transcription of this gene [reviewed e.g. in (Orphanides and large-scale data. In our approach, we took advantage of the fact that Reinberg, 2002; Venters and Pugh, 2009)]. If a promoter region a specific transcription factor often will consistently either activate contains binding motifs for different transcription factors, these or inhibit a specific target gene, and this kind of regulatory behavior factors can either cooperate in the regulation of a certain target can be modeled using monotone functions. gene or counteract each other. Moreover, in some cases, specific Results: To detect regulatory dependencies in a network, we cofactors determine whether a transcription factor acts as activator examined how the expression of different genes correlates to or repressor. Despite the variety in the modes of transcriptional successive network states. For this purpose, we used Pearson regulation, most transcriptional regulators will be either activators correlation as an elementary correlation measure. Given a Boolean or inhibitors of a certain gene in a specific cell type. In this case, the network containing only monotone Boolean functions, we prove that activating or repressing effect of a transcription factor monotonically the correlation of successive states can identify the dependencies depends on its cellular concentration. In other words, an increase in in the network. This method not only finds dependencies in the concentration of an activator will increase but never decrease randomly created artificial networks to very high percentage, but also transcription of its target, while an increase in the concentration reconstructed large fractions of both a published Escherichia coli of a repressor will decrease but never increase transcription of regulatory network from simulated data and a yeast cell cycle its target. This kind of transcriptional regulation can be modeled network from real microarray data. mathematically in a very simplistic manner by the use of monotone Contact: hans.kestler@uni-ulm.de Boolean functions which describe exactly this monotonic relation. Supplementary information: Supplementary data are available at Besides the monotone Boolean functions applied in this work, there Bioinformatics online. exist further related classes of functions describing such monotonic Received on August 20, 2010; revised on February 28, 2011; relations. Among them are, for example, nested canalyzing functions accepted on March 25, 2011 (Kauffman et al., 2003), single-layer perceptrons (Rani et al., 2007) and multilinear functions (Tsukimoto and Hatano, 2003). Gene expression data can be analyzed and visualized on the 1 INTRODUCTION basis of correlations identified in the observed expression patterns Boolean networks were popularized by Stuart Kauffman as models of the analyzed genes. Models from information theory correlate for genetic regulatory networks (Kauffman, 1969). In this kind of expression values from two genes directly and predict an interaction model, only two states are discriminated (active/inactive) for each between two genes to be present if the correlation coefficient is gene. The dynamics of the network can then be described by Boolean exceeding a certain threshold. Algorithms based on this principle are functions. This model works with a very small set of parameters and for example CLR (Faith et al., 2007) or ARACNE (Margolin et al., thus represents a very stringent application of Occam’s Razor, which 2006). However, these correlations generally reflect co-expression makes it especially suitable for modeling large genetic networks of genes and can, under some restrictions to the underlying network like e.g. absence of cyclic dependencies, also give information on direct causal relationships [cf. (Pearl, 2000; Shipley, 2000; Spirtes To whom correspondence should be addressed. The authors wish it to be known that, in their opinion, the first two authors et al., 2000) and references therein]. Moreover, Opgen-Rhein and should be regarded as joint First Authors. Strimmer (2007) and Zoppoli et al. (2010) established algorithms © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com 1529 [15:00 16/5/2011 Bioinformatics-btr166.tex] Page: 1529 1529–1536 M.Maucher et al. to develop networks depicting dynamic relations from correlation Let D be a product distribution and X be a D-distributed random variable. The influence I (f ) of a variable x on a function f :{0,1} →{0,1} is defined data. Likewise, there exist algorithms for the inference of dynamic D,i i as the probability that a change in x will also lead to a change in f (X ), i.e. Boolean networks from gene expression data, like REVEAL (Liang et al., 1998) or the best fit extension algorithm (Lähdesmäki et al., I (f ) =P f (X )| =f (X )| . D,i D x =0 x =1 i i 2003). These algorithms, however, perform reasonably well only if Here, f (X )| denotes a partial assignment, i.e. x =0 the datasets are not too large. In contrast to existing approaches, we do not correlate expression f (X ,...,X )| =f (X ,...,X ,b,X ,...,X ) . 1 n x =b 1 i−1 i+1 n data of different genes directly, but instead correlate the states of Pearson correlation: given two random variables X and Y , their Pearson particular genes with the successive states of potential target genes, correlation is defined as thus assuming and examining directed causal dependencies. E[(X −E[X])(Y −E[Y ])] Cov(X,Y ) This article is organized as follows: we first prove properties ρ(X,Y ) = = , σ σ σ σ X Y X Y of interaction reconstruction via correlation, assuming monotone Boolean functions. This is followed by a series of experiments where σ denotes the SD of a random variable Z . successively advancing from entirely artificial through to real data. Theorem 1. Given a monotone function f :{0,1} →{0,1} and a random variable X ∈{0,1} that is distributed according to a product distribution D. Then the function f depends on the i-th variable if and only if the Pearson 2 METHODS—CORRELATION AND MONOTONE correlation of X and f (X ) is non-zero for non-zero variances. FUNCTIONS Proof. Let D denote the product distribution D on all variables except Boolean networks: a Boolean network consists of n nodes, numbered from i x . Then 1to n and n functions f ,...,f :{0,1} →{0,1}. The state of the network can i 1 n be described by a Boolean vector x ∈{0,1} , where x describes the state of E [(X −E[X ])(f (X ) −E[f (X )])] D i i the i-th node. The n functions describe the dynamics of the network: If the σ σ X f (X ) network is in state x at time t, it transforms into state (f (x),f (x),...,f (x)) 1 2 n n 1 at time t + 1. We can also combine f ,...,f into one function F :{0,1} → 1 n = E E [(X −E[X ])(f (X ) −E[f (X )])] D X i i i i σ σ {0,1} such that F (x) =f (x). State x then transforms into state F (x). When X f (X ) i i i (j) (1) (2) considering successive states x ,x ,..., the term x will denote the state (j) = E (1 − µ )(−µ ) f (X )| −f (X ) D i i x =0 i i of the i-th node in x . σ σ X f (X ) Relevance of a variable: a function f :{0,1} →{0,1} depends on the i-th +µ (1 − µ ) f (X )| −f (X ) i i x =1 variable if there exist x ,...,x ,x ,...,x such that 1 i−1 i+1 n µ (1 − µ ) i i = E f (X )| −f (X )| f (x ,...,x ,0,x ,...,x ) D x =1 x =0 i i i 1 i−1 i+1 n σ σ X f (X ) =f (x ,...,x ,1,x ,...,x ). 1 i−1 i+1 n X = E f (X )| −f (X )| . D x =1 x =0 i i i f (X ) If f depends on the i-th input variable, we also say that the i-th variable is relevant for f . The set of all relevant variables of f is denoted as rel(f ). The theorem then follows from the fact that f is monotone in x . Reconstructing the dynamics of a Boolean network: in order to reconstruct As derived in Theorem 1, the influence of a variable can be estimated via the functions of a Boolean network, we are given a sequence of m a modified version of the Pearson correlation (1) (2) (m) states x ,x ,...,x along with the corresponding successor states E[(X −E[X])(Y −E[Y ])] Cov(X,Y ) (1) (2) (m) (i) (i) ρ ˜ (X,Y ) = = . F (x ),F (x ),...,F (x ). A tuple (x ,F (x )) is also called an example. 2 2 σ σ X X From these examples, the task is to reconstruct the dependencies in the Boolean network, i.e. the set of variables each of the functions depends The Chernoff-Hoeffding bound (Hoeffding, 1963): given independent on. random variables X ,...,X with a ≤X ≤b for all i, the Chernoff-Hoeffding 1 m i bound can be stated as follows: Monotonicity of functions: a Boolean function f :{0,1} →{0,1} is m m −2m monotonically increasing in the i-th variable if for all x ,...,x ,x ,...,x 2 1 i−1 i+1 n (b−a) P X − E[X ] ≥ m ≤e . i i i=1 i=1 f (x ,...,x ,0,x ,...,x ) 1 i−1 i+1 n Estimating the correlation coefficient: we will use the Chernoff-Hoeffding ≤f (x ,...,x ,1,x ,...,x ), 1 i−1 i+1 n bound for the estimation of a variable’s influence: the function f is monotonically decreasing in the i-th variable if for all Theorem 2. When estimating the influence of a variable on a function, the variables x ,...,x ,x ,...,x 1 i−1 i+1 n estimation error shrinks exponentially in the number of samples. f (x ,...,x ,0,x ,...,x ) 1 i−1 i+1 n Proof. As shown in Theorem 1, the influence of a variable X on a function f (X ) ≥f (x ,...,x ,1,x ,...,x ). 1 i−1 i+1 n f can be estimated via a modified correlation ρ ˜ (X ,f (X )) = ρ(X ,f (X )). i i We will show that when estimating this term the error probability shrinks A function f is monotone if for every variable f is either monotonically exponentially in the number of samples. increasing or monotonically decreasing in that variable. For example, the Via the Chernoff-Hoeffding bound, we can prove that Boolean AND function is monotone, while the XOR function is not. P [|x − µ | ≥  ] ≤ exp(−2m ) x x Influence of a variable: a probability distribution D on {0,1} is called and product distribution if for any D-distributed random variable X the property n 2 P y − µ ≥  ≤ exp(−2m ) , P[X = (x ,...,x )]= P[X =x ] holds, i.e. the X are independent. y y 1 n i i i y i=1 [15:00 16/5/2011 Bioinformatics-btr166.tex] Page: 1530 1529–1536 Boolean network structure Algorithm 1 Finding all relevant variables of f with influence which implies exceeding a threshold T P[|xy − µ µ |≥  µ +  µ +   ] x y x y y x x y (1) (1) (m) (m) Input: m examples (x ,f (x )),...,(x ,f (x )) of a function f , 2 2 ≤ exp(−2m ) + exp(−2m ) , x y drawn from a product distribution, threshold T Output: Approximation of rel(f ) and we can also show rel ←∅ m 1 m (j) y ← f (x ) 2 m j=1 P x y −E[XY ] ≥  ≤ exp(−2m ) . i i r for i ∈{1,...,n} do i=1 (j) 1 m x ← x j=1 m i We use the fact that the x and y are all Boolean to estimate the sample i i if x(1 −x) > 0 then variance s =x(1 −x). Let δ denote the estimation error for µ , i.e. x = x x {ensures non zero sample variance} µ + δ . That way, m (j) x x 1 (j) x f (x )−xy m j=1 if ≥T then x(1−x) x(1 −x) = (µ + δ )(1 − µ − δ ) x x x x rel ← rel ∪{i} = µ (1 − µ ) + (1 − 2µ − δ )δ . end if x x x x x end if Since |(1 − 2µ − δ )|≤ 1, we can conclude that x x end for return rel |x(1 −x) − µ (1 − µ )|≤|δ |. x x x This means that if the error for estimating µ is at most  then the error for x x 3 SIMULATION RESULTS estimating σ is at most  , too. We can now combine the estimation errors above: estimating the modified correlation coefficient ρ ˜ via Error rates in artificial networks with fixed in-degree: to investigate the performance of our method for concrete values and for controlled x y −xy i i i=1 r ˜(X,Y ) = noise levels, we analyzed the error probability of our inference x(1 −x) algorithm in an artificial network consisting of monotone Boolean functions. For this purpose, we created a set of 50 random Boolean and setting  := ( +  µ +  µ +   +  )/(σ −  ), we can use Lemma r x y y x x y x x networks each containing 80 nodes. For every node in these 1 below to conclude that networks, a monotone function with three input variables was P [|˜r(X,Y )−˜ ρ(X,Y )|≥ ] constructed as follows: first, we created a random truth table and 2 2 2 determined randomly for each input variable if the function should ≤ exp(−2m ) + exp(−2m ) + exp(−2m ) . x y r be monotonically increasing or decreasing in that variable. We then In particular, the error probability shrinks exponentially in the number of altered the initial truth table by correcting all inconsistencies. If samples m. this resulted in a function that did not depend on all of its input variables, we created a new random truth table and repeated the process until every function depended on all of its input variables. ˜ ˜ ˜ ˜ Lemma 1. Let A,A,B,B ∈[−1,1] with B > |A|, |A −A| < and |B −B| < . 1 2 Additionally, for each of the networks generated we analyzed Then ˜ attractors, i.e. single states or sequences of states toward which A A  + 1 2 − ≤ . the networks evolve. The network generation and attractor analysis B ˜ B − was performed with the R package BoolNet (Müssel et al., 2010), with the described modification to create monotone functions. We Proof. It holds that determined the number of attractors for each of these artificial A A A A −  A +  A 1 1 networks via a random sampling of 10 000 starting states and their − ≤ max − , − . B B B +  B −  B B 2 2 resulting attractors (function getAttractors of BoolNet). The number of attractors ranged from 1 to 59 (with a median of 5.5), with With attractor lengths ranging from 1 to 258 (with median 6). From A A −  A +B  + 1 2 1 1 2 − = ≤ each of the 50 random networks, we subsequently generated a batch B B +  B(B +  ) B + 2 2 2 of 1000 simulated time-series datasets, with each dataset covering and A +  A A +B  +  2 time points. In this process, for the generation of each dataset 1 2 1 1 2 − = ≤ B −  B B(B −  ) B −  a new starting configuration of the network was drawn from a 2 2 2 uniform distribution. The datasets generated in this way for each the lemma follows. of the 50 random networks represent synthetic ‘gene expression’ data of the 80 ‘genes’, measured at two successive time points in an Inferring the structure of a Boolean network: combining the results unperturbed system. These datasets were then used to reconstruct above, we can infer the dependency relations within a Boolean graph the dependencies in the underlying artificial networks. If a Boolean with a fast algorithm: Algorithm 1 finds the relevant variables of a function depends on three variables, changing one of these variables monotone function with a chosen minimum influence, given a sequence of from 0 to 1 must have an effect on f for at least one combination of independent examples, by calculating the correlation value of each variable the other two variables. Since the probability of such a combination and identifying all variables with this correlation exceeding a given value. of two variables is 0.25 under a uniform distribution, each of these Running that algorithm for the output function of each node of a Boolean network reveals the structure of the entire network. variables has an influence of at least 0.25. A correlation value was [15:00 16/5/2011 Bioinformatics-btr166.tex] Page: 1531 1529–1536 M.Maucher et al. Table 1. Runtimes of the correlation method and the best fit extension algorithm, averaged over 20 runs True negative, best fit algorithm Network size Correlation (s) Best Fit Best fit True negative, correlation (max k = 3) (s) (max k = 4) True positive, correlation 50 nodes 0.98 0.79 10.2 s 100 nodes 3.86 11.2 324 s 200 nodes 15.4 176 2 h 50 min The Correlation approach is independent of setting a max k. True positive, best fit algorithm approach, moreover, the threshold for the correlation can be varied to determine whether the algorithm should preferably identify dependencies with a high precision, at the expense of a lower recall, or the other way round (see Supplementary Fig. S5). As additional measure for the reliability of network reconstruction, we further analyzed the F -score, which can be regarded as weighted average 200 400 600 800 1000 of precision and recall (see Supplementary Fig. S6). Number of datasets Runtime: in a second experiment, we measured the running time of the correlation method and the best fit algorithm. The average single- Fig. 1. True positive and true negative rates of an experiment with artificial core runtimes on a 3.0 GHz Xeon CPU are given in Table 1. The data and Gaussian noise with SD σ = 0.4. The correlation algorithm and the random networks for the reconstruction consisted of 50, 100 and 200 best fit extension algorithm were used to infer artificially created networks nodes, respectively. In each case, we measured the running times for consisting of 80 nodes with an in-degree of 3. The plot shows average true reconstruction of the random network from 50 synthetic datasets. positive and true negative rates for the reconstruction of 50 artificial networks from 50 to 1000 datasets each. The values in the table show the running times averaged over 20 runs, where we created a new random Boolean network for each of assumed to signify a dependency when it exceeded a threshold of these runs. As seen in Table 1, a key advantage of the correlation 0.125, i.e. half the minimum influence. method are the considerably shorter running times compared to the Figure 1 shows the average identification rates of our method best fit algorithm. Thus, our correlation approach is especially useful and Lähdesmäki and co-workers’ best fit extension algorithm for large networks in which the nodes have a relatively large in- (Lähdesmäki et al., 2003) for the reconstruction of 50 artificial degree. For example, the runtime of the best fit algorithm for the Boolean networks. The network size of 80 nodes was chosen to reconstruction of a network of 200 nodes with a maximum in-degree be able to compare the performance of our approach with the best of 4 (i.e. each Boolean function depends on no more than four input fit algorithm. Moreover, to obtain data that is closer to experimental variables) is almost 3 h, whereas the correlation approach needs only results, we added a Gaussian noise term with mean 0 and SD 0.4 about 15 s for this task. to the generated data, then rounded to the nearest value in {0,1}. Reconstruction of interactions from an E. coli regulatory network: This corresponds to a probability of about 0.1 for each bit to be we next tested the method on a real biological network. For that changed to the wrong value. The effect of different noise levels on purpose, we chose the integrated Escherichia coli gene regulatory network reconstruction is shown in Supplementary Figure S1. For and metabolic network published by Covert et al. (2004). Based the best fit algorithm, we used the implementation in the BoolNet R- on previously published information, extracted from literature and package. As seen in Figure 1, for a threshold of 0.125, the correlation databases, the authors of this study constructed a network model approach detected 80% and more of the dependencies in the artificial that describes the transcriptional regulation of genes involved network already when less than 200 datasets were used for network in E.coli metabolism by transcription factors and environmental reconstruction, while the best fit algorithm identified a comparable stimuli. The complete model includes a total of 1010 genes and portion of dependencies only when more than 200 datasets were 102 environmental stimuli. Of these, 906 genes code for proteins in used. However, for this threshold, the correlation approach also the E.coli metabolic network and 104 genes code for transcriptional retrieved a fraction of false-positive dependencies, that were not regulators. The expression of 479 of the genes is controlled in the actually present in the underlying networks. For example, when metabolic network, 427 genes were not included in our investigation. using 200 datasets, 25% of the non-present dependencies were So we used a total 685 (= 479 + 104 + 102) variables in the network, wrongly classified as present. Thus, the rate of correctly classified i.e. the regulated 583 genes and the 102 additional inputs. To non-present connections was lower than the true-positive rate, unless reconstruct the regulatory network, including the environmental more than 400 datasets were used for network inference (Fig. 1). stimuli, we used the Boolean functions provided by Covert et al. We further analyzed precision, i.e. the fraction of true dependencies (2004) to generate 20 synthetic time-series datasets covering 3 time among all predicted dependencies, and recall, i.e. the fraction of true points each. For the generation of each of the 20 datasets, a random dependencies that was correctly predicted, for our reconstruction starting configuration of the network was drawn from a uniform approach and the best fit algorithm. As seen in Supplementary distribution. To every value in all datasets, Gaussian noise with Figure S5, precision and recall increase for both algorithms with the mean 0 and SD 0.1 was added, to resemble the noise generally number of datasets used for network inference. In the correlation [15:00 16/5/2011 Bioinformatics-btr166.tex] Page: 1532 1529–1536 Correctly classified present and non−present connections 0.2 0.4 0.6 0.8 1.0 Boolean network structure Functions depending on Functions depending on True negative, threshold 0.4 one variable two variables True positive, threshold 0.2 True negative, threshold 0.2 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Area under ROC Area under ROC True positive, threshold 0.4 Functions depending on Functions depending on three variables four or more variables 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 5 101520 Area under ROC Area under ROC Number of datasets Fig. 3. Histograms for the areas under the receiver operator curves (AUC). Fig. 2. True positive (solid lines) and true negative (dashed lines) rates of Correlation values were computed from 10 time-series datasets each covering the correlation approach for reconstruction of a published E.coli regulatory three discrete time steps. In total, 25 reconstruction runs were performed, network with 685 nodes. The network was reconstructed from 2 to 20 each time creating new time-series. That way, every dependency in the E.coli synthetic time-series datasets (with 3 time steps), that had been generated network contributes 25 AUC values to the corresponding histogram. The from the network by Covert et al. (2004). Correlations signified a regulatory AUC frequency distributions for functions that depend on 1, 2, 3, or more dependency if they exceeded a predefined threshold of 0.2 (circles) or 0.4 than 3 input variables are shown in four separate histograms. (squares). independent network reconstruction runs. For each reconstruction observed in biological data. Binarization was then performed by run, we generated a fixed number of 10 datasets of length 3 rounding to the nearest Boolean value 0 or 1. We then reconstructed from the Boolean network of Covert et al. (2004) by choosing the dependencies of the published E.coli network from the synthetic starting states randomly. In each run, we calculated the respective datasets using the described correlation method. To investigate correlation values for all potential dependencies. This generates the influence of the number of datasets on the reconstruction, 25 × 583(= 14575) AUC values (583 genes are dependent). One we varied the number of used datasets in the range of 2 to the AUC value is determined by calculating all correlations of one of available 20 (Fig. 2). By comparing the predicted dependencies with the 583 dependent genes to the 685 variables that are potentially the published network, finally, the fraction of correctly classified influencing them and utilizing all possible thresholds. From these present (true positive) and non-present (true negative) dependencies values, we generated histograms separated by the number of either was determined. Additionally, we analyzed precision and recall 1, 2, 3 or more than 3 input variables (Fig. 3). For Boolean functions and F -scores for different thresholds (see Supplementary Figs S7 with only one input variable (representing genes which are regulated and S8). As seen in Figure 2, for a threshold value of 0.2, the true by just one factor), the correlation method identified the relevant positive rate was close to 75% for network reconstruction from at variables with very high reliability. For functions with several least five datasets. The true negative rate, for this threshold, ranged input variables (representing genes which are regulated by several between 50% and 80% depending on the number of datasets used factors), the fraction of smaller AUC values increased, but a large for reconstruction. Thus, at this threshold a large fraction (∼ 75%) fraction is still close to 1. of the actual dependencies was found. To increase precision of the correlation approach, a higher threshold can be applied. So, for a Interactions of the yeast cell cycle transcriptional network: finally, threshold of 0.4, the true positive rate still was close to 60%, and in we used the correlation method to reconstruct a biological network this case the true negative rate was higher than 80% already when from microarray data. For that purpose, we chose the cell cycle five datasets were used for reconstruction (Fig. 2). Examples of transcriptional network of budding yeast, as suggested by Orlando reconstructed interactions for different dataset sizes and thresholds et al. (2008), and used published microarray results from three are given as Supplementary Material II. groups (Cho et al., 1998; Pramila et al., 2006; Spellman et al., 1998) To further evaluate how reliably the correlation of successive for network reconstruction. All three studies represent genome-wide states identified true relevant variables of the examined Boolean analyses of gene expression during the cell cycle in synchronized functions, we additionally computed the area under the receiver yeast cells, but they differ in the applied synchronization methods operator curve (AUC; Fawcett, 2006). The AUC is equal to the and the time intervals at which transcript levels were measured. probability that a randomly chosen present dependency has a higher In a first step, we binarized each of these microarray datasets absolute correlation value than a randomly chosen non-existing with the 2-means algorithm, the version of the k-means clustering dependency. For computation of the AUC values, we performed 25 algorithm that generates k = 2 clusters. Next, we created a set of [15:00 16/5/2011 Bioinformatics-btr166.tex] Page: 1533 1529–1536 Correctly classified (relative to class) 0.4 0.6 0.8 1.0 Frequency Frequency 0 20 40 60 80 120 0 2000 4000 6000 Frequency Frequency 0 20406080 0 500 1000 1500 2000 M.Maucher et al. that this published network correctly and fully represents the real biological situation, our correlation inference approach correctly identified an average of 74.7% of the present dependencies and HCM1 STB1 SWI5 FKH2/1 59.1% of non-present dependencies for the described threshold. In NDD1 comparison, Lähdesmäki and co-workers’ best fit algorithm found ASH1 61.1% of the present dependencies and 38.8% of the non-present dependencies. This best result was achieved with a maximum in- MCM1 ACE2 MBP1 degree of 5 and interpreting a dependency as present when it SWI4 was found for at least one of the three time lag groups. The SWI6 identification rates of the best fit algorithm for different maximum in- CLN3 degrees are given in the Supplementary Material I. Additionally, we analyzed precision and recall as well as F -scores for correlation and WHI5 YOX1 YHP1 best fit algorithm, applying different thresholds for the correlation. In these analyses we found, that the initially chosen threshold of ‘mean ±1SD’ provided a reasonable balance between recall and precision (see Supplementary Figs S9 and S10). An example G1 S G2/M of the interactions found by the described analysis are given in Supplementary Material III. Here, an interaction was included if it occurred in the majority of the 25 simulations. Finally, we examined Fig. 4. Yeast cell cycle transcription factor network. The diagram shows differences in precision and recall, when each of the three time the yeast transcription factor network as suggested by Orlando et al. lag groups was used separately for network reconstruction (see (2008) with the transcription factors arranged on the cell cycle time line Supplementary Fig. S12). (G1 →S→G2/M) on the basis of their peak transcript levels. Interactions shown as dashed lines are based on a publication by Pramila et al. (2006). Transcriptional activators are depicted as circles, repressors as rectangles 4 DISCUSSION and the cyclin Cln3 as octagon; activating interactions end in arrowheads and inhibitory interactions in squares. The interactions shown as solid and Under the assumption that a Boolean network consists of monotone dashed lines were correctly identified by our approach, while interactions functions, we have shown that its dependencies can be reconstructed shown as dotted lines were not found. via Pearson correlation of subsequent states. Compared for instance to the best fit extension algorithm, correlations have the important advantage that they can be computed very quickly—for each pair example pairs, i.e. each example contained the state of the network of nodes, only two correlations for the two possible directed at a certain time point along with a succeeding state measured dependencies have to be computed, where the running time for after a specific time period. As the time lag between regulator the computation of a single correlation is linear in the number of expression and target expression varies considerably in the data, we samples. This leads to an overall running time of the order O(n m), grouped the sample pairs according to the intervals: the short time where n is the number of nodes and m the number of samples. In lag group contained all examples where the following states were contrast, an exhaustive search algorithm like the best fit algorithm measured after a period of 5–10 min, the intermediate time lag group considers all subsets of genes up to a given size, assuming the contained the examples with the following states measured after 14– functions of the network have an input degree of k, for each node of 21 min, and the long time lag group contained the examples with the network combinations of input nodes have to be considered, the following states measured after 25–30 min. Thus, for network k+1 reconstruction we regarded subsequent states with a time lag of which leads to a running time of the order O(n m). This is also up to about one fourth of the yeast’s cell cycle, which is assumed reflected in Table 1: for k = 3, doubling the number of nodes in to span a time of about 135 min (Orlando et al., 2008). For each the network increases the runtime of the best fit algorithm by a of the 16 genes with known dependencies in the published cell factor of 16, while it increases the runtime of the correlation method cycle transcriptional network, we computed the correlation to the only by a factor of 4. So while the exhaustive search approach is other genes in the cell cycle network and also added 16 randomly only feasible for networks with a small number of nodes or low chosen additional genes contained in all of the microarray datasets. input degrees, the correlation method can also be applied to large This was done to distort the reconstruction process and to evaluate networks. This is particularly interesting for biological applications, the detection of non-present dependencies. A computed correlation as in the context of microarray and deep-sequencing technologies was assumed to represent a dependency, if it exceeded the mean of this type of large-scale data becomes more available. all correlations either by at least 1 SD for at least one of the time Theoretically, dependencies in the examples violate the lag groups or by at least half the SD for at least two of the time assumption of Theorem 1. In an experimental evaluation lag groups. This evaluation was repeated for a total of 25 times, (Supplementary Figs S2–S4), we found that these dependencies each time choosing a new set of 16 additional genes at random. only marginally influence the reconstruction process as long as these For comparison, we also applied the best fit extension algorithm to dependencies are not too strong (visually discernible only for bias the same data. This algorithm was run on the data corresponding probabilities greater than 0.8). to each of the three time lag groups, and for maximum in-degrees While our new method is to some extent similar to using of 3, 4 and 5. the Fourier transform, it does not need the Fourier transform’s Figure 4 shows the correctly identified dependencies in the mathematical overhead (Bshouty and Tamon, 1996; Mossel et al., yeast transcription factor network (Orlando et al., 2008). Assuming 2003). We do not see any specific advantage of moving from the [15:00 16/5/2011 Bioinformatics-btr166.tex] Page: 1534 1529–1536 Boolean network structure time domain into the spectral domain, as the values are (in the case that the published network whose interactions we are rebuilding of the Fourier transform) hard to interpret. We think that the notion exactly matches the real in vivo situation. Thus, for example, some of a correlation value is much more intuitive. identified dependencies, that were assumed to be false positives, Our experiments on the randomly created artificial networks have might actually be real dependencies. In line with this reasoning, shown that for noisy data and a low number of samples, Pearson the precision of network reconstruction was indeed lower for the correlation still finds a high percentage of the dependencies and, reconstruction from real data, for both the best-fit and the correlation for a range of thresholds, even surpasses the best fit algorithm with algorithm. Furthermore, it has to be kept in mind, that correlations regard to recall, precision and F -score (see Fig. 1, Supplementary representing indirect effects within the cell cycle also were counted Figs S5 and S6). A further advantage of our approach, besides the as false positives, while they do actually confer some biological short running times, is the possibility to vary the threshold according meaning. to the desired outcome, i.e. whether a high recall or rather a high In addition, Pearson correlation is not only suitable for measuring precision are requested. In case of little or no previous knowledge, the dependencies between sequences of binarized values. Especially for example, preferably a high threshold should be applied so that when real-valued examples are difficult to binarize, applying dependencies can be assumed with a high reliability, while a lower Pearson correlation directly on the raw, non-binarized data might threshold can be applied, if more previous knowledge exists that can already give some valuable information about a network. help to (pre-)select meaningful dependencies. For the simulated E.coli network, we also could reconstruct more than 50% of the present and non-present dependencies, already from 5 CONCLUSION a relatively low number of datasets (Fig. 2). The comparison of For a Boolean network consisting only of monotone Boolean the areas under the ROCs indicate that especially for functions functions, we showed that Pearson correlation is a fast method to that depend only on one or two variables, dependencies can be find dependencies in the network. This method makes it possible to found reliably. For functions depending on more variables, the analyze large networks that contain nodes with large input degree. identification of dependencies is more complicated, but the AUCs We could show for both simulated and real microarray data that show that the correlation method still is able to provide information our approach could reconstruct large parts of published regulatory about these dependencies (Fig. 3). One reason for the increased networks. difficulty when reconstructing functions with higher in-degree is the fact that for higher in-degree, variables can have a lower influence and dependencies are therefore harder to detect. In addition, the set ACKNOWLEDGEMENTS of Boolean functions can be partitioned into sets of varying difficulty The authors would especially like to thank Guido Adler for for a learning algorithm [cf. Gordon and Peretto (1990)]. This continuing support. The authors would also like to thank the partition, however, also depends to a large extent on the in-degree anonymous reviewers for their valuable comments. of the functions. Furthermore, we could show that Pearson correlation also Funding: Graduate School of Mathematical Analysis of Evolution, performed reasonably well when reconstructing dependencies from Information and Complexity (to H.A.K.). International Graduate real biological data, reaching identification rates (true positive, School in Molecular Medicine at Ulm University which is funded true negative) similar to those for the simulated E.coli datasets. by the Excellence Initiative of German governments (GSC 270 to Compared to the best fit algorithm, interactions were not only B.K.). Federal ministry of education and research (BMBF) within identified faster using correlation, but the identification was also the framework of the program of medical genome research (PaCa- more reliable. As seen also in Figure 4, almost 75% of the assumed Net; project ID PKB-01GS08, in part). The responsibility for the dependencies could be correctly identified at the chosen threshold, content lies exclusively with the authors. in spite of the regulatory complexity of the yeast cell cycle network, Conflict of Interest: none declared. even for components regulated by more than three factors (like Swi4, Swi6 or Yhp1). Further examining which of the expected dependencies were not identified by our method, we find that REFERENCES in these cases the regulatory mechanisms are based on a quite complex interplay of several factors involved. So, for example, Alon,U. (2006) An Introduction to Systems Biology: Design Principles of Biological Circuits. Chapman and Hall/CRC. Cln3 can activate transcription of Swi6 through inactivation of Babu,M. et al. (2004) Structure and evolution of transcriptional regulatory networks. the transcriptional repressor Whi5 as well as independent of Whi5 Curr. Opin. Struct. Biol., 14, 283–291. (Wittenberg and Reed, 2005), and the Whi5-independent mechanism Bornholdt,S. (2005) Systems biology: less is more in modeling large genetic networks. might hamper identification of the inhibitory influence of Whi5 Science, 21, 449–451. on Swi6. A second example is the transcriptional activator Mcm1, Bshouty,N. and Tamon,C. (1996) On the fourier spectrum of monotone functions. J. ACM (JACM), 43, 747–770. which can act in concert with an activating transcription factor Cho,R. et al. (1998) A genome-wide transcriptional analysis of the mitotic cell cycle. complex consisting of Fkh2/Ndd1 or the transcriptional repressors Mol. Cell, 2, 65–73. Yhp1 and Yox1 (Wittenberg and Reed, 2005). Accordingly, the Covert,M. et al. (2004) Integrating high-throughput and computational data elucidates influence of Mcm1, which was mostly not identified by our bacterial networks. Nature, 429, 92–96. Faith,J. et al. (2007) Large-scale mapping and validation of escherichia coli method, is strongly dependent on the presence of the respective co- transcriptional regulation from a compendium of expression profiles. PLoS Biol., factors, for which our approach correctly identified the respective 5, e8. dependencies. With regard to the reconstruction from real data, Fawcett,T. (2006) An introduction to ROC analysis. Pattern Recognit. Lett., 27, it has to be kept in mind that we cannot be completely sure 861–874. [15:00 16/5/2011 Bioinformatics-btr166.tex] Page: 1535 1529–1536 M.Maucher et al. Gordon,M. and Peretto,P. (1990) The statistical distribution of Boolean gates in two- Orphanides,G. and Reinberg,D. (2002) A unified theory of gene expression. Cell, 108, inputs, one-output multilayered neural networks. J. Phys. A Math. Gen., 23, 3061. 439–451. Hoeffding,W. (1963) Probability inequalities for sums of bounded random variables. Pearl,J. (2000) Causality. Cambridge University Press, Cambridge. J. Am. Stat. Assoc., 58, 13–30. Pramila,T. et al. (2006) The forkhead transcription factor hcm1 regulates chromosome Kauffman,S. (1969) Metabolic stability and epigenesis in randomly constructed genetic segregation genes and fills the s-phase gap in the transcriptional circuitry of the cell nets. J. Theor. Biol., 22, 437–467. cycle. Genes Dev., 20, 2266–2278. Kauffman,S. (1993) The Origins of Order: Self-Organization and Selection in Evolution. Rani,T.S. et al. (2007) Analysis of E.coli promoter recognition problem in dinucleotide Oxford University Press, USA. feature space. Bioinformatics, 23, 582–588. Kauffman,S. et al. (2003) Random Boolean network models and the yeast transcriptional Shipley,B. (2000) Cause and Correlation in Biology: A User’s Guide to Path network. Proc. Natl Acad. Sci. USA, 100, 14796–14799. Analysis, Structural Equations and Causal Inference. Cambridge University Press, Lähdesmäki,H. et al. (2003) On learning gene regulatory networks under the boolean Cambridge. network model. Mach. Learn., 52, 147–167. Spellman,P. et al. (1998) Comprehensive identification of cell cycle-regulated genes of Liang,S. et al. (1998) Reveal, a general reverse engineering algorithm for inference of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell, genetic network architectures. Pac. Symp. Biocomput., 3, 18–29. 9, 3273. Margolin,A. et al. (2006) Aracne: an algorithm for the reconstruction of gene regulatory Spirtes,P. et al. (2000) Causation, Prediction, and Search., 2nd edn. MIT Press, networks in a mammalian cellular context. BMC Bioinformatics, 7,S7. Cambridge, MA. Mossel,E. et al. (2003) Learning juntas. In STOC ’03: Proceedings of the thirty-fifth Tsukimoto,H. and Hatano,H. (2003) The functional localization of neural networks annual ACM symposium on Theory of Computing, ACM, New York, NY, USA, using genetic algorithms. Neural Networks, 16, 55–67. pp. 206–212. Venters,B.J. and Pugh,B.F. (2009) How eukaryotic genes are transcribed. Crit. Rev. Müssel,C. et al. (2010) BoolNet - an R package for generation, reconstruction, and Biochem. Mol. Biol., 44, 117–141. analysis of Boolean networks. Bioinformatics, 26, 1378–1380. Wittenberg,C. and Reed,S. (2005) Cell cycle-dependent transcription in Opgen-Rhein,R. and Strimmer,K. (2007) Learning causal networks from systems yeast: promoters, transcription factors, and transcriptomes. Oncogene, 24, biology time course data: an effective model selection procedure for the vector 2746–2755. autoregressive process. BMC Bioinformatics, 8 (Suppl. 2), S3. Zoppoli,P. et al. (2010) Timedelay-aracne: Reverse engineering of gene networks Orlando,D. et al. (2008) Global control of cell-cycle transcription by coupled CDK and from time-course data by an information theoretic approach. BMC Bioinformatics, network oscillators. Nature, 453, 944–947. 11, 154. [15:00 16/5/2011 Bioinformatics-btr166.tex] Page: 1536 1529–1536 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

Inferring Boolean network structure via correlation

Loading next page...
 
/lp/oxford-university-press/inferring-boolean-network-structure-via-correlation-bnHkMkKxYR

References (29)

Publisher
Oxford University Press
Copyright
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
ISSN
1367-4803
eISSN
1460-2059
DOI
10.1093/bioinformatics/btr166
pmid
21471013
Publisher site
See Article on Publisher Site

Abstract

Vol. 27 no. 11 2011, pages 1529–1536 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btr166 Systems biology Advance Access publication April 5, 2011 1,† 2,† 2 1,3,∗ Markus Maucher , Barbara Kracher , Michael Kühl and Hans A. Kestler Research group Bioinformatics and Systems Biology, Clinic for Internal Medicine I, University Medical Center Ulm, 2 3 Institute for Biochemistry and Molecular Biology, Ulm University, and Research Group Bioinformatics and Systems Biology, Institute for Neural Information Processing, Ulm University, Ulm, Germany Associate Editor: Olga Troyanskaya ABSTRACT (Bornholdt, 2005). Nonetheless, it is powerful enough to model the structure of network motifs, basic network components frequently Motivation: Accurate, context-specific regulation of gene found in gene regulatory networks (Alon, 2006; Babu et al., 2004). expression is essential for all organisms. Accordingly, it is very Gene transcription in eukaryotic cells has to be tightly regulated important to understand the complex relations within cellular gene to ensure proper cell function, i.e. according to a particular cellular regulatory networks. A tool to describe and analyze the behavior context (cell cycle phase, cell type, environmental conditions, of such networks are Boolean models. The reconstruction of a developmental stage), a specific subset of genes is expressed Boolean network from biological data requires identification of while the expression of other genes has to be actively repressed. dependencies within the network. This task becomes increasingly This regulation is generally mediated via certain proteins, called computationally demanding with large amounts of data created by transcription factors, that bind to specific sequence motifs in recent high-throughput technologies. Thus, we developed a method the promoter region of a gene and either enhance or inhibit that is especially suited for network structure reconstruction from the transcription of this gene [reviewed e.g. in (Orphanides and large-scale data. In our approach, we took advantage of the fact that Reinberg, 2002; Venters and Pugh, 2009)]. If a promoter region a specific transcription factor often will consistently either activate contains binding motifs for different transcription factors, these or inhibit a specific target gene, and this kind of regulatory behavior factors can either cooperate in the regulation of a certain target can be modeled using monotone functions. gene or counteract each other. Moreover, in some cases, specific Results: To detect regulatory dependencies in a network, we cofactors determine whether a transcription factor acts as activator examined how the expression of different genes correlates to or repressor. Despite the variety in the modes of transcriptional successive network states. For this purpose, we used Pearson regulation, most transcriptional regulators will be either activators correlation as an elementary correlation measure. Given a Boolean or inhibitors of a certain gene in a specific cell type. In this case, the network containing only monotone Boolean functions, we prove that activating or repressing effect of a transcription factor monotonically the correlation of successive states can identify the dependencies depends on its cellular concentration. In other words, an increase in in the network. This method not only finds dependencies in the concentration of an activator will increase but never decrease randomly created artificial networks to very high percentage, but also transcription of its target, while an increase in the concentration reconstructed large fractions of both a published Escherichia coli of a repressor will decrease but never increase transcription of regulatory network from simulated data and a yeast cell cycle its target. This kind of transcriptional regulation can be modeled network from real microarray data. mathematically in a very simplistic manner by the use of monotone Contact: hans.kestler@uni-ulm.de Boolean functions which describe exactly this monotonic relation. Supplementary information: Supplementary data are available at Besides the monotone Boolean functions applied in this work, there Bioinformatics online. exist further related classes of functions describing such monotonic Received on August 20, 2010; revised on February 28, 2011; relations. Among them are, for example, nested canalyzing functions accepted on March 25, 2011 (Kauffman et al., 2003), single-layer perceptrons (Rani et al., 2007) and multilinear functions (Tsukimoto and Hatano, 2003). Gene expression data can be analyzed and visualized on the 1 INTRODUCTION basis of correlations identified in the observed expression patterns Boolean networks were popularized by Stuart Kauffman as models of the analyzed genes. Models from information theory correlate for genetic regulatory networks (Kauffman, 1969). In this kind of expression values from two genes directly and predict an interaction model, only two states are discriminated (active/inactive) for each between two genes to be present if the correlation coefficient is gene. The dynamics of the network can then be described by Boolean exceeding a certain threshold. Algorithms based on this principle are functions. This model works with a very small set of parameters and for example CLR (Faith et al., 2007) or ARACNE (Margolin et al., thus represents a very stringent application of Occam’s Razor, which 2006). However, these correlations generally reflect co-expression makes it especially suitable for modeling large genetic networks of genes and can, under some restrictions to the underlying network like e.g. absence of cyclic dependencies, also give information on direct causal relationships [cf. (Pearl, 2000; Shipley, 2000; Spirtes To whom correspondence should be addressed. The authors wish it to be known that, in their opinion, the first two authors et al., 2000) and references therein]. Moreover, Opgen-Rhein and should be regarded as joint First Authors. Strimmer (2007) and Zoppoli et al. (2010) established algorithms © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com 1529 [15:00 16/5/2011 Bioinformatics-btr166.tex] Page: 1529 1529–1536 M.Maucher et al. to develop networks depicting dynamic relations from correlation Let D be a product distribution and X be a D-distributed random variable. The influence I (f ) of a variable x on a function f :{0,1} →{0,1} is defined data. Likewise, there exist algorithms for the inference of dynamic D,i i as the probability that a change in x will also lead to a change in f (X ), i.e. Boolean networks from gene expression data, like REVEAL (Liang et al., 1998) or the best fit extension algorithm (Lähdesmäki et al., I (f ) =P f (X )| =f (X )| . D,i D x =0 x =1 i i 2003). These algorithms, however, perform reasonably well only if Here, f (X )| denotes a partial assignment, i.e. x =0 the datasets are not too large. In contrast to existing approaches, we do not correlate expression f (X ,...,X )| =f (X ,...,X ,b,X ,...,X ) . 1 n x =b 1 i−1 i+1 n data of different genes directly, but instead correlate the states of Pearson correlation: given two random variables X and Y , their Pearson particular genes with the successive states of potential target genes, correlation is defined as thus assuming and examining directed causal dependencies. E[(X −E[X])(Y −E[Y ])] Cov(X,Y ) This article is organized as follows: we first prove properties ρ(X,Y ) = = , σ σ σ σ X Y X Y of interaction reconstruction via correlation, assuming monotone Boolean functions. This is followed by a series of experiments where σ denotes the SD of a random variable Z . successively advancing from entirely artificial through to real data. Theorem 1. Given a monotone function f :{0,1} →{0,1} and a random variable X ∈{0,1} that is distributed according to a product distribution D. Then the function f depends on the i-th variable if and only if the Pearson 2 METHODS—CORRELATION AND MONOTONE correlation of X and f (X ) is non-zero for non-zero variances. FUNCTIONS Proof. Let D denote the product distribution D on all variables except Boolean networks: a Boolean network consists of n nodes, numbered from i x . Then 1to n and n functions f ,...,f :{0,1} →{0,1}. The state of the network can i 1 n be described by a Boolean vector x ∈{0,1} , where x describes the state of E [(X −E[X ])(f (X ) −E[f (X )])] D i i the i-th node. The n functions describe the dynamics of the network: If the σ σ X f (X ) network is in state x at time t, it transforms into state (f (x),f (x),...,f (x)) 1 2 n n 1 at time t + 1. We can also combine f ,...,f into one function F :{0,1} → 1 n = E E [(X −E[X ])(f (X ) −E[f (X )])] D X i i i i σ σ {0,1} such that F (x) =f (x). State x then transforms into state F (x). When X f (X ) i i i (j) (1) (2) considering successive states x ,x ,..., the term x will denote the state (j) = E (1 − µ )(−µ ) f (X )| −f (X ) D i i x =0 i i of the i-th node in x . σ σ X f (X ) Relevance of a variable: a function f :{0,1} →{0,1} depends on the i-th +µ (1 − µ ) f (X )| −f (X ) i i x =1 variable if there exist x ,...,x ,x ,...,x such that 1 i−1 i+1 n µ (1 − µ ) i i = E f (X )| −f (X )| f (x ,...,x ,0,x ,...,x ) D x =1 x =0 i i i 1 i−1 i+1 n σ σ X f (X ) =f (x ,...,x ,1,x ,...,x ). 1 i−1 i+1 n X = E f (X )| −f (X )| . D x =1 x =0 i i i f (X ) If f depends on the i-th input variable, we also say that the i-th variable is relevant for f . The set of all relevant variables of f is denoted as rel(f ). The theorem then follows from the fact that f is monotone in x . Reconstructing the dynamics of a Boolean network: in order to reconstruct As derived in Theorem 1, the influence of a variable can be estimated via the functions of a Boolean network, we are given a sequence of m a modified version of the Pearson correlation (1) (2) (m) states x ,x ,...,x along with the corresponding successor states E[(X −E[X])(Y −E[Y ])] Cov(X,Y ) (1) (2) (m) (i) (i) ρ ˜ (X,Y ) = = . F (x ),F (x ),...,F (x ). A tuple (x ,F (x )) is also called an example. 2 2 σ σ X X From these examples, the task is to reconstruct the dependencies in the Boolean network, i.e. the set of variables each of the functions depends The Chernoff-Hoeffding bound (Hoeffding, 1963): given independent on. random variables X ,...,X with a ≤X ≤b for all i, the Chernoff-Hoeffding 1 m i bound can be stated as follows: Monotonicity of functions: a Boolean function f :{0,1} →{0,1} is m m −2m monotonically increasing in the i-th variable if for all x ,...,x ,x ,...,x 2 1 i−1 i+1 n (b−a) P X − E[X ] ≥ m ≤e . i i i=1 i=1 f (x ,...,x ,0,x ,...,x ) 1 i−1 i+1 n Estimating the correlation coefficient: we will use the Chernoff-Hoeffding ≤f (x ,...,x ,1,x ,...,x ), 1 i−1 i+1 n bound for the estimation of a variable’s influence: the function f is monotonically decreasing in the i-th variable if for all Theorem 2. When estimating the influence of a variable on a function, the variables x ,...,x ,x ,...,x 1 i−1 i+1 n estimation error shrinks exponentially in the number of samples. f (x ,...,x ,0,x ,...,x ) 1 i−1 i+1 n Proof. As shown in Theorem 1, the influence of a variable X on a function f (X ) ≥f (x ,...,x ,1,x ,...,x ). 1 i−1 i+1 n f can be estimated via a modified correlation ρ ˜ (X ,f (X )) = ρ(X ,f (X )). i i We will show that when estimating this term the error probability shrinks A function f is monotone if for every variable f is either monotonically exponentially in the number of samples. increasing or monotonically decreasing in that variable. For example, the Via the Chernoff-Hoeffding bound, we can prove that Boolean AND function is monotone, while the XOR function is not. P [|x − µ | ≥  ] ≤ exp(−2m ) x x Influence of a variable: a probability distribution D on {0,1} is called and product distribution if for any D-distributed random variable X the property n 2 P y − µ ≥  ≤ exp(−2m ) , P[X = (x ,...,x )]= P[X =x ] holds, i.e. the X are independent. y y 1 n i i i y i=1 [15:00 16/5/2011 Bioinformatics-btr166.tex] Page: 1530 1529–1536 Boolean network structure Algorithm 1 Finding all relevant variables of f with influence which implies exceeding a threshold T P[|xy − µ µ |≥  µ +  µ +   ] x y x y y x x y (1) (1) (m) (m) Input: m examples (x ,f (x )),...,(x ,f (x )) of a function f , 2 2 ≤ exp(−2m ) + exp(−2m ) , x y drawn from a product distribution, threshold T Output: Approximation of rel(f ) and we can also show rel ←∅ m 1 m (j) y ← f (x ) 2 m j=1 P x y −E[XY ] ≥  ≤ exp(−2m ) . i i r for i ∈{1,...,n} do i=1 (j) 1 m x ← x j=1 m i We use the fact that the x and y are all Boolean to estimate the sample i i if x(1 −x) > 0 then variance s =x(1 −x). Let δ denote the estimation error for µ , i.e. x = x x {ensures non zero sample variance} µ + δ . That way, m (j) x x 1 (j) x f (x )−xy m j=1 if ≥T then x(1−x) x(1 −x) = (µ + δ )(1 − µ − δ ) x x x x rel ← rel ∪{i} = µ (1 − µ ) + (1 − 2µ − δ )δ . end if x x x x x end if Since |(1 − 2µ − δ )|≤ 1, we can conclude that x x end for return rel |x(1 −x) − µ (1 − µ )|≤|δ |. x x x This means that if the error for estimating µ is at most  then the error for x x 3 SIMULATION RESULTS estimating σ is at most  , too. We can now combine the estimation errors above: estimating the modified correlation coefficient ρ ˜ via Error rates in artificial networks with fixed in-degree: to investigate the performance of our method for concrete values and for controlled x y −xy i i i=1 r ˜(X,Y ) = noise levels, we analyzed the error probability of our inference x(1 −x) algorithm in an artificial network consisting of monotone Boolean functions. For this purpose, we created a set of 50 random Boolean and setting  := ( +  µ +  µ +   +  )/(σ −  ), we can use Lemma r x y y x x y x x networks each containing 80 nodes. For every node in these 1 below to conclude that networks, a monotone function with three input variables was P [|˜r(X,Y )−˜ ρ(X,Y )|≥ ] constructed as follows: first, we created a random truth table and 2 2 2 determined randomly for each input variable if the function should ≤ exp(−2m ) + exp(−2m ) + exp(−2m ) . x y r be monotonically increasing or decreasing in that variable. We then In particular, the error probability shrinks exponentially in the number of altered the initial truth table by correcting all inconsistencies. If samples m. this resulted in a function that did not depend on all of its input variables, we created a new random truth table and repeated the process until every function depended on all of its input variables. ˜ ˜ ˜ ˜ Lemma 1. Let A,A,B,B ∈[−1,1] with B > |A|, |A −A| < and |B −B| < . 1 2 Additionally, for each of the networks generated we analyzed Then ˜ attractors, i.e. single states or sequences of states toward which A A  + 1 2 − ≤ . the networks evolve. The network generation and attractor analysis B ˜ B − was performed with the R package BoolNet (Müssel et al., 2010), with the described modification to create monotone functions. We Proof. It holds that determined the number of attractors for each of these artificial A A A A −  A +  A 1 1 networks via a random sampling of 10 000 starting states and their − ≤ max − , − . B B B +  B −  B B 2 2 resulting attractors (function getAttractors of BoolNet). The number of attractors ranged from 1 to 59 (with a median of 5.5), with With attractor lengths ranging from 1 to 258 (with median 6). From A A −  A +B  + 1 2 1 1 2 − = ≤ each of the 50 random networks, we subsequently generated a batch B B +  B(B +  ) B + 2 2 2 of 1000 simulated time-series datasets, with each dataset covering and A +  A A +B  +  2 time points. In this process, for the generation of each dataset 1 2 1 1 2 − = ≤ B −  B B(B −  ) B −  a new starting configuration of the network was drawn from a 2 2 2 uniform distribution. The datasets generated in this way for each the lemma follows. of the 50 random networks represent synthetic ‘gene expression’ data of the 80 ‘genes’, measured at two successive time points in an Inferring the structure of a Boolean network: combining the results unperturbed system. These datasets were then used to reconstruct above, we can infer the dependency relations within a Boolean graph the dependencies in the underlying artificial networks. If a Boolean with a fast algorithm: Algorithm 1 finds the relevant variables of a function depends on three variables, changing one of these variables monotone function with a chosen minimum influence, given a sequence of from 0 to 1 must have an effect on f for at least one combination of independent examples, by calculating the correlation value of each variable the other two variables. Since the probability of such a combination and identifying all variables with this correlation exceeding a given value. of two variables is 0.25 under a uniform distribution, each of these Running that algorithm for the output function of each node of a Boolean network reveals the structure of the entire network. variables has an influence of at least 0.25. A correlation value was [15:00 16/5/2011 Bioinformatics-btr166.tex] Page: 1531 1529–1536 M.Maucher et al. Table 1. Runtimes of the correlation method and the best fit extension algorithm, averaged over 20 runs True negative, best fit algorithm Network size Correlation (s) Best Fit Best fit True negative, correlation (max k = 3) (s) (max k = 4) True positive, correlation 50 nodes 0.98 0.79 10.2 s 100 nodes 3.86 11.2 324 s 200 nodes 15.4 176 2 h 50 min The Correlation approach is independent of setting a max k. True positive, best fit algorithm approach, moreover, the threshold for the correlation can be varied to determine whether the algorithm should preferably identify dependencies with a high precision, at the expense of a lower recall, or the other way round (see Supplementary Fig. S5). As additional measure for the reliability of network reconstruction, we further analyzed the F -score, which can be regarded as weighted average 200 400 600 800 1000 of precision and recall (see Supplementary Fig. S6). Number of datasets Runtime: in a second experiment, we measured the running time of the correlation method and the best fit algorithm. The average single- Fig. 1. True positive and true negative rates of an experiment with artificial core runtimes on a 3.0 GHz Xeon CPU are given in Table 1. The data and Gaussian noise with SD σ = 0.4. The correlation algorithm and the random networks for the reconstruction consisted of 50, 100 and 200 best fit extension algorithm were used to infer artificially created networks nodes, respectively. In each case, we measured the running times for consisting of 80 nodes with an in-degree of 3. The plot shows average true reconstruction of the random network from 50 synthetic datasets. positive and true negative rates for the reconstruction of 50 artificial networks from 50 to 1000 datasets each. The values in the table show the running times averaged over 20 runs, where we created a new random Boolean network for each of assumed to signify a dependency when it exceeded a threshold of these runs. As seen in Table 1, a key advantage of the correlation 0.125, i.e. half the minimum influence. method are the considerably shorter running times compared to the Figure 1 shows the average identification rates of our method best fit algorithm. Thus, our correlation approach is especially useful and Lähdesmäki and co-workers’ best fit extension algorithm for large networks in which the nodes have a relatively large in- (Lähdesmäki et al., 2003) for the reconstruction of 50 artificial degree. For example, the runtime of the best fit algorithm for the Boolean networks. The network size of 80 nodes was chosen to reconstruction of a network of 200 nodes with a maximum in-degree be able to compare the performance of our approach with the best of 4 (i.e. each Boolean function depends on no more than four input fit algorithm. Moreover, to obtain data that is closer to experimental variables) is almost 3 h, whereas the correlation approach needs only results, we added a Gaussian noise term with mean 0 and SD 0.4 about 15 s for this task. to the generated data, then rounded to the nearest value in {0,1}. Reconstruction of interactions from an E. coli regulatory network: This corresponds to a probability of about 0.1 for each bit to be we next tested the method on a real biological network. For that changed to the wrong value. The effect of different noise levels on purpose, we chose the integrated Escherichia coli gene regulatory network reconstruction is shown in Supplementary Figure S1. For and metabolic network published by Covert et al. (2004). Based the best fit algorithm, we used the implementation in the BoolNet R- on previously published information, extracted from literature and package. As seen in Figure 1, for a threshold of 0.125, the correlation databases, the authors of this study constructed a network model approach detected 80% and more of the dependencies in the artificial that describes the transcriptional regulation of genes involved network already when less than 200 datasets were used for network in E.coli metabolism by transcription factors and environmental reconstruction, while the best fit algorithm identified a comparable stimuli. The complete model includes a total of 1010 genes and portion of dependencies only when more than 200 datasets were 102 environmental stimuli. Of these, 906 genes code for proteins in used. However, for this threshold, the correlation approach also the E.coli metabolic network and 104 genes code for transcriptional retrieved a fraction of false-positive dependencies, that were not regulators. The expression of 479 of the genes is controlled in the actually present in the underlying networks. For example, when metabolic network, 427 genes were not included in our investigation. using 200 datasets, 25% of the non-present dependencies were So we used a total 685 (= 479 + 104 + 102) variables in the network, wrongly classified as present. Thus, the rate of correctly classified i.e. the regulated 583 genes and the 102 additional inputs. To non-present connections was lower than the true-positive rate, unless reconstruct the regulatory network, including the environmental more than 400 datasets were used for network inference (Fig. 1). stimuli, we used the Boolean functions provided by Covert et al. We further analyzed precision, i.e. the fraction of true dependencies (2004) to generate 20 synthetic time-series datasets covering 3 time among all predicted dependencies, and recall, i.e. the fraction of true points each. For the generation of each of the 20 datasets, a random dependencies that was correctly predicted, for our reconstruction starting configuration of the network was drawn from a uniform approach and the best fit algorithm. As seen in Supplementary distribution. To every value in all datasets, Gaussian noise with Figure S5, precision and recall increase for both algorithms with the mean 0 and SD 0.1 was added, to resemble the noise generally number of datasets used for network inference. In the correlation [15:00 16/5/2011 Bioinformatics-btr166.tex] Page: 1532 1529–1536 Correctly classified present and non−present connections 0.2 0.4 0.6 0.8 1.0 Boolean network structure Functions depending on Functions depending on True negative, threshold 0.4 one variable two variables True positive, threshold 0.2 True negative, threshold 0.2 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Area under ROC Area under ROC True positive, threshold 0.4 Functions depending on Functions depending on three variables four or more variables 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 5 101520 Area under ROC Area under ROC Number of datasets Fig. 3. Histograms for the areas under the receiver operator curves (AUC). Fig. 2. True positive (solid lines) and true negative (dashed lines) rates of Correlation values were computed from 10 time-series datasets each covering the correlation approach for reconstruction of a published E.coli regulatory three discrete time steps. In total, 25 reconstruction runs were performed, network with 685 nodes. The network was reconstructed from 2 to 20 each time creating new time-series. That way, every dependency in the E.coli synthetic time-series datasets (with 3 time steps), that had been generated network contributes 25 AUC values to the corresponding histogram. The from the network by Covert et al. (2004). Correlations signified a regulatory AUC frequency distributions for functions that depend on 1, 2, 3, or more dependency if they exceeded a predefined threshold of 0.2 (circles) or 0.4 than 3 input variables are shown in four separate histograms. (squares). independent network reconstruction runs. For each reconstruction observed in biological data. Binarization was then performed by run, we generated a fixed number of 10 datasets of length 3 rounding to the nearest Boolean value 0 or 1. We then reconstructed from the Boolean network of Covert et al. (2004) by choosing the dependencies of the published E.coli network from the synthetic starting states randomly. In each run, we calculated the respective datasets using the described correlation method. To investigate correlation values for all potential dependencies. This generates the influence of the number of datasets on the reconstruction, 25 × 583(= 14575) AUC values (583 genes are dependent). One we varied the number of used datasets in the range of 2 to the AUC value is determined by calculating all correlations of one of available 20 (Fig. 2). By comparing the predicted dependencies with the 583 dependent genes to the 685 variables that are potentially the published network, finally, the fraction of correctly classified influencing them and utilizing all possible thresholds. From these present (true positive) and non-present (true negative) dependencies values, we generated histograms separated by the number of either was determined. Additionally, we analyzed precision and recall 1, 2, 3 or more than 3 input variables (Fig. 3). For Boolean functions and F -scores for different thresholds (see Supplementary Figs S7 with only one input variable (representing genes which are regulated and S8). As seen in Figure 2, for a threshold value of 0.2, the true by just one factor), the correlation method identified the relevant positive rate was close to 75% for network reconstruction from at variables with very high reliability. For functions with several least five datasets. The true negative rate, for this threshold, ranged input variables (representing genes which are regulated by several between 50% and 80% depending on the number of datasets used factors), the fraction of smaller AUC values increased, but a large for reconstruction. Thus, at this threshold a large fraction (∼ 75%) fraction is still close to 1. of the actual dependencies was found. To increase precision of the correlation approach, a higher threshold can be applied. So, for a Interactions of the yeast cell cycle transcriptional network: finally, threshold of 0.4, the true positive rate still was close to 60%, and in we used the correlation method to reconstruct a biological network this case the true negative rate was higher than 80% already when from microarray data. For that purpose, we chose the cell cycle five datasets were used for reconstruction (Fig. 2). Examples of transcriptional network of budding yeast, as suggested by Orlando reconstructed interactions for different dataset sizes and thresholds et al. (2008), and used published microarray results from three are given as Supplementary Material II. groups (Cho et al., 1998; Pramila et al., 2006; Spellman et al., 1998) To further evaluate how reliably the correlation of successive for network reconstruction. All three studies represent genome-wide states identified true relevant variables of the examined Boolean analyses of gene expression during the cell cycle in synchronized functions, we additionally computed the area under the receiver yeast cells, but they differ in the applied synchronization methods operator curve (AUC; Fawcett, 2006). The AUC is equal to the and the time intervals at which transcript levels were measured. probability that a randomly chosen present dependency has a higher In a first step, we binarized each of these microarray datasets absolute correlation value than a randomly chosen non-existing with the 2-means algorithm, the version of the k-means clustering dependency. For computation of the AUC values, we performed 25 algorithm that generates k = 2 clusters. Next, we created a set of [15:00 16/5/2011 Bioinformatics-btr166.tex] Page: 1533 1529–1536 Correctly classified (relative to class) 0.4 0.6 0.8 1.0 Frequency Frequency 0 20 40 60 80 120 0 2000 4000 6000 Frequency Frequency 0 20406080 0 500 1000 1500 2000 M.Maucher et al. that this published network correctly and fully represents the real biological situation, our correlation inference approach correctly identified an average of 74.7% of the present dependencies and HCM1 STB1 SWI5 FKH2/1 59.1% of non-present dependencies for the described threshold. In NDD1 comparison, Lähdesmäki and co-workers’ best fit algorithm found ASH1 61.1% of the present dependencies and 38.8% of the non-present dependencies. This best result was achieved with a maximum in- MCM1 ACE2 MBP1 degree of 5 and interpreting a dependency as present when it SWI4 was found for at least one of the three time lag groups. The SWI6 identification rates of the best fit algorithm for different maximum in- CLN3 degrees are given in the Supplementary Material I. Additionally, we analyzed precision and recall as well as F -scores for correlation and WHI5 YOX1 YHP1 best fit algorithm, applying different thresholds for the correlation. In these analyses we found, that the initially chosen threshold of ‘mean ±1SD’ provided a reasonable balance between recall and precision (see Supplementary Figs S9 and S10). An example G1 S G2/M of the interactions found by the described analysis are given in Supplementary Material III. Here, an interaction was included if it occurred in the majority of the 25 simulations. Finally, we examined Fig. 4. Yeast cell cycle transcription factor network. The diagram shows differences in precision and recall, when each of the three time the yeast transcription factor network as suggested by Orlando et al. lag groups was used separately for network reconstruction (see (2008) with the transcription factors arranged on the cell cycle time line Supplementary Fig. S12). (G1 →S→G2/M) on the basis of their peak transcript levels. Interactions shown as dashed lines are based on a publication by Pramila et al. (2006). Transcriptional activators are depicted as circles, repressors as rectangles 4 DISCUSSION and the cyclin Cln3 as octagon; activating interactions end in arrowheads and inhibitory interactions in squares. The interactions shown as solid and Under the assumption that a Boolean network consists of monotone dashed lines were correctly identified by our approach, while interactions functions, we have shown that its dependencies can be reconstructed shown as dotted lines were not found. via Pearson correlation of subsequent states. Compared for instance to the best fit extension algorithm, correlations have the important advantage that they can be computed very quickly—for each pair example pairs, i.e. each example contained the state of the network of nodes, only two correlations for the two possible directed at a certain time point along with a succeeding state measured dependencies have to be computed, where the running time for after a specific time period. As the time lag between regulator the computation of a single correlation is linear in the number of expression and target expression varies considerably in the data, we samples. This leads to an overall running time of the order O(n m), grouped the sample pairs according to the intervals: the short time where n is the number of nodes and m the number of samples. In lag group contained all examples where the following states were contrast, an exhaustive search algorithm like the best fit algorithm measured after a period of 5–10 min, the intermediate time lag group considers all subsets of genes up to a given size, assuming the contained the examples with the following states measured after 14– functions of the network have an input degree of k, for each node of 21 min, and the long time lag group contained the examples with the network combinations of input nodes have to be considered, the following states measured after 25–30 min. Thus, for network k+1 reconstruction we regarded subsequent states with a time lag of which leads to a running time of the order O(n m). This is also up to about one fourth of the yeast’s cell cycle, which is assumed reflected in Table 1: for k = 3, doubling the number of nodes in to span a time of about 135 min (Orlando et al., 2008). For each the network increases the runtime of the best fit algorithm by a of the 16 genes with known dependencies in the published cell factor of 16, while it increases the runtime of the correlation method cycle transcriptional network, we computed the correlation to the only by a factor of 4. So while the exhaustive search approach is other genes in the cell cycle network and also added 16 randomly only feasible for networks with a small number of nodes or low chosen additional genes contained in all of the microarray datasets. input degrees, the correlation method can also be applied to large This was done to distort the reconstruction process and to evaluate networks. This is particularly interesting for biological applications, the detection of non-present dependencies. A computed correlation as in the context of microarray and deep-sequencing technologies was assumed to represent a dependency, if it exceeded the mean of this type of large-scale data becomes more available. all correlations either by at least 1 SD for at least one of the time Theoretically, dependencies in the examples violate the lag groups or by at least half the SD for at least two of the time assumption of Theorem 1. In an experimental evaluation lag groups. This evaluation was repeated for a total of 25 times, (Supplementary Figs S2–S4), we found that these dependencies each time choosing a new set of 16 additional genes at random. only marginally influence the reconstruction process as long as these For comparison, we also applied the best fit extension algorithm to dependencies are not too strong (visually discernible only for bias the same data. This algorithm was run on the data corresponding probabilities greater than 0.8). to each of the three time lag groups, and for maximum in-degrees While our new method is to some extent similar to using of 3, 4 and 5. the Fourier transform, it does not need the Fourier transform’s Figure 4 shows the correctly identified dependencies in the mathematical overhead (Bshouty and Tamon, 1996; Mossel et al., yeast transcription factor network (Orlando et al., 2008). Assuming 2003). We do not see any specific advantage of moving from the [15:00 16/5/2011 Bioinformatics-btr166.tex] Page: 1534 1529–1536 Boolean network structure time domain into the spectral domain, as the values are (in the case that the published network whose interactions we are rebuilding of the Fourier transform) hard to interpret. We think that the notion exactly matches the real in vivo situation. Thus, for example, some of a correlation value is much more intuitive. identified dependencies, that were assumed to be false positives, Our experiments on the randomly created artificial networks have might actually be real dependencies. In line with this reasoning, shown that for noisy data and a low number of samples, Pearson the precision of network reconstruction was indeed lower for the correlation still finds a high percentage of the dependencies and, reconstruction from real data, for both the best-fit and the correlation for a range of thresholds, even surpasses the best fit algorithm with algorithm. Furthermore, it has to be kept in mind, that correlations regard to recall, precision and F -score (see Fig. 1, Supplementary representing indirect effects within the cell cycle also were counted Figs S5 and S6). A further advantage of our approach, besides the as false positives, while they do actually confer some biological short running times, is the possibility to vary the threshold according meaning. to the desired outcome, i.e. whether a high recall or rather a high In addition, Pearson correlation is not only suitable for measuring precision are requested. In case of little or no previous knowledge, the dependencies between sequences of binarized values. Especially for example, preferably a high threshold should be applied so that when real-valued examples are difficult to binarize, applying dependencies can be assumed with a high reliability, while a lower Pearson correlation directly on the raw, non-binarized data might threshold can be applied, if more previous knowledge exists that can already give some valuable information about a network. help to (pre-)select meaningful dependencies. For the simulated E.coli network, we also could reconstruct more than 50% of the present and non-present dependencies, already from 5 CONCLUSION a relatively low number of datasets (Fig. 2). The comparison of For a Boolean network consisting only of monotone Boolean the areas under the ROCs indicate that especially for functions functions, we showed that Pearson correlation is a fast method to that depend only on one or two variables, dependencies can be find dependencies in the network. This method makes it possible to found reliably. For functions depending on more variables, the analyze large networks that contain nodes with large input degree. identification of dependencies is more complicated, but the AUCs We could show for both simulated and real microarray data that show that the correlation method still is able to provide information our approach could reconstruct large parts of published regulatory about these dependencies (Fig. 3). One reason for the increased networks. difficulty when reconstructing functions with higher in-degree is the fact that for higher in-degree, variables can have a lower influence and dependencies are therefore harder to detect. In addition, the set ACKNOWLEDGEMENTS of Boolean functions can be partitioned into sets of varying difficulty The authors would especially like to thank Guido Adler for for a learning algorithm [cf. Gordon and Peretto (1990)]. This continuing support. The authors would also like to thank the partition, however, also depends to a large extent on the in-degree anonymous reviewers for their valuable comments. of the functions. Furthermore, we could show that Pearson correlation also Funding: Graduate School of Mathematical Analysis of Evolution, performed reasonably well when reconstructing dependencies from Information and Complexity (to H.A.K.). International Graduate real biological data, reaching identification rates (true positive, School in Molecular Medicine at Ulm University which is funded true negative) similar to those for the simulated E.coli datasets. by the Excellence Initiative of German governments (GSC 270 to Compared to the best fit algorithm, interactions were not only B.K.). Federal ministry of education and research (BMBF) within identified faster using correlation, but the identification was also the framework of the program of medical genome research (PaCa- more reliable. As seen also in Figure 4, almost 75% of the assumed Net; project ID PKB-01GS08, in part). The responsibility for the dependencies could be correctly identified at the chosen threshold, content lies exclusively with the authors. in spite of the regulatory complexity of the yeast cell cycle network, Conflict of Interest: none declared. even for components regulated by more than three factors (like Swi4, Swi6 or Yhp1). Further examining which of the expected dependencies were not identified by our method, we find that REFERENCES in these cases the regulatory mechanisms are based on a quite complex interplay of several factors involved. So, for example, Alon,U. (2006) An Introduction to Systems Biology: Design Principles of Biological Circuits. Chapman and Hall/CRC. Cln3 can activate transcription of Swi6 through inactivation of Babu,M. et al. (2004) Structure and evolution of transcriptional regulatory networks. the transcriptional repressor Whi5 as well as independent of Whi5 Curr. Opin. Struct. Biol., 14, 283–291. (Wittenberg and Reed, 2005), and the Whi5-independent mechanism Bornholdt,S. (2005) Systems biology: less is more in modeling large genetic networks. might hamper identification of the inhibitory influence of Whi5 Science, 21, 449–451. on Swi6. A second example is the transcriptional activator Mcm1, Bshouty,N. and Tamon,C. (1996) On the fourier spectrum of monotone functions. J. ACM (JACM), 43, 747–770. which can act in concert with an activating transcription factor Cho,R. et al. (1998) A genome-wide transcriptional analysis of the mitotic cell cycle. complex consisting of Fkh2/Ndd1 or the transcriptional repressors Mol. Cell, 2, 65–73. Yhp1 and Yox1 (Wittenberg and Reed, 2005). Accordingly, the Covert,M. et al. (2004) Integrating high-throughput and computational data elucidates influence of Mcm1, which was mostly not identified by our bacterial networks. Nature, 429, 92–96. Faith,J. et al. (2007) Large-scale mapping and validation of escherichia coli method, is strongly dependent on the presence of the respective co- transcriptional regulation from a compendium of expression profiles. PLoS Biol., factors, for which our approach correctly identified the respective 5, e8. dependencies. With regard to the reconstruction from real data, Fawcett,T. (2006) An introduction to ROC analysis. Pattern Recognit. Lett., 27, it has to be kept in mind that we cannot be completely sure 861–874. [15:00 16/5/2011 Bioinformatics-btr166.tex] Page: 1535 1529–1536 M.Maucher et al. Gordon,M. and Peretto,P. (1990) The statistical distribution of Boolean gates in two- Orphanides,G. and Reinberg,D. (2002) A unified theory of gene expression. Cell, 108, inputs, one-output multilayered neural networks. J. Phys. A Math. Gen., 23, 3061. 439–451. Hoeffding,W. (1963) Probability inequalities for sums of bounded random variables. Pearl,J. (2000) Causality. Cambridge University Press, Cambridge. J. Am. Stat. Assoc., 58, 13–30. Pramila,T. et al. (2006) The forkhead transcription factor hcm1 regulates chromosome Kauffman,S. (1969) Metabolic stability and epigenesis in randomly constructed genetic segregation genes and fills the s-phase gap in the transcriptional circuitry of the cell nets. J. Theor. Biol., 22, 437–467. cycle. Genes Dev., 20, 2266–2278. Kauffman,S. (1993) The Origins of Order: Self-Organization and Selection in Evolution. Rani,T.S. et al. (2007) Analysis of E.coli promoter recognition problem in dinucleotide Oxford University Press, USA. feature space. Bioinformatics, 23, 582–588. Kauffman,S. et al. (2003) Random Boolean network models and the yeast transcriptional Shipley,B. (2000) Cause and Correlation in Biology: A User’s Guide to Path network. Proc. Natl Acad. Sci. USA, 100, 14796–14799. Analysis, Structural Equations and Causal Inference. Cambridge University Press, Lähdesmäki,H. et al. (2003) On learning gene regulatory networks under the boolean Cambridge. network model. Mach. Learn., 52, 147–167. Spellman,P. et al. (1998) Comprehensive identification of cell cycle-regulated genes of Liang,S. et al. (1998) Reveal, a general reverse engineering algorithm for inference of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell, genetic network architectures. Pac. Symp. Biocomput., 3, 18–29. 9, 3273. Margolin,A. et al. (2006) Aracne: an algorithm for the reconstruction of gene regulatory Spirtes,P. et al. (2000) Causation, Prediction, and Search., 2nd edn. MIT Press, networks in a mammalian cellular context. BMC Bioinformatics, 7,S7. Cambridge, MA. Mossel,E. et al. (2003) Learning juntas. In STOC ’03: Proceedings of the thirty-fifth Tsukimoto,H. and Hatano,H. (2003) The functional localization of neural networks annual ACM symposium on Theory of Computing, ACM, New York, NY, USA, using genetic algorithms. Neural Networks, 16, 55–67. pp. 206–212. Venters,B.J. and Pugh,B.F. (2009) How eukaryotic genes are transcribed. Crit. Rev. Müssel,C. et al. (2010) BoolNet - an R package for generation, reconstruction, and Biochem. Mol. Biol., 44, 117–141. analysis of Boolean networks. Bioinformatics, 26, 1378–1380. Wittenberg,C. and Reed,S. (2005) Cell cycle-dependent transcription in Opgen-Rhein,R. and Strimmer,K. (2007) Learning causal networks from systems yeast: promoters, transcription factors, and transcriptomes. Oncogene, 24, biology time course data: an effective model selection procedure for the vector 2746–2755. autoregressive process. BMC Bioinformatics, 8 (Suppl. 2), S3. Zoppoli,P. et al. (2010) Timedelay-aracne: Reverse engineering of gene networks Orlando,D. et al. (2008) Global control of cell-cycle transcription by coupled CDK and from time-course data by an information theoretic approach. BMC Bioinformatics, network oscillators. Nature, 453, 944–947. 11, 154. [15:00 16/5/2011 Bioinformatics-btr166.tex] Page: 1536 1529–1536

Journal

BioinformaticsOxford University Press

Published: Apr 5, 2011

There are no references for this article.