Access the full text.
Sign up today, get DeepDyve free for 14 days.
References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.
Vol. 30 ECCB 2014, pages i430–i437 BIOINFORMATICS doi:10.1093/bioinformatics/btu447 1, 1 1 2 1 Waqar Ali , Tiago Rito ,Gesine Reinert , Fengzhu Sun and Charlotte M. Deane 1 2 Department of Statistics, University of Oxford, Oxford OX1 3TG, UK and Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, CA 90089-2910, USA ABSTRACT There already are more elegant methods available for construct- ing species trees, e.g. based on genomic sequences. Trees built Motivation: Biological network comparison software largely relies on from PPI networks would span a limited species set and are by the concept of alignment where close matches between the nodes of themselves perhaps not of interest. In contrast, the methodology two or more networks are sought. These node matches are based on that is able to correctly build a phylogenetic tree from interaction sequence similarity and/or interaction patterns. However, because of data would be much of interest, as it should reveal information the incomplete and error-prone datasets currently available, such about the evolutionary mechanisms at play in biological methods have had limited success. Moreover, the results of network networks. Such a method would also be useful in other domains alignment are in general not amenable for distance-based evolutionary where alternative means of generating taxonomies are not analysis of sets of networks. In this article, we describe Netdis, a available. topology-based distance measure between networks, which offers There is currently a lack of consensus regarding likelihood- the possibility of network phylogeny reconstruction. based statistical models for PPI network evolution (Ratmann Results: We first demonstrate that Netdis is able to correctly separate different random graph model types independent of network size and et al., 2009). Moreover, often phylogenetic tree reconstruction density. The biological applicability of the method is then shown by its methods based on distances perform better and are more ability to build the correct phylogenetic tree of species based solely on robust to mis-specification than maximum-likelihood methods the topology of current protein interaction networks. Our results pro- (Gonnet, 2012; Huelsenbeck and Hillis, 1993). Hence, we con- vide new evidence that the topology of protein interaction networks centrate here on the development of a one-dimensional network contains information about evolutionary processes, despite the lack of comparison statistic, which can be used to compile a distance conservation of individual interactions. As Netdis is applicable to all matrix to build trees. networks because of its speed and simplicity, we apply it to a large The most tractable methods for network comparison are those collection of biological and non-biological networks where it clusters which compare at the level of the entire network using statistics diverse networks by type. that describe global properties (e.g. Ratmann et al.,2009), but Availability and implementation: The source code of the program is these statistics are not sensitive enough to be able to reconstruct freely available at http://www.stats.ox.ac.uk/research/proteins/ phylogeny or shed light on evolutionary processes. In contrast, resources. there are several network alignment-based methods that compare Contact: w.ali@stats.ox.ac.uk networks using the properties of the individual proteins (nodes) Supplementary information: Supplementary data are available at e.g. local network similarity and/or protein functional or Bioinformatics online. sequence similarity (Flannick et al., 2009; Phan and Sternberg, 2012; Singh et al., 2008). The aim of these methods is to identify matching proteins/nodes between networks and use these match- 1INTRODUCTION ing nodes to identify exact or close subnetwork matches. A few of these methods have been expanded to the multiple network The ability to recreate evolutionary relationships between biolo- problems (Flannick et al., 2009; Liao et al., 2009). These methods gical objects has driven much of our increased biological under- are usually computationally intensive and tend to yield an align- standing. In particular, the reconstruction of phylogenetic trees ment that contains only a relatively small proportion of the using sequence data is now commonplace and has provided network, although this has been alleviated to some extent in evidence for evolutionary mechanisms such as mutation, inser- more recent methods (Alkan and Erten, 2014; Hu et al., 2014; tion and deletion. Patro and Kingsford, 2012). Similar to sequence data, over the past decade, the amount of The analysis is further confounded by a large number of false available interaction data between proteins has been steadily positives and false negatives thought to be present in current PPI increasing. These data are routinely represented as protein– data. Ali and Deane (2010) studied the effect of errors and in- protein interaction (PPI) networks, with proteins as nodes and completeness in alignments of simulated networks and estimated interactions as edges. Despite the abundance of PPI data now that only nearly complete networks (490%) can produce available, there is no tree reconstruction method based purely on reliable alignments. Finally, PPI network alignment methods these networks. Yet, it is conjectured (Sharan and Ideker, 2006) are all based on the loose premise that the respective orthologs that such trees based on networks may give rise to a step change of two interacting proteins also interact, forming pairs of in biology, much as tree-building methods from sequences did. so-called interologs, and/or that orthologs will share neighbour- hood topology. While there is some evidence for the existence of such conserved interactions across species (Matthews et al., *To whom correspondence should be addressed. 2001), particularly in proteins with high sequence similarity, The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com Alignment-free protein interaction network comparison Lewis et al. (2012) found, taking the noise and incompleteness of 2APPROACH the data into account, that the fraction of correctly transferred 2.1 Neighbourhoods: ego-networks interactions is at most 3%, between reasonably diverese species, Our method for network comparison is based on the core even if orthologs are defined as those proteins matched with a concept that similar networks will, on average, contain similar blast E 10 (Altschul et al., 1997). Moreover, current val local neighbourhoods. We count the occurrence of subgraph evidence points to a far larger rate of change in PPIs than ex- shapes in the local neighbourhoods of all nodes in a network. pected; specific interaction matches seem not to be the rule, but This method was chosen rather than just counting the number of rather the exception (Lewis et al., 2012; Shou et al., 2011). subgraphs in the entire network, as the latter will be strongly Thus, we do not follow the network alignment paradigm, influenced by factors such as network size and density and, there- but instead we take our lead from alignment-free sequence comparison methods that have been used to identify evolution- fore, may be too coarse for many network comparisons. Our notion of the neighbourhood of a protein is that of a ary relationships (Liu et al., 2011a). Alignment-free methods based on k-tuple counts (also called k-grams or k-words) have two-step ego-network, also called ego-network of radius two been applied to construct trees from sequence data (Song et al., (see, for example, Pattison, 1993). The two-step ego-network of 2013). A key feature is the standardization of the counts to a protein/node p is the (sub) network consisting of all nodes separate the signal from the background noise. Inspired by within two edges of p, also including all the edges between alignment-free sequence comparison, we use subgraph counts those nodes. In general, r-step ego networks could be used instead of sequence homology or functional one-to-one matches based on the application area. The radius should be chosen so to compare networks. Yet, even when comparing synthetic that the set of ego-networks displays reasonable variability. Here networks of the same size, generated from the same model, we choose radius two, as the average shortest path length in their small subgraph content can be volatile (Rito et al., 2010). protein interaction networks tend to be of the order four, Our proposed method, Netdis, compares the subgraph content and hence ego-networks with larger radii often contain a large not of the networks themselves but instead of the ensemble of proportion of the overall network. For very dense networks, ego- all protein neighbourhoods (ego-networks) in each network, networks of radius one may be more appropriate. through an averaging many-to-many approach. The comparison between these ensembles is summarized in a Netdis value, which 2.2 An overview of the Netdis measure in turn is used as input for phylogenetic tree reconstruction. Our algorithm starts by extracting for each query network the set The biological intuition for our new approach is based on a of two-step ego-networks of all nodes. For each two-step collection of results. The idea of using the subgraph content to ego-network, we count the number of occurrences of all 3 to build a distance between networks arises because motifs and 5-node induced subgraphs, or graphlets (Przul j, 2007). The modules have long been identified as important components of counting algorithm uses a combinatorial subgraph enumeration biological networks (Wagner et al., 2007; Zhu et al., 2007) and approach (Hoevar and Demar, 2014). Counting induced sub- have been conjectured to play an important role in evolution graphs on k vertices as opposed to any subgraphs on k vertices (Liu et al., 2011b); see also Cootes et al. (2007) and Przul j means that all the edges between the k vertices in the subgraph (2007) for previous explorations of biological network compari- must be present in the larger graph, and absent edges in the son based on subgraph counts. A key idea of our article is not to subgraph must also be absent in the larger graph. compare just the subgraph content of two networks, but instead Every node in the network is hence associated with a k-nodes the ensembles of subgraphs of the two networks, as hinted at in subgraph count vector (k = 3, 4 or 5) for its corresponding two- Rice et al. (2005). This use of subgraph content ensembles means step ego-network (see Fig. 1). One could also use k=6 that Netdis requires only the interaction data and is very differ- and above, but the counts of larger subgraphs can be extremely ent in concept from PPI network alignment methods. low in many networks and also computationally expensive to Netdis differs from standard subgraph count approaches in enumerate. Once the ego-network of a node has been associated two key aspects: it introduces the ensemble view and applies a with a count vector for all subgraphs contained in it, these counts standardization that controls for background noise. are centred according to the size and density of the ego-network. We first use our method on simulated networks using several While ideally the counts would be centred using a suitable random graph models and show that it correctly classifies null model, currently no good probabilistic model for PPI net- networks by model type even when confounded by varying network density and size. We then use it to successfully recon- works is available that replicates the k-subgraph content of a struct the correct phylogenetic tree for the set of organisms for network (Rito et al., 2012). Instead we use the counts from a which significant PPI data are currently available. Our ability to gold-standard network as a proxy for the expected counts, where reconstruct correct phylogenies adds evidence that PPI the size and the graph density of the size-two ego-network are data retain evolutionary information, despite the lack of conser- taken into account. vation of individual interactions. For each k-node subgraph, we sum the centred counts of all We have also investigated our method’s ability to separate two-step ego-networks in the query network. The final sum by type a large set of networks from several biological and vectors have length 2 for k = 3, length 6 for k=4 and length non-biological domains. The resulting tree is found to be 21 for k = 5. These sum vectors are then used as input to a self- highly clustered by domain type, outperforming a recent and standardizing statistic, which we call netd ðkÞ. In a final step, we far more computationally intensive community-detection calculate a symmetric matrix containing the netd ðkÞ values for approach. pair of networks in the set. The resulting matrix can be used to i431 W.Ali et al. scaled N ðQÞ for all q ego-networks in the graph density bin w;i is given by 1 N ðQÞ w;i E ðQ;Þ= : jfi 2f1; :::; qg : d gj n i=1:::q : For a query network G and a given subgraph w on k nodes, we estimate the expected count of w in the ego-network of node i 2 G,with d ðQÞ and n edges, as i i E ðG;Þ= E ðQ;Þ: ð1Þ This estimated expected count serves as our background Fig. 1. Overview of the Netdis method on a pair of networks (G, H). Each network is associated with vectors of subgraph counts, calculated expectation. from all its two-step ego-networks (this figure shows counts for only Now, let A(k) be the setofall w subgraphs with k number subgraphs of three nodes). These count vectors, normalized by a back- of nodes; here k = 3, 4 or 5. For a particular induced subgraph w ground expectation, are then used to calculate the distance measure with k nodes, let N ðGÞ denote its number of occurrences in w;i between the pair of input networks. See Supplementary Section S3 for the ego-network of i 2 G. We proceed as follows: a detailed calculation (1) For each node i in G, cluster the candidate networks and the clustering represented by (a) Build an ego-network of radius 2 around and includ- dendrograms. ing i. (b) Calculate for the ego-network of node i, the number of 2.3 Expectation of subgraph counts in neighbourhoods nodes n , graph density d and the subgraph count i i In alignment-free sequence comparison, centering the counts by vector, N ðGÞ,for each k,with k=3;4and 5. w;i subtracting their means is crucial to avoid measuring back- (c) If d =ðiÞ, calculate E ðG;ðiÞÞ. ground noise instead of signal (Reinert et al., 2009). For Netdis, we are faced with the problem that there is no good (2) Calculate, probabilistic model for subgraph counts in PPI networks avail- able (Rito et al., 2012). Exploratory data analysis shows that the S ðGÞ= ðN ðGÞ E ðG;ðiÞÞÞ: ð2Þ w w;i subgraph counts depend on the graph density, which is the fraction of observed edges over all potential edges. To compare two networks, G and H,we define three netD ðkÞ To gauge the expected number of subgraph occurrences in an statistics by ego-network of a given graph density, we first create a histogram 0 1 of the graph densities of all ego-networks in the gold-standard 1 S ðGÞS ðHÞ network. While the ideal comparison would be between ego- B w w C netD ðkÞ=pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ; k=3; 4; 5; @ A networks of the exact same density, there is not enough variabil- MðkÞ 2 2 w2AðkÞ S ðGÞ +S ðHÞ w w ity in the data for every possible density; hence, we use an adaptive binning approach. The ego-networks of the gold stand- where ard are binned according to density, where the number and 0 1 0 1 size of bins is automatically adapted to ensure that each bin 2 2 X X S ðGÞ S ðHÞ B w C B w C contains at least five samples. We then estimate the expectation MðkÞ= qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ @ A @ A 2 2 2 2 w2AðkÞ w2AðkÞ per ego-network of each subgraph in a given graph density bin as S ðGÞ +S ðHÞ S ðGÞ +S ðHÞ w w w w follows. Let Q represent a gold-standard network with q nodes and S is a normalizing constant so that netD ðkÞ2½1; 1 by hence q ego-networks. The ego-network of node i contains, the Cauchy–Schwarz inequality. The corresponding Netdis i statistic is defined as, say, n nodes, e edges and has graph density d =e . i i i i S S netd ðkÞ= ð1 netD ðkÞÞ 2 ½0; 1: ð3Þ 2 2 For an ego-network of Q with graph density d,we write d i i 2 if d is in the graph density bin =ðQÞ.Let w denote The pairwise Netdis values from Equation 3 are then used to a particular induced subgraph with k nodes and N ðQÞ w;i build a distance matrix for all query networks. Note that three its number of occurrences in the ego-network of node i,with n different distance matrices are defined, based on k=3, 4 or 5. We then use Unweighted Pair Group Method with Arithmetic nodes, in Q. We scale these counts by the possible choices Mean (UPGMA; Sokal and Michener, 1958) for building trees of k nodes in the ego-network of node i. The average of the from the Netdis distance matrices. UPGMA is a heuristic greedy i432 Alignment-free protein interaction network comparison Table 1. Network summaries for PPI data method that creates one cluster per network and sequentially merges the nearest pair of clusters by directly using the distance matrix until only two clusters remain. We ignore any branch Species Genes Nodes Edges Coverage 1000 lengths, as they would be difficult to interpret without a statis- tical model and before understanding the effect of errors. We Human 21 224 9223 36 631 43.9 0.8 used the phangorn package (Schliep, 2011) in R (R Core Team, Fly 13 917 7565 22 800 54.3 0.8 2013) to generate trees. Yeast 6692 5078 22 103 86.2 1.7 E.coli 4303 2968 11 604 68.9 2.6 H.pylori 1553 714 1,361 45.9 5.3 3DATA —network density. 3.1 Synthetic networks from random graph models We initially tested Netdis on simulated networks, namely, (dated: February 28, 2012). Table 1 summarizes the five PPI Erdos € –Reny i (ER) random graphs (Erdos € and Re nyi, 1961), datasets used in the study. Erdos € –Reny i graphs with fixed degree distribution (ERDD) (Newman, 2010), geometric random graphs (Penrose, 2003), geo- 3.3 Networks from multiple disciplines metric with gene duplication (Przul j et al., 2010), Chung–Lu In a recent paper (Onnela et al., 2012), the authors constructed a model (Chung and Lu, 2002) and the duplication–divergence taxonomy of a large collection of networks obtained from a growth model (Middendorf et al., 2005). variety of sources. Their method is based on first probing the The particular ER model, which we use here, has n labelled community structure within each network and then using sum- nodes connected by m edges, which are randomly chosen from maries of the community structures to identify similar networks. nðn1Þ the possible edges. The fixed distribution variant (ERDD) The original dataset used by the authors contains 746 networks. is constructed to have not just the same number of nodes and We used all unweighted and undirected networks from this set, edges as a reference network, but also the same degree distribu- resulting in a total of 151 networks. These come from across the tion. Geometric 3-dimensional random graphs (GEO3D) with biological and social domains as well as model simulations (see parameter r are constructed by assigning each node random co- Supplementary Section S3 for full details). ordinates in a 3-dimensional box of unit volume. Two nodes are connected by an edge if the Euclidean distance between them is at most r. A variant of this, incorporating biological intuition is 4RESULTS the geometric with gene duplication model (GeoGD). The In all of the results that follow, unless stated otherwise, Netdis Chung–Lu (or Sticky) model constructs networks by assigning was calculated using k = 4 and the DIP core yeast interaction degðiÞ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ an index to every node i for all n nodes where = ; i i n dataset (Deane et al., 2002) as the gold standard. To test the degðjÞ j=1 robustness of the method, we also used a simulated gold-stand- here deg(i) is the degree of node i. Each possible edge i, j is then ard network using the ER model with 5000 nodes and 20 000 formed with probability . Finally, the duplication divergence i j edges (see Supplementary Fig. S8 for results). While using Netdis model (DD) grows a network at each iteration t by selecting a with k = 5 is expected to increase the sensitivity of the method, node random and duplicating it with all its edges. We use a preliminary analysis of the datasets used for this study indicated variant of this model that specifies a symmetric node duplication little benefit of the latter, at the cost of substantially more com- model so that edges can be lost from both parent and child nodes puting time spent in induced subgraph counting. on divergence. 4.1 Netdis can separate different random graph model 3.2 Protein interaction data types To test Netdis on experimental data, we downloaded species- We simulated five networks for each of the six models described specific PPI data from the Database of Interacting Proteins in Section 3.1, giving a total of 30 networks. The parameters for (DIP; Salwinski et al., 2004) and Human Protein Reference all of these simulated networks were chosen to have the number Database (HPRD; Keshava Prasad et al., 2009). Only species of nodes and edges match those of the DIP yeast network, al- having at least 500 physical interactions and 415% coverage though some models create self-loops and disconnected nodes, were considered. Coverage is here a rough estimate of how which lead to slight discrepancies. A pairwise distance matrix many proteins have been probed for interactions given the was then constructed between these networks and the resulting expected proteome of the organism. We define it as a percentage tree is shown in Figure 2. We observe a perfect clustering of the by taking the number of nodes in the network divided by the networks according to model type. estimated number of genes in the genome of the organism at In the above analysis, all networks had the same or similar hand. In total, we analyse five species: Saccharomyces cerevisiae number of nodes and graph density. We also investigated the (yeast), Drosophila melanogaster (fly), Homo sapiens (human), effect of varying network size and density on Netdis’s ability to Escherichia coli (E.coli)and Helicobacter pylori (H.pylori). resolve the different models. For three models (ER, DD and Human data came from HPRD (dated: September 2012) Geo3D), we generated samples with either 1000 or 5000 nodes. while the other four datasets were downloaded from DIP We also varied the densities, using values of 0.003, 0.005 and i433 W.Ali et al. DD 1 DD 5000 0.007 DD 3 DD 5000 0.003 DD 5 DD 2 DD 1000 0.005 DD 4 ERDD 2 ER 5000 0.007 ERDD 3 ERDD 1 ER 1000 0.005 ERDD 5 ER 5000 0.003 ERDD 4 ER 5 GEO3D 1000 0.005 ER 4 ER 3 GEO3D 5000 0.007 ER 1 ER 2 GEO3D 5000 0.003 Sticky 4 Sticky 3 Fig. 3. Phylogenetic tree generated by Netdis for simulated networks of Sticky 2 varying size and density. The model name is followed by number of nodes Sticky 1 and network density Sticky 5 GEO3D 5 GEO3D 4 GEO3D 2 serving as proxies for pairwise network distance were then used GEO3D 1 GEO3D 3 to generate phylogenetic trees for the simulated networks GeoGD 2 discussed above. The results (see Supplementary Figs S9 and GeoGD 1 S10) show that while these methods generate the correct cluster- GeoGD 3 GeoGD 4 ing in the simplest case, they fail when the networks are variable GeoGD 5 in size and density, indicating the need for richer measures like Netdis in realistic settings. Fig. 2. Phylogenetic tree of simulated networks generated by Netdis. The method perfectly clusters together samples from the same model While Netdis is completely different in approach to traditional network alignment, it is tempting to compare the results to such methods. Network alignment-based methods typically return re- sults in the form of node or edge mappings and not a distance 0.007. As shown in Figure 3, even with such varying sizes and matrix for a given set of networks. However, one such method, densities, the correct clustering was achieved. MI-GRAAL (Kuchaiev and Prz ulj, 2011) has been used in the The simulated networks discussed so far are error-free. literature to generate a phylogenetic tree of small viral PPI net- Introduction of error should make it harder to distinguish works. MI-GRAAL is capable of solely topological network between networks from the different models. We introduced alignment and its edge correctness score wasusedasaproxy false negatives and false positives in the simulated networks for network similarity measure to generate the distance matrix (see Supplementary Section S2) and observed that for the for the phylogenetic tree. We therefore tested MI-GRAAL on all simple case when all networks have the same size and density, of the above simulated datasets. For all simulated networks with Netdis can recover the correct grouping even with an error rate 5000 or more nodes, we were unable to successfully finish align- of 50% (Supplementary Fig. S11a). However, error has a larger ment using MI-GRAAL. We therefore used only the networks of adverse impact when the networks are already harder to cluster 1000 nodes and densities 0.003, 0.005 and 0.007 and the resulting because of varying size and density (Supplementary Fig. S11b). tree is presented in Supplementary Figure S3. While the tree The sensitivity of the method to errors is likely to be dependent generated by MI-GRAAL is generally correct, two networks on the dataset as well as error characteristics. (ER_1000_0.003 and DD_1000_0.007) are wrongly clustered. Our results on random graph models suggest that the different The tree generated by Netdis for the same dataset, giving perfect model types differ substantially from each other, and thus are clustering, is also given for comparison (Supplementary Fig. S4). perhaps not a rigorous test of the sensitivity of the method. This intuition was borne out when we reanalysed the dataset using 4.2 Phylogenies from protein interaction data modified versions of Netdis with the expectation in Equation 2 set to 0 (Netdis_ ne), or replacing the summation over all ego- The currently accepted phylogeny between the species of Table 1 networks in Equation 2 with just the subgraph count of the is depicted by the tree in Supplementary Figure S5. The tree is whole network (Netdis_ gc) and no correction for background based on the NCBI taxonomy database (Sayers et al., 2009), expectation. Even these non-centred measures successfully gen- which incorporates a variety of phylogenetic resources including erate the correct tree in all of the above cases (see Supplementary molecular and morphological characters. The tree generated by Figs S1 and S2). Netdis is depicted in Figure 4. Instead of ego-network subgraph counts, networks could also Out of the many possible rooted trees with five leaves, the be grouped based on simpler properties such as the distribution correct clustering is obtained with fly next to human and yeast, of the node degrees or local clustering coefficients. To compare and the two bacterial networks in a separate clade. This is despite Netdis with such baseline measures, we defined Ddis and Cdis as the significant differences in number of nodes and edges, cover- the values of the Kolmogorov–Smirnov test statistic between the age and graph density. Moreover, Supplementary Figure S11c empirical distribution of nodes degrees and local clustering coef- shows that even when introducing high levels of error in the PPI ficients, respectively, for a given pair of networks. These values, networks, the correct tree topology is reproduced by Netdis. The i434 Alignment-free protein interaction network comparison hpylori from the null distribution as follows: each null sample was gen- erated by creating a random tree topology with the same number ecoli of leaves as the dataset, and we then calculated the cluster fly similarity with the manual grouping. The P-value for an obser- vation is then the fraction of null samples equal to or greater human than the observed similarity. The best performing method is yeast Netdis, with similarity index 0.011 (P-value: 0). Even without using background expectation (Netdis_ ne), a better grouping Fig. 4. Phylogenetic tree of PPI networks generated by Netdis is achieved (RI: 0.01, P-value: 2 10 ) than Onnela et al.’s method, which has a similarity index of 0.006 (P-value: 0.001). The clustered taxonomies generated by Onnela et al.’s method and Netdis are given in Supplementary Figures S13 and S14. Of importance of the background expectation in Netdis becomes particular interest is the fact that Netdis separates most of the apparent when the tree is generated using Netdis_ ne instead metabolic and protein interaction networks into two distinct (Supplementary Fig. S6a). In this case, the method does not re- groups. The complete analysis for the 151 networks using create the correct phylogeny, and places fly next to H.pylori.The Netdis consumed around 10 h of computing time on a standard same is true when Netdis_ gc is used (Supplementary Fig. S6b). desktop computer, while Onnela et al.’s method consumed 18 h The simple network statistics based methods, Ddis and Cdis, also on the same computer. The analysis could not be replicated using fail to generate the correct tree (Supplementary Figs S9c and MI-GRAAL, as the program failed to provide alignment results S10c). When executing MI-GRAAL using only network top- for a large fraction of the dataset. We note that alignment-based ology on these data, the program failed during the alignment methods are perhaps inherently ill-suited to the task of classify- of the yeast and human network. We therefore only used the ing large sets, which may include dense networks, as generating other four species (fly, yeast, H.pylori and E.coli)to generate all possible pairwise network alignments is a computationally the tree. In this case, MI-GRAAL recreates a generally correct prohibitive task. tree with yeast and fly in a single clade (Supplementary Fig. S7a), but the bacterial species are split into two clades. The tree gen- erated by Netdis for these four species is given in Supplementary 5DISCUSSION Figure S7b. For PPI networks, as a baseline, one could also In this article, we add evidence to the idea that the topology of create phylogenies based on the number of orthologous inter- protein interaction networks alone contains evolutionary infor- actions shared by a pair of networks. The results for such an mation without any additional biological data. Our results reveal approach are shown in Supplementary Figure S6c (see caption that current PPI data are sufficiently abundant to derive correct for method details). The tree is generally correct, although the phylogenetic relationships, at least between the model species. approach is not applicable to other types of networks. From PPI data with genome coverage of at least 15%, our method, Netdis, is able to deduce correct phylogenies between 4.3 Classification of diverse empirical networks species without resorting to sequence homology. As a systematic representation of data amenable to rigorous We emphasize that Netdis is not proposed as a competitive analysis and visualization, networks have become ubiquitous in method for the generation of phylogenetic trees for biological recent years across many scientific and social disciplines. While species; existing techniques based on molecular sequences these networks vary enormously in their detailed properties, already address this particular problem comprehensively. Our methods that can group similar networks together could prove main result is that from the topology of protein interaction to be highly useful. We therefore compared the ability of Netdis networks alone, it is possible to generate a correct phylogenetic with group networks by domain, against the method used by tree. It is therefore worth considering the underlying assumptions Onnela et al. on the data described in Section 3.3. As there is of Netdis explore what they may reveal about protein interaction no agreed true dendogram available for these data, we first network evolution. The core principle of this work is that species manually created a taxonomy by simply grouping the data that are more related will on average share more PPI network based on type and assuming no further branching within neighbourhoods that are topologically similar than unrelated groups. In total, we identify 13 groups, such as protein inter- species do. It is tempting to assign a biological interpretation action, congressional voting, metabolic networks, ER random to the Netdis algorithm: a given biological function is normally graphs, etc. Supplementary Figure S12 presents the complete credited to a community of interacting proteins that act together manual taxonomy. We then created taxonomic trees for these to perform that function in the cell. Closely related species will networks using Netdis and Onnela et al.’s method. The resulting have, on average, more of these communities in common. Our trees were split using the cutree function in R to give 13 groups method detects this phenomenon by comparing ensembles of (cutree can interpret a given tree as a cluster hierarchy that can ego-networks derived from the PPI networks. then be merged/split using the underlying distance matrix to give Netdis could also be used to consider the similarities between the desired number of clusters). Finally, we compared the 13 biological networks during a cellular or adaptive response, where groups from each of the methods with the manually created it is thought that understanding the differences between these groups using the adjusted Rand index (RI) for cluster similarity networks is key (Ideker and Krogan, 2012). Obtaining phyloge- (Hubert and Arabie, 1985). We also generated Monte-Carlo nies of cell-states based on network data would provide a biolo- P-values for the similarity values by generating 50 000 samples gical perspective on these phylogenies, which would be both new i435 W.Ali et al. Cootes,A.P. et al. (2007) The identification of similarities between biological net- and completely free from sequence data. This perspective works: application to the metabolome and interactome. J. Mol. Biol., 369, would further complement phylogenies derived from 1126–1139. phenotypic traits with another type of molecular data. The Deane,C.M. et al. (2002) Protein interactions: two methods for assessment of the method could be refined by adding information on the proteins, reliability of high throughput observations. Mol. Cell. Proteomics, 1, 349–356. such as protein function, structure, subcellular location or age Erdos, € P. and Reny i,A. (1961) On the evolution of random graphs. Bull. Inst. Internat. Statist., 38, 343–347. (Liu et al., 2011b; Rito et al., 2012); we would expect that includ- Flannick,J. et al. (2009) Automatic parameter learning for multiple network ing such information would increase the sensitivity of the alignment. J. Comput. Biol., 16, 1001–1022. method. Gonnet,G. (2012) Surprising results on phylogenetic tree building methods based on One particular focus of our ongoing and future research molecular sequences. BMC Bioinformatics, 13,148. Hoevar,T. and Demar,J. (2014) A combinatorial approach to graphlet counting. efforts will be the derivation of theoretically well-founded Bioinformatics, 30, 559–565. background expectations of subgraph counts used in Netdis. Hu,J. et al. (2014) Netcoffee: a fast and accurate global alignment approach to At present, the method relies on specifying a gold-standard identify functionally conserved proteins in multiple networks. Bioinformatics, dataset to estimate these expectations. While our results so 30, 540–548. far indicate that the method is robust to the choice of gold stand- Hubert,L. and Arabie,P. (1985) Comparing partitions. J. Classif., 2, 193–218. Huelsenbeck,J.P. and Hillis,D.M. (1993) Success of phylogenetic methods in the ard, it seems feasible that derivation of exact expectation for- four-taxon case. Syst. Biol., 42, 247–264. mulas may increase method sensitivity and/or remove potential Ideker,T. and Krogan,N.J. (2012) Differential network biology. Mol. Systems Biol., bias. 8,565. We conclude with noting that, as no other data besides net- Keshava Prasad,T.S. et al. (2009) Human Protein Reference Database–2009 update. work data are used as input to the method, many types of net- Nucleic Acids Res., 37, D767–D772. Kuchaiev,O. and Przulj,N. (2011) Integrative network alignment reveals large work data can be analysed together. This makes the method regions of global network similarity in yeast and human. Bioinformatics, 27, inherently different from the network alignment paradigm. As 1390–1396. a proof of concept, we presented some results indicating its abil- Lewis,A.C.F. et al. (2012) What evidence is there for the homology of protein- ity to correctly classify simulated networks from random graph protein interactions? PLoS Comput. Biol., 8, e1002645. models as well creating a reasonable taxonomy for a diverse Liao,C.-S. et al. (2009) IsoRankN: spectral methods for global alignment of mul- tiple protein networks. Bioinformatics, 25, i253–i258. mixture of empirical networks. The ability to highlight relation- Liu,X. et al. (2011a) New powerful statistics for alignment-free sequence compari- ships between networks from different sources in a systematic son under a pattern transfer model. J. Theor. Biol., 284, 106–116. way may help researchers across different fields to identify em- Liu,Z. et al. (2011b) Evidence for the additions of clustered interacting nodes during pirical analyses or theoretical models which are applicable to the evolution of protein interaction networks from network motifs. BMC Evol. their specific problem. Biol., 11,133. Matthews,L.R. et al. (2001) Identification of potential interaction networks using sequence-based searches for conserved protein-protein interactions or “interologs”. Genome Res., 11, 2120–2126. ACKNOWLEDGEMENT Middendorf,M. et al. (2005) Inferring network mechanisms: the drosophila mela- The authors would like to thank Mason A. Porter for kindly nogaster protein interaction network. Proc. Natl Acad. Sci. USA, 102, 3192–3197. providing the data from Onnela et al. (2012). The authors Newman,M. (2010) Networks: An Introduction. 1st edn. Oxford University Press, would also like to thank Robert Gaunt, Shuang Tian, Rebecca USA. Walton and the anonymous reviewers, also of previous drafts of Onnela,J.-P. et al. (2012) Taxonomies of networks from community structure. Phys. this article, for their valuable comments. Rev. E, 86, 036104. Patro,R. and Kingsford,C. (2012) Global network alignment using multiscale spec- Funding: This work was supported in part by Fundacao ¸ para a tral signatures. Bioinformatics, 28, 3105–3114. ^ Pattison,P. (1993) Algebraic Models for Social Networks. Structural Analysis in the Ciencia e a Tecnologia (FCT) through a PhD grant, and by the Social Sciences. Cambridge University Press, USA. Biotechnology and Biological Sciences Research Council Penrose,M. (2003) Random Geometric Graphs (Oxford Studies in Probability). (BBSRC) and the Engineering and Physical Sciences Research Oxford University Press, USA. Council (EPSRC) through the Systems Biology Doctoral Phan,H.T.T. and Sternberg,M.J.E. (2012) Pinalog: a novel approach to align Training Center (DTC) and the Oxford Center for Integrated protein interaction networks—implications for complex detection and function prediction. Bioinformatics, 28, 1239–1245. Systems Biology (OCISB). GR, CD and WA acknowledge Przul j,N. (2007) Biological network comparison using graphlet degree distribution. support from EPSRC grant EP/K032402/1. Bioinformatics, 23, e177–e183. Przul j,N. et al. (2010) Geometric Evolutionary Dynamics of Protein Interaction Conflicts of Interest: none declared. Networks. Chapter 20, pp 178–189. R Core Team. (2013) R: A Language and Environment for Statistical Computing.R Foundation for Statistical Computing, Vienna, Austria. REFERENCES Ratmann,O. et al. (2009) From evidence to inference: probing the evolution of protein interaction networks. HFSP J., 3, 290–306. Ali,W. and Deane,C.M. (2010) Evolutionary analysis reveals low coverage as the Reinert,G. et al. (2009) Alignment-free sequence comparison (I): statistics and major challenge for protein interaction network alignment. Mol. Biosyst., 6, power. J. Comput. Biol., 16, 1615–1634. 2296–2304. Rice,J.J. et al. (2005) Lasting impressions: Motifs in protein-protein maps may Alkan,F. and Erten,C. (2014) Beams: backbone extraction and merge strategy for provide footprints of evolutionary events. Proc. Natl Acad. Sci. USA, 102, the global many-to-many alignment of multiple PPI networks. Bioinformatics, 3173–3174. 30, 531–539. Rito,T. et al. (2010) How threshold behaviour affects the use of subgraphs for Altschul,S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new gen- eration of protein database search programs. Nucleic Acids Res., 25, 3389–3402. network comparison. Bioinformatics, 26, i611–i617. Rito,T. et al. (2012) The importance of age and high degree, in protein-protein Chung,F. and Lu,L. (2002) The average distances in random graphs with given expected degrees. Proc. Natl Acad. Sci. USA, 99, 15879–15882. interaction networks. J. Comput. Biol., 19, 785–795. i436 Alignment-free protein interaction network comparison Salwinski,L. et al. (2004) The database of interacting proteins: 2004 update. Nucleic Singh,R. et al. (2008) Global alignment of multiple protein interaction networks Acids Res., 32 (Suppl. 1), D449–D451. with application to functional orthology detection. Proc. Natl Acad. Sci. USA, Sayers,E.W. et al. (2009) Database resources of the National Center 105, 12763–12768. for Biotechnology Information. Nucleic Acids Res., 37 (Suppl. 1), D5–D15. Sokal,R.R. and Michener,C.D. (1958) A statistical method for evaluating systematic Schliep,K. (2011) phangorn: phylogenetic analysis in r. Bioinformatics, 27, relationships. Univ. Kans. Sci. Bull., 28, 1409–1438. 592–593. Song,K. et al. (2013) Alignment-free sequence comparison based on next-generation Sharan,R. and Ideker,T. (2006) Modeling cellular machinery through biological sequencing reads. J. Comput. Biol., 20, 64–79. network comparison. Nat. Biotechnol., 24, 427–433. Wagner,G.P. et al. (2007) The road to modularity. Nat. Rev. Genet., 8, 921–931. Shou,C. et al. (2011) Measuring the evolutionary rewiring of biological networks. Zhu,X. et al. (2007) Getting connected: analysis and principles of biological net- PLoS Comput. Biol., 7, e1001050. works. Genes Dev., 21, 1010–1024. i437
Bioinformatics – Pubmed Central
Published: Aug 22, 2014
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.