Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

An automated method for finding molecular complexes in large protein interaction networks

An automated method for finding molecular complexes in large protein interaction networks Background: Recent advances in proteomics technologies such as two-hybrid, phage display and mass spectrometry have enabled us to create a detailed map of biomolecular interaction networks. Initial mapping efforts have already produced a wealth of data. As the size of the interaction set increases, databases and computational methods will be required to store, visualize and analyze the information in order to effectively aid in knowledge discovery. Results: This paper describes a novel graph theoretic clustering algorithm, "Molecular Complex Detection" (MCODE), that detects densely connected regions in large protein-protein interaction networks that may represent molecular complexes. The method is based on vertex weighting by local neighborhood density and outward traversal from a locally dense seed protein to isolate the dense regions according to given parameters. The algorithm has the advantage over other graph clustering methods of having a directed mode that allows fine-tuning of clusters of interest without considering the rest of the network and allows examination of cluster interconnectivity, which is relevant for protein networks. Protein interaction and complex information from the yeast Saccharomyces cerevisiae was used for evaluation. Conclusion: Dense regions of protein interaction networks can be found, based solely on connectivity data, many of which correspond to known protein complexes. The algorithm is not affected by a known high rate of false positives in data from high-throughput interaction techniques. The program is available from ftp://ftp.mshri.on.ca/pub/BIND/Tools/MCODE. Background method that identified a functional protein complex Recent papers published in Science and Nature among oth- around the yeast protein Las17 that is involved in actin cy- ers describe large-scale proteomics experiments that have toskeleton rearrangement [10]. Here we extend the meth- generated large data sets of protein-protein interactions od to better apply it to the accumulating information in and molecular complexes [1–7]. Protein structure [8] and protein networks. gene expression data [9] is also accumulating at a rapid rate. Bioinformatics systems for storage, management, vis- Currently, most proteomics data is available for the model ualization and analysis of this new wealth of data must organism Saccharomyces cerevisiae, by virtue of the availa- keep pace. We previously published a simple graph theory bility of a defined and relatively stable proteome, full Page 1 of 27 (page number not for citation purposes) BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 genome clone libraries [11], established molecular biolo- undirected graph is used. Using this graph representation gy experimental techniques and an assortment of well de- of a biological system allows graph theoretic methods to signed genomics databases [12–14]. Using the be applied to aid in analysis and solve biological prob- Biomolecular Interaction Network Database (BIND – ht- lems. This graph theory approach has been used by other tp://www.bind.ca) [15] as an integration platform, we biomolecular interaction database projects such as DIP have collected 15,143 yeast protein-protein interactions [16], CSNDB [17], TRANSPATH [18], EcoCyc [19] and among 4,825 proteins (about 75% of the yeast pro- WIT [20] and is discussed by Wagner and Fell [21]. teome). Much larger data sets than this will eventually be available for other well studied model organisms as well Algorithms for finding clusters, or locally dense regions, as for the human proteome. These complex data sets of a graph are an ongoing research topic in computer sci- present a formidable challenge for computational biology ence and are often based on network flow/minimum cut to develop automated data mining analyses for knowl- theory [22,23] or more recently, spectral clustering [24]. edge discovery. To find locally dense regions of a graph, MCODE instead uses a vertex-weighting scheme based on the clustering co- , which measures 'cliquishness' of the neigh- Here we present the first report that uses a clustering algo- efficient, C rithm to identify molecular complexes in a large protein borhood of a vertex [25]. C = 2n/k (k -1) where k is the i i i i interaction network derived from heterogeneous experi- vertex size of the neighborhood of vertex i and n is the mental sources. Based on our previous observation that number of edges in the neighborhood (the immediate highly interconnected, or dense, regions of the network neighborhood density of v not including v). A clique is de- may represent complexes [10], the "Molecular Complex fined as a maximally connected graph. There is no stand- Detection" (MCODE) algorithm has been implemented ard graph theory definition of density, but definitions are and evaluated on our yeast protein interaction compila- normally based on the connectivity level of a graph. Den- tion using known molecular complex data from a recent sity of a graph, G = (V,E), with number of vertices, |V|, and systematic mass spectrometry study of the proteome [7] number of edges, |E|, is defined here as |E|; divided by the and from the MIPS database [13]. theoretical maximum number of edges possible for the graph, |E| . For a graph with loops (an edge connecting max back to its originating vertex), |E| = |V| (|V|+1)/2 and Predicting molecular complexes from protein interaction max data is important because it provides another level of for a graph with no loops, |E| = |V| (|V|-1)/2. So, den- max functional annotation above other guilt-by-association sity of G, D = |E|/|E| and is thus a real number rang- G max methods. Since sub-units of a molecular complex general- ing from 0.0 to 1.0. ly function towards the same biological goal, prediction of an unknown protein as part of a complex also allows The first stage of MCODE, vertex weighting, weights all increased confidence in the annotation of that protein. vertices based on their local network density using the highest k-core of the vertex neighborhood. A k-core is a MCODE also makes the visualization of large networks graph of minimal degree k (graph G, for all v in G, deg(v) manageable by extracting the dense regions around a pro- >= k). The highest k-core of a graph is the central most tein of interest. This is important, as it is now obvious that densely connected subgraph. We define here the term the current visualization tools present on many interac- core-clustering coefficient of a vertex, v, to be the density tion databases [15], originally based on the Sun Microsys- of the highest k-core of the immediate neighborhood of v tems embedded spring graph layout Java applet do not (vertices connected directly to v) including v (note that C scale well to large networks (http://java.sun.com/applets/ does not include v). The core-clustering coefficient is used jdk/1.1/demo/GraphLayout/example1.html). here instead of the clustering coefficient because it ampli- fies the weighting of heavily interconnected graph regions Algorithm while removing the many less connected vertices that are The MCODE algorithm operates in three stages, vertex usually part of a biomolecular interaction network, weighting, complex prediction and optionally post- known to be scale-free [6,21,26–29]. A scale-free network processing to filter or add proteins in the resulting com- has a vertex connectivity distribution that follows a power plexes by certain connectivity criteria. law, with relatively few highly connected vertices (high degree) and many vertices having a low degree. A given A network of interacting molecules can be intuitively highly connected vertex, v, in a dense region of a graph modeled as a graph, where vertices are molecules and edg- may be connected to many vertices of degree one (singly es are molecular interactions. If temporal pathway or cell linked vertex). These low degree vertices do not intercon- signalling information is known, it is possible to create a nect within the neighborhood of v and thus would reduce directed graph with arcs representing direction of chemi- the clustering coefficient, but not the core-clustering coef- cal action or direction of information flow, otherwise an ficient. The final weight given to a vertex is the product of Page 2 of 27 (page number not for citation purposes) BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 the vertex core-clustering coefficient and the highest k- ified seed is a part of. Typically, when analyzing complex- core level, k , of the immediate neighborhood of the es in a given network, one would find all complexes max vertex. This weighting scheme further boosts the weight of present (undirected mode) and then switch to the directed densely connected vertices. This specific weighting func- mode for the complexes of interest. The directed mode al- tion is based on local network density. Many other func- lows one to experiment with MCODE parameters to fine tions are possible and some may have better performance tune the size of the resulting complex according to exist- for this algorithm but these are not evaluated here. ing biological knowledge of the system. In directed mode, MCODE will first pre-process the input network to ignore The second stage, molecular complex prediction, takes as all vertices with higher vertex weight than the seed vertex. input the vertex weighted graph, seeds a complex with the If this were not done, MCODE would preferentially highest weighted vertex and recursively moves outward branch out to denser regions of the graph, if they exist, from the seed vertex, including vertices in the complex which could belong to separate, but denser complexes. whose weight is above a given threshold, which is a given Thus, a seed vertex for directed mode should always be the percentage away from the weight of the seed vertex. This is highest density vertex among the suspected complex. the vertex weight percentage (VWP) parameter. If a vertex There is an option to turn this pre-processing step off, is included, its neighbours are recursively checked in the which will allow seeded complexes to branch out into same manner to see if they are part of the complex. A ver- denser regions of the graph, if desired. tex is not checked more than once, since complexes can- not overlap in this stage of the algorithm (see below for a The time complexity of the entire algorithm is polynomial possible overlap condition). This process stops once no O(nmh ) where n is the number of vertices, m is the more vertices can be added to the complex based on the number of edges and h is the vertex size of the average ver- given threshold and is repeated for the next highest un- tex neighbourhood in the input graph, G. This comes seen weighted vertex in the network. In this way, the dens- from the vertex-weighting step. Finding a k-core in a graph est regions of the network are identified. The vertex weight proceeds by progressively removing vertices of degree < k threshold parameter defines the density of the resulting until all remaining vertices are connected to each other by complex. A threshold that is closer to the weight of the degree k or more, and is thus O(n ). The highest k-core is found by trying to find k-cores from one up until all verti- seed vertex identifies a smaller, denser network region around the seed vertex. ces have been found and cannot go beyond a number of steps equal to the highest degree in the graph. Thus, the The third stage is post-processing. Complexes are filtered highest k-core step is O(n ). Since this k-core step operates if they do not contain at least a 2-core (graph of minimum only on the neighbourhood of a vertex, the n in this case degree 2). The algorithm may be run with the 'fluff' op- is the number of vertices in the average neighbourhood of tion, which increases the size of the complex according to a vertex, h. The inner loop of the algorithm only operates a given 'fluff' parameter between 0.0 and 1.0. For every twice for every edge in the input graph, thus is O(2mh ). vertex in the complex, v, its neighbors are added to the The outer loop operates once on all vertices in the input complex if they have not yet been seen and if the neigh- graph, thus the entire time complexity of the weighting 3 3 borhood density (including v) is higher than the given stage is O(n2mh ) = O(nmh ). The complex prediction fluff parameter. Vertices that are added by the fluff param- stage is O(n) and the optional post-processing step can be ), where c is the number of complexes that eter are not marked as seen, so there can be overlap among up to O(cs predicted complexes with the fluff parameter set. If the al- were found in the previous step and s is the number of ver- gorithm is run using the 'haircut' option, the resulting tices in the largest complex - O(cs ) to find the 2-core once complexes are 2-cored, thereby removing the vertices that for each complex. are singly connected to the core complex. If both options are specified, fluff is run first, then haircut. Even though the fastest min-cut graph clustering algo- rithms are faster, at O(n logn) [30], MCODE has a Resulting complexes from the algorithm are scored and number of advantages. Since weighting is done once and ranked. The complex score is defined as the product of the comprises most of the time complexity, many algorithm complex subgraph, C = (V,E), density and the number of parameters can be tried, in O(n), once weighting is com- vertices in the complex subgraph (D × |V|). This ranks plete. This is useful when evaluating many different pa- larger more dense complexes higher in the results. Other rameters. MCODE is relatively easy to implement and scoring schemes are possible, but are not evaluated here. since it is local density based, has the advantage of a di- rected mode and a complex connectivity mode. These two MCODE may also be run in a directed mode where a seed modes are generally not useful in typical clustering appli- vertex is specified as a parameter. In this mode, MCODE cations, but are useful for examining molecular interac- only runs once to predict the single complex that the spec- tion networks. Additionally, only those proteins above a Page 3 of 27 (page number not for citation purposes) BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 given local density threshold are assigned to complexes. if not already seen v then call: MCODE-FIND-COM- This is in contrast to many clustering applications that PLEX(G, W, d, v) force all data points to be part of clusters, whether they truly should be part of a cluster or not. end for Pseudocode end procedure Stage 1: Vertex Weighting procedure MCODE-VERTEX-WEIGHTING Stage 3: Post-Processing (optional) procedure MCODE-FLUFF-COMPLEX input: graph: G = (V,E) input: graph: G = (V,E); vertex weights: W; for all v in G do fluff density threshold: d; complex graph: C = (U,F) N = find neighbors of v to depth 1 for all u in C do K = Get highest k-core graph from N if weight of u >d then add u to complex C k = Get highest k-core number from N end for d = Get density of K end procedure Set weight of v = k × d procedure MCODE-POST-PROCESS end for input: graph: G = (V,E); vertex weights: W; haircut flag: end procedure h; fluff flag: f; Stage 2: Molecular Complex Prediction fluff density threshold: t; set of predicted complex procedure MCODE-FIND-COMPLEX graphs: C input: graph: G = (V,E); vertex weights: W; for all c in C do vertex weight percentage: d; seed vertex: s if c not 2-core then filter if s already seen then return if h is TRUE then 2-core complex for all v neighbors of s do if f is TRUE then call: MCODE-FLUFF-COMPLEX(G, W, t, c) if weight of v > (weight of s)(1 - d) then add v to com- plex C end for call: MCODE-FIND-COMPLEX (G, W, d, v) end procedure end for Overall Process procedure MCODE end procedure input: graph: G = (V,E); vertex weight percentage: d; procedure MCODE-FIND-COMPLEXES haircut flag: h; fluff flag: f; fluff density threshold: t; input: graph: G = (V,E); vertex weights: W; set of predicted complex graphs: C vertex weight percentage: d call: W = MCODE-VERTEX-WEIGHTING (G) for all v in G do call: C = MCODE-FIND-COMPLEXES (G, W, d) Page 4 of 27 (page number not for citation purposes) BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 call: MCODE-POST-PROCESS (G, W, h, f, t, C) Evaluation of MCODE using the Gavin data set of protein interactions and complexes In this study, we wanted to use all forms of protein inter- end procedure action data available, which requires mixing of different Implementation types of experiments, such as yeast two-hybrid and co-im- MCODE has been implemented in ANSI C using the munoprecipitation. Two-hybrid results are inherently cross-platform NCBI Toolkit; http://www.nc- pairwise, whereas copurification results are sets of one or bi.nlm.nih.gov/IEB and the BIND graph library in the more identified proteins. For a copurification result, only SLRI Toolkit; http://sourceforge.net/projects/slritools. a set of size 2 can be directly considered a pairwise inter- Both of these source code libraries are freely available. The action, otherwise it must be modeled as a set of hypothet- actual MCODE source code is not yet freely available. The ical interactions. Biochemical copurifications can be MCODE program has been compiled and tested on UNIX, thought of as populations of complexes with some under- Mac OS X and Windows. Because a yeast gene name dic- lying pairwise protein interaction topology that is un- tionary is used to recognize input and generate output, the known from the experiment. In the general case of the MCODE executable currently only works for yeast pro- purification used by Gavin et al., one affinity tagged pro- teins in a user friendly manner. The algorithm, however is tein was used as bait to pull associated proteins out of a completely general, via the graph theory abstraction, to yeast cell lysate. The two extreme cases for the topology any graph and thus to any biomolecular interaction net- underlying the population of complexes from a single pu- work. MCODE binaries are available from ftp:// rification experiment are a minimally connected 'spoke' ftp.mshri.on.ca/pub/BIND/Tools/MCODE. model, where the data are modeled as direct bait-associat- ed protein pairwise interactions, and a maximally con- Results nected 'matrix' model, where the data are modeled as all Evaluation of MCODE proteins connected to all others in the set. The real topol- The evaluation of MCODE requires a set of experimentally ogy of the set of proteins must lie somewhere between determined biomolecular interactions and a set of associ- these two extremes. ated experimentally determined molecular complexes. Currently, the largest source for such data is for proteins Population of complexes: C = {b, c, d, e} (b = bait) from the budding yeast, Saccharomyces cerevisiae. Recently, a large-scale mass spectrometry study by Gavin et al [7] Spoke model hypothetical interactions: i = {b-c, b-d, b-e} provided a large data set of protein interactions with man- ually annotated molecular complexes. Also available are Matrix model hypothetical interactions; i = {b-b, b-c, b-d, the protein interaction and complex tables of MIPS [13] b-e, c-c, c-d, c-e, d-d, d-e, e-e} and YPD [14]. MCODE was used to automatically predict protein complexes in our collected protein-protein inter- Advantages of the spoke model are that it is biologically action data sets. Resulting complexes were then matched intuitive, biologists often represent their copurification re- to known molecular complexes from Gavin et al. (the sults in this manner, and is about 3 times more accurate Gavin benchmark) and the MIPS benchmark using an than the matrix model [31]. Disadvantages are that it overlap score. Parameter optimization was then used to could misrepresent interactions. The matrix model, alter- maximize the biological relevance of predicted complexes natively, cannot misrepresent interactions, as all possible according to the given benchmarks. YPD was not used as interactions are generated, but this is at the cost of gener- a current version could not be acquired. ating a large number of false interactions. Matrix topolo- gies are also physically implausible for larger complexes To ensure that MCODE is not unduly affected by the ex- because of increased possibility of steric clash if all subu- pected high false-positive rate in large-scale interaction nits are interacting with all others. Ultimately, the spoke data sets, large-scale and literature derived MCODE pre- model should be reasonable for use in evaluating dictions were compared. MCODE was then used to pre- MCODE. dict complexes in the entire set of machine readable protein-protein interactions that we could collect for Gavin et al. raw data from 588 biochemical purifications yeast. Complexes of interest were then further examined were represented using the spoke model, described above, using the directed mode and complex connectivity mode to get 3,225 hypothetical protein-protein interactions of MCODE. among 1,363 proteins for input to MCODE. A list of 232 manually annotated protein complexes based on the orig- inal purification data reported by Gavin et al. was filtered to remove five reported 'complexes' each composed of a single protein and six complexes of two or three proteins Page 5 of 27 (page number not for citation purposes) BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 that were already in the data set as part of a larger com- matched complexes declines with increasing ω threshold, plex. This yielded a filtered set of 221 complexes that were as shown in Figure 1. Interestingly, the average and maxi- used to evaluate MCODE, although some of these com- mum number of matched known complexes drops more plexes have significant overlap to other complexes in the quickly from zero until a ω threshold of 0.2 than from 0.2 set. to 0.9 indicating that many predicted complexes only have one or a few proteins that overlap with known com- To evaluate which parameter choice would allow auto- plexes. A ω threshold of 0.2 to 0.3 thus seems to filter out matic prediction of protein complexes from the spoke most predicted complexes that have insignificant overlap modeled Gavin et al. interaction set that best matched the with known complexes. manually annotated complexes, MCODE was run using all four possible combinations of the two Boolean param- Figure 2 shows the range of number of complexes predict- eters (haircut: true/false, fluff: true/false) over a full range ed and number of known complexes matched for the 0.2 of 20 vertex weight percentage (VWP) and fluff parameters ω threshold over all tried MCODE parameters. A y = x line (0 to 0.95 in 0.05 increments). During this parameter op- is also plotted to show that data points tend to be skewed timization process, MCODE was limited to find complex- towards a higher number of matched known complexes es of size two or higher. than predicted complexes because of the redundancy in the Gavin complex benchmark. Data points closest to the A scoring scheme was developed to determine how effec- upper right portion of the graph maximize both number tively an MCODE predicted complex matched a complex of matched known complexes and number of predicted from the benchmark set of complexes. In this case, the complexes. MCODE parameter combinations that result benchmark complex set was the Gavin et al. hand-anno- in these data points therefore optimize MCODE on this tated complex set. The overlap score was defined as ω = i / data set (according to the overlap score threshold). This a*b, where i is the size of the intersection set of a predicted result shows that the number of predicted complexes complex with a known complex, a is the size of the pre- should be similar to the number of matched known com- dicted complex and b is the size of the known complex. A plexes for a parameter choice to be reasonable, although protein is part of the intersection set only if it is present in the number of matched known complexes may be larger, both predicted and known complexes. Thus, a predicted again, because of some commonality among complexes complex that has no proteins in a known complex has ω in the benchmark set. The parameter combination corre- = 0 and a predicted complex that perfectly matches a sponding to the best data point (63,88) at an overlap known complex has ω = 1. Also, predicted complexes that score threshold of 0.2 is haircut = FALSE, fluff = TRUE, fully overlap, but are much larger or much smaller than VWP = 0.05 and a fluff density threshold between 0 and any known complexes will get a low ω. The overlap score 0.1. These parameter optimization results for MCODE of a predicted complex vs. a benchmark complex is then a over this data set were stable over a range of ω thresholds measure of biological significance of the prediction, as- up to 0.5. Above 0.5, the result was not stable as there suming that the benchmark set of complexes is biological- were generally too few predicted complexes with high ly relevant. The best parameter choice for MCODE on this overlap scores (Figure 1). protein interaction data set is one that predicts the largest set of complexes that match the largest number of bench- A specificity versus sensitivity analysis [32] was also per- mark complexes above a threshold ω. Since there is over- formed. Defining the number of true positives (TP) as the lap in the Gavin benchmark complex database, a number of MCODE predicted complexes with ω over a predicted complex may match more than one known threshold value and the number of false positives (FP) as complex with a high ω. the total number of predicted MCODE complexes minus TP. The number of false negatives (FN) equals the number To choose an overlap score that maximizes biological rel- of known benchmark complexes not matched by predict- evance of the predicted complexes without filtering away ed complexes. Sensitivity was defined as [TP/(TP+FN)] too many predictions, each of the 840 parameter combi- and specificity was defined as [TP/(TP+FP)]. The MCODE nations tested during the parameter optimization stage. parameter choice that optimizes both specificity and sen- The number of MCODE predicted complexes was plotted sitivity is the same as from the above analysis. The optimal against the number of matched known complexes over a sensitivity of this analysis was ~0.31 and the correspond- range of ω thresholds from 'no threshold' to 0.1 to 0.9 (in ing specificity was ~0.79. 0.1 increments). If no ω threshold is used, a predicted complex only needs at least one protein in common with The 63 MCODE predicted complexes only matched 88 of a known complex to be considered a match. If predicted the 221 complexes in the known data set indicating that and known complexes are only counted as a match when MCODE could not recapitulate the majority of the Gavin their ω is above a specific threshold, the number of complex benchmark solely using protein connectivity Page 6 of 27 (page number not for citation purposes) BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Overlap Score Threshold Avg. Pred Avg. Matched Known Max. Pred Max. Known Figure 1 Effect of Overlap Score Threshold on Number of Predicted and Matched Known Complexes for the Gavin Evaluation Figure legend: Average and maximum number of predicted and matched known complexes seen during MCODE parameter optimization (840 parameter combinations) plotted as a function of overlap score threshold. As the stringency for the closeness that a predicted complex must match a known complex is increased (increase in overlap score), fewer predicted complexes match known complexes. Note that these curves do not correspond to the best parameter set, but rather are an average of results from all tried parameter combinations. information. As mentioned above, there are more model. For example, Cdc3 was used as a bait to co-immu- matched complexes than predicted because of some re- noprecipitate Cdc10, Cdc11, Cdc12 and Ydl225w. A com- dundancy in the benchmark. This low sensitivity is not plex was annotated as containing these five proteins, but surprising, since many of the hand-annotated complexes only Cdc3 was used as bait. If more elements of a complex were created directly from single co-immunoprecipitation are used as baits, the proteins become more interconnect- results, which are not highly interconnected in the spoke ed and more readily predicted by MCODE. A good exam- Page 7 of 27 (page number not for citation purposes) Number of Matched Complexes BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 (63,88) 0 10 2030 4050 6070 Number of MCODE Predicted Complexes Above Overlap Score 0.2 Figure 2 Number of Predicted and Matched Known Complexes at Overlap Score Threshold of 0.2 Figure legend: Number of known complexes matched to MCODE predicted complexes plotted against number of MCODE predicted complexes, both with an overlap score above 0.2. ple of this is the Arp2/3 complex, which is highly baits by Gavin et al. and the resulting benchmark complex conserved in eukaryotes and is involved in actin cytoskel- included the five extra proteins that MCODE also predict- eton rearrangement. The structure of this complex is ed (Nog2, Pfk1, Prt1, Cct8 and Cct5) that are not in the known by X-ray crystallography [33] thus actual protein- crystal structure. Cct5 and Cct8 are known to be involved protein interactions from the structure can be matched up in actin assembly, but Nog2, Pfk1 and Prt1 are not. These to the co-immunoprecipitation results. MCODE predicted extra proteins likely represent non-specific binding in the all seven components of the Arp2/3 complex crystal struc- experimental approach. These two cases are shown dia- ture and five extra proteins using the optimized parame- grammatically in Figure 3. Interestingly, using the haircut ters. Six out of the seven Arp2/3 subunits were used as parameter would remove all five extra proteins that are Page 8 of 27 (page number not for citation purposes) Number of Matched Known Complexes Above Overlap Score 0.2 BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 A B C Pfk1 Prt1 Nog2 Cdc12 Cdc11 Arc18 Arc18 Arp3 Arp2 Cdc3 Cct8 Arp3 Arp2 Arc35 Cdc10 Cct5 Ydl225w Arc40 Arc35 Arc40 Arc19 Arc15 Arc19 Arc15 Figure 3 Examples of Gavin Benchmark Complexes Missed and Hit by MCODE Figure legend: Protein complexes are repre- sented as graphs using the spoke model. Vertices represent proteins and edges represent experimentally determined interac- tions. Blue vertices are baits in the Gavin et al. study. A) A Cdc3 complex hand-annotated by Gavin et al. that was missed by MCODE because of a lack of connectivity information among sub-components. This complex annotation was the result of a single co-immunoprecipitation experiment. B) The Arp2/3 complex as annotated by Gavin et al. and as found by MCODE with parameters optimized to the data set. Note the five extra proteins that have minimal connectivity to main cluster. C) The pro- tein connection map seen from the crystal structure of the Arp2/3 complex. The crystal structure is from Bos taurus (cow), but is assumed to be very similar to yeast based on very high similarity between cow and yeast Arp2/3 subunits. not in the crystal structure, leaving only the seven that are ed at high ω thresholds, but generally reduced the number present. This shows that while the parameter optimiza- of matched known complexes at low ω thresholds (0 to tion allows maximum matching of the hand-annotated 0.1) compared to haircut = FALSE. Since the haircut = known complexes, these complexes may not all be physi- TRUE option removes less-connected proteins on the ologically relevant and thus another parameter set may fringe of a predicted complex and this reduces the number better predict 'real' complexes. of predicted complexes with low overlap scores, these fringe proteins likely contribute to low-level overlap (<0.2 To explore the effect of certain MCODE parameters on re- ω) of the known complexes. sulting predicted complexes, various features of these complexes were examined while changing specific param- We also investigated the effect of changing the fluff densi- eters and keeping all else constant. Linearly increasing the ty threshold when setting fluff = TRUE on the number of VWP parameter increased the size of the predicted com- matched benchmark complexes. Linearly increasing the plexes exponentially while reducing the number of com- fluff density threshold in the MCODE post-processing plexes predicted in a linear fashion. Figure 4 shows this step linearly decreased the number of matched complexes effect with both fluff and haircut parameters turned off. At above an overlap score of 0.2. high VWP values, very large complexes were predicted and Evaluation of MCODE using MIPS data set of protein in- these encompassed most of the data set, thus were not very useful. teractions and complexes Since the Gavin et al. data set was developed by only one Because using haircut = TRUE would have led MCODE to group using a single experimental method, it may not ac- predict the Arp2/3 complex perfectly (according to the curately represent protein complex knowledge for yeast. crystal structure as discussed above), we examined if the The MIPS protein complex catalogue http://mips.gsf.de/ haircut parameter has any general effect on the number of proj/yeast/catalogues/complexes/ is a curated set of 260 matched predicted complexes. Setting haircut = TRUE had protein complexes for yeast that was compiled from the no significant effect on the number of complexes predict- literature and is thus a more realistic data set comprised of Page 9 of 27 (page number not for citation purposes) BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Vertex Weight Percentage Threshold Number of Complexes Average Complex Size Largest Complex Size Figure 4 Effect of Vertex Weight Percentage Parameter on Predicted Complex Size Figure legend: As the vertex weight percentage (VWP) parameter of MCODE is increased, the number of predicted complexes steadily decreases and the average and largest size of predicted complexes increases exponentially. The y-axis follows a logarithmic scale. For reference, the aver- age and maximum size of the MIPS benchmark complexes are 6 and 81, respectively and of the Gavin benchmark complexes are 11.8 and 88, respectively. varied experiments from many labs using different tech- able public resource for yeast protein complexes that we niques. After filtering away 50 'complexes' each composed are aware of. of a single protein and 2 highly similar complexes, we were left with 208 complexes for the MIPS known set. This MCODE was run again with a full combination of param- set did not include information from the recent large-scale eters, this time over a set of 9088 protein-protein interac- mass spectrometry studies [6,7]. While the MIPS complex tions among 4379 proteins which did not include the catalogue may be incomplete, it is currently the best avail- recent large-scale mass spectrometry studies but included Page 10 of 27 (page number not for citation purposes) BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 all interactions from the MIPS, YPD and PreBIND MCODE predictions on the high-throughput data sets, databases as well as from the majority of large-scale yeast termed 'Gavin Spoke', 'Y2H' and 'HTP only' (see Meth- two-hybrid experiments to date [2–4,10,34]. This interac- ods), are about as specific as the literature derived interac- tion set is termed 'Pre HTMS'. All of the interactions in this tion data set, but not as sensitive (Figure 6A). MCODE set were published before the last update specified on the predictions on interaction data sets containing the litera- MIPS protein complex catalogue and many are included ture derived benchmark, labelled 'Benchmark', 'Pre in the MIPS protein interaction table, thus we assumed HTMS' and 'AllYeast', are generally more sensitive and that the MIPS complex catalogue took into account the in- specific than those containing just the large-scale interac- formation in the known interaction table. Protein com- tion sets. Since the specificity drops from Benchmark to plexes found by MCODE in this set were compared to the Pre HTMS to AllYeast, with increasing amounts of large- MIPS protein complex catalogue to evaluate how well scale data, it could be argued that addition of this data MCODE performed at locating protein complexes ab negatively affects MCODE. However, large-scale data is initio. known to contain a high number of false positives, so it should be expected that these false-positives would not The same evaluation of MCODE that was done using the randomly contribute to the formation of dense regions, Gavin et al. data set was performed with the MIPS data set. which are highly unlikely to occur by chance (see below). From this analysis, including specificity versus sensitivity More complexes should be predicted with the addition of plots (optimized sensitivity = ~0.27 and specificity = the large-scale data, assuming this data explores previous- ~0.31), the MIPS complex benchmark optimized parame- ly unseen regions of the interactome, but the high number ters were haircut = TRUE, fluff = TRUE, VWP = 0.1 and a of false-positives should limit the amount of new com- fluff density threshold of 0.2. This result was stable up to plexes compared to the amount of added interactions. The a ω threshold of 0.6 after which it was difficult to evaluate MIPS complex benchmark used here is not expected to the results, as there were generally too few predicted com- contain complexes newly found in large-scale studies, ex- plexes above the high ω thresholds. This parameter com- plaining the decrease in specificity. This is exactly what oc- bination led MCODE to predict 166 complexes of which curs in our analysis. In an effort to further test the effect of 52 matched 64 MIPS complexes with a ω of at least 0.2. large-scale data on MCODE prediction performance, the Examining the ω distribution for this parameter set reveals Benchmark interaction data set was augmented with the that, even though this prediction is optimized, most of the addition of interactions from large-scale experiments that predicted complexes don't show overlap to those in the only connect proteins in the Benchmark set with each oth- known MIPS set (Figure 5). The complexes predicted here er. Over 3100 interactions were added to the Benchmark are also different from those predicted from the Gavin in- data set to create a set of over 6400 interactions. MIPS teraction data. Nine complexes have an overlap score complex benchmark optimised MCODE predicted 52 above 0.2 between these two sets, with the highest overlap complexes matching 66 MIPS benchmark complexes, al- score being 0.43 and all the rest being below 0.27. This most exactly the same number of complexes found using might signify that either the MIPS complex catalogue is the Benchmark set by itself (Table 1). These analyses not complete, that there is not enough data in the dataset strongly suggest the addition of large-scale experimentally that MCODE was run on, or a human annotated defini- derived interactions does not unduly affect the prediction tion of a complex does not perfectly match with a graph of complexes by MCODE. density based definition. It can be seen from Figure 6B that the Gavin complex The effect of the VWP parameter on complex size and of benchmark set is biased towards the Gavin et al. spoke the haircut and fluff parameters on number of matched modeled interaction data. This is expected and is the main complexes was very similar to that seen when evaluating reason why the less biased MIPS complex set is used MCODE on the Gavin complex benchmark. throughout this work as a benchmark instead of the Gavin set. Effect of data set properties on MCODE Since many large-scale protein interaction data sets from Since the result of a co-immunoprecipitation experiment yeast are known to contain a high level of false positives is a set of proteins, which we model as binary interactions [35], we examined the effect these might have on MCODE using the spoke method, we wished to evaluate whether predictions. Sensitivity vs. specificity was plotted for this affects complex prediction compared to an experi- MCODE predictions, with parameters chosen to maxi- mental system that generates purely binary interaction re- mize these values at ω threshold of 0.2 against the MIPS sults, such as yeast two-hybrid. As can be seen in Table 1, and Gavin complex benchmarks for the various data sets MCODE does find known complexes in the 'Y2H' set of (Figure 6). only yeast two-hybrid results, thus this set does contain dense regions that are known protein complexes. This Page 11 of 27 (page number not for citation purposes) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 MCODE Predicted Complex Number Pre HTMS AllYeast BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 Highest Overlap Score Figure 5 Overlap Score Distributions of Pre HTMS and AllYeast interaction sets with MIPS Complex Benchmark Opti- mized MCODE Parameter Sets Figure legend: The number of MCODE predicted complexes in the pre-large scale mass spectrometry (Pre HTMS) and AllYeast protein-protein interaction sets with a given overlap score threshold compared to the MIPS benchmark complex set is shown. The majority of predicted complexes have an overlap score of zero meaning that they had no overlap with the catalogue of known MIPS protein complexes. Page 12 of 27 (page number not for citation purposes) BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 0.9 0.8 0.7 0.6 0.5 Gavin Spoke Benchmark 0.4 Pre HTMS 0.3 Y2H 0.2 AllYeast HTP Only 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Sensitivity Gavin Spoke 0.9 (self) 0.8 0.7 0.6 0.5 0.4 HTP Only 0.3 AllYeast Benchmark 0.2 0.1 Pre HTMS Y2H 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Sensitivity Figure 6 Sensitivity vs. Specificity Plots of MCODE Results Among Various Data Sets Figure legend: Specificity is plotted ver- sus sensitivity of the best MCODE results at an overlap score above 0.2 against both the MIPS (Panel A) and Gavin (Panel B) complex benchmarks. Panel A shows that there are no large inherent differences among interaction data sets resulting from significantly different experimental methods (data set: sensitivity, specificity; Y2H:0.10,0.27; Benchmark:0.29,0.36; HTP Only:0.14;0.24; Pre HTMS:0.27,0.31; AllYeast:0.27,0.26; Gavin Spoke:0.10,0.38). Panel B shows that the Gavin benchmark is expectedly biased towards the Gavin interaction data set and thus should not be used as a general benchmark (data set: sensi- tivity, specificity; Y2H:0.03,0.10; Benchmark:0.11,0.16; HTP Only:0.24;0.33; Pre HTMS:0.10,0.13; AllYeast:0.27,0.26; Gavin Spoke:0.31,0.79). Page 13 of 27 (page number not for citation purposes) Specificity Specificity BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 Table 1: Summary of MCODE Results with Best Parameters on Various Data Sets. Data Set Number of Number of Number of Predicted MCODE Com- Matched Complex Best MCODE Proteins Interact-ions Complexes plexes Pre- Benchmark Benchmark Parameters dicted Above ω Complexes = 0.2 Gavin Spoke 1363 3225 82 63 88 Gavin hFfT\0.05\0.05 Gavin Spoke 1363 3225 53 20 20 MIPS hTfT\0.1\0.35 Pre HTMS 4379 9088 158 21 28 Gavin hTfT\0\0.2\ Pre HTMS 4379 9088 166 52 64 MIPS hTfT\0.1\0.2 AllYeast 4825 15143 209 52 76 Gavin hFfT\0\0.1 AllYeast 4825 15143 209 54 63 MIPS hTfT\0\0.1 AllYeast 4825 15143 203 80 150 MIPS+Gavin hTfT\0\0.15\ Benchmark 1762 3310 141 23 30 Gavin hTfT\0\0.3 Benchmark 1762 3310 163 58 67 MIPS hTfT\0.1\0.05 HTP Only 4557 12249 138 46 77 Gavin hTfT\0.05\0.1 HTP Only 4557 12249 122 29 35 MIPS hTfT\0.05\0.15 Y2H 3847 6133 73 7 7 Gavin hTfT\0.2\0.1 Y2H 3847 6133 78 21 26 MIPS hTfT\0\0.1 Statistics and a summary of results are shown for the various data sets used to evaluate MCODE. 'Gavin Spoke' is the Gavin et al. data set repre- sented as binary interactions using the spoke model; 'Pre HTMS' is the set of all yeast interaction not including the recent high-throughput mass spectrometry studies [6,7].; 'AllYeast' is the set of all yeast interactions that we could collect; 'Benchmark' is a set of interactions found in the liter- ature from YPD, MIPS and PreBIND; 'HTP Only' is the combination of all large-scale and high-throughput yeast two-hybrid and mass spectrometry data sets; 'Y2H' is the set of all yeast two-hybrid results from large-scale and literature sources. See Methods for full explanation of data sets. The 'Best MCODE Parameters' are formatted as haircut True of False, fluff True or False\VWP\Fluff Density Threshold Parameter. being said, the Y2H set is the least dense of all data sets ex- should be further studied using MCODE in directed mode amined here so is expected to have less dense regions of by specifying a seed vertex and trying different parameters the network and thus less MCODE predictable complexes to examine how large a complex can get before seemingly per protein present in the set. MCODE predicts a similar biologically irrelevant proteins are added (see below). amount of complexes as well as finding a similar amount of known complexes in the Y2H and Gavin Spoke data Figure 5 shows that even when a large set of interactions sets indicating that these data sets are not significantly dif- is used as input to MCODE, most of the MCODE predict- ferent from each other in the amount of dense network re- ed complexes do not match well with known complexes gions that they contain, even though they are different in MIPS. The complex size distribution of MCODE pre- sizes. Taken together, the latter results and those in Figure dicted complexes matches the shape of the MIPS set, but 6B show that the spoke model is a reasonable representa- the MCODE complexes are on average larger (Average tion of the Gavin et al. tandem affinity purification data. MIPS size = 6.0, Average MCODE Predicted size = 9.7). The average number of YPD and GO functional annota- Predicting complexes in the Yeast interactome tion terms per protein in an MCODE predicted complex is Given that MCODE performed reasonably well on test da- similar to that of MIPS complexes (Table 2). This seems to ta, we decided to predict complexes in a much larger indicate that MCODE is predicting complexes that are network. All machine-readable protein-protein interac- functionally relevant. Also, closer examination of the top, tion data from various data sets [2–7,10,13,14]. were col- middle and bottom five scoring MCODE complexes lected and integrated to form a non-redundant set of shows that MCODE can predict biologically relevant com- 15,143 experimentally determined yeast protein interac- plexes (Table 3). tions encompassing 4,825 proteins, or approximately three quarters of the proteome. This set was termed 'AllY- Many of the 209 predicted complexes are of size 2 (9 pre- east'. MCODE was parameter optimized, as above, using dicted complexes) or 3 (54 predicted complexes). Com- the MIPS benchmark. The best resulting parameter set was plexes of this size may not be significant since it is easy to haircut = TRUE, fluff = TRUE, VWP = 0 and a fluff density create high density subgraphs of size 2 or 3, but becomes threshold of 0.1. With these parameters, MCODE combinatorially more difficult to randomly create high predicted 209 complexes, of which 54 matched 63 MIPS density subgraphs as the size of the subgraph increases. To benchmark complexes above an overlap score of 0.2 (see examine the relevance of these small predicted complexes Additional file 1). Complexes found in this manner of size 2 or 3, we calculated the sensitivity and specificity Page 14 of 27 (page number not for citation purposes) BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 Table 2: Average Number of YPD and GO Annotation Terms in Complex Sets. Data Set YPD Functions YPD Roles GO Components GO Processes MCODE on All Yeast 0.58 0.89 0.39 0.59 Interactions MIPS Complex Database 0.50 0.75 0.39 0.48 MCODE Random Model 0.72 1.24 0.52 0.85 (100 AllYeast network permutations) The average number of YPD and GO functional annotation terms per protein in an MCODE predicted complex is shown for MCODE predicted complexes on the AllYeast set, the MIPS complex database and the MCODE random model. A lower number indicates that the complexes from a set contain more functionally related proteins (or unannotated proteins). In the cases of multiple annotation, all terms are taken into account. Even though there are multiple annotation terms per protein and a variable amount of unannotated proteins per complex, these numbers should perform well in relative comparisons based on the assumption that the distribution of the latter two factors is similar in each data set. of the optimized MCODE predictions against the MIPS Significance of MCODE predictions complex benchmark while disregarding the small com- Naïvely, the chance of randomly picking a known protein plexes. First, complexes of size 2, then of size 3, were re- complex from a protein interaction network depends on moved from the optimized MCODE predicted complex the size of the complex and the network. It is easier to pick set. Removing each of these sets independently resulted in out a smaller known complex by chance from a smaller only small sensitivity and specificity changes. Because network. For instance, in our network of 15,143 interac- both sets overlap the MIPS benchmark, small complexes tions among 4,825 proteins, the chance of picking a spe- have been reported as predictions. Also, because MCODE cific known complex of size three is about one in 1.9 × found these small complexes in regions of high local den- 10 (4,825 choose 3). A more realistic model would as- sity, they may be good cores for further examination with sume that the proteins are connected and thus would only MCODE in directed mode, especially since the haircut op- consider complex choices of size three where all three tion was turned on here to produce them. proteins are connected. The number of choices now de- pends on the topology of the network. In our large net- Complexes that are larger and denser are ranked higher by work, there are 6,799 fully connected subnetworks of size MCODE and these generally correspond to known com- three and 313,057 subnetworks of size three with only plexes (see below). Interestingly, some MCODE two interactions (from the triadic census feature of Pajek). complexes contain unknown proteins that are highly con- Thus now our chance of picking a more realistic complex -6 nected to known complex subunits. For example, the sec- is one out of 319,856 (1/(6,799 + 313,057) = 3.1 × 10 ). ond highest ranked MCODE complex is involved in RNA As the size of the complex increases, the number of possi- processing/modification and contains the known polya- ble complex topologies increases exponentially and, in a denylation factor I complex (Cft1, Cft2, Fip1, Pap1, Pfs2, connected network of some reasonable density, so does Pta1, Ysh1, Yth1 and Ykl059c). Seven other proteins in- the number of possible subgraphs that could represent a volved in mainly RNA processing/modification (Fir1, complex. The density of our large protein interaction net- Hca4, Pcf11, Pti1, Ref2, Rna14, Ssu72) and protein degra- work is 0.0013 and is mostly connected (4,689 proteins dation (Uba2 and Ufd1) are highly connected within this are in one connected component). Thus, it is expected predicted complex. Two unknown proteins Pti1 and that if a complex is found in a network with MCODE that Yor179c are highly connected to RNA processing/modifi- matches a known complex, that the result would be highly cation proteins and are therefore likely involved in the significant. To understand the significance of complex same process (Figure 7). Pti1 may be an unknown compo- prediction further, the topology of the protein interaction rd nent of the polyadenylation factor I complex. The 23 network would have to be understood in general, in order highest ranked predicted complex is interesting in that it to build a null model to compare against. is involved in cell polarity and cytokinesis and contains two proteins of unknown function, Yhr033w and Recent research on modeling complex systems [21,25,27] Yal027w. Yal027w interacts with two kinases, Gin4 and has found that networks such as the world wide web, met- Kcc4, which in turn interact with the components of the abolic networks [26] and protein-protein interaction net- Septin complex (Cdc3, Cdc10, Cdc11 and Cdc12) (Figure works [36] are scale-free. That is, the connectivity 8). distribution of the vertices of the graph follows a power law, with many vertices of low degree and few vertices of high degree. Scale-free networks are known to have large Page 15 of 27 (page number not for citation purposes) BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 Yor179c RNA processing/modification Pol II transcription Protein modification Protein degradation Pti1 Unknown Ssu72 Ref2 Yth1 Pta1 Fir1 Cft1 Pfs2 Hca4 Ykl059c Pap1 Uba2 Cft2 Fip1 Ysh1 Rna14 Ufd1 Pcf11 Figure 7 The Second Highest Ranked MCODE Predicted Complex is Involved in RNA Processing and Modification . Fig- ure legend: This complex incorporates the known polyadenylation factor I complex (Cft1, Cft2, Fip1, Pap1, Pfs2, Pta1, Ysh1, Yth1 and Ykl059c) and contains other proteins highly connected to this complex, some of unknown function. The fact that the unknown proteins (Yor179c and Pti1) connect more to known RNA processing/modification proteins than to other proteins in the larger data set likely indicates that these proteins function in RNA processing/modification. This complex was ranked second by MCODE from the predicted complexes in the AllYeast interaction set. clustering coefficients, or clustered regions of the graph. In To test the significance of clustered regions in biological biological networks, at least in yeast, these clustered re- networks, 100 random permutations of the large set of all gions seem to correspond to molecular complexes and 15,143 yeast interactions were made. If the graph to be these subgraphs are what MCODE is designed to find. randomised is considered as a set of edges between two Page 16 of 27 (page number not for citation purposes) BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 Cell Organization/Biogenesis General Metabolism RNA Processing/Localization Smd3 She3 Cell Cycle DNA Damage Response/Repair Unknown Spa2 Gic2 Nfi1 Bem4 Cdc42 Spr28 Gic1 Zds2 Top2 Cdc12 Pdi1 Cdc11 Cla4 Bub2 Yhr033w Elm1 Cdc10 Shs1 Gin4 Acc1 Swe1 Cdc3 Kcc4 Rvb2 Yal027w Msh6 Aat2 Hsl1 Figure 8 An MCODE Predicted Complex Involved in Cytokinesis Figure legend: This predicted complex incorporates the known Septin complex (Cdc3, Cdc10, Cdc11 and Cdc12) involved in cytokinesis and other cytokinesis related proteins. The Yal027w protein is of unknown function, but likely functions in cell cycle control according to this figure, possibly in cytokine- rd sis. This complex was ranked 23 by MCODE from the predicted complexes in the AllYeast interaction set. Page 17 of 27 (page number not for citation purposes) BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 Table 3: Statistics for Top, Middle and Bottom Five Scoring Optimized MCODE Predicted Complexes Found in All Known Yeast Protein Interaction Data Set Complex Score Proteins Interactions Density Cell Role Cell Localization Rank 1 10.04 46 236 0.22 RNA processing/ Nuclear modification and protein degradation (26S Proteasome) Protein names Dbf2,Ecm29,Gcn4,Hsm3,Hyp2,Lhs1,Mkt1,Nas6,Pre1,Pre2,Pre4,Pre5,Pre6, Pre7,Pre8,Pre9,Pup3,Rad23,Rad24,Rad50,Rfc3,Rfc4,Rpn1,Rpn10,Rpn11, Rpn12,Rpn13,Rpn3,Rpn4,Rpn5,Rpn6,Rpn7,Rpn8,Rpn9,Rpt1,Rpt2,Rpt3,Rpt4, Rpt5,Rpt6,Scl1,Ubp6,Ura7,Ygl004c,Yku70,Ypl070w 2 9 19 90 0.51 RNA processing/ Nuclear modification Protein names Cft1,Cft2,Fip1,Fir1,Hca4,Mpe1,Pap1,Pcf11,Pfs2,Pta1,Pti1,Ref2,Rna14,Ssu72, Uba2,Ufd1,Yor179c,Ysh1,Yth1 3 7.72 56 220 0.14 Pol II transcription Nuclear Protein names Ada2,Adr1,Ahc1,Cdc23,Cdc36,Epl1,Esa1,Fet4,Fun19,Gal4,Gcn5,Hac1,Hfi1, Hhf2,Hht1,Hht2,Ire1,Luc7,Med7,Myo4,Ngg1,Pcf11,Pdr1,Prp40,Rna14,Rpb2, Rpo21,Sap185,Sgf29,Sgf73,Spt15,Spt20,Spt3,Spt7,Spt8,Srb6,Swi5,Taf1,Taf10, Taf11,Taf12,Taf13,Taf14,Taf2,Taf3,Taf5,Taf6,Taf7,Taf8,Taf9,Tra1,Ubp8, Yap1,Yap6,Ybr270c,Yng2 4 7.58 18 72 0.44 Cell cycle control, Nuclear protein degradation, mitosis (Anaphase Promoting Complex) Protein names Apc1,Apc11,Apc2,Apc4,Apc5,Apc9,Cdc16,Cdc23,Cdc26,Cdc27,Dmc1,Doc1, Leu3,Rpt1,Sic1,Spc29,Spt2,Ybr270c 5 7 15 56 0.52 Vesicular transport Golgi (TRAPP Complex) Protein names Bet1,Bet3,Bet5,Fks1,Gsg1,Gyp6,Kre11,Sec22,Trs120,Trs130,Trs20,Trs23, Trs31,Trs33,Uso1 102 3 3 3 1 RNA splicing Nuclear Protein names Msl5,Mud2,Smy2 103 3 3 3 1 Signal transduction, Nuclear Cell cycle control, DNA repair, DNA synthesis Protein names Ptc2,Rad53,Ydr071c 104 3 3 3 1 Cell cycle control, Uknown mating response Protein names Far3,Vps64,Ynl127w 105 3 3 3 1 Chromatin/chromo- Nuclear some structure Protein names Gbp2,Hpr1,Mft1 106 3 3 3 1 Pol II transcription Nuclear Protein names Ctk1,Ctk2,Ctk3 205 2 3 4 1 Vesicular transport ER Protein names Rim20,Snf7,Vps4 Page 18 of 27 (page number not for citation purposes) BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 Table 3: Statistics for Top, Middle and Bottom Five Scoring Optimized MCODE Predicted Complexes Found in All Known Yeast Protein Interaction Data Set (Continued) 206 2 3 4 1 Protein Cytoplasmic translocation Protein names Srp14,Srp21,Srp54 207 2 3 4 1 Protein Cytoplasmic translocation Protein names Srp54,Srp68,Srp72 208 2 3 4 1 Energy generation Mitochondrial Protein names Atp1,Atp11,Atp2 209 2 4 5 0.67 Nuclear-cytoplas- Varied mic and vesicular transport Protein names Kap123,Nup145,Sec7,Slc1 Score is defined as the product of the complex subgraph density and the number of vertices (proteins) in the complex subgraph (DC × |V|). This ranks larger more dense complexes higher in the results. Density is calculated using the "loop" formula if homodimers exist in the complex, other- wise the "no loop" formula is used. The cell role column is a manual combination of annotation terms for the proteins reported in the complex. vertices (v , v ), a network permutation is made by all yeast interactions are highly unlikely to occur by 1 2 randomly permuting the set of all v vertices. The random chance. networks have the same number of edges and vertices as the original network and follow a power-law connectivity To evaluate the effectiveness of our scoring scheme, which distribution, as do the original data sets [37]. Running scores larger, more dense complexes higher than smaller, MCODE with the same parameters as the original network more sparse complexes, we examined the accuracy of (haircut = TRUE, fluff = TRUE, VWP = 0 and a fluff density MCODE predictions at various score thresholds. As the threshold of 0.1) on the 100 random networks resulted in score threshold for inclusion of complexes is increased, an average of 27.4 (SD = 4.4) complexes per network. The less complexes are included, but a higher percentage of size distribution of complexes found by MCODE did not the included complexes match complexes in the bench- match that of the complexes found in the original net- mark. This is at the expense of sensitivity as many bench- work, as some complexes found in the random networks mark matching complexes are not included at higher were composed of >1500 proteins. One random network score thresholds (Figure 9). For example, of the ten pre- that had an approximately average number of predicted dicted complexes with MCODE score greater or equal to complexes (27) was parameter optimized using the MIPS six, nine match a known complex in either the MIPS or benchmark to see how parameter choice affects the size Gavin benchmark above a 0.2 threshold overlap score, distribution and number of predicted complexes. Param- yielding an accuracy of 90%. 100% of the five complexes eters of haircut = TRUE, fluff = TRUE, VWP = 0.1 and a that had an MCODE score better or equal to seven fluff density threshold of zero produced the maximal matched known complexes. Thus, complexes that score number of 81 complexes for this network, but these com- highly on our simple density based scoring scheme are plexes were composed of on average 27 proteins (without very likely to be real. counting an outlier complex of size 1961), which is much larger than normal (e.g. larger than the MIPS set average Directed mode of MCODE of 6.0). None of these predicted complexes matched any To simulate an obvious example where the directed mode MIPS complexes above an overlap score of 0.1. Also, the of MCODE would be useful, MCODE was run with re- random network complexes had a much higher average laxed parameters (haircut = TRUE, fluff = TRUE, VWP = number of YPD and GO annotation terms per protein per 0.05 and a fluff density threshold of 0.2) compared to the complex than for MIPS or MCODE on the original best parameters on the AllYeast network. The resulting network (Table 2). This indicates, as expected, that the fourth highest ranked complex, when visualized, shows random network complexes are composed of a higher lev- two clustered components and represents two protein el of unrelated proteins than complexes in the original complexes, the proteasome and an RNA processing com- network. Thus, the number, size and functional composi- plex, both found in the nucleus (Figure 10). This is an tion of complexes that MCODE predicts in the large set of example of where a lower VWP parameter would have been superior since it would have divided this large Page 19 of 27 (page number not for citation purposes) BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 100% 5 2 75% 50% 25% 0% 012345 6789 10 Complex Score Threshold Figure 9 Effect of Complex Score Threshold on MCODE Prediction Accuracy Figure legend: MCODE complexes equal to or greater than a specific score were compared to a benchmark comprising the combined MIPS and Gavin benchmarks. Accuracy was calculated as the number of known complexes better or equal to the threshold score divided by the total number of pre- dicted complexes (matching and non-matching) at that threshold. A complex was deemed to match a known complex if it had an overlap score above 0.2. The number of predicted complexes that matched known complexes at each score threshold is shown as labels on the plot. complex into two more functionally related complexes. 11A). Using this VWP parameter, combinations of haircut The highest weighted vertices in the center of each of the and fluff parameters were used to further expand the core two dense regions in Figure 10 are the Rpt1 and Lsm4 pro- complex. This process was stopped when the predicted teins. MCODE was run in directed mode starting with complexes began to include proteins of sufficiently these two proteins over a range of VWP parameters from different known biological function to the seed vertex. 0 to 0.2, at 0.05 increments. For Lsm4, the parameter set Proteins, such as Vam6 and Yor320c were included in the of haircut = TRUE, fluff = FALSE, VWP = 0 was used to find complex at moderate fluff parameters (0.4–0.6), but not a core complex, which contained 9 proteins fully connect- at higher fluff parameters, and these are known to be lo- ed to each other (Dcp1, Kem1, Lsm2, Lsm3, Lsm4, Lsm5, calized in membranes outside of the nucleus, thus are Lsm6, Lsm7 and Pat1). Above this VWP parameter, the likely not functionally related to the Lsm complex pro- core complex branched out into proteasome subunit pro- teins. Therefore, the 9 proteins listed above were decided teins, which are not part of the Lsm complex (see Figure Page 20 of 27 (page number not for citation purposes) Accuracy BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 Chromatin/chromosome structure Protein degradation Cell cycle control Dop1 RNA processing/modification,RNA splicing Snu23 Unknown Bem3 Dib1 Hsh155 Nog2 Sec21 Snu66 Gcn4 Pre9 Pre4 Scl1 Whi2 Mak31 Smx2 Rri1 Gdb1 Pre2 Pre5 Prp4 Gcr2 Pre1 Smd2 Prp24 Pre6 Pre7 Sme1 Pat1 Rpn13 Sro7 Lsm7 Yfl066c Rpn7 Rpn10 Pup3 Rpn5 Sec26 Kem1 Rpn8 Lsm6 Rad23 Lsm2 Snu114 Lsm4 Hsm3 Lsm8 Rpn6 Dcp1 Lsm5 Ubp6 Rpn4 Gzf3 Rpn12 Ydl175c Rpt2 Ygl004c Lsm3 Ste6 Rpn3 Rps28b Lsm1 Yor320c Rpt6 Rpn9 Mtr3 Ecm29 Rps28a Rpn11 Rpt3 Nas6 Ycr024c Rpt1 Rpt4 Neo1 Ylr269c Rrd2 Rpt5 Rad18 Ris1 Yrb1 Bro1 Iml1 Lhs1 Shm2 Hex3 Ybr094w Yor056c Ubr1 Ctr9 Rfc3 Spt16 Bas1 Paf1 Rad24 Leo1 Figure 10 An MCODE Predicted Complex That is Too Large (Relaxed Parameters) Figure legend: An example of a predicted complex that incorporates two complexes, proteasome (left) and an RNA processing complex (right). These should probably be predicted as separate complexes as can be seen by the clear distinction of biological role annotation on one side of this lay- out compared to the other (purple versus blue). This figure, however, shows the large amount of overall connectivity between these two complexes. This complex was ranked fourth by MCODE from the predicted complexes in the AllYeast interaction set with slightly relaxed parameters compared to the optimized prediction. to be the final complex (Figure 11B). This is intuitive be- proteins (Pre1 to Pre10, Pup1, Pup2, Pup3, Scl1 and cause of their maximal density (a 9-clique). Ump1) of which Pre7, Pre8, Pre10, Pup1, Pup2 and Ump1 are not found with MCODE. The 19S regulatory Using this same method of known biological role "titra- subunit of the proteasome is known to have 21 subunits tion" on Rpt1 found a complex of 34 proteins (Gal4, (Nas6, Rpn1 to Rpn13, Rpt1 to Rpt6 and Ubp6) of which Gcn4, Hsm3, Lhs1, Nas6, Pre1, Pre2, Pre3, Pre4, Pre5, Rpn1, Rpn2, Rpn4, Rpn12 and Rpt5 are not found with Pre6, Pre7, Pre9, Pup3, Rpn10, Rpn11, Rpn13, Rpn3, MCODE. Known complex components not found by Rpn5, Rpn6, Rpn7, Rpn8, Rpn9, Rpt1, Rpt2, Rpt3, Rpt4, MCODE are not present at a high enough local density re- Rpt6, Rri1, Scl1, Sts1, Ubp6, Ydr179c, Ygl004c) and 160 gions of the interaction network, possibly because not interactions using the parameter set haircut = TRUE, fluff enough experiments involving these proteins are present = TRUE, VWP = 0.2 and a fluff density threshold of 0.3. in our data set. Figure 11C shows the final Rpt1 seeded Two regions of density can be seen here corresponding to complex. Of note, Ygl004c is unknown and binds to the two known subunits of the 26S proteasome. The 20S almost every Rpt and Rpn protein in the complex al- proteolytic subunit of the proteasome is comprised of 15 though all of these interactions were from a single immu- Page 21 of 27 (page number not for citation purposes) BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 AB Lsm7 Lsm6 Lsm7 Pat1 Dcp1 Lsm5 Rpn10 Dcp1 Lsm2 Lsm3 Rpt3 Rpn6 Kem1 Lsm6 Rpt1 Lsm2 Kem1 Lsm5 Lsm3 Lsm4 Pat1 Lsm4 Rpn7 Pre2 Rpt2 Gcn4 Pre4 Rpt4 Sts1 Pre6 Rpn3 Pre3 Rpn11 Pup3 Rpn10 Lhs1 Rpn6 Pre1 Ubp6 Rpt3 Scl1 Rpt1 Rpn5 Ygl004c Hsm3 Pre7 Pre5 Rpn8 Rpt6 Pre9 Rri1 Rpn13 Ydr179c Nas6 Gal4 Rpn9 Figure 11 MCODE in Directed Mode Figure legend: MCODE was used in directed mode to further study the complex in Figure 10 by using seed vertices from high density regions of the two parts of this complex. A) The result of examining the Lsm complex using MCODE parameters that are too relaxed (haircut = TRUE, fluff = FALSE, VWP = 0.05). B) The final Lsm complex using MCODE parameters of haircut = TRUE, fluff = FALSE and VWP = 0 seeded with Lsm4. C) The final 26S proteasome complex seeded with Rpt1 using MCODE parameters haircut = TRUE, fluff = TRUE and VWP = 0.2. Visible here are two regions of density in this complex corresponding to the 20S proteolytic subunit (left side – mainly Pre proteins) and the 19S regulatory subunit (right side – mainly Rpt and Rpn proteins). noprecipitation experiment [6]. As well, Rri1 and Ydr179c subunits and is involved in DNA mismatch repair path- have unknown function and both bind to each other and ways, but is not known to be part of the proteasome, al- to Rpn5. Thus one would predict that these three un- though all of these Hsm3 interactions are from a known proteins function with or as part of the 26S particular large-scale experiment [7]. Interestingly, Gal4, a proteasome. The protein Hsm3 binds to eight other 19S transcription factor involved in galactose metabolism, is Page 22 of 27 (page number not for citation purposes) BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 found to be part of the proteasome complex. While this [40] with the Kamada-Kawai graph layout algorithm [41]. metabolic functionality seems unrelated to protein degra- Kamada-Kawai models the edges in the graph as springs, dation, it has recently been shown that the binding is randomly places the vertices in a high energy state and physiologically relevant [38]. These cases illustrate the then attempts to minimize the energy of the system over a possible unreliability of both functional annotation and number of time steps. The result is that the Euclidean dis- interaction data, but also that seemingly unrelated pro- tance, here in a plane, is close to the graph-theoretic or teins should not be immediately discounted if found to be path distance between the vertices. The vertices are visual- part of a complex by MCODE. ly clustered based on connectivity. Biologically, this visu- alization can allow one to see the rough structural outline Of note, the known topology of the 26S proteasome [39] of large complexes, if enough interactions are known, as compares favourably with the complex visualization of evidenced in the proteasome complex analysis above (Fig- Figure 11C without considering stoichiometry. Thus, if ure 11C). enough interactions are known, visualizing complexes may reveal the rough structural outline of large complex- It is important to note and understand the limitations of es. This should be expected when dealing with actual the current experimental methods (e.g. yeast two-hybrid physical protein-protein interactions since there are few and co-immunoprecipitation) and the protein interaction allowed topologies for large complexes considering the networks that these techniques generate when analyzing specific set of defining interactions and steric clashes be- the resulting data. One common class of false-positive in- tween protein subunits. teractions arising from many different kinds of experi- mental methods is that of indirect interactions. For Complex connectivity instance, an interaction may be seen between two proteins MCODE may also be used to examine the connectivity using a specific experimental method, but in reality, those and relationships between molecular complexes. Once a proteins do not physically bind each other, and one or complex is known using the directed mode, the MCODE more other molecules that are generally part of the same parameters can be relaxed to allow branching out into complex mediate the observed interaction. As can be seen other complexes. The MCODE directed mode preprocess- for the Arp2/3 complex shown in Figure 3, when pairwise ing step must also be turned off to allow MCODE to interactions between all combinations of proteins in a branch into other connected complexes, which may reside complex are studied, this creates a very dense graph. Inter- in denser regions of the graph than the seed vertex. As an estingly, this false-positive effect is normally considered a example, this was done with the Lsm4 seeded complex disadvantage, but is an advantage with MCODE as it in- (Figure 12). MCODE parameters were relaxed to haircut = creases the density in the region of the graph containing a TRUE, fluff = FALSE, VWP = 0.2 although they could be complex, which can then be more easily predicted. further relaxed for greater extension out into the network. Apart from the experimental factors that lead to false-pos- Discussion itive and false-negative interactions, representational lim- This method represents an initial step in taking advantage itations also exist computationally. Temporal and spatial of the protein function data being generated by many information is not currently described in interaction net- large-scale protein interaction studies. As the experimen- works. A complex found by the MCODE approach may tal methods are further developed, an increasing amount not actually exist even though all of the component of data will be produced which will require computation- proteins bind each other in vitro. Those proteins may al methods for efficient interpretation. The algorithm de- never be present at the same time and place. For example, scribed here allows the automated prediction of protein molecular complexes that perform different functions complexes from qualitative protein-protein interaction sometimes have common subunits as with the three types data and is thus able to help predict the function of of eukaryotic RNA polymerases. unknown proteins and aid in the understanding of the functional connectivity of molecular complexes in the Complex stoichiometry, another important aspect of bio- cell. The general nature of this method may allow logical data, is not represented either. While it is possible complex prediction for molecules other than proteins as to include full stoichiometry in a graph representation of well, for example metabolic complexes that include small a biomolecular interaction network, many experimental molecules. methods do not provide this information, so a homo- multimeric complex is normally represented as a simple MCODE cannot stand alone in this task; it must be com- homodimer. When an experiment does provide stoichi- bined with a graph visualization system to ease the under- ometry information, it is not stored in most current standing of the relationships among molecules in the data databases, such as MIPS and YPD. Thus, one is forced to set. We use the Pajek program for large network analysis Page 23 of 27 (page number not for citation purposes) BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 Yth1 Ykl059c Pap1 Glc7 Ref2 Pfs2 Pti1 Fip1 Cft1 Cft2 Pta1 Rna14 Ysh1 Lsm6 Spt7 Lsm7 Hfi1 Spt20 Taf60 Lsm3 Kem1 Spt3 Tra1 Taf61 Taf25 Lsm5 Pat1 Taf90 Gcn5 Lsm1 Dcp1 Spt8 Taf17 Lsm4 Spt15 Ngg1 Lsm8 Lsm2 Ada2 Rpn8 Rpn6 Cdc23 Apc2 Apc4 Rpt3 Rpn10 Rpt1 Apc1 Doc1 Cdc26 Rpn11 Cdc16 Rpn5 Apc9 Apc11 Rpn3 Rpt6 Cdc27 Apc5 Figure 12 Examining Complex Connectivity with MCODE Figure legend: The complexes shown here are known to be nuclear localized and are involved in protein degradation (19S proteasome subunit), mRNA processing (Lsm complex and mRNA Cleavage/Polyadenylation complex), cell cycle (anaphase promoting complex) and transcription (SAGA transcriptional activa- tion complex). return to the primary literature to extract the data, an ex- to determine a reliability index or p-value on edges in the tremely time-consuming task for large data sets. graph. For instance, one may wish to rank results pub- lished in high-impact journals above other journals (or Some quantitative and statistical information is present vice versa) and rank classical purification methods above when integrating results of large-scale approaches and this high-throughput yeast two-hybrid techniques when deter- is not used in our current graph model. For instance, the mining the quality of the interaction data. It may also be number of different types of experiments that find the possible to weight vertices on the graph by other quality same interaction, the quality of the experiment, the date criteria, such as whether a protein is hypothetical from a the experiment was conducted (newer methods may be gene prediction or not or whether a protein is expressed at superior in certain aspects) and other factors that pertain a particular time and place in the cell. For example, if one to the reliability of the interaction could all be considered were interested in a certain stage of the cell cycle, proteins Page 24 of 27 (page number not for citation purposes) BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 that are known to be absent at that stage could be reduced clusion of functional annotation and p-values on edges. in weight (VWP in the case of MCODE) compared to pro- Time, space and stoichiometry should also be represented teins that are present. It should be noted that any weight- on networks and in visualization systems. The process of ing scheme that tries to assess the quality of an interaction 'functional annotation titration' in the directed mode of might make false assumptions that would prevent the dis- MCODE could be automated. covery of new and interesting data. Conclusions This paper shows that the structure of a biological network MCODE effectively finds densely connected regions of a can define complexes, which can be seen as dense regions. molecular interaction network, many of which corre- This may be attributed to indirect interactions accumulat- spond to known molecular complexes, based solely on ing in the literature. Thus, interaction data taken out of connectivity data. Given that this approach to analyzing context may be erroneous. For instance, if one has a col- protein interaction networks performs well using lection of protein interactions from various different ex- minimal qualitative information implies that large periments done at different times in different labs from a amounts of available knowledge is buried in large protein specific complex that form a clique, and if one chooses an interaction networks. More accurate data mining algo- interaction from this clique, then how can one verify if it rithms and systems models could be constructed to un- is indirect or not. We would only begin to know if we had derstand and predict interactions, complexes and a very detailed description of the experiment from the pathways by taking into account more existing biological original papers where we could tell the amount of work knowledge. Structured molecular interaction data resourc- and quality of work that went into measuring each inter- es such as BIND will be vital in creating these resources. action. Thus with only a qualitative view of interactions, in reference to Dobzhansky [42], nothing in the biomo- Methods Data sources lecular interaction network would make sense except in light of molecular complexes and the functional connec- All protein interaction data sets from MIPS [13], Gene On- tions between them. If one had a highly detailed represen- tology [43] and PreBIND http://bioinfo.mshri.on.ca/ tation of each interaction including time, place, prebind/ were collected as described previously [6]. The YPD protein interaction data are from March 2001 and experimental condition, number of experiments, binding sites, chemical actions and chemical state information, were originally requested from Proteome, Inc. http:// one would be able to computationally delve into molecu- www.proteome.com. Other interaction data sets are from lar complexes to resolve topology, structure, function and BIND http://www.bind.ca. A BIND yeast import utility mechanism down to the atomic level. This information was developed to integrate data from SGD [12], RefSeq would also help to judge the biological relevance of an in- [44], Gene Registry http://genome-www.stanford.edu/ teraction. Thus, we require databases like BIND [15] to Saccharomyces/registry.html, the list of essential genes store this information. The integration of known qualita- from the yeast deletion consortium [11] and GO terms tive and quantitative molecular interaction data in a ma- [43]. This database ensures proper matching of yeast gene chine-readable format should allow increasingly accurate names among the multiple data sets that may use different protein interaction, molecular complex and pathway pre- names for the same genes. The yeast proteome used here diction, including actual binding site and mechanism in- is defined by SGD and RefSeq and contains 6,334 ORFs formation in a sequence and structural context. including the mitochondrial chromosome. Before per- forming comparisons, the various interaction data sets Based on our scale-free network analysis, it would seem were entered into a local instance of BIND as pairwise pro- that real biological networks are organized differently tein interaction records. The MIPS complex catalogue was than random models of scale-free networks in that they downloaded in February 2002. have higher clustering coefficients around specific regions (complexes) and the vertices in these regions are related to The protein interaction data sets used here were com- each other, by biological function. Thus, attempts to mod- posed as follows. 'Gavin Spoke' is the spoke model of the el biological networks and their evolution in a global way raw purifications from Gavin et al [7]. 'Y2H' is all known solely using the statistics of scale-free networks may not large-scale [2–5,10] combined with normal yeast two-hy- work, rather modeling should take into account as much brid results from MIPS. 'HTP Only' is only high-through- extant biological knowledge as possible. put or large-scale data [2–7,10] The 'Benchmark' set was constructed from MIPS, YPD and PreBIND as previously Future work on MCODE could include researching differ- described [6]. 'Pre HTMS' was composed of all yeast sets ent, possibly adaptive, vertex scoring functions to take except the recent large-scale mass spectrometry data sets into account, for example, the local density of the network [6,7]. 'AllYeast' was the combination of all above data past the immediate neighborhood of a vertex and the in- sets. All data sets are non-redundant. Page 25 of 27 (page number not for citation purposes) BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 10. Tong AH, Drees B, Nardelli G, Bader GD, Brannetti B and Castagnoli Network visualization L A combined experimental and computational strategy to Visualization of networks was performed using the Pajek define protein interaction networks for peptide recognition program for large network analysis [40]http:// modules. Science 2002, 295:321-324 11. Winzeler EA, Shoemaker DD, Astromoff A, Liang H, Anderson K and vlado.fmf.uni-lj.si/pub/networks/pajek/ as described pre- Andre B Functional characterization of the S. cerevisiae ge- viously [6,10]. using the Kamada-Kawai graph layout al- nome by gene deletion and parallel analysis. Science 1999, 285:901-906 gorithm followed by manual vertex adjustments and was 12. Chervitz SA, Hester ET, Ball CA, Dolinski K, Dwight SS and Harris formatted using CorelDraw 10. Power law analysis was MA Using the Saccharomyces Genome Database (SGD) for also accomplished as previously described [6]. analysis of protein similarities and structure. Nucleic Acids Res 1999, 27:74-78 13. Mewes HW, Frishman D, Gruber C, Geier B, Haase D and Kaps A Authors' contributions MIPS: a database for genomes and protein sequences. Nucleic GB conceived of the study and carried out all program- Acids Res 2000, 28:37-40 14. Costanzo MC, Crawford ME, Hirschman JE, Kranz JE, Olsen P and ming and analyses as a Ph.D. student in the lab of CH. CH Robertson LS YPD, PombePD and WormPD: model organism supervised the study and provided valuable input for the volumes of the BioKnowledge library, an integrated re- source for protein information. Nucleic Acids Res 2001, 29:75-79 evaluation analyses. 15. Bader GD, Donaldson I, Wolting C, Ouellette BF, Pawson T and Hogue CW BIND-The biomolecular interaction network Additional material database. Nucleic Acids Res 2001, 29:242-245 16. Xenarios I, Salwinski L, Duan XJ, Higney P, Kim SM and Eisenberg D DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Ac- Additional File 1 ids Res 2002, 30:303-305 AllYeastPredictedComplexes.zip Zip file containing Pajek .net and anno- 17. Takai-Igarashi T, Nadaoka Y and Kaminuma T A database for cell tation files for all 209 complexes found using MCODE on the set of all signaling networks. J Comput Biol 1998, 5:747-754 18. Wingender E, Chen X, Hehl R, Karas H, Liebich I and Matys V yeast interactions reported here. Various report files from MCODE are TRANSFAC: an integrated system for gene expression also included as well as basic instructions for using Pajek. regulation. Nucleic Acids Res 2000, 28:316-319 Click here for file 19. Karp PD, Riley M, Saier M, Paulsen IT, Paley SM and Pellegrini-Toole [http://www.biomedcentral.com/content/supplementary/1471- A The EcoCyc and MetaCyc databases. Nucleic Acids Res 2000, 2105-4-2-S1.zip] 28:56-59 20. Overbeek R, Larsen N, Pusch GD, D'Souza M, Selkov EJ and Kyrpides N WIT: integrated system for high-throughput genome se- quence analysis and metabolic reconstruction. Nucleic Acids Res 2000, 28:123-125 21. Wagner A and Fell DA The small world inside large metabolic Acknowledgements networks. Proc R Soc Lond B Biol Sci 2001, 268:1803-1810 Michel Dumontier, Sherrie Kelly, Katerina Michalickova, Tony Pawson and 22. Flake GW, Lawrence S, Giles CL and Coetzee FM Self-Organiza- Mike Tyers provided helpful discussion. This work was supported in part tion of the Web and Identification of Communities. IEEE from grants from the Canadian Institutes of Health Research (CIHR), the Computer 2002, 35:66-71 23. Goldberg AV Finding a Maximum Density Subgraph. Technical Ontario Research and Development Challenge Fund and MDS-Sciex to Report UCB/CSD University of California, Berkeley, CA 1984, 84: C.H. G.D.B. is supported by an Ontario Graduate Scholarship (OGS). 24. Ng A, Jordan M and Weiss Y On spectral clustering: Analysis and an algorithm. Advances in Neural Information Processing Systems 14: Proceedings of the 2001 2001, References 25. Watts DJ and Strogatz SH Collective dynamics of 'small-world' 1. Fields S Proteomics. Proteomics in genomeland. Science 2001, networks. Nature 1998, 393:440-442 291:1221-1224 26. Jeong H, Tombor B, Albert R, Oltvai ZN and Barabasi AL The large- 2. Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS and Knight JR A scale organization of metabolic networks. Nature 2000, comprehensive analysis of protein-protein interactions in 407:651-654 Saccharomyces cerevisiae. Nature 2000, 403:623-627 27. Albert R, Jeong H and Barabasi AL Error and attack tolerance of 3. Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M and Sakaki Y A com- complex networks. Nature 2000, 406:378-382 prehensive two-hybrid analysis to explore the yeast protein 28. Barabasi AL and Albert R Emergence of scaling in random interactome. Proc Natl Acad Sci U S A 2001, 98:4569-4574 networks. Science 1999, 286:509-512 4. Drees BL, Sundin B, Brazeau E, Caviston JP, Chen GC and Guo W A 29. Fell DA and Wagner A The small world of metabolism. Nat protein interaction map for cell polarity development. J Cell Biotechnol 2000, 18:1121-1122 Biol 2001, 154:549-571 30. Hartuv E and Shamir R A clustering algorithm based on graph 5. Fromont-Racine M, Mayes AE, Brunet-Simon A, Rain JC, Colley A and connectivity. Information processing letters 1999, 76:175-181 Dix I Genome-wide protein interaction screens reveal func- 31. Bader GD and Hogue CW Analyzing yeast protein-protein in- tional networks involving Sm-like proteins. Yeast 2000, 17:95- teraction data obtained from different sources. Nat Biotechnol 2002, 20:991-997 6. Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L and Adams SL Sys- 32. Baldi P, Brunak S, Chauvin Y, Andersen CA and Nielsen H Assessing tematic identification of protein complexes in Saccharomy- the accuracy of prediction algorithms for classification: an ces cerevisiae by mass spectrometry. Nature 2002, 415:180-183 overview. Bioinformatics 2000, 16:412-424 7. Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M and Bauer A 33. Robinson RC, Turbedsky K, Kaiser DA, Marchand JB, Higgs HN and Functional organization of the yeast proteome by systemat- Choe S Crystal structure of Arp2/3 complex. Science 2001, ic analysis of protein complexes. Nature 2002, 415:141-147 294:1679-1684 8. Christendat D, Yee A, Dharamsi A, Kluger Y, Savchenko A and Cort 34. Mayes AE, Verdone L, Legrain P and Beggs JD Characterization of JR Structural proteomics of an archaeon. Nat Struct Biol 2000, Sm-like proteins in yeast and their association with U6 7:903-909 snRNA. EMBO J 1999, 18:4321-4331 9. Kim SK, Lund J, Kiraly M, Duke K, Jiang M and Stuart JM A gene ex- 35. von Mering C, Krause R, Snel B, Cornell M, Oliver SG and Fields S pression map for Caenorhabditis elegans. Science 2001, Comparative assessment of large-scale data sets of protein- 293:2087-2092 protein interactions. Nature 2002, 417:399-403 Page 26 of 27 (page number not for citation purposes) BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 36. Jeong H, Mason SP, Barabasi AL and Oltvai ZN Lethality and cen- trality in protein networks. Nature 2001, 411:41-42 37. Maslov S and Sneppen K Specificity and stability in topology of protein networks. Science 2002, 296:910-913 38. Gonzalez F, Delahodde A, Kodadek T and Johnston SA Recruit- ment of a 19S proteasome subcomplex to an activated promoter. Science 2002, 296:548-550 39. Bochtler M, Ditzel L, Groll M, Hartmann C and Huber R The proteasome. Annu Rev Biophys Biomol Struct 1999, 28:295-317 40. Batagelj V and Mrvar A Pajek – Program for Large Network Analysis. Connections 1998, 2:47-57 41. Kamada T and Kawai S An algorithm for drawing general indi- rect graphs. Information processing letters 1989, 31:7-15 42. Dobzhansky T Nothing in Biology Makes Sense Except in the Light of Evolution. American Biology Teacher 1973, 35:125-129 43. The Gene Ontology Consortium Gene ontology: tool for the uni- fication of biology. Nat Genet 2000, 25:25-29 44. Pruitt KD and Maglott DR RefSeq and LocusLink: NCBI gene- centered resources. Nucleic Acids Res 2001, 29:137-140 Publish with Bio Med Central and every scientist can read your work free of charge "BioMed Central will be the most significant development for disseminating the results of biomedical researc h in our lifetime." Sir Paul Nurse, Cancer Research UK Your research papers will be: available free of charge to the entire biomedical community peer reviewed and published immediately upon acceptance cited in PubMed and archived on PubMed Central yours — you keep the copyright BioMedcentral Submit your manuscript here: http://www.biomedcentral.com/info/publishing_adv.asp Page 27 of 27 (page number not for citation purposes) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png BMC Bioinformatics Springer Journals

An automated method for finding molecular complexes in large protein interaction networks

BMC Bioinformatics , Volume 4 (1) – Jan 13, 2003

Loading next page...
 
/lp/springer-journals/an-automated-method-for-finding-molecular-complexes-in-large-protein-zUHtzEuDV5

References (86)

Publisher
Springer Journals
Copyright
Copyright © 2003 by Bader and Hogue; licensee BioMed Central Ltd.
Subject
Life Sciences; Bioinformatics; Microarrays; Computational Biology/Bioinformatics; Computer Appl. in Life Sciences; Combinatorial Libraries; Algorithms
eISSN
1471-2105
DOI
10.1186/1471-2105-4-2
Publisher site
See Article on Publisher Site

Abstract

Background: Recent advances in proteomics technologies such as two-hybrid, phage display and mass spectrometry have enabled us to create a detailed map of biomolecular interaction networks. Initial mapping efforts have already produced a wealth of data. As the size of the interaction set increases, databases and computational methods will be required to store, visualize and analyze the information in order to effectively aid in knowledge discovery. Results: This paper describes a novel graph theoretic clustering algorithm, "Molecular Complex Detection" (MCODE), that detects densely connected regions in large protein-protein interaction networks that may represent molecular complexes. The method is based on vertex weighting by local neighborhood density and outward traversal from a locally dense seed protein to isolate the dense regions according to given parameters. The algorithm has the advantage over other graph clustering methods of having a directed mode that allows fine-tuning of clusters of interest without considering the rest of the network and allows examination of cluster interconnectivity, which is relevant for protein networks. Protein interaction and complex information from the yeast Saccharomyces cerevisiae was used for evaluation. Conclusion: Dense regions of protein interaction networks can be found, based solely on connectivity data, many of which correspond to known protein complexes. The algorithm is not affected by a known high rate of false positives in data from high-throughput interaction techniques. The program is available from ftp://ftp.mshri.on.ca/pub/BIND/Tools/MCODE. Background method that identified a functional protein complex Recent papers published in Science and Nature among oth- around the yeast protein Las17 that is involved in actin cy- ers describe large-scale proteomics experiments that have toskeleton rearrangement [10]. Here we extend the meth- generated large data sets of protein-protein interactions od to better apply it to the accumulating information in and molecular complexes [1–7]. Protein structure [8] and protein networks. gene expression data [9] is also accumulating at a rapid rate. Bioinformatics systems for storage, management, vis- Currently, most proteomics data is available for the model ualization and analysis of this new wealth of data must organism Saccharomyces cerevisiae, by virtue of the availa- keep pace. We previously published a simple graph theory bility of a defined and relatively stable proteome, full Page 1 of 27 (page number not for citation purposes) BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 genome clone libraries [11], established molecular biolo- undirected graph is used. Using this graph representation gy experimental techniques and an assortment of well de- of a biological system allows graph theoretic methods to signed genomics databases [12–14]. Using the be applied to aid in analysis and solve biological prob- Biomolecular Interaction Network Database (BIND – ht- lems. This graph theory approach has been used by other tp://www.bind.ca) [15] as an integration platform, we biomolecular interaction database projects such as DIP have collected 15,143 yeast protein-protein interactions [16], CSNDB [17], TRANSPATH [18], EcoCyc [19] and among 4,825 proteins (about 75% of the yeast pro- WIT [20] and is discussed by Wagner and Fell [21]. teome). Much larger data sets than this will eventually be available for other well studied model organisms as well Algorithms for finding clusters, or locally dense regions, as for the human proteome. These complex data sets of a graph are an ongoing research topic in computer sci- present a formidable challenge for computational biology ence and are often based on network flow/minimum cut to develop automated data mining analyses for knowl- theory [22,23] or more recently, spectral clustering [24]. edge discovery. To find locally dense regions of a graph, MCODE instead uses a vertex-weighting scheme based on the clustering co- , which measures 'cliquishness' of the neigh- Here we present the first report that uses a clustering algo- efficient, C rithm to identify molecular complexes in a large protein borhood of a vertex [25]. C = 2n/k (k -1) where k is the i i i i interaction network derived from heterogeneous experi- vertex size of the neighborhood of vertex i and n is the mental sources. Based on our previous observation that number of edges in the neighborhood (the immediate highly interconnected, or dense, regions of the network neighborhood density of v not including v). A clique is de- may represent complexes [10], the "Molecular Complex fined as a maximally connected graph. There is no stand- Detection" (MCODE) algorithm has been implemented ard graph theory definition of density, but definitions are and evaluated on our yeast protein interaction compila- normally based on the connectivity level of a graph. Den- tion using known molecular complex data from a recent sity of a graph, G = (V,E), with number of vertices, |V|, and systematic mass spectrometry study of the proteome [7] number of edges, |E|, is defined here as |E|; divided by the and from the MIPS database [13]. theoretical maximum number of edges possible for the graph, |E| . For a graph with loops (an edge connecting max back to its originating vertex), |E| = |V| (|V|+1)/2 and Predicting molecular complexes from protein interaction max data is important because it provides another level of for a graph with no loops, |E| = |V| (|V|-1)/2. So, den- max functional annotation above other guilt-by-association sity of G, D = |E|/|E| and is thus a real number rang- G max methods. Since sub-units of a molecular complex general- ing from 0.0 to 1.0. ly function towards the same biological goal, prediction of an unknown protein as part of a complex also allows The first stage of MCODE, vertex weighting, weights all increased confidence in the annotation of that protein. vertices based on their local network density using the highest k-core of the vertex neighborhood. A k-core is a MCODE also makes the visualization of large networks graph of minimal degree k (graph G, for all v in G, deg(v) manageable by extracting the dense regions around a pro- >= k). The highest k-core of a graph is the central most tein of interest. This is important, as it is now obvious that densely connected subgraph. We define here the term the current visualization tools present on many interac- core-clustering coefficient of a vertex, v, to be the density tion databases [15], originally based on the Sun Microsys- of the highest k-core of the immediate neighborhood of v tems embedded spring graph layout Java applet do not (vertices connected directly to v) including v (note that C scale well to large networks (http://java.sun.com/applets/ does not include v). The core-clustering coefficient is used jdk/1.1/demo/GraphLayout/example1.html). here instead of the clustering coefficient because it ampli- fies the weighting of heavily interconnected graph regions Algorithm while removing the many less connected vertices that are The MCODE algorithm operates in three stages, vertex usually part of a biomolecular interaction network, weighting, complex prediction and optionally post- known to be scale-free [6,21,26–29]. A scale-free network processing to filter or add proteins in the resulting com- has a vertex connectivity distribution that follows a power plexes by certain connectivity criteria. law, with relatively few highly connected vertices (high degree) and many vertices having a low degree. A given A network of interacting molecules can be intuitively highly connected vertex, v, in a dense region of a graph modeled as a graph, where vertices are molecules and edg- may be connected to many vertices of degree one (singly es are molecular interactions. If temporal pathway or cell linked vertex). These low degree vertices do not intercon- signalling information is known, it is possible to create a nect within the neighborhood of v and thus would reduce directed graph with arcs representing direction of chemi- the clustering coefficient, but not the core-clustering coef- cal action or direction of information flow, otherwise an ficient. The final weight given to a vertex is the product of Page 2 of 27 (page number not for citation purposes) BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 the vertex core-clustering coefficient and the highest k- ified seed is a part of. Typically, when analyzing complex- core level, k , of the immediate neighborhood of the es in a given network, one would find all complexes max vertex. This weighting scheme further boosts the weight of present (undirected mode) and then switch to the directed densely connected vertices. This specific weighting func- mode for the complexes of interest. The directed mode al- tion is based on local network density. Many other func- lows one to experiment with MCODE parameters to fine tions are possible and some may have better performance tune the size of the resulting complex according to exist- for this algorithm but these are not evaluated here. ing biological knowledge of the system. In directed mode, MCODE will first pre-process the input network to ignore The second stage, molecular complex prediction, takes as all vertices with higher vertex weight than the seed vertex. input the vertex weighted graph, seeds a complex with the If this were not done, MCODE would preferentially highest weighted vertex and recursively moves outward branch out to denser regions of the graph, if they exist, from the seed vertex, including vertices in the complex which could belong to separate, but denser complexes. whose weight is above a given threshold, which is a given Thus, a seed vertex for directed mode should always be the percentage away from the weight of the seed vertex. This is highest density vertex among the suspected complex. the vertex weight percentage (VWP) parameter. If a vertex There is an option to turn this pre-processing step off, is included, its neighbours are recursively checked in the which will allow seeded complexes to branch out into same manner to see if they are part of the complex. A ver- denser regions of the graph, if desired. tex is not checked more than once, since complexes can- not overlap in this stage of the algorithm (see below for a The time complexity of the entire algorithm is polynomial possible overlap condition). This process stops once no O(nmh ) where n is the number of vertices, m is the more vertices can be added to the complex based on the number of edges and h is the vertex size of the average ver- given threshold and is repeated for the next highest un- tex neighbourhood in the input graph, G. This comes seen weighted vertex in the network. In this way, the dens- from the vertex-weighting step. Finding a k-core in a graph est regions of the network are identified. The vertex weight proceeds by progressively removing vertices of degree < k threshold parameter defines the density of the resulting until all remaining vertices are connected to each other by complex. A threshold that is closer to the weight of the degree k or more, and is thus O(n ). The highest k-core is found by trying to find k-cores from one up until all verti- seed vertex identifies a smaller, denser network region around the seed vertex. ces have been found and cannot go beyond a number of steps equal to the highest degree in the graph. Thus, the The third stage is post-processing. Complexes are filtered highest k-core step is O(n ). Since this k-core step operates if they do not contain at least a 2-core (graph of minimum only on the neighbourhood of a vertex, the n in this case degree 2). The algorithm may be run with the 'fluff' op- is the number of vertices in the average neighbourhood of tion, which increases the size of the complex according to a vertex, h. The inner loop of the algorithm only operates a given 'fluff' parameter between 0.0 and 1.0. For every twice for every edge in the input graph, thus is O(2mh ). vertex in the complex, v, its neighbors are added to the The outer loop operates once on all vertices in the input complex if they have not yet been seen and if the neigh- graph, thus the entire time complexity of the weighting 3 3 borhood density (including v) is higher than the given stage is O(n2mh ) = O(nmh ). The complex prediction fluff parameter. Vertices that are added by the fluff param- stage is O(n) and the optional post-processing step can be ), where c is the number of complexes that eter are not marked as seen, so there can be overlap among up to O(cs predicted complexes with the fluff parameter set. If the al- were found in the previous step and s is the number of ver- gorithm is run using the 'haircut' option, the resulting tices in the largest complex - O(cs ) to find the 2-core once complexes are 2-cored, thereby removing the vertices that for each complex. are singly connected to the core complex. If both options are specified, fluff is run first, then haircut. Even though the fastest min-cut graph clustering algo- rithms are faster, at O(n logn) [30], MCODE has a Resulting complexes from the algorithm are scored and number of advantages. Since weighting is done once and ranked. The complex score is defined as the product of the comprises most of the time complexity, many algorithm complex subgraph, C = (V,E), density and the number of parameters can be tried, in O(n), once weighting is com- vertices in the complex subgraph (D × |V|). This ranks plete. This is useful when evaluating many different pa- larger more dense complexes higher in the results. Other rameters. MCODE is relatively easy to implement and scoring schemes are possible, but are not evaluated here. since it is local density based, has the advantage of a di- rected mode and a complex connectivity mode. These two MCODE may also be run in a directed mode where a seed modes are generally not useful in typical clustering appli- vertex is specified as a parameter. In this mode, MCODE cations, but are useful for examining molecular interac- only runs once to predict the single complex that the spec- tion networks. Additionally, only those proteins above a Page 3 of 27 (page number not for citation purposes) BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 given local density threshold are assigned to complexes. if not already seen v then call: MCODE-FIND-COM- This is in contrast to many clustering applications that PLEX(G, W, d, v) force all data points to be part of clusters, whether they truly should be part of a cluster or not. end for Pseudocode end procedure Stage 1: Vertex Weighting procedure MCODE-VERTEX-WEIGHTING Stage 3: Post-Processing (optional) procedure MCODE-FLUFF-COMPLEX input: graph: G = (V,E) input: graph: G = (V,E); vertex weights: W; for all v in G do fluff density threshold: d; complex graph: C = (U,F) N = find neighbors of v to depth 1 for all u in C do K = Get highest k-core graph from N if weight of u >d then add u to complex C k = Get highest k-core number from N end for d = Get density of K end procedure Set weight of v = k × d procedure MCODE-POST-PROCESS end for input: graph: G = (V,E); vertex weights: W; haircut flag: end procedure h; fluff flag: f; Stage 2: Molecular Complex Prediction fluff density threshold: t; set of predicted complex procedure MCODE-FIND-COMPLEX graphs: C input: graph: G = (V,E); vertex weights: W; for all c in C do vertex weight percentage: d; seed vertex: s if c not 2-core then filter if s already seen then return if h is TRUE then 2-core complex for all v neighbors of s do if f is TRUE then call: MCODE-FLUFF-COMPLEX(G, W, t, c) if weight of v > (weight of s)(1 - d) then add v to com- plex C end for call: MCODE-FIND-COMPLEX (G, W, d, v) end procedure end for Overall Process procedure MCODE end procedure input: graph: G = (V,E); vertex weight percentage: d; procedure MCODE-FIND-COMPLEXES haircut flag: h; fluff flag: f; fluff density threshold: t; input: graph: G = (V,E); vertex weights: W; set of predicted complex graphs: C vertex weight percentage: d call: W = MCODE-VERTEX-WEIGHTING (G) for all v in G do call: C = MCODE-FIND-COMPLEXES (G, W, d) Page 4 of 27 (page number not for citation purposes) BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 call: MCODE-POST-PROCESS (G, W, h, f, t, C) Evaluation of MCODE using the Gavin data set of protein interactions and complexes In this study, we wanted to use all forms of protein inter- end procedure action data available, which requires mixing of different Implementation types of experiments, such as yeast two-hybrid and co-im- MCODE has been implemented in ANSI C using the munoprecipitation. Two-hybrid results are inherently cross-platform NCBI Toolkit; http://www.nc- pairwise, whereas copurification results are sets of one or bi.nlm.nih.gov/IEB and the BIND graph library in the more identified proteins. For a copurification result, only SLRI Toolkit; http://sourceforge.net/projects/slritools. a set of size 2 can be directly considered a pairwise inter- Both of these source code libraries are freely available. The action, otherwise it must be modeled as a set of hypothet- actual MCODE source code is not yet freely available. The ical interactions. Biochemical copurifications can be MCODE program has been compiled and tested on UNIX, thought of as populations of complexes with some under- Mac OS X and Windows. Because a yeast gene name dic- lying pairwise protein interaction topology that is un- tionary is used to recognize input and generate output, the known from the experiment. In the general case of the MCODE executable currently only works for yeast pro- purification used by Gavin et al., one affinity tagged pro- teins in a user friendly manner. The algorithm, however is tein was used as bait to pull associated proteins out of a completely general, via the graph theory abstraction, to yeast cell lysate. The two extreme cases for the topology any graph and thus to any biomolecular interaction net- underlying the population of complexes from a single pu- work. MCODE binaries are available from ftp:// rification experiment are a minimally connected 'spoke' ftp.mshri.on.ca/pub/BIND/Tools/MCODE. model, where the data are modeled as direct bait-associat- ed protein pairwise interactions, and a maximally con- Results nected 'matrix' model, where the data are modeled as all Evaluation of MCODE proteins connected to all others in the set. The real topol- The evaluation of MCODE requires a set of experimentally ogy of the set of proteins must lie somewhere between determined biomolecular interactions and a set of associ- these two extremes. ated experimentally determined molecular complexes. Currently, the largest source for such data is for proteins Population of complexes: C = {b, c, d, e} (b = bait) from the budding yeast, Saccharomyces cerevisiae. Recently, a large-scale mass spectrometry study by Gavin et al [7] Spoke model hypothetical interactions: i = {b-c, b-d, b-e} provided a large data set of protein interactions with man- ually annotated molecular complexes. Also available are Matrix model hypothetical interactions; i = {b-b, b-c, b-d, the protein interaction and complex tables of MIPS [13] b-e, c-c, c-d, c-e, d-d, d-e, e-e} and YPD [14]. MCODE was used to automatically predict protein complexes in our collected protein-protein inter- Advantages of the spoke model are that it is biologically action data sets. Resulting complexes were then matched intuitive, biologists often represent their copurification re- to known molecular complexes from Gavin et al. (the sults in this manner, and is about 3 times more accurate Gavin benchmark) and the MIPS benchmark using an than the matrix model [31]. Disadvantages are that it overlap score. Parameter optimization was then used to could misrepresent interactions. The matrix model, alter- maximize the biological relevance of predicted complexes natively, cannot misrepresent interactions, as all possible according to the given benchmarks. YPD was not used as interactions are generated, but this is at the cost of gener- a current version could not be acquired. ating a large number of false interactions. Matrix topolo- gies are also physically implausible for larger complexes To ensure that MCODE is not unduly affected by the ex- because of increased possibility of steric clash if all subu- pected high false-positive rate in large-scale interaction nits are interacting with all others. Ultimately, the spoke data sets, large-scale and literature derived MCODE pre- model should be reasonable for use in evaluating dictions were compared. MCODE was then used to pre- MCODE. dict complexes in the entire set of machine readable protein-protein interactions that we could collect for Gavin et al. raw data from 588 biochemical purifications yeast. Complexes of interest were then further examined were represented using the spoke model, described above, using the directed mode and complex connectivity mode to get 3,225 hypothetical protein-protein interactions of MCODE. among 1,363 proteins for input to MCODE. A list of 232 manually annotated protein complexes based on the orig- inal purification data reported by Gavin et al. was filtered to remove five reported 'complexes' each composed of a single protein and six complexes of two or three proteins Page 5 of 27 (page number not for citation purposes) BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 that were already in the data set as part of a larger com- matched complexes declines with increasing ω threshold, plex. This yielded a filtered set of 221 complexes that were as shown in Figure 1. Interestingly, the average and maxi- used to evaluate MCODE, although some of these com- mum number of matched known complexes drops more plexes have significant overlap to other complexes in the quickly from zero until a ω threshold of 0.2 than from 0.2 set. to 0.9 indicating that many predicted complexes only have one or a few proteins that overlap with known com- To evaluate which parameter choice would allow auto- plexes. A ω threshold of 0.2 to 0.3 thus seems to filter out matic prediction of protein complexes from the spoke most predicted complexes that have insignificant overlap modeled Gavin et al. interaction set that best matched the with known complexes. manually annotated complexes, MCODE was run using all four possible combinations of the two Boolean param- Figure 2 shows the range of number of complexes predict- eters (haircut: true/false, fluff: true/false) over a full range ed and number of known complexes matched for the 0.2 of 20 vertex weight percentage (VWP) and fluff parameters ω threshold over all tried MCODE parameters. A y = x line (0 to 0.95 in 0.05 increments). During this parameter op- is also plotted to show that data points tend to be skewed timization process, MCODE was limited to find complex- towards a higher number of matched known complexes es of size two or higher. than predicted complexes because of the redundancy in the Gavin complex benchmark. Data points closest to the A scoring scheme was developed to determine how effec- upper right portion of the graph maximize both number tively an MCODE predicted complex matched a complex of matched known complexes and number of predicted from the benchmark set of complexes. In this case, the complexes. MCODE parameter combinations that result benchmark complex set was the Gavin et al. hand-anno- in these data points therefore optimize MCODE on this tated complex set. The overlap score was defined as ω = i / data set (according to the overlap score threshold). This a*b, where i is the size of the intersection set of a predicted result shows that the number of predicted complexes complex with a known complex, a is the size of the pre- should be similar to the number of matched known com- dicted complex and b is the size of the known complex. A plexes for a parameter choice to be reasonable, although protein is part of the intersection set only if it is present in the number of matched known complexes may be larger, both predicted and known complexes. Thus, a predicted again, because of some commonality among complexes complex that has no proteins in a known complex has ω in the benchmark set. The parameter combination corre- = 0 and a predicted complex that perfectly matches a sponding to the best data point (63,88) at an overlap known complex has ω = 1. Also, predicted complexes that score threshold of 0.2 is haircut = FALSE, fluff = TRUE, fully overlap, but are much larger or much smaller than VWP = 0.05 and a fluff density threshold between 0 and any known complexes will get a low ω. The overlap score 0.1. These parameter optimization results for MCODE of a predicted complex vs. a benchmark complex is then a over this data set were stable over a range of ω thresholds measure of biological significance of the prediction, as- up to 0.5. Above 0.5, the result was not stable as there suming that the benchmark set of complexes is biological- were generally too few predicted complexes with high ly relevant. The best parameter choice for MCODE on this overlap scores (Figure 1). protein interaction data set is one that predicts the largest set of complexes that match the largest number of bench- A specificity versus sensitivity analysis [32] was also per- mark complexes above a threshold ω. Since there is over- formed. Defining the number of true positives (TP) as the lap in the Gavin benchmark complex database, a number of MCODE predicted complexes with ω over a predicted complex may match more than one known threshold value and the number of false positives (FP) as complex with a high ω. the total number of predicted MCODE complexes minus TP. The number of false negatives (FN) equals the number To choose an overlap score that maximizes biological rel- of known benchmark complexes not matched by predict- evance of the predicted complexes without filtering away ed complexes. Sensitivity was defined as [TP/(TP+FN)] too many predictions, each of the 840 parameter combi- and specificity was defined as [TP/(TP+FP)]. The MCODE nations tested during the parameter optimization stage. parameter choice that optimizes both specificity and sen- The number of MCODE predicted complexes was plotted sitivity is the same as from the above analysis. The optimal against the number of matched known complexes over a sensitivity of this analysis was ~0.31 and the correspond- range of ω thresholds from 'no threshold' to 0.1 to 0.9 (in ing specificity was ~0.79. 0.1 increments). If no ω threshold is used, a predicted complex only needs at least one protein in common with The 63 MCODE predicted complexes only matched 88 of a known complex to be considered a match. If predicted the 221 complexes in the known data set indicating that and known complexes are only counted as a match when MCODE could not recapitulate the majority of the Gavin their ω is above a specific threshold, the number of complex benchmark solely using protein connectivity Page 6 of 27 (page number not for citation purposes) BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Overlap Score Threshold Avg. Pred Avg. Matched Known Max. Pred Max. Known Figure 1 Effect of Overlap Score Threshold on Number of Predicted and Matched Known Complexes for the Gavin Evaluation Figure legend: Average and maximum number of predicted and matched known complexes seen during MCODE parameter optimization (840 parameter combinations) plotted as a function of overlap score threshold. As the stringency for the closeness that a predicted complex must match a known complex is increased (increase in overlap score), fewer predicted complexes match known complexes. Note that these curves do not correspond to the best parameter set, but rather are an average of results from all tried parameter combinations. information. As mentioned above, there are more model. For example, Cdc3 was used as a bait to co-immu- matched complexes than predicted because of some re- noprecipitate Cdc10, Cdc11, Cdc12 and Ydl225w. A com- dundancy in the benchmark. This low sensitivity is not plex was annotated as containing these five proteins, but surprising, since many of the hand-annotated complexes only Cdc3 was used as bait. If more elements of a complex were created directly from single co-immunoprecipitation are used as baits, the proteins become more interconnect- results, which are not highly interconnected in the spoke ed and more readily predicted by MCODE. A good exam- Page 7 of 27 (page number not for citation purposes) Number of Matched Complexes BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 (63,88) 0 10 2030 4050 6070 Number of MCODE Predicted Complexes Above Overlap Score 0.2 Figure 2 Number of Predicted and Matched Known Complexes at Overlap Score Threshold of 0.2 Figure legend: Number of known complexes matched to MCODE predicted complexes plotted against number of MCODE predicted complexes, both with an overlap score above 0.2. ple of this is the Arp2/3 complex, which is highly baits by Gavin et al. and the resulting benchmark complex conserved in eukaryotes and is involved in actin cytoskel- included the five extra proteins that MCODE also predict- eton rearrangement. The structure of this complex is ed (Nog2, Pfk1, Prt1, Cct8 and Cct5) that are not in the known by X-ray crystallography [33] thus actual protein- crystal structure. Cct5 and Cct8 are known to be involved protein interactions from the structure can be matched up in actin assembly, but Nog2, Pfk1 and Prt1 are not. These to the co-immunoprecipitation results. MCODE predicted extra proteins likely represent non-specific binding in the all seven components of the Arp2/3 complex crystal struc- experimental approach. These two cases are shown dia- ture and five extra proteins using the optimized parame- grammatically in Figure 3. Interestingly, using the haircut ters. Six out of the seven Arp2/3 subunits were used as parameter would remove all five extra proteins that are Page 8 of 27 (page number not for citation purposes) Number of Matched Known Complexes Above Overlap Score 0.2 BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 A B C Pfk1 Prt1 Nog2 Cdc12 Cdc11 Arc18 Arc18 Arp3 Arp2 Cdc3 Cct8 Arp3 Arp2 Arc35 Cdc10 Cct5 Ydl225w Arc40 Arc35 Arc40 Arc19 Arc15 Arc19 Arc15 Figure 3 Examples of Gavin Benchmark Complexes Missed and Hit by MCODE Figure legend: Protein complexes are repre- sented as graphs using the spoke model. Vertices represent proteins and edges represent experimentally determined interac- tions. Blue vertices are baits in the Gavin et al. study. A) A Cdc3 complex hand-annotated by Gavin et al. that was missed by MCODE because of a lack of connectivity information among sub-components. This complex annotation was the result of a single co-immunoprecipitation experiment. B) The Arp2/3 complex as annotated by Gavin et al. and as found by MCODE with parameters optimized to the data set. Note the five extra proteins that have minimal connectivity to main cluster. C) The pro- tein connection map seen from the crystal structure of the Arp2/3 complex. The crystal structure is from Bos taurus (cow), but is assumed to be very similar to yeast based on very high similarity between cow and yeast Arp2/3 subunits. not in the crystal structure, leaving only the seven that are ed at high ω thresholds, but generally reduced the number present. This shows that while the parameter optimiza- of matched known complexes at low ω thresholds (0 to tion allows maximum matching of the hand-annotated 0.1) compared to haircut = FALSE. Since the haircut = known complexes, these complexes may not all be physi- TRUE option removes less-connected proteins on the ologically relevant and thus another parameter set may fringe of a predicted complex and this reduces the number better predict 'real' complexes. of predicted complexes with low overlap scores, these fringe proteins likely contribute to low-level overlap (<0.2 To explore the effect of certain MCODE parameters on re- ω) of the known complexes. sulting predicted complexes, various features of these complexes were examined while changing specific param- We also investigated the effect of changing the fluff densi- eters and keeping all else constant. Linearly increasing the ty threshold when setting fluff = TRUE on the number of VWP parameter increased the size of the predicted com- matched benchmark complexes. Linearly increasing the plexes exponentially while reducing the number of com- fluff density threshold in the MCODE post-processing plexes predicted in a linear fashion. Figure 4 shows this step linearly decreased the number of matched complexes effect with both fluff and haircut parameters turned off. At above an overlap score of 0.2. high VWP values, very large complexes were predicted and Evaluation of MCODE using MIPS data set of protein in- these encompassed most of the data set, thus were not very useful. teractions and complexes Since the Gavin et al. data set was developed by only one Because using haircut = TRUE would have led MCODE to group using a single experimental method, it may not ac- predict the Arp2/3 complex perfectly (according to the curately represent protein complex knowledge for yeast. crystal structure as discussed above), we examined if the The MIPS protein complex catalogue http://mips.gsf.de/ haircut parameter has any general effect on the number of proj/yeast/catalogues/complexes/ is a curated set of 260 matched predicted complexes. Setting haircut = TRUE had protein complexes for yeast that was compiled from the no significant effect on the number of complexes predict- literature and is thus a more realistic data set comprised of Page 9 of 27 (page number not for citation purposes) BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Vertex Weight Percentage Threshold Number of Complexes Average Complex Size Largest Complex Size Figure 4 Effect of Vertex Weight Percentage Parameter on Predicted Complex Size Figure legend: As the vertex weight percentage (VWP) parameter of MCODE is increased, the number of predicted complexes steadily decreases and the average and largest size of predicted complexes increases exponentially. The y-axis follows a logarithmic scale. For reference, the aver- age and maximum size of the MIPS benchmark complexes are 6 and 81, respectively and of the Gavin benchmark complexes are 11.8 and 88, respectively. varied experiments from many labs using different tech- able public resource for yeast protein complexes that we niques. After filtering away 50 'complexes' each composed are aware of. of a single protein and 2 highly similar complexes, we were left with 208 complexes for the MIPS known set. This MCODE was run again with a full combination of param- set did not include information from the recent large-scale eters, this time over a set of 9088 protein-protein interac- mass spectrometry studies [6,7]. While the MIPS complex tions among 4379 proteins which did not include the catalogue may be incomplete, it is currently the best avail- recent large-scale mass spectrometry studies but included Page 10 of 27 (page number not for citation purposes) BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 all interactions from the MIPS, YPD and PreBIND MCODE predictions on the high-throughput data sets, databases as well as from the majority of large-scale yeast termed 'Gavin Spoke', 'Y2H' and 'HTP only' (see Meth- two-hybrid experiments to date [2–4,10,34]. This interac- ods), are about as specific as the literature derived interac- tion set is termed 'Pre HTMS'. All of the interactions in this tion data set, but not as sensitive (Figure 6A). MCODE set were published before the last update specified on the predictions on interaction data sets containing the litera- MIPS protein complex catalogue and many are included ture derived benchmark, labelled 'Benchmark', 'Pre in the MIPS protein interaction table, thus we assumed HTMS' and 'AllYeast', are generally more sensitive and that the MIPS complex catalogue took into account the in- specific than those containing just the large-scale interac- formation in the known interaction table. Protein com- tion sets. Since the specificity drops from Benchmark to plexes found by MCODE in this set were compared to the Pre HTMS to AllYeast, with increasing amounts of large- MIPS protein complex catalogue to evaluate how well scale data, it could be argued that addition of this data MCODE performed at locating protein complexes ab negatively affects MCODE. However, large-scale data is initio. known to contain a high number of false positives, so it should be expected that these false-positives would not The same evaluation of MCODE that was done using the randomly contribute to the formation of dense regions, Gavin et al. data set was performed with the MIPS data set. which are highly unlikely to occur by chance (see below). From this analysis, including specificity versus sensitivity More complexes should be predicted with the addition of plots (optimized sensitivity = ~0.27 and specificity = the large-scale data, assuming this data explores previous- ~0.31), the MIPS complex benchmark optimized parame- ly unseen regions of the interactome, but the high number ters were haircut = TRUE, fluff = TRUE, VWP = 0.1 and a of false-positives should limit the amount of new com- fluff density threshold of 0.2. This result was stable up to plexes compared to the amount of added interactions. The a ω threshold of 0.6 after which it was difficult to evaluate MIPS complex benchmark used here is not expected to the results, as there were generally too few predicted com- contain complexes newly found in large-scale studies, ex- plexes above the high ω thresholds. This parameter com- plaining the decrease in specificity. This is exactly what oc- bination led MCODE to predict 166 complexes of which curs in our analysis. In an effort to further test the effect of 52 matched 64 MIPS complexes with a ω of at least 0.2. large-scale data on MCODE prediction performance, the Examining the ω distribution for this parameter set reveals Benchmark interaction data set was augmented with the that, even though this prediction is optimized, most of the addition of interactions from large-scale experiments that predicted complexes don't show overlap to those in the only connect proteins in the Benchmark set with each oth- known MIPS set (Figure 5). The complexes predicted here er. Over 3100 interactions were added to the Benchmark are also different from those predicted from the Gavin in- data set to create a set of over 6400 interactions. MIPS teraction data. Nine complexes have an overlap score complex benchmark optimised MCODE predicted 52 above 0.2 between these two sets, with the highest overlap complexes matching 66 MIPS benchmark complexes, al- score being 0.43 and all the rest being below 0.27. This most exactly the same number of complexes found using might signify that either the MIPS complex catalogue is the Benchmark set by itself (Table 1). These analyses not complete, that there is not enough data in the dataset strongly suggest the addition of large-scale experimentally that MCODE was run on, or a human annotated defini- derived interactions does not unduly affect the prediction tion of a complex does not perfectly match with a graph of complexes by MCODE. density based definition. It can be seen from Figure 6B that the Gavin complex The effect of the VWP parameter on complex size and of benchmark set is biased towards the Gavin et al. spoke the haircut and fluff parameters on number of matched modeled interaction data. This is expected and is the main complexes was very similar to that seen when evaluating reason why the less biased MIPS complex set is used MCODE on the Gavin complex benchmark. throughout this work as a benchmark instead of the Gavin set. Effect of data set properties on MCODE Since many large-scale protein interaction data sets from Since the result of a co-immunoprecipitation experiment yeast are known to contain a high level of false positives is a set of proteins, which we model as binary interactions [35], we examined the effect these might have on MCODE using the spoke method, we wished to evaluate whether predictions. Sensitivity vs. specificity was plotted for this affects complex prediction compared to an experi- MCODE predictions, with parameters chosen to maxi- mental system that generates purely binary interaction re- mize these values at ω threshold of 0.2 against the MIPS sults, such as yeast two-hybrid. As can be seen in Table 1, and Gavin complex benchmarks for the various data sets MCODE does find known complexes in the 'Y2H' set of (Figure 6). only yeast two-hybrid results, thus this set does contain dense regions that are known protein complexes. This Page 11 of 27 (page number not for citation purposes) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 MCODE Predicted Complex Number Pre HTMS AllYeast BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 Highest Overlap Score Figure 5 Overlap Score Distributions of Pre HTMS and AllYeast interaction sets with MIPS Complex Benchmark Opti- mized MCODE Parameter Sets Figure legend: The number of MCODE predicted complexes in the pre-large scale mass spectrometry (Pre HTMS) and AllYeast protein-protein interaction sets with a given overlap score threshold compared to the MIPS benchmark complex set is shown. The majority of predicted complexes have an overlap score of zero meaning that they had no overlap with the catalogue of known MIPS protein complexes. Page 12 of 27 (page number not for citation purposes) BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 0.9 0.8 0.7 0.6 0.5 Gavin Spoke Benchmark 0.4 Pre HTMS 0.3 Y2H 0.2 AllYeast HTP Only 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Sensitivity Gavin Spoke 0.9 (self) 0.8 0.7 0.6 0.5 0.4 HTP Only 0.3 AllYeast Benchmark 0.2 0.1 Pre HTMS Y2H 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Sensitivity Figure 6 Sensitivity vs. Specificity Plots of MCODE Results Among Various Data Sets Figure legend: Specificity is plotted ver- sus sensitivity of the best MCODE results at an overlap score above 0.2 against both the MIPS (Panel A) and Gavin (Panel B) complex benchmarks. Panel A shows that there are no large inherent differences among interaction data sets resulting from significantly different experimental methods (data set: sensitivity, specificity; Y2H:0.10,0.27; Benchmark:0.29,0.36; HTP Only:0.14;0.24; Pre HTMS:0.27,0.31; AllYeast:0.27,0.26; Gavin Spoke:0.10,0.38). Panel B shows that the Gavin benchmark is expectedly biased towards the Gavin interaction data set and thus should not be used as a general benchmark (data set: sensi- tivity, specificity; Y2H:0.03,0.10; Benchmark:0.11,0.16; HTP Only:0.24;0.33; Pre HTMS:0.10,0.13; AllYeast:0.27,0.26; Gavin Spoke:0.31,0.79). Page 13 of 27 (page number not for citation purposes) Specificity Specificity BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 Table 1: Summary of MCODE Results with Best Parameters on Various Data Sets. Data Set Number of Number of Number of Predicted MCODE Com- Matched Complex Best MCODE Proteins Interact-ions Complexes plexes Pre- Benchmark Benchmark Parameters dicted Above ω Complexes = 0.2 Gavin Spoke 1363 3225 82 63 88 Gavin hFfT\0.05\0.05 Gavin Spoke 1363 3225 53 20 20 MIPS hTfT\0.1\0.35 Pre HTMS 4379 9088 158 21 28 Gavin hTfT\0\0.2\ Pre HTMS 4379 9088 166 52 64 MIPS hTfT\0.1\0.2 AllYeast 4825 15143 209 52 76 Gavin hFfT\0\0.1 AllYeast 4825 15143 209 54 63 MIPS hTfT\0\0.1 AllYeast 4825 15143 203 80 150 MIPS+Gavin hTfT\0\0.15\ Benchmark 1762 3310 141 23 30 Gavin hTfT\0\0.3 Benchmark 1762 3310 163 58 67 MIPS hTfT\0.1\0.05 HTP Only 4557 12249 138 46 77 Gavin hTfT\0.05\0.1 HTP Only 4557 12249 122 29 35 MIPS hTfT\0.05\0.15 Y2H 3847 6133 73 7 7 Gavin hTfT\0.2\0.1 Y2H 3847 6133 78 21 26 MIPS hTfT\0\0.1 Statistics and a summary of results are shown for the various data sets used to evaluate MCODE. 'Gavin Spoke' is the Gavin et al. data set repre- sented as binary interactions using the spoke model; 'Pre HTMS' is the set of all yeast interaction not including the recent high-throughput mass spectrometry studies [6,7].; 'AllYeast' is the set of all yeast interactions that we could collect; 'Benchmark' is a set of interactions found in the liter- ature from YPD, MIPS and PreBIND; 'HTP Only' is the combination of all large-scale and high-throughput yeast two-hybrid and mass spectrometry data sets; 'Y2H' is the set of all yeast two-hybrid results from large-scale and literature sources. See Methods for full explanation of data sets. The 'Best MCODE Parameters' are formatted as haircut True of False, fluff True or False\VWP\Fluff Density Threshold Parameter. being said, the Y2H set is the least dense of all data sets ex- should be further studied using MCODE in directed mode amined here so is expected to have less dense regions of by specifying a seed vertex and trying different parameters the network and thus less MCODE predictable complexes to examine how large a complex can get before seemingly per protein present in the set. MCODE predicts a similar biologically irrelevant proteins are added (see below). amount of complexes as well as finding a similar amount of known complexes in the Y2H and Gavin Spoke data Figure 5 shows that even when a large set of interactions sets indicating that these data sets are not significantly dif- is used as input to MCODE, most of the MCODE predict- ferent from each other in the amount of dense network re- ed complexes do not match well with known complexes gions that they contain, even though they are different in MIPS. The complex size distribution of MCODE pre- sizes. Taken together, the latter results and those in Figure dicted complexes matches the shape of the MIPS set, but 6B show that the spoke model is a reasonable representa- the MCODE complexes are on average larger (Average tion of the Gavin et al. tandem affinity purification data. MIPS size = 6.0, Average MCODE Predicted size = 9.7). The average number of YPD and GO functional annota- Predicting complexes in the Yeast interactome tion terms per protein in an MCODE predicted complex is Given that MCODE performed reasonably well on test da- similar to that of MIPS complexes (Table 2). This seems to ta, we decided to predict complexes in a much larger indicate that MCODE is predicting complexes that are network. All machine-readable protein-protein interac- functionally relevant. Also, closer examination of the top, tion data from various data sets [2–7,10,13,14]. were col- middle and bottom five scoring MCODE complexes lected and integrated to form a non-redundant set of shows that MCODE can predict biologically relevant com- 15,143 experimentally determined yeast protein interac- plexes (Table 3). tions encompassing 4,825 proteins, or approximately three quarters of the proteome. This set was termed 'AllY- Many of the 209 predicted complexes are of size 2 (9 pre- east'. MCODE was parameter optimized, as above, using dicted complexes) or 3 (54 predicted complexes). Com- the MIPS benchmark. The best resulting parameter set was plexes of this size may not be significant since it is easy to haircut = TRUE, fluff = TRUE, VWP = 0 and a fluff density create high density subgraphs of size 2 or 3, but becomes threshold of 0.1. With these parameters, MCODE combinatorially more difficult to randomly create high predicted 209 complexes, of which 54 matched 63 MIPS density subgraphs as the size of the subgraph increases. To benchmark complexes above an overlap score of 0.2 (see examine the relevance of these small predicted complexes Additional file 1). Complexes found in this manner of size 2 or 3, we calculated the sensitivity and specificity Page 14 of 27 (page number not for citation purposes) BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 Table 2: Average Number of YPD and GO Annotation Terms in Complex Sets. Data Set YPD Functions YPD Roles GO Components GO Processes MCODE on All Yeast 0.58 0.89 0.39 0.59 Interactions MIPS Complex Database 0.50 0.75 0.39 0.48 MCODE Random Model 0.72 1.24 0.52 0.85 (100 AllYeast network permutations) The average number of YPD and GO functional annotation terms per protein in an MCODE predicted complex is shown for MCODE predicted complexes on the AllYeast set, the MIPS complex database and the MCODE random model. A lower number indicates that the complexes from a set contain more functionally related proteins (or unannotated proteins). In the cases of multiple annotation, all terms are taken into account. Even though there are multiple annotation terms per protein and a variable amount of unannotated proteins per complex, these numbers should perform well in relative comparisons based on the assumption that the distribution of the latter two factors is similar in each data set. of the optimized MCODE predictions against the MIPS Significance of MCODE predictions complex benchmark while disregarding the small com- Naïvely, the chance of randomly picking a known protein plexes. First, complexes of size 2, then of size 3, were re- complex from a protein interaction network depends on moved from the optimized MCODE predicted complex the size of the complex and the network. It is easier to pick set. Removing each of these sets independently resulted in out a smaller known complex by chance from a smaller only small sensitivity and specificity changes. Because network. For instance, in our network of 15,143 interac- both sets overlap the MIPS benchmark, small complexes tions among 4,825 proteins, the chance of picking a spe- have been reported as predictions. Also, because MCODE cific known complex of size three is about one in 1.9 × found these small complexes in regions of high local den- 10 (4,825 choose 3). A more realistic model would as- sity, they may be good cores for further examination with sume that the proteins are connected and thus would only MCODE in directed mode, especially since the haircut op- consider complex choices of size three where all three tion was turned on here to produce them. proteins are connected. The number of choices now de- pends on the topology of the network. In our large net- Complexes that are larger and denser are ranked higher by work, there are 6,799 fully connected subnetworks of size MCODE and these generally correspond to known com- three and 313,057 subnetworks of size three with only plexes (see below). Interestingly, some MCODE two interactions (from the triadic census feature of Pajek). complexes contain unknown proteins that are highly con- Thus now our chance of picking a more realistic complex -6 nected to known complex subunits. For example, the sec- is one out of 319,856 (1/(6,799 + 313,057) = 3.1 × 10 ). ond highest ranked MCODE complex is involved in RNA As the size of the complex increases, the number of possi- processing/modification and contains the known polya- ble complex topologies increases exponentially and, in a denylation factor I complex (Cft1, Cft2, Fip1, Pap1, Pfs2, connected network of some reasonable density, so does Pta1, Ysh1, Yth1 and Ykl059c). Seven other proteins in- the number of possible subgraphs that could represent a volved in mainly RNA processing/modification (Fir1, complex. The density of our large protein interaction net- Hca4, Pcf11, Pti1, Ref2, Rna14, Ssu72) and protein degra- work is 0.0013 and is mostly connected (4,689 proteins dation (Uba2 and Ufd1) are highly connected within this are in one connected component). Thus, it is expected predicted complex. Two unknown proteins Pti1 and that if a complex is found in a network with MCODE that Yor179c are highly connected to RNA processing/modifi- matches a known complex, that the result would be highly cation proteins and are therefore likely involved in the significant. To understand the significance of complex same process (Figure 7). Pti1 may be an unknown compo- prediction further, the topology of the protein interaction rd nent of the polyadenylation factor I complex. The 23 network would have to be understood in general, in order highest ranked predicted complex is interesting in that it to build a null model to compare against. is involved in cell polarity and cytokinesis and contains two proteins of unknown function, Yhr033w and Recent research on modeling complex systems [21,25,27] Yal027w. Yal027w interacts with two kinases, Gin4 and has found that networks such as the world wide web, met- Kcc4, which in turn interact with the components of the abolic networks [26] and protein-protein interaction net- Septin complex (Cdc3, Cdc10, Cdc11 and Cdc12) (Figure works [36] are scale-free. That is, the connectivity 8). distribution of the vertices of the graph follows a power law, with many vertices of low degree and few vertices of high degree. Scale-free networks are known to have large Page 15 of 27 (page number not for citation purposes) BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 Yor179c RNA processing/modification Pol II transcription Protein modification Protein degradation Pti1 Unknown Ssu72 Ref2 Yth1 Pta1 Fir1 Cft1 Pfs2 Hca4 Ykl059c Pap1 Uba2 Cft2 Fip1 Ysh1 Rna14 Ufd1 Pcf11 Figure 7 The Second Highest Ranked MCODE Predicted Complex is Involved in RNA Processing and Modification . Fig- ure legend: This complex incorporates the known polyadenylation factor I complex (Cft1, Cft2, Fip1, Pap1, Pfs2, Pta1, Ysh1, Yth1 and Ykl059c) and contains other proteins highly connected to this complex, some of unknown function. The fact that the unknown proteins (Yor179c and Pti1) connect more to known RNA processing/modification proteins than to other proteins in the larger data set likely indicates that these proteins function in RNA processing/modification. This complex was ranked second by MCODE from the predicted complexes in the AllYeast interaction set. clustering coefficients, or clustered regions of the graph. In To test the significance of clustered regions in biological biological networks, at least in yeast, these clustered re- networks, 100 random permutations of the large set of all gions seem to correspond to molecular complexes and 15,143 yeast interactions were made. If the graph to be these subgraphs are what MCODE is designed to find. randomised is considered as a set of edges between two Page 16 of 27 (page number not for citation purposes) BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 Cell Organization/Biogenesis General Metabolism RNA Processing/Localization Smd3 She3 Cell Cycle DNA Damage Response/Repair Unknown Spa2 Gic2 Nfi1 Bem4 Cdc42 Spr28 Gic1 Zds2 Top2 Cdc12 Pdi1 Cdc11 Cla4 Bub2 Yhr033w Elm1 Cdc10 Shs1 Gin4 Acc1 Swe1 Cdc3 Kcc4 Rvb2 Yal027w Msh6 Aat2 Hsl1 Figure 8 An MCODE Predicted Complex Involved in Cytokinesis Figure legend: This predicted complex incorporates the known Septin complex (Cdc3, Cdc10, Cdc11 and Cdc12) involved in cytokinesis and other cytokinesis related proteins. The Yal027w protein is of unknown function, but likely functions in cell cycle control according to this figure, possibly in cytokine- rd sis. This complex was ranked 23 by MCODE from the predicted complexes in the AllYeast interaction set. Page 17 of 27 (page number not for citation purposes) BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 Table 3: Statistics for Top, Middle and Bottom Five Scoring Optimized MCODE Predicted Complexes Found in All Known Yeast Protein Interaction Data Set Complex Score Proteins Interactions Density Cell Role Cell Localization Rank 1 10.04 46 236 0.22 RNA processing/ Nuclear modification and protein degradation (26S Proteasome) Protein names Dbf2,Ecm29,Gcn4,Hsm3,Hyp2,Lhs1,Mkt1,Nas6,Pre1,Pre2,Pre4,Pre5,Pre6, Pre7,Pre8,Pre9,Pup3,Rad23,Rad24,Rad50,Rfc3,Rfc4,Rpn1,Rpn10,Rpn11, Rpn12,Rpn13,Rpn3,Rpn4,Rpn5,Rpn6,Rpn7,Rpn8,Rpn9,Rpt1,Rpt2,Rpt3,Rpt4, Rpt5,Rpt6,Scl1,Ubp6,Ura7,Ygl004c,Yku70,Ypl070w 2 9 19 90 0.51 RNA processing/ Nuclear modification Protein names Cft1,Cft2,Fip1,Fir1,Hca4,Mpe1,Pap1,Pcf11,Pfs2,Pta1,Pti1,Ref2,Rna14,Ssu72, Uba2,Ufd1,Yor179c,Ysh1,Yth1 3 7.72 56 220 0.14 Pol II transcription Nuclear Protein names Ada2,Adr1,Ahc1,Cdc23,Cdc36,Epl1,Esa1,Fet4,Fun19,Gal4,Gcn5,Hac1,Hfi1, Hhf2,Hht1,Hht2,Ire1,Luc7,Med7,Myo4,Ngg1,Pcf11,Pdr1,Prp40,Rna14,Rpb2, Rpo21,Sap185,Sgf29,Sgf73,Spt15,Spt20,Spt3,Spt7,Spt8,Srb6,Swi5,Taf1,Taf10, Taf11,Taf12,Taf13,Taf14,Taf2,Taf3,Taf5,Taf6,Taf7,Taf8,Taf9,Tra1,Ubp8, Yap1,Yap6,Ybr270c,Yng2 4 7.58 18 72 0.44 Cell cycle control, Nuclear protein degradation, mitosis (Anaphase Promoting Complex) Protein names Apc1,Apc11,Apc2,Apc4,Apc5,Apc9,Cdc16,Cdc23,Cdc26,Cdc27,Dmc1,Doc1, Leu3,Rpt1,Sic1,Spc29,Spt2,Ybr270c 5 7 15 56 0.52 Vesicular transport Golgi (TRAPP Complex) Protein names Bet1,Bet3,Bet5,Fks1,Gsg1,Gyp6,Kre11,Sec22,Trs120,Trs130,Trs20,Trs23, Trs31,Trs33,Uso1 102 3 3 3 1 RNA splicing Nuclear Protein names Msl5,Mud2,Smy2 103 3 3 3 1 Signal transduction, Nuclear Cell cycle control, DNA repair, DNA synthesis Protein names Ptc2,Rad53,Ydr071c 104 3 3 3 1 Cell cycle control, Uknown mating response Protein names Far3,Vps64,Ynl127w 105 3 3 3 1 Chromatin/chromo- Nuclear some structure Protein names Gbp2,Hpr1,Mft1 106 3 3 3 1 Pol II transcription Nuclear Protein names Ctk1,Ctk2,Ctk3 205 2 3 4 1 Vesicular transport ER Protein names Rim20,Snf7,Vps4 Page 18 of 27 (page number not for citation purposes) BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 Table 3: Statistics for Top, Middle and Bottom Five Scoring Optimized MCODE Predicted Complexes Found in All Known Yeast Protein Interaction Data Set (Continued) 206 2 3 4 1 Protein Cytoplasmic translocation Protein names Srp14,Srp21,Srp54 207 2 3 4 1 Protein Cytoplasmic translocation Protein names Srp54,Srp68,Srp72 208 2 3 4 1 Energy generation Mitochondrial Protein names Atp1,Atp11,Atp2 209 2 4 5 0.67 Nuclear-cytoplas- Varied mic and vesicular transport Protein names Kap123,Nup145,Sec7,Slc1 Score is defined as the product of the complex subgraph density and the number of vertices (proteins) in the complex subgraph (DC × |V|). This ranks larger more dense complexes higher in the results. Density is calculated using the "loop" formula if homodimers exist in the complex, other- wise the "no loop" formula is used. The cell role column is a manual combination of annotation terms for the proteins reported in the complex. vertices (v , v ), a network permutation is made by all yeast interactions are highly unlikely to occur by 1 2 randomly permuting the set of all v vertices. The random chance. networks have the same number of edges and vertices as the original network and follow a power-law connectivity To evaluate the effectiveness of our scoring scheme, which distribution, as do the original data sets [37]. Running scores larger, more dense complexes higher than smaller, MCODE with the same parameters as the original network more sparse complexes, we examined the accuracy of (haircut = TRUE, fluff = TRUE, VWP = 0 and a fluff density MCODE predictions at various score thresholds. As the threshold of 0.1) on the 100 random networks resulted in score threshold for inclusion of complexes is increased, an average of 27.4 (SD = 4.4) complexes per network. The less complexes are included, but a higher percentage of size distribution of complexes found by MCODE did not the included complexes match complexes in the bench- match that of the complexes found in the original net- mark. This is at the expense of sensitivity as many bench- work, as some complexes found in the random networks mark matching complexes are not included at higher were composed of >1500 proteins. One random network score thresholds (Figure 9). For example, of the ten pre- that had an approximately average number of predicted dicted complexes with MCODE score greater or equal to complexes (27) was parameter optimized using the MIPS six, nine match a known complex in either the MIPS or benchmark to see how parameter choice affects the size Gavin benchmark above a 0.2 threshold overlap score, distribution and number of predicted complexes. Param- yielding an accuracy of 90%. 100% of the five complexes eters of haircut = TRUE, fluff = TRUE, VWP = 0.1 and a that had an MCODE score better or equal to seven fluff density threshold of zero produced the maximal matched known complexes. Thus, complexes that score number of 81 complexes for this network, but these com- highly on our simple density based scoring scheme are plexes were composed of on average 27 proteins (without very likely to be real. counting an outlier complex of size 1961), which is much larger than normal (e.g. larger than the MIPS set average Directed mode of MCODE of 6.0). None of these predicted complexes matched any To simulate an obvious example where the directed mode MIPS complexes above an overlap score of 0.1. Also, the of MCODE would be useful, MCODE was run with re- random network complexes had a much higher average laxed parameters (haircut = TRUE, fluff = TRUE, VWP = number of YPD and GO annotation terms per protein per 0.05 and a fluff density threshold of 0.2) compared to the complex than for MIPS or MCODE on the original best parameters on the AllYeast network. The resulting network (Table 2). This indicates, as expected, that the fourth highest ranked complex, when visualized, shows random network complexes are composed of a higher lev- two clustered components and represents two protein el of unrelated proteins than complexes in the original complexes, the proteasome and an RNA processing com- network. Thus, the number, size and functional composi- plex, both found in the nucleus (Figure 10). This is an tion of complexes that MCODE predicts in the large set of example of where a lower VWP parameter would have been superior since it would have divided this large Page 19 of 27 (page number not for citation purposes) BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 100% 5 2 75% 50% 25% 0% 012345 6789 10 Complex Score Threshold Figure 9 Effect of Complex Score Threshold on MCODE Prediction Accuracy Figure legend: MCODE complexes equal to or greater than a specific score were compared to a benchmark comprising the combined MIPS and Gavin benchmarks. Accuracy was calculated as the number of known complexes better or equal to the threshold score divided by the total number of pre- dicted complexes (matching and non-matching) at that threshold. A complex was deemed to match a known complex if it had an overlap score above 0.2. The number of predicted complexes that matched known complexes at each score threshold is shown as labels on the plot. complex into two more functionally related complexes. 11A). Using this VWP parameter, combinations of haircut The highest weighted vertices in the center of each of the and fluff parameters were used to further expand the core two dense regions in Figure 10 are the Rpt1 and Lsm4 pro- complex. This process was stopped when the predicted teins. MCODE was run in directed mode starting with complexes began to include proteins of sufficiently these two proteins over a range of VWP parameters from different known biological function to the seed vertex. 0 to 0.2, at 0.05 increments. For Lsm4, the parameter set Proteins, such as Vam6 and Yor320c were included in the of haircut = TRUE, fluff = FALSE, VWP = 0 was used to find complex at moderate fluff parameters (0.4–0.6), but not a core complex, which contained 9 proteins fully connect- at higher fluff parameters, and these are known to be lo- ed to each other (Dcp1, Kem1, Lsm2, Lsm3, Lsm4, Lsm5, calized in membranes outside of the nucleus, thus are Lsm6, Lsm7 and Pat1). Above this VWP parameter, the likely not functionally related to the Lsm complex pro- core complex branched out into proteasome subunit pro- teins. Therefore, the 9 proteins listed above were decided teins, which are not part of the Lsm complex (see Figure Page 20 of 27 (page number not for citation purposes) Accuracy BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 Chromatin/chromosome structure Protein degradation Cell cycle control Dop1 RNA processing/modification,RNA splicing Snu23 Unknown Bem3 Dib1 Hsh155 Nog2 Sec21 Snu66 Gcn4 Pre9 Pre4 Scl1 Whi2 Mak31 Smx2 Rri1 Gdb1 Pre2 Pre5 Prp4 Gcr2 Pre1 Smd2 Prp24 Pre6 Pre7 Sme1 Pat1 Rpn13 Sro7 Lsm7 Yfl066c Rpn7 Rpn10 Pup3 Rpn5 Sec26 Kem1 Rpn8 Lsm6 Rad23 Lsm2 Snu114 Lsm4 Hsm3 Lsm8 Rpn6 Dcp1 Lsm5 Ubp6 Rpn4 Gzf3 Rpn12 Ydl175c Rpt2 Ygl004c Lsm3 Ste6 Rpn3 Rps28b Lsm1 Yor320c Rpt6 Rpn9 Mtr3 Ecm29 Rps28a Rpn11 Rpt3 Nas6 Ycr024c Rpt1 Rpt4 Neo1 Ylr269c Rrd2 Rpt5 Rad18 Ris1 Yrb1 Bro1 Iml1 Lhs1 Shm2 Hex3 Ybr094w Yor056c Ubr1 Ctr9 Rfc3 Spt16 Bas1 Paf1 Rad24 Leo1 Figure 10 An MCODE Predicted Complex That is Too Large (Relaxed Parameters) Figure legend: An example of a predicted complex that incorporates two complexes, proteasome (left) and an RNA processing complex (right). These should probably be predicted as separate complexes as can be seen by the clear distinction of biological role annotation on one side of this lay- out compared to the other (purple versus blue). This figure, however, shows the large amount of overall connectivity between these two complexes. This complex was ranked fourth by MCODE from the predicted complexes in the AllYeast interaction set with slightly relaxed parameters compared to the optimized prediction. to be the final complex (Figure 11B). This is intuitive be- proteins (Pre1 to Pre10, Pup1, Pup2, Pup3, Scl1 and cause of their maximal density (a 9-clique). Ump1) of which Pre7, Pre8, Pre10, Pup1, Pup2 and Ump1 are not found with MCODE. The 19S regulatory Using this same method of known biological role "titra- subunit of the proteasome is known to have 21 subunits tion" on Rpt1 found a complex of 34 proteins (Gal4, (Nas6, Rpn1 to Rpn13, Rpt1 to Rpt6 and Ubp6) of which Gcn4, Hsm3, Lhs1, Nas6, Pre1, Pre2, Pre3, Pre4, Pre5, Rpn1, Rpn2, Rpn4, Rpn12 and Rpt5 are not found with Pre6, Pre7, Pre9, Pup3, Rpn10, Rpn11, Rpn13, Rpn3, MCODE. Known complex components not found by Rpn5, Rpn6, Rpn7, Rpn8, Rpn9, Rpt1, Rpt2, Rpt3, Rpt4, MCODE are not present at a high enough local density re- Rpt6, Rri1, Scl1, Sts1, Ubp6, Ydr179c, Ygl004c) and 160 gions of the interaction network, possibly because not interactions using the parameter set haircut = TRUE, fluff enough experiments involving these proteins are present = TRUE, VWP = 0.2 and a fluff density threshold of 0.3. in our data set. Figure 11C shows the final Rpt1 seeded Two regions of density can be seen here corresponding to complex. Of note, Ygl004c is unknown and binds to the two known subunits of the 26S proteasome. The 20S almost every Rpt and Rpn protein in the complex al- proteolytic subunit of the proteasome is comprised of 15 though all of these interactions were from a single immu- Page 21 of 27 (page number not for citation purposes) BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 AB Lsm7 Lsm6 Lsm7 Pat1 Dcp1 Lsm5 Rpn10 Dcp1 Lsm2 Lsm3 Rpt3 Rpn6 Kem1 Lsm6 Rpt1 Lsm2 Kem1 Lsm5 Lsm3 Lsm4 Pat1 Lsm4 Rpn7 Pre2 Rpt2 Gcn4 Pre4 Rpt4 Sts1 Pre6 Rpn3 Pre3 Rpn11 Pup3 Rpn10 Lhs1 Rpn6 Pre1 Ubp6 Rpt3 Scl1 Rpt1 Rpn5 Ygl004c Hsm3 Pre7 Pre5 Rpn8 Rpt6 Pre9 Rri1 Rpn13 Ydr179c Nas6 Gal4 Rpn9 Figure 11 MCODE in Directed Mode Figure legend: MCODE was used in directed mode to further study the complex in Figure 10 by using seed vertices from high density regions of the two parts of this complex. A) The result of examining the Lsm complex using MCODE parameters that are too relaxed (haircut = TRUE, fluff = FALSE, VWP = 0.05). B) The final Lsm complex using MCODE parameters of haircut = TRUE, fluff = FALSE and VWP = 0 seeded with Lsm4. C) The final 26S proteasome complex seeded with Rpt1 using MCODE parameters haircut = TRUE, fluff = TRUE and VWP = 0.2. Visible here are two regions of density in this complex corresponding to the 20S proteolytic subunit (left side – mainly Pre proteins) and the 19S regulatory subunit (right side – mainly Rpt and Rpn proteins). noprecipitation experiment [6]. As well, Rri1 and Ydr179c subunits and is involved in DNA mismatch repair path- have unknown function and both bind to each other and ways, but is not known to be part of the proteasome, al- to Rpn5. Thus one would predict that these three un- though all of these Hsm3 interactions are from a known proteins function with or as part of the 26S particular large-scale experiment [7]. Interestingly, Gal4, a proteasome. The protein Hsm3 binds to eight other 19S transcription factor involved in galactose metabolism, is Page 22 of 27 (page number not for citation purposes) BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 found to be part of the proteasome complex. While this [40] with the Kamada-Kawai graph layout algorithm [41]. metabolic functionality seems unrelated to protein degra- Kamada-Kawai models the edges in the graph as springs, dation, it has recently been shown that the binding is randomly places the vertices in a high energy state and physiologically relevant [38]. These cases illustrate the then attempts to minimize the energy of the system over a possible unreliability of both functional annotation and number of time steps. The result is that the Euclidean dis- interaction data, but also that seemingly unrelated pro- tance, here in a plane, is close to the graph-theoretic or teins should not be immediately discounted if found to be path distance between the vertices. The vertices are visual- part of a complex by MCODE. ly clustered based on connectivity. Biologically, this visu- alization can allow one to see the rough structural outline Of note, the known topology of the 26S proteasome [39] of large complexes, if enough interactions are known, as compares favourably with the complex visualization of evidenced in the proteasome complex analysis above (Fig- Figure 11C without considering stoichiometry. Thus, if ure 11C). enough interactions are known, visualizing complexes may reveal the rough structural outline of large complex- It is important to note and understand the limitations of es. This should be expected when dealing with actual the current experimental methods (e.g. yeast two-hybrid physical protein-protein interactions since there are few and co-immunoprecipitation) and the protein interaction allowed topologies for large complexes considering the networks that these techniques generate when analyzing specific set of defining interactions and steric clashes be- the resulting data. One common class of false-positive in- tween protein subunits. teractions arising from many different kinds of experi- mental methods is that of indirect interactions. For Complex connectivity instance, an interaction may be seen between two proteins MCODE may also be used to examine the connectivity using a specific experimental method, but in reality, those and relationships between molecular complexes. Once a proteins do not physically bind each other, and one or complex is known using the directed mode, the MCODE more other molecules that are generally part of the same parameters can be relaxed to allow branching out into complex mediate the observed interaction. As can be seen other complexes. The MCODE directed mode preprocess- for the Arp2/3 complex shown in Figure 3, when pairwise ing step must also be turned off to allow MCODE to interactions between all combinations of proteins in a branch into other connected complexes, which may reside complex are studied, this creates a very dense graph. Inter- in denser regions of the graph than the seed vertex. As an estingly, this false-positive effect is normally considered a example, this was done with the Lsm4 seeded complex disadvantage, but is an advantage with MCODE as it in- (Figure 12). MCODE parameters were relaxed to haircut = creases the density in the region of the graph containing a TRUE, fluff = FALSE, VWP = 0.2 although they could be complex, which can then be more easily predicted. further relaxed for greater extension out into the network. Apart from the experimental factors that lead to false-pos- Discussion itive and false-negative interactions, representational lim- This method represents an initial step in taking advantage itations also exist computationally. Temporal and spatial of the protein function data being generated by many information is not currently described in interaction net- large-scale protein interaction studies. As the experimen- works. A complex found by the MCODE approach may tal methods are further developed, an increasing amount not actually exist even though all of the component of data will be produced which will require computation- proteins bind each other in vitro. Those proteins may al methods for efficient interpretation. The algorithm de- never be present at the same time and place. For example, scribed here allows the automated prediction of protein molecular complexes that perform different functions complexes from qualitative protein-protein interaction sometimes have common subunits as with the three types data and is thus able to help predict the function of of eukaryotic RNA polymerases. unknown proteins and aid in the understanding of the functional connectivity of molecular complexes in the Complex stoichiometry, another important aspect of bio- cell. The general nature of this method may allow logical data, is not represented either. While it is possible complex prediction for molecules other than proteins as to include full stoichiometry in a graph representation of well, for example metabolic complexes that include small a biomolecular interaction network, many experimental molecules. methods do not provide this information, so a homo- multimeric complex is normally represented as a simple MCODE cannot stand alone in this task; it must be com- homodimer. When an experiment does provide stoichi- bined with a graph visualization system to ease the under- ometry information, it is not stored in most current standing of the relationships among molecules in the data databases, such as MIPS and YPD. Thus, one is forced to set. We use the Pajek program for large network analysis Page 23 of 27 (page number not for citation purposes) BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 Yth1 Ykl059c Pap1 Glc7 Ref2 Pfs2 Pti1 Fip1 Cft1 Cft2 Pta1 Rna14 Ysh1 Lsm6 Spt7 Lsm7 Hfi1 Spt20 Taf60 Lsm3 Kem1 Spt3 Tra1 Taf61 Taf25 Lsm5 Pat1 Taf90 Gcn5 Lsm1 Dcp1 Spt8 Taf17 Lsm4 Spt15 Ngg1 Lsm8 Lsm2 Ada2 Rpn8 Rpn6 Cdc23 Apc2 Apc4 Rpt3 Rpn10 Rpt1 Apc1 Doc1 Cdc26 Rpn11 Cdc16 Rpn5 Apc9 Apc11 Rpn3 Rpt6 Cdc27 Apc5 Figure 12 Examining Complex Connectivity with MCODE Figure legend: The complexes shown here are known to be nuclear localized and are involved in protein degradation (19S proteasome subunit), mRNA processing (Lsm complex and mRNA Cleavage/Polyadenylation complex), cell cycle (anaphase promoting complex) and transcription (SAGA transcriptional activa- tion complex). return to the primary literature to extract the data, an ex- to determine a reliability index or p-value on edges in the tremely time-consuming task for large data sets. graph. For instance, one may wish to rank results pub- lished in high-impact journals above other journals (or Some quantitative and statistical information is present vice versa) and rank classical purification methods above when integrating results of large-scale approaches and this high-throughput yeast two-hybrid techniques when deter- is not used in our current graph model. For instance, the mining the quality of the interaction data. It may also be number of different types of experiments that find the possible to weight vertices on the graph by other quality same interaction, the quality of the experiment, the date criteria, such as whether a protein is hypothetical from a the experiment was conducted (newer methods may be gene prediction or not or whether a protein is expressed at superior in certain aspects) and other factors that pertain a particular time and place in the cell. For example, if one to the reliability of the interaction could all be considered were interested in a certain stage of the cell cycle, proteins Page 24 of 27 (page number not for citation purposes) BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 that are known to be absent at that stage could be reduced clusion of functional annotation and p-values on edges. in weight (VWP in the case of MCODE) compared to pro- Time, space and stoichiometry should also be represented teins that are present. It should be noted that any weight- on networks and in visualization systems. The process of ing scheme that tries to assess the quality of an interaction 'functional annotation titration' in the directed mode of might make false assumptions that would prevent the dis- MCODE could be automated. covery of new and interesting data. Conclusions This paper shows that the structure of a biological network MCODE effectively finds densely connected regions of a can define complexes, which can be seen as dense regions. molecular interaction network, many of which corre- This may be attributed to indirect interactions accumulat- spond to known molecular complexes, based solely on ing in the literature. Thus, interaction data taken out of connectivity data. Given that this approach to analyzing context may be erroneous. For instance, if one has a col- protein interaction networks performs well using lection of protein interactions from various different ex- minimal qualitative information implies that large periments done at different times in different labs from a amounts of available knowledge is buried in large protein specific complex that form a clique, and if one chooses an interaction networks. More accurate data mining algo- interaction from this clique, then how can one verify if it rithms and systems models could be constructed to un- is indirect or not. We would only begin to know if we had derstand and predict interactions, complexes and a very detailed description of the experiment from the pathways by taking into account more existing biological original papers where we could tell the amount of work knowledge. Structured molecular interaction data resourc- and quality of work that went into measuring each inter- es such as BIND will be vital in creating these resources. action. Thus with only a qualitative view of interactions, in reference to Dobzhansky [42], nothing in the biomo- Methods Data sources lecular interaction network would make sense except in light of molecular complexes and the functional connec- All protein interaction data sets from MIPS [13], Gene On- tions between them. If one had a highly detailed represen- tology [43] and PreBIND http://bioinfo.mshri.on.ca/ tation of each interaction including time, place, prebind/ were collected as described previously [6]. The YPD protein interaction data are from March 2001 and experimental condition, number of experiments, binding sites, chemical actions and chemical state information, were originally requested from Proteome, Inc. http:// one would be able to computationally delve into molecu- www.proteome.com. Other interaction data sets are from lar complexes to resolve topology, structure, function and BIND http://www.bind.ca. A BIND yeast import utility mechanism down to the atomic level. This information was developed to integrate data from SGD [12], RefSeq would also help to judge the biological relevance of an in- [44], Gene Registry http://genome-www.stanford.edu/ teraction. Thus, we require databases like BIND [15] to Saccharomyces/registry.html, the list of essential genes store this information. The integration of known qualita- from the yeast deletion consortium [11] and GO terms tive and quantitative molecular interaction data in a ma- [43]. This database ensures proper matching of yeast gene chine-readable format should allow increasingly accurate names among the multiple data sets that may use different protein interaction, molecular complex and pathway pre- names for the same genes. The yeast proteome used here diction, including actual binding site and mechanism in- is defined by SGD and RefSeq and contains 6,334 ORFs formation in a sequence and structural context. including the mitochondrial chromosome. Before per- forming comparisons, the various interaction data sets Based on our scale-free network analysis, it would seem were entered into a local instance of BIND as pairwise pro- that real biological networks are organized differently tein interaction records. The MIPS complex catalogue was than random models of scale-free networks in that they downloaded in February 2002. have higher clustering coefficients around specific regions (complexes) and the vertices in these regions are related to The protein interaction data sets used here were com- each other, by biological function. Thus, attempts to mod- posed as follows. 'Gavin Spoke' is the spoke model of the el biological networks and their evolution in a global way raw purifications from Gavin et al [7]. 'Y2H' is all known solely using the statistics of scale-free networks may not large-scale [2–5,10] combined with normal yeast two-hy- work, rather modeling should take into account as much brid results from MIPS. 'HTP Only' is only high-through- extant biological knowledge as possible. put or large-scale data [2–7,10] The 'Benchmark' set was constructed from MIPS, YPD and PreBIND as previously Future work on MCODE could include researching differ- described [6]. 'Pre HTMS' was composed of all yeast sets ent, possibly adaptive, vertex scoring functions to take except the recent large-scale mass spectrometry data sets into account, for example, the local density of the network [6,7]. 'AllYeast' was the combination of all above data past the immediate neighborhood of a vertex and the in- sets. All data sets are non-redundant. Page 25 of 27 (page number not for citation purposes) BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 10. Tong AH, Drees B, Nardelli G, Bader GD, Brannetti B and Castagnoli Network visualization L A combined experimental and computational strategy to Visualization of networks was performed using the Pajek define protein interaction networks for peptide recognition program for large network analysis [40]http:// modules. Science 2002, 295:321-324 11. Winzeler EA, Shoemaker DD, Astromoff A, Liang H, Anderson K and vlado.fmf.uni-lj.si/pub/networks/pajek/ as described pre- Andre B Functional characterization of the S. cerevisiae ge- viously [6,10]. using the Kamada-Kawai graph layout al- nome by gene deletion and parallel analysis. Science 1999, 285:901-906 gorithm followed by manual vertex adjustments and was 12. Chervitz SA, Hester ET, Ball CA, Dolinski K, Dwight SS and Harris formatted using CorelDraw 10. Power law analysis was MA Using the Saccharomyces Genome Database (SGD) for also accomplished as previously described [6]. analysis of protein similarities and structure. Nucleic Acids Res 1999, 27:74-78 13. Mewes HW, Frishman D, Gruber C, Geier B, Haase D and Kaps A Authors' contributions MIPS: a database for genomes and protein sequences. Nucleic GB conceived of the study and carried out all program- Acids Res 2000, 28:37-40 14. Costanzo MC, Crawford ME, Hirschman JE, Kranz JE, Olsen P and ming and analyses as a Ph.D. student in the lab of CH. CH Robertson LS YPD, PombePD and WormPD: model organism supervised the study and provided valuable input for the volumes of the BioKnowledge library, an integrated re- source for protein information. Nucleic Acids Res 2001, 29:75-79 evaluation analyses. 15. Bader GD, Donaldson I, Wolting C, Ouellette BF, Pawson T and Hogue CW BIND-The biomolecular interaction network Additional material database. Nucleic Acids Res 2001, 29:242-245 16. Xenarios I, Salwinski L, Duan XJ, Higney P, Kim SM and Eisenberg D DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Ac- Additional File 1 ids Res 2002, 30:303-305 AllYeastPredictedComplexes.zip Zip file containing Pajek .net and anno- 17. Takai-Igarashi T, Nadaoka Y and Kaminuma T A database for cell tation files for all 209 complexes found using MCODE on the set of all signaling networks. J Comput Biol 1998, 5:747-754 18. Wingender E, Chen X, Hehl R, Karas H, Liebich I and Matys V yeast interactions reported here. Various report files from MCODE are TRANSFAC: an integrated system for gene expression also included as well as basic instructions for using Pajek. regulation. Nucleic Acids Res 2000, 28:316-319 Click here for file 19. Karp PD, Riley M, Saier M, Paulsen IT, Paley SM and Pellegrini-Toole [http://www.biomedcentral.com/content/supplementary/1471- A The EcoCyc and MetaCyc databases. Nucleic Acids Res 2000, 2105-4-2-S1.zip] 28:56-59 20. Overbeek R, Larsen N, Pusch GD, D'Souza M, Selkov EJ and Kyrpides N WIT: integrated system for high-throughput genome se- quence analysis and metabolic reconstruction. Nucleic Acids Res 2000, 28:123-125 21. Wagner A and Fell DA The small world inside large metabolic Acknowledgements networks. Proc R Soc Lond B Biol Sci 2001, 268:1803-1810 Michel Dumontier, Sherrie Kelly, Katerina Michalickova, Tony Pawson and 22. Flake GW, Lawrence S, Giles CL and Coetzee FM Self-Organiza- Mike Tyers provided helpful discussion. This work was supported in part tion of the Web and Identification of Communities. IEEE from grants from the Canadian Institutes of Health Research (CIHR), the Computer 2002, 35:66-71 23. Goldberg AV Finding a Maximum Density Subgraph. Technical Ontario Research and Development Challenge Fund and MDS-Sciex to Report UCB/CSD University of California, Berkeley, CA 1984, 84: C.H. G.D.B. is supported by an Ontario Graduate Scholarship (OGS). 24. Ng A, Jordan M and Weiss Y On spectral clustering: Analysis and an algorithm. Advances in Neural Information Processing Systems 14: Proceedings of the 2001 2001, References 25. Watts DJ and Strogatz SH Collective dynamics of 'small-world' 1. Fields S Proteomics. Proteomics in genomeland. Science 2001, networks. Nature 1998, 393:440-442 291:1221-1224 26. Jeong H, Tombor B, Albert R, Oltvai ZN and Barabasi AL The large- 2. Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS and Knight JR A scale organization of metabolic networks. Nature 2000, comprehensive analysis of protein-protein interactions in 407:651-654 Saccharomyces cerevisiae. Nature 2000, 403:623-627 27. Albert R, Jeong H and Barabasi AL Error and attack tolerance of 3. Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M and Sakaki Y A com- complex networks. Nature 2000, 406:378-382 prehensive two-hybrid analysis to explore the yeast protein 28. Barabasi AL and Albert R Emergence of scaling in random interactome. Proc Natl Acad Sci U S A 2001, 98:4569-4574 networks. Science 1999, 286:509-512 4. Drees BL, Sundin B, Brazeau E, Caviston JP, Chen GC and Guo W A 29. Fell DA and Wagner A The small world of metabolism. Nat protein interaction map for cell polarity development. J Cell Biotechnol 2000, 18:1121-1122 Biol 2001, 154:549-571 30. Hartuv E and Shamir R A clustering algorithm based on graph 5. Fromont-Racine M, Mayes AE, Brunet-Simon A, Rain JC, Colley A and connectivity. Information processing letters 1999, 76:175-181 Dix I Genome-wide protein interaction screens reveal func- 31. Bader GD and Hogue CW Analyzing yeast protein-protein in- tional networks involving Sm-like proteins. Yeast 2000, 17:95- teraction data obtained from different sources. Nat Biotechnol 2002, 20:991-997 6. Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L and Adams SL Sys- 32. Baldi P, Brunak S, Chauvin Y, Andersen CA and Nielsen H Assessing tematic identification of protein complexes in Saccharomy- the accuracy of prediction algorithms for classification: an ces cerevisiae by mass spectrometry. Nature 2002, 415:180-183 overview. Bioinformatics 2000, 16:412-424 7. Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M and Bauer A 33. Robinson RC, Turbedsky K, Kaiser DA, Marchand JB, Higgs HN and Functional organization of the yeast proteome by systemat- Choe S Crystal structure of Arp2/3 complex. Science 2001, ic analysis of protein complexes. Nature 2002, 415:141-147 294:1679-1684 8. Christendat D, Yee A, Dharamsi A, Kluger Y, Savchenko A and Cort 34. Mayes AE, Verdone L, Legrain P and Beggs JD Characterization of JR Structural proteomics of an archaeon. Nat Struct Biol 2000, Sm-like proteins in yeast and their association with U6 7:903-909 snRNA. EMBO J 1999, 18:4321-4331 9. Kim SK, Lund J, Kiraly M, Duke K, Jiang M and Stuart JM A gene ex- 35. von Mering C, Krause R, Snel B, Cornell M, Oliver SG and Fields S pression map for Caenorhabditis elegans. Science 2001, Comparative assessment of large-scale data sets of protein- 293:2087-2092 protein interactions. Nature 2002, 417:399-403 Page 26 of 27 (page number not for citation purposes) BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/2 36. Jeong H, Mason SP, Barabasi AL and Oltvai ZN Lethality and cen- trality in protein networks. Nature 2001, 411:41-42 37. Maslov S and Sneppen K Specificity and stability in topology of protein networks. Science 2002, 296:910-913 38. Gonzalez F, Delahodde A, Kodadek T and Johnston SA Recruit- ment of a 19S proteasome subcomplex to an activated promoter. Science 2002, 296:548-550 39. Bochtler M, Ditzel L, Groll M, Hartmann C and Huber R The proteasome. Annu Rev Biophys Biomol Struct 1999, 28:295-317 40. Batagelj V and Mrvar A Pajek – Program for Large Network Analysis. Connections 1998, 2:47-57 41. Kamada T and Kawai S An algorithm for drawing general indi- rect graphs. Information processing letters 1989, 31:7-15 42. Dobzhansky T Nothing in Biology Makes Sense Except in the Light of Evolution. American Biology Teacher 1973, 35:125-129 43. The Gene Ontology Consortium Gene ontology: tool for the uni- fication of biology. Nat Genet 2000, 25:25-29 44. Pruitt KD and Maglott DR RefSeq and LocusLink: NCBI gene- centered resources. Nucleic Acids Res 2001, 29:137-140 Publish with Bio Med Central and every scientist can read your work free of charge "BioMed Central will be the most significant development for disseminating the results of biomedical researc h in our lifetime." Sir Paul Nurse, Cancer Research UK Your research papers will be: available free of charge to the entire biomedical community peer reviewed and published immediately upon acceptance cited in PubMed and archived on PubMed Central yours — you keep the copyright BioMedcentral Submit your manuscript here: http://www.biomedcentral.com/info/publishing_adv.asp Page 27 of 27 (page number not for citation purposes)

Journal

BMC BioinformaticsSpringer Journals

Published: Jan 13, 2003

There are no references for this article.