A first-draft human protein-interaction map

Ben Lehner; Andrew G Fraser

doi:10.1186/gb-2004-5-9-r63

A first-draft human protein-interaction map

Lehner, Ben;Fraser, Andrew G 2004-08-01 00:00:00 Background: Protein-interaction maps are powerful tools for suggesting the cellular functions of genes. Although large-scale protein-interaction maps have been generated for several invertebrate species, projects of a similar scale have not yet been described for any mammal. Because many physical interactions are conserved between species, it should be possible to infer information about human protein interactions (and hence protein function) using model organism protein- interaction datasets. Results: Here we describe a network of over 70,000 predicted physical interactions between around 6,200 human proteins generated using the data from lower eukaryotic protein-interaction maps. The physiological relevance of this network is supported by its ability to preferentially connect human proteins that share the same functional annotations, and we show how the network can be used to successfully predict the functions of human proteins. We find that combining interaction datasets from a single organism (but generated using independent assays) and combining interaction datasets from two organisms (but generated using the same assay) are both very effective ways of further improving the accuracy of protein-interaction maps. Conclusions: The complete network predicts interactions for a third of human genes, including 448 human disease genes and 1,482 genes of unknown function, and so provides a rich framework for biomedical research. human protein interactions and protein function using data Background Physical interactions between proteins underpin most biolog- from model organism protein-interaction datasets [7,8]. ical processes. For this reason, large-scale protein-interaction mapping projects have been initiated in several model organ- To transfer information on gene function between two isms [1-6]. Unfortunately, projects of a similar scale have not genomes requires the identification of orthologous genes in yet been described for mammalian systems, with the result the two genomes (that is, genes that are descended from a that our global understanding of protein function remains common ancestor and share biological functions). However, less advanced in mammals than in lower eukaryotes. How- the identification of gene orthologs is often not a trivial prob- ever, many physical interactions are conserved between spe- lem; gene duplications can result in a single gene having mul- cies, so it should be possible to infer information about tiple potential orthologs in a second species. In addition, it is Genome Biology 2004, 5:R63 R63.2 Genome Biology 2004, Volume 5, Issue 9, Article R63 Lehner and Fraser http://genomebiology.com/2004/5/9/R63 necessary to distinguish true gene orthologs from 'out-para- Table 1 logs' (that is, genes that arose from a gene-duplication event The number and accuracy of human protein interactions pre- before the divergence of two species, and so are unlikely to dicted by different model organism protein-interaction datasets share functions) [9]. One method that addresses both these problems is the InParanoid algorithm, which first identifies Data source Predicted Interactions sharing GO terms potential orthologs by best pairwise similarity searches, and human interactions then clusters these orthologs into groups of likely co- orthologs, with each ortholog assigned a score representing Number % the confidence that it is the main ortholog [9]. We have used All 71,496 12,724 24.9 the orthology relationships identified by the InParanoid algo- Yeast 55,231 10,727 26.2 rithm to construct a putative human protein-interaction map Fly 12,059 1,404 19.0 based solely on high-throughput interaction datasets from model organisms. We show that this approach successfully Worm 4,494 753 24.4 identifies functionally related human proteins, and so can be All core 11,487 3,133 38.1 used to assign putative functions to many novel human genes. Core yeast 6,061 2,146 45.4 The resulting network provides a framework for human biol- Core fly 2,889 488 27.8 ogy and acts as a guide for a future experimental human pro- Core worm 2,701 597 32.3 tein-interaction mapping project. Two species 288 154 74.8 Two species (core) 160 95 88.0 Two methods 2,166 829 60.6 Results Random pairs 71,496 6,053 14.6 Generation of a human protein-interaction map Protein interactions are often evolutionarily conserved The table lists the total number of interactions predicted by each interaction dataset, and the number of these interactions that connect between orthologous proteins from different species [7]. proteins that share at least one GO term (at level 3 or deeper in the Hence we reasoned that a human protein-interaction map GO hierarchy). The percentages are relative to the total number of could be constructed using data from model organism pro- non-self interactions where both proteins have at least one GO annotation. All, all predicted human protein interactions; Yeast/worm/ tein-interaction mapping projects. We obtained the data from fly, interactions predicted by the yeast, worm or fly interaction maps; seven experimental and four computationally predicted pro- All core, all interactions predicted by the high-confidence subsets of tein-interaction maps from Saccharomyces cerevisiae [1- each model organism interaction map (see Materials and methods); 4,10,11], Drosophila melanogaster [5] and Caenorhabditis Two species, interactions predicted by more than one model organism interaction map; Two species (core), interactions predicted by the elegans [6]. For each interacting protein, we identified poten- high-confidence subset of interactions from more than one model tial human orthologs using the InParanoid algorithm [9]. A organism; Two methods, interactions predicted by data derived from human protein interaction is predicted if both interaction more than one different interaction assay; Random pairs, the data for a partners from a model organism have one or more human randomly generated interaction network. orthologs. Using this strategy, we were able to generate a human interaction network comprising 71,496 interactions between 6,231 human proteins. The sources of these pre- dicted interactions are summarized in Table 1 and Figure 1a, and all the interactions are available in Additional data file 1 physiologically interacting proteins are expected to have available online with this article and can also be searched or related, but non-identical functions, they are expected to downloaded from our website [12]. share some, but not all GO annotations. Therefore, one method to evaluate an interaction dataset is to count the pro- Assessment of the accuracy of the interaction datasets portion of interactions that connect proteins that share com- In the absence of a comprehensive set of verified human pro- mon GO terms [5]. For the complete predicted human tein interactions, we required another method to assess the interaction network, 25% of interaction partners share at accuracy of the interaction network. Proteins that interact least one GO term, which is many more than observed with a physiologically are expected to have related functions. There- randomly generated network of the same size (15% of interac- fore high-quality interaction datasets should predict a greater tions). To confirm that this result did not just apply to quite proportion of interactions between functionally related pro- general GO annotations, we calculated the proportion of teins than low quality datasets. The functions of human pro- interaction partners that share GO annotations at depths 3 to teins can be systematically described using the Gene 8 and greater than 8 in the GO hierarchy. We found that the Ontology (GO) annotations [13] available from Ensembl [14- predicted interaction network preferentially connects pro- 17]. GO annotations provide a hierarchical description of gene teins that share GO annotations at any level of the GO hierar- functions with general functions described by GO annota- chy (see Figure 2). This suggests that the interaction network tions at the top levels of the hierarchy and very precise func- indeed preferentially connects functionally related human tions described by terms deeper in the hierarchy. Because proteins. Genome Biology 2004, 5:R63 comment reviews reports deposited research refereed research interactions information http://genomebiology.com/2004/5/9/R63 Genome Biology 2004, Volume 5, Issue 9, Article R63 Lehner and Fraser R63.3 (a) (a) Complete network (71,496 interactions) 50 Core yeast Core worm Worm (4,494) Yeast (55,252) Core fly All yeast All worm All fly Random pairs 4,321 53 55,064 3 45 678 >8 100 115 Depth of shared GO term (b) Two species (core) 100 Two species 11,824 Two methods All core All Random pairs Fly (12,059) (b) Core network (11,487 interactions) Worm (2,701) Yeast (6,061) 345678 >8 Depth of shared GO term F Fig ilteri urn eg in 2 teraction datasets to improve their accuracy Filtering interaction datasets to improve their accuracy. (a) The percentages of interactions sharing GO terms at various depths in the GO hierarchy are compared for interactions predicted by the high-confidence 2,582 26 5,990 interactions from each model organism (core yeast, core worm and core fly), as well as for the complete datasets from each organism (all yeast, all worm, all fly). For comparison, the percentage of shared GO terms is shown for a randomly generated network of the same size as the complete human network (random pairs). The x-axis indicates the depth in 89 41 the GO hierarchy being considered, and the y-value the percentage of interaction partners (with known GO annotations) that share GO annotations at this depth or deeper. (b) The percentages of interactions sharing GO terms at different levels in the GO hierarchy are compared for interactions predicted by core interactions in two or more species 2,755 (two species (core)), by interactions in the complete datasets of two or more species (two species), for interactions predicted by more than one experimental method in yeast (two methods), by any core interaction (all core), by any interaction (all), or by a randomly generated interaction network of the same size as the complete human interaction network (random pairs). All values shown are the percentage of non-self interactions between pairs of proteins that both have at least one Fly (2,889) associated GO term at the indicated depth in the GO hierarchy. Sour Figure ces 1 of predicted human protein interactions Sources of predicted human protein interactions. (a) The number of We then used the same strategy to compare the accuracy of human protein interactions predicted by the interaction maps from each model organism. (b) The number of human protein interactions predicted human interactions predicted by data from the three different by the core higher-confidence interactions from each organism. As model organisms. If the interactions from a particular model explained in the text, core interactions are those that reconfirmed when organism dataset predict fewer interactions between func- retested (worm), or had an interaction score of greater than 0.5 (fly) or tionally related human proteins than the other datasets, then were identified more than once in a single assay (yeast, worm). this dataset should be considered less reliable as a source of Genome Biology 2004, 5:R63 Percentage non-self interactions Percentage non-self interactions sharing GO term sharing GO term R63.4 Genome Biology 2004, Volume 5, Issue 9, Article R63 Lehner and Fraser http://genomebiology.com/2004/5/9/R63 candidate human protein interactions. As shown in Table 1 Table 2 and Figure 2a, interactions predicted by the complete yeast The number of interactions, genes, novel genes and disease genes and worm datasets are slightly better at connecting function- in the complete and core human interaction networks ally related human proteins than those predicted by the fly dataset, suggesting that these interactions can be considered Network Interactions Genes Novel genes Disease genes with higher confidence. This result is especially interesting Complete 71,496 6,231 1,482 448 given that the yeast interaction map is an order of magnitude Core 11,487 3,872 864 292 larger than the fly or worm maps, confirming that the fly and worm interaction maps currently have a relatively low The complete network consists of all human protein interactions coverage. predicted by model organism protein-interaction datasets. The core network consists of all the human interactions predicted by the high- confidence subsets of each interaction network (see Materials and Next we asked how the confidence in the assignment of gene methods). Novel genes are defined as those without GO annotations. orthologs affects the accuracy of an interaction. For each pre- Disease genes are defined by the OMIM database [25], available from dicted interaction, an orthology confidence score was calcu- Ensembl [16]. lated by summing the InParanoid orthology confidence scores for the two human and two model organism proteins (see Materials and methods). Of the predicted interactions, 24,897 have the maximum possible confidence score of 4. Of Combining interaction datasets to generate high- these interactions, 28%, 24% and 13% connect proteins that confidence networks share GO terms at depths of 3, 5 or 7 in the GO hierarchy It has been shown previously that protein interactions (excluding proteins without GO annotation). In contrast, for detected by more than one high-throughput interaction assay interactions with an orthology confidence score less than 4, are more accurate [11]. We find that this is also true for these figures are 24%, 20% and 10%. Hence we conclude that human protein interactions predicted by yeast protein inter- the predicted human interactions with high-confidence actions detected by more than one method (see Figure 2b and orthology assignments can be considered more reliable than Table 1). It has also been suggested that protein interactions those interactions with less confidence in their orthology are more likely to represent physiologically important inter- assignments. This confirms that the confidence scores actions if they have been detected between orthologous pro- assigned using InParanoid are indeed likely to be useful pre- tein pairs from two or more species [7,18]. To test this dictors of functional conservation. hypothesis we identified 288 human protein interactions pre- dicted by interactions in two or more model organisms (Fig- A core dataset of high-confidence protein interactions ure 1, Table 1). Remarkably, 75%, 70% and 56% of these The worm and fly interaction mapping projects both defined interactions share GO terms at depths of 3, 5 or 7 in the GO a subset of high-confidence 'core' interactions that have the hierarchy, respectively (Figure 2b). Indeed, for interactions greatest experimental support (Figure 1b). For the worm derived from core interaction datasets, these figures rise to interaction map these were defined as interactions identified 88%, 80% and 67% of interactions. Hence, protein interac- more than once, or that reconfirmed when retested in the tions predicted by data from multiple species can be consid- two-hybrid assay [6]. In the fly interaction map each interac- ered with very high confidence. tion has an associated confidence score, and interactions with a score greater than 0.5 are considered core interactions (the Using the interaction network to predict human gene interaction score mainly depends upon the number of times function each interaction was detected, the total number of Because physiologically interacting proteins often have simi- interactions made by each protein and the local network clus- lar functions (Figure 2), it should be possible to predict the tering; see [5]). To generate a similar subset of yeast protein functions of a novel human protein if it interacts with pro- interactions, we defined core yeast protein interactions as teins of known function. To address how well our interaction those identified more than once by any single assay, consist- map could be used for this purpose, we asked whether the ent with previous analyses of the individual datasets [1-3,11]. known GO terms of a protein could be predicted using only As shown in Figure 2a and Table 1, for all three species these the GO terms of its interaction partners. As shown in Table 3, core interactions predict a greater proportion of human inter- GO terms associated with at least one of a gene's core interac- actions that share GO terms than the total datasets. Indeed all tion partners predict GO terms associated with that gene with three core interaction maps are of similar accuracy, so we an accuracy of around 8%. However, GO terms associated combine their predicted interactions into a core network of with at least two, three, four or five of a gene's interaction 11,487 higher-confidence human protein interactions (sum- partners have 22%, 30%, 37%, 42% and 45% probabilities, marized in Table 2 and available as Additional data file 2). Of respectively, of also being associated with that gene (Table 3). these core interactions, 38%, 35% and 24% connect proteins Although these values may vary for different GO terms, as that share GO terms at depths of 3, 5 or 7 in the GO hierarchy shown in Additional data file 3, the accuracy and coverage of (excluding proteins with no GO annotations). these GO term predictions are very similar for GO terms at Genome Biology 2004, 5:R63 comment reviews reports deposited research refereed research interactions information http://genomebiology.com/2004/5/9/R63 Genome Biology 2004, Volume 5, Issue 9, Article R63 Lehner and Fraser R63.5 Table 3 human disease genes (listed in Additional data file 5), of which 55 interact with two or more proteins in the core inter- The approximate accuracy and coverage of GO terms predicted action network that share a GO annotation. The functional by the core and complete interaction networks predictions for these 55 genes are listed in Additional data file Number of interactors Core data Complete data with GO term Accuracy Coverage Accuracy Coverage Discussion 1+ 8 26 3 35 A framework for human biology We report here the use of data from model organism protein- 2+ 22 11 8 19 interaction mapping projects to predict a network of human 3+ 30711 14 protein interactions. This network consists of over 70,000 4+ 36515 11 interactions that connect over one-third of all the predicted 5+ 424188 human proteins, including 1,482 proteins of unknown func- 6+ 453207 tion and 448 proteins encoded by human disease genes. The The approximate accuracy and coverage of GO term predictions were physiological relevance of this network is supported by its calculated for every gene in the core or complete interaction networks ability to preferentially connect human proteins that share with at least one known GO term. The GO terms of a gene are biological functions (Figure 2). Indeed the network can be predicted using the GO terms of any of its interaction partners (1+), or GO terms shared by at least two to six of its interaction partners (2+ successfully used to predict the functions of a gene using the to 6+). Accuracy is calculated as the number of correctly predicted GO known functions of its interaction partners (Table 3). As such, terms divided by the total number of predicted GO terms. Coverage is the network should provide a rich source of functional calculated as the number of correctly predicted GO terms divided by hypotheses for researchers interested in the functions of one the total number of known GO terms associated with each gene. These values are similar for GO annotations at different levels of the GO or many human proteins. hierarchy (see Additional data file 3). The accuracy and coverage of the interactions predicted in this network depend primarily on two parameters: the quality of the original model organism interaction datasets; and the different levels in the GO hierarchy, and so can be used as an ability to identify the human orthologs of a model organism approximate indication of the confidence in a prediction of protein. Our analysis suggests that the raw yeast and worm gene function. Hence the network can be used to predict GO protein-interaction datasets are currently slightly more accu- terms for a human gene of unknown function, with the rate than the raw fly interaction dataset, but that when fil- approximate confidence in the GO prediction determined by tered for high-confidence interactions the three interaction the number of interaction partners that share the GO term. maps are of very similar accuracy (see Table 1 and Figure 2). The fly and worm interaction maps both have a much lower The ability to provide a reasonably accurate prediction of a coverage than the yeast interaction network, most probably gene's GO terms means that we can use the interaction net- because they both only represent the results of a single inter- work to provide probabilistic gene function predictions for action-mapping project. The continuation of these model novel human proteins and also to predict additional functions organism protein-interaction mapping projects to generate for proteins with some known functions. The core interaction higher coverage interaction maps will greatly enhance our map contains 864 proteins with no functional annotations. ability to predict human protein interactions. About 10% of these proteins interact with two or more pro- teins that share GO terms. The probabilistic predictions of the For the identification of gene orthologs, we used the InPara- functions of these novel proteins are listed in Additional data noid algorithm. InParanoid offers several important benefits file 4. Often these predicted functions are also supported by compared to simple 'reciprocal best hit' sequence-similarity the known functions of the protein domains predicted to be searches [9]. First, many genes from lower eukaryotes have encoded by these novel genes (see Additional data file 4). For multiple co-orthologs in humans, which can be identified example, ENSG00000028310 encodes a bromodomain and using InParanoid, but not by simple one-to-one sequence- interacts with six proteins annotated as 'GO:0006355 regula- similarity searches. Second, InParanoid can successfully dis- tion of transcription, DNA-dependent', ENSG00000080608 tinguish these true co-orthologs from paralogs that arose encodes an RNA-binding domain and interacts with five pro- before a speciation event (which are unlikely to retain similar teins annotated as 'GO:0006364 rRNA processing', and functions). Finally, each potential ortholog in a group of co- ENSG00000104863 encodes a PDZ domain and interacts orthologs identified by InParanoid has an associated score with three proteins with the annotations 'GO:0005887 inte- that represents the likelihood that it is the main ortholog of a gral to plasma membrane, GO:0007242 intracellular signal- gene. We have summed these confidence scores to provide an ing cascade' (Additional data file 4). The complete and core orthology confidence score for each predicted human protein interaction maps also predict interactions for 448 and 292 interaction in our network. These high-confidence ortholog Genome Biology 2004, 5:R63 R63.6 Genome Biology 2004, Volume 5, Issue 9, Article R63 Lehner and Fraser http://genomebiology.com/2004/5/9/R63 interactions connect a greater proportion of functionally human protein interactions, but will also allow many more related human proteins, suggesting that the InParanoid con- interactions to be verified using the interaction footprinting fidence score is indeed a useful tool for predicting the likely strategy. However, such an approach will be limited to pro- physiological relevance of a predicted protein interaction. viding information on those proteins and interactions that are conserved between vertebrates and invertebrates. The ability to successfully predict human protein functions using the results of model organism protein-interaction map- Strategies for completing the human interaction map ping projects highlights both the relevance of model organism The interactions described here provide a first-draft human protein-interaction mapping projects to understanding protein-interaction map that can be used to predict interac- human biology and also the benefits that would result from an tions and functions for genes of interest to a particular experimental human protein-interaction mapping project. researcher. However, the map also provides a framework Although the interaction network can currently accurately from which a complete human protein-interaction map could predict only a subset of the known functions of a gene, this be generated. Firstly, the map could be used to identify sub- should improve as more protein-interaction data becomes sets of high-confidence, evolutionarily conserved interactions available. For this reason, we strongly encourage the continu- from the results of large- or medium-scale human interac- ation of model organism protein-interaction mapping tion-mapping projects. For example the map verifies 51 of projects. 296 yeast two-hybrid interactions detected for human pro- teins involved in mRNA decay [19]. Alternatively, the Methods of verifying protein-interaction datasets interactions predicted here could be directly experimentally We also assessed the relative merits of three different meth- validated using an assay that allows rapid testing of binary ods to improve the accuracy of protein-interaction maps. The interactions (such as the yeast or mammalian two-hybrid first strategy is to define a subset of interactions detected assays [20] or protein fragment complementation assays more than once with a single assay [1-3,6]. We found that this [21]). This would represent a cost-effective strategy to pro- approach leads to an approximately 1.5- to 2.7-fold increase duce a high-confidence human protein-interaction map in the proportion of predicted human interactions that share because it massively reduces the number of candidate inter- GO terms (Figure 2b). The second strategy is to define a sub- actions that need to be tested. Finally, the map identifies set of interactions that have been identified by more than one 17,300 (23,531 - 6,231) human genes for which no protein interaction assay. This results in around a 2.3- to 8-fold interactions are predicted from model organism interaction improvement in the prediction of associations between pro- datasets. Many of these proteins are likely to be vertebrate- or teins that share GO terms (Figure 2b). The final strategy is to mammalian-specific, and are the most logical choices for bait define a subset of interactions that are predicted by interac- proteins for the discovery phase of an experimental human tions from more than one model organism, which results in protein-interaction mapping project. around a 3- to 12-fold improvement in the proportion of interactions between proteins sharing GO terms (Figure 2b). Materials and methods With all these filtering methods, the greatest improvements Model organism protein-interaction datasets The interaction datasets used to generate the draft human are seen when considering the proportion of interactions that share GO terms deep within the GO hierarchy; that is, the fil- protein-interaction network were two-hybrid-based interac- tering steps dramatically improve the proportion of interac- tion maps for D. melanogaster [5] and C. elegans [6] and a tions between proteins with very closely related functions. We list of S. cerevisiae protein-interactions compiled by Von conclude that using interaction data derived from a second Mering et al. [11] from two two-hybrid [1,2], two complex interaction assay or from a second species both represent purification [3,4], one genetic [10], and four in silico-pre- excellent methods to improve the accuracy of protein-interac- dicted interaction datasets (which used correlated mRNA tion maps. Because of the small number of protein-interac- expressions, conserved gene neighbourhood, gene co-occur- tion assays that have been adapted to a high-throughput rence or gene fusion events to predict protein interactions format, we suggest that constructing a second interaction [11]). Table 4 shows the number of unique interactions in map in a related organism using the same assay may be an each dataset, the methods used to generate each dataset, and efficient way to produce a high-confidence interaction map. the URLs from which the datasets were obtained. This strategy is somewhat similar to using phylogenetic foot- printing to identify functional noncoding DNA, so we suggest Identification of gene orthologs and construction of the it should be named 'interaction footprinting'. Using the rela- interaction network tively low-coverage model organism interaction datasets The human orthologs of yeast, worm and fly genes were iden- currently available, only a small proportion of interactions tified using the InParanoid algorithm, which is designed to can be verified by interaction footprinting. The continuation distinguish true orthologs from out-paralogs that arose from of these model organism interaction mapping projects will gene duplications before the divergence of two species [9]. not only provide a much richer framework of predicted The InParanoid algorithm first identifies potential orthologs Genome Biology 2004, 5:R63 comment reviews reports deposited research refereed research interactions information http://genomebiology.com/2004/5/9/R63 Genome Biology 2004, Volume 5, Issue 9, Article R63 Lehner and Fraser R63.7 Table 4 Core interactions were defined as those predicted by worm interactions identified more than once or that reconfirmed Sources of model organism protein-interaction data when retested in the two-hybrid assay [6], by fly interactions with an interaction score greater than 0.5 [5], or by yeast Dataset Interactions Type Reference URL interactions detected two or more times by a single assay [1- Fly 20,020 Two-hybrid [5] [26] 3,11]. Worm 4,605 Two-hybrid [6] [27] Assessment of the interaction data Yeast 78,391 Total [11] [11] Human GOs (at levels 3 or deeper in the GO hierarchy) were 5,125 Two-hybrid obtained from Ensembl (v19.34a.1) [14,15] using Ensmart 49,313 Complex purification (v19.1) [16,17]. The GO terms 'unknown molecular function/ 886 Genetic biological process/cellular compartment' were discarded in 23,844 (23,399) In silico (In silico only) all subsequent analyses. To validate the accuracy of the inter- The table lists the total number of interactions contained in each model action data, we calculated the percentage of interactions that organism dataset, together with the method used to identify shared at least one GO term. To confirm that the results did interactions, the publication reference, and the website (URL) from not just apply to very general GO annotations, we calculated which the interaction dataset was obtained. For each dataset, the non- redundant number of unique interactions between unambiguously the proportion of interacting proteins that shared a GO anno- identified proteins is shown. For the yeast interactions, the total tation at levels 3 to 8 and greater than 8 in the GO hierarchy. number of interactions is shown, as well as the number of interactions For all of these analyses we ignored proteins with no associ- identified using each detection method. In silico only are interactions ated GO annotations. Moreover, self-interactions were only predicted by in silico methods without any confirmation from the experimental datasets. excluded because they will always share GO terms and so bias the results. Prediction of gene functions To predict the GO terms of a protein, we identified all the GO terms associated with x or more of its interaction partners by best pairwise similarity searches, and then clusters these (where x varied from 1 to 6). To validate the accuracy and cov- orthologs into groups of probable co-orthologs, with each erage of this approach we predicted GO terms for genes that ortholog assigned a score representing the confidence that it already have associated GO terms. The accuracy was calcu- is the main ortholog. For each interaction data source, we lated as the total number of correct GO term predictions obtained SWISS-PROT/TrEMBL accessions for each inter- divided by the total number of GO term predictions. The cov- acting protein using the Ensmart data-mining tool [16,17] (for erage was calculated as the total number of correct GO term worm and fly genes) or both SWISS-PROT [22] and a predictions divided by the total number of known GO terms. TrEMBL conversion file kindly provided by Paul Kersey, EBI, This analysis was repeated, but only considering individually Hinxton, UK (for yeast genes). Potential human orthologs of GO terms at depths of 3 to 8 and greater than 8 in the GO hier- these genes were then identified using the pre-computed archy (see Additional data file 3). To avoid biasing the results InParanoid results (version 2.3, available from [23]), and the we again ignored self-interactions. For the same reason, we results converted to nonredundant Ensembl (v19.34a.1, also only counted once GO terms associated with more than genome assembly NCBI34) gene IDs using Ensmart (v19.1) 1 one interaction partner predicted by the same source interac- [16,17]. In total, InParanoid identifies 9,500 human genes tion from a model organism. The InterPro protein domains with at least one ortholog in at least one of worm, fly or yeast. [24] encoded by each human gene were obtained from For each potential ortholog in a group of co-orthologs, the Ensembl using Ensmart. Genes of unknown function were InParanoid algorithm calculates a score that represents the defined as those having no associated GO terms, and disease confidence that it is the main ortholog. In this scoring system, genes were as defined by Ensembl using the Online Mende- the main ortholog always receives a score of 1, with the other lian Inheritance in Man (OMIM) database as a reference [25]. co-orthologs receiving scores ranging between 0 and 1, calcu- lated according to their similarity to the main ortholog [9]. As an indication of the confidence we have in the orthology rela- Additional data files tionships between a pair of interacting proteins from a model The following additional data files are available with the organism and a predicted pair of interacting human proteins, online version of this article: Additional data file 1 contains a we calculate a confidence score by summing the InParanoid complete list of predicted human protein interactions; this confidence scores for each of the four proteins. Hence, each dataset contains every human protein interaction that is pre- interaction has an associated score ranging from 0 to 4 that dicted by a protein interaction from any of seven experimen- represents the confidence that both human proteins repre- tal and four computationally-predicted protein interaction sent the main orthologs of the model organism proteins, and maps from Saccharomyces cerevisiae [1-4,10,11], Drosophila vice versa. melanogaster [5] and Caenorhabditis elegans [6]. Genome Biology 2004, 5:R63 R63.8 Genome Biology 2004, Volume 5, Issue 9, Article R63 Lehner and Fraser http://genomebiology.com/2004/5/9/R63 Additional data file 2 contains a list of all core human protein Additional data file 3 lists the accuracy and coverage of GO interactions. This represents a subset of high-confidence term predictions at different levels in the GO hierarchy; Addi- human protein interactions that is predicted by model organ- tional data file 4 lists gene function predictions for 85 human ism protein interactions with greater experimental support. genes of unknown function; Additional data file 5 lists human In the worm interaction map, these are defined as interac- disease genes with predicted protein interactions; and tions that reconfirmed when retested in the Y2H assay [6]. In Additional data file 6 lists gene function predictions for 55 the fly interaction map, each interaction has an associated human disease genes. Additi A c A c Cl Additi A A Cl Additi The a el The a el Cl Additi Gene function Gene function Cl Additi Human dis Human dis Cl Additi Gene Gene Cl li li s in s in ick here ick here ick here ick here ick here ick here o o s smple mple t t of of ccur ccur func func f f the G the G onal onal onal onal onal onal u u al al nction nction te te a a l l for for for for for for tion tion cor cor cy cy e e da da da da da da O O list list as as hie hie and cov and cov additi additi additi additi additi additi ta file ta file ta file ta file ta file ta file e gen e gen e e pred pred pre pre of of huma huma r rar ar pre pre d d 1 2 3 4 5 6 i i i ichy chy e e onal dat onal dat onal dat onal dat onal dat onal dat c c c cs w s w ti ti ti ti d d e e n p n p ra ra ons for 85 human gen ons for 85 human gen ons for 55 huma ons for 55 huma i ic ci ite te ge ge th p th p r rote ote d human pr d human pr of GO t of GO t a a a a a a r ri i file file file file file file edic edic n i n in nted p ted p t te e e er r rm p rm p a act ct ote ote r r n dise n dise i iote ote o o r ri in n e e n int n int d d s s i in in n in i ie e c c ase ase ti ti s of unknown s of unknown e e ons at ons at ter ter ra ra g gcti cti e e a anes nes ct ct o o differ differ ions ions ns ns ent ent lev lev- - confidence score, and interactions with a score greater than 0.5 are considered core interactions (the interaction score mainly depends upon the number of times each interaction Acknowledgements We thank the Sanger Institute Web Team for construction of the web was detected, the total number of interactions made by each interface and Paul Kersey for providing a list of TrEMBL accessions for yeast protein and the local network clustering [5]). To generate a proteins. B.L. is supported by a Sanger Institute Postdoctoral Fellowship similar subset of yeast protein interactions, we defined core and A.G.F is supported by the Wellcome Trust. yeast protein interactions as those identified more than once by any single assay. Each entry in the core and complete inter- action networks contains the following tab delimited infor- References 1. Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lock- mation: Gene 1 Id, Ensembl gene ID for human interaction shon D, Narayan V, Srinivasan M, Pochart P, et al.: A comprehen- partner 1; Gene 1 description, alternative names for human sive analysis of protein-protein interactions in Saccharomyces Gene 1 (from Ensembl); Gene 2 Id, Ensembl gene ID for cerevisiae. Nature 2000, 403:623-627. 2. Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y: A compre- human interaction partner 2; Gene 2 description, alternative hensive two-hybrid analysis to explore the yeast protein names for human Gene 2 (from Ensembl); Source Organism, interactome. Proc Natl Acad Sci USA 2001, 98:4569-4574. 3. Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, the model organism protein interaction dataset that predicts Schultz J, Rick JM, Michon AM, Cruciat CM, et al.: Functional organ- this human protein interaction; Ortholog 1, model organism ization of the yeast proteome by systematic analysis of pro- interaction partner 1 from the model organism protein inter- tein complexes. Nature 2002, 415:141-147. 4. Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams SL, Millar A, action that predicts the human protein interaction; Ortholog Taylor P, Bennett K, Boutilier K, et al.: Systematic identification 2, model organism interaction partner 2 from the model of protein complexes in Saccharomyces cerevisiae by mass organism protein interaction that predicts the human protein spectrometry. Nature 2002, 415:180-183. 5. Giot L, Bader JS, Brouwer C, Chaudhuri A, Kuang B, Li Y, Hao YL, interaction; and Ortholog score, a confidence score for the Ooi CE, Godwin B, Vitols E, et al.: A protein interaction map of human protein interaction based on the likelihood that the Drosophila melanogaster. Science 2003, 302:1727-1736. 6. Li S, Armstrong CM, Bertin N, Ge H, Milstein S, Boxem M, Vidalain two human proteins are the functional orthologs of the two PO, Han JD, Chesneau A, Hao T, et al.: A map of the interactome model organism proteins. The score ranges from 0 (no confi- network of the metazoan C. elegans. Science 2004, 303:540-543. dence) to 4 (high confidence). The score is calculated as the 7. Matthews LR, Vaglio P, Reboul J, Ge H, Davis BP, Garrels J, Vincent S, Vidal M: Identification of potential interaction networks using sum of the Inparanoid confidence scores for each gene sequence-based searches for conserved protein-protein orthology assignment. A score of 4 means that both of the interactions or "interologs". Genome Res 2001, 11:2120-2126. 8. Wojcik J, Boneca IG, Legrain P: Prediction, assessment and vali- human genes and both of the model organism genes are all dation of protein interaction maps in bacteria. J Mol Biol 2002, the main orthologs in their groups of co-orthologs according 323:763-770. to Inparanoid. These represent higher confidence human 9. Remm M, Storm CE, Sonnhammer EL: Automatic clustering of orthologs and in-paralogs from pairwise species protein interactions. Description, this field contains the orig- comparisons. J Mol Biol 2001, 314:1041-1052. inal annotation for the model organism protein interaction; 10. Tong AH, Evangelista M, Parsons AB, Xu H, Bader GD, Page N, Rob- for worm interactions this indicates whether the interaction is inson M, Raghibizadeh S, Hogue CW, Bussey H, et al.: Systematic genetic analysis with ordered arrays of yeast deletion in the core dataset of interactions found more than once mutants. Science 2001, 294:2364-2368. (CORE_1), or interactions that reconfirmed when retested 11. von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S, Bork P: Comparative assessment of large-scale data sets of pro- (CORE_2), or non-core interactions that did not reconfirm tein-protein interactions. Nature 2002, 417:399-403. (NON_CORE) [6]. For fly interactions this indicates the 12. The Sanger Institute: Interaction Map [http:// interaction score. This score mainly depends upon the www.sanger.ac.uk/interactionmap] 13. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, number of times each interaction was detected, the total Davis AP, Dolinski K, Dwight SS, Eppig JT, et al.: Gene ontology: number of interactions made by each protein and the local tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000, 25:25-29. network clustering, see [5] for details. A score >0.5 is consid- 14. Birney E, Andrews D, Bevan P, Caccamo M, Cameron G, Chen Y, ered high confidence. For yeast protein interactions, these are Clarke L, Coates G, Cox T, Cuff J, et al.: Ensembl 2004. Nucleic Acids the annotations of von Mering et al. [11] and contain the fol- Res 2004, 32 Database issue:D468-D470. 15. Ensembl genome browser [http://www.ensembl.org] lowing information: experimental/computation method (and 16. Ensembl EnsMart genome browser (Martview) [http:// the number of times the interaction was detected); Von Mer- www.ensembl.org/Multi/martview] ing et al.'s confidence assignment; and whether the interac- 17. Kasprzyk A, Keefe D, Smedley D, London D, Spooner W, Melsopp C, Hammond M, Rocca-Serra P, Cox T, Birney E: EnsMart: a generic tion was previously known in the literature. For more system for fast and flexible access to biological data. Genome information, please see [11]. Res 2004, 14:160-169. 18. Walhout AJ, Sordella R, Lu X, Hartley JL, Temple GF, Brasch MA, Thi- erry-Mieg N, Vidal M: Protein interaction mapping in C. elegans Genome Biology 2004, 5:R63 comment reviews reports deposited research refereed research interactions information http://genomebiology.com/2004/5/9/R63 Genome Biology 2004, Volume 5, Issue 9, Article R63 Lehner and Fraser R63.9 using proteins involved in vulval development. Science 2000, 287:116-122. 19. Lehner B, Sanderson CM: A protein interaction framework for human RNA degradation. Genome Res 2004, 14:1315-1323. 20. Suzuki H, Fukunishi Y, Kagawa I, Saito R, Oda H, Endo T, Kondo S, Bono H, Okazaki Y, Hayashizaki Y: Protein-protein interaction panel using mouse full-length cDNAs. Genome Res 2001, 11:1758-1765. 21. Remy I, Galarneau A, Michnick SW: Detection and visualization of protein interactions with protein fragment complementa- tion assays. Methods Mol Biol 2002, 185:447-459. 22. Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O'Donovan C, Phan I, et al.: The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res 2003, 31:365-370. 23. InParanoid: database of pairwise orthologs [http://inpara noid.cgb.ki.se] 24. Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Barrell D, Bateman A, Binns D, Biswas M, Bradley P, Bork P, et al.: The InterPro Data- base, 2003 brings increased coverage and new features. Nucleic Acids Res 2003, 31:315-318. 25. Online Mendelian Inheritance in Man [http:// www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM] 26. Welcome to Blueprint [http://www.blueprint.org/bind/bind.php] 27. Vidal laboratory [http://vidal.dfci.harvard.edu] Genome Biology 2004, 5:R63 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Genome Biology Springer Journals http://www.deepdyve.com/lp/springer-journals/a-first-draft-human-protein-interaction-map-T9p4ghRJrI

Loading next page...

References (33)

Si-ming Li, Christopher Armstrong, N. Bertin, Hui Ge, S. Milstein, M. Boxem, P. Vidalain, J. Han, A. Chesneau, Tong Hao, D. Goldberg, Ning Li, Monica Martinez, Jean-François Rual, Philippe Lamesch, Lai Xu, M. Tewari, Sharyl Wong, Lan Zhang, G. Berriz, L. Jacotot, P. Vaglio, J. Reboul, T. Hirozane-Kishikawa, Qian-ru Li, Harrison Gabel, Ahmed Elewa, Bridget Baumgartner, D. Rose, Haiyuan Yu, Stephanie Bosak, Reynaldo Sequerra, Andrew Fraser, S. Mango, W. Saxton, S. Strome, S. Heuvel, F. Piano, J. Vandenhaute, C. Sardet, M. Gerstein, L. Doucette-Stamm, K. Gunsalus, J. Harper, M. Cusick, F. Roth, D. Hill, M. Vidal (2004)
A Map of the Interactome Network of the Metazoan C. elegans
Science, 303
N. Mulder, R. Apweiler, T. Attwood, A. Bairoch, D. Barrell, A. Bateman, David Binns, M. Biswas, Paul Bradley, P. Bork, P. Bucher, R. Copley, E. Courcelle, Ujjwal Das, R. Durbin, L. Falquet, W. Fleischmann, S. Griffiths-Jones, D. Haft, Nicola Harte, N. Hulo, D. Kahn, Alexander Kanapin, Maria Krestyaninova, R. Lopez, Ivica Letunic, D. Lonsdale, Ville Silventoinen, S. Orchard, M. Pagni, David Peyruc, C. Ponting, J. Selengut, F. Servant, Christian Sigrist, Robert Vaughan, E. Zdobnov (2003)
The InterPro Database, 2003 brings increased coverage and new features
Nucleic acids research, 31 1
Ensembl EnsMart genome browser (Martview)
M. Remm, Christian Storm, E. Sonnhammer (2001)
Automatic clustering of orthologs and in-paralogs from pairwise species comparisons.
Journal of molecular biology, 314 5
Ensembl EnsMart genome browser
//www.ensembl.org/Multi/martview]
The Sanger Institute
Interaction Map. [http://www.sanger.ac.uk/interactionmap]
E. Birney, T. Andrews, Paul Bevan, M. Cáccamo, G. Cameron, Yuan Chen, Laura Clarke, Guy Coates, Tony Cox, James Cuff, V. Curwen, T. Cutts, T. Down, R. Durbin, E. Eyras, X. Fernández-Suárez, P. Gane, B. Gibbins, J. Gilbert, M. Hammond, H. Hotz, V. Iyer, Andreas Kähäri, K. Jekosch, A. Kasprzyk, Damian Keefe, S. Keenan, H. Lehväslaiho, G. McVicker, Craig Melsopp, Patrick Meidl, Emmanuel Mongin, Roger Pettett, Simon Potter, G. Proctor, Mark Rae, S. Searle, G. Slater, D. Smedley, James Smith, W. Spooner, Arne Stabenau, J. Stalker, R. Storey, A. Ureta-Vidal, Cara Woodwark, M. Clamp, T. Hubbard (2004)
Ensembl 2004
Nucleic acids research, 32 Database issue
Ensembl genome browser. [http
//www.ensembl.org]
Takashi Ito, Tomoko Chiba, Ritsuko Ozawa, Mikio Yoshida, Masahira Hattori, Y. Sakaki (2001)
A comprehensive two-hybrid analysis to explore the yeast protein interactome
Proceedings of the National Academy of Sciences of the United States of America, 98
Welcome to Blueprint. [http
//www.blueprint.org/bind/bind.php]
A. Tong, M. Evangelista, A. Parsons, H. Xu, Gary Bader, N. Pagé, M. Robinson, S. Raghibizadeh, C. Hogue, H. Bussey, B. Andrews, M. Tyers, C. Boone (2001)
Systematic Genetic Analysis with Ordered Arrays of Yeast Deletion Mutants
Science, 294
Vidal laboratory. [http
//vidal.dfci.harvard.edu]
A. Kasprzyk, Damian Keefe, D. Smedley, D. London, W. Spooner, Craig Melsopp, M. Hammond, P. Rocca-Serra, Tony Cox, E. Birney (2003)
EnsMart: a generic system for fast and flexible access to biological data.
Genome research, 14 1
L. Giot, J. Bader, Cory Brouwer, Amitabha Chaudhuri, B. Kuang, Ying Li, Y. Hao, C. Ooi, B. Godwin, E. Vitols, G. Vijayadamodar, P. Pochart, H. Machineni, M. Welsh, Yong Kong, B. Zerhusen, Rachel Malcolm, Z. Varrone, A. Collis, M. Minto, S. Burgess, L. McDaniel, E. Stimpson, F. Spriggs, J. Williams, K. Neurath, N. Ioime, M. Agee, E. Voss, K. Furtak, R. Renzulli, N. Aanensen, S. Carrolla, E. Bickelhaupt, Y. Lazovatsky, A. Dasilva, J. Zhong, C. Stanyon, R. Finley, K. White, Michael Braverman, T. Jarvie, S. Gold, M. Leach, James Knight, R. Shimkets, M. McKenna, J. Chant, J. Rothberg (2003)
A Protein Interaction Map of Drosophila melanogaster
Science, 302
Jeremy Rashbass (1995)
Online Mendelian Inheritance in Man.
Trends in genetics : TIG, 11 7
M. Ashburner, C. Ball, J. Blake, D. Botstein, Heather Butler, J. Cherry, A. Davis, K. Dolinski, S. Dwight, J. Eppig, M. Harris, D. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. Matese, J. Richardson, M. Ringwald, G. Rubin (2000)
Gene Ontology: tool for the unification of biology
Nature Genetics, 25
Online Mendelian Inheritance in Man. [http
//www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM]
Y. Ho, A. Gruhler, Adrian Heilbut, Gary Bader, L. Moore, S. Adams, A. Millar, Paul Taylor, K. Bennett, K. Boutilier, Lingyun Yang, Cheryl Wolting, I. Donaldson, Søren Schandorff, Juanita Shewnarane, M. Võ, Joanne Taggart, Marilyn Goudreault, B. Muskat, C. Alfarano, D. Dewar, Zhen-Liang Lin, K. Michalickova, A. Willems, Holly Sassi, P. Nielsen, K. Rasmussen, J. Andersen, L. Johansen, L. Hansen, H. Jespersen, A. Podtelejnikov, E. Nielsen, Janne Crawford, Vibeke Poulsen, B. Sørensen, Jesper Matthiesen, Ronald Hendrickson, F. Gleeson, T. Pawson, M. Moran, D. Durocher, M. Mann, C. Hogue, D. Figeys, M. Tyers (2002)
Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry
Nature, 415
A. Walhout, Raffaella Sordella, Xiaowei Lu, J. Hartley, G. Temple, M. Brasch, Nicolas Thierry-Mieg, M. Vidal (2000)
Protein interaction mapping in C. elegans using proteins involved in vulval development.
Science, 287 5450
B. Boeckmann, A. Bairoch, R. Apweiler, M. Blatter, A. Estreicher, E. Gasteiger, M. Martin, Karine Michoud, C. O’Donovan, Isabelle Phan, S. Pilbout, Michel Schneider (2003)
The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003
Nucleic acids research, 31 1
A. Gavin, M. Bösche, R. Krause, P. Grandi, M. Marzioch, A. Bauer, J. Schultz, Jens Rick, A. Michon, Cristina Cruciat, M. Remor, Christian Höfert, Malgorzata Schelder, Miro Brajenovic, H. Ruffner, Alejandro Merino, Karin Klein, Manuela Hudak, David Dickson, T. Rudi, V. Gnau, A. Bauch, Sonja Bastuck, B. Huhse, C. Leutwein, Marie-Anne Heurtier, R. Copley, A. Edelmann, Erich Querfurth, V. Rybin, G. Drewes, M. Raida, T. Bouwmeester, P. Bork, B. Séraphin, B. Kuster, G. Neubauer, G. Superti-Furga (2002)
Functional organization of the yeast proteome by systematic analysis of protein complexes
Nature, 415
(2004)
Ensembl Nucleic Acids Res
I Remy (2002)

Methods Mol Biol, 185
J. Wojcik, I. Boneca, P. Legrain (2002)
Prediction, assessment and validation of protein interaction maps in bacteria.
Journal of molecular biology, 323 4
I. Remy, A. Galarneau, S. Michnick (2002)
Detection and visualization of protein interactions with protein fragment complementation assays.
Methods in molecular biology, 185
Harukazu Suzuki, Y. Fukunishi, I. Kagawa, R. Saito, H. Oda, T. Endo, S. Kondo, H. Bono, Y. Okazaki, Y. Hayashizaki (2001)
Protein-protein interaction panel using mouse full-length cDNAs.
Genome research, 11 10
P. Uetz, L. Giot, G. Cagney, T. Mansfield, R. Judson, James Knight, D. Lockshon, Vaibhav Narayan, Maithreyan Srinivasan, P. Pochart, Alia Qureshi-Emili, Ying Li, B. Godwin, D. Conover, T. Kalbfleisch, G. Vijayadamodar, Meijia Yang, M. Johnston, S. Fields, J. Rothberg (2000)
A comprehensive analysis of protein–protein interactions in Saccharomyces cerevisiae
Nature, 403
J. Oyston (1998)
Online Mendelian Inheritance in Man.
Anesthesiology, 89 3
Ben Lehner, C. Sanderson (2004)
A protein interaction framework for human mRNA degradation.
Genome research, 14 7
Lisa Matthews, P. Vaglio, J. Reboul, Hui Ge, B. Davis, J. Garrels, S. Vincent, M. Vidal (2001)
Identification of potential interaction networks using sequence-based searches for conserved protein-protein interactions or "interologs".
Genome research, 11 12
Interaction Map
InParanoid
database of pairwise orthologs. [http://inparanoid.cgb.ki.se]
C. Mering, R. Krause, B. Snel, M. Cornell, S. Oliver, S. Fields, P. Bork (2002)
Comparative assessment of large-scale data sets of protein–protein interactions
Nature, 417

Publisher: Springer Journals
Copyright: 2004 Lehner and Fraser; licensee BioMed Central Ltd.
eISSN: 1474-760X
DOI: 10.1186/gb-2004-5-9-r63
Publisher site: See Article on Publisher Site

Abstract

Background: Protein-interaction maps are powerful tools for suggesting the cellular functions of genes. Although large-scale protein-interaction maps have been generated for several invertebrate species, projects of a similar scale have not yet been described for any mammal. Because many physical interactions are conserved between species, it should be possible to infer information about human protein interactions (and hence protein function) using model organism protein- interaction datasets. Results: Here we describe a network of over 70,000 predicted physical interactions between around 6,200 human proteins generated using the data from lower eukaryotic protein-interaction maps. The physiological relevance of this network is supported by its ability to preferentially connect human proteins that share the same functional annotations, and we show how the network can be used to successfully predict the functions of human proteins. We find that combining interaction datasets from a single organism (but generated using independent assays) and combining interaction datasets from two organisms (but generated using the same assay) are both very effective ways of further improving the accuracy of protein-interaction maps. Conclusions: The complete network predicts interactions for a third of human genes, including 448 human disease genes and 1,482 genes of unknown function, and so provides a rich framework for biomedical research. human protein interactions and protein function using data Background Physical interactions between proteins underpin most biolog- from model organism protein-interaction datasets [7,8]. ical processes. For this reason, large-scale protein-interaction mapping projects have been initiated in several model organ- To transfer information on gene function between two isms [1-6]. Unfortunately, projects of a similar scale have not genomes requires the identification of orthologous genes in yet been described for mammalian systems, with the result the two genomes (that is, genes that are descended from a that our global understanding of protein function remains common ancestor and share biological functions). However, less advanced in mammals than in lower eukaryotes. How- the identification of gene orthologs is often not a trivial prob- ever, many physical interactions are conserved between spe- lem; gene duplications can result in a single gene having mul- cies, so it should be possible to infer information about tiple potential orthologs in a second species. In addition, it is Genome Biology 2004, 5:R63 R63.2 Genome Biology 2004, Volume 5, Issue 9, Article R63 Lehner and Fraser http://genomebiology.com/2004/5/9/R63 necessary to distinguish true gene orthologs from 'out-para- Table 1 logs' (that is, genes that arose from a gene-duplication event The number and accuracy of human protein interactions pre- before the divergence of two species, and so are unlikely to dicted by different model organism protein-interaction datasets share functions) [9]. One method that addresses both these problems is the InParanoid algorithm, which first identifies Data source Predicted Interactions sharing GO terms potential orthologs by best pairwise similarity searches, and human interactions then clusters these orthologs into groups of likely co- orthologs, with each ortholog assigned a score representing Number % the confidence that it is the main ortholog [9]. We have used All 71,496 12,724 24.9 the orthology relationships identified by the InParanoid algo- Yeast 55,231 10,727 26.2 rithm to construct a putative human protein-interaction map Fly 12,059 1,404 19.0 based solely on high-throughput interaction datasets from model organisms. We show that this approach successfully Worm 4,494 753 24.4 identifies functionally related human proteins, and so can be All core 11,487 3,133 38.1 used to assign putative functions to many novel human genes. Core yeast 6,061 2,146 45.4 The resulting network provides a framework for human biol- Core fly 2,889 488 27.8 ogy and acts as a guide for a future experimental human pro- Core worm 2,701 597 32.3 tein-interaction mapping project. Two species 288 154 74.8 Two species (core) 160 95 88.0 Two methods 2,166 829 60.6 Results Random pairs 71,496 6,053 14.6 Generation of a human protein-interaction map Protein interactions are often evolutionarily conserved The table lists the total number of interactions predicted by each interaction dataset, and the number of these interactions that connect between orthologous proteins from different species [7]. proteins that share at least one GO term (at level 3 or deeper in the Hence we reasoned that a human protein-interaction map GO hierarchy). The percentages are relative to the total number of could be constructed using data from model organism pro- non-self interactions where both proteins have at least one GO annotation. All, all predicted human protein interactions; Yeast/worm/ tein-interaction mapping projects. We obtained the data from fly, interactions predicted by the yeast, worm or fly interaction maps; seven experimental and four computationally predicted pro- All core, all interactions predicted by the high-confidence subsets of tein-interaction maps from Saccharomyces cerevisiae [1- each model organism interaction map (see Materials and methods); 4,10,11], Drosophila melanogaster [5] and Caenorhabditis Two species, interactions predicted by more than one model organism interaction map; Two species (core), interactions predicted by the elegans [6]. For each interacting protein, we identified poten- high-confidence subset of interactions from more than one model tial human orthologs using the InParanoid algorithm [9]. A organism; Two methods, interactions predicted by data derived from human protein interaction is predicted if both interaction more than one different interaction assay; Random pairs, the data for a partners from a model organism have one or more human randomly generated interaction network. orthologs. Using this strategy, we were able to generate a human interaction network comprising 71,496 interactions between 6,231 human proteins. The sources of these pre- dicted interactions are summarized in Table 1 and Figure 1a, and all the interactions are available in Additional data file 1 physiologically interacting proteins are expected to have available online with this article and can also be searched or related, but non-identical functions, they are expected to downloaded from our website [12]. share some, but not all GO annotations. Therefore, one method to evaluate an interaction dataset is to count the pro- Assessment of the accuracy of the interaction datasets portion of interactions that connect proteins that share com- In the absence of a comprehensive set of verified human pro- mon GO terms [5]. For the complete predicted human tein interactions, we required another method to assess the interaction network, 25% of interaction partners share at accuracy of the interaction network. Proteins that interact least one GO term, which is many more than observed with a physiologically are expected to have related functions. There- randomly generated network of the same size (15% of interac- fore high-quality interaction datasets should predict a greater tions). To confirm that this result did not just apply to quite proportion of interactions between functionally related pro- general GO annotations, we calculated the proportion of teins than low quality datasets. The functions of human pro- interaction partners that share GO annotations at depths 3 to teins can be systematically described using the Gene 8 and greater than 8 in the GO hierarchy. We found that the Ontology (GO) annotations [13] available from Ensembl [14- predicted interaction network preferentially connects pro- 17]. GO annotations provide a hierarchical description of gene teins that share GO annotations at any level of the GO hierar- functions with general functions described by GO annota- chy (see Figure 2). This suggests that the interaction network tions at the top levels of the hierarchy and very precise func- indeed preferentially connects functionally related human tions described by terms deeper in the hierarchy. Because proteins. Genome Biology 2004, 5:R63 comment reviews reports deposited research refereed research interactions information http://genomebiology.com/2004/5/9/R63 Genome Biology 2004, Volume 5, Issue 9, Article R63 Lehner and Fraser R63.3 (a) (a) Complete network (71,496 interactions) 50 Core yeast Core worm Worm (4,494) Yeast (55,252) Core fly All yeast All worm All fly Random pairs 4,321 53 55,064 3 45 678 >8 100 115 Depth of shared GO term (b) Two species (core) 100 Two species 11,824 Two methods All core All Random pairs Fly (12,059) (b) Core network (11,487 interactions) Worm (2,701) Yeast (6,061) 345678 >8 Depth of shared GO term F Fig ilteri urn eg in 2 teraction datasets to improve their accuracy Filtering interaction datasets to improve their accuracy. (a) The percentages of interactions sharing GO terms at various depths in the GO hierarchy are compared for interactions predicted by the high-confidence 2,582 26 5,990 interactions from each model organism (core yeast, core worm and core fly), as well as for the complete datasets from each organism (all yeast, all worm, all fly). For comparison, the percentage of shared GO terms is shown for a randomly generated network of the same size as the complete human network (random pairs). The x-axis indicates the depth in 89 41 the GO hierarchy being considered, and the y-value the percentage of interaction partners (with known GO annotations) that share GO annotations at this depth or deeper. (b) The percentages of interactions sharing GO terms at different levels in the GO hierarchy are compared for interactions predicted by core interactions in two or more species 2,755 (two species (core)), by interactions in the complete datasets of two or more species (two species), for interactions predicted by more than one experimental method in yeast (two methods), by any core interaction (all core), by any interaction (all), or by a randomly generated interaction network of the same size as the complete human interaction network (random pairs). All values shown are the percentage of non-self interactions between pairs of proteins that both have at least one Fly (2,889) associated GO term at the indicated depth in the GO hierarchy. Sour Figure ces 1 of predicted human protein interactions Sources of predicted human protein interactions. (a) The number of We then used the same strategy to compare the accuracy of human protein interactions predicted by the interaction maps from each model organism. (b) The number of human protein interactions predicted human interactions predicted by data from the three different by the core higher-confidence interactions from each organism. As model organisms. If the interactions from a particular model explained in the text, core interactions are those that reconfirmed when organism dataset predict fewer interactions between func- retested (worm), or had an interaction score of greater than 0.5 (fly) or tionally related human proteins than the other datasets, then were identified more than once in a single assay (yeast, worm). this dataset should be considered less reliable as a source of Genome Biology 2004, 5:R63 Percentage non-self interactions Percentage non-self interactions sharing GO term sharing GO term R63.4 Genome Biology 2004, Volume 5, Issue 9, Article R63 Lehner and Fraser http://genomebiology.com/2004/5/9/R63 candidate human protein interactions. As shown in Table 1 Table 2 and Figure 2a, interactions predicted by the complete yeast The number of interactions, genes, novel genes and disease genes and worm datasets are slightly better at connecting function- in the complete and core human interaction networks ally related human proteins than those predicted by the fly dataset, suggesting that these interactions can be considered Network Interactions Genes Novel genes Disease genes with higher confidence. This result is especially interesting Complete 71,496 6,231 1,482 448 given that the yeast interaction map is an order of magnitude Core 11,487 3,872 864 292 larger than the fly or worm maps, confirming that the fly and worm interaction maps currently have a relatively low The complete network consists of all human protein interactions coverage. predicted by model organism protein-interaction datasets. The core network consists of all the human interactions predicted by the high- confidence subsets of each interaction network (see Materials and Next we asked how the confidence in the assignment of gene methods). Novel genes are defined as those without GO annotations. orthologs affects the accuracy of an interaction. For each pre- Disease genes are defined by the OMIM database [25], available from dicted interaction, an orthology confidence score was calcu- Ensembl [16]. lated by summing the InParanoid orthology confidence scores for the two human and two model organism proteins (see Materials and methods). Of the predicted interactions, 24,897 have the maximum possible confidence score of 4. Of Combining interaction datasets to generate high- these interactions, 28%, 24% and 13% connect proteins that confidence networks share GO terms at depths of 3, 5 or 7 in the GO hierarchy It has been shown previously that protein interactions (excluding proteins without GO annotation). In contrast, for detected by more than one high-throughput interaction assay interactions with an orthology confidence score less than 4, are more accurate [11]. We find that this is also true for these figures are 24%, 20% and 10%. Hence we conclude that human protein interactions predicted by yeast protein inter- the predicted human interactions with high-confidence actions detected by more than one method (see Figure 2b and orthology assignments can be considered more reliable than Table 1). It has also been suggested that protein interactions those interactions with less confidence in their orthology are more likely to represent physiologically important inter- assignments. This confirms that the confidence scores actions if they have been detected between orthologous pro- assigned using InParanoid are indeed likely to be useful pre- tein pairs from two or more species [7,18]. To test this dictors of functional conservation. hypothesis we identified 288 human protein interactions pre- dicted by interactions in two or more model organisms (Fig- A core dataset of high-confidence protein interactions ure 1, Table 1). Remarkably, 75%, 70% and 56% of these The worm and fly interaction mapping projects both defined interactions share GO terms at depths of 3, 5 or 7 in the GO a subset of high-confidence 'core' interactions that have the hierarchy, respectively (Figure 2b). Indeed, for interactions greatest experimental support (Figure 1b). For the worm derived from core interaction datasets, these figures rise to interaction map these were defined as interactions identified 88%, 80% and 67% of interactions. Hence, protein interac- more than once, or that reconfirmed when retested in the tions predicted by data from multiple species can be consid- two-hybrid assay [6]. In the fly interaction map each interac- ered with very high confidence. tion has an associated confidence score, and interactions with a score greater than 0.5 are considered core interactions (the Using the interaction network to predict human gene interaction score mainly depends upon the number of times function each interaction was detected, the total number of Because physiologically interacting proteins often have simi- interactions made by each protein and the local network clus- lar functions (Figure 2), it should be possible to predict the tering; see [5]). To generate a similar subset of yeast protein functions of a novel human protein if it interacts with pro- interactions, we defined core yeast protein interactions as teins of known function. To address how well our interaction those identified more than once by any single assay, consist- map could be used for this purpose, we asked whether the ent with previous analyses of the individual datasets [1-3,11]. known GO terms of a protein could be predicted using only As shown in Figure 2a and Table 1, for all three species these the GO terms of its interaction partners. As shown in Table 3, core interactions predict a greater proportion of human inter- GO terms associated with at least one of a gene's core interac- actions that share GO terms than the total datasets. Indeed all tion partners predict GO terms associated with that gene with three core interaction maps are of similar accuracy, so we an accuracy of around 8%. However, GO terms associated combine their predicted interactions into a core network of with at least two, three, four or five of a gene's interaction 11,487 higher-confidence human protein interactions (sum- partners have 22%, 30%, 37%, 42% and 45% probabilities, marized in Table 2 and available as Additional data file 2). Of respectively, of also being associated with that gene (Table 3). these core interactions, 38%, 35% and 24% connect proteins Although these values may vary for different GO terms, as that share GO terms at depths of 3, 5 or 7 in the GO hierarchy shown in Additional data file 3, the accuracy and coverage of (excluding proteins with no GO annotations). these GO term predictions are very similar for GO terms at Genome Biology 2004, 5:R63 comment reviews reports deposited research refereed research interactions information http://genomebiology.com/2004/5/9/R63 Genome Biology 2004, Volume 5, Issue 9, Article R63 Lehner and Fraser R63.5 Table 3 human disease genes (listed in Additional data file 5), of which 55 interact with two or more proteins in the core inter- The approximate accuracy and coverage of GO terms predicted action network that share a GO annotation. The functional by the core and complete interaction networks predictions for these 55 genes are listed in Additional data file Number of interactors Core data Complete data with GO term Accuracy Coverage Accuracy Coverage Discussion 1+ 8 26 3 35 A framework for human biology We report here the use of data from model organism protein- 2+ 22 11 8 19 interaction mapping projects to predict a network of human 3+ 30711 14 protein interactions. This network consists of over 70,000 4+ 36515 11 interactions that connect over one-third of all the predicted 5+ 424188 human proteins, including 1,482 proteins of unknown func- 6+ 453207 tion and 448 proteins encoded by human disease genes. The The approximate accuracy and coverage of GO term predictions were physiological relevance of this network is supported by its calculated for every gene in the core or complete interaction networks ability to preferentially connect human proteins that share with at least one known GO term. The GO terms of a gene are biological functions (Figure 2). Indeed the network can be predicted using the GO terms of any of its interaction partners (1+), or GO terms shared by at least two to six of its interaction partners (2+ successfully used to predict the functions of a gene using the to 6+). Accuracy is calculated as the number of correctly predicted GO known functions of its interaction partners (Table 3). As such, terms divided by the total number of predicted GO terms. Coverage is the network should provide a rich source of functional calculated as the number of correctly predicted GO terms divided by hypotheses for researchers interested in the functions of one the total number of known GO terms associated with each gene. These values are similar for GO annotations at different levels of the GO or many human proteins. hierarchy (see Additional data file 3). The accuracy and coverage of the interactions predicted in this network depend primarily on two parameters: the quality of the original model organism interaction datasets; and the different levels in the GO hierarchy, and so can be used as an ability to identify the human orthologs of a model organism approximate indication of the confidence in a prediction of protein. Our analysis suggests that the raw yeast and worm gene function. Hence the network can be used to predict GO protein-interaction datasets are currently slightly more accu- terms for a human gene of unknown function, with the rate than the raw fly interaction dataset, but that when fil- approximate confidence in the GO prediction determined by tered for high-confidence interactions the three interaction the number of interaction partners that share the GO term. maps are of very similar accuracy (see Table 1 and Figure 2). The fly and worm interaction maps both have a much lower The ability to provide a reasonably accurate prediction of a coverage than the yeast interaction network, most probably gene's GO terms means that we can use the interaction net- because they both only represent the results of a single inter- work to provide probabilistic gene function predictions for action-mapping project. The continuation of these model novel human proteins and also to predict additional functions organism protein-interaction mapping projects to generate for proteins with some known functions. The core interaction higher coverage interaction maps will greatly enhance our map contains 864 proteins with no functional annotations. ability to predict human protein interactions. About 10% of these proteins interact with two or more pro- teins that share GO terms. The probabilistic predictions of the For the identification of gene orthologs, we used the InPara- functions of these novel proteins are listed in Additional data noid algorithm. InParanoid offers several important benefits file 4. Often these predicted functions are also supported by compared to simple 'reciprocal best hit' sequence-similarity the known functions of the protein domains predicted to be searches [9]. First, many genes from lower eukaryotes have encoded by these novel genes (see Additional data file 4). For multiple co-orthologs in humans, which can be identified example, ENSG00000028310 encodes a bromodomain and using InParanoid, but not by simple one-to-one sequence- interacts with six proteins annotated as 'GO:0006355 regula- similarity searches. Second, InParanoid can successfully dis- tion of transcription, DNA-dependent', ENSG00000080608 tinguish these true co-orthologs from paralogs that arose encodes an RNA-binding domain and interacts with five pro- before a speciation event (which are unlikely to retain similar teins annotated as 'GO:0006364 rRNA processing', and functions). Finally, each potential ortholog in a group of co- ENSG00000104863 encodes a PDZ domain and interacts orthologs identified by InParanoid has an associated score with three proteins with the annotations 'GO:0005887 inte- that represents the likelihood that it is the main ortholog of a gral to plasma membrane, GO:0007242 intracellular signal- gene. We have summed these confidence scores to provide an ing cascade' (Additional data file 4). The complete and core orthology confidence score for each predicted human protein interaction maps also predict interactions for 448 and 292 interaction in our network. These high-confidence ortholog Genome Biology 2004, 5:R63 R63.6 Genome Biology 2004, Volume 5, Issue 9, Article R63 Lehner and Fraser http://genomebiology.com/2004/5/9/R63 interactions connect a greater proportion of functionally human protein interactions, but will also allow many more related human proteins, suggesting that the InParanoid con- interactions to be verified using the interaction footprinting fidence score is indeed a useful tool for predicting the likely strategy. However, such an approach will be limited to pro- physiological relevance of a predicted protein interaction. viding information on those proteins and interactions that are conserved between vertebrates and invertebrates. The ability to successfully predict human protein functions using the results of model organism protein-interaction map- Strategies for completing the human interaction map ping projects highlights both the relevance of model organism The interactions described here provide a first-draft human protein-interaction mapping projects to understanding protein-interaction map that can be used to predict interac- human biology and also the benefits that would result from an tions and functions for genes of interest to a particular experimental human protein-interaction mapping project. researcher. However, the map also provides a framework Although the interaction network can currently accurately from which a complete human protein-interaction map could predict only a subset of the known functions of a gene, this be generated. Firstly, the map could be used to identify sub- should improve as more protein-interaction data becomes sets of high-confidence, evolutionarily conserved interactions available. For this reason, we strongly encourage the continu- from the results of large- or medium-scale human interac- ation of model organism protein-interaction mapping tion-mapping projects. For example the map verifies 51 of projects. 296 yeast two-hybrid interactions detected for human pro- teins involved in mRNA decay [19]. Alternatively, the Methods of verifying protein-interaction datasets interactions predicted here could be directly experimentally We also assessed the relative merits of three different meth- validated using an assay that allows rapid testing of binary ods to improve the accuracy of protein-interaction maps. The interactions (such as the yeast or mammalian two-hybrid first strategy is to define a subset of interactions detected assays [20] or protein fragment complementation assays more than once with a single assay [1-3,6]. We found that this [21]). This would represent a cost-effective strategy to pro- approach leads to an approximately 1.5- to 2.7-fold increase duce a high-confidence human protein-interaction map in the proportion of predicted human interactions that share because it massively reduces the number of candidate inter- GO terms (Figure 2b). The second strategy is to define a sub- actions that need to be tested. Finally, the map identifies set of interactions that have been identified by more than one 17,300 (23,531 - 6,231) human genes for which no protein interaction assay. This results in around a 2.3- to 8-fold interactions are predicted from model organism interaction improvement in the prediction of associations between pro- datasets. Many of these proteins are likely to be vertebrate- or teins that share GO terms (Figure 2b). The final strategy is to mammalian-specific, and are the most logical choices for bait define a subset of interactions that are predicted by interac- proteins for the discovery phase of an experimental human tions from more than one model organism, which results in protein-interaction mapping project. around a 3- to 12-fold improvement in the proportion of interactions between proteins sharing GO terms (Figure 2b). Materials and methods With all these filtering methods, the greatest improvements Model organism protein-interaction datasets The interaction datasets used to generate the draft human are seen when considering the proportion of interactions that share GO terms deep within the GO hierarchy; that is, the fil- protein-interaction network were two-hybrid-based interac- tering steps dramatically improve the proportion of interac- tion maps for D. melanogaster [5] and C. elegans [6] and a tions between proteins with very closely related functions. We list of S. cerevisiae protein-interactions compiled by Von conclude that using interaction data derived from a second Mering et al. [11] from two two-hybrid [1,2], two complex interaction assay or from a second species both represent purification [3,4], one genetic [10], and four in silico-pre- excellent methods to improve the accuracy of protein-interac- dicted interaction datasets (which used correlated mRNA tion maps. Because of the small number of protein-interac- expressions, conserved gene neighbourhood, gene co-occur- tion assays that have been adapted to a high-throughput rence or gene fusion events to predict protein interactions format, we suggest that constructing a second interaction [11]). Table 4 shows the number of unique interactions in map in a related organism using the same assay may be an each dataset, the methods used to generate each dataset, and efficient way to produce a high-confidence interaction map. the URLs from which the datasets were obtained. This strategy is somewhat similar to using phylogenetic foot- printing to identify functional noncoding DNA, so we suggest Identification of gene orthologs and construction of the it should be named 'interaction footprinting'. Using the rela- interaction network tively low-coverage model organism interaction datasets The human orthologs of yeast, worm and fly genes were iden- currently available, only a small proportion of interactions tified using the InParanoid algorithm, which is designed to can be verified by interaction footprinting. The continuation distinguish true orthologs from out-paralogs that arose from of these model organism interaction mapping projects will gene duplications before the divergence of two species [9]. not only provide a much richer framework of predicted The InParanoid algorithm first identifies potential orthologs Genome Biology 2004, 5:R63 comment reviews reports deposited research refereed research interactions information http://genomebiology.com/2004/5/9/R63 Genome Biology 2004, Volume 5, Issue 9, Article R63 Lehner and Fraser R63.7 Table 4 Core interactions were defined as those predicted by worm interactions identified more than once or that reconfirmed Sources of model organism protein-interaction data when retested in the two-hybrid assay [6], by fly interactions with an interaction score greater than 0.5 [5], or by yeast Dataset Interactions Type Reference URL interactions detected two or more times by a single assay [1- Fly 20,020 Two-hybrid [5] [26] 3,11]. Worm 4,605 Two-hybrid [6] [27] Assessment of the interaction data Yeast 78,391 Total [11] [11] Human GOs (at levels 3 or deeper in the GO hierarchy) were 5,125 Two-hybrid obtained from Ensembl (v19.34a.1) [14,15] using Ensmart 49,313 Complex purification (v19.1) [16,17]. The GO terms 'unknown molecular function/ 886 Genetic biological process/cellular compartment' were discarded in 23,844 (23,399) In silico (In silico only) all subsequent analyses. To validate the accuracy of the inter- The table lists the total number of interactions contained in each model action data, we calculated the percentage of interactions that organism dataset, together with the method used to identify shared at least one GO term. To confirm that the results did interactions, the publication reference, and the website (URL) from not just apply to very general GO annotations, we calculated which the interaction dataset was obtained. For each dataset, the non- redundant number of unique interactions between unambiguously the proportion of interacting proteins that shared a GO anno- identified proteins is shown. For the yeast interactions, the total tation at levels 3 to 8 and greater than 8 in the GO hierarchy. number of interactions is shown, as well as the number of interactions For all of these analyses we ignored proteins with no associ- identified using each detection method. In silico only are interactions ated GO annotations. Moreover, self-interactions were only predicted by in silico methods without any confirmation from the experimental datasets. excluded because they will always share GO terms and so bias the results. Prediction of gene functions To predict the GO terms of a protein, we identified all the GO terms associated with x or more of its interaction partners by best pairwise similarity searches, and then clusters these (where x varied from 1 to 6). To validate the accuracy and cov- orthologs into groups of probable co-orthologs, with each erage of this approach we predicted GO terms for genes that ortholog assigned a score representing the confidence that it already have associated GO terms. The accuracy was calcu- is the main ortholog. For each interaction data source, we lated as the total number of correct GO term predictions obtained SWISS-PROT/TrEMBL accessions for each inter- divided by the total number of GO term predictions. The cov- acting protein using the Ensmart data-mining tool [16,17] (for erage was calculated as the total number of correct GO term worm and fly genes) or both SWISS-PROT [22] and a predictions divided by the total number of known GO terms. TrEMBL conversion file kindly provided by Paul Kersey, EBI, This analysis was repeated, but only considering individually Hinxton, UK (for yeast genes). Potential human orthologs of GO terms at depths of 3 to 8 and greater than 8 in the GO hier- these genes were then identified using the pre-computed archy (see Additional data file 3). To avoid biasing the results InParanoid results (version 2.3, available from [23]), and the we again ignored self-interactions. For the same reason, we results converted to nonredundant Ensembl (v19.34a.1, also only counted once GO terms associated with more than genome assembly NCBI34) gene IDs using Ensmart (v19.1) 1 one interaction partner predicted by the same source interac- [16,17]. In total, InParanoid identifies 9,500 human genes tion from a model organism. The InterPro protein domains with at least one ortholog in at least one of worm, fly or yeast. [24] encoded by each human gene were obtained from For each potential ortholog in a group of co-orthologs, the Ensembl using Ensmart. Genes of unknown function were InParanoid algorithm calculates a score that represents the defined as those having no associated GO terms, and disease confidence that it is the main ortholog. In this scoring system, genes were as defined by Ensembl using the Online Mende- the main ortholog always receives a score of 1, with the other lian Inheritance in Man (OMIM) database as a reference [25]. co-orthologs receiving scores ranging between 0 and 1, calcu- lated according to their similarity to the main ortholog [9]. As an indication of the confidence we have in the orthology rela- Additional data files tionships between a pair of interacting proteins from a model The following additional data files are available with the organism and a predicted pair of interacting human proteins, online version of this article: Additional data file 1 contains a we calculate a confidence score by summing the InParanoid complete list of predicted human protein interactions; this confidence scores for each of the four proteins. Hence, each dataset contains every human protein interaction that is pre- interaction has an associated score ranging from 0 to 4 that dicted by a protein interaction from any of seven experimen- represents the confidence that both human proteins repre- tal and four computationally-predicted protein interaction sent the main orthologs of the model organism proteins, and maps from Saccharomyces cerevisiae [1-4,10,11], Drosophila vice versa. melanogaster [5] and Caenorhabditis elegans [6]. Genome Biology 2004, 5:R63 R63.8 Genome Biology 2004, Volume 5, Issue 9, Article R63 Lehner and Fraser http://genomebiology.com/2004/5/9/R63 Additional data file 2 contains a list of all core human protein Additional data file 3 lists the accuracy and coverage of GO interactions. This represents a subset of high-confidence term predictions at different levels in the GO hierarchy; Addi- human protein interactions that is predicted by model organ- tional data file 4 lists gene function predictions for 85 human ism protein interactions with greater experimental support. genes of unknown function; Additional data file 5 lists human In the worm interaction map, these are defined as interac- disease genes with predicted protein interactions; and tions that reconfirmed when retested in the Y2H assay [6]. In Additional data file 6 lists gene function predictions for 55 the fly interaction map, each interaction has an associated human disease genes. Additi A c A c Cl Additi A A Cl Additi The a el The a el Cl Additi Gene function Gene function Cl Additi Human dis Human dis Cl Additi Gene Gene Cl li li s in s in ick here ick here ick here ick here ick here ick here o o s smple mple t t of of ccur ccur func func f f the G the G onal onal onal onal onal onal u u al al nction nction te te a a l l for for for for for for tion tion cor cor cy cy e e da da da da da da O O list list as as hie hie and cov and cov additi additi additi additi additi additi ta file ta file ta file ta file ta file ta file e gen e gen e e pred pred pre pre of of huma huma r rar ar pre pre d d 1 2 3 4 5 6 i i i ichy chy e e onal dat onal dat onal dat onal dat onal dat onal dat c c c cs w s w ti ti ti ti d d e e n p n p ra ra ons for 85 human gen ons for 85 human gen ons for 55 huma ons for 55 huma i ic ci ite te ge ge th p th p r rote ote d human pr d human pr of GO t of GO t a a a a a a r ri i file file file file file file edic edic n i n in nted p ted p t te e e er r rm p rm p a act ct ote ote r r n dise n dise i iote ote o o r ri in n e e n int n int d d s s i in in n in i ie e c c ase ase ti ti s of unknown s of unknown e e ons at ons at ter ter ra ra g gcti cti e e a anes nes ct ct o o differ differ ions ions ns ns ent ent lev lev- - confidence score, and interactions with a score greater than 0.5 are considered core interactions (the interaction score mainly depends upon the number of times each interaction Acknowledgements We thank the Sanger Institute Web Team for construction of the web was detected, the total number of interactions made by each interface and Paul Kersey for providing a list of TrEMBL accessions for yeast protein and the local network clustering [5]). To generate a proteins. B.L. is supported by a Sanger Institute Postdoctoral Fellowship similar subset of yeast protein interactions, we defined core and A.G.F is supported by the Wellcome Trust. yeast protein interactions as those identified more than once by any single assay. Each entry in the core and complete inter- action networks contains the following tab delimited infor- References 1. Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lock- mation: Gene 1 Id, Ensembl gene ID for human interaction shon D, Narayan V, Srinivasan M, Pochart P, et al.: A comprehen- partner 1; Gene 1 description, alternative names for human sive analysis of protein-protein interactions in Saccharomyces Gene 1 (from Ensembl); Gene 2 Id, Ensembl gene ID for cerevisiae. Nature 2000, 403:623-627. 2. Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y: A compre- human interaction partner 2; Gene 2 description, alternative hensive two-hybrid analysis to explore the yeast protein names for human Gene 2 (from Ensembl); Source Organism, interactome. Proc Natl Acad Sci USA 2001, 98:4569-4574. 3. Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, the model organism protein interaction dataset that predicts Schultz J, Rick JM, Michon AM, Cruciat CM, et al.: Functional organ- this human protein interaction; Ortholog 1, model organism ization of the yeast proteome by systematic analysis of pro- interaction partner 1 from the model organism protein inter- tein complexes. Nature 2002, 415:141-147. 4. Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams SL, Millar A, action that predicts the human protein interaction; Ortholog Taylor P, Bennett K, Boutilier K, et al.: Systematic identification 2, model organism interaction partner 2 from the model of protein complexes in Saccharomyces cerevisiae by mass organism protein interaction that predicts the human protein spectrometry. Nature 2002, 415:180-183. 5. Giot L, Bader JS, Brouwer C, Chaudhuri A, Kuang B, Li Y, Hao YL, interaction; and Ortholog score, a confidence score for the Ooi CE, Godwin B, Vitols E, et al.: A protein interaction map of human protein interaction based on the likelihood that the Drosophila melanogaster. Science 2003, 302:1727-1736. 6. Li S, Armstrong CM, Bertin N, Ge H, Milstein S, Boxem M, Vidalain two human proteins are the functional orthologs of the two PO, Han JD, Chesneau A, Hao T, et al.: A map of the interactome model organism proteins. The score ranges from 0 (no confi- network of the metazoan C. elegans. Science 2004, 303:540-543. dence) to 4 (high confidence). The score is calculated as the 7. Matthews LR, Vaglio P, Reboul J, Ge H, Davis BP, Garrels J, Vincent S, Vidal M: Identification of potential interaction networks using sum of the Inparanoid confidence scores for each gene sequence-based searches for conserved protein-protein orthology assignment. A score of 4 means that both of the interactions or "interologs". Genome Res 2001, 11:2120-2126. 8. Wojcik J, Boneca IG, Legrain P: Prediction, assessment and vali- human genes and both of the model organism genes are all dation of protein interaction maps in bacteria. J Mol Biol 2002, the main orthologs in their groups of co-orthologs according 323:763-770. to Inparanoid. These represent higher confidence human 9. Remm M, Storm CE, Sonnhammer EL: Automatic clustering of orthologs and in-paralogs from pairwise species protein interactions. Description, this field contains the orig- comparisons. J Mol Biol 2001, 314:1041-1052. inal annotation for the model organism protein interaction; 10. Tong AH, Evangelista M, Parsons AB, Xu H, Bader GD, Page N, Rob- for worm interactions this indicates whether the interaction is inson M, Raghibizadeh S, Hogue CW, Bussey H, et al.: Systematic genetic analysis with ordered arrays of yeast deletion in the core dataset of interactions found more than once mutants. Science 2001, 294:2364-2368. (CORE_1), or interactions that reconfirmed when retested 11. von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S, Bork P: Comparative assessment of large-scale data sets of pro- (CORE_2), or non-core interactions that did not reconfirm tein-protein interactions. Nature 2002, 417:399-403. (NON_CORE) [6]. For fly interactions this indicates the 12. The Sanger Institute: Interaction Map [http:// interaction score. This score mainly depends upon the www.sanger.ac.uk/interactionmap] 13. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, number of times each interaction was detected, the total Davis AP, Dolinski K, Dwight SS, Eppig JT, et al.: Gene ontology: number of interactions made by each protein and the local tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000, 25:25-29. network clustering, see [5] for details. A score >0.5 is consid- 14. Birney E, Andrews D, Bevan P, Caccamo M, Cameron G, Chen Y, ered high confidence. For yeast protein interactions, these are Clarke L, Coates G, Cox T, Cuff J, et al.: Ensembl 2004. Nucleic Acids the annotations of von Mering et al. [11] and contain the fol- Res 2004, 32 Database issue:D468-D470. 15. Ensembl genome browser [http://www.ensembl.org] lowing information: experimental/computation method (and 16. Ensembl EnsMart genome browser (Martview) [http:// the number of times the interaction was detected); Von Mer- www.ensembl.org/Multi/martview] ing et al.'s confidence assignment; and whether the interac- 17. Kasprzyk A, Keefe D, Smedley D, London D, Spooner W, Melsopp C, Hammond M, Rocca-Serra P, Cox T, Birney E: EnsMart: a generic tion was previously known in the literature. For more system for fast and flexible access to biological data. Genome information, please see [11]. Res 2004, 14:160-169. 18. Walhout AJ, Sordella R, Lu X, Hartley JL, Temple GF, Brasch MA, Thi- erry-Mieg N, Vidal M: Protein interaction mapping in C. elegans Genome Biology 2004, 5:R63 comment reviews reports deposited research refereed research interactions information http://genomebiology.com/2004/5/9/R63 Genome Biology 2004, Volume 5, Issue 9, Article R63 Lehner and Fraser R63.9 using proteins involved in vulval development. Science 2000, 287:116-122. 19. Lehner B, Sanderson CM: A protein interaction framework for human RNA degradation. Genome Res 2004, 14:1315-1323. 20. Suzuki H, Fukunishi Y, Kagawa I, Saito R, Oda H, Endo T, Kondo S, Bono H, Okazaki Y, Hayashizaki Y: Protein-protein interaction panel using mouse full-length cDNAs. Genome Res 2001, 11:1758-1765. 21. Remy I, Galarneau A, Michnick SW: Detection and visualization of protein interactions with protein fragment complementa- tion assays. Methods Mol Biol 2002, 185:447-459. 22. Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O'Donovan C, Phan I, et al.: The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res 2003, 31:365-370. 23. InParanoid: database of pairwise orthologs [http://inpara noid.cgb.ki.se] 24. Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Barrell D, Bateman A, Binns D, Biswas M, Bradley P, Bork P, et al.: The InterPro Data- base, 2003 brings increased coverage and new features. Nucleic Acids Res 2003, 31:315-318. 25. Online Mendelian Inheritance in Man [http:// www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM] 26. Welcome to Blueprint [http://www.blueprint.org/bind/bind.php] 27. Vidal laboratory [http://vidal.dfci.harvard.edu] Genome Biology 2004, 5:R63

Journal

Genome Biology – Springer Journals

Published: Aug 1, 2004

Keywords: Animal Genetics and Genomics; Human Genetics; Plant Genetics and Genomics; Microbial Genetics and Genomics; Bioinformatics; Evolutionary Biology

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

A first-draft human protein-interaction map

A first-draft human protein-interaction map

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

A first-draft human protein-interaction map

A first-draft human protein-interaction map

References (33)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies