Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

ConSurf‐DB: An accessible repository for the evolutionary conservation patterns of the majority of PDB proteins

ConSurf‐DB: An accessible repository for the evolutionary conservation patterns of the majority... Molecular Biology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Patterns observed by examining the evolutionary relationships among proteins Aviv, Israel of common origin can reveal the structural and functional importance of specific School of Molecular Cell Biology & residue positions. In particular, amino acids that are highly conserved (i.e., their Biotechnology, George S. Wise Faculty of positions evolve at a slower rate than other positions) are particularly likely to Life Sciences, Tel Aviv University, Tel Aviv, Israel be of biological importance, for example, for ligand binding. ConSurf is a bioin- formatics tool for accurately estimating the evolutionary rate of each position in Correspondence a protein family. Here we introduce a new release of ConSurf-DB, a database of Nir Ben-Tal, Department of Biochemistry and Molecular Biology, George S. Wise precalculated ConSurf evolutionary conservation profiles for proteins of known Faculty of Life Sciences, Tel Aviv structure. ConSurf-DB provides high-accuracy estimates of the evolutionary University, Tel Aviv 69978, Israel. Email: bental@ashtoret.tau.ac.il rates of the amino acids in each protein. A reliable estimate of a query protein's evolutionary rates depends on having a sufficiently large number of effective Present address homologues (i.e., nonredundant yet sufficiently similar). With current sequence Haim Ashkenazy, Department of Molecular Biology, Max Planck Institute data, ConSurf-DB covers 82% of the PDB proteins. It will be updated on a regu- for Developmental Biology, lar basis to ensure that coverage remains high—and that it might even increase. Tübingen 72076, Germany. Much effort was dedicated to improving the user experience. The repository is available at https://consurfdb.tau.ac.il/. Broader audience: By comparing a protein to other proteins of similar origin, it is possible to determine the extent to which each amino acid position in the protein evolved slowly or rapidly. A protein's evolutionary profile can provide valuable insights: For example, amino acid positions that are highly conserved (i.e., evolved slowly) are particularly likely to be of structural and/or functional importance, for example, for ligand binding and catalysis. We introduce here a new and improved version of ConSurf-DB, a continually updated database that provides precalculated evolutionary profiles of proteins with known structure. KEYWOR DS binding site, ConSurf, ConSurf-DB, evolutionary conservation, evolutionary rate, functional importance This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. © 2019 The Authors. Protein Science published by Wiley Periodicals, Inc. on behalf of The Protein Society. 258 wileyonlinelibrary.com/journal/pro Protein Science. 2020;29:258–267. BEN CHORIN ET AL. 259 4,5 1 | INTRODUCTION in that position. These approaches are highly sensitive to the specific selection of homologues used because they The explosion of protein sequence data over recent do not account for the phylogenetic relationships among decades has led to the emergence of numerous databases homologues. Thus, results obtained using very close that organize and characterize protein sequences homologues may differ substantially from calculations according to biologically relevant features. These data- with a more diverse set. To alleviate this problem, tools bases enable researchers to extract invaluable informa- such as the Evolutionary Trace Viewer and SiteFiNDER| tion on many proteins quickly and inexpensively. About 3D use phylogenetic trees, which reflect the evolutionary 12 years ago, our group introduced ConSurf-DB, a data- relationships between the proteins. Explicit consideration base aimed at providing researchers with convenient of the evolutionary relationships among the homologues access to evolutionary data for proteins of known struc- helps to reduce inaccuracies caused by uneven sampling ture. Herein, we present a new version of ConSurf-DB. in sequence space and decreases the sensitivity to the In general, evolutionary information serves as a pow- choice of homologues. Notably, whereas Evolutionary erful tool in studies of protein structure and function, Trace Viewer and SiteFiNDER|3D are based only on 8,9 and it is especially useful for identifying residues with sequence information, an alternative tool, FuncPatch, important functional roles. In particular, residues that also accounts for the three-dimensional structure of the are involved in functions such as ligand binding and protein. This approach is based on a phylogenetic Gauss- catalysis, or that are necessary for maintaining the pro- ian process that accounts for three-dimensional correla- tein's structure, tend to be evolutionarily conserved, tion of substitution rates in different positions according meaning that during protein evolution their positions to the tertiary structure of the protein. tend to change more slowly than other positions. This The most commonly used tool for calculating evolu- tendency results from the fact that mutations to function- tionary rates on the basis of sequence information, while 10,11 ally important residues may compromise the protein's accounting for the phylogenetic tree, is ConSurf. In function and/or structural stability and as such are ConSurf, homologues of the query sequence are detected unlikely to be tolerated. Moreover, once the evolutionary and aligned, a phylogenetic tree is constructed, and the rates of a protein's amino acid positions have been calcu- evolutionary rates of all positions in the query protein are lated, it can be highly informative to map these rates onto then calculated using the Rate4Site program without the protein's three-dimensional structure: By observing explicit use of the three-dimensional structure of the pro- where conserved residues are located within the protein's tein. Specifically, Rate4Site estimates the evolutionary structure, researchers may be able to predict what the rates of the amino acids, by taking into account the rela- functional roles of these residues might be. Evolutionary tionships among the homologues and the evolutionary information can also guide experimental effort, such as process, as reflected in the phylogenetic tree. Rate4Site mutagenesis, to confirm such predictions and to decipher also assigns a credibility interval for the evolutionary the protein's mechanism of action. rates. The conservation grades (derived from the evolu- The extraction of evolutionary data for a given query tionary rates) are projected onto the corresponding posi- protein is based on the comparison of that protein to its tions in the query sequence, where each position is homologues, that is, other proteins of a shared evolution- colored according to a unique color-coding scale ranging ary origin. There are several different approaches and from least to most conserved. ConSurf also maps the con- methods that infer evolutionary information from homo- servation grades onto the three-dimensional structure of logues. These are all based on aligning the query and its the protein, if available. This step enables the evolution- homologues to each other in a way that maximizes the ary information to be integrated with spatial consider- total similarity in all the amino acid positions, that is, a ations that are visible only from the structure, for multiple sequence alignment (MSA). The simplest esti- example, the location of binding/catalytic sites and mates are based on the consensus approach: For each ligand-binding positions. position, the amino acid that appears in that position in Though ConSurf carries out its calculations relatively the greatest number of homologues is identified; then, quickly, in certain cases (e.g., high-throughput studies the evolutionary conservation level of that position is involving many proteins) scholars may prefer to get an determined according to whether the “frequency” of the instant conservation map of a protein's structure, without amino acid (i.e., the proportion of homologues in which having to enter specific calculation parameters. ConSurf- it appears) exceeds a predefined consensus threshold. DB was introduced to address these cases. More sophisticated methods estimate conservation using ConSurf-DB is a repository of precalculated evolution- the entropy of each position, calculated from the collec- ary rates for the protein structures deposited in the 13,14 tive frequencies of the different amino acids that appear Protein Data Bank (PDB), the main resource for 260 BEN CHORIN ET AL. experimentally determined protein structures. The PDB discards chains that do not have a PDB file, either because is constantly growing; it currently contains nearly theentry hasbecomeobsoleteorbecause they aretoo large 150,000 entries representing protein structures (according (containing 100,000 atoms or more). Large structures are to http://www.rcsb.org), three times more than it did deposited in the PDB only using the mmCIF format, which 12 years ago, when ConSurf-DB was introduced. To the ConSurf-DB pipeline cannot handle yet (though it soon accommodate this growth and to adapt the database to will). Finally, the “modifications” filter handles the chains recent methodological developments, we have designed a that contain nonstandard amino acids. Each such amino new release of ConSurf-DB. The current version of the acid is modified to its closest neighbor among the standard database covers 82% of the PDB and will be periodically amino acids, and if the fraction of these modified residues updated to include new PDB entries, as well as to exploit in the chain exceeds 15%, the chain is filtered out. In any the flood of sequence data. ConSurf-DB is available as an case, the modifications are saved to the chain's data. Fol- online website and does not require local installation. lowing this initial filtration, a directory containing the input data is constructed in the repository for each of the remaining unique chains, and they are associated with 2 | METHODOLOGY FOR their sequences and identical chains. Thus, each unique CREATION OF THE CONSURF-DB chain's calculations can easily be mapped to the structures REPOSITORY of all its identical chains. The second step is searching for sequence homologues 17,18 The new version of ConSurf-DB is based on a fully in UniRef90, a clustered version of the UniProt data- 19,20 automated process that consists of four main steps: base. This is done using one iteration of the homologue 21,22 downloading and parsing nonredundant PDB entries, search tool HMMER with an E-value threshold of collecting sequence homologues and aligning the 0.0001. The candidate homologues retrieved by HMMER sequences, calculating evolutionary rates, and finally for- for a certain chain are further filtered according to the fol- matting the results for presentation in the ConSurf-DB lowing three parameters: (a) sequence identity—first, website (Figure 1). Separation of these individual steps sequences identical to the query by over 95% are discarded provides flexibility and modularity, enabling new data— to reduce error due to sample bias; (b) sequence coverage— for example, updates to the PDB—and new features to be sequence homologues that cover below 60% of the query integrated efficiently. The repository will be updated fre- protein are filtered; and (c) maximum overlap among quently, where each update involves making calculations homologues—some homologous sequences may overlap. In for newly added PDB entries, as well as revisiting old this case, if the overlap is greater than 10%, the highest scor- PDB entries that were not eligible for inclusion in previ- ing homologue is chosen, and the others are discarded. ous compilations (e.g., because of an insufficient number After this filtration process, chains with fewer than of homologues). Once a year, the whole database will be 50 homologues are eliminated. In ConSurf, the minimum reconstructed for the entire PDB, in order to account for number of homologues required to calculate evolutionary new homologues that have become available as a result rates is five; here, we adopt a higher threshold with the aim of growth in sequence data. of ensuring that the estimated evolutionary rates included The first step in building ConSurf-DB is retrieving the in ConSurf-DB aremorerobust. Next,cluster database at 23,24 PDB entries. Each PDB entry can contain one or more pro- high identity with tolerance (CD-HIT) removes any tein chains, which are handled separately in ConSurf-DB. redundant homologues with a threshold of 95%. If there are In order to overcome the problem of redundancy in the more than 50 homologues after the CD-HIT filtration pro- PDB (i.e., more than one structure for a given protein cess, the remaining homologues are sorted by their E value sequence), the chains are extracted from a PISCES file in ascending order, in line with the principle that the lower 15,16 (downloaded from http://dunbrack.fccc.edu). This file the E value the more significant the resemblance between contains all nonredundant (unique) chains in FASTA for- the homologue and the query protein. A maximum of mat, where the header of each unique chain lists all redun- 300 homologues are sampled uniformly from the sorted list dant chains, that is, chains with 100% sequence identity. to create the final list of homologues of the query protein. After extraction of the unique chains, their sequences, and This is also a higher threshold in comparison to the default their identical chains from the file, the unique chains are threshold used in ConSurf (150 homologues); again, the filtered using the following criteria: “length”, “PDB file” aim is to increase the robustness of the results. Finally, an and “modifications”.The “length” filter eliminates chains MSA of the homologues is constructed using the MAFFT- 25,26 containing fewer than 30 residues, as for shorter chains it LINSi procedure. can be challenging to collect credible homologues and con- The third step is estimating the evolutionary rate at each struct a reliable phylogenetic tree. The “PDB file” filter amino acid position. To this end, the MSA is first used to BEN CHORIN ET AL. 261 infer the best amino acid substitution model. This model assigned positive values and slowly evolving (conserved) essentially describes the evolution of the amino acids. Sev- positions are assigned negative values. In addition, a confi- eral such models are considered, including the following: dence interval, estimated using the empirical Bayesian 28 29 30 31 32 33 36 JTT, LG, Dayhoff, WAG, mtREV, and cpREV. method, which represents the extent of credibility of the Next, a phylogenetic tree is built from the MSA with the estimated evolutionary rate, is assigned to each position. Neighbor-Joining method, implemented in Rate4Site. Finally, the evolutionary rates are categorized into discrete Finally, Rate4Site assigns an evolutionary rate to each posi- conservation grades, ranging from 1 to 9, where 1 represents tion in the query sequence, based on the phylogenetic tree the most highly variable residue positions, 5 represents posi- and the substitution model, and using an empirical Bayesian tions of intermediate conservation, and 9 represents the methodology. The evolutionary rates are normalized most highly conserved positions. These grades are then around zero, where rapidly evolving (variable) positions are mapped to nine colors, providing a clear and intuitive means FIGURE 1 A flowchart of the pipeline used to construct ConSurf-DB. The pipeline consists of four steps: retrieving PDB entries, homologue detection and building a multiple sequence alignment, estimating evolutionary conservation, and formatting the results 262 BEN CHORIN ET AL. of visualizing the conserved and variable regions in the pro- 3.2 | Batch download tein. Positions that are assigned grades with low confidence are treated as a separate, tenth, category. Since results are precalculated in ConSurf-DB, we can The final step is formatting and visually representing provide results for several protein structures in a single the data, to make the information accessible and user download. This feature, which was not included in previ- friendly. The conservation grades (colors) are mapped ous versions of ConSurf-DB, is now available on our onto the three-dimensional structure of the query pro- homepage, and users can access it by uploading a list of 37,38 tein, which can be viewed using the NGL viewer or desired chains (where each chain appears on a new line). FirstGlance in Jmol. This visualization is highly enlightening because it emphasizes the important, evolu- tionarily conserved regions of the protein. The colors are 3.3 | Improved visualization also projected on the query sequence and on the MSA. Moreover, session files presenting the protein structure, I. Improving the color scales. In this release of ConSurf- colored according to the conservation grades, are created DB, the colors, both in the default and color-bind 40 41 using the PyMOL and UCSF Chimera programs. All scales, were refined to allow better distinction between visual results are available in two color scales: the default the different conservation grades. color scale, which is cyan-through-maroon and the color- II. Providing PyMOL session files for high-resolution fig- blind friendly color scale, which is green-through-purple. ures. PyMOL is a popular molecular visualization These color scales correspond to variable (Grade 1)- program; it contains various functions that enable through-conserved (Grade 9). Positions with low reliabil- users to analyze three-dimensional structures of pro- ity according to the confidence interval are colored in teins (e.g., show hydrogen bonds, calculate electro- light yellow in both color scales. Additional nonvisual static potential), and it can also be used to create data are also available to users, as well as links to related high-resolution images of the viewed protein. In pre- 42,43 sources of information such as PDBsum and vious versions of ConSurf and ConSurf-DB, users 44,45 Proteopedia. The repository can be accessed through were provided with a modified PDB file of their pro- a website, available at https://consurfdb.tau.ac.il/. To tein, which contained the conservation grades in the view the results, users need only to provide the PDB ID temperature factor column. Using this file and a pro- or sequence of the query protein. vided script, users were able to color the protein according to its calculated conservation grades. In this version, we provide a complete PyMOL session 3 | NEW FEATURES file, in which the query protein is already colored according to conservation. To create a high-resolution 3.1 | Homologue detection using image, the user needs only to open the file with HMMER PyMOL and save it as a figure. While working on this feature, we discovered and fixed some issues with the In previous releases of ConSurf-DB, the homologues of coloring script. We therefore recommend that users the query protein were collected using PSI-BLAST. Yet, who prefer to construct their own ConSurf figures new sequence search methodologies have developed in download the revised files provided in this version. recent years, to keep pace with the continuous increase III. Color-blind presentation option for all visual results. in the number of protein sequences. In the new release of In earlier releases of ConSurf-DB, the visual results ConSurf-DB (as well as in ConSurf itself), homologues were presented using only the default conservation are collected using the more advanced HMMER algo- color scale. From this version on, all visual results rithm. HMMER implements probabilistic inference using will be available in both the default and the color- profile hidden Markov models. Given a query sequence blind scales, both for viewing directly and for down- x and a target sequence y, BLAST calculates the score of loading. The color-blind display can be selected in the optimal alignment of x and y, whereas HMMER cal- the homepage, when running a query, or alterna- culates a score that is the sum of scores of all possible tively, in the results page, by clicking a button that alignments of x and y. Because HMMER uses a heuristic enables switching between the two displays. acceleration algorithm, it remains similar in speed to IV. Supporting the NGL viewer. The page of each entry BLAST, but with a better rate of correctly detected homo- now includes a visualization of the three- logues and a much lower rate of falsely detected hits. dimensional structure using the NGL viewer. This Implementation of HMMER in the new release of viewer is very fast and provides many features, such ConSurf-DB has improved homologue identification. as zooming in on the interactions of the query BEN CHORIN ET AL. 263 protein with its cognate ligand, thus highlighting 30 amino acids, 4,629 chains from large structures, and important biological information. 210 chains with more than 15% modified amino acids, which, as explained above, are not suitable for the calcu- lation. A total of 97,065 nonredundant chains remained 3.4 | Improvements in design and user after this initial filtration. The homologue search for each experience of these chains was performed using HMMER v3.2.1 against UniProt/UniRef90 release 07-2019. The homo- The new release of ConSurf-DB is considerably more user logues were filtered by thresholds and using CD-HIT v4.7 friendly than the previous release and includes many and were then aligned using MAFFT v7.419. The build improvements in the user interface and user experience. process was carried out using 150–200 CPUs, with an In terms of the query process, for example, the list of pro- average CPU time of roughly 15 min per chain. For 7,363 tein chains is presented in a drop-down menu in the of the 97,065 chains, we failed to find at least 50 homo- homepage, instead of on a new page. In terms of techni- logues and aborted calculation. cal support, a contact form is now available to improve In aggregate, as of November 2019, ConSurf-DB our communication with users. We encourage our users covers 89,702 of the 108,958 unique protein chains in to write, and we would appreciate any feedback. the PDB, that is, coverage of 82%, corresponding to a Moreover, in this version of ConSurf-DB, we present total of 365,218 chains. The vast majority of the calcula- a new design for the website, which should improve clar- tions are based on large MSAs of 201–300 homologous ity of presentation and ease of use. For example, in the proteins. new results page, the order of the results is determined by anticipated importance and usefulness, making it eas- ier for users to find what they need. In addition, the 5 | EXAMPLES OF APPLICATIONS names of the result files are much more intuitive and OF EVOLUTIONARY DATA: ACTIVE informative, and users can further access a README file SITE ANALYSIS IN ENZYMES AND that provides detailed information for all results. Finally, ANTIBODIES the running parameters of ConSurf-DB are presented in the results page, for the user's convenience. As discussed above, data regarding the degree of conser- vation of each position in a protein can be used to predict the biological significance of specific positions, as func- 4 | CONSURF-DB IN NUMBERS tionally important positions tend to be more evolution- arily conserved compared with other positions. The high The statistics for this version of ConSurf-DB are pres- conservation of functional positions results from negative ented in Table 1. ConSurf-DB was built on the basis of a selection on mutations in these positions, as such muta- PISCES file containing 108,958 nonredundant protein tions may result in loss of function. chains from the PDB (at 100% sequence identity thresh- In enzymes, mutations to catalytic residues are partic- old); the PISCES file was updated on September 2019. Of ularly unlikely to be tolerated, as each of these residues is this initial set, we filtered 7,054 chains shorter than engaged in a very specific function during catalysis TABLE 1 Statistics of ConSurf-DB PDB chains MSA sizes Total chains found 473,197 Chains with less than 50 homologues 7,363 Total nonredundant chains found 108,958 MSA's created Filtered Chains with 50–100 homologues 3,238 Chains shorter than 30 amino acids 7,054 Chains with 101–200 homologues 4,978 Chains with large structures 4,629 Chains with 201–300 homologues 81,486 Chains with more than 15% modified residues 210 Total chains processed 89,702 Total chains post-initial filtration 389,863 Total nonredundant chains post-initial filtration 97,065 Note: Currently, the databases cover 89,702 of the 108,958 protein chains in the nonredundant set, that is, 82%. 264 BEN CHORIN ET AL. (Figure 2). Other residues in the active site may deter- mutagenesis is likely to result in considerable loss of mine the specificity of the enzyme to its cognate sub- enzymatic activity. However, when position 85, which is strate. That is, the residues in these positions allow an also in the binding pocket of Or-AT, is replaced, the enzyme to bind and act only on a certain substrate. Such enzyme remains active yet changes its substrate prefer- positions are called specificity-determining positions ence considerably. This suggests that position 85 is an (SDPs). Notably, different forms of a given enzyme SDP. Indeed, though position 85 is evolutionarily con- (e.g., equivalent enzymes from different species or served, its conservation grade is lower than the conser- organs) may have different residues in these positions vation grades of the catalytic positions in the binding and thus bind different substrates. Accordingly, SDPs pocket. For example, in γ-aminobutyrate-aminotransfer- tend to be somewhat less evolutionarily conserved than ase, another member of the ω-aminotransferase family, 46,47 catalytic positions, which are, in essence, invariant. this position is populated by isoleucine instead of tyro- Such a phenomenon can be seen in aminotransferases sine, the equivalent residue in Or-AT. (also called transaminases)—a large group of enzymes The decreased conservation of SDPs is particularly that act on different substrates, such as the amino acids pronounced in antibodies. This is because each antibody alanine, ornithine, aspartate, cysteine, and gluta- binds a different substrate and therefore uses different 48,49 mate. Figure 2 shows the conservation patterns of residues in the equivalent substrate-binding positions. three positions in ornithine-aminotransferase (Or-AT), a The SDPs in antibodies are located in the hypervariable member of the (S)-selective ω-aminotransferase enzyme region, at the tip of each “arm” of the antibody (Figure 3). family. The enzyme in this structure is bound to an The “stem” of this structure, referred to as the constant inhibitor that resembles the substrate. Most of the Or- region, is similar in many antibodies, and it is therefore AT positions around the inhibitor–cofactor conjugate more evolutionarily conserved. (including the principal catalytic positions, 235 and 292) Identifying SDPs in an enzyme or antibody is not triv- are highly conserved. Replacement of these positions by ial and requires knowledge of the specific positions FIGURE 2 Conservation of catalytic and specificity-determining positions (SDPs) in the active site of Or-AT (PDB entry 2oat). (a) Ornithine-aminotransferase, colored by conservation grade and shown in surface representation, together with the inhibitor–cofactor (pyridoxal phosphate) conjugate, colored by atom type and shown as spheres. (b) The catalytic and suspected specificity-determining positions of ornithine-aminotransferase are shown as sticks and colored by conservation grade. For clarity, the backbone of the enzyme is not shown BEN CHORIN ET AL. 265 FIGURE 3 The conservation pattern of an antibody (PDB entry 1igt). A cartoon representation of an antibody colored according to evolutionary conservation. The constant and hypervariable regions in the structure are annotated. The antigen-binding region (CDR loops) is shown as spheres interacting with each substrate in each form of the pro- were used in constructing the database. ConSurf-DB will tein. Obtaining this knowledge requires either knowing be periodically updated to keep up with the rapid the three-dimensional structure of the different proteins increase in sequence and structure data. bound to their cognate substrates or biochemical data (e.g., data obtained from mutagenesis experiments) that ACKNOWLEDGMENTS implicate specific positions in selective substrate binding. The research was supported by Grant 450/16 of the The above examples suggest that evolutionary informa- Israeli Science Foundation (ISF). NB-T's research is tion, which can be obtained quickly and easily using supported in part by the Abraham E. Kazan Chair in computational tools such as ConSurf and ConSurf-DB, Structural Biology, Tel Aviv University. not only may help researchers pinpoint functionally important positions in proteins but also may help to dif- ORCID ferentiate between subclasses of such positions Nir Ben-Tal https://orcid.org/0000-0001-6901-832X (e.g., catalytic positions vs. SDPs). REFERENCES 1. Goldenberg O, Erez E, Nimrod G, Ben-Tal N. The ConSurf-DB: 6 | CONCLUSIONS Pre-calculated evolutionary conservation profiles of protein structures. Nucleic Acids Res. 2009;37:D323–D327. 2. Kessel A, Ben-Tal N. Introduction to proteins: Structure, func- Evolutionary information can be used to obtain valuable tion, and motion. 2nd ed. Boca Raton, FL: Chapman and insights regarding the structure and function of a query Hall/CRC (Taylor & Francis Group), 2018. protein, and in particular, it can highlight biologically 3. Valdar WS. Scoring residue conservation. Proteins. 2002;48: important regions. ConSurf-DB provides such evolution- 227–241. ary information instantly and efficiently for the majority 4. Mihalek I, Res I, Lichtarge O. A family of evolution-entropy of the proteins included in the PDB. The results are hybrid methods for ranking protein residues by importance. J Mol Biol. 2004;336:1265–1282. highly robust because particularly stringent thresholds 266 BEN CHORIN ET AL. 5. Capra JA, Singh M. Predicting functionally important 25. Katoh K, Misawa K, Kuma K, Miyata T. MAFFT: A novel residues from sequence conservation. Bioinformatics. 2007;23: method for rapid multiple sequence alignment based on fast 1875–1882. Fourier transform. Nucleic Acids Res. 2002;30:3059–3066. 6. Morgan DH, Kristensen DM, Mittelman D, Lichtarge O. ET 26. Katoh K, Standley DM. MAFFT multiple sequence alignment viewer: An application for predicting and visualizing functional software version 7: Improvements in performance and usabil- sites in protein structures. Bioinformatics. 2006;22:2049–2050. ity. Mol Biol Evol. 2013;30:772–780. 7. Innis CA. siteFiNDER|3D: A web-based tool for predicting the 27. Darriba D, Taboada GL, Doallo R, Posada D. ProtTest 3: Fast location of functional sites in proteins. Nucleic Acids Res. 2007; selection of best-fit models of protein evolution. Bioinformatics. 35:W489–W494. 2011;27:1164–1165. 8. Huang YF, Golding GB. Phylogenetic Gaussian process model 28. Jones DT, Taylor WR, Thornton JM. The rapid generation of for the inference of functionally important regions in protein mutation data matrices from protein sequences. Bioinformat- tertiary structures. PLoS Comput Biol. 2014;10:e1003429. ics. 1992;8:275–282. 9. Huang YF, Golding GB. FuncPatch: A web server for the fast 29. Le SQ, Gascuel O. An improved general amino acid replace- Bayesian inference of conserved functional patches in protein ment matrix. Mol Biol Evol. 2008;25:1307–1320. 3D structures. Bioinformatics. 2015;31:523–531. 30. Dayhoff MO, Schwartz RM, Orcutt BC. A model of evolution- 10. Glaser F, Pupko T, Paz I, et al. ConSurf: Identification of func- ary change in proteins. In: Dayhoff M, editor. Atlas of protein tional regions in proteins by surface-mapping of phylogenetic sequence and structure. Washington, D.C.: National Biomedi- cal Research Foundation, 1978; p. 345–352. information. Bioinformatics. 2003;19:163–164. 11. Ashkenazy H, Abadi S, Martz E, et al. ConSurf 2016: An 31. Whelan S, Goldman N. A general empirical model of protein improved methodology to estimate and visualize evolutionary evolution derived from multiple protein families using a conservation in macromolecules. Nucleic Acids Res. 2016;44: maximum-likelihood approach. Mol Biol Evol. 2001;18:691–699. W344–W350. 32. Adachi J, Hasegawa M. Model of amino acid substitution in 12. Pupko T, Bell RE, Mayrose I, Glaser F, Ben-Tal N. Rate4Site: proteins encoded by mitochondrial DNA. J Mol Evol. 1996;42: 459–468. An algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants 33. Adachi J, Waddell PJ, Martin W, Hasegawa M. Plastid genome within their homologues. Bioinformatics. 2002;18:S71–S77. phylogeny and a model of amino acid substitution for proteins 13. Berman HM, Westbrook J, Feng Z, et al. The Protein Data encoded by chloroplast DNA. J Mol Evol. 2000;50:348–358. Bank. Nucleic Acids Res. 2000;28:235–242. 34. Saitou N, Nei M. The neighbor-joining method: A new method 14. Burley SK, Berman HM, Bhikadiya C, et al. RCSB protein data for reconstructing phylogenetic trees. Mol Biol Evol. 1987;4: Bank: Biological macromolecular structures enabling research 406–425. and education in fundamental biology, biomedicine, biotech- 35. Mayrose I, Graur D, Ben-Tal N, Pupko T. Comparison of site- nology and energy. Nucleic Acids Res. 2019;47:D464–D474. specific rate-inference methods for protein sequences: Empirical 15. Wang G, Dunbrack RL. PISCES: A protein sequence culling Bayesian methods are superior. Mol Biol Evol. 2004;21:1781–1791. server. Bioinformatics. 2003;19:1589–1591. 36. Susko E, Inagaki Y, Field C, Holder ME, Roger AJ. Testing for 16. Wang G, Dunbrack RL. PISCES: Recent improvements to a differences in rates-across-sites distributions in phylogenetic PDB sequence culling server. Nucleic Acids Res. 2005;33: subtrees. Mol Biol Evol. 2002;19:1514–1523. W94–W98. 37. Rose AS, Hildebrand PW. NGL viewer: A web application for 17. Suzek BE, Huang H, McGarvey P, Mazumder R, Wu CH. Uni- molecular visualization. Nucleic Acids Res. 2015;43:W576–W579. Ref: Comprehensive and non-redundant UniProt reference 38. Rose AS, Bradley AR, Valasatava Y, Duarte JM, PrlicA, clusters. Bioinformatics. 2007;23:1282–1288. Rose PW. NGL viewer: Web-based molecular graphics for large 18. Suzek BE, Wang Y, Huang H, PB MG, Wu CH, UniProt Con- complexes. Bioinformatics. 2018;34:3755–3758. sortium. UniRef clusters: A comprehensive and scalable alter- 39. Martz E (2005) FirstGlance in Jmol Available from: http:// native for improving sequence similarity searches. firstglance.jmol.org/. Bioinformatics. 2015;31:926–932. 40. Schrödinger LLC (2015) The PyMOL Molecular Graphics Sys- 19. Apweiler R, Bairoch A, Wu CH, et al. UniProt: The universal tem, Version 2.3.3. protein knowledgebase. Nucleic Acids Res. 2004;32:D115–D119. 41. Pettersen EF, Goddard TD, Huang CC, et al. UCSF chimera— 20. UniProt Consortium. UniProt: A worldwide hub of protein A visualization system for exploratory research and analysis. knowledge. Nucleic Acids Res. 2019;47:D506–D515. J Comput Chem. 2004;25:1605–1612. 21. Eddy SR. A new generation of homology search tools based on 42. Laskowski RA, Hutchinson EG, Michie AD, Wallace AC, probabilistic inference. Genome Inform. 2009;23:205–211. Jones ML, Thornton JM. PDBsum: A web-based database of 22. Mistry J, Finn RD, Eddy SR, Bateman A, Punta M. Challenges summaries and analyses of all PDB structures. Trends Biochem in homology search: HMMER3 and convergent evolution of Sci. 1997;22:488–490. coiled-coil regions. Nucleic Acids Res. 2013;41:e121. 43. Laskowski RA, Jabłonska  J, Pravda L, Vařeková RS, 23. Li W, Godzik A. Cd-hit: A fast program for clustering and com- Thornton JM. PDBsum: Structural summaries of PDB entries. paring large sets of protein or nucleotide sequences. Bioinfor- Protein Sci. 2018;27:129–134. 44. Hodis E, Prilusky J, Martz E, Silman I, Moult J, Sussman JL. matics. 2006;22:1658–1659. 24. Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: Accelerated for clus- Proteopedia—A scientific ‘wiki’ bridging the rift tering the next-generation sequencing data. Bioinformatics. between three-dimensional structure and function of bio- 2012;28:3150–3152. macromolecules. Genome Biol. 2008;9:R121. BEN CHORIN ET AL. 267 45. Hodis E, Prilusky J, Sussman JL. Proteopedia: A collaborative, 50. Markova M, Peneff C, Hewlins MJE, Schirmer T, John RA. virtual 3D web-resource for protein and biomolecule structure Determinants of substrate specificity in omega-aminotransfer- and function. Biochem Mol Biol Educ. 2010;38:341–342. ases. J Biol Chem. 2005;280:36409–36416. 46. Gerlt JA, Babbitt PC. Mechanistically diverse enzyme super- families: The importance of chemistry in the evolution of catal- ysis. Curr Opin Chem Biol. 1998;2:607–612. How to cite this article: Ben Chorin A, 47. Todd AE, Orengo CA, Thornton JM. Evolution of function in protein superfamilies, from a structural perspective. J Mol Biol. Masrati G, Kessel A, et al. ConSurf-DB: An 2001;307:1113–1143. accessible repository for the evolutionary 48. Nelson DL, Cox M. Lehninger principles of biochemistry. 5th conservation patterns of the majority of PDB ed. New York, NY: W.H. Freeman, 2008. proteins. Protein Science. 2020;29:258–267. https:// 49. Eliot AC, Kirsch JF. Pyridoxal phosphate enzymes: Mechanis- doi.org/10.1002/pro.3779 tic, structural, and evolutionary considerations. Annu Rev Bio- chem. 2004;73:383–415. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Protein Science : A Publication of the Protein Society Pubmed Central

ConSurf‐DB: An accessible repository for the evolutionary conservation patterns of the majority of PDB proteins

Protein Science : A Publication of the Protein Society , Volume 29 (1) – Nov 22, 2019

Loading next page...
 
/lp/pubmed-central/consurf-db-an-accessible-repository-for-the-evolutionary-conservation-PxRuqgVOZE

References (56)

Publisher
Pubmed Central
Copyright
© 2019 The Authors. Protein Science published by Wiley Periodicals, Inc. on behalf of The Protein Society.
ISSN
0961-8368
eISSN
1469-896X
DOI
10.1002/pro.3779
Publisher site
See Article on Publisher Site

Abstract

Molecular Biology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Patterns observed by examining the evolutionary relationships among proteins Aviv, Israel of common origin can reveal the structural and functional importance of specific School of Molecular Cell Biology & residue positions. In particular, amino acids that are highly conserved (i.e., their Biotechnology, George S. Wise Faculty of positions evolve at a slower rate than other positions) are particularly likely to Life Sciences, Tel Aviv University, Tel Aviv, Israel be of biological importance, for example, for ligand binding. ConSurf is a bioin- formatics tool for accurately estimating the evolutionary rate of each position in Correspondence a protein family. Here we introduce a new release of ConSurf-DB, a database of Nir Ben-Tal, Department of Biochemistry and Molecular Biology, George S. Wise precalculated ConSurf evolutionary conservation profiles for proteins of known Faculty of Life Sciences, Tel Aviv structure. ConSurf-DB provides high-accuracy estimates of the evolutionary University, Tel Aviv 69978, Israel. Email: bental@ashtoret.tau.ac.il rates of the amino acids in each protein. A reliable estimate of a query protein's evolutionary rates depends on having a sufficiently large number of effective Present address homologues (i.e., nonredundant yet sufficiently similar). With current sequence Haim Ashkenazy, Department of Molecular Biology, Max Planck Institute data, ConSurf-DB covers 82% of the PDB proteins. It will be updated on a regu- for Developmental Biology, lar basis to ensure that coverage remains high—and that it might even increase. Tübingen 72076, Germany. Much effort was dedicated to improving the user experience. The repository is available at https://consurfdb.tau.ac.il/. Broader audience: By comparing a protein to other proteins of similar origin, it is possible to determine the extent to which each amino acid position in the protein evolved slowly or rapidly. A protein's evolutionary profile can provide valuable insights: For example, amino acid positions that are highly conserved (i.e., evolved slowly) are particularly likely to be of structural and/or functional importance, for example, for ligand binding and catalysis. We introduce here a new and improved version of ConSurf-DB, a continually updated database that provides precalculated evolutionary profiles of proteins with known structure. KEYWOR DS binding site, ConSurf, ConSurf-DB, evolutionary conservation, evolutionary rate, functional importance This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. © 2019 The Authors. Protein Science published by Wiley Periodicals, Inc. on behalf of The Protein Society. 258 wileyonlinelibrary.com/journal/pro Protein Science. 2020;29:258–267. BEN CHORIN ET AL. 259 4,5 1 | INTRODUCTION in that position. These approaches are highly sensitive to the specific selection of homologues used because they The explosion of protein sequence data over recent do not account for the phylogenetic relationships among decades has led to the emergence of numerous databases homologues. Thus, results obtained using very close that organize and characterize protein sequences homologues may differ substantially from calculations according to biologically relevant features. These data- with a more diverse set. To alleviate this problem, tools bases enable researchers to extract invaluable informa- such as the Evolutionary Trace Viewer and SiteFiNDER| tion on many proteins quickly and inexpensively. About 3D use phylogenetic trees, which reflect the evolutionary 12 years ago, our group introduced ConSurf-DB, a data- relationships between the proteins. Explicit consideration base aimed at providing researchers with convenient of the evolutionary relationships among the homologues access to evolutionary data for proteins of known struc- helps to reduce inaccuracies caused by uneven sampling ture. Herein, we present a new version of ConSurf-DB. in sequence space and decreases the sensitivity to the In general, evolutionary information serves as a pow- choice of homologues. Notably, whereas Evolutionary erful tool in studies of protein structure and function, Trace Viewer and SiteFiNDER|3D are based only on 8,9 and it is especially useful for identifying residues with sequence information, an alternative tool, FuncPatch, important functional roles. In particular, residues that also accounts for the three-dimensional structure of the are involved in functions such as ligand binding and protein. This approach is based on a phylogenetic Gauss- catalysis, or that are necessary for maintaining the pro- ian process that accounts for three-dimensional correla- tein's structure, tend to be evolutionarily conserved, tion of substitution rates in different positions according meaning that during protein evolution their positions to the tertiary structure of the protein. tend to change more slowly than other positions. This The most commonly used tool for calculating evolu- tendency results from the fact that mutations to function- tionary rates on the basis of sequence information, while 10,11 ally important residues may compromise the protein's accounting for the phylogenetic tree, is ConSurf. In function and/or structural stability and as such are ConSurf, homologues of the query sequence are detected unlikely to be tolerated. Moreover, once the evolutionary and aligned, a phylogenetic tree is constructed, and the rates of a protein's amino acid positions have been calcu- evolutionary rates of all positions in the query protein are lated, it can be highly informative to map these rates onto then calculated using the Rate4Site program without the protein's three-dimensional structure: By observing explicit use of the three-dimensional structure of the pro- where conserved residues are located within the protein's tein. Specifically, Rate4Site estimates the evolutionary structure, researchers may be able to predict what the rates of the amino acids, by taking into account the rela- functional roles of these residues might be. Evolutionary tionships among the homologues and the evolutionary information can also guide experimental effort, such as process, as reflected in the phylogenetic tree. Rate4Site mutagenesis, to confirm such predictions and to decipher also assigns a credibility interval for the evolutionary the protein's mechanism of action. rates. The conservation grades (derived from the evolu- The extraction of evolutionary data for a given query tionary rates) are projected onto the corresponding posi- protein is based on the comparison of that protein to its tions in the query sequence, where each position is homologues, that is, other proteins of a shared evolution- colored according to a unique color-coding scale ranging ary origin. There are several different approaches and from least to most conserved. ConSurf also maps the con- methods that infer evolutionary information from homo- servation grades onto the three-dimensional structure of logues. These are all based on aligning the query and its the protein, if available. This step enables the evolution- homologues to each other in a way that maximizes the ary information to be integrated with spatial consider- total similarity in all the amino acid positions, that is, a ations that are visible only from the structure, for multiple sequence alignment (MSA). The simplest esti- example, the location of binding/catalytic sites and mates are based on the consensus approach: For each ligand-binding positions. position, the amino acid that appears in that position in Though ConSurf carries out its calculations relatively the greatest number of homologues is identified; then, quickly, in certain cases (e.g., high-throughput studies the evolutionary conservation level of that position is involving many proteins) scholars may prefer to get an determined according to whether the “frequency” of the instant conservation map of a protein's structure, without amino acid (i.e., the proportion of homologues in which having to enter specific calculation parameters. ConSurf- it appears) exceeds a predefined consensus threshold. DB was introduced to address these cases. More sophisticated methods estimate conservation using ConSurf-DB is a repository of precalculated evolution- the entropy of each position, calculated from the collec- ary rates for the protein structures deposited in the 13,14 tive frequencies of the different amino acids that appear Protein Data Bank (PDB), the main resource for 260 BEN CHORIN ET AL. experimentally determined protein structures. The PDB discards chains that do not have a PDB file, either because is constantly growing; it currently contains nearly theentry hasbecomeobsoleteorbecause they aretoo large 150,000 entries representing protein structures (according (containing 100,000 atoms or more). Large structures are to http://www.rcsb.org), three times more than it did deposited in the PDB only using the mmCIF format, which 12 years ago, when ConSurf-DB was introduced. To the ConSurf-DB pipeline cannot handle yet (though it soon accommodate this growth and to adapt the database to will). Finally, the “modifications” filter handles the chains recent methodological developments, we have designed a that contain nonstandard amino acids. Each such amino new release of ConSurf-DB. The current version of the acid is modified to its closest neighbor among the standard database covers 82% of the PDB and will be periodically amino acids, and if the fraction of these modified residues updated to include new PDB entries, as well as to exploit in the chain exceeds 15%, the chain is filtered out. In any the flood of sequence data. ConSurf-DB is available as an case, the modifications are saved to the chain's data. Fol- online website and does not require local installation. lowing this initial filtration, a directory containing the input data is constructed in the repository for each of the remaining unique chains, and they are associated with 2 | METHODOLOGY FOR their sequences and identical chains. Thus, each unique CREATION OF THE CONSURF-DB chain's calculations can easily be mapped to the structures REPOSITORY of all its identical chains. The second step is searching for sequence homologues 17,18 The new version of ConSurf-DB is based on a fully in UniRef90, a clustered version of the UniProt data- 19,20 automated process that consists of four main steps: base. This is done using one iteration of the homologue 21,22 downloading and parsing nonredundant PDB entries, search tool HMMER with an E-value threshold of collecting sequence homologues and aligning the 0.0001. The candidate homologues retrieved by HMMER sequences, calculating evolutionary rates, and finally for- for a certain chain are further filtered according to the fol- matting the results for presentation in the ConSurf-DB lowing three parameters: (a) sequence identity—first, website (Figure 1). Separation of these individual steps sequences identical to the query by over 95% are discarded provides flexibility and modularity, enabling new data— to reduce error due to sample bias; (b) sequence coverage— for example, updates to the PDB—and new features to be sequence homologues that cover below 60% of the query integrated efficiently. The repository will be updated fre- protein are filtered; and (c) maximum overlap among quently, where each update involves making calculations homologues—some homologous sequences may overlap. In for newly added PDB entries, as well as revisiting old this case, if the overlap is greater than 10%, the highest scor- PDB entries that were not eligible for inclusion in previ- ing homologue is chosen, and the others are discarded. ous compilations (e.g., because of an insufficient number After this filtration process, chains with fewer than of homologues). Once a year, the whole database will be 50 homologues are eliminated. In ConSurf, the minimum reconstructed for the entire PDB, in order to account for number of homologues required to calculate evolutionary new homologues that have become available as a result rates is five; here, we adopt a higher threshold with the aim of growth in sequence data. of ensuring that the estimated evolutionary rates included The first step in building ConSurf-DB is retrieving the in ConSurf-DB aremorerobust. Next,cluster database at 23,24 PDB entries. Each PDB entry can contain one or more pro- high identity with tolerance (CD-HIT) removes any tein chains, which are handled separately in ConSurf-DB. redundant homologues with a threshold of 95%. If there are In order to overcome the problem of redundancy in the more than 50 homologues after the CD-HIT filtration pro- PDB (i.e., more than one structure for a given protein cess, the remaining homologues are sorted by their E value sequence), the chains are extracted from a PISCES file in ascending order, in line with the principle that the lower 15,16 (downloaded from http://dunbrack.fccc.edu). This file the E value the more significant the resemblance between contains all nonredundant (unique) chains in FASTA for- the homologue and the query protein. A maximum of mat, where the header of each unique chain lists all redun- 300 homologues are sampled uniformly from the sorted list dant chains, that is, chains with 100% sequence identity. to create the final list of homologues of the query protein. After extraction of the unique chains, their sequences, and This is also a higher threshold in comparison to the default their identical chains from the file, the unique chains are threshold used in ConSurf (150 homologues); again, the filtered using the following criteria: “length”, “PDB file” aim is to increase the robustness of the results. Finally, an and “modifications”.The “length” filter eliminates chains MSA of the homologues is constructed using the MAFFT- 25,26 containing fewer than 30 residues, as for shorter chains it LINSi procedure. can be challenging to collect credible homologues and con- The third step is estimating the evolutionary rate at each struct a reliable phylogenetic tree. The “PDB file” filter amino acid position. To this end, the MSA is first used to BEN CHORIN ET AL. 261 infer the best amino acid substitution model. This model assigned positive values and slowly evolving (conserved) essentially describes the evolution of the amino acids. Sev- positions are assigned negative values. In addition, a confi- eral such models are considered, including the following: dence interval, estimated using the empirical Bayesian 28 29 30 31 32 33 36 JTT, LG, Dayhoff, WAG, mtREV, and cpREV. method, which represents the extent of credibility of the Next, a phylogenetic tree is built from the MSA with the estimated evolutionary rate, is assigned to each position. Neighbor-Joining method, implemented in Rate4Site. Finally, the evolutionary rates are categorized into discrete Finally, Rate4Site assigns an evolutionary rate to each posi- conservation grades, ranging from 1 to 9, where 1 represents tion in the query sequence, based on the phylogenetic tree the most highly variable residue positions, 5 represents posi- and the substitution model, and using an empirical Bayesian tions of intermediate conservation, and 9 represents the methodology. The evolutionary rates are normalized most highly conserved positions. These grades are then around zero, where rapidly evolving (variable) positions are mapped to nine colors, providing a clear and intuitive means FIGURE 1 A flowchart of the pipeline used to construct ConSurf-DB. The pipeline consists of four steps: retrieving PDB entries, homologue detection and building a multiple sequence alignment, estimating evolutionary conservation, and formatting the results 262 BEN CHORIN ET AL. of visualizing the conserved and variable regions in the pro- 3.2 | Batch download tein. Positions that are assigned grades with low confidence are treated as a separate, tenth, category. Since results are precalculated in ConSurf-DB, we can The final step is formatting and visually representing provide results for several protein structures in a single the data, to make the information accessible and user download. This feature, which was not included in previ- friendly. The conservation grades (colors) are mapped ous versions of ConSurf-DB, is now available on our onto the three-dimensional structure of the query pro- homepage, and users can access it by uploading a list of 37,38 tein, which can be viewed using the NGL viewer or desired chains (where each chain appears on a new line). FirstGlance in Jmol. This visualization is highly enlightening because it emphasizes the important, evolu- tionarily conserved regions of the protein. The colors are 3.3 | Improved visualization also projected on the query sequence and on the MSA. Moreover, session files presenting the protein structure, I. Improving the color scales. In this release of ConSurf- colored according to the conservation grades, are created DB, the colors, both in the default and color-bind 40 41 using the PyMOL and UCSF Chimera programs. All scales, were refined to allow better distinction between visual results are available in two color scales: the default the different conservation grades. color scale, which is cyan-through-maroon and the color- II. Providing PyMOL session files for high-resolution fig- blind friendly color scale, which is green-through-purple. ures. PyMOL is a popular molecular visualization These color scales correspond to variable (Grade 1)- program; it contains various functions that enable through-conserved (Grade 9). Positions with low reliabil- users to analyze three-dimensional structures of pro- ity according to the confidence interval are colored in teins (e.g., show hydrogen bonds, calculate electro- light yellow in both color scales. Additional nonvisual static potential), and it can also be used to create data are also available to users, as well as links to related high-resolution images of the viewed protein. In pre- 42,43 sources of information such as PDBsum and vious versions of ConSurf and ConSurf-DB, users 44,45 Proteopedia. The repository can be accessed through were provided with a modified PDB file of their pro- a website, available at https://consurfdb.tau.ac.il/. To tein, which contained the conservation grades in the view the results, users need only to provide the PDB ID temperature factor column. Using this file and a pro- or sequence of the query protein. vided script, users were able to color the protein according to its calculated conservation grades. In this version, we provide a complete PyMOL session 3 | NEW FEATURES file, in which the query protein is already colored according to conservation. To create a high-resolution 3.1 | Homologue detection using image, the user needs only to open the file with HMMER PyMOL and save it as a figure. While working on this feature, we discovered and fixed some issues with the In previous releases of ConSurf-DB, the homologues of coloring script. We therefore recommend that users the query protein were collected using PSI-BLAST. Yet, who prefer to construct their own ConSurf figures new sequence search methodologies have developed in download the revised files provided in this version. recent years, to keep pace with the continuous increase III. Color-blind presentation option for all visual results. in the number of protein sequences. In the new release of In earlier releases of ConSurf-DB, the visual results ConSurf-DB (as well as in ConSurf itself), homologues were presented using only the default conservation are collected using the more advanced HMMER algo- color scale. From this version on, all visual results rithm. HMMER implements probabilistic inference using will be available in both the default and the color- profile hidden Markov models. Given a query sequence blind scales, both for viewing directly and for down- x and a target sequence y, BLAST calculates the score of loading. The color-blind display can be selected in the optimal alignment of x and y, whereas HMMER cal- the homepage, when running a query, or alterna- culates a score that is the sum of scores of all possible tively, in the results page, by clicking a button that alignments of x and y. Because HMMER uses a heuristic enables switching between the two displays. acceleration algorithm, it remains similar in speed to IV. Supporting the NGL viewer. The page of each entry BLAST, but with a better rate of correctly detected homo- now includes a visualization of the three- logues and a much lower rate of falsely detected hits. dimensional structure using the NGL viewer. This Implementation of HMMER in the new release of viewer is very fast and provides many features, such ConSurf-DB has improved homologue identification. as zooming in on the interactions of the query BEN CHORIN ET AL. 263 protein with its cognate ligand, thus highlighting 30 amino acids, 4,629 chains from large structures, and important biological information. 210 chains with more than 15% modified amino acids, which, as explained above, are not suitable for the calcu- lation. A total of 97,065 nonredundant chains remained 3.4 | Improvements in design and user after this initial filtration. The homologue search for each experience of these chains was performed using HMMER v3.2.1 against UniProt/UniRef90 release 07-2019. The homo- The new release of ConSurf-DB is considerably more user logues were filtered by thresholds and using CD-HIT v4.7 friendly than the previous release and includes many and were then aligned using MAFFT v7.419. The build improvements in the user interface and user experience. process was carried out using 150–200 CPUs, with an In terms of the query process, for example, the list of pro- average CPU time of roughly 15 min per chain. For 7,363 tein chains is presented in a drop-down menu in the of the 97,065 chains, we failed to find at least 50 homo- homepage, instead of on a new page. In terms of techni- logues and aborted calculation. cal support, a contact form is now available to improve In aggregate, as of November 2019, ConSurf-DB our communication with users. We encourage our users covers 89,702 of the 108,958 unique protein chains in to write, and we would appreciate any feedback. the PDB, that is, coverage of 82%, corresponding to a Moreover, in this version of ConSurf-DB, we present total of 365,218 chains. The vast majority of the calcula- a new design for the website, which should improve clar- tions are based on large MSAs of 201–300 homologous ity of presentation and ease of use. For example, in the proteins. new results page, the order of the results is determined by anticipated importance and usefulness, making it eas- ier for users to find what they need. In addition, the 5 | EXAMPLES OF APPLICATIONS names of the result files are much more intuitive and OF EVOLUTIONARY DATA: ACTIVE informative, and users can further access a README file SITE ANALYSIS IN ENZYMES AND that provides detailed information for all results. Finally, ANTIBODIES the running parameters of ConSurf-DB are presented in the results page, for the user's convenience. As discussed above, data regarding the degree of conser- vation of each position in a protein can be used to predict the biological significance of specific positions, as func- 4 | CONSURF-DB IN NUMBERS tionally important positions tend to be more evolution- arily conserved compared with other positions. The high The statistics for this version of ConSurf-DB are pres- conservation of functional positions results from negative ented in Table 1. ConSurf-DB was built on the basis of a selection on mutations in these positions, as such muta- PISCES file containing 108,958 nonredundant protein tions may result in loss of function. chains from the PDB (at 100% sequence identity thresh- In enzymes, mutations to catalytic residues are partic- old); the PISCES file was updated on September 2019. Of ularly unlikely to be tolerated, as each of these residues is this initial set, we filtered 7,054 chains shorter than engaged in a very specific function during catalysis TABLE 1 Statistics of ConSurf-DB PDB chains MSA sizes Total chains found 473,197 Chains with less than 50 homologues 7,363 Total nonredundant chains found 108,958 MSA's created Filtered Chains with 50–100 homologues 3,238 Chains shorter than 30 amino acids 7,054 Chains with 101–200 homologues 4,978 Chains with large structures 4,629 Chains with 201–300 homologues 81,486 Chains with more than 15% modified residues 210 Total chains processed 89,702 Total chains post-initial filtration 389,863 Total nonredundant chains post-initial filtration 97,065 Note: Currently, the databases cover 89,702 of the 108,958 protein chains in the nonredundant set, that is, 82%. 264 BEN CHORIN ET AL. (Figure 2). Other residues in the active site may deter- mutagenesis is likely to result in considerable loss of mine the specificity of the enzyme to its cognate sub- enzymatic activity. However, when position 85, which is strate. That is, the residues in these positions allow an also in the binding pocket of Or-AT, is replaced, the enzyme to bind and act only on a certain substrate. Such enzyme remains active yet changes its substrate prefer- positions are called specificity-determining positions ence considerably. This suggests that position 85 is an (SDPs). Notably, different forms of a given enzyme SDP. Indeed, though position 85 is evolutionarily con- (e.g., equivalent enzymes from different species or served, its conservation grade is lower than the conser- organs) may have different residues in these positions vation grades of the catalytic positions in the binding and thus bind different substrates. Accordingly, SDPs pocket. For example, in γ-aminobutyrate-aminotransfer- tend to be somewhat less evolutionarily conserved than ase, another member of the ω-aminotransferase family, 46,47 catalytic positions, which are, in essence, invariant. this position is populated by isoleucine instead of tyro- Such a phenomenon can be seen in aminotransferases sine, the equivalent residue in Or-AT. (also called transaminases)—a large group of enzymes The decreased conservation of SDPs is particularly that act on different substrates, such as the amino acids pronounced in antibodies. This is because each antibody alanine, ornithine, aspartate, cysteine, and gluta- binds a different substrate and therefore uses different 48,49 mate. Figure 2 shows the conservation patterns of residues in the equivalent substrate-binding positions. three positions in ornithine-aminotransferase (Or-AT), a The SDPs in antibodies are located in the hypervariable member of the (S)-selective ω-aminotransferase enzyme region, at the tip of each “arm” of the antibody (Figure 3). family. The enzyme in this structure is bound to an The “stem” of this structure, referred to as the constant inhibitor that resembles the substrate. Most of the Or- region, is similar in many antibodies, and it is therefore AT positions around the inhibitor–cofactor conjugate more evolutionarily conserved. (including the principal catalytic positions, 235 and 292) Identifying SDPs in an enzyme or antibody is not triv- are highly conserved. Replacement of these positions by ial and requires knowledge of the specific positions FIGURE 2 Conservation of catalytic and specificity-determining positions (SDPs) in the active site of Or-AT (PDB entry 2oat). (a) Ornithine-aminotransferase, colored by conservation grade and shown in surface representation, together with the inhibitor–cofactor (pyridoxal phosphate) conjugate, colored by atom type and shown as spheres. (b) The catalytic and suspected specificity-determining positions of ornithine-aminotransferase are shown as sticks and colored by conservation grade. For clarity, the backbone of the enzyme is not shown BEN CHORIN ET AL. 265 FIGURE 3 The conservation pattern of an antibody (PDB entry 1igt). A cartoon representation of an antibody colored according to evolutionary conservation. The constant and hypervariable regions in the structure are annotated. The antigen-binding region (CDR loops) is shown as spheres interacting with each substrate in each form of the pro- were used in constructing the database. ConSurf-DB will tein. Obtaining this knowledge requires either knowing be periodically updated to keep up with the rapid the three-dimensional structure of the different proteins increase in sequence and structure data. bound to their cognate substrates or biochemical data (e.g., data obtained from mutagenesis experiments) that ACKNOWLEDGMENTS implicate specific positions in selective substrate binding. The research was supported by Grant 450/16 of the The above examples suggest that evolutionary informa- Israeli Science Foundation (ISF). NB-T's research is tion, which can be obtained quickly and easily using supported in part by the Abraham E. Kazan Chair in computational tools such as ConSurf and ConSurf-DB, Structural Biology, Tel Aviv University. not only may help researchers pinpoint functionally important positions in proteins but also may help to dif- ORCID ferentiate between subclasses of such positions Nir Ben-Tal https://orcid.org/0000-0001-6901-832X (e.g., catalytic positions vs. SDPs). REFERENCES 1. Goldenberg O, Erez E, Nimrod G, Ben-Tal N. The ConSurf-DB: 6 | CONCLUSIONS Pre-calculated evolutionary conservation profiles of protein structures. Nucleic Acids Res. 2009;37:D323–D327. 2. Kessel A, Ben-Tal N. Introduction to proteins: Structure, func- Evolutionary information can be used to obtain valuable tion, and motion. 2nd ed. Boca Raton, FL: Chapman and insights regarding the structure and function of a query Hall/CRC (Taylor & Francis Group), 2018. protein, and in particular, it can highlight biologically 3. Valdar WS. Scoring residue conservation. Proteins. 2002;48: important regions. ConSurf-DB provides such evolution- 227–241. ary information instantly and efficiently for the majority 4. Mihalek I, Res I, Lichtarge O. A family of evolution-entropy of the proteins included in the PDB. The results are hybrid methods for ranking protein residues by importance. J Mol Biol. 2004;336:1265–1282. highly robust because particularly stringent thresholds 266 BEN CHORIN ET AL. 5. Capra JA, Singh M. Predicting functionally important 25. Katoh K, Misawa K, Kuma K, Miyata T. MAFFT: A novel residues from sequence conservation. Bioinformatics. 2007;23: method for rapid multiple sequence alignment based on fast 1875–1882. Fourier transform. Nucleic Acids Res. 2002;30:3059–3066. 6. Morgan DH, Kristensen DM, Mittelman D, Lichtarge O. ET 26. Katoh K, Standley DM. MAFFT multiple sequence alignment viewer: An application for predicting and visualizing functional software version 7: Improvements in performance and usabil- sites in protein structures. Bioinformatics. 2006;22:2049–2050. ity. Mol Biol Evol. 2013;30:772–780. 7. Innis CA. siteFiNDER|3D: A web-based tool for predicting the 27. Darriba D, Taboada GL, Doallo R, Posada D. ProtTest 3: Fast location of functional sites in proteins. Nucleic Acids Res. 2007; selection of best-fit models of protein evolution. Bioinformatics. 35:W489–W494. 2011;27:1164–1165. 8. Huang YF, Golding GB. Phylogenetic Gaussian process model 28. Jones DT, Taylor WR, Thornton JM. The rapid generation of for the inference of functionally important regions in protein mutation data matrices from protein sequences. Bioinformat- tertiary structures. PLoS Comput Biol. 2014;10:e1003429. ics. 1992;8:275–282. 9. Huang YF, Golding GB. FuncPatch: A web server for the fast 29. Le SQ, Gascuel O. An improved general amino acid replace- Bayesian inference of conserved functional patches in protein ment matrix. Mol Biol Evol. 2008;25:1307–1320. 3D structures. Bioinformatics. 2015;31:523–531. 30. Dayhoff MO, Schwartz RM, Orcutt BC. A model of evolution- 10. Glaser F, Pupko T, Paz I, et al. ConSurf: Identification of func- ary change in proteins. In: Dayhoff M, editor. Atlas of protein tional regions in proteins by surface-mapping of phylogenetic sequence and structure. Washington, D.C.: National Biomedi- cal Research Foundation, 1978; p. 345–352. information. Bioinformatics. 2003;19:163–164. 11. Ashkenazy H, Abadi S, Martz E, et al. ConSurf 2016: An 31. Whelan S, Goldman N. A general empirical model of protein improved methodology to estimate and visualize evolutionary evolution derived from multiple protein families using a conservation in macromolecules. Nucleic Acids Res. 2016;44: maximum-likelihood approach. Mol Biol Evol. 2001;18:691–699. W344–W350. 32. Adachi J, Hasegawa M. Model of amino acid substitution in 12. Pupko T, Bell RE, Mayrose I, Glaser F, Ben-Tal N. Rate4Site: proteins encoded by mitochondrial DNA. J Mol Evol. 1996;42: 459–468. An algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants 33. Adachi J, Waddell PJ, Martin W, Hasegawa M. Plastid genome within their homologues. Bioinformatics. 2002;18:S71–S77. phylogeny and a model of amino acid substitution for proteins 13. Berman HM, Westbrook J, Feng Z, et al. The Protein Data encoded by chloroplast DNA. J Mol Evol. 2000;50:348–358. Bank. Nucleic Acids Res. 2000;28:235–242. 34. Saitou N, Nei M. The neighbor-joining method: A new method 14. Burley SK, Berman HM, Bhikadiya C, et al. RCSB protein data for reconstructing phylogenetic trees. Mol Biol Evol. 1987;4: Bank: Biological macromolecular structures enabling research 406–425. and education in fundamental biology, biomedicine, biotech- 35. Mayrose I, Graur D, Ben-Tal N, Pupko T. Comparison of site- nology and energy. Nucleic Acids Res. 2019;47:D464–D474. specific rate-inference methods for protein sequences: Empirical 15. Wang G, Dunbrack RL. PISCES: A protein sequence culling Bayesian methods are superior. Mol Biol Evol. 2004;21:1781–1791. server. Bioinformatics. 2003;19:1589–1591. 36. Susko E, Inagaki Y, Field C, Holder ME, Roger AJ. Testing for 16. Wang G, Dunbrack RL. PISCES: Recent improvements to a differences in rates-across-sites distributions in phylogenetic PDB sequence culling server. Nucleic Acids Res. 2005;33: subtrees. Mol Biol Evol. 2002;19:1514–1523. W94–W98. 37. Rose AS, Hildebrand PW. NGL viewer: A web application for 17. Suzek BE, Huang H, McGarvey P, Mazumder R, Wu CH. Uni- molecular visualization. Nucleic Acids Res. 2015;43:W576–W579. Ref: Comprehensive and non-redundant UniProt reference 38. Rose AS, Bradley AR, Valasatava Y, Duarte JM, PrlicA, clusters. Bioinformatics. 2007;23:1282–1288. Rose PW. NGL viewer: Web-based molecular graphics for large 18. Suzek BE, Wang Y, Huang H, PB MG, Wu CH, UniProt Con- complexes. Bioinformatics. 2018;34:3755–3758. sortium. UniRef clusters: A comprehensive and scalable alter- 39. Martz E (2005) FirstGlance in Jmol Available from: http:// native for improving sequence similarity searches. firstglance.jmol.org/. Bioinformatics. 2015;31:926–932. 40. Schrödinger LLC (2015) The PyMOL Molecular Graphics Sys- 19. Apweiler R, Bairoch A, Wu CH, et al. UniProt: The universal tem, Version 2.3.3. protein knowledgebase. Nucleic Acids Res. 2004;32:D115–D119. 41. Pettersen EF, Goddard TD, Huang CC, et al. UCSF chimera— 20. UniProt Consortium. UniProt: A worldwide hub of protein A visualization system for exploratory research and analysis. knowledge. Nucleic Acids Res. 2019;47:D506–D515. J Comput Chem. 2004;25:1605–1612. 21. Eddy SR. A new generation of homology search tools based on 42. Laskowski RA, Hutchinson EG, Michie AD, Wallace AC, probabilistic inference. Genome Inform. 2009;23:205–211. Jones ML, Thornton JM. PDBsum: A web-based database of 22. Mistry J, Finn RD, Eddy SR, Bateman A, Punta M. Challenges summaries and analyses of all PDB structures. Trends Biochem in homology search: HMMER3 and convergent evolution of Sci. 1997;22:488–490. coiled-coil regions. Nucleic Acids Res. 2013;41:e121. 43. Laskowski RA, Jabłonska  J, Pravda L, Vařeková RS, 23. Li W, Godzik A. Cd-hit: A fast program for clustering and com- Thornton JM. PDBsum: Structural summaries of PDB entries. paring large sets of protein or nucleotide sequences. Bioinfor- Protein Sci. 2018;27:129–134. 44. Hodis E, Prilusky J, Martz E, Silman I, Moult J, Sussman JL. matics. 2006;22:1658–1659. 24. Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: Accelerated for clus- Proteopedia—A scientific ‘wiki’ bridging the rift tering the next-generation sequencing data. Bioinformatics. between three-dimensional structure and function of bio- 2012;28:3150–3152. macromolecules. Genome Biol. 2008;9:R121. BEN CHORIN ET AL. 267 45. Hodis E, Prilusky J, Sussman JL. Proteopedia: A collaborative, 50. Markova M, Peneff C, Hewlins MJE, Schirmer T, John RA. virtual 3D web-resource for protein and biomolecule structure Determinants of substrate specificity in omega-aminotransfer- and function. Biochem Mol Biol Educ. 2010;38:341–342. ases. J Biol Chem. 2005;280:36409–36416. 46. Gerlt JA, Babbitt PC. Mechanistically diverse enzyme super- families: The importance of chemistry in the evolution of catal- ysis. Curr Opin Chem Biol. 1998;2:607–612. How to cite this article: Ben Chorin A, 47. Todd AE, Orengo CA, Thornton JM. Evolution of function in protein superfamilies, from a structural perspective. J Mol Biol. Masrati G, Kessel A, et al. ConSurf-DB: An 2001;307:1113–1143. accessible repository for the evolutionary 48. Nelson DL, Cox M. Lehninger principles of biochemistry. 5th conservation patterns of the majority of PDB ed. New York, NY: W.H. Freeman, 2008. proteins. Protein Science. 2020;29:258–267. https:// 49. Eliot AC, Kirsch JF. Pyridoxal phosphate enzymes: Mechanis- doi.org/10.1002/pro.3779 tic, structural, and evolutionary considerations. Annu Rev Bio- chem. 2004;73:383–415.

Journal

Protein Science : A Publication of the Protein SocietyPubmed Central

Published: Nov 22, 2019

There are no references for this article.