Inference of Functional Properties from Large-scale Analysis of Enzyme Superfamilies *

Shoshana D. Brown; Patricia C. Babbitt

doi:10.1074/jbc.r111.283408

Inference of Functional Properties from Large-scale Analysis of Enzyme Superfamilies *

Brown, Shoshana D.;Babbitt, Patricia C. 2012-01-02 00:00:00 MINIREVIEW THE JOURNAL OF BIOLOGICAL CHEMISTRY VOL. 287, NO. 1, pp. 35–42, January 2, 2012 © 2012 by The American Society for Biochemistry and Molecular Biology, Inc. Published in the U.S.A. tionally, community challenges such as the Critical Assess- Inference of Functional ment of Function Annotations (CAFA) (Automated Func- Properties from Large-scale tion Prediction 2011) have been mounted to assess and improve the current state of automated prediction of protein Analysis of Enzyme function. Viewing the glass as half-full, progress in sequenc- Superfamilies ing and annotation over the last decade led one group to Published, JBC Papers in Press, November 8, 2011, DOI 10.1074/jbc.R111.283408 estimate that some functional features can be assigned to as ‡ ‡§¶1 Shoshana D. Brown and Patricia C. Babbitt much as 85% of proteins in completely sequenced genomes From the Departments of Bioengineering and Therapeutic Sciences and (6). From a more skeptical perspective, more recent assess- § ¶ Pharmaceutical Chemistry, School of Pharmacy, and California Institute ments of annotation accuracy suggest that computational for Quantitative Biosciences, University of California, approaches are especially prone to misannotation (7, 8), San Francisco, California 94158-2330 indicating that significant challenges for functional infer- As increasingly large amounts of data from genome and ence remain. other sequencing projects become available, new approaches This minireview focuses on how new insights about protein are needed to determine the functions of the proteins these structure-function relationships and functional inference can genes encode. We show how large-scale computational anal- be obtained from large-scale analyses of proteins, specifically ysis can help to address this challenge by linking functional for “functionally diverse” enzyme superfamilies. We define information to sequence and structural similarities using these types of superfamilies as sets of homologous proteins that protein similarity networks. Network analyses using three conserve structural and active site features that can be explicitly functionally diverse enzyme superfamilies illustrate the use associated with a conserved partial reaction or other chemical of these approaches for facile updating and comparison of capability. Within a superfamily and constrained by these available structures for a large superfamily, for creation of superfamily-common features, many divergent families may functional hypotheses for metagenomic sequences, and to have evolved that exhibit different reaction and/or substrate summarize the limits of our functional knowledge about even specificities (9). (See the Prologue for some definitions of super- well studied superfamilies. families, families, and related terms.) These types of superfamilies provide a useful context for inference of functional properties of members of unknown In the post-genomic era, access to large amounts of gene function (“unknowns”) because the constraints imposed by sequence and protein structure data has become the norm; the structure-function paradigm unique to each superfamily by mid-2011, the number of protein sequences in the Uni- restrict the search space for functional inference of their Prot/TrEMBL Database (1) topped 16 million, whereas the reaction and substrate specificities, simplifying their func- Protein Data Bank (2) contained over 73,000 structures. tional assignments. Because the number of sequences in Additional millions of sequences are becoming available each superfamily is still increasing rapidly, large amounts of from newer types of genome projects, including metag- new data are regularly available to inform these investiga- enomics projects, with one report for the human gut micro- tions. Moreover, sequence and structural similarities among biome accounting for an additional 3.3 million microbial all of the members of a superfamily can be associated with genes (3). Because experimental determination of protein many types of functional information, allowing us to lever- function lags far behind the rate of sequence and structure age what is known to guide inference of functional properties determination, improved computational methods for func- of unknowns that are similar. (See the minireview by Gerlt et tion prediction are urgently needed to help bridge the gap al. (48) in this thematic series describing strategies for between sequenced genes and functionally characterized assigning functions in the enolase superfamily for an exam- protein products. In response, new methods are rapidly ple.) Furthermore, as our coverage of genome space being developed to address these challenges, and community increases, new “outlier” functions in superfamilies can be efforts are now under way to increase the pace of experimen- identified from specialized environmental niches, extending tal and computational prediction of protein function (4, our estimates of the natural boundaries of functional varia- 5). Another large-scale effort (http://www.nigms.nih.gov/ tion that a particular superfamily supports. News/Results/gluegrant_051510.htm) aims to develop a Below, we describe how the continuing increase in sequence combined experimental/computational strategy for the pre- and structural data can be used to understand better the evolu- diction of the reaction and substrate specificity of enzymes, tion of new functions and to improve functional inference the protein class that is the subject of this minireview. Addi- accessed using a relatively new application of network-based methods, protein similarity networks, an attractive approach * This work was supported, in whole or in part, by National Institutes of Health for investigation of functional properties from the context of Grants R01 GM60595 and U54 GM093342. This is the fifth article in the sequence and structural similarity. Results from such large- Thematic Minireview Series on Enzyme Evolution in the Post-genomic Era. scale studies are reviewed here using examples from three dif- To whom correspondence should be addressed. E-mail: [email protected]. edu. ferent superfamilies of enzymes: the eukaryotic protein kinase JANUARY 2, 2012• VOLUME 287 • NUMBER 1 JOURNAL OF BIOLOGICAL CHEMISTRY 35 This is an Open Access article under the CC BY license. MINIREVIEW: Functional Inference in Enzyme Superfamilies (ePK) -like superfamily, a large group of acid-sugar dehydrata- tation and to distinguish divergent families within enzyme ses from the enolase superfamily, and the glutathione transfer- superfamilies (see Ref. 20 for an example). Additionally, search- ase (GST) superfamily. able online databases such as BRENDA (21) provide access to a large store of enzyme function information, whereas others Emerging Roles for Large-scale Computational Analysis provide online curation and computational tools created to link of Protein Superfamilies enzyme sequence and structural information with functional As methods for managing and analyzing sequence and struc- characteristics and mechanistic properties (22–25). tural data have improved, computational studies can more Network-based Approaches for Large-scale Analysis of effectively address broad issues in large-scale mapping of struc- Protein Superfamilies ture-function relationships and deduction of the patterns by which natural evolution has led to the divergence of many func- Although large-scale analyses indeed provide a “big picture” tions from an ancestral structural scaffold. For example, for perspective that adds much to our understanding of genomic protein kinases, one of the largest and most important enzyme and chemical biology, the growing size of the data sets and their superfamilies, the seminal Manning tree (10) provided a foun- associated metadata continue to raise significant challenges for dation for classification of human kinases and those from other analysis and dissemination. Network-based analysis represents eukaryotes. Likewise, a large-scale study of redox proteins gen- one approach used to capture biological context, with genetic erated a census of sequence, structural, and functional charac- or protein interaction networks using computational and/or teristics of the divergent superfamilies of the thioredoxin fold experimental data being among the most common. Sequence class that are represented in nature (11). and structure similarity networks have also been used for the Large-scale analyses have the additional advantage of reveal- analysis and visualization of structure-function relationships ing patterns not easily observable when smaller data sets are (26–28). This technique allows users to efficiently and quickly examined. For example, comparison of sequence and structural examine similarities of much larger sets of proteins than is gen- features conserved in the active sites of the members of the erally possible using traditional methods such as phylogenetic large and functionally diverse enolase superfamily allowed the trees and multiple alignments. For example, one such study prediction of the specific partial reaction uniting the entire mounted a comparison of over 145,000 sequences to create a superfamily, the abstraction of an -proton of a carboxylic acid, map in which proteins are positioned according to sequence thereby restricting the functional prediction problem for the relationships and gene functions (29). The recent development thousands of sequences now identified as superfamily members of software platforms such as Cytoscape (30) facilitates the use to consideration of only the overall reactions and substrates of network methods and algorithms of several types, enabling consistent with that paradigm (12). Using that structure-func- access to these types of tools by non-experts. tion mapping as a foundation, more detailed computational Although they are not a substitute for phylogenetic infer- and experimental studies have identified differences among ence, networks generated from even such simple metrics as superfamily members that distinguish the reaction and sub- all-by-all pairwise comparisons of a large number of divergent strate specificities of the 20 constituent families whose func- sequences have been shown to track well with known relation- tions can now be assigned (see the minireview by Gerlt et al. ships and with the clustering provided by trees. Furthermore, (48) for a listing). Other notable studies linking structural and they support facile mapping of many types of orthogonal data to mechanistic features across large enzyme superfamilies include proteins clustered by similarity (31). Types of information such analyses of the amidohydrolase (13, 14), enoyl-CoA hydratase as genome/operon context, interaction networks and path- (15), nudix (16), haloalkanoic acid dehalogenase (17), and two ways, and organism-specific information have been shown to dinucleotide-binding domain flavoprotein (18) superfamilies, enhance the accuracy of functional inference (see Refs. 32 and to name a few. 33 for relevant reviews). In analogy to phylogenomics, func- As more powerful tools and computers have been created, tional information of many types can be associated with nodes the ease of mounting such studies has enabled new types of (e.g. protein sequences or structures) in a similarity network to analyses that provide context for interpreting functional char- improve functional inference and insight. Because protein sim- acteristics across homologous members of superfamilies. These ilarity networks can be quickly generated in interactive formats, include sophisticated algorithms for multiple alignment and users can easily explore these associations by coloring nodes phylogenetic inference, both of which have long been used with different combinations of sequence/structural properties to examine evolutionary relationships among groups of and functional information. sequences. Especially relevant to this minireview, phylog- Examples illustrating the application of large-scale analysis enomic approaches, first described over a decade ago (19), com- of structure-function relationships using protein similarity net- bine phylogenetic reconstruction with functional assignment works are described below. Interactive versions of these net- of unknowns based on their placement in the tree relative to works are available from the authors and can be viewed using knowns. Phylogenomic approaches have now been applied the freely available Cytoscape software (30). extensively to improve the accuracy of homology-based anno- Tracking Growth of Structural Coverage: ePK-like Superfamily The abbreviations used are: ePK, eukaryotic protein kinase; GST, glutathione The ePK-like superfamily is a large and diverse group of transferase; HMM, hidden Markov model; SFLD, Structure-Function Link- age Database; r.m.s.d., root mean square deviation. homologous enzymes that share a common protein kinase-like 36 JOURNAL OF BIOLOGICAL CHEMISTRY VOLUME 287 • NUMBER 1 •JANUARY 2, 2012 MINIREVIEW: Functional Inference in Enzyme Superfamilies fold (34) and conserved residues associated with ATP-depen- between when the study by Scheeff and Bourne (38) was pub- dent phosphorylation of proteins and small molecules. ePK-like lished (October 2005) and May 2011, respectively. As is clear enzymes mediate many important cellular processes, including from these summaries, the structure space has filled out signif- signal transduction (10). They make up almost 2% of eukaryotic icantly over this 6-year span. Most strikingly, the fructosamine genes and, although present as a smaller percentage of bacterial kinase family defined by Pfam, Fructosamin_kin (red oval in genes, may be at least as important in bacterial cellular regula- Fig. 1A, lower panel), was not represented at all in the network tion as the structurally unrelated histidine kinases (35). from 2005. Fig. 1B shows the same network as in Fig. 1A (lower The size and diversity of the ePK-like superfamily make it panel), but thresholded at a higher stringency scoring cutoff hard to generate a global overview of their sequence and struc- (achieved by increasing the score threshold required for draw- tural relationships. As a result, only a small number of groups ing edges between two nodes), enabling a more detailed view of have attempted the time-consuming task of generating large- the same structural relationships. Fig. 1B provides a different scale classifications of the kinases. In one of these studies, Kan- and somewhat more detailed view of the growth of structural nan et al. (35) used a library of hidden Markov models (HMMs) coverage between these two time points. Although these net- to identify 45,000 ePK-like sequences from the NCBI non- works use a set of structures that is larger and somewhat differ- redundant database (36) and the Global Ocean Sampling data ent from that used by Scheeff and Bourne, they track reasonably set (37) and to classify them into 20 families. Examination of well with those trees (data not shown). Some exceptions this diverse sequence set allowed the identification of 10 resi- include structures for which the position was labeled as uncer- dues conserved across most families. Six of these residues were tain in the Scheeff and Bourne tree. Alternative versions of known to be involved in ATP and substrate binding and catal- these networks colored by the Manning classification (10), with ysis, whereas the functional role of the remaining residues had the addition of the atypical kinase class used in Ref. 38, are not been established. This study also showed that all but one of provided in Fig. 2. these well conserved residues had been lost over the course of As shown in this example, similarity networks can be used evolution in one or more families (in some cases, substituted effectively to update relationships among proteins in a super- with changes in other regions of the protein), illustrating the family as new structures become available, if, as for the ePK-like plasticity of the ePK-like fold. Although profile-profile align- superfamily, its structural coverage is good. Sequence networks ments and alignments of conserved motifs could be used to can also be used to summarize relationships among proteins on group some families into related clusters, the size and diversity a large scale (11), as described below. Although the scale at of the superfamily have continued to challenge the construc- which networks can easily query such data is still much larger tion of a more detailed evolutionary history. than can generally be accommodated using multiple align- Scheeff and Bourne (38) were able to surmount the problem ments and trees, the size of networks that can be viewed and of low sequence identity across the superfamily by combining manipulated by software such as Cytoscape is limited by the sequence and structural information into a single phylogenetic number of edges they contain. In practice, for a superfamily as analysis. The results suggested that the tree constructed by this large as the kinases, only a small proportion of the available method had some advantages and was more reliable than trees sequences can be represented in a single network, typically produced using either sequence or structural data alone. requiring the use of representative sequences to cover the In addition to these types of global analyses, many thousands divergence space. Additionally, because of the diversity of many of detailed studies have been published describing properties of superfamilies, including the ePK-like superfamily, it is not pos- smaller groups and of individual enzymes. However, the sheer sible to connect the whole set of sequences at statistically sig- number of sequences and structures in this superfamily, cou- nificant scores. pled with the rate of growth of the sequence and structure data- Prediction of New Carbon Sources in Human Gut bases, makes keeping an up-to-date record of kinase relation- Microbiome from Comparisons with Acid-sugar ships increasingly difficult, even without the inclusion of linked Dehydratases of Enolase Superfamily functional information. (The Pfam (39) PKinase clan currently includes nearly 85,000 sequences.) Here, we illustrate the use of Microbes residing in the gut have a significant influence on similarity networks to keep track of relationships between human health. In addition to aiding in energy harvest from food enzymes in large superfamilies. In this example, networks gen- and synthesizing essential vitamins, changes in the gut micro- erated from pairwise structural comparisons provide a current bial population are associated with medical conditions such as update of the structural coverage of the superfamily. inflammatory bowel disease and obesity (3). Variations in Fig. 1 shows structure similarity networks for the ePK-like microbiome populations have also been observed following superfamily, colored by Pfam classifications, with Fig. 1A indi- treatment with antibiotics (40). Thus, much interest is now cating the differences in structural coverage in the years focused on determining the molecular functions and biological roles of the gut metaproteome both in healthy individuals and in those suffering from disease. For network analysis for the ePK-like superfamily, structures were chosen to One of the most comprehensive studies on the human gut include only one structure for each unique UniProt ID, with a preference for 1) structures solved October 2005 or previously and 2) wild-type, 3) ligand- microbiome to date describes a set of 3.3 million microbial bound, and 4) good resolution structures. Using the FAST algorithm (46), genes sequenced and assembled from fecal samples of 124 indi- each structure in the set was used as a query against a database containing viduals (3). As expected, the census of protein functions initially all structures in the set. Networks were created at various N-score cutoffs and visualized using Cytoscape. identified in this metagenome includes proteins in many cen- JANUARY 2, 2012• VOLUME 287 • NUMBER 1 JOURNAL OF BIOLOGICAL CHEMISTRY 37 MINIREVIEW: Functional Inference in Enzyme Superfamilies FIGURE 1. Structure similarity networks of ePK-like superfamily generated from pairwise comparisons using FAST algorithm. Each node represents a structure. Each edge represents a connection with a FAST N-score better than a given threshold. A, FAST N-score cutoff 11, colored by Pfam family. Upper panel, structures available as of October 2005 (97 nodes). At this cutoff, the average root mean square deviation (r.m.s.d.) is 2.81 Å with 213 C atoms aligned. Lower panel, structures available as of May 2011 (295 nodes). At this cutoff, the average r.m.s.d. is2.98 Å with207 C atoms aligned. B, FAST N-score cutoff 23. At this cutoff, the average r.m.s.d. is 1.97 with 247 C atoms aligned. Nodes colored green represent structures available in the Protein Data Bank as of October 2005; those colored blue represent structures added to the Protein Data Bank between October 2005 and May 2011 (total of 295 nodes). Nodes were arranged using the yFiles organic layout provided with Cytoscape version 2.7. Lengths of edges are not meaningful except that sequences in tightly clustered groups are relatively more similar to each other than sequences with few connections. tral metabolic pathways such as those involved in carbon utili- assignment of specificity to40% of the2000 sequences cur- zation pathways. We used the information available in the rently represented in this subgroup of the superfamily in SFLD. Structure-Function Linkage Database (SFLD) (25) for a large Although the rest can be assigned with high confidence as likely set of acid-sugar dehydratases in the enolase superfamily to acid-sugar dehydratases, their substrate specificities remain probe for additional and possibly unique carbon sources in the unknown. Using SFLD tools, protein sequences from the microbiome. This was accomplished by identifying putative human gut microbiome predicted to be acid-sugar dehydrata- acid-sugar dehydratases in the gut metagenome that differ from ses were identified and clustered together with the knowns and those that had been previously identified, whether of known or unknowns of the subgroup already annotated in the database. unknown specificity. The results are summarized in the network shown in Fig. 3A. The substrate specificities of 10 acid-sugar dehydratases This network is thresholded at a relatively permissive cutoff, have now been biochemically established, allowing functional where most families are found in one major cluster. Other reac- tion families that do not show similarities to any of the nodes in SFLD is a joint project of the Babbitt laboratory (supported by National Insti- tutes of Health Grant GM60595 and National Science Foundation Grants For network analysis for the gut metagenome, the sequence set consists of DBI-0234768 and DBI-0640476) and the UCSF Resource for Biocomputing, 1) the subgroup from SFLD containing acid-sugar dehydratases (named Visualization, and Informatics (supported by National Institutes of Health the mandelate racemase subgroup), filtered to 90% identity, aside from Grant P41 RR001081). Additional support for the creation of networks experimentally characterized members, all of which are present, and 2) all available at SFLD is provided by the Enzyme Function Initiative (supported gut metagenome sequences that matched either this SFLD subgroup by National Institutes of Health Grant U54 GM093342). HMM or an SFLD family HMM from a family within the subgroup with an Of 10 acid-sugar dehydratase families of known reaction specificity in SFLD, e-value cutoff of at least 1e2 and that did not better match any other only seven are colored in Fig. 3, as two others are not represented in this enolase superfamily SFLD HMMs. These sequences were filtered to 90% analysis. The mandelate racemase family, the namesake of the subgroup, is identity and to remove fragments under 150 amino acids. BLAST analysis also colored. Although mandelate racemase is not an acid-sugar dehydra- (47) was performed using each sequence in the set as a query against a tase, it is a member of this subgroup by sequence and structural similarity database containing all sequences in the set. Networks were created at and is therefore included in Fig. 3. two different e-value cutoffs and visualized as described in Footnote 3. 38 JOURNAL OF BIOLOGICAL CHEMISTRY VOLUME 287 • NUMBER 1 •JANUARY 2, 2012 MINIREVIEW: Functional Inference in Enzyme Superfamilies FIGURE 2. Alternative view of structure similarity networks of 86 representative structures in ePK-like superfamily (generated as described for Fig. 1). Nodes are colored according to their Manning/Bourne group classification. Dark gray nodes represent structures that were not classified. A, FAST N-score cutoff 4. B, FAST N-score cutoff 23. this large cluster at a threshold better than the cutoff form and illustrates more fully the breadth of their natural diversity. smaller clusters arranged randomly at the bottom of Fig. 3A. It is also interesting that some clusters containing members of Simple examination reveals a few emerging clusters in the main characterized families in Fig. 3B have no representatives from cluster and also in the separated clusters (e.g. the circled group the gut microbiome, suggesting that these functions may not be in Fig. 3A) that are populated primarily or exclusively by gut represented in the microorganisms that live in the gut (or those metagenomic sequences. Because these sequences are some- functions are supplied by enzymes from a different evolution- what distant from those with characterized functions (desig- ary background). nated by different colors), they may indeed represent unique What We Do Not Know About Cytosolic GST Superfamily acid-sugar dehydratases and, hence, new carbon sources not GSTs constitute a large class of enzymes that play important previously associated with the superfamily. A more detailed examination of this hypothesis can be biological roles in cell signaling and metabolism of endogenous obtained by visualization of the network at the more stringent compounds, drugs, and other xenobiotics. They are ubiquitous in nature (except for archaea) and may represent as much as e-value cutoff, shown in Fig. 3B. In this view, most of the char- 0.01% of the enzyme universe. Based on sequence similarities, acterized families within the subgroup have separated into indi- vidual clusters, suggesting that this threshold cutoff may be GSTs have historically been organized into major classes using useful for hypothesizing the boundaries of at least some of the the names of Greek letters (e.g. Alpha, Pi, Omega, Theta, etc.) (41). Within each major class, subclasses designate functional functionally distinct families within it. From this view, we can and other properties. Although a number of GSTs have been predict the specificity of some of the metagenomic sequences that cluster closely with known families, e.g. fuconate and galac- experimentally characterized in terms of their general substrate tonate dehydratases. The perspective provided in Fig. 3B also profiles, the physiological substrates and reaction specificities of only a small minority are known. Still, because of their lends support to the hypothesis that the separated clusters pop- importance to human biology and health, GSTs are among the ulated only by gut metagenomic sequences and other unchar- TM acterized sequences from the GenBank Data Bank may best studied of enzyme superfamilies, with thousands of publi- indeed represent new carbon sources not previously identified cations detailing their biological roles and structural and func- tional properties. as members of the enolase superfamily. Finally, the addition of these metagenomic sequences to the networks helps to fill out the sequence space representing the acid-sugar dehydratases H. J. Atkinson and P. C. Babbitt, unpublished data. JANUARY 2, 2012• VOLUME 287 • NUMBER 1 JOURNAL OF BIOLOGICAL CHEMISTRY 39 MINIREVIEW: Functional Inference in Enzyme Superfamilies FIGURE 3. Sequence similarity networks of acid-sugar dehydratases known or predicted to belong to enolase superfamily and human gut micro- biome. Networks were generated from all-by-all BLAST comparisons of 1578 sequences representing sequences of eight known acid-sugar dehydratase families and the mandelate racemase family from the mandelate racemase subgroup (see Footnote 5) as defined by SFLD and a filtered set of gut metagenome sequences that showed significant similarity to the members of the subgroup. Each of the 1578 nodes represents a sequence. Larger square nodes represent those that have been experimentally characterized, so their reaction and substrate specificities are known. Brown nodes represent sequences from the human gut metagenome, and white nodes represent SFLD sequences in the subgroup for which the reaction and substrate specificities have not been predicted. The remainder (small nodes) represent sequences for which specificity can be predicted at high confidence, colored by their SFLD family names (see Footnote 4). Nodes were arranged using the yFiles organic layout provided with Cytoscape version 2.7. A, each edge in the network represents a BLAST connection with an e-value of 1e44 or better. At this cutoff, sequences have a median percent identity and alignment length of32% and 369, respectively. B, each edge in the network represents a BLAST connection with an e-value of 1e84 or better. At this cutoff, sequences have a median percent identity and alignment length of 44% and 384, respectively. Lengths of edges are not meaningful except that sequences in tightly clustered groups are relatively more similar to each other than sequences with few connections. Only a few studies have focused on the GST superfamily on a acterized at any level. Furthermore, the representation of the large scale, however (11, 42, 43). The sequence similarity net- colored nodes in the overall topology suggests that many addi- work shown in Fig. 4 provides an overview of the cytosolic GST tional classes likely remain to be defined. The view provided in superfamily from one of these (42). It compares 622 GSTs rep- Fig. 4 thus lays a foundation for choosing new sequences for resenting6000 sequences and shows that they can be divided which functional and structural characterization may be espe- into two major groups distinguished by sequence and structural cially valuable for prediction of new functional classes. Many similarity (and also by variations in their active site features). additional GST sequences have recently been identified, so the The majority of the enzymes in the smaller of the two groups proportion of GSTs for which no functional information is shown in Fig. 4 (Group 1) are from eukaryotic organisms, available continues to increase dramatically. whereas those from the larger group (Group 2) are more mixed, Challenges for Computational Prediction of Functional but with the largest number coming from bacteria. Properties The summary of sequence relationships and structural cov- erage provided in Fig. 4 is the first time that similarity relation- The examples provided in this minireview suggest the value of large-scale analyses such as similarity networks for summa- ships across the entire GST superfamily were captured in a rizing sequence and structural relationships in large superfami- single view. This map shows both the sequences that could be classified as members of one of the major classes (colored nodes) lies and for developing hypotheses about how structure- or as well as those that had not even been assigned to one of these sequence-based clustering tracks with functional boundaries. However, like any other method, similarity networks also have general classes (light and dark gray nodes) and had thus far only some significant limitations, a few of which have been been identified as belonging to the cytosolic GST superfamily. Remarkably, despite decades of study, these results reveal that addressed above and others elsewhere (31). Although it is only the huge majority of GSTs have never been functionally char- by experimental investigation that the in vitro and in vivo func- tions of unknowns can ultimately be validated, the continual For network analysis for the GST superfamily, the sequence set was gener- ated, and networks were calculated and visualized as described previously (42). P. C. Babbitt and D. Stryke, unpublished data. 40 JOURNAL OF BIOLOGICAL CHEMISTRY VOLUME 287 • NUMBER 1 •JANUARY 2, 2012 MINIREVIEW: Functional Inference in Enzyme Superfamilies FIGURE 4. Sequence similarity network of cytosolic GSTs. Similarity is defined by pairwise BLAST alignments better than an e-value cutoff of 1e12. 622 representative sequences that are a maximum of 40% identical and that span the diversity of6000 GSTs are shown. Nodes are colored by classification of the sequence in the Swiss-Prot Database (part of the UniProt Database), if available. The 40 large nodes designate sequences with structures. At this cutoff, edges at this threshold represent alignments with a median 27% identity over 200 residues. This network and legend are adapted from Ref. 42 with permission. Bertalan, M., Batto, J. M., Hansen, T., Le Paslier, D., Linneberg, A., Nielsen, growth of sequence data makes it increasingly difficult for H. B., Pelletier, E., Renault, P., Sicheritz-Ponten, T., Turner, K., Zhu, H., either focused or high-throughput experimental studies to keep Yu, C., Li, S., Jian, M., Zhou, Y., Li, Y., Zhang, X., Li, S., Qin, N., Yang, H., up. Even a reasonable fallback position requires the develop- Wang, J., Brunak, S., Doré, J., Guarner, F., Kristiansen, K., Pedersen, O., ment of new strategies for identifying the few experiments that Parkhill, J., Weissenbach, J., Bork, P., Ehrlich, S. D., and Wang, J. (2010) could be most useful for validation of large-scale computational Nature 464, 59–65 predictions. As illustrated here and elsewhere (44, 45), protein 4. Roberts, R. J., Chang, Y. C., Hu, Z., Rachlin, J. N., Anton, B. P., Pokrzywa, similarity networks represent one way to generate the context R. M., Choi, H. P., Faller, L. L., Guleria, J., Housman, G., Klitgord, N., Mazumdar, V., McGettrick, M. G., Osmani, L., Swaminathan, R., Tao, needed for choosing those experiments and interpreting the K. R., Letovsky, S., Vitkup, D., Segrè, D., Salzberg, S. L., Delisi, C., Steffen, results. M., and Kasif, S. (2011) Nucleic Acids Res. 39, D11–D14 REFERENCES 5. Bateman, A. (2010) Bioinformatics 26, 991 6. Raes, J., Harrington, E. D., Singh, A. H., and Bork, P. (2007) Curr. Opin. 1. UniProt Consortium (2011) Nucleic Acids Res. 39, D214–D219 Struct. Biol. 17, 362–369 2. Dutta, S., Burkhardt, K., Young, J., Swaminathan, G. J., Matsuura, T., Hen- 7. Hsiao, T. L., Revelles, O., Chen, L., Sauer, U., and Vitkup, D. (2010) Nat. rick, K., Nakamura, H., and Berman, H. M. (2009) Mol. Biotechnol 42, 1–13 Chem. Biol. 6, 34–40 3. Qin, J., Li, R., Raes, J., Arumugam, M., Burgdorf, K. S., Manichanh, C., 8. Schnoes, A. M., Brown, S. D., Dodevski, I., and Babbitt, P. C. (2009) PLoS Nielsen, T., Pons, N., Levenez, F., Yamada, T., Mende, D. R., Li, J., Xu, J., Li, Comput. Biol. 5, e1000605 S., Li, D., Cao, J., Wang, B., Liang, H., Zheng, H., Xie, Y., Tap, J., Lepage, P., 9. Gerlt, J. A., and Babbitt, P. C. (2001) Annu. Rev. Biochem. 70, 209–246 JANUARY 2, 2012• VOLUME 287 • NUMBER 1 JOURNAL OF BIOLOGICAL CHEMISTRY 41 MINIREVIEW: Functional Inference in Enzyme Superfamilies 10. Manning, G., Whyte, D. B., Martinez, R., Hunter, T., and Sudarsanam, S. ner, G. J., Ideker, T., and Bader, G. D. (2007) Nat. Protoc. 2, 2366–2382 (2002) Science 298, 1912–1934 31. Atkinson, H. J., Morris, J. H., Ferrin, T. E., and Babbitt, P. C. (2009) PLoS 11. Atkinson, H. J., and Babbitt, P. C. (2009) PLoS Comput. Biol. 5, e1000541 ONE 4, e4345 12. Babbitt, P. C., Hasson, M. S., Wedekind, J. E., Palmer, D. R., Barrett, W. C., 32. Frishman, D. (2007) Chem. Rev. 107, 3448–3466 Reed, G. H., Rayment, I., Ringe, D., Kenyon, G. L., and Gerlt, J. A. (1996) 33. Rentzsch, R., and Orengo, C. A. (2009) Trends Biotechnol. 27, 210–219 Biochemistry 35, 16489–16501 34. Taylor, S. S., and Radzio-Andzelm, E. (1994) Structure 2, 345–355 13. Holm, L., and Sander, C. (1997) Proteins Struct. Funct. Genet. 28, 72–82 35. Kannan, N., Taylor, S. S., Zhai, Y., Venter, J. C., and Manning, G. (2007) 14. Seibert, C. M., and Raushel, F. M. (2005) Biochemistry 44, 6383–6391 PLoS Biol. 5, e17 15. Holden, H. M., Benning, M. M., Haller, T., and Gerlt, J. A. (2001) Acc. 36. Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., and Sayers, Chem. Res. 34, 145–157 E. W. (2011) Nucleic Acids Res. 39, D32–D37 16. Mildvan, A. S., Xia, Z., Azurmendi, H. F., Saraswat, V., Legler, P. M., 37. Yooseph, S., Sutton, G., Rusch, D. B., Halpern, A. L., Williamson, S. J., Massiah, M. A., Gabelli, S. B., Bianchet, M. A., Kang, L. W., and Amzel, Remington, K., Eisen, J. A., Heidelberg, K. B., Manning, G., Li, W., Jaro- L. M. (2005) Arch. Biochem. Biophys. 433, 129–143 szewski, L., Cieplak, P., Miller, C. S., Li, H., Mashiyama, S. T., Joachimiak, 17. Burroughs, A. M., Allen, K. N., Dunaway-Mariano, D., and Aravind, L. M. P., van Belle, C., Chandonia, J. M., Soergel, D. A., Zhai, Y., Natarajan, K., (2006) J. Mol. Biol. 361, 1003–1034 Lee, S., Raphael, B. J., Bafna, V., Friedman, R., Brenner, S. E., Godzik, A., 18. Ojha, S., Meng, E. C., and Babbitt, P. C. (2007) PLoS Comput. Biol. 3, e121 Eisenberg, D., Dixon, J. E., Taylor, S. S., Strausberg, R. L., Frazier, M., and 19. Eisen, J. A. (1998) Genome Res. 8, 163–167 Venter, J. C. (2007) PLoS Biol. 5, e16 20. Brown, D. P., Krishnamurthy, N., and Sjölander, K. (2007) PLoS Comput. 38. Scheeff, E. D., and Bourne, P. E. (2005) PLoS Comput. Biol. 1, e49 Biol. 3, e160 39. Finn, R. D., Mistry, J., Tate, J., Coggill, P., Heger, A., Pollington, J. E., Gavin, 21. Scheer, M., Grote, A., Chang, A., Schomburg, I., Munaretto, C., Rother, O. L., Gunasekaran, P., Ceric, G., Forslund, K., Holm, L., Sonnhammer, M., Söhngen, C., Stelzer, M., Thiele, J., and Schomburg, D. (2011) Nucleic E. L., Eddy, S. R., and Bateman, A. (2010) Nucleic Acids Res. 38, Acids Res. 39, D670–D676 D211–D222 22. Gariev, I. A., and Varfolomeev, S. D. (2006) Bioinformatics 22, 2574–2576 40. Dethlefsen, L., and Relman, D. A. (2011) Proc. Natl. Acad. Sci. U.S.A. 108, 23. Holliday, G. L., Almonacid, D. E., Bartlett, G. J., O’Boyle, N. M., Torrance, 4554–4561 J. W., Murray-Rust, P., Mitchell, J. B., and Thornton, J. M. (2007) Nucleic 41. Mannervik, B., Board, P. G., Hayes, J. D., Listowsky, I., and Pearson, W. R. Acids Res. 35, D515–D520 (2005) Methods Enzymol. 401, 1–8 24. Nagano, N. (2005) Nucleic Acids Res. 33, D407–D412 42. Atkinson, H. J., and Babbitt, P. C. (2009) Biochemistry 48, 11108–11116 25. Pegg, S. C., Brown, S. D., Ojha, S., Seffernick, J., Meng, E. C., Morris, J. H., 43. Pearson, W. R. (2005) Methods Enzymol. 401, 186–204 Chang, P. J., Huang, C. C., Ferrin, T. E., and Babbitt, P. C. (2006) Biochem- 44. Hicks, M. A., Barber, A. E. I., Giddings, L. A., Caldwell, J., O’Connor, S. E., istry 45, 2545–2555 and Babbitt, P. C. (2011) Proteins Struct. Funct. Genet. 79, 3082–3098 26. Enright, A. J., and Ouzounis, C. A. (2001) Bioinformatics 17, 853–854 45. Pieper, U., Chiang, R., Seffernick, J. J., Brown, S. D., Glasner, M. E., Kelly, L., 27. Frickey, T., and Lupas, A. (2004) Bioinformatics 20, 3702–3704 Eswar, N., Sauder, J. M., Bonanno, J. B., Swaminathan, S., Burley, S. K., 28. Huttenhower, C., Mehmood, S. O., and Troyanskaya, O. G. (2009) BMC Zheng, X., Chance, M. R., Almo, S. C., Gerlt, J. A., Raushel, F. M., Jacobson, Bioinformatics 10, 417 M. P., Babbitt, P. C., and Sali, A. (2009) J. Struct. Funct. Genomics 10, 29. Adai, A. T., Date, S. V., Wieland, S., and Marcotte, E. M. (2004) J. Mol. Biol. 107–125 340, 179–190 30. Cline, M. S., Smoot, M., Cerami, E., Kuchinsky, A., Landys, N., Workman, 46. Zhu, J., and Weng, Z. (2005) Proteins 58, 618–627 C., Christmas, R., Avila-Campilo, I., Creech, M., Gross, B., Hanspers, K., 47. Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, Isserlin, R., Kelley, R., Killcoyne, S., Lotia, S., Maere, S., Morris, J., Ono, K., W., and Lipman, D. J. (1997) Nucleic Acids Res. 25, 3389–3402 Pavlovic, V., Pico, A. R., Vailaya, A., Wang, P. L., Adler, A., Conklin, B. R., 48. Gerlt, J. A., Babbitt, P. C., Jacobson, M. P., and Almo, S. C. (2012) J Biol. Hood, L., Kuiper, M., Sander, C., Schmulevich, I., Schwikowski, B., War- Chem. 287, 29–34 42 JOURNAL OF BIOLOGICAL CHEMISTRY VOLUME 287 • NUMBER 1 •JANUARY 2, 2012 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Journal of Biological Chemistry American Society for Biochemistry and Molecular Biology http://www.deepdyve.com/lp/american-society-for-biochemistry-and-molecular-biology/inference-of-functional-properties-from-large-scale-analysis-of-enzyme-VTcMYiwmJV

Loading next page...

References (48)

H. Atkinson, P. Babbitt (2009)
Glutathione Transferases Are Structural and Functional Outliers in the Thioredoxin Fold†
Biochemistry, 48
Shibu Yooseph, G. Sutton, D. Rusch, A. Halpern, S. Williamson, K. Remington, J. Eisen, K. Heidelberg, G. Manning, Weizhong Li, L. Jaroszewski, P. Cieplak, ChristopherA. Miller, Huiying Li, S. Mashiyama, marcin joachimiak, Christopher Belle, J. Chandonia, David Soergel, Yufeng Zhai, K. Natarajan, Shaun Lee, Benjamin Raphael, V. Bafna, R. Friedman, S. Brenner, A. Godzik, D. Eisenberg, J. Dixon, Susan Taylor, R. Strausberg, M. Frazier, J. Venter (2007)
The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families
PLoS Biology, 5
D. Frishman (2007)
Protein annotation at genomic scale: the current status.
Chemical reviews, 107 8
J. Gerlt, P. Babbitt, M. Jacobson, S. Almo (2011)
Divergent Evolution in Enolase Superfamily: Strategies for Assigning Functions*
The Journal of Biological Chemistry, 287
A. Adai, Shailesh Date, Shannon Wieland, E. Marcotte (2004)
LGL: creating a map of protein function with an algorithm for visualizing very large biological networks.
Journal of molecular biology, 340 1
Duncan Brown, Nandini Krishnamurthy, Kimmen Sjölander (2007)
Automated Protein Subfamily Identification and Classification
PLoS Computational Biology, 3
J. Raes, E. Harrington, Amoolya Singh, P. Bork (2007)
Protein function space: viewing the limits or limited by our view?
Current opinion in structural biology, 17 3
H. Holden, M. Benning, T. Haller, J. Gerlt (2001)
The crotonase superfamily: divergently related enzymes that catalyze different reactions involving acyl coenzyme a thioesters.
Accounts of chemical research, 34 2
M. Cline, M. Smoot, E. Cerami, A. Kuchinsky, Nerius Landys, C. Workman, R. Christmas, Iliana Avila-Campilo, Michael Creech, Benjamin Gross, K. Hanspers, Ruth Isserlin, Ryan Kelley, S. Killcoyne, Samad Lotia, Steven Maere, J. Morris, K. Ono, Vuk Pavlovic, A. Pico, A. Vailaya, Peng-Liang Wang, Annette Adler, B. Conklin, L. Hood, Martin Kuiper, C. Sander, Ilya Schmulevich, B. Schwikowski, G. Warner, T. Ideker, Gary Bader (2007)
Integration of biological networks and gene expression data using Cytoscape
Nature Protocols, 2
P. Babbitt, M. Hasson, J. Wedekind, D. Palmer, W. Barrett, G. Reed, I. Rayment, D. Ringe, G. Kenyon, J. Gerlt (1996)
The enolase superfamily: a general strategy for enzyme-catalyzed abstraction of the alpha-protons of carboxylic acids.
Biochemistry, 35 51
Susan Taylor, E. Radzio‐Andzelm (1994)
Three protein kinase structures define a common motif.
Structure, 2 5
R. Roberts, Yi-Chien Chang, Zhenjun Hu, John Rachlin, B. Anton, R. Pokrzywa, Han-Pil Choi, L. Faller, Jyotsna Guleria, Genevieve Housman, Niels Klitgord, Varun Mazumdar, M. McGettrick, Lais Osmani, R. Swaminathan, Kevin Tao, S. Letovsky, Dennis Vitkup, D. Segrè, S. Salzberg, C. DeLisi, Martin Steffen, S. Kasif (2010)
COMBREX: a project to accelerate the functional annotation of prokaryotic genomes
Nucleic Acids Research, 39
L. Dethlefsen, D. Relman (2010)
Incomplete recovery and individualized responses of the human distal gut microbiota to repeated antibiotic perturbation
Proceedings of the National Academy of Sciences, 108
J. Eisen (1998)
Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis.
Genome research, 8 3
(2011)
Proteins Struct
A. Schnoes, Shoshana Brown, Igor Dodevski, P. Babbitt (2009)
Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies
PLoS Computational Biology, 5
B. Mannervik, P. Board, J. Hayes, I. Listowsky, W. Pearson (2005)
Nomenclature for mammalian soluble glutathione transferases.
Methods in enzymology, 401
J. Qin, Ruiqiang Li, J. Raes, Manimozhiyan Arumugam, K. Burgdorf, C. Manichanh, T. Nielsen, N. Pons, F. Levenez, Takuji Yamada, D. Mende, Junhua Li, Junming Xu, Shaochuan Li, Dongfang Li, Jianjun Cao, Bo Wang, Huiqing Liang, Huisong Zheng, Yinlong Xie, J. Tap, P. Lepage, Marcelo Bertalan, Jean-Michel Batto, T. Hansen, D. Paslier, A. Linneberg, H. Nielsen, É. Pelletier, P. Renault, Thomas Sicheritz-Pontén, Keith Turner, Hong-mei Zhu, Chang Yu, Shengting Li, Min Jian, Yan Zhou, Yingrui Li, Xiuqing Zhang, Songgang Li, Nan Qin, Huanming Yang, Jian Wang, S. Brunak, J. Doré, F. Guarner, K. Kristiansen, O. Pedersen, J. Parkhill, J. Weissenbach, P. Bork, S. Ehrlich, Jun Wang (2010)
A human gut microbial gene catalogue established by metagenomic sequencing
Nature, 464
Eric Scheeff, P. Bourne (2005)
Structural Evolution of the Protein Kinase–Like Superfamily
PLoS Computational Biology, 1
W. Pearson (2005)
Phylogenies of glutathione transferase families.
Methods in enzymology, 401
A. Mildvan, Zuyong Xia, H. Azurmendi, V. Saraswat, P. Legler, M. Massiah, S. Gabelli, M. Bianchet, L. Kang, L. Amzel (2005)
Structures and mechanisms of Nudix hydrolases.
Archives of biochemistry and biophysics, 433 1
Gemma Holliday, D. Almonacid, G. Bartlett, Noel O'Boyle, James Torrance, Peter Murray-Rust, John Mitchell, J. Thornton (2006)
MACiE (Mechanism, Annotation and Classification in Enzymes): novel tools for searching catalytic mechanisms
Nucleic Acids Research, 35
S. Ojha, E. Meng, P. Babbitt (2007)
Evolution of Function in the “Two Dinucleotide Binding Domains” Flavoproteins
PLoS Computational Biology, 3
J. Gerlt, P. Babbitt (2001)
Divergent evolution of enzymatic function: mechanistically diverse superfamilies and functionally distinct suprafamilies.
Annual review of biochemistry, 70
N. Nagano (2004)
EzCatDB: the Enzyme Catalytic-mechanism Database
Nucleic Acids Research, 33
S. Altschul, Thomas Madden, A. Schäffer, Jinghui Zhang, Zheng Zhang, W. Miller, D. Lipman (1997)
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
Nucleic acids research, 25 17
Maurice Scheer, A. Grote, Antje Chang, I. Schomburg, Cornelia Munaretto, M. Rother, C. Söhngen, M. Stelzer, Juliane Thiele, D. Schomburg (2010)
BRENDA, the enzyme information system in 2011
Nucleic Acids Research, 39
T. Frickey, A. Lupas (2004)
CLANS: a Java application for visualizing protein families based on pairwise similarity
Bioinformatics, 20 18
Anton Enright, C. Ouzounis (2001)
BioLayout-an automatic graph layout algorithm for similarity visualization
Bioinformatics, 17 9
N. Kannan, Susan Taylor, Yufeng Zhai, J. Venter, G. Manning (2007)
Structural and Functional Diversity of the Microbial Kinome
PLoS Biology, 5
A. Morgat, R. Apweiler, M. Martin, C. O’Donovan, M. Magrane, Y. Alam-Faruque, R. Antunes, D. Barrell, B. Bely, M. Bingley, David Binns, Lawrence Bower, Paul Browne, Chan Wm, E. Dimmer, R. Eberhardt, F. Fazzini, A. Fedotov, R. Foulger, J. Garavelli, Castro Lg, R. Huntley, Julius Jacobsen, M. Kleen, K. Laiho, D. Legge, Quan Lin, W. Liu, J. Luo, S. Orchard, S. Patient, K. Pichler, D. Poggioli, Nikolas Pontikos, Manuela Pruess, S. Rosanoff, T. Sawford, H. Sehra, E. Turner, M. Corbett, M. Donnelly, P. VanRensburg, I. Xenarios, L. Bougueleret, A. Auchincloss, Ghislaine Argoud-Puy, K. Axelsen, A. Bairoch, Delphine Baratin, Blatter Mc, B. Boeckmann, Jerven Bolleman, L. Bollondi, E. Boutet, Quintaje Sb, L. Breuza, A. Bridge, E. Decastro, E. Coudert, Isabelle Cusin, M. Doche, D. Dornevil, S. Duvaud, A. Estreicher, L. Famiglietti, M. Feuermann, S. Gehant, Serenella Ferro, E. Gasteiger, A. Gateau, Vivienne Gerritsen, A. Gos, N. Gruaz-Gumowski, U. Hinz, C. Hulo, N. Hulo, J. James, S. Jimenez, F. Jungo, T. Kappler, G. Keller, V. Lara, P. Lemercier, D. Lieberherr, X. Martin, P. Masson, M. Moinat, S. Paesano, I. Pedruzzi, S. Pilbout, S. Poux, Monica Pozzato, Nicole Redaschi, C. Rivoire, B. Roechert, M. Schneider, Christian Sigrist, K. Sonesson, S. Staehli, E. Stanley, A. Stutz, S. Sundaram, M. Tognolli, L. Verbregue, V. Al, Wu Ch, Arighi Cn, L. Arminski, Barker Wc, Chuming Chen, Yingfei Chen, P. Dubey, He Huang, R. Mazumder, P. McGarvey, Natale Da, N. Tg, J. Nchoutmboube, Roberts Nv, Suzek Be, U. Ugochukwu, Vinayak Cr, Qiang Wang, Y. Wang, Yeh Ls, J. Zhang (2010)
Ongoing and future developments at the Universal Protein Resource
Nucleic Acids Research, 39
Tzu-Lin Hsiao, O. Revelles, Lifeng Chen, U. Sauer, Dennis Vitkup (2009)
Automatic policing of biochemical annotations using genomic correlations
Nature chemical biology, 6
C. Seibert, F. Raushel (2005)
Structural and catalytic diversity within the amidohydrolase superfamily.
Biochemistry, 44 17
S. Pegg, Shoshana Brown, S. Ojha, J. Seffernick, E. Meng, J. Morris, Patricia Chang, Conrad Huang, T. Ferrin, P. Babbitt (2006)
Leveraging enzyme structure-function relationships for functional inference and experimental design: the structure-function linkage database.
Biochemistry, 45 8
G. Manning, D. Whyte, R. Martinez, T. Hunter, S. Sudarsanam (2002)
The Protein Kinase Complement of the Human Genome
Science, 298
H. Atkinson, P. Babbitt (2009)
An Atlas of the Thioredoxin Fold Class Reveals the Complexity of Function-Enabling Adaptations
PLoS Computational Biology, 5
Michael Hicks, Alan Barber, L. Giddings, Jenna Caldwell, S. O’Connor, P. Babbitt (2011)
The evolution of function in strictosidine synthase‐like proteins
Proteins: Structure, 79
L. Holm, C. Sander (1997)
An evolutionary treasure: unification of a broad set of amidohydrolases related to urease
Proteins: Structure, 28
H. Atkinson, J. Morris, T. Ferrin, P. Babbitt (2009)
Using Sequence Similarity Networks for Visualization of Relationships Across Diverse Protein Superfamilies
PLoS ONE, 4
Jianhua Zhu, Z. Weng (2004)
FAST: A novel protein structure alignment algorithm
Proteins: Structure, 58
Shuchismita Dutta, K. Burkhardt, Jasmine Young, G. Swaminathan, T. Matsuura, K. Henrick, Haruki Nakamura, H. Berman (2009)
Data Deposition and Annotation at the Worldwide Protein Data Bank
Molecular Biotechnology, 42
A. Bateman (2010)
Curators of the world unite: the International Society of Biocuration
Bioinformatics, 26 8
U. Pieper, Ranyee Chiang, J. Seffernick, Shoshana Brown, M. Glasner, L. Kelly, N. Eswar, J. Sauder, J. Bonanno, S. Swaminathan, S. Burley, Xiaojing Zheng, M. Chance, S. Almo, J. Gerlt, F. Raushel, M. Jacobson, P. Babbitt, A. Sali (2009)
Target selection and annotation for the structural genomics of the amidohydrolase and enolase superfamilies
Journal of Structural and Functional Genomics, 10
I. Gariev, S. Varfolomeev (2006)
Hierarchical classification of hydrolases catalytic sites
Bioinformatics, 22 20
C. Huttenhower, Sajid Mehmood, O. Troyanskaya (2009)
Graphle: Interactive exploration of large, dense graphs
BMC Bioinformatics, 10
(2012)
29–34 MINIREVIEW: Functional Inference in Enzyme Superfamilies 42 JOURNAL OF BIOLOGICAL CHEMISTRY VOLUME
R. Rentzsch, C. Orengo (2009)
Protein function prediction--the power of multiplicity.
Trends in biotechnology, 27 4
A. Burroughs, Karen Allen, D. Dunaway-Mariano, L. Aravind (2006)
Evolutionary genomics of the HAD superfamily: understanding the structural adaptations and catalytic diversity in a superfamily of phosphoesterases and allied enzymes.
Journal of molecular biology, 361 5

Publisher: American Society for Biochemistry and Molecular Biology
Copyright: Copyright © 2012 Elsevier Inc.
ISSN: 0021-9258
eISSN: 1083-351X
DOI: 10.1074/jbc.r111.283408
Publisher site: See Article on Publisher Site

Abstract

MINIREVIEW THE JOURNAL OF BIOLOGICAL CHEMISTRY VOL. 287, NO. 1, pp. 35–42, January 2, 2012 © 2012 by The American Society for Biochemistry and Molecular Biology, Inc. Published in the U.S.A. tionally, community challenges such as the Critical Assess- Inference of Functional ment of Function Annotations (CAFA) (Automated Func- Properties from Large-scale tion Prediction 2011) have been mounted to assess and improve the current state of automated prediction of protein Analysis of Enzyme function. Viewing the glass as half-full, progress in sequenc- Superfamilies ing and annotation over the last decade led one group to Published, JBC Papers in Press, November 8, 2011, DOI 10.1074/jbc.R111.283408 estimate that some functional features can be assigned to as ‡ ‡§¶1 Shoshana D. Brown and Patricia C. Babbitt much as 85% of proteins in completely sequenced genomes From the Departments of Bioengineering and Therapeutic Sciences and (6). From a more skeptical perspective, more recent assess- § ¶ Pharmaceutical Chemistry, School of Pharmacy, and California Institute ments of annotation accuracy suggest that computational for Quantitative Biosciences, University of California, approaches are especially prone to misannotation (7, 8), San Francisco, California 94158-2330 indicating that significant challenges for functional infer- As increasingly large amounts of data from genome and ence remain. other sequencing projects become available, new approaches This minireview focuses on how new insights about protein are needed to determine the functions of the proteins these structure-function relationships and functional inference can genes encode. We show how large-scale computational anal- be obtained from large-scale analyses of proteins, specifically ysis can help to address this challenge by linking functional for “functionally diverse” enzyme superfamilies. We define information to sequence and structural similarities using these types of superfamilies as sets of homologous proteins that protein similarity networks. Network analyses using three conserve structural and active site features that can be explicitly functionally diverse enzyme superfamilies illustrate the use associated with a conserved partial reaction or other chemical of these approaches for facile updating and comparison of capability. Within a superfamily and constrained by these available structures for a large superfamily, for creation of superfamily-common features, many divergent families may functional hypotheses for metagenomic sequences, and to have evolved that exhibit different reaction and/or substrate summarize the limits of our functional knowledge about even specificities (9). (See the Prologue for some definitions of super- well studied superfamilies. families, families, and related terms.) These types of superfamilies provide a useful context for inference of functional properties of members of unknown In the post-genomic era, access to large amounts of gene function (“unknowns”) because the constraints imposed by sequence and protein structure data has become the norm; the structure-function paradigm unique to each superfamily by mid-2011, the number of protein sequences in the Uni- restrict the search space for functional inference of their Prot/TrEMBL Database (1) topped 16 million, whereas the reaction and substrate specificities, simplifying their func- Protein Data Bank (2) contained over 73,000 structures. tional assignments. Because the number of sequences in Additional millions of sequences are becoming available each superfamily is still increasing rapidly, large amounts of from newer types of genome projects, including metag- new data are regularly available to inform these investiga- enomics projects, with one report for the human gut micro- tions. Moreover, sequence and structural similarities among biome accounting for an additional 3.3 million microbial all of the members of a superfamily can be associated with genes (3). Because experimental determination of protein many types of functional information, allowing us to lever- function lags far behind the rate of sequence and structure age what is known to guide inference of functional properties determination, improved computational methods for func- of unknowns that are similar. (See the minireview by Gerlt et tion prediction are urgently needed to help bridge the gap al. (48) in this thematic series describing strategies for between sequenced genes and functionally characterized assigning functions in the enolase superfamily for an exam- protein products. In response, new methods are rapidly ple.) Furthermore, as our coverage of genome space being developed to address these challenges, and community increases, new “outlier” functions in superfamilies can be efforts are now under way to increase the pace of experimen- identified from specialized environmental niches, extending tal and computational prediction of protein function (4, our estimates of the natural boundaries of functional varia- 5). Another large-scale effort (http://www.nigms.nih.gov/ tion that a particular superfamily supports. News/Results/gluegrant_051510.htm) aims to develop a Below, we describe how the continuing increase in sequence combined experimental/computational strategy for the pre- and structural data can be used to understand better the evolu- diction of the reaction and substrate specificity of enzymes, tion of new functions and to improve functional inference the protein class that is the subject of this minireview. Addi- accessed using a relatively new application of network-based methods, protein similarity networks, an attractive approach * This work was supported, in whole or in part, by National Institutes of Health for investigation of functional properties from the context of Grants R01 GM60595 and U54 GM093342. This is the fifth article in the sequence and structural similarity. Results from such large- Thematic Minireview Series on Enzyme Evolution in the Post-genomic Era. scale studies are reviewed here using examples from three dif- To whom correspondence should be addressed. E-mail: [email protected]. edu. ferent superfamilies of enzymes: the eukaryotic protein kinase JANUARY 2, 2012• VOLUME 287 • NUMBER 1 JOURNAL OF BIOLOGICAL CHEMISTRY 35 This is an Open Access article under the CC BY license. MINIREVIEW: Functional Inference in Enzyme Superfamilies (ePK) -like superfamily, a large group of acid-sugar dehydrata- tation and to distinguish divergent families within enzyme ses from the enolase superfamily, and the glutathione transfer- superfamilies (see Ref. 20 for an example). Additionally, search- ase (GST) superfamily. able online databases such as BRENDA (21) provide access to a large store of enzyme function information, whereas others Emerging Roles for Large-scale Computational Analysis provide online curation and computational tools created to link of Protein Superfamilies enzyme sequence and structural information with functional As methods for managing and analyzing sequence and struc- characteristics and mechanistic properties (22–25). tural data have improved, computational studies can more Network-based Approaches for Large-scale Analysis of effectively address broad issues in large-scale mapping of struc- Protein Superfamilies ture-function relationships and deduction of the patterns by which natural evolution has led to the divergence of many func- Although large-scale analyses indeed provide a “big picture” tions from an ancestral structural scaffold. For example, for perspective that adds much to our understanding of genomic protein kinases, one of the largest and most important enzyme and chemical biology, the growing size of the data sets and their superfamilies, the seminal Manning tree (10) provided a foun- associated metadata continue to raise significant challenges for dation for classification of human kinases and those from other analysis and dissemination. Network-based analysis represents eukaryotes. Likewise, a large-scale study of redox proteins gen- one approach used to capture biological context, with genetic erated a census of sequence, structural, and functional charac- or protein interaction networks using computational and/or teristics of the divergent superfamilies of the thioredoxin fold experimental data being among the most common. Sequence class that are represented in nature (11). and structure similarity networks have also been used for the Large-scale analyses have the additional advantage of reveal- analysis and visualization of structure-function relationships ing patterns not easily observable when smaller data sets are (26–28). This technique allows users to efficiently and quickly examined. For example, comparison of sequence and structural examine similarities of much larger sets of proteins than is gen- features conserved in the active sites of the members of the erally possible using traditional methods such as phylogenetic large and functionally diverse enolase superfamily allowed the trees and multiple alignments. For example, one such study prediction of the specific partial reaction uniting the entire mounted a comparison of over 145,000 sequences to create a superfamily, the abstraction of an -proton of a carboxylic acid, map in which proteins are positioned according to sequence thereby restricting the functional prediction problem for the relationships and gene functions (29). The recent development thousands of sequences now identified as superfamily members of software platforms such as Cytoscape (30) facilitates the use to consideration of only the overall reactions and substrates of network methods and algorithms of several types, enabling consistent with that paradigm (12). Using that structure-func- access to these types of tools by non-experts. tion mapping as a foundation, more detailed computational Although they are not a substitute for phylogenetic infer- and experimental studies have identified differences among ence, networks generated from even such simple metrics as superfamily members that distinguish the reaction and sub- all-by-all pairwise comparisons of a large number of divergent strate specificities of the 20 constituent families whose func- sequences have been shown to track well with known relation- tions can now be assigned (see the minireview by Gerlt et al. ships and with the clustering provided by trees. Furthermore, (48) for a listing). Other notable studies linking structural and they support facile mapping of many types of orthogonal data to mechanistic features across large enzyme superfamilies include proteins clustered by similarity (31). Types of information such analyses of the amidohydrolase (13, 14), enoyl-CoA hydratase as genome/operon context, interaction networks and path- (15), nudix (16), haloalkanoic acid dehalogenase (17), and two ways, and organism-specific information have been shown to dinucleotide-binding domain flavoprotein (18) superfamilies, enhance the accuracy of functional inference (see Refs. 32 and to name a few. 33 for relevant reviews). In analogy to phylogenomics, func- As more powerful tools and computers have been created, tional information of many types can be associated with nodes the ease of mounting such studies has enabled new types of (e.g. protein sequences or structures) in a similarity network to analyses that provide context for interpreting functional char- improve functional inference and insight. Because protein sim- acteristics across homologous members of superfamilies. These ilarity networks can be quickly generated in interactive formats, include sophisticated algorithms for multiple alignment and users can easily explore these associations by coloring nodes phylogenetic inference, both of which have long been used with different combinations of sequence/structural properties to examine evolutionary relationships among groups of and functional information. sequences. Especially relevant to this minireview, phylog- Examples illustrating the application of large-scale analysis enomic approaches, first described over a decade ago (19), com- of structure-function relationships using protein similarity net- bine phylogenetic reconstruction with functional assignment works are described below. Interactive versions of these net- of unknowns based on their placement in the tree relative to works are available from the authors and can be viewed using knowns. Phylogenomic approaches have now been applied the freely available Cytoscape software (30). extensively to improve the accuracy of homology-based anno- Tracking Growth of Structural Coverage: ePK-like Superfamily The abbreviations used are: ePK, eukaryotic protein kinase; GST, glutathione The ePK-like superfamily is a large and diverse group of transferase; HMM, hidden Markov model; SFLD, Structure-Function Link- age Database; r.m.s.d., root mean square deviation. homologous enzymes that share a common protein kinase-like 36 JOURNAL OF BIOLOGICAL CHEMISTRY VOLUME 287 • NUMBER 1 •JANUARY 2, 2012 MINIREVIEW: Functional Inference in Enzyme Superfamilies fold (34) and conserved residues associated with ATP-depen- between when the study by Scheeff and Bourne (38) was pub- dent phosphorylation of proteins and small molecules. ePK-like lished (October 2005) and May 2011, respectively. As is clear enzymes mediate many important cellular processes, including from these summaries, the structure space has filled out signif- signal transduction (10). They make up almost 2% of eukaryotic icantly over this 6-year span. Most strikingly, the fructosamine genes and, although present as a smaller percentage of bacterial kinase family defined by Pfam, Fructosamin_kin (red oval in genes, may be at least as important in bacterial cellular regula- Fig. 1A, lower panel), was not represented at all in the network tion as the structurally unrelated histidine kinases (35). from 2005. Fig. 1B shows the same network as in Fig. 1A (lower The size and diversity of the ePK-like superfamily make it panel), but thresholded at a higher stringency scoring cutoff hard to generate a global overview of their sequence and struc- (achieved by increasing the score threshold required for draw- tural relationships. As a result, only a small number of groups ing edges between two nodes), enabling a more detailed view of have attempted the time-consuming task of generating large- the same structural relationships. Fig. 1B provides a different scale classifications of the kinases. In one of these studies, Kan- and somewhat more detailed view of the growth of structural nan et al. (35) used a library of hidden Markov models (HMMs) coverage between these two time points. Although these net- to identify 45,000 ePK-like sequences from the NCBI non- works use a set of structures that is larger and somewhat differ- redundant database (36) and the Global Ocean Sampling data ent from that used by Scheeff and Bourne, they track reasonably set (37) and to classify them into 20 families. Examination of well with those trees (data not shown). Some exceptions this diverse sequence set allowed the identification of 10 resi- include structures for which the position was labeled as uncer- dues conserved across most families. Six of these residues were tain in the Scheeff and Bourne tree. Alternative versions of known to be involved in ATP and substrate binding and catal- these networks colored by the Manning classification (10), with ysis, whereas the functional role of the remaining residues had the addition of the atypical kinase class used in Ref. 38, are not been established. This study also showed that all but one of provided in Fig. 2. these well conserved residues had been lost over the course of As shown in this example, similarity networks can be used evolution in one or more families (in some cases, substituted effectively to update relationships among proteins in a super- with changes in other regions of the protein), illustrating the family as new structures become available, if, as for the ePK-like plasticity of the ePK-like fold. Although profile-profile align- superfamily, its structural coverage is good. Sequence networks ments and alignments of conserved motifs could be used to can also be used to summarize relationships among proteins on group some families into related clusters, the size and diversity a large scale (11), as described below. Although the scale at of the superfamily have continued to challenge the construc- which networks can easily query such data is still much larger tion of a more detailed evolutionary history. than can generally be accommodated using multiple align- Scheeff and Bourne (38) were able to surmount the problem ments and trees, the size of networks that can be viewed and of low sequence identity across the superfamily by combining manipulated by software such as Cytoscape is limited by the sequence and structural information into a single phylogenetic number of edges they contain. In practice, for a superfamily as analysis. The results suggested that the tree constructed by this large as the kinases, only a small proportion of the available method had some advantages and was more reliable than trees sequences can be represented in a single network, typically produced using either sequence or structural data alone. requiring the use of representative sequences to cover the In addition to these types of global analyses, many thousands divergence space. Additionally, because of the diversity of many of detailed studies have been published describing properties of superfamilies, including the ePK-like superfamily, it is not pos- smaller groups and of individual enzymes. However, the sheer sible to connect the whole set of sequences at statistically sig- number of sequences and structures in this superfamily, cou- nificant scores. pled with the rate of growth of the sequence and structure data- Prediction of New Carbon Sources in Human Gut bases, makes keeping an up-to-date record of kinase relation- Microbiome from Comparisons with Acid-sugar ships increasingly difficult, even without the inclusion of linked Dehydratases of Enolase Superfamily functional information. (The Pfam (39) PKinase clan currently includes nearly 85,000 sequences.) Here, we illustrate the use of Microbes residing in the gut have a significant influence on similarity networks to keep track of relationships between human health. In addition to aiding in energy harvest from food enzymes in large superfamilies. In this example, networks gen- and synthesizing essential vitamins, changes in the gut micro- erated from pairwise structural comparisons provide a current bial population are associated with medical conditions such as update of the structural coverage of the superfamily. inflammatory bowel disease and obesity (3). Variations in Fig. 1 shows structure similarity networks for the ePK-like microbiome populations have also been observed following superfamily, colored by Pfam classifications, with Fig. 1A indi- treatment with antibiotics (40). Thus, much interest is now cating the differences in structural coverage in the years focused on determining the molecular functions and biological roles of the gut metaproteome both in healthy individuals and in those suffering from disease. For network analysis for the ePK-like superfamily, structures were chosen to One of the most comprehensive studies on the human gut include only one structure for each unique UniProt ID, with a preference for 1) structures solved October 2005 or previously and 2) wild-type, 3) ligand- microbiome to date describes a set of 3.3 million microbial bound, and 4) good resolution structures. Using the FAST algorithm (46), genes sequenced and assembled from fecal samples of 124 indi- each structure in the set was used as a query against a database containing viduals (3). As expected, the census of protein functions initially all structures in the set. Networks were created at various N-score cutoffs and visualized using Cytoscape. identified in this metagenome includes proteins in many cen- JANUARY 2, 2012• VOLUME 287 • NUMBER 1 JOURNAL OF BIOLOGICAL CHEMISTRY 37 MINIREVIEW: Functional Inference in Enzyme Superfamilies FIGURE 1. Structure similarity networks of ePK-like superfamily generated from pairwise comparisons using FAST algorithm. Each node represents a structure. Each edge represents a connection with a FAST N-score better than a given threshold. A, FAST N-score cutoff 11, colored by Pfam family. Upper panel, structures available as of October 2005 (97 nodes). At this cutoff, the average root mean square deviation (r.m.s.d.) is 2.81 Å with 213 C atoms aligned. Lower panel, structures available as of May 2011 (295 nodes). At this cutoff, the average r.m.s.d. is2.98 Å with207 C atoms aligned. B, FAST N-score cutoff 23. At this cutoff, the average r.m.s.d. is 1.97 with 247 C atoms aligned. Nodes colored green represent structures available in the Protein Data Bank as of October 2005; those colored blue represent structures added to the Protein Data Bank between October 2005 and May 2011 (total of 295 nodes). Nodes were arranged using the yFiles organic layout provided with Cytoscape version 2.7. Lengths of edges are not meaningful except that sequences in tightly clustered groups are relatively more similar to each other than sequences with few connections. tral metabolic pathways such as those involved in carbon utili- assignment of specificity to40% of the2000 sequences cur- zation pathways. We used the information available in the rently represented in this subgroup of the superfamily in SFLD. Structure-Function Linkage Database (SFLD) (25) for a large Although the rest can be assigned with high confidence as likely set of acid-sugar dehydratases in the enolase superfamily to acid-sugar dehydratases, their substrate specificities remain probe for additional and possibly unique carbon sources in the unknown. Using SFLD tools, protein sequences from the microbiome. This was accomplished by identifying putative human gut microbiome predicted to be acid-sugar dehydrata- acid-sugar dehydratases in the gut metagenome that differ from ses were identified and clustered together with the knowns and those that had been previously identified, whether of known or unknowns of the subgroup already annotated in the database. unknown specificity. The results are summarized in the network shown in Fig. 3A. The substrate specificities of 10 acid-sugar dehydratases This network is thresholded at a relatively permissive cutoff, have now been biochemically established, allowing functional where most families are found in one major cluster. Other reac- tion families that do not show similarities to any of the nodes in SFLD is a joint project of the Babbitt laboratory (supported by National Insti- tutes of Health Grant GM60595 and National Science Foundation Grants For network analysis for the gut metagenome, the sequence set consists of DBI-0234768 and DBI-0640476) and the UCSF Resource for Biocomputing, 1) the subgroup from SFLD containing acid-sugar dehydratases (named Visualization, and Informatics (supported by National Institutes of Health the mandelate racemase subgroup), filtered to 90% identity, aside from Grant P41 RR001081). Additional support for the creation of networks experimentally characterized members, all of which are present, and 2) all available at SFLD is provided by the Enzyme Function Initiative (supported gut metagenome sequences that matched either this SFLD subgroup by National Institutes of Health Grant U54 GM093342). HMM or an SFLD family HMM from a family within the subgroup with an Of 10 acid-sugar dehydratase families of known reaction specificity in SFLD, e-value cutoff of at least 1e2 and that did not better match any other only seven are colored in Fig. 3, as two others are not represented in this enolase superfamily SFLD HMMs. These sequences were filtered to 90% analysis. The mandelate racemase family, the namesake of the subgroup, is identity and to remove fragments under 150 amino acids. BLAST analysis also colored. Although mandelate racemase is not an acid-sugar dehydra- (47) was performed using each sequence in the set as a query against a tase, it is a member of this subgroup by sequence and structural similarity database containing all sequences in the set. Networks were created at and is therefore included in Fig. 3. two different e-value cutoffs and visualized as described in Footnote 3. 38 JOURNAL OF BIOLOGICAL CHEMISTRY VOLUME 287 • NUMBER 1 •JANUARY 2, 2012 MINIREVIEW: Functional Inference in Enzyme Superfamilies FIGURE 2. Alternative view of structure similarity networks of 86 representative structures in ePK-like superfamily (generated as described for Fig. 1). Nodes are colored according to their Manning/Bourne group classification. Dark gray nodes represent structures that were not classified. A, FAST N-score cutoff 4. B, FAST N-score cutoff 23. this large cluster at a threshold better than the cutoff form and illustrates more fully the breadth of their natural diversity. smaller clusters arranged randomly at the bottom of Fig. 3A. It is also interesting that some clusters containing members of Simple examination reveals a few emerging clusters in the main characterized families in Fig. 3B have no representatives from cluster and also in the separated clusters (e.g. the circled group the gut microbiome, suggesting that these functions may not be in Fig. 3A) that are populated primarily or exclusively by gut represented in the microorganisms that live in the gut (or those metagenomic sequences. Because these sequences are some- functions are supplied by enzymes from a different evolution- what distant from those with characterized functions (desig- ary background). nated by different colors), they may indeed represent unique What We Do Not Know About Cytosolic GST Superfamily acid-sugar dehydratases and, hence, new carbon sources not GSTs constitute a large class of enzymes that play important previously associated with the superfamily. A more detailed examination of this hypothesis can be biological roles in cell signaling and metabolism of endogenous obtained by visualization of the network at the more stringent compounds, drugs, and other xenobiotics. They are ubiquitous in nature (except for archaea) and may represent as much as e-value cutoff, shown in Fig. 3B. In this view, most of the char- 0.01% of the enzyme universe. Based on sequence similarities, acterized families within the subgroup have separated into indi- vidual clusters, suggesting that this threshold cutoff may be GSTs have historically been organized into major classes using useful for hypothesizing the boundaries of at least some of the the names of Greek letters (e.g. Alpha, Pi, Omega, Theta, etc.) (41). Within each major class, subclasses designate functional functionally distinct families within it. From this view, we can and other properties. Although a number of GSTs have been predict the specificity of some of the metagenomic sequences that cluster closely with known families, e.g. fuconate and galac- experimentally characterized in terms of their general substrate tonate dehydratases. The perspective provided in Fig. 3B also profiles, the physiological substrates and reaction specificities of only a small minority are known. Still, because of their lends support to the hypothesis that the separated clusters pop- importance to human biology and health, GSTs are among the ulated only by gut metagenomic sequences and other unchar- TM acterized sequences from the GenBank Data Bank may best studied of enzyme superfamilies, with thousands of publi- indeed represent new carbon sources not previously identified cations detailing their biological roles and structural and func- tional properties. as members of the enolase superfamily. Finally, the addition of these metagenomic sequences to the networks helps to fill out the sequence space representing the acid-sugar dehydratases H. J. Atkinson and P. C. Babbitt, unpublished data. JANUARY 2, 2012• VOLUME 287 • NUMBER 1 JOURNAL OF BIOLOGICAL CHEMISTRY 39 MINIREVIEW: Functional Inference in Enzyme Superfamilies FIGURE 3. Sequence similarity networks of acid-sugar dehydratases known or predicted to belong to enolase superfamily and human gut micro- biome. Networks were generated from all-by-all BLAST comparisons of 1578 sequences representing sequences of eight known acid-sugar dehydratase families and the mandelate racemase family from the mandelate racemase subgroup (see Footnote 5) as defined by SFLD and a filtered set of gut metagenome sequences that showed significant similarity to the members of the subgroup. Each of the 1578 nodes represents a sequence. Larger square nodes represent those that have been experimentally characterized, so their reaction and substrate specificities are known. Brown nodes represent sequences from the human gut metagenome, and white nodes represent SFLD sequences in the subgroup for which the reaction and substrate specificities have not been predicted. The remainder (small nodes) represent sequences for which specificity can be predicted at high confidence, colored by their SFLD family names (see Footnote 4). Nodes were arranged using the yFiles organic layout provided with Cytoscape version 2.7. A, each edge in the network represents a BLAST connection with an e-value of 1e44 or better. At this cutoff, sequences have a median percent identity and alignment length of32% and 369, respectively. B, each edge in the network represents a BLAST connection with an e-value of 1e84 or better. At this cutoff, sequences have a median percent identity and alignment length of 44% and 384, respectively. Lengths of edges are not meaningful except that sequences in tightly clustered groups are relatively more similar to each other than sequences with few connections. Only a few studies have focused on the GST superfamily on a acterized at any level. Furthermore, the representation of the large scale, however (11, 42, 43). The sequence similarity net- colored nodes in the overall topology suggests that many addi- work shown in Fig. 4 provides an overview of the cytosolic GST tional classes likely remain to be defined. The view provided in superfamily from one of these (42). It compares 622 GSTs rep- Fig. 4 thus lays a foundation for choosing new sequences for resenting6000 sequences and shows that they can be divided which functional and structural characterization may be espe- into two major groups distinguished by sequence and structural cially valuable for prediction of new functional classes. Many similarity (and also by variations in their active site features). additional GST sequences have recently been identified, so the The majority of the enzymes in the smaller of the two groups proportion of GSTs for which no functional information is shown in Fig. 4 (Group 1) are from eukaryotic organisms, available continues to increase dramatically. whereas those from the larger group (Group 2) are more mixed, Challenges for Computational Prediction of Functional but with the largest number coming from bacteria. Properties The summary of sequence relationships and structural cov- erage provided in Fig. 4 is the first time that similarity relation- The examples provided in this minireview suggest the value of large-scale analyses such as similarity networks for summa- ships across the entire GST superfamily were captured in a rizing sequence and structural relationships in large superfami- single view. This map shows both the sequences that could be classified as members of one of the major classes (colored nodes) lies and for developing hypotheses about how structure- or as well as those that had not even been assigned to one of these sequence-based clustering tracks with functional boundaries. However, like any other method, similarity networks also have general classes (light and dark gray nodes) and had thus far only some significant limitations, a few of which have been been identified as belonging to the cytosolic GST superfamily. Remarkably, despite decades of study, these results reveal that addressed above and others elsewhere (31). Although it is only the huge majority of GSTs have never been functionally char- by experimental investigation that the in vitro and in vivo func- tions of unknowns can ultimately be validated, the continual For network analysis for the GST superfamily, the sequence set was gener- ated, and networks were calculated and visualized as described previously (42). P. C. Babbitt and D. Stryke, unpublished data. 40 JOURNAL OF BIOLOGICAL CHEMISTRY VOLUME 287 • NUMBER 1 •JANUARY 2, 2012 MINIREVIEW: Functional Inference in Enzyme Superfamilies FIGURE 4. Sequence similarity network of cytosolic GSTs. Similarity is defined by pairwise BLAST alignments better than an e-value cutoff of 1e12. 622 representative sequences that are a maximum of 40% identical and that span the diversity of6000 GSTs are shown. Nodes are colored by classification of the sequence in the Swiss-Prot Database (part of the UniProt Database), if available. The 40 large nodes designate sequences with structures. At this cutoff, edges at this threshold represent alignments with a median 27% identity over 200 residues. This network and legend are adapted from Ref. 42 with permission. Bertalan, M., Batto, J. M., Hansen, T., Le Paslier, D., Linneberg, A., Nielsen, growth of sequence data makes it increasingly difficult for H. B., Pelletier, E., Renault, P., Sicheritz-Ponten, T., Turner, K., Zhu, H., either focused or high-throughput experimental studies to keep Yu, C., Li, S., Jian, M., Zhou, Y., Li, Y., Zhang, X., Li, S., Qin, N., Yang, H., up. Even a reasonable fallback position requires the develop- Wang, J., Brunak, S., Doré, J., Guarner, F., Kristiansen, K., Pedersen, O., ment of new strategies for identifying the few experiments that Parkhill, J., Weissenbach, J., Bork, P., Ehrlich, S. D., and Wang, J. (2010) could be most useful for validation of large-scale computational Nature 464, 59–65 predictions. As illustrated here and elsewhere (44, 45), protein 4. Roberts, R. J., Chang, Y. C., Hu, Z., Rachlin, J. N., Anton, B. P., Pokrzywa, similarity networks represent one way to generate the context R. M., Choi, H. P., Faller, L. L., Guleria, J., Housman, G., Klitgord, N., Mazumdar, V., McGettrick, M. G., Osmani, L., Swaminathan, R., Tao, needed for choosing those experiments and interpreting the K. R., Letovsky, S., Vitkup, D., Segrè, D., Salzberg, S. L., Delisi, C., Steffen, results. M., and Kasif, S. (2011) Nucleic Acids Res. 39, D11–D14 REFERENCES 5. Bateman, A. (2010) Bioinformatics 26, 991 6. Raes, J., Harrington, E. D., Singh, A. H., and Bork, P. (2007) Curr. Opin. 1. UniProt Consortium (2011) Nucleic Acids Res. 39, D214–D219 Struct. Biol. 17, 362–369 2. Dutta, S., Burkhardt, K., Young, J., Swaminathan, G. J., Matsuura, T., Hen- 7. Hsiao, T. L., Revelles, O., Chen, L., Sauer, U., and Vitkup, D. (2010) Nat. rick, K., Nakamura, H., and Berman, H. M. (2009) Mol. Biotechnol 42, 1–13 Chem. Biol. 6, 34–40 3. Qin, J., Li, R., Raes, J., Arumugam, M., Burgdorf, K. S., Manichanh, C., 8. Schnoes, A. M., Brown, S. D., Dodevski, I., and Babbitt, P. C. (2009) PLoS Nielsen, T., Pons, N., Levenez, F., Yamada, T., Mende, D. R., Li, J., Xu, J., Li, Comput. Biol. 5, e1000605 S., Li, D., Cao, J., Wang, B., Liang, H., Zheng, H., Xie, Y., Tap, J., Lepage, P., 9. Gerlt, J. A., and Babbitt, P. C. (2001) Annu. Rev. Biochem. 70, 209–246 JANUARY 2, 2012• VOLUME 287 • NUMBER 1 JOURNAL OF BIOLOGICAL CHEMISTRY 41 MINIREVIEW: Functional Inference in Enzyme Superfamilies 10. Manning, G., Whyte, D. B., Martinez, R., Hunter, T., and Sudarsanam, S. ner, G. J., Ideker, T., and Bader, G. D. (2007) Nat. Protoc. 2, 2366–2382 (2002) Science 298, 1912–1934 31. Atkinson, H. J., Morris, J. H., Ferrin, T. E., and Babbitt, P. C. (2009) PLoS 11. Atkinson, H. J., and Babbitt, P. C. (2009) PLoS Comput. Biol. 5, e1000541 ONE 4, e4345 12. Babbitt, P. C., Hasson, M. S., Wedekind, J. E., Palmer, D. R., Barrett, W. C., 32. Frishman, D. (2007) Chem. Rev. 107, 3448–3466 Reed, G. H., Rayment, I., Ringe, D., Kenyon, G. L., and Gerlt, J. A. (1996) 33. Rentzsch, R., and Orengo, C. A. (2009) Trends Biotechnol. 27, 210–219 Biochemistry 35, 16489–16501 34. Taylor, S. S., and Radzio-Andzelm, E. (1994) Structure 2, 345–355 13. Holm, L., and Sander, C. (1997) Proteins Struct. Funct. Genet. 28, 72–82 35. Kannan, N., Taylor, S. S., Zhai, Y., Venter, J. C., and Manning, G. (2007) 14. Seibert, C. M., and Raushel, F. M. (2005) Biochemistry 44, 6383–6391 PLoS Biol. 5, e17 15. Holden, H. M., Benning, M. M., Haller, T., and Gerlt, J. A. (2001) Acc. 36. Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., and Sayers, Chem. Res. 34, 145–157 E. W. (2011) Nucleic Acids Res. 39, D32–D37 16. Mildvan, A. S., Xia, Z., Azurmendi, H. F., Saraswat, V., Legler, P. M., 37. Yooseph, S., Sutton, G., Rusch, D. B., Halpern, A. L., Williamson, S. J., Massiah, M. A., Gabelli, S. B., Bianchet, M. A., Kang, L. W., and Amzel, Remington, K., Eisen, J. A., Heidelberg, K. B., Manning, G., Li, W., Jaro- L. M. (2005) Arch. Biochem. Biophys. 433, 129–143 szewski, L., Cieplak, P., Miller, C. S., Li, H., Mashiyama, S. T., Joachimiak, 17. Burroughs, A. M., Allen, K. N., Dunaway-Mariano, D., and Aravind, L. M. P., van Belle, C., Chandonia, J. M., Soergel, D. A., Zhai, Y., Natarajan, K., (2006) J. Mol. Biol. 361, 1003–1034 Lee, S., Raphael, B. J., Bafna, V., Friedman, R., Brenner, S. E., Godzik, A., 18. Ojha, S., Meng, E. C., and Babbitt, P. C. (2007) PLoS Comput. Biol. 3, e121 Eisenberg, D., Dixon, J. E., Taylor, S. S., Strausberg, R. L., Frazier, M., and 19. Eisen, J. A. (1998) Genome Res. 8, 163–167 Venter, J. C. (2007) PLoS Biol. 5, e16 20. Brown, D. P., Krishnamurthy, N., and Sjölander, K. (2007) PLoS Comput. 38. Scheeff, E. D., and Bourne, P. E. (2005) PLoS Comput. Biol. 1, e49 Biol. 3, e160 39. Finn, R. D., Mistry, J., Tate, J., Coggill, P., Heger, A., Pollington, J. E., Gavin, 21. Scheer, M., Grote, A., Chang, A., Schomburg, I., Munaretto, C., Rother, O. L., Gunasekaran, P., Ceric, G., Forslund, K., Holm, L., Sonnhammer, M., Söhngen, C., Stelzer, M., Thiele, J., and Schomburg, D. (2011) Nucleic E. L., Eddy, S. R., and Bateman, A. (2010) Nucleic Acids Res. 38, Acids Res. 39, D670–D676 D211–D222 22. Gariev, I. A., and Varfolomeev, S. D. (2006) Bioinformatics 22, 2574–2576 40. Dethlefsen, L., and Relman, D. A. (2011) Proc. Natl. Acad. Sci. U.S.A. 108, 23. Holliday, G. L., Almonacid, D. E., Bartlett, G. J., O’Boyle, N. M., Torrance, 4554–4561 J. W., Murray-Rust, P., Mitchell, J. B., and Thornton, J. M. (2007) Nucleic 41. Mannervik, B., Board, P. G., Hayes, J. D., Listowsky, I., and Pearson, W. R. Acids Res. 35, D515–D520 (2005) Methods Enzymol. 401, 1–8 24. Nagano, N. (2005) Nucleic Acids Res. 33, D407–D412 42. Atkinson, H. J., and Babbitt, P. C. (2009) Biochemistry 48, 11108–11116 25. Pegg, S. C., Brown, S. D., Ojha, S., Seffernick, J., Meng, E. C., Morris, J. H., 43. Pearson, W. R. (2005) Methods Enzymol. 401, 186–204 Chang, P. J., Huang, C. C., Ferrin, T. E., and Babbitt, P. C. (2006) Biochem- 44. Hicks, M. A., Barber, A. E. I., Giddings, L. A., Caldwell, J., O’Connor, S. E., istry 45, 2545–2555 and Babbitt, P. C. (2011) Proteins Struct. Funct. Genet. 79, 3082–3098 26. Enright, A. J., and Ouzounis, C. A. (2001) Bioinformatics 17, 853–854 45. Pieper, U., Chiang, R., Seffernick, J. J., Brown, S. D., Glasner, M. E., Kelly, L., 27. Frickey, T., and Lupas, A. (2004) Bioinformatics 20, 3702–3704 Eswar, N., Sauder, J. M., Bonanno, J. B., Swaminathan, S., Burley, S. K., 28. Huttenhower, C., Mehmood, S. O., and Troyanskaya, O. G. (2009) BMC Zheng, X., Chance, M. R., Almo, S. C., Gerlt, J. A., Raushel, F. M., Jacobson, Bioinformatics 10, 417 M. P., Babbitt, P. C., and Sali, A. (2009) J. Struct. Funct. Genomics 10, 29. Adai, A. T., Date, S. V., Wieland, S., and Marcotte, E. M. (2004) J. Mol. Biol. 107–125 340, 179–190 30. Cline, M. S., Smoot, M., Cerami, E., Kuchinsky, A., Landys, N., Workman, 46. Zhu, J., and Weng, Z. (2005) Proteins 58, 618–627 C., Christmas, R., Avila-Campilo, I., Creech, M., Gross, B., Hanspers, K., 47. Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, Isserlin, R., Kelley, R., Killcoyne, S., Lotia, S., Maere, S., Morris, J., Ono, K., W., and Lipman, D. J. (1997) Nucleic Acids Res. 25, 3389–3402 Pavlovic, V., Pico, A. R., Vailaya, A., Wang, P. L., Adler, A., Conklin, B. R., 48. Gerlt, J. A., Babbitt, P. C., Jacobson, M. P., and Almo, S. C. (2012) J Biol. Hood, L., Kuiper, M., Sander, C., Schmulevich, I., Schwikowski, B., War- Chem. 287, 29–34 42 JOURNAL OF BIOLOGICAL CHEMISTRY VOLUME 287 • NUMBER 1 •JANUARY 2, 2012

Journal

Journal of Biological Chemistry – American Society for Biochemistry and Molecular Biology

Published: Jan 2, 2012

Keywords: Bioinformatics; Computational Biology; Enzyme Structure; Enzymes; Protein Evolution; Enzyme Superfamily; Functional Inference; Protein Similarity Network

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Inference of Functional Properties from Large-scale Analysis of Enzyme Superfamilies *

Inference of Functional Properties from Large-scale Analysis of Enzyme Superfamilies *

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Inference of Functional Properties from Large-scale Analysis of Enzyme Superfamilies *

Inference of Functional Properties from Large-scale Analysis of Enzyme Superfamilies *

References (48)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies