Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

modbase, a database of annotated comparative protein structure models and associated resources

modbase, a database of annotated comparative protein structure models and associated resources Published online 23 October 2008 Nucleic Acids Research, 2009, Vol. 37, Database issue D347–D354 doi:10.1093/nar/gkn791 MODBASE, a database of annotated comparative protein structure models and associated resources 1 1 1 1,2 Ursula Pieper , Narayanan Eswar , Ben M. Webb , David Eramian , 1,3 1,3 4 4 Libusha Kelly , David T. Barkan , Hannah Carter , Parminder Mankoo , 4 5 6 1, Rachel Karchin , Marc A. Marti-Renom , Fred P. Davis and Andrej Sali * Department of Bioengineering and Therapeutic Sciences, Department of Pharmaceutical Chemistry, and California Institute for Quantitative Biosciences, Byers Hall at Mission Bay, Office 503B, University of California 2 3 at San Francisco, 1700 4th Street, San Francisco, CA 94158, Graduate Group in Biophysics, Graduate Group in Bioinformatics, University of California at San Francisco, CA, Department of Biomedical Engineering, Institute for Computational Medicine, Johns Hopkins University, 3400 North Charles Street, Baltimore, MD 21218, USA, Structural Genomics Unit, Bioinformatics & Genomics Department, Centro de Investigacio ´ n Prı´ncipe Felipe (CIPF), Avda. Autopista del Saler 16, Valencia 46012, Spain and Howard Hughes Medical Institute, Janelia Farm, 19700 Helix Drive, Ashburn, VA 20147, USA Received September 15, 2008; Accepted October 8, 2008 INTRODUCTION ABSTRACT The genome sequencing efforts are providing us with com- MODBASE (http://salilab.org/modbase) is a data- plete genetic blueprints for hundreds of organisms, includ- base of annotated comparative protein structure ing humans. We are now faced with the challenge of models. The models are calculated by MODPIPE, assigning, investigating and modifying the functions of an automated modeling pipeline that relies primarily proteins encoded by these genomes. This task is generally on MODELLER for fold assignment, sequence– facilitated by 3D structures of the proteins (1–3), which structure alignment, model building and model are best determined by experimental methods such as assessment (http:/salilab.org/modeller). MODBASE X-ray crystallography and NMR-spectroscopy. The currently contains 5 152 695 reliable models for number of experimentally determined structures deposited domains in 1 593 209 unique protein sequences; in the Protein Data Bank (PDB) more than doubled from only models based on statistically significant align- 23 096 to 52 821 over the last 5 years (September 2008) (4). ments and/or models assessed to have the correct However, the number of sequences in comprehensive fold are included. MODBASE also allows users to sequence databases, such as UniProt (5) and GenPept calculate comparative models on demand, through (6), continues to grow even more rapidly than the number of known protein structures; for example, the an interface to the MODWEB modeling server number of sequences in UniProt increased from 1.2 mil- (http://salilab.org/modweb). Other resources inte- lion to 6.4 million over the same period. Therefore, pro- grated with MODBASE include databases of multi- tein structure prediction is essential for structural ple protein structure alignments (DBAli), structurally characterization of sequences without experimentally defined ligand binding sites (LIGBASE), predicted determined structures. ligand binding sites (AnnoLyze), structurally defined The most accurate models are generally obtained by binary domain interfaces (PIBASE) and annotated homology or comparative modeling (7–10), which is single nucleotide polymorphisms and somatic applicable when an experimentally determined structure mutations found in human proteins (LS-SNP, related to the target sequence is available. The fraction LS-Mut). MODBASE models are also available of sequences in a genome for which comparative models through the Protein Model Portal (http://www.prote can be obtained automatically varies from 20%– inmodelportal.org/). 75% (11). *To whom correspondence should be addressed. Tel: +1 415 514 4227; Fax: +1 415 514 4231; Email: [email protected] 2008 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. D348 Nucleic Acids Research, 2009, Vol. 37, Database issue The process of comparative modeling usually requires structure as input, calculates a profile for each identifiable the use of a number of programs to identify template sequence homolog in the UniProt database, followed by structures, to generate sequence–structure alignments, modeling these homologs based on detectable templates in to build the models and to evaluate them. In addition, the PDB as well as the user-provided structure. Finally, various sequence and structure databases that are accessed MODWEB proposes a representative model based on by these programs are needed. Once an initial model is model assessment. This module is a useful tool for mea- calculated, it is generally refined and ultimately analyzed suring the impact of new structures, such as those gener- in the context of many other related proteins and their ated by structural genomics efforts (21). The module functional annotations. Here, we describe MODBASE, a allows us to assess the impact of a newly determined pro- database of comparative protein structure models, and tein structure on the modeling of sequences of unknown several associated databases and servers that facilitate structure. It is also used to identify new members of modeling and analysis tasks for both expert and novice sequence superfamilies with at least one member of users. We highlight the improvements of MODBASE that known structure. The results of MODWEB calculations were implemented since the last report (11), including are available to the users through the MODBASE inter- updates in the modeling software, user interface and asso- face as private datasets protected with passwords. ciated annotation tools. We also illustrate the utility of Pairwise and multiple structure alignments (DBAli) MODBASE by describing several projects depending on large model sets. DBAli (http://www.dbali.org/) stores pairwise compari- sons of all structures in the PDB calculated using the pro- gram MAMMOTH (22), as well as multiple structure CONTENTS alignments generated by the SALIGN module of Comparative modeling (MODELLER and MODPIPE) MODELLER-9 (23). DBAli contains approximately 1.7 billion pairwise comparisons and 12 732 family-based mul- Models in MODBASE are calculated using MODPIPE, tiple structure alignments for 34 637 nonredundant protein our automated software pipeline for comparative model- chains out of 96 804 protein chains in the PDB. Additional ing (12). It relies primarily on the various modules of information is provided by ModDom that assigns domain MODELLER (13) for its functionality and is adapted boundaries from structure and ModClus that allows the for large-scale operation on a cluster of PCs using scripts user to generate clusters of similar protein structures. written in PERL and Python. Sequence–structure matches These DBAli tools help users to analyze the protein struc- are established using a variety of fold-assignment meth- ture space by establishing relationships between protein ods, including sequence–sequence (14), profile–sequence structures and their fragments in a flexible and dynamic (15,16) and profile–profile alignments (16,17). Odds manner. of finding a template structure are increased by using an E-value threshold of 1.0. By default, 10 models are calcu- Ligand binding sites (LIGBASE and AnnoLyze) lated for each of the alignments (13). A representative model for each alignment is then chosen by ranking The LIGBASE module stores a list of the binding sites of based on the atomic distance-dependent statistical poten- known structure for approximately 230 000 ligands found tial DOPE (18). Finally, the fold of each model is evalu- in the PDB (24). The ligands include small molecules, such ated using a composite model quality criterion that as metal ions, nucleotides, saccharides and peptides. includes the coverage of the modeled sequence, sequence Binding sites in all known structures are defined to consist identity implied by the sequence–structure alignment, the of residues with at least one atom within 5 A of any ligand fraction of gaps in the alignment, the compactness of the atom. For each template structure, MODBASE also con- model and various statistical potential Z-scores (18–20). tains a list of putative binding sites that were predicted by Only models that are assessed to have the correct fold the AnnoLyze program (25). The predictions are based on were included in the final model sets. inheriting an actual binding site from any related known A key feature of the pipeline is not prejudging the structure if at least 75% of the binding site residues are validity of sequence–structure relationships at the fold- within 4 A of the template residues in a global superposi- assignment stage; instead, sequence–structure matches tion of the two structures in DBALI and if at least 75% of are assessed after the construction of the models and the binding site residue types are invariant. In addition, their evaluation. This approach enables a thorough the putative ligand binding sites in the models are then exploration of fold assignments, sequence–structure align- mapped via the target–template alignments. The putative ments and conformations, with the aim of finding the ligand binding sites are stored as SITE records and the model with the best evaluation score. binding site membership frequency per residue is indicated in the B-factor column of the model coordinate files. Sixty- Comparative modeling web server (MODWEB) five percent of MODBASE models have at least one pre- dicted binding site. MODWEB is our comparative modeling web server that is an integral module of MODBASE (http://salilab.org/ Protein interactions (PIBASE) modweb) (12). MODWEB accepts one or more sequences in the FASTA format and calculates their models using PIBASE (http://pibase.janelia.org, http://salilab.org/ MODPIPE based on the best available templates from the pibase) is a comprehensive database of structurally defined PDB. Alternatively, MODWEB also accepts a protein protein interfaces (26). It is composed of binary interfaces Nucleic Acids Research, 2009, Vol. 37, Database issue D349 between pairs of chains or domains extracted from struc- mutations may destabilize protein quaternary structure or tures in the PDB and the Probable Quaternary Structure interfere with small molecule ligand binding. server PQS using domain assignments from the Structural Classification of Proteins and CATH fold classification systems. PIBASE currently contains 269 821 SCOP, MODBASE MODEL SETS 269 438 CATH, and 216 739 chain binary interfaces. A Models in MODBASE are organized into a number of diverse set of geometrical, physiochemical and topological datasets. The largest dataset contains models of all properties are calculated for each complex, its domains, sequences in the UniProt database that are detectably interfaces and binding sites. The database is accessible related to at least one known structure in the PDB from through the web server and can also be installed locally. July 2005. Because of the rapid growth of the public The software used to build PIBASE is available for down- sequence databases, we now concentrate our efforts on load under an open-source license. adding datasets that are useful for specific projects, PIBASE is a convenient resource for structural informa- rather than attempt to model all known protein sequen- tion on protein–protein interactions and is easily inte- ces with detectable template structures. Currently, grated with other databases. It is currently used by the MODBASE includes datasets of nine archaeal genomes, AnnoLyze annotation program (27) and the LS-SNP 13 bacterial genomes and 18 eukaryotic genomes annotation system (28). The complexes stored in (Table 1). Together with other project-oriented datasets, PIBASE can also be used as templates to predict the com- MODBASE currently contains 5 152 695 models from position and structure of protein complexes using com- domains in 1 593 209 unique sequences. Next, we illustrate parative modeling followed by an assessment of the the utility of MODBASE by outlining several recent modeled interface (29). This approach was applied to pre- projects. dict host–pathogen interactions for 10 ‘neglected’ human pathogens (30). Structural genomics of the enolase and amidohydrolase superfamilies Single nucleotide polymorphisms and somatic mutations Comparative models of enzymes in the amidohydrolase (LS-SNP and LS-Mut) and enolase superfamilies have contributed to studying their substrate specificity by the Enzyme Specificity LS-SNP [http://karchinlab.org/LS-SNP, http://salilab. Consortium (ENSPEC) as well as selecting targets for a org/LS-SNP (28)] and LS-Mut [http://karchinlab.org/LS- structural genomics effort by the New York SGX Mut, (31,32)] are collections of annotated DNA sequence Research Center for Structural Genomics (NYSGXRC). variants in protein-coding exons that result in an amino In particular, we selected 535 target proteins from 130 acid residue-type substitution. These resources focus on genomes for high-throughput structure determination by inherited genetic variants and tumor-derived somatic X-ray crystallography, resulting in 61 unique structures mutations, respectively. For LS-SNP, genomic locations thus far. Both template-based modeling and sequence- of the variants are taken from the dbSNP database (33) based modeling were essential in identifying suitable and are mapped onto as many human proteins in the targets. UniProt database (34) as possible. The mapping is achieved via a collection of protein-to-mRNA and Structural genomics of membrane proteins mRNA-to-genome alignments produced with the Known Comparative modeling was also applied to inform target Genes algorithm (35). For LS-Mut, somatic mutation data selection for the structural genomics of membrane proteins from tumor sequencing projects are used, consisting of as part of the Center for Structures of Membrane Proteins transcript identifiers from RefSeq, CCDS and Ensembl (CSMP) at UCSF (40). The goal of CSMP is to express, (36,37), codon positions and amino acid residue-type sub- purify and determine the structures of representative mem- stitutions. Our software then maps the mutations onto bers of integral membrane protein classes. MODBASE translated protein sequences. LS-Mut currently includes models were combined with an interactive web-based mutations from 24 advanced pancreatic cancers and target selection tool to facilitate selection of biologically 22 glioblastoma multiforme (brain) tumors. For both interesting targets with little or no structural data LS-SNP and LS-Mut, human protein sequences are available. In addition, template-based modeling in aligned with homologous proteins of known structure MODWEB is being used to calculate how many sequences from PDB, to build comparative protein structure can be modeled based on newly determined CSMP models using MODPIPE. Models are constructed for all structures. significant alignments covering a distinct region of protein sequence (E-value cutoff 0.0001). UCSF Chimera (38) is ABC Transporters used to visualize the location of the residue substitutions on the model. We use our software and DSSP (39) to ABC transporters are a large and diverse set of integral identify secondary structure elements and relative solvent membrane proteins that couple the action of ATP binding, accessibility of the residue positions. Putative protein hydrolysis and release to substrate transport across a cel- and small ligand binding sites on the models are anno- lular membrane (41). Mutations in 13 of the 48 human tated with PIBASE and the LIGBASE module of ABC transporters are associated with monogenic human MODBASE, respectively, to infer which SNPs or somatic disease phenotypes (42). Additional variants are being D350 Nucleic Acids Research, 2009, Vol. 37, Database issue Table 1. MODBASE datasets Dataset/Project Taxonomy ID No. of No. of No. of Sequence source Transcripts Sequences modeled Models Genomes ( genomes for the TDI) Archaea Archaeoglobus fulgidus 2234 2409 1794 3980 NCBI Methanococcus jannaschii 2190 1785 1480 1707 NCBI Nanoarchaeum equitans 160 232 536 447 496 NCBI Picrophilus torridus 82 076 1535 1260 2902 NCBI Pyrobaculum aerophilum 13 773 2600 1566 3497 NCBI Pyrococcus furiosus 2261 2113 1524 3373 NCBI Sulfolobus solfataricus 2287 2922 2006 4451 NCBI Thermoplasma volcanium 50 339 1497 1204 2806 NCBI Thermoplasma acidophilum 1480 1220 2801 NCBI Bacteria Bacillus subtilis 1423 4105 3374 9245 NCBI Burkholderia mallei 13 373 4798 3910 23 219 NCBI Clostridium tetani 1513 2413 2158 5864 NCBI Escherichia coli 562 4206 3150 5994 NCBI Mycobacterium leprae 1769 1605 1178 2493 OrthoMCL-DB Mycobacterium tuberculosis 1773 3991 2808 5913 TubercuList Mycoplasma pneumoniae 2104 687 426 857 NCBI Pseudomonas aeruginosa 287 5559 3806 9222 NCBI Rickettsia prowazekii 782 835 754 2136 NCBI Staphylococcus aureus MRSA252 282 458 2635 1184 3161 NCBI Streptococcus pyogenes 1314 1691 1440 3984 NCBI Wolbachia 953 805 621 1873 TIGR Yersinia pestis 632 3882 3215 8371 NCBI Eukaryota Arabidopsis thaliana 3702 30 707 23 807 70 494 ENSEMBL Brugia malayi 6279 11 397 7850 23 219 TIGR Caenorhabditis elegans 6239 22 698 18 996 52 235 NCBI Canis familiaris 9615 30 264 22 614 65 617 ENSEMBL Cryptosporidium hominis 237 895 3886 1614 3287 CryptoDB Cryptosporidium parvum 5807 3806 1918 3969 CryptoDB Danio rerio Calculation in progress ENSEMBL Drosophila melanogaster 7227 17 104 9381 24 683 NCBI H.sapiens 9606 32 010 21 270 51 084 OrthoMCL-DB Leishmania major 5664 8274 3975 8285 GeneDB Mus musculus 10 090 30 133 25 338 70 783 NCBI Pan troglodytes Calculation in progress ENSEMBL Plasmodium falciparum 5833 5363 2599 5053 PlasmoDB Plasmodium vivax 5855 5342 2359 4670 PlasmoDB Rattus norvegicus Calculation in progress ENSEMBL Saccharomyces cerevisiae 4932 6600 3035 5543 NCBI Schistosoma mansoni 6183 25 304 8576 26 076 GeneDB Toxoplasma gondii 5811 7793 1530 3064 ToxoDB Trypanosoma brucei 5691 9210 3900 8054 GeneDB Trypanosoma cruzi 5693 19 607 7390 14 858 GeneDB Xenopus laevis 8355 27 952 25 457 69 191 NCBI Selected projects CSMP datasets 195 235 184 139 690 255 GENPEPT NR NYSGXRC datasets 553 537 493 672 1 415 237 GENPEPT NR Enzyme Specificity Project 15 833 10 875 183 591 SFLD/NR ABC Transporter 152 85 85 GPCR 11 586 11 551 24 272 UNIPROT Datasets 2005 1 742 816 1 025 196 2 146 830 UNIPROT Total (including other datasets) 2 608 987 1 593 209 5 152 695 The sequences were retrieved from ENSEMBL (36), TIGR (50), NCBI-Genbank (6), OrthoMCL-DB (51), TubercuList (52), CryptoDB (53), GeneDB (54), ToxoDB (55), SFLD (56) and UniProt (34). identified in hundreds of individuals by the Pharmacoge- sequences with disease-associated and polymorphic non- nomics of Membrane Transporters (PMT) consortium at synonymous SNPs found in the nucleotide binding UCSF (43). To annotate these variants, we modeled domains. Finally, the incomplete or unsatisfactory nucleotide binding and membrane spanning domains modeling coverage was used to suggest specific targets with detectably related template structures in all human for a structural genomics effort on ABC transporters by ABC transporters. The dataset also includes models of CSMP. Nucleic Acids Research, 2009, Vol. 37, Database issue D351 Human caspases G-Protein Coupled receptors G-protein coupled receptors (GPCR) are a large family of Caspases are cysteine proteases involved in multiple apop- pharmacologically important transmembrane receptors totic pathways. An experimental approach was recently that are involved in the recognition of a wide variety of developed to identify caspase substrates by biotinylating extra-cellular ligands. It has been estimated that this natural protein N-termini and selecting protein fragments family of proteins is the target for about half of all cur- containing unblocked a-amines characteristically gener- rently marketed drugs. Atomic structures are known for ated upon proteolytic cleavage (44). Likely high accuracy only three sub-families of GPCRs, including light-sensitive models of protein substrates prior to cleavage were iden- rhodopsins, b1 and b2 adrenergic receptors that all belong tified in the MODBASE human genome datasets and ana- to the Class A Rhodopsin-like family (GPCRDB nomen- lysis of the structural properties of the cleavage sites was clature). The GPCR dataset in MODBASE consists of performed. While these sites often appeared in disordered, models for approximately 12 000 UniProt sequences that solvent accessible regions of the substrate as expected (45), are related to one of these structures. The models span a surprising number were found in a-helices and partially several sub-families of the Class A Rhodopsin-like inaccessible regions, information which can now be incor- family, including aminergic, peptide, hormone, opsin, porated into new algorithms for predicting additional cas- olfactory and nucleotide receptors. These models are pase substrates. used for ligand docking and virtual screening computa- tions by DOCK (47). Binding sites and ligands for the tropical disease initiative Open source drug discovery is an alternative avenue to ACCESS AND INTERFACE conventional patent-based drug development, illustrated The main access to MODBASE is through its web inter- by the proposed Tropical Disease Initiative (TDI) face at http://salilab.org/modbase, by querying with (http://tropicaldisease.org) (46). Open source drug discov- Uniprot and GI identifiers, gene names, annotation key- ery involves a decentralized, web-based and community- words, PDB codes, datasets, organisms, sequence similar- wide collaboration, in which scientists from laboratories, ity to the modeled sequences (BLAST) and model-specific universities, institutes and corporations volunteer to work criteria such as model reliability, model size and target– together for a common cause. To contribute to this effort, template sequence identity. Additionally, it is possible to we calculated comparative protein structure models for 10 retrieve coordinate files, alignment files and ligand-binding genomes of organisms that cause ‘neglected’ tropical dis- information in text files. Select genome datasets are also eases (Table 1). We followed up by predicting binding sites available from our ftp server (ftp://salilab.org/databases/ for known drugs using the AnnoLyze program (25). These modbase/projects). predictions may be used as a starting point for experimen- The output of a search is displayed on pages with vary- tally testing the biological functions of the target proteins ing amounts of information about the modeled sequences, and potentially even as leads for drug discovery. template structures, alignments and functional annota- tions. An example of the output from a search resulting in one model is shown in Figure 1. A ribbon diagram of the Host–pathogen protein interactions for TDI model with the highest target–template sequence identity is Pathogens have evolved numerous strategies to infect their displayed by default, together with details of the modeling hosts, while hosts have evolved immune responses and calculation. Ribbon thumbprints of additional models for other defenses to these foreign challenges. The vast major- this sequence link to corresponding pages with more infor- ity of host–pathogen interactions involve protein–protein mation. The ribbon diagrams are generated on the fly using recognition, yet our current understanding of these inter- Molscript (48) and Raster3D (49). A pull-down menu pro- actions is limited. We developed and applied a computa- vides links to additional functionality: the ligand-binding tional whole-genome protocol that generates testable module, the SNP module, retrieval of coordinate and predictions of host–pathogen protein interactions (30) alignment files, as well as molecular visualization by (http://salilab.org/hostpathogen). The protocol first scans Chimera that allows the user to display template and model coordinates together with their alignment. If muta- the host and pathogen genomes for proteins with similar- tion information is available for a protein sequence, links ity to known protein complexes, then assesses these puta- to the details are provided in the cross-references section. tive interactions, using structure if available, and, finally, Additionally, cross-references to various other databases, filters the remaining interactions using biological context, including PDB, UniProt, SwissProt/TrEMBL, PubMed such as the stage-specific expression of pathogen proteins and the UCSC Genome Browser, are given. Other and tissue expression of host proteins. The technique was MODBASE pages provide overviews of more than one applied to 10 pathogens, using their MODBASE model sequence or structure. All MODBASE pages are intercon- datasets. Several specific predictions have been made that nected to facilitate easy navigation between different views. warrant experimental follow-up, including interactions from previously characterized mechanisms, such as Access through external databases cytoadhesion and protease inhibition, as well as suspected interactions in hypothesized networks, such as apoptotic MODBASE models in academic and public datasets are pathways. directly accessible from several other databases, including D352 Nucleic Acids Research, 2009, Vol. 37, Database issue Figure 1. MODBASE Model Details page (Example Q9NP58 from the human genome dataset): this page provides links to all models for this specific sequence. A ribbon diagram of the primary model, database annotations and modeling details are displayed. Links to additional models for different target regions or models from other datasets are displayed as thumbprints. The pull-down menu provides access to alternative MODBASE views and other types of information (if available), such as data about mutations and putative ligand binding sites. The cross-references section contains links to relevant internal and external databases. For this particular sequence, mutation data are available from LS-Mut, LS-SNP and ABC SNPs. the SwissProt/TrEMBL sequence pages, UniProt, PIR’s our own calculations of model datasets that are needed iProClass, EBI’s InterPro, the UCSC Genome Browser for our research projects (using MODPIPE, MODWEB and PubMed (LinkOut). Importantly, MODBASE or MODELLER). These updates will reflect improve- models are also accessible through the Protein Model ments in the methods and software used for calculating Portal (http://proteinmodelportal.org), a module of the the models as well as the new template structures in the Protein Structure Initiative Knowledgebase (PSI KB). PDB and new sequences in UniProt. In the future, we The Model Portal has the potential to become the single expect that most of the users will access MODBASE entry point for users interested in experimentally deter- models through the Protein Model Portal. mined or computationally predicted models. For a user query, the portal will interrogate participating source CITATION model databases and modeling servers to provide a com- prehensive view of all available models of the query Users of MODBASE are requested to cite this article in their sequence. publications. FUTURE DIRECTIONS ACKNOWLEDGEMENTS MODBASE will grow by adding models calculated on We are grateful to Tom Ferrin, Daniel Greenblatt, demand by external users (using MODWEB) as well as Conrad Huang and Tom Goddard for CHIMERA and Nucleic Acids Research, 2009, Vol. 37, Database issue D353 15. Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., contributing to the MODBASE/CHIMERA interface. Miller,W. and Lipman,D.J. (1997) Gapped BLAST and For linking to MODBASE from their databases, we PSI-BLAST: a new generation of protein database search programs. thank Torsten Schwede (Protein Model Portal), David Nucleic Acids Res., 25, 3389–3402. Haussler and Jim Kent (UCSC Genome Browser), Amos 16. Eswar,N., Webb,B., Marti-Renom,M.A., Madhusudhan,M.S., Eramian,D., Shen,M.Y., Pieper,U. and Sali,A. (2006) Comparative Bairoch (SwissProt/TrEMBL), Rolf Apweiler (InterPro), protein structure modeling using Modeller. Curr. Protocols Patsy Babbitt (SFLD) and Cathy Wu (PIR/iProClass). Bioinformatics/editoral board, Andreas D. Baxevanis .. . et al., We are also grateful for computing hardware gifts from Chapter 5, Unit 56. Mike Homer, Ron Conway, NetApp, IBM, Hewlett 17. Marti-Renom,M.A., Madhusudhan,M.S. and Sali,A. (2004) Packard and Intel. Alignment of protein sequences by their profiles. Protein Sci., 13, 1071–1087. 18. Shen,M.Y. and Sali,A. (2006) Statistical potential for assessment and prediction of protein structures. Protein Sci., 15, FUNDING 2507–2524. 19. Eramian,D., Shen,M.Y., Devos,D., Melo,F., Sali,A. and National Institutes of Health (R01 GM54762, U54 Marti-Renom,M.A. (2006) A composite score for predicting errors GM074945, U54 GM074929, U01 GM61390, P01 in protein structure models. Protein Sci., 15, 1653–1666. 20. Melo,F., Sanchez,R. and Sali,A. (2002) Statistical potentials for fold GM71790 to A.S., GM08284 to D.E., NSF EF 0626651); assessment. Protein Sci., 11, 430–448. the Sandler Family Supporting Foundation (to A.S.); 21. Chance,M.R., Fiser,A., Sali,A., Pieper,U., Eswar,N., Xu,G., Susan G. Komen Foundation (KG080137 to R.K.); Fajardo,J.E., Radhakannan,T. and Marinkovic,N. (2004) Spanish Ministerio de Educacion y Ciencia (BIO2007/ High-throughput computational and experimental techniques in structural genomics. Genome Res., 14, 2145–2154. 66670 to M.A.M-R). Funding for open access charge: 22. Ortiz,A.R., Strauss,C.E. and Olmea,O. (2002) MAMMOTH U54 GM074945. (matching molecular models obtained from theory): an automated method for model comparison. Protein Sci., 11, 2606–2621. 23. Marti-Renom,M.A., Ilyin,V.A. and Sali,A. (2001) DBAli: a database of protein structure alignments. Bioinformatics, 17, REFERENCES 746–747. 1. Domingues,F.S., Koppensteiner,W.A. and Sippl,M.J. (2000) 24. Stuart,A.C., Ilyin,V.A. and Sali,A. (2002) LigBase: a database of The role of protein structure in genomics. FEBS Lett., 476, 98–102. families of aligned ligand binding sites in known protein sequences 2. Brenner,S.E. and Levitt,M. (2000) Expectations from structural and structures. Bioinformatics, 18, 200–201. genomics. Protein Sci., 9, 197–200. 25. Marti-Renom,M.A., Rossi,A., Al-Shahrour,F., Davis,F.P., 3. Skolnick,J., Fetrow,J.S. and Kolinski,A. (2000) Structural genomics Pieper,U., Dopazo,J. and Sali,A. (2007) The AnnoLite and and its importance for gene function analysis. Nat. Biotechnol., 18, AnnoLyze programs for comparative annotation of protein 283–287. structures. BMC Bioinformatics, 8(Suppl. 4), S4. 4. Deshpande,N., Addess,K.J., Bluhm,W.F., Merino-Ott,J.C., 26. Davis,F.P. and Sali,A. (2005) PIBASE: a comprehensive database Townsend-Merino,W., Zhang,Q., Knezevich,C., Xie,L., Chen,L., of structurally defined protein interfaces. Bioinformatics, 21, Feng,Z. et al. (2005) The RCSB Protein Data Bank: a redesigned 1901–1907. query system and relational database based on the mmCIF schema. 27. Marti-Renom,M.A., Pieper,U., Madhusudhan,M.S., Rossi,A., Nucleic Acids Res., 33, D233–D237. Eswar,N., Davis,F.P., Al-Shahrour,F., Dopazo,J. and Sali,A. (2007) 5. Bairoch,A., Apweiler,R., Wu,C.H., Barker,W.C., Boeckmann,B., DBAli tools: mining the protein structure space. Nucleic Acids Res., Ferro,S., Gasteiger,E., Huang,H., Lopez,R., Magrane,M. et al. 35, D393–D397. (2005) The Universal Protein Resource (UniProt). Nucleic Acids 28. Karchin,R., Diekhans,M., Kelly,L., Thomas,D.J., Pieper,U., Res., 33, D154–D159. Eswar,N., Haussler,D. and Sali,A. (2005) LS-SNP: large-scale 6. Benson,D.A., Karsch-Mizrachi,I., Lipman,D.J., Ostell,J. and annotation of coding non-synonymous SNPs based on multiple Wheeler,D.L. (2008) GenBank. Nucleic Acids Res., 36, D25–D30. information sources. Bioinformatics, 21, 2814–2820. 7. Baker,D. and Sali,A. (2001) Protein structure prediction and 29. Davis,F.P., Braberg,H., Shen,M.Y., Pieper,U., Sali,A. and structural genomics. Science, 294, 93–96. Madhusudhan,M.S. (2006) Protein complex compositions predicted 8. Wallner,B. and Elofsson,A. (2005) All are not equal: a benchmark by structural similarity. Nucleic Acids Res., 34, 2943–2952. of different homology modeling programs. Protein Sci., 14, 30. Davis,F.P., Barkan,D.T., Eswar,N., McKerrow,J.H. and Sali,A. 1315–1327. (2007) Host pathogen protein interactions predicted by comparative 9. Hillisch,A., Pineda,L.F. and Hilgenfeld,R. (2004) Utility of modeling. Protein Sci., 16, 2585–2596. homology models in the drug discovery process. Drug Discov. 31. Jones,S., Zhang,X., Parsons,D.W., Lin,J.C., Leary,R.J., Today, 9, 659–669. Angenendt,P., Mankoo,P., Carter,H., Kamiyama,H., Jimeno,A. 10. Eswar,N., Webb,B., Marti-Renom,M.A., Madhusudhan,M.S., et al. (2008) Core signaling pathways in human pancreatic cancers Eramian,D., Shen,M.Y., Pieper,U. and Sali,A. (2007) Comparative revealed by global genomic analyses. Science, 321, 1801–1806. protein structure modeling using MODELLER. Curr. Protocols 32. Parsons,D.W., Jones,S., Zhang,X., Lin,J.C., Leary,R.J., Protein Sci./editorial board, John E. Coligan .. . et al., Chapter 2, Angenendt,P., Mankoo,P., Carter,H., Siu,I.M., Gallia,G.L. et al. Unit 29. (2008) An integrated genomic analysis of human Glioblastoma 11. Pieper,U., Eswar,N., Davis,F.P., Braberg,H., Madhusudhan,M.S., multiforme. Science, 321, 1807–1812. Rossi,A., Marti-Renom,M., Karchin,R., Webb,B.M., Eramian,D. 33. Sherry,S.T., Ward,M.H., Kholodov,M., Baker,J., Phan,L., et al. (2006) MODBASE: a database of annotated comparative Smigielski,E.M. and Sirotkin,K. (2001) dbSNP: the NCBI database protein structure models and associated resources. Nucleic Acids of genetic variation. Nucleic Acids Res., 29, 308–311. Res., 34, D291–D295. 34. Wu,C.H., Apweiler,R., Bairoch,A., Natale,D.A., Barker,W.C., 12. Eswar,N., John,B., Mirkovic,N., Fiser,A., Ilyin,V.A., Pieper,U., Boeckmann,B., Ferro,S., Gasteiger,E., Huang,H., Lopez,R. et al. Stuart,A.C., Marti-Renom,M.A., Madhusudhan,M.S., Yerkovich,B. (2006) Nucleic Acids Res., 34, D187–191. et al. (2003) Tools for comparative protein structure modeling and 35. Hsu,F., Kent,W.J., Clawson,H., Kuhn,R.M., Diekhans,M. and analysis. Nucleic Acids Res., 31, 3375–3380. Haussler,D. (2006) The UCSC known genes. Bioinformatics, 22, 13. Sali,A. and Blundell,T.L. (1993) Comparative protein modelling 1036–1046. by satisfaction of spatial restraints. J. Mol. Biol., 234, 779–815. 36. Flicek,P., Aken,B.L., Beal,K., Ballester,B., Caccamo,M., Chen,Y., 14. Smith,T.F. and Waterman,M.S. (1981) Identification of common Clarke,L., Coates,G., Cunningham,F., Cutts,T. et al. (2008) molecular subsequences. J. Mol. Biol., 147, 195–197. Ensembl 2008. Nucleic Acids Res., 36, D707–D714. D354 Nucleic Acids Research, 2009, Vol. 37, Database issue 37. Wheeler,D.L., Barrett,T., Benson,D.A., Bryant,S.H., Canese,K., 47. Hermann,J.C., Marti-Arbona,R., Fedorov,A.A., Fedorov,E., Chetvernin,V., Church,D.M., DiCuccio,M., Edgar,R., Federhen,S. Almo,S.C., Shoichet,B.K. and Raushel,F.M. (2007) Structure-based et al. (2008) Database resources of the National Center for activity prediction for an enzyme of unknown function. Nature, 448, Biotechnology Information. Nucleic Acids Res., 36, D13–D21. 775–779. 38. Pettersen,E.F., Goddard,T.D., Huang,C.C., Couch,G.S., 48. Kraulis,P.J. (1991) MOLSCRIPT: a program to produce both Greenblatt,D.M., Meng,E.C. and Ferrin,T.E. (2004) UCSF detailed and schematic plorts of protein structures. J. Appl. Chimera—a visualization system for exploratory research and Crystallogr., 24, 946–950. analysis. J. Comput. Chem., 25, 1605–1612. 49. Merritt,E.A. and Bacon,D.J. (1997) Raster3D: photorealistic 39. Kabsch,W. and Sander,C. (1983) Dictionary of protein secondary molecular graphics. Methods Enzymol., 277, 505–524. structure: pattern recognition of hydrogen-bonded and geometrical 50. Ghedin,E., Wang,S., Spiro,D., Caler,E., Zhao,Q., Crabtree,J., features. Biopolymers, 22, 2577–2637. Allen,J.E., Delcher,A.L., Guiliano,D.B., Miranda-Saavedra,D. et al. 40. Li,M., Hays,F.A., Roe-Zurz,Z., Vuong,L., Kelly,L., Robbins,R., (2007) Draft genome of the filarial nematode parasite Brugia Ho,C.M., Pieper,U., O’Connell,J., Miercke,L.J. et al. (2008) malayi. Science, 317, 1756–1760. Eukaryotic Integral Membrane Protein Production For Structural 51. Chen,F., Mackey,A.J., Stoeckert,C.J. Jr. and Roos,D.S. (2006) Genomics. J. Mol. Biol., in press. OrthoMCL-DB: querying a comprehensive multi-species 41. Dean,M., Rzhetsky,A. and Allikmets,R. (2001) The human collection of ortholog groups. Nucleic Acids Res., 34, ATP-binding cassette (ABC) transporter superfamily. Genome Res., D363–D368. 11, 1156–1166. 52. Cole,S.T. (1999) Learning from the genome sequence of 42. Hamosh,A., Scott,A.F., Amberger,J.S., Bocchini,C.A. and Mycobacterium tuberculosis H37Rv. FEBS Lett., 452, 7–10. McKusick,V.A. (2005) Online Mendelian Inheritance in Man 53. Heiges,M., Wang,H., Robinson,E., Aurrecoechea,C., Gao,X., (OMIM), a knowledgebase of human genes and genetic disorders. Kaluskar,N., Rhodes,P., Wang,S., He,C.Z., Su,Y. et al. (2006) Nucleic Acids Res., 33, D514–D517. CryptoDB: a Cryptosporidium bioinformatics resource update. 43. Leabman,M.K., Huang,C.C., DeYoung,J., Carlson,E.J., Nucleic Acids Res., 34, D419–D422. Taylor,T.R., de la Cruz,M., Johns,S.J., Stryke,D., Kawamoto,M., 54. Hertz-Fowler,C., Peacock,C.S., Wood,V., Aslett,M., Kerhornou,A., Urban,T.J. et al. (2003) Natural variation in human membrane Mooney,P., Tivey,A., Berriman,M., Hall,N., Rutherford,K. et al. transporter genes reveals evolutionary and functional constraints. (2004) GeneDB: a resource for prokaryotic and eukaryotic Proc. Natl Acad. Sci. USA, 100, 5896–5901. organisms. Nucleic Acids Res., 32, D339–D343. 44. Mahrus,S., Trinidad,J.C., Barkan,D.T., Sali,A., Burlingame,A.L. 55. Gajria,B., Bahl,A., Brestelli,J., Dommer,J., Fischer,S., Gao,X., and Wells,J.A. (2008) Global sequencing of proteolytic cleavage Heiges,M., Iodice,J., Kissinger,J.C., Mackey,A.J. et al. (2008) sites in apoptosis by specific labeling of protein N termini. Cell, 134, ToxoDB: an integrated Toxoplasma gondii database resource. 866–876. Nucleic Acids Res., 36, D553–D556. 45. Hubbard,S.J., Campbell,S.F. and Thornton,J.M. (1991) Molecular 56. Pegg,S.C., Brown,S.D., Ojha,S., Seffernick,J., Meng,E.C., recognition. Conformational analysis of limited proteolytic sites Morris,J.H., Chang,P.J., Huang,C.C., Ferrin,T.E. and Babbitt,P.C. and serine proteinase protein inhibitors. J. Mol. Biol., 220, 507–530. (2006) Leveraging enzyme structure-function relationships for 46. Maurer,S.M., Rai,A. and Sali,A. (2004) Finding cures for tropical functional inference and experimental design: the structure- diseases: is open source an answer? PLoS Med., 1, e56. function linkage database. Biochemistry, 45, 2545–2555. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Nucleic Acids Research Oxford University Press

modbase, a database of annotated comparative protein structure models and associated resources

Loading next page...
 
/lp/oxford-university-press/modbase-a-database-of-annotated-comparative-protein-structure-models-pFc80Leonh

References (158)

Publisher
Oxford University Press
Copyright
© 2008 The Author(s)
ISSN
0305-1048
eISSN
1362-4962
DOI
10.1093/nar/gkn791
pmid
18948282
Publisher site
See Article on Publisher Site

Abstract

Published online 23 October 2008 Nucleic Acids Research, 2009, Vol. 37, Database issue D347–D354 doi:10.1093/nar/gkn791 MODBASE, a database of annotated comparative protein structure models and associated resources 1 1 1 1,2 Ursula Pieper , Narayanan Eswar , Ben M. Webb , David Eramian , 1,3 1,3 4 4 Libusha Kelly , David T. Barkan , Hannah Carter , Parminder Mankoo , 4 5 6 1, Rachel Karchin , Marc A. Marti-Renom , Fred P. Davis and Andrej Sali * Department of Bioengineering and Therapeutic Sciences, Department of Pharmaceutical Chemistry, and California Institute for Quantitative Biosciences, Byers Hall at Mission Bay, Office 503B, University of California 2 3 at San Francisco, 1700 4th Street, San Francisco, CA 94158, Graduate Group in Biophysics, Graduate Group in Bioinformatics, University of California at San Francisco, CA, Department of Biomedical Engineering, Institute for Computational Medicine, Johns Hopkins University, 3400 North Charles Street, Baltimore, MD 21218, USA, Structural Genomics Unit, Bioinformatics & Genomics Department, Centro de Investigacio ´ n Prı´ncipe Felipe (CIPF), Avda. Autopista del Saler 16, Valencia 46012, Spain and Howard Hughes Medical Institute, Janelia Farm, 19700 Helix Drive, Ashburn, VA 20147, USA Received September 15, 2008; Accepted October 8, 2008 INTRODUCTION ABSTRACT The genome sequencing efforts are providing us with com- MODBASE (http://salilab.org/modbase) is a data- plete genetic blueprints for hundreds of organisms, includ- base of annotated comparative protein structure ing humans. We are now faced with the challenge of models. The models are calculated by MODPIPE, assigning, investigating and modifying the functions of an automated modeling pipeline that relies primarily proteins encoded by these genomes. This task is generally on MODELLER for fold assignment, sequence– facilitated by 3D structures of the proteins (1–3), which structure alignment, model building and model are best determined by experimental methods such as assessment (http:/salilab.org/modeller). MODBASE X-ray crystallography and NMR-spectroscopy. The currently contains 5 152 695 reliable models for number of experimentally determined structures deposited domains in 1 593 209 unique protein sequences; in the Protein Data Bank (PDB) more than doubled from only models based on statistically significant align- 23 096 to 52 821 over the last 5 years (September 2008) (4). ments and/or models assessed to have the correct However, the number of sequences in comprehensive fold are included. MODBASE also allows users to sequence databases, such as UniProt (5) and GenPept calculate comparative models on demand, through (6), continues to grow even more rapidly than the number of known protein structures; for example, the an interface to the MODWEB modeling server number of sequences in UniProt increased from 1.2 mil- (http://salilab.org/modweb). Other resources inte- lion to 6.4 million over the same period. Therefore, pro- grated with MODBASE include databases of multi- tein structure prediction is essential for structural ple protein structure alignments (DBAli), structurally characterization of sequences without experimentally defined ligand binding sites (LIGBASE), predicted determined structures. ligand binding sites (AnnoLyze), structurally defined The most accurate models are generally obtained by binary domain interfaces (PIBASE) and annotated homology or comparative modeling (7–10), which is single nucleotide polymorphisms and somatic applicable when an experimentally determined structure mutations found in human proteins (LS-SNP, related to the target sequence is available. The fraction LS-Mut). MODBASE models are also available of sequences in a genome for which comparative models through the Protein Model Portal (http://www.prote can be obtained automatically varies from 20%– inmodelportal.org/). 75% (11). *To whom correspondence should be addressed. Tel: +1 415 514 4227; Fax: +1 415 514 4231; Email: [email protected] 2008 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. D348 Nucleic Acids Research, 2009, Vol. 37, Database issue The process of comparative modeling usually requires structure as input, calculates a profile for each identifiable the use of a number of programs to identify template sequence homolog in the UniProt database, followed by structures, to generate sequence–structure alignments, modeling these homologs based on detectable templates in to build the models and to evaluate them. In addition, the PDB as well as the user-provided structure. Finally, various sequence and structure databases that are accessed MODWEB proposes a representative model based on by these programs are needed. Once an initial model is model assessment. This module is a useful tool for mea- calculated, it is generally refined and ultimately analyzed suring the impact of new structures, such as those gener- in the context of many other related proteins and their ated by structural genomics efforts (21). The module functional annotations. Here, we describe MODBASE, a allows us to assess the impact of a newly determined pro- database of comparative protein structure models, and tein structure on the modeling of sequences of unknown several associated databases and servers that facilitate structure. It is also used to identify new members of modeling and analysis tasks for both expert and novice sequence superfamilies with at least one member of users. We highlight the improvements of MODBASE that known structure. The results of MODWEB calculations were implemented since the last report (11), including are available to the users through the MODBASE inter- updates in the modeling software, user interface and asso- face as private datasets protected with passwords. ciated annotation tools. We also illustrate the utility of Pairwise and multiple structure alignments (DBAli) MODBASE by describing several projects depending on large model sets. DBAli (http://www.dbali.org/) stores pairwise compari- sons of all structures in the PDB calculated using the pro- gram MAMMOTH (22), as well as multiple structure CONTENTS alignments generated by the SALIGN module of Comparative modeling (MODELLER and MODPIPE) MODELLER-9 (23). DBAli contains approximately 1.7 billion pairwise comparisons and 12 732 family-based mul- Models in MODBASE are calculated using MODPIPE, tiple structure alignments for 34 637 nonredundant protein our automated software pipeline for comparative model- chains out of 96 804 protein chains in the PDB. Additional ing (12). It relies primarily on the various modules of information is provided by ModDom that assigns domain MODELLER (13) for its functionality and is adapted boundaries from structure and ModClus that allows the for large-scale operation on a cluster of PCs using scripts user to generate clusters of similar protein structures. written in PERL and Python. Sequence–structure matches These DBAli tools help users to analyze the protein struc- are established using a variety of fold-assignment meth- ture space by establishing relationships between protein ods, including sequence–sequence (14), profile–sequence structures and their fragments in a flexible and dynamic (15,16) and profile–profile alignments (16,17). Odds manner. of finding a template structure are increased by using an E-value threshold of 1.0. By default, 10 models are calcu- Ligand binding sites (LIGBASE and AnnoLyze) lated for each of the alignments (13). A representative model for each alignment is then chosen by ranking The LIGBASE module stores a list of the binding sites of based on the atomic distance-dependent statistical poten- known structure for approximately 230 000 ligands found tial DOPE (18). Finally, the fold of each model is evalu- in the PDB (24). The ligands include small molecules, such ated using a composite model quality criterion that as metal ions, nucleotides, saccharides and peptides. includes the coverage of the modeled sequence, sequence Binding sites in all known structures are defined to consist identity implied by the sequence–structure alignment, the of residues with at least one atom within 5 A of any ligand fraction of gaps in the alignment, the compactness of the atom. For each template structure, MODBASE also con- model and various statistical potential Z-scores (18–20). tains a list of putative binding sites that were predicted by Only models that are assessed to have the correct fold the AnnoLyze program (25). The predictions are based on were included in the final model sets. inheriting an actual binding site from any related known A key feature of the pipeline is not prejudging the structure if at least 75% of the binding site residues are validity of sequence–structure relationships at the fold- within 4 A of the template residues in a global superposi- assignment stage; instead, sequence–structure matches tion of the two structures in DBALI and if at least 75% of are assessed after the construction of the models and the binding site residue types are invariant. In addition, their evaluation. This approach enables a thorough the putative ligand binding sites in the models are then exploration of fold assignments, sequence–structure align- mapped via the target–template alignments. The putative ments and conformations, with the aim of finding the ligand binding sites are stored as SITE records and the model with the best evaluation score. binding site membership frequency per residue is indicated in the B-factor column of the model coordinate files. Sixty- Comparative modeling web server (MODWEB) five percent of MODBASE models have at least one pre- dicted binding site. MODWEB is our comparative modeling web server that is an integral module of MODBASE (http://salilab.org/ Protein interactions (PIBASE) modweb) (12). MODWEB accepts one or more sequences in the FASTA format and calculates their models using PIBASE (http://pibase.janelia.org, http://salilab.org/ MODPIPE based on the best available templates from the pibase) is a comprehensive database of structurally defined PDB. Alternatively, MODWEB also accepts a protein protein interfaces (26). It is composed of binary interfaces Nucleic Acids Research, 2009, Vol. 37, Database issue D349 between pairs of chains or domains extracted from struc- mutations may destabilize protein quaternary structure or tures in the PDB and the Probable Quaternary Structure interfere with small molecule ligand binding. server PQS using domain assignments from the Structural Classification of Proteins and CATH fold classification systems. PIBASE currently contains 269 821 SCOP, MODBASE MODEL SETS 269 438 CATH, and 216 739 chain binary interfaces. A Models in MODBASE are organized into a number of diverse set of geometrical, physiochemical and topological datasets. The largest dataset contains models of all properties are calculated for each complex, its domains, sequences in the UniProt database that are detectably interfaces and binding sites. The database is accessible related to at least one known structure in the PDB from through the web server and can also be installed locally. July 2005. Because of the rapid growth of the public The software used to build PIBASE is available for down- sequence databases, we now concentrate our efforts on load under an open-source license. adding datasets that are useful for specific projects, PIBASE is a convenient resource for structural informa- rather than attempt to model all known protein sequen- tion on protein–protein interactions and is easily inte- ces with detectable template structures. Currently, grated with other databases. It is currently used by the MODBASE includes datasets of nine archaeal genomes, AnnoLyze annotation program (27) and the LS-SNP 13 bacterial genomes and 18 eukaryotic genomes annotation system (28). The complexes stored in (Table 1). Together with other project-oriented datasets, PIBASE can also be used as templates to predict the com- MODBASE currently contains 5 152 695 models from position and structure of protein complexes using com- domains in 1 593 209 unique sequences. Next, we illustrate parative modeling followed by an assessment of the the utility of MODBASE by outlining several recent modeled interface (29). This approach was applied to pre- projects. dict host–pathogen interactions for 10 ‘neglected’ human pathogens (30). Structural genomics of the enolase and amidohydrolase superfamilies Single nucleotide polymorphisms and somatic mutations Comparative models of enzymes in the amidohydrolase (LS-SNP and LS-Mut) and enolase superfamilies have contributed to studying their substrate specificity by the Enzyme Specificity LS-SNP [http://karchinlab.org/LS-SNP, http://salilab. Consortium (ENSPEC) as well as selecting targets for a org/LS-SNP (28)] and LS-Mut [http://karchinlab.org/LS- structural genomics effort by the New York SGX Mut, (31,32)] are collections of annotated DNA sequence Research Center for Structural Genomics (NYSGXRC). variants in protein-coding exons that result in an amino In particular, we selected 535 target proteins from 130 acid residue-type substitution. These resources focus on genomes for high-throughput structure determination by inherited genetic variants and tumor-derived somatic X-ray crystallography, resulting in 61 unique structures mutations, respectively. For LS-SNP, genomic locations thus far. Both template-based modeling and sequence- of the variants are taken from the dbSNP database (33) based modeling were essential in identifying suitable and are mapped onto as many human proteins in the targets. UniProt database (34) as possible. The mapping is achieved via a collection of protein-to-mRNA and Structural genomics of membrane proteins mRNA-to-genome alignments produced with the Known Comparative modeling was also applied to inform target Genes algorithm (35). For LS-Mut, somatic mutation data selection for the structural genomics of membrane proteins from tumor sequencing projects are used, consisting of as part of the Center for Structures of Membrane Proteins transcript identifiers from RefSeq, CCDS and Ensembl (CSMP) at UCSF (40). The goal of CSMP is to express, (36,37), codon positions and amino acid residue-type sub- purify and determine the structures of representative mem- stitutions. Our software then maps the mutations onto bers of integral membrane protein classes. MODBASE translated protein sequences. LS-Mut currently includes models were combined with an interactive web-based mutations from 24 advanced pancreatic cancers and target selection tool to facilitate selection of biologically 22 glioblastoma multiforme (brain) tumors. For both interesting targets with little or no structural data LS-SNP and LS-Mut, human protein sequences are available. In addition, template-based modeling in aligned with homologous proteins of known structure MODWEB is being used to calculate how many sequences from PDB, to build comparative protein structure can be modeled based on newly determined CSMP models using MODPIPE. Models are constructed for all structures. significant alignments covering a distinct region of protein sequence (E-value cutoff 0.0001). UCSF Chimera (38) is ABC Transporters used to visualize the location of the residue substitutions on the model. We use our software and DSSP (39) to ABC transporters are a large and diverse set of integral identify secondary structure elements and relative solvent membrane proteins that couple the action of ATP binding, accessibility of the residue positions. Putative protein hydrolysis and release to substrate transport across a cel- and small ligand binding sites on the models are anno- lular membrane (41). Mutations in 13 of the 48 human tated with PIBASE and the LIGBASE module of ABC transporters are associated with monogenic human MODBASE, respectively, to infer which SNPs or somatic disease phenotypes (42). Additional variants are being D350 Nucleic Acids Research, 2009, Vol. 37, Database issue Table 1. MODBASE datasets Dataset/Project Taxonomy ID No. of No. of No. of Sequence source Transcripts Sequences modeled Models Genomes ( genomes for the TDI) Archaea Archaeoglobus fulgidus 2234 2409 1794 3980 NCBI Methanococcus jannaschii 2190 1785 1480 1707 NCBI Nanoarchaeum equitans 160 232 536 447 496 NCBI Picrophilus torridus 82 076 1535 1260 2902 NCBI Pyrobaculum aerophilum 13 773 2600 1566 3497 NCBI Pyrococcus furiosus 2261 2113 1524 3373 NCBI Sulfolobus solfataricus 2287 2922 2006 4451 NCBI Thermoplasma volcanium 50 339 1497 1204 2806 NCBI Thermoplasma acidophilum 1480 1220 2801 NCBI Bacteria Bacillus subtilis 1423 4105 3374 9245 NCBI Burkholderia mallei 13 373 4798 3910 23 219 NCBI Clostridium tetani 1513 2413 2158 5864 NCBI Escherichia coli 562 4206 3150 5994 NCBI Mycobacterium leprae 1769 1605 1178 2493 OrthoMCL-DB Mycobacterium tuberculosis 1773 3991 2808 5913 TubercuList Mycoplasma pneumoniae 2104 687 426 857 NCBI Pseudomonas aeruginosa 287 5559 3806 9222 NCBI Rickettsia prowazekii 782 835 754 2136 NCBI Staphylococcus aureus MRSA252 282 458 2635 1184 3161 NCBI Streptococcus pyogenes 1314 1691 1440 3984 NCBI Wolbachia 953 805 621 1873 TIGR Yersinia pestis 632 3882 3215 8371 NCBI Eukaryota Arabidopsis thaliana 3702 30 707 23 807 70 494 ENSEMBL Brugia malayi 6279 11 397 7850 23 219 TIGR Caenorhabditis elegans 6239 22 698 18 996 52 235 NCBI Canis familiaris 9615 30 264 22 614 65 617 ENSEMBL Cryptosporidium hominis 237 895 3886 1614 3287 CryptoDB Cryptosporidium parvum 5807 3806 1918 3969 CryptoDB Danio rerio Calculation in progress ENSEMBL Drosophila melanogaster 7227 17 104 9381 24 683 NCBI H.sapiens 9606 32 010 21 270 51 084 OrthoMCL-DB Leishmania major 5664 8274 3975 8285 GeneDB Mus musculus 10 090 30 133 25 338 70 783 NCBI Pan troglodytes Calculation in progress ENSEMBL Plasmodium falciparum 5833 5363 2599 5053 PlasmoDB Plasmodium vivax 5855 5342 2359 4670 PlasmoDB Rattus norvegicus Calculation in progress ENSEMBL Saccharomyces cerevisiae 4932 6600 3035 5543 NCBI Schistosoma mansoni 6183 25 304 8576 26 076 GeneDB Toxoplasma gondii 5811 7793 1530 3064 ToxoDB Trypanosoma brucei 5691 9210 3900 8054 GeneDB Trypanosoma cruzi 5693 19 607 7390 14 858 GeneDB Xenopus laevis 8355 27 952 25 457 69 191 NCBI Selected projects CSMP datasets 195 235 184 139 690 255 GENPEPT NR NYSGXRC datasets 553 537 493 672 1 415 237 GENPEPT NR Enzyme Specificity Project 15 833 10 875 183 591 SFLD/NR ABC Transporter 152 85 85 GPCR 11 586 11 551 24 272 UNIPROT Datasets 2005 1 742 816 1 025 196 2 146 830 UNIPROT Total (including other datasets) 2 608 987 1 593 209 5 152 695 The sequences were retrieved from ENSEMBL (36), TIGR (50), NCBI-Genbank (6), OrthoMCL-DB (51), TubercuList (52), CryptoDB (53), GeneDB (54), ToxoDB (55), SFLD (56) and UniProt (34). identified in hundreds of individuals by the Pharmacoge- sequences with disease-associated and polymorphic non- nomics of Membrane Transporters (PMT) consortium at synonymous SNPs found in the nucleotide binding UCSF (43). To annotate these variants, we modeled domains. Finally, the incomplete or unsatisfactory nucleotide binding and membrane spanning domains modeling coverage was used to suggest specific targets with detectably related template structures in all human for a structural genomics effort on ABC transporters by ABC transporters. The dataset also includes models of CSMP. Nucleic Acids Research, 2009, Vol. 37, Database issue D351 Human caspases G-Protein Coupled receptors G-protein coupled receptors (GPCR) are a large family of Caspases are cysteine proteases involved in multiple apop- pharmacologically important transmembrane receptors totic pathways. An experimental approach was recently that are involved in the recognition of a wide variety of developed to identify caspase substrates by biotinylating extra-cellular ligands. It has been estimated that this natural protein N-termini and selecting protein fragments family of proteins is the target for about half of all cur- containing unblocked a-amines characteristically gener- rently marketed drugs. Atomic structures are known for ated upon proteolytic cleavage (44). Likely high accuracy only three sub-families of GPCRs, including light-sensitive models of protein substrates prior to cleavage were iden- rhodopsins, b1 and b2 adrenergic receptors that all belong tified in the MODBASE human genome datasets and ana- to the Class A Rhodopsin-like family (GPCRDB nomen- lysis of the structural properties of the cleavage sites was clature). The GPCR dataset in MODBASE consists of performed. While these sites often appeared in disordered, models for approximately 12 000 UniProt sequences that solvent accessible regions of the substrate as expected (45), are related to one of these structures. The models span a surprising number were found in a-helices and partially several sub-families of the Class A Rhodopsin-like inaccessible regions, information which can now be incor- family, including aminergic, peptide, hormone, opsin, porated into new algorithms for predicting additional cas- olfactory and nucleotide receptors. These models are pase substrates. used for ligand docking and virtual screening computa- tions by DOCK (47). Binding sites and ligands for the tropical disease initiative Open source drug discovery is an alternative avenue to ACCESS AND INTERFACE conventional patent-based drug development, illustrated The main access to MODBASE is through its web inter- by the proposed Tropical Disease Initiative (TDI) face at http://salilab.org/modbase, by querying with (http://tropicaldisease.org) (46). Open source drug discov- Uniprot and GI identifiers, gene names, annotation key- ery involves a decentralized, web-based and community- words, PDB codes, datasets, organisms, sequence similar- wide collaboration, in which scientists from laboratories, ity to the modeled sequences (BLAST) and model-specific universities, institutes and corporations volunteer to work criteria such as model reliability, model size and target– together for a common cause. To contribute to this effort, template sequence identity. Additionally, it is possible to we calculated comparative protein structure models for 10 retrieve coordinate files, alignment files and ligand-binding genomes of organisms that cause ‘neglected’ tropical dis- information in text files. Select genome datasets are also eases (Table 1). We followed up by predicting binding sites available from our ftp server (ftp://salilab.org/databases/ for known drugs using the AnnoLyze program (25). These modbase/projects). predictions may be used as a starting point for experimen- The output of a search is displayed on pages with vary- tally testing the biological functions of the target proteins ing amounts of information about the modeled sequences, and potentially even as leads for drug discovery. template structures, alignments and functional annota- tions. An example of the output from a search resulting in one model is shown in Figure 1. A ribbon diagram of the Host–pathogen protein interactions for TDI model with the highest target–template sequence identity is Pathogens have evolved numerous strategies to infect their displayed by default, together with details of the modeling hosts, while hosts have evolved immune responses and calculation. Ribbon thumbprints of additional models for other defenses to these foreign challenges. The vast major- this sequence link to corresponding pages with more infor- ity of host–pathogen interactions involve protein–protein mation. The ribbon diagrams are generated on the fly using recognition, yet our current understanding of these inter- Molscript (48) and Raster3D (49). A pull-down menu pro- actions is limited. We developed and applied a computa- vides links to additional functionality: the ligand-binding tional whole-genome protocol that generates testable module, the SNP module, retrieval of coordinate and predictions of host–pathogen protein interactions (30) alignment files, as well as molecular visualization by (http://salilab.org/hostpathogen). The protocol first scans Chimera that allows the user to display template and model coordinates together with their alignment. If muta- the host and pathogen genomes for proteins with similar- tion information is available for a protein sequence, links ity to known protein complexes, then assesses these puta- to the details are provided in the cross-references section. tive interactions, using structure if available, and, finally, Additionally, cross-references to various other databases, filters the remaining interactions using biological context, including PDB, UniProt, SwissProt/TrEMBL, PubMed such as the stage-specific expression of pathogen proteins and the UCSC Genome Browser, are given. Other and tissue expression of host proteins. The technique was MODBASE pages provide overviews of more than one applied to 10 pathogens, using their MODBASE model sequence or structure. All MODBASE pages are intercon- datasets. Several specific predictions have been made that nected to facilitate easy navigation between different views. warrant experimental follow-up, including interactions from previously characterized mechanisms, such as Access through external databases cytoadhesion and protease inhibition, as well as suspected interactions in hypothesized networks, such as apoptotic MODBASE models in academic and public datasets are pathways. directly accessible from several other databases, including D352 Nucleic Acids Research, 2009, Vol. 37, Database issue Figure 1. MODBASE Model Details page (Example Q9NP58 from the human genome dataset): this page provides links to all models for this specific sequence. A ribbon diagram of the primary model, database annotations and modeling details are displayed. Links to additional models for different target regions or models from other datasets are displayed as thumbprints. The pull-down menu provides access to alternative MODBASE views and other types of information (if available), such as data about mutations and putative ligand binding sites. The cross-references section contains links to relevant internal and external databases. For this particular sequence, mutation data are available from LS-Mut, LS-SNP and ABC SNPs. the SwissProt/TrEMBL sequence pages, UniProt, PIR’s our own calculations of model datasets that are needed iProClass, EBI’s InterPro, the UCSC Genome Browser for our research projects (using MODPIPE, MODWEB and PubMed (LinkOut). Importantly, MODBASE or MODELLER). These updates will reflect improve- models are also accessible through the Protein Model ments in the methods and software used for calculating Portal (http://proteinmodelportal.org), a module of the the models as well as the new template structures in the Protein Structure Initiative Knowledgebase (PSI KB). PDB and new sequences in UniProt. In the future, we The Model Portal has the potential to become the single expect that most of the users will access MODBASE entry point for users interested in experimentally deter- models through the Protein Model Portal. mined or computationally predicted models. For a user query, the portal will interrogate participating source CITATION model databases and modeling servers to provide a com- prehensive view of all available models of the query Users of MODBASE are requested to cite this article in their sequence. publications. FUTURE DIRECTIONS ACKNOWLEDGEMENTS MODBASE will grow by adding models calculated on We are grateful to Tom Ferrin, Daniel Greenblatt, demand by external users (using MODWEB) as well as Conrad Huang and Tom Goddard for CHIMERA and Nucleic Acids Research, 2009, Vol. 37, Database issue D353 15. Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., contributing to the MODBASE/CHIMERA interface. Miller,W. and Lipman,D.J. (1997) Gapped BLAST and For linking to MODBASE from their databases, we PSI-BLAST: a new generation of protein database search programs. thank Torsten Schwede (Protein Model Portal), David Nucleic Acids Res., 25, 3389–3402. Haussler and Jim Kent (UCSC Genome Browser), Amos 16. Eswar,N., Webb,B., Marti-Renom,M.A., Madhusudhan,M.S., Eramian,D., Shen,M.Y., Pieper,U. and Sali,A. (2006) Comparative Bairoch (SwissProt/TrEMBL), Rolf Apweiler (InterPro), protein structure modeling using Modeller. Curr. Protocols Patsy Babbitt (SFLD) and Cathy Wu (PIR/iProClass). Bioinformatics/editoral board, Andreas D. Baxevanis .. . et al., We are also grateful for computing hardware gifts from Chapter 5, Unit 56. Mike Homer, Ron Conway, NetApp, IBM, Hewlett 17. Marti-Renom,M.A., Madhusudhan,M.S. and Sali,A. (2004) Packard and Intel. Alignment of protein sequences by their profiles. Protein Sci., 13, 1071–1087. 18. Shen,M.Y. and Sali,A. (2006) Statistical potential for assessment and prediction of protein structures. Protein Sci., 15, FUNDING 2507–2524. 19. Eramian,D., Shen,M.Y., Devos,D., Melo,F., Sali,A. and National Institutes of Health (R01 GM54762, U54 Marti-Renom,M.A. (2006) A composite score for predicting errors GM074945, U54 GM074929, U01 GM61390, P01 in protein structure models. Protein Sci., 15, 1653–1666. 20. Melo,F., Sanchez,R. and Sali,A. (2002) Statistical potentials for fold GM71790 to A.S., GM08284 to D.E., NSF EF 0626651); assessment. Protein Sci., 11, 430–448. the Sandler Family Supporting Foundation (to A.S.); 21. Chance,M.R., Fiser,A., Sali,A., Pieper,U., Eswar,N., Xu,G., Susan G. Komen Foundation (KG080137 to R.K.); Fajardo,J.E., Radhakannan,T. and Marinkovic,N. (2004) Spanish Ministerio de Educacion y Ciencia (BIO2007/ High-throughput computational and experimental techniques in structural genomics. Genome Res., 14, 2145–2154. 66670 to M.A.M-R). Funding for open access charge: 22. Ortiz,A.R., Strauss,C.E. and Olmea,O. (2002) MAMMOTH U54 GM074945. (matching molecular models obtained from theory): an automated method for model comparison. Protein Sci., 11, 2606–2621. 23. Marti-Renom,M.A., Ilyin,V.A. and Sali,A. (2001) DBAli: a database of protein structure alignments. Bioinformatics, 17, REFERENCES 746–747. 1. Domingues,F.S., Koppensteiner,W.A. and Sippl,M.J. (2000) 24. Stuart,A.C., Ilyin,V.A. and Sali,A. (2002) LigBase: a database of The role of protein structure in genomics. FEBS Lett., 476, 98–102. families of aligned ligand binding sites in known protein sequences 2. Brenner,S.E. and Levitt,M. (2000) Expectations from structural and structures. Bioinformatics, 18, 200–201. genomics. Protein Sci., 9, 197–200. 25. Marti-Renom,M.A., Rossi,A., Al-Shahrour,F., Davis,F.P., 3. Skolnick,J., Fetrow,J.S. and Kolinski,A. (2000) Structural genomics Pieper,U., Dopazo,J. and Sali,A. (2007) The AnnoLite and and its importance for gene function analysis. Nat. Biotechnol., 18, AnnoLyze programs for comparative annotation of protein 283–287. structures. BMC Bioinformatics, 8(Suppl. 4), S4. 4. Deshpande,N., Addess,K.J., Bluhm,W.F., Merino-Ott,J.C., 26. Davis,F.P. and Sali,A. (2005) PIBASE: a comprehensive database Townsend-Merino,W., Zhang,Q., Knezevich,C., Xie,L., Chen,L., of structurally defined protein interfaces. Bioinformatics, 21, Feng,Z. et al. (2005) The RCSB Protein Data Bank: a redesigned 1901–1907. query system and relational database based on the mmCIF schema. 27. Marti-Renom,M.A., Pieper,U., Madhusudhan,M.S., Rossi,A., Nucleic Acids Res., 33, D233–D237. Eswar,N., Davis,F.P., Al-Shahrour,F., Dopazo,J. and Sali,A. (2007) 5. Bairoch,A., Apweiler,R., Wu,C.H., Barker,W.C., Boeckmann,B., DBAli tools: mining the protein structure space. Nucleic Acids Res., Ferro,S., Gasteiger,E., Huang,H., Lopez,R., Magrane,M. et al. 35, D393–D397. (2005) The Universal Protein Resource (UniProt). Nucleic Acids 28. Karchin,R., Diekhans,M., Kelly,L., Thomas,D.J., Pieper,U., Res., 33, D154–D159. Eswar,N., Haussler,D. and Sali,A. (2005) LS-SNP: large-scale 6. Benson,D.A., Karsch-Mizrachi,I., Lipman,D.J., Ostell,J. and annotation of coding non-synonymous SNPs based on multiple Wheeler,D.L. (2008) GenBank. Nucleic Acids Res., 36, D25–D30. information sources. Bioinformatics, 21, 2814–2820. 7. Baker,D. and Sali,A. (2001) Protein structure prediction and 29. Davis,F.P., Braberg,H., Shen,M.Y., Pieper,U., Sali,A. and structural genomics. Science, 294, 93–96. Madhusudhan,M.S. (2006) Protein complex compositions predicted 8. Wallner,B. and Elofsson,A. (2005) All are not equal: a benchmark by structural similarity. Nucleic Acids Res., 34, 2943–2952. of different homology modeling programs. Protein Sci., 14, 30. Davis,F.P., Barkan,D.T., Eswar,N., McKerrow,J.H. and Sali,A. 1315–1327. (2007) Host pathogen protein interactions predicted by comparative 9. Hillisch,A., Pineda,L.F. and Hilgenfeld,R. (2004) Utility of modeling. Protein Sci., 16, 2585–2596. homology models in the drug discovery process. Drug Discov. 31. Jones,S., Zhang,X., Parsons,D.W., Lin,J.C., Leary,R.J., Today, 9, 659–669. Angenendt,P., Mankoo,P., Carter,H., Kamiyama,H., Jimeno,A. 10. Eswar,N., Webb,B., Marti-Renom,M.A., Madhusudhan,M.S., et al. (2008) Core signaling pathways in human pancreatic cancers Eramian,D., Shen,M.Y., Pieper,U. and Sali,A. (2007) Comparative revealed by global genomic analyses. Science, 321, 1801–1806. protein structure modeling using MODELLER. Curr. Protocols 32. Parsons,D.W., Jones,S., Zhang,X., Lin,J.C., Leary,R.J., Protein Sci./editorial board, John E. Coligan .. . et al., Chapter 2, Angenendt,P., Mankoo,P., Carter,H., Siu,I.M., Gallia,G.L. et al. Unit 29. (2008) An integrated genomic analysis of human Glioblastoma 11. Pieper,U., Eswar,N., Davis,F.P., Braberg,H., Madhusudhan,M.S., multiforme. Science, 321, 1807–1812. Rossi,A., Marti-Renom,M., Karchin,R., Webb,B.M., Eramian,D. 33. Sherry,S.T., Ward,M.H., Kholodov,M., Baker,J., Phan,L., et al. (2006) MODBASE: a database of annotated comparative Smigielski,E.M. and Sirotkin,K. (2001) dbSNP: the NCBI database protein structure models and associated resources. Nucleic Acids of genetic variation. Nucleic Acids Res., 29, 308–311. Res., 34, D291–D295. 34. Wu,C.H., Apweiler,R., Bairoch,A., Natale,D.A., Barker,W.C., 12. Eswar,N., John,B., Mirkovic,N., Fiser,A., Ilyin,V.A., Pieper,U., Boeckmann,B., Ferro,S., Gasteiger,E., Huang,H., Lopez,R. et al. Stuart,A.C., Marti-Renom,M.A., Madhusudhan,M.S., Yerkovich,B. (2006) Nucleic Acids Res., 34, D187–191. et al. (2003) Tools for comparative protein structure modeling and 35. Hsu,F., Kent,W.J., Clawson,H., Kuhn,R.M., Diekhans,M. and analysis. Nucleic Acids Res., 31, 3375–3380. Haussler,D. (2006) The UCSC known genes. Bioinformatics, 22, 13. Sali,A. and Blundell,T.L. (1993) Comparative protein modelling 1036–1046. by satisfaction of spatial restraints. J. Mol. Biol., 234, 779–815. 36. Flicek,P., Aken,B.L., Beal,K., Ballester,B., Caccamo,M., Chen,Y., 14. Smith,T.F. and Waterman,M.S. (1981) Identification of common Clarke,L., Coates,G., Cunningham,F., Cutts,T. et al. (2008) molecular subsequences. J. Mol. Biol., 147, 195–197. Ensembl 2008. Nucleic Acids Res., 36, D707–D714. D354 Nucleic Acids Research, 2009, Vol. 37, Database issue 37. Wheeler,D.L., Barrett,T., Benson,D.A., Bryant,S.H., Canese,K., 47. Hermann,J.C., Marti-Arbona,R., Fedorov,A.A., Fedorov,E., Chetvernin,V., Church,D.M., DiCuccio,M., Edgar,R., Federhen,S. Almo,S.C., Shoichet,B.K. and Raushel,F.M. (2007) Structure-based et al. (2008) Database resources of the National Center for activity prediction for an enzyme of unknown function. Nature, 448, Biotechnology Information. Nucleic Acids Res., 36, D13–D21. 775–779. 38. Pettersen,E.F., Goddard,T.D., Huang,C.C., Couch,G.S., 48. Kraulis,P.J. (1991) MOLSCRIPT: a program to produce both Greenblatt,D.M., Meng,E.C. and Ferrin,T.E. (2004) UCSF detailed and schematic plorts of protein structures. J. Appl. Chimera—a visualization system for exploratory research and Crystallogr., 24, 946–950. analysis. J. Comput. Chem., 25, 1605–1612. 49. Merritt,E.A. and Bacon,D.J. (1997) Raster3D: photorealistic 39. Kabsch,W. and Sander,C. (1983) Dictionary of protein secondary molecular graphics. Methods Enzymol., 277, 505–524. structure: pattern recognition of hydrogen-bonded and geometrical 50. Ghedin,E., Wang,S., Spiro,D., Caler,E., Zhao,Q., Crabtree,J., features. Biopolymers, 22, 2577–2637. Allen,J.E., Delcher,A.L., Guiliano,D.B., Miranda-Saavedra,D. et al. 40. Li,M., Hays,F.A., Roe-Zurz,Z., Vuong,L., Kelly,L., Robbins,R., (2007) Draft genome of the filarial nematode parasite Brugia Ho,C.M., Pieper,U., O’Connell,J., Miercke,L.J. et al. (2008) malayi. Science, 317, 1756–1760. Eukaryotic Integral Membrane Protein Production For Structural 51. Chen,F., Mackey,A.J., Stoeckert,C.J. Jr. and Roos,D.S. (2006) Genomics. J. Mol. Biol., in press. OrthoMCL-DB: querying a comprehensive multi-species 41. Dean,M., Rzhetsky,A. and Allikmets,R. (2001) The human collection of ortholog groups. Nucleic Acids Res., 34, ATP-binding cassette (ABC) transporter superfamily. Genome Res., D363–D368. 11, 1156–1166. 52. Cole,S.T. (1999) Learning from the genome sequence of 42. Hamosh,A., Scott,A.F., Amberger,J.S., Bocchini,C.A. and Mycobacterium tuberculosis H37Rv. FEBS Lett., 452, 7–10. McKusick,V.A. (2005) Online Mendelian Inheritance in Man 53. Heiges,M., Wang,H., Robinson,E., Aurrecoechea,C., Gao,X., (OMIM), a knowledgebase of human genes and genetic disorders. Kaluskar,N., Rhodes,P., Wang,S., He,C.Z., Su,Y. et al. (2006) Nucleic Acids Res., 33, D514–D517. CryptoDB: a Cryptosporidium bioinformatics resource update. 43. Leabman,M.K., Huang,C.C., DeYoung,J., Carlson,E.J., Nucleic Acids Res., 34, D419–D422. Taylor,T.R., de la Cruz,M., Johns,S.J., Stryke,D., Kawamoto,M., 54. Hertz-Fowler,C., Peacock,C.S., Wood,V., Aslett,M., Kerhornou,A., Urban,T.J. et al. (2003) Natural variation in human membrane Mooney,P., Tivey,A., Berriman,M., Hall,N., Rutherford,K. et al. transporter genes reveals evolutionary and functional constraints. (2004) GeneDB: a resource for prokaryotic and eukaryotic Proc. Natl Acad. Sci. USA, 100, 5896–5901. organisms. Nucleic Acids Res., 32, D339–D343. 44. Mahrus,S., Trinidad,J.C., Barkan,D.T., Sali,A., Burlingame,A.L. 55. Gajria,B., Bahl,A., Brestelli,J., Dommer,J., Fischer,S., Gao,X., and Wells,J.A. (2008) Global sequencing of proteolytic cleavage Heiges,M., Iodice,J., Kissinger,J.C., Mackey,A.J. et al. (2008) sites in apoptosis by specific labeling of protein N termini. Cell, 134, ToxoDB: an integrated Toxoplasma gondii database resource. 866–876. Nucleic Acids Res., 36, D553–D556. 45. Hubbard,S.J., Campbell,S.F. and Thornton,J.M. (1991) Molecular 56. Pegg,S.C., Brown,S.D., Ojha,S., Seffernick,J., Meng,E.C., recognition. Conformational analysis of limited proteolytic sites Morris,J.H., Chang,P.J., Huang,C.C., Ferrin,T.E. and Babbitt,P.C. and serine proteinase protein inhibitors. J. Mol. Biol., 220, 507–530. (2006) Leveraging enzyme structure-function relationships for 46. Maurer,S.M., Rai,A. and Sali,A. (2004) Finding cures for tropical functional inference and experimental design: the structure- diseases: is open source an answer? PLoS Med., 1, e56. function linkage database. Biochemistry, 45, 2545–2555.

Journal

Nucleic Acids ResearchOxford University Press

Published: Jan 23, 2009

There are no references for this article.