Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

SIMAP—a comprehensive database of pre-calculated protein sequence similarities, domains, annotations and clusters

SIMAP—a comprehensive database of pre-calculated protein sequence similarities, domains,... Published online 11 November 2009 Nucleic Acids Research, 2010, Vol. 38, Database issue D223–D226 doi:10.1093/nar/gkp949 SIMAP—a comprehensive database of pre-calculated protein sequence similarities, domains, annotations and clusters 1, 1 2 1 1 Thomas Rattei *, Patrick Tischler , Stefan Go ¨ tz , Marc-Andre ´ Jehl , Jonathan Hoser , 1 2 1,3 Roland Arnold , Ana Conesa and Hans-Werner Mewes Technische Universita ¨tMu ¨ nchen, Department of Genome Oriented Bioinformatics, Wissenschaftszentrum Weihenstephan, Freising, Germany, Bioinformatics Department, Centro de Investigacio ´ n Prı´ncipe Felipe, Valencia, Spain and Institute for Bioinformatics and Systems Biology (MIPS), Helmholtz Zentrum Mu ¨ nchen, German Research Center for Environmental Health (GmbH), Neuherberg, Germany Received September 15, 2009; Revised October 10, 2009; Accepted October 12, 2009 INTRODUCTION ABSTRACT Protein sequences are of utmost importance for studying The prediction of protein function as well as the the function and evolution of genes and genomes. reconstruction of evolutionary genesis employing Evolutionary processes of mutation and selection have sequence comparison at large is still the most shaped the protein sequence space and became manifest powerful tool in sequence analysis. Due to the expo- in the protein sequences as well as their pair-wise and nential growth of the number of known protein group-wise similarities. Therefore, a rich collection of sequences and the subsequent quadratic growth methods in computational biology relies on the analysis of the similarity matrix, the computation of the and comparison of protein sequences. Many of these Similarity Matrix of Proteins (SIMAP) becomes a intensively used methods perform sequence similarity computational intensive task. The SIMAP database searches [e.g. BLAST (1)] or compare protein sequences provides a comprehensive and up-to-date pre- against secondary databases of protein families [e.g. calculation of the protein sequence similarity InterPro (2)]. The fast increasing volume of publicly available pro- matrix, sequence-based features and sequence tein sequences forges a computational dilemma for clusters. As of September 2009, SIMAP covers bioinformatics tasks that require repeated all-against-all 48 million proteins and more than 23 million non- calculations of sequence similarities or sequence features. redundant sequences. Novel features of SIMAP Such rather straightforward but technically challenging include the expansion of the sequence space by tasks among others are the annotation of genomes or the including databases such as ENSEMBL as well as clustering of the protein sequence space into protein the integration of metagenomes based on their con- families. Due to the exponential growth of the number of sistent processing and annotation. Furthermore, sequences and the quadratic complexity of the sequence protein function predictions by Blast2GO are pre- similarity matrix, the computational demand of calculating calculated for all sequences in SIMAP and the data an all-versus-all sequence matrix of all known proteins easily outgrows available computational resources. Due access and query functions have been improved. to the subsequent growth of the secondary databases, SIMAP assists biologists to query the up-to-date a similar problem exists for the prediction of protein sequence space systematically and facilitates domains. As a consequence, any repeated ab initio recalcu- large-scale downstream projects in computational lation of the similarity matrix is highly ineffective due to biology. Access to SIMAP is freely provided the recalculation of the vast majority of already known through the web portal for individuals (http://mips sequence similarity relations. However, as the number of .gsf.de/simap/) and for programmatic access recently added sequences is always small compared to through DAS (http://webclu.bio.wzw.tum.de/das/) the bulk of know sequences, repeated recalculations— and Web-Service (http://mips.gsf.de/webservices/ frequently performed in many sequence-based projects— services/SimapService2.0?wsdl). waste a remarkable amount of compute sources worldwide. *To whom correspondence should be addressed. Tel: +49 8161 712136; Fax: +49 8161 712186; Email: [email protected] The Author(s) 2009. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.5/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Downloaded from https://academic.oup.com/nar/article-abstract/38/suppl_1/D223/3112298 by Ed 'DeepDyve' Gillespie user on 02 February 2018 D224 Nucleic Acids Research, 2010, Vol. 38, Database issue The Similarity Matrix of Proteins (SIMAP) solves the demands for a sophisticated high-performance computing computational dilemma described above by incrementally infrastructure to pre-calculate the sequence similarities pre-calculating the sequence similarities forming the of all new sequences and their sequence-based features known protein sequence space (3). The comparison of immediately after the import of new sequences even new sequences versus known ones returns symmetric in case of SIMAP’s incremental implementation. scores that can be updated accordingly in the existing The SIMAPBOINC public resource computing project records. Compared to other resources that pre-calculate (9) steadily provides compute power beyond the current sequence similarities [e.g. NCBI Blink (4)], the FASTA need and thus enables rapid updating of SIMAP. (5) and Smith–Waterman (6) based similarity calculation in SIMAP is only restricted by a static and sensitive raw Consistent processing and annotation of metagenomes score threshold without limiting the maximal number of With the breakthrough of next generation sequencing hits per sequence. Hence the structure of the sequence methods and their application to environmental samples similarity matrix is not influenced by the taxonomy and (10), metagenomic sequences have indelibly expanded the study biases that exist in the major protein sequence protein sequence space to non-culturable organisms databases. The SIMAP database stores raw scores from and environmental communities. However, the pioneering the calculated alignments. When querying SIMAP, ‘Global ocean sampling’ (GOS) project (11) so far remains e-values are calculated on-the-fly according to the the only metagenomic dataset of which protein sequences selected databases and taxa. To complement the pair- are represented in a major public sequence database wise sequence similarity matrix by position specific [NCBI GenBank (4)]. All other metagenomes are—if at searches against known protein families, SIMAP in all—deposited in distributed resources as the ‘Whole addition pre-calculates sequence based features as e.g. Genome Shutgun’ (wgs) section of NCBI GenBank (4) InterPro matches (2). To maximize its coverage to pro- or the IMG/M database (12). No standardized protocol vide an efficient alternative to BLAST or Interproscan for gene calling and the annotation of protein-coding calculations, the comprehensive representation of the sequences has been established so far for these data protein sequence space is crucial for SIMAP. Recent collections. As the consistent annotation of metagenomes improvements in SIMAP have addressed this requirement is indispensable for any downstream comparative analysis by further expanding the sequence space and including such as comparisons of taxonomic or functional profiles metagenomic sequences. Further improvements have between different metagenomes, an extension of SIMAP extended the functional annotation of the protein was implemented that extracts coding sequences from sequence space in SIMAP by pre-calculated GO metagenomic sequencing reads, assembled contigs and annotations and improved the data access and query scaffolds in a consistent way. tools of SIMAP. This part of SIMAP covering environmental sequence fragments is monthly synchronized with three major repositories of metagenomes (Table 2). Entirely redundant NEW FEATURES AND IMPROVEMENTS IN SIMAP metagenomes are considered only once, whereas redun- Comprehensive coverage of the protein sequence space dant representations of the same project differing in their total number of nucleotides (e.g. the whale fall SIMAP represents the known protein sequence space samples in IMG/M and GenBank wgs) are retained. comprehensively and up-to-date. According to this goal, Similar to the methodology used by the GOS project the SIMAP database is synchronized once per month with (13), coding sequences are extracted from the nucleotide the major protein sequence databases (Table 1). The con- sequences in a multi-step procedure: sideration of each of these databases in SIMAP is justified by providing either unique protein sequences that are not (1) all open reading frames (ORF) exceeding a length found in other databases [e.g. ENSEMBL (7)], or unique of 90 nt are extracted from the nucleotide sequences protocols for data processing [e.g. NCBI RefSeq (8)]. of a metagenome, The continuous and rapid growth of the sequence space (2) all-against-all protein sequence similarities between all ORFs in a metagenome are calculated using the SIMAP software (3): first a FASTA (5) similarity Table 1. Number of protein entries and non-redundant sequences of search against the low-complexity masked sequences the major protein sequence databases included in SIMAP as of down to the BLOSUM50 (14) score of 80 is September 2009 Table 2. Number of metagenomic samples and extracted protein- Database Protein entries Non-redundant coding sequences in SIMAP as of September 2009 sequences Database Metagenomic Non-redundant NCBI GenBank 16 146 018 13 065 886 samples sequences NCBI RefSeq 8 181 910 6 681 186 Uniprot/TrEMBL 8 926 016 7 586 794 Camera (JCVI) 54 6 031 109 Uniprot/Swissprot 495 880 416 496 NCBI Genbank/wgs 130 4 244 008 PDB 139 106 41 445 section PEDANT 5 480 442 5 389 911 IMG/M (JGI) 65 2 833 359 ENSEMBL 1 094 482 1 062 197 Downloaded from https://academic.oup.com/nar/article-abstract/38/suppl_1/D223/3112298 by Ed 'DeepDyve' Gillespie user on 02 February 2018 Nucleic Acids Research, 2010, Vol. 38, Database issue D225 From complete Table 3. Pre-calculated functional annotations in SIMAP as of genomes September 2009 20% Method Number of pre-calculated features 48% InterProScan 133 829 528 TargetP 17 205 439 32% SignalP 11 060 831 From TMHMM 15 841 454 metagenomes From genes/ Phobius 18 488 832 Blast2GO 190 801 556 unfinished genomes Figure 1. Composition of the non-redundant protein sequence space in SIMAP as of September 2009. annotation of protein sequences based on information transfer from annotated proteins. BLAST2GO may performed without restricting the number of hits, serve as an example that provides various annotation thereafter the alignments are re-calculated without tools for the functional classification of proteins (16,17). low-complexity masking, Blast2GO achieves the automatic functional annotation of (3) ORFs are weighted by the number and score of their DNA or protein sequences employing the Gene Ontology sequence alignments; shadow ORFs are detected by vocabulary. We have adapted the Blast2GO suite to their overlap with higher weighted ORFs and enable the retrieval of sequence similarities from the removed using the methodology and parameters as SIMAP database instead of performing BLAST (1) in the GOS project (13), searches. This step saves an enormous amount of (4) remaining ORFs having a length of at least 60 aa are compute-time compared to BLAST and allows annotating imported into the main SIMAP database, the complete protein sequence space of SIMAP using a (5) all-against-all protein sequence similarities between few PCs within a week. We have integrated the adapted all ORFs of a metagenome and all other protein BLAST2GO program into the monthly update workflow sequences in SIMAP are calculated as in step 2, of SIMAP in order to keep the pre-calculated BLAST2GO (6) again, shadow ORFs are removed as in step 3. annotations complete and up-to-date (Table 3). Compared to the supervised gene prediction methods High performance data access facilities as used in other metagenomic resources, the procedure applied in SIMAP is not biased towards any taxonomic All data in SIMAP are freely available. The continuously group (i.e. prokaryotes) and only limited by the minimal growing size of SIMAP demands a sophisticated imple- length of open reading frames in step 1 and 4. The mentation of the database to provide versatile and rapid parameters applied in this procedure ensure optimal sen- access to the data with respect to a broad spectrum of use sitivity in detecting coding sequences both in single-exon cases. Based on the established database and standard and multi-exon genes. middleware components of SIMAP, we have improved The derived metagenomic ORFs have almost doubled the performance and stability of SIMAP through cluster- the volume of the known protein sequence space and thus ing of two independent database and application servers. significantly added valuable information (Figure 1). This clustering effectively uncouples production and main- However, metagenomic sequences exhibit lower accuracy tenance processes. Each of the servers is ready to process compared to completely sequenced genes and genomes, more than 2 million complex queries per day. show fragmentation in case of multi-exon genes and Furthermore, we have improved the different data lack knowledge of their taxonomic origin. Therefore, access facilities connecting SIMAP to its users. The versa- metagenomic sequences can be excluded when retrieving tile web portal allows searching for proteins by text or data from SIMAP according to the individual sequence queries. The matches are starting points requirements of the user. for retrieving homologous proteins based on sequence similarity or domain architecture. Protein report pages Functional annotation of the protein sequence space integrate data from SIMAP including InterPro and GO Many computational methods to support the prediction annotation as well as from external resources as the of protein function are computationally expensive and PEDANT database (18). To facilitate clustering therefore benefit from comprehensive pre-calculation methods, all-against-all matrices of similarity scores can and incremental updates as the basic design principles of be downloaded for user-supplied groups of proteins. SIMAP. SIMAP thus pre-calculates Interpro domains and Programmatic access to SIMAP is provided by several features (2) for all sequences including metagenomic SOAP based Web-Services. The SimpAT (Simap Access ORFs (15). New releases of InterPro are incorporated Tools) allows easy access to the SIMAP database using into SIMAP as soon as they become available; SIMAP Web-Service functionality. Recently, we have imple- is regularly updated to the latest InterPro version mented Distributed Annotation System (DAS) services (currently 22.0). for SIMAP. These can be accessed via the URL SIMAP provides an ideal complete resource for the http://webclu.bio.wzw.tum.de/das/ and provide easy and computation of secondary features such as the functional rapid access to the proteins, sequence similarities, InterPro Downloaded from https://academic.oup.com/nar/article-abstract/38/suppl_1/D223/3112298 by Ed 'DeepDyve' Gillespie user on 02 February 2018 D226 Nucleic Acids Research, 2010, Vol. 38, Database issue matches and GO annotations from SIMAP. These data REFERENCES with the exception of the very huge similarity matrix 1. Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., itself can also be downloaded as flat files from the Miller,W. and Lipman,D.J. (1997) Gapped BLAST and SIMAP web portal. For research projects interested in PSI-BLAST: a new generation of protein database search parts of the similarity matrix, we provide project specific programs. Nucleic Acids Res., 25, 3389–3402. 2. Hunter,S., Apweiler,R., Attwood,T.K., Bairoch,A., Bateman,A., monthly dumps upon request. Binns,D., Bork,P., Das,U., Daugherty,L. and Duquenne,L. (2009) InterPro: the integrative protein signature database. Nucleic Acids Res., 37, D211–D215. DISCUSSION 3. Arnold,R., Rattei,T., Tischler,P., Truong,M.D., Stumpflen,V. and Mewes,W. (2005) SIMAP-The similarity matrix of proteins. The SIMAP database is a unique fundamental resource Bioinformatics, 21, ii42–ii46. for computational biology that consequently puts the 4. Sayers,E.W., Barrett,T., Benson,D.A., Bryant,S.H., Canese,K., principle of incremental pre-calculation of sequence Chetvernin,V., Church,D.M., DiCuccio,M., Edgar,R., Federhen,S. similarities and sequence based features into practice. et al. (2009) Database resources of the national center for SIMAP as an exhaustive, up-to-date resource to inspect biotechnology information. Nucleic Acids Res., 37, D5–D15. 5. Pearson,W.R. (1990) Rapid and sensitive sequence comparison the sequence similarity of any known sequence enables of with FASTP and FASTA. Methods Enzymol., 183, 63–98. any type of systematic post-processing with respect to the 6. Smith,T.F. and Waterman,M.S. (1981) Identification of common functional or structural classification of proteins. molecular subsequences. J. Mol. Bwl, 147, 195–197. The recent integration of metagenomic sequences 7. Hubbard,T.J.P., Aken,B.L., Ayling,S., Ballester,B., Beal,K., into SIMAP based on a consistent extraction of Bragin,E., Brent,S., Chen,Y., Clapham,P. and Clarke,L. (2009) Ensembl 2009. Nucleic Acids Res., 37, D690–D697. coding sequences has been beneficial to preserve the 8. Pruitt,K.D., Tatusova,T. and Maglott,D.R. (2007) NCBI reference comprehensiveness of the sequence space representation sequences (RefSeq): a curated non-redundant sequence database in SIMAP. At the same time it makes use of the of genomes, transcripts and proteins. Nucleic Acids Res., 35, sequence similarity matrix of SIMAP to resolve overlaps D61–D65. and remove shadow ORFs. SIMAP represents to our 9. Rattei,T., Walter,M., Arnold,R., Anderson,D.P. and Mewes,W. (2007) Using public resource computing and systematic knowledge the largest and most homogeneous resource pre-calculation for large scale sequence analysis. Lecture Notes for the annotation of coding sequences in metagenomes. Comp. Sci., 4360, 11–18. It provides an ideal data repository and speed-up for tools 10. Handelsman,J. (2004) Metagenomics: application of genomics as e.g. MEGAN (19) that extract taxonomic and func- to uncultured microorganisms. Microbiol Mol. Biol. Rev., 68, tional information from similarities between metagenomic 669–685. 11. Rusch,D.B., Halpern,A.L., Sutton,G., Heidelberg,K.B., ORFs and known proteins in major sequence databases. Williamson,S., Yooseph,S., Wu,D., Eisen,J.A., Hoffman,J.M. and The extended functional annotation of the sequence Remington,K. (2007) The Sorcerer II global ocean sampling space through the pre-calculation of GO annotations expedition: Northwest Atlantic through eastern tropical Pacific. and the improved data access facilities have enhanced PLoS Biol., 5, e77. the potential of SIMAP in assisting biologists in answering 12. Markowitz,V.M., Ivanova,N.N., Szeto,E., Palaniappan,K., Chu,K., Dalevi,D., Chen,I., Min,A., Grechkin,Y. and Dubchak,I. (2008) their individual research questions as well as facilitating IMG/M: a data management and analysis system for metagenomes. downstream projects in computational biology at any Nucleic Acids Res., 36, D534–D538. scale. 13. Yooseph,S., Sutton,G., Rusch,D.B., Halpern,A.L., Williamson,S.J., Remington,K., Eisen,J.A., Heidelberg,K.B., Manning,G. and Li,W. (2007) The Sorcerer II Global Ocean Sampling expedition: ACKNOWLEDGEMENTS expanding the universe of protein families. PLoS Biol., 5, e16. 14. Henikoff,S. and Henikoff,J.G. (1992) Amino acid substitution The authors gratefully acknowledge the BOINCSIMAP matrices from protein blocks. Proc. Natl Acad. Sci., 89, community for donating their CPU power for the calcu- 10915–10919. lation of protein similarities and features. They are 15. Rattei,T., Arnold,R., Tischler,P., Lindner,D., Stumpflen,V. and grateful to their colleagues at MIPS, in particular Mewes,H.W. (2006) SIMAP: the similarity matrix of proteins. Nucleic Acids Res., 34, D252–D256. Mathias Walter, Martin Muensterkoetter and Manuel 16. Conesa,A., Go¨ tz,S., Garcia-Gomez,J.M., Terol,J., Talon,M. and Spannagl, for many helpful discussions and suggestions. Robles,M. (2005) Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics, 21, 3674–3676. FUNDING 17. Go¨ tz,S., Garcia-Gomez,J.M., Terol,J., Williams,T.D., Nagaraj,S.H., Nueda,M.J., Robles,M., Talon,M., Dopazo,J. and SUN Microsystems Inc. (funding a fully equipped X4500 Conesa,A. (2008) High-throughput functional annotation and data center server that is hosting parts of the SIMAP data mining with the Blast2GO suite. Nucleic Acids Res., 36, database, through a SUN Academic Excellence Grant), 3420–3435. European Science Foundation (financial support for 18. Walter,M.C., Rattei,T., Arnold,R., Guldener,U., Munsterkotter,M., Nenova,K., Kastenmuller,G., Tischler,P., Stefan Go¨ tz through the activity entitled ‘Frontiers of Wolling,A. and Volz,A. (2008) PEDANT covers all complete Functional Genomics’). Funding for open access charge: RefSeq genomes. Nucleic Acids Res., 37, D408–D411. Helmholtz Zentrum Mu¨ nchen, German Research Center 19. Huson,D.H., Auch,A.F., Qi,J. and Schuster,S.C. (2007) MEGAN for Environmental Health, Neuherberg. analysis of metagenomic data. Genome Res., 17, 377–386. Conflict of interest statement. None declared. Downloaded from https://academic.oup.com/nar/article-abstract/38/suppl_1/D223/3112298 by Ed 'DeepDyve' Gillespie user on 02 February 2018 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Nucleic Acids Research Oxford University Press

SIMAP—a comprehensive database of pre-calculated protein sequence similarities, domains, annotations and clusters

Loading next page...
 
/lp/oxford-university-press/simap-a-comprehensive-database-of-pre-calculated-protein-sequence-Be0Qu3Eej0

References (44)

Publisher
Oxford University Press
Copyright
© The Author(s) 2009. Published by Oxford University Press.
ISSN
0305-1048
eISSN
1362-4962
DOI
10.1093/nar/gkp949
pmid
19906725
Publisher site
See Article on Publisher Site

Abstract

Published online 11 November 2009 Nucleic Acids Research, 2010, Vol. 38, Database issue D223–D226 doi:10.1093/nar/gkp949 SIMAP—a comprehensive database of pre-calculated protein sequence similarities, domains, annotations and clusters 1, 1 2 1 1 Thomas Rattei *, Patrick Tischler , Stefan Go ¨ tz , Marc-Andre ´ Jehl , Jonathan Hoser , 1 2 1,3 Roland Arnold , Ana Conesa and Hans-Werner Mewes Technische Universita ¨tMu ¨ nchen, Department of Genome Oriented Bioinformatics, Wissenschaftszentrum Weihenstephan, Freising, Germany, Bioinformatics Department, Centro de Investigacio ´ n Prı´ncipe Felipe, Valencia, Spain and Institute for Bioinformatics and Systems Biology (MIPS), Helmholtz Zentrum Mu ¨ nchen, German Research Center for Environmental Health (GmbH), Neuherberg, Germany Received September 15, 2009; Revised October 10, 2009; Accepted October 12, 2009 INTRODUCTION ABSTRACT Protein sequences are of utmost importance for studying The prediction of protein function as well as the the function and evolution of genes and genomes. reconstruction of evolutionary genesis employing Evolutionary processes of mutation and selection have sequence comparison at large is still the most shaped the protein sequence space and became manifest powerful tool in sequence analysis. Due to the expo- in the protein sequences as well as their pair-wise and nential growth of the number of known protein group-wise similarities. Therefore, a rich collection of sequences and the subsequent quadratic growth methods in computational biology relies on the analysis of the similarity matrix, the computation of the and comparison of protein sequences. Many of these Similarity Matrix of Proteins (SIMAP) becomes a intensively used methods perform sequence similarity computational intensive task. The SIMAP database searches [e.g. BLAST (1)] or compare protein sequences provides a comprehensive and up-to-date pre- against secondary databases of protein families [e.g. calculation of the protein sequence similarity InterPro (2)]. The fast increasing volume of publicly available pro- matrix, sequence-based features and sequence tein sequences forges a computational dilemma for clusters. As of September 2009, SIMAP covers bioinformatics tasks that require repeated all-against-all 48 million proteins and more than 23 million non- calculations of sequence similarities or sequence features. redundant sequences. Novel features of SIMAP Such rather straightforward but technically challenging include the expansion of the sequence space by tasks among others are the annotation of genomes or the including databases such as ENSEMBL as well as clustering of the protein sequence space into protein the integration of metagenomes based on their con- families. Due to the exponential growth of the number of sistent processing and annotation. Furthermore, sequences and the quadratic complexity of the sequence protein function predictions by Blast2GO are pre- similarity matrix, the computational demand of calculating calculated for all sequences in SIMAP and the data an all-versus-all sequence matrix of all known proteins easily outgrows available computational resources. Due access and query functions have been improved. to the subsequent growth of the secondary databases, SIMAP assists biologists to query the up-to-date a similar problem exists for the prediction of protein sequence space systematically and facilitates domains. As a consequence, any repeated ab initio recalcu- large-scale downstream projects in computational lation of the similarity matrix is highly ineffective due to biology. Access to SIMAP is freely provided the recalculation of the vast majority of already known through the web portal for individuals (http://mips sequence similarity relations. However, as the number of .gsf.de/simap/) and for programmatic access recently added sequences is always small compared to through DAS (http://webclu.bio.wzw.tum.de/das/) the bulk of know sequences, repeated recalculations— and Web-Service (http://mips.gsf.de/webservices/ frequently performed in many sequence-based projects— services/SimapService2.0?wsdl). waste a remarkable amount of compute sources worldwide. *To whom correspondence should be addressed. Tel: +49 8161 712136; Fax: +49 8161 712186; Email: [email protected] The Author(s) 2009. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.5/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Downloaded from https://academic.oup.com/nar/article-abstract/38/suppl_1/D223/3112298 by Ed 'DeepDyve' Gillespie user on 02 February 2018 D224 Nucleic Acids Research, 2010, Vol. 38, Database issue The Similarity Matrix of Proteins (SIMAP) solves the demands for a sophisticated high-performance computing computational dilemma described above by incrementally infrastructure to pre-calculate the sequence similarities pre-calculating the sequence similarities forming the of all new sequences and their sequence-based features known protein sequence space (3). The comparison of immediately after the import of new sequences even new sequences versus known ones returns symmetric in case of SIMAP’s incremental implementation. scores that can be updated accordingly in the existing The SIMAPBOINC public resource computing project records. Compared to other resources that pre-calculate (9) steadily provides compute power beyond the current sequence similarities [e.g. NCBI Blink (4)], the FASTA need and thus enables rapid updating of SIMAP. (5) and Smith–Waterman (6) based similarity calculation in SIMAP is only restricted by a static and sensitive raw Consistent processing and annotation of metagenomes score threshold without limiting the maximal number of With the breakthrough of next generation sequencing hits per sequence. Hence the structure of the sequence methods and their application to environmental samples similarity matrix is not influenced by the taxonomy and (10), metagenomic sequences have indelibly expanded the study biases that exist in the major protein sequence protein sequence space to non-culturable organisms databases. The SIMAP database stores raw scores from and environmental communities. However, the pioneering the calculated alignments. When querying SIMAP, ‘Global ocean sampling’ (GOS) project (11) so far remains e-values are calculated on-the-fly according to the the only metagenomic dataset of which protein sequences selected databases and taxa. To complement the pair- are represented in a major public sequence database wise sequence similarity matrix by position specific [NCBI GenBank (4)]. All other metagenomes are—if at searches against known protein families, SIMAP in all—deposited in distributed resources as the ‘Whole addition pre-calculates sequence based features as e.g. Genome Shutgun’ (wgs) section of NCBI GenBank (4) InterPro matches (2). To maximize its coverage to pro- or the IMG/M database (12). No standardized protocol vide an efficient alternative to BLAST or Interproscan for gene calling and the annotation of protein-coding calculations, the comprehensive representation of the sequences has been established so far for these data protein sequence space is crucial for SIMAP. Recent collections. As the consistent annotation of metagenomes improvements in SIMAP have addressed this requirement is indispensable for any downstream comparative analysis by further expanding the sequence space and including such as comparisons of taxonomic or functional profiles metagenomic sequences. Further improvements have between different metagenomes, an extension of SIMAP extended the functional annotation of the protein was implemented that extracts coding sequences from sequence space in SIMAP by pre-calculated GO metagenomic sequencing reads, assembled contigs and annotations and improved the data access and query scaffolds in a consistent way. tools of SIMAP. This part of SIMAP covering environmental sequence fragments is monthly synchronized with three major repositories of metagenomes (Table 2). Entirely redundant NEW FEATURES AND IMPROVEMENTS IN SIMAP metagenomes are considered only once, whereas redun- Comprehensive coverage of the protein sequence space dant representations of the same project differing in their total number of nucleotides (e.g. the whale fall SIMAP represents the known protein sequence space samples in IMG/M and GenBank wgs) are retained. comprehensively and up-to-date. According to this goal, Similar to the methodology used by the GOS project the SIMAP database is synchronized once per month with (13), coding sequences are extracted from the nucleotide the major protein sequence databases (Table 1). The con- sequences in a multi-step procedure: sideration of each of these databases in SIMAP is justified by providing either unique protein sequences that are not (1) all open reading frames (ORF) exceeding a length found in other databases [e.g. ENSEMBL (7)], or unique of 90 nt are extracted from the nucleotide sequences protocols for data processing [e.g. NCBI RefSeq (8)]. of a metagenome, The continuous and rapid growth of the sequence space (2) all-against-all protein sequence similarities between all ORFs in a metagenome are calculated using the SIMAP software (3): first a FASTA (5) similarity Table 1. Number of protein entries and non-redundant sequences of search against the low-complexity masked sequences the major protein sequence databases included in SIMAP as of down to the BLOSUM50 (14) score of 80 is September 2009 Table 2. Number of metagenomic samples and extracted protein- Database Protein entries Non-redundant coding sequences in SIMAP as of September 2009 sequences Database Metagenomic Non-redundant NCBI GenBank 16 146 018 13 065 886 samples sequences NCBI RefSeq 8 181 910 6 681 186 Uniprot/TrEMBL 8 926 016 7 586 794 Camera (JCVI) 54 6 031 109 Uniprot/Swissprot 495 880 416 496 NCBI Genbank/wgs 130 4 244 008 PDB 139 106 41 445 section PEDANT 5 480 442 5 389 911 IMG/M (JGI) 65 2 833 359 ENSEMBL 1 094 482 1 062 197 Downloaded from https://academic.oup.com/nar/article-abstract/38/suppl_1/D223/3112298 by Ed 'DeepDyve' Gillespie user on 02 February 2018 Nucleic Acids Research, 2010, Vol. 38, Database issue D225 From complete Table 3. Pre-calculated functional annotations in SIMAP as of genomes September 2009 20% Method Number of pre-calculated features 48% InterProScan 133 829 528 TargetP 17 205 439 32% SignalP 11 060 831 From TMHMM 15 841 454 metagenomes From genes/ Phobius 18 488 832 Blast2GO 190 801 556 unfinished genomes Figure 1. Composition of the non-redundant protein sequence space in SIMAP as of September 2009. annotation of protein sequences based on information transfer from annotated proteins. BLAST2GO may performed without restricting the number of hits, serve as an example that provides various annotation thereafter the alignments are re-calculated without tools for the functional classification of proteins (16,17). low-complexity masking, Blast2GO achieves the automatic functional annotation of (3) ORFs are weighted by the number and score of their DNA or protein sequences employing the Gene Ontology sequence alignments; shadow ORFs are detected by vocabulary. We have adapted the Blast2GO suite to their overlap with higher weighted ORFs and enable the retrieval of sequence similarities from the removed using the methodology and parameters as SIMAP database instead of performing BLAST (1) in the GOS project (13), searches. This step saves an enormous amount of (4) remaining ORFs having a length of at least 60 aa are compute-time compared to BLAST and allows annotating imported into the main SIMAP database, the complete protein sequence space of SIMAP using a (5) all-against-all protein sequence similarities between few PCs within a week. We have integrated the adapted all ORFs of a metagenome and all other protein BLAST2GO program into the monthly update workflow sequences in SIMAP are calculated as in step 2, of SIMAP in order to keep the pre-calculated BLAST2GO (6) again, shadow ORFs are removed as in step 3. annotations complete and up-to-date (Table 3). Compared to the supervised gene prediction methods High performance data access facilities as used in other metagenomic resources, the procedure applied in SIMAP is not biased towards any taxonomic All data in SIMAP are freely available. The continuously group (i.e. prokaryotes) and only limited by the minimal growing size of SIMAP demands a sophisticated imple- length of open reading frames in step 1 and 4. The mentation of the database to provide versatile and rapid parameters applied in this procedure ensure optimal sen- access to the data with respect to a broad spectrum of use sitivity in detecting coding sequences both in single-exon cases. Based on the established database and standard and multi-exon genes. middleware components of SIMAP, we have improved The derived metagenomic ORFs have almost doubled the performance and stability of SIMAP through cluster- the volume of the known protein sequence space and thus ing of two independent database and application servers. significantly added valuable information (Figure 1). This clustering effectively uncouples production and main- However, metagenomic sequences exhibit lower accuracy tenance processes. Each of the servers is ready to process compared to completely sequenced genes and genomes, more than 2 million complex queries per day. show fragmentation in case of multi-exon genes and Furthermore, we have improved the different data lack knowledge of their taxonomic origin. Therefore, access facilities connecting SIMAP to its users. The versa- metagenomic sequences can be excluded when retrieving tile web portal allows searching for proteins by text or data from SIMAP according to the individual sequence queries. The matches are starting points requirements of the user. for retrieving homologous proteins based on sequence similarity or domain architecture. Protein report pages Functional annotation of the protein sequence space integrate data from SIMAP including InterPro and GO Many computational methods to support the prediction annotation as well as from external resources as the of protein function are computationally expensive and PEDANT database (18). To facilitate clustering therefore benefit from comprehensive pre-calculation methods, all-against-all matrices of similarity scores can and incremental updates as the basic design principles of be downloaded for user-supplied groups of proteins. SIMAP. SIMAP thus pre-calculates Interpro domains and Programmatic access to SIMAP is provided by several features (2) for all sequences including metagenomic SOAP based Web-Services. The SimpAT (Simap Access ORFs (15). New releases of InterPro are incorporated Tools) allows easy access to the SIMAP database using into SIMAP as soon as they become available; SIMAP Web-Service functionality. Recently, we have imple- is regularly updated to the latest InterPro version mented Distributed Annotation System (DAS) services (currently 22.0). for SIMAP. These can be accessed via the URL SIMAP provides an ideal complete resource for the http://webclu.bio.wzw.tum.de/das/ and provide easy and computation of secondary features such as the functional rapid access to the proteins, sequence similarities, InterPro Downloaded from https://academic.oup.com/nar/article-abstract/38/suppl_1/D223/3112298 by Ed 'DeepDyve' Gillespie user on 02 February 2018 D226 Nucleic Acids Research, 2010, Vol. 38, Database issue matches and GO annotations from SIMAP. These data REFERENCES with the exception of the very huge similarity matrix 1. Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., itself can also be downloaded as flat files from the Miller,W. and Lipman,D.J. (1997) Gapped BLAST and SIMAP web portal. For research projects interested in PSI-BLAST: a new generation of protein database search parts of the similarity matrix, we provide project specific programs. Nucleic Acids Res., 25, 3389–3402. 2. Hunter,S., Apweiler,R., Attwood,T.K., Bairoch,A., Bateman,A., monthly dumps upon request. Binns,D., Bork,P., Das,U., Daugherty,L. and Duquenne,L. (2009) InterPro: the integrative protein signature database. Nucleic Acids Res., 37, D211–D215. DISCUSSION 3. Arnold,R., Rattei,T., Tischler,P., Truong,M.D., Stumpflen,V. and Mewes,W. (2005) SIMAP-The similarity matrix of proteins. The SIMAP database is a unique fundamental resource Bioinformatics, 21, ii42–ii46. for computational biology that consequently puts the 4. Sayers,E.W., Barrett,T., Benson,D.A., Bryant,S.H., Canese,K., principle of incremental pre-calculation of sequence Chetvernin,V., Church,D.M., DiCuccio,M., Edgar,R., Federhen,S. similarities and sequence based features into practice. et al. (2009) Database resources of the national center for SIMAP as an exhaustive, up-to-date resource to inspect biotechnology information. Nucleic Acids Res., 37, D5–D15. 5. Pearson,W.R. (1990) Rapid and sensitive sequence comparison the sequence similarity of any known sequence enables of with FASTP and FASTA. Methods Enzymol., 183, 63–98. any type of systematic post-processing with respect to the 6. Smith,T.F. and Waterman,M.S. (1981) Identification of common functional or structural classification of proteins. molecular subsequences. J. Mol. Bwl, 147, 195–197. The recent integration of metagenomic sequences 7. Hubbard,T.J.P., Aken,B.L., Ayling,S., Ballester,B., Beal,K., into SIMAP based on a consistent extraction of Bragin,E., Brent,S., Chen,Y., Clapham,P. and Clarke,L. (2009) Ensembl 2009. Nucleic Acids Res., 37, D690–D697. coding sequences has been beneficial to preserve the 8. Pruitt,K.D., Tatusova,T. and Maglott,D.R. (2007) NCBI reference comprehensiveness of the sequence space representation sequences (RefSeq): a curated non-redundant sequence database in SIMAP. At the same time it makes use of the of genomes, transcripts and proteins. Nucleic Acids Res., 35, sequence similarity matrix of SIMAP to resolve overlaps D61–D65. and remove shadow ORFs. SIMAP represents to our 9. Rattei,T., Walter,M., Arnold,R., Anderson,D.P. and Mewes,W. (2007) Using public resource computing and systematic knowledge the largest and most homogeneous resource pre-calculation for large scale sequence analysis. Lecture Notes for the annotation of coding sequences in metagenomes. Comp. Sci., 4360, 11–18. It provides an ideal data repository and speed-up for tools 10. Handelsman,J. (2004) Metagenomics: application of genomics as e.g. MEGAN (19) that extract taxonomic and func- to uncultured microorganisms. Microbiol Mol. Biol. Rev., 68, tional information from similarities between metagenomic 669–685. 11. Rusch,D.B., Halpern,A.L., Sutton,G., Heidelberg,K.B., ORFs and known proteins in major sequence databases. Williamson,S., Yooseph,S., Wu,D., Eisen,J.A., Hoffman,J.M. and The extended functional annotation of the sequence Remington,K. (2007) The Sorcerer II global ocean sampling space through the pre-calculation of GO annotations expedition: Northwest Atlantic through eastern tropical Pacific. and the improved data access facilities have enhanced PLoS Biol., 5, e77. the potential of SIMAP in assisting biologists in answering 12. Markowitz,V.M., Ivanova,N.N., Szeto,E., Palaniappan,K., Chu,K., Dalevi,D., Chen,I., Min,A., Grechkin,Y. and Dubchak,I. (2008) their individual research questions as well as facilitating IMG/M: a data management and analysis system for metagenomes. downstream projects in computational biology at any Nucleic Acids Res., 36, D534–D538. scale. 13. Yooseph,S., Sutton,G., Rusch,D.B., Halpern,A.L., Williamson,S.J., Remington,K., Eisen,J.A., Heidelberg,K.B., Manning,G. and Li,W. (2007) The Sorcerer II Global Ocean Sampling expedition: ACKNOWLEDGEMENTS expanding the universe of protein families. PLoS Biol., 5, e16. 14. Henikoff,S. and Henikoff,J.G. (1992) Amino acid substitution The authors gratefully acknowledge the BOINCSIMAP matrices from protein blocks. Proc. Natl Acad. Sci., 89, community for donating their CPU power for the calcu- 10915–10919. lation of protein similarities and features. They are 15. Rattei,T., Arnold,R., Tischler,P., Lindner,D., Stumpflen,V. and grateful to their colleagues at MIPS, in particular Mewes,H.W. (2006) SIMAP: the similarity matrix of proteins. Nucleic Acids Res., 34, D252–D256. Mathias Walter, Martin Muensterkoetter and Manuel 16. Conesa,A., Go¨ tz,S., Garcia-Gomez,J.M., Terol,J., Talon,M. and Spannagl, for many helpful discussions and suggestions. Robles,M. (2005) Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics, 21, 3674–3676. FUNDING 17. Go¨ tz,S., Garcia-Gomez,J.M., Terol,J., Williams,T.D., Nagaraj,S.H., Nueda,M.J., Robles,M., Talon,M., Dopazo,J. and SUN Microsystems Inc. (funding a fully equipped X4500 Conesa,A. (2008) High-throughput functional annotation and data center server that is hosting parts of the SIMAP data mining with the Blast2GO suite. Nucleic Acids Res., 36, database, through a SUN Academic Excellence Grant), 3420–3435. European Science Foundation (financial support for 18. Walter,M.C., Rattei,T., Arnold,R., Guldener,U., Munsterkotter,M., Nenova,K., Kastenmuller,G., Tischler,P., Stefan Go¨ tz through the activity entitled ‘Frontiers of Wolling,A. and Volz,A. (2008) PEDANT covers all complete Functional Genomics’). Funding for open access charge: RefSeq genomes. Nucleic Acids Res., 37, D408–D411. Helmholtz Zentrum Mu¨ nchen, German Research Center 19. Huson,D.H., Auch,A.F., Qi,J. and Schuster,S.C. (2007) MEGAN for Environmental Health, Neuherberg. analysis of metagenomic data. Genome Res., 17, 377–386. Conflict of interest statement. None declared. Downloaded from https://academic.oup.com/nar/article-abstract/38/suppl_1/D223/3112298 by Ed 'DeepDyve' Gillespie user on 02 February 2018

Journal

Nucleic Acids ResearchOxford University Press

Published: Nov 10, 2009

There are no references for this article.