Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

IMG/M: a data management and analysis system for metagenomes

IMG/M: a data management and analysis system for metagenomes D534–D538 Nucleic Acids Research, 2008, Vol. 36, Database issue Published online 11 October 2007 doi:10.1093/nar/gkm869 IMG/M: a data management and analysis system for metagenomes 1 4 1 1 Victor M. Markowitz , Natalia N. Ivanova , Ernest Szeto , Krishna Palaniappan , 1 1 1 1 2 Ken Chu , Daniel Dalevi , I-Min A. Chen , Yuri Grechkin , Inna Dubchak , 4 4 4 Iain Anderson , Athanasios Lykidis , Konstantinos Mavromatis , 3 4, Philip Hugenholtz and Nikos C. Kyrpides * 1 2 Biological Data Management and Technology Center, Genomics Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, Department of Energy Joint Genome Institute, Microbial Ecology Program and Department of Energy Joint Genome Institute, Genome Biology Program, 2800 Mitchell Drive, Walnut Creek, USA Received August 10, 2007; Revised September 22, 2007; Accepted September 24, 2007 traditional assembly, gene prediction and annotation ABSTRACT methods do not perform as well on these datasets as IMG/M is a data management and analysis system they do on isolate microbial genome sequences (3,4). for microbial community genomes (metagenomes) In spite of these limitations, metagenome data are hosted at the Department of Energy’s (DOE) amenable to a variety of analyses, as illustrated by several Joint Genome Institute (JGI). IMG/M consists of recent studies (5–10). Metagenome data analysis is usually metagenome data integrated with isolate microbial set up in the context of reference isolate genomes and considers the questions of phylogenetic composition and genomes from the Integrated Microbial Genomes functional or metabolic potential of individual micro- (IMG) system. IMG/M provides IMG’s comparative biomes, as well as differences between microbiome data analysis tools extended to handle metagenome samples. Such analysis relies on efficient management of data, together with metagenome-specific analysis genome and metagenome data collected from multiple tools. IMG/M is available at http://img.jgi.doe.gov/m sources, while taking into account the iterative nature of sequence data generation and processing. IMG/M aims at providing support for comparative INTRODUCTION metagenome analysis in the integrated context of micro- Studies of the collective genomes (also known as bial genome and metagenome data generated with diverse metagenomes) of environmental microbial communities sequencing technology platforms and data processing (also known as microbiomes) are expected to lead to methods. IMG/M was initially developed as an experi- advances in environmental cleanup, agriculture, industrial mental system (11). Subsequently, IMG/M has been processes, alternative energy production and human extended in terms of metagenome data content and health (1). Metagenomes of specific microbiome samples metagenome specific analytical tools, as discussed below. are sequenced by organizations worldwide such as the Department of Energy’s (DOE) Joint Genome Institute DATA CONTENT (JGI), the Venter Institute, Washington University in St. Louis, and Genoscope using different sequencing IMG/M consists of microbial metagenome data integrated strategies, technology platforms and annotation proce- with isolate microbial genomes from the Integrated dures. According to the Genomes OnLine Database, Microbial Genomes (IMG) system (12). The current about 25 metagenome studies have been published to date, version of IMG/M (as of September 2007) contains with over 60 other projects ongoing and more in the metagenome datasets generated using shotgun sequencing process of being launched (2). JGI is one of the major for 10 projects involving a total of 24 microbiome samples, contributors of metagenome sequence data, currently including an acid mine drainage biofilm (5), three isolated sequencing more than 50% of the reported metagenome deep sea ‘whale fall’ carcass samples, an agricultural soil projects worldwide. sample (6), two biological phosphorus removing sludge Due to the higher complexity, inherent incompleteness samples (7), the metagenome of gutless marine worm and lower quality of metagenome sequence data, symbionts (8), two human distal gut samples (9) and obese *To whom correspondence should be addressed. Tel: 925 296 5718; Fax: 925 296 5666; Email: [email protected] 2007 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Nucleic Acids Research, 2008, Vol. 36, Database issue D535 and lean mouse gut samples (10). Several other metage- the total number of scaffolds and genes or the number nome datasets such as hypersaline microbial mats and of genes associated with functional characterizations termite hindgut metagenomes, are currently analyzed (e.g. COG, Pfam). The ‘Phylogenetic Distribution of using an internal version of IMG/M in preparation for Genes’, shown in pane 2 of Figure 1, provides an estimate publication. of the phylogenetic composition of a microbiome sample The current version of IMG/M also includes 2301 based on the distribution of the best BLAST hits of the genomes from IMG 2.0 (released on 1 December 2006), protein-coding genes in the sample. The ‘Phylogenetic consisting of 595 bacterial, 32 archaeal, 13 eukaryotic and Distribution of Genes’ consists of a histogram, with 1661 phage genomes. counts of protein-coding genes in the sample that have Similar to IMG, the data model underlying IMG/M best BLASTp hits to proteins of isolate genomes in each allows recording the primary sequence information and its phylum or class with more than 90% identity (right organization in scaffolds and/or contigs, together with column), 60–90% identity (middle column) and 30–60% computationally predicted protein-coding sequences and identity (left column). The higher the number of hits and some RNA-coding genes. Protein-coding genes are percent identity cutoff, the more likely it is that the sample characterized in terms of additional annotations such as contains close relatives of the sequenced isolate genomes motifs, domains, pathways and orthology relationships, from this phylum/class. Gene counts in the histogram are which may serve as an indication of their functions. These linked to the corresponding lists of genes, which can then annotations are based on diverse data sources such as be selected and added to ‘Gene Cart’ or analyzed through COG (13), Pfam (14) and KEGG (15). Genes are assigned their ‘Gene Pages’. For each phylum/class, the phyloge- to COGs and Pfams based on reverse position specific netic distribution of genes can be projected onto BLAST (RPS-BLAST) and NCBI’s Conserved Domain the families in that phylum/class; for each family the distribution of genes can be further projected onto the Database (16). Homologs are computed as unidirectional species in that family. Finally, the genes in the sample can hits with an E-value of 10 or better, with IMG/M be viewed in the context of an individual reference isolate providing support for filtering homolog lists by percent genome, using either the ‘Reference Genome Context identity, bit score and more stringent E-values. Viewer’ as shown in pane 3 of Figure 1, or the ‘Protein Isolate organisms are identified via their taxonomic Recruitment Plot’, as shown in pane 4 of Figure 1. The lineage (domain, phylum, class, order, family, genus, ‘Reference Genome Context Viewer’ displays the meta- species and strain), while individual microbiome samples genome genes aligned with their homologous genes of the are treated as ‘meta’ organisms. The sequences of a reference isolate genome. The metagenome genes are color microbiome sample together with their associated genes coded to indicate BLAST percent identity (blue for 30%, and annotations are grouped into ‘bins’ when binning has green for 60% and red for 90%), while the genes of the been performed to assign these sequences to organism reference genome are color coded to indicate their COG types (phylotypes). Both isolate organisms and micro- functional role, and are displayed as they are located biome samples are characterized by a variety of metadata along the chromosome. The ‘Protein Recruitment Plot’ attributes. Some metadata such as phenotype, habitat, displays the BLASTp hits of the metagenome genes disease, relevance, temperature and pH, are included from against the genes of the reference genome, with the GOLD (2), with additional metadata collected directly coordinates of the scaffold reference genome and the from scientists or publications. BLAST percent identities shown on the X and Y axis, respectively. DATA ANALYSIS Similar to genes of isolate genomes, metagenome genes can be examined using ‘Gene Details’ pages, which include We briefly review below the IMG/M data analysis tools information on locus, biochemical properties of the with emphasis on the support for new metagenome product, KEGG pathways, as well as evidence for the analysis tools developed since IMG/M’s initial public functional prediction: gene neighborhood, COG and Pfam release in 2006 (11). and precomputed lists of homologs, orthologs and paralogs (for isolate organisms), or intra-metagenome Data exploration and visualization tools homologs as well as homologs to other genomes and Data exploration tools in IMG/M help selecting and metagenomes (for microbiomes). examining genomes/metagenomes, genes and functions of For metagenomes that include contigs and scaffolds interest. Similar to IMG, genes and functions can be generated by assembly of individual reads and potentially selected using keyword searches or functional classifica- comprised of sequences from multiple strains, a ‘SNP tion (e.g. COG, Pfam) browsers. Lists of genes and BLAST’ tool allows to examine the heterogeneity between functional annotations of interest can be maintained and the reads contributing to the composite population contigs further explored using various ‘Analysis Carts’. and scaffolds. This tool allows users to run BLASTn of the Metagenomes and isolate genomes can be selected using query nucleotide sequence of a specific gene or scaffold in a keyword based ‘Genome Search’ tool or a ‘Genome the metagenome against a database of sequencing reads. Browser’. Microbiomes can be further examined using the The BLAST output, which shows whether there are any ‘Microbiome Details’, where a user can find relevant SNPs among the reads corresponding to the query metadata such as sample site, as shown in pane 1 of sequence, can be examined using the raw BLAST output Figure 1, along with various summaries of interest such as or using the ‘SNP VISTA’ viewer (17). D536 Nucleic Acids Research, 2008, Vol. 36, Database issue Figure 1. Metagenome Data Exploration and Visualization Tools. Individual microbiome samples such as the ‘Sludge/Australian’ sample, can be examined using the ‘Microbiome Details’ page, which includes relevant microbiome information (1). The ‘Phylogenetic Distribution of Genes’ tool (2) displays the distribution of best BLAST hits of protein-coding genes in the microbiome as a histogram, with counts of genes that have best BLASTp hits to proteins of isolate genomes in each phylum or class with more than 90% identity, 60–90% identity and 30–60% identity. The distribution of genes for each phylum/class can be projected onto the families in that phylum/class such as Betaproteobacteria, and then further projected onto the species in that family such as Rhodocyclaceae. The genes in the sample can be viewed in the context of an individual reference isolate genome such as Dechloromonas aromatica, using the ‘Reference Genome Context Viewer’ (3), or using a ‘Protein Recruitment Plot’ (4). For each gene on ‘Reference Genome Context Viewer’ and ‘Protein Recruitment Plot’, locus tag and scaffold coordinates are provided locally (by placing the cursor over the gene), while additional information is available in the ‘Gene Details’ page, which is linked to each gene. Comparative analysis tools as illustrated by the example in pane 1 of Figure 2 Comparative analysis of genomes and metagenomes is which shows the abundance profiles of COGs across three provided in IMG/M through a number of tools that allow whale-fall microbiomes. Abundance of protein/functional to examine their gene content and functional capabilities. families is displayed as a heat map over all families of a The differences in gene content of genomes and metage- specific type (COGs, Pfams, or enzymes), as shown in nomes can be examined with a profile-based selection tool pane 2 of Figure 2, with red corresponding to the (‘Phylogenetic Profiler’) and further explored through most abundant families. Each column on the map gene neighborhood analysis and multiple sequence align- corresponds to a genome or metagenome, while each ment tools, which are similar to their IMG counterparts row corresponds to a family. Clicking on the cell will (12). Functional capabilities of a microbial community retrieve the list of genes assigned to this particular family can be examined using several occurrence and abundance in this genome or metagenome, while clicking on the profile-based tools. We discuss below in more detail the identifier of the family displayed on right side of the abundance profile tools that are specific to metagenome column (e.g. COG0642) will add the corresponding data comparative analysis. family to the ‘Function Cart’, as shown in pane 3 of Several ‘Abundance Profile’ tools can be used for Figure 2. For protein families in the ‘Function Cart’ a comparing the functional capabilities of metagenomes selective ‘Function Profile’ can be computed, as shown and genomes. The ‘Abundance Profile Viewer’ provides an in pane 4 of Figure 2. overview of the relative abundance of protein families The ‘Abundance Profile Viewer’ and ‘Function Profile’ (COGs and Pfams) and functional families (enzymes) tools provide a rough estimate of the functional capabil- across selected metagenomes and isolate genomes, ities of metagenomes. When metagenomes are compared Nucleic Acids Research, 2008, Vol. 36, Database issue D537 Figure 2. Abundance Profile Tools. The ‘Abundance Profile Viewer’ (1) provides an overview of the relative abundance of protein families (COGs and Pfams) and functional families (enzymes) across selected metagenomes, normalized for genome size or using z-score. Abundance of protein/ functional families is displayed as a heat map (2), with each cell hyperlinked to the list of genes assigned to a particular family. A protein family can be saved in the ‘Function Cart’ by clicking its identifier such as COG0642 (3). For protein families in the ‘Function Cart’ a selective ‘Function Profile’ can be also computed (4). The ‘Abundance Comparison’ tool (5) takes into account the stochastic nature of metagenome datasets and tests whether the differences in abundance can be ascribed to chance variation or not. In addition to the gene count based abundance, the results provided by this tool include an assessment of statistical significance in terms of D-score (6) or P-value. to each other or to isolate genomes, statistical tests are metagenome studies conducted at JGI and other institutes needed for estimating the statistical significance of the such as the Washington University in St. Louis, while new observed differences. The ‘Abundance Comparison’ tool, reference genomes are included from IMG. illustrated in pane 5 of Figure 2, takes into account the New visualization tools are currently developed in order stochastic nature of metagenome datasets and tests to improve the efficiency of analyzing large and complex whether the differences in abundance can be ascribed to metagenome datasets, including datasets generated with chance variation or not. The results provided by this tool new technology platforms such as the Genome Sequencer include an assessment of statistical significance in terms of TM 20 System from 454 Life Sciences. The abundance a d-score (that translates into a P-value) in addition to the profile tools will be extended to allow comparison of gene count based abundance, as shown in pane 6 of genomes and metagenomes based on higher-level func- Figure 2. The d-score is a standard normal statistics tional categories such as COG functional categories and derived under a binomial assumption where the corre- KEGG pathways. As the number of analytical tools sponding P-value provides support at different levels of increases, the organization and documentation of the significance (e.g. 0.05, 0.01). IMG user interface will be revised in order to improve its usability. FUTURE PLANS We also plan to extend IMG/M’s capability to capture more detailed metadata attributes characterizing micro- The current version of IMG/M contains data on 2301 biome samples. Such attributes are often specific to a isolate genomes, 21 metagenome samples from 9 studies habitat (e.g. biomedical, ecological). Samples are asso- and 3 simulated datasets from a metagenome data processing benchmarking project (4). New metagenome ciated with properties used for metagenome analysis such datasets are continuously included into IMG/M from as sample structural and morphological characteristics D538 Nucleic Acids Research, 2008, Vol. 36, Database issue 3. Chen,K. and Pachter,L. (2005) Bioinformatics for whole-genome (e.g. sample site, time of collection) and donor or host shotgun sequencing of microbial communities. PLoS Comput. Biol., data (e.g. demographic and clinical record, including 1, 106–112. diagnosis, disease, stage of disease and treatment informa- 4. Mavromatis,K., Ivanova,N., Barry,K., Shapiro,H., Goltsman,E., tion for human donors). Samples may also be involved in McHardy,A.C., Rigoutsos,I., Salamov,A., Korzeniewski,F. et al. (2007) On the fidelity of processing metagenomic sequences using clinical studies and therefore can be grouped into several simulated dataset. Nat. Meth., 4, 495–500. time/treatment study groups. In addition to extending the 5. Tyson,G.W., Chapman,J., Hugenholtz,P., Allen,E.E., Ram,R.J., data model for supporting sample metadata, we plan to Richardson,P.M., Solovyev,V.V., Rubin,E., Rokhsar,D.S. et al. improve the coherence and completeness of these annota- (2004) Community structure and metabolism through reconstruc- tion of microbial genomes from the environment. Nature, 428, tions via manual curation. We collaborate with the 37–43. Genome Standards Consortium (18) in order to ensure 6. Tringe,S., von Mering,C., Kobayashi,A., Salamov,A., Chen,K., high coverage and consistency of microbiome sample Chang,H.W., Podar,M., Short,J.M., Mathur,E.J. et al. (2005) metadata. Comparative metagenomics of microbial communities. Science, 308, The current version of IMG/M does not provide sup- 554–557. 7. Martin,H.G., Ivanova,N.N., Kunin,V., Warnecke,F., Barry,K.W., port for data curation. We plan to incorporate into McHardy,A.C., Yeates,C., He,S., Salamov,A.A. et al. (2006) IMG/M the annotation capabilities that are available in Metagenomic analysis of two enhanced biological phosphorus IMG for isolate genomes, adapted to handle metagenome removal (EBPR) sludge communities. Nat. Biotechnol., 24, data. 1263–1269. 8. Woyke,T., Teeling,H., Ivanova,N.N., Huntemann,M., Richter,M., Gloeckner,F.O., Biffelli,D., Anderson,I., Barry,K.W. et al. (2006) Symbiosis insights through metagenomic analysis of a microbial ACKNOWLEDGEMENTS consortium. Nature, 443, 950–955. 9. Gill,S.R., Pop,M., DeBoy,R.T., Eckburg,P.B., Turnbaugh,P.J., We thank Chris Oehmen of the Computational Biology Samuel,B.S., Gordon,J.I., Relman,D.A., Fraser-Liggett,C.M. et al. and Bioinformatics group at the Pacific Northwest (2006) Metagenomic analysis of the human distal gut microbiome. Science, 312, 1355–1359. National Laboratory for his help in carrying out the 10. Turnbaugh,P.J., Ley,R.E., Mahowald,M.A., Magrini,V., large-scale gene similarity computations for IMG/M. The Mardis,E.R. and Gordon,J.I. (2006) An obesity-associated gut work of JGI’s sequencing, assembly and annotation teams microbiome with increased capacity for energy harvest. Nature, 444, is an essential prerequisite for IMG/M. Eddy Rubin and 1027–1031. 11. Markowitz,V.M., Ivanova,N., Korzeniewski,F., Palaniappan,K., James Bristow provided, support, advice and encourage- Szeto,E., Lykidis,A., Anderson,I., Mavrommatis,K., Kunin,V. et al. ment throughout this project. The work presented in this (2006) An experimental metagenome data management and analysis article was performed under the auspices of the US system. Bioinformatics, 22, e359–e367. Department of Energy’s Office of Science, Biological and 12. Markowitz,V.M., Szeto,E., Palaniappan,K., Chen,I.A., Grechkin,Y., Chu,K., Dubchak,I., Anderson,I., Lykidis,A. et al. Environmental Research Program and by the University (2008) The integrated microbial genomes (IMG) system in 2007: of California, Lawrence Livermore National Laboratory data content and analysis tool extensions. Nucleic Acids Res., 36. under contract No. W-7405-Eng-48, Lawrence Berkeley 13. Tatusov,R.L., Koonin,E.V. and Lipman,D.J.A. (1997) Genomic National Laboratory under contract No. DE-AC02- perspective on protein families. Science, 278, 631–637. 14. Bateman,A., Coin,L., Durbin,R., Finn,R.D., Hollich,V., 05CH11231 and Los Alamos National Laboratory under Griffiths-Jones,S., Khanna,A., Marshall,M., Moxon,S. et al. (2004) contract No. DE-AC02-06NA25396. Funding to pay the The Pfam protein families database. Nucleic Acids Res., 32, Open Access publication charges for this article was D138–D141. provided by the Department of Energy Joint Genome 15. Kanehisa,M., Goto,S., Kawashima,S., Okuno,Y. and Hattori,M. Institute. (2004) The KEGG resource for deciphering the genome. Nucleic Acids Res., 32, D277–D280. Conflict of interest statement. None declared. 16. Marchler-Bauer,A., Panchenko,A.R., Shoemaker,B.A., Thiessen,P.A., Geer,L.Y. and Bryant,S.H. (2002) CDD: a database of conserved domain alignments with links to domain three-dimensional structure. Nucleic Acids Res., 30, 281–283. REFERENCES 17. Shah,N., Teplitsky,M.V., Minovitsky,S., Pennacchio,L.A., 1. National Research Council Committee on Metagenomics. (2007) Hugenholtz,P., Hamann,B. and Dubchak,I.L. (2005) The New Science of Metagenomics: Revealing the Secrets of our SNP-VISTA: an interactive SNP visualization tool. BMC Microbial Planet. National Academies Press, Washington, DC. Bioinformatics, 8, 292. 2. Liolios,K., Tavernarakis,N., Hugenholtz,P. and Kyrpides,N. (2006) 18. Field,D., Garrity,G., Gray,T., Selengut,J., Sterk,P., Thomson,N., The genomes online database (GOLD) v.2: a Tatusov,T., Cochrane,G., Glockner,F.O. et al. (2007) eGenomics: monitor of genomeprojects worldwide. Nucleic Acids Res., 34, cataloguing our complete genome collection III. Comp. Funct. Genomics, 10, 100–104. D332–D334. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Nucleic Acids Research Oxford University Press

Loading next page...
 
/lp/oxford-university-press/img-m-a-data-management-and-analysis-system-for-metagenomes-cRE1ArvBUR

References (22)

Publisher
Oxford University Press
Copyright
© Published by Oxford University Press.
ISSN
0305-1048
eISSN
1362-4962
DOI
10.1093/nar/gkm869
pmid
17932063
Publisher site
See Article on Publisher Site

Abstract

D534–D538 Nucleic Acids Research, 2008, Vol. 36, Database issue Published online 11 October 2007 doi:10.1093/nar/gkm869 IMG/M: a data management and analysis system for metagenomes 1 4 1 1 Victor M. Markowitz , Natalia N. Ivanova , Ernest Szeto , Krishna Palaniappan , 1 1 1 1 2 Ken Chu , Daniel Dalevi , I-Min A. Chen , Yuri Grechkin , Inna Dubchak , 4 4 4 Iain Anderson , Athanasios Lykidis , Konstantinos Mavromatis , 3 4, Philip Hugenholtz and Nikos C. Kyrpides * 1 2 Biological Data Management and Technology Center, Genomics Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, Department of Energy Joint Genome Institute, Microbial Ecology Program and Department of Energy Joint Genome Institute, Genome Biology Program, 2800 Mitchell Drive, Walnut Creek, USA Received August 10, 2007; Revised September 22, 2007; Accepted September 24, 2007 traditional assembly, gene prediction and annotation ABSTRACT methods do not perform as well on these datasets as IMG/M is a data management and analysis system they do on isolate microbial genome sequences (3,4). for microbial community genomes (metagenomes) In spite of these limitations, metagenome data are hosted at the Department of Energy’s (DOE) amenable to a variety of analyses, as illustrated by several Joint Genome Institute (JGI). IMG/M consists of recent studies (5–10). Metagenome data analysis is usually metagenome data integrated with isolate microbial set up in the context of reference isolate genomes and considers the questions of phylogenetic composition and genomes from the Integrated Microbial Genomes functional or metabolic potential of individual micro- (IMG) system. IMG/M provides IMG’s comparative biomes, as well as differences between microbiome data analysis tools extended to handle metagenome samples. Such analysis relies on efficient management of data, together with metagenome-specific analysis genome and metagenome data collected from multiple tools. IMG/M is available at http://img.jgi.doe.gov/m sources, while taking into account the iterative nature of sequence data generation and processing. IMG/M aims at providing support for comparative INTRODUCTION metagenome analysis in the integrated context of micro- Studies of the collective genomes (also known as bial genome and metagenome data generated with diverse metagenomes) of environmental microbial communities sequencing technology platforms and data processing (also known as microbiomes) are expected to lead to methods. IMG/M was initially developed as an experi- advances in environmental cleanup, agriculture, industrial mental system (11). Subsequently, IMG/M has been processes, alternative energy production and human extended in terms of metagenome data content and health (1). Metagenomes of specific microbiome samples metagenome specific analytical tools, as discussed below. are sequenced by organizations worldwide such as the Department of Energy’s (DOE) Joint Genome Institute DATA CONTENT (JGI), the Venter Institute, Washington University in St. Louis, and Genoscope using different sequencing IMG/M consists of microbial metagenome data integrated strategies, technology platforms and annotation proce- with isolate microbial genomes from the Integrated dures. According to the Genomes OnLine Database, Microbial Genomes (IMG) system (12). The current about 25 metagenome studies have been published to date, version of IMG/M (as of September 2007) contains with over 60 other projects ongoing and more in the metagenome datasets generated using shotgun sequencing process of being launched (2). JGI is one of the major for 10 projects involving a total of 24 microbiome samples, contributors of metagenome sequence data, currently including an acid mine drainage biofilm (5), three isolated sequencing more than 50% of the reported metagenome deep sea ‘whale fall’ carcass samples, an agricultural soil projects worldwide. sample (6), two biological phosphorus removing sludge Due to the higher complexity, inherent incompleteness samples (7), the metagenome of gutless marine worm and lower quality of metagenome sequence data, symbionts (8), two human distal gut samples (9) and obese *To whom correspondence should be addressed. Tel: 925 296 5718; Fax: 925 296 5666; Email: [email protected] 2007 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Nucleic Acids Research, 2008, Vol. 36, Database issue D535 and lean mouse gut samples (10). Several other metage- the total number of scaffolds and genes or the number nome datasets such as hypersaline microbial mats and of genes associated with functional characterizations termite hindgut metagenomes, are currently analyzed (e.g. COG, Pfam). The ‘Phylogenetic Distribution of using an internal version of IMG/M in preparation for Genes’, shown in pane 2 of Figure 1, provides an estimate publication. of the phylogenetic composition of a microbiome sample The current version of IMG/M also includes 2301 based on the distribution of the best BLAST hits of the genomes from IMG 2.0 (released on 1 December 2006), protein-coding genes in the sample. The ‘Phylogenetic consisting of 595 bacterial, 32 archaeal, 13 eukaryotic and Distribution of Genes’ consists of a histogram, with 1661 phage genomes. counts of protein-coding genes in the sample that have Similar to IMG, the data model underlying IMG/M best BLASTp hits to proteins of isolate genomes in each allows recording the primary sequence information and its phylum or class with more than 90% identity (right organization in scaffolds and/or contigs, together with column), 60–90% identity (middle column) and 30–60% computationally predicted protein-coding sequences and identity (left column). The higher the number of hits and some RNA-coding genes. Protein-coding genes are percent identity cutoff, the more likely it is that the sample characterized in terms of additional annotations such as contains close relatives of the sequenced isolate genomes motifs, domains, pathways and orthology relationships, from this phylum/class. Gene counts in the histogram are which may serve as an indication of their functions. These linked to the corresponding lists of genes, which can then annotations are based on diverse data sources such as be selected and added to ‘Gene Cart’ or analyzed through COG (13), Pfam (14) and KEGG (15). Genes are assigned their ‘Gene Pages’. For each phylum/class, the phyloge- to COGs and Pfams based on reverse position specific netic distribution of genes can be projected onto BLAST (RPS-BLAST) and NCBI’s Conserved Domain the families in that phylum/class; for each family the distribution of genes can be further projected onto the Database (16). Homologs are computed as unidirectional species in that family. Finally, the genes in the sample can hits with an E-value of 10 or better, with IMG/M be viewed in the context of an individual reference isolate providing support for filtering homolog lists by percent genome, using either the ‘Reference Genome Context identity, bit score and more stringent E-values. Viewer’ as shown in pane 3 of Figure 1, or the ‘Protein Isolate organisms are identified via their taxonomic Recruitment Plot’, as shown in pane 4 of Figure 1. The lineage (domain, phylum, class, order, family, genus, ‘Reference Genome Context Viewer’ displays the meta- species and strain), while individual microbiome samples genome genes aligned with their homologous genes of the are treated as ‘meta’ organisms. The sequences of a reference isolate genome. The metagenome genes are color microbiome sample together with their associated genes coded to indicate BLAST percent identity (blue for 30%, and annotations are grouped into ‘bins’ when binning has green for 60% and red for 90%), while the genes of the been performed to assign these sequences to organism reference genome are color coded to indicate their COG types (phylotypes). Both isolate organisms and micro- functional role, and are displayed as they are located biome samples are characterized by a variety of metadata along the chromosome. The ‘Protein Recruitment Plot’ attributes. Some metadata such as phenotype, habitat, displays the BLASTp hits of the metagenome genes disease, relevance, temperature and pH, are included from against the genes of the reference genome, with the GOLD (2), with additional metadata collected directly coordinates of the scaffold reference genome and the from scientists or publications. BLAST percent identities shown on the X and Y axis, respectively. DATA ANALYSIS Similar to genes of isolate genomes, metagenome genes can be examined using ‘Gene Details’ pages, which include We briefly review below the IMG/M data analysis tools information on locus, biochemical properties of the with emphasis on the support for new metagenome product, KEGG pathways, as well as evidence for the analysis tools developed since IMG/M’s initial public functional prediction: gene neighborhood, COG and Pfam release in 2006 (11). and precomputed lists of homologs, orthologs and paralogs (for isolate organisms), or intra-metagenome Data exploration and visualization tools homologs as well as homologs to other genomes and Data exploration tools in IMG/M help selecting and metagenomes (for microbiomes). examining genomes/metagenomes, genes and functions of For metagenomes that include contigs and scaffolds interest. Similar to IMG, genes and functions can be generated by assembly of individual reads and potentially selected using keyword searches or functional classifica- comprised of sequences from multiple strains, a ‘SNP tion (e.g. COG, Pfam) browsers. Lists of genes and BLAST’ tool allows to examine the heterogeneity between functional annotations of interest can be maintained and the reads contributing to the composite population contigs further explored using various ‘Analysis Carts’. and scaffolds. This tool allows users to run BLASTn of the Metagenomes and isolate genomes can be selected using query nucleotide sequence of a specific gene or scaffold in a keyword based ‘Genome Search’ tool or a ‘Genome the metagenome against a database of sequencing reads. Browser’. Microbiomes can be further examined using the The BLAST output, which shows whether there are any ‘Microbiome Details’, where a user can find relevant SNPs among the reads corresponding to the query metadata such as sample site, as shown in pane 1 of sequence, can be examined using the raw BLAST output Figure 1, along with various summaries of interest such as or using the ‘SNP VISTA’ viewer (17). D536 Nucleic Acids Research, 2008, Vol. 36, Database issue Figure 1. Metagenome Data Exploration and Visualization Tools. Individual microbiome samples such as the ‘Sludge/Australian’ sample, can be examined using the ‘Microbiome Details’ page, which includes relevant microbiome information (1). The ‘Phylogenetic Distribution of Genes’ tool (2) displays the distribution of best BLAST hits of protein-coding genes in the microbiome as a histogram, with counts of genes that have best BLASTp hits to proteins of isolate genomes in each phylum or class with more than 90% identity, 60–90% identity and 30–60% identity. The distribution of genes for each phylum/class can be projected onto the families in that phylum/class such as Betaproteobacteria, and then further projected onto the species in that family such as Rhodocyclaceae. The genes in the sample can be viewed in the context of an individual reference isolate genome such as Dechloromonas aromatica, using the ‘Reference Genome Context Viewer’ (3), or using a ‘Protein Recruitment Plot’ (4). For each gene on ‘Reference Genome Context Viewer’ and ‘Protein Recruitment Plot’, locus tag and scaffold coordinates are provided locally (by placing the cursor over the gene), while additional information is available in the ‘Gene Details’ page, which is linked to each gene. Comparative analysis tools as illustrated by the example in pane 1 of Figure 2 Comparative analysis of genomes and metagenomes is which shows the abundance profiles of COGs across three provided in IMG/M through a number of tools that allow whale-fall microbiomes. Abundance of protein/functional to examine their gene content and functional capabilities. families is displayed as a heat map over all families of a The differences in gene content of genomes and metage- specific type (COGs, Pfams, or enzymes), as shown in nomes can be examined with a profile-based selection tool pane 2 of Figure 2, with red corresponding to the (‘Phylogenetic Profiler’) and further explored through most abundant families. Each column on the map gene neighborhood analysis and multiple sequence align- corresponds to a genome or metagenome, while each ment tools, which are similar to their IMG counterparts row corresponds to a family. Clicking on the cell will (12). Functional capabilities of a microbial community retrieve the list of genes assigned to this particular family can be examined using several occurrence and abundance in this genome or metagenome, while clicking on the profile-based tools. We discuss below in more detail the identifier of the family displayed on right side of the abundance profile tools that are specific to metagenome column (e.g. COG0642) will add the corresponding data comparative analysis. family to the ‘Function Cart’, as shown in pane 3 of Several ‘Abundance Profile’ tools can be used for Figure 2. For protein families in the ‘Function Cart’ a comparing the functional capabilities of metagenomes selective ‘Function Profile’ can be computed, as shown and genomes. The ‘Abundance Profile Viewer’ provides an in pane 4 of Figure 2. overview of the relative abundance of protein families The ‘Abundance Profile Viewer’ and ‘Function Profile’ (COGs and Pfams) and functional families (enzymes) tools provide a rough estimate of the functional capabil- across selected metagenomes and isolate genomes, ities of metagenomes. When metagenomes are compared Nucleic Acids Research, 2008, Vol. 36, Database issue D537 Figure 2. Abundance Profile Tools. The ‘Abundance Profile Viewer’ (1) provides an overview of the relative abundance of protein families (COGs and Pfams) and functional families (enzymes) across selected metagenomes, normalized for genome size or using z-score. Abundance of protein/ functional families is displayed as a heat map (2), with each cell hyperlinked to the list of genes assigned to a particular family. A protein family can be saved in the ‘Function Cart’ by clicking its identifier such as COG0642 (3). For protein families in the ‘Function Cart’ a selective ‘Function Profile’ can be also computed (4). The ‘Abundance Comparison’ tool (5) takes into account the stochastic nature of metagenome datasets and tests whether the differences in abundance can be ascribed to chance variation or not. In addition to the gene count based abundance, the results provided by this tool include an assessment of statistical significance in terms of D-score (6) or P-value. to each other or to isolate genomes, statistical tests are metagenome studies conducted at JGI and other institutes needed for estimating the statistical significance of the such as the Washington University in St. Louis, while new observed differences. The ‘Abundance Comparison’ tool, reference genomes are included from IMG. illustrated in pane 5 of Figure 2, takes into account the New visualization tools are currently developed in order stochastic nature of metagenome datasets and tests to improve the efficiency of analyzing large and complex whether the differences in abundance can be ascribed to metagenome datasets, including datasets generated with chance variation or not. The results provided by this tool new technology platforms such as the Genome Sequencer include an assessment of statistical significance in terms of TM 20 System from 454 Life Sciences. The abundance a d-score (that translates into a P-value) in addition to the profile tools will be extended to allow comparison of gene count based abundance, as shown in pane 6 of genomes and metagenomes based on higher-level func- Figure 2. The d-score is a standard normal statistics tional categories such as COG functional categories and derived under a binomial assumption where the corre- KEGG pathways. As the number of analytical tools sponding P-value provides support at different levels of increases, the organization and documentation of the significance (e.g. 0.05, 0.01). IMG user interface will be revised in order to improve its usability. FUTURE PLANS We also plan to extend IMG/M’s capability to capture more detailed metadata attributes characterizing micro- The current version of IMG/M contains data on 2301 biome samples. Such attributes are often specific to a isolate genomes, 21 metagenome samples from 9 studies habitat (e.g. biomedical, ecological). Samples are asso- and 3 simulated datasets from a metagenome data processing benchmarking project (4). New metagenome ciated with properties used for metagenome analysis such datasets are continuously included into IMG/M from as sample structural and morphological characteristics D538 Nucleic Acids Research, 2008, Vol. 36, Database issue 3. Chen,K. and Pachter,L. (2005) Bioinformatics for whole-genome (e.g. sample site, time of collection) and donor or host shotgun sequencing of microbial communities. PLoS Comput. Biol., data (e.g. demographic and clinical record, including 1, 106–112. diagnosis, disease, stage of disease and treatment informa- 4. Mavromatis,K., Ivanova,N., Barry,K., Shapiro,H., Goltsman,E., tion for human donors). Samples may also be involved in McHardy,A.C., Rigoutsos,I., Salamov,A., Korzeniewski,F. et al. (2007) On the fidelity of processing metagenomic sequences using clinical studies and therefore can be grouped into several simulated dataset. Nat. Meth., 4, 495–500. time/treatment study groups. In addition to extending the 5. Tyson,G.W., Chapman,J., Hugenholtz,P., Allen,E.E., Ram,R.J., data model for supporting sample metadata, we plan to Richardson,P.M., Solovyev,V.V., Rubin,E., Rokhsar,D.S. et al. improve the coherence and completeness of these annota- (2004) Community structure and metabolism through reconstruc- tion of microbial genomes from the environment. Nature, 428, tions via manual curation. We collaborate with the 37–43. Genome Standards Consortium (18) in order to ensure 6. Tringe,S., von Mering,C., Kobayashi,A., Salamov,A., Chen,K., high coverage and consistency of microbiome sample Chang,H.W., Podar,M., Short,J.M., Mathur,E.J. et al. (2005) metadata. Comparative metagenomics of microbial communities. Science, 308, The current version of IMG/M does not provide sup- 554–557. 7. Martin,H.G., Ivanova,N.N., Kunin,V., Warnecke,F., Barry,K.W., port for data curation. We plan to incorporate into McHardy,A.C., Yeates,C., He,S., Salamov,A.A. et al. (2006) IMG/M the annotation capabilities that are available in Metagenomic analysis of two enhanced biological phosphorus IMG for isolate genomes, adapted to handle metagenome removal (EBPR) sludge communities. Nat. Biotechnol., 24, data. 1263–1269. 8. Woyke,T., Teeling,H., Ivanova,N.N., Huntemann,M., Richter,M., Gloeckner,F.O., Biffelli,D., Anderson,I., Barry,K.W. et al. (2006) Symbiosis insights through metagenomic analysis of a microbial ACKNOWLEDGEMENTS consortium. Nature, 443, 950–955. 9. Gill,S.R., Pop,M., DeBoy,R.T., Eckburg,P.B., Turnbaugh,P.J., We thank Chris Oehmen of the Computational Biology Samuel,B.S., Gordon,J.I., Relman,D.A., Fraser-Liggett,C.M. et al. and Bioinformatics group at the Pacific Northwest (2006) Metagenomic analysis of the human distal gut microbiome. Science, 312, 1355–1359. National Laboratory for his help in carrying out the 10. Turnbaugh,P.J., Ley,R.E., Mahowald,M.A., Magrini,V., large-scale gene similarity computations for IMG/M. The Mardis,E.R. and Gordon,J.I. (2006) An obesity-associated gut work of JGI’s sequencing, assembly and annotation teams microbiome with increased capacity for energy harvest. Nature, 444, is an essential prerequisite for IMG/M. Eddy Rubin and 1027–1031. 11. Markowitz,V.M., Ivanova,N., Korzeniewski,F., Palaniappan,K., James Bristow provided, support, advice and encourage- Szeto,E., Lykidis,A., Anderson,I., Mavrommatis,K., Kunin,V. et al. ment throughout this project. The work presented in this (2006) An experimental metagenome data management and analysis article was performed under the auspices of the US system. Bioinformatics, 22, e359–e367. Department of Energy’s Office of Science, Biological and 12. Markowitz,V.M., Szeto,E., Palaniappan,K., Chen,I.A., Grechkin,Y., Chu,K., Dubchak,I., Anderson,I., Lykidis,A. et al. Environmental Research Program and by the University (2008) The integrated microbial genomes (IMG) system in 2007: of California, Lawrence Livermore National Laboratory data content and analysis tool extensions. Nucleic Acids Res., 36. under contract No. W-7405-Eng-48, Lawrence Berkeley 13. Tatusov,R.L., Koonin,E.V. and Lipman,D.J.A. (1997) Genomic National Laboratory under contract No. DE-AC02- perspective on protein families. Science, 278, 631–637. 14. Bateman,A., Coin,L., Durbin,R., Finn,R.D., Hollich,V., 05CH11231 and Los Alamos National Laboratory under Griffiths-Jones,S., Khanna,A., Marshall,M., Moxon,S. et al. (2004) contract No. DE-AC02-06NA25396. Funding to pay the The Pfam protein families database. Nucleic Acids Res., 32, Open Access publication charges for this article was D138–D141. provided by the Department of Energy Joint Genome 15. Kanehisa,M., Goto,S., Kawashima,S., Okuno,Y. and Hattori,M. Institute. (2004) The KEGG resource for deciphering the genome. Nucleic Acids Res., 32, D277–D280. Conflict of interest statement. None declared. 16. Marchler-Bauer,A., Panchenko,A.R., Shoemaker,B.A., Thiessen,P.A., Geer,L.Y. and Bryant,S.H. (2002) CDD: a database of conserved domain alignments with links to domain three-dimensional structure. Nucleic Acids Res., 30, 281–283. REFERENCES 17. Shah,N., Teplitsky,M.V., Minovitsky,S., Pennacchio,L.A., 1. National Research Council Committee on Metagenomics. (2007) Hugenholtz,P., Hamann,B. and Dubchak,I.L. (2005) The New Science of Metagenomics: Revealing the Secrets of our SNP-VISTA: an interactive SNP visualization tool. BMC Microbial Planet. National Academies Press, Washington, DC. Bioinformatics, 8, 292. 2. Liolios,K., Tavernarakis,N., Hugenholtz,P. and Kyrpides,N. (2006) 18. Field,D., Garrity,G., Gray,T., Selengut,J., Sterk,P., Thomson,N., The genomes online database (GOLD) v.2: a Tatusov,T., Cochrane,G., Glockner,F.O. et al. (2007) eGenomics: monitor of genomeprojects worldwide. Nucleic Acids Res., 34, cataloguing our complete genome collection III. Comp. Funct. Genomics, 10, 100–104. D332–D334.

Journal

Nucleic Acids ResearchOxford University Press

Published: Jan 11, 2008

There are no references for this article.