Access the full text.
Sign up today, get DeepDyve free for 14 days.
Langille (2013)
814Nat. Biotechnol, 31
H. Ogata, S. Goto, Kazushige Sato, W. Fujibuchi, H. Bono, M. Kanehisa (1999)
KEGG: Kyoto Encyclopedia of Genes and GenomesNucleic acids research, 27 1
T. Hiraiwa, Y. Hanami, Toshiyuki Yamamoto (2013)
CONFLICT OF INTEREST: None declared.
M. Kanehisa (1997)
A database for post-genome analysis.Trends in genetics : TIG, 13 9
Kanehisa (1997)
375Trends Genet, 13
M. Kanehisa, Miho Furumichi, M. Tanabe, Y. Sato, Kanae Morishima (2016)
KEGG: new perspectives on genomes, pathways, diseases and drugsNucleic Acids Research, 45
Kanehisa (2000)
27Nucleic Acids Res, 28
Kanehisa (2016)
726J. Mol. Biol, 428
Kanehisa (2017)
D353Nucleic Acids Res, 45
M. Langille, Jesse Zaneveld, J. Caporaso, Daniel McDonald, D. Knights, Joshua Reyes, J. Clemente, D. Burkepile, R. Thurber, R. Knight, R. Beiko, C. Huttenhower (2013)
Predictive functional profiling of microbial communities using 16S rRNA marker gene sequencesNature biotechnology, 31
M. Kanehisa, Y. Sato, Kanae Morishima (2016)
BlastKOALA and GhostKOALA: KEGG Tools for Functional Characterization of Genome and Metagenome Sequences.Journal of molecular biology, 428 4
Summary: With the rapid accumulation of sequencing data from genomic and metagenomic stud- ies, there is an acute need for better tools that facilitate their analyses against biological functions. To this end, we developed MetQy, an open–source R package designed for query–based analysis of functional units in [meta]genomes and/or sets of genes using the The Kyoto Encyclopedia of Genes and Genomes (KEGG). Furthermore, MetQy contains visualization and analysis tools and facilitates KEGG’s flat file manipulation. Thus, MetQy enables better understanding of metabolic capabilities of known genomes or user–specified [meta]genomes by using the available informa- tion and can help guide studies in microbial ecology, metabolic engineering and synthetic biology. Availability and implementation: The MetQy R package is freely available and can be downloaded from our group’s website (http://osslab.lifesci.warwick.ac.uk) or GitHub (https://github.com/OSS- Lab/MetQy). Contact: [email protected] 1 Introduction allows only specific retrieval of information and analyses. The advent of molecular biology has made the characterization and Furthermore, although the whole of the data can be downloaded via analysis of genomic sequences a key part of all areas of life sciences (paid) FTP access, the systematic analysis of these data in a user– research. In the case of single–cell organisms, identification of spe- defined manner remains difficult and developing computational ana- cific functions within the genome directly influences our ability to lysis tools for this purpose remains a niche expertise that is still not assess their fitness in a given environment and their potential roles in available in many research labs. biotechnology. Particularly, we should theoretically be able to trans- There are several specific tools that make use of certain aspects of late genomic data into physiological predictions. Genomic databases the KEGG data more available to a wider user-base. Examples include are a pre-requisite for making such predictions, but their full use PICRUSt (Langille et al.,2013), BlastKOALA and GhostKOALA also requires computational tools that allow easy access and system- (Kanehisa et al.,2016), all of which focus on metagenomics data ana- atic analyses of the data. lysis. However, to our knowledge there are no tools that facilitate the The Kyoto Encyclopedia of Genes and Genomes (KEGG) is one analyses and information retrieval from KEGG with regards to study- of the oldest and most comprehensive collections of databases. Its ing the relationship between genomic data and physiological function. primary aim has been the digitising of current knowledge on genes Therefore, we have developed MetQy, an open–source, easy–to–use and molecules and their interactions (Kanehisa, 1997; Kanehisa and and readily expandable R package for such analyses. MetQy uses the Goto, 2000) and it includes 16 databases and 3 sequence data collec- R–platform because it is commonly used among biologists, it is fea- tions (Kanehisa et al., 2017). While these data can be analysed via tured in undergraduate education, and it contains extensive statistical different tools on the KEGG website, the existing web interface packages which are useful in subsequent data analyses. V The Author(s) 2018. Published by Oxford University Press. 4134 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. MetQy 4135 MetQy was developed to readily interface between the KEGG query_genes_to_genomes determines which KEGG genomes contain orthology, module and genome databases and perform automated user-specified gene(s). query_missingGenes_from_module determines cross–analyses on them. It consists of a set of functions that allow the missing gene(s) (K or EC numbers) that would be required to querying genes, enzymes and functional modules across genomes have a complete KEGG module within a genome (or gene set). and vice versa, thereby enabling better understanding of genotype– phenotype mapping in single–celled organisms and providing guid- 2.2 Parsing KEGG databases ance for cellular engineering in synthetic biology. MetQy can be MetQy comes with built–in data components of KEGG. It is, how- used ‘as-is’, since the relevant components of the KEGG databases ever, possible for users with FTP KEGG access to update these data (downloaded on 20/02/2018) are included within the package. The components to their latest version. The MetQy parsing functions included KEGG data constitutes only part of the entire encyclopedia allow the production of the updated data, by formatting the relevant and is ‘hidden’ in the package so that direct access to the data is not KEGG data files into R structures. They can also be used as stand– possible, complying with KEGG licence. Users with a paid KEGG alone functions to introduce KEGG data into the R environment. All subscription can use MetQy parsing functions to update the data query functions have been designed to take in these updated data. that the package uses. The MetQy package and GitHub wiki con- MetQy features two generic parsing functions that deal with the tain extensive documentation and usage examples for each function. two main KEGG file types: files without extension (parseKEGG_file) and ‘.list’ files (parseKEGG_file.list). parseKEGG_file.list formats KEGG files containing a mapping between two KEGG database entries 2 Software features into binary matrices. For example, the mapping between K numbers MetQy contains three main groups of software functions: data and EC numbers is contained in the ‘ko_enzyme.list’ file and shows query, parsing and analysis and visualization. These are briefly which K numbers correspond to which EC numbers. parseKEGG_file described below. For more detailed information and usage exam- formats a KEGG database file into an R data frame by automatically ples, please see the package documentation and GitHub wiki. detecting fields of the KEGG data and transforms these into variables. MetQy also contains file–specific functions that use these generic functions. 2.1 Metabolic query functions The query family of functions allows the user to query the KEGG data structures in a systematic (and automated) way. Users without 2.3 Analysis and visualization FTP access can analyse the KEGG genome, module and ortholog The analysis and visualization family of functions are designed to fa- databases indirectly by using this family of functions on built–in for- cilitate the analysis primarily of the output of the query_genomes_ matted KEGG data which is not directly accessible by the user. to_modules function, which generates a matrix of mcf values for the Additionally, these functions feature optional arguments that allow genomes and modules analysed. There are three analysis and five users to provide up–to–date data (by using the parsing functions on plot (visualization) functions. KEGG FTP data) or their own data structures, such as custom– analysis_pca_mean_distance_calculation is designed to process made KEGG–style modules. Additional query functions can be read- the output of a principal component analysis (PCA) performed on ily developed by the users, allowing expansion of MetQy. MetQy the mcf matrix (this can be done for example by applying the R features five query functions for key functional analyses. function stats:: prcomp function). It uses the resulting numeric ma- query_genomes_to_modules calculates the module completeness trix containing the principal components to calculate the mean fraction (mcf) given a set of genes or genomes. It returns a matrix Euclidean distance as a measure of spread or variation (of the data). showing the mcf for each module. The mcf calculation is based on This assumes that every row represents a multi-dimensional point (a block–based, logical KEGG module definition (see GitHub wiki). genome in this case), with coordinates given in the corresponding The function input is the modules to be queried (default is all KEGG columns. The mean Euclidean distance of p points is calculated by modules) and the set of genes to be considered. The gene set can be adding the computed pairwise Euclidean distance in n dimensions provided either as a set of KEGG ortholog or Enzyme Commission between all the points divided by the total number of distances. (EC) numbers, or as genome identifier(s), with the latter case result- analysis_pca_mean_distance_grouping takes in the numeric ma- ing in automatic retrieval of all genes for the genome(s). trix resulting from performing a PCA on the mcf matrix and a fac- While the implementation of query_genomes_to_modules function tor, such as genus, to group the rows (genomes) of the matrix and is similar to KEGG mapper [a web interface tool that performs a similar uses the previous function (analysis_pca_mean_distance_calcula- task (http://www.genome.jp/kegg/mapper.html; Kanehisa et al.,2017)], tion) to calculate the mean Euclidean distance for each group. there are several key features that are different. The KEGG Mapper’s analysis_genomes_module_output takes in the mcf matrix web interface does not allow for module–specific evaluation nor for (genomes and modules as rows and columns, respectively) and pro- automation of the analysis. Our implementation allows for specific duces a series of analyses and generates a report automatically by KEGG modules to be evaluated, given their ID, name and/or class. It default. These analyses comprise of: (i) reporting the number of also provides the capacity to determine the mcf of a module, rather than genomes (data sets) and modules analysed, producing a (ii) heatmap only identifying modules that are complete or that have one block miss- of the mcf of all genomes and modules analysed, (iii) a boxplot of ing. Finally, as EC numbers are widely used in systems biology, we used the mcf across all genomes for each module, (iv) a scatter plot of the the KEGG orthology to translate the K number–based module defini- SD of the mcf across all genomes for each module and (v) identifying tions to EC number–based module definitions. This allows for module any modules that have a constant (zero-variance) mcf across all evaluation based on both K and EC numbers. genomes and producing a table. In addition, the function performs, query_module_to_genomes determines the KEGG genome(s) that for every factor group specified, the following analyses: (vi) group have user–specified module(s) that are complete above a mcf thresh- the genomes according to that factor and create a heatmap of the old (defaults to 1, i.e. complete). query_gene_to_modules determines mean mcf for each module across the genomes that make up each those KEGG modules that feature specific user–specified gene(s). group, (vii) carry out a PCA analysis on all the mcf data, showing 4136 A.S.Martinez-Vernon et al. 3 Uses and applications MetQy facilitates the general usability of the KEGG database and allows users to gain qualitative information about the functional capacity of a given organism or gene set. Anticipated uses of the tool include synthetic biology, where it can facilitate the design and guid- ing of metabolic engineering studies by identifying missing genes needed for an organism to have a complete KEGG module, and identifying KEGG genomes with desired metabolic capabilities. For systems biology applications, it allows identification of key physio- logical features of organisms and development of stoichiometric metabolic models by analysing module completeness in specific genomes and identifying transporter modules and carbon utilization routes in genomes. Finally, in microbial ecology, MetQy can allow species–function mappings in metagenomes and insights into func- tional capabilities of ecological groups by analysing the metabolic capacity of novel genomes from metagenomic studies. Organisms can be put into different functional groups, and the functional pro- files of different environments compared. 3.1 Example of usage To demonstrate some possible uses of MetQy functions, we have included a coded example on the MetQy GitHub wiki pages. This example demonstrates how MetQy can be used to retrieve KEGG genome data and how the metabolic functions of the extracted/ matched organisms can be queried/identified in terms of KEGG modules. In the presented example, we evaluate the module com- pleteness fraction (mcf) in methanogen genomes, focusing on sample KEGG modules loosely relating to the anaerobic digestion process (note that any user specified modules, or all KEGG can be used in a real analysis). We then visualize the results of this analysis as a heat- map using MetQy function plot_heatmap (Fig. 1A). In this example Fig. 1. Visualization of some of the results obtained from an example analysis case, this analysis highlighted a specific module that is expected to (Section 3.1). (A) Heatmap representation of module fraction completeness be essential for methanogenesis (M00567: Methanogenesis, CO (mcf) across selected genomes (y–axis) and modules (x–axis). The mcf value ¼> methane) and that was almost fully complete in most genomes is colour–coded as per the provided mapping scheme shown. (B) A sunburst as expected (mcf >¼ 0:75 in 96% of genomes), but incomplete in diagram showing the mcf of different modules and their functional classes some genomes. This prompted us to analyse the genomes that had a as obtained from the analysis of a specific genome (genome ID: T04272). lower mcf for this key module We thus identified the genome The mcf value is colour–coded as per the provided mapping scheme shown. The data for both plots was obtained using MetQy function ‘query_genomes_ T04272 (Methanogenic archaeon ISO4-H5) as an interesting meth- to_modules’ anogen to focus on and used another MetQy function plot_sunburst to analyse all of its modules’ mcf through a sunburst plot (Fig. 1B). Furthermore, we identified the genes that were missing for that mod- the cumulative variance and generating a PC plot, (viii) visualize the ule to be complete (for that organism). PC plot with an overlay of the factor grouping and, finally, (ix) While this example highlights how specific MetQy functions can measure the within-group (per factor) variance, using the mean be utilized on their own to develop a specific analysis pipeline, it is Euclidean distance as a proxy for spread. also possible to use MetQy functions to perform an automated ana- plot_heatmap can be used to visualize the mcf calculated by the lysis on a set of genomes grouped by genus (or another grouping fac- query_genomes_to_modules function as a colour mapped matrix tor provided by the user, e.g. species or sample origin) and generate a (with genomes against modules). plot_scatter_byFactors allows the comprehensive report in an automated fashion (see description for automatic grouping of data as determined by a factor and produces a analysis_genomes_module_output function, the PDF report file in the scatter plot with groups overlaid by colour. plot_scatter is useful to GitHub repository and the worked–out example in the GitHub wiki). visualize numerical data associated to data groups generated by a fac- tor. This category–based visualization can be used to plot the SD for each module’s mcf or the mean Euclidean distance (see the analysis Acknowledgement description above for more details). plot_variance_boxplot takes the mcf matrix and produces a boxplot for each module. plot_sunburst The authors acknowledge Sean Aller for helpful comments and David Selby for sharing his expertise in developing R packages. makes a hierarchical arrangement of categorical data, such as KEGG module classes, and represents it in a dart–style, where the inner ring contains the most general (highest level) information which can be Funding divided into sub-categories (rings going outwards). The final ring rep- resents the most specific level of information and can be coloured by This work is funded by The University of Warwick and by the either the counts of the data or an additional set of values provided Biotechnological and Biological and Engineering and Physical Sciences by the user (refer to the GitHub wiki for more information). Research Councils (BB– and EPSRC), with grant IDs: EP/L016494/1 (to MetQy 4137 the Centre for Doctoral Training in Synthetic Biology, SynBioCDT), Kanehisa,M. and Goto,S. (2000) KEGG: kyoto encyclopedia of genes and BB/K003240/2 (to OSS), BB/M017982/1 (to the Warwick Integrative genomes. Nucleic Acids Res., 28, 27–30. Synthetic Biology Centre, WISB). Kanehisa,M. et al. (2016) BlastKOALA and GhostKOALA: KEGG tools for functional characterization of genome and metagenome sequences. J. Mol. Conflict of Interest: none declared. Biol., 428, 726–731. Kanehisa,M. et al. (2017) KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res., 45, D353–D361. References Langille,M.G.I. et al. (2013) Predictive functional profiling of microbial com- Kanehisa,M. (1997) A database for post-genome analysis. Trends Genet., 13, munities using 16S rRNA marker gene sequences. Nat. Biotechnol., 31, 375–376. 814–821.
Bioinformatics – Oxford University Press
Published: Jun 5, 2018
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.