Home

Briefings in Bioinformatics

Publisher:
Oxford University Press
Oxford University Press
ISSN:
1467-5463
Scimago Journal Rank:
121
journal article
Open Access Collection
Microbial genome analysis: the COG approach

Galperin, Michael Y; Kristensen, David M; Makarova, Kira S; Wolf, Yuri I; Koonin, Eugene V

2019 Briefings in Bioinformatics

doi: 10.1093/bib/bbx117pmid: 28968633

Abstract For the past 20 years, the Clusters of Orthologous Genes (COG) database had been a popular tool for microbial genome annotation and comparative genomics. Initially created for the purpose of evolutionary classification of protein families, the COG have been used, apart from straightforward functional annotation of sequenced genomes, for such tasks as (i) unification of genome annotation in groups of related organisms; (ii) identification of missing and/or undetected genes in complete microbial genomes; (iii) analysis of genomic neighborhoods, in many cases allowing prediction of novel functional systems; (iv) analysis of metabolic pathways and prediction of alternative forms of enzymes; (v) comparison of organisms by COG functional categories; and (vi) prioritization of targets for structural and functional characterization. Here we review the principles of the COG approach and discuss its key advantages and drawbacks in microbial genome analysis. comparative genomics, genome annotation, enzyme evolution, orthologs, paralogs Introduction The success of the entire genomic enterprise critically depends on reliable genome annotation, i.e. correct identification of the genes, which includes accurate determination of gene boundaries and functional annotation of the gene product(s). The Clusters of Orthologous Groups of proteins (COGs) database has been devised as a way to allow phylogenetic classification of proteins from complete microbial genomes [1]. While the COG system has grown over the years (Figure 1), the goal has always been for each COG to represent a family of orthologous protein-coding genes. However, when the compared genomes are separated by long evolutionary distances and possess substantially different numbers of genes, evolutionary relationships between these genes are not accurately captured by the straightforward definition of orthology as a one-to-one relationship because of such evolutionary processes as lineage-specific gene duplication and loss, as well as horizontal gene transfer [7, 8]. Owing to these complexities of the evolutionary relationships among genes, the COGs have become families of co-orthologous genes that embody one-to-many and many-to-many relationships. Hence the term ‘orthologous groups’ (of proteins) that embraces such more complex evolutionary relationships among genes and simplifies the assignment of (general) functions to genes and their products. As the genomic community gradually embraced the notion of co-orthologous relationships between genes [7–9], the COGs have been re-branded Clusters of Orthologous Genes [10]. Figure 1 Open in new tabDownload slide Evolution of the COG system. The numbers in parentheses indicate the number of bacterial, archaeal and eukaryotic genomes, respectively, included in the respective COG release [1–6]. Figure 1 Open in new tabDownload slide Evolution of the COG system. The numbers in parentheses indicate the number of bacterial, archaeal and eukaryotic genomes, respectively, included in the respective COG release [1–6]. During the 20 years since the inception of the COG project, several alternative systems for orthology analysis have been developed [11–20], some of them implementing genome-wide phylogenetic analysis, which, in principle, is supposed to provide robust resolution of evolutionary relationships between orthologs and paralogs. In practice, however, such methods are computationally expensive and fraught with artifacts at different stages, and therefore, simpler approaches such as the COGs continue to be widely used in microbial genomics. The popular EggNOG database (‘Evolutionary genealogy of genes: Non-supervised Orthologous Groups’, http://eggnog.embl.de) applies essentially the same approach as COGs to a much greater number of genomes, but fully relies on automated assignment of orthologs and does not annotate the orthologous gene clusters [21, 22]. Here we briefly review the key principles underlying the COG approach and its applications for genome annotation and comparative analysis. Rather than providing a detailed description of the COG construction methods and the resulting collections of (co)orthologous gene families, our goal here is to highlight the unresolved problems in functional annotation and the possible ways to address them. For a description of the COG database per se, the reader is referred to the previous publications [1, 2–6]. Differences between COGs and other collections of gene and protein families Functional annotation of proteins encoded in sequenced genomes typically relies on BLASTP [23] or, more recently, HMMer [24] search of protein databases for the most similar sequence, followed by a (semi-)automated transfer of the best hit annotation to the new protein. This approach has a number of well-known drawbacks [25–28]. First, if the sequence similarity is low, there is a distinct possibility that the two proteins have different functions; this problem is exacerbated in cases of transitive annotation of multiple proteins in this manner. Second, the reliance on the best hit often results in a protein ending up being annotated as ‘uncharacterized’ and/or ‘putative’ even when the function of a close homolog is already known. Third, differences in domain architectures of homologous proteins often result in erroneous functional assignment. Given these systematic errors, advanced approaches for functional annotation of proteins increasingly rely on curated databases of protein sequences [29], such as UniProt KnowledgeBase or PANTHER [30, 31], and protein domains, such as Pfam, SMART or SUPERFAMILY [32–34]. Aggregated domain databases InterPro and CDD, which allow an easy comparison of the annotations provided by various databases, often prove to be the most efficient tools [35, 36]. The COG approach shares some features with the curated protein family databases but differs from them in several important aspects. Use of complete genomes A distinct feature of the COG approach is the reliance on complete genome (proteome) sequences, which allows relatively simple and reliable recognition of potential orthologs and paralogs among all proteins encoded in the given genome. With incomplete genomes, there always remains the obvious possibility that the true ortholog of the given gene failed to make it into the final assembly. Like other methods for ortholog identification, the COG approach relies on sequence similarity searches against selected proteomes, aimed at the identification of pairwise best hits. However, instead of imposing predetermined similarity scores for delineation of likely homologs, the COG approach extends the popular concept of two-way (often also called bidirectional, symmetric or reciprocal) best BLAST hits in each particular proteome by adding the more stringent requirement of forming a triangle, or three-way set of best BLAST matches (thus forcing the mathematical property of transitivity [7, 9]) to form a new COG. Owing to the presence of potential paralogs from the same lineage (inparalogs [37]), the original approach [1] only required that at least one such triangle be included that represented symmetrical (bidirectional) matches, with that criteria being imposed by manual supervision of groups initially constructed with an automated method. Later, the process of detection and collapsing such obvious paralogs was performed by an automated method, introduced in the first major update of the COGs [3] and later codified in the EdgeSearch algorithm [38–40]. Proteins from new genomes can be added to the existing COGs by using the new sequences as queries for an RPS-BLAST search of the collection of position-specific scoring matrices generated from COG-specific multiple sequence alignments [41]. The query is assigned to the COG that yields the best score in this search. Technically, this approach is analogous to that used to search domain databases, such as InterPro and CDD, but because the COGs contain previously identified orthologs, in this case, the best hit gives a strong indication of orthology. A detailed discussion of other methods for ortholog identification can be found e.g. in [7, 9, 42–45]. In addition to sequence similarity and phylogenetic proximity, a potentially useful criterion is genomic synteny [39, 40], which, however, in practice is typically used for manual verification of the existing assignments at the quality-control stage. Flexible similarity cutoffs The advantage of the triangle-based approach for orthology inference is that it dispenses with artificially imposed sequence similarity cutoffs for different protein families, some of which evolve with dramatically different rates, and permits creation of COGs from proteins that span the entire range of similarity, from barely detectable to extremely high. For example, Na+-binding c subunits (COG0636) of Na+-translocating ATP synthases from bacteria and archaea have low sequence similarity and might not be recognized as orthologs using arbitrarily high BLAST cutoffs; to further complicate the annotation, archaeal protein is often referred to as subunit K [46]. With strict BLAST cutoffs, recognition of orthology becomes particularly complicated for short proteins, including some ribosomal proteins. The COG approach also allows separation of closely related paralogs, such as, for example, 3-isopropylmalate dehydrogenase (LeuB) and isocitrate dehydrogenase (Icd), members of COG0473 and COG0538, respectively, that in most other databases are assigned to the same family (PF00180 in Pfam, SM01329 in SMART, PS00470 in PROSITE, SSF53659 in SUPERFAMILY). Protein family granularity in COGs Flexible similarity cutoffs have the built-in advantage of allowing the COGs to be as wide or as narrow as dictated by the evolutionary history of a given gene family. In the above example, the LeuB/Icd family is split into two COGs, which reflects the wide distribution of these enzymes among bacteria and archaea. However, this family also includes even two more closely related enzymes. One of these is tartrate dehydrogenase/decarboxylase that has been characterized in Pseudomonas putida and Agrobacterium vitis [47, 48]. This enzyme is closely related to LeuB, still has the isopropylmalate dehydrogenase activity and has probably evolved from LeuB in the course of the adaptation of the host bacteria to life on tartrate-rich grapevine [47]. The fourth member is homoisocitrate dehydrogenase AksF, which participates in the biosynthesis of the methanoarchaeal coenzyme B [49]. Homoisocitrate dehydrogenase has been described in Methanocaldococcus jannaschii, and a variety of methanogenic archaea encode closely related proteins [49]. At this time, there are too few tartrate dehydrogenases to form a separate COG. As for homoisocitrate dehydrogenase, LeuB and AksF are co-orthologs with respect to the bacterial LeuB enzymes. Accordingly, all members of this family are currently assigned to the same COG0473 (LeuB) and the same arCOG01163 in archaeal COGs [10]. In the future, methanogenic homoisocitrate dehydrogenases might form an archaea-specific COG. For now, however, the split of the family into two COGs appears to represent a reasonable compromise. In contrast, TIGRfams [50] and NCBI Protein Clusters [51] databases divide this family into 6 and 13 clusters, respectively. However, because sequence similarity alone does not allow unequivocal functional assignment, most of these clusters end up with the same functional annotation, either LeuB or Icd. Phyletic profiles in COGs An important feature of the COG approach is that a protein (or domain) either belongs or does not belong to it. Accordingly, a genome is either represented in the given COG (by one or more proteins) or it is not. Thus, the COG approach can dispense with the matrix of similarity scores and replace them with the simple yes/no (1 or 0) representation or, alternatively, indicate the number of paralogous members of the given COG in the given genome. Such phyletic patterns, i.e. the patterns of species that are either represented or not represented in the given COGs, are a powerful tool for functional annotation of microbial genomes and evolutionary reconstruction. The most obvious use of phyletic patterns is for identification of supposedly essential genes that are missing in certain genomes [4, 52]. Consistent application of this principle offers an easy way to evaluate genome quality [53, 54], which is why the NCBI’s prokaryotic genome annotation pipeline currently involves routine checking of the submitted genomes for the presence of certain (nearly) universal genes, including those encoding ribosomal proteins and translation system components, as well as RNA polymerase subunits [55, 56]. A conceptually similar application of phyletic patters involves analysis of metabolic pathways and multi-protein functional systems. Obviously, metabolic pathways should not allow accumulation of any intermediate that cannot be further metabolized and represents a dead end: to avoid poisoning the cell, such intermediate would have to be exported into the surrounding milieu. Likewise, an intermediate in the functional metabolic pathway needs to be either imported or synthesized within the cell. Although the possibility of ‘distributed’ pathways cannot be discarded, these simple considerations prove productive when COGs are superimposed on the metabolic map to identify the intermediates that have no known enzymes to produce or metabolize them. Identification of such gaps in pathways often suggests alternative enzymes that can be then identified experimentally [54, 57]. Functional categories of genes in COGs Another widely used feature of the COG system is the assignment of all COGs to one of the 26 functional categories. These categories have evolved over time, with several of them (B, Y, W, Z) describing functions that are found primarily in eukaryotic cells. The recently added V (Defense mechanisms) and X (Mobilome) categories provide for a more detailed description of the dynamics of bacterial and archaeal genomes. Functional categories are assigned in accordance with the cellular roles of the respective COGs, so that, for example, peptide uptake systems are included into category E (Amino acid transport and metabolism), rather than in general ‘Transport’ or other similar categories. Two functional categories of uncharacterized proteins, R (genes with only a generic functional prediction, typically of the biochemical activity) and S (uncharacterized genes), are particularly useful, as they reflect the current level of understanding of protein function on the proteome level and allow tracing the progress in experimental characterization and computational analysis of widespread protein families. The fraction of proteins from a given genome assigned to certain COG functional categories turned out to be a useful whole-genome feature [58] and has been adopted by the Genome Standards Consortium as an essential characteristic of the newly sequenced genomes https://standardsingenomics.biomedcentral.com/submission-guidelines. Full-length proteins and domains as COG members Most existing protein family databases include either full-length sequences (NCBI protein database, UniProt, PANTHER, TIGRfams [30, 31]) or separate protein domains (Pfam, SMART, SUPERFAMILY, etc [32–34]). The COG approach allows a degree of flexibility: conserved domain combinations can be included in separate COGs without the need to split them into individual domains. As an example, along with the COG0784 for individual CheY-like receiver (REC) domains of the two-component signal transduction systems (which also includes stand-alone CheY/Spo0F proteins), the current COG collection includes 15 additional REC-domain COGs, such as COG2197 for DNA-binding response regulators of the NarL/FixJ family, containing REC and helix-turn-helix domains; COG0745 for DNA-binding response regulators of the OmpR/PhoB family, which consist of REC and winged-helix domains; COG3279 for DNA-binding response regulators of LytR/AlgR family, containing REC and LytTR domains, and many others [59, 60]. The discrimination between the architectures of proteins that share a common domain provides for a finer granularity of annotation and allows better characterization of the respective proteins. However, non-critical use of COGs for high-throughput domain annotation can result in egregious errors, whereby a multidomain protein receives a misleading annotation of its best COG hit that has a completely different domain architecture. The recent attempts to identify specific domain architectures and limit annotation transfer to proteins with the same domain combination [36] have the potential to resolve this issue. COG annotation Functional annotation of COGs, including assignment of COG names, is based on two key principles. First, reliance on orthologous relationships for the COG construction makes it likely, according to the ‘orthology conjecture’, that members of each COG have equivalent functions [7] (with only rare known exceptions [61]). Accordingly, experimentally characterized functions of a single member of a given COG often can be used to assign the functional annotation to the entire COG. Indeed, in most cases, subsequent characterization of additional COG members has confirmed the validity of the initial assignment [6]. Second, all COG names are manually curated with the goal of creating the most appropriate annotation, avoiding the common annotation errors [25], as well as over- and under-predictions. Thus, for those COGs whose members have two or more distinct functions, the annotations (COG names) get expanded to cover the entire range of experimental results. In some cases, the growing number of distinct paralogs justifies splitting a COG into two or more separate COGs with higher sequence conservation and more narrowly defined functional annotation. Many COGs, however, do not include any experimentally characterized members so that their annotation has to rely on computational analyses alone. In such cases, inference of a robust annotation requires careful analysis of their sequences, structures, genomic neighborhoods, phyletic patterns and other cues, which requires a substantial effort that, however, often leads to interesting insights [62, 63]. Such efforts are essential for increasing the fraction of proteins that belong to well-characterized COGs beyond the figure of 60–70% that is currently obtained for most bacterial and archaeal genomes [6]. The overall genome coverage by COGs (including the R- and S-type COGs) has stayed largely the same over the years and currently ranges from ∼65% of the total proteomes in Chlamydiae and Planctomycetes to >80% in Synergistetes and Thermotogae (Figure 2). This stable coverage of bacterial and archaeal genomes by COGs, despite the addition of numerous new genomes, is likely to reflect the open pangenomes of most prokaryotes [65–68] and the extremely rapid turnover of the poorly conserved gene class. Figure 2 Open in new tabDownload slide Proteome coverage by the current version of COGs. Archaeal and bacterial phyla and selected classes of Firmicutes and Proteobacteria are listed as in the latest release of the COG database [6]. The orange and blue columns show the fractions of the respective proteomes covered by COGs in each taxonomic group (including R- and S-type COGs that consist of poorly characterized or uncharacterized genes), averaged over the members of that group in the COGs (the respective numbers are shown in parentheses). The ‘Other archaea’ group includes two genomes representing, respectively, Kor- and Nanoarchaeota; the ‘Other bacteria’ group includes members of Deferribacteres, Nitrospirae, Verrucomicrobia and other sparsely sampled phyla, as well as representatives of several candidate phyla. The bright yellow rectangles on top of the archaeal columns indicate the additional coverage of the archaeal proteomes in the latest version of arCOGs [10]. The hatched rectangles indicate the additional coverage of the archaeal and bacterial proteomes in the ATGC-COGs from the latest version of the ATGCs database [64]. Figure 2 Open in new tabDownload slide Proteome coverage by the current version of COGs. Archaeal and bacterial phyla and selected classes of Firmicutes and Proteobacteria are listed as in the latest release of the COG database [6]. The orange and blue columns show the fractions of the respective proteomes covered by COGs in each taxonomic group (including R- and S-type COGs that consist of poorly characterized or uncharacterized genes), averaged over the members of that group in the COGs (the respective numbers are shown in parentheses). The ‘Other archaea’ group includes two genomes representing, respectively, Kor- and Nanoarchaeota; the ‘Other bacteria’ group includes members of Deferribacteres, Nitrospirae, Verrucomicrobia and other sparsely sampled phyla, as well as representatives of several candidate phyla. The bright yellow rectangles on top of the archaeal columns indicate the additional coverage of the archaeal proteomes in the latest version of arCOGs [10]. The hatched rectangles indicate the additional coverage of the archaeal and bacterial proteomes in the ATGC-COGs from the latest version of the ATGCs database [64]. Although COG annotations typically describe protein families, in the most recent release of the COG database, owing to the popularity of COG-based annotation, many COG names have been modified to allow functional annotation of individual proteins [6]. Unresolved problems in the COG approach The wide use of COGs for microbial genome annotation and comparative analysis has illuminated several problems inherent in the COG approach that warrant a brief discussion. These difficulties include, among others, the issues of COG hierarchy, inclusion of paralogs, splitting proteins into separate domains and scalability of the COG approach. Orthologs, paralogs and xenologs: the missing hierarchy The very definition of orthology [69] inherently depends on the group of organisms under consideration [7, 9, 37]. For example, in most members of the Crenarchaeota, the family B DNA polymerases are represented by several paralogs which form distinct orthologous families (arCOG00328, arCOG00329, arCOG15272 and others) within this archaeal phylum (all these genes are out-paralogs in Crenarchaeota). In contrast, most of those bacteria that possess the polB gene have a single copy, which is co-orthologous to all archaeal polB genes, so archaea and bacteria share only one orthologous family of polB, COG0417 (all these genes are co-orthologs among prokaryotes with several in-paralogs in archaea). Such complex relationships among homologous genes confound COG analysis because the definition of orthology becomes mutually dependent with the phyletic patterns (the definition of orthology depends on the list of organisms where these genes are present, which itself depends on which of the homologous genes are considered orthologs and which are not). Several formal and informal empirical rules have been proposed to resolve this conundrum [70]. The hierarchical orthologous groups have been implemented in such databases as EggNOG, OMA and OrthoDB [14, 22, 71]. In most of the current COG collections, all COGs are equal, and there is no hierarchical structure; only in arCOGs, an extra level of super-COGs has been introduced to combine paralogous COGs into higher level clusters. Although the non-hierarchical structure of COG collections is convenient for straightforward genome annotation, it has substantial drawbacks. Some COGs include closely related proteins with similar, if not identical, biochemical activities. In such cases, assignment of a protein to a specific COG can be taken, without justification, as an indication that the respective organism possesses one functionality but not the other. A good example is the case of glutamate and glutamine aminoacyl tRNA-synthetases (COG0008). While most bacteria encode two paralogous enzymes that charge the Glu- and Gln-specific tRNAs, archaea (as well as chlamydia, chlorobi, chloroflexi, cyanobacteria and certain members of other bacterial phyla) encode only glutamate-tRNA synthetase and produce glutamyl-tRNA by transamidation of misacylated Glu-tRNAGln [72]. Here, both bacterial paralogs are co-orthologs for the archaeal and chlamydial enzymes, which is why they end up in a single COG. Obviously, splitting COG0008 into two subCOGs would have been a better solution, allowing a precise characterization of the respective enzymes. In some cases, a COG includes a small subgroup with a well-characterized function but the lack of hierarchy results in annotation of generic function only (e.g. an ABC-type transporter). The single-level definition of orthology can even result in annotations that are largely arbitrary. In some cases (e.g. COG0183, Acetyl-CoA acetyltransferase), COGs are overloaded with paralogs because it is practically impossible to track all extant genes to distinct genes in the common ancestor. On other occasions (COG0050, Translation elongation factor EF-Tu, and COG5256, Translation elongation factor EF-1α), lineage-specific COGs are created for genes that are arguably orthologous because they are sufficiently distinct. The absence of multilevel hierarchy dilutes functional annotation of the characterized members of the COG and weakens the evolutionary reconstructions. Developing and implementing a hierarchical framework is one of the most pressing problems in the COG-based approach to gene classification and genome annotation. Whole proteins versus protein domains As noted above, COG construction is based on clustering of orthologous domains that are identified as bidirectional best hits in genome-specific BLAST searches. This approach, however, is sensitive to domain rearrangements that occurred after the divergence of the analyzed set of species from their last common ancestor. Particularly severe problems are caused by promiscuous domains, which can attract proteins to spurious COGs through significant but effectively irrelevant sequence similarity to the promiscuous domains. Although this problem can be addressed semi-automatically, e.g. by excluding the hits that cover only a small portion of the protein sequence, precise solutions still require manual intervention. On many occasions, conserved domain architectures allowed construction of consistent COGs that were not substantially affected by the presence of a shared domain (e.g. the widespread helix-turn-helix DNA-binding domain). Conversely, the diversity of domain architectures of proteins involved in microbial signal transduction and containing a number of promiscuous domains (PAS, GAF, CHASE, GGDEF, EAL and others) required splitting some of these proteins into individual domains or domain combinations. As a result, the COGs are a mix of (i) highly specific domain architectures (such as the above-mentioned response regulators), (ii) multiple domain architectures that include a single shared domain and (iii) separate promiscuous domains. To our knowledge, as of this writing, there is no complete, formal solution for optimal dissection of full-length proteins into orthologous domains. At present, for the analysis of multidomain proteins, the best practical approaches are offered by integrated domain identification tools, such as CDD (which includes the COGs) and InterPro. Scalability of the COG approach and specialized COG collections The basic COG approach relies first on an exhaustive all-against-all protein comparison that scales as O(n2) with the total number of proteins and then on a search of connected triangles in clusters of reciprocal best hits that scales as O(n3) with the number of proteins in the cluster [38]. Inevitably, the growth of the database outpaces the availability of the computational resources, making regular major updates of the entire COG database impractical. Several divide-and-conquer strategies have been used to circumvent this major difficulty. One approach that has been implemented in several COG updates includes accommodating the new sequences into the existing COGs first, then searching for potential new COGs among the sequences that do not fit the existing ones, and then, moving some sequences from the old COGs to the new ones [10]. The principal direction, however, has involved construction of dedicated COG collections for distinct microbial taxa. In particular, the COGs for archaea (arCOGs) went through several closely curated releases and remain up to date, having become a widely used framework for archaeal genome annotation and analysis [10, 70, 73]. As illustrated in Figure 2, detailed analysis of archaeal protein families increased the coverage of cren-, eury- and thaumarchaeal genomes by 18–20%, so that arCOGs now cover >92% of the proteins encoded in typical genomes of Crenarchaeota and Euryarchaeota. Separate projects have involved construction and analysis of COGs for Cyanobacteria and Gram-positive bacteria of the order Lactobacillales [74, 75]. The COG approach was also implemented in the database of Alignable Tight Genome Clusters (ATGC) that includes closely related bacterial and archaeal genomes [64, 76]. COGs have been constructed separately for each ATGC. These ATGC-COGs largely avoid the problems inherent in the COG analysis at larger evolutionary distances (lineage-specific paralogy, differential gene loss and differences in domain architectures) and have proved an efficient platform for various types of evolutionary reconstructions [77, 78]. In taxa for which ATGCs are available—i.e. those studied in sufficient depth so that multiple closely related genomes are available—the coverage of genomes is again raised so that ATGC-COGs now cover >95% of the proteins encoded in typical genomes (Figure 2). The COG approach has also been extended beyond cellular organisms to construct COG for viruses that infect bacteria or archaea, and for the large DNA viruses of eukaryotes [79, 80]. The successful application of the early versions of the COGs was to a large extent based on comprehensive manual curation of the COG membership, COG names and supporting information, and a substantial body of computational analysis aimed at predicting functions for poorly characterized COGs. This effort has led to several notable breakthroughs that have been validated by subsequent experiments and opened up new research directions, including the characterization of the CRISPR-Cas system [81, 82], prediction of the archaeal exosome [83], identification of the bacterial c-di-GMP-centered signaling network [84, 85], new bacterial toxin-antitoxin systems [86–88] and archaeal type IV secretion systems [89], and allowing prioritization of uncharacterized proteins (COGs) for further study [90, 91]. However, scaling this labor-consuming approach to accommodate the exponentially growing amount of genomic sequence data is even more challenging than keeping the COGs up to date. That path forward is likely to combine improved automatic approaches to functional annotation with subprojects focusing on specific taxa or functional classes of COGs. Concluding remarks The COG approach for identification of orthologous genes was developed as a platform for comparative genomic analysis shortly after the first few microbial genomes have been sequenced. It could have been expected that in 20 years, this simple strategy based on sequence similarity hierarchy would completely give way to more sophisticated, phylogenetic approaches. This, however, is not the case, primarily, because the extended orthology conjecture, according to which bidirectional best hits between genomes correspond to orthologs, and the latter possess equivalent functions, largely holds for prokaryotes given the limited extent of lineage-specific paralogy, differential gene loss and domain shuffling. In contrast, in eukaryotes where all these confounding aspects of genome evolution are pervasive, the COG approach encounters great difficulties, and robust, genome-wide orthology assignment does not seem to be feasible without full-scale phylogenomics. Thus, the COGs are likely to remain an important tool for microbial genome analysis for years to come, so that investment of effort into refinements of this straightforward approach seems to be justified. Key Points Robust orthology identification is essential for accurate genome annotation. Reconstructions of genome evolution are based on orthology and paralogy. COGs are an essential tool in microbial genomics. Several specialized COG projects have been developed. Acknowledgments The authors would like to thank all former members of the COG team for their contributions to the project. Funding The authors are supported by Intramural Research Program of the US National Institutes of Health at the National Library of Medicine. D.M.K. acknowledges the support of the Department of Biomedical Engineering at the University of Iowa (Iowa City, USA). Michael Y. Galperin is a Lead Scientist at the NCBI’s (NIH) Computational Biology Branch. He uses comparative genomics to study evolution of membrane energetics and bacterial metabolic and signaling pathways. David M. Kristensen is an Assistant Professor at the University of Iowa’s Department of Biomedical Engineering. He uses tools of comparative genomics, bioinformatics and systems biology to study evolution of genes in viruses and microbes. Kira S. Makarova is a Staff Scientist at the NCBI’s Computational Biology Branch. Her area of expertise is comparative genomics and sequence analysis of microbial genomes. Yuri I. Wolf is a Lead Scientist at the National Center for Biotechnology Information in Bethesda, Maryland. His research is focused on quantitative aspects of evolutionary and comparative genomics. Eugene V. Koonin is a Senior Investigator and Leader of the Evolutionary Genomics Group at the National Center for Biotechnology Information at the NIH. He studies various aspects of genome evolution. References 1 Tatusov RL , Koonin EV, Lipman DJ. A genomic perspective on protein families . Science 1997 ; 278 : 631 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 2 Koonin EV , Tatusov RL, Galperin MY. Beyond complete genomes: from sequence to structure and function . Curr Opin Struct Biol 1998 ; 8 : 355 – 63 . Google Scholar Crossref Search ADS PubMed WorldCat 3 Tatusov RL , Galperin MY, Natale DA, et al. The COG database: a tool for genome-scale analysis of protein functions and evolution . Nucleic Acids Res 2000 ; 28 : 33 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 4 Tatusov RL , Natale DA, Garkavtsev IV, et al. The COG database: new developments in phylogenetic classification of proteins from complete genomes . Nucleic Acids Res 2001 ; 29 : 22 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 5 Tatusov RL , Fedorova ND, Jackson JD, et al. The COG database: an updated version includes eukaryotes . BMC Bioinformatics 2003 ; 4 : 41 . Google Scholar Crossref Search ADS PubMed WorldCat 6 Galperin MY , Makarova KS, Wolf YI, et al. Expanded microbial genome coverage and improved protein family annotation in the COG database . Nucleic Acids Res 2015 ; 43 : D261 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 7 Koonin EV. Orthologs, paralogs, and evolutionary genomics . Annu Rev Genet 2005 ; 39 : 309 – 38 . Google Scholar Crossref Search ADS PubMed WorldCat 8 Gabaldon T , Koonin EV. Functional and evolutionary implications of gene orthology . Nat Rev Genet 2013 ; 14 : 360 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 9 Kristensen DM , Wolf YI, Mushegian AR, et al. Computational methods for gene orthology inference . Brief Bioinform 2011 ; 12 : 379 – 91 . Google Scholar Crossref Search ADS PubMed WorldCat 10 Makarova KS , Wolf YI, Koonin EV. Archaeal Clusters of Orthologous Genes (arCOGs): an update and application for analysis of shared features between Thermococcales, Methanococcales, and Methanobacteriales . Life 2015 ; 5 : 818 – 40 . Google Scholar Crossref Search ADS PubMed WorldCat 11 Kanehisa M , Sato Y, Kawashima M, et al. KEGG as a reference resource for gene and protein annotation . Nucleic Acids Res 2016 ; 44 : D457 – 62 . Google Scholar Crossref Search ADS PubMed WorldCat 12 Chen F , Mackey AJ, Stoeckert CJ Jr, et al. OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups . Nucleic Acids Res 2006 ; 34 : D363 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 13 Uchiyama I , Mihara M, Nishide H, et al. MBGD update 2015: microbial genome database for flexible ortholog analysis utilizing a diverse set of genomic data . Nucleic Acids Res 2015 ; 43 : D270 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 14 Altenhoff AM , Skunca N, Glover N, et al. The OMA orthology database in 2015: function predictions, better plant support, synteny view and other improvements . Nucleic Acids Res 2015 ; 43 : D240 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 15 Heinicke S , Livstone MS, Lu C, et al. The Princeton Protein Orthology Database (P-POD): a comparative genomics analysis tool for biologists . PLoS One 2007 ; 2 : e766 . Google Scholar Crossref Search ADS PubMed WorldCat 16 Huerta-Cepas J , Capella-Gutierrez S, Pryszcz LP, et al. PhylomeDB v4: zooming into the plurality of evolutionary histories of a genome . Nucleic Acids Res 2014 ; 42 : D897 – 902 . Google Scholar Crossref Search ADS PubMed WorldCat 17 Kriventseva EV , Tegenfeldt F, Petty TJ, et al. OrthoDB v8: update of the hierarchical catalog of orthologs and the underlying free software . Nucleic Acids Res 2015 ; 43 : D250 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 18 Powell S , Forslund K, Szklarczyk D, et al. eggNOG v4.0: nested orthology inference across 3686 organisms . Nucleic Acids Res 2014 ; 42 : D231 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 19 Sonnhammer EL , Ostlund G. InParanoid 8: orthology analysis between 273 proteomes, mostly eukaryotic . Nucleic Acids Res 2015 ; 43 : D234 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 20 Kaduk M , Riegler C, Lemp O, et al. HieranoiDB: a database of orthologs inferred by Hieranoid . Nucleic Acids Res 2017 ; 45 : D687 – 90 . Google Scholar Crossref Search ADS PubMed WorldCat 21 Jensen LJ , Julien P, Kuhn M, et al. eggNOG: automated construction and annotation of orthologous groups of genes . Nucleic Acids Res 2008 ; 36 : D250 – 4 . Google Scholar Crossref Search ADS PubMed WorldCat 22 Huerta-Cepas J , Szklarczyk D, Forslund K, et al. eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences . Nucleic Acids Res 2016 ; 44 : D286 – 93 . Google Scholar Crossref Search ADS PubMed WorldCat 23 Altschul SF , Madden TL, Schaffer AA, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs . Nucleic Acids Res 1997 ; 25 : 3389 – 402 . Google Scholar Crossref Search ADS PubMed WorldCat 24 Eddy SR. Accelerated profile HMM searches . PLoS Comput Biol 2011 ; 7 : e1002195 . Google Scholar Crossref Search ADS PubMed WorldCat 25 Galperin MY , Koonin EV. Sources of systematic error in functional annotation of genomes: domain rearrangement, non-orthologous gene displacement and operon disruption . In Silico Biol 1998 ; 1 : 55 – 67 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 26 Schnoes AM , Brown SD, Dodevski I, et al. Annotation error in public databases: misannotation of molecular function in enzyme superfamilies . PLoS Comput Biol 2009 ; 5 : e1000605 . Google Scholar Crossref Search ADS PubMed WorldCat 27 Gilks WR , Audit B, De Angelis D, et al. Modeling the percolation of annotation errors in a database of protein sequences . Bioinformatics 2002 ; 18 : 1641 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 28 Valencia A. Automatic annotation of protein function . Curr Opin Struct Biol 2005 ; 15 : 267 – 74 . Google Scholar Crossref Search ADS PubMed WorldCat 29 Gaudet P , Livstone MS, Lewis SE, et al. Phylogenetic-based propagation of functional annotations within the Gene Ontology Consortium . Brief Bioinform 2011 ; 12 : 449 – 62 . Google Scholar Crossref Search ADS PubMed WorldCat 30 Mi H , Huang X, Muruganujan A, et al. PANTHER version 11: expanded annotation data from Gene Ontology and Reactome pathways, and data analysis tool enhancements . Nucleic Acids Res 2017 ; 45 : D183 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 31 The UniProt Consortium . UniProt: the universal protein knowledgebase . Nucleic Acids Res 2017 ; 45 : D158 – 69 . Crossref Search ADS PubMed WorldCat 32 Letunic I , Doerks T, Bork P. SMART: recent updates, new developments and status in 2015 . Nucleic Acids Res 2015 ; 43 : D257 – 60 . Google Scholar Crossref Search ADS PubMed WorldCat 33 Oates ME , Stahlhacke J, Vavoulis DV, et al. The SUPERFAMILY 1.75 database in 2014: a doubling of data . Nucleic Acids Res 2015 ; 43 : D227 – 33 . Google Scholar Crossref Search ADS PubMed WorldCat 34 Finn RD , Coggill P, Eberhardt RY, et al. The Pfam protein families database: towards a more sustainable future . Nucleic Acids Res 2016 ; 44 : D279 – 85 . Google Scholar Crossref Search ADS PubMed WorldCat 35 Finn RD , Attwood TK, Babbitt PC, et al. InterPro in 2017-beyond protein family and domain annotations . Nucleic Acids Res 2017 ; 45 : D190 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 36 Marchler-Bauer A , Bo Y, Han L, et al. CDD/SPARCLE: functional classification of proteins via subfamily domain architectures . Nucleic Acids Res 2017 ; 45 : D200 – 3 . Google Scholar Crossref Search ADS PubMed WorldCat 37 Sonnhammer EL , Koonin EV. Orthology, paralogy and proposed classification for paralog subtypes . Trends Genet 2002 ; 18 : 619 – 20 . Google Scholar Crossref Search ADS PubMed WorldCat 38 Kristensen DM , Kannan L, Coleman MK, et al. A low-polynomial algorithm for assembling clusters of orthologous groups from intergenomic symmetric best matches . Bioinformatics 2010 ; 26 : 1481 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 39 Lechner M , Hernandez-Rosales M, Doerr D, et al. Orthology detection combining clustering and synteny for very large datasets . PLoS One 2014 ; 9 : e105015 . Google Scholar Crossref Search ADS PubMed WorldCat 40 Dewey CN. Positional orthology: putting genomic evolutionary relationships into context . Brief Bioinform 2011 ; 12 : 401 – 12 . Google Scholar Crossref Search ADS PubMed WorldCat 41 Marchler-Bauer A , Zheng C, Chitsaz F, et al. CDD: conserved domains and protein three-dimensional structure . Nucleic Acids Res 2013 ; 41 : D348 – 52 . Google Scholar Crossref Search ADS PubMed WorldCat 42 Alexeyenko A , Tamas I, Liu G, et al. Automatic clustering of orthologs and inparalogs shared by multiple proteomes . Bioinformatics 2006 ; 22 : e9 – 15 . Google Scholar Crossref Search ADS PubMed WorldCat 43 Chen F , Mackey AJ, Vermunt JK, et al. Assessing performance of orthology detection strategies applied to eukaryotic genomes . PLoS One 2007 ; 2 : e383 . Google Scholar Crossref Search ADS PubMed WorldCat 44 Altenhoff AM , Dessimoz C. Phylogenetic and functional assessment of orthologs inference projects and methods . PLoS Comput Biol 2009 ; 5 : e1000262 . Google Scholar Crossref Search ADS PubMed WorldCat 45 Altenhoff AM , Dessimoz C. Inferring orthology and paralogy . Methods Mol Biol 2012 ; 855 : 259 – 79 . Google Scholar Crossref Search ADS PubMed WorldCat 46 Mulkidjanian AY , Galperin MY, Makarova KS, et al. Evolutionary primacy of sodium bioenergetics . Biol Direct 2008 ; 3 : 13 . Google Scholar Crossref Search ADS PubMed WorldCat 47 Tipton PA , Beecher BS. Tartrate dehydrogenase, a new member of the family of metal-dependent decarboxylating R-hydroxyacid dehydrogenases . Arch Biochem Biophys 1994 ; 313 : 15 – 21 . Google Scholar Crossref Search ADS PubMed WorldCat 48 Salomone JY , Crouzet P, De Ruffray P, et al. Characterization and distribution of tartrate utilization genes in the grapevine pathogen Agrobacterium vitis . Mol Plant Microbe Interact 1996 ; 9 : 401 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 49 Howell DM , Graupner M, Xu H, et al. Identification of enzymes homologous to isocitrate dehydrogenase that are involved in coenzyme B and leucine biosynthesis in methanoarchaea . J Bacteriol 2000 ; 182 : 5013 – 16 . Google Scholar Crossref Search ADS PubMed WorldCat 50 Haft DH , Selengut JD, Richter RA, et al. TIGRFAMs and genome properties in 2013 . Nucleic Acids Res 2013 ; 41 : D387 – 95 . Google Scholar Crossref Search ADS PubMed WorldCat 51 Klimke W , Agarwala R, Badretdin A, et al. The national center for biotechnology information's protein clusters database . Nucleic Acids Res 2009 ; 37 : D216 – 23 . Google Scholar Crossref Search ADS PubMed WorldCat 52 Yutin N , Puigbo P, Koonin EV, et al. Phylogenomics of prokaryotic ribosomal proteins . PLoS One 2012 ; 7 : e36972 . Google Scholar Crossref Search ADS PubMed WorldCat 53 Natale DA , Galperin MY, Tatusov RL, et al. Using the COG database to improve gene recognition in complete genomes . Genetica 2000 ; 108 : 9 – 17 . Google Scholar Crossref Search ADS PubMed WorldCat 54 Koonin EV , Galperin MY ( 2003 ) Sequence—Evolution—Function: Computational Approaches in Comparative Genomics . Boston : Kluwer Academic . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC 55 Tatusova T , Ciufo S, Fedorov B, et al. RefSeq microbial genomes database: new representation and annotation strategy . Nucleic Acids Res 2014 ; 42 : D553 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 56 Tatusova T , DiCuccio M, Badretdin A, et al. NCBI prokaryotic genome annotation pipeline . Nucleic Acids Res 2016 ; 44 : 6614 – 24 . Google Scholar Crossref Search ADS PubMed WorldCat 57 Galperin MY , Koonin EV. Functional genomics and enzyme evolution. Homologous and analogous enzymes encoded in microbial genomes . Genetica 1999 ; 106 : 159 – 70 . Google Scholar Crossref Search ADS PubMed WorldCat 58 Galperin MY , Kolker E. New metrics for comparative genomics . Curr Opin Biotechnol 2006 ; 17 : 440 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 59 Galperin MY. Structural classification of bacterial response regulators: diversity of output domains and domain combinations . J Bacteriol 2006 ; 188 : 4169 – 82 . Google Scholar Crossref Search ADS PubMed WorldCat 60 Galperin MY. Diversity of structure and function of response regulator output domains . Curr Opin Microbiol 2010 ; 13 : 150 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 61 Diaz R , Vargas-Lagunas C, Villalobos MA, et al. argC orthologs from Rhizobiales show diverse profiles of transcriptional efficiency and functionality in Sinorhizobium meliloti . J Bacteriol 2011 ; 193 : 460 – 72 . Google Scholar Crossref Search ADS PubMed WorldCat 62 Prunetti L , El Yacoubi B, Schiavon CR, et al. Evidence that COG0325 proteins are involved in PLP homeostasis . Microbiology 2016 ; 162 : 694 – 706 . Google Scholar Crossref Search ADS WorldCat 63 Zallot R , Yuan Y, de Crecy-Lagard V. The Escherichia coli COG1738 member YhhQ is involved in 7-cyanodeazaguanine (preQ0) transport . Biomolecules 2017 ; 7 : 12 . Google Scholar Crossref Search ADS WorldCat 64 Kristensen DM , Wolf YI, Koonin EV. ATGC database and ATGC-COGs: an updated resource for micro- and macro-evolutionary studies of prokaryotic genomes and protein family annotation . Nucleic Acids Res 2017 ; 45 : D210 – 18 . Google Scholar Crossref Search ADS PubMed WorldCat 65 Tettelin H , Masignani V, Cieslewicz MJ, et al. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial “pan-genome” . Proc Natl Acad Sci USA 2005 ; 102 : 13950 – 5 . Google Scholar Crossref Search ADS PubMed WorldCat 66 Tettelin H , Riley D, Cattuto C, et al. Comparative genomics: the bacterial pan-genome . Curr Opin Microbiol 2008 ; 11 : 472 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 67 Puigbo P , Lobkovsky AE, Kristensen DM, et al. Genomes in turmoil: quantification of genome dynamics in prokaryote supergenomes . BMC Biol 2014 ; 12 : 66 . Google Scholar Crossref Search ADS PubMed WorldCat 68 Wolf YI , Makarova KS, Lobkovsky AE, et al. Two fundamentally different classes of microbial genes . Nat Microbiol 2016 ; 2 : 16208 . Google Scholar Crossref Search ADS PubMed WorldCat 69 Fitch WM. Distinguishing homologous from analogous proteins . Syst Zool 1970 ; 19 : 99 – 113 . Google Scholar Crossref Search ADS PubMed WorldCat 70 Makarova KS , Sorokin AV, Novichkov PS, et al. Clusters of orthologous genes for 41 archaeal genomes and implications for evolutionary genomics of archaea . Biol Direct 2007 ; 2 : 33 . Google Scholar Crossref Search ADS PubMed WorldCat 71 Zdobnov EM , Tegenfeldt F, Kuznetsov D, et al. OrthoDB v9.1: cataloging evolutionary and functional annotations for animal, fungal, plant, archaeal, bacterial and viral orthologs . Nucleic Acids Res 2017 ; 45 : D744 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 72 Curnow AW , Hong K, Yuan R, et al. Glu-tRNAGln amidotransferase: a novel heterotrimeric enzyme required for correct decoding of glutamine codons during translation . Proc Natl Acad Sci USA 1997 ; 94 : 11819 – 26 . Google Scholar Crossref Search ADS PubMed WorldCat 73 Wolf YI , Makarova KS, Yutin N, et al. Updated clusters of orthologous genes for Archaea: a complex ancestor of the Archaea and the byways of horizontal gene transfer . Biol Direct 2012 ; 7 : 46 . Google Scholar Crossref Search ADS PubMed WorldCat 74 Mulkidjanian AY , Koonin EV, Makarova KS, et al. The cyanobacterial genome core and the origin of photosynthesis . Proc Natl Acad Sci USA 2006 ; 103 : 13126 – 31 . Google Scholar Crossref Search ADS PubMed WorldCat 75 Makarova KS , Koonin EV. Evolutionary genomics of lactic acid bacteria . J Bacteriol 2007 ; 189 : 1199 – 208 . Google Scholar Crossref Search ADS PubMed WorldCat 76 Novichkov PS , Ratnere I, Wolf YI, et al. ATGC: a database of orthologous genes from closely related prokaryotic genomes and a research platform for microevolution of prokaryotes . Nucleic Acids Res 2009 ; 37 : D448 – 54 . Google Scholar Crossref Search ADS PubMed WorldCat 77 Novichkov PS , Wolf YI, Dubchak I, et al. Trends in prokaryotic evolution revealed by comparison of closely related bacterial and archaeal genomes . J Bacteriol 2009 ; 191 : 65 – 73 . Google Scholar Crossref Search ADS PubMed WorldCat 78 Ran W , Kristensen DM, Koonin EV. Coupling between protein level selection and codon usage optimization in the evolution of bacteria and archaea . MBio 2014 ; 5 : e00956-14 . Google Scholar Crossref Search ADS PubMed WorldCat 79 Yutin N , Colson P, Raoult D, et al. Mimiviridae: clusters of orthologous genes, reconstruction of gene repertoire evolution and proposed expansion of the giant virus family . Virol J 2013 ; 10 : 106 . Google Scholar Crossref Search ADS PubMed WorldCat 80 Grazziotin AL , Koonin EV, Kristensen DM. Prokaryotic Virus Orthologous Groups (pVOGs): a resource for comparative genomics and protein family annotation . Nucleic Acids Res 2017 ; 45 : D491 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 81 Makarova KS , Aravind L, Grishin NV, et al. A DNA repair system specific for thermophilic Archaea and bacteria predicted by genomic context analysis . Nucleic Acids Res 2002 ; 30 : 482 – 96 . Google Scholar Crossref Search ADS PubMed WorldCat 82 Makarova KS , Grishin NV, Shabalina SA, et al. A putative RNA-interference-based immune system in prokaryotes: computational analysis of the predicted enzymatic machinery, functional analogies with eukaryotic RNAi, and hypothetical mechanisms of action . Biol Direct 2006 ; 1 : 7 . Google Scholar Crossref Search ADS PubMed WorldCat 83 Koonin EV , Wolf YI, Aravind L. Prediction of the archaeal exosome and its connections with the proteasome and the translation and transcription machineries by a comparative-genomic approach . Genome Res 2001 ; 11 : 240 – 52 . Google Scholar Crossref Search ADS PubMed WorldCat 84 Galperin MY , Nikolskaya AN, Koonin EV. Novel domains of the prokaryotic two-component signal transduction systems . FEMS Microbiol Lett 2001 ; 203 : 11 – 21 . Google Scholar Crossref Search ADS PubMed WorldCat 85 Amikam D , Galperin MY. PilZ domain is part of the bacterial c-di-GMP binding protein . Bioinformatics 2006 ; 22 : 3 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 86 Makarova KS , Wolf YI, Koonin EV. Comprehensive comparative-genomic analysis of type 2 toxin-antitoxin systems and related mobile stress response systems in prokaryotes . Biol Direct 2009 ; 4 : 19 . Google Scholar Crossref Search ADS PubMed WorldCat 87 Fozo EM , Makarova KS, Shabalina SA, et al. Abundance of type I toxin-antitoxin systems in bacteria: searches for new candidates and discovery of novel families . Nucleic Acids Res 2010 ; 38 : 3743 – 59 . Google Scholar Crossref Search ADS PubMed WorldCat 88 Makarova KS , Wolf YI, Snir S, et al. Defense islands in bacterial and archaeal genomes and prediction of novel defense systems . J Bacteriol 2011 ; 193 : 6039 – 56 . Google Scholar Crossref Search ADS PubMed WorldCat 89 Makarova KS , Koonin EV, Albers SV. Diversity and evolution of type IV pili systems in Archaea . Front Microbiol 2016 ; 7 : 667 . Google Scholar Crossref Search ADS PubMed WorldCat 90 Galperin MY , Koonin EV. ′Conserved hypothetical′ proteins: prioritization of targets for experimental study . Nucleic Acids Res 2004 ; 32 : 5452 – 63 . Google Scholar Crossref Search ADS PubMed WorldCat 91 Galperin MY , Koonin EV. From complete genome sequence to ′complete′ understanding? Trends Biotechnol 2010 ; 28 : 398 – 406 . Google Scholar Crossref Search ADS PubMed WorldCat Published by Oxford University Press 2017. This work is written by US Government employees and is in the public domain in the US. This work is written by US Government employees and is in the public domain in the US. Published by Oxford University Press 2017. This work is written by US Government employees and is in the public domain in the US.
journal article
LitStream Collection
MicroScope—an integrated resource for community expertise of gene functions and comparative analysis of microbial genomic and metabolic data

Médigue, Claudine; Calteau, Alexandra; Cruveiller, Stéphane; Gachet, Mathieu; Gautreau, Guillaume; Josso, Adrien; Lajus, Aurélie; Langlois, Jordan; Pereira, Hugo; Planel, Rémi; Roche, David; Rollin, Johan; Rouy, Zoe; Vallenet, David

2019 Briefings in Bioinformatics

doi: 10.1093/bib/bbx113pmid: 28968784

Abstract The overwhelming list of new bacterial genomes becoming available on a daily basis makes accurate genome annotation an essential step that ultimately determines the relevance of thousands of genomes stored in public databanks. The MicroScope platform (http://www.genoscope.cns.fr/agc/microscope) is an integrative resource that supports systematic and efficient revision of microbial genome annotation, data management and comparative analysis. Starting from the results of our syntactic, functional and relational annotation pipelines, MicroScope provides an integrated environment for the expert annotation and comparative analysis of prokaryotic genomes. It combines tools and graphical interfaces to analyze genomes and to perform the manual curation of gene function in a comparative genomics and metabolic context. In this article, we describe the free-of-charge MicroScope services for the annotation and analysis of microbial (meta)genomes, transcriptomic and re-sequencing data. Then, the functionalities of the platform are presented in a way providing practical guidance and help to the nonspecialists in bioinformatics. Newly integrated analysis tools (i.e. prediction of virulence and resistance genes in bacterial genomes) and original method recently developed (the pan-genome graph representation) are also described. Integrated environments such as MicroScope clearly contribute, through the user community, to help maintaining accurate resources. microbial genome annotation system, gene function curation, comparative genomics, transcriptomics, variant detection, metabolic networks Introduction Large-scale genome sequencing and the increasingly massive use of high-throughput approaches produce a vast amount of new information that completely transforms our understanding of thousands of species. However, despite the development of powerful bioinformatics approaches, full interpretation of the content of these genomes remains a difficult task. To address this challenge, several integrated environments that combine and standardize information from a variety of sources and apply uniform (re-)annotation techniques have been developed (i.e. EnsemblGenomes [1], IMG [2], PATRIC [3]). In the context of the French National Sequencing Center (CEA/DRF/Genoscope), we have developed the MicroScope platform, which is a software environment for management, annotation, comparative analysis and visualization of microbial genomes. Published for the first time in 2006 [4], the platform has been under continuous development within the LABGeM group at CEA, and its capacities are now extensive [5–7]. MicroScope serves different used cases in bioinformatics: It supports the integration of newly sequenced or already available prokaryotic genomes through the offer of a free-of-charge service to the scientific community [genome annotation, RNA sequencing (RNA-seq) and variant analyses]. It performs computational inferences including prediction of metabolic pathways, prediction of resistome and virulome, which can be used for genome analysis. It provides tools for (comparative) analyses and visualization of prokaryotic genomes. It supports collaborative expert annotation processes through the use of specific curation tools and graphical interfaces. The present article provides a comprehensive description of MicroScope from the point of view of the end users. We start with the major objectives for which the platform was designed, and we give an overview of the main categories of MicroScope users and projects. Then we explain how to submit data and interact with the MicroScope team, and how to explore the annotated data, use the various analysis tools and perform expert annotation of gene functions. Technical details on the architecture of the system are given in the last section of this review. Where possible, earlier publications that provide more details are referenced. We conclude by one of the ongoing work that lead to a promising representation of the pan-genome of thousands of prokaryotic genomes. Who is using MicroScope and for what purposes? In the era of high-throughput sequencing technologies, a vast majority of genome sequences receive only automatic annotation, mainly based on sequence similarity, that can give spurious results [8]. Indeed, manual expertise of gene functions is a time-consuming and expensive process, but it undoubtedly adds great value to resources. In knowledge bases such as UniProtKB [9], curation efforts remain restricted to large and widespread protein families, and these resources cannot replace expert curations made by specialized biologists in community systems, such as SEED [10], IMG [2] and MicroScope. Our integrated platform supports systematic and efficient revision of microbial genome annotation, data management together with comparative genomics and metabolic analyses [4–7]. The resource provides data from completed and ongoing genome projects together with post-genomic experiments (i.e. transcriptomics; re-sequencing of evolved strains; mutant collections) allowing users to improve the understanding of gene functions. In comparison with other similar systems, MicroScope enables curation in a rich comparative genomic context and is mainly focused on (re-)annotation projects, which are built in close collaboration with microbiologists working on reference species. Indeed, MicroScope was initially dedicated to the annotation and analysis of Acinetobacter baylyi APD1 [11] and to biologists who do not have the required computing infrastructure to perform efficient annotation and analyses of newly sequenced bacterial genomes. Our system rapidly became a ‘service’ free of charge to the scientific community at large. From <400 user accounts in 2006, MicroScope counts >3300 personal accounts at present time (Figure 1). The number of registered users has doubled since 2013, and the platform has even widened its international popularity with 64% of accounts outside France. Many international projects are conducted through the platform involving users from distant geographic areas [7]. Although authentication is not required to navigate in MicroScope, it allows users to annotate genes and save data on their personal session. On average per month, we count 360 active accounts (i.e. the user logged in at least once in the month) and 2200 authentications among ∼1700 monthly unique visitors. Figure 1. Open in new tabDownload slide Evolution of the number of integrated genomes, user accounts and expert annotations stored in MicroScope since 2002. Red scale on the right refers to the number of integrated genomes (red curve) and to the number of user accounts (orange curve). Blue scale on the left refers to the cumulated number of expert annotations. The platform has been used to perform a complete expert annotation of several reference species such as Escherichia coli [12], Bacillus subtilis 128 [13, 14] and Pseudomonas putida KT2440 [15]. In addition, important pathogens and environmental species have also been extensively curated. The MicroScope system is now also used for variant analysis of re-sequenced bacterial strains (for example, in the context of bacterial evolution experiments) and for the analysis of transcriptomic experiments using RNA-seq sequencing data [6, 7], and finally, the platform is also (and in some cases, exclusively) used for the set of analysis tools pertaining to microbial genomics and metabolism, which have been integrated and made available through the MicroScope Web interface (see next sections). Indeed, the MicroScope platform has been cited 690 times since 13 years. As shown in Figure 1, although the number of MicroScope users having a personal account has increased significantly since 2011, the number of expert annotations made each year is clearly decreasing, reaching only 21 600 in 2016 (we registered >100 000 expert annotations in 2009). Past year, about one-tenth of the users performed curation of gene function and a third of them made >100 expert annotations. Obviously, with the number of prokaryotic genomes being sequenced today, the time-consuming task of expert annotation is totally unacceptable. This is the reason why our major efforts have been focused on the development of several key functionalities allowing to ease the expert annotation process and to notably improve the final annotation quality of the analyzed genomes, at least, for gene functions of interest. An annotation service to researchers in microbiology Interface for user data integration Integration and analysis of genomic data into MicroScope are open and free of charge for the worldwide community of microbiologists. To standardize and make user submission fully automated, we have developed a dedicated Web interface (https://www.genoscope.cns.fr/agc/microscope/about/services.php). The service is mainly used for the annotation of microbial genomes: both newly sequenced genomes (which will remain private till the genome publication and/or their submission to public databanks) and, for comparative analysis purpose, public prokaryotic genomes (Figure 2). Moreover, three other types of services are provided for the integration of (i) genome assemblies (bins) from metagenomic samples (ii) RNA-seq data for quantitative transcriptomics and (iii) DNA sequencing (DNA-seq) data to identify genomic variations in evolved strains (Figure 3). To ease data integration and comparative studies, standardization of contextual data about genome sequences is essential. For metagenomes, we have added a dedicated form that follows the MIMS specifications (minimum information about a metagenome sequence [16]). When submitting assembled metagenomic data in Microscope, the users are invited to select the type of environment (e.g. soil; air; water; human-associated; plant-associated) and to complete the associated fields (e.g. collection date, environment biome, geographic location, etc.). These fields are dynamically loaded and displayed on metagenome type selection. Indeed, the MicroScope database model is flexible enough to store predefined descriptors, like MIMS, or the ones defined by users. Figure 2. Open in new tabDownload slide Annotation pipelines for the analysis of newly sequenced genomes and genomes already annotated in public databanks. Figure 3. Open in new tabDownload slide Submission of genomic data into the MicroScope platform. Four types of services are provided for the integration of (i) newly sequenced or publicly available genomes (Genome), (ii) genome assemblies/bins from metagenomic samples (Metagenome), (iii) RNA-seq data for quantitative transcriptomics (RNA-Seq), (iv) DNA-seq data to identify genomic variations in evolved strains (Evolution). Following the three main steps of the procedure, the user is invited to complete the requested metadata to describe sequencing, genomes and experimental properties, to upload FASTA (genome assemblies) or FASTQ (RNA-seq or DNA-seq reads) files and, finally, to approve the terms of services. Users are then informed by an e-mail about the progress of their integration request. At present time, an average of eight genomes a day are requested for integration in the platform (this includes bins from metagenomic samples). The resource contains data for >7400 microbial genomes of which ∼3100 are publicly available. In addition, 607 RNA-seq runs and 756 runs corresponding to the re-sequencing of evolved strains have also been requested for integration into MicroScope. Running the annotation pipelines About 25 analyses workflows include most of the currently used annotation software, plus some in-house tools and/or annotation strategies (Table 1). The newly sequenced (meta)genomes, generally submitted in several contigs and organized (or not) on the final chromosome(s), are first analyzed by the syntactic annotation pipeline to identify protein genes, transfer RNA (tRNA), ribosomal RNA (rRNA), noncoding RNA (ncRNA) and repeats (Figure 2, Table 1). For a more accurate prediction of small genes and/or atypical gene composition, we have developed a strategy to first construct appropriate gene models that takes into account the codon usage of the studied organism. These models are then used in the core of the AMIGene program [17]. Starting with the set of genomic objects identified during the syntactic annotation process, the next step is to infer biological functions of the predicted genes. Our functional annotation pipeline includes sequence similarity searches tools using generalist (i.e. UniProtKB/Swiss-Prot) or specialized (i.e. Interpro, FIGFAM, etc.) databases (Table 1). Results obtained with high-quality manually curated protein sequence data sets (i.e. Swiss-Prot, E. coli K-12, B. subtilis 168 MicroScope-curated genes) are first considered in the final functional automatic annotation procedure. This procedure also takes into account the results obtained from the computation of synteny groups with complete reference prokaryotic genomes and the one available in MicroScope. Indeed, for assigning function to novel proteins, gene context approaches often complement the classical homology-based gene annotation in prokaryotes. The method we have developed offers the possibility of retaining more than one homologous gene (i.e. not only the bidirectional best hit), to allow for multiple correspondences between genes; that way, paralogy relations and/or gene fusions are easily detected [4]. Table 1. Software and databases integrated in the MicroScope pipelines Topic . Name . Software . Database . Description . Internal . URL . Syntactic annotation AMIGene x CoDing sequences (CDS) prediction x http://www.genoscope.cns.fr/agc/tools/amigene Glimmer x https://ccb.jhu.edu/software/glimmer Prodigal x http://prodigal.ornl.gov MICheck x INSDC genome CDS re-annotation x http://www.genoscope.cns.fr/agc/tools/micheck tRNAscan-SE x tRNA prediction http://eddylab.org/software/tRNAscan-SE RNAmmer x rRNA prediction http://www.cbs.dtu.dk/services/RNAmmer Rfam/Infernal x x ncRNA families and prediction http://rfam.xfam.org, http://eddylab.org/infernal RepSeek x DNA sequence repeats http://wwwabi.snv.jussieu.fr/public/RepSeek Alien hunter x DNA compositional biases to detect HGT regions http://www.sanger.ac.uk/science/tools/alien-hunter SIGI-HMM x http://www.brinkman.mbb.sfu.ca/∼mlangill/sigi-hmm GenProtFeat x Gene/protein features x Taxonomy x NCBI taxonomy database https://www.ncbi.nlm.nih.gov/taxonomy Functional annotation BLAST+ x DNA/protein sequence alignment https://blast.ncbi.nlm.nih.gov Diamond x https://github.com/bbuchfink/diamond UniProtKB x Protein sequence and function database http://www.uniprot.org InterPro x x Protein signature and family prediction https://www.ebi.ac.uk/interpro COG x x Protein family annotation and prediction https://www.ncbi.nlm.nih.gov/COG FigFam x x http://www.nmpdr.org/FIG/wiki/view.cgi/FIG/FigFam MICFAM x Protein sequence family classification with SiliX x SiliX x Clustering of protein sequences https://lbbe.univ-lyon1.fr/-SiLiX-.html ENZYME x Enzymatic activity database http://enzyme.expasy.org PRIAM x Enzymatic activity prediction http://priam.prabi.fr dbCAN x Carbohydrate-active enzyme prediction http://csbl.bmb.uga.edu/dbCAN/ SignalP x Signal peptide cleavage site prediction http://www.cbs.dtu.dk/services/SignalP TMHMM x Transmembrane helix prediction http://www.cbs.dtu.dk/services/TMHMM LipoP x Lipoprotein prediction http://www.cbs.dtu.dk/services/LipoP PSORTb x Subcellular localization prediction http://www.psort.org VFDB x Virulence factor database http://www.mgc.ac.cn/VFs VirulenceFinder x https://cge.cbs.dtu.dk/services/VirulenceFinder CARD/RGI x x Antibiotic resistance database and prediction https://card.mcmaster.ca AutoFassign x Automatic functional annotation of proteins x Relational annotation Syntonizer x Synteny conservation detection x http://www.inrialpes.fr/helix/people/viari/cccpart/ Directon x Operon prediction x PhyloProfile x Phylogenetic profilef co-evolution score x https://dx.doi.org/10.1186/1471-2164-13-69 RGP x Genomic plasticity region detection x Pathway synteny x Synteny involved in metabolic pathways x MIBiG/ antiSMASH x x Biosynthetic Gene Cluster database and prediction http://www.secondarymetabolites.org/ ChEBI x Chemical compound database https://www.ebi.ac.uk/chebi Rhea x Reaction database http://www.rhea-db.org KEGG x Metabolic pathway database http://www.genome.jp/kegg MetaCyc/ Pathway tools x x Metabolic pathway database and prediction https://metacyc.org, http://brg.ai.sri.com/ptools/ Transcriptomics and variant discovery SSAHA2 x Read mapping http://www.sanger.ac.uk/science/tools/ssaha2-0 BWA x https://github.com/lh3/bwa SAMtools x Mapping analysis http://www.htslib.org/ bedtools x http://bedtools.readthedocs.io PALOMA x Variant detection x DESeq x Differential gene expression analysis http://bioconductor.org/packages/release/bioc/html/DESeq.html Topic . Name . Software . Database . Description . Internal . URL . Syntactic annotation AMIGene x CoDing sequences (CDS) prediction x http://www.genoscope.cns.fr/agc/tools/amigene Glimmer x https://ccb.jhu.edu/software/glimmer Prodigal x http://prodigal.ornl.gov MICheck x INSDC genome CDS re-annotation x http://www.genoscope.cns.fr/agc/tools/micheck tRNAscan-SE x tRNA prediction http://eddylab.org/software/tRNAscan-SE RNAmmer x rRNA prediction http://www.cbs.dtu.dk/services/RNAmmer Rfam/Infernal x x ncRNA families and prediction http://rfam.xfam.org, http://eddylab.org/infernal RepSeek x DNA sequence repeats http://wwwabi.snv.jussieu.fr/public/RepSeek Alien hunter x DNA compositional biases to detect HGT regions http://www.sanger.ac.uk/science/tools/alien-hunter SIGI-HMM x http://www.brinkman.mbb.sfu.ca/∼mlangill/sigi-hmm GenProtFeat x Gene/protein features x Taxonomy x NCBI taxonomy database https://www.ncbi.nlm.nih.gov/taxonomy Functional annotation BLAST+ x DNA/protein sequence alignment https://blast.ncbi.nlm.nih.gov Diamond x https://github.com/bbuchfink/diamond UniProtKB x Protein sequence and function database http://www.uniprot.org InterPro x x Protein signature and family prediction https://www.ebi.ac.uk/interpro COG x x Protein family annotation and prediction https://www.ncbi.nlm.nih.gov/COG FigFam x x http://www.nmpdr.org/FIG/wiki/view.cgi/FIG/FigFam MICFAM x Protein sequence family classification with SiliX x SiliX x Clustering of protein sequences https://lbbe.univ-lyon1.fr/-SiLiX-.html ENZYME x Enzymatic activity database http://enzyme.expasy.org PRIAM x Enzymatic activity prediction http://priam.prabi.fr dbCAN x Carbohydrate-active enzyme prediction http://csbl.bmb.uga.edu/dbCAN/ SignalP x Signal peptide cleavage site prediction http://www.cbs.dtu.dk/services/SignalP TMHMM x Transmembrane helix prediction http://www.cbs.dtu.dk/services/TMHMM LipoP x Lipoprotein prediction http://www.cbs.dtu.dk/services/LipoP PSORTb x Subcellular localization prediction http://www.psort.org VFDB x Virulence factor database http://www.mgc.ac.cn/VFs VirulenceFinder x https://cge.cbs.dtu.dk/services/VirulenceFinder CARD/RGI x x Antibiotic resistance database and prediction https://card.mcmaster.ca AutoFassign x Automatic functional annotation of proteins x Relational annotation Syntonizer x Synteny conservation detection x http://www.inrialpes.fr/helix/people/viari/cccpart/ Directon x Operon prediction x PhyloProfile x Phylogenetic profilef co-evolution score x https://dx.doi.org/10.1186/1471-2164-13-69 RGP x Genomic plasticity region detection x Pathway synteny x Synteny involved in metabolic pathways x MIBiG/ antiSMASH x x Biosynthetic Gene Cluster database and prediction http://www.secondarymetabolites.org/ ChEBI x Chemical compound database https://www.ebi.ac.uk/chebi Rhea x Reaction database http://www.rhea-db.org KEGG x Metabolic pathway database http://www.genome.jp/kegg MetaCyc/ Pathway tools x x Metabolic pathway database and prediction https://metacyc.org, http://brg.ai.sri.com/ptools/ Transcriptomics and variant discovery SSAHA2 x Read mapping http://www.sanger.ac.uk/science/tools/ssaha2-0 BWA x https://github.com/lh3/bwa SAMtools x Mapping analysis http://www.htslib.org/ bedtools x http://bedtools.readthedocs.io PALOMA x Variant detection x DESeq x Differential gene expression analysis http://bioconductor.org/packages/release/bioc/html/DESeq.html Open in new tab Table 1. Software and databases integrated in the MicroScope pipelines Topic . Name . Software . Database . Description . Internal . URL . Syntactic annotation AMIGene x CoDing sequences (CDS) prediction x http://www.genoscope.cns.fr/agc/tools/amigene Glimmer x https://ccb.jhu.edu/software/glimmer Prodigal x http://prodigal.ornl.gov MICheck x INSDC genome CDS re-annotation x http://www.genoscope.cns.fr/agc/tools/micheck tRNAscan-SE x tRNA prediction http://eddylab.org/software/tRNAscan-SE RNAmmer x rRNA prediction http://www.cbs.dtu.dk/services/RNAmmer Rfam/Infernal x x ncRNA families and prediction http://rfam.xfam.org, http://eddylab.org/infernal RepSeek x DNA sequence repeats http://wwwabi.snv.jussieu.fr/public/RepSeek Alien hunter x DNA compositional biases to detect HGT regions http://www.sanger.ac.uk/science/tools/alien-hunter SIGI-HMM x http://www.brinkman.mbb.sfu.ca/∼mlangill/sigi-hmm GenProtFeat x Gene/protein features x Taxonomy x NCBI taxonomy database https://www.ncbi.nlm.nih.gov/taxonomy Functional annotation BLAST+ x DNA/protein sequence alignment https://blast.ncbi.nlm.nih.gov Diamond x https://github.com/bbuchfink/diamond UniProtKB x Protein sequence and function database http://www.uniprot.org InterPro x x Protein signature and family prediction https://www.ebi.ac.uk/interpro COG x x Protein family annotation and prediction https://www.ncbi.nlm.nih.gov/COG FigFam x x http://www.nmpdr.org/FIG/wiki/view.cgi/FIG/FigFam MICFAM x Protein sequence family classification with SiliX x SiliX x Clustering of protein sequences https://lbbe.univ-lyon1.fr/-SiLiX-.html ENZYME x Enzymatic activity database http://enzyme.expasy.org PRIAM x Enzymatic activity prediction http://priam.prabi.fr dbCAN x Carbohydrate-active enzyme prediction http://csbl.bmb.uga.edu/dbCAN/ SignalP x Signal peptide cleavage site prediction http://www.cbs.dtu.dk/services/SignalP TMHMM x Transmembrane helix prediction http://www.cbs.dtu.dk/services/TMHMM LipoP x Lipoprotein prediction http://www.cbs.dtu.dk/services/LipoP PSORTb x Subcellular localization prediction http://www.psort.org VFDB x Virulence factor database http://www.mgc.ac.cn/VFs VirulenceFinder x https://cge.cbs.dtu.dk/services/VirulenceFinder CARD/RGI x x Antibiotic resistance database and prediction https://card.mcmaster.ca AutoFassign x Automatic functional annotation of proteins x Relational annotation Syntonizer x Synteny conservation detection x http://www.inrialpes.fr/helix/people/viari/cccpart/ Directon x Operon prediction x PhyloProfile x Phylogenetic profilef co-evolution score x https://dx.doi.org/10.1186/1471-2164-13-69 RGP x Genomic plasticity region detection x Pathway synteny x Synteny involved in metabolic pathways x MIBiG/ antiSMASH x x Biosynthetic Gene Cluster database and prediction http://www.secondarymetabolites.org/ ChEBI x Chemical compound database https://www.ebi.ac.uk/chebi Rhea x Reaction database http://www.rhea-db.org KEGG x Metabolic pathway database http://www.genome.jp/kegg MetaCyc/ Pathway tools x x Metabolic pathway database and prediction https://metacyc.org, http://brg.ai.sri.com/ptools/ Transcriptomics and variant discovery SSAHA2 x Read mapping http://www.sanger.ac.uk/science/tools/ssaha2-0 BWA x https://github.com/lh3/bwa SAMtools x Mapping analysis http://www.htslib.org/ bedtools x http://bedtools.readthedocs.io PALOMA x Variant detection x DESeq x Differential gene expression analysis http://bioconductor.org/packages/release/bioc/html/DESeq.html Topic . Name . Software . Database . Description . Internal . URL . Syntactic annotation AMIGene x CoDing sequences (CDS) prediction x http://www.genoscope.cns.fr/agc/tools/amigene Glimmer x https://ccb.jhu.edu/software/glimmer Prodigal x http://prodigal.ornl.gov MICheck x INSDC genome CDS re-annotation x http://www.genoscope.cns.fr/agc/tools/micheck tRNAscan-SE x tRNA prediction http://eddylab.org/software/tRNAscan-SE RNAmmer x rRNA prediction http://www.cbs.dtu.dk/services/RNAmmer Rfam/Infernal x x ncRNA families and prediction http://rfam.xfam.org, http://eddylab.org/infernal RepSeek x DNA sequence repeats http://wwwabi.snv.jussieu.fr/public/RepSeek Alien hunter x DNA compositional biases to detect HGT regions http://www.sanger.ac.uk/science/tools/alien-hunter SIGI-HMM x http://www.brinkman.mbb.sfu.ca/∼mlangill/sigi-hmm GenProtFeat x Gene/protein features x Taxonomy x NCBI taxonomy database https://www.ncbi.nlm.nih.gov/taxonomy Functional annotation BLAST+ x DNA/protein sequence alignment https://blast.ncbi.nlm.nih.gov Diamond x https://github.com/bbuchfink/diamond UniProtKB x Protein sequence and function database http://www.uniprot.org InterPro x x Protein signature and family prediction https://www.ebi.ac.uk/interpro COG x x Protein family annotation and prediction https://www.ncbi.nlm.nih.gov/COG FigFam x x http://www.nmpdr.org/FIG/wiki/view.cgi/FIG/FigFam MICFAM x Protein sequence family classification with SiliX x SiliX x Clustering of protein sequences https://lbbe.univ-lyon1.fr/-SiLiX-.html ENZYME x Enzymatic activity database http://enzyme.expasy.org PRIAM x Enzymatic activity prediction http://priam.prabi.fr dbCAN x Carbohydrate-active enzyme prediction http://csbl.bmb.uga.edu/dbCAN/ SignalP x Signal peptide cleavage site prediction http://www.cbs.dtu.dk/services/SignalP TMHMM x Transmembrane helix prediction http://www.cbs.dtu.dk/services/TMHMM LipoP x Lipoprotein prediction http://www.cbs.dtu.dk/services/LipoP PSORTb x Subcellular localization prediction http://www.psort.org VFDB x Virulence factor database http://www.mgc.ac.cn/VFs VirulenceFinder x https://cge.cbs.dtu.dk/services/VirulenceFinder CARD/RGI x x Antibiotic resistance database and prediction https://card.mcmaster.ca AutoFassign x Automatic functional annotation of proteins x Relational annotation Syntonizer x Synteny conservation detection x http://www.inrialpes.fr/helix/people/viari/cccpart/ Directon x Operon prediction x PhyloProfile x Phylogenetic profilef co-evolution score x https://dx.doi.org/10.1186/1471-2164-13-69 RGP x Genomic plasticity region detection x Pathway synteny x Synteny involved in metabolic pathways x MIBiG/ antiSMASH x x Biosynthetic Gene Cluster database and prediction http://www.secondarymetabolites.org/ ChEBI x Chemical compound database https://www.ebi.ac.uk/chebi Rhea x Reaction database http://www.rhea-db.org KEGG x Metabolic pathway database http://www.genome.jp/kegg MetaCyc/ Pathway tools x x Metabolic pathway database and prediction https://metacyc.org, http://brg.ai.sri.com/ptools/ Transcriptomics and variant discovery SSAHA2 x Read mapping http://www.sanger.ac.uk/science/tools/ssaha2-0 BWA x https://github.com/lh3/bwa SAMtools x Mapping analysis http://www.htslib.org/ bedtools x http://bedtools.readthedocs.io PALOMA x Variant detection x DESeq x Differential gene expression analysis http://bioconductor.org/packages/release/bioc/html/DESeq.html Open in new tab Information from the syntactic and functional annotation pipelines can be placed into a biological context to understand how the predicted objects interact in functional modules such as metabolic pathways. Each genome integrated into MicroScope is processed by an in-house workflow based on the MetaCyc reference database [18] and on the Pathway Tools software [19]. This software creates a Pathway Genome DataBase (PGDB) containing the predicted pathways and reactions of an organism. It uses a matching procedure for which we directly use as input the official MetaCyc reaction frame identifiers when available in the genome annotation; this allows to avoid overpredicted or missed enzymatic reactions [20]. The collection of MicroScope PGDBs is made available at the MicroCyc Web site (http://www.genoscope.cns.fr/agc/microcyc) and in the MicroScope database (see ‘Exploration of metabolic data’ section). Moreover, these metabolic networks are synchronized each night with new MicroScope genomes and expert annotations. When a public prokaryotic genome is integrated into MicroScope, the original annotations are stored in the database, and the syntactic re-annotation process, which uses the MICheck procedure, often allows to identify missing genes or wrongly annotated one [21]. This step is useful to annotate more completely the pseudogenes found in a genome (‘real’ or because of sequencing errors), an important piece of information when comparing closely related species. Data from genomes available in public databanks generally remain with the ‘public’ status too in MicroScope. A MicroScope staff to support and train a user community As soon as annotations and comparative analysis results are processed by MicroScope, the user who submitted the genome(s) is alerted by an e-mail; he/she can subsequently use a specific administration tool to grant access to his/her collaborators and to define consultation and modification rights on the sequences (‘User Panel’ menu/‘Access Rights Management’ functionality). Continuing support and assistance to MicroScope users remain an important activity in the context of our services (or collaborative projects). These regular exchanges, together with the satisfaction surveys, are the most efficient way of performing continual evolution of the platform in response to user needs. Indeed, in addition to the user-friendliness of the tools integrated into the platform (see below), the short response time and the quality of feedback to individual queries are highly appreciated aspects of the MicroScope service. Microbiologists who submitted genomic data to the MicroScope platform are warmly invited to follow a training course organized by our team. Using the data related to their own project, attendees learn how to change or correct the current automatic functional annotations, and how to perform effective searches and analyses with the functionalities available through the Web interface. About twice a year, we provide for new users a four-and-a-half-day training ‘Annotation and analysis of prokaryotic genomes using the MicroScope platform’. Since 2016, we also provide an advanced course for former trainees, so that they can remain up-to-date on recent developments. Since 2008, 450 users from 20 countries have been trained and 13 external sessions have been organized in France and abroad (Tunisia; Denmark; Germany; Switzerland; Spain; the Netherlands; China). More information is available on our Web site: http://www.genoscope.cns.fr/agc/microscope/training. Data integration, service continuity and data conservation (backups) are currently provided free of charge. MicroScope services follow the quality management system of our laboratory (ISO 9001:2008 and NF X50-900:2013 standards). All the data previously described (primarily genomes, analysis results and annotations) should be made appropriately accessible to biologist users, to allow efficient curation of annotations and to develop hypotheses about specific genomes or sets of genes to be experimentally tested. The following sections describe the MicroScope Web interface (http://www.genoscope.cns.fr/agc/microscope), i.e. the components accessible to our users, via secure or anonymous connections. For a complete description of each functionality in terms of input and output data, a complete tutorial is available here: https://microscope.readthedocs.io. Exploration of the genomic data: simple and advanced queries The ‘Search/Export’ menu (Figure 4) allows the user to perform Blast and pattern searches in the MicroScope database, and to download, in standard file formats (Genbank, EMBL, GFF, etc.), sequences, annotation data and the metabolic networks. The ‘Search by keywords’ functionality allows the user to identify genes and functions of interest using a variety of selection filters. The ‘single mode’ is used to query only one chromosome and the ‘multiple mode’ to query several replicons (of one organism) and/or several genomes. A basic keyword search enables the user to quickly retrieve genes having a particular function (i.e. ‘kinase’, ‘transporter’). Each kind of precomputed results (i.e. Blast results on various primary data, InterPro and FigFAM results, etc.) can be queried. Figure 4 shows an example of a query on the similarity searches in the CARD database [22] (‘Resistome’ data set). Figure 4. Open in new tabDownload slide MicroScope interface illustrating the ‘Search by keywords’ functionality. In the ‘multiple’ mode, a set of Staphylococcus species has been selected, and the BLASTP similarity results obtained with well-known resistance genes stored in the CARD database are queried using an amino acid identity threshold of at least 80% and using the keywords ‘kanamycine tetracycline’. The selection of ‘At least one word’ is required to apply an ‘OR’ between the two keywords. Keyword searches are useful to compare current annotation of the gene functions with the results, in terms of biological function, given by a specific analysis method. Indeed, the result of a query can be refined with a further query. For example, one can search for gene annotated as ‘protein of unknown function’ (first query) and then, search for the one having significant Blast results with proteins annotated with specific functions (second query). Whatever the query, the result output is a list of candidate genes, the genomic contexts of which can be easily visualized: next to the gene label, a magnify icon can be clicked to come back to the MaGe graphical representation with automatic displacement of the genome browser centered on the gene of interest. MaGe (Magnifying Genome): a genome browser in the light of synteny results The MaGe graphical interface is one of the functionality that had a strong positive resonance among users: this genome browser offers gene context exploration of the studied genome compared against other microbial genomes. The graphical representation of the synteny groups allows the user to quickly see if part of the genome being annotated shares similarities and locally conserved organization with the selected sequences. As shown in Figure 5, there is a clear synteny break in the visualized part of the E. coli CFT073 strain: the genes located between 5116000 and 5131000 share homologs only with the E. coli pathogenic strain ABU and, partially, with the E. coli commensal strain ED1a. The foreign origin of this region is also obvious if one looks at the coding prediction curves: the gene model used here does not fit well with the codon usage of the genes annotated in this genomic island. The example shown in Figure 5 also indicates possible paralogy relations through multiple correspondences between genes and one case of frameshift (or sequencing error) in E. coli 536 for the idnK gene (D gluconate kinase; see Figure 5). With such graphical representation, the conservation of genomic context is fully integrated in the process of the expert curation of gene function. Figure 5. Open in new tabDownload slide MicroScope genome browser and synteny map. The first graphical map contains part of the genome being analyzed (here 30 kb of E. coli CFT073), over which the user can navigate (moving and zooming functionalities). The predicted coding genes are drawn, on the six reading frames, in red rectangles together with the coding prediction curves (computed with the gene model selected by the user; ‘Matrix’ selection menu). Below this genome browser, is represented the synteny map in which each line shows the similarity results between the genome being annotated (E. coli CFT073) and other selected genomes (i.e. 11 pathogenic and commensal E. coli strains; the selection is performed using the ‘Options’ functionality). On this map, a rectangle flags the existence of a gene, somewhere in the compared genome, homolog to the corresponding gene in the genome browser. If, for several co-localized CDSs on the annotated genome, there are several co-localized homologs on the compared genome, the rectangles are all of the same color; otherwise, the rectangle is white. Thus, in this map, a specific color indicates a synteny group. A rectangle is always of the same size as the reference gene in the genome browser; however, it is colored only on part of the gene, which aligns with the compared protein. This allows the user to visualize situations where the alignment is partial. There is one such case in E. coli 536 indicating that the idnK gene in this strain is a pseudogene compared with the idnK gene in CFT073. In contrast with the genome browser, there is no notion of scale on the synteny maps: to see how homologous genes are organized in a synteny group, the user can click on one rectangle in a given synteny group. Another visualization mode has been added more recently to represent synteny conservation at different taxonomic levels (i.e. phylum, class, order, family or species). In this ‘taxon-synteny’ mode (obtained by clicking the ‘Switch’ button, Figure 5), each line of the synteny map refers to a taxon, and colored boxes represent the percentage of synteny conservation among organisms of the corresponding taxon. Comparative genomics tools Computations of homologs and synteny groups between microbial genomes are the starting point of several comparative methods available in the ‘Comparative Genomics’ menu (Figure 6). Figure 6. Open in new tabDownload slide Comparative genomics tools of the MicroScope platform. The figure displays some of the tools available to perform in-depth comparative genomics analyses involving the bacterium of interest and one or a set of organisms: ‘Gene Phyloprofile’ (comparison of five Lactobacillus rhamnosus strains), ‘Line Plot’ (shared synteny groups found in the same DNA strand are colored in green, and in red otherwise), ‘Regions of Genomic Plasticity’ (the predicted genomic island is shown in the second layer of the circular representation), ‘Pan-core genome’ and ‘Resistome’. In this last case, the figure shows Acinetobacter baumannii AYE genes having BLASTP hits with proteins from the CARD database. First, the ‘Fusion/Fission’ functionality provides a list of candidate genes of the selected genome potentially involved in evolutionary events such as gene fusion or fission. Such events involve what is named ‘Rosetta-stone’ proteins, and suggest a high probability of functional interaction between the involved proteins [23]. Second, the ‘Gene phyloprofile’ functionality is used to find unique or common genes in the query genome with respect to other genomes of interest. Homology constraints and inclusions in synteny group criteria may be applied to refine queries. Third, the ‘LinePlot’ functionality draws a global graphical representation of conserved syntenies between two selected genomes, and the ‘Regions of Genomic Plasticity (RGP)’ is used to search for potential horizontal gene transfer (HGT). The method combines (i) the results of algorithms that detect signals in the query sequence indicative of horizontal transfer origin (tRNA hotspots; mobility genes; compositional bias [24]) and (ii) the identification of synteny breaks in the query genome in comparison with closely selected microbial genomes. Results are reported in a tabular form and on a circular representation of the genome (Figure 6). Finally, the ‘Pan/Core Genome’ functionality computes dynamically the pan-genome and its components (core-genome; variable-genome) of a set of selected organisms (up to 200). The method uses the MicroScope gene families (MICFAM) computed with the SiLiX software [25]. The set of common (= core-genome), variable and strain-specific genes of each compared genomes can be exported in a tabular file format or in a ‘Gene Cart’. Indeed, at any level of the MicroScope Web interface, the gene list that results from the corresponding search/analysis can be selected for inclusion into a ‘Gene Cart’. The user can manage several ‘Gene Carts’ at the same time resulting from different queries. A specific interface has been developed to perform various operations such as the intersection or the difference between two gene carts, to extract sequences or to run multiple alignments via the plugged Jalview software [26] (Functionality ‘Gene Carts’ of the ‘User Panel’ menu). Two functionalities of the ‘Comparative Genomics’ menu are most specifically related to pathogen analysis (Figure 6): ‘Resistome’, which uses the Comprehensive Antibiotic Resistance Database [22] a manually curated resource containing high-quality reference data on the molecular basis of antimicrobial resistance, and the Resistance Gene Identifier (RGI) tool to predict the resistome of a genome. The ‘Virulome’ functionality gives the results of a Blast similarity searches in three distinct data sets of virulence genes: VFDB, which contains experimentally demonstrated virulence genes [27], VirulenceFinder [28] and a subset of the E. coli main virulence genes. Exploration of metabolic data The ‘Metabolism’ menu of MicroScope allows to explore the predicted metabolic pathways using two main resources, KEGG and MetaCyc, and to use analysis tools (Figure 7). Figure 7. Open in new tabDownload slide Tools for the analysis of microbial metabolism. Metabolic data can be explored using the KEGG or MetaCyc metabolic pathway hierarchies. On the left, the figure shows, for one selected MicroScope genome, the mapping of the annotated EC numbers on a KEGG metabolic map (enzymes encoded by genes localized on the current genome browser region are highlighted in yellow, and the ones encoded by genes localized elsewhere are highlighted in green). Predicted PGDBs using the Pathway Tools software are available using the ‘MicroCyc’ functionality. Comparison of metabolic pathways between a set of selected genomes is performed using the ‘Metabolic profiles’ tool: for each metabolic pathway, a completion value is computed, which corresponds to the number of reactions found in the genome × divided by the total number of reactions in the pathway. This value can take into account pseudogenes or not. It ranges between 0 (absence of the pathway) and 1 (complete pathway). The figure also shows an example of antiSMASH, which predicts Biosynthetic Gene Clusters in prokaryotic genomes. For the NRPS/PKS cluster types, the predicted peptide monomer composition and its corresponding SMILES formula are specified. Below the graphical representation of the predicted antiSMASH cluster, a summary of MIBiG cluster similarities, BGC gene composition as well as tailoring cluster similarities is given. Starting from the set of predicted and/or validated Enzyme Commission numbers (EC numbers), metabolic maps are dynamically drawn via a request to the KEGG Web server (‘KEGG’ functionality). A color-based code enables to see the number of enzymatic activities (i.e. EC number) of the annotated genome found in specific metabolic pathways (Figure 7). The interconnected metabolic pathways represented in KEGG are supplemented by the MicroCyc PGDBs built with the Pathway Tools software using MetaCyc as reference metabolic database (see ‘Running the annotation pipelines’ section). The ‘MicroCyc’ functionality allows the user to browse and query the metabolic network of a target genome using the Pathway Tools Web interface [18]. These two sets of predicted pathways can be used in the ‘Metabolic profiles’ functionality. Starting with a selection of organisms and a subset (or all) of metabolic pathways from the KEGG or MetaCyc classification, the tool computes a pathway completion value for each metabolic pathways (Figure 7). These values can be used by the MeV statistical method (Java Web start application) to cluster genomes according to their metabolic capabilities. Moreover, this table is also a good starting point to find candidate genes for missing gene–reaction associations in specific pathways (see example in [6]). In the same way, the ‘Pathway Synteny’ functionality follows the ‘guilt by association’ strategy [29], as it combines information on synteny groups and metabolic pathways (i.e. it searches for groups of genes, which share conserved synteny and are found on the same metabolic pathway). Using this interface, annotators can quickly check for reaction-hole candidate genes among the conserved miss-annotated genes of a given group. Finally, the ‘antiSMASH’ functionality relies on the integration of the antiSMASH (antibiotics and Secondary Metabolite Analysis Shell) program, which enables rapid genome-wide identification, annotation and analysis of secondary metabolite Biosynthesis Gene Clusters (BGCs) in microbial genomes [30]. Each predicted cluster and its genomic context are explored in a dedicated visualization window showing also a graphical representation of the gene domain composition (Figure 7). For nonribosomal peptide synthetase (NRPS) and polyketide synthase (PKS) cluster types, the predicted peptide monomer composition and its corresponding SMILES formula are specified and the corresponding predicted chemical structure is displayed. For each predicted BGC, a summary of similarities with the reference database MIBiG [31], BGC gene composition as well as tailoring cluster similarities is given. This last item relies on a knowledge database provided with antiSMASH about tailoring clusters already described in known BGCs and associated with publications. Analysis of experimental data The functionalities available in the ‘Transcriptomics’ and ‘Variant discovery’ menus rely on the results of the pipelines used to analyze data from transcriptomic projects (i.e. RNA-seq experiments) and data from evolution projects (i.e. clones of the same species at different generation times). Exploration of these experimental data has been illustrated in the two last publications of the MicroScope platform [6, 7]. The ‘Transcriptomics’ functionality allows exploring the transcript coverage along genome, expression levels of genomic objects (genes, ncRNAs) and differential expression between samples for distinct experimental conditions. All appropriate pairwise comparisons of experimental conditions can be directly queried from the interface. Differentially expressed genes may be projected on reconstructed metabolic networks to highlight metabolic pathways significantly affected by experimental conditions. The ‘Variant discovery’ functionality offers different tools to explore and analyze the predicted mutations (single nucleotide polymorphisms and small insertions/deletions) in their genomic and functional context. This detection takes into account raw sequencing data and associated read qualities to discriminate between true variations and sequencing errors. Expert curation of genomic and metabolic data From the results of the exploration of data and the analysis tools, MicroScope users can review and curate the automatic functional annotation of genes encoded by its genome of interest. This task is performed using the ‘Gene Editor’, which has been illustrated in the 2013 MicroScope publication [6]. Briefly, it is made of three main sections: The ‘current annotation’ section allows the user to modify, delete and add information. The functional description of gene functions is a free-text field exposed to inconsistencies across genes and genomes. We thus have also integrated enumerated lists of well-defined and nonredundant terms for the product type field (defined in GenProtEC [32]), the functional classifications (MultiFun [33] and TIGRFAMs [34]) and for the class field (inspired from the Pseudomonas Genome database [35]), which helps understanding the origin of the functional annotation (e.g. it comes from the functional description of an homologous gene for which the function has been experimentally demonstrated). The curation of associations between genes coding for enzymatic activities and the biochemical reactions catalyzed by these enzymes is performed using two main enzymatic reactions resources: MetaCyc [18] and Rhea [36]. Finally, to alert users about possible inconsistencies, annotation is checked via an automatic procedure launched when the annotation is saved in the database. The ‘automatic annotation’ section contains the gene function predicted by our automatic functional annotation procedure (‘MicroScope pipeline annotation’), which involves the transfer of the reliable up-to-date reference annotations to ‘strong’ orthologs, if any [4]. In case of published bacterial genome integrated in MicroScope, the section contains information on the functional annotation in nucleotide sequence databanks and UniProtKB if available. The ‘method results’ section provides, for each individual annotation tool executed, a summary of the results, visualized in a tabulated form (this includes precomputed lists of homologs and synteny groups). This integrative strategy allows annotators to quickly browse functional evidences, tracking the history of an annotation and checking the gene context conservation with an orthologous gene having an experimentally demonstrated biological function for example. Criteria for entering an expert annotation are based on different level of evidences from direct experimentation to bioinformatics evidences. The confidence status of each gene annotation is available in the class field of the gene editor. The categories are inspired by the ‘protein name confidence’ defined in PseudoCAP (Pseudomonas aeruginosa community annotation project). A set of rules allowing to choose this ‘class’ annotation category according to bioinformatics evidences is proposed in our MicroScope tutorial: https://microscope.readthedocs.io/en/latest/content/mage/info.html (‘How to choose the “Class” annotation category?’ and ‘Annotation Rules’ sections). Following the integration of novel functionalities into MicroScope, the ‘Gene Editor’ is constantly evolving. First, new interfaces allowing to ease the curation of resistance and virulence genes are under development, especially using defined ontologies such as ARO, the Antibiotic Resistance Ontology [22]. Second, to fully exploit the results of the different tools dedicated to genomic region analysis (e.g. antiSMASH or RGPfinder), we are currently working on the development of a specific editor to annotate gene clusters such as operons, BGCs, genomic islands, CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) regions, secretion systems and phages. Expert annotations are continuously gathered in the MicroScope database. Indeed, ∼35 000 annotations are made in a year (Figure 1), and >370 000 genes have been curated so far. A third of these annotations correspond to the description of precise molecular functions supported by direct or indirect (i.e. from homology relationships) experimental evidences. Biologists generally focused their annotations on proteins/functions of interest; however, it is interesting to note that about 50 genomes integrated in MicroScope are near completely curated (≥80% of the genes were expertly annotated), and 124 additional genomes got >300 curated genes. MicroScope annotations are submitted to INSDC databanks when the genomes get published and can be easily downloaded via the Web interface (‘Search/Export->Download Data’ functionality). Moreover, we provide a RESTful API to access programmatically public genome data, and semantic Web approaches are currently used to work on the interoperability of MicroScope curated data with other European resources such as UniProtKB [9], HAMAP [37], EnsemblBacteria [1] and Rhea [36]. These developments are performed in the context of the ELIXIR bioinformatics infrastructure (https://www.elixir-europe.org). Software and database architecture The technical architecture of the MicroScope platform is shown on Figure 8. Its three components have been described and updated in the previous publications of MicroScope [5, 6]. In summary: Figure 8. Open in new tabDownload slide Technical architecture of the MicroScope platform. The MicroScope platform is made of three components: (i) a ‘Process management’ system to organize workflow execution, (ii) a ‘Data management’ system, called PkGDB, to store information from databanks, genomes and computational results and (iii) a ‘Visualization’ system for textual and graphical representation of PkGDB data. Process management system The annotation pipelines are organized in a robust automated workflow management system using the jBPM framework (java Business Process Management; http://jbpm.org), which allows us to handle simultaneously millions of tasks for the analysis of several new microbial genomes. These tasks are parallelized on hundreds of CPU cores using Pegasus MPI cluster module (https://pegasus.isi.edu). The pipelines for the structural, functional and relational annotation orchestrate >50 external/internal bioinformatics software (see section ‘Running the MicroScope pipelines’). A large part of these analyses are updated at regular intervals to take into account primary databases growth and new expert annotations. Data management system The results of these analysis tools, together with the primary data used as inputs, are stored in a relational database named PkGDB and based on the open-source MySQL relational database management system and the InnoDB (for continuous data integration and incremental updates) and MyISAM (for large bulk inserts) table engines. The PkGDB architecture supports integration of automatic and human-curated functional annotations and records a history of all the modifications. Finally, for metabolic comparative analyses purposes (see the ‘Metabolic profiles’ functionality in the ‘Exploration of metabolic data’ section), relational tables have been designed in PkGDB to store information of the MicroCyc PGDBs, together with the KEGG metabolic pathways and modules. The size of PkGDB is today 1 TB for databanks and genome data, and 30 TB for the computational results (Figure 8). Only one instance of the database gathers all genome analyses, which eases collaborative annotation process. The Web visualization component The MicroScope Web interface (http://www.genoscope.cns.fr/agc/microscope) is developed using the Apache/PHP server-based language and consists of numerous dynamic Web pages containing textual and graphical representations for accessing and querying data. Several useful graphical applications, such as Artemis [38], MeV [39] and IGV [40], are also available in the MicroScope interface through plugged Java applications. As shown in this article, the tools are organized in a menu bar to facilitate the exploration and the curation process. At any level of the interface, a ‘Help’ functionality is available, and a complete tutorial can be found in the ‘About’ menu. Conclusion In this article, we have described the MicroScope platform from the point of view of the end user, i.e. following one of the main objectives of our prokaryotic genome annotation and comparative system: to allow biologists to submit their genomic data in a simple way and, then, to perform analysis and make relevant assessments of the predicted gene functions using (i) the functionalities for querying and browsing the computed data, (ii) the synteny results and metabolic network predictions, the combination of which can be helpful in formulating hypotheses on the biological function of nonannotated genes and (iii) a gene annotation editor giving access to the results of each method applied, together with links to several useful public resources. Among the ongoing developments described in the last update of the platform [7], we have currently made great progresses in the consensus representation of thousands bacterial genomes to provide a better analysis workflow of prokaryotic species. The idea is to structure the pan-genome of an organism into the set of ‘persistent’ genes (relaxed core definition, that is to say genes found in the great majority of the genomes), the ‘shell’, which gathers moderately conserved genes and the ‘cloud’ corresponding to rare and unique genes [41]. To organize pangenomic information, we are using a graph data model, where the nodes represent the protein families, and the edges represent the genome co-localization of the two protein families (weighted by the number of the genomes sharing this co-localization). A statistical method is then used to divide the pan-genome into the three main classes (persistent, shell and cloud). The next step is the integration of this representation in MicroScope to facilitate comparative analysis and data visualization of thousands of strains. We will also add functionalities allowing users to select, at any level of this pan-genome graph, a subpart of this graph and, using one genome as reference, to come back to the MaGe genome browser. We are starting to work on an instance of MicroScope based on this novel pan-genome representation that will contain most of the reference species found in the human gut microbiota. Key Points MicroScope is open to microbiologists interested in extended analyses of species of interest. MicroScope is an integrated environment allowing to perform comparative genomic and metabolic analyses. Tools and graphical interfaces for the curation of gene function are part of the specificities of the MicroScope platform. MicroScope provides a collaborative environment to share and improve knowledge on genomes. Claudine Médigue, PhD, is a research director at CNRS. She is the head of the Laboratoire d’Analyse Bioinformatiques en Génomique et Métabolisme located at Genoscope. She has worked on the annotation and comparative analysis of prokaryotic genomes >25 years. Alexandra Calteau is a senior researcher at CEA. She contributes to different bioanalysis projects and the development of functionalities in the MicroScope platform, mainly in the Comparative Genomics field. She is responsible for the MicroScope professional training organization, and for the quality management of the LABGeM. Stéphane Cruveiller is a senior researcher at CEA. He is managing the MicroScope services and developments for the analysis of variants discovery, transcriptomics and metagenomics data. He has specific research activities in microbial evolution. Mathieu Gachet is a master student at CEA. He is working on the improvement of metagenomic data integration in MicroScope. Guillaume Gautreau is a PhD student at CEA. He works on the development of pan-genome graphs in MicroScope and their application in metagenomics. Adrien Josso is an engineer in bioinformatics at CEA. He works on MicroScope software development for workflow management and metabolic data integration. Aurélie Lajus is an engineer in bioinformatics at CEA. She works mainly on (meta)genome project management and software integration in the MicroScope platform. Jordan Langlois is an engineer in bioinformatics at CEA. He works on software integration and Web developments in the MicroScope platform. Hugo Pereira is an engineer in bioinformatics at CEA. He works on MicroScope software development for workflow management of NGS projects. Rémi Planel is an engineer in bioinformatics at CEA. He works on MicroScope software development for pan-genome computation and Web visualization. David Roche is an engineer in bioinformatics at CEA. He works on NGS project management, software integration and Web developments in the MicroScope platform. He is also involved in the training of MicroScope users. Johan Rollin is an engineer in bioinformatics at CEA. He works on software integration and Web developments in the MicroScope platform. Zoe Rouy is an engineer in bioinformatics at CEA. She works mainly on (meta-)genome project management, software integration and Web developments in the MicroScope platform. David Vallenet is a senior researcher at CEA. He is managing all the technological developments of the MicroScope platform and has specific research activities in the development of methods for enzyme function prediction and metabolic network analysis. Acknowledgements The authors would like to thank all MicroScope users for their feedback, which helped greatly in optimizing and improving many functionalities of the system. The authors also thank the entire IT system team of Genoscope for its essential contribution to the efficiency of the platform. Funding French Government ‘Investissements d’Avenir programmes’, namely, FRANCE GENOMIQUE (grant number ANR-10-INBS-09-08); INSTITUT FRANCAIS DE BOINFORMATIQUE (grant number ANR-11-INBS-0013). References 1 Kersey PJ , Allen JE, Armean I, et al. Ensembl Genomes 2016: more genomes, more complexity . Nucleic Acids Res 2016 ; 44 : D574 – 80 . Google Scholar Crossref Search ADS PubMed WorldCat 2 Chen I-MA , Markowitz VM, Palaniappan K, et al. Supporting community annotation and user collaboration in the integrated microbial genomes (IMG) system . BMC Genomics 2016 ; 17 : 307 . Google Scholar Crossref Search ADS PubMed WorldCat 3 Wattam AR , Davis JJ, Assaf R, et al. Improvements to PATRIC, the all-bacterial Bioinformatics Database and Analysis Resource Center . Nucleic Acids Res 2017 ; 45 : D535 – 42 . Google Scholar Crossref Search ADS PubMed WorldCat 4 Vallenet D , Labarre L, Rouy Z, et al. MaGe: a microbial genome annotation system supported by synteny results . Nucleic Acids Res 2006 ; 34 : 53 – 65 . Google Scholar Crossref Search ADS PubMed WorldCat 5 Vallenet D , Engelen S, Mornico D, et al. MicroScope: a platform for microbial genome annotation and comparative genomics . Database 2009 ; 2009 : bap021 . Google Scholar Crossref Search ADS PubMed WorldCat 6 Vallenet D , Belda E, Calteau A, et al. MicroScope–an integrated microbial resource for the curation and comparative analysis of genomic and metabolic data . Nucleic Acids Res 2013 ; 41 : D636 – 47 . Google Scholar Crossref Search ADS PubMed WorldCat 7 Vallenet D , Calteau A, Cruveiller S, et al. MicroScope in 2017: an expanding and evolving integrated resource for community expertise of microbial genomes . Nucleic Acids Res 2017 ; 45 : D517 – 28 . Google Scholar Crossref Search ADS PubMed WorldCat 8 Wilson CA , Kreychman J, Gerstein M. Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores . J Mol Biol 2000 ; 297 : 233 – 49 . Google Scholar Crossref Search ADS PubMed WorldCat 9 The UniProt Consortium . UniProt: the universal protein knowledgebase . Nucleic Acids Res 2017 ; 45 : D158 – 69 . Crossref Search ADS PubMed WorldCat 10 Overbeek R , Begley T, Butler RM, et al. The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes . Nucleic Acids Res 2005 ; 33 : 5691 – 702 . Google Scholar Crossref Search ADS PubMed WorldCat 11 Barbe V , Vallenet D, Fonknechten N, et al. Unique features revealed by the genome sequence of Acinetobacter sp. ADP1, a versatile and naturally transformation competent bacterium . Nucleic Acids Res 2004 ; 32 : 5766 – 79 . Google Scholar Crossref Search ADS PubMed WorldCat 12 Touchon M , Hoede C, Tenaillon O, et al. Organised genome dynamics in the Escherichia coli species results in highly diverse adaptive paths . PLoS Genet 2009 ; 5 : e1000344 . Google Scholar Crossref Search ADS PubMed WorldCat 13 Barbe V , Cruveiller S, Kunst F, et al. From a consortium sequence to a unified sequence: the Bacillus subtilis 168 reference genome a decade later . Microbiology 2009 ; 155 : 1758 – 75 . Google Scholar Crossref Search ADS PubMed WorldCat 14 Belda E , Sekowska A, Le Fèvre F, et al. An updated metabolic view of the Bacillus subtilis 168 genome . Microbiology 2013 ; 159 : 757 – 70 . Google Scholar Crossref Search ADS PubMed WorldCat 15 Belda E , van Heck RG, José Lopez-Sanchez M, et al. The revisited genome of Pseudomonas putida KT2440 enlightens its value as a robust metabolic chassis . Environ Microbiol 2016 ; 18 : 3403 – 24 . Google Scholar Crossref Search ADS PubMed WorldCat 16 Field D , Garrity G, Gray T, et al. The minimum information about a genome sequence (MIGS) specification . Nat Biotechnol 2008 ; 26 : 541 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 17 Bocs S , Cruveiller S, Vallenet D, et al. AMIGene: annotation of MIcrobial genes . Nucleic Acids Res 2003 ; 31 : 3723 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 18 Caspi R , Billington R, Ferrer L, et al. The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases . Nucleic Acids Res 2016 ; 44 : D471 – 80 . Google Scholar Crossref Search ADS PubMed WorldCat 19 Karp PD , Latendresse M, Paley SM, et al. Pathway Tools Version 19.0 update: software for pathway/genome informatics and systems biology . Brief Bioinform 2015 ; 17 : 877 – 90 . Google Scholar Crossref Search ADS PubMed WorldCat 20 Vieira G , Sabarly V, Bourguignon PY, et al. Core and panmetabolism in Escherichia coli . J Bacteriol 2011 ; 193 : 1461 – 72 . Google Scholar Crossref Search ADS PubMed WorldCat 21 Cruveiller S , Le Saux J, Vallenet D, et al. MICheck: a web tool for fast checking of syntactic annotations of bacterial genomes . Nucleic Acids Res 2005 ; 33 : W471 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 22 Jia B , Raphenya AR, Alcock B, et al. CARD 2017: expansion and model-centric curation of the comprehensive antibiotic resistance database . Nucleic Acids Res 2017 ; 45 : D566 – 73 . Google Scholar Crossref Search ADS PubMed WorldCat 23 Suhre K. Inference of gene function based on gene fusion events: the Rosetta-Stone method . Methods Mol Biol 2007 ; 396 : 31 – 41 . Google Scholar Crossref Search ADS PubMed WorldCat 24 Vernikos GS , Parkhill J. Interpolated variable order motifs for identification of horizontally acquired DNA: revisiting the Salmonella pathogenicity islands . Bioinformatics 2006 ; 22 : 2196 – 203 . Google Scholar Crossref Search ADS PubMed WorldCat 25 Miele V , Penel S, Duret L. Ultra-fast sequence clustering from similarity networks with SiLiX . BMC Bioinformatics 2011 ; 12 : 116 . Google Scholar Crossref Search ADS PubMed WorldCat 26 Waterhouse AM , Procter JB, Martin DM, et al. Jalview Version 2–a multiple sequence alignment editor and analysis workbench . Bioinformatics 2009 ; 25 : 1189 – 91 . Google Scholar Crossref Search ADS PubMed WorldCat 27 Chen L , Zheng D, Liu B, et al. VFDB 2016: hierarchical and refined dataset for big data analysis–10 years on . Nucleic Acids Res 2016 ; 44 : D694 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 28 Joensen KG , Scheutz F, Lund O, et al. Real-time whole-genome sequencing for routine typing, surveillance, and outbreak detection of verotoxigenic Escherichia coli . J Clin Microbiol 2014 ; 52 : 1501 – 10 . Google Scholar Crossref Search ADS PubMed WorldCat 29 Aravind L. Guilt by association: contextual information in genome analysis . Genome Res 2000 ; 10 : 1074 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 30 Blin K , Wolf T, Chevrette MG, et al. antiSMASH 4.0-improvements in chemistry prediction and gene cluster boundary identification . Nucleic Acids Res 2017 ; 45 : 36 – 41 . Google Scholar Crossref Search ADS WorldCat 31 Medema MH , Kottmann R, Yilmaz P, et al. Minimum information about a Biosynthetic Gene cluster . Nat Chem Biol 2015 ; 11 : 625 – 31 . Google Scholar Crossref Search ADS PubMed WorldCat 32 Serres MH , Goswami S, Riley M. GenProtEC: an updated and improved analysis of functions of Escherichia coli K-12 proteins . Nucleic Acids Res 2004 ; 32 : D300 – 2 . Google Scholar Crossref Search ADS PubMed WorldCat 33 Serres MH , Riley M. MultiFun, a multifunctional classification scheme for Escherichia coli K-12 gene products . Microb Comp Genomics 2000 ; 5 : 205 – 22 . Google Scholar Crossref Search ADS PubMed WorldCat 34 Haft DH , Selengut JD, Richter RA, et al. TIGRFAMs and genome properties in 2013 . Nucleic Acids Res 2013 ; 41 : D387 – 95 . Google Scholar Crossref Search ADS PubMed WorldCat 35 Winsor GL , Griffiths EJ, Lo R, et al. Enhanced annotations and features for comparing thousands of Pseudomonas genomes in the Pseudomonas genome database . Nucleic Acids Res 2016 ; 44 : D646 – 53 . Google Scholar Crossref Search ADS PubMed WorldCat 36 Morgat A , Lombardot T, Axelsen KB, et al. Updates in Rhea—an expert curated resource of biochemical reactions . Nucleic Acids Res 2017 ; 45 : 4279 . Google Scholar Crossref Search ADS PubMed WorldCat 37 Pedruzzi I , Rivoire C, Auchincloss AH, et al. HAMAP in 2015: updates to the protein family classification and annotation system . Nucleic Acids Res 2015 ; 43 : D1064 – 70 . Google Scholar Crossref Search ADS PubMed WorldCat 38 Carver T , Harris SR, Berriman M, et al. Artemis: an integrated platform for visualization and analysis of high-throughput sequence-based experimental data . Bioinformatics 2012 ; 28 : 464 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 39 Saeed AI , Sharov V, White J, et al. TM4: a free, open-source system for microarray data management and analysis . Biotechniques 2003 ; 34 : 374 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 40 Thorvaldsdóttir H , Robinson JT, Mesirov JP. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration . Brief Bioinform 2013 ; 14 : 178 – 92 . Google Scholar Crossref Search ADS PubMed WorldCat 41 Lobkovsky AE , Wolf YI, Koonin EV. Gene frequency distributions reject a neutral model of genome evolution . Genome Biol Evol 2013 ; 5 : 233 – 42 . Google Scholar Crossref Search ADS PubMed WorldCat © The Author 2017. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com © The Author 2017. Published by Oxford University Press.
journal article
LitStream Collection
The BioCyc collection of microbial genomes and metabolic pathways

Karp, Peter, D;Billington,, Richard;Caspi,, Ron;Fulcher, Carol, A;Latendresse,, Mario;Kothari,, Anamika;Keseler, Ingrid, M;Krummenacker,, Markus;Midford, Peter, E;Ong,, Quang;Ong, Wai, Kit;Paley, Suzanne, M;Subhraveti,, Pallavi

2019 Briefings in Bioinformatics

doi: 10.1093/bib/bbx085pmid: 29447345

Abstract BioCyc.org is a microbial genome Web portal that combines thousands of genomes with additional information inferred by computer programs, imported from other databases and curated from the biomedical literature by biologist curators. BioCyc also provides an extensive range of query tools, visualization services and analysis software. Recent advances in BioCyc include an expansion in the content of BioCyc in terms of both the number of genomes and the types of information available for each genome; an expansion in the amount of curated content within BioCyc; and new developments in the BioCyc software tools including redesigned gene/protein pages and metabolite pages; new search tools; a new sequence-alignment tool; a new tool for visualizing groups of related metabolic pathways; and a facility called SmartTables, which enables biologists to perform analyses that previously would have required a programmer’s assistance. genome databases, microbial genome databases, metabolic pathway databases Introduction BioCyc.org is a microbial genome Web portal that combines thousands of genomes with additional information inferred by computer programs, imported from other databases (DBs) and curated from the biomedical literature by biologist curators. BioCyc also provides an extensive range of query tools, visualization services and analysis software. BioCyc has been developed over a 25 year period, beginning with the EcoCyc DB for Escherichia coli. Over time, the content of BioCyc has expanded in terms of the number of genomes, the types of information available for each genome and the amount of curated content. BioCyc has also grown to include some eukaryotic genomes (although its main emphasis is microbial). The software behind BioCyc, called Pathway Tools [1,2], has also expanded in many ways during this period, such as to support regulatory networks, omics data analysis and metabolic modeling. Recent enhancements include redesigned gene/protein pages and metabolite pages, new search tools, a new sequence-alignment tool, a new tool for visualizing groups of related metabolic pathways and a facility called SmartTables, which enables biologists to perform analyses that previously would have required a programmer’s assistance. Expansion of BioCyc DB content Each BioCyc DB describes one sequenced genome, with the exception of the MetaCyc DB, which describes experimentally studied metabolic pathways from all domains of life. Since 2011, BioCyc has expanded from 1000 genomes to 9300 genomes. The majority of those genomes were obtained from Genbank RefSeq and from the Human Microbiome Project complete genomes DB. As the majority of sequenced microbial genomes are of interest to a relatively small number of researchers, BioCyc emphasizes breadth and quality of information for more highly used genomes at the expense of number of genomes. To facilitate access to the more commonly used BioCyc Pathway/Genome Databases (PGDBs), we have created the set of home pages listed in Table 1. When entering BioCyc through these home pages, the user’s default organism will be set to the BioCyc PGDB for the primary strain for that species. Table 1 Home pages for BioCyc organisms Home page Genus ecocyc.org Escherichia coli helicobacter.biocyc.org Helicobacter pylori vibrio.biocyc.org Vibrio cholerae listeria.biocyc.org Listeria monocytogenes salmonella.biocyc.org Salmonella enterica shigella.biocyc.org Shigella flexneri cdifficile.biocyc.org Clostridium difficile mycobacterium.biocyc.org Mycobacterium tuberculosis pseudomonas.biocyc.org Pseudomonas aeruginosa yeast.biocyc.org Saccharomyces cerevisiae Home page Genus ecocyc.org Escherichia coli helicobacter.biocyc.org Helicobacter pylori vibrio.biocyc.org Vibrio cholerae listeria.biocyc.org Listeria monocytogenes salmonella.biocyc.org Salmonella enterica shigella.biocyc.org Shigella flexneri cdifficile.biocyc.org Clostridium difficile mycobacterium.biocyc.org Mycobacterium tuberculosis pseudomonas.biocyc.org Pseudomonas aeruginosa yeast.biocyc.org Saccharomyces cerevisiae Open in new tab Table 1 Home pages for BioCyc organisms Home page Genus ecocyc.org Escherichia coli helicobacter.biocyc.org Helicobacter pylori vibrio.biocyc.org Vibrio cholerae listeria.biocyc.org Listeria monocytogenes salmonella.biocyc.org Salmonella enterica shigella.biocyc.org Shigella flexneri cdifficile.biocyc.org Clostridium difficile mycobacterium.biocyc.org Mycobacterium tuberculosis pseudomonas.biocyc.org Pseudomonas aeruginosa yeast.biocyc.org Saccharomyces cerevisiae Home page Genus ecocyc.org Escherichia coli helicobacter.biocyc.org Helicobacter pylori vibrio.biocyc.org Vibrio cholerae listeria.biocyc.org Listeria monocytogenes salmonella.biocyc.org Salmonella enterica shigella.biocyc.org Shigella flexneri cdifficile.biocyc.org Clostridium difficile mycobacterium.biocyc.org Mycobacterium tuberculosis pseudomonas.biocyc.org Pseudomonas aeruginosa yeast.biocyc.org Saccharomyces cerevisiae Open in new tab Workflow for generation of BioCyc PGDBs To produce new BioCyc PGDBs, we process each BioCyc genome through the computational steps shown in Figure 1 both to computationally infer new information for the genome and to integrate additional information from other bioinformatics DBs. The amount of information found by the different import steps will vary for different organisms. Note that we retain the original genome annotation that was present in the downloaded genome file(s) for each organism. Figure 1 Open in new tabDownload slide Processing steps involved in generating the BioCyc DBs. Recently added steps are shown in bold. No relative ordering is implied between the steps along the top and the steps along the bottom. Figure 1 Open in new tabDownload slide Processing steps involved in generating the BioCyc DBs. Recently added steps are shown in bold. No relative ordering is implied between the steps along the top and the steps along the bottom. First, Pathway Tools converts the annotated genome from the Genbank format to its internal PGDB format. Next, the computational operations in the upper portion of Figure 1 are performed. Pathway Tools modules make the following predictions [2]. Metabolic and transport reactions and metabolic pathways are predicted [3] from the reactions and pathways in the MetaCyc DB [4]. Next occurs prediction of pathway hole fillers (genes that code for enzymes catalyzing reactions with no currently assigned enzyme) and prediction of operons using both structural and functional information [5]. Orthologs among BioCyc genomes are computed by software that runs large-scale bidirectional BLAST (version 2.2.23) comparisons among all pairs of proteins in the BioCyc genomes. We use a BLAST E-value cutoff of 0.001, with all other parameters at default settings. We define two proteins A and B as orthologs if protein A from proteome PA and protein B from proteome PB are bidirectional best BLAST hits of one another, meaning that protein B is the best BLAST hit of protein A within proteome PB, and protein A is the best BLAST hit of protein B within proteome PA. In rare cases, protein A might have multiple orthologs in proteome PB, as explained below. The best hit of protein A in proteome PB is defined by finding the minimal E-value among all hits in proteome PB in the BLAST output, and collecting all the hits for A in proteome PB that have the same minimal E-value. In other words, ties are possible, as in the case of exact gene duplications. We attempt to break ties using two methods: taking the hit with the maximum alignment length; and then taking the hit with the maximum alignment amino acid residue identity. For the first method, we compare the alignment lengths among all the hits of protein A in proteome PB that share the same minimum E-value, and the protein in proteome PB with the maximum alignment length is selected. For the second method, we compare the number of identical amino acid residues in the alignments between protein A and the hits of protein A in proteome PB that share the same minimum E-value, and the protein in proteome PB with the maximum number of identical amino acid residues is selected. In the case that ties still remain (as in the case of exact gene duplications), all ties are included in the final set of orthologs used by BioCyc. Thus, protein A could have multiple orthologs in PB, such as if multiple proteins B1, B2, etc., exist in PB, and have exactly the same regions align against protein A. BioCyc does not calculate paralogs. Pfam [6] domains are identified in BioCyc proteins by running the Pfam software. Finally, zoomable cellular overview (metabolic map) diagrams are generated for each organism. Next, data from several third-party DBs are imported into BioCyc, as shown in the lower portion of Figure 1. Protein-feature data, such as locations of enzyme active sites, phosphorylation sites and metal-ion binding sites, are loaded from UniProt [7], as are Gene Ontology (GO) [8] annotations. Predicted subcellular localizations are loaded from PSORTdb [9]. Descriptions of promoters, transcription factor-binding sites and regulatory interactions are loaded from RegTransBase [10]. Organism phenotype data, such as aerobicity, are loaded from the National Center for Biotechnology Information (NCBI) BioSample DB, as are organism metadata, such as the geographical location of the site from which the sequenced organism was collected. Gene essentiality data have been loaded from the OGEE DB [11] and from individual articles. Phenotype microarray data have also been loaded from individual articles. We also generate Web links from BioCyc to other related DBs, such as UniProt, NCBI-Bioproject and BioSample. BioCyc curation After the preceding automated processing, some BioCyc DBs receive manual curation to integrate additional information and to remove some false-positive predictions. All in all, the information within the BioCyc DBs has been curated from 80 900 different publications, as shown in Table 2. The BioCyc DBs are organized into three tiers [12] to communicate the amount of manual curation that each DB has received: Table 2 For those BioCyc version 21.0 PGDBs citing ≥100 references, we list the number of references cited by each PGDB (and from which the information in each PGDB was curated), sorted by number of citations DB Citations Tier MetaCyc 52 446 1 Escherichia coli K-12 substr. MG1655 31 555 1 Saccharomyces cerevisiae S288c 12 018 1 Bacillus subtilis subtilis 168 3682 2 Clostridioides difficile 630 2027 2 Mycobacterium tuberculosis H37Rv 1521 2 Chlamydomonas reinhardtii 1233 2 Candida albicans SC5314 623 2 Streptomyces coelicolor A3(2) 343 2 Synechococcus elongatus PCC 7942 284 2 Agrobacterium fabrum C58 257 2 Leishmania major strain Friedlin 212 1 Corynebacterium glutamicum ATCC 13032 184 2 Listeria monocytogenes 10403S 176 2 Candidatus Evansia muelleri 147 2 DB Citations Tier MetaCyc 52 446 1 Escherichia coli K-12 substr. MG1655 31 555 1 Saccharomyces cerevisiae S288c 12 018 1 Bacillus subtilis subtilis 168 3682 2 Clostridioides difficile 630 2027 2 Mycobacterium tuberculosis H37Rv 1521 2 Chlamydomonas reinhardtii 1233 2 Candida albicans SC5314 623 2 Streptomyces coelicolor A3(2) 343 2 Synechococcus elongatus PCC 7942 284 2 Agrobacterium fabrum C58 257 2 Leishmania major strain Friedlin 212 1 Corynebacterium glutamicum ATCC 13032 184 2 Listeria monocytogenes 10403S 176 2 Candidatus Evansia muelleri 147 2 Note: In many cases, the curation was performed by BioCyc curators, and in other cases, the curation was performed by other DBs from which information was imported (e.g. from GO term curation or from UniProt protein-feature curation). For these PGDBs, we have removed from the citation counts those references shared with MetaCyc classes and metabolites (which were likely copied from MetaCyc during PGDB creation). MetaCyc and EcoCyc cite a number of common references because the EcoCyc pathway and enzyme data and their references are periodically copied from EcoCyc to MetaCyc. Open in new tab Table 2 For those BioCyc version 21.0 PGDBs citing ≥100 references, we list the number of references cited by each PGDB (and from which the information in each PGDB was curated), sorted by number of citations DB Citations Tier MetaCyc 52 446 1 Escherichia coli K-12 substr. MG1655 31 555 1 Saccharomyces cerevisiae S288c 12 018 1 Bacillus subtilis subtilis 168 3682 2 Clostridioides difficile 630 2027 2 Mycobacterium tuberculosis H37Rv 1521 2 Chlamydomonas reinhardtii 1233 2 Candida albicans SC5314 623 2 Streptomyces coelicolor A3(2) 343 2 Synechococcus elongatus PCC 7942 284 2 Agrobacterium fabrum C58 257 2 Leishmania major strain Friedlin 212 1 Corynebacterium glutamicum ATCC 13032 184 2 Listeria monocytogenes 10403S 176 2 Candidatus Evansia muelleri 147 2 DB Citations Tier MetaCyc 52 446 1 Escherichia coli K-12 substr. MG1655 31 555 1 Saccharomyces cerevisiae S288c 12 018 1 Bacillus subtilis subtilis 168 3682 2 Clostridioides difficile 630 2027 2 Mycobacterium tuberculosis H37Rv 1521 2 Chlamydomonas reinhardtii 1233 2 Candida albicans SC5314 623 2 Streptomyces coelicolor A3(2) 343 2 Synechococcus elongatus PCC 7942 284 2 Agrobacterium fabrum C58 257 2 Leishmania major strain Friedlin 212 1 Corynebacterium glutamicum ATCC 13032 184 2 Listeria monocytogenes 10403S 176 2 Candidatus Evansia muelleri 147 2 Note: In many cases, the curation was performed by BioCyc curators, and in other cases, the curation was performed by other DBs from which information was imported (e.g. from GO term curation or from UniProt protein-feature curation). For these PGDBs, we have removed from the citation counts those references shared with MetaCyc classes and metabolites (which were likely copied from MetaCyc during PGDB creation). MetaCyc and EcoCyc cite a number of common references because the EcoCyc pathway and enzyme data and their references are periodically copied from EcoCyc to MetaCyc. Open in new tab Tier 1 PGDBs have received at least one person-year of curation; some PGDBs have received person-decades of curation. Tier 2 PGDBs have received at least one person-month of curation. Tier 3 PGDBs have received no manual curation. Some BioCyc PGDBs were contributed by groups outside SRI (for example, the Chlamydomonas reinhardtii PGDB was developed by the Carnegie Institution for Science, and the Streptomyces coelicolor PGDB was developed by the University of Warwick and the John Innes Centre). The authors of each PGDB are listed on the summary page that is displayed when a user changes the current PGDB. The Clostridioidesdifficile 630 PGDB has undergone several recent curation enhancements. We updated its genome annotation from the recently revised RefSeq entry, and from the annotation from the MicroScope site [13]. We performed literature searches and curation updates for 213 proteins listed in MicroScope as having experimental evidence for their function in C. difficile or in the Clostridioides genus, as well as other genes encountered during the course of literature searches. Those proteins with experimental evidence in C. difficile are now annotated with experimental evidence codes and contain references to the literature from which their enhanced curation was derived. Curation adds value to BioCyc PGDBs in many ways, and is a major factor in differentiating BioCyc from other bacterial genome PGDBs. All computational prediction methods make errors, including predictors of gene boundaries, protein function and metabolic pathways. Curators correct errors in those predictions, and they supplement computational predictions with information from the experimental literature. They also annotate experimentally known information with experimental evidence codes and literature citations to indicate high-confidence information. Curators capture a wide variety of information in BioCyc PGDBs (Table 3) including protein functions, metabolic reactions and pathways and regulatory interactions of several types (such as allosteric regulation of enzymes, and control of gene expression via transcription factors and small RNAs). Table 3 Datatypes available in PGDBs, and statistics on the number of objects of each type in various PGDBs DB tier Escherichia coli Bacillus subtilis Synechococcus elongatus Mycobacterium tuberculosis K-12 substr. MG1655 subtilis 168 PCC 7942 Beijing/NITR203 1 2 2 3 Data type  Genome metadata 2 2 4 0  Genes 4657 4440 2719 4206  Operons 3564 1604 1982 2680  Promoters 3850 1193 44 0  Transcription factor-binding sites 2918 763 36 0  Terminators 303 1146 0 0  Proteins 5719 4407 2832 4127  Protein features 4223 3029 0 0  Gene Ontology terms 5733 3927 2518 4  Metabolites 2758 942 990 1169  Metabolic reactions 1712 1158 1100 1450  Metabolic pathways 396 269 230 285  Transport reactions 1526 1048 953 1148  Genetic regulatory networks 3438 788 41 0  Evidence codes 134 561 58 658 15 098 3137  Growth media 436 1 2 0  Gene essentiality 4239 4217 2421 0 DB tier Escherichia coli Bacillus subtilis Synechococcus elongatus Mycobacterium tuberculosis K-12 substr. MG1655 subtilis 168 PCC 7942 Beijing/NITR203 1 2 2 3 Data type  Genome metadata 2 2 4 0  Genes 4657 4440 2719 4206  Operons 3564 1604 1982 2680  Promoters 3850 1193 44 0  Transcription factor-binding sites 2918 763 36 0  Terminators 303 1146 0 0  Proteins 5719 4407 2832 4127  Protein features 4223 3029 0 0  Gene Ontology terms 5733 3927 2518 4  Metabolites 2758 942 990 1169  Metabolic reactions 1712 1158 1100 1450  Metabolic pathways 396 269 230 285  Transport reactions 1526 1048 953 1148  Genetic regulatory networks 3438 788 41 0  Evidence codes 134 561 58 658 15 098 3137  Growth media 436 1 2 0  Gene essentiality 4239 4217 2421 0 Note: Different DBs contain different proportions of these datatypes depending on factors such as the amounts of data available in DBs from which BioCyc imports information, and the amount of data curated from the literature. Typically, DBs that have received more curation will have objects of a wider range of datatypes. Open in new tab Table 3 Datatypes available in PGDBs, and statistics on the number of objects of each type in various PGDBs DB tier Escherichia coli Bacillus subtilis Synechococcus elongatus Mycobacterium tuberculosis K-12 substr. MG1655 subtilis 168 PCC 7942 Beijing/NITR203 1 2 2 3 Data type  Genome metadata 2 2 4 0  Genes 4657 4440 2719 4206  Operons 3564 1604 1982 2680  Promoters 3850 1193 44 0  Transcription factor-binding sites 2918 763 36 0  Terminators 303 1146 0 0  Proteins 5719 4407 2832 4127  Protein features 4223 3029 0 0  Gene Ontology terms 5733 3927 2518 4  Metabolites 2758 942 990 1169  Metabolic reactions 1712 1158 1100 1450  Metabolic pathways 396 269 230 285  Transport reactions 1526 1048 953 1148  Genetic regulatory networks 3438 788 41 0  Evidence codes 134 561 58 658 15 098 3137  Growth media 436 1 2 0  Gene essentiality 4239 4217 2421 0 DB tier Escherichia coli Bacillus subtilis Synechococcus elongatus Mycobacterium tuberculosis K-12 substr. MG1655 subtilis 168 PCC 7942 Beijing/NITR203 1 2 2 3 Data type  Genome metadata 2 2 4 0  Genes 4657 4440 2719 4206  Operons 3564 1604 1982 2680  Promoters 3850 1193 44 0  Transcription factor-binding sites 2918 763 36 0  Terminators 303 1146 0 0  Proteins 5719 4407 2832 4127  Protein features 4223 3029 0 0  Gene Ontology terms 5733 3927 2518 4  Metabolites 2758 942 990 1169  Metabolic reactions 1712 1158 1100 1450  Metabolic pathways 396 269 230 285  Transport reactions 1526 1048 953 1148  Genetic regulatory networks 3438 788 41 0  Evidence codes 134 561 58 658 15 098 3137  Growth media 436 1 2 0  Gene essentiality 4239 4217 2421 0 Note: Different DBs contain different proportions of these datatypes depending on factors such as the amounts of data available in DBs from which BioCyc imports information, and the amount of data curated from the literature. Typically, DBs that have received more curation will have objects of a wider range of datatypes. Open in new tab Curators author mini-review summaries appearing in the protein, pathway and operon pages, which summarize findings from multiple publications and save users significant amounts of time in poring through the primary literature. For some BioCyc PGDBs, person-decades of curation work have been performed across tens of thousands of publications, resulting in large volumes of mini-review summaries, measured in textbook page equivalents: EcoCyc version 21.0 contains 2907 textbook-equivalent pages of summaries and MetaCyc version 21.0 contains 7897 such pages. Further, curators enter a wide range of experimentally determined information that cannot be inferred computationally, including enzyme activators and inhibitors, protein subunit structure, enzyme kinetic values, protein features (e.g. active site residues) and transcriptional regulatory interactions. Although automated text mining software has shown gradual improvement over the years, its accuracy is still far from that of human curators. In addition, text mining systems are typically limited to extracting fewer types of data than the wide range of information that BioCyc curators capture. Perhaps most importantly, only human curators can correctly resolve the many disagreements, inconsistencies and errors found in the literature. Many metabolic pathways and enzymes are complex, and earlier reports often contain information that has been later partially or completely invalidated. For example, enzyme commission (EC) 2.3.1.111, mycocerosate synthase, was initially reported to release its product in the form of a coenzyme-A activated compound, but later, it was shown that the products remain bound to the enzyme at the end of catalysis because of a lack of a thioesterase function. A computer program reading through the conflicting reports would have great difficulty in reconciling the information from the different publications. An experienced human curator, on the other hand, can integrate the information and generate a review that consolidates all sources and provides an accurate review of current knowledge. Expansion of bioinformatics tools The BioCyc.org Web site offers, to our knowledge, the most extensive set of bioinformatics tools of any microbial genome portal (Table 4). Many of the tools provide visualization services that aid the user in navigating the large and complex information space within BioCyc. Table 4 BioCyc software tools Genome tools Gene/Protein/RNA Page Presents information about individual genes and their products Genome Browser View genes and other genome regions at variable magnification, from full genome on one screen to sequence level Comparative Genome Browser Align genome regions from multiple organisms around shared orthologous genes Genome Poster Printable poster containing full-genome diagram Gene Ontology Browser Hierarchical browser for navigating within GO Sequence Alignment Viewer Align nucleotide or amino acid sequences BLAST Sequence Search Search nucleotide or amino -acid sequences against individual BioCyc genomes or against all BioCyc genomes Sequence Pattern Search Search short nucleotide or amino acid sequences including wild cards against a BioCyc genome Gene Regulatory Network Browser Visualize and navigate within complete organism regulatory network Operon Page Presents information about an operon and its regulatory sites and regulators Comparative Analysis Compare genome and pathway information across organisms Metabolism tools Pathway Page Presents information about individual metabolic pathways Pathway Collage Diagrams Personalized multi-pathway diagrams: the user chooses a set of pathways, positions them relative to one another and defines connections among them Metabolic Network Browser Zoomable browsing of organism-specific metabolic network diagrams Metabolic Network Posters Printable organism-specific metabolic network diagrams Run Metabolic Models Execute quantitative steady-state metabolic models Chokepoint Analysis Compute potential antimicrobial drug targets based on metabolic network choke points Dead-End Metabolite Analysis Find metabolites that are not producible or not consumable Metabolic Route Search Search for optimal paths through reaction network connecting starting and ending metabolites Omics Data Analysis Paint Data on Metabolic Network Color full zoomable metabolic network diagram with omics data Paint Data on Pathway Diagram Color individual pathway diagrams with omics data Paint Data on Pathway Collage Color pathway collage diagrams with omics data Paint Data on Genome Browser Color single-screen genome diagram with omics data Paint Data on Regulatory Network Color full regulatory network diagram with omics data Enrichment Analysis Compute statistical enrichment of GO terms, pathways, regulators Other tools Update Notifications Receive notification of curation updates in declared research interest areas Advanced Search Create SQL-like complex DB searches Cross-Organism Search Perform text searches across all of BioCyc or specified groups of organisms Multi-Organism Route Search Search for metabolic routes that cross-multiple organisms such as the gut microbiome SmartTables Define groups of genes, pathways, metabolites, etc., and manipulate those groups as a programmer would Genome tools Gene/Protein/RNA Page Presents information about individual genes and their products Genome Browser View genes and other genome regions at variable magnification, from full genome on one screen to sequence level Comparative Genome Browser Align genome regions from multiple organisms around shared orthologous genes Genome Poster Printable poster containing full-genome diagram Gene Ontology Browser Hierarchical browser for navigating within GO Sequence Alignment Viewer Align nucleotide or amino acid sequences BLAST Sequence Search Search nucleotide or amino -acid sequences against individual BioCyc genomes or against all BioCyc genomes Sequence Pattern Search Search short nucleotide or amino acid sequences including wild cards against a BioCyc genome Gene Regulatory Network Browser Visualize and navigate within complete organism regulatory network Operon Page Presents information about an operon and its regulatory sites and regulators Comparative Analysis Compare genome and pathway information across organisms Metabolism tools Pathway Page Presents information about individual metabolic pathways Pathway Collage Diagrams Personalized multi-pathway diagrams: the user chooses a set of pathways, positions them relative to one another and defines connections among them Metabolic Network Browser Zoomable browsing of organism-specific metabolic network diagrams Metabolic Network Posters Printable organism-specific metabolic network diagrams Run Metabolic Models Execute quantitative steady-state metabolic models Chokepoint Analysis Compute potential antimicrobial drug targets based on metabolic network choke points Dead-End Metabolite Analysis Find metabolites that are not producible or not consumable Metabolic Route Search Search for optimal paths through reaction network connecting starting and ending metabolites Omics Data Analysis Paint Data on Metabolic Network Color full zoomable metabolic network diagram with omics data Paint Data on Pathway Diagram Color individual pathway diagrams with omics data Paint Data on Pathway Collage Color pathway collage diagrams with omics data Paint Data on Genome Browser Color single-screen genome diagram with omics data Paint Data on Regulatory Network Color full regulatory network diagram with omics data Enrichment Analysis Compute statistical enrichment of GO terms, pathways, regulators Other tools Update Notifications Receive notification of curation updates in declared research interest areas Advanced Search Create SQL-like complex DB searches Cross-Organism Search Perform text searches across all of BioCyc or specified groups of organisms Multi-Organism Route Search Search for metabolic routes that cross-multiple organisms such as the gut microbiome SmartTables Define groups of genes, pathways, metabolites, etc., and manipulate those groups as a programmer would Open in new tab Table 4 BioCyc software tools Genome tools Gene/Protein/RNA Page Presents information about individual genes and their products Genome Browser View genes and other genome regions at variable magnification, from full genome on one screen to sequence level Comparative Genome Browser Align genome regions from multiple organisms around shared orthologous genes Genome Poster Printable poster containing full-genome diagram Gene Ontology Browser Hierarchical browser for navigating within GO Sequence Alignment Viewer Align nucleotide or amino acid sequences BLAST Sequence Search Search nucleotide or amino -acid sequences against individual BioCyc genomes or against all BioCyc genomes Sequence Pattern Search Search short nucleotide or amino acid sequences including wild cards against a BioCyc genome Gene Regulatory Network Browser Visualize and navigate within complete organism regulatory network Operon Page Presents information about an operon and its regulatory sites and regulators Comparative Analysis Compare genome and pathway information across organisms Metabolism tools Pathway Page Presents information about individual metabolic pathways Pathway Collage Diagrams Personalized multi-pathway diagrams: the user chooses a set of pathways, positions them relative to one another and defines connections among them Metabolic Network Browser Zoomable browsing of organism-specific metabolic network diagrams Metabolic Network Posters Printable organism-specific metabolic network diagrams Run Metabolic Models Execute quantitative steady-state metabolic models Chokepoint Analysis Compute potential antimicrobial drug targets based on metabolic network choke points Dead-End Metabolite Analysis Find metabolites that are not producible or not consumable Metabolic Route Search Search for optimal paths through reaction network connecting starting and ending metabolites Omics Data Analysis Paint Data on Metabolic Network Color full zoomable metabolic network diagram with omics data Paint Data on Pathway Diagram Color individual pathway diagrams with omics data Paint Data on Pathway Collage Color pathway collage diagrams with omics data Paint Data on Genome Browser Color single-screen genome diagram with omics data Paint Data on Regulatory Network Color full regulatory network diagram with omics data Enrichment Analysis Compute statistical enrichment of GO terms, pathways, regulators Other tools Update Notifications Receive notification of curation updates in declared research interest areas Advanced Search Create SQL-like complex DB searches Cross-Organism Search Perform text searches across all of BioCyc or specified groups of organisms Multi-Organism Route Search Search for metabolic routes that cross-multiple organisms such as the gut microbiome SmartTables Define groups of genes, pathways, metabolites, etc., and manipulate those groups as a programmer would Genome tools Gene/Protein/RNA Page Presents information about individual genes and their products Genome Browser View genes and other genome regions at variable magnification, from full genome on one screen to sequence level Comparative Genome Browser Align genome regions from multiple organisms around shared orthologous genes Genome Poster Printable poster containing full-genome diagram Gene Ontology Browser Hierarchical browser for navigating within GO Sequence Alignment Viewer Align nucleotide or amino acid sequences BLAST Sequence Search Search nucleotide or amino -acid sequences against individual BioCyc genomes or against all BioCyc genomes Sequence Pattern Search Search short nucleotide or amino acid sequences including wild cards against a BioCyc genome Gene Regulatory Network Browser Visualize and navigate within complete organism regulatory network Operon Page Presents information about an operon and its regulatory sites and regulators Comparative Analysis Compare genome and pathway information across organisms Metabolism tools Pathway Page Presents information about individual metabolic pathways Pathway Collage Diagrams Personalized multi-pathway diagrams: the user chooses a set of pathways, positions them relative to one another and defines connections among them Metabolic Network Browser Zoomable browsing of organism-specific metabolic network diagrams Metabolic Network Posters Printable organism-specific metabolic network diagrams Run Metabolic Models Execute quantitative steady-state metabolic models Chokepoint Analysis Compute potential antimicrobial drug targets based on metabolic network choke points Dead-End Metabolite Analysis Find metabolites that are not producible or not consumable Metabolic Route Search Search for optimal paths through reaction network connecting starting and ending metabolites Omics Data Analysis Paint Data on Metabolic Network Color full zoomable metabolic network diagram with omics data Paint Data on Pathway Diagram Color individual pathway diagrams with omics data Paint Data on Pathway Collage Color pathway collage diagrams with omics data Paint Data on Genome Browser Color single-screen genome diagram with omics data Paint Data on Regulatory Network Color full regulatory network diagram with omics data Enrichment Analysis Compute statistical enrichment of GO terms, pathways, regulators Other tools Update Notifications Receive notification of curation updates in declared research interest areas Advanced Search Create SQL-like complex DB searches Cross-Organism Search Perform text searches across all of BioCyc or specified groups of organisms Multi-Organism Route Search Search for metabolic routes that cross-multiple organisms such as the gut microbiome SmartTables Define groups of genes, pathways, metabolites, etc., and manipulate those groups as a programmer would Open in new tab This section surveys a number of recent developments in the BioCyc software tools. For a comprehensive description of the Pathway Tools software behind BioCyc, see [1]. Run Metabolic Model The ‘Run Metabolic Model’ command allows users to solve steady-state metabolic models based on flux balance analysis [14]. Metabolic models are generated from the reactions stored in a PGDB by the MetaFlux component of Pathway Tools [1]. Users must login to their BioCyc account to be able to use the command ‘Run Metabolic Model’. Existing public models, or a user’s own private models, can be executed. For example, by selecting the Escherichia coli K-12 substr. MG1655 organism, the ‘Run Metabolic Model’ command will open up a new Web page and show a list of several public models available for that organism (by the owner ‘BRG SRI’) and any metabolic models that the user has created, some of which are probably private. Click the ‘Select’ button of a model to analyze and execute that model. After clicking the ‘Execute’ button, the ‘Results’ tab will provide the biomass flux of that model, three buttons (‘Show Solution File’, ‘Show Log File’ and ‘Show Fluxes on Cellular Overview’) to further analyze the solution and the list of all reactions that are active (i.e. reactions with nonzero fluxes) in the model. The solution file shows much more detail about the solutions, such as the fluxes of all biomass metabolites, nutrients and secreted metabolites. The log file gives a complete list of all reactions that are in the model, including the non-active reactions, the reactions that are blocked; the instantiated reactions; and more. The ‘Show Fluxes on Cellular Overview’ button, if clicked, will open the Cellular Overview metabolic map diagram of the organism with the reactions and pathways highlighted according to the fluxes of the reactions. The model specification can be seen by selecting the tabs ‘Nutrients’, ‘Reactions’, ‘Biomass’ and ‘Secreted Metabolites’. From the Web page displaying one model, you can go back to the list of models by clicking ‘View Models’ near the top left corner of the Web page. The publicly available models can be copied and modified. To copy a model, click the ‘Copy’ button and enter a new name for it. The copy is private in your own account and can be modified and solved at will. For example, the nutrients and secreted metabolites of the model can be modified to execute under different growth conditions (e.g. anaerobic). If desired, by clicking the ‘Make Public’ button, the model can be shared with all the users that have access to the Web server. You can learn more about the Run Metabolic Model tool via the ‘Getting Started Guide’ link on the Web page listing the models available. New search tools As the number of organisms in BioCyc has grown, we have introduced new search tools to help users find organisms of interest. Clicking ‘change organism database’ in the upper right corner of the Web site brings up the organism-selector dialog, which enables a user to search for organisms by name, by organism taxonomy and by phenotypic and metadata properties. The name-based search finds any prefix of the genus, species or strain name. The taxonomy search uses a hierarchical browser of the NCBI taxonomy DB. For the phenotypic/metadata search, the user first selects a property (such as ‘biotic relationship’) and then selects the value of interest for that property (such as whether the desired organism is free living, parasitic or symbiotic). Additional available properties include whether the organism is a pathogen of humans, animals or plants; the human microbiome body site from which the organism was collected; and the depth or altitude at which the organism sample was collected. Metadata searches include number of GO terms annotated within the DB and number of regulatory interactions within the DB. Multiple properties can be queried at once (combined using AND or OR) by clicking the ‘Add Constraint’ button. To facilitate comparative analyses, a new multi-organism search tool is available under Search → Cross Organism Search. The user can specify what set of organisms (DBs) to search in several alternative ways, such as by specifying taxonomic groups (e.g. ‘Archaea’ or ‘Coriobacteriia’), by specifying organisms by names, or by selecting organisms according to their phenotypic properties (e.g. selecting all symbionts). A user can also save lists of organisms for later use within SmartTables. A cross-organism search enables the user to search a designated set of organisms for search terms in specific object types. The user specifies the types of objects to search for (e.g. genes or metabolites), and one or more search terms (e.g. ‘trpA’ or ‘acetaldehyde’). The tool returns a table indicating what objects from what organisms matched the requested search. Redesigned gene/protein pages and metabolite pages We have redesigned BioCyc gene/protein pages to modernize their look and feel and to make it easier for scientists to find the information they seek. The new design provides a summary of commonly used information at the top, with additional information available via the tabs just below the table. For example, the ‘Protein Features’ tab depicts protein features such as metal-ion binding sites and enzyme active sites; the ‘Operons’ tab depicts the operon(s) containing the gene. The ‘Show All’ tab combines information from all tabs into one page, which is convenient when searching the Web page for terms of interest. A new menu, called the right-sidebar menu, is available along the right side of gene/protein pages and most other BioCyc pages. Its content varies depending on the page type currently displayed (e.g. different operations are available for gene pages versus metabolite pages). Operations available at gene pages include retrieving the nucleotide and amino acid sequences for the gene/protein, and retrieving arbitrary nucleotide sequences surrounding a gene, or for any region of the genome. Other gene-page operations include creating a multi-genome alignment (using the Pathway Tools comparative genome browser) and a multiple sequence alignment [computed using MUSCLE [15] and displayed using the Sol Genomics Network alignment viewer (https://sgn.cornell.edu/tools/align_viewer/index.pl)] for the current gene and specified orthologs. Metabolite pages have been redesigned along similar lines. They contain a table at the top that summarizes important information, along with tabs to select additional information such as the reactions in which a metabolite occurs. SmartTables SmartTables [16] enable scientists to define and store lists of objects from any BioCyc DB, such as lists of genes, proteins, metabolites, pathways or sequence regions (e.g. SNPs). Using SmartTables, a scientist can browse and explore a group of objects. They can transform a group of objects to a set of related objects (such as transforming a metabolite set to the set of pathways those metabolites are involved in). Users can also perform analyses such as statistical enrichment analysis (e.g. to understand what functional categories are shared by the differentially regulated genes from a transcriptomics experiment). Scientists can share SmartTables with specific colleagues or with the public, and can use them to supplement a publication by providing online gene or metabolite sets. Users must create a BioCyc account to create SmartTables. SmartTables can be created to contain results from different types of BioCyc query operations (look for the button ‘Turn into a SmartTable’). They can also be created from a file—we defined the public SmartTable at https://biocyc.org/group?id=biocyc14-1553-3655492599 by uploading a file listing the essential genes determined for C.difficile R20291 by Dembek et al. [17], and then adding the orthologous genes and gene products in Bacillussubtilis and E. coli as additional columns. We will use this SmartTable to illustrate some general capabilities of SmartTables by investigating the question of which metabolic pathways these essential genes are involved in. We begin by creating a separate SmartTable containing the orthologs from strain 630 of this essential gene set, by clicking the ‘+’ at the top of the column labeled ‘Gene/Locus-Ids in Strain 630’. Next, from the ‘Add Property Column’ menu directly above the SmartTable, select ‘Product’ to add a new column containing the gene products, and from the ‘Add Transform Column’ menu directly above the SmartTable, select ‘Pathways of Gene’ to add a column listing the metabolic pathway(s) (if any) in which these gene products participate; the result is shown in Figure 2. Figure 2 Open in new tabDownload slide SmartTable showing essential genes and their metabolic pathways in C. difficile 630. Figure 2 Open in new tabDownload slide SmartTable showing essential genes and their metabolic pathways in C. difficile 630. To see these same data from a different perspective, click the ‘+’ above the ‘Pathways of Gene’ (third) column, which will create a new SmartTable listing each metabolic pathway, and the essential genes within that pathway. Another way to see the data from the SmartTable in Figure 2 is to run the operation ‘Paint Data → on Cellular Overview’ from the right-sidebar menu (be sure the first column in the SmartTable is selected, by clicking on it). This operation will display the set of genes within the SmartTable on a zoomable metabolic map diagram for strain C. difficile 630. Among the other operations provided for SmartTables are adding and deleting rows individually, using a filtering operation to remove rows that meet criteria such as containing a search string, and performing set operations such as union and intersection between two SmartTables. SmartTables also offer views of the nucleotide and amino acid sequences of genes and proteins, and of the chemical structures of metabolites. Pathway collages For many years, BioCyc has provided the ability for users to customize its images of metabolic pathways. The command ‘Customize or Overlay Omics Data on Pathway Diagram’ from the right-sidebar menu of any pathway page enables users to control which elements of the pathway diagram are visible (gene names, EC numbers, etc.), and to overlay gene expression, metabolomics or reaction-flux data on the pathway. Pathway collages are a new way of creating diagrams depicting interactions among several metabolic pathways, and were suggested by Prof. Tricia Kiley of the University of Wisconsin. Define a SmartTable containing the pathways to include in the pathway collage (such as by creating a new SmartTable and then adding the pathways by name). Then use the right-sidebar menu command ‘Export → Export Pathways to Pathway Collage’ to create the pathway collage within a Web browser. The commands available within the pathway collage builder include dragging pathways to new positions, creating connection lines between metabolites, changing the visual appearance of gene and metabolite names and adding omics data to the diagram. An example collage of E. coli pathways is shown in Figure 3. Collages can be exported to PNG files for use in publications. Figure 3 Open in new tabDownload slide Escherichia coli gene expression data from an anaerobic to aerobic transition (Gene expression omnibus id GDS2364) superimposed on a pathway collage. Each horizontal row of squares indicates the expression levels of one gene across the four oxygen levels depicted, where the leftmost square is anaerobic and the rightmost square is the highest oxygen concentration. Blue and purple indicate low expression, gray indicates intermediate expression and orange indicates high expression. Figure 3 Open in new tabDownload slide Escherichia coli gene expression data from an anaerobic to aerobic transition (Gene expression omnibus id GDS2364) superimposed on a pathway collage. Each horizontal row of squares indicates the expression levels of one gene across the four oxygen levels depicted, where the leftmost square is anaerobic and the rightmost square is the highest oxygen concentration. Blue and purple indicate low expression, gray indicates intermediate expression and orange indicates high expression. Pan-genome DBs We are introducing pan-genome PGDBs in BioCyc that integrate gene and pathway information from a large number of sequenced strains into one DB. Pan-genome PGDBs illuminate the set of gene families found across the species. Pan-genome PGDBs now exist for Listeria monocytogenes, Mycobacterium tuberculosis, C.difficile and Pseudomonas aeruginosa, and can be found by searching for the phrase ‘pan-genome’ within the organism selector. The following steps are taken to construct a Pan-Genome PGDB for species S: Create an empty PGDB for S. Choose a set of PGDBs for strains of S for which computed orthologs are available. Choose a so-called ‘lead PGDB’ from the preceding set of available strain-specific PGDBs. For example, we chose M. tuberculosis H37Rv as the lead PGDB for species M. tuberculosis because of its status as a highly studied strain. Import the lead PGDB’s replicons, genes, proteins, reactions and pathways into the pan-genome PGDB. Visit every other strain PGDB from the chosen set. For each protein-coding gene in that PGDB, check whether it is an ortholog of any gene already residing in the pan-genome PGDB. If so, record the existence of the ortholog in the gene in the pan-genome PGDB. If no ortholog was found, then import the new gene from the other strain PGDB, along with its proteins and any reactions and pathways that are not yet in the pan-genome PGDB. Finally, add the nucleotide sequence of the newly added gene to an ‘artificial replicon’, which accumulates all these other genes (separated by spacers consisting of several N nucleotides). The new gene will thereafter also be checked for orthology in future comparison rounds with additional strain PGDBs. The end result will be that many genes, both on the replicons from the lead PGDB and on the ‘artificial replicon’, will have orthologs recorded, and some genes from the lead PGDB and the other strain PGDBs will be unique and have no orthologs at all. When viewing the Cellular Overview for a pan-genome PGDB, two special highlighting commands are made available. Highlighting the core genes shows all the reactions of the genes that are shared among all the strain PGDBs; in other words, each gene has orthologs to all the other strains. Highlighting the unique genes shows all the reactions of the genes that have no orthologs at all, and are thus contributed by one strain. RouteSearch, atom mappings and Gibbs free energies BioCyc reaction pages depict atom mappings for most reactions. The atom mapping of a reaction identifies for each reactant non-hydrogen atom its corresponding atom in a product compound. For a given reaction in a BioCyc PGDB, atom mapping data are obtained from the same reaction (reaction having the same reaction identifier) in MetaCyc. We computed MetaCyc atom mappings using an algorithm that minimizes the overall cost of bonds broken and made in the reaction, given assigned propensities for bond creation and breakage [18]. Of the 14 051 reactions in MetaCyc, 12 356 (87.9%) have computed atom mappings. Our analysis [18] has found a low rate of errors (<3%) in our computed atom mappings. RouteSearch (see Metabolism → Metabolic Route Search) [19] is a software tool for finding routes in the metabolic reaction network of an organism. Given a starting compound, a target compound and other parameters, the tool finds the best (least cost) routes between these compounds by taking into account atom conservation (routes that conserve more atoms from the starting compound are considered better), reaction path length and adding a minimum number of foreign reactions from MetaCyc. RouteSearch uses the precomputed atom mappings of the reactions involved in the routes to calculate the number of conserved atoms. Gibbs free energies are provided for a large number of reactions and compounds in BioCyc, based on data in the MetaCyc DB. We calculated standard Δ Gibbs free energies for reactions and compounds in MetaCyc, that is ΔrG′° and ΔfG′° ⁠, at pH 7.3 and ionic strength 0.25. Computational access to BioCyc data A variety of REST-based Web services offer programmatic access to the BioCyc data via HTTP GET or POST requests [20]. A set of defined queries enables retrieval of data for a single object (such as a gene or a reaction) or collection of related objects (such as all the genes in a pathway) in XML format [21]. More complex Web service queries to BioCyc, of power on the order of SQL, can be constructed using the powerful BioVelo Query Language [22]. Web services also provide access to pathway data in BioPAX [23] format. Additional Web services enable mapping of identifiers from external DBs, and retrieval of metabolites by chemical formula, InChI key and/or monoisotopic molecular weight. A variety of visualization services and SmartTable manipulation operations provide access to advanced BioCyc capabilities, and are further described at [20]. BioCyc data are available for bulk download in several different file formats [24]. In addition to tab-delimited tables and our own internal attribute-value format [25], subsets of the data are made available in SBML [26], BioPAX [27], GO [28], GenBank [29] and FASTA [30] formats. Users who install the Pathway Tools software locally can access and update data directly via our application programming interfaces (APIs), available for Python, R, Java, Perl and Common Lisp. The PythonCyc [31], RCyc [32], JavaCyc [33] and PerlCyc [34] packages, which provide API access to their respective languages, must be downloaded and installed separately from the main Pathway Tools distribution. BioCyc subscription model Model-organism DBs such as EcoCyc, Saccharomyces Genome DB, FlyBase, Mouse Genome DB and Rat Genome DB see high usage rates. Thus, it is fairly clear that curated genome DBs are a critical part of the scientific information infrastructure for sequenced organisms that are studied by large scientific communities and that have important applications (e.g. M.tuberculosis, which is a significant pathogen, and B.subtilis, which sees widespread use in biotechnology). It has also become clear that the cost of DB curation is fairly modest and can attain low error rates. For example, the cost of curation for the EcoCyc DB was $219 per curated article over a 5 year period, which is modest when compared with the costs of the projects that generated the research to be curated: for EcoCyc, we estimated that curation cost to be ∼0.088% of the cost of the research projects that generated the research and to be 6–15% of the cost of open-access publication fees for publishing the curated research [35]. The EcoCyc error rate was measured to be 1.40% [36]. Despite the fact that a number of bioinformatics groups have put forward the preceding arguments over a 15 year period, government funding agencies have not provided funds for additional needed DB curation projects, particularly for bacteria. Thus, in 2016, we decided to convert BioCyc to a subscription model to raise revenue for the curation of BioCyc DBs. Subscriptions to BioCyc are available to individuals and to institutions such as companies and university libraries. Subscription costs are similar to the costs of journal subscriptions, and depend on usage level. Subscription revenues are invested in a nonprofit basis in BioCyc curation, operation, sales and marketing. Access to EcoCyc and MetaCyc DBs remains free because these DBs are still supported by government grants. Conclusions We have outlined some of the recent improvements to BioCyc. Additional improvements to the Pathway Tools software are described in a recent article [2]. The data content and software tools within BioCyc will continue to evolve. The human microbiome and metabolomics data analysis are two major topics of our current grant period. How to learn more A number of online information sources are available for BioCyc including online instructional videos [37], a how-to guide for the BioCyc Web site [38], a guide to the concepts and methods behind BioCyc [39] and a guide to the data content of BioCyc [40]. To receive monthly updates and explanations regarding new developments in BioCyc, please subscribe to the BioCyc mailing list by sending an e-mail to biocyc-users-request@ai.sri.com with the word ‘subscribe’ in the subject. Key Points BioCyc.org is a microbial genome Web portal that combines sequenced genomes, computationally inferred data and curated information from the scientific literature. BioCyc provides an extensive range of query tools, visualization services and analysis software. BioCyc SmartTables is a unique tool that enables biologists to perform analyses that previously would have required a programmer’s assistance, such as performing programmatic transformations on sets of objects. Funding The National Institute of General Medical Sciences of the National Institutes of Health (grant numbers R01GM080746, R01GM75742 and R01GM077678). The contents of this article are solely the responsibility of the authors and do not necessarily represent the official views of the National Institutes of Health. Peter Karp is the Director of the SRI Bioinformatics Research Group (BRG). Richard Billington is a Senior Software Engineer at SRI International; he obtained an MS degree from the University of Pennsylvania and a BS degree from the University of California at Santa Cruz. He has held research positions at the University of Michigan, University of Pennsylvania, Georgia Tech, and Sandia National Labs. Ron Caspi has a PhD in Molecular Microbiology. He is currently the sole curator of the MetaCyc database. Carol A. Fulcher, is a Scientific Database Curator at SRI International. Mario Latendresse is a Computer Scientist at SRI International. He has worked on the flux balance analysis module, atom mappings, Gibbs free energies computation, BioVelo, Route Search and visualization tools. Anamika Kothari obtained her bachelor's degree from K.C. College Mumbai; she is a Scientific Database Curator at SRI International. Ingrid Keseler, is a Scientific Database Curator at SRI International. Markus Krummenacker is a computer scientist. He has worked on the BioCyc genome browser, reaction and compound display pages, and electron transfer diagrams. Peter E. Midford is a Chemo/Bioinformatics Scientist at SRI International. Before joining SRI, he worked on phylogenetic databases and phenotype ontologies for several collaborative projects. Quang Ong received a BS in Industrial Management and is a Scientific Programmer at SRI International. Wai Kit Ong is a Metabolic Modeler at BRG. He is responsible for building genome-scale metabolic network models of bacteria with a focus on those found in the human gut microbiome. Suzanne Paley is a Senior Software Developer and has been with the BioCyc project since its inception. She is responsible for Pathway Collages, and many other BioCyc query, analysis and visualization tools. Pallavi Subhraveti is the Release Manager at BRG. She is also responsible for building PGDBs in large scale. References 1 Karp P , Latendresse DM , Paley SM , et al. Pathway Tools version 19.0: software for pathway/genome informatics and systems biology . Brief Bioinform 2016 ; 17 : 877 – 90 . Google Scholar Crossref Search ADS PubMed WorldCat 2 Karp PD, Latendresse M, Paley SM, et al. Pathway Tools version 19.0 update: software for pathway/genome informatics and systems biology. Brief Bioinform 2015. doi:10.1093/bib/bbv079. 3 Karp PD , Latendresse M , Caspi R. The pathway tools pathway prediction algorithm . Stand Genomic Sci Dec 2011 ; 5 ( 3 ): 424 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 4 Caspi R , Billington R , Ferrer L , et al. The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases . Nucleic Acids Res 2016 ; 44 ( D1 ): D471 – 80 . Google Scholar Crossref Search ADS PubMed WorldCat 5 Romero P , Karp PD. Using functional and organizational information to improve genome-wide computational prediction of transcription units on Pathway/Genome Databases . Bioinformatics 2004 ; 20 : 709 – 17 . Google Scholar Crossref Search ADS PubMed WorldCat 6 Finn RD , Coggill P , Eberhardt RY , et al. The Pfam protein families database: towards a more sustainable future . Nucleic Acids Res 2016 ; 44 ( D1 ): D279 – 85 . Google Scholar Crossref Search ADS PubMed WorldCat 7 UniProt Consortium . Update on activities at the universal protein resource (UniProt) in 2013 . Nucleic Acids Res 2013 ; 41 : D43 – 7 . Crossref Search ADS PubMed WorldCat 8 Gene Ontology Consortium . Gene Ontology Consortium: going forward . Nucleic Acids Res 2015 ; 43 : D1049 – 56 . Crossref Search ADS PubMed WorldCat 9 Peabody MA , Laird M , Vlasschaert RC , et al. PSORTdb: expanding the bacteria and archaea protein subcellular localization database to better reflect diversity in cell envelope structures . Nucleic Acids Res 2016 ; 44 ( D1 ): D663 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 10 Cipriano MJ , Novichkov PN , Kazakov AE , et al. RegTransBase–a database of regulatory sequences and interactions based on literature: a resource for investigating transcriptional regulation in prokaryotes . BMC Genomics 2013 ; 14 : 213 . Google Scholar Crossref Search ADS PubMed WorldCat 11 Chen R , Mias GI , Li-Pook-Than J , et al. Personal omics profiling reveals dynamic molecular and medical phenotypes . Cell 2012 ; 148 ( 6 ): 1293 – 307 . Google Scholar Crossref Search ADS PubMed WorldCat 12 List of BioCyc Pathway/Genome Databases . https://biocyc.org/biocyc-pgdb-list.shtml. 13 MicroScope Home Page . https://www.genoscope.cns.fr/agc/microscope/home/index.php. 14 Latendresse M , Krummenacker M , Trupp M , et al. Construction and completion of flux balance models from pathway databases . Bioinformatics 2012 ; 28 : 388 – 96 . Google Scholar Crossref Search ADS PubMed WorldCat 15 Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput . Nucleic Acids Res 2004 ; 32 ( 5 ): 1792 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 16 Travers M , Paley SM , Shrager J , et al. Groups: knowledge spreadsheets for symbolic biocomputing . Database 2013 . WorldCat 17 Dembek M , Barquist L , Boinett CJ , et al. High-throughput analysis of gene essentiality and sporulation in Clostridium difficile . mBio 2015 ; 6 ( 2 ): e02383 . Google Scholar Crossref Search ADS PubMed WorldCat 18 Latendresse M , Malerich J , Travers PM , et al. Accurate atom-mapping computation for biochemical reactions . J Chem Inf Model 2012 ; 52 : 2970 – 82 . Google Scholar Crossref Search ADS PubMed WorldCat 19 Latendresse M , Krummenacker M , Karp PD. Optimal metabolic route search based on atom mappings . Bioinformatics 2014 ; 30 : 2043 – 50 . Google Scholar Crossref Search ADS PubMed WorldCat 20 Pathway Tools Web Services . https://biocyc.org/web-services.shtml. 21 Guide to ptools-xml. https://biocyc.org/ptools-xml-guide.shtml. 22 The BioVelo Query Language . https://biocyc.org/bioveloLanguage.html. 23 Demir E , et al. The BioPAX community standard for pathway data sharing . Nat Biotechnol 2010 ; 28 ( 12 ): 935 – 42 . Google Scholar Crossref Search ADS PubMed WorldCat 24 BioCyc and Pathway Tools Download Information. https://biocyc.org/download.shtml. 25 Pathway Tools Data-File Formats. http://bioinformatics.ai.sri.com/ptools/flatfile-format.html. 26 SBML . http://www.sbml.org/. 27 BioPAX . http://www.biopax.org/. 28 GO Annotation File Formats. http://geneontology.org/page/go-annotation-file-formats. 29 Genbank Format. http://www.ncbi.nlm.nih.gov/collab/FT/#7.1.2. 30 FASTA Format. http://www.ncbi.nlm.nih.gov/blast/fasta.shtml. 31 PythonCyc API to Pathway Tools. http://bioinformatics.ai.sri.com/ptools/pythoncyc.html. 32 RCyc API to Pathway Tools. https://github.com/taltman/RCyc/blob/master/DESCRIPTION. 33 JavaCyc API to Pathway Tools. http://solgenomics.net/downloads/index.pl. 34 PerlCyc API to Pathway Tools. http://solgenomics.net/downloads/perlcyc.pl. 35 Karp PD. How much does curation cost? Database 2016 . WorldCat 36 Keseler IM , Skrzypek M , Weerasinghe D , et al. Curation accuracy of model organism databases . Database 2014 : 1 – 6 . WorldCat 37 BioCyc Webinars . https://biocyc.org/webinar.shtml. 38 How to Use a Pathway Tools Website. https://biocyc.org/PToolsWebsiteHowto.shtml. 39 Pathway/Genome Database Concepts Guide. https://biocyc.org/PGDBConceptsGuide.shtml. 40 BioCyc Database Guide. https://biocyc.org/BioCycUserGuide.shtml. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
journal article
LitStream Collection
PATRIC as a unique resource for studying antimicrobial resistance

Antonopoulos, Dionysios A; Assaf, Rida; Aziz, Ramy Karam; Brettin, Thomas; Bun, Christopher; Conrad, Neal; Davis, James J; Dietrich, Emily M; Disz, Terry; Gerdes, Svetlana; Kenyon, Ronald W; Machi, Dustin; Mao, Chunhong; Murphy-Olson, Daniel E; Nordberg, Eric K; Olsen, Gary J; Olson, Robert; Overbeek, Ross; Parrello, Bruce; Pusch, Gordon D; Santerre, John; Shukla, Maulik; Stevens, Rick L; VanOeffelen, Margo; Vonstein, Veronika; Warren, Andrew S; Wattam, Alice R; Xia, Fangfang; Yoo, Hyunseung

2019 Briefings in Bioinformatics

doi: 10.1093/bib/bbx083pmid: 28968762

Abstract The Pathosystems Resource Integration Center (PATRIC, www.patricbrc.org) is designed to provide researchers with the tools and services that they need to perform genomic and other ‘omic’ data analyses. In response to mounting concern over antimicrobial resistance (AMR), the PATRIC team has been developing new tools that help researchers understand AMR and its genetic determinants. To support comparative analyses, we have added AMR phenotype data to over 15 000 genomes in the PATRIC database, often assembling genomes from reads in public archives and collecting their associated AMR panel data from the literature to augment the collection. We have also been using this collection of AMR metadata to build machine learning-based classifiers that can predict the AMR phenotypes and the genomic regions associated with resistance for genomes being submitted to the annotation service. Likewise, we have undertaken a large AMR protein annotation effort by manually curating data from the literature and public repositories. This collection of 7370 AMR reference proteins, which contains many protein annotations (functional roles) that are unique to PATRIC and RAST, has been manually curated so that it projects stably across genomes. The collection currently projects to 1 610 744 proteins in the PATRIC database. Finally, the PATRIC Web site has been expanded to enable AMR-based custom page views so that researchers can easily explore AMR data and design experiments based on whole genomes or individual genes. antimicrobial resistance (AMR), antibiotic, genome annotation, minimum inhibitory concentration, RAST, the SEED Background The Pathosystems Resource Integration Center (PATRIC) is one of four bioinformatics resource centers (BRCs) funded by the National Institute of Allergy and Infectious Diseases (NIAID) [1]. The BRC program supports research by providing access to data associated with the NIAID Category A–C pathogenic genera [2], with PATRIC serving as the bacterial database. To provide a rich comparative analysis environment, PATRIC provides access to all publicly available genomes and associated metadata for bacterial and archaeal isolates, which includes >104 000 genomes as of June 2017. All of the genomes in PATRIC have been consistently annotated using the Rapid Annotation using Subsystems Technology toolkit (RASTtk) [3, 4]. This annotation consistency and subsequent protein family generation [5] serve as the backbone for many of the comparative analysis tools on the Web site [1]. The PATRIC database retains the annotations and identifiers from both GenBank [6, 7] and RefSeq [8] to facilitate side-by-side comparisons across the public data, allowing researchers to quickly find genomes and genes with information that they have gathered from different resources. PATRIC also provides researchers with a private workspace, where they can access bioinformatics services including genome assembly, annotation, RNA sequencing, variation calling, Tn-Seq, similar genome finder, proteome comparison and metabolic model reconstruction. When a user annotates a private genome with the PATRIC annotation service, they can compare their genome with the public collection. This ‘virtual integration’ provides a unique analysis experience that is not available at a similar scale at any other data repository. Facilitating research on antimicrobial resistance (AMR) has become increasingly important with the recent escalation in resistance and the loss of effectiveness to first-line drugs [9–13]. This resistance has a human cost, with ∼2 million people being sickened and 23 000 dying annually in the United States alone [14]. Here, we describe a set of enhancements introduced to support research on AMR. AMR strategy The current strategy for integrating AMR data into PATRIC breaks down roughly into two parts: (1) data collection to support analyses of whole genomes and (2) data collection to support analyses of individual proteins (Figure 1). In both cases, the data are drawn from the literature as well as a number of public resources. Specifics on the data integration, curation and tools are described below. Figure 1 Open in new tabDownload slide PATRIC annotation process for integrating AMR data in both genomic regions and genes. Figure 1 Open in new tabDownload slide PATRIC annotation process for integrating AMR data in both genomic regions and genes. AMR—integrating data at the genome level Data collection To support an environment for comparative analysis, we integrate metadata associated with the public genomes at GenBank [7] into the PATRIC database. This makes it easy to build sets of genomes that are based on collection date, geographic location, host, isolation source, etc. These metadata fields are incorporated both from BioSample [15] and directly from the GenBank file when an assembled genome is added to PATRIC. In some cases, metadata are acquired first hand from the NIAID-funded genome sequencing centers and from collaborators wishing to make their data public. Given the increasing emphasis on research to combat AMR and the decreasing costs of sequencing, we have been able to collect a large number of genomes with AMR panel data in the form of minimum inhibitory concentrations (MICs) or susceptible, intermediate and resistant (SIR) calls [16]. These panel data provide critical context for AMR research by allowing researchers to quickly build data sets for performing protein and gene comparisons, novel gene discovery, whole-genome variation analyses and machine learning (ML) experiments (described below). To increase the number of genomes with AMR metadata in PATRIC and expand our ability to support AMR-based comparative analyses, we began searching the literature for studies that included sequenced bacterial genomes and AMR panel data. Oftentimes, panel data from these studies were not recorded in the public archives, so PATRIC becomes the only place, where both the assembled genomes and metadata are available in the same place. If a genome was assembled and deposited in GenBank [7], we attach the AMR metadata directly to the corresponding genome in PATRIC. If the reads for a genome were deposited in the Sequence Read Archive (SRA) or the European Nucleotide Archive (ENA) [17, 18], we assemble and annotate the genome using PATRIC services [1, 4, 19]. We then incorporate the genome into the database along with the metadata (Supplementary Document S1). As laboratory methods for determining MIC values vary, incorporating these data into PATRIC requires a significant manual curation effort. When information is available from the study, we record how the MIC data were generated, including the laboratory method, the units of the measurement and the platform that was used to make the measurements. When an assertion about a phenotype is provided in the form of a SIR call, we record the laboratory standard from the European Committee on Antimicrobial Susceptibility Testing (EUCAST) [20] or the Clinical and Laboratory Standards Institute [21] and the year of the standard. To date, we have attached metadata to PATRIC genomes for ∼9165 genomes and have assembled and annotated ∼6122 genomes from SRA and ENA (Supplementary Table S1). To date, all AMR metadata in PATRIC are phenotypes that are derived from laboratory analyses. Studies often assert the susceptibility or resistance of an organism based on the presence or absence of key AMR genes. We do not currently incorporate data that are only based on genotypic data. The complete collection of AMR data in PATRIC can be downloaded from the PATRIC FTP site: ftp.patricbrc.org/patric2/current_release/RELEASE_NOTES/PATRIC_genomes_AMR.txt. ML classifiers As the PATRIC database was rapidly accumulating AMR panel data associated with sequenced genomes, a small number of studies were being published that explored using ML algorithms to study AMR [22–24]. With a sufficient number of genomes and AMR panel data, ML algorithms can be used to predict AMR phenotypes and the genomic regions associated with AMR with no a priori knowledge of the underlying mechanisms. This is an appealing area of exploration for PATRIC because it allows us to leverage our growing metadata collection to predict AMR phenotypes within the annotation service and to identify AMR-associated genomic regions with single-nucleotide polymorphism (SNP)-level resolution, a feature that can be used to inform our ongoing manual protein annotation efforts. In early 2016, we published a study describing the collection of AMR metadata for genomes and an ML approach that used the AdaBoost algorithm [25, 26] to build classifiers for predicting AMR [16]. At the time, we had sufficient data to make predictions in the species Acinetobacter baumannii, Mycobacterium tuberculosis, Staphylococcus aureus and Streptococcus pneumoniae for nine antibiotics [16] (Table 1). Shortly thereafter, we collaborated with scientists at the Houston Methodist Research Hospital to build classifiers for Klebsiella pneumoniae covering 13 antibiotics using 1777 genomes collected in their hospital system between 2011 and 2015 [27]. Using the same protocol as described in the Davis et al. [16] and Long et al. studies [27], we added 18 additional classifiers to the annotation system that have not been previously reported, including classifiers for M. tuberculosis, Peptoclostridium difficile, Pseudomonas aeruginosa, S. aureus and S. pneumoniae (Table 1). Receiver operating characteristic (ROC) curves for the newly added classifiers are shown in Figure 2. Table 1 AMR classifiers in the PATRIC annotation system Species . Antibiotica . Resistant genomesb . Susceptible genomesb . F1 score . Initially described in . Acinetobacter baumannii Carbapenem 122 110 0.95 [16] Klebsiella pneumoniae Amikacin 1190 364 0.92 [27] Klebsiella pneumoniae Aztreonam 1377 100 0.75 [27] Klebsiella pneumoniae Cefoxitin 555 976 0.80 [27] Klebsiella pneumoniae Ciprofloxacin 119 1435 0.91 [27] Klebsiella pneumoniae Ertapenem 265 178 0.96 [27] Klebsiella pneumoniae Gentamicin 786 768 0.86 [27] Klebsiella pneumoniae Imipenem 1100 453 0.94 [27] Klebsiella pneumoniae Levofloxacin 246 1307 0.93 [27] Klebsiella pneumoniae Meropenem 1123 430 0.92 [27] Klebsiella pneumoniae Piperacillin–tazobactam 322 1230 0.76 [27] Klebsiella pneumoniae Tetracycline 658 896 0.79 [27] Klebsiella pneumoniae Tobramycin 501 1053 0.94 [27] Klebsiella pneumoniae Co-trimoxazole 331 1223 0.87 [27] Mycobacterium tuberculosis Amikacin 210 350 0.91 This study Mycobacterium tuberculosis Capreomycin 204 350 0.83 This study Mycobacterium tuberculosis Isoniazid 250 250 0.88 [16] Mycobacterium tuberculosis Kanamycin 188 250 0.87 [16] Mycobacterium tuberculosis Ofloxacin 239 250 0.79 [16] Mycobacterium tuberculosis Rifampicin 250 250 0.86 [16] Mycobacterium tuberculosis Streptomycin 250 250 0.71 [16] Peptoclostridium difficile Azithromycin 213 246 0.97 This study Peptoclostridium difficile Ceftriaxone 228 86 0.86 This study Peptoclostridium difficile Clarithromycin 213 246 0.99 This study Peptoclostridium difficile Clindamycin 310 89 0.74 This study Peptoclostridium difficile Moxifloxacin 188 271 0.97 This study Pseudomonas aeruginosa Levofloxacin 192 290 0.85 This study Staphylococcus aureus Ciprofloxacin 467 762 0.98 This study Staphylococcus aureus Clindamycin 350 274 0.97 This study Staphylococcus aureus Erythromycin 484 821 0.96 This study Staphylococcus aureus Gentamicin 162 1144 0.98 This study Staphylococcus aureus Methicillin 707 886 0.99 [16] Staphylococcus aureus Penicillin 886 156 0.96 This study Staphylococcus aureus Tetracycline 203 1029 0.97 This study Staphylococcus aureus Co-trimoxazole 142 178 0.96 This study Streptococcus pneumoniae Beta-lactam 2124 584 0.90 [16] Streptococcus pneumoniae Chloramphenicol 165 289 0.94 This study Streptococcus pneumoniae Co-trimoxazole 2124 584 0.88 [16] Streptococcus pneumoniae Erythromycin 381 324 0.96 This study Streptococcus pneumoniae Tetracycline 368 290 0.96 This study Species . Antibiotica . Resistant genomesb . Susceptible genomesb . F1 score . Initially described in . Acinetobacter baumannii Carbapenem 122 110 0.95 [16] Klebsiella pneumoniae Amikacin 1190 364 0.92 [27] Klebsiella pneumoniae Aztreonam 1377 100 0.75 [27] Klebsiella pneumoniae Cefoxitin 555 976 0.80 [27] Klebsiella pneumoniae Ciprofloxacin 119 1435 0.91 [27] Klebsiella pneumoniae Ertapenem 265 178 0.96 [27] Klebsiella pneumoniae Gentamicin 786 768 0.86 [27] Klebsiella pneumoniae Imipenem 1100 453 0.94 [27] Klebsiella pneumoniae Levofloxacin 246 1307 0.93 [27] Klebsiella pneumoniae Meropenem 1123 430 0.92 [27] Klebsiella pneumoniae Piperacillin–tazobactam 322 1230 0.76 [27] Klebsiella pneumoniae Tetracycline 658 896 0.79 [27] Klebsiella pneumoniae Tobramycin 501 1053 0.94 [27] Klebsiella pneumoniae Co-trimoxazole 331 1223 0.87 [27] Mycobacterium tuberculosis Amikacin 210 350 0.91 This study Mycobacterium tuberculosis Capreomycin 204 350 0.83 This study Mycobacterium tuberculosis Isoniazid 250 250 0.88 [16] Mycobacterium tuberculosis Kanamycin 188 250 0.87 [16] Mycobacterium tuberculosis Ofloxacin 239 250 0.79 [16] Mycobacterium tuberculosis Rifampicin 250 250 0.86 [16] Mycobacterium tuberculosis Streptomycin 250 250 0.71 [16] Peptoclostridium difficile Azithromycin 213 246 0.97 This study Peptoclostridium difficile Ceftriaxone 228 86 0.86 This study Peptoclostridium difficile Clarithromycin 213 246 0.99 This study Peptoclostridium difficile Clindamycin 310 89 0.74 This study Peptoclostridium difficile Moxifloxacin 188 271 0.97 This study Pseudomonas aeruginosa Levofloxacin 192 290 0.85 This study Staphylococcus aureus Ciprofloxacin 467 762 0.98 This study Staphylococcus aureus Clindamycin 350 274 0.97 This study Staphylococcus aureus Erythromycin 484 821 0.96 This study Staphylococcus aureus Gentamicin 162 1144 0.98 This study Staphylococcus aureus Methicillin 707 886 0.99 [16] Staphylococcus aureus Penicillin 886 156 0.96 This study Staphylococcus aureus Tetracycline 203 1029 0.97 This study Staphylococcus aureus Co-trimoxazole 142 178 0.96 This study Streptococcus pneumoniae Beta-lactam 2124 584 0.90 [16] Streptococcus pneumoniae Chloramphenicol 165 289 0.94 This study Streptococcus pneumoniae Co-trimoxazole 2124 584 0.88 [16] Streptococcus pneumoniae Erythromycin 381 324 0.96 This study Streptococcus pneumoniae Tetracycline 368 290 0.96 This study aAMR data in PATRIC may be described as individual antibiotics or classes of antibiotics. bUsed for building the classifiers. Open in new tab Table 1 AMR classifiers in the PATRIC annotation system Species . Antibiotica . Resistant genomesb . Susceptible genomesb . F1 score . Initially described in . Acinetobacter baumannii Carbapenem 122 110 0.95 [16] Klebsiella pneumoniae Amikacin 1190 364 0.92 [27] Klebsiella pneumoniae Aztreonam 1377 100 0.75 [27] Klebsiella pneumoniae Cefoxitin 555 976 0.80 [27] Klebsiella pneumoniae Ciprofloxacin 119 1435 0.91 [27] Klebsiella pneumoniae Ertapenem 265 178 0.96 [27] Klebsiella pneumoniae Gentamicin 786 768 0.86 [27] Klebsiella pneumoniae Imipenem 1100 453 0.94 [27] Klebsiella pneumoniae Levofloxacin 246 1307 0.93 [27] Klebsiella pneumoniae Meropenem 1123 430 0.92 [27] Klebsiella pneumoniae Piperacillin–tazobactam 322 1230 0.76 [27] Klebsiella pneumoniae Tetracycline 658 896 0.79 [27] Klebsiella pneumoniae Tobramycin 501 1053 0.94 [27] Klebsiella pneumoniae Co-trimoxazole 331 1223 0.87 [27] Mycobacterium tuberculosis Amikacin 210 350 0.91 This study Mycobacterium tuberculosis Capreomycin 204 350 0.83 This study Mycobacterium tuberculosis Isoniazid 250 250 0.88 [16] Mycobacterium tuberculosis Kanamycin 188 250 0.87 [16] Mycobacterium tuberculosis Ofloxacin 239 250 0.79 [16] Mycobacterium tuberculosis Rifampicin 250 250 0.86 [16] Mycobacterium tuberculosis Streptomycin 250 250 0.71 [16] Peptoclostridium difficile Azithromycin 213 246 0.97 This study Peptoclostridium difficile Ceftriaxone 228 86 0.86 This study Peptoclostridium difficile Clarithromycin 213 246 0.99 This study Peptoclostridium difficile Clindamycin 310 89 0.74 This study Peptoclostridium difficile Moxifloxacin 188 271 0.97 This study Pseudomonas aeruginosa Levofloxacin 192 290 0.85 This study Staphylococcus aureus Ciprofloxacin 467 762 0.98 This study Staphylococcus aureus Clindamycin 350 274 0.97 This study Staphylococcus aureus Erythromycin 484 821 0.96 This study Staphylococcus aureus Gentamicin 162 1144 0.98 This study Staphylococcus aureus Methicillin 707 886 0.99 [16] Staphylococcus aureus Penicillin 886 156 0.96 This study Staphylococcus aureus Tetracycline 203 1029 0.97 This study Staphylococcus aureus Co-trimoxazole 142 178 0.96 This study Streptococcus pneumoniae Beta-lactam 2124 584 0.90 [16] Streptococcus pneumoniae Chloramphenicol 165 289 0.94 This study Streptococcus pneumoniae Co-trimoxazole 2124 584 0.88 [16] Streptococcus pneumoniae Erythromycin 381 324 0.96 This study Streptococcus pneumoniae Tetracycline 368 290 0.96 This study Species . Antibiotica . Resistant genomesb . Susceptible genomesb . F1 score . Initially described in . Acinetobacter baumannii Carbapenem 122 110 0.95 [16] Klebsiella pneumoniae Amikacin 1190 364 0.92 [27] Klebsiella pneumoniae Aztreonam 1377 100 0.75 [27] Klebsiella pneumoniae Cefoxitin 555 976 0.80 [27] Klebsiella pneumoniae Ciprofloxacin 119 1435 0.91 [27] Klebsiella pneumoniae Ertapenem 265 178 0.96 [27] Klebsiella pneumoniae Gentamicin 786 768 0.86 [27] Klebsiella pneumoniae Imipenem 1100 453 0.94 [27] Klebsiella pneumoniae Levofloxacin 246 1307 0.93 [27] Klebsiella pneumoniae Meropenem 1123 430 0.92 [27] Klebsiella pneumoniae Piperacillin–tazobactam 322 1230 0.76 [27] Klebsiella pneumoniae Tetracycline 658 896 0.79 [27] Klebsiella pneumoniae Tobramycin 501 1053 0.94 [27] Klebsiella pneumoniae Co-trimoxazole 331 1223 0.87 [27] Mycobacterium tuberculosis Amikacin 210 350 0.91 This study Mycobacterium tuberculosis Capreomycin 204 350 0.83 This study Mycobacterium tuberculosis Isoniazid 250 250 0.88 [16] Mycobacterium tuberculosis Kanamycin 188 250 0.87 [16] Mycobacterium tuberculosis Ofloxacin 239 250 0.79 [16] Mycobacterium tuberculosis Rifampicin 250 250 0.86 [16] Mycobacterium tuberculosis Streptomycin 250 250 0.71 [16] Peptoclostridium difficile Azithromycin 213 246 0.97 This study Peptoclostridium difficile Ceftriaxone 228 86 0.86 This study Peptoclostridium difficile Clarithromycin 213 246 0.99 This study Peptoclostridium difficile Clindamycin 310 89 0.74 This study Peptoclostridium difficile Moxifloxacin 188 271 0.97 This study Pseudomonas aeruginosa Levofloxacin 192 290 0.85 This study Staphylococcus aureus Ciprofloxacin 467 762 0.98 This study Staphylococcus aureus Clindamycin 350 274 0.97 This study Staphylococcus aureus Erythromycin 484 821 0.96 This study Staphylococcus aureus Gentamicin 162 1144 0.98 This study Staphylococcus aureus Methicillin 707 886 0.99 [16] Staphylococcus aureus Penicillin 886 156 0.96 This study Staphylococcus aureus Tetracycline 203 1029 0.97 This study Staphylococcus aureus Co-trimoxazole 142 178 0.96 This study Streptococcus pneumoniae Beta-lactam 2124 584 0.90 [16] Streptococcus pneumoniae Chloramphenicol 165 289 0.94 This study Streptococcus pneumoniae Co-trimoxazole 2124 584 0.88 [16] Streptococcus pneumoniae Erythromycin 381 324 0.96 This study Streptococcus pneumoniae Tetracycline 368 290 0.96 This study aAMR data in PATRIC may be described as individual antibiotics or classes of antibiotics. bUsed for building the classifiers. Open in new tab Figure 2 Open in new tabDownload slide ROC curves for AdaBoost-based AMR classifiers installed in the annotation service since the publication of the Davis et al. [16] and Long et al. papers [27]. Accuracy and F1 scores are displayed in each inset. ROC curves depict classifiers for (A) P. difficile, (B) S. aureus and (C) K. pneumoniae (Kpn), M. tuberculosis (Mtb), P. aeruginosa (Pae) and S. pneumoniae (Spn). Antibiotic abbreviations are: AZM, azithromycin; CC, clindamycin; CIP, ciprofloxacin; CLR, clarithromycin; CRO, ceftriaxone; E, erythromycin; GM, gentamicin; MFX, moxifloxacin; OX, ofloxacin; P, penicillin; SXT, trimethoprim sulfamethoxazole; TE, tetracycline. Figure 2 Open in new tabDownload slide ROC curves for AdaBoost-based AMR classifiers installed in the annotation service since the publication of the Davis et al. [16] and Long et al. papers [27]. Accuracy and F1 scores are displayed in each inset. ROC curves depict classifiers for (A) P. difficile, (B) S. aureus and (C) K. pneumoniae (Kpn), M. tuberculosis (Mtb), P. aeruginosa (Pae) and S. pneumoniae (Spn). Antibiotic abbreviations are: AZM, azithromycin; CC, clindamycin; CIP, ciprofloxacin; CLR, clarithromycin; CRO, ceftriaxone; E, erythromycin; GM, gentamicin; MFX, moxifloxacin; OX, ofloxacin; P, penicillin; SXT, trimethoprim sulfamethoxazole; TE, tetracycline. To date, we have maintained a policy of adding classifiers to the annotation system when their accuracies and F1 scores exceed 70% and their top feature k-mers relate to known AMR genes. The classifiers built in this project and described in Table 1 and Figure 2 are integrated into the annotation service and can be accessed through PATRIC and RAST. Phenotype predictions and the associated genomic regions are available for browsing on both Web sites and are described in tutorials at http://tutorial.theseed.org/. Our AMR metadata collection and classifier building efforts are ongoing at PATRIC. In many cases, the AMR metadata available in published studies report pan-resistant strains, which can be difficult to classify. In an effort to improve the accuracy of the classifiers, we are actively seeking strains with AMR metadata that improve the biological diversity of the collection. This includes collecting strains susceptible to many antibiotics. We are also comparing the results from several ML methods and are in the process of adding classifiers based on these other methods when they outperform AdaBoost [25]. In this manner, an antibiotic and species would be paired with the best ML algorithm in the annotation system. AMR—integrating data at the gene level Data collection Starting in 2015, the PATRIC annotation team, which also maintains the SEED [28] and RAST projects [3], began a focused effort to incorporate and manually curate protein functions relating to AMR. There are several well-known consortia that strive to provide standardized nomenclature for specific groups of antibiotic resistance genes including tetracycline resistance determinants [29, 30], and different classes of β-lactamases maintained by the Lahey Clinic [31], the University of Stuttgart [32, 33] and the Institute Pasteur [34]. There are also several well-respected databases that provide collections of AMR genes covering broad categories of AMR mechanisms including the Comprehensive Antibiotic Resistance Database (CARD) [35], the Bacterial Antimicrobial Resistance Reference Gene Database [36] hosted by the National Center for Biotechnology Information as part of the National Database of Antibiotic Resistant Organisms (NDARO) and ResFinder [37]. These resources maintain reference sequences for each AMR gene type, providing each with well-curated informative product names (in the case of NDARO) or a specialized Antibiotic Resistance Ontology (ARO, provided by CARD). These collections enable accurate detection and annotation of specific AMR determinates in pathogen isolates by means of supporting the BLAST-based [38, 39] or hidden Markov model (HMM)-based [40] screening of user-submitted sequences against representative sets of AMR sequences. However, in many cases, these AMR annotations project ambiguously because newly discovered proteins can match representative proteins with differing annotations at nearly equal BLAST similarities. For example, a novel CTX-M, SHV or TEM β-lactamase could potentially present the researcher with over a hundred nearly equal BLAST hits against highly homologous but clinically different reference sequence variants, making the choice of the most appropriate product name difficult. In many cases, the best choice would be a novel allele designation, rather than one of the existing curated product names. We believed that a manual curation effort was necessary to integrate AMR sequence variants into distinct functional roles (isofunctional protein families, which are integral for the SEED/PATRIC environment) to ensure that they can be unambiguously projected to the genomes in PATRIC by the annotation service. As many resources focus more heavily on the horizontally transferred AMR genes, we began our curation effort by building functional roles for AMR-related porin and efflux pump proteins described in the literature that are often chromosomally encoded, reasoning that this would rapidly add new value to the scientific community. Afterward, this naturally led into an effort to incorporate annotations for proteins involved in tetracycline resistance. The proteins involved in efflux pumps are known to play an important role in this type of resistance [41], and there are well-described annotation rules, which have been curated by the community for decades for naming them [30, 42]. More recently, we have been annotating class by class using publicly available resources when possible. Curation process and k-mer projection Significant manual curation and modification of the existing RAST/RASTtk automatic annotation pipeline were required to accommodate AMR-related functional roles, as their biology differs significantly from ‘classic’ functional roles encoding prokaryotic enzymatic and nonenzymatic housekeeping functions. The process of creating projectable AMR annotations starts with the incorporation of reference proteins from the literature and public resources. BLAST searches are used to compare reference sequences against the SEED database and PATRIC [1]. The subsequent matching proteins are used to build alignments and trees, which are manually inspected to understand how specific or general an annotation is, and if it will project cleanly in the annotation system. When reference proteins from the literature create ambiguous BLAST matches or split high-similarity clades in the tree, the nomenclature is retained, but then combined into a single annotation that covers the entire clade. The training sets of representative AMR sequence variants from outside sources and the SEED database [28] are then built. They form the basis for each AMR-related functional role. An annotation string for each of the functional roles is assigned, taking into account the SEED database internal nomenclature conventions as well as those developed by the AMR research community and accepted by CARD, ResFinder, NCBI and other resources. Signature k-mers (amino acid 8-mers) are built from these functional roles as described previously [4], and the annotations are then projected to all of the genomes in PATRIC. Trees for the newly annotated AMR proteins are then manually inspected to identify clades that contain multiple annotations, indicating a lack of consistency. Inconsistencies are also identified by comparing the generation of protein families before and after the addition of a new function. The inconsistent proteins are manually re-annotated and this process is iterated until the annotations project stably and accurately across the entire database. The PATRIC manual curation effort offers a variety of additional benefits to the field of AMR research. For example, this effort is helping to alleviate the well-documented problem of miss-annotation and over prediction of AMR annotations [43, 44]. We are doing this by systematically removing erroneous annotations, which implicate non-AMR-related proteins with antibiotic resistance functions, and by annotating and attaching literature references to these closely related proteins to prevent over-projection of AMR roles, and then curating their projection over the PATRIC collection as described above. We occasionally discover clades of potential AMR proteins that are surrounded by solid AMR reference sequences, yet have not been described in any reference database. In these cases, we describe the protein as a ‘putative’ AMR protein of a given resistance type, if the sequence identity levels are 50% or better over the entire length of the protein, which enables functional projection. These are obvious targets for characterization in the laboratory. However, if a newly discovered hypothetical clade has a sequence identity that is <50%, we use the less specific annotation string for all its members. In these cases, we use the following annotations: ‘weak similarity to aminoglycoside N(6')-acetyltransferase’ and ‘weak similarity to aminoglycoside N(3)-acetyltransferase’. These are obvious targets for characterization in the laboratory. Finally, having clean sets of AMR-related functional roles facilitates SNP and other comparative analyses at PATRIC and elsewhere by providing relevant sequence peer groups for variation research. As of May 2017, the annotation of AMR determinants conferring resistance to tetracycline, β-lactam, aminoglycoside [45, 46], chloramphenicol [47] and MLSKO (macrolides, lincosamides, streptogramins, ketolides and oxazolidinones) [42, 48, 49] antibiotic classes has been completed. These include 450 functional roles for these five major antibiotic classes, as well as 36 roles for closely related non-AMR proteins. This collection comprises a combined set of 7370 reference and SEED proteins with AMR roles and 36 424 proteins with related non-AMR roles. The collection projects consistently to 1 610 744 AMR proteins with AMR roles and 2 518 252 proteins with related non-AMR roles in PATRIC. We have also associated literature references with the majority of the newly curated AMR functional roles in PATRIC, totaling 411 references. The curation effort is ongoing and is focusing on proteins conveying resistance to quinolone, vancomycin, fosfomycin, rifampin/rifamycin, nitroimidazole, bleomycin and other antibiotic classes. Visualization of AMR data at PATRIC Several new interfaces have been developed on the PATRIC Web site to allow researchers to fully explore the AMR data available in the resource. These interfaces include information that is summarized across all genomes for the available antibiotics, at the taxon level, and for individual genomes and genes. Details on each of these interfaces are described below. Antibiotic view Data from PubChem [50] are now integrated for nearly 100 specific antibiotics that can be viewed on landing pages designed especially to display this information. Each individual antibiotic has a landing page with several tabs that provide a general overview, specific information on the AMR phenotype, the genes associated with that phenotype and the regions within the individual genes or genomes that are linked to resistance or susceptibility to that specific drug (Figure 3). Figure 3 Open in new tabDownload slide Summary information for the antibiotic methicillin at PATRIC. The antibiotic interface provides a summary of the antibiotic, its synonyms and actions, and also provides links via separate tabs for AMR phenotypes, genes and regions across all the data available in PATRIC. Figure 3 Open in new tabDownload slide Summary information for the antibiotic methicillin at PATRIC. The antibiotic interface provides a summary of the antibiotic, its synonyms and actions, and also provides links via separate tabs for AMR phenotypes, genes and regions across all the data available in PATRIC. The overview tab includes a general description of the drug, the chemical structure, the mechanism of action, a description of the pharmacological activity and class and known synonyms. The AMR phenotype tab provides a list of all the genomes that have been identified as being susceptible or resistant to that antimicrobial. This tab also includes the laboratory typing method and platform, and the testing standard if that information is available. A third tab, called AMR genes, displays information on the genes associated with resistance. The final tab, AMR regions, includes the location of the specific k-mers that are associated with the genome’s phenotype. Taxon-level view PATRIC organizes relevant data for all the available sequenced bacterial and archaeal genomes according to NCBI taxonomy [51]. Data are summarized at each level, from the highest (the Superkingdoms: Bacteria and Archaea) to the strain (or isolate) from which the genome has been sequenced. For each taxonomic level with associated AMR data, PATRIC provides several summaries. A bar graph summarizing the antibiotics, the AMR phenotype (resistant, intermediate or susceptible) and the number of genomes that match that phenotype is available on the overview tab at the top of the main landing page for each taxon (Figure 4A). Clicking on any of the antibiotics displayed in the graph will open a new page that summarizes all the genomes from that taxon level that have the particular AMR phenotype. An alternate tabular view of the data is also available (Figure 4B). The taxon-level summary page also includes an AMR phenotype tab that lists all of the genomes within the selected taxon that have an AMR phenotype, and the data that are associated with it, including specific treatments, phenotypes or laboratory methods. All tables in PATRIC include a dynamic filter for rapid filtering of the genomes based on metadata selections. Figure 4 Open in new tabDownload slide A taxon-level summary on the PATRIC Web site describing AMR phenotype data across all of the genomes that are part of the Staphylococcus genus. (A) A bar graph summarizes the antibiotics, the AMR phenotype (resistant, intermediate or susceptible) and the number of genomes that match that phenotype. (B) The AMR phenotype tabular view, which shows all the genomes that have associated AMR data, includes a dynamic filter for rapid selection of genomes based on the metadata. Figure 4 Open in new tabDownload slide A taxon-level summary on the PATRIC Web site describing AMR phenotype data across all of the genomes that are part of the Staphylococcus genus. (A) A bar graph summarizes the antibiotics, the AMR phenotype (resistant, intermediate or susceptible) and the number of genomes that match that phenotype. (B) The AMR phenotype tabular view, which shows all the genomes that have associated AMR data, includes a dynamic filter for rapid selection of genomes based on the metadata. Gene view and predicted regions associated with AMR phenotypes PATRIC provides a summary of data at the gene level, where the physical characteristics of a gene, its functional role(s), available experimental data and associated publications are provided. This view also includes information on homology to genes known to be important in AMR. In addition, PATRIC provides a view for predicted regions within some genes that are associated with AMR phenotypes. The k-mer regions predicted by the ML classifiers are visually indicated and their genomic region can be seen on the genome browser (Figure 5). Figure 5 Open in new tabDownload slide AMR predicted regions, located in the genome of S. aureus strain 08S00974, as visualized in the PATRIC JBrowse viewer [57]. These predicted regions, numbered sequentially by their occurrence in the genome as ‘classifier_predicted_regions 12–15’, were predicted by the ML algorithm that is being used to predict AMR phenotypes. The predicted regions are located in and around a gene (fig|1280.11691.peg.56) that is annotated as ‘Tetracycline resistance, MFS efflux pump  = > Tet(K)’. The annotation for this gene came from the focused manual curation effort at PATRIC to incorporate and propagate information for specific genes that were known to play an important role in AMR. Figure 5 Open in new tabDownload slide AMR predicted regions, located in the genome of S. aureus strain 08S00974, as visualized in the PATRIC JBrowse viewer [57]. These predicted regions, numbered sequentially by their occurrence in the genome as ‘classifier_predicted_regions 12–15’, were predicted by the ML algorithm that is being used to predict AMR phenotypes. The predicted regions are located in and around a gene (fig|1280.11691.peg.56) that is annotated as ‘Tetracycline resistance, MFS efflux pump  = > Tet(K)’. The annotation for this gene came from the focused manual curation effort at PATRIC to incorporate and propagate information for specific genes that were known to play an important role in AMR. Future improvements We continue to peruse resources and publications to identify new genomes and AMR genes to incorporate into PATRIC. These will be used to expand the AMR phenotype predictions and AMR gene analysis to new genera and new antibiotics. We plan to map AMR properties to the genus-specific families (PLfams) to support comparative analysis of AMR genes, incorporate new AMR gene trees and allow users to build nucleotide-based multiple sequence alignments to identify SNPs and their association with AMR phenotypes. We are acutely aware that several important types of AMR determinants are not amenable to being encoded and automatically propagated via the automated annotation propagation strategy described above. These include antibiotic targets, which are largely cellular proteins performing essential household cellular functions, and such proteins are grouped into ‘classic’ functional roles in SEED/PATRIC. They carry functional annotations that are unrelated to AMR. Antibiotic susceptibility in these target proteins is determined by a few, or even a single, non-synonymous mutation in the corresponding gene [52–54]. Likewise, single mutations in noncoding DNA regions, including promoters, operators and attenuators, can lead to dramatic increase in MIC, or an increase in resistance levels to particular antimicrobials [55, 56]. These cases will be treated separately in PATRIC. We are in the process of designing tools specific for SNP detection and analysis targeted at the gene level. While PATRIC does not currently enable examining AMR data from metagenomes or from population-based studies, this is something that we plan to provide in future releases. Key Points PATRIC includes AMR information at both the genome and gene level, and uses manual curation and ML to integrate these data into the annotation service. A large collection of AMR-specific functional roles has been manually curated, and this information is propagated by the annotation service. With summaries of the available data across all taxonomic levels and new interfaces, researchers can quickly locate and examine these data in their private genomes and compare with the PATRIC collection. Funding The NIAID, National Institutes of Health, Department of Health and Human Services (grant number HHSN272201400027C to R.L.S.). Dionysios A. Antonopoulos is a Microbiologist who is a staff scientist in the Biosciences Division at Argonne National Laboratory and an Assistant Professor in the University of Chicago Department of Medicine in Illinois, USA. Rida Assaf is a PhD student in the Department of Computer Science at the University of Chicago in Illinois, USA. Ramy Karam Aziz is a Professor and Acting Chair at the Department of Microbiology and Immunology, Faculty of Pharmacy, Cairo University, Cairo Egypt. His research focuses on microbial and viral genomics and metagenomics. Thomas S. Brettin is a Strategic Program Manager for Computing and Life Sciences within the Computing, Environmental and Biological Sciences Directorate at Argonne National Laboratory in Illinois, USA. Christopher Bun has a PhD degree in Computational Biology in the Department of Computer Science, University of Chicago in Illinois, USA. Neal Conrad is a Software Engineering Associate at Argonne National Laboratory and the University of Chicago Computation Institute who specializes in Web application development and user experience for bioinformatics. James J. Davis is a Computational Biologist at Argonne National Laboratory and the University of Chicago Computation Institute in Illinois, USA. Emily M. Dietrich is a Coordinating Writer/Editor at Argonne National Laboratory and a joint appointment at the University of Chicago Computation Institute in Illinois, USA. Terrence Disz, PhD, is a Bioinformatics Software Specialist at the Fellowship for Interpretation of Genomes in Illinois, USA. Svetlana Gerdes, PhD, is a Comparative Genomics Specialist at the Fellowship for Interpretation of Genomes in Illinois, USA. Ron Kenyon is a Project Director at the Biocomplexity Institute of Virginia Tech, Blacksburg, Virginia, USA. Dustin Machi is a Senior Software Architect at the Biocomplexity Institute of Virginia Tech, Blacksburg, Virginia, USA. Chunhong Mao is a Research Assistant Professor at the Biocomplexity Institute of Virginia Tech, Blacksburg, Virginia, USA. Daniel E. Murphy-Olson is a Cloud Services Team Lead at Argonne National Laboratory and Joint Staff at the University of Chicago Computation Institute in Illinois, USA. Eric K. Nordberg is a Research Scientist and Software Engineer with the Biocomplexity Institute of Virginia Tech, Blacksburg, Virginia, USA. Gary J. Olsen is a Microbiologist with a particular interest in comparative genome analysis at the University of Illinois at Urbana-Champaign in Illinois, USA. Robert Olson is a Senior Software Engineer in the Computing, Environment and Life Sciences Directorate of Argonne National Laboratory and the Computation Institute at the University of Chicago, in Illinois, USA. Ross Overbeek is a Founding Fellow of the Fellowship to Interpret Genomes, as well as Senior Computational Scientist at the Computation Institute, University of Chicago, in Illinois, USA. Bruce Parrello is a Research Professional in the Computing, Environment, and Life Sciences Division at Argonne National Laboratory in Illinois, USA. Gordon D. Pusch has a PhD degree in Physics. He is a member of the Fellowship for Interpretation of Genomes, and is a codeveloper and co-maintainer of the SEED and RAST genome annotation systems. John Santerre is a PhD candidate in Machine Learning in the Department of Computer Science, University of Chicago in Illinois, USA. Maulik Shukla is a Senior Software Engineer, Computing in the Environment and Life Sciences, Argonne National Laboratory in Illinois, USA. Rick L. Stevens is the Associate Laboratory Director for Computing, Environment and Life Sciences Directorate at Argonne National Laboratory and Professor of Computer Science in the Computation Institute at the University of Chicago in Illinois, USA. Margo Van Oeffelen is a Technical Assistant at the Fellowship for Interpretation of Genomes. Veronika Vonstein, PhD, is a Founding Fellow and President of the Fellowship for Interpretation of Genomes. Andrew S. Warren is a Senior Software Architect at the Biocomplexity Institute of Virginia Tech, Blacksburg, Virginia, USA. Alice R. Wattam is a Research Assistant Professor at the Biocomplexity Institute of Virginia Tech, Blacksburg, Virginia, USA. Fangfang Xia is a Computer Scientist in the Computing, Environment and Life Sciences Directorate of Argonne National Laboratory and a Research Fellow at Computation Institute of the University of Chicago in Illinois, USA. Hyunseung Yoo is a Software Engineer at Argonne National Laboratory and the University of Chicago Computation Institute in Illinois, USA. References 1 Wattam AR , Davis JJ, Assaf R, et al. Improvements to PATRIC, the all-bacterial Bioinformatics Database and Analysis Resource Center . Nucleic Acids Res 2017 ; 45 : D535 – 42 . Google Scholar Crossref Search ADS PubMed WorldCat 2 Greene JM , Collins F, Lefkowitz EJ, et al. National Institute of Allergy and Infectious Diseases bioinformatics resource centers: new assets for pathogen informatics . Infect Immun 2007 ; 75 : 3212 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 3 Aziz RK , Bartels D, Best AA, et al. The RAST Server: rapid annotations using subsystems technology . BMC Genomics 2008 ; 9 : 1. Google Scholar Crossref Search ADS PubMed WorldCat 4 Brettin T , Davis JJ, Disz T, et al. RASTtk: a modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes . Sci Rep 2015 ; 5 : 8365. Google Scholar Crossref Search ADS PubMed WorldCat 5 Davis JJ , Gerdes S, Olsen GJ, et al. PATtyFams: protein families for the microbial genomes in the PATRIC database . Front Microbiol 2016 ; 7 : 118. Google Scholar Crossref Search ADS PubMed WorldCat 6 Clark K , Karsch-Mizrachi I, Lipman DJ, et al. GenBank . Nucleic Acids Res 2016 ; 44 : D67 – 72 . Google Scholar Crossref Search ADS PubMed WorldCat 7 Benson DA , Cavanaugh M, Clark K, et al. GenBank . Nucleic Acids Res 2017 ; 45 : D37. Google Scholar Crossref Search ADS PubMed WorldCat 8 O'Leary NA , Wright MW, Brister JR, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation . Nucleic Acids Res 2016 ; 44 : D733 – 45 . Google Scholar Crossref Search ADS PubMed WorldCat 9 World Health Organization . Antimicrobial Resistance. Draft Global Action Plan on Antimicrobial Resistance . Geneva : WHO , 2015 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC 10 Eurosurveillance Editorial Team . WHO member states adopt global action plan on antimicrobial resistance . Euro Surveill 2015 ; 20 . OpenURL Placeholder Text WorldCat 11 Fauci AS , Collins FS. New strategies in battle against antibiotic resistance. https://directorsblog.nih.gov/2014/09/18/new-strategies-in-battle-against-antibiotic-resistance/. 12 Roca I , Akova M, Baquero F, et al. The global threat of antimicrobial resistance: science for intervention . New Microbes New Infect 2015 ; 6 : 22 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 13 Chen L. Notes from the field: pan-resistant New Delhi metallo-beta-lactamase-producing Klebsiella pneumoniae—Washoe County, Nevada, 2016 . MMWR Morb Mortal Wkly Rep 2017 ; 66 : 33 . Google Scholar Crossref Search ADS PubMed WorldCat 14 Centers for Disease Control and Prevention. Antibiotic resistance threats in the United States, 2013. Centres for Disease Control and Prevention, US Department of Health and Human Services, 2013 . 15 Barrett T , Clark K, Gevorgyan R, et al. BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata . Nucleic Acids Res 2012 ; 40 : D57 – 63 . Google Scholar Crossref Search ADS PubMed WorldCat 16 Davis JJ , Boisvert S, Brettin T, et al. Antimicrobial resistance prediction in PATRIC and RAST . Sci Rep 2016 ; 6 : 27930 . Google Scholar Crossref Search ADS PubMed WorldCat 17 Kodama Y , Shumway M, Leinonen R. The Sequence Read Archive: explosive growth of sequencing data . Nucleic Acids Res 2012 ; 40 : D54 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 18 Leinonen R , Akhtar R, Birney E, et al. The European Nucleotide Archive . Nucleic Acids Res 2010 ; gkq967. Google Scholar OpenURL Placeholder Text WorldCat 19 Bankevich A , Nurk S, Antipov D, et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing . J Comput Biol 2012 ; 19 : 455 – 77 . Google Scholar Crossref Search ADS PubMed WorldCat 20 European Committee on Antimicrobial Susceptibility Testing . EUCAST guidelines for detection of resistance mechanisms and specific resistances of clinical and/or epidemiological importance. EUCAST, Basel, Switzerland, 2013 . http://www/eucast. org/clinical_breakpoints. 21 Patel J , Cockerill F, Alder J, et al. Performance standards for antimicrobial susceptibility testing; twenty-fourth informational supplement. In: CLSI Standards for Antimicrobial Susceptibility Testing . Clinical and Laboratory Standards Institute, Wayne, PA, vol. 34 , 2014 , 1 – 226 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC 22 Bradley P , Gordon NC, Walker TM, et al. Rapid antibiotic-resistance predictions from genome sequence data for Staphylococcus aureus and Mycobacterium tuberculosis . Nat Commun 2015 ; 6 : 10063 . Google Scholar Crossref Search ADS PubMed WorldCat 23 Drouin A , Giguère S, Sagatovich V, et al. Learning interpretable models of phenotypes from whole genome sequences with the Set Covering Machine . arXiv , preprint arXiv:1412.1074 [q-bio.GN], 2014. OpenURL Placeholder Text WorldCat 24 Santerre JW , Davis JJ, Xia F, et al. Machine learning for antimicrobial resistance . arXiv:1607.01224 [stat.ML], 2016. OpenURL Placeholder Text WorldCat 25 Freund Y , Schapire RE, A decision-theoretic generalization of on-line learning and an application to boosting. In: European Conference on Computational Learning Theory. Springer, 1995 , 23–37. 26 Freund Y , Schapire R, Abe N. A short introduction to boosting . J Jpn Soc Artif Intell 1999 ; 14 : 1612. Google Scholar OpenURL Placeholder Text WorldCat 27 Long SW , Olsen RJ, Eager TN, et al. Population genomic analysis of 1,777 extended-spectrum beta-lactamase producing Klebsiella pneumoniae, Houston, Texas: unexpected abundance of clonal group 307 . mBio , vol. 8, 2017 . Google Scholar OpenURL Placeholder Text WorldCat 28 Overbeek R , Olson R, Pusch GD, et al. The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST) . Nucleic Acids Res 2014 ; 42 : D206 – 14 . Google Scholar Crossref Search ADS PubMed WorldCat 29 Levy SB , McMurry LM, Barbosa TM, et al. Nomenclature for new tetracycline resistance determinants . Antimicrob Agents Chemother 1999 ; 43 : 1523 – 4 . Google Scholar Crossref Search ADS PubMed WorldCat 30 Chopra I , Roberts M. Tetracycline antibiotics: mode of action, applications, molecular biology, and epidemiology of bacterial resistance . Microbiol Mol Biol Rev 2001 ; 65 : 232 – 60 . second page, table of contents. Google Scholar Crossref Search ADS PubMed WorldCat 31 Bush K , Pazkill T, Jacoby J. ß-lactamase classification and amino acid sequences for TEM, SHV and OXA extended-spectrum and inhibitor resistant enzymes. http://www.lahey.org/Studies/. 32 Thai QK , Bös F, Pleiss J. The lactamase engineering database: a critical survey of TEM sequences in public databases . BMC Genomics 2009 ; 10 : 390. Google Scholar Crossref Search ADS PubMed WorldCat 33 Fischer M , Thai QK, Grieb M, et al. DWARF–a data warehouse system for analyzing protein families . BMC Bioinformatics 2006 ; 7 : 495. Google Scholar Crossref Search ADS PubMed WorldCat 34 Pasteur I. Klebsiella sequence typing. http://bigsdb.pasteur.fr/klebsiella/klebsiella.html. 35 McArthur AG , Waglechner N, Nizam F, et al. The comprehensive antibiotic resistance database . Antimicrob Agents Chemother 2013 ; 57 : 3348 – 57 . Google Scholar Crossref Search ADS PubMed WorldCat 36 NCBI. Bacterial antimicrobial resistance reference gene database, 2017 . https://www.ncbi.nlm.nih.gov/bioproject/?term=3130472017. 37 Zankari E , Hasman H, Cosentino S, et al. Identification of acquired antimicrobial resistance genes . J Antimicrob Chemother 2012 ; 67 : 2640 – 4 . Google Scholar Crossref Search ADS PubMed WorldCat 38 Boratyn GM , Camacho C, Cooper PS, et al. BLAST: a more efficient report with usability improvements . Nucleic Acids Res 2013 ; 41 : W29 – 33 . Google Scholar Crossref Search ADS PubMed WorldCat 39 Madden T. The BLAST sequence analysis tool . In: The NCBI Handbook [Internet] , 2nd ed., National Center for Biotechnology Information, Bethesda, MD, 2013 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC 40 Haft DH , Selengut JD, Richter RA, et al. TIGRFAMs and genome properties in 2013 . Nucleic Acids Res 2013 ; 41 : D387 – 95 . Google Scholar Crossref Search ADS PubMed WorldCat 41 Sun J , Deng Z, Yan A. Bacterial multidrug efflux pumps: mechanisms, physiology and pharmacological exploitations . Biochem Biophys Res Commun 2014 ; 453 : 254 – 67 . Google Scholar Crossref Search ADS PubMed WorldCat 42 Roberts MC , Sutcliffe J, Courvalin P, et al. Nomenclature for macrolide and macrolide-lincosamide-streptogramin B resistance determinants . Antimicrob Agents Chemother 1999 ; 43 : 2823 – 30 . Google Scholar Crossref Search ADS PubMed WorldCat 43 Furnham N , Garavelli JS, Apweiler R, et al. Missing in action: enzyme functional annotations in biological databases . Nat Chem Biol 2009 ; 5 : 521 – 5 . Google Scholar Crossref Search ADS PubMed WorldCat 44 Schnoes AM , Brown SD, Dodevski I, et al. Annotation error in public databases: misannotation of molecular function in enzyme superfamilies . PLoS Comput Biol 2009 ; 5 : e1000605. Google Scholar Crossref Search ADS PubMed WorldCat 45 Ramirez MS , Tolmasky ME. Aminoglycoside modifying enzymes . Drug Resist Updat 2010 ; 13 : 151 – 71 . Google Scholar Crossref Search ADS PubMed WorldCat 46 Shaw KJ , Rather PN, Hare RS, et al. Molecular genetics of aminoglycoside resistance genes and familial relationships of the aminoglycoside-modifying enzymes . Microbiol Rev 1993 ; 57 : 138 – 63 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 47 van Hoek AH , Mevius D, Guerra B, et al. Acquired antibiotic resistance genes: an overview . Front Microbiol 2011 ; 2 : 203. Google Scholar Crossref Search ADS PubMed WorldCat 48 Roberts MC. Update on macrolide-lincosamide-streptogramin, ketolide, and oxazolidinone resistance genes . FEMS Microbiol Lett 2008 ; 282 : 147 – 59 . Google Scholar Crossref Search ADS PubMed WorldCat 49 Roberts MC. Nomenclature for tetracycline genes/nomenclature center for MLS genes, 2017 . http://faculty.washington.edu/marilynr/2017. 50 Kim S , Thiessen PA, Bolton EE, et al. PubChem substance and compound databases . Nucleic Acids Res 2015 ; 4 : 44. Google Scholar OpenURL Placeholder Text WorldCat 51 Federhen S. Type material in the NCBI taxonomy database . Nucleic Acids Res 2015 ; 43 : D1086 – 98 . Google Scholar Crossref Search ADS PubMed WorldCat 52 Maness MJ , Sparling PF. Multiple antibiotic resistance due to a single mutation in Neisseria gonorrhoeae . J Infect Dis 1973 ; 128 : 321 – 30 . Google Scholar Crossref Search ADS PubMed WorldCat 53 Mac Aogain M , Kilkenny S, Walsh C, et al. Identification of a novel mutation at the primary dimer interface of GyrA conferring fluoroquinolone resistance in Clostridium difficile . J Glob Antimicrob Resist 2015 ; 3 : 295 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 54 Santos-Lopez A , Bernabe-Balas C, Ares-Arroyo M, et al. A naturally occurring single nucleotide polymorphism in a multicopy plasmid produces a reversible increase in antibiotic resistance . Antimicrob Agents Chemother 2017 ; 61 :e01735–16. Google Scholar OpenURL Placeholder Text WorldCat 55 Martinez J , Baquero F. Mutation frequencies and antibiotic resistance . Antimicrob Agents Chemother 2000 ; 44 : 1771 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 56 Suzuki S , Horinouchi T, Furusawa C. Prediction of antibiotic resistance by gene expression profiles . Nat Commun 2014 ; 5 . Google Scholar OpenURL Placeholder Text WorldCat 57 Buels R , Yao E, Diesh CM, et al. JBrowse: a dynamic web platform for genome visualization and analysis . Genome Biol 2016 ; 17 : 66. Google Scholar Crossref Search ADS PubMed WorldCat © The Author 2017. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com © The Author 2017. Published by Oxford University Press.
journal article
Open Access Collection
Recent development of antiSMASH and other computational approaches to mine secondary metabolite biosynthetic gene clusters

Blin, Kai; Kim, Hyun Uk; Medema, Marnix H; Weber, Tilmann

2019 Briefings in Bioinformatics

doi: 10.1093/bib/bbx146pmid: 29112695

Abstract Many drugs are derived from small molecules produced by microorganisms and plants, so-called natural products. Natural products have diverse chemical structures, but the biosynthetic pathways producing those compounds are often organized as biosynthetic gene clusters (BGCs) and follow a highly conserved biosynthetic logic. This allows for the identification of core biosynthetic enzymes using genome mining strategies that are based on the sequence similarity of the involved enzymes/genes. However, mining for a variety of BGCs quickly approaches a complexity level where manual analyses are no longer possible and require the use of automated genome mining pipelines, such as the antiSMASH software. In this review, we discuss the principles underlying the predictions of antiSMASH and other tools and provide practical advice for their application. Furthermore, we discuss important caveats such as rule-based BGC detection, sequence and annotation quality and cluster boundary prediction, which all have to be considered while planning for, performing and analyzing the results of genome mining studies. genome mining, biosynthetic gene cluster, antibiotics, secondary metabolites, natural products, antiSMASH Introduction Most antibiotics, such as penicillin, erythromycin or tetracycline, and also other drugs like acarbose (anti-diabetic), artemisinin (anti-malarial), tacrolimus or cyclosporins (immunosuppressants) are so-called natural products either synthesized by or derived from microorganisms or plants [1]. As the biosynthetic pathways for such compounds are not directly related to growth and reproduction, these compounds are also referred to as ‘secondary metabolites’ or—in newer literature—‘specialized metabolites’. In bacteria and fungi, the genes required for the biosynthesis of these compounds are usually organized as biosynthetic gene clusters (BGCs). These clusters contain all genes required for the biosynthesis of precursors, assembly of the compound scaffold, modification of the compound scaffold (also referred to as ‘tailoring’) and often also resistance, export and regulation. This implies that the full pathway can easily be identified if the involvement of one of the genes in biosynthesis can be demonstrated. In plants, only some pathways are organized in BGCs [2]. For other pathways, the biosynthesis genes are scattered across the genome and thus require additional experimental data, such as co-expression analyses [3], for identification. Soon after the first genes encoding natural product biosynthetic enzymes were identified, sequenced and analyzed, it became apparent that the sequences of the corresponding enzymes contain data of highly predictive quality, which can be used to infer key biosynthetic steps. For example, the core scaffolds of the products of canonical modular type I polyketide synthases (PKSs) can be predicted by combining several types of easy-to-obtain data: (a) the content and architecture of individual enzymatic domains within the megaenzymes, which are responsible for the assembly of the molecular scaffold and its modifications (e.g. reduction of the β-carbon), can be identified by using Hidden Markov model (HMM) profiles of such domains; (b) the individual acyl-CoA building blocks for each PKS module (e.g. malonyl-CoA versus methylmalonyl-CoA) can be inferred based on key residues in the active sites of the acyltransferase (AT) domains or by using phylogenetic classification; (c) the stereospecificity mediated by ketoreductase domains can be inferred by key amino acids in the active site motifs. These studies were the starting point in establishing genome mining for secondary metabolite BGCs as one of the recent key technologies in natural products research. One of the first computational tools to make use of such predictions was the proprietary DECIPHERⓇ search engine and database of the former company Ecopia [4] that was first published in 2003. Around the same time, the first publicly available tools were released. For example, SEARCHPKS automated the identification of enzymatic domains in PKSs [5] (for URLs to this and all following Web tools, please see Table 1). However, it took until 2009 for the first open-source genome mining pipelines CLUSEAN [29] and NP.searcher [21] to be published. In 2011, the first version of the open-source genome mining platform antiSMASH was released [30], which combined and extended the functionality of the previous tools and also offered a user-friendly Web interface. For the first time, it became possible for scientists without significant experience in computational biology to perform larger-scale genome mining studies on a free and public Web server. Since then, antiSMASH has been steadily extended [6, 7, 23, 30–33] and currently offers a broad collection of tools and databases for automated genome mining and comparative genomics for a wide variety of different classes of secondary metabolites. The antiSMASH analysis pipeline for bacterial genomes and the pipeline for fungal genomes (recently named ‘fungiSMASH’) are both based on the same codebase. antiSMASH and fungiSMASH use two different Web submission forms, each offering specific options. plantiSMASH [23] is a branch of antiSMASH that includes plant-specific functionality, such as plant-adapted HMM profiles and cluster detection logic, as well as support for coexpression analysis. Table 1 URLs of Web servers, Web tools and databases referred to in the review Tool . Functions . URL . Reference . antiSMASH 4 Genome mining http://antismash.secondarymetabolites.org [6] BGC analysis Domain analysis antiSMASH database BGC database http://antismash-db.secondarymetabolites.org [7] ARTS Genome mining http://arts.ziemertlab.com [8] BAGEL 3 Genome mining http://bagel.molgenrug.nl/ [9] CASSIS BGC boundary prediction https://sbi.hki-jena.de/cassis/cassis.php [10] CRISPy-web sgRNA design http://crispy.secondarymetabolites.org [11] eSNaPD v2 Genome mining http://esnapd2.rockefeller.edu [12] FunGeneClusterS BGC boundary prediction https://fungiminions.shinyapps.io/FunGeneClusterS [13] fungiSMASH Genome mining http://fungismash.secondarymetabolites.org [6] BGC analysis Domain analysis GNP Metabolomics http://magarveylab.ca/gnp [14] GRAPE/GARLIC Genome mining https://magarveylab.ca/gast/ [15, 16] MIBiG BGC database http://mibig.secondarymetabolites.org [17] reference data set NaPDoS Genome mining http://napdos.ucsd.edu [18] NORINE Nonribosomal peptide database http://bioinfo.lifl.fr/NRP [19, 20] NP.searcher Genome mining http://dna.sherman.lsi.umich.edu/ [21] Domain analysis NRPSpredictor Domain analysis http://nrps.informatik.uni-tuebingen.de [22] plantiSMASH Genome mining http://plantismash.secondarymetabolites.org [23] BGC analysis PRISM 3 Genome mining http://magarveylab.ca/prism [24] BGC analysis Domain analysis RODEO Genome mining http://www.ripprodeo.org [25] RiPP analysis (SEARCHPKS)/SBSPKS v2 Domain analysis http://202.54.226.228/∼pksdb/sbspks_updated/master.html [26] BGC database Smiles2Monomers Retro-biosynthetic monomer prediction http://bioinfo.lifl.fr/norine/smiles2monomers.jsp [27] SMURF Genome mining http://www.jcvi.org/smurf [28] Tool . Functions . URL . Reference . antiSMASH 4 Genome mining http://antismash.secondarymetabolites.org [6] BGC analysis Domain analysis antiSMASH database BGC database http://antismash-db.secondarymetabolites.org [7] ARTS Genome mining http://arts.ziemertlab.com [8] BAGEL 3 Genome mining http://bagel.molgenrug.nl/ [9] CASSIS BGC boundary prediction https://sbi.hki-jena.de/cassis/cassis.php [10] CRISPy-web sgRNA design http://crispy.secondarymetabolites.org [11] eSNaPD v2 Genome mining http://esnapd2.rockefeller.edu [12] FunGeneClusterS BGC boundary prediction https://fungiminions.shinyapps.io/FunGeneClusterS [13] fungiSMASH Genome mining http://fungismash.secondarymetabolites.org [6] BGC analysis Domain analysis GNP Metabolomics http://magarveylab.ca/gnp [14] GRAPE/GARLIC Genome mining https://magarveylab.ca/gast/ [15, 16] MIBiG BGC database http://mibig.secondarymetabolites.org [17] reference data set NaPDoS Genome mining http://napdos.ucsd.edu [18] NORINE Nonribosomal peptide database http://bioinfo.lifl.fr/NRP [19, 20] NP.searcher Genome mining http://dna.sherman.lsi.umich.edu/ [21] Domain analysis NRPSpredictor Domain analysis http://nrps.informatik.uni-tuebingen.de [22] plantiSMASH Genome mining http://plantismash.secondarymetabolites.org [23] BGC analysis PRISM 3 Genome mining http://magarveylab.ca/prism [24] BGC analysis Domain analysis RODEO Genome mining http://www.ripprodeo.org [25] RiPP analysis (SEARCHPKS)/SBSPKS v2 Domain analysis http://202.54.226.228/∼pksdb/sbspks_updated/master.html [26] BGC database Smiles2Monomers Retro-biosynthetic monomer prediction http://bioinfo.lifl.fr/norine/smiles2monomers.jsp [27] SMURF Genome mining http://www.jcvi.org/smurf [28] Open in new tab Table 1 URLs of Web servers, Web tools and databases referred to in the review Tool . Functions . URL . Reference . antiSMASH 4 Genome mining http://antismash.secondarymetabolites.org [6] BGC analysis Domain analysis antiSMASH database BGC database http://antismash-db.secondarymetabolites.org [7] ARTS Genome mining http://arts.ziemertlab.com [8] BAGEL 3 Genome mining http://bagel.molgenrug.nl/ [9] CASSIS BGC boundary prediction https://sbi.hki-jena.de/cassis/cassis.php [10] CRISPy-web sgRNA design http://crispy.secondarymetabolites.org [11] eSNaPD v2 Genome mining http://esnapd2.rockefeller.edu [12] FunGeneClusterS BGC boundary prediction https://fungiminions.shinyapps.io/FunGeneClusterS [13] fungiSMASH Genome mining http://fungismash.secondarymetabolites.org [6] BGC analysis Domain analysis GNP Metabolomics http://magarveylab.ca/gnp [14] GRAPE/GARLIC Genome mining https://magarveylab.ca/gast/ [15, 16] MIBiG BGC database http://mibig.secondarymetabolites.org [17] reference data set NaPDoS Genome mining http://napdos.ucsd.edu [18] NORINE Nonribosomal peptide database http://bioinfo.lifl.fr/NRP [19, 20] NP.searcher Genome mining http://dna.sherman.lsi.umich.edu/ [21] Domain analysis NRPSpredictor Domain analysis http://nrps.informatik.uni-tuebingen.de [22] plantiSMASH Genome mining http://plantismash.secondarymetabolites.org [23] BGC analysis PRISM 3 Genome mining http://magarveylab.ca/prism [24] BGC analysis Domain analysis RODEO Genome mining http://www.ripprodeo.org [25] RiPP analysis (SEARCHPKS)/SBSPKS v2 Domain analysis http://202.54.226.228/∼pksdb/sbspks_updated/master.html [26] BGC database Smiles2Monomers Retro-biosynthetic monomer prediction http://bioinfo.lifl.fr/norine/smiles2monomers.jsp [27] SMURF Genome mining http://www.jcvi.org/smurf [28] Tool . Functions . URL . Reference . antiSMASH 4 Genome mining http://antismash.secondarymetabolites.org [6] BGC analysis Domain analysis antiSMASH database BGC database http://antismash-db.secondarymetabolites.org [7] ARTS Genome mining http://arts.ziemertlab.com [8] BAGEL 3 Genome mining http://bagel.molgenrug.nl/ [9] CASSIS BGC boundary prediction https://sbi.hki-jena.de/cassis/cassis.php [10] CRISPy-web sgRNA design http://crispy.secondarymetabolites.org [11] eSNaPD v2 Genome mining http://esnapd2.rockefeller.edu [12] FunGeneClusterS BGC boundary prediction https://fungiminions.shinyapps.io/FunGeneClusterS [13] fungiSMASH Genome mining http://fungismash.secondarymetabolites.org [6] BGC analysis Domain analysis GNP Metabolomics http://magarveylab.ca/gnp [14] GRAPE/GARLIC Genome mining https://magarveylab.ca/gast/ [15, 16] MIBiG BGC database http://mibig.secondarymetabolites.org [17] reference data set NaPDoS Genome mining http://napdos.ucsd.edu [18] NORINE Nonribosomal peptide database http://bioinfo.lifl.fr/NRP [19, 20] NP.searcher Genome mining http://dna.sherman.lsi.umich.edu/ [21] Domain analysis NRPSpredictor Domain analysis http://nrps.informatik.uni-tuebingen.de [22] plantiSMASH Genome mining http://plantismash.secondarymetabolites.org [23] BGC analysis PRISM 3 Genome mining http://magarveylab.ca/prism [24] BGC analysis Domain analysis RODEO Genome mining http://www.ripprodeo.org [25] RiPP analysis (SEARCHPKS)/SBSPKS v2 Domain analysis http://202.54.226.228/∼pksdb/sbspks_updated/master.html [26] BGC database Smiles2Monomers Retro-biosynthetic monomer prediction http://bioinfo.lifl.fr/norine/smiles2monomers.jsp [27] SMURF Genome mining http://www.jcvi.org/smurf [28] Open in new tab In addition to antiSMASH, other noteworthy tools have also been developed and made available: SMURF [28] offers mining for fungal PKS, nonribosomal peptide synthetase (NRPS) and terpenoid gene clusters; the PRISM tool [24, 34, 35] offers genome mining functionality with a strong focus on predicting chemical structures of the biosynthetic pathways. PRISM is closely connected to the ‘Genomes-to-Natural Products platform (GNP)’ [14] that matches such predictions with MS/MS data, and to the GRAPE/GARLIC tools [15, 16], which match the predictions to chemical databases. For a comprehensive review describing the history and progress of secondary metabolite genome mining, along with many examples of compounds and BGCs that were identified using genome mining approaches, please see [36]. In this review, we will focus on the general computational approaches to study secondary metabolite biosynthesis and how these are integrated into the current antiSMASH framework (Figure 1). Finally, we will give practical advice for preparing and interpreting genome mining data. Although we focus on antiSMASH as an example, the issues discussed are applicable to natural product genome mining in general, and hence are equally relevant when using other tools. Comprehensive user guides for antiSMASH can be found online (http://docs.antismash.secondarymetabolites.org/using_antismash/) and in [37–39]. For comprehensive reviews on the different genome mining tools and databases on secondary metabolites, the reader is referred to [40–43]. Figure 1 Open in new tabDownload slide General workflow of an antiSMASH analysis of bacterial, fungal and plant genomes. Computational resources in the left and right boxes have been integrated with antiSMASH 4 for enhanced genome mining performance, whereas those in the box in the bottom correspond to third-party applications that use antiSMASH for the detection of BGCs. Figure 1 Open in new tabDownload slide General workflow of an antiSMASH analysis of bacterial, fungal and plant genomes. Computational resources in the left and right boxes have been integrated with antiSMASH 4 for enhanced genome mining performance, whereas those in the box in the bottom correspond to third-party applications that use antiSMASH for the detection of BGCs. Principles of predicting secondary metabolite biosynthesis To predict secondary metabolite biosynthesis pathways, genome mining approaches commonly start out by identifying conserved biosynthetic genes. Their gene products are subsequently analyzed to gain information about their putative function in biosynthesis and sometimes their substrate specificity. To identify conserved biosynthetic genes, it is necessary to have gene annotations available on the genome of interest. Formats such as NCBI’s GenBank or EBI’s EMBL contain both DNA sequence and gene annotations. GFF3 files can be used to carry the annotations for sequences in FASTA format. antiSMASH accepts input data in all of these formats. If no gene annotations are available, antiSMASH will run a gene finding tool. For the bacterial version, this is Prodigal [44]. For fungal and plant genomes, antiSMASH uses GlimmerHMM [45]. In the next step, BGCs are identified based on core enzymes involved in the biosynthesis of secondary metabolites. Functionally related proteins frequently share common patterns of amino acids. Using profile-based methods like position-specific scoring matrices to identify these patterns seems intuitive. HMMs are probabilistic models of linear sequences that provide an algorithmic approach to interpret the scores obtained from the scoring matrix. Profile HMMs (pHMMs) are HMMs designed to represent multiple sequence alignments, including matches, insertions and deletions. The most commonly used tool around pHMMs in biology is HMMer [46]. Many profile databases such as PFAM [47] and TIGRFAMs [48] provide downloadable profiles compatible with HMMer. antiSMASH uses pHMMs with profiles specific to conserved core enzymes of secondary metabolite biosynthesis pathways to run its profile-based BGC detection. Once the core enzymes have been identified, antiSMASH compares co-located core genes with a set of manually curated BGC cluster rules. These rules comprise Boolean logic regarding domain presence/absence within either a gene or a genomic region of interest. For example, BGCs encoding nonribosomally synthesized peptides (such as the antibiotic vancomycin) can be unambiguously identified if the sequence to be analyzed contains genes encoding proteins that have a combination of one or multiple Condensation, Adenylation (A) and Peptidyl Carrier Protein domains. ‘Negative’ models are also used to discard false positives, e.g. protein sequences that achieve higher scores for profiles of fatty acid synthases (which are homologous to PKSs) than for profiles of PKSs will not lead to the identification of a polyketide BGC. The 2017 version of antiSMASH (version 4) [6] uses such rules for 45 different types/classes of secondary metabolites (Table 2A). The cluster rules are stored in a tab-delimited text file, which can be easily edited to add custom types of gene clusters. Similar rule-based strategies are also used by many other secondary metabolite genome mining tools, such as PRISM [24], SMURF [28] and BAGEL [9]. Table 2 A: BGC types detectable by pHMM-based rules with antiSMASH, PRISM and SMURF. B: Rule-independent methods to detect BGCs A: Rule-based detection of gene clustersa . BGC-type . antiSMASH . PRISM/RiPP PRISM . SMURF . Aminocoumarins X X Aminoglycosides/ aminocyclitols X Antimetabolites X Aryl polyenes X X Autoinducing peptide X Bacteriocins X Beta-lactams X X Bottromycin X X Butyrolactones X X ClusterFinder fatty acid X ClusterFinder saccharide X ComX X Cyanobactins X X Ectoines X X Furan X X Fused (pheganomycin-like) X Glycocin X X Head-to-tail cyclized peptide X X Homoserine lactone X X Indoles X X Ladderane lipids X X Lantipeptides class I X X Lantipeptides class II X X Lantipeptides class III/IV X X Lasso peptide X X Linaridin X X Linear azol(in)e-containing X X Melanins X X Microcin X Microviridin X X Nonribosomal peptides X X X Nucleosides X Oligosaccharide X Other (unusual) PKS X Others X Phenazine X X Phosphoglycolipids X X Phosphonate X X Polyunsaturated fatty acids X Prochlorosin X Proteusin X X Sactipeptide X X Non-NRP siderophores X Streptide X Terpene X X Thiopeptides X X Thioviridamide X Trans-AT type I PKS X X Trifolitoxin X Type I PKS X X X Type II PKS X X Type III PKS X X YM-216391 X A: Rule-based detection of gene clustersa . BGC-type . antiSMASH . PRISM/RiPP PRISM . SMURF . Aminocoumarins X X Aminoglycosides/ aminocyclitols X Antimetabolites X Aryl polyenes X X Autoinducing peptide X Bacteriocins X Beta-lactams X X Bottromycin X X Butyrolactones X X ClusterFinder fatty acid X ClusterFinder saccharide X ComX X Cyanobactins X X Ectoines X X Furan X X Fused (pheganomycin-like) X Glycocin X X Head-to-tail cyclized peptide X X Homoserine lactone X X Indoles X X Ladderane lipids X X Lantipeptides class I X X Lantipeptides class II X X Lantipeptides class III/IV X X Lasso peptide X X Linaridin X X Linear azol(in)e-containing X X Melanins X X Microcin X Microviridin X X Nonribosomal peptides X X X Nucleosides X Oligosaccharide X Other (unusual) PKS X Others X Phenazine X X Phosphoglycolipids X X Phosphonate X X Polyunsaturated fatty acids X Prochlorosin X Proteusin X X Sactipeptide X X Non-NRP siderophores X Streptide X Terpene X X Thiopeptides X X Thioviridamide X Trans-AT type I PKS X X Trifolitoxin X Type I PKS X X X Type II PKS X X Type III PKS X X YM-216391 X Open in new tab Table 2 A: BGC types detectable by pHMM-based rules with antiSMASH, PRISM and SMURF. B: Rule-independent methods to detect BGCs A: Rule-based detection of gene clustersa . BGC-type . antiSMASH . PRISM/RiPP PRISM . SMURF . Aminocoumarins X X Aminoglycosides/ aminocyclitols X Antimetabolites X Aryl polyenes X X Autoinducing peptide X Bacteriocins X Beta-lactams X X Bottromycin X X Butyrolactones X X ClusterFinder fatty acid X ClusterFinder saccharide X ComX X Cyanobactins X X Ectoines X X Furan X X Fused (pheganomycin-like) X Glycocin X X Head-to-tail cyclized peptide X X Homoserine lactone X X Indoles X X Ladderane lipids X X Lantipeptides class I X X Lantipeptides class II X X Lantipeptides class III/IV X X Lasso peptide X X Linaridin X X Linear azol(in)e-containing X X Melanins X X Microcin X Microviridin X X Nonribosomal peptides X X X Nucleosides X Oligosaccharide X Other (unusual) PKS X Others X Phenazine X X Phosphoglycolipids X X Phosphonate X X Polyunsaturated fatty acids X Prochlorosin X Proteusin X X Sactipeptide X X Non-NRP siderophores X Streptide X Terpene X X Thiopeptides X X Thioviridamide X Trans-AT type I PKS X X Trifolitoxin X Type I PKS X X X Type II PKS X X Type III PKS X X YM-216391 X A: Rule-based detection of gene clustersa . BGC-type . antiSMASH . PRISM/RiPP PRISM . SMURF . Aminocoumarins X X Aminoglycosides/ aminocyclitols X Antimetabolites X Aryl polyenes X X Autoinducing peptide X Bacteriocins X Beta-lactams X X Bottromycin X X Butyrolactones X X ClusterFinder fatty acid X ClusterFinder saccharide X ComX X Cyanobactins X X Ectoines X X Furan X X Fused (pheganomycin-like) X Glycocin X X Head-to-tail cyclized peptide X X Homoserine lactone X X Indoles X X Ladderane lipids X X Lantipeptides class I X X Lantipeptides class II X X Lantipeptides class III/IV X X Lasso peptide X X Linaridin X X Linear azol(in)e-containing X X Melanins X X Microcin X Microviridin X X Nonribosomal peptides X X X Nucleosides X Oligosaccharide X Other (unusual) PKS X Others X Phenazine X X Phosphoglycolipids X X Phosphonate X X Polyunsaturated fatty acids X Prochlorosin X Proteusin X X Sactipeptide X X Non-NRP siderophores X Streptide X Terpene X X Thiopeptides X X Thioviridamide X Trans-AT type I PKS X X Trifolitoxin X Type I PKS X X X Type II PKS X X Type III PKS X X YM-216391 X Open in new tab Table 2 (continued) B: Rule-independent methods . Method . Principle . Implemented in . References . ClusterFinder HMM-based classification of which PFAM domains are likely to be found inside or outside a BGC antiSMASH [6, 49] EvoMining Phylogenomic identification of enzymes with expanded substrate spectrum; such enzymes are often found in BGCs EvoMining [50] Resistance gene-based mining Identification of potential antibiotic resistance genes; often such genes are part of BGCs to provide self-protection of the producing organism ARTS [8] B: Rule-independent methods . Method . Principle . Implemented in . References . ClusterFinder HMM-based classification of which PFAM domains are likely to be found inside or outside a BGC antiSMASH [6, 49] EvoMining Phylogenomic identification of enzymes with expanded substrate spectrum; such enzymes are often found in BGCs EvoMining [50] Resistance gene-based mining Identification of potential antibiotic resistance genes; often such genes are part of BGCs to provide self-protection of the producing organism ARTS [8] aFor details on the pHMM’s and specific rules used by the different genome mining programs, please consult the original publications of antiSMASH [6, 32], PRISM [24, 34] or SMURF [28]. Open in new tab Table 2 (continued) B: Rule-independent methods . Method . Principle . Implemented in . References . ClusterFinder HMM-based classification of which PFAM domains are likely to be found inside or outside a BGC antiSMASH [6, 49] EvoMining Phylogenomic identification of enzymes with expanded substrate spectrum; such enzymes are often found in BGCs EvoMining [50] Resistance gene-based mining Identification of potential antibiotic resistance genes; often such genes are part of BGCs to provide self-protection of the producing organism ARTS [8] B: Rule-independent methods . Method . Principle . Implemented in . References . ClusterFinder HMM-based classification of which PFAM domains are likely to be found inside or outside a BGC antiSMASH [6, 49] EvoMining Phylogenomic identification of enzymes with expanded substrate spectrum; such enzymes are often found in BGCs EvoMining [50] Resistance gene-based mining Identification of potential antibiotic resistance genes; often such genes are part of BGCs to provide self-protection of the producing organism ARTS [8] aFor details on the pHMM’s and specific rules used by the different genome mining programs, please consult the original publications of antiSMASH [6, 32], PRISM [24, 34] or SMURF [28]. Open in new tab Alternatively, a probabilistic method to detect potential secondary metabolite BGCs can be selected in antiSMASH that uses the ClusterFinder algorithm [49]. Rather than using explicit rules requiring specific enzymes to be present for a particular class of BGCs, ClusterFinder is based on a model built from a training set of PFAM domains found in BGCs and non-BGC regions. Given this model and a genome of interest with annotated PFAM domains, ClusterFinder then calculates the probability of a stretch of observed PFAM domains to constitute a BGC. In regions where this probability is higher than the configurable threshold, a BGC is predicted. For BGCs encoding NRPS, PKS, terpene or ribosomally synthesized and posttranslationally modified peptides (RiPPs), it is possible to perform some additional analyses to predict further details, such as substrate specificities or product cyclization patterns. To this end, it is sometimes necessary to classify proteins or domains that share a high overall sequence similarity. The differences between the functional classes (e.g. different substrate specificities) are determined by a small number of key amino acids. Sequence-alignment-based methods such as BLAST and profile-based methods like HMMer tend to perform poorly in these cases. As both kinds of methods are designed to score overall sequence similarities, they—by design—gloss over the few key differences. In such cases, more complex algorithms can be used. Support vector machines (SVMs) are a machine learning approach that uses supervised learning to create nonprobabilistic binary linear classifiers. SVMs classify data points encoded in multidimensional feature vectors by a maximum margin hyperplane. Compared with other machine learning methods such as artificial neural networks, the construction of the SVM hyperplane allows for gaining some insight over which of the input parameters contribute most to the solution. For the multimodular enzymes involved in NRPS biosynthesis, antiSMASH uses the recently published SANDPUMA tool [51] to predict the substrates of A domains. Knowledge of these substrates and the order of the A domains are then used to predict the backbone structure of the NRPS product. SANDPUMA internally uses a combination of pHMMs and SVMs to obtain the best possible A domain substrate predictions. In RiPP clusters that encode the biosynthesis of, e.g., lanthi-, lasso-, sacti- and thiopeptides, identifying the precursor peptide is key to predicting the cluster product. Here, antiSMASH scores putative precursor peptides using the recently published RODEO tool [25], as well as some custom pHMMs. RODEO also uses both pHMMs and SVMs internally to identify precursor peptides. Tailoring enzymes that further modify the RiPP are also identified using pHMMs. Phylogenetic analysis assists with the classification of enzymes in Clusters of Orthologous Groups and the calculation of phylogenetic distances of genes/enzyme sequences of interest to characterized reference sequences. Multiple methods exist to construct phylogenetic trees based on multiple sequence alignments. Depending on the desired output tree characteristics, the number of input sequences and other constraints, the most appropriate method should be chosen. A popular algorithm among the distance-matrix-based methods is the Neighbor-Joining algorithm, which uses bottom-up clustering to create the tree. Neighbor-Joining is a comparatively fast method, but the correctness of the tree depends on the accuracy and additivity of the underlying distance matrix. Maximum parsimony methods try to identify the tree that uses the smallest number of evolution events to explain the observed sequence data. While maximum parsimony algorithms build accurate trees, their computation tends to be relatively slow compared with distance matrix-based methods. Maximum likelihood methods use probability distributions to assess the likelihood of a given phylogenetic tree according to a substitution model. This method unfortunately has a high complexity for computing the optimal tree. Many current tools use a combination of methods. The popular software FastTree [52] first builds rough Neighbor-Joining trees and then refines them using a maximum likelihood scoring of the trees generated in the first pass. In antiSMASH, phylogenetic methods are used in many places. For NRPS clusters, SANDPUMA includes a phylogenetic analysis in the PrediCAT step. A modified version of PrediCAT trained on a recently released data set [53] is also used in terpenoid clusters to further classify terpene synthases. Noncore biosynthetic genes in a BGC are assigned to ‘secondary metabolite clusters of orthologous groups’, for which phylogenies are reconstructed. In addition to BGC type-dependent analyses, antiSMASH also includes general tools providing information on all cluster types. The built-in ClusterBlast module [30] considers the similarity of individual gene products as well as their genomic arrangement. ClusterBlast contains a comprehensive database of all predicted BGCs from publicly available genomes that is searched to identify organisms containing similar BGCs. The same algorithm is used in the ‘SubClusterBlast’ module to identify operons/sets of genes in the query BGC that code for enzymes involved in the biosynthesis of common precursors, for example the nonproteinogenic amino acid 3, 5-dihydroxy-phenylglycine present in some types, or NRPS clusters such as the vancomycin-family glycopeptides. Finally, this strategy is also used to search the Minimum Information on Biosynthetic Gene cluster (MIBiG) [17] data set with the ‘KnownClusterBlast’ function to provide information about related and well-characterized gene clusters. This function can also be used to perform a sequence-based dereplication, i.e. the identification of gene clusters that code for already known products. ‘Linked’ tools and resources A general challenge when using comparative approaches to study BGCs is the varying quality of annotation in public sequence databases. Some BGCs that have been extensively studied experimentally are well annotated, whereas others—mostly identified in high-throughput sequencing efforts—were only annotated using standard genome annotation pipelines that do not provide specific annotations of secondary metabolite BGCs. Therefore, a community effort has been established to define a ‘MIBiG’ standard [17] and provide a standardized repository for BGCs that have been experimentally connected to their biosynthetic products. The MIBiG repository currently (as of April 2017) contains 1396 entries of BGCs that are validated to code for a specific biosynthetic pathway. Within this set, 396 of the entries contain comprehensive manually curated annotations of the specific features of the gene clusters, which were provided by the specialists that studied these respective BGCs. This collection now serves as a reference data set for a wide variety of applications and the validation of novel computational tools. In addition to analyses integrated into antiSMASH, the annotation generated by antiSMASH can also be useful as a starting point for further downstream analyses. Therefore, antiSMASH 4 provides an application programming interface that allows third-party software to access antiSMASH annotation for further processing. Examples of such tools are the ‘Antibiotic Resistant Target Seeker ARTS’ [8], which predicts potential targets of antibiotics and uses the annotation provided by antiSMASH to mine for BGCs and CRISpy-web [11], a Web tool that allows user-friendly design of single guide RNAs (sgRNAs) for CRISPR applications on nonmodel organisms. antiSMASH is a comprehensive genome mining platform, but only provides information on individually submitted genomes and does not offer any integrated search functionality. Therefore, in 2016, the antiSMASH platform was extended with a database containing precomputed antiSMASH annotation on >3900 finished high-quality bacterial genome sequences [7]. Using the Web interface, it is possible to browse secondary metabolite clusters by BGC type or taxonomy of the producer organism. Additionally, custom queries can be constructed using an interactive query builder. This makes it possible to answer research questions such as ‘which clusters of type NRPS contain A domains that select for the nonproteinogenic amino acid 3, 5-dihydroxy-phenylglycine?’ or ‘what BGCs of type RiPP exist in the genus Streptomyces that are not lanthipeptides?’. The results are displayed in the same antiSMASH Web format. They can also be exported in various file formats that allow further processing in other bioinformatics tools. Considerations and caveats for computational genome mining You can only find what you are looking for… Most genome mining platforms, including antiSMASH (with default search options), SMURF [28] and PRISM [24, 34], use a rule-based approach to define what is annotated as a secondary metabolite BGC. These rules are derived from existing knowledge about key biosynthetic steps/principles, which require the activity of individual or combinations of specific enzymes. The genes encoding these are also often referred to as ‘core’ genes and used as anchors or probes to screen the genomic data of interest. While this method is highly sensitive and precise for identifying biosynthesis genes for many classes of secondary metabolites, such as polyketides, or nonribosomally synthesized peptides, it of course implies that only pathways for which rules are implemented in the mining software can be detected; all pathways that may use unknown or unrelated alternative enzymes will be missed. As an extension to the rule-based genome mining, antiSMASH optionally provides the possibility to use the ‘ClusterFinder’ method [49]. This algorithm can identify BGCs that are not detected by the expert-generated rule sets described above. However, it should be noted that this method still has some bias, as the source data used to train the HMM determining whether a gene product likely belongs to a BGC are also based on the currently known pathways. To address these limitations, alternative methods are under development to access the ‘biosynthetic dark matter’ and identify novel pathways and enzymes. One promising approach is ‘EvoMining’ [50], which is based on the observation that biosynthetic enzymes and/or resistance genes often evolved by duplication and divergence of primary metabolism enzymes. By detecting divergences in phylogenetic trees of enzymes from the core metabolism shared between many bacterial species, this method can identify enzymes that have likely been repurposed for secondary metabolite biosynthesis [50] or resistance [8]. Once novel pathways have been identified using such methods and experimentally validated, the newly obtained knowledge on the involved enzymes is of course used to refine and extend the rule-based mining methods. The quality of input data is important for getting reliable results One important aspect to be considered when mining genomic data for BGCs using antiSMASH or alternative pipelines, such as PRISM [24, 34], SMURF [28] and ClusterFinder [49], is the quality of the sequence data that is to be analyzed. All these tools use either rule-based or statistical approaches to identify the BGCs involved in secondary metabolism. Both methods require that the sequence data to be analyzed are not too fragmented and that the genes of a BGC are not scattered across different contigs in the assembly. Users should be particularly aware of potential quality issues when analyzing genome data generated with short-read sequencing technologies. Special care has to be taken when analyzing type I polyketide or NRPS-containing BGCs; both types of pathways involve large multimodular megaenzymes, whose gene sequences often are highly repetitive and therefore difficult to assemble purely based on short sequencing reads [54]. The same applies to metagenomic data; reliable identification of BGCs—which consist of several genes—is only possible on well-assembled data. Therefore, analyses on the public antiSMASH Web server are limited to sequences of over 1 kb length and the first 1000 contigs. Both limits can be deactivated in the stand-alone version of antiSMASH. To analyze highly fragmented short-read-based assemblies, pipelines focusing on the detection and analysis of individual core domains, such as NaPDos [18] or eSNaPD [12], should be considered. In general, phylogenomics-based approaches like the abovementioned or as used in EvoMining [50] are excellent alternatives for such fragmented data, as they base their predictions on single enzymes/genes instead of requiring the presence of complete or partial BGCs [55]. Therefore, we recommend first using these tools to identify ‘interesting’ sequence records in such bulk DNA data and then submitting only these records (provided they have the required sequence length) for an analysis with antiSMASH. In addition, most algorithms predicting enzyme specificities rely on automatically generated alignments of the user-supplied input data with experimentally characterized ‘reference’ sequences to identify residues of the active sites or the substrate-binding pockets. Depending on the tool used to predict specificities, these alignments are generated using standard multiple sequence alignment software like ClustalW [56] or Muscle [57]. Alternatively, BLAST or HMMer are used to match the query with a custom reference database. Consequently, these tools are sensitive to sequencing errors if these errors occur in or near the active sites or binding pockets. In addition, the accuracy of such computer-generated, nonrefined alignments may suffer if the protein sequence of interest is too dissimilar to the reference data sets. In both cases, this can easily lead to incorrect specificity predictions. In the case where users analyze annotated sequence data, which is uploaded as GenBank files or directly retrieved from the NCBI GenBank or RefSeq database, antiSMASH will only consider the annotated genes and not perform additional gene finding. This also implies that genes annotated as ‘pseudogenes’ are not considered for any prediction. This is noteworthy, as many modular PKS and NRPS gene calls that were generated with the NCBI PGAP [58] pipeline (which is used to annotate all microbial genomes in RefSeq [59]) were inaccurate and the intact genes were labelled as pseudogenes. This bug has been fixed for RefSeq 82, but users that downloaded earlier versions of RefSeq entries should be cautious. Many GenBank records that were annotated with affected versions of PGAP also suffer from this issue. If users supply unannotated sequence data, antiSMASH uses the software prodigal [44] for bacterial genomes or GlimmerHMM [45] for fungal and plant sequences to automatically identify coding regions. The downstream genome analyses therefore depend on the accuracy of the automated gene finding, which can vary between different organisms and is also dependent on the sequence quality. If users supply annotated sequence data by uploading GenBank-formatted or FASTA+GFF3 files, antiSMASH uses these gene coordinates. If an annotated and high-quality genome sequence of an organism of interest is available, it is therefore advisable to use the preannotated data. Defining the extent of a secondary metabolite BGC Predicting the boundaries of a BGC solely based on genomic data still remains challenging. For fungal BGCs, conserved binding sites of cluster-specific transcriptional regulators are good indicators to use in defining which genes are co-regulated. If the same regulator binding site is present near the core-genes of a cluster, they probably belong to the same biosynthetic pathway. This approach is used in the CASSIS tool [10], which was recently integrated into version 4 of antiSMASH [6]. In addition, fungal transcriptomics data can also be used to efficiently define the cluster boundaries [60], as implemented in the FunGeneClusterS application [13]. For bacterial sequences, such automated or semi-automated methods are unfortunately not (yet) well established. The presence or absence of BGCs is often strain specific [61, 62]. Comparing genomes between closely related species to identify which genes are highly conserved between these species and which are unique to the strain of interest can indicate the extent of BGCs. In antiSMASH, we have therefore chosen an ‘inclusive’ approach. Genes that are encoded within an empirically defined distance from conserved core genes of a BGC are displayed as a cluster. The distances were selected in a way that we would rather overpredict the distance, i.e. include genes in the gene cluster annotation that may belong to the gene cluster border region, than exclude genes that are part of the BGCs but are encoded outside this range from the core biosynthetic genes. Strategies to connect gene clusters to molecules In the end, most users turn to antiSMASH or related tools to accomplish one of two goals: (1) to identify potentially new molecules that could be synthesized by the organism of study based on its genome, or (2) to identify genes involved in the biosynthesis of an already observed molecule. Specific strategies are available for each of these scenarios. When trying to find out what kind of specialized metabolites an organism can produce based on its genome, the starting point is to go over each gene cluster in the genome in detail. First, comparisons with BGCs from MIBiG (in antiSMASH, this is done using the KnownClusterBlast module) will identify BGCs that are either closely or more distantly related to these reference clusters. To determine whether a BGC is likely to produce the exact same molecule, manual inspection is required. It should be checked that all key biosynthetic genes of the reference cluster are also found in the BGC of interest by studying the data of the MIBiG entry and related literature. If so, are any additional enzymes encoded in the BGC of interest that could encode chemical modifications not observed for the known molecule? If the BGC encodes PKSs or NRPSs, do the domain architectures and their corresponding predicted substrate specificities match to those of the known cluster? The answers to these questions will determine whether the BGC of interest is likely to encode the biosynthesis of: (a) the same molecule (all relevant genes ‘shared’ with high percent identity, and perfect alignment of chemistry predictions with the structure of the known molecule); (b) a potentially new variant of a known molecule (some enzyme-coding genes are cluster-specific, and/or some substrate specificities are different); (c) a new molecule within a known class of molecules (only a minority or small majority of the genes ‘shared’); or (d) an altogether unknown molecule (no significant similarities). Before it can be concluded that a molecule is unknown, it should be taken into account that some known natural products lack a described BGC; hence, some novel-looking BGCs may still encode the production of molecules for which the chemistry has been long known. For polyketides and nonribosomal peptides, these cases can be assessed with a retro-biosynthetic approach using tools like Smiles2Monomers [27] or GRAPE [15]. These tools predict the potential monomers of a given compound structure, for example derived from a compound database. In a second step, these compounds can be connected to BGCs by mapping the monomer predictions derived from the chemical structure to the monomer predictions derived from the analysis of BGCs. The latter predictions can be made using the antiSMASH database or tools like GARLIC [15]. For nonribosomal peptides, another option is to check for compounds with similar monomers in the NORINE database [19, 20]. antiSMASH provides the appropriate search links from the ‘detailed annotations’ sidebar. If no cluster-wide similarity is observed, it is in any case still a good idea to look for similarities to known clusters at a smaller scale: either per gene or per subcluster. antiSMASH offers functionalities to identify such similarities, using the SubClusterBlast feature and the gene-specific BLAST search of MIBiG [17]. This makes it possible to predict the presence of specific chemical moieties or chemical modifications to the molecule, which helps to prioritize the targets or to connect the gene cluster to a molecule observed in metabolomic data. Finally, looking for functional markers can greatly help in prioritizing BGCs, e.g. when the aim of the project is antibiotic discovery, one can look for both general and specific types of antibiotic resistance genes that are often encoded inside a BGC to provide natural self-resistance to the producer [8, 16]. Sometimes, the structure of a molecule has already been elucidated before a genome is sequenced or studied. In such a case, the aim of using antiSMASH or related tools is usually to identify the biosynthetic mechanism of the molecule of interest. If, chemically, the molecule is closely related to other known natural products for which the biosynthesis is known, one would usually be able to find either a single BGC or only a few BGCs with high similarity to the corresponding MIBiG reference cluster. However, this is often not the case. Then, the best strategy is to use ‘exclusion logic’ and step-by-step exclude BGCs that are unlikely to be involved in the biosynthesis of the molecule, thus gradually narrowing down the options to only one or a few gene clusters. First, one would ask: What is the chemical class of the molecule, and, accordingly, what is its expected biosynthetic class? For some chemical classes, there can be multiple biosynthetic options, e.g. peptides can be made in either a ribosomal or nonribosomal fashion. Second, one would ask: What can we specifically predict about the biosynthetic pathway? If it concerns a potential nonribosomal peptide or polyketide, knowledge of the structure would allow predicting the number of modules expected in corresponding NRPSs or PKSs, as well as their substrate specificities. Third, is there specific chemistry seen in the molecule for which enzymatic mechanisms are known? If, for example, a peptide is acylated, one could expect the presence of either a CoA-ligase or a Condensation-starter domain in the BGC. Fourth, are any other organisms known to produce this molecule? If so, one could see which BGCs have homologous clusters in each of these known producers. When dealing with larger numbers of genomes, the abovementioned strategies may no longer be feasible. In this case, a targeted search could be done using software like clusterTools [63] or MultiGeneBlast [64] among the entire set of BGCs identified in all genomes. For example, if the presence of a certain (combination of) specific gene(s) is either desired (in case of hunting for new molecules) or expected (in case of trying to connect a known molecule to its BGC), a specific query can be built to search for this. Perspectives With the recent progress in sequencing technologies and the availability of easy-to-use software programs, genome mining for BGCs and evaluating the genetic potential of secondary metabolite producing organisms have matured into an important technology. It complements the classical organic chemistry-centered approach to find, dereplicate and characterize novel bioactive secondary metabolites, and contributes toward the current paradigm-shift that brings natural products once more into focus for future drug discovery [36]. In addition, it also can be used as an effective method to evaluate the safety of biotechnological production organisms, which are used directly in food production or for the production of enzymes or other biochemicals. In this case, genome mining data can be used to demonstrate that a production strain does not contain BGCs coding for the biosynthesis of known hazardous chemicals. Increasingly available high-quality genome data, in combination with databases of BGCs of known function, such as sequence data from the MIBiG repository [17], can also be used for dereplication of known or closely related compounds and the identification of unexplored or underexplored gene cluster families. So far, several studies [35, 49, 65–67] have successfully used such approaches to identify novel natural products. In connection with large-scale metabolomics approaches (in which gene cluster data are automatically correlated with information on known or unknown compounds identified by mass spectrometry [14, 15, 67, 68]), these high-quality data now allow for new high-throughput methods to identify novel compounds. Many of the current limitations of automated genome mining approaches are being actively addressed by the international natural product community. The EvoMining strategy has been successfully used [50] to identify new BGCs coding for previously unknown compounds and enzymes. Another promising approach to better predict BGC boundaries is based on comparative genomics by detecting ‘breaks’ in the conserved synteny of related strains; as such breaks are often caused by the insertion and/or horizontal acquisition of BGCs, this approach allows the identification of potential secondary metabolite biosynthetic pathways without relying on previous knowledge of the enzymes involved (SYNTERUPTOR, S. Lautru and J. L. Pernodet; personal communication). Thousands of BGCs already have been identified and the number is still steadily increasing. Tools like CORASON (F. Barona-Gómez, personal communication; https://github.com/nselem/EvoDivMet; as used in [69, 70]), clusterTools [63] and MultiGeneBlast [64] can be used to identify clusters, which share varying degrees of similarity with known BGCs. Large-scale clustering of these BGCs is emerging as an important method to compare, classify into gene cluster families, dereplicate and identify novel or—depending on the aim of the study—related BGCs [49, 66, 67]. Novel software packages like BIG-SCAPE (Medema, personal communication; https://git.wageningenur.nl/medema-group/BiG-SCAPE) will help scientists to perform such analyses. Of course, the widespread use of genome mining approaches also raises new challenges. One major bottleneck in such approaches is the frequent observation that the BGCs remain unexpressed (i.e. ‘silent’) in the producer strains under normal laboratory fermentation conditions; in such cases, the compounds cannot be detected or isolated despite the genome containing all the genes required for the biosynthesis. Thus, strategies have to be developed and improved to trigger the expression of such silent BGCs [71, 72]. One important step forward in this regard has been the development of CRISPR-based genome editing tools for important groups of bacterial and fungal secondary metabolite producers [11, 73–75] that can be used to insert promoters to activate the silent BGCs [76] or to ‘repair’ biosynthetic genes [77]. Successful expression of the BGC and isolation of a novel compound should be followed by metabolomics analysis and metabolic engineering that are interconnected with each other. Metabolomics helps with identifying secondary metabolite precursors, and hence provides clues on the use of metabolic pathways. This information in turn facilitates metabolic engineering of the host strain that considers quantitatively optimal production of a target secondary metabolite [78]. Key Points Despite the huge chemical diversity of bioactive secondary metabolites, the enzymes involved in their biosynthesis are often strikingly conserved. The sequence conservation of these enzymes can be exploited by genome mining approaches to identify secondary metabolite BGCs in genome data. Genome mining is a powerful method to access the genetic potential of secondary metabolite producers. User-friendly pipelines (e.g. antiSMASH) are available to assist scientists in genome mining. There are caveats that should be considered when designing and interpreting genome mining studies. Acknowledgements The authors would like to thank Simon Shaw for the helpful discussions. Funding The work of T. W., K. B. and H. U. K. is supported by grants of the Novo Nordisk Foundation (CFB and grant number NNF16OC0021746); the Technology Development Program to Solve Climate Change on Systems Metabolic Engineering for Biorefineries from the Ministry of Science and ICT through the National Research Foundation (NRF) of Korea (grant numbers NRF-2012M1A2A2026556 and NRF-2012M1A2A2026557 to H. U. K.); and Veni grant (grant number 863.15.002 to M. H. M.) from The Netherlands Organization for Scientific Research (NWO). Kai Blin is a Postdoctoral Fellow at the Novo Nordisk Foundation Center for Biosustainability of the Technical University of Denmark. He is developing computational biology tools around microbial genome mining for natural products and connected -omics approaches. Hyun Uk Kim is a Research Fellow at KAIST, South Korea, and a visiting Senior Researcher at the Novo Nordisk Foundation Center for Biosustainability, DTU. His research field lies in systems biology, biochemical and metabolic engineering and drug targeting and discovery. Marnix H. Medema is an Assistant Professor in the Bioinformatics Group at Wageningen University. His research group develops and applies computational methodologies to identify and analyze biosynthetic pathways and gene clusters. Tilmann Weber is a Co-Principal Investigator at the Novo Nordisk Foundation Center for Biosustainability of the Technical University of Denmark. He is interested in integrating bioinformatics, genome mining and systems biology approaches into Natural Products discovery and characterization and thus bridging the in silico and in vivo world. References 1 Newman DJ , Cragg GM. Natural products as sources of new drugs over the 30 years from 1981 to 2010 . J Nat Prod 2012 ; 75 ( 3 ): 311 – 35 . http://dx.doi.org/10.1021/np200906s Google Scholar Crossref Search ADS PubMed WorldCat 2 Nützmann HW , Huang A, Osbourn A. Plant metabolic clusters—from genetics to genomics . New Phytol 2016 ; 211 ( 3 ): 771 – 89 . Google Scholar Crossref Search ADS PubMed WorldCat 3 Medema MH , Osbourn A. Computational genomic identification and functional reconstitution of plant natural product biosynthetic pathways . Nat Prod Rep 2016 ; 33 : 951 – 62 . http://dx.doi.org/10.1039/C6NP00035E Google Scholar Crossref Search ADS PubMed WorldCat 4 Zazopoulos E , Huang K, Staffa A, et al. A genomics-guided approach for discovering and expressing cryptic metabolic pathways . Nat Biotechnol 2003 ; 21 : 187 – 90 [Database. Google Scholar Crossref Search ADS PubMed WorldCat 5 Yadav G , Gokhale RS, Mohanty D. SEARCHPKS: a program for detection and analysis of polyketide synthase domains . Nucleic Acids Res 2003 ; 31 ( 13 ): 3654 – 8 . http://dx.doi.org/10.1093/nar/gkg607 Google Scholar Crossref Search ADS PubMed WorldCat 6 Blin K , Wolf T, Chevrette MG, et al. antiSMASH 4.0-improvements in chemistry prediction and gene cluster boundary identification . Nucleic Acids Res 2017 ; 45 ( W1 ): W36 – 41 . Google Scholar Crossref Search ADS PubMed WorldCat 7 Blin K , Medema MH, Kottmann R, et al. The antiSMASH database, a comprehensive database of microbial secondary metabolite biosynthetic gene clusters . Nucleic Acids Res 2017 ; 45 : D555 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 8 Alanjary M , Kronmiller B, Adamek M, et al. The Antibiotic Resistant Target Seeker (ARTS), an exploration engine for antibiotic cluster prioritization and novel drug target discovery . Nucleic Acids Res 2017 ; 45 ( W1 ): W42 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 9 van Heel AJ , de Jong A, Montalbán-López M, et al. BAGEL3: automated identification of genes encoding bacteriocins and (non-)bactericidal posttranslationally modified peptides . Nucleic Acids Res 2013 ; 41 : W448 – 53 . Google Scholar Crossref Search ADS PubMed WorldCat 10 Wolf T , Shelest V, Nath N, et al. CASSIS and SMIPS: promoter-based prediction of secondary metabolite gene clusters in eukaryotic genomes . Bioinformatics 2016 ; 32 ( 8 ): 1138 – 43 . http://dx.doi.org/10.1093/bioinformatics/btv713 Google Scholar Crossref Search ADS PubMed WorldCat 11 Blin K , Pedersen LE, Weber T, et al. CRISPy-web: an online resource to design sgRNAs for CRISPR applications . Synth Syst Biotechnol 2016 ; 1 ( 2 ): 118 – 21 . http://dx.doi.org/10.1016/j.synbio.2016.01.003 Google Scholar Crossref Search ADS PubMed WorldCat 12 Reddy BVB , Milshteyn A, Charlop-Powers Z, et al. eSNaPD: a versatile, web-based bioinformatics platform for surveying and mining natural product biosynthetic diversity from metagenomes . Chem Biol 2014 ; 21 : 1023 – 33 . http://dx.doi.org/10.1016/j.chembiol.2014.06.007 Google Scholar Crossref Search ADS PubMed WorldCat 13 Vesth TC , Brandl J, Andersen MR. FunGeneClusterS: predicting fungal gene clusters from genome and transcriptome data . Synth Syst Biotechnol 2016 ; 1 ( 2 ): 122 – 9 . http://dx.doi.org/10.1016/j.synbio.2016.01.002 Google Scholar Crossref Search ADS PubMed WorldCat 14 Johnston CW , Skinnider MA, Wyatt MA, et al. An automated Genomes-to-Natural Products platform (GNP) for the discovery of modular natural products . Nat Commun 2015 ; 6 : 8421 . http://dx.doi.org/10.1038/ncomms9421 Google Scholar Crossref Search ADS PubMed WorldCat 15 Dejong CA , Chen GM, Li H, et al. Polyketide and nonribosomal peptide retro-biosynthesis and global gene cluster matching . Nat Chem Biol 2016 ; 12 : 1007 – 14 . http://dx.doi.org/10.1038/nchembio.2188 Google Scholar Crossref Search ADS PubMed WorldCat 16 Johnston CW , Skinnider MA, Dejong CA, et al. Assembly and clustering of natural antibiotics guides target identification . Nat Chem Biol 2016 ; 12 : 233 – 9 . http://dx.doi.org/10.1038/nchembio.2018 Google Scholar Crossref Search ADS PubMed WorldCat 17 Medema MH , Kottmann R, Yilmaz P, et al. Minimum information about a biosynthetic gene cluster . Nat Chem Biol 2015 ; 11 ( 9 ): 625 – 31 . http://dx.doi.org/10.1038/nchembio.1890 Google Scholar Crossref Search ADS PubMed WorldCat 18 Ziemert N , Podell S, Penn K, et al. The natural product domain seeker NaPDoS: a phylogeny based bioinformatic tool to classify secondary metabolite gene diversity . PLoS One 2012 ; 7 ( 3 ): e34064 . Google Scholar Crossref Search ADS PubMed WorldCat 19 Pupin M , Esmaeel Q, Flissi A, et al. Norine: a powerful resource for novel nonribosomal peptide discovery . Synth Syst Biotechnol 2016 ; 1 ( 2 ): 89 – 94 . http://dx.doi.org/10.1016/j.synbio.2015.11.001 Google Scholar Crossref Search ADS PubMed WorldCat 20 Caboche S , Pupin M, Leclère V, et al. NORINE: a database of nonribosomal peptides . Nucleic Acids Res 2008 ; 36 : D326 – 31 . Google Scholar Crossref Search ADS PubMed WorldCat 21 Li MH , Ung PM, Zajkowski J, et al. Automated genome mining for natural products . BMC Bioinformatics 2009 ; 10 : 185 . http://dx.doi.org/10.1186/1471-2105-10-185 Google Scholar Crossref Search ADS PubMed WorldCat 22 Röttig M , Medema MH, Blin K, et al. NRPSpredictor2–a web server for predicting NRPS adenylation domain specificity . Nucleic Acids Res 2011 ; 39 : W362 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 23 Kautsar SA , Suarez Duran HG, Blin K, et al. plantiSMASH: automated identification, annotation and expression analysis of plant biosynthetic gene clusters . Nucleic Acids Res 2017 ; 45 : W55 – 63 . Google Scholar Crossref Search ADS PubMed WorldCat 24 Skinnider MA , Merwin NJ, Johnston CW, et al. PRISM 3: expanded prediction of natural product chemical structures from microbial genomes . Nucleic Acids Res 2017 ; 45 ( W1 ): W49 – 54 . Google Scholar Crossref Search ADS PubMed WorldCat 25 Tietz JI , Schwalen CJ, Patel PS, et al. A new genome-mining tool redefines the lasso peptide biosynthetic landscape . Nat Chem Biol 2017 ; 13 ( 5 ): 470 – 8 . http://dx.doi.org/10.1038/nchembio.2319 Google Scholar Crossref Search ADS PubMed WorldCat 26 Khater S , Gupta M, Agrawal P, et al. SBSPKSv2: structure-based sequence analysis of polyketide synthases and non-ribosomal peptide synthetases . Nucleic Acids Res 2017 ; 45 ( W1 ): W72 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 27 Dufresne Y , Noé L, Leclère V, et al. Smiles2Monomers: a link between chemical and biological structures for polymers . J Cheminform 2015 ; 7 : 62 . Google Scholar Crossref Search ADS PubMed WorldCat 28 Khaldi N , Seifuddin FT, Turner G, et al. SMURF: genomic mapping of fungal secondary metabolite clusters . Fungal Genet Biol 2010 ; 47 : 736 – 41 . http://dx.doi.org/10.1016/j.fgb.2010.06.003 Google Scholar Crossref Search ADS PubMed WorldCat 29 Weber T , Rausch C, Lopez P, et al. CLUSEAN: a computer-based framework for the automated analysis of bacterial secondary metabolite biosynthetic gene clusters . J Biotechnol 2009 ; 140 : 13 – 17 . http://dx.doi.org/10.1016/j.jbiotec.2009.01.007 Google Scholar Crossref Search ADS PubMed WorldCat 30 Medema MH , Blin K, Cimermancic P, et al. antiSMASH: rapid identification, annotation and analysis of secondary metabolite biosynthesis gene clusters in bacterial and fungal genome sequences . Nucleic Acids Res 2011 ; 39 : W339 – 46 . Google Scholar Crossref Search ADS PubMed WorldCat 31 Blin K , Medema MH, Kazempour D, et al. antiSMASH 2.0–a versatile platform for genome mining of secondary metabolite producers . Nucleic Acids Res 2013 ; 41 : W204 – 12 . Google Scholar Crossref Search ADS PubMed WorldCat 32 Weber T , Blin K, Duddela S, et al. antiSMASH 3.0-a comprehensive resource for the genome mining of biosynthetic gene clusters . Nucleic Acids Res 2015 ; 43 : W237 – 43 . Google Scholar Crossref Search ADS PubMed WorldCat 33 Blin K , Kazempour D, Wohlleben W, et al. Improved lanthipeptide detection and prediction for antiSMASH . PLoS One 2014 ; 9 ( 2 ): e89420 . Google Scholar Crossref Search ADS PubMed WorldCat 34 Skinnider MA , Dejong CA, Rees PN, et al. Genomes to natural products PRediction Informatics for Secondary Metabolomes (PRISM) . Nucleic Acids Res 2015 ; 43 : 9645 – 62 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 35 Skinnider MA , Johnston CW, Edgar RE, et al. Genomic charting of ribosomally synthesized natural product chemical space facilitates targeted mining . Proc Natl Acad Sci USA 2016 ; 113 ( 42 ): E6343 – 51 . Google Scholar Crossref Search ADS PubMed WorldCat 36 Ziemert N , Alanjary M, Weber T. The evolution of genome mining in microbes—a review . Nat Prod Rep 2016 ; 33 ( 8 ): 988 – 1005 . Google Scholar Crossref Search ADS PubMed WorldCat 37 Fedorova ND , Moktali V, Medema MH. Bioinformatics approaches and software for detection of secondary metabolic gene clusters . Methods Mol Biol 2012 ; 944 : 23 – 45 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 38 Leclère V , Weber T, Jacques P, et al. Bioinformatics tools for the discovery of new nonribosomal peptides . Methods Mol Biol 2016 ; 1401 : 209 – 32 . Google Scholar Crossref Search ADS PubMed WorldCat 39 Adamek M , Spohn M, Stegmann E, et al. Mining bacterial genomes for secondary metabolite gene clusters . Methods Mol Biol 2017 ; 1520 : 23 – 47 . Google Scholar Crossref Search ADS PubMed WorldCat 40 Weber T. In silico tools for the analysis of antibiotic biosynthetic pathways . Int J Med Microbiol 2014 ; 304 ( 3–4 ): 230 – 5 . http://dx.doi.org/10.1016/j.ijmm.2014.02.001 Google Scholar Crossref Search ADS PubMed WorldCat 41 Weber T , Kim HU. The secondary metabolite bioinformatics portal: computational tools to facilitate synthetic biology of secondary metabolite production . Synth Syst Biotechnol 2016 ; 1 ( 2 ): 69 – 79 . http://dx.doi.org/10.1016/j.synbio.2015.12.002 Google Scholar Crossref Search ADS PubMed WorldCat 42 Medema MH , Fischbach MA. Computational approaches to natural product discovery . Nat Chem Biol 2015 ; 11 ( 9 ): 639 – 48 . http://dx.doi.org/10.1038/nchembio.1884 Google Scholar Crossref Search ADS PubMed WorldCat 43 Chavali AK , Rhee SY. Bioinformatics tools for the identification of gene clusters that biosynthesize specialized metabolites . Brief Bioinform 2017 . (Epub ahead of print). doi: 10.1093/bib/bbx020. Google Scholar OpenURL Placeholder Text WorldCat 44 Hyatt D , Chen GL, Locascio PF, et al. Prodigal: prokaryotic gene recognition and translation initiation site identification . BMC Bioinformatics 2010 ; 11 : 119 . http://dx.doi.org/10.1186/1471-2105-11-119 Google Scholar Crossref Search ADS PubMed WorldCat 45 Majoros WH , Pertea M, Salzberg SL. TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders . Bioinformatics 2004 ; 20 ( 16 ): 2878 – 9 . http://dx.doi.org/10.1093/bioinformatics/bth315 Google Scholar Crossref Search ADS PubMed WorldCat 46 Eddy SR. Accelerated profile HMM searches . PLoS Comput Biol 2011 ; 7 ( 10 ): e1002195 . Google Scholar Crossref Search ADS PubMed WorldCat 47 Finn RD , Coggill P, Eberhardt RY, et al. The Pfam protein families database: towards a more sustainable future . Nucleic Acids Res 2016 ; 44 ( D1 ): D279 – 85 . Google Scholar Crossref Search ADS PubMed WorldCat 48 Haft DH , Selengut JD, Richter RA, et al. TIGRFAMs and genome properties in 2013 . Nucleic Acids Res 2013 ; 41 : D387 – 95 . Google Scholar Crossref Search ADS PubMed WorldCat 49 Cimermancic P , Medema MH, Claesen J, et al. Insights into secondary metabolism from a global analysis of prokaryotic biosynthetic gene clusters . Cell 2014 ; 158 ( 2 ): 412 – 21 . http://dx.doi.org/10.1016/j.cell.2014.06.034 Google Scholar Crossref Search ADS PubMed WorldCat 50 Cruz-Morales P , Kopp JF, Martínez-Guerrero C, et al. Phylogenomic analysis of natural products biosynthetic gene clusters allows discovery of arseno-organic metabolites in model streptomycetes . Genome Biol Evol 2016 ; 8 : 1906 – 16 . Google Scholar Crossref Search ADS PubMed WorldCat 51 Chevrette MG , Aicheler F, Kohlbacher O, et al. SANDPUMA: ensemble predictions of nonribosomal peptide chemistry reveals biosynthetic diversity across actinobacteria . Bioinformatics 2017 ; 33 ( 20 ): 3202 – 10 . http://dx.doi.org/10.1093/bioinformatics/btx400 Google Scholar Crossref Search ADS PubMed WorldCat 52 Price MN , Dehal PS, Arkin AP. FastTree 2–approximately maximum-likelihood trees for large alignments . PLoS One 2010 ; 5 ( 3 ): e9490 . Google Scholar Crossref Search ADS PubMed WorldCat 53 Dickschat JS. Bacterial terpene cyclases . Nat Prod Rep 2016 ; 33 ( 1 ): 87 – 110 . http://dx.doi.org/10.1039/C5NP00102A Google Scholar Crossref Search ADS PubMed WorldCat 54 Klassen JL , Currie CR. Gene fragmentation in bacterial draft genomes: extent, consequences and mitigation . BMC Genomics 2012 ; 13 : 14 . http://dx.doi.org/10.1186/1471-2164-13-14 Google Scholar Crossref Search ADS PubMed WorldCat 55 Cibrián-Jaramillo A , Barona-Gómez F. Increasing metagenomic resolution of microbiome interactions through functional phylogenomics and bacterial sub-communities . Front Genet 2016 ; 7 : 4 . Google Scholar Crossref Search ADS PubMed WorldCat 56 Larkin MA , Blackshields G, Brown NP, et al. Clustal W and Clustal X version 2.0 . Bioinformatics 2007 ; 23 ( 21 ): 2947 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 57 Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput . Nucleic Acids Res 2004 ; 32 ( 5 ): 1792 – 7 . http://dx.doi.org/10.1093/nar/gkh340 Google Scholar Crossref Search ADS PubMed WorldCat 58 Tatusova T , DiCuccio M, Badretdin A, et al. NCBI prokaryotic genome annotation pipeline . Nucleic Acids Res 2016 ; 44 ( 14 ): 6614 – 24 . http://dx.doi.org/10.1093/nar/gkw569 Google Scholar Crossref Search ADS PubMed WorldCat 59 Tatusova T , Ciufo S, Fedorov B, et al. RefSeq microbial genomes database: new representation and annotation strategy . Nucleic Acids Res 2014 ; 42 : D553 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 60 Andersen MR , Nielsen JB, Klitgaard A, et al. Accurate prediction of secondary metabolite gene clusters in filamentous fungi . Proc Natl Acad Sci USA 2013 ; 110 ( 1 ): E99 – 107 . Google Scholar Crossref Search ADS PubMed WorldCat 61 Letzel AC , Li J, Amos GCA, et al. Genomic insights into specialized metabolism in the marine actinomycete Salinispora . Environ Microbiol 2017 ; 19 : 3660 – 73 . http://dx.doi.org/10.1111/1462-2920.13867 Google Scholar Crossref Search ADS PubMed WorldCat 62 Cruz-Morales P , Vijgenboom E, Iruegas-Bocardo F, et al. The genome sequence of Streptomyces lividans 66 reveals a novel tRNA-dependent peptide biosynthetic system within a metal-related genomic island . Genome Biol Evol 2013 ; 5 : 1165 – 75 . Google Scholar Crossref Search ADS PubMed WorldCat 63 de los Santos ELC , Challis GL. clusterTools: proximity searches for functional elements to identify putative biosynthetic gene clusters . bioRxiv 2017 . (Epub ahead of print). doi: 10.1101/119214. Google Scholar OpenURL Placeholder Text WorldCat 64 Medema MH , Takano E, Breitling R. Detecting sequence homology at the gene cluster level with MultiGeneBlast . Mol Biol Evol 2013 ; 30 ( 5 ): 1218 – 23 . http://dx.doi.org/10.1093/molbev/mst025 Google Scholar Crossref Search ADS PubMed WorldCat 65 Donia MS , Cimermancic P, Schulze CJ, et al. A systematic analysis of biosynthetic gene clusters in the human microbiome reveals a common family of antibiotics . Cell 2014 ; 158 ( 6 ): 1402 – 14 . http://dx.doi.org/10.1016/j.cell.2014.08.032 Google Scholar Crossref Search ADS PubMed WorldCat 66 Zhang Q , Doroghazi JR, Zhao X, et al. Expanded natural product diversity revealed by analysis of lanthipeptide-like gene clusters in actinobacteria . Appl Environ Microbiol 2015 ; 81 : 4339 – 50 . http://dx.doi.org/10.1128/AEM.00635-15 Google Scholar Crossref Search ADS PubMed WorldCat 67 Doroghazi JR , Albright JC, Goering AW, et al. A roadmap for natural product discovery based on large-scale genomics and metabolomics . Nat Chem Biol 2014 ; 10 ( 11 ): 963 – 8 . http://dx.doi.org/10.1038/nchembio.1659 Google Scholar Crossref Search ADS PubMed WorldCat 68 Maansson M , Vynne NG, Klitgaard A, et al. An integrated metabolomic and genomic mining workflow to uncover the biosynthetic potential of bacteria . mSystems 2016 ; 1 ( 3 ): e00028-15 . doi: 10.1128/mSystems.00028–15. Google Scholar Crossref Search ADS PubMed WorldCat 69 Cruz-Morales P , Ramos-Aboites HE, Licona-Cassani C, et al. Actinobacteria phylogenomics, selective isolation from an iron oligotrophic environment and siderophore functional characterization, unveil new desferrioxamine traits . FEMS Microbiol Ecol 2017 ; 93 ( 9 ). doi: 10.1093/femsec/fix086. Google Scholar OpenURL Placeholder Text WorldCat 70 Gutiérrez-García K , Neira-González A, Pérez-Gutiérrez RM, et al. Phylogenomics of 2, 4-Diacetylphloroglucinol-producing pseudomonas and novel antiglycation endophytes from Piper auritum . J Nat Prod 2017 ; 80 : 1955 – 63 . Google Scholar Crossref Search ADS PubMed WorldCat 71 Rutledge PJ , Challis GL. Discovery of microbial natural products by activation of silent biosynthetic gene clusters . Nat Rev Microbiol 2015 ; 13 ( 8 ): 509 – 23 . http://dx.doi.org/10.1038/nrmicro3496 Google Scholar Crossref Search ADS PubMed WorldCat 72 Ren H , Wang B, Zhao H. Breaking the silence: new strategies for discovering novel natural products . Curr Opin Biotechnol 2017 ; 48 : 21 – 7 . http://dx.doi.org/10.1016/j.copbio.2017.02.008 Google Scholar Crossref Search ADS PubMed WorldCat 73 Cobb RE , Wang Y, Zhao H. High-efficiency multiplex genome editing of Streptomyces species using an engineered CRISPR/Cas system . ACS Synth Biol 2015 ; 4 : 723 – 8 . http://dx.doi.org/10.1021/sb500351f Google Scholar Crossref Search ADS PubMed WorldCat 74 Tong Y , Charusanti P, Zhang L, et al. CRISPR-Cas9 based engineering of actinomycetal genomes . ACS Synth Biol 2015 ; 4 ( 9 ): 1020 – 9 . http://dx.doi.org/10.1021/acssynbio.5b00038 Google Scholar Crossref Search ADS PubMed WorldCat 75 Nødvig CS , Nielsen JB, Kogle ME, et al. A CRISPR-Cas9 system for genetic engineering of filamentous fungi . PLoS One 2015 ; 10 ( 7 ): e0133085 . Google Scholar Crossref Search ADS PubMed WorldCat 76 Zhang MM , Wong FT, Wang Y, et al. CRISPR-Cas9 strategy for activation of silent Streptomyces biosynthetic gene clusters . Nat Chem Biol 2017 ; 13 : 607 – 9 . http://dx.doi.org/10.1038/nchembio.2341 Google Scholar Crossref Search ADS WorldCat 77 Weber J , Valiante V, Nødvig CS, et al. Functional reconstitution of a fungal natural product gene cluster by advanced genome editing . ACS Synth Biol 2017 ; 6 : 62 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 78 Kim HU , Charusanti P, Lee SY, et al. Metabolic engineering with systems biology tools to optimize production of prokaryotic secondary metabolites . Nat Prod Rep 2016 ; 33 : 933 – 41 . http://dx.doi.org/10.1039/C6NP00019C Google Scholar Crossref Search ADS PubMed WorldCat © The Author 2017. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. © The Author 2017. Published by Oxford University Press.
journal article
LitStream Collection
Recent development of Ori-Finder system and DoriC database for microbial replication origins

Luo,, Hao;Quan,, Chun-Lan;Peng,, Chong;Gao,, Feng

2019 Briefings in Bioinformatics

doi: 10.1093/bib/bbx174pmid: 29329409

Abstract DNA replication begins at replication origins in all three domains of life. Identification and characterization of replication origins are important not only in providing insights into the structure and function of the replication origins but also in understanding the regulatory mechanisms of the initiation step in DNA replication. The Z-curve method has been used in the identification of replication origins in archaeal genomes successfully since 2002. Furthermore, the Web servers of Ori-Finder and Ori-Finder 2 have been developed to predict replication origins in both bacterial and archaeal genomes based on the Z-curve method, and the replication origins with manual curation have been collected into an online database, DoriC. Ori-Finder system and DoriC database are currently used in the research field of DNA replication origins in prokaryotes, including: (i) identification of oriC regions in bacterial and archaeal genomes; (ii) discovery and analysis of the conserved sequences within oriC regions; and (iii) strand-biased analysis of bacterial genomes. Up to now, more and more predicted results by Ori-Finder system were supported by subsequent experiments, and Ori-Finder system has been used to identify the replication origins in > 100 newly sequenced prokaryotes in their genome reports. In addition, the data in DoriC database have been widely used in the large-scale analyses of replication origins and strand bias in prokaryotic genomes. Here, we review the development of Ori-Finder system and DoriC database as well as their applications. Some future directions and aspects for extending the application of Ori-Finder and DoriC are also presented. replication origin, DNA replication, Z-curve, prokaryotic genome Introduction DNA replication is one of the basic processes in all three domains of cellular life. The duplication of the genetic information in a cell begins at specific sites on the chromosomes, termed DNA replication origins. The replication origin regions play significant roles in the DNA replication and serve as recognition sites for initiator proteins and assembly of replication forks [1]. The characteristics of replication origin regions are various among prokaryote and eukaryote, and their nucleotide sequences also present diversities in different organisms [2, 3]. In most bacteria, two replication forks assemble at the replication origin of chromosomes (oriCs) and move in opposite directions, and then leading to bidirectional growth of both daughter strands. The oriC regions contain several DnaA box motifs, which are 9 bp highly conserved consensus sequences. DnaA boxes are the recognition sites for the DnaA protein, which is essential for the initiation of chromosome replication. Moreover, the oriC regions are frequently located next to the replication-related genes [4, 5]. Detailed analyses have shown that the consensus sequences of DnaA boxes and distributed genes adjacent to oriCs are highly conserved in different phyla [6]. In addition, because of the asymmetric nucleotide composition of prokaryotic chromosomes, the replication origins have been identified around the boundaries of GC or AT skew [7]. In eukaryote, because of the huge size of genomes, the chromosomes use multiple dispersed replication origins to initiate the DNA replication, ranging from hundreds in yeast to tens of thousands in human [8–10]. Minichromosome maintenance (MCM) complexes are first loaded at replication origins in G1 phase of the cell cycle. Then, the origin-bound MCM complexes unwind the double-stranded DNA at the origins, recruit DNA polymerases and initiate DNA synthesis in S phase [11]. The replication origins in unicellular eukaryotes Saccharomyces cerevisiae and Schizosaccharomyces pombe have been well characterized by microarray and deep sequencing techniques [12, 13]. However, the selection and regulation of DNA replication origins in the higher eukaryotes are more complex and diverse [14–16]. As a separate domain in the three-domain system, archaea share some similar features with both bacteria and eukaryotes. The locations of oriC regions in archaea are also adjacent to the replication-related genes, and the origin recognition boxes (ORBs) are distributed within oriC sequences [17]. In some archaea, single chromosome could adopt more than one oriCs in initiation of DNA replication as eukaryotes. Furthermore, the origin binding proteins in archaea are homologous to the corresponding eukaryotic Orc1/Cdc6 proteins [18], and a homologous MCM hexameric ring serving as replicative helicase is loaded by origin-bound initiator proteins [19]. Since the first complete bacterial genome was sequenced in 1995 [20], the available microbial genomic data have been increasing exponentially with the rapid development of sequencing techniques. Recently, the White House launched the National Microbiome Initiative (NMI) in 2016, which aimed to deepen the understanding of the microbes that live in humans, animals, crops, soils, oceans, etc. [21], and some other microbial programs, such as MGP, MetaHIT and HMP, have also been carried out in the past decade [22–24]. These microbial projects have produced a large amount of sequence data, thereby creating an opportunity for exploration of the molecular mechanisms for initiating cellular DNA replication by in vivo experiments as well as in silico analysis at the genome level. Development of bioinformatics tools to mine useful biological information in microbial genomes will contribute to bridge the gap between genomic data and knowledge discovery. On the other hand, the accumulation of genomic data has created great challenges and opportunities for identification and characterization of the replication origins on a large scale. Identification of the oriC regions will not only provide insights into the structure and functions of replication origins but also facilitate the studies in regulatory mechanisms of DNA replication initiation [25]. Our laboratory has developed the Web service and database in this field based on the Z-curve method. The Z-curve is a unique three-dimensional curve used to transform the DNA sequence in a three-dimensional space, and the three components of the Z-curve could represent three independent distributions, including purine/pyrimidine (R/Y) bases, amino/keto (M/K) bases and strong-H bond/weak-H bond (S/W) along the sequence. The Z-curve method could be used in the detection of the asymmetric nucleotide distribution around replication origins [26]. With this method, the oriC regions in Methanocaldococcus jannaschii, Methanosarcina mazei, Halobacterium sp. strain NRC-1 and Sulfolobus solfataricus P2 have been identified successfully, and the predicted results are also consistent with the subsequent experiments [27, 28]. To facilitate the prediction of oriC regions, a Web-based system, Ori-Finder (http://tubic.tju.edu.cn/Ori-Finder/), was developed to find replication origins in bacterial genomes with high accuracy and reliability [29]. Up to now, more and more predicted results by Ori-Finder system were supported by subsequent experiments, and Ori-Finder system has been used to identify the replication origins in >100 newly sequenced prokaryotes in their genome reports [30–32]. Furthermore, we also designed Ori-Finder 2, a new Web-based tool to identify the oriC regions in the archaeal genomes [33]. Then, the oriC regions identified by in silico analyses, as well as in vivo and in vitro experiments have been organized into DoriC (http://tubic.tju.edu.cn/doric/), a database of oriC regions in bacterial and archaeal genomes [34, 35]. DoriC has provided insights into the regulatory mechanisms of the initiation step in DNA replication as well as the molecular mechanisms of strand bias in genomes. The application of the rules derived from the database will also be helpful to develop new prediction algorithms of replication origins and speed up the experimental confirmation and functional analysis of oriCs in bacterial or archaeal genomes. In this review, we briefly introduce the development of the Ori-Finder system and DoriC database and review some applications with those tools. Additionally, some future directions and aspects for extending the application of Ori-Finder and DoriC are also presented. The development of Ori-Finder and DoriC Ori-Finder: an online tool for oriC prediction in bacteria DNA replication is a precise and complex process in the cell life, during which the cell uses a great deal of enzymes and proteins to synthesize the nucleotides. So that the perfect prediction algorithms of oriC regions should take as many factors as possible into consideration, which are concerned with the DNA replication process. It is well known that the DNA replication asymmetry gives rise to compositional deviations between the leading and lagging strands [36]. As the pioneer work to identify bacterial oriC regions in silico, the GC skew analysis is mainly based on the asymmetric nucleotide composition [7]. Later, other skew methods, such as the cumulative GC skew without sliding windows method and oligomer-skew, were proposed to predict oriC regions in bacterial and archaeal genomes [37]. Nevertheless, with the analysis of asymmetric nucleotide composition, scientists could only predict the approximate location, but not the exact boundary of replication origins. Meanwhile, the bacterial replication origins are frequently located in the intergenic regions that are adjacent to the replication-related genes, such as dnaA, gidA and so on. Hence, a GC skew analysis together with the location of dnaA gene and distribution of DnaA boxes led to the more accurate predictions of oriC regions [38, 39]. However, it is inconvenient for the biologists who sequenced the bacterial genomes to take all the possible characteristics of oriC regions into account, such as the effects of ‘species-specific’ DnaA box motif, thereby leading to the wrong prediction results of oriC regions [40, 41], and the pipelines or Web servers that could predict and visualize the related data of oriC regions in complete bacterial genome automatically were in great need. Therefore, we developed Ori-Finder to predict oriC regions in the complete bacterial genomes, which integrated gene prediction, analysis of base composition asymmetry, distribution of DnaA boxes, occurrence of genes frequently close to oriC regions and phylogenetic relationships [29] (Figure 1A). In addition, Ori-Finder can also predict oriC regions in some draft genomes only with contigs or scaffolds. Owing to integration of Z-curve method, Ori-Finder is also used to separate the leading and lagging strands to perform the strand bias analysis of biological characteristics. Figure 1 Open in new tabDownload slide The architecture of Ori-Finder system and DoriC database. (A) Ori-Finder, Ori-Finder 2 and PubMed used as the data source of DoriC database. (B) The screenshot of DoriC main page. (C) A representative record in DoriC database. The left part shows the Z-curves for the genome sequences, and the right part presents some tools used by DoriC, including BLAST, NCBI genome viewer and REPuter. (D) Future perspectives of Ori-Finder and DoriC including the newfound characteristics as well as the extended oriC prediction and collection. Figure 1 Open in new tabDownload slide The architecture of Ori-Finder system and DoriC database. (A) Ori-Finder, Ori-Finder 2 and PubMed used as the data source of DoriC database. (B) The screenshot of DoriC main page. (C) A representative record in DoriC database. The left part shows the Z-curves for the genome sequences, and the right part presents some tools used by DoriC, including BLAST, NCBI genome viewer and REPuter. (D) Future perspectives of Ori-Finder and DoriC including the newfound characteristics as well as the extended oriC prediction and collection. DoriC: a database of oriCs DoriC is a database of manually curated oriC regions, which was initially publicly available in 2007 (Figure 1B). At that time, the complete bacterial genome data were accumulated rapidly because of the advance in high-throughput sequencing technology. However, the experimental method is impossible to identify all the replication origins in the sequenced genomes extensively. Furthermore, some large-scale analyses of the bacterial genomes, such as those of replication origins and strand bias in genomes, were restricted by the absence of oriC data. Before the construction of DoriC, the Z-curve method has been used in the prediction of oriCs in several archaeal genomes, and some of the results were confirmed by experimental data subsequently. To extensively identify oriCs with high accuracy and reliability, our laboratory developed an integrated in silico method to predict oriC regions of bacterial genomes, and the predicted oriCs as well as those identified by in vivo or in vitro experiments were manually curated and collected into the DoriC database. The first public release of the database only collected 478 predicted oriCs in 425 bacterial genomes, and 72.2% of the predicted oriCs have consistent features with each other, including typical base composition asymmetry, DnaA box distribution and indicator gene positions [34]. Furthermore, DoriC database presents the detailed information of oriC regions, including the experimental evidences of the replication origins, the number of DnaA boxes, disparity curves and replication-related genes, and enables the retrieve of DoriC entries and BLAST search for oriC regions. The putative dif (deletion-induced filamentation) sequences, which are associated with DNA replication terminus, are also added to the oriC records. The accumulation of oriC records in the database would provide the possibility to explore the characteristics in the oriC regions. With the increasing availability of completely sequenced prokaryotic genomes and experimental evidences, we presented an updated version of the database DoriC 5.0 in 2013. Compared with the initial release, the number of oriC regions in bacterial genomes has been increased from 425 to 1528 in DoriC 5.0, and the database provides more information of the oriC regions, including repeats by REPuter, URLs to NCBI or UCSC genome browsers, which are useful to explore the characteristics of the oriC regions [42–44]. In addition, the 86 oriC regions in 83 archaeal genomes identified by in vivo experiments, as well as in silico analyses, were also added to the database in this version. Currently, the DoriC database has collected 3423 oriC records in >2700 complete RefSeq bacterial genomes and 257 oriC regions in over 200 archaeal genomes with manual curation. Figure 1C displays a representative record in DoriC database. Ori-Finder 2: an online tool for oriC prediction in archaea Archaea are classified as a separate domain in the three-domain system, and some of them exist in various extreme environments on earth, such as hot spring and salt lake [45]. Their special habits make it difficult in strain collection and cell cultivation, leading to slow progress in the genome sequencing for a long time. The first oriC of archaea was predicted in Halobacterium sp. strain NRC-1 with GC skew method and then confirmed by cloning into a nonreplicating plasmid [46]. With the Z-curve method, the oriCs in M.jannaschii, M.mazei, Halobacterium sp. strain NRC-1 and S.solfataricus P2 were identified, and some predicted results were consistent with subsequent experiments. In 2014, Wu et al. [47] also predicted putative multiple orc1/cdc6-associated oriCs in all the available Haloarchaeal genomes. In recent years, the development of the high-throughput sequencing technology results in the rapid increase of the archaeal genome projects. Therefore, we further developed a Web-based tool Ori-Finder 2 to predict the oriC regions in archaeal genomes automatically, based on the frame of Ori-Finder. The oriCs in archaea have significant differences with those in bacteria. For example, in contrast to the DnaA boxes in bacteria, the ORB sequences present more diversities in different species. With the archaeal oriCs in DoriC database, the consensus sequences of ORB motifs were calculated for different taxonomies, including Methanobacteriaceae, Methanomicrobia, Methanococcaceae, Sulfolobaceae and Thermococcaceae, by Multiple EM for Motif Elicitation (MEME) program [48]. Based on Ori-Finder 2, the intergenic sequences with the putative ORB sequences and adjacent to the replication-related genes are predicted as oriC regions. Because this method may fail to identify the oriCs adjacent to the uncharacterized genes that might be involved in DNA replication, the intergenic sequences, which contain more than two putative ORB motifs, are also predicted as oriCs. Currently, Ori-Finder 2 has been used to identify the oriCs in Pyrococcus chitonophagus DSM 10152, Thermococcus sp. strain 2319x1, Haloarculahispanica pleomorphic virus 3, Natrinema sp. J7 and Methanobrevibacter ruminantium M1 [49–53]. However, Ori-Finder 2 could not find all the potential origins of replication for the genomes with multiple oriCs currently. With the increase of the experimentally confirmed oriCs in archaea, it will become more accurate and sensitive by the continuous improvement. Prediction of replication origins by Ori-Finder system Ori-Finder and Ori-Finder 2 have a friendly and intuitive input interface, and use an integrated method to predict replication origins in prokaryotic genomes, which are available at http://tubic.tju.edu.cn/Ori-Finder/ and http://tubic.tju.edu.cn/Ori-Finder2/, respectively. Figure 1A presents the submission Web pages of the Web servers. Both the Web servers integrate the gene predicting pipelines, ZCURVE1.02 or Glimmer3, to perform the gene prediction in the unannotated genomes [54, 55]. Users can submit the annotated genome sequence by uploading the sequence file in FASTA format together with its protein table (PTT) file to the Web servers, and the Ori-Finder 2 can also accept the annotated genome file in GenBank format. BLAST program has been installed for functional annotation of genes by search against indicator genes (such as dnaA, dnaN, hemE and gidA in bacteria, or cdc6, orc1 and Mc-pRIP in archaea) throughout the genome. Both the Web servers enable users to select or type the motif sequences of DnaA boxes or ORBs to predict replication origin. The DnaA boxes or ORBs are ‘species-specific’ conserved sequences within the oriC regions and recognition sites for the DnaA proteins or Cdc6/Orc1 proteins. Ori-Finder provides 16 different types of DnaA box, and also allows the users to define some unique DnaA boxes by themselves. Whereas, FIMO (Find Individual Motif Occurrences) is used to obtain ORB sequences with the PSPM of five taxonomic clusters in Ori-Finder 2, and Weblogos are provided to facilitate the selection of ORBs [56]. Finally, the intergenic regions with the required characteristics are predicted as oriC regions. Note that, as Ori-Finder 2 used text search in the annotated genome file, some replication-related genes with unclear annotation might be ignored. So that we recommend users to compare the result with that based on the unannotated sequence. Figure 2 displays the workflow of both Ori-Finder and Ori-Finder 2. Figure 2 Open in new tabDownload slide The workflow of Ori-Finder system for the bacterial and archaeal genomes. Figure 2 Open in new tabDownload slide The workflow of Ori-Finder system for the bacterial and archaeal genomes. In the result Web page, the information, including the genome size, GC content, the locations of the indicator genes and predicted oriC regions, as well as the Z-curve (AT, GC, RY and MK disparity curves) of the input sequence, is presented as an html table. In addition, the location of DnaA boxes or ORB sequences within the oriC regions and the distribution of them in the whole genomes are also available for download from the provided URL. For Ori-Finder 2, the repeats identified by REPuter and the homologs in DoriC are also displayed in the result table. By comparison, the two Web servers share a common procedure to predict oriC regions (Figure 2). However, Ori-Finder could identify oriC regions in most of the bacterial chromosomes. In contrast, the sensitivity and precision of the predictions by Ori-Finder 2 are only 66.7 and 62.1%, respectively [33]. There are three main reasons that result in the significantly different performances. First, it is well known that most of the oriC regions in prokaryotic genomes are adjacent to the replication-related genes. In bacteria, dnaA gene plays a key role in the initiation of DNA replication for most bacteria. In addition, some other replication-related genes, such as dnaN, hemE and gidA, are also considered as the indicators in the oriC prediction. Besides the orc1/cdc6 gene, there are still many unknown genes involved in DNA replication in archaea. It is the reason that the intergenic sequences with more than two putative ORBs are also predicted as oriCs by Ori-Finder 2. Second, the binding sites of replication proteins, such as DnaA box and ORB, are essential for initiation of chromosome replication. Although the DNA boxes in bacteria show some differences throughout the bacterial kingdom, the sequences of them are considerably more conserved in comparison with the ORB sequences in archaea. Most of DnaA boxes are 9 bp sequences and the derivatives are based on the Escherichiacoli perfect DnaA box ‘TTATCCACA’ with one or more mismatches. In archaea, the consensus sequences ‘TCCA—GAAAC’ were found by scanning DoriC database with MEME, and a ‘G-string’ (GGGGT) is observed obviously at the end of ORB motifs in Methanomicrobia and Sulfolobacea. Furthermore, some other conserved motifs are also found in Sulflobacea and Thermococcaceae. Nevertheless, these motifs are more degenerative compared with DnaA boxs. Finally, the majority of bacteria use one oriC to start the DNA replication, whereas some archaea could adopt multiple oriCs. Moreover, the location of typical oriC in bacteria is next to the extreme of GC disparity curve, and the curve is shaped like ‘V’ graph clearly. However, the GC disparity curves in archaea are more irregular, so that the oriCs in archaea are not always near the extremes of GC disparity curves. Despite of some difficulties in the prediction of archaeal oriCs, Ori-Finder 2 will be improved to be more accurate with the increase of the experimental oriCs data. In addition to Ori-Finder system, Oriloc is the alternative tool to predict the oriC regions in bacterial chromosomes, which is mainly based on the GC skew method [57]. The shift points of relative GC skew defined as (G−C)/(G + C) have been used in the identification of the replication origins and termini. To improve the accuracy of prediction, the cumulative GC skew calculated as (G−C) was used to eliminate the effect of the window size. However, the relative GC skew could predict the turning point in the skew graphs as the location of oriC region, which corresponds to the extreme value in the cumulative GC skew as well as GC disparity curve by Z-curve method. However, Oriloc mainly used the cumulative GC skew method to identify oriC regions and could not provide the exact boundaries of oriC regions. Therefore, we compared the location of oriC and minimum of GC disparity curve using the records in DoriC database and found that the most of oriCs are close to the minimum of GC disparity curve, but there are still about 7% (246) of total oriCs, which are of over the one-tenth of the distance to the whole chromosomes. This also confirmed that the oriC prediction in bacteria should take other biological characteristics into consideration, such as the distribution of indictor genes and DnaA boxes. Furthermore, the experimentally confirmed oriCs together with the corresponding predictions by Ori-Finder are summarized at http://tubic.tju.edu.cn/doric/supplementary.php. Exploration of replication origins with DoriC database Figure 1A presents the main source of DoriC database including the predicted results by Ori-Finder system and experimentally confirmed oriCs from literature in PubMed. The oriCs in DoriC are distributed across all the phyla of bacteria and four archaeal phyla, including Crenarchaeota, Euryarchaeota, Korarchaeota and Thaumarchaeota. Figure 3 presents the main distribution of oriC records in DoriC database by phyla in both bacteria and archaea. The oriCs from the phyla Proteobacteria and Firmicutes constitute the primary component in bacteria, while for the archaea, the oriCs from the phylum Euryachaeota are in the majority. DoriC database also included seven archaeal species from Thaumarchaeota and Korarchaeota, and one unclassified archaeal species, whose oriC regions were identified by Ori-Finder 2. The predicted results are consistent with the typical characteristics of archaeal oriCs. Different from the bacteria, a substantial proportion of oriCs in archaea could not be identified in vivo or in silico. Currently, DoriC database has covered all the RefSeq bacterial complete genomes, so that we will focus on the bacterial oriCs in this section. Figure 3 Open in new tabDownload slide The taxonomic distribution of oriC records in DoriC database. The phyla with the oriC records less than nine are classified into the ‘other’ subcategory in the pie chart. Figure 3 Open in new tabDownload slide The taxonomic distribution of oriC records in DoriC database. The phyla with the oriC records less than nine are classified into the ‘other’ subcategory in the pie chart. In the previous section, we have described several typical characteristics of bacterial oriCs, such as asymmetrical nucleotide distributions, the replication-related genes and DnaA boxes. Now, these conserved features of oriCs could be summarized based on DoriC database, which were calculated by python scripts. The DnaA boxes are essential for DNA replication and enriched in the oriC regions. We retrieved the DnaA boxes in all the oriCs with no more than two mismatches from E. coli perfect DnaA box (TTATCCACA), and calculated the GC contents of those DnaA boxes and the corresponding chromosomes. Consequently, the statistically significant correlation is observed in Figure 4A (R = 0.71), and the result is consistent with the previous reports [38]. It suggests that the GC content of the chromosome could affect that of DnaA boxes. In particular, the position of cytosine in TTATCCACA is more conserved in the high GC content genomes. Besides that, adenine or thymine in some other positions is also offset to guanine or cytosine in the bacteria with high GC content, and vice versa. A cluster of DnaA boxes could facilitate DnaA proteins to bind the oriC regions and participate in the regulation of chromosome replication. Figure 4B presents the percentage of oriC regions with different number of DnaA boxes and displays that most of oriCs contain more than four DnaA boxes. In addition, the distribution of DnaA boxes throughout the chromosomes is also presented in the Z-curve figures provided by DoriC, and the significant abundance in the oriC regions could be observed in most of the records. The replication-related gene is another important indicator of oriC region. Figure 4D displays the probabilities of the top 10 genes adjacent to the oriC regions. Nearly half of the oriCs are next to the dnaA genes. Moreover, dnaN, rpmH and gidA also appear on the sides of the oriC regions frequently. It should be noted that some oriCs were bipartite origins [58], which were split into two subregions by the dnaA gene and located in the intergenic regions of dnaA-dnaN and rpmH-dnaA (Nocardia farcinica IFM 10152 and Chlorobium chlorochromatii CaD3 in Table 1). For the phylum Cyanobacteria, most of oriC regions are adjacent to dnaN gene instead of dnaA gene in case of the separation of the two genes in the chromosomes, which is supported by a series of experiments [59–61]. The different mechanism of DNA replication between the leading and lagging strands leads to the asymmetric nucleotide distribution. As a result, the oriC regions are usually near the switch of GC skew or extreme of GC disparity. The distances between oriC and minimum GC disparity relative to the whole chromosome length were presented in Figure 4E. The majority of oriCs are close to the location of minimum GC disparity. In addition, the length of oriCs and the difference of AT content in oriC to that in chromosome are presented in Figure 4C and 4F. Most of the oriCs are about 500 bp long with >10% higher AT content, which facilitate the DNA melting. Table 1 The oriC information of some representative bacterial and archaeal chromosomes in DoriC Organism Refseq Phylum GC (%) Adjacent gene cluster DnaA box or ORB sequence Nocardia farcinica IFM 10152 NC_006361 Actinobacteria 70.83 rnpA_rpmH_oriC1_dnaA_oriC2_dnaN_recF_gyrB TTGTCCACA Bacteroides helcogenes P 36-108 NC_014933 Bacteroidetes 44.72 gidA_oriC TTATACACA Chlamydia trachomatis Sweden2 NC_017441 Chlamydiae 41.31 ispA_glmU_oriC_hemB_nqrA_greA_aspC TTATCAACA Chlorobium chlorochromatii CaD3 NC_007514 Chlorobi 44.28 rnpA_rpmH_oriC1_dnaA_oriC2_dnaN_ recF_ TTATCCACA Dehalococcoides mccartyi DCMB5 NC_020386 Chloroflexi 47.07 rpsT_lexA_fmt_dnaA_oriC_obgE_nadD_gyrB_relA_hisS TTATCCAAA Synechococcus sp. WH 7803 NC_009481 Cyanobacteria 60.24 uvrA_recN_thrC_oriC_dnaN_purL_purF_gyrA TTTTCCACA Deinococcus radiodurans R1 chr. I and II NC_001263 Deinococcus- Thermus 67.01 eno_dnaN_oriC_dnaA TT[AT]TCCACA NC_001264 66.69 oriC_parA_parB Bacillus subtilis subsp. subtilis str. 168 NC_000964 Firmicutes 43.52 jag_spoIIIJ_rnpA_rpmH_oriC1_dnaA_oriC2_dnaN_yaaA_recF_yaaB_gyrB TT[AT]TCCACA Brucella abortus biovar 1 str. 9-941 chr. I and II NC_006932 Proteobacteria 57.16 trmE_rho_hemE_oriC_maf_aroE_coaE_dnaQ TTATCCACA NC_006933 57.34 hemN-2_repA_repB_oriC_repC Escherichia fergusonii ATCC 35469 NC_011740 Proteobacteria 49.94 atpE_atpB_atpI_gidB_gidA_oriC_mioC_asnC_asnA_yieM_ravA TTATCCACA Burkholderia cenocepacia J2315 chr. I, II and III NC_011000 Proteobacteria 66.68 arsC_oriC_parA_parB_repC TTATCCACA NC_011001 67.28 oriC_repC_parB_repA TTATGCGCATAA NC_011002 66.92 oriC_repC Geobacter sulfurreducens PCA NC_002939 Proteobacteria 60.94 yidC_rnpA_rpmH_oriC1_dnaA_oriC2_dnaN_recF_gyrB_gyrA TTATCCACA Helicobacter pylori 26695 NC_000915 Proteobacteria 38.87 dnaA_oriC_ glmS_thyX TTATTCACA Vibrio cholerae O1 biovar eltor str. N16961 chr. I and II NC_002505 Proteobacteria 47.7 parA_parB_gidB_gidA_oriC_yidC_trmE TTATCCACA NC_002506 46.91 parB_parA_oriC ATGATCAAGAG Leptospira borgpetersenii serovar Hardjo-bovis JB197 chr. I and II NC_008510 Spirochaetes 40.23 gidA_ilvE_dnaX1_dnaA_oriC_dnaN_recF_gyrB_gyrA TTTTCCACA NC_008511 40.43 oriC_parA_parB Thermotoga sp. RQ2 NC_010483 Thermotogae 46.18 oriC_rpmF AAACCTACCACC Sulfolobus solfataricus P2 NC_002754 Crenarchaeota 35.79 cdc6_oriC-I, oriC-II,oriC-III_cdc6 TCCA[AG][AT][TG]GAA[CA][CT][GA]AAGGGGT Pyrococcus abyssi GE5 NC_000868 Euryarchaeota 44.71 cdc6_oriC TCCA[CG]T[TG]GAAA[TC][GA]AAGGGGT Methanococcus maripaludis X1 NC_015847 Euryarchaeota 32.94 oriC_Mc-pRIP TT[TA][GT] ATTCA[TC][GA]AT[AT]T[AT]T[AT] Organism Refseq Phylum GC (%) Adjacent gene cluster DnaA box or ORB sequence Nocardia farcinica IFM 10152 NC_006361 Actinobacteria 70.83 rnpA_rpmH_oriC1_dnaA_oriC2_dnaN_recF_gyrB TTGTCCACA Bacteroides helcogenes P 36-108 NC_014933 Bacteroidetes 44.72 gidA_oriC TTATACACA Chlamydia trachomatis Sweden2 NC_017441 Chlamydiae 41.31 ispA_glmU_oriC_hemB_nqrA_greA_aspC TTATCAACA Chlorobium chlorochromatii CaD3 NC_007514 Chlorobi 44.28 rnpA_rpmH_oriC1_dnaA_oriC2_dnaN_ recF_ TTATCCACA Dehalococcoides mccartyi DCMB5 NC_020386 Chloroflexi 47.07 rpsT_lexA_fmt_dnaA_oriC_obgE_nadD_gyrB_relA_hisS TTATCCAAA Synechococcus sp. WH 7803 NC_009481 Cyanobacteria 60.24 uvrA_recN_thrC_oriC_dnaN_purL_purF_gyrA TTTTCCACA Deinococcus radiodurans R1 chr. I and II NC_001263 Deinococcus- Thermus 67.01 eno_dnaN_oriC_dnaA TT[AT]TCCACA NC_001264 66.69 oriC_parA_parB Bacillus subtilis subsp. subtilis str. 168 NC_000964 Firmicutes 43.52 jag_spoIIIJ_rnpA_rpmH_oriC1_dnaA_oriC2_dnaN_yaaA_recF_yaaB_gyrB TT[AT]TCCACA Brucella abortus biovar 1 str. 9-941 chr. I and II NC_006932 Proteobacteria 57.16 trmE_rho_hemE_oriC_maf_aroE_coaE_dnaQ TTATCCACA NC_006933 57.34 hemN-2_repA_repB_oriC_repC Escherichia fergusonii ATCC 35469 NC_011740 Proteobacteria 49.94 atpE_atpB_atpI_gidB_gidA_oriC_mioC_asnC_asnA_yieM_ravA TTATCCACA Burkholderia cenocepacia J2315 chr. I, II and III NC_011000 Proteobacteria 66.68 arsC_oriC_parA_parB_repC TTATCCACA NC_011001 67.28 oriC_repC_parB_repA TTATGCGCATAA NC_011002 66.92 oriC_repC Geobacter sulfurreducens PCA NC_002939 Proteobacteria 60.94 yidC_rnpA_rpmH_oriC1_dnaA_oriC2_dnaN_recF_gyrB_gyrA TTATCCACA Helicobacter pylori 26695 NC_000915 Proteobacteria 38.87 dnaA_oriC_ glmS_thyX TTATTCACA Vibrio cholerae O1 biovar eltor str. N16961 chr. I and II NC_002505 Proteobacteria 47.7 parA_parB_gidB_gidA_oriC_yidC_trmE TTATCCACA NC_002506 46.91 parB_parA_oriC ATGATCAAGAG Leptospira borgpetersenii serovar Hardjo-bovis JB197 chr. I and II NC_008510 Spirochaetes 40.23 gidA_ilvE_dnaX1_dnaA_oriC_dnaN_recF_gyrB_gyrA TTTTCCACA NC_008511 40.43 oriC_parA_parB Thermotoga sp. RQ2 NC_010483 Thermotogae 46.18 oriC_rpmF AAACCTACCACC Sulfolobus solfataricus P2 NC_002754 Crenarchaeota 35.79 cdc6_oriC-I, oriC-II,oriC-III_cdc6 TCCA[AG][AT][TG]GAA[CA][CT][GA]AAGGGGT Pyrococcus abyssi GE5 NC_000868 Euryarchaeota 44.71 cdc6_oriC TCCA[CG]T[TG]GAAA[TC][GA]AAGGGGT Methanococcus maripaludis X1 NC_015847 Euryarchaeota 32.94 oriC_Mc-pRIP TT[TA][GT] ATTCA[TC][GA]AT[AT]T[AT]T[AT] Open in new tab Table 1 The oriC information of some representative bacterial and archaeal chromosomes in DoriC Organism Refseq Phylum GC (%) Adjacent gene cluster DnaA box or ORB sequence Nocardia farcinica IFM 10152 NC_006361 Actinobacteria 70.83 rnpA_rpmH_oriC1_dnaA_oriC2_dnaN_recF_gyrB TTGTCCACA Bacteroides helcogenes P 36-108 NC_014933 Bacteroidetes 44.72 gidA_oriC TTATACACA Chlamydia trachomatis Sweden2 NC_017441 Chlamydiae 41.31 ispA_glmU_oriC_hemB_nqrA_greA_aspC TTATCAACA Chlorobium chlorochromatii CaD3 NC_007514 Chlorobi 44.28 rnpA_rpmH_oriC1_dnaA_oriC2_dnaN_ recF_ TTATCCACA Dehalococcoides mccartyi DCMB5 NC_020386 Chloroflexi 47.07 rpsT_lexA_fmt_dnaA_oriC_obgE_nadD_gyrB_relA_hisS TTATCCAAA Synechococcus sp. WH 7803 NC_009481 Cyanobacteria 60.24 uvrA_recN_thrC_oriC_dnaN_purL_purF_gyrA TTTTCCACA Deinococcus radiodurans R1 chr. I and II NC_001263 Deinococcus- Thermus 67.01 eno_dnaN_oriC_dnaA TT[AT]TCCACA NC_001264 66.69 oriC_parA_parB Bacillus subtilis subsp. subtilis str. 168 NC_000964 Firmicutes 43.52 jag_spoIIIJ_rnpA_rpmH_oriC1_dnaA_oriC2_dnaN_yaaA_recF_yaaB_gyrB TT[AT]TCCACA Brucella abortus biovar 1 str. 9-941 chr. I and II NC_006932 Proteobacteria 57.16 trmE_rho_hemE_oriC_maf_aroE_coaE_dnaQ TTATCCACA NC_006933 57.34 hemN-2_repA_repB_oriC_repC Escherichia fergusonii ATCC 35469 NC_011740 Proteobacteria 49.94 atpE_atpB_atpI_gidB_gidA_oriC_mioC_asnC_asnA_yieM_ravA TTATCCACA Burkholderia cenocepacia J2315 chr. I, II and III NC_011000 Proteobacteria 66.68 arsC_oriC_parA_parB_repC TTATCCACA NC_011001 67.28 oriC_repC_parB_repA TTATGCGCATAA NC_011002 66.92 oriC_repC Geobacter sulfurreducens PCA NC_002939 Proteobacteria 60.94 yidC_rnpA_rpmH_oriC1_dnaA_oriC2_dnaN_recF_gyrB_gyrA TTATCCACA Helicobacter pylori 26695 NC_000915 Proteobacteria 38.87 dnaA_oriC_ glmS_thyX TTATTCACA Vibrio cholerae O1 biovar eltor str. N16961 chr. I and II NC_002505 Proteobacteria 47.7 parA_parB_gidB_gidA_oriC_yidC_trmE TTATCCACA NC_002506 46.91 parB_parA_oriC ATGATCAAGAG Leptospira borgpetersenii serovar Hardjo-bovis JB197 chr. I and II NC_008510 Spirochaetes 40.23 gidA_ilvE_dnaX1_dnaA_oriC_dnaN_recF_gyrB_gyrA TTTTCCACA NC_008511 40.43 oriC_parA_parB Thermotoga sp. RQ2 NC_010483 Thermotogae 46.18 oriC_rpmF AAACCTACCACC Sulfolobus solfataricus P2 NC_002754 Crenarchaeota 35.79 cdc6_oriC-I, oriC-II,oriC-III_cdc6 TCCA[AG][AT][TG]GAA[CA][CT][GA]AAGGGGT Pyrococcus abyssi GE5 NC_000868 Euryarchaeota 44.71 cdc6_oriC TCCA[CG]T[TG]GAAA[TC][GA]AAGGGGT Methanococcus maripaludis X1 NC_015847 Euryarchaeota 32.94 oriC_Mc-pRIP TT[TA][GT] ATTCA[TC][GA]AT[AT]T[AT]T[AT] Organism Refseq Phylum GC (%) Adjacent gene cluster DnaA box or ORB sequence Nocardia farcinica IFM 10152 NC_006361 Actinobacteria 70.83 rnpA_rpmH_oriC1_dnaA_oriC2_dnaN_recF_gyrB TTGTCCACA Bacteroides helcogenes P 36-108 NC_014933 Bacteroidetes 44.72 gidA_oriC TTATACACA Chlamydia trachomatis Sweden2 NC_017441 Chlamydiae 41.31 ispA_glmU_oriC_hemB_nqrA_greA_aspC TTATCAACA Chlorobium chlorochromatii CaD3 NC_007514 Chlorobi 44.28 rnpA_rpmH_oriC1_dnaA_oriC2_dnaN_ recF_ TTATCCACA Dehalococcoides mccartyi DCMB5 NC_020386 Chloroflexi 47.07 rpsT_lexA_fmt_dnaA_oriC_obgE_nadD_gyrB_relA_hisS TTATCCAAA Synechococcus sp. WH 7803 NC_009481 Cyanobacteria 60.24 uvrA_recN_thrC_oriC_dnaN_purL_purF_gyrA TTTTCCACA Deinococcus radiodurans R1 chr. I and II NC_001263 Deinococcus- Thermus 67.01 eno_dnaN_oriC_dnaA TT[AT]TCCACA NC_001264 66.69 oriC_parA_parB Bacillus subtilis subsp. subtilis str. 168 NC_000964 Firmicutes 43.52 jag_spoIIIJ_rnpA_rpmH_oriC1_dnaA_oriC2_dnaN_yaaA_recF_yaaB_gyrB TT[AT]TCCACA Brucella abortus biovar 1 str. 9-941 chr. I and II NC_006932 Proteobacteria 57.16 trmE_rho_hemE_oriC_maf_aroE_coaE_dnaQ TTATCCACA NC_006933 57.34 hemN-2_repA_repB_oriC_repC Escherichia fergusonii ATCC 35469 NC_011740 Proteobacteria 49.94 atpE_atpB_atpI_gidB_gidA_oriC_mioC_asnC_asnA_yieM_ravA TTATCCACA Burkholderia cenocepacia J2315 chr. I, II and III NC_011000 Proteobacteria 66.68 arsC_oriC_parA_parB_repC TTATCCACA NC_011001 67.28 oriC_repC_parB_repA TTATGCGCATAA NC_011002 66.92 oriC_repC Geobacter sulfurreducens PCA NC_002939 Proteobacteria 60.94 yidC_rnpA_rpmH_oriC1_dnaA_oriC2_dnaN_recF_gyrB_gyrA TTATCCACA Helicobacter pylori 26695 NC_000915 Proteobacteria 38.87 dnaA_oriC_ glmS_thyX TTATTCACA Vibrio cholerae O1 biovar eltor str. N16961 chr. I and II NC_002505 Proteobacteria 47.7 parA_parB_gidB_gidA_oriC_yidC_trmE TTATCCACA NC_002506 46.91 parB_parA_oriC ATGATCAAGAG Leptospira borgpetersenii serovar Hardjo-bovis JB197 chr. I and II NC_008510 Spirochaetes 40.23 gidA_ilvE_dnaX1_dnaA_oriC_dnaN_recF_gyrB_gyrA TTTTCCACA NC_008511 40.43 oriC_parA_parB Thermotoga sp. RQ2 NC_010483 Thermotogae 46.18 oriC_rpmF AAACCTACCACC Sulfolobus solfataricus P2 NC_002754 Crenarchaeota 35.79 cdc6_oriC-I, oriC-II,oriC-III_cdc6 TCCA[AG][AT][TG]GAA[CA][CT][GA]AAGGGGT Pyrococcus abyssi GE5 NC_000868 Euryarchaeota 44.71 cdc6_oriC TCCA[CG]T[TG]GAAA[TC][GA]AAGGGGT Methanococcus maripaludis X1 NC_015847 Euryarchaeota 32.94 oriC_Mc-pRIP TT[TA][GT] ATTCA[TC][GA]AT[AT]T[AT]T[AT] Open in new tab Figure 4 Open in new tabDownload slide Several characteristics of replication origins. (A) The relationship between the GC content of DnaA boxes and that of chromosome. (B) The percentages of bacterial oriCs with different number of DnaA boxes. (C) The percentages of bacterial oriCs with different length. (D) The probabilities of top 10 replication-related indicator genes adjacent to the oriCs. Note that some oriCs are next to two indicator genes, so that the sum of probabilities is not equal to 100%. (E) The distribution of relative distance between oriC and minimum GC disparity. It should be noted that the x axis indicates the proportion of the chromosome size. (F) The difference between the AT content of oriC region and that of chromosome. Figure 4 Open in new tabDownload slide Several characteristics of replication origins. (A) The relationship between the GC content of DnaA boxes and that of chromosome. (B) The percentages of bacterial oriCs with different number of DnaA boxes. (C) The percentages of bacterial oriCs with different length. (D) The probabilities of top 10 replication-related indicator genes adjacent to the oriCs. Note that some oriCs are next to two indicator genes, so that the sum of probabilities is not equal to 100%. (E) The distribution of relative distance between oriC and minimum GC disparity. It should be noted that the x axis indicates the proportion of the chromosome size. (F) The difference between the AT content of oriC region and that of chromosome. However, the statistical analyses could only reflect some common characteristics of bacterial oriCs, but exceptions always exist in individuals. Consequently, we analyzed the indicator genes and DnaA boxes by phyla based on the records in DoriC database and found the gene clusters frequently around oriCs and the ‘species-specific’ DnaA boxes within oriCs. Table 1 displays the oriC information of some bacterial and archaeal chromosomes in DoriC, which are used to illuminate the common features in their phyla. Besides the genes (dnaA, dnaN, gidA, hemE and rpmH) listed above, some other genes, such as rnpA, gyrB and recF, seem to be near the replication origins. As for the bacteria with multiple chromosomes, the replication initiation genes and plasmid partition genes, such as repA, repC, parA and parB, are often the indicators for the oriCs in the extra chromosomes, and those genes have also been verified to be relevant with plasmid replication [62]. This suggests that the microbial extra chromosomes may originate from megaplasmids. Moreover, the ‘species-specific’ DnaA boxes different from the E. coli perfect DnaA box ‘TTATCCACA’ are also outlined in this table. For example, the motif ‘TTTTCCACA’ was found in most species of the phylum Cyanobacteria. Owing to the high GC content of chromosome, a cluster of ‘TTGTCCACA’ was discovered in the oriC of N. farcinica IFM 10152. In the extra chromosomes, some other motifs entirely different from classic DnaA box are also listed in this table, and these motifs usually appear as repeats in oriC regions. Currently, the majority of archaeal oriCs in DoriC are the oric1/cdc6-associated oriCs, and a distant homolog of the cdc6 gene, named Mc-pRIP for the putative replication initiator protein, was found next to the oriC regions in the order Methanococcales during the update of DoriC database. Beyond that, some archaea with multiple oriCs are also summarized, such as Sulfolobus and Halobacteria. For more details, please refer to the article of Ori-Finder 2 [33]. Applications with DoriC database With the accumulation of the prokaryotic oriC records in DoriC database, it is possible to determine the conserved features of oriCs, analyze the strand bias and search the homologous oriCs. Figure 5 outlines the main applications, and several examples will be presented in this section. Figure 5 Open in new tabDownload slide Main applications based on DoriC database including data mining, strand-biased analysis and homology search. Figure 5 Open in new tabDownload slide Main applications based on DoriC database including data mining, strand-biased analysis and homology search. Data mining with DoriC database By exploring oriCs in DoriC database, some newfound features associated with oriC regions, such as motif sequences, and multiple replication origins in single bacterial chromosome have been discovered and supported by other studies. For example, Murray et al. identified a new indispensable bacterial replication origin element, repeating trinucleotide motif, named DnaA-trio, and demonstrated these elements play an important role in the stabilization of DNA filaments by experiments. Then, the new elements have been detected throughout the bacterial kingdom by bioinformatics analysis with DoriC database, indicating that DnaA-trio is another core oriC element [63]. DNA methylation is an epigenetic mechanism, which is involved in various biological processes in bacteria including DNA replication. Bendall et al. [64] performed the single-molecule real-time sequencing in Shewanella oneidensis MR-1 to reveal methylation of adenine (N6mA) throughout the genome, and the methylated GATC motifs are found enriched in the oriC region. The further comparative analysis of the Gammaproteobacteria genomes including those in DoriC database revealed that the oriCs are enriched for GATC motifs with the presence of dam and seqA. It is well known that bacteria typically have single replication origin in a chromosome. However, double replication origins could exist in some artificial biological systems. Several bacterial genomes were reformed by synthetic biology methodologies, and more than one WT origins have been extensively characterized in those chromosomes [65–67]. This discovery indicated that multiple origins could occur on a bacterial chromosome, and several bacterial chromosomes with putative double origins of replication, including Acidaminococcus fermentans DSM 20731, Dehalobacter sp. CF, Ralstonia pickettii 12 D chromosome 1 and Ochrobactrum anthropi ATCC 49188 chromosome I, were found indeed in DoriC [68]. Recent work reported that Achromatium oxaliferum and Synechocystis may harbor different replication origins, which also supported our hypothesis [69, 70]. The strand-biased analysis with DoriC database The strand-biased analyses of the biological features are important to understand the mechanisms of many biological processes. Consequently, the data of oriC regions in DoriC database have been widely used in a series of comparative genomics studies focused on the strand bias, such as nucleotide composition [71–73], codon usage [74, 75], substitution rate [76], gene expression [77] and genes distributions [78, 79]. Here, we introduce a few examples about strand-biased analyses. It is well known that majority of genes in bacterial chromosomes tend to locate at the leading strand. A number of studies have been carried out with DoriC database aiming to provide explanations for such observations. Mao et al. [80] performed a computational study on 725 bacterial genomes, and found the genes with different functional categories have a various performances of strand bias. The preference for genes on the leading strand in certain functional categories could enhance the survivability of the host and keep them moving to the more efficient leading strand. The expression level of genes was once considered as the main force to cause the strand bias. However, the analysis of the gene distributions in Bacillussubtilis and E. coli showed that essentiality, not expressiveness, is the basis of gene strand bias [81], and our laboratory also confirmed the previous findings that essential genes are more frequently situated at the leading strand with DoriC and DEG database [78]. Furthermore, only the essential genes with certain COG subcategories showed the preference. These results are helpful to understand the architecture of bacterial chromosomes. This property was also used in the prediction of gene essentiality in bacterial genomes [82, 83]. Conclusion and future perspectives In this article, we briefly reviewed the history of Ori-Finder system and DoriC database and then outlined the main methodology and applications associated with them. Ori-Finder system used an integrated method to identify oriCs in prokaryotic genome sequences, and the oriCs predicted by in silico methods as well as those identified by in vivo or in vitro experiments were collected to DoriC database after manual curation. Currently, Ori-Finder system becomes a popular software tool to predict prokaryotic replication origins, and some of the predictions were confirmed by experiments. DoriC database has stored about 3600 records of oriC regions in both bacterial and archaeal genomes, which would facilitate the research of the large-scale data mining and strand-biased analyses associated with the replication origins. Furthermore, we also explored the oriC records in DoriC database and displayed the statistical results as well as the representative organisms here. However, next-generation sequencing technologies have created a new challenge for the identification of replication origins in different types of genomic data. To address this challenge, Ori-Finder system will be extended to predict oriC regions in metagenomic sequences in future, and the new version of DoriC database will also include the information of the strand-biased analyses for nucleotide asymmetry, codon usage, gene distribution, etc. The oriC prediction in the RefSeq genomes has laid a firm foundation for the further development of Ori-Finder system and DoriC database, which will serve as the critical tools in the prokaryotic genomics. Key Points Ori-Finder system is designed for the oriC prediction in bacterial and archaeal genomes with high accuracy and reliability, which integrates gene prediction, analysis of base composition asymmetry, distribution of DnaA boxes or ORBs, occurrence of genes frequently close to oriC regions and phylogenetic relationships. DoriC database contains 3423 oriC records in > 2700 complete RefSeq bacterial chromosomes and 257 oriC regions in over 200 archaeal genomes with manual curation. Detailed information about oriC regions, such as DNA boxes or ORBs, repeat sequences, replication-related genes and URLs to genome browser, are provided in the database. Ori-Finder system and DoriC database have been widely used in the research field of DNA replication origins in prokaryotes, including the oriC retrieve, motif sequences discovery and strand-biased analysis. Acknowledgements The authors would like to thank Professor Chun-Ting Zhang for the invaluable assistance and inspiring discussions. Funding This work was supported by the National Natural Science Foundation of China (grant numbers 31571358, 21621004 and 31171238) and the National High-Tech Research and Development Program (863) of China (grant number 2015AA020101). Hao Luo is an assistant professor in the Department of Physics, School of Science, Tianjin University, China. His research focuses on DNA replication, gene essentiality and bioinformatics. Chun-Lan Quan is a graduate student in the Department of Physics, School of Science, Tianjin University, China. Her research interests are bioinformatics and microbial genomics. Chong Peng is a PhD candidate in the Department of Physics, School of Science, Tianjin University, China. Her research interests are bioinformatics and gene essentiality. Feng Gao is a professor in the Department of Physics, School of Science, Key Laboratory of Systems Bioengineering (Ministry of Education) and SynBio Research Platform, Collaborative Innovation Center of Chemical Science and Engineering (Tianjin), Tianjin University, China. His research studies are performed in the fields of computational biology and bioinformatics with a special focus on microbial genomics and functional genomics. Reference 1 Costa A , Hood IV , Berger JM. Mechanisms for initiating cellular DNA replication . Annu Rev Biochem 2013 ; 82 : 25 – 54 . http://dx.doi.org/10.1146/annurev-biochem-052610-094414 Google Scholar Crossref Search ADS PubMed WorldCat 2 O'Donnell M , Langston L , Stillman B. Principles and concepts of DNA replication in bacteria, archaea, and eukarya . Cold Spring Harb Perspect Biol 2013 ; 5 : a010108 . Google Scholar Crossref Search ADS PubMed WorldCat 3 Stillman B. Origin recognition and the chromosome cycle . FEBS Lett 2005 ; 579 ( 4 ): 877 – 84 . http://dx.doi.org/10.1016/j.febslet.2004.12.011 Google Scholar Crossref Search ADS PubMed WorldCat 4 Mott ML , Berger JM. DNA replication initiation: mechanisms and regulation in bacteria . Nat Rev Microbiol 2007 ; 5 ( 5 ): 343 – 54 . http://dx.doi.org/10.1038/nrmicro1640 Google Scholar Crossref Search ADS PubMed WorldCat 5 Skarstad K , Katayama T. Regulating DNA replication in bacteria . Cold Spring Harb Perspect Biol 2013 ; 5 ( 4 ): a012922. Google Scholar Crossref Search ADS PubMed WorldCat 6 Gao F. Recent advances in the identification of replication origins based on the Z-curve method . Curr Genomics 2014 ; 15 ( 2 ): 104 – 12 . http://dx.doi.org/10.2174/1389202915999140328162938 Google Scholar Crossref Search ADS PubMed WorldCat 7 Grigoriev A. Analyzing genomes with cumulative skew diagrams . Nucleic Acids Res 1998 ; 26 ( 10 ): 2286 – 90 . http://dx.doi.org/10.1093/nar/26.10.2286 Google Scholar Crossref Search ADS PubMed WorldCat 8 Gao F , Luo H , Zhang CT. DeOri: a database of eukaryotic DNA replication origins . Bioinformatics 2012 ; 28 ( 11 ): 1551 – 2 . http://dx.doi.org/10.1093/bioinformatics/bts151 Google Scholar Crossref Search ADS PubMed WorldCat 9 Leonard AC , Méchali M. DNA replication origins . Cold Spring Harbor Perspect Biol 2013 ; 5 ( 10 ): a010116. Google Scholar Crossref Search ADS WorldCat 10 Liu F , Ren C , Li H , et al. De novo identification of replication-timing domains in the human genome by deep learning . Bioinformatics 2016 ; 32 ( 5 ): 641 – 9 . http://dx.doi.org/10.1093/bioinformatics/btv643 Google Scholar Crossref Search ADS PubMed WorldCat 11 Lei M. The MCM complex: its role in DNA replication and implications for cancer therapy . Curr Cancer Drug Targets 2005 ; 5 ( 5 ): 365 – 80 . http://dx.doi.org/10.2174/1568009054629654 Google Scholar Crossref Search ADS PubMed WorldCat 12 Peng C , Luo H , Zhang X , et al. Recent advances in the genome-wide study of DNA replication origins in yeast . Front Microbiol 2015 ; 6 : 117. Google Scholar PubMed WorldCat 13 Xu J , Yanagisawa Y , Tsankov AM , et al. Genome-wide identification and characterization of replication origins by deep sequencing . Genome Biol 2012 ; 13 ( 4 ): R27 . Google Scholar Crossref Search ADS PubMed WorldCat 14 Sasaki T , Gilbert DM. Unearthing worm replication origins . Nat Struct Mol Biol 2017 ; 24 ( 3 ): 195 – 6 . http://dx.doi.org/10.1038/nsmb.3385 Google Scholar Crossref Search ADS PubMed WorldCat 15 Cayrou C , Ballester B , Peiffer I , et al. The chromatin environment shapes DNA replication origin organization and defines origin classes . Genome Res 2015 ; 25 ( 12 ): 1873 – 85 . http://dx.doi.org/10.1101/gr.192799.115 Google Scholar Crossref Search ADS PubMed WorldCat 16 Costas C , Sanchez MD , Stroud H , et al. Genome-wide mapping of Arabidopsis thaliana origins of DNA replication and their associated epigenetic marks . Nat Struct Mol Biol 2011 ; 18 ( 3 ): 395 – 400 . http://dx.doi.org/10.1038/nsmb.1988 Google Scholar Crossref Search ADS PubMed WorldCat 17 Wu ZF , Liu JF , Yang HB , et al. DNA replication origins in archaea . Front Microbiol 2014 ; 5 : 179. Google Scholar PubMed WorldCat 18 Barry ER , Bell SD. DNA replication in the archaea . Microbiol Mol Biol Rev 2006 ; 70 ( 4 ): 876 – 87 . http://dx.doi.org/10.1128/MMBR.00029-06 Google Scholar Crossref Search ADS PubMed WorldCat 19 Samson RY , Abeyrathne PD , Bell SD. Mechanism of archaeal MCM helicase recruitment to DNA replication origins . Mol Cell 2016 ; 61 ( 2 ): 287 – 96 . http://dx.doi.org/10.1016/j.molcel.2015.12.005 Google Scholar Crossref Search ADS PubMed WorldCat 20 Fleischmann RD , Adams MD , White O , et al. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd . Science 1995 ; 269 ( 5223 ): 496 – 512 . http://dx.doi.org/10.1126/science.7542800 Google Scholar Crossref Search ADS PubMed WorldCat 21 Bouchie A. White house unveils national microbiome initiative . Nat Biotechnol 2016 ; 34 ( 6 ): 580 – 1 . http://dx.doi.org/10.1038/nbt0616-580a Google Scholar Crossref Search ADS PubMed WorldCat 22 Ommen B , El-Sohemy A , Hesketh J , et al. The Micronutrient Genomics Project: a community-driven knowledge base for micronutrient research . Genes Nutr 2010 ; 5 : 285 . http://dx.doi.org/10.1007/s12263-010-0192-8 Google Scholar Crossref Search ADS PubMed WorldCat 23 Ehrlich SD , Consortium M. MetaHIT: the European Union Project on metagenomics of the human intestinal tract. In: Metagenomics of the Human Body . New York, NY : Springer , 2011 , 307 – 16 . Google Preview WorldCat COPAC 24 Collins FS , Morgan M , Patrinos A. The Human Genome Project: lessons from large-scale biology . Science 2003 ; 300 ( 5617 ): 286 – 90 . http://dx.doi.org/10.1126/science.1084564 Google Scholar Crossref Search ADS PubMed WorldCat 25 Gao F. Editorial: DNA replication origins in microbial genomes . Front Microbiol 2015 ; 6 : 1545. Google Scholar PubMed WorldCat 26 Zhang R , Zhang C-T. A brief review: the Z-curve theory and its application in genome analysis . Curr Genomics 2014 ; 15 ( 2 ): 78 – 94 . http://dx.doi.org/10.2174/1389202915999140328162433 Google Scholar Crossref Search ADS PubMed WorldCat 27 Zhang R , Zhang CT. Identification of replication origins in archaeal genomes based on the Z-curve method . Archaea 2005 ; 1 ( 5 ): 335 – 46 . http://dx.doi.org/10.1155/2005/509646 Google Scholar Crossref Search ADS PubMed WorldCat 28 Soppa J. From genomes to function: haloarchaea as model organisms . Microbiology 2006 ; 152 ( Pt 3 ): 585 – 90 . Google Scholar Crossref Search ADS PubMed WorldCat 29 Gao F , Zhang CT. Ori-Finder: a web-based system for finding oriCs in unannotated bacterial genomes . BMC Bioinformatics 2008 ; 9 ( 1 ): 79. http://dx.doi.org/10.1186/1471-2105-9-79 Google Scholar Crossref Search ADS PubMed WorldCat 30 Korem T , Zeevi D , Suez J , et al. Microbiome growth dynamics of gut microbiota in health and disease inferred from single metagenomic samples . Science 2015 ; 349 ( 6252 ): 1101 – 6 . http://dx.doi.org/10.1126/science.aac4812 Google Scholar Crossref Search ADS PubMed WorldCat 31 Xu YX , Ji XF , Chen N , et al. Development of replicative oriC plasmids and their versatile use in genetic manipulation of Cytophaga hutchinsonii . Appl Microbiol Biotechnol 2012 ; 93 ( 2 ): 697 – 705 . http://dx.doi.org/10.1007/s00253-011-3572-0 Google Scholar Crossref Search ADS PubMed WorldCat 32 Makowski L , Donczew R , Weigel C , et al. Initiation of chromosomal replication in predatory bacterium bdellovibrio bacteriovorus . Front Microbiol 2016 ; 7 : 1898. Google Scholar Crossref Search ADS PubMed WorldCat 33 Luo H , Zhang CT , Gao F. Ori-Finder 2, an integrated tool to predict replication origins in the archaeal genomes . Front Microbiol 2014 ; 5 : 482. Google Scholar PubMed WorldCat 34 Gao F , Zhang CT. DoriC: a database of oriC regions in bacterial genomes . Bioinformatics 2007 ; 23 ( 14 ): 1866 – 7 . http://dx.doi.org/10.1093/bioinformatics/btm255 Google Scholar Crossref Search ADS PubMed WorldCat 35 Gao F , Luo H , Zhang C-T. DoriC 5.0: an updated database of oriC regions in both bacterial and archaeal genomes . Nucleic Acids Res 2013 ; 41 : D90 – 3 . Google Scholar Crossref Search ADS PubMed WorldCat 36 Xia X. DNA replication and strand asymmetry in prokaryotic and mitochondrial genomes . Curr Genomics 2012 ; 13 ( 1 ): 16 – 27 . http://dx.doi.org/10.2174/138920212799034776 Google Scholar Crossref Search ADS PubMed WorldCat 37 Salzberg SL , Salzberg AJ , Kerlavage AR , et al. Skewed oligomers and origins of replication . Gene 1998 ; 217 ( 1–2 ): 57 – 67 . Google Scholar Crossref Search ADS PubMed WorldCat 38 Mackiewicz P , Zakrzewska CJ , Zawilak A , et al. Where does bacterial replication start? Rules for predicting the oriC region . Nucleic Acids Res 2004 ; 32 ( 13 ): 3781 – 91 . http://dx.doi.org/10.1093/nar/gkh699 Google Scholar Crossref Search ADS PubMed WorldCat 39 Sernova NV , Gelfand MS. Identification of replication origins in prokaryotic genomes . Brief Bioinform 2008 ; 9 ( 5 ): 376 – 91 . http://dx.doi.org/10.1093/bib/bbn031 Google Scholar Crossref Search ADS PubMed WorldCat 40 Gao F , Zhang C-T. Origins of replication in Sorangium cellulosum and Microcystis aeruginosa . DNA Res 2008 ; 15 ( 3 ): 169 – 71 . http://dx.doi.org/10.1093/dnares/dsn007 Google Scholar Crossref Search ADS PubMed WorldCat 41 Gao F , Zhang C-T. Origins of replication in Cyanothece 51142 . Proc Natl Acad Sci USA 2008 ; 105 ( 52 ): E125. Google Scholar Crossref Search ADS PubMed WorldCat 42 Kurtz S , Choudhuri JV , Ohlebusch E , et al. REPuter: the manifold applications of repeat analysis on a genomic scale . Nucleic Acids Res 2001 ; 29 ( 22 ): 4633 – 42 . http://dx.doi.org/10.1093/nar/29.22.4633 Google Scholar Crossref Search ADS PubMed WorldCat 43 Tyner C , Barber GP , Casper J , et al. The UCSC genome browser database: 2017 update . Nucleic Acids Res 2017 ; 45 ( D1 ): D626 – 34 . Google Scholar PubMed WorldCat 44 Coordinators NR. Database resources of the national center for biotechnology information . Nucleic Acids Res 2017 ; 45 : D12 – 17 . Google Scholar Crossref Search ADS PubMed WorldCat 45 Woese CR , Kandler O , Wheelis ML. Towards a natural system of organisms: proposal for the domains Archaea, Bacteria, and Eucarya . Proc Natl Acad Sci USA 1990 ; 87 ( 12 ): 4576 – 9 . http://dx.doi.org/10.1073/pnas.87.12.4576 Google Scholar Crossref Search ADS PubMed WorldCat 46 Myllykallio H , Lopez P , Lopez GP , et al. Bacterial mode of replication with eukaryotic-like machinery in a hyperthermophilic archaeon . Science 2000 ; 288 ( 5474 ): 2212 – 15 . http://dx.doi.org/10.1126/science.288.5474.2212 Google Scholar Crossref Search ADS PubMed WorldCat 47 Wu Z , Liu J , Yang H , et al. Multiple replication origins with diverse control mechanisms in Haloarcula hispanica . Nucleic Acids Res 2014 ; 42 ( 4 ): 2282 – 94 . http://dx.doi.org/10.1093/nar/gkt1214 Google Scholar Crossref Search ADS PubMed WorldCat 48 Bailey TL , Boden M , Buske FA , et al. MEME SUITE: tools for motif discovery and searching . Nucleic Acids Res 2009 ; 37 : W202 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 49 Wang Y , Chen B , Sima L , et al. Construction of expression shuttle vectors for the Haloarchaeon Natrinema sp. J7 based on its chromosomal origins of replication . Archaea 2017 ; 2017 : 4237079. Google Scholar Crossref Search ADS PubMed WorldCat 50 Papadimitriou K , Baharidis PK , Georgoulis A , et al. Analysis of the complete genome sequence of the archaeon Pyrococcus chitonophagus DSM 10152 (formerly Thermococcus chitonophagus) . Extremophiles 2016 ; 20 ( 3 ): 351 – 361 . Google Scholar Crossref Search ADS PubMed WorldCat 51 Gavrilov SN , Stracke C , Jensen K , et al. Isolation and characterization of the first xylanolytic hyperthermophilic Euryarchaeon thermococcus sp strain 2319x1 and its unusual multidomain glycosidase . Front Microbiol 2016 ; 7 : 552. Google Scholar Crossref Search ADS PubMed WorldCat 52 Demina TA , Atanasova NS , Pietila MK , et al. Vesicle-like virion of Haloarcula hispanica pleomorphic virus 3 preserves high infectivity in saturated salt . Virology 2016 ; 499 : 40 – 51 . http://dx.doi.org/10.1016/j.virol.2016.09.002 Google Scholar Crossref Search ADS PubMed WorldCat 53 Bharathi M , Chellapandi P. Intergenomic evolution and metabolic cross-talk between rumen and thermophilic autotrophic methanogenic archaea . Mol Phylogenet Evol 2017 ; 107 : 293 – 304 . http://dx.doi.org/10.1016/j.ympev.2016.11.008 Google Scholar Crossref Search ADS PubMed WorldCat 54 Guo FB , Ou HY , Zhang CT. ZCURVE: a new system for recognizing protein-coding genes in bacterial and archaeal genomes . Nucleic Acids Res 2003 ; 31 ( 6 ): 1780 – 9 . http://dx.doi.org/10.1093/nar/gkg254 Google Scholar Crossref Search ADS PubMed WorldCat 55 Delcher AL , Bratke KA , Powers EC , et al. Identifying bacterial genes and endosymbiont DNA with Glimmer . Bioinformatics 2007 ; 23 ( 6 ): 673 – 9 . http://dx.doi.org/10.1093/bioinformatics/btm009 Google Scholar Crossref Search ADS PubMed WorldCat 56 Grant CE , Bailey TL , Noble WS. FIMO: scanning for occurrences of a given motif . Bioinformatics 2011 ; 27 ( 7 ): 1017 – 18 . http://dx.doi.org/10.1093/bioinformatics/btr064 Google Scholar Crossref Search ADS PubMed WorldCat 57 Frank AC , Lobry JR. Oriloc: prediction of replication boundaries in unannotated bacterial chromosomes . Bioinformatics 2000 ; 16 ( 6 ): 560 – 1 . http://dx.doi.org/10.1093/bioinformatics/16.6.560 Google Scholar Crossref Search ADS PubMed WorldCat 58 Wolanski M , Donczew R , Zawilak-Pawlik A , et al. oriC-encoded instructions for the initiation of bacterial chromosome replication . Front Microbiol 2015 ; 5 : 735 . Google Scholar PubMed WorldCat 59 Zhou Y , Chen WL , Wang L , et al. Identification of the oriC region and its influence on heterocyst development in the filamentous Cyanobacterium anabaena sp. strain PCC 7120 . Microbiology 2011 ; 157 : 1910 – 19 . Google Scholar Crossref Search ADS PubMed WorldCat 60 Watanabe S , Ohbayashi R , Shiwa Y , et al. Light-dependent and asynchronous replication of cyanobacterial multi-copy chromosomes . Mol Microbiol 2012 ; 83 ( 4 ): 856 – 65 . http://dx.doi.org/10.1111/j.1365-2958.2012.07971.x Google Scholar Crossref Search ADS PubMed WorldCat 61 Huang H , Song CC , Yang ZL , et al. Identification of the replication origins from cyanothece ATCC 51142 and their interactions with the DnaA protein: from in silico to in vitro studies . Front Microbiol 2015 ; 6 : 1370. Google Scholar PubMed WorldCat 62 Stolz A. Degradative plasmids from sphingomonads . FEMS Microbiol Lett 2014 ; 350 ( 1 ): 9 – 19 . http://dx.doi.org/10.1111/1574-6968.12283 Google Scholar Crossref Search ADS PubMed WorldCat 63 Richardson TT , Harran O , Murray H. The bacterial DnaA-trio replication origin element specifies single-stranded DNA initiator binding . Nature 2016 ; 534 ( 7607 ): 412 – 16 . http://dx.doi.org/10.1038/nature17962 Google Scholar Crossref Search ADS PubMed WorldCat 64 Bendall ML , Luong K , Wetmore KM , et al. Exploring the roles of DNA methylation in the metal-reducing bacterium Shewanella oneidensis MR-1 . J Bacteriol 2013 ; 195 ( 21 ): 4966 – 74 . http://dx.doi.org/10.1128/JB.00935-13 Google Scholar Crossref Search ADS PubMed WorldCat 65 Wang X , Lesterlin C , Reyes-Lamothe R , et al. Replication and segregation of an Escherichia coli chromosome with two replication origins . Proc Natl Acad Sci USA 2011 ; 108 ( 26 ): E243 – 50 . 2 Google Scholar Crossref Search ADS PubMed WorldCat 66 Liang X , Baek CH , Katzen F. Escherichia coli with two linear chromosomes . ACS Synth Biol 2013 ; 2 ( 12 ): 734 – 40 . http://dx.doi.org/10.1021/sb400079u Google Scholar Crossref Search ADS PubMed WorldCat 67 Messerschmidt SJ , Kemter FS , Schindler D , et al. Synthetic secondary chromosomes in Escherichia coli based on the replication origin of chromosome II in Vibrio cholerae . Biotechnol J 2015 ; 10 : 302 – 14 . http://dx.doi.org/10.1002/biot.201400031 Google Scholar Crossref Search ADS PubMed WorldCat 68 Gao F. Bacteria may have multiple replication origins . Front Microbiol 2015 ; 6 : 324. Google Scholar PubMed WorldCat 69 Ionescu D , Bizic-Ionescu M , De Maio N , et al. Community-like genome in single cells of the sulfur bacterium Achromatium oxaliferum . Nat Commun 2017 ; 8 ( 1 ): 455 . http://dx.doi.org/10.1038/s41467-017-00342-9 Google Scholar Crossref Search ADS PubMed WorldCat 70 Ohbayashi R , Watanabe S , Ehira S , et al. Diversification of DnaA dependency for DNA replication in cyanobacterial evolution . ISME J 2016 ; 10 ( 5 ): 1113 – 21 . http://dx.doi.org/10.1038/ismej.2015.194 Google Scholar Crossref Search ADS PubMed WorldCat 71 Zhao HL , Xia ZK , Zhang FZ , et al. Multiple factors drive replicating strand composition bias in bacterial genomes . Int J Mol Sci 2015 ; 16 ( 9 ): 23111 – 26 . http://dx.doi.org/10.3390/ijms160923111 Google Scholar Crossref Search ADS PubMed WorldCat 72 Zhang G , Gao F. Quantitative analysis of correlation between AT and GC biases among bacterial genomes . PLoS One 2017 ; 12 ( 2 ): e0171408. Google Scholar Crossref Search ADS PubMed WorldCat 73 Chen WH , Lu G , Bork P , et al. Energy efficiency trade-offs drive nucleotide usage in transcribed regions . Nat Commun 2016 ; 7 : 11334. http://dx.doi.org/10.1038/ncomms11334 Google Scholar Crossref Search ADS PubMed WorldCat 74 Banerjee R , Roy D. Codon usage and gene expression pattern of Stenotrophomonas maltophilia R551-3 for pathogenic mode of living . Biochem Biophys Res Commun 2009 ; 390 ( 2 ): 177 – 81 . http://dx.doi.org/10.1016/j.bbrc.2009.09.062 Google Scholar Crossref Search ADS PubMed WorldCat 75 Guo FB , Yuan JB. Codon usages of genes on chromosome, and surprisingly, genes in plasmid are primarily affected by strand-specific mutational biases in Lawsonia intracellularis . DNA Res 2009 ; 16 ( 2 ): 91 – 104 . http://dx.doi.org/10.1093/dnares/dsp001 Google Scholar Crossref Search ADS PubMed WorldCat 76 Khrustalev VV , Barkovsky EV. The probability of nonsense mutation caused by replication-associated mutational pressure is much higher for bacterial genes from lagging than from leading strands . Genomics 2010 ; 96 ( 3 ): 173 – 180 . http://dx.doi.org/10.1016/j.ygeno.2010.06.002 Google Scholar Crossref Search ADS PubMed WorldCat 77 Sobetzko P , Travers A , Muskhelishvili G. Gene order and chromosome dynamics coordinate spatiotemporal gene expression during the bacterial growth cycle . Proc Natl Acad Sci USA 2012 ; 109 ( 2 ): E42 – 50 . Google Scholar Crossref Search ADS PubMed WorldCat 78 Lin Y , Gao F , Zhang CT. Functionality of essential genes drives gene strand-bias in bacterial genomes . Biochem Biophys Res Commun 2010 ; 396 ( 2 ): 472 – 76 . http://dx.doi.org/10.1016/j.bbrc.2010.04.119 Google Scholar Crossref Search ADS PubMed WorldCat 79 Gao N , Lu G , Lercher MJ , et al. Selection for energy efficiency drives strand-biased gene distribution in prokaryotes . Sci Rep 2017 ; 7 : 10572 . http://dx.doi.org/10.1038/s41598-017-11159-3 Google Scholar Crossref Search ADS PubMed WorldCat 80 Mao X , Zhang H , Yin Y , et al. The percentage of bacterial genes on leading versus lagging strands is influenced by multiple balancing forces . Nucleic Acids Res 2012 ; 40 ( 17 ): 8210 – 18 . http://dx.doi.org/10.1093/nar/gks605 Google Scholar Crossref Search ADS PubMed WorldCat 81 Rocha EP , Danchin A. Essentiality, not expressiveness, drives gene-strand bias in bacteria . Nat Genet 2003 ; 34 ( 4 ): 377. http://dx.doi.org/10.1038/ng1209 Google Scholar Crossref Search ADS PubMed WorldCat 82 Wei W , Ning LW , Ye YN , et al. Geptop: a gene essentiality prediction tool for sequenced bacterial genomes based on orthology and phylogeny . PLoS One 2013 ; 8 ( 8 ): e72343. Google Scholar Crossref Search ADS PubMed WorldCat 83 Lin Y , Zhang RR. Putative essential and core-essential genes in Mycoplasma genomes . Sci Rep 2011 ; 1 ( 1 ): 53. http://dx.doi.org/10.1038/srep00053 Google Scholar Crossref Search ADS PubMed WorldCat © The Author(s) 2018. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
journal article
LitStream Collection
A review of methods and databases for metagenomic classification and assembly

Breitwieser, Florian, P;Lu,, Jennifer;Salzberg, Steven, L

2019 Briefings in Bioinformatics

doi: 10.1093/bib/bbx120pmid: 29028872

Abstract Microbiome research has grown rapidly over the past decade, with a proliferation of new methods that seek to make sense of large, complex data sets. Here, we survey two of the primary types of methods for analyzing microbiome data: read classification and metagenomic assembly, and we review some of the challenges facing these methods. All of the methods rely on public genome databases, and we also discuss the content of these databases and how their quality has a direct impact on our ability to interpret a microbiome sample. microbiome, microbial genomics, next-generation sequencing, bacteria, databases Introduction Microbiome research has been expanding rapidly as a consequence of dramatic improvements in the efficiency of genome sequencing. As the variety and complexity of experiments has grown, so have the methods and databases used to analyze these experiments. Ever-larger data sets present increasing challenges for computational methods, which must minimize processing and memory requirements to provide fast turnaround and to avoid overwhelming the computational resources available to most research laboratories. The rapid increase in the number and variety of genomes also present many challenges, rising in part from the effort required to fit traditional taxonomic naming schemes onto a microbial world that we now know is vastly richer and more complex than scientists realized when they first created taxonomic naming schemes in the distant past. Additional challenges arise from the rapid pace of ‘draft’ genome sequencing, which has produced tens of thousands of new genomes, many of which are highly fragmented and incomplete. As we discuss below, the variable quality of these genomes can lead to unexpected and erroneous results if the genomes are used without careful vetting. This review discusses the computational challenges of analyzing metagenomics data, focusing on methods but also including a discussion of microbial taxonomy and genome resources, which are rarely discussed in benchmark studies and tool reviews despite their critical importance. We begin with a review of terminology and a comparison of marker gene sequencing, shotgun metagenome sequencing and meta-transcriptome sequencing, all of which are sometimes included in the term metagenomics. Metataxonomics, metagenomics, metatranscriptomics The most widely used sequencing-based approaches for microbiome research are metataxonomics and metagenomics (Table 1). Metataxonomics refers to the sequencing of marker genes, usually regions of the ribosomal RNA (rRNA) gene that is highly conserved across taxa. Note that there has been some ambiguity in the use of these terms; in the past, marker gene sequencing has also been referred to as metagenomics. In this review, we follow the proposal of Marchesi and Ravel [1] on terminology, and use the term ‘metataxonomics’ for marker gene sequencing. Because it only requires sequence from a single gene, this strategy provides a cost-effective means to identify a wide range of organisms. Metagenomics refers to the random ‘shotgun’ sequencing of microbial DNA, without selecting any particular gene [2]. Both metataxonomics and metagenomics can provide information on the species composition of a microbiome. Another strategy, metatranscriptomics, attempts to capture and sequence all of the RNA in a sample, which can help create a profile of all genes that are actively being transcribed, and may also provide a picture of the relative abundance of those genes [3]. Table 1 Metataxonomics, metagenomics and meta-transcriptomics strategies Technique Advantages and challenges Main applications Metataxonomics using amplicon sequencing of the 16S or 18S rRNA gene or ITS + Fast and cost-effective identification of a wide variety of bacteria and eukaryotes * Profiling of what is present − Does not capture gene content other than the targeted genes * Microbial ecology − Amplification bias * rRNA-based phylogeny − Viruses cannot be captured Metagenomics using random shotgun sequencing of DNA or RNA + No amplification bias * Profiling of what is present across all domains + Detects bacteria, archaea, viruses and eukaryotes * Functional genome analyses + Enables de novo assembly of genomes * Phylogeny − Requires high read count * Detection of pathogens − Many reads may be from host − Requires reference genomes for classification Meta-transcriptomics using sequencing of mRNA + Identifies active genes and pathways * Transcriptional profiling of what is active − mRNA is unstable − Multiple purification and amplification steps can lead to more noise Technique Advantages and challenges Main applications Metataxonomics using amplicon sequencing of the 16S or 18S rRNA gene or ITS + Fast and cost-effective identification of a wide variety of bacteria and eukaryotes * Profiling of what is present − Does not capture gene content other than the targeted genes * Microbial ecology − Amplification bias * rRNA-based phylogeny − Viruses cannot be captured Metagenomics using random shotgun sequencing of DNA or RNA + No amplification bias * Profiling of what is present across all domains + Detects bacteria, archaea, viruses and eukaryotes * Functional genome analyses + Enables de novo assembly of genomes * Phylogeny − Requires high read count * Detection of pathogens − Many reads may be from host − Requires reference genomes for classification Meta-transcriptomics using sequencing of mRNA + Identifies active genes and pathways * Transcriptional profiling of what is active − mRNA is unstable − Multiple purification and amplification steps can lead to more noise Open in new tab Table 1 Metataxonomics, metagenomics and meta-transcriptomics strategies Technique Advantages and challenges Main applications Metataxonomics using amplicon sequencing of the 16S or 18S rRNA gene or ITS + Fast and cost-effective identification of a wide variety of bacteria and eukaryotes * Profiling of what is present − Does not capture gene content other than the targeted genes * Microbial ecology − Amplification bias * rRNA-based phylogeny − Viruses cannot be captured Metagenomics using random shotgun sequencing of DNA or RNA + No amplification bias * Profiling of what is present across all domains + Detects bacteria, archaea, viruses and eukaryotes * Functional genome analyses + Enables de novo assembly of genomes * Phylogeny − Requires high read count * Detection of pathogens − Many reads may be from host − Requires reference genomes for classification Meta-transcriptomics using sequencing of mRNA + Identifies active genes and pathways * Transcriptional profiling of what is active − mRNA is unstable − Multiple purification and amplification steps can lead to more noise Technique Advantages and challenges Main applications Metataxonomics using amplicon sequencing of the 16S or 18S rRNA gene or ITS + Fast and cost-effective identification of a wide variety of bacteria and eukaryotes * Profiling of what is present − Does not capture gene content other than the targeted genes * Microbial ecology − Amplification bias * rRNA-based phylogeny − Viruses cannot be captured Metagenomics using random shotgun sequencing of DNA or RNA + No amplification bias * Profiling of what is present across all domains + Detects bacteria, archaea, viruses and eukaryotes * Functional genome analyses + Enables de novo assembly of genomes * Phylogeny − Requires high read count * Detection of pathogens − Many reads may be from host − Requires reference genomes for classification Meta-transcriptomics using sequencing of mRNA + Identifies active genes and pathways * Transcriptional profiling of what is active − mRNA is unstable − Multiple purification and amplification steps can lead to more noise Open in new tab Complementary approaches that are becoming increasingly popular in microbiome research, but are not further covered in this review, include metaproteomics and metametabolomics [4–6]. Metaproteomics uses mass spectrometry techniques, e.g. liquid chromatography-coupled tandem mass spectrometry, to generate profiles of protein expression and posttranslational modifications of proteins [5]. Typically, genome sequences are required for the mapping of generated mass spectra to proteins, and thus, this field also depends on metagenomics. Metametabolomics attempts to create profiles of metabolites, usually also created using mass spectrometry [6]. Mass spectrometry is more expensive and experimentally challenging than sequencing, although the field is making continual technical improvements [4]. Integrating the data of all these different ‘meta-omics’ approaches is challenging, but it can yield insights not found by looking at just one type of data [7]. Metataxonomics is an invaluable tool for microbial ecology. rRNA gene sequences are the most widely used marker sequences; these include the 16S rRNA gene for bacteria, the 18S rRNA gene for eukaryotes, and the internal transcribed spacer (ITS) regions of the fungal ribosome for fungi [8, 9]. These markers work well for phylogenetic profiling because they are ubiquitously present in the population, they have hypervariable regions that differentiate species and they are flanked by conserved regions that can be targeted by ‘universal’ primers [8]. A major advantage of rRNA analysis is that databases such as Greengenes [10], RDP [11] and SILVA [12] contain genes from millions of species, making them far more comprehensive than genome databases, which contain tens of thousands of species. The workflow for 16S analysis typically includes quality filtering, error correction (sometimes called de-noising), removal of chimeric sequences, clustering of reads into ‘Operational Taxonomic Units’ (OTUs) based on sequence similarity and classification of the OTUs [13–20]. An alternative approach before clustering of reads into OTUs is their direct classification using metagenomics classifiers (see section on ‘metagenomics classification’ and Table 3), as recently compared in [21]. The rest of this review will focus on metagenomics methods; for further discussion of metataxonomic methods, see [22–25]. Table 3 Metagenomic classifiers, aligners and profilers Tool Synopsis Reference Web site Kraken Fast taxonomic classifier using in-memory k-mer search of metagenomics reads against a database built from multiple genomes [64] https://ccb.jhu.edu/software/kraken/ Kraken-HLL Extension of Kraken counting unique k-mers for taxa and allowing multiple databases https://github.com/fbreitwieser/kraken-hll CLARK(-S) Fast taxonomic classifier using in-memory k-mer search of metagenomics reads against a database built from completed genomes. S extension uses spaced k-mer seeds for better classification [65, 66] http://clark.cs.ucr.edu Kallisto Taxonomic profiler using pseudo-alignment with k-mers using techniques based on transcript (RNA-seq) quantification [67] https://github.com/pachterlab/kallisto k-SLAM Taxonomic classifier using database of nonoverlapping k-mers in genomes. Reads are split into k-mers, and overlaps found by lexicographical ordering are pseudo-assembled [68] https://github.com/aindj/k-SLAM Kaiju Fast taxonomic classifier against protein sequences using FM-index with reduced amino acid alphabet [69] https://github.com/bioinformatics-centre/kaiju DIAMOND Protein homology search using spaced seeds with a reduced amino acid alphabet, 2000–20 000 times faster than BLASTX [70] https://github.com/bbuchfink/diamond BLAST+ Highly sensitive nucleotide and translated-nucleotide protein alignment [61, 71] https://blast.ncbi.nlm.nih.gov MEGAN6/CE Desktop and Web metagenomics analysis suite. Uses BLAST or diamond to match sequences and assigns LCA of matches [72, 73] http://ab.inf.uni-tuebingen.de/software/megan6/ DUDes Top-down assignment of metagenomics reads [74] https://sourceforge.net/projects/dudes/ Taxonomer Web-based metagenomics classifier including binning and visualization [75] http://taxonomer.iobio.io/ GOTTCHA Taxonomic profiler that maps reads against short unique subsequences (‘signature’) at multiple taxonomic ranks [76] http://lanl-bioinformatics.github.io/GOTTCHA/ LMAT(-ML) K-mer-based taxonomic read classifier using extensive database including draft genomes and eukaryotes. ML (Marker Library) extension reduces RAM requirements by stringent pruning of non-informative and overlapping k-mers [77, 78] https://sourceforge.net/projects/lmat/ taxator-tk Uses BLAST or LAST output for binning and taxonomic assignment via overlapping regions and pairwise distance measures [79] https://github.com/fungs/taxator-tk Centrifuge Fast taxonomic classifier using database compressed with FM-index, database and output format similar to Kraken [80] http://ccb.jhu.edu/software/centrifuge/ MetaPhlAn 2 Marker gene-based taxonomic profiler [81] https://bitbucket.org/biobakery/metaphlan2 mOTU Taxonomic profiler based on a set of 40 prokaryotic marker genes [82] http://www.bork.embl.de/software/mOTU/ Mash MinHash-based taxonomic profiler enabling super-fast overlap estimations [83] http://mash.readthedocs.io sourmash Alternative implementation of MinHash algorithm using fast searches with sequence bloom trees for taxonomic profiling [84] https://github.com/dib-lab/sourmash PanPhlAn Pan-genome-based phylogenomic analysis [2] http://segatalab.cibio.unitn.it/tools/panphlan/ Tool Synopsis Reference Web site Kraken Fast taxonomic classifier using in-memory k-mer search of metagenomics reads against a database built from multiple genomes [64] https://ccb.jhu.edu/software/kraken/ Kraken-HLL Extension of Kraken counting unique k-mers for taxa and allowing multiple databases https://github.com/fbreitwieser/kraken-hll CLARK(-S) Fast taxonomic classifier using in-memory k-mer search of metagenomics reads against a database built from completed genomes. S extension uses spaced k-mer seeds for better classification [65, 66] http://clark.cs.ucr.edu Kallisto Taxonomic profiler using pseudo-alignment with k-mers using techniques based on transcript (RNA-seq) quantification [67] https://github.com/pachterlab/kallisto k-SLAM Taxonomic classifier using database of nonoverlapping k-mers in genomes. Reads are split into k-mers, and overlaps found by lexicographical ordering are pseudo-assembled [68] https://github.com/aindj/k-SLAM Kaiju Fast taxonomic classifier against protein sequences using FM-index with reduced amino acid alphabet [69] https://github.com/bioinformatics-centre/kaiju DIAMOND Protein homology search using spaced seeds with a reduced amino acid alphabet, 2000–20 000 times faster than BLASTX [70] https://github.com/bbuchfink/diamond BLAST+ Highly sensitive nucleotide and translated-nucleotide protein alignment [61, 71] https://blast.ncbi.nlm.nih.gov MEGAN6/CE Desktop and Web metagenomics analysis suite. Uses BLAST or diamond to match sequences and assigns LCA of matches [72, 73] http://ab.inf.uni-tuebingen.de/software/megan6/ DUDes Top-down assignment of metagenomics reads [74] https://sourceforge.net/projects/dudes/ Taxonomer Web-based metagenomics classifier including binning and visualization [75] http://taxonomer.iobio.io/ GOTTCHA Taxonomic profiler that maps reads against short unique subsequences (‘signature’) at multiple taxonomic ranks [76] http://lanl-bioinformatics.github.io/GOTTCHA/ LMAT(-ML) K-mer-based taxonomic read classifier using extensive database including draft genomes and eukaryotes. ML (Marker Library) extension reduces RAM requirements by stringent pruning of non-informative and overlapping k-mers [77, 78] https://sourceforge.net/projects/lmat/ taxator-tk Uses BLAST or LAST output for binning and taxonomic assignment via overlapping regions and pairwise distance measures [79] https://github.com/fungs/taxator-tk Centrifuge Fast taxonomic classifier using database compressed with FM-index, database and output format similar to Kraken [80] http://ccb.jhu.edu/software/centrifuge/ MetaPhlAn 2 Marker gene-based taxonomic profiler [81] https://bitbucket.org/biobakery/metaphlan2 mOTU Taxonomic profiler based on a set of 40 prokaryotic marker genes [82] http://www.bork.embl.de/software/mOTU/ Mash MinHash-based taxonomic profiler enabling super-fast overlap estimations [83] http://mash.readthedocs.io sourmash Alternative implementation of MinHash algorithm using fast searches with sequence bloom trees for taxonomic profiling [84] https://github.com/dib-lab/sourmash PanPhlAn Pan-genome-based phylogenomic analysis [2] http://segatalab.cibio.unitn.it/tools/panphlan/ Open in new tab Table 3 Metagenomic classifiers, aligners and profilers Tool Synopsis Reference Web site Kraken Fast taxonomic classifier using in-memory k-mer search of metagenomics reads against a database built from multiple genomes [64] https://ccb.jhu.edu/software/kraken/ Kraken-HLL Extension of Kraken counting unique k-mers for taxa and allowing multiple databases https://github.com/fbreitwieser/kraken-hll CLARK(-S) Fast taxonomic classifier using in-memory k-mer search of metagenomics reads against a database built from completed genomes. S extension uses spaced k-mer seeds for better classification [65, 66] http://clark.cs.ucr.edu Kallisto Taxonomic profiler using pseudo-alignment with k-mers using techniques based on transcript (RNA-seq) quantification [67] https://github.com/pachterlab/kallisto k-SLAM Taxonomic classifier using database of nonoverlapping k-mers in genomes. Reads are split into k-mers, and overlaps found by lexicographical ordering are pseudo-assembled [68] https://github.com/aindj/k-SLAM Kaiju Fast taxonomic classifier against protein sequences using FM-index with reduced amino acid alphabet [69] https://github.com/bioinformatics-centre/kaiju DIAMOND Protein homology search using spaced seeds with a reduced amino acid alphabet, 2000–20 000 times faster than BLASTX [70] https://github.com/bbuchfink/diamond BLAST+ Highly sensitive nucleotide and translated-nucleotide protein alignment [61, 71] https://blast.ncbi.nlm.nih.gov MEGAN6/CE Desktop and Web metagenomics analysis suite. Uses BLAST or diamond to match sequences and assigns LCA of matches [72, 73] http://ab.inf.uni-tuebingen.de/software/megan6/ DUDes Top-down assignment of metagenomics reads [74] https://sourceforge.net/projects/dudes/ Taxonomer Web-based metagenomics classifier including binning and visualization [75] http://taxonomer.iobio.io/ GOTTCHA Taxonomic profiler that maps reads against short unique subsequences (‘signature’) at multiple taxonomic ranks [76] http://lanl-bioinformatics.github.io/GOTTCHA/ LMAT(-ML) K-mer-based taxonomic read classifier using extensive database including draft genomes and eukaryotes. ML (Marker Library) extension reduces RAM requirements by stringent pruning of non-informative and overlapping k-mers [77, 78] https://sourceforge.net/projects/lmat/ taxator-tk Uses BLAST or LAST output for binning and taxonomic assignment via overlapping regions and pairwise distance measures [79] https://github.com/fungs/taxator-tk Centrifuge Fast taxonomic classifier using database compressed with FM-index, database and output format similar to Kraken [80] http://ccb.jhu.edu/software/centrifuge/ MetaPhlAn 2 Marker gene-based taxonomic profiler [81] https://bitbucket.org/biobakery/metaphlan2 mOTU Taxonomic profiler based on a set of 40 prokaryotic marker genes [82] http://www.bork.embl.de/software/mOTU/ Mash MinHash-based taxonomic profiler enabling super-fast overlap estimations [83] http://mash.readthedocs.io sourmash Alternative implementation of MinHash algorithm using fast searches with sequence bloom trees for taxonomic profiling [84] https://github.com/dib-lab/sourmash PanPhlAn Pan-genome-based phylogenomic analysis [2] http://segatalab.cibio.unitn.it/tools/panphlan/ Tool Synopsis Reference Web site Kraken Fast taxonomic classifier using in-memory k-mer search of metagenomics reads against a database built from multiple genomes [64] https://ccb.jhu.edu/software/kraken/ Kraken-HLL Extension of Kraken counting unique k-mers for taxa and allowing multiple databases https://github.com/fbreitwieser/kraken-hll CLARK(-S) Fast taxonomic classifier using in-memory k-mer search of metagenomics reads against a database built from completed genomes. S extension uses spaced k-mer seeds for better classification [65, 66] http://clark.cs.ucr.edu Kallisto Taxonomic profiler using pseudo-alignment with k-mers using techniques based on transcript (RNA-seq) quantification [67] https://github.com/pachterlab/kallisto k-SLAM Taxonomic classifier using database of nonoverlapping k-mers in genomes. Reads are split into k-mers, and overlaps found by lexicographical ordering are pseudo-assembled [68] https://github.com/aindj/k-SLAM Kaiju Fast taxonomic classifier against protein sequences using FM-index with reduced amino acid alphabet [69] https://github.com/bioinformatics-centre/kaiju DIAMOND Protein homology search using spaced seeds with a reduced amino acid alphabet, 2000–20 000 times faster than BLASTX [70] https://github.com/bbuchfink/diamond BLAST+ Highly sensitive nucleotide and translated-nucleotide protein alignment [61, 71] https://blast.ncbi.nlm.nih.gov MEGAN6/CE Desktop and Web metagenomics analysis suite. Uses BLAST or diamond to match sequences and assigns LCA of matches [72, 73] http://ab.inf.uni-tuebingen.de/software/megan6/ DUDes Top-down assignment of metagenomics reads [74] https://sourceforge.net/projects/dudes/ Taxonomer Web-based metagenomics classifier including binning and visualization [75] http://taxonomer.iobio.io/ GOTTCHA Taxonomic profiler that maps reads against short unique subsequences (‘signature’) at multiple taxonomic ranks [76] http://lanl-bioinformatics.github.io/GOTTCHA/ LMAT(-ML) K-mer-based taxonomic read classifier using extensive database including draft genomes and eukaryotes. ML (Marker Library) extension reduces RAM requirements by stringent pruning of non-informative and overlapping k-mers [77, 78] https://sourceforge.net/projects/lmat/ taxator-tk Uses BLAST or LAST output for binning and taxonomic assignment via overlapping regions and pairwise distance measures [79] https://github.com/fungs/taxator-tk Centrifuge Fast taxonomic classifier using database compressed with FM-index, database and output format similar to Kraken [80] http://ccb.jhu.edu/software/centrifuge/ MetaPhlAn 2 Marker gene-based taxonomic profiler [81] https://bitbucket.org/biobakery/metaphlan2 mOTU Taxonomic profiler based on a set of 40 prokaryotic marker genes [82] http://www.bork.embl.de/software/mOTU/ Mash MinHash-based taxonomic profiler enabling super-fast overlap estimations [83] http://mash.readthedocs.io sourmash Alternative implementation of MinHash algorithm using fast searches with sequence bloom trees for taxonomic profiling [84] https://github.com/dib-lab/sourmash PanPhlAn Pan-genome-based phylogenomic analysis [2] http://segatalab.cibio.unitn.it/tools/panphlan/ Open in new tab Marker gene sequencing does have some drawbacks, which explains (in part) the rising popularity of metagenomics. First, marker gene-based methodologies do not capture viruses, which have no conserved genes analogous to 16S or 18S rRNA genes. The use of the 16S rRNA gene itself is imperfect as well: for the recently described Candidate Phyla Radiation, which comprises up to 15% of the bacterial domain [26], it was estimated that >50% of the organisms evaded detection with classical 16S amplicon sequencing [27]. The short reads produced by next-generation sequencers further limit analysis at the species level, although full-length 16S rRNA gene sequencing using long-read sequencers from Pacific Biosciences or Oxford Nanopore might help overcome this limitation [28]. The methodology of an experiment and laboratory-specific factors can also limit the effectiveness of marker gene sequencing approaches, although the same caveat applies to metagenomics [29–32]. Metagenomic analysis Many strategies can be used for analysis of metagenomics shotgun data (Figure 1). A common first step is to run a variety of computational tools for quality control, which identify and remove low-quality sequences and contaminants. These include programs such as FastQC [33], Cutadapt [34], BBDuk [35] and Trimmomatic [36] (Table 2). FastQ Screen [37] matches reads against multiple reference genomes such as human, mouse, Escherichiacoli and yeast, and can provide a quick overview of where the reads align. Diginorm [38], implemented in the khmer package [39], can be used to reduce redundancy of reads in high-depth areas by down-sampling reads, and thus normalize coverage and make subsequent analyses computationally cheaper. MultiQC [40] aggregates quality control reports from multiple samples into a single report that can be viewed more easily. If the microbiome comes from a host with a sequenced genome, such as human, it is useful to identify and filter out host reads before further analysis. Alternatively, some taxonomic classifiers can include the host genome in their databases. Table 2 A selection of quality control software tools for metagenomics data Tool Synopsis Reference Web site FastQC Quality control tool showing statics such as quality values, sequence length distribution and GC content distribution [33] http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ FastQ Screen Screen a library against sequence databases to see if composition of library matches expectations [37] http://www.bioinformatics.babraham.ac.uk/projects/fastq_screen BBtools BBDuk trims and filters reads using k-mers and entropy information. BBNorm normalizes coverage by down-sampling reads (digital normalization) [35] http://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/ Trimmomatic Flexible read trimming tool for Illumina data [36] http://www.usadellab.org/cms/?page=trimmomatic Cutadapt Find and remove adapter sequences, primers, poly-A tails and other types of unwanted sequence [34] https://cutadapt.readthedocs.io khmer/diginorm Tools for k-mer error trimming of reads and digital normalization of samples [38, 39] http://khmer.readthedocs.io MultiQC Summarize results from different analysis (such as FastQC) into one report [40] http://multiqc.info Tool Synopsis Reference Web site FastQC Quality control tool showing statics such as quality values, sequence length distribution and GC content distribution [33] http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ FastQ Screen Screen a library against sequence databases to see if composition of library matches expectations [37] http://www.bioinformatics.babraham.ac.uk/projects/fastq_screen BBtools BBDuk trims and filters reads using k-mers and entropy information. BBNorm normalizes coverage by down-sampling reads (digital normalization) [35] http://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/ Trimmomatic Flexible read trimming tool for Illumina data [36] http://www.usadellab.org/cms/?page=trimmomatic Cutadapt Find and remove adapter sequences, primers, poly-A tails and other types of unwanted sequence [34] https://cutadapt.readthedocs.io khmer/diginorm Tools for k-mer error trimming of reads and digital normalization of samples [38, 39] http://khmer.readthedocs.io MultiQC Summarize results from different analysis (such as FastQC) into one report [40] http://multiqc.info Note: Most of these tools can also be used for other types of genome sequence data, e.g. whole-genome or RNA-seq data. Open in new tab Table 2 A selection of quality control software tools for metagenomics data Tool Synopsis Reference Web site FastQC Quality control tool showing statics such as quality values, sequence length distribution and GC content distribution [33] http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ FastQ Screen Screen a library against sequence databases to see if composition of library matches expectations [37] http://www.bioinformatics.babraham.ac.uk/projects/fastq_screen BBtools BBDuk trims and filters reads using k-mers and entropy information. BBNorm normalizes coverage by down-sampling reads (digital normalization) [35] http://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/ Trimmomatic Flexible read trimming tool for Illumina data [36] http://www.usadellab.org/cms/?page=trimmomatic Cutadapt Find and remove adapter sequences, primers, poly-A tails and other types of unwanted sequence [34] https://cutadapt.readthedocs.io khmer/diginorm Tools for k-mer error trimming of reads and digital normalization of samples [38, 39] http://khmer.readthedocs.io MultiQC Summarize results from different analysis (such as FastQC) into one report [40] http://multiqc.info Tool Synopsis Reference Web site FastQC Quality control tool showing statics such as quality values, sequence length distribution and GC content distribution [33] http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ FastQ Screen Screen a library against sequence databases to see if composition of library matches expectations [37] http://www.bioinformatics.babraham.ac.uk/projects/fastq_screen BBtools BBDuk trims and filters reads using k-mers and entropy information. BBNorm normalizes coverage by down-sampling reads (digital normalization) [35] http://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/ Trimmomatic Flexible read trimming tool for Illumina data [36] http://www.usadellab.org/cms/?page=trimmomatic Cutadapt Find and remove adapter sequences, primers, poly-A tails and other types of unwanted sequence [34] https://cutadapt.readthedocs.io khmer/diginorm Tools for k-mer error trimming of reads and digital normalization of samples [38, 39] http://khmer.readthedocs.io MultiQC Summarize results from different analysis (such as FastQC) into one report [40] http://multiqc.info Note: Most of these tools can also be used for other types of genome sequence data, e.g. whole-genome or RNA-seq data. Open in new tab Figure 1 Open in new tabDownload slide Common analysis procedures for metagenomics data. Note that the order of some of the analysis steps can be shuffled. For example, reads might be binned before assembly or before taxonomic assignment, so that the downstream algorithms can work only with a subset of the data. Figure 1 Open in new tabDownload slide Common analysis procedures for metagenomics data. Note that the order of some of the analysis steps can be shuffled. For example, reads might be binned before assembly or before taxonomic assignment, so that the downstream algorithms can work only with a subset of the data. After quality control, the reads can either be assembled into longer contiguous sequences called contigs or passed directly to taxonomic classifiers (Figure 1). Taxonomic classification of every read is a form of binning because it groups reads into bins corresponding to their taxon ID. Binning can also be done using other properties such as composition and co-abundance profiles, although those methods typically require assembly of reads into longer contigs, which provide better statistics for profiling [41]. (See [42] for a review of binning methods.) When the analysis only returns the estimated abundances of the different taxa (instead of a classification of each read), we call it taxonomic profiling. The choice of assembly-based analyses versus direct taxonomic classification of reads depends on the research question. Direct taxonomic classification is useful for quantitative community profiling and identification of organisms with close relatives in the database. Compared with marker gene-based community profiling, metagenomic shotgun sequencing alleviates biases from primer choice and enables the detection of organisms across all domains of life, assuming that DNA can be extracted from the target environment. Researchers can quantify the structure of microbial communities using ecological and biogeographic measures such as species diversity, richness and uniformity of the communities [22, 43]. In clinical microbiology, the focus is often on the presence or absence of infectious pathogens, which can be identified by matching reads against a reference database [44–47]. Even though human-associated microbes are comparatively well studied with many completed genomes in the reference database, some pathogens remain unsequenced, and others have only recently been discovered using metagenomics sequencing [48–52]. Insights into the functional potential of a microbiome can be gained by matching the reads against pathway or gene databases [53, 54]. Further discussion of functional analysis in metagenomics and metatranscriptomics can be found in [55]. When no close relative of a species is in the database, as often happens with samples from unexplored ecological niches, assembly and binning of the reads may be useful first steps in the analysis. Analysis of the binned draft genomes allows for a more qualitative understanding of the physiology of the uncultivated microbes. By identifying single-copy and conserved genes in the contig bins, taxonomy, genome completeness as well as contamination can be assessed [41, 56]. Some recent findings from metagenomic (draft) assemblies include the identification of the enzymes used for oil and paraffin degradation by Smithella spp. [57–59] and insights into metabolic pathways and interactions between microbes in methanogenic bioreactors [60]. Metagenomic classification Metagenomic classification tools match sequences—typically reads or assembled contigs—against a database of microbial genomes to identify the taxon of each sequence. In the early days of metagenomics, the best strategy was to use BLAST [61] to compare each read with all sequences in GenBank. As the reference databases and the size of sequencing data sets have grown, alignment using BLAST has become computationally infeasible, leading to the development of metagenomics classifiers that provide much faster results, although usually with less sensitivity than BLAST. Some programs return an assignment of every read, while others only provide the overall composition of the sample. A variety of strategies have been used for the matching step: aligning reads, mapping k-mers, using complete genomes, aligning marker genes only or translating the DNA and aligning to protein sequences (Tables 3). Recent studies have attempted to benchmark the performance of metagenomics classifiers based on both accuracy and speed [62, 63], although these studies are limited by their (unavoidable) reliance on simulated data. Taxonomic profiling with marker gene-based and other approaches Marker gene approaches identify sets of clade-specific, single-copy genes, so that the identification of one of these genes can be used as evidence that a member of the associated clade is present. This allows faster assignment because the database, even with a million or more genes (as in MetaPhlAn [81]), is far smaller than a database containing the full genomes for all species. The assignment can then be made with fast, sensitive aligners, such as Bowtie2 [85] used by MetaPhlAn and HMMER [86] used by Phylosift [87] and mOTU [82]. GOTTCHA [76] generates a database with unique genome signatures based on unique 24 base-pair fragments, which it indexes with bwa-mem [88]. GOTTCHA can output either binary classification (presence/absence calls) or a taxonomic profile, which is based on coverage of the genomic signatures, The use of single-copy marker genes should in principle make abundance estimation more precise, although it is impossible to know the copy number of a gene for a species with an incomplete genome. Because marker gene methods identify only a few genes per genome, most of the reads in a sample do not receive a classification at all; instead, these algorithms provide the microbial composition, expressed in terms of relative abundance for all taxa that they recognize in the sample. An alternative approach for metagenomics profiling is using the overlap of MinHash signatures [89] as implemented in Mash [83] and sourmash [84]. MinHashes allow one to estimate the similarity of data sets extremely efficiently, e.g. the overlap between all microbial genomes in GenBank and a metagenomics data set. The MinHash search databases are small and fast to build and search, allowing searches against the entire GenBank database on a laptop. Nucleotide taxonomic classification and quantification Kraken [64] was the first method to provide fast identification of all reads in a metagenomic sample. It accomplishes this using an algorithm that relies on exact k-mer matches, replacing alignment (which requires more computational work) with a simple table lookup. Kraken constructs a database that stores, with every k-mer in every genome, the species identifier (taxonomy ID) for that k-mer. When a k-mer is found in two or more taxa, Kraken stores the lowest-common ancestor (LCA) of those taxa with that k-mer. Database k-mers and their taxa are saved in a compressed lookup table that can be rapidly queried for exact matches to k-mers found in the reads (or contigs) of a metagenomics data set. CLARK [65] uses a similar approach, building databases of species- or genus-level specific k-mers, and discarding any k-mers mapping to higher levels. Both Kraken and CLARK set k = 31 by default, although the database can be built with any length k-mer. The selection of k reflects an important trade-off between sensitivity and specificity: excessively long k-mers may fail to match because of sequencing errors or genuine differences among species and strains, while overly short k-mers will yield nonspecific (and false) matches to many genomes. An alternative approach to using fixed k-mers is spaced or adaptive (variable-length) seeds, which encode patterns for which only a subset of the bases has to match perfectly [90–92]. An extension of Kraken using spaced seeds shows somewhat better accuracy for family and genus-level classification, but lower precision at the species level [93]. A similar extension was developed for CLARK [66]. Note that Kraken maps reads to the taxonomic tree, not to a specific level such as species or genus. Bracken [94] is an extension of Kraken that estimates species- or genus-level abundance based on a Bayesian probability algorithm. The Livermore Metagenomics Analysis Toolkit (LMAT) [77] is a k-mer-based classifier that uses a smaller default k-mer size (k = 20) than Kraken and CLARK, but stores the list of source genomes with each k-mer instead of their lowest common taxonomic ancestor. LMAT includes microbial draft genomes as well as eukaryotic microbes in its ‘Grand’ database, which requires 500 GB RAM and classifies more reads than a database without draft genomes. LMAT-ML (for Marker Library) [78] implements more stringent k-mer pruning to retain only informative and nonoverlapping k-mers, which reduces the memory requirements to just 16 GB. K-mers can also be represented in de Bruijn graphs. Kallisto [67, 95], which was originally developed for RNA-Seq analysis, uses a colored de Bruijn graph [96] in which each edge (i.e. k-mer) is assigned a set of ‘colors’, where a color encodes a genome in which the k-mer has been found. Given a sample read, Kallisto finds approximately matching paths in the colored de Bruijn graph, an approach the authors term ‘pseudo-alignment’. After mapping, each read has a set of genomes associated with it. Kallisto then infers strain abundances using an expectation–maximization (EM) algorithm [67]. k-SLAM [68] is a novel k-mer-based approach that uses local sequence alignments and pseudo-assembly, which generates contigs that can lead to more specific assignments. Centrifuge [80] is a fast and accurate metagenomics classifier using the Burrows–Wheeler transform (BWT) and an FM-index to store and index the genome database. This strategy uses only about one-tenth the space of a Kraken index for the same database. Centrifuge also implements a feature that combines shared sequences from closely related genomes using MUMmer [97]. This greatly reduces redundancy for species where dozens of strains have been sequenced, further reducing the size of the index data structure. MEGAN6/MEGAN-CE [73] and taxator-tk [79] both use the output of a local sequence aligner such as BLAST [61, 71], DIAMOND [70] or LAST [91]. MEGAN uses the LCA of the alignment results as its taxonomic assignment. A Web-based interface allows interactive exploration and functional analysis of its results. Taxator-tk first merges overlapping regions from the query (found by the local alignment) into larger subsequences. The pairwise distances of the subsequences to reference genomes are determined and used for binning and taxonomic assignment. DUDes [74] computes taxonomic abundances from output of read aligners such as bwa-mem [88]. DUDes resolves ambiguities in mapping using an iterative approach that analyzes the read coverage of nodes in the taxonomic tree top-down, and uses permutation tests to select significant tree nodes. The algorithm can report multiple probable candidate strains or select the best candidate, instead of reporting just their LCA. Taxonomer [75] provides a Web-based interface that enables fast classification of most reads. Taxonomer achieves fast classification by first binning reads into broad categories, and then classifying human, bacterial and fungal rRNA, labeling other reads as unknown. The visualization presents the results in interactive sunburst diagrams and enables the download of BIOM-formatted reports. Fast amino acid database searches Amino acid sequences are conserved at much greater evolutionary distances than DNA sequences, and this property can be exploited for more sensitive read classification, although the alignment step is slower. Both DIAMOND [70] and Kaiju [69] take this approach, comparing the six-frame translations of reads against protein databases. DIAMOND uses double-indexing of both a reference protein database and the translated sample reads. Each index contains seed-location pairs, where each seed is an amino acid fragment. After lexicographically ordering each index, DIAMOND traverses both lists in parallel to find matches between the database and the sample. For every match, DIAMOND attempts to align the sequencing read against the database protein and reports high-scoring matches. MEGAN [72] calculates taxonomic composition of samples based on BLAST or DIAMOND results using the LCA approach of multi-matching sequences. Kaiju indexes the reference protein database using a BWT and saving each sequence in an FM-index table. This efficient database structure, similar to the one used in centrifuge (described above), allows metagenomic sequences to be searched against a large protein database. Given a metagenomic sample and the pre-built index, Kaiju first translates every read in all six reading frames, splitting the read at stop codons. Kaiju sorts all of the resulting protein fragments by length and compares each against the protein database, longest to shortest, finding and returning maximum exact matches. Metagenomic assembly Illumina sequencing technology, which is the most widely used sequencing method for metagenomics experiments today, generates read lengths in the range of 100–250 bp, with a typical sequencing run producing tens of millions of reads. Metagenomics experiments might generate hundreds of millions or even billions of reads from a single sample. Depending on number of reads and the complexity of the microbial species in the sample, some genomes might be sequenced deeply, allowing the experimenter to try to assemble the original genome sequence, or parts of it, from the short reads. Genome assembly is a challenging problem, even for single genomes [98]; assembly of a mixed sample with many species in different abundances, as is necessary for a metagenomics sample, is even more complicated, requiring special-purpose assembly algorithms, reviewed and compared in [99, 100]. Perhaps, the biggest problem is the highly uneven sequencing depth of different organisms in a metagenomics sample. Standard assemblers assume that depth of coverage is approximately uniform across a genome; this assumption helps the algorithm in resolving repeats as well as removes erroneous reads. Relaxing this assumption means that any techniques within the assembler that rely on depth of coverage will no longer work. A second issue that makes metagenomics assembly harder is the nonclonal nature of the organisms within a sample. For bacterial assembly (and for some eukaryotic assemblies), the source DNA can be grown up clonally, allowing the assembly algorithm to impose strict requirements for the percent identity between overlapping reads. In this context, lower sequence identity between two reads implies that they came from two slightly divergent copies of a repeat in the genome. In a metagenomics sample, between-strain differences can look exactly the same as variation between repeats. Third, the depth of coverage of a particular species is rarely high, unless that species is present in high quantities in the sample. Even with tens of millions of reads, a metagenomics sample is not likely to contain deep coverage of more than one or two species, unless the sample itself is simple, i.e. containing only a few species. These and other issues mean that the results of metagenomics assembly will never be as good as those from assembly of a single, clonal organism. Nonetheless, assembly and binning of a metagenomics sample often succeed in merging many of the reads, resulting in contigs that are easier to align to a genome database or analyze without alignment. Here, we list current assemblers and contig binners that have been designed for metagenomics, also summarized in Table 4. An overview of the techniques used in assembly is given in [41, 98, 99]. For more discussion on contig binning and curation and validation of reconstructed genome bins, see [41]. Table 4 Tools for whole-genome assembly and metagenomics assembly Tool Synopsis Reference Web site Megahit Co-assembly of metagenomic reads with variable k-mer lengths and low memory usage [101] https://github.com/voutcn/megahit SPAdes DBG assembler using multiple k-mers, works also for simple metagenomes [102] http://cab.spbu.ru/software/spades MetaSPAdes Extension of SPADES with better assemblies with different abundances, conserved regions and strain mixtures [103] http://cab.spbu.ru/software/spades/ Ray Meta DBG assembler with fixed k-mer size [104] http://denovoassembler.sourceforge.net/ MetaVelvet(-SL) DBG assembler using fixed k-mer size. SL extension identifies and splits chimeric nodes [105, 106] http://metavelvet.dna.bio.keio.ac.jp IDBA-UD DBG assembler using multiple k-mer sizes, analyzes coverages between paths to give better assemblies in complex metagenomes with uneven coverage [107] http://i.cs.hku.hk/∼alse/hkubrg/projects/idba_ud/ MetAMOS Framework for metagenomic assembly, analysis and validation [108] http://metamos.readthedocs.io MOCAT2 Pipeline for read filtering, taxonomic profiling, assembly, gene prediction and functional analysis [109] http://mocat.embl.de/ Anvi’o Analysis and visualization platform for metagenomics assembly and binning [110] http://merenlab.org/software/anvio/ Contig binning MaxBin Efficient binning of metagenomic contigs based on EM algorithm using nucleotide composition [111] https://downloads.jbei.org/data/microbial_communities/MaxBin/MaxBin.html CONCOCT Bins contigs using nucleotide composition, coverage data in multiple samples and paired-end read information [112] https://github.com/BinPro/CONCOCT COCACOLA Binning contigs in using read coverage, correlation, sequence composition and paired-end read linkage [113] https://github.com/younglululu/COCACOLA MetaBAT Metagenome binning with abundance and tetra-nucleotide frequencies [114] https://bitbucket.org/berkeleylab/metabat VizBin Visualization of metagenomic data based on nonlinear dimension reduction [115] http://claczny.github.io/VizBin/ AbundanceBin Binning method based on k-mer frequency in reads [116] http://omics.informatics.indiana.edu/AbundanceBin/ GroopM Identifies population genomes using differential coverage of contigs [117] http://ecogenomics.github.io/GroopM/ MetaCluster Read and contig binning in two rounds for low- and high-abundance organisms using various k-mer lengths [118, 119] http://i.cs.hku.hk/∼alse/MetaCluster/ PhyloPythiaS(+) Assigns contigs to taxonomic bin using support vector machine trained on reference sequences [120, 121] https://github.com/algbioi/ppsp/wiki Assembly and binning quality assessment MetaQuast Evaluate and compare metagenomics assemblies based on alignments with reference genomes [122] http://quast.sourceforge.net/metaquast BUSCO Assess genome assembly and gene set completeness based on single-copy orthologs, also for eukaryotes [123] http://busco.ezlab.org/ CheckM Tools for assessing quality of (meta)genomic assemblies providing genome completion and contamination estimates, especially for bacteria and viruses [56] http://ecogenomics.github.io/CheckM/ Tool Synopsis Reference Web site Megahit Co-assembly of metagenomic reads with variable k-mer lengths and low memory usage [101] https://github.com/voutcn/megahit SPAdes DBG assembler using multiple k-mers, works also for simple metagenomes [102] http://cab.spbu.ru/software/spades MetaSPAdes Extension of SPADES with better assemblies with different abundances, conserved regions and strain mixtures [103] http://cab.spbu.ru/software/spades/ Ray Meta DBG assembler with fixed k-mer size [104] http://denovoassembler.sourceforge.net/ MetaVelvet(-SL) DBG assembler using fixed k-mer size. SL extension identifies and splits chimeric nodes [105, 106] http://metavelvet.dna.bio.keio.ac.jp IDBA-UD DBG assembler using multiple k-mer sizes, analyzes coverages between paths to give better assemblies in complex metagenomes with uneven coverage [107] http://i.cs.hku.hk/∼alse/hkubrg/projects/idba_ud/ MetAMOS Framework for metagenomic assembly, analysis and validation [108] http://metamos.readthedocs.io MOCAT2 Pipeline for read filtering, taxonomic profiling, assembly, gene prediction and functional analysis [109] http://mocat.embl.de/ Anvi’o Analysis and visualization platform for metagenomics assembly and binning [110] http://merenlab.org/software/anvio/ Contig binning MaxBin Efficient binning of metagenomic contigs based on EM algorithm using nucleotide composition [111] https://downloads.jbei.org/data/microbial_communities/MaxBin/MaxBin.html CONCOCT Bins contigs using nucleotide composition, coverage data in multiple samples and paired-end read information [112] https://github.com/BinPro/CONCOCT COCACOLA Binning contigs in using read coverage, correlation, sequence composition and paired-end read linkage [113] https://github.com/younglululu/COCACOLA MetaBAT Metagenome binning with abundance and tetra-nucleotide frequencies [114] https://bitbucket.org/berkeleylab/metabat VizBin Visualization of metagenomic data based on nonlinear dimension reduction [115] http://claczny.github.io/VizBin/ AbundanceBin Binning method based on k-mer frequency in reads [116] http://omics.informatics.indiana.edu/AbundanceBin/ GroopM Identifies population genomes using differential coverage of contigs [117] http://ecogenomics.github.io/GroopM/ MetaCluster Read and contig binning in two rounds for low- and high-abundance organisms using various k-mer lengths [118, 119] http://i.cs.hku.hk/∼alse/MetaCluster/ PhyloPythiaS(+) Assigns contigs to taxonomic bin using support vector machine trained on reference sequences [120, 121] https://github.com/algbioi/ppsp/wiki Assembly and binning quality assessment MetaQuast Evaluate and compare metagenomics assemblies based on alignments with reference genomes [122] http://quast.sourceforge.net/metaquast BUSCO Assess genome assembly and gene set completeness based on single-copy orthologs, also for eukaryotes [123] http://busco.ezlab.org/ CheckM Tools for assessing quality of (meta)genomic assemblies providing genome completion and contamination estimates, especially for bacteria and viruses [56] http://ecogenomics.github.io/CheckM/ Note: DBG, de Bruijn graph. Open in new tab Table 4 Tools for whole-genome assembly and metagenomics assembly Tool Synopsis Reference Web site Megahit Co-assembly of metagenomic reads with variable k-mer lengths and low memory usage [101] https://github.com/voutcn/megahit SPAdes DBG assembler using multiple k-mers, works also for simple metagenomes [102] http://cab.spbu.ru/software/spades MetaSPAdes Extension of SPADES with better assemblies with different abundances, conserved regions and strain mixtures [103] http://cab.spbu.ru/software/spades/ Ray Meta DBG assembler with fixed k-mer size [104] http://denovoassembler.sourceforge.net/ MetaVelvet(-SL) DBG assembler using fixed k-mer size. SL extension identifies and splits chimeric nodes [105, 106] http://metavelvet.dna.bio.keio.ac.jp IDBA-UD DBG assembler using multiple k-mer sizes, analyzes coverages between paths to give better assemblies in complex metagenomes with uneven coverage [107] http://i.cs.hku.hk/∼alse/hkubrg/projects/idba_ud/ MetAMOS Framework for metagenomic assembly, analysis and validation [108] http://metamos.readthedocs.io MOCAT2 Pipeline for read filtering, taxonomic profiling, assembly, gene prediction and functional analysis [109] http://mocat.embl.de/ Anvi’o Analysis and visualization platform for metagenomics assembly and binning [110] http://merenlab.org/software/anvio/ Contig binning MaxBin Efficient binning of metagenomic contigs based on EM algorithm using nucleotide composition [111] https://downloads.jbei.org/data/microbial_communities/MaxBin/MaxBin.html CONCOCT Bins contigs using nucleotide composition, coverage data in multiple samples and paired-end read information [112] https://github.com/BinPro/CONCOCT COCACOLA Binning contigs in using read coverage, correlation, sequence composition and paired-end read linkage [113] https://github.com/younglululu/COCACOLA MetaBAT Metagenome binning with abundance and tetra-nucleotide frequencies [114] https://bitbucket.org/berkeleylab/metabat VizBin Visualization of metagenomic data based on nonlinear dimension reduction [115] http://claczny.github.io/VizBin/ AbundanceBin Binning method based on k-mer frequency in reads [116] http://omics.informatics.indiana.edu/AbundanceBin/ GroopM Identifies population genomes using differential coverage of contigs [117] http://ecogenomics.github.io/GroopM/ MetaCluster Read and contig binning in two rounds for low- and high-abundance organisms using various k-mer lengths [118, 119] http://i.cs.hku.hk/∼alse/MetaCluster/ PhyloPythiaS(+) Assigns contigs to taxonomic bin using support vector machine trained on reference sequences [120, 121] https://github.com/algbioi/ppsp/wiki Assembly and binning quality assessment MetaQuast Evaluate and compare metagenomics assemblies based on alignments with reference genomes [122] http://quast.sourceforge.net/metaquast BUSCO Assess genome assembly and gene set completeness based on single-copy orthologs, also for eukaryotes [123] http://busco.ezlab.org/ CheckM Tools for assessing quality of (meta)genomic assemblies providing genome completion and contamination estimates, especially for bacteria and viruses [56] http://ecogenomics.github.io/CheckM/ Tool Synopsis Reference Web site Megahit Co-assembly of metagenomic reads with variable k-mer lengths and low memory usage [101] https://github.com/voutcn/megahit SPAdes DBG assembler using multiple k-mers, works also for simple metagenomes [102] http://cab.spbu.ru/software/spades MetaSPAdes Extension of SPADES with better assemblies with different abundances, conserved regions and strain mixtures [103] http://cab.spbu.ru/software/spades/ Ray Meta DBG assembler with fixed k-mer size [104] http://denovoassembler.sourceforge.net/ MetaVelvet(-SL) DBG assembler using fixed k-mer size. SL extension identifies and splits chimeric nodes [105, 106] http://metavelvet.dna.bio.keio.ac.jp IDBA-UD DBG assembler using multiple k-mer sizes, analyzes coverages between paths to give better assemblies in complex metagenomes with uneven coverage [107] http://i.cs.hku.hk/∼alse/hkubrg/projects/idba_ud/ MetAMOS Framework for metagenomic assembly, analysis and validation [108] http://metamos.readthedocs.io MOCAT2 Pipeline for read filtering, taxonomic profiling, assembly, gene prediction and functional analysis [109] http://mocat.embl.de/ Anvi’o Analysis and visualization platform for metagenomics assembly and binning [110] http://merenlab.org/software/anvio/ Contig binning MaxBin Efficient binning of metagenomic contigs based on EM algorithm using nucleotide composition [111] https://downloads.jbei.org/data/microbial_communities/MaxBin/MaxBin.html CONCOCT Bins contigs using nucleotide composition, coverage data in multiple samples and paired-end read information [112] https://github.com/BinPro/CONCOCT COCACOLA Binning contigs in using read coverage, correlation, sequence composition and paired-end read linkage [113] https://github.com/younglululu/COCACOLA MetaBAT Metagenome binning with abundance and tetra-nucleotide frequencies [114] https://bitbucket.org/berkeleylab/metabat VizBin Visualization of metagenomic data based on nonlinear dimension reduction [115] http://claczny.github.io/VizBin/ AbundanceBin Binning method based on k-mer frequency in reads [116] http://omics.informatics.indiana.edu/AbundanceBin/ GroopM Identifies population genomes using differential coverage of contigs [117] http://ecogenomics.github.io/GroopM/ MetaCluster Read and contig binning in two rounds for low- and high-abundance organisms using various k-mer lengths [118, 119] http://i.cs.hku.hk/∼alse/MetaCluster/ PhyloPythiaS(+) Assigns contigs to taxonomic bin using support vector machine trained on reference sequences [120, 121] https://github.com/algbioi/ppsp/wiki Assembly and binning quality assessment MetaQuast Evaluate and compare metagenomics assemblies based on alignments with reference genomes [122] http://quast.sourceforge.net/metaquast BUSCO Assess genome assembly and gene set completeness based on single-copy orthologs, also for eukaryotes [123] http://busco.ezlab.org/ CheckM Tools for assessing quality of (meta)genomic assemblies providing genome completion and contamination estimates, especially for bacteria and viruses [56] http://ecogenomics.github.io/CheckM/ Note: DBG, de Bruijn graph. Open in new tab Assembly of reads into longer contiguous sequences (contigs) MetaVelvet [106] and Ray Meta [104] are single k-mer de Bruijn graph assemblers for metagenomics data. MetaVelvet is an extension of the Velvet assembler [124] that decomposes the single de Bruijn graph into multiple subgraphs (ideally corresponding to different organisms) based on coverage information and graph connectivity. MetaVelvet-SL [105] improves the splitting of chimeric nodes—nodes that are shared between subgraphs of closely related species—and thus generates longer scaffolds than MetaVelvet. Ray Meta, conversely, constructs contigs by a heuristics-guided graph traversal. The choice of k is important for single k-mer de Bruijn graph assemblers. Small k’s are more sensitive in making connections, but fail to resolve repeats. Large k’s may miss connections and are more sensitive to sequencing errors, but usually create longer contigs. Most current metagenomics assemblers thus generate contigs from iteratively constructed and refined de Bruijn graphs using multiple k-mer lengths. The IDBA assembler (Iterative De Bruijn Graph Assembler) [125] first implemented this approach going from small k’s to large k’s, replacing reads with preassembled contigs at each iteration. IDBA-UD [107] is a version of the IDBA assembler modified to tolerate uneven depth of coverage, as occurring in single-cell and metagenomics sequencing experiments. IDBA-UD first generates a de Bruijn graph from the reads using small k-mers (by default k = 20), and—after error correction—extracts contigs that are used as ‘reads’ in the graph construction with the next-higher k-mer size. IDBA-UD detects erroneous k-mers and k-mers from different genomes by looking at deviations from the average multiplicity of k-mers in a contig. This local thresholding allows IDBA-UD to more accurately decompose the de Bruijn graph. MetaSPAdes [103] is an extension of the SPAdes assembler [102], which was originally developed for bacterial genome and single-cell sequencing assembly. SPAdes/MetaSPAdes use an approach similar to IDBA with iterative de Bruijn graph refinement, but keeping the complete read information together with preassembled contigs at each step. MetaSPAdes implements various heuristics for graph simplification, filtering and storage to allow the assembly of large metagenomics data sets. Importantly, MetaSPAdes uses ‘strain-contigs’ to inform the assembly of high-quality consensus backbone sequences, which are often longer than contigs from other assemblers [126]. Megahit [101] is a fast assembler that uses a range of k-mers for iteratively improving the assembly. Megahit (which works for both metagenomics and single-genome sequencing data) uses a memory-efficient succinct de Bruijn graph representation [127] and can optionally run on CUDA-enabled graphics processing units in the graph construction step. By default, Megahit only keeps highly reliable k-mers that appear more than once, but implements a strategy to recover low-depth edges by taking additional k-mers from high-quality reads, which increases the contiguity of low-depth regions (‘mercy k-mers’). The aforementioned assemblers are for the short, accurate reads generated by Illumina sequencers. Long-read sequencing technologies by Pacific Biosciences and Oxford Nanopore, with read lengths sometimes exceeding 10 000 bp, have great promise for microbial whole-genome sequencing [128], and are now being applied for metagenomics assembly in low-diversity communities [129]. While their lower throughput may limit their usefulness for complex metagenomes in the near future, they are revolutionizing the assembly and structural variant analysis of single genomes. As their throughput improves, these technologies have tremendous potential for metagenomic analysis as well. Binning of contigs from closely related organisms Short read metagenome assemblies are often highly fragmented because of low coverage and interstrain variation, as explained above. Binning algorithms attempt to group contigs or scaffolds from the same or closely related organisms [41, 130], and subsequent analysis, such as taxonomic assignment and functional analysis, is then done on the bins instead of individual contigs [41]. Binning has been shown to cluster contigs even from rare species and can recover draft genomes from previously uncultivated bacteria [131]. The bins are sometimes referred to as ‘population genomes’, as the unsupervised binning usually cannot distinguish the genetic content of closely related organisms (strains) in complex microbial communities. Binning algorithms can use taxonomic information from a reference database (taxonomy-dependent or supervised binning), or they can cluster sequences using statistical properties and/or contig coverage (unsupervised binning). Many current methods use a combination of these features. For supervised taxonomy-dependent binning, some of the methods described in the previous section on metagenomics classification can be used. When classifying contigs instead of reads, the search space is much smaller, and slower alignment or phylogenetic methods can be used. For example, taxator-tk [79] uses BLAST, PhyloSift [87] searches for similarities to marker genes using Hidden Markov model profiles with HMMER and PhyloPythiaS(+) [120, 121] assigns reads to bins using a support vector machine model trained on reference sequences. Taxonomy-independent binning does not require prior knowledge about the genomes in a sample, but relies on features inherent to the sequence set. Composition-based binning is based on the observation that overall genome composition in terms of G/C content and di- and higher-order nucleotide frequencies vary between organisms and are often characteristic of taxonomic lineages [132]. Clustering then can be done on sequence composition ‘fingerprints’ of the contigs [133]. MetaCluster [118, 119] bins reads by first grouping them based on long unique k-mers (k > 36) and merging groups based on tetranucleotide or pentanucleotide frequency distribution. MetaCluster 5.0 further uses 16-mer frequencies in a second round to bin contigs from low-abundance species in complex samples. VizBin [115] uses a dimensionality reduction mechanism based on self-organizing maps to visualize as well as cluster contigs into bins. Composition-based binning methods usually require fairly large contigs (> 1–2 kb) to generate robust statistics. It can be difficult to separate contigs from closely related microorganisms whose nucleotide frequencies may be similar [134]. Some binning methods use coverage profiles across multiple samples, e.g. MGS Canopy [135] generates abundance profiles of gene calls and clusters them by co-abundance across samples. GroopM [117] identifies population genomes using differential coverage profiles of assembled contigs. CONCOCT [112] combines both tetranucleotide frequencies and differential abundances across multiple samples for binning. COCACOLA [113] works similarly to CONCOCT but using different distance metrics and different clustering rules. MetaBAT [114] calculates composite probabilistic distances incorporating models of interspecies and intraspecies distances that were trained on sequenced genomes. MaxBin 2.0 [111] estimates the number of bins by counting single-copy marker genes and iteratively refines binning using an EM algorithm with probabilistic distances. After binning, reads can be mapped back to the bins, and each bin can reassembled, which has the potential to produce longer contigs if the binning was successful. Because each bin should contain only one taxonomic group, the reassembly can be done using either a specialized metagenomics assembler, such as those described above, or a single-genome assembler. Validation of the assembly and binning is an important step in metagenomic genome reconstruction. MetaQUAST [128] computes genome statistics of metagenomics assemblies, and, by aligning against reference genomes, can report the number of misassemblies and mismatches. CheckM [60] and BUSCO [129] estimate both the completeness as well as the contamination of recovered genomes using lineage-specific single-copy marker genes and single-copy orthologs, respectively. When marker genes are missing, the genome is probably not complete, and if marker genes are present multiple times, it suggests contamination. Assembly pipelines and analysis tool sets Metagenomics assembly is a complicated process, involving quality control, assembly, contig binning, mapping of reads back to contigs, reassembly, gene annotation and visualization. Several analysis pipelines and visualization tools have been developed to facilitate this process. MetAMOS [108] is a comprehensive pipeline for assembly and annotation of metagenomics samples. It can run multiple assemblers to create contigs and scaffolds. It then runs bacterial gene finders on the resulting contigs, and finally searches the predicted genes against a protein database to assign names and functions wherever possible. Anvi’o [110] is another pipeline that combines assembly, alignment, binning and classification results in an interactive interface that allows one to refine the binning and assembly. MOCAT2 [109] integrates read filtering, taxonomic profiling with mOTU [82], assembly, gene prediction and annotation to output taxonomic as well as functional profiles of metagenomics samples. Microbial taxonomy and genome resources and their impact on classification Almost all of the methods described here rely on a database of genomes and on taxonomy of species. The accuracy and reliability metagenomics analysis relies critically on these data resources. Here, we discuss several issues about both the data themselves—the genomes—and the taxonomy that we use to name and group all living species. The NCBI Taxonomy database [136] provides the standard nomenclature and hierarchical taxon tree for GenBank, EMBL and DDBJ (which mirror one another, and which together comprise the International Nucleotide Sequence Database Collaboration, INSDC [137]), and thus for most metagenomic classifiers. Metataxonomic classifiers, on the other hand, often use the SILVA, RDP and Greengenes databases of ribosomal genes which, somewhat confusingly, have their own taxonomies [138]. Every sequence deposited within an INSDC database has a taxon identifier based on species information provided by the depositor. The hierarchical concept of the taxonomy is convenient for benchmarking metagenomics classifiers, but several issues can make evaluation difficult and even misleading. The taxonomy concept was originally developed for multicellular eukaryotes, primarily plants and animals, and a common definition of ‘species’ is a group of organisms that can interbreed and produce fertile offspring [139]. This definition clearly does not work for prokaryotes, which reproduce asexually and have no distinction between somatic and germ line cells. Making things more complicated is the (relatively rare) process of horizontal gene transfer, which in bacteria and archaea allows for the direct exchange of DNA across species barriers. Metagenomics classifiers may incorporate assumptions that are violated by the taxonomy or by the genome data itself, which will result in sequences being assigned to the wrong taxonomic ID. Here, we discuss some examples of how this can happen. The same taxonomic level can contain different levels of sequence similarity. Although the set of species under a phylum represents a much wider range of diversity than the species within a genus, the level of similarity at a specific level of the tree is highly variable. A comparison of bacterial genomes present in GenBank (as of September 2014) showed that 6% of genomes with different species assignments have an average nucleotide identity (ANI) >93%, while 15% of genomes within the same species have an ANI <93% [139]. For example, Yersinia pseudotuberculosis and Yersinia pestis, which represent two distinct species, are over 98.5% identical, but Yersinia enterocolitica is <86% identical to either of them. Mycobacterium tuberculosis and Mycobacterium bovis have >99.6% identity, while the ANI of Mycobacterium leprae with either of them is <85%. Notably, the close Y. pestis and Y. pseudotuberculosis species are grouped together in the ‘species group’ Y.pseudotuberculosis complex, and M. tuberculosis and M. bovis are grouped in the species group M.tuberculosis complex. A well-known example of historic misplacement is Shigella [140], a genus that clearly falls within the E.coli species with ANIs above 97%—much higher than the ANIs of, for example Escherichiafergusonii to E. coli of about 93%. The consequence of this variability for computational classifiers is that at the species or genus rank, different levels of sequence similarity in different parts of the taxonomic tree have a different meaning, making it impossible, for some taxa, to design consistent rules assigning reads or contigs (even long ones) to a species, and there is clearly no fixed percent-identity threshold that can be used to group sequences into the same species or genus. The fungal taxonomy sometimes has two species and taxonomy IDs for the same organism. Fungi can have both teleomorphic (sexual reproductive stage) and anamorphic (asexual reproductive stage) phases. Historically, different names were given to the same fungi in the different stages. For example, Fusarium solani is a filamentous fungus whose spores are found in soil and plant debris, and which can cause keratitis [104]. This fungus is assigned to two different species in the NCBI taxonomy database: the anamorph is called F. solani and has taxonomy ID 169388, while the teleomorph is called Nectria haematococca with taxonomy ID 140110. The taxons are both listed as species in the genus Fusarium, and some sequences in GenBank are assigned to one taxonomy ID, and others to the other. (As of 28 May 2017, there were 6765 nucleotide sequences for F. solani and 16 643 for N. haematococca in GenBank.) The rules have been since updated to reflect a ‘one fungus, one name’ system [141], but it may take a long time to resolve the current multiplicity of names [142]. As a consequence, metagenomics classifiers might assign sequences to either taxon—and both would be correct, even though they appear to be different species. Historically, no official species names were given to unculturable bacteria. Bacterial nomenclature is governed by the International Code of Nomenclature of Bacteria. In 2001, it was decided that the designation of a new microbial species would require the identification of a type strain representing that species, and that the type strain had to be deposited in at least two different culture collections as pure (axenic) culture [143]. Most bacteria and archaea, though, cannot be cultured with current methods. All of these bacteria are given Candidatus names (i.e. the name Candidatus is prepended to the putative genus and species name) or are named only informally [144, 145], but are not covered by the standard nomenclature [146]. The NCBI taxon ‘unclassified Bacteria’, which contains several candidate divisions, is placed directly under the ‘Bacteria’ taxon node (see next paragraph). As of 28 May 2017, the NCBI taxonomy has 16 400 formal bacterial species and >280 000 informal ones. Unclassified organism sequences and metagenomes are close to the root of the taxonomy. The NCBI databases contain sequences of bacteria, eukaryotes and viruses that thus far are not placed into the taxonomic hierarchy. As of 21 August 2017, NCBI had 2756 genomes for ‘unclassified bacteria’ (taxonomy ID 2323), 168 genomes for ‘unclassified viruses’ (taxonomy ID 12429) and 4 genomes for ‘unclassified viruses’ (taxonomy ID 12429). All these taxa are at high levels in the taxonomic tree, just below their superkingdoms. Furthermore, GenBank and the BLAST nr/nt database (https://www.ncbi.nlm.nih.gov/books/NBK62345/) contain thousands of ‘unclassified’ sequences (taxonomy ID 12908), especially from metagenomes (e.g. ‘human gut microbiome’, taxonomy ID 408170). Shared sequences of such taxa and properly placed organisms can present a challenge for metagenomics methods that attempt to cluster together sequences or compute the lowest common ancestor. Especially when using the BLAST nr/nt or nr databases, it may be useful to filter unclassified sequences, or include only microbial taxa, as is done by the kaiju classifier [69] when including eukaryotes from nr. Taxonomy changes. One solution to some of the problems just listed is to rename or move the species in the microbial taxonomy. This does happen somewhat frequently, but the new names do not automatically percolate outward to every resource that has downloaded the genomes from GenBank. As a result, some benchmark genome sets used in metagenomics comparisons [148] have become outdated because some of the organisms have new names. This in turn can lead to mistaken conclusions when later studies download and reuse the data without going back to retrieve the original genomes from GenBank. NCBI taxonomy does keep track of all previous names of a taxon via synonyms; however, the taxonomy is not versioned, which makes it difficult to track or refer to a specific version. Viruses and viral taxonomy Most of the comments about bacterial genomes and taxonomy apply equally well to viruses, which thus far we have not discussed. Viruses do not have universally conserved genes such as the 16S and 18S rRNA genes, making it far more difficult to conduct systematic surveys of diversity. Nonetheless, it appears that the number of diversity of viral species may far exceed those of bacteria. A recent paper, for example, used metagenomic sequencing to discover >125 000 new DNA viruses [149], most of which encode proteins that have no sequence similarity to known isolates. Another study mined public databases to discover >12 000 new viral genomes linked to bacterial and archaeal hosts [150]. Faced with this rapid growth in the variety of viral species, a scientific consortium recently proposed a new framework for incorporating viruses discovered through metagenomic sequencing into the official taxonomy of the International Committee on Taxonomy of Viruses [151]. The relatively sparse sampling of the viral microbiome means that most viral species cannot yet be recognized by alignment of metagenomic samples to databases. Viruses also mutate much more rapidly than bacteria, so even when a known virus is present, alignment algorithms may need to permit more mismatches to identify. These and other issues mean that metagenomic methods for viruses sometimes require different methods from bacteria, which are beyond the scope of this discussion; a recent review of such methods can be found in [152]. Microbial genome resources The most commonly used reference genome databases are the complete and draft genomes at GenBank [153], which for more than a quarter century has been the repository for genome sequence data from around the world. Sequence records in GenBank are owned by the submitter, and only the submitter can update that. In the vast majority of cases, DNA sequence records are never altered after their original submission. GenBank relies on correct taxonomic identification and annotation provided by the submitter. Some genomes in GenBank have an incorrect species name, presumably because of labeling errors for bacterial samples. When such an error is discovered, NCBI (the home of GenBank) can request the submitter to update the record, but if the submitter does not respond, then NCBI can only suppress or flag the entry [142]. To avoid such errors, NCBI now performs a variety of quality checks when genomes are submitted to make sure that submitted genomes are not assigned to the wrong species [153]. An even bigger issue than incorrect species labels is contamination. The vast majority of genomes in GenBank today are ‘draft’ genomes (Table 5). These are genomes for which an assembly was generated from one or more sequencing data sets, but where most chromosomes are fragmented into many pieces. It is not uncommon for a draft genome to contain tens of thousands of such contigs. In any draft genome, some of the contigs might be contaminants, i.e. they might not belong to the species that was presumably sequenced, even though every contig is assigned to the same species. Common contaminants include sequencing vectors and adaptors, nucleic acids that are commonly present in laboratories such as from E. coli and PhiX174 (a phage used as Illumina sequencing control) and of course human DNA, which creeps into many sequencing projects by accident. If the laboratory that created the assembly did not screen out these contaminants, they are submitted to GenBank as part of the organism. GenBank itself runs a contaminant screen on all assemblies, and contigs that appear to be contaminants are reported back to the submitter, who is encouraged to remove them and resubmit. Despite the best efforts of GenBank curators, though, thousands of contaminants have already made their way into the draft genome data. Table 5 Number of entries in commonly used reference databases Domain Level Draft genomes Complete genomes1 GenBank RefSeq GenBank RefSeq Archaea Entries 859 351 260 (20) 225 (12) Species 695 204 209 (14) 178 (7) Bacteria Entries 89 730 78 783 7314 (1346) 6973 (1066) Species 19 078 11 217 2677 (542) 2586 (406) Fungi Entries 1897 191 28 (414) 7 (38) Species 997 190 17 (68) 7 (36) Protists Entries 430 47 2 (49) 2 (27) Species 226 47 2 (38) 2 (26) Viruses Entries 3 3 0 (0) 7214 (22) Species 1 3 0 (0) 7073 (22) Domain Level Draft genomes Complete genomes1 GenBank RefSeq GenBank RefSeq Archaea Entries 859 351 260 (20) 225 (12) Species 695 204 209 (14) 178 (7) Bacteria Entries 89 730 78 783 7314 (1346) 6973 (1066) Species 19 078 11 217 2677 (542) 2586 (406) Fungi Entries 1897 191 28 (414) 7 (38) Species 997 190 17 (68) 7 (36) Protists Entries 430 47 2 (49) 2 (27) Species 226 47 2 (38) 2 (26) Viruses Entries 3 3 0 (0) 7214 (22) Species 1 3 0 (0) 7073 (22) 1Numbers in parentheses represent incomplete genome assemblies for which at least one chromosome was assembled. Data as of 27 May 2017. Open in new tab Table 5 Number of entries in commonly used reference databases Domain Level Draft genomes Complete genomes1 GenBank RefSeq GenBank RefSeq Archaea Entries 859 351 260 (20) 225 (12) Species 695 204 209 (14) 178 (7) Bacteria Entries 89 730 78 783 7314 (1346) 6973 (1066) Species 19 078 11 217 2677 (542) 2586 (406) Fungi Entries 1897 191 28 (414) 7 (38) Species 997 190 17 (68) 7 (36) Protists Entries 430 47 2 (49) 2 (27) Species 226 47 2 (38) 2 (26) Viruses Entries 3 3 0 (0) 7214 (22) Species 1 3 0 (0) 7073 (22) Domain Level Draft genomes Complete genomes1 GenBank RefSeq GenBank RefSeq Archaea Entries 859 351 260 (20) 225 (12) Species 695 204 209 (14) 178 (7) Bacteria Entries 89 730 78 783 7314 (1346) 6973 (1066) Species 19 078 11 217 2677 (542) 2586 (406) Fungi Entries 1897 191 28 (414) 7 (38) Species 997 190 17 (68) 7 (36) Protists Entries 430 47 2 (49) 2 (27) Species 226 47 2 (38) 2 (26) Viruses Entries 3 3 0 (0) 7214 (22) Species 1 3 0 (0) 7073 (22) 1Numbers in parentheses represent incomplete genome assemblies for which at least one chromosome was assembled. Data as of 27 May 2017. Open in new tab The result of these contaminants is that reads from a metagenomics project will match some draft genomes extremely well because the metagenomics project has some of the same contaminants (e.g. fragments of E. coli or human DNA). This in turn leads to incorrect taxonomic classification, even though the computational tools performed perfectly. For example, a strain of Neisseria gonorrhoeae was found to be contaminated with fragments of cow and sheep DNA [154], a problem that was discovered after a metagenomics study of the cow microbiome detected this particular N. gonorrhoeae strain and reported it to the authors of the Kraken program, who in turn discovered that the mistake was in the data, not the software. RefSeq provides an alternative. The RefSeq project takes GenBank sequences and passes them through additional automated filters to produce a more curated genome resource [155]. RefSeq records are owned by NCBI and can be updated as needed to maintain annotation or to incorporate additional information. As shown in Table 5, ∼79 000 of ∼90 000 draft bacterial genomes are in RefSeq (data as of 27 May 2017). There are various reasons why genomes may be excluded from RefSeq, e.g. the assemblies are too highly fragmented. For bacteria, currently, the most common reason that a GenBank genome is not included is that it is derived from a metagenome (about half of the excluded genomes). Note that this is a current policy and inclusion criteria may change in the future. The rate of inclusion into RefSeq has been much slower for eukaryotic microbes; currently, it contains only 191 of 1897 fungal genome assemblies. RefSeq also includes the viral domain, for which it validates and indexes one viral genome per species (and sometimes per serotype). As of May 2017, there are >7000 viral genomes in RefSeq. In addition, the NCBI Viral Genomes Resource (https://www.ncbi.nlm.nih.gov/genome/viruses/) [156] provides links to other validated viral genomes that are ‘neighbors’ (i.e. strains) of viral species in RefSeq. Genomes are assigned to species or strains. Until 2014, every new microbial genome submitted to NCBI was assigned a new taxonomy ID, even if they were isolates of existing species. Owing to the dramatic increase in the number of genome sequences, this policy was changed in 2014, and since then only novel species and higher microbial orders get new taxonomy IDs [147]. Previously assigned strain taxonomy IDs remain in the database, which means that a single species may have genomes both at species and strain levels. For E. coli, for example, RefSeq contains 5596 genomes (as of 28 June 2017), of which 3292 have the taxonomy ID of E. coli, and the remainder have one of 2223 distinct strain-level taxonomy IDs. Overall, ∼35% of the bacterial genomes in RefSeq and GenBank have strain-level IDs, and the remaining ∼65% have species-level IDs. This can be challenging for algorithms that try to characterize metagenomic samples at the strain level. Conclusions Next-generation sequencing provides a powerful tool to study the microbes in, on, and around us. A great variety of computational tools have been developed to assist in the analysis of metagenomics data sets, which are large and constantly changing as the technology of sequencing improves. Here, we reviewed methods for classification and assembly of metagenomics data. Classification methods determine the mixture of species in a sample, either by using marker genes to estimate their abundance or by assigning a taxonomic identifier to every read. Assembly methods take the raw read data and assemble reads from the same species into larger contigs, which in turn can be assigned taxonomic labels. We also discussed some of the challenges presented by inconsistencies in microbial taxonomy itself, and by contamination in the draft genomes that almost all methods rely on. Many of these problems may be solved over time, but while the data are in a constant state of flux, users need to remain aware of these issues, so that they can avoid potential pitfalls when analyzing large, complex metagenomics data sets. Key Points Classification methods for metagenomic reads rely on fast lookup algorithms to handle the enormous data sets generated by next-generation sequencing. Metagenomic assembly methods can reconstruct large sections of the genomes of some species in a microbial community, if the sequencing depth is sufficient. Genome databases are growing rapidly, but many draft genomes are contaminated with fragments of sequence from other species, which presents challenges for metagenomic analysis. Microbial taxonomy is rapidly changing in the genome era, with many species being renamed and grouped into different clades. Funding This work was supported in part by the US National Institutes of Health (NIH grants R01-HG006677 and R01-GM083873) and by the US Army Research Office (grant W911NF-1410490). Florian P. Breitwieser is a postdoctoral fellow at the Center for Computational Biology at Johns Hopkins School of Medicine. His research interests include metagenomics classification and visualization methods and their application to infectious disease diagnosis. Jennifer Lu is a Biomedical Engineering PhD student in Steven Salzberg’s laboratory at the Center for Computational Biology at Johns Hopkins University. Her research focuses on computational genomics and the usage of sequencing for diagnosing microbial infections relating to human health and diseases. Steven L. Salzberg is the Bloomberg Distinguished Professor of Biomedical Engineering, Computer Science and Biostatistics at Johns Hopkins University. His laboratory conducts research on DNA and RNA sequence analysis including genome assembly, transcriptome assembly, sequence alignment and metagenomics. References 1 Marchesi JR , Ravel J. The vocabulary of microbiome research: a proposal . Microbiome 2015 ; 3 : 31 . Google Scholar Crossref Search ADS PubMed WorldCat 2 Scholz M , Ward DV , Pasolli E , et al. Strain-level microbial epidemiology and population genomics from shotgun metagenomics . Nat Methods 2016 ; 13 : 435 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 3 Moran MA , Satinsky B , Gifford SM , et al. Sizing up metatranscriptomics . ISME J 2013 ; 7 : 237 – 43 . Google Scholar Crossref Search ADS PubMed WorldCat 4 Baldrian P , López-Mondéjar R. Microbial genomics, transcriptomics and proteomics: new discoveries in decomposition research using complementary methods . Appl Microbiol Biotechnol 2014 ; 98 : 1531 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 5 Wilmes P , Heintz-Buschart A , Bond PL. A decade of metaproteomics: where we stand and what the future holds . Proteomics 2015 ; 15 : 3409 – 17 . Google Scholar Crossref Search ADS PubMed WorldCat 6 Beale DJ , Karpe AV , Ahmed W. Beyond metabolomics: a review of multi-omics-based approaches. In: Beale DJ , Kouremenos KA , Palombo EA (eds). Microbial Metabolomics: Applications in Clinical, Environmental, and Industrial Microbiology . Switzerland : Springer International Publishing , 2016 , 289 – 312 . Google Preview WorldCat COPAC 7 Franzosa EA , Hsu T , Sirota-Madi A , et al. Sequencing and beyond: integrating molecular ‘omics’ for microbial community profiling . Nat Rev Microbiol 2015 ; 13 : 360 – 72 . Google Scholar Crossref Search ADS PubMed WorldCat 8 Woese CR , Fox GE. Phylogenetic structure of the prokaryotic domain: the primary kingdoms . Proc Natl Acad Sci USA 1977 ; 74 : 5088 – 90 . Google Scholar Crossref Search ADS PubMed WorldCat 9 Schoch CL , Seifert KA , Huhndorf S , et al. Nuclear ribosomal internal transcribed spacer (ITS) region as a universal DNA barcode marker for Fungi . Proc Natl Acad SciUSA 2012 ; 109 : 6241 – 6 . Google Scholar Crossref Search ADS WorldCat 10 DeSantis TZ , Hugenholtz P , Larsen N , et al. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB . Appl Environ Microbiol 2006 ; 72 : 5069 – 72 . Google Scholar Crossref Search ADS PubMed WorldCat 11 Cole JR , Chai B , Farris RJ , et al. The Ribosomal Database Project (RDP-II): sequences and tools for high-throughput rRNA analysis . Nucleic Acids Res 2005 ; 33 : D294 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 12 Carlton JM , Angiuoli SV , Suh BB , et al. Genome sequence and comparative analysis of the model rodent malaria parasite Plasmodium yoelii yoelii . Nature 2002 ; 419 : 512 – 19 . Google Scholar Crossref Search ADS PubMed WorldCat 13 Caporaso JG , Kuczynski J , Stombaugh J , et al. QIIME allows analysis of high-throughput community sequencing data . Nat Methods 2010 ; 7 : 335 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 14 Schloss PD , Westcott SL , Ryabin T , et al. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities . Appl Environ Microbiol 2009 ; 75 : 7537 – 41 . Google Scholar Crossref Search ADS PubMed WorldCat 15 Edgar RC. UPARSE: highly accurate OTU sequences from microbial amplicon reads . Nat Methods 2013 ; 10 : 996 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 16 Li W , Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences . Bioinformatics 2006 ; 22 : 1658 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 17 Mahe F , Rognes T , Quince C , et al. Swarm: robust and fast clustering method for amplicon-based studies . PeerJ 2014 ; 2 : e593 . Google Scholar Crossref Search ADS PubMed WorldCat 18 Callahan BJ , McMurdie PJ , Rosen MJ , et al. DADA2: high-resolution sample inference from Illumina amplicon data . Nat Methods 2016 ; 13 : 581 – 3 . Google Scholar Crossref Search ADS PubMed WorldCat 19 Callahan BJ , Sankaran K , Fukuyama JA , et al. Bioconductor workflow for microbiome data analysis: from raw reads to community analyses . F1000Res 2016 ; 5 : 1492 . Google Scholar Crossref Search ADS PubMed WorldCat 20 McMurdie PJ , Holmes S. phyloseq: an R package for reproducible interactive analysis and graphics of microbiome census data . PLoS One 2013 ; 8 : e61217. Google Scholar Crossref Search ADS PubMed WorldCat 21 Siegwald L , Touzet H , Lemoine Y , et al. Assessment of common and emerging bioinformatics pipelines for targeted metagenomics . PLoS One 2017 ; 12 : e0169563. Google Scholar Crossref Search ADS PubMed WorldCat 22 Oulas A , Pavloudi C , Polymenakou P , et al. Metagenomics: tools and insights for analyzing next-generation sequencing data derived from biodiversity studies . Bioinform Biol Insights 2015 ; 9 : 75 – 88 . Google Scholar Crossref Search ADS PubMed WorldCat 23 D’Amore R , Ijaz UZ , Schirmer M , et al. A comprehensive benchmarking study of protocols and sequencing platforms for 16S rRNA community profiling . BMC Genomics 2016 ; 17 : 55 . Google Scholar Crossref Search ADS PubMed WorldCat 24 Kopylova E , Navas-Molina JA , Mercier C , et al. Open-source sequence clustering methods improve the state of the art . mSystems 2016 ; 1 : e00003-15 . Google Scholar Crossref Search ADS PubMed WorldCat 25 Nguyen NP , Warnow T , Pop M , et al. A perspective on 16S rRNA operational taxonomic unit clustering using sequence similarity . NPJ Biofilms Microbiomes 2016 ; 2 : 16004 . Google Scholar Crossref Search ADS PubMed WorldCat 26 Brown CT , Hug LA , Thomas BC , et al. Unusual biology across a group comprising more than 15% of domain bacteria . Nature 2015 ; 523 : 208 – 11 . Google Scholar Crossref Search ADS PubMed WorldCat 27 Eloe-Fadrosh EA , Ivanova NN , Woyke T , et al. Metagenomics uncovers gaps in amplicon-based detection of microbial diversity . Nat Microbiol 2016 ; 1 : 15032 . Google Scholar Crossref Search ADS PubMed WorldCat 28 Shin J , Lee S , Go MJ , et al. Analysis of the mouse gut microbiome using full-length 16S rRNA amplicon sequencing . Sci Rep 2016 ; 6 : 29681 . Google Scholar Crossref Search ADS PubMed WorldCat 29 Salter SJ , Cox MJ , Turek EM , et al. Reagent and laboratory contamination can critically impact sequence-based microbiome analyses . BMC Biol 2014 ; 12 : 87 . Google Scholar Crossref Search ADS PubMed WorldCat 30 Brooks JP , Edwards DJ , Harwich MD Jr , et al. The truth about metagenomics: quantifying and counteracting bias in 16S rRNA studies . BMC Microbiol 2015 ; 15 : 66 . Google Scholar Crossref Search ADS PubMed WorldCat 31 Tremblay J , Singh K , Fern A , et al. Primer and platform effects on 16S rRNA tag sequencing, Front . Microbiol 2015 ; 6 : 771 . WorldCat 32 Clooney AG , Fouhy F , Sleator RD , et al. Comparing apples and oranges? Next generation sequencing and its impact on microbiome analysis . PLoS One 2016 ; 11 : e0148028 . Google Scholar Crossref Search ADS PubMed WorldCat 33 Babraham Bioinformatics. FastQC a quality control tool for high throughput sequence data. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ 34 Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads . EMBnet J 2011 ; 17 : 10 – 12 . Google Scholar Crossref Search ADS WorldCat 35 DOE Joint Genome Institute. BBDuk guide. http://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbduk-guide/ 36 Bolger AM , Lohse M , Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data . Bioinformatics 2014 ; 30 : 2114 – 20 . Google Scholar Crossref Search ADS PubMed WorldCat 37 Babraham Bioinformatics. FastQ Screen. http://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/ 38 Titus Brown C , Howe A , Zhang Q , et al. A reference-free algorithm for computational normalization of shotgun sequencing data . arXiv e-prints 2012 . WorldCat 39 Crusoe MR , Alameldin HF , Awad S , et al. The khmer software package: enabling efficient nucleotide sequence analysis , F1000Rese 2015 ; 4 : 900 . Google Scholar Crossref Search ADS WorldCat 40 Ewels P , Magnusson M , Lundin S , et al. MultiQC: summarize analysis results for multiple tools and samples in a single report . Bioinformatics 2016 ; 32 : 3047 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 41 Sangwan N , Xia F , Gilbert JA. Recovering complete and draft population genomes from metagenome datasets . Microbiome 2016 ; 4 : 8 . Google Scholar Crossref Search ADS PubMed WorldCat 42 Mande SS , Mohammed MH , Ghosh TS. Classification of metagenomic sequences: methods and challenges . Brief Bioinform 2012 ; 13 : 669 – 81 . Google Scholar Crossref Search ADS PubMed WorldCat 43 Chiarucci A , Bacaro G , Scheiner SM. Old and new challenges in using species diversity for assessing biodiversity . Philos Trans R Soc Lond B Biol Sci 2011 ; 366 : 2426 – 37 . Google Scholar Crossref Search ADS PubMed WorldCat 44 Langelier C , Zinter MS , Kalantar K , et al. Metagenomic sequencing detects respiratory pathogens in hematopoietic cellular transplant patients . Am J Respir Crit Care Med 2017 , [Epub ahead of print]. WorldCat 45 Salzberg SL , Breitwieser FP , Kumar A , et al. Next-generation sequencing in neuropathologic diagnosis of infections of the nervous system . Neurol Neuroimmunol Neuroinflamm 2016 ; 3 : e251 . Google Scholar Crossref Search ADS PubMed WorldCat 46 Breitwieser FP , Pardo CA , Salzberg SL. Re-analysis of metagenomic sequences from acute flaccid myelitis patients reveals alternatives to enterovirus D68 infection . F1000Res 2015 ; 4 : 180 . Google Scholar Crossref Search ADS PubMed WorldCat 47 Schlaberg R , Chiu CY , Miller S , et al. Validation of metagenomic next-generation sequencing tests for universal pathogen detection . Arch Pathol Lab Med 2017 ; 141 : 776 – 86 . Google Scholar Crossref Search ADS PubMed WorldCat 48 Greninger AL , Messacar K , Dunnebacke T , et al. Clinical metagenomic identification of Balamuthia mandrillaris encephalitis and assembly of the draft genome: the continuing case for reference genome sequencing . Genome Med 2015 ; 7 : 113 . Google Scholar Crossref Search ADS PubMed WorldCat 49 Mongkolrattanothai K , Naccache SN , Bender JM , et al. Neurobrucellosis: unexpected answer from metagenomic next-generation sequencing . J Pediatric Infect Dis Soc 2017 :piw066. WorldCat 50 Kandathil AJ , Breitwieser FP , Sachithanandham J , et al. Presence of Human Hepegivirus-1 in a cohort of people who inject drugs . Ann Intern Med 2017 ; 167 : 1 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 51 Cuestas ML. New virus discovered in blood supply: Human Hepegivirus-1 (HHpgV-1) . Rev Argent Microbiol 2016 ; 48 : 180 – 1 . Google Scholar PubMed WorldCat 52 Berg MG , Lee D , Coller K , et al. Discovery of a novel human pegivirus in blood associated with hepatitis C virus co-infection . PLoS Pathog 2015 ; 11 : e1005325 . Google Scholar Crossref Search ADS PubMed WorldCat 53 Truong DT , Tett A , Pasolli E , et al. Microbial strain-level population structure and genetic diversity from metagenomes . Genome Res 2017 ; 27 : 626 – 38 . Google Scholar Crossref Search ADS PubMed WorldCat 54 Hahn AS , Altman T , Konwar KM , et al. A geographically-diverse collection of 418 human gut microbiome pathway genome databases . Sci Data 2017 ; 4 : 170035. Google Scholar Crossref Search ADS PubMed WorldCat 55 Niu SY , Yang J , McDermaid A , et al. Bioinformatics tools for quantitative and functional metagenome and metatranscriptome data analysis in microbes . Brief Bioinform 2017 :bbx051. WorldCat 56 Parks DH , Imelfort M , Skennerton CT , et al. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes . Genome Res 2015 ; 25 : 1043 – 55 . Google Scholar Crossref Search ADS PubMed WorldCat 57 Tan B , de Araujo E Silva R , Rozycki T , et al. Draft genome sequences of three Smithella spp. obtained from a methanogenic alkane-degrading culture and oil field produced water . Genome Announc 2014 ; 2 : e01085-14 . Google Scholar Crossref Search ADS PubMed WorldCat 58 Tan B , Nesbo C , Foght J. Re-analysis of omics data indicates Smithella may degrade alkanes by addition to fumarate under methanogenic conditions . ISME J 2014 ; 8 : 2353 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 59 Wawrik B , Marks CR , Davidova IA , et al. Methanogenic paraffin degradation proceeds via alkane addition to fumarate by ‘Smithella’ spp. mediated by a syntrophic coupling with hydrogenotrophic methanogens . Environ Microbiol 2016 ; 18 : 2604 – 19 . Google Scholar Crossref Search ADS PubMed WorldCat 60 Nobu MK , Narihiro T , Rinke C , et al. Microbial dark matter ecogenomics reveals complex synergistic networks in a methanogenic bioreactor . ISME J 2015 ; 9 : 1710 – 22 . Google Scholar Crossref Search ADS PubMed WorldCat 61 Altschul SF , Gish W , Miller W , et al. Basic local alignment search tool . J Mol Biol 1990 ; 215 : 403 – 10 . Google Scholar Crossref Search ADS PubMed WorldCat 62 Lindgreen S , Adair KL , Gardner PP. An evaluation of the accuracy and speed of metagenome analysis tools . Sci Rep 2016 ; 6 : 19233 . Google Scholar Crossref Search ADS PubMed WorldCat 63 Kelley DR , Salzberg SL. Clustering metagenomic sequences with interpolated Markov models . BMC Bioinformatics 2010 ; 11 : 544 . Google Scholar Crossref Search ADS PubMed WorldCat 64 Wood DE , Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments . Genome Biol 2014 ; 15 : R46 . Google Scholar Crossref Search ADS PubMed WorldCat 65 Ounit R , Wanamaker S , Close TJ , et al. CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers . BMC Genomics 2015 ; 16 : 236 . Google Scholar Crossref Search ADS PubMed WorldCat 66 Ounit R , Lonardi S. Higher classification sensitivity of short metagenomic reads with CLARK-S . Bioinformatics 2016 ; 32 : 3823 – 5 . Google Scholar Crossref Search ADS PubMed WorldCat 67 Bray NL , Pimentel H , Melsted P , et al. Near-optimal probabilistic RNA-seq quantification . Nat Biotechnol 2016 ; 34 : 525 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 68 Ainsworth D , Sternberg MJE , Raczy C , et al. k-SLAM: accurate and ultra-fast taxonomic classification and gene identification for large metagenomic data sets . Nucleic Acids Res 2017 ; 45 : 1649 – 56 . Google Scholar PubMed WorldCat 69 Menzel P , Ng KL , Krogh A. Fast and sensitive taxonomic classification for metagenomics with Kaiju . Nat Commun 2016 ; 7 : 11257 . Google Scholar Crossref Search ADS PubMed WorldCat 70 Buchfink B , Xie C , Huson DH. Fast and sensitive protein alignment using DIAMOND . Nat Methods 2015 ; 12 : 59 – 60 . Google Scholar Crossref Search ADS PubMed WorldCat 71 Camacho C , Coulouris G , Avagyan V , et al. BLAST+: architecture and applications . BMC Bioinformatics 2009 ; 10 : 421 . Google Scholar Crossref Search ADS PubMed WorldCat 72 Huson DH , Auch AF , Qi J , et al. MEGAN analysis of metagenomic data . Genome Res 2007 ; 17 : 377 – 86 . Google Scholar Crossref Search ADS PubMed WorldCat 73 Huson DH , Beier S , Flade I , et al. MEGAN community edition—interactive exploration and analysis of large-scale microbiome sequencing data . PLoS Comput Biol 2016 ; 12 : e1004957. Google Scholar Crossref Search ADS PubMed WorldCat 74 Piro VC , Lindner MS , Renard BY. DUDes: a top-down taxonomic profiler for metagenomics . Bioinformatics 2016 ; 32 : 2272 – 80 . Google Scholar Crossref Search ADS PubMed WorldCat 75 Flygare S , Simmon K , Miller C , et al. Taxonomer: an interactive metagenomics analysis portal for universal pathogen detection and host mRNA expression profiling . Genome Biol 2016 ; 17 : 111 . Google Scholar Crossref Search ADS PubMed WorldCat 76 Freitas TA , Li PE , Scholz MB , et al. Accurate read-based metagenome characterization using a hierarchical suite of unique signatures . Nucleic Acids Res 2015 ; 43 : e69 . Google Scholar Crossref Search ADS PubMed WorldCat 77 Ames SK , Hysom DA , Gardner SN , et al. Scalable metagenomic taxonomy classification using a reference genome database . Bioinformatics 2013 ; 29 : 2253 – 60 . Google Scholar Crossref Search ADS PubMed WorldCat 78 Gardner SN , Ames SK , Gokhale MB , et al. Searching more genomic sequence with less memory for fast and accurate metagenomic profiling . bioRxiv 2016 . WorldCat 79 Droge J , Gregor I , McHardy AC. Taxator-tk: precise taxonomic assignment of metagenomes by fast approximation of evolutionary neighborhoods . Bioinformatics 2015 ; 31 : 817 – 24 . Google Scholar Crossref Search ADS PubMed WorldCat 80 Kim D , Song L , Breitwieser FP , et al. Centrifuge: rapid and sensitive classification of metagenomic sequences . Genome Res 2016 ; 26 : 1721 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 81 Truong DT , Franzosa EA , Tickle TL , et al. MetaPhlAn2 for enhanced metagenomic taxonomic profiling . Nat Methods 2015 ; 12 : 902 – 3 . Google Scholar Crossref Search ADS PubMed WorldCat 82 Sunagawa S , Mende DR , Zeller G , et al. Metagenomic species profiling using universal phylogenetic marker genes . Nat Methods 2013 ; 10 : 1196 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 83 Ondov BD , Treangen TJ , Melsted P , et al. Mash: fast genome and metagenome distance estimation using MinHash . Genome Biol 2016 ; 17 : 132 . Google Scholar Crossref Search ADS PubMed WorldCat 84 Titus Brown C , Irber L. Sourmash: a library for MinHash sketching of DNA . J Open Source Softw 2016 ; 1 . WorldCat 85 Langmead B , Salzberg SL. Fast gapped-read alignment with Bowtie 2 . Nat Methods 2012 ; 9 : 357 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 86 Eddy SR. Accelerated profile HMM searches . PLoS Comput Biol 2011 ; 7 : e1002195 . Google Scholar Crossref Search ADS PubMed WorldCat 87 Darling AE , Jospin G , Lowe E , et al. PhyloSift: phylogenetic analysis of genomes and metagenomes . PeerJ 2014 ; 2 : e243 . Google Scholar Crossref Search ADS PubMed WorldCat 88 Li H , Durbin R. Fast and accurate long-read alignment with Burrows-Wheeler transform . Bioinformatics 2010 ; 26 : 589 – 95 . Google Scholar Crossref Search ADS PubMed WorldCat 89 Broder AZ. On the Resemblance and Containment of Documents . Palo Alto, CA : Digital Systems Research Center , 1998 , 21 – 29 . Google Preview WorldCat COPAC 90 Ma B , Tromp J , Li M. PatternHunter: faster and more sensitive homology search . Bioinformatics 2002 ; 18 : 440 – 5 . Google Scholar Crossref Search ADS PubMed WorldCat 91 Kielbasa SM , Wan R , Sato K , et al. Adaptive seeds tame genomic sequence comparison . Genome Res 2011 ; 21 : 487 – 93 . Google Scholar Crossref Search ADS PubMed WorldCat 92 Noé L , Martin DEK. A coverage criterion for spaced seeds and its applications to support vector machine string kernels and k-mer distances . J Comput Biol 2014 ; 21 : 947 – 63 . Google Scholar Crossref Search ADS PubMed WorldCat 93 Břinda K , Sykulski M , Kucherov G. Spaced seeds improve k-mer-based metagenomic classification . Bioinformatics 2015 ; 31 : 3584 – 92 . Google Scholar Crossref Search ADS PubMed WorldCat 94 Lu J , Breitwieser FP , Thielen P , et al. Bracken: estimating species abundance in metagenomics data . PeerJ Comput Sci 2017 ; 3 : e104 . Google Scholar Crossref Search ADS WorldCat 95 Schaeffer L , Pimentel H , Bray N , et al. Pseudoalignment for metagenomic read assignment . Bioinformatics 2017 ; 33 : 2082 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 96 Iqbal Z , Caccamo M , Turner I , et al. De novo assembly and genotyping of variants using colored de Bruijn graphs . Nat Genet 2012 ; 44 : 226 – 32 . Google Scholar Crossref Search ADS PubMed WorldCat 97 Delcher AL , Phillippy A , Carlton J , et al. Fast algorithms for large-scale genome alignment and comparison . Nucleic Acids Res 2002 ; 30 : 2478 – 83 . Google Scholar Crossref Search ADS PubMed WorldCat 98 Nagarajan N , Pop M. Sequence assembly demystified . Nat Rev Genet 2013 ; 14 : 157 – 67 . Google Scholar Crossref Search ADS PubMed WorldCat 99 Ghurye JS , Cepeda-Espinoza V , Pop M. Metagenomic assembly: overview, challenges and applications . Yale J Biol Med 2016 ; 89 : 353 – 62 . Google Scholar PubMed WorldCat 100 Vollmers J , Wiegand S , Kaster AK. Comparing and evaluating metagenome assembly tools from a microbiologist’s perspective—not only size matters! PLoS One 2017 ; 12 : e0169662. Google Scholar Crossref Search ADS PubMed WorldCat 101 Li D , Liu CM , Luo R , et al. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph . Bioinformatics 2015 ; 31 : 1674 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 102 Bankevich A , Nurk S , Antipov D , et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing . J Comput Biol 2012 ; 19 : 455 – 77 . Google Scholar Crossref Search ADS PubMed WorldCat 103 Nurk S , Meleshko D , Korobeynikov A , et al. metaSPAdes: a new versatile metagenomic assembler . Genome Res 2017 ; 27 : 824 – 34 . Google Scholar Crossref Search ADS PubMed WorldCat 104 Boisvert S , Raymond F , Godzaridis E , et al. Ray Meta: scalable de novo metagenome assembly and profiling . Genome Biol 2012 ; 13 : R122 . Google Scholar Crossref Search ADS PubMed WorldCat 105 Afiahayati Sato K , Sakakibara Y. MetaVelvet-SL: an extension of the Velvet assembler to a de novo metagenomic assembler utilizing supervised learning . DNA Res 2015 ; 22 : 69 – 77 . Google Scholar Crossref Search ADS PubMed WorldCat 106 Namiki T , Hachiya T , Tanaka H , et al. MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads . Nucleic Acids Res 2012 ; 40 : e155. Google Scholar Crossref Search ADS PubMed WorldCat 107 Peng Y , Leung HC , Yiu SM , et al. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth . Bioinformatics 2012 ; 28 : 1420 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 108 Treangen TJ , Koren S , Sommer DD , et al. MetAMOS: a modular and open source metagenomic assembly and analysis pipeline . Genome Biol 2013 ; 14 : R2 . Google Scholar Crossref Search ADS PubMed WorldCat 109 Kultima JR , Coelho LP , Forslund K , et al. MOCAT2: a metagenomic assembly, annotation and profiling framework . Bioinformatics 2016 ; 32 : 2520 – 3 . Google Scholar Crossref Search ADS PubMed WorldCat 110 Eren AM , Esen OC , Quince C , et al. Anvi’o: an advanced analysis and visualization platform for ‘omics data . PeerJ 2015 ; 3 : e1319 . Google Scholar Crossref Search ADS PubMed WorldCat 111 Wu YW , Simmons BA , Singer SW. MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets . Bioinformatics 2016 ; 32 : 605 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 112 Alneberg J , Bjarnason BS , de Bruijn I , et al. Binning metagenomic contigs by coverage and composition . Nat Methods 2014 ; 11 : 1144 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 113 Lu YY , Chen T , Fuhrman JA , et al. COCACOLA: binning metagenomic contigs using sequence COmposition, read CoverAge, CO-alignment and paired-end read LinkAge . Bioinformatics 2017 ; 33 : 791 – 8 . Google Scholar PubMed WorldCat 114 Kang DD , Froula J , Egan R , et al. MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities . PeerJ 2015 ; 3 : e1165. Google Scholar Crossref Search ADS PubMed WorldCat 115 Laczny CC , Sternal T , Plugaru V , et al. VizBin - an application for reference-independent visualization and human-augmented binning of metagenomic data . Microbiome 2015 ; 3 : 1 . Google Scholar Crossref Search ADS PubMed WorldCat 116 Wu YW , Ye Y. A novel abundance-based algorithm for binning metagenomic sequences using l-tuples . J Comput Biol 2011 ; 18 : 523 – 34 . Google Scholar Crossref Search ADS PubMed WorldCat 117 Imelfort M , Parks D , Woodcroft BJ , et al. GroopM: an automated tool for the recovery of population genomes from related metagenomes . PeerJ 2014 ; 2 : e603. Google Scholar Crossref Search ADS PubMed WorldCat 118 Wang Y , Leung HC , Yiu SM , et al. MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample . Bioinformatics 2012 ; 28 : i356 – 62 . Google Scholar Crossref Search ADS PubMed WorldCat 119 Wang Y , Leung HC , Yiu SM , et al. MetaCluster 4.0: a novel binning algorithm for NGS reads and huge number of species . J Comput Biol 2012 ; 19 : 241 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 120 Patil KR , Roune L , McHardy AC. The PhyloPythiaS web server for taxonomic assignment of metagenome sequences . PLoS One 2012 ; 7 : e38581 . Google Scholar Crossref Search ADS PubMed WorldCat 121 Gregor I , Droge J , Schirmer M , et al. PhyloPythiaS+: a self-training method for the rapid reconstruction of low-ranking taxonomic bins from metagenomes . PeerJ 2016 ; 4 : e1603. Google Scholar Crossref Search ADS PubMed WorldCat 122 Mikheenko A , Saveliev V , Gurevich A. MetaQUAST: evaluation of metagenome assemblies . Bioinformatics 2016 ; 32 : 1088 – 90 . Google Scholar Crossref Search ADS PubMed WorldCat 123 Simao FA , Waterhouse RM , Ioannidis P , et al. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs . Bioinformatics 2015 ; 31 : 3210 – 2 . Google Scholar Crossref Search ADS PubMed WorldCat 124 Zerbino DR , Birney E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs . Genome Res 2008 ; 18 : 821 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 125 Peng Y , Leung HC , Yiu SM , et al. IDBA–a practical iterative de Bruijn graph de novo assembler. In: 14th Annual International Conference, RECOMB 2010, Lisbon, Portugal, 25-28 April 2010 . In Research in Computational Molecular Biology. Springer-Verlag , Berlin Heidelberg , 2010 ; vol. 6044, 426 – 40 . Google Preview WorldCat COPAC 126 Sczyrba A , Hofmann P , Belmann P , et al. Critical assessment of metagenome interpretation—a benchmark of computational metagenomics software . bioRxiv 2017 . WorldCat 127 Bowe A , Onodera T , Sadakane K , et al. Succinct de Bruijn Graphs. In: Algorithms in Bioinformatics . Berlin, Heidelberg : Springer , 2012 , 225 – 35 . Google Preview WorldCat COPAC 128 Koren S , Phillippy AM. One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly . Curr Opin Microbiol 2015 ; 23 : 110 – 20 . Google Scholar Crossref Search ADS PubMed WorldCat 129 Driscoll CB , Otten TG , Brown NM , et al. Towards long-read metagenomics: complete assembly of three novel genomes from bacteria dependent on a diazotrophic Cyanobacterium in a freshwater lake co-culture . Stand Genomic Sci 2017 ; 12 : 9 . Google Scholar Crossref Search ADS PubMed WorldCat 130 Sedlar K , Kupkova K , Provaznik I. Bioinformatics strategies for taxonomy independent binning and visualization of sequences in shotgun metagenomics . Comput Struct Biotechnol J 2017 ; 15 : 48 – 55 . Google Scholar Crossref Search ADS PubMed WorldCat 131 Albertsen M , Hugenholtz P , Skarshewski A , et al. Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes . Nat Biotechnol 2013 ; 31 : 533 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 132 Land M , Hauser L , Jun SR , et al. Insights from 20 years of bacterial genome sequencing . Funct Integr Genomics 2015 ; 15 : 141 – 61 . Google Scholar Crossref Search ADS PubMed WorldCat 133 Dick GJ , Andersson AF , Baker BJ , et al. Community-wide analysis of microbial genome sequence signatures . Genome Biol 2009 ; 10 : R85 . Google Scholar Crossref Search ADS PubMed WorldCat 134 Vernikos G , Medini D , Riley DR , et al. Ten years of pan-genome analyses . Curr Opin Microbiol 2015 ; 23 : 148 – 54 . Google Scholar Crossref Search ADS PubMed WorldCat 135 Nielsen HB , Almeida M , Juncker AS , et al. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes . Nat Biotechnol 2014 ; 32 : 822 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 136 Federhen S. The NCBI taxonomy database . Nucleic Acids Res 2012 ; 40 : D136 – 43 . Google Scholar Crossref Search ADS PubMed WorldCat 137 Cochrane G , Karsch-Mizrachi I , Takagi T , et al. The international nucleotide sequence database collaboration . Nucleic Acids Res 2016 ; 44 : D48 – 50 . Google Scholar Crossref Search ADS PubMed WorldCat 138 Balvočiūtė M , Huson DH. SILVA, RDP, Greengenes, NCBI and OTT—how do these taxonomies compare? BMC Genomics 2017 ; 18 : 114 . Google Scholar Crossref Search ADS PubMed WorldCat 139 Rosselló-Móra R , Amann R. Past and future species definitions for Bacteria and Archaea . Syst Appl Microbiol 2015 ; 38 : 209 – 16 . Google Scholar Crossref Search ADS PubMed WorldCat 140 Lan R , Reeves PR. Escherichia coli in disguise: molecular origins of Shigella . Microbes Infect 2002 ; 4 : 1125 – 32 . Google Scholar Crossref Search ADS PubMed WorldCat 141 Taylor JW. One Fungus = One Name: DNA and fungal nomenclature twenty years after PCR . IMA Fungus 2011 ; 2 : 113 – 20 . Google Scholar Crossref Search ADS PubMed WorldCat 142 Federhen S. Type material in the NCBI taxonomy database . Nucleic Acids Res 2015 ; 43 : D1086 – 98 . Google Scholar Crossref Search ADS PubMed WorldCat 143 Lapage SP , Sneath P , Lessel EF , et al. International Code of Nomenclature of Bacteria: Bacteriological Code, 1990 Revision . Washington, DC : ASM Press , 2010 . Google Preview WorldCat COPAC 144 Murray RG , Stackebrandt E. Taxonomic note: implementation of the provisional status Candidatus for incompletely described procaryotes . Int J Syst Bacteriol 1995 ; 45 : 186 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 145 Konstantinidis KT , Rosselló-Móra R. Classifying the uncultivated microbial majority: a place for metagenomic data in the Candidatus proposal . Syst Appl Microbiol 2015 ; 38 : 223 – 30 . Google Scholar Crossref Search ADS PubMed WorldCat 146 Parker CT , Tindall BJ , Garrity GM. International code of nomenclature of prokaryotes . Int J Syst Evol Microbiol 2015 . doi: 10.1099/ijsem.0.000778. WorldCat 147 Federhen S , Clark K , Barrett T , et al. Toward richer metadata for microbial sequences: replacing strain-level NCBI taxonomy taxids with BioProject, BioSample and Assembly records . Stand Genomic Sci 2014 ; 9 : 1275 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 148 Mende DR , Waller AS , Sunagawa S , et al. Assessment of metagenomic assembly using simulated next generation sequencing data . PLoS One 2012 ; 7 : e31386. Google Scholar Crossref Search ADS PubMed WorldCat 149 Paez-Espino D , Eloe-Fadrosh EA , Pavlopoulos GA , et al. Uncovering Earth’s virome . Nature 2016 ; 536 : 425 – 30 . Google Scholar Crossref Search ADS PubMed WorldCat 150 Roux S , Hallam SJ , Woyke T , et al. Viral dark matter and virus-host interactions resolved from publicly available microbial genomes . Elife 2015 ; 4 :e08490. WorldCat 151 Simmonds P , Adams MJ , Benko M , et al. Consensus statement: virus taxonomy in the age of metagenomics . Nat Rev Microbiol 2017 ; 15 : 161 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 152 Simmonds P. Methods for virus classification and the challenge of incorporating metagenomic sequence data . J Gen Virol 2015 ; 96 : 1193 – 206 . Google Scholar Crossref Search ADS PubMed WorldCat 153 Benson DA , Cavanaugh M , Clark K , et al. GenBank . Nucleic Acids Res 2017 ; 45 : D37 – 42 . Google Scholar Crossref Search ADS PubMed WorldCat 154 Merchant S , Wood DE , Salzberg SL. Unexpected cross-species contamination in genome sequencing projects . PeerJ 2014 ; 2 : e675. Google Scholar Crossref Search ADS PubMed WorldCat 155 Tatusova T , Ciufo S , Federhen S , et al. Update on RefSeq microbial genomes resources . Nucleic Acids Res 2015 ; 43 : D599 – 605 . Google Scholar Crossref Search ADS PubMed WorldCat 156 Brister JR , Ako-Adjei D , Bao Y , et al. NCBI viral genomes resource . Nucleic Acids Res 2015 ; 43 : D571 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
journal article
LitStream Collection
Metagenomic assembly through the lens of validation: recent advances in assessing and improving the quality of genomes assembled from metagenomes

Olson, Nathan D; Treangen, Todd J; Hill, Christopher M; Cepeda-Espinoza, Victoria; Ghurye, Jay; Koren, Sergey; Pop, Mihai

2019 Briefings in Bioinformatics

doi: 10.1093/bib/bbx098pmid: 28968737

Abstract Metagenomic samples are snapshots of complex ecosystems at work. They comprise hundreds of known and unknown species, contain multiple strain variants and vary greatly within and across environments. Many microbes found in microbial communities are not easily grown in culture making their DNA sequence our only clue into their evolutionary history and biological function. Metagenomic assembly is a computational process aimed at reconstructing genes and genomes from metagenomic mixtures. Current methods have made significant strides in reconstructing DNA segments comprising operons, tandem gene arrays and syntenic blocks. Shorter, higher-throughput sequencing technologies have become the de facto standard in the field. Sequencers are now able to generate billions of short reads in only a few days. Multiple metagenomic assembly strategies, pipelines and assemblers have appeared in recent years. Owing to the inherent complexity of metagenome assembly, regardless of the assembly algorithm and sequencing method, metagenome assemblies contain errors. Recent developments in assembly validation tools have played a pivotal role in improving metagenomics assemblers. Here, we survey recent progress in the field of metagenomic assembly, provide an overview of key approaches for genomic and metagenomic assembly validation and demonstrate the insights that can be derived from assemblies through the use of assembly validation strategies. We also discuss the potential for impact of long-read technologies in metagenomics. We conclude with a discussion of future challenges and opportunities in the field of metagenomic assembly and validation. metagenomics, microbiome, metagenomic assembly, assembly validation, variant discovery Introduction Shotgun sequencing of microbial communities, metagenomics, has emerged as a key tool for investigating the composition, evolutionary history and function of communities comprising previously uncultured and unsequenced organisms. Assembling metagenomic sequencing data can provide a more complete picture of a microbial community compared with performing analyses directly on the reads [1–4]. However, assembly and metagenomic assembly are complex computational tasks. The complexity of the assembly problem stems from the DNA segments repeated within a same organism, or shared between distinct organisms. Intragenomic repeats have long been recognized as a challenge in assembly of isolate genomes [5], while metagenomic assembly is further complicated by the combination of intragenomic and intergenomic repeats. It has been shown that assembly complexity is directly tied to the ratio of the sequencing read length and the length of repeats [6]. While intergenomic repeats are generally small (usually ∼<10 000 bp in bacteria [7, 8]), intragenomic repeats can be nearly the entire chromosomes for closely related strains. To better explain this perhaps un-intuitive statement—just one gene (e.g., a virulence gene) may distinguish between two closely-related bacterial genomes in a community, thus close to the entire genome can be viewed as an inter-genomic repeat. Sequencing strategies for metagenomics are currently dominated by short, high-throughput sequencing technologies, such as the Illumina NextSeq and HiSeq. These technologies can produce billions of highly accurate 100–300 bp reads within a few days and are cost-effective for most large-scale microbiome research projects. Metagenome sequence assembly algorithms have been largely based on a de Bruijn graph (DBG) paradigm, which is effective for accurate [9] and efficient assembly of large metagenomic data sets [10]. New advances in sequencing methods, such as single-molecule sequencing, synthetic long reads and Hi-C along with new assembly and scaffolding algorithms have the potential to significantly improve the contiguity and quality of metagenome assemblies, and are an emerging area of research interest. To date, however, the use of such technologies in metagenomic settings has been limited because of the complex sample processing requirements and cost. Owing to the complexity of the metagenome assembly problem, regardless of the assembly algorithm used or sequencing method, metagenome assemblies are incomplete and contain errors. Methods for evaluating the quality and completeness of assemblies are critical for informing downstream analyses of the assembled data, and for allowing researchers to compare different tools that could be used for assembly. Assembly validation methods fall into two broad categories: reference-based and de novo. Reference-based validation methods compare the assembly with a database containing previously assembled genes or genomes [11, 12]. They assess as errors any differences identified between the assembled data and the reference collection. In contrast, de novo methods rely on features of the assembled data itself, seeking to identify internal inconsistencies indicative of potential assembly errors. Reference-based methods are particularly effective in benchmarking experiments attempting to reconstruct communities with known composition; however, these methods have limited effectiveness in real data sets. For example, metagenomic segments originating from a genome for which no reference sequence is available cannot be verified through a reference-based approach. It is also difficult to determine whether differences between an assembled contig are errors or true differences between the reference sequence and its relative within the metagenomic mixture. Current methods for metagenome assembly The (meta)genome assembly problem can be formulated as a graph traversal problem, finding a path through a complex graph satisfying the constraints imposed by the data provided to the assembler [13]. Metagenome assembly is accomplished de novo by reconstructing genomes directly from the read data [13]. Despite the development of dozens of implementations for de novo assembly, algorithmic challenges posed by repeats remain prohibitive (Figure 1). Figure 1 Open in new tabDownload slide The challenge of repeats in metagenomes. Three genomes are used to depict intragenomic (A) and intergenomic (B) repeats. The dark blue and light blue genomes represent two closely related strains and the green genome an unrelated strain. Within the genomes, the red, orange and tan blocks represent inparalogs. The yellow blocks represent a horizontal gene transfer event between the light blue and green genomes. In traditional assembly, any reads longer than the inparalog blocks (red; orange) would be sufficient to fully resolve the genome. In metagenomic assembly, reads longer than the full syntenic block (gray) would be necessary. Figure 1 Open in new tabDownload slide The challenge of repeats in metagenomes. Three genomes are used to depict intragenomic (A) and intergenomic (B) repeats. The dark blue and light blue genomes represent two closely related strains and the green genome an unrelated strain. Within the genomes, the red, orange and tan blocks represent inparalogs. The yellow blocks represent a horizontal gene transfer event between the light blue and green genomes. In traditional assembly, any reads longer than the inparalog blocks (red; orange) would be sufficient to fully resolve the genome. In metagenomic assembly, reads longer than the full syntenic block (gray) would be necessary. The problem of reconstructing a mixture of genomes is further complicated by uneven and unknown representation of the different organisms within a metagenomic mixture. Owing to uneven sequencing coverage within a metagenome, coverage heuristics used for isolate genome assembly cannot be readily used to accurately disentangle repetitive sequence in metagenomes [14]. This problem is further aggravated by the presence of intergenomic and intragenomic repeats (Figure 1). Effectively, one can view the task of the assembler to be not just to reconstruct one path through a graph, but a multitude of paths that come together and split apart at different places. This problem is not exclusive to metagenomic assembly—polyploid plant and animal genomes cause similar issues for isolate genome assembly. Prior work in polyploid genomes has provided methods for estimating ploidy based on sequencing depth, k-mer distribution and genotype heterogeneity [15]. Even so, despite initial attempts, algorithms developed for single-genome assembly have not been successfully applied directly to metagenomics data. Instead, several approaches have been developed that explicitly consider the specific characteristics of metagenomic data. Below, we describe a few of these approaches. The main approaches described here generally try to ‘hide’ the complexity imposed by metagenomic data. We will later discuss in more detail approaches that specifically try to identify strain variants in the data. Before doing so, however, we provide a short overview of the DBG approach for assembly and the impact of the parameters of this approach on the assembly process. Briefly, the reads are converted to a graph as follows. Each read is decomposed into overlapping segments of equal length k, usually termed k-mers. The k-mers become the nodes of the graph, and the edges connect nodes with k−1 matching bases. It is also possible to define the DBG in a node-centric manner (edges indicate k−1 matching bases between nodes; see [16] for more details). In a metagenomic context, one looks for multiple paths through the graph that collectively ‘explain’ all the edges. While an exact algorithm exists that can solve the traversal problem efficiently, it can only find one out of the many possible traversals of the graph that are consistent with the set of reads, reconstructing a possibly incorrect sequence representing a rearrangement of the genome(s). The above formulation allows traversals, which are not necessarily consistent with the input sequences, and adding as a constraint the reads themselves leads to the ‘Eulerian superpaths’ [13] formulation. In general, there are often multiple valid traversals of the DBG and identifying the correct one is the source of computational complexity in the assembler [6]. Most assemblers rely on heuristics to incorporate further information in the reconstruction process to bias the reconstruction toward the correct sequence. When ambiguities cannot be resolved, the assemblers break the traversal, leading to fragmented reconstructions of the original genome(s). Several factors impact the performance of DBG assemblers: (i) sequencing errors, (ii) repeats, (iii) the presence of strain variants and (iv) the depth of sequencing coverage. The interplay between these factors drives the choice of optimal k-mer size for a specific application as well as the ultimate performance of an assembler. Sequencing errors create ‘false’ k-mers, thereby increasing the complexity of the graph and making it more difficult to identify an unambiguous reconstruction of a sequence. Every error impacts at most k different k-mers; thus, the impact of sequencing errors increases with the size of k. The de Bruijn formulation above assumes perfect data. In practice, sequencing errors introduce false k-mers in the graph, increasing the size of the graph and adding ambiguity in the reconstruction. As a result, assemblers often include a ‘correction’ step or assume precorrected data as input. Initial de Bruijn assemblers used spectral correction [13], which attempts to make a minimum number of changes in a sequence to make it consistent with ‘correct’ or ‘solid’ k-mers [17, 18]. The correction strategies use a fixed k-mer count threshold to define ‘correct’ k-mers, strategy that is insufficient in metagenomic data sets with varying coverage levels. Recent approaches to correction have been proposed, which can correct data without assuming uniform coverage [10, 19, 20]. Repeats create additional edges in the graph, increasing the number of possible traversals. This creates ambiguity in the reconstruction of the genome, as a larger possible space of solutions must be explored [6]. Without further information, an assembler can either choose one of the branches at random, possibly leading to assembly errors, or simply decide to break the assembly, leading to fragmented results. The longer the size of k, the fewer nodes in the graph are repetitive, and thus, the easier it is to reconstruct large segments of a genome. Strain variants create a similar challenge as sequencing errors, and in highly polymorphic samples, the assembly result will likely be fragmented [21]. Finally, the depth of coverage impacts the connectivity of the assembly graph. A path stretching from a read to the next until it covers an entire genome can only be found if adjacent reads share k-mers. At low depths of coverage, the adjacent reads are only expected to overlap by a small extent, and as a result, the assembly is only possible for small values of k. To summarize the points made above, large values of k reduce the complexity of the graph and impact of repeats, but using such values requires longer sequences (longer than the k-mer size) and higher depth of coverage, and leads to an increased impact of sequencing errors (each error impacts k different k-mers). Assuming uniform error and random sequencing, it is possible to compute the expected surviving coverage for a given k-mer size and input coverage [22]. These trade-offs represent a key component of the algorithmic choices made by assembly software and also guide the empirical choices made by users of assembly tools. In the following section, we highlight three published algorithms developed specifically for metagenome assembly that perform well in a recent review [23]. IDBA-UD IDBA-UD [24] is part of the IDBA (Iterative De Bruijn Graph De Novo Assembler) [25] suite of assemblers. IDBA assemblers use multiple k-mer sizes to address the trade-offs described above. It iterates through a range of k-mer values in a stepwise fashion to improve the DBG and resulting assembly. Sequencing errors are corrected at each iteration, reducing the impact of sequencing errors. In this way, the assembly graph becomes more and more resolved with increasing k-mer size in each iteration step, leading to a more contiguous assembly result. MEGAHIT MEGAHIT [10] relies on the same multiple k-mer strategy as the [10] IDBA assemblers [25]. MEGAHIT is currently the most efficient de novo assembler largely because of its use of efficient data structures for storing the DBG. Memory requirements are reduced by using a new data structure, a succinct DBG [26]. Memory is also reduced by eliminating k-mers below a defined frequency threshold from the graph. This approach minimizes the negative impact of sequencing errors on the assembly. To retain k-mers from low-abundance organisms, distinguishing them from errors, MEGAHIT reconsiders discarded k-mers in low-coverage regions of the assembly graph. metaSPAdes MetaSPAdes [9] is a metagenomic-specific version of the SPAdes assembler [27]. A main innovation in these assemblers is the use of paired-end information during the assembly process rather than afterward [28]. This information is incorporated in the graph by using a pair of k-mers separated by an estimated distance. Similar to IDBA-UD and MEGAHIT, SPAdes uses an iterative multiple k-mer approach. However, SPAdes uses the complete read information together with the preassembled contigs at every step. Originally, SPAdes was designed to address two major issues of single-cell sequencing data [27], the uneven read coverage and chimeric sequences, issues that are also germane to metagenomic assembly. In addition, metaSPAdes [9] was extended to handle strain variation. Micro-variations between highly similar ‘strain-contigs’ are combined to form high-quality consensus sequences, aiming at the best possible representation of each species instead of every strain variant. Metagenome scaffolding To improve the continuity of fragmented assemblies, orthogonal information, which is not used in the assembly process, is used to orient and order contigs with respect to each other. The linkage information provides a measure of confidence about the proximity of any two contigs on the genome. Different kinds of linkage information such as optical maps [29], mate pairs [30–33], fosmid clones [34], fosmid clone dilution pool sequencing [35], linked read sequencing [36], synthetic long reads [37] and Hi-C [38–40] have been explored to improve genome assembly quality. Paired-end sequences have a known size distribution, which can give an estimate of the distance between two contigs. Hi-C makes use of the 3D structure of the chromosomes inside a cell nucleus, thereby inferring genome-scale contact information. Development of algorithms, which use one or more types of orthogonal linkage information to get high-quality assemblies, is an emerging area of interest. However, the applicability of these methods in metagenomics has been limited because of difficult and expensive sample processing protocols. Because of this, the information used for scaffolding metagenomes has been limited to paired-ends. Strain resolution and variant detection Unlike isolate genome sequencing, the analysis of microbial communities provides valuable information about the strain structure of mixtures of closely related organisms, making it possible to study how the strain composition changes across time or in response to environmental changes [41–45]. This power has been recognized since the early days of the field [42], and software packages have been developed to help scientists discover, characterize and quantify strain-level differences between microbes. A first package in the field, Strainer [46], allowed researchers to manually inspect metagenomic assemblies to identify single-nucleotide variants. The Bambus 2 [47] scaffolding package was the first tool to provide the ability to automatically detect structural variants within metagenomic assemblies. This approach was later extended in Marygold [48] through the use of SPQR trees [49] to allow the efficient discovery of more types of structural variants. Most recently Anvi'o [50] provides a data analytics and visualization environment for comparing strain variants across multiple data sets. Using available infant gut samples [43], Anvi’o allowed the identification of systematic emergence of nucleotide variation in an abundant draft genome bin [50]. Note that the problem of strain resolution in metagenomic data bears strong similarities to the reconstruction of transcript splicing structure in RNA sequencing data [51]. Owing to the much shorter extent of eukaryotic transcripts, and the fact that transcript graphs can be assumed to lack cycles (which is not true in metagenomes), the approaches developed in this field cannot be effectively applied to metagenomic data. Future strategies and approaches High-quality assembly of bacterial genomes has undergone a renaissance with the advent of single-molecule sequences [52, 53], such as the PacBio RSII/Sequel [54] and Oxford Nanopore MinION [55]. An alternate technology is TruSeq Synthetic Long Reads (TSLRs, previously known as Moleculo), which relies on barcodes and pooling to reconstruct long sequences [37, 56, 57]. While not yet used for resolving microbial genomes [58] because of difficulty in assembling variable number tandem repeats, the TSLR sequences have high accuracy, allowing strain-level resolution in metagenomes [59]. Single-molecule sequencing is continuing to mature, increasing in both throughput and quality. As the instruments have increased in throughput, they have been applied to assemble eukaryotic genomes [60–67] with improved assembly algorithms [62, 68–70]. In addition to long-read sequencing, novel scaffolding approaches have also had a significant impact on genome assembly. The combination of Hi-C scaffolding and long-read assembly has been particularly powerful in combination, generating chromosome-scale scaffolds [60, 71]. Recent studies have begun to highlight the application of long reads to metagenomics [72] across a broad range of applications such as: gut microbiome [73], coculture communities [74] and the skin microbiome [75]. However, long-read sequencing has not gained widespread adoption because of three main factors: cost, DNA quality requirements and complexity of DNA preparation. A prerequisite for the effectiveness of both synthetic long-read and single-molecule long-read sequencing is that the DNA fragments provided to the sequencing instrument are sufficiently long. Current DNA extraction procedures, in addition to lysing the cell, also shear the DNA, thereby limiting sequencing read length [76]. This is especially true in the case of gram-positive organisms that are difficult to lyse. While protocols can be effectively optimized on a per-organism basis, this is not true for complex mixtures. Novel DNA extraction methods using agarose plugs, like those used for extracting DNA for optical mapping [77, 78], or enzymatic lysis using cocktails of enzymes [79], may result in the extraction of longer DNA fragments. Long-range linking technologies, such as Hi-C, have an advantage over long-read sequencing, as these technologies do not require high molecular weight DNA to be generated after cross-linking and cell lysis. Current methods and strategies for metagenome assembly validation The validation of genome assemblies has been an active area of interest since the development of the first genome assemblers in the late 1970s [80]. Below, we describe several of the strategies and corresponding used in this context (Table 1). Table 1 Metrics to evaluate assembly quality Assembly metric . Description . Features . Reference- based . Reference- free . Measures errors . Measures errors  + variation . Contiguity-based metrics Number of contigs Total number of assembled contigs reported by each assembler √ √ Assembly size at 1 Mb Represents the size of the largest contig C such that the sum of all contigs larger than C exceeds 1 Mb √ Contig number at 1 Mb Represents the number of contigs required to exceed 1 Mb √ Complete genes Represents the median number of complete genes per sample √ Complete marker genes Indicates the median number of fully reconstructed marker genes per sample √ Reference-based metrics Genome recovery (%) Median percentage of each truth genome that is recovered √ √ Total aligned length Sum of the length of contigs aligned to the truth genomes √ √ Total unaligned length Sum of the length of unaligned contigs √ √ NGAx The length of the contig that covers at least half the reference genome. Contigs are broken at mis-assembly events and removing all unaligned bases √ √ Consistency- based metrics Depth of coverage Statistical comparison of global versus local coverage, as signature of compressed/expanded repeats, chimeric contigs √ √ Consensus Concordance of consensus to read pileup √ √ Split-read mapping Single reads with partial alignments √ √ Insert size consistency Concordance of insert size (expanded/collapsed) √ √ Assembly metric . Description . Features . Reference- based . Reference- free . Measures errors . Measures errors  + variation . Contiguity-based metrics Number of contigs Total number of assembled contigs reported by each assembler √ √ Assembly size at 1 Mb Represents the size of the largest contig C such that the sum of all contigs larger than C exceeds 1 Mb √ Contig number at 1 Mb Represents the number of contigs required to exceed 1 Mb √ Complete genes Represents the median number of complete genes per sample √ Complete marker genes Indicates the median number of fully reconstructed marker genes per sample √ Reference-based metrics Genome recovery (%) Median percentage of each truth genome that is recovered √ √ Total aligned length Sum of the length of contigs aligned to the truth genomes √ √ Total unaligned length Sum of the length of unaligned contigs √ √ NGAx The length of the contig that covers at least half the reference genome. Contigs are broken at mis-assembly events and removing all unaligned bases √ √ Consistency- based metrics Depth of coverage Statistical comparison of global versus local coverage, as signature of compressed/expanded repeats, chimeric contigs √ √ Consensus Concordance of consensus to read pileup √ √ Split-read mapping Single reads with partial alignments √ √ Insert size consistency Concordance of insert size (expanded/collapsed) √ √ Note: Most commonly used metrics to evaluate metagenomics assembly quality, including contiguity-based, reference-based and consistency-based metrics. Assembly metric column contains commonly used metrics; description column briefly describes the metric; and features column indicates four characteristics of these metrics: reference-based (reference genome required), reference-free (reference genome not required), measures errors, measures errors + variation (biased by real differences in reference genome). Open in new tab Table 1 Metrics to evaluate assembly quality Assembly metric . Description . Features . Reference- based . Reference- free . Measures errors . Measures errors  + variation . Contiguity-based metrics Number of contigs Total number of assembled contigs reported by each assembler √ √ Assembly size at 1 Mb Represents the size of the largest contig C such that the sum of all contigs larger than C exceeds 1 Mb √ Contig number at 1 Mb Represents the number of contigs required to exceed 1 Mb √ Complete genes Represents the median number of complete genes per sample √ Complete marker genes Indicates the median number of fully reconstructed marker genes per sample √ Reference-based metrics Genome recovery (%) Median percentage of each truth genome that is recovered √ √ Total aligned length Sum of the length of contigs aligned to the truth genomes √ √ Total unaligned length Sum of the length of unaligned contigs √ √ NGAx The length of the contig that covers at least half the reference genome. Contigs are broken at mis-assembly events and removing all unaligned bases √ √ Consistency- based metrics Depth of coverage Statistical comparison of global versus local coverage, as signature of compressed/expanded repeats, chimeric contigs √ √ Consensus Concordance of consensus to read pileup √ √ Split-read mapping Single reads with partial alignments √ √ Insert size consistency Concordance of insert size (expanded/collapsed) √ √ Assembly metric . Description . Features . Reference- based . Reference- free . Measures errors . Measures errors  + variation . Contiguity-based metrics Number of contigs Total number of assembled contigs reported by each assembler √ √ Assembly size at 1 Mb Represents the size of the largest contig C such that the sum of all contigs larger than C exceeds 1 Mb √ Contig number at 1 Mb Represents the number of contigs required to exceed 1 Mb √ Complete genes Represents the median number of complete genes per sample √ Complete marker genes Indicates the median number of fully reconstructed marker genes per sample √ Reference-based metrics Genome recovery (%) Median percentage of each truth genome that is recovered √ √ Total aligned length Sum of the length of contigs aligned to the truth genomes √ √ Total unaligned length Sum of the length of unaligned contigs √ √ NGAx The length of the contig that covers at least half the reference genome. Contigs are broken at mis-assembly events and removing all unaligned bases √ √ Consistency- based metrics Depth of coverage Statistical comparison of global versus local coverage, as signature of compressed/expanded repeats, chimeric contigs √ √ Consensus Concordance of consensus to read pileup √ √ Split-read mapping Single reads with partial alignments √ √ Insert size consistency Concordance of insert size (expanded/collapsed) √ √ Note: Most commonly used metrics to evaluate metagenomics assembly quality, including contiguity-based, reference-based and consistency-based metrics. Assembly metric column contains commonly used metrics; description column briefly describes the metric; and features column indicates four characteristics of these metrics: reference-based (reference genome required), reference-free (reference genome not required), measures errors, measures errors + variation (biased by real differences in reference genome). Open in new tab Some of the most intuitive metrics relate to assembly contiguity. Measures such as the number of contigs and average or maximum contig sizes, attempt to assess how far the assembly is from the ideal objective of one contig per chromosome. As most assemblies comprise many small contigs, usually because of sequencing errors or other artifacts, these metrics can be misleading. A more robust measure is the N50 size, defined as the minimum contig length in the set of contigs that comprise over half the assembly (a weighted median contig size). Other metrics refer to the information contained in contigs, such as the number of open reading frames (ORFs, a proxy for genes) or their density (ORFs/Mb). As genes are used to address biological questions, a greater number or density of ORFs result in more information available for testing biological hypotheses. Contig length statistics do not incorporate any correctness information and can be ‘fooled’ by accepting errors (a single long contig can be constructed by concatenating all the reads in an arbitrary order). In contrast, ORF-based statistics capture error, as they would disrupt ORFs. ORF/gene information can also be used to evaluate assembly completeness. Several genes (called marker genes) have been found in all known bacterial genomes and can thus be assumed to exist in a newly assembled sequence. An assembly where some of these genes are missing can be assumed incomplete. A recently developed software package, CheckM [12], relies on marker genes that are specific to a genome-based lineage within a reference tree. CheckM also provides tools for identifying sets of contigs that can be combined to reconstruct individual organisms, based on marker set compatibility, similarity in genomic characteristics and proximity within a reference genome tree. An alternative approach for validating assemblies assesses the fit of the assembly with a model of the sequencing process. Assembly likelihood estimators have been developed and used to evaluate single-genome assemblies [81, 82], as well as metagenome assemblies [83, 84]. While these metrics do not provide a global confidence value, they can be used to compare and rank different genome assemblies, allowing one to automatically optimize the assembly parameters by iteratively trying different options and selecting the parameter set that maximizes the likelihood [83]. Such approaches are largely used as ‘holistic’ validation strategies, assessing the assembly as a whole rather than identifying specific errors. However, they can be adapted to also highlight regions of the assembly where errors may occur [84]. It is important to note that exactly computing the likelihood of an assembly given a model of the sequencing process can be expensive, simple heuristics are remarkably effective and, in fact, the likelihood metrics are tightly related to the number of reads and/or paired-ends that can be correctly aligned to the assembly. Essentially, the more information that is ‘explained’ by the assembly, the more likely it is that the assembly is correct. Characterizing assembly errors The approaches described so far only implicitly take errors into account. It is often important to determine exactly where errors were introduced in the assembly, either to correct these mistakes or to ensure that the errors do not influence the results of downstream analyses. Figure 2 highlights the four primary types of assembly errors: repeat collapse, insertions, deletions and inversions. These assembly errors can be identified by mapping reads to the assembly and evaluating the coverage, evaluating the distance between read pairs and split read mapping data. Increases in coverage indicate under-collapsed repeats, while drops in coverage or coverage gaps can indicate break points because of insertions, deletions and inversions. There are two primary approaches for detecting these assembly errors: (i) reference-based and (ii) consistency-based (Table 1). In reference-based assembly error detection, assembly errors are identified by comparing the assembly to one or more reference genomes. In consistency-based methods, errors are identified by aligning the sequencing reads to the assembly and identifying regions, where the mappings are inconsistent with the assembly. Figure 2 Open in new tabDownload slide Metagenome assembly error signatures. There are four primary types of assembly errors: repeat collapse, insertions, deletions and inversions. These assembly errors can be identified by mapping reads to the assembly and evaluating the coverage (solid curve), distance between read pairs (boxes labeled H) and split read mapping data (boxes labeled H). Increase in coverage indicates repeat collapse, whereas drops in coverage indicate break points for insertions, deletions and inversions. Shorter than expected distance between read pairs indicates potential repeat collapse or deletion, whereas increase in distance between read pairs indicates a potential insertion. Inconsistency in read pair direction can indicate an inversion. Finally, split-read mapping data, obtained by independently aligning the first and last third of a read can be used in a similar manner to read pair information to identify assembly errors [85]. Figure 2 Open in new tabDownload slide Metagenome assembly error signatures. There are four primary types of assembly errors: repeat collapse, insertions, deletions and inversions. These assembly errors can be identified by mapping reads to the assembly and evaluating the coverage (solid curve), distance between read pairs (boxes labeled H) and split read mapping data (boxes labeled H). Increase in coverage indicates repeat collapse, whereas drops in coverage indicate break points for insertions, deletions and inversions. Shorter than expected distance between read pairs indicates potential repeat collapse or deletion, whereas increase in distance between read pairs indicates a potential insertion. Inconsistency in read pair direction can indicate an inversion. Finally, split-read mapping data, obtained by independently aligning the first and last third of a read can be used in a similar manner to read pair information to identify assembly errors [85]. Reference-based MetaQuast. MetaQUAST [11] is a reference-based method that identifies mis-assemblies and structural variants in an assembly relative to reference genomes. MetaQUAST is a modification of QUAST [86], an isolate genome assembly validation tool that computes alignments of assembled contigs to a single reference genome. For data sets with known reference genomes, metaQUAST uses the user-provided reference sequences to evaluate the assembly. For the data sets where genomes in the sample are not known, metaQUAST identifies appropriate reference sequences using a 16 S ribosomal RNA database. Additionally, metaQUAST applies a structural variant finding algorithm to distinguish between structural variants and true assembly errors. Consistency-based Within isolate genomes, errors introduce inconsistencies in the placement of reads within the assembly, leading to several signatures that can be detected computationally (Figure 2). In isolate genomes, several assumptions are usually made. First, the ideal sequencing process can broadly be assumed to be uniform [87], i.e. the DNA fragments are equally likely to start at any position in the genome. Strong deviations from the assumption of uniformity frequently correspond to assembly mistakes. Second, the reads agree with the assembled sequence except for potential random sequencing errors. Third, in the case of paired-end data, the distance between the paired reads is consistent with the fragment sizes generated during the sequencing process. Amosvalidate [88] is a de novo pipeline for detecting mis-assemblies that checks all these constraints and reports regions in the assembly characterized by a sufficient deviation from the assumptions outlined above. FRCbam is an approach based on amosvalidate that introduced the concept of feature-response curves (FRCs), which track assembly error across assembled base pairs [89, 90]. REAPR [91] focuses on paired-end constraints, and identifies regions where the distance between the paired-ends is consistently stretched or shrunk, or where the depth of coverage is unusual. Pilon [85] relies on both paired-end information (similar to REAPR) and also on single-base changes and fragmented alignments (called soft-clipping) to identify and correct assembly errors. None of the consistency-based tools described above are effective in a metagenomic setting, in no small part, as the underlying assumptions are incorrect in this context. Strain variants within the community can be mistaken for errors, and the difference in abundance between the organisms in a mixture makes it difficult to assess what deviations from expectation are ‘unusual’. VALET [92] (Figure 3, http://github.com/marbl/VALET) is a de novo pipeline for detecting all types of mis-assemblies in metagenomic data sets (Figure 2). VALET primarily adapts the approaches developed in the context of isolate genomes that we described above. To avoid false positives and false negatives because of uneven depth of coverage, VALET bins contig by coverage before applying these methods. Figure 3 Open in new tabDownload slide Overview of the VALET pipeline. Figure 3 Open in new tabDownload slide Overview of the VALET pipeline. Possible break points in the assembly are found by examining regions, where a large number of parts of the reads are unable to align. To identify break points, VALET uses the first and last third of each unaligned read, called sister reads. The sister reads are aligned independently to the reference genome, and then regions where the sister reads align to nonadjacent segments of the genome are flagged as mis-assemblies. In practice, most mis-assembly signatures have high false-positive rates. This false-positive rate can be reduced by focusing on just the regions where multiple signatures agree. Any window of the assembly (2000 bp in length by default) that contains multiple mis-assembly signatures is marked as suspicious by VALET. The flagged and suspicious regions are stored in a BED file, which allows users to visualize the mis-assemblies using genomic viewers, such as IGV [93]. Excluding from the analysis regions of the assembly where just one type of inconsistency is detected may lead to false negatives. It is important for the user to be aware of this trade-off and use the set of signatures that is most appropriate for their application. VALET provides several visual representations of assembly quality including an FRC plot, which highlights the trade-off between contiguity and accuracy. Exemplar validation of metagenomic data sets Reference-based and de novo evaluation of an HMP data set To evaluate state-of-the-art metagenomic de novo assembly software (IDBA-UD; MEGAHIT; metaSPAdes) on a real data set, we compared the accumulated errors versus cumulative assembly length for multiple assemblies on a Human Microbiome Project (HMP) stool sample (SRS016203) (Figure 4A–C), using FRCs [90]. The results show that different validation metrics provide a different picture of the relative accuracy of the different approaches. In the reference-based MetaQUAST results, MEGAHIT outperforms metaSPAdes and IDBA with metaSPAdes containing the most error (mostly within the shortest contigs). The de novo validation based on coverage only (Figure 4B) favors metaSPAdes, while the validation based on break point events favors MEGAHIT. MEGAHIT has fewer structural errors compared with metaSPAdes, especially in the largest contigs, while metaSPAdes has fewer under-collapsed/over-collapsed repeats when compared with MEGAHIT. Taken together, these validation results highlight the need for ‘use-case’ specific metagenomic assembly pipelines. Depending on the application, different types of errors have varying impacts on the final results. For example, coverage errors make it difficult to estimate relative abundances, thereby possibly confounding statistical associations based on the assembled data. On the other hand, errors with respect to a reference database are most relevant when metagenomic assembly is applied to clinical data sets, setting where most pathogens can be assumed to exist in reference collections. Figure 4 Open in new tabDownload slide Reference-based and consistency-based evaluations of an HMP sample. FRC plots produced by MetaQUAST and VALET comparing assemblies of a stool sample (SRS016203) from the HMP (using IDBA-UD [24], metaSPAdes [9] and MEGAHIT [10]). (A) Reference-based (MetaQUAST) validation (all mis-assemblies flagged), (B) consistency-based (VALET) coverage anomalies and (C) consistency-based (VALET) break point errors. The y-axis represents the cumulative assembly size, considering contigs from largest to smallest. The x-axis represents the cumulative number of errors within the contigs comprising the corresponding y-axis value. Curves toward the top and left of the plot represent better assemblies—fewer errors for the same cumulative assembly size. Depending on metric, different assemblers perform best—MEGAHIT has highest consistency with reference genomes and fewest break points, while MetaSpades has fewer coverage anomalies. Figure 4 Open in new tabDownload slide Reference-based and consistency-based evaluations of an HMP sample. FRC plots produced by MetaQUAST and VALET comparing assemblies of a stool sample (SRS016203) from the HMP (using IDBA-UD [24], metaSPAdes [9] and MEGAHIT [10]). (A) Reference-based (MetaQUAST) validation (all mis-assemblies flagged), (B) consistency-based (VALET) coverage anomalies and (C) consistency-based (VALET) break point errors. The y-axis represents the cumulative assembly size, considering contigs from largest to smallest. The x-axis represents the cumulative number of errors within the contigs comprising the corresponding y-axis value. Curves toward the top and left of the plot represent better assemblies—fewer errors for the same cumulative assembly size. Depending on metric, different assemblers perform best—MEGAHIT has highest consistency with reference genomes and fewest break points, while MetaSpades has fewer coverage anomalies. Reference-based evaluation of long-read assemblies We next evaluated the promise of long-read assembly using an available HMP synthetic data set sequenced with both PacBio [94] and TSLR [59] assembled with Canu [68]. We compared the results with assemblies of the same data set using short-read data. The full PacBio-only data set (median coverage 192-fold) generates the most complete representation of the data set, reconstructing 67/83 Mb of the data set with several genomes and plasmids in complete circular contigs (Figure 5). This assembly also has a low rate of mis-assembly per megabase (0.23/100 kb) and the lowest mismatch rate (1.87/100 kb). It also has a low insertion and deletion (indel) rate (3.83/100 kb), second only to TSLR sequencing. As with clonal bacterial assemblies, despite the high raw error rate of individual sequences, the consensus assembly has high accuracy exceeding that of short-read data sets because of the signal-based polishing of the consensus sequence [96]. A one-tenth subset PacBio-only assembly (median coverage 19.5-fold) is still able to reconstruct 62/83 Mb of the data set with high continuity and low error (0.24 errors/100 kb; 7.54 mismatches/100 kb; 129.77 indels/100 kb). Here, the coverage is insufficient for accurate polishing [97], leading to the highest indel error rate, the predominant error mode of the PacBio sequencer. This result could likely be improved by Illumina-based correction [85]. The remaining assemblies did not recover >28/83 Mb of the data set. However, the TSLR assembly had a low error rate across all metrics (0.19 errors/100 kb; 4.95 mismatches/100 kb; 1.65 indels/100 kb). Figure 5 Open in new tabDownload slide A bandage [95] visualization of the canu graphical assembly output. The unitigs in the assembly were aligned to the reference and assigned to their best match. Unitigs were colored by hand to match their species assignment. There are several large circular structures, which correspond to complete chromosomes. The smaller circles correspond to complete plasmids. Figure 5 Open in new tabDownload slide A bandage [95] visualization of the canu graphical assembly output. The unitigs in the assembly were aligned to the reference and assigned to their best match. Unitigs were colored by hand to match their species assignment. There are several large circular structures, which correspond to complete chromosomes. The smaller circles correspond to complete plasmids. Conclusion Here, we have highlighted recent developments in metagenomic assembly, both from a computational and technological perspective. Future technological advances are likely to have a significant impact on metagenomics. Our analysis on the mock HMP community highlights the power of single-molecule sequencing to resolve complex repeats and simplify assembly, albeit on a simple data set. Today, the use of long-read technologies in metagenomics applications is limited, as is the use of technologies generating long-range linking information. Some of these technologies, coupled with advances in sample processing, will likely become sufficiently accurate and cost-effective, allowing for a much more accurate reconstruction of the genomic composition of microbial communities. Combinations of short-read (for coverage) and long-read (for continuity) are also possible, and several hybrid assemblers have been developed [98–102]. New technologies will also provide new capabilities, such as the ability of the Oxford Nanopore technology to ‘filter’ the reads within the sequencing instrument [103]. While only demonstrated with short targets, it has the potential to transform the way in which sequencing is used on diagnostic/detection applications, allowing researchers to bias the random sequencing process toward the DNA fragments of interest. New algorithms and software tools will continue to be developed to leverage such technological advances. Effective assembly validation approaches are a critical need for the further progress of the field. Such tools will help researchers evaluate new software tools and sequencing strategies, and will highlight opportunities for further algorithmic development. Both reference-based and de novo validation strategies are important and need to continue to be developed. Reference-based methods provide a valuable lower bound on the performance of tools, while de novo methods allow the validation of assembly results and tuning of parameters even in settings where reference genomes are unavailable. Key Points Despite recent advances in metagenomic assembly, assembled contigs are imperfect, and validation is key for moving forward. There are numerous metagenomic assembly metrics; understanding how to interpret each individual metric is key to evaluation of the accuracy of assembly of gene and genomes from metagenomes. VALET represents a de novo pipeline for detecting mis-assemblies in metagenomic data sets. Long-read technologies, such as PacBio RSII/Sequel, Oxford Nanopore, etc., are highly effective in the context of isolate genomes, but technical hurdles remain before they can be used routinely in metagenomic applications. Acknowledgements Opinions expressed in this article are of the authors’ and do not necessarily reflect the policies and views of NIST or affiliated venues. Official contribution of NIST; not subject to copyrights in United States. Funding The NIH (grant numbers R01-HG-004885 and R01-AI-100947 to M.P.), and the NSF (grant numbers IIS-1117247 and IIS-0812111 to M.P.). The Intramural Research Program of the National Human Genome Research Institute, National Institutes of Health (to S.K.). Nathan D. Olson is a PhD student at the University of Maryland and researcher at the National Institute of Standards and Technology working on methods for evaluating metagenomic bioinformatic tools. Todd J. Treangen received his PhD in Computer Science from the Technical University of Catalonia, Barcelona Spain. His research interests include multiple genome alignment and metagenomic assembly. Currently, he is an Assistant Research Scientist at the Center for Bioinformatics and Computational Biology, University of Maryland College Park. Christopher M. Hill received his PhD in Computer Science from University of Maryland, College Park, where he focused on developing algorithms for assembling and comparing single genome and metagenomic assemblies. Currently, he is a software engineer at Google Inc. Victoria Cepeda-Espinoza is a PhD student in the Department of Computer science at the University of Maryland, College Park. Her research interests include developing algorithms for metagenomic assembly. Jay Ghurye is a PhD student in the Department of Computer science at the University of Maryland, College Park. His research interests include developing algorithms for metagenomic assembly. Sergey Koren received his PhD in Computer Science from the University of Maryland. His research interests include single-molecule sequence assembly and analysis. He is currently a Staff Scientist in the Genome Informatics Section at the National Human Genome Research Institute. Mihai Pop received his PhD degree in Computer Science from Johns Hopkins University. His research interests include sequence analysis algorithms and metagenomics. Currently, he is a Professor in the Department of Computer Science and the Center for Bioinformatics and Computational Biology, and the Interim Director of the Institute for Advanced Computer Studies at the University of Maryland, College Park. References 1 Podell S , Ugalde JA, Narasingarao P, et al. Assembly-driven community genomics of a hypersaline microbial ecosystem . PLoS One 2013 ; 8 : e61692 . Google Scholar Crossref Search ADS PubMed WorldCat 2 Narasingarao P , Podell S, Ugalde JA, et al. De novo metagenomic assembly reveals abundant novel major lineage of Archaea in hypersaline microbial communities . ISME J 2012 ; 6 : 81 – 93 . Google Scholar Crossref Search ADS PubMed WorldCat 3 Ji P , Zhang Y, Wang J, et al. MetaSort untangles metagenome assembly by reducing microbial community complexity . Nat Commun 2017 ; 8 : 14306 . Google Scholar Crossref Search ADS PubMed WorldCat 4 Sangwan N , Xia F, Gilbert JA. Recovering complete and draft population genomes from metagenome datasets . Microbiome 2016 ; 4 : 8 . Google Scholar Crossref Search ADS PubMed WorldCat 5 Kingsford C , Schatz MC, Pop M. Assembly complexity of prokaryotic genomes using short reads . BMC Bioinformatics 2010 ; 11 : 21 . Google Scholar Crossref Search ADS PubMed WorldCat 6 Nagarajan N , Pop M. Parametric complexity of sequence assembly: theory and applications to next generation sequencing . J Comput Biol 2009 ; 16 : 897 – 908 . Google Scholar Crossref Search ADS PubMed WorldCat 7 Treangen TJ , Abraham AL, Touchon M, et al. Genesis, effects and fates of repeats in prokaryotic genomes . FEMS Microbiol Rev 2009 ; 33 : 539 – 71 . Google Scholar Crossref Search ADS PubMed WorldCat 8 Koren S , Harhay GP, Smith TP, et al. Reducing assembly complexity of microbial genomes with single-molecule sequencing . Genome Biol 2013 ; 14 : R101 . Google Scholar Crossref Search ADS PubMed WorldCat 9 Nurk S , Meleshko D, Korobeynikov A, et al. metaSPAdes: a new versatile metagenomic assembler . Genome Res 2017 ; 27 : 824 – 34 . Google Scholar Crossref Search ADS PubMed WorldCat 10 Li D , Liu CM, Luo R, et al. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph . Bioinformatics 2015 ; 31 : 1674 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 11 Mikheenko A , Saveliev V, Gurevich A. MetaQUAST: evaluation of metagenome assemblies . Bioinformatics 2016 ; 32 : 1088 – 90 . Google Scholar Crossref Search ADS PubMed WorldCat 12 Parks DH , Imelfort M, Skennerton CT, et al. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes . Genome Res 2015 ; 25 : 1043 – 55 . Google Scholar Crossref Search ADS PubMed WorldCat 13 Pevzner PA , Tang H, Waterman MS. An Eulerian path approach to DNA fragment assembly . Proc Natl Acad Sci USA 2001 ; 98 : 9748 – 53 . Google Scholar Crossref Search ADS PubMed WorldCat 14 Namiki T , Hachiya T, Tanaka H, et al. MetaVelvet: an extension of velvet assembler to de novo metagenome assembly from short sequence reads . Nucleic Acids Res 2012 ; 40 : e155 . Google Scholar Crossref Search ADS PubMed WorldCat 15 Sohn J , Nam JW. The present and future of de novo whole-genome assembly . Brief Bioinform 2016 . doi: 10.1093/bib/bbw096. Google Scholar OpenURL Placeholder Text WorldCat 16 Tomescu AI , Medvedev P. Safe and complete contig assembly through omnitigs . J Comput Biol 2017 ; 24 : 590 – 602 . Google Scholar Crossref Search ADS PubMed WorldCat 17 Nagarajan N , Pop M. Sequence assembly demystified . Nat Rev Genet 2013 ; 14 : 157 – 67 . Google Scholar Crossref Search ADS PubMed WorldCat 18 Miller JR , Koren S, Sutton G. Assembly algorithms for next-generation sequencing data . Genomics 2010 ; 95 : 315 – 27 . Google Scholar Crossref Search ADS PubMed WorldCat 19 Medvedev P , Scott E, Kakaradov B, et al. Error correction of high-throughput sequencing datasets with non-uniform coverage . Bioinformatics 2011 ; 27 : i137 – 41 . Google Scholar Crossref Search ADS PubMed WorldCat 20 Song L , Florea L, Langmead B. Lighter: fast and memory-efficient sequencing error correction without counting . Genome Biol 2014 ; 15 : 509 . Google Scholar Crossref Search ADS PubMed WorldCat 21 Morowitz MJ , Denef VJ, Costello EK, et al. Strain-resolved community genomic analysis of gut microbial colonization in a premature infant . Proc Natl Acad Sci USA 2011 ; 108 : 1128 – 33 . Google Scholar Crossref Search ADS PubMed WorldCat 22 Salmela L , Walve R, Rivals E, et al. Accurate self-correction of errors in long reads using de Bruijn graphs . Bioinformatics 2017 ; 33 : 799 – 806 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 23 Greenwald WW , Klitgord N, Seguritan V, et al. Utilization of defined microbial communities enables effective evaluation of meta-genomic assemblies . BMC Genomics 2017 ; 18 : 296 . Google Scholar Crossref Search ADS PubMed WorldCat 24 Peng Y , Leung HCM, Yiu SM, et al. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth . Bioinformatics 2012 ; 28 : 1420 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 25 Peng Y , Leung HCM, Yiu SM, et al. IDBA—a practical iterative de Bruijn graph de novo assembler. In: Proceedings of the 14th Annual International Conference on Research in Computational Molecular Biology (RECOMB 2010) . 2010 , 426 – 40 . ACM Press . 26 Bowe A , Onodera T, Sadakane K, et al. Succinct de Bruijn graphs. In: Proceedings of the 12th international conference on Algorithms in Bioinformatics (WABI 2012) . 2012 , 225 – 35 . Springer . 27 Bankevich A , Nurk S, Antipov D, et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing . J Comput Biol 2012 ; 19 : 455 – 77 . Google Scholar Crossref Search ADS PubMed WorldCat 28 Prjibelski AD , Vasilinetc I, Bankevich A, et al. ExSPAnder: a universal repeat resolver for DNA fragment assembly . Bioinformatics 2014 ; 30 : i293 – 301 . Google Scholar Crossref Search ADS PubMed WorldCat 29 Cai W , Aburatani H, Stanton VP, et al. Ordered restriction endonuclease maps of yeast artificial chromosomes created by optical mapping on surfaces . Proc Natl Acad Sci USA 1995 ; 92 : 5164 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 30 Gao S , Sung WK, Nagarajan N. Opera: reconstructing optimal genomic scaffolds with high-throughput paired-end sequences . J Comput Biol 2011 ; 18 : 1681 – 91 . Google Scholar Crossref Search ADS PubMed WorldCat 31 Gao S , Bertrand D, Chia BK, et al. OPERA-LG: efficient and exact scaffolding of large, repeat-rich eukaryotic genomes with performance guarantees . Genome Biol 2016 ; 17 : 102. Google Scholar Crossref Search ADS PubMed WorldCat 32 Salmela L , Mäkinen V, Välimäki N, et al. Fast scaffolding with small independent mixed integer programs . Bioinformatics 2011 ; 27 : 3259 – 65 . Google Scholar Crossref Search ADS PubMed WorldCat 33 Pop M , Kosack DS, Salzberg SL. Hierarchical scaffolding with Bambus . Genome Res 2004 ; 14 : 149 – 59 . Google Scholar Crossref Search ADS PubMed WorldCat 34 Gnerre S , Maccallum I, Przybylski D, et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data . Proc Natl Acad Sci USA 2011 ; 108 : 1513 – 18 . Google Scholar Crossref Search ADS PubMed WorldCat 35 Mouse Genome Sequencing Consortium ; Waterston RH, Lindblad-Toh K, et al. Initial sequencing and comparative analysis of the mouse genome . Nature 2002 ; 420 : 520 – 62 . Google Scholar Crossref Search ADS PubMed WorldCat 36 Zheng GX , Lau BT, Schnall-Levin M, et al. Haplotyping germline and cancer genomes with high-throughput linked-read sequencing . Nat Biotechnol 2016 ; 34 : 303 – 11 . Google Scholar Crossref Search ADS PubMed WorldCat 37 McCoy RC , Taylor RW, Blauwkamp TA, et al. Illumina TruSeq synthetic long-reads empower de novo assembly and resolve complex, highly-repetitive transposable elements . PLoS One 2014 ; 9 : e106689 . Google Scholar Crossref Search ADS PubMed WorldCat 38 Burton JN , Adey A, Patwardhan RP, et al. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions . Nat Biotechnol 2013 ; 31 : 1119 – 25 . Google Scholar Crossref Search ADS PubMed WorldCat 39 Kaplan N , Dekker J. High-throughput genome scaffolding from in vivo DNA interaction frequency . Nat Biotechnol 2013 ; 31 : 1143 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 40 Lieberman-Aiden E , van Berkum NL, Williams L, et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome . Science 2009 ; 326 : 289 – 93 . Google Scholar Crossref Search ADS PubMed WorldCat 41 Olm MR , Brown CT, Brooks B, et al. Identical bacterial populations colonize premature infant gut, skin, and oral microbiomes and exhibit different in situ growth rates . Genome Res 2017 ; 27 : 601 – 12 . Google Scholar Crossref Search ADS PubMed WorldCat 42 Baker BJ , Banfield JF. Microbial communities in acid mine drainage . FEMS Microbiol Ecol 2003 ; 44 : 139 – 52 . Google Scholar Crossref Search ADS PubMed WorldCat 43 Sharon I , Morowitz MJ, Thomas BC, et al. Time series community genomics analysis reveals rapid shifts in bacterial species, strains, and phage during infant gut colonization . Genome Res 2013 ; 23 : 111 – 20 . Google Scholar Crossref Search ADS PubMed WorldCat 44 Mackelprang R , Waldrop MP, DeAngelis KM, et al. Metagenomic analysis of a permafrost microbial community reveals a rapid response to thaw . Nature 2011 ; 480 : 368 – 71 . Google Scholar Crossref Search ADS PubMed WorldCat 45 Schloissnig S , Arumugam M, Sunagawa S, et al. Genomic variation landscape of the human gut microbiome . Nature 2013 ; 493 : 45 – 50 . Google Scholar Crossref Search ADS PubMed WorldCat 46 Eppley JM , Tyson GW, Getz WM, et al. Strainer: software for analysis of population variation in community genomic datasets . BMC Bioinformatics 2007 ; 8 : 398 . Google Scholar Crossref Search ADS PubMed WorldCat 47 Koren S , Treangen TJ, Pop M. Bambus 2: scaffolding metagenomes . Bioinformatics 2011 ; 27 : 2964 – 71 . Google Scholar Crossref Search ADS PubMed WorldCat 48 Nijkamp JF , Pop M, Reinders MJ, et al. Exploring variation-aware contig graphs for (comparative) metagenomics using MaryGold . Bioinformatics 2013 ; 29 : 2826 – 34 . Google Scholar Crossref Search ADS PubMed WorldCat 49 Gutwenger C , Mutzel P. A linear time implementation of SPQR-trees. In: Proceedings of the 8th International Symposium on Graph Drawing (LNCS, volume 1984). 2001 , 70 – 90 . Springer . 50 Eren AM , Esen ÖC, Quince C, et al. Anvi’o: an advanced analysis and visualization platform for ’omics data . PeerJ 2015 ; 3 : e1319 . Google Scholar Crossref Search ADS PubMed WorldCat 51 Grabherr MG , Haas BJ, Yassour M, et al. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data . Nat Biotechnol 2011 ; 29 : 644 – 52 . Google Scholar Crossref Search ADS PubMed WorldCat 52 Schneider GF , Dekker C. DNA sequencing with Nanopores . Nat Biotechnol 2012 ; 30 : 326 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 53 Eid J , Fehr A, Gray J, et al. Real-time DNA sequencing from single polymerase molecules . Science 2009 ; 323 : 133 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 54 http://www.pacb.com/products-and-services/pacbio-systems. 55 https://nanoporetech.com/products. 56 Voskoboynik A , Neff NF, Sahoo D, et al. The genome sequence of the colonial chordate, Botryllus schlosseri . Elife 2013 ; 2 : e00569 . Google Scholar Crossref Search ADS PubMed WorldCat 57 Bankevich A , Pevzner PA. TruSPAdes: barcode assembly of TruSeq synthetic long reads . Nat Methods 2016 ; 13 : 248 – 50 . Google Scholar Crossref Search ADS PubMed WorldCat 58 Koren S , Phillippy AM. One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly . Curr Opin Microbiol 2015 ; 23 : 110 – 20 . Google Scholar Crossref Search ADS PubMed WorldCat 59 Kuleshov V , Jiang C, Zhou W, et al. Synthetic long-read sequencing reveals intraspecies diversity in the human microbiome . Nat Biotechnol 2016 ; 34 : 64 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 60 Bickhart DM , Rosen BD, Koren S, et al. Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome . Nat Genet 2017 ; 49 : 643 – 50 . Google Scholar Crossref Search ADS PubMed WorldCat 61 Pendleton M , Sebra R, Pang AW, et al. Assembly and diploid architecture of an individual human genome via single-molecule technologies . Nat Methods 2015 ; 12 : 780 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 62 Berlin K , Koren S, Chin CS, et al. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing . Nat Biotechnol 2015 ; 33 : 623 – 30 . Google Scholar Crossref Search ADS PubMed WorldCat 63 Gordon D , Huddleston J, Chaisson MJ, et al. Long-read sequence assembly of the gorilla genome . Science 2016 ; 352 : aae0344. Google Scholar Crossref Search ADS PubMed WorldCat 64 Zimin AV , Puiu D, Luo MC, et al. Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm . Genome Res 2017 ; 27 : 787 – 92 . Google Scholar Crossref Search ADS PubMed WorldCat 65 Seo JS , Rhie A, Kim J, et al. De novo assembly and phasing of a Korean human genome . Nature 2016 ; 538 : 243 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 66 Jarvis DE , Ho YS, Lightfoot DJ, et al. The genome of Chenopodium quinoa . Nature 2017 ; 542 : 307 – 12 . Google Scholar Crossref Search ADS PubMed WorldCat 67 Jain M , Koren S, Quick J, et al. Nanopore sequencing and assembly of a human genome with ultra-long reads . bioRxiv 2017 ; 128835 . Google Scholar OpenURL Placeholder Text WorldCat 68 Koren S , Walenz BP, Berlin K, et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation . Genome Res 2017 ; 27 : 722 – 36 . Google Scholar Crossref Search ADS PubMed WorldCat 69 Chin CS , Peluso P, Sedlazeck FJ, et al. Phased diploid genome assembly with single-molecule real-time sequencing . Nat Methods 2016 ; 13 : 1050 – 4 . Google Scholar Crossref Search ADS PubMed WorldCat 70 Li H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences . Bioinformatics 2016 ; 32 : 2103 – 10 . Google Scholar Crossref Search ADS PubMed WorldCat 71 Dudchenko O , Batra SS, Omer AD, et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds . Science 2017 ; 356 : 92 – 5 . Google Scholar Crossref Search ADS PubMed WorldCat 72 White RA , Bottos EM, Roy Chowdhury T, et al. Moleculo long-read sequencing facilitates assembly and genomic binning from complex soil metagenomes . mSystems 2016 ; 1 : e00045 – 16 . Google Scholar Crossref Search ADS PubMed WorldCat 73 Kuleshov V , Snyder MP, Batzoglou S. Genome assembly from synthetic long read clouds . Bioinformatics 2016 ; 32 : i216 – 24 . Google Scholar Crossref Search ADS PubMed WorldCat 74 Driscoll CB , Otten TG, Brown NM, et al. Towards long-read metagenomics: complete assembly of three novel genomes from bacteria dependent on a diazotrophic cyanobacterium in a freshwater lake co-culture . Stand Genomic Sci 2017 ; 12 : 9 . Google Scholar Crossref Search ADS PubMed WorldCat 75 Tsai YC , Conlan S, Deming C, et al. Resolving the complexity of human skin metagenomes using single-molecule sequencing . MBio 2016 ; 7 : e01948 – 15 . Google Scholar Crossref Search ADS PubMed WorldCat 76 Olson ND , Morrow JB. DNA extract characterization process for microbial detection methods development and validation . BMC Res Notes 2012 ; 5 : 668 . Google Scholar Crossref Search ADS PubMed WorldCat 77 Nair S , Karim R, Cardosa MJ, et al. Convenient and versatile DNA extraction using agarose plugs for ribotyping of problematic bacterial species . J Microbiol Methods 1999 ; 38 : 63 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 78 Maydan J , Thomas M, Tabanfar L, et al. Electrophoretic high molecular weight DNA purification enables optical mapping . J Biomol Tech 2013 ; 24 : S57 . Google Scholar OpenURL Placeholder Text WorldCat 79 Tighe S , Afshinnekoo E, Rock TM, et al. Genomic methods and microbiological technologies for profiling novel and extreme environments for the Extreme Microbiome Project (XMP) . J Biomol Tech 2017 ; 28 : 31 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 80 Staden R. A strategy of DNA sequencing employing computer programs . Nucleic Acids Res 1979 ; 6 : 2601 – 10 . Google Scholar Crossref Search ADS PubMed WorldCat 81 Rahman A , Pachter L. CGAL: computing genome assembly likelihoods . Genome Biol 2013 ; 14 : R8 . Google Scholar Crossref Search ADS PubMed WorldCat 82 Ghodsi M , Hill CM, Astrovskaya I, et al. De novo likelihood-based measures for comparing genome assemblies . BMC Res Notes 2013 ; 6 : 334 . Google Scholar Crossref Search ADS PubMed WorldCat 83 Hill CM , Astrovskaya I, Huang H, et al. De novo likelihood-based measures for comparing metagenomic assemblies. In: Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine, 2013 . 2013 , 94 – 8 . 84 Clark SC , Egan R, Frazier PI, et al. ALE: a generic assembly likelihood evaluation framework for assessing the accuracy of genome and metagenome assemblies . Bioinformatics 2013 ; 29 : 435 – 43 . Google Scholar Crossref Search ADS PubMed WorldCat 86 Gurevich A , Saveliev V, Vyahhi N, et al. QUAST: quality assessment tool for genome assemblies . Bioinformatics 2013 ; 29 : 1072 – 5 . Google Scholar Crossref Search ADS PubMed WorldCat 87 Lander ES , Waterman MS. Genomic mapping by fingerprinting random clones: a mathematical analysis . Genomics 1988 ; 2 : 231 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 88 Phillippy AM , Schatz MC, Pop M. Genome assembly forensics: finding the elusive mis-assembly . Genome Biol 2008 ; 9 : R55 . Google Scholar Crossref Search ADS PubMed WorldCat 89 Narzisi G , Mishra B. Comparing de novo genome assembly: the long and short of it . PLoS One 2011 ; 6 : e19175 . Google Scholar Crossref Search ADS PubMed WorldCat 90 Vezzi F , Narzisi G, Mishra B. Reevaluating assembly evaluations with feature response curves: GAGE and assemblathons . PLoS One 2012 ; 7 : e52210 . Google Scholar Crossref Search ADS PubMed WorldCat 91 Hunt M , Kikuchi T, Sanders M, et al. REAPR: a universal tool for genome assembly evaluation . Genome Biol 2013 ; 14 : R47 . Google Scholar Crossref Search ADS PubMed WorldCat 85 Walker BJ , Abeel T, Shea T, et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement . PLoS One 2014 ; 9 : e112963. Google Scholar Crossref Search ADS PubMed WorldCat 92 Hill CM. Novel methods for comparing and evaluating single and metagenomic assemblies. PhD Thesis, University Maryland, 2015 . http://drum.lib.umd.edu/handle/1903/17100 93 Thorvaldsdóttir H , Robinson JT, Mesirov JP. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration . Brief Bioinform 2013 ; 14 : 178 – 92 . Google Scholar Crossref Search ADS PubMed WorldCat 94 DevNet . Human Microbiome Project MockB Shotgun. 2014 . https://github.com/PacificBiosciences/DevNet/wiki/Human_Microbiome_Project_MockB_Shotgun (28 July 2017, date last accessed). 95 Wick RR , Schultz MB, Zobel J, et al. Bandage: interactive visualization of de novo genome assemblies . Bioinformatics 2015 ; 31 : 3350 – 2 . Google Scholar Crossref Search ADS PubMed WorldCat 96 Chin CS , Alexander DH, Marks P, et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data . Nat Methods 2013 ; 10 : 563 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 97 Alexander DH. Quiver FAQ. 2016 . https://github.com/PacificBiosciences/GenomicConsensus/blob/master/doc/FAQ.rst 98 Madoui MA , Engelen S, Cruaud C, et al. Genome assembly using Nanopore-guided long and error-free DNA reads . BMC Genomics 2015 ; 16 : 327 . Google Scholar Crossref Search ADS PubMed WorldCat 99 Koren S , Schatz MC, Walenz BP, et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads . Nat Biotechnol 2012 ; 30 : 693 – 700 . Google Scholar Crossref Search ADS PubMed WorldCat 100 Antipov D , Korobeynikov A, McLean JS, et al. hybridSPAdes: an algorithm for hybrid assembly of short and long reads . Bioinformatics 2016 ; 32 : 1009 – 15 . Google Scholar Crossref Search ADS PubMed WorldCat 101 Ye C , Hill CM, Wu S, et al. DBG2OLC: efficient assembly of large genomes using long erroneous reads of the third generation sequencing technologies . Sci Rep 2016 ; 6 : 31900 . Google Scholar Crossref Search ADS PubMed WorldCat 102 Goodwin S , Gurtowski J, Ethe-Sayers S, et al. Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome . Genome Res 2015 ; 25 : 1750 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 103 Loose M , Malla S, Stout M. Real-time selective sequencing using Nanopore technology . Nat Methods 2016 ; 13 : 751 – 4 . Google Scholar Crossref Search ADS PubMed WorldCat Author notes Nathan D. Olson and Todd J.Treangen contributed equally to this work. © The Author 2017. Published by Oxford University Press. This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model) © The Author 2017. Published by Oxford University Press.
journal article
LitStream Collection
MG-RAST version 4—lessons learned from a decade of low-budget ultra-high-throughput metagenome analysis

Meyer, Folker; Bagchi, Saurabh; Chaterji, Somali; Gerlach, Wolfgang; Grama, Ananth; Harrison, Travis; Paczian, Tobias; Trimble, William L; Wilke, Andreas

2019 Briefings in Bioinformatics

doi: 10.1093/bib/bbx105pmid: 29028869

Abstract As technologies change, MG-RAST is adapting. Newly available software is being included to improve accuracy and performance. As a computational service constantly running large volume scientific workflows, MG-RAST is the right location to perform benchmarking and implement algorithmic or platform improvements, in many cases involving trade-offs between specificity, sensitivity and run-time cost. The work in [Glass EM, Dribinsky Y, Yilmaz P, et al. ISME J 2014;8:1–3] is an example; we use existing well-studied data sets as gold standards representing different environments and different technologies to evaluate any changes to the pipeline. Currently, we use well-understood data sets in MG-RAST as platform for benchmarking. The use of artificial data sets for pipeline performance optimization has not added value, as these data sets are not presenting the same challenges as real-world data sets. In addition, the MG-RAST team welcomes suggestions for improvements of the workflow. We are currently working on versions 4.02 and 4.1, both of which contain significant input from the community and our partners that will enable double barcoding, stronger inferences supported by longer-read technologies, and will increase throughput while maintaining sensitivity by using Diamond and SortMeRNA. On the technical platform side, the MG-RAST team intends to support the Common Workflow Language as a standard to specify bioinformatics workflows, both to facilitate development and efficient high-performance implementation of the community’s data analysis tasks. metagenome analysis, cloud, distributed workflows Introduction The ever-increasing amount of DNA sequence data [1] has motivated significant developments in biomedical research. Currently, however, many researchers continue to struggle with large-scale computing and data management requirements. Numerous approaches have been proposed and are being pursued to alleviate this burden on application scientists. The approaches include focusing on the user-interface layer while relying primarily on legacy technology [2]; reimplementing significant chunks of code in new languages [3]; and developing clean-slate designs [4]. Breakthroughs that appreciably reduce computational burden, such as Diamond [5], are the exception. While important, few if any of the solutions contribute to solving the central problem: data analysis is becoming increasingly expensive in terms of both time and cost, with reference databases growing rapidly and data volumes rising. In essence, more and more data are being produced without sufficient resources to analyze the data. All indicators show that this trend will continue in the foreseeable future [1]. We strongly believe that a change in how the research community handles routine data analytics is required. While we cannot predict the outcome of this evolutionary process, scalable, flexible and—most important—efficient platforms will, in our opinion, be part of any ‘new computational ecosystem’. MG-RAST [6] is one such platform that handles hundreds of submissions daily, often aggregating >0.5 terabytes in a 24 h period. MG-RAST is a hosted, open-source, open-submission platform (‘Software as a Service’) that provides robust analysis of environmental DNA data sets (where environment is broadly defined). The system has three main components: a workflow, a data warehouse and an API (with a Web frontend). The workflow combines automated quality control, automated analysis and user-driven parameter- and database-flexible analysis. The data warehouse supports data archiving, discovery and integration. The platform is accessible via a Web interface [7], as well as a RESTful API [8]. Analysis of environmental DNA (i.e. metagenomics) presents a number of challenges, including feature extraction (e.g. gene calling) from (mainly unassembled) often lower-quality sequence data; data warehousing; movement of often large data sets to be compared against many equally large data sets; and data discovery. A key insight (see Lessons learned, L1) is that the challenges faced here are distinct from the challenges facing groups that render services for individual genomes or even sets of genomes [9]. Several hosted systems currently provide services in this field: JGI IMG/M [10], EBI MG-Portal [11] and MG-RAST [6]. Myriad stand-alone tools exist, including integrative user-friendly interfaces [12]; feature prediction tools [5, 13]; tools that ‘bin’ individual reads using codon frequencies, read abundance and cross-sample abundance [14–17]; and sets of marker genes reducing the search space for analysis with associated visualization tools [18]. MG-RAST seeks to select the best-in-class implementations and provide a hosted resource-efficient service that implements a balance between custom analysis and one-size-fits-all recipes. The approach taken in MG-RAST to achieve this goal is by defining parameters late—during download or analysis—not a priori before running a set of analyses tools. The analysis workflow in MG-RAST is identical across all data sets, except for data set-specific operations such as host DNA removal and variations in filtering to accommodate different user-submitted data types. While many approaches to metagenome analysis exist, we chose an approach that allows large-scale analysis and massive comparisons. The core principle for the design of MG-RAST was to provide consistent analyses as deep and unbiased as possible at affordable computational cost. Other approaches, such as comprehensive genome and protein binning, adopted by IMG/M, or the profile hidden Markov model-based approaches using MG-Portal, do add value and provide valuable alternative analyses. These portals complement each other’s capabilities, and we routinely share best practices with them. MG-RAST’s strong suit is handling raw reads directly from a sequencing service. It has been extended to handle assembled metagenomes and metatranscriptomes as well. In its current form, however, it does not support metagenomics assembly or a genome-centric approach to metagenomics (i.e. binning). Like many hosted applications (not just in bioinformatics), MG-RAST started out as a traditional database-oriented system using largely traditional design patterns. While expanding the number of machines able to execute MG-RAST workflows, we learned that data access input and output (I/O) is as limiting a factor as the processing power or memory (see Lessons learned, L2 and L8). MG-RAST has rapidly adapted [19–21] to meet the needs of a growing user community, as well as the changing technology landscape. We have run MG-RAST workflows on several computational platforms, including OpenStack [22], Amazon’s AWS [23], Microsoft’s Azure [24], several local clusters and even laptops on occasion. In many ways, MG-RAST has evolved to be the counterpoint to the now-abundant one-offs that are routinely implemented in many laboratories for sequence analysis. It offers reproducibility and was designed for efficient execution [9] (see Lessons learned, L9). To date, MG-RAST has processed >295 000 data sets from 23 000 researchers. As of June 2017, over 1 trillion individual sequences totaling >40 terabase pairs have been processed, and the total volume of data generated is well over half a petabyte of data. A fair assessment is that we do a lot of the heavy lifting of high-volume automated analysis of amplicon and shotgun metagenomes as well as metatranscriptomes for a large user community. Currently, only 20% (44 000) of the data sets in MG-RAST are publicly available. Data are frequently shared by researchers with only their collaborators. In future releases, we will introduce a series of features to incentivize data publication. The vast majority of the data sets in MG-RAST represent user submissions; <3000 are data sets extracted from SRA by the developers. Determining identity between any two data sets is far from trivial if available metadata does not provide sufficient evidence—one more reason to incentivize metadata (see Lessons learned, L7). However, the developers are working jointly with researchers at EBI to synchronize the contents of EBI’s ENA with the contents of MG-RAST. Currently, to the best of our combined knowledge, there is little overlap between the data sets in SRA/ENA and those in MG-RAST. The analysis shown in Figure 1 is typical for one class of user queries. We note that in addition to requesting SEED annotations, the user might also request annotations from the M5NR sub-databases (i.e. namespaces) such as KEGG pathways [25], KEGG orthologues [26], COG [27] and RefSeq [28]. Providing a smart data product that can be projected with no computation onto other namespaces (read annotation databases) saves a significant amount of computational resources (see Lessons learned, L3). Figure 1. Open in new tabDownload slide MG-RAST data and analysis results can be reused for other purposes. Here, we show a muscle [29] alignment of (the prodigal translations) of filtered sequences from the following unauthenticated API call: http://api.metagenomics.anl.gov//annotation/sequence/mgm4662210.3?evalue=10&type=function&source=Subsystems&filter=Inosine-5. Figure 1. Open in new tabDownload slide MG-RAST data and analysis results can be reused for other purposes. Here, we show a muscle [29] alignment of (the prodigal translations) of filtered sequences from the following unauthenticated API call: http://api.metagenomics.anl.gov//annotation/sequence/mgm4662210.3?evalue=10&type=function&source=Subsystems&filter=Inosine-5. Compared with previous versions of MG-RAST, the latest version has increased throughput dramatically while using the same amount of resources: ∼22 million core-hours annually are used to run the MG-RAST workflow for user-based submissions. In addition, the RESTful API has allowed a rethinking and restructuring of the user interface (different model–view–controller pattern) and, most important, the reproducibility of results using containers. Implementation While MG-RAST started as an application built on a traditional LAMP stack [30], we quickly realized that a single database could not provide sufficient flexibility to support various underlying components at the scale required. Instead, we chose to rely on an open API [8] that provides the ability to change underlying components as required by scale. We note that MG-RAST is not a comprehensive analysis tool offering every conceivable kind of boutique analysis. By making some data-filtering parameters user adjustable at analysis or download time, MG-RAST provides flexibility. Via the API, users and developers can pipe MG-RAST data and results into their in-house analysis procedures. Figure 1 shows an MG-RAST API query for sequences with similarities to proteins with the SEED Subsystem namespace annotation inosine-5′ phosphate dehydrogenase from the soil metagenome mgm4662210.3 that is streamed into a filtering and alignment procedure. A key feature of MG-RAST (see Lessons learned, L6) is its ability to adjust database match parameter at query time—a function frequently not recognized by researchers and in some cases missed even by studies comparing systems [31]. MG-RAST has been designed to treat every data set with the same pipeline. Given the expected volume and variety of datasets, per-data set optimization of parameters has not been a design goal. The system is optimized for robust handling of a wide variety of input types, and users can perform optimizations within sets of parameters that filter the pipeline results. The automatic setting of, for instance, detection thresholds for dramatically different data types and research questions is not the role of a data analysis platform. While this one-size-fits-all nature of the processing might somewhat limit sensitivity and potentially limit downstream scientific inquiry, these limitations are counterbalanced by the vast scope of the consistently analyzed data universe that the uniformly applied workflows and data management and discovery systems enable researchers to access. We believe that relying on smart data products that enable adjustment of parameters after processing and using custom downstream analysis scripts more than compensate for any reduction in sensitivity (see Lessons learned, L3 and L6). Backend components Figure 2 shows the current design of the MG-RAST backend components, using various databases and caching systems [32–35] as appropriate to support the API with the performance needed. Figure 2. Open in new tabDownload slide Backend of MG-RAST version 4 using several database systems to enable efficient querying via the API. Figure 2. Open in new tabDownload slide Backend of MG-RAST version 4 using several database systems to enable efficient querying via the API. A major criterion for success of the workflow is the ability to scale to the throughput levels required. Algorithmic changes (e.g. adoption of Diamond [5]) can help, but the design of the execution environment—most specifically its portability—is the single key to scaling (see Lessons learned, L4 and L5). Access to data In computational biology, shared filesystems traditionally are used to serve data to the computational resources. Sharing data between multiple computers is necessary because the data typically require more computational resources than a single machine can provide. Shared filesystems can render data accessible on several computers. This approach, however, limits the range of available platforms or requires significant time for configuring access or moving data into the platform. In addition, many shared filesystems exhibit poor scaling performance for science applications. Slow or inadequate shared filesystems have been observed by almost every practitioner of bioinformatics (see Lessons learned, L2). This situation has forced the use of complex I/O middleware to transform science I/O workloads into patterns that can scale in various science domains, including quantum chromodynamics and astrophysics [36], molecular dynamics [37], fusion science [38] and climate [39]. Rather than adopting this approach, we conducted a detailed analysis of our workloads, which revealed that individual computational units (e.g. cluster nodes) typically use a small fraction of the data and do not require access to the entire data set. Consequently, we chose to centralize data into a single point and access it in a RESTful way, thus providing efficient access while requiring no configuration for the vast majority of computing systems. A single object store can support distributed streaming analysis of data across many computers (see Lessons learned, L8). The SHOCK object store [40] provides secure access to data and, most important, to subsets of the data. A computational client node can request a number of sequence records or sets of records meeting specific criteria. Data are typically streamed at significant fractions of line speed, and as results are frequently returned as indices that are much smaller than the original data files, writing is extremely efficient. Furthermore, the data are primarily write-once, which significantly simplifies the design of the object store with respect to data consistency. Data in SHOCK is available to third parties via a RESTful API, and thus, SHOCK supports the reuse of both data and results. Execution format Executing workflows across a number of systems requires that the code be made available in suitable binary form on those platforms. Among the emerging challenges, reproducibility is a key problem for scientific disciplines that rely on the results of sequence analysis at scale without the ability to validate every single computational step in depth. Virtual machines have been used to provide stable and portable execution environments [41] for a number of years. However, because of many technical details (e.g. significant number of binary formats required to cover all platforms) and significant overhead [42] in execution, containers provide a more suitable platform for most scientific computations. In particular, the relatively recent advent of binary Linux containers (notably, Docker) in computing affords a novel way to distribute execution environments. Containers reduce the set of requirements for any given software package to one: a container. We have devised a scalable system [43] to execute scientific workflows across a number of containers connected only via a RESTful interface to an object store. With increasing numbers of systems supporting containerized execution [44] and with compatibility mechanisms [45] emerging to support legacy installations, Linux containers are quickly becoming the lingua franca of binary execution environments (see Lessons learned, L5). As with all of MG-RAST, the recipes for building the containers (‘Dockerfiles’) are available as open source on github, and the binary containers are available on DockerHub. The resulting containers are not specific to the MG-RAST systems, and the binary containers and the recipes are available to third parties for their adoption. Current MG-RAST workflow MG-RAST has been used for tens of thousands of data sets. This extensive use has led to a level of stability and robustness that few sequence analysis workflows can match. The workflow (version 4.01) consists of the following logical steps: Data hygiene Providing quality control and normalization steps that also include mate pair merging with ea-utils fastq-join [46–48]. The focus, however, is on artifact removal and host DNA removal [48, 49]. Feature extraction Using a predictor that has been shown to be robust against sequence noise (FragGeneScan [50]) to predict potentially protein-coding features, and using a purposefully simple similarity-based approach to predict ribosomal RNAs using VSEARCH [51]. The similarity-based predictions use a version of M5RNA [52] that was clustered at 70% identity to find candidate ribosomal RNA sequences. Data reduction Clustering of predicted features at 90% identity (protein coding) and 97% (ribosomal RNA). Features overlapping with predicted ribosomal RNA (rRNA) sequences are removed. For each cluster, the longest representative is used. Feature annotation Using similarity-based mapping of cluster representatives using super nonredundant M5NR [52] with a parallelized version of BLAT [53] for candidate proteins and ribosomal RNAs. This creates ‘annotations’ with M5NR database identifiers only. Profile creation Mapping the M5NR identifiers to several functional namespaces (e.g. RefSeq or SEED), hierarchical namespaces (e.g. COG and Subsystems), pivoting into functional and taxonomic categories, and thus creating a reduced fingerprint (‘profile’) for each namespace and hierarchy. Database load Uploading profiles to the various MG-RAST backend databases that support the API. We note that the approach taken to sequence analysis is different from the state of the art for more or less complete microbial genomes [54]. Using data from MG-RAST A key problem of current big data bioinformatics is the barrier to reuse of data and results. Comparing results of an expensive computational procedure with results from another laboratory can be problematic if the procedures used are not identical (potentially compromising integrity of the study). Another common approach is to not reuse existing results but to do an expensive reanalysis of both data sets, thus duplicating the work originally performed. One key problem with this approach is that data-driven science is no longer reviewable, as no reviewer can be expected to retrace the steps of the investigators while duplicating their computational work. If the data and results (also intermediate results) were available as reproducible entities, the problem of data uncertainty and costly recomputations would disappear. This exceptional waste of computer time is acceptable default behavior in a discipline that is rich in computation and poor in data. In a data-rich ecosystem, however, either the terms of engagement have to change or the percentage of the research budgets allocated to computational resources has to dramatically increase. One of the key goals of MG-RAST is to provide a wealth of data sets and the underlying analysis results. Both the Web-based user interface and the RESTful API make these results accessible. To get closer to our goal of transparent and reproducible MG-RAST data analysis, we already execute all workflow steps in containers. The missing building block—which we are currently working on and which will enable every interested party to easily to execute, compare or modify our analysis pipeline—is support for the Common Workflow Language (CWL) [55] in our workflow engine. We think that producing data with a CWL workflow adds more value because it adds executable provenance (see Lessons learned, L4). Executable provenance is critical, as it allows recreation of the results on a wide variety of computational platforms. Using profiles generated by MG-RAST Profiles are the primary data product generated by MG-RAST, and they feed into the Web user interface and the various other tools. They encode the abundance of entities in a given sample combining information from several databases. Most important, profiles include information on the quality of the underlying observation (e.g. results of sequence similarity search) (Figure 3). Profiles are a compressed representation of the environmental samples, allowing large-scale comparisons. Another critical feature is the ability to adjust matching parameters (e.g. minimal alignment length required for inclusion) at analysis time, allowing data reuse without the need for recomputing the profile with different cutoffs. With this ‘smart data product’, data consumers can switch between reference databases and parameter sets without recomputing the underlying sequence similarity searches (see Lessons learned, L3). Metadata—making data discoverable A key component of data reuse is the much-discussed ‘metadata’ (or ‘data describing data’). With tens of thousands of data sets available, the ability to identify the relevance of data sets has become critical. Approaches include ‘simple’ machine-readable encoding of data items such as pH, temperature and location and the use of controlled vocabularies to allow unambiguous encoding of, for example, anatomical organs via [56] or geographical features using the ENVO ontology [57]. Machine-readable metadata, such as the concepts championed by the Genomic Standards Consortium (GSC) [58], is key. GSC metadata is intentionally kept as simple and lightweight as possible while trying to meet the needs of the data producers and data consumers. Despite its simplicity, however, for the occasional user (e.g. a scientist depositing data), it is still cumbersome. Tools such as Metazen [59] help bridge the gap between data scientists and occasional users. MG-RAST implements the core MIxS [60] checklist, as well as all available environmental packages [61]. GSC-compliant, machine-readable markup of data sets at the time of upload to or deposition in online resources offers a unique opportunity. Data become discoverable, and analysis is made easier. MG-RAST incentivizes the addition of metadata by offering priority access to the compute resources to data sets with valid GSC metadata (see Lessons learned, L7). Web user interface Not all scientists spend a significant fraction of their time on the command line or enjoy using the command line to solve their bioinformatics questions. Extracting and displaying the relative abundance of proteins from proteins classified as part of the subsystem class ‘Protein Metabolism’from the phylum Proteobacteria are simple via the Web interface (Figure 4) but require many command line invocations. Figure 3. Open in new tabDownload slide MG-RAST profile encoding abundance and matching parameter information as well as information on the observed entities. Figure 3. Open in new tabDownload slide MG-RAST profile encoding abundance and matching parameter information as well as information on the observed entities. For these users, MG-RAST provides a graphical user interface (GUI) implemented in JavaScript/HTML5. The GUI provides guidance for nontrivial procedures such as data upload and validation, data sharing and data discovery, as well as data analysis Figure 5A. Data export in various formats is also supported Figure 5B. Figure 4. Open in new tabDownload slide Relative abundance of protein functional classes (‘Subsystems’) in Proteobacteria (‘RefSeq Phylum’) displayed as a waterfall diagram for data sets in study mgp128 as displayed by the version 4.0 MG-RAST graphical user interface. Figure 4. Open in new tabDownload slide Relative abundance of protein functional classes (‘Subsystems’) in Proteobacteria (‘RefSeq Phylum’) displayed as a waterfall diagram for data sets in study mgp128 as displayed by the version 4.0 MG-RAST graphical user interface. User’s view of MG-RAST Every user has a different view of the data in MG-RAST. All users have access to the public metagenomics data, but shared or private data available to the user are linked to the user’s login information. Each data set has a unique identifier and information on visibility; until the data are made publicly available, temporary identifiers are used to minimize the number of data sets mentioned in the literature without being publicly available. Figure 6 provides a comparison of public and private data sets and highlights the sharing and data organization capabilities of the platform. Figure 5. Open in new tabDownload slide (A) Heatmap and clustering of the occurrence of Corynebacteria in study mgp128 as displayed by the MG-RAST web frontend. (B) Data export options available for the data and visualization, including sequences and abundance in tabular and JSON format. Figure 5. Open in new tabDownload slide (A) Heatmap and clustering of the occurrence of Corynebacteria in study mgp128 as displayed by the MG-RAST web frontend. (B) Data export options available for the data and visualization, including sequences and abundance in tabular and JSON format. Figure 6. Open in new tabDownload slide Public study (with permanent unique identifier mgp128) and private study set with temporary identifier. A study groups multiple data sets, provides a single identifier and allows sharing via simply providing an email address for the person the data are to be shared with. Figure 6. Open in new tabDownload slide Public study (with permanent unique identifier mgp128) and private study set with temporary identifier. A study groups multiple data sets, provides a single identifier and allows sharing via simply providing an email address for the person the data are to be shared with. A key design feature of MG-RAST is to allow private data sets; users are in charge of uploading, sharing and releasing the data. Once submitted, data are private to the submitting user. The submitting user is reminded to share their data at their earliest convenience. In addition to data, the processing pipeline and the data warehousing, MG-RAST provides an analytical tool set. It is implemented as a user-friendly Web application and consuming the profiles generated by the MG-RAST pipeline. Future work As technologies change, MG-RAST is adapting. Newly available software is being included to improve accuracy and performance. As a computational service constantly running large-volume scientific workflows, MG-RAST is the right location to perform benchmarking and implement algorithmic or platform improvements, in many cases involving trade-offs between specificity, sensitivity and run-time cost. The work in [62] is an example. We use existing well-studied data sets as gold standards representing different environments and different technologies to evaluate any changes to the pipeline. Currently, we use well-understood data sets in MG-RAST as a platform for benchmarking. The use of artificial data sets for pipeline performance optimization has not added value because these data sets do not present the same challenges as real-world data sets do. The MG-RAST team welcomes suggestions for improvements of the workflow. We are currently working on versions 4.02 and 4.1, both of which contain significant input from the community and our partners that will enable double barcoding and stronger inferences supported by longer-read technologies and will increase throughput while maintaining sensitivity by using Diamond and SortMeRNA. On the technical platform side, the MG-RAST team intends to support the CWL as a standard to specify bioinformatics workflows, to facilitate both development and efficient high-performance implementation of the community’s data analysis tasks. Lessons learned L1. Analyzing large-scale environmental DNA is different from genomics. Because of the absence of high-quality assembled data (in most projects) and the lack of good models for removing contaminations upstream, a metagenomics portal site has to take over quality control and normalization and become good at it. L2. Data I/O is as limiting as CPU and RAM. A bad tradition in bioinformatics is ignoring the cost of I/O. Large-scale distributed systems need to model the I/O cost explicitly and design their solution to include I/O cost as well as CPU cost. L3. Using smart data products helps avoid costly recomputations and empowers downstream tool builders. The bad tradition of downloading raw data and creating spreadsheets with results is not sustainable. While bioinformatics is not yet able to fully rely on disseminating data as research objects [63], we need to move toward them. L4. The use of reproducible workflows such as CWL [55, 64] is a crucial requirement for any service generating data meant for reuse. Providing a detailed, portable, executable recipe for how the data were generated is important to data consumers. In addition, making the recipes available supports improvement to the workflows by third parties. L5. Containers should be used to capture the execution environment. Containers (e.g. Linux containers) capture the environment in a reproducible format. Workflows without their environment are less than useful. L6. Data reuse is critical for saving computational cost. While the reproducibility resulting from reproducible execution environments is great, providing intermediate results adds significantly more value to reviewers and fosters reuse of computational results for a variety of purposes such as building software to improve existing components (e.g. feature predictors) or use the data for scientific projects. L7. Metadata is invaluable and should be required. Users require encouragement to provide metadata. We aim to make users submit metadata as early as possible, and to incentivize users, we provide high-quality tools that make metadata collection easy. L8. The complexity of shared filesystems should be avoided whenever possible. Relying on RESTful interfaces instead of shared filesystems provides cross-cloud execution capabilities, allowing us to run on almost any computational platform including the cheapest computational platform available. L9. Portals are the right place for performance engineering. While many biomedical informatics groups are computationally proficient, the convergence of large-scale processing and domain expertise makes portal sites an ideal location for optimization. Running many workflows thousands of times and providing services to many other groups is a good platform for accumulating expertise. Discussion As more environmental DNA sequence data become available to the research community, a new set of challenges emerges. These challenges require a change in approach to computing at the community level. We describe a domain-specific portal that, like its European companion system [11], acts as an integrator of data and efficiently implements domain-specific workflows. The lessons learned about building scale-out infrastructure dedicated to executing bioinformatics workflows and the resulting middleware systems [19, 20, 40, 59, 65] will benefit both the community of users and researchers attempting to build efficient sequence analytics workflows. Reproducible efficient execution of domain-specific workflows is a central contribution of the MG-RAST system. Provisioning of data and results via a Web interface and a RESTful API is another key aspect. Encouraging data reuse by provisioning both data and results (as well as intermediate files) via a stable API is a key function that serves the community of bioinformatics developers, who can use precomputed data that are well described by a workflow, rather than implementing their own (frequently subpar preprocessing steps), and thus can focus on their key mission. By providing preanalyzed data (using an open recipe that is available to the community for discussion and improvement), MG-RAST can help reduce the current ‘method uncertainty’, where individual data sets analyzed with different analysis strategies can lead to dramatically different interpretations. The role of MG-RAST is not one-size-fits-all. Rather than being the one and only analysis mechanism, MG-RAST is a well-designed high-performance system on top of an efficient scale-out platform [66] that can take some of the heavy lifting off the shoulders of individual researchers. Researchers can add their own custom boutique analyses at a fraction of the computational and development cost, allowing them to focus on their specific problem and thus maximizing overall productivity. With the state of the art of sequencing technology shifting, MG-RAST will adapt to extract maximum value by, for example, explicitly supporting value-added information from longer sequences with multiple features, for example for taxonomy calling. We also anticipate that the currently used alignment-based methods will be supplemented by profile-based methods for performance reasons within a few years. Key Points Analyzing the growing volume of biomedical environmental sequence data requires cost-effective, reproducible and flexible analysis platforms and data reuse and is significantly different from analyzing (almost) complete genomes. The hosted MG-RAST service provides a Linux container-based workflow system and a RESTful API that allow data and analysis reuse. Community portals are the right location for performance engineering, as they operate at the required scale. Folker Meyer is a Senior Computational Biologist at Argonne National Laboratory; a Professor at the Department of Medicine, University of Chicago; and a Senior Fellow at the Computation Institute at the University of Chicago. He is also deputy division director of the Biology Division at Argonne National Laboratory and a senior fellow at the Institute of Genomics and Systems Biology (a joint Argonne National Laboratory and University of Chicago Institute). Saurabh Bagchi is a Professor in the School of Electrical and Computer Engineering and the Department of Computer Science (by courtesy) at Purdue University. He is the founding Director of CRISP, a university-wide resiliency center at Purdue. Somali Chaterji is a biomedical engineer and medical data analyst. She is a Research Faculty at Purdue University, specializing in high-performance computing infrastructures and algorithms for synthetic biology and epigenomics. Wolfgang Gerlach, PhD is a Bioinformatics Senior Software Engineer at the University of Chicago with a joint appointment at Argonne National Laboratory. Ananth Grama is a Professor of Computer Science at Purdue University. He also serves as the Associate Director of the Center for Science of Information, a Science and Technology Center of the National Science Foundation. Travis Harrison is a Bioinformatics Senior Software Engineer at the University of Chicago with a joint appointment at Argonne National Laboratory. Tobias Paczian is a Senior Developer at the University of Chicago with a joint appointment at Argonne National Laboratory. He has more than a decade of experience building User Interfaces for bioinformatics applications. William L. Trimble, PhD is a postdoctoral researcher at Argonne National Laboratory with a background in physics and data science. Andreas Wilke is a Principal Bioinformatics Specialist Argonne National Laboratory with a joint appointment at the University of Chicago. He has more than a decade of experience building bioinformatics applications. Acknowledgements The authors thank Dion Antonopoulos, Gail Pieper and Robert Ross for their input and help. Funding The work reported in this article was supported in part by a grant from the National Institutes of Health (NIH) grant 1R01AI123037-01. Work on this article was also supported by NSF award 1645609. This work was supported in part by the NIH award U01HG006537 ‘OSDF: Support infrastructure for NextGen sequence storage, analysis, and management’, by the Gordon and Betty Moore Foundation with the grant ‘6-34881, METAZen-Going the Last Mile for Solving the Metadata Crisis)’. This material was based on work supported by the US Department of Energy, Office of Science, under contract DE-AC02-06CH11357. References 1 NHGRI. DNA sequencing costs. https://www.genome.gov/sequencingcosts/.[TQ1] 2 Afgan E , Baker D, van den Beek M, et al. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update . Nucleic Acids Res 2016 ; 44 : W3 – 10 . Google Scholar Crossref Search ADS PubMed WorldCat 3 Doring A , Weese D, Rausch T, et al. SeqAn an efficient, generic C ++ library for sequence analysis . BMC Bioinformatics 2008 ; 9 : 11. Google Scholar Crossref Search ADS PubMed WorldCat 4 Xia F , Dou Y, Xu J. Families of FPGA-based accelerators for BLAST algorithm with multi-seeds detection and parallel extension. In: Elloumi M, Küng J, Linial M, et al. (eds), Bioinformatics Research and Development: Second International Conference, BIRD 2008 Vienna, Austria, July 7-9, 2008 Proceedings. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008 , 43–57. 5 Buchfink B , Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND . Nat Methods 2015 ; 12 : 59 – 60 . Google Scholar Crossref Search ADS PubMed WorldCat 6 Meyer F , Paarmann D, D'Souza M, et al. The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes . BMC Bioinformatics 2008 ; 9 : 386. Google Scholar Crossref Search ADS PubMed WorldCat 7 Wilke A , Bischof J, Gerlach W, et al. The MG-RAST metagenomics database and portal in 2015 . Nucleic Acids Res 2016 ; 44 : D590 – 4 . Google Scholar Crossref Search ADS PubMed WorldCat 8 Wilke A , Bischof J, Harrison T, et al. A RESTful API for accessing microbial community data for MG-RAST . PLoS Comput Biol 2015 ; 11 : e1004008. Google Scholar Crossref Search ADS PubMed WorldCat 9 Desai N , Antonopoulos D, Gilbert JA, et al. From genomics to metagenomics . Curr Opin Biotechnol 2012 ; 23 : 72 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 10 Chen IA , Markowitz VM, Chu K, et al. IMG/M: integrated genome and metagenome comparative data analysis system . Nucleic Acids Res 2017 ; 45 : D507 – 16 . Google Scholar Crossref Search ADS PubMed WorldCat 11 Mitchell A , Bucchini F, Cochrane G, et al. EBI metagenomics in 2016–an expanding and evolving resource for the analysis and archiving of metagenomic data . Nucleic Acids Res 2016 ; 44 : D595 – 603 . Google Scholar Crossref Search ADS PubMed WorldCat 12 Huson DH , Weber N. Microbial community analysis using MEGAN . Methods Enzymol 2013 ; 531 : 465 – 85 . Google Scholar Crossref Search ADS PubMed WorldCat 13 Kopylova E , Noe L, Touzet H. SortMeRNA: fast and accurate filtering of ribosomal RNAs in metatranscriptomic data . Bioinformatics 2012 ; 28 : 3211 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 14 Kang DD , Froula J, Egan R, et al. MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities . PeerJ 2015 ; 3 : e1165. Google Scholar Crossref Search ADS PubMed WorldCat 15 Eren AM , Esen OC, Quince C, et al. Anvi'o: an advanced analysis and visualization platform for 'omics data . PeerJ 2015 ; 3 : e1319. Google Scholar Crossref Search ADS PubMed WorldCat 16 Imelfort M , Parks D, Woodcroft BJ, et al. GroopM: an automated tool for the recovery of population genomes from related metagenomes . PeerJ 2014 ; 2 : e603. Google Scholar Crossref Search ADS PubMed WorldCat 17 Alneberg J , Bjarnason BS, de Bruijn I, et al. Binning metagenomic contigs by coverage and composition . Nat Methods 2014 ; 11 : 1144 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 18 Segata N , Waldron L, Ballarini A, et al. Metagenomic microbial community profiling using unique clade-specific marker genes . Nat Methods 2012 ; 9 : 811 – 4 . Google Scholar Crossref Search ADS PubMed WorldCat 19 Tang W , Bischof J, Desai N, et al. Workload characterization for MG-RAST metagenomic data analytics service in the cloud. In: Proceedings of IEEE International Conference on Big Data, Washington, DC, USA, 2014 . IEEE Press, Piscataway, NJ, USA. 20 Tang W , Wilkening J, Bischof J, et al. Building scalable data management and analysis infrastructure for metagenomics. In: 5th International Workshop on Data-Intensive Computing in the Clouds, Poster at Supercomputing 2013 . 21 Wilke A , Wilkening J, Glass EM, et al. An experience report: porting the MG-RAST rapid metagenomics analysis pipeline to the cloud . Concurr Comput 2011 ; 23 : 2250 – 7 . Google Scholar Crossref Search ADS WorldCat 22 openstack.org. OpenStack. https://www.openstack.org/. 23 Amazon Inc . Amazon Web Services. https://aws.amazon.com/. 24 Microsoft Inc . Azure. https://azure.microsoft.com/. 25 Kanehisa M , Goto S, Sato Y, et al. Data, information, knowledge and principle: back to metabolism in KEGG . Nucleic Acids Res 2014 ; 42 : D199 – 205 . Google Scholar Crossref Search ADS PubMed WorldCat 26 KEGG . KAAS - KEGG automatic annotation server. http://www.genome.jp/kegg/kaas/. 27 Tatusov RL , Fedorova ND, Jackson JD, et al. The COG database: an updated version includes eukaryotes . BMC Bioinformatics 2003 ; 4 : 41. Google Scholar Crossref Search ADS PubMed WorldCat 28 O'Leary NA , Wright MW, Brister JR, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation . Nucleic Acids Res 2016 ; 44 : D733 – 45 . Google Scholar Crossref Search ADS PubMed WorldCat 29 Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput . Nucleic Acids Res 2004 ; 32 : 1792 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 30 Wikipedia. https://en.wikipedia.org/wiki/LAMP_(software_bundle). 31 Plummer E , Twin J, Bulach DM, et al. A comparison of three bioinformatics pipelines for the analysis of preterm gut microbiota using 16S rRNA gene sequencing data . J Proteom Bioinform 2016 ; 283 – 91 . Google Scholar OpenURL Placeholder Text WorldCat 32 Alexandre R. Instant Apache Solr for Indexing Data How-to . Packt Publishing Limited , 2013 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC 33 Elasticsearch BV , Elastic search. https://www.elastic.co/products/elasticsearch. 34 Cassandra . http://cassandra.apache.org/. 35 Inc. O. MySQL . https://www.mysql.com/. 36 Bent J , Gibson G, Grider G, et al. A Checkpoint Filesystem for Parallel Applications . 2009 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC 37 Jens Freche WF , Sutmann G. High-Throughput Parallel-I/O using SIONlib for Mesoscopic Particle Dynamics Simulations on Massively Parallel Computers, Advances in Parallel Computing; Volume 19: Parallel Computing: From Multicores and GPU's to Petascale, 371–78, DOI: 10.3233/978-1-60750-530-3-371. IOS Press, Amsterdam. 38 Jay FL , Scott K, Karsten S, et al. Flexible IO and integration for scientific codes through the adaptable IO system (ADIOS). In: Proceedings of the 6th International Workshop on Challenges of Large Applications in Distributed Environments. Boston, MA: ACM, 2008 , 15–24. 39 Dennis JM , Edwards J, Loy R, et al. An application level parallel I/O library for earth system models . Int J High Perform Comput Appl 2012 ; 26 : 43 – 53 . Google Scholar Crossref Search ADS WorldCat 40 Bischof J , Wilke A, Gerlach W, et al. Shock: active storage for multicloud streaming data analysis. In: 2nd IEEE/ACM International Symposium on Big Data Computing. Limassol, Cyprus, 2015 . 41 Wilkening J , Wilke A, Desai N, et al. Using Clouds for Metagenomics: A Case Study. CLUSTER. New Orleans, LA: IEEE Computer Society, 2009 , 1–6. 42 Felter W , Ferriera A, Rajamony R, et al. An Updated Performance Comparison of Virtual Machines and Linux Containers. http://domino.research.ibm.com/library/cyberdig.nsf/papers/0929052195DD819C85257D2300681E7B/$File/rc25482.pdf. 43 Gerlach W , Tang W, Keegan K, et al. Skyport: container-based execution environment management for multi-cloud scientific workflows. In: Proceedings of the 5th International Workshop on Data-Intensive Computing in the Clouds. 2014 , 25–32. IEEE Press, Piscataway, NJ, USA. 44 Kurtzer G , Sochat V, Bauer M. Singularity: Scientific containers for mobility of compute . PLoS ONE 2017 ; 12 ( 5 ): e0177459. Google Scholar Crossref Search ADS PubMed WorldCat 45 udocker . https://github.com/indigo-dc/udocker. 46 Keegan KP , Trimble WL, Wilkening J, et al. A platform-independent method for detecting errors in metagenomic sequencing data: DRISEE . PLoS Comput Biol 2012 ; 8 : e1002541. Google Scholar Crossref Search ADS PubMed WorldCat 47 Marcais G , Kingsford C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers . Bioinformatics 2011 ; 27 : 764 – 70 . Google Scholar Crossref Search ADS PubMed WorldCat 48 Aronesty E. Comparison of sequencing utility programs . Open Bioinform J 2013 ; 7 : 1 – 8 . Google Scholar Crossref Search ADS WorldCat 49 Langmead B , Salzberg SL. Fast gapped-read alignment with Bowtie 2 . Nat Methods 2012 ; 9 : 357 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 50 Rho M , Tang H, Ye Y. FragGeneScan: predicting genes in short and error-prone reads . Nucleic Acids Res 2010 ; 38 : e191. Google Scholar Crossref Search ADS PubMed WorldCat 51 Rognes T , Flouri T, Nichols B, et al. VSEARCH: a versatile open source tool for metagenomics . PeerJ 2016 ; 4 : e2584. Google Scholar Crossref Search ADS PubMed WorldCat 52 Wilke A , Harrison T, Wilkening J, et al. The M5nr: a novel non-redundant database containing protein sequences and annotations from multiple sources and associated tools . BMC Bioinformatics 2012 ; 13 : 141. Google Scholar Crossref Search ADS PubMed WorldCat 53 Kent WJ. BLAT–the BLAST-like alignment tool . Genome Res 2002 ; 12 : 656 – 64 . Google Scholar Crossref Search ADS PubMed WorldCat 54 Overbeek R , Bartels D, Vonstein V, et al. Annotation of bacterial and archaeal genomes: improving accuracy and consistency . Chem Rev 2007 ; 107 : 3431 – 47 . Google Scholar Crossref Search ADS PubMed WorldCat 55 Amstutz P , Crusoe MR, Tijanić N, et al. Common Workflow Language, v1.0. https://doi.org/10.6084/m9.figshare.3115156.v2. 56 Mungall CJ , Torniai C, Gkoutos GV, et al. Uberon, an integrative multi-species anatomy ontology . Genome Biol 2012 ; 13 : R5. Google Scholar Crossref Search ADS PubMed WorldCat 57 Buttigieg PL , Morrison N, Smith B, et al. The environment ontology: contextualising biological and biomedical entities . J Biomed Semantics 2013 ; 4 : 43. Google Scholar Crossref Search ADS PubMed WorldCat 58 Field D , Sterk P, Kottmann R, et al. Genomic standards consortium projects . Stand Genomic Sci 2014 ; 9 : 599 – 601 . Google Scholar Crossref Search ADS PubMed WorldCat 59 Bischof J , Harrison T, Paczian T, et al. Metazen - metadata capture for metagenomes . Stand Genomic Sci 2014 ; 9 : 18. Google Scholar Crossref Search ADS PubMed WorldCat 60 Yilmaz P , Kottmann R, Field D, et al. Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications . Nat Biotechnol 2011 ; 29 : 415 – 20 . Google Scholar Crossref Search ADS PubMed WorldCat 61 Glass EM , Dribinsky Y, Yilmaz P, et al. MIxS-BE: a MIxS extension defining a minimum information standard for sequence data from the built environment . ISME J 2014 ; 8 : 1 – 3 . Google Scholar Crossref Search ADS PubMed WorldCat 62 Trimble WL , Keegan KP, D'Souza M, et al. Short-read reading-frame predictors are not created equal: sequence error causes loss of signal . BMC Bioinformatics 2012 ; 13 : 183. Google Scholar Crossref Search ADS PubMed WorldCat 63 Sean B , Buchan I, De Roure D, et al. Why linked data is not enough for scientists . Fut Gener Comput Syst 2013 ; 29 ( 2 ): 599 – 611 . Google Scholar Crossref Search ADS WorldCat 64 Crusoe MR , Brown CT. Walking the talk: adopting and adapting sustainable scientific software development processes in a small biology lab . J Open Res Softw 2016 ; 4 : e44 . Google Scholar Crossref Search ADS PubMed WorldCat 65 Tang W , Wilkening J, Desai N, et al. A scalable data analysis platform for metagenomics. In: 2013 IEEE International Conference on Big Data, Silicon Valley, CA, USA, 2013 . IEEE Press, Piscataway, NJ, USA. 66 Michael M , Moreira JE, Shiloach D, et al. Scale-up x scale-out: a case study using Nutch/Lucene. In: 2007 IEEE International Parallel and Distributed Processing Symposium. 2007 , 1. Published by Oxford University Press on behalf of Entomological Society of America 2017. This work is written by US Government employees and is in the public domain in the US. This work is written by US Government employees and is in the public domain in the US. Published by Oxford University Press on behalf of Entomological Society of America 2017. This work is written by US Government employees and is in the public domain in the US.
journal article
LitStream Collection
MAFFT online service: multiple sequence alignment, interactive sequence choice and visualization

Katoh, Kazutaka; Rozewicki, John; Yamada, Kazunori D

2019 Briefings in Bioinformatics

doi: 10.1093/bib/bbx108pmid: 28968734

Abstract This article describes several features in the MAFFT online service for multiple sequence alignment (MSA). As a result of recent advances in sequencing technologies, huge numbers of biological sequences are available and the need for MSAs with large numbers of sequences is increasing. To extract biologically relevant information from such data, sophistication of algorithms is necessary but not sufficient. Intuitive and interactive tools for experimental biologists to semiautomatically handle large data are becoming important. We are working on development of MAFFT toward these two directions. Here, we explain (i) the Web interface for recently developed options for large data and (ii) interactive usage to refine sequence data sets and MSAs. multiple sequence alignment, sequence analysis, phylogenetic tree Multiple sequence alignment (MSA) is an important step in comparative analyses of biological sequences. We provide an online service for computing MSAs on the Web using MAFFT [1, 2]. MAFFT has several different options for computing large MSAs consisting of thousands of sequences. Our service also has some additional functions (interactive sequence selection and phylogenetic inference) for preprocessing and postprocessing MSA. Moreover, these processes can be circularly performed as necessary. Here, we describe usage of these functions, including recently added ones, and several tips for using our online service. MSA of large data The demand for MSAs with a large number of sequences is increasing along with the advance of sequencing technologies. The default option of MAFFT, FFT-NS-2, is applicable to most cases, but MAFFT has more options for constructing large MSAs. They can be selected in a designated page for large alignment on the MAFFT server: http://mafft.cbrc.jp/alignment/server/large.html. Below, we briefly explain the options available on this page. Headings (A)–(G) correspond to those in Figure 1. Benchmark results of these options are shown in Table 1. Commands for locally running those options are available in the last section. Table 1 Results of two different benchmarks, ContTest (136 entries, 1467–43 912 sequences) [3] and HomFam (89 entries; 93–93 681 sequences) [4], for some MAFFT options available on our online server Method . ContTest . HomFam . Accuracy score . CPU time (minutes) . Accuracy score (SP/TC) . CPU time (minutes) . A PartTree (partsize = 50) 0.4103 61 0.7862/0.5658 47 PartTree (partsize = 1000) 0.4364 140 0.8258/0.6377 94 DPPartTree (partsize = 50) 0.4424 210 0.8413/0.6597 160 DPPartTree (partsize = 1000) 0.4632 1000 0.8541/0.6934 820 B FFT-NS-1 0.4856 170 0.8491/0.6669 160 B+C FFT-NS-1 (memsavetree) 0.4835 280 0.8416/0.6667 260 D FFT-NS-2 0.4998 500 0.8759/0.7162 460 D+C FFT-NS-2 (memsavetree) 0.5099 1100 0.8611/0.7023 990 E mafft-sparsecore (p = 100) 0.5153 730 0.8821/0.7274 650 mafft-sparsecore (p = 500) 0.5361 1200 0.8970/0.7586 1300 mafft-sparsecore (p = 1000) 0.5440 3400 0.9075/0.7810 4400 E+C mafft-sparsecore (p = 100, memsavetree) 0.5298 1500 0.8845/0.7416 1300 mafft-sparsecore (p = 500, memsavetree) 0.5438 2000 0.8995/0.7638 2000 mafft-sparsecore (p = 1000, memsavetree) 0.5428 4200 0.9052/0.7826 5000 F G-INS-1 0.5696 55 000 0.9306/0.8288 49000 G Randomchain 0.5425 100 0.8349/0.6681 88 Method . ContTest . HomFam . Accuracy score . CPU time (minutes) . Accuracy score (SP/TC) . CPU time (minutes) . A PartTree (partsize = 50) 0.4103 61 0.7862/0.5658 47 PartTree (partsize = 1000) 0.4364 140 0.8258/0.6377 94 DPPartTree (partsize = 50) 0.4424 210 0.8413/0.6597 160 DPPartTree (partsize = 1000) 0.4632 1000 0.8541/0.6934 820 B FFT-NS-1 0.4856 170 0.8491/0.6669 160 B+C FFT-NS-1 (memsavetree) 0.4835 280 0.8416/0.6667 260 D FFT-NS-2 0.4998 500 0.8759/0.7162 460 D+C FFT-NS-2 (memsavetree) 0.5099 1100 0.8611/0.7023 990 E mafft-sparsecore (p = 100) 0.5153 730 0.8821/0.7274 650 mafft-sparsecore (p = 500) 0.5361 1200 0.8970/0.7586 1300 mafft-sparsecore (p = 1000) 0.5440 3400 0.9075/0.7810 4400 E+C mafft-sparsecore (p = 100, memsavetree) 0.5298 1500 0.8845/0.7416 1300 mafft-sparsecore (p = 500, memsavetree) 0.5438 2000 0.8995/0.7638 2000 mafft-sparsecore (p = 1000, memsavetree) 0.5428 4200 0.9052/0.7826 5000 F G-INS-1 0.5696 55 000 0.9306/0.8288 49000 G Randomchain 0.5425 100 0.8349/0.6681 88 Note: The sum-of-pairs (SP) and total-column (TC) scores for HomFam were calculated by the FastSP program [5]. (A–G) correspond to the techniques explained in the main text. Command-line arguments are displayed after performing the calculation on the online service and also listed in the main text. Random numbers are used in (A), (E) and (G). In this test, only one set of random numbers was used for each method. For (E) and (G), seed of random numbers can be specified in the download version (see the last section in the main text) but cannot be specified in the online version. See https://mafft.sb.ecei.tohoku.ac.jp/ for detailed results. Open in new tab Table 1 Results of two different benchmarks, ContTest (136 entries, 1467–43 912 sequences) [3] and HomFam (89 entries; 93–93 681 sequences) [4], for some MAFFT options available on our online server Method . ContTest . HomFam . Accuracy score . CPU time (minutes) . Accuracy score (SP/TC) . CPU time (minutes) . A PartTree (partsize = 50) 0.4103 61 0.7862/0.5658 47 PartTree (partsize = 1000) 0.4364 140 0.8258/0.6377 94 DPPartTree (partsize = 50) 0.4424 210 0.8413/0.6597 160 DPPartTree (partsize = 1000) 0.4632 1000 0.8541/0.6934 820 B FFT-NS-1 0.4856 170 0.8491/0.6669 160 B+C FFT-NS-1 (memsavetree) 0.4835 280 0.8416/0.6667 260 D FFT-NS-2 0.4998 500 0.8759/0.7162 460 D+C FFT-NS-2 (memsavetree) 0.5099 1100 0.8611/0.7023 990 E mafft-sparsecore (p = 100) 0.5153 730 0.8821/0.7274 650 mafft-sparsecore (p = 500) 0.5361 1200 0.8970/0.7586 1300 mafft-sparsecore (p = 1000) 0.5440 3400 0.9075/0.7810 4400 E+C mafft-sparsecore (p = 100, memsavetree) 0.5298 1500 0.8845/0.7416 1300 mafft-sparsecore (p = 500, memsavetree) 0.5438 2000 0.8995/0.7638 2000 mafft-sparsecore (p = 1000, memsavetree) 0.5428 4200 0.9052/0.7826 5000 F G-INS-1 0.5696 55 000 0.9306/0.8288 49000 G Randomchain 0.5425 100 0.8349/0.6681 88 Method . ContTest . HomFam . Accuracy score . CPU time (minutes) . Accuracy score (SP/TC) . CPU time (minutes) . A PartTree (partsize = 50) 0.4103 61 0.7862/0.5658 47 PartTree (partsize = 1000) 0.4364 140 0.8258/0.6377 94 DPPartTree (partsize = 50) 0.4424 210 0.8413/0.6597 160 DPPartTree (partsize = 1000) 0.4632 1000 0.8541/0.6934 820 B FFT-NS-1 0.4856 170 0.8491/0.6669 160 B+C FFT-NS-1 (memsavetree) 0.4835 280 0.8416/0.6667 260 D FFT-NS-2 0.4998 500 0.8759/0.7162 460 D+C FFT-NS-2 (memsavetree) 0.5099 1100 0.8611/0.7023 990 E mafft-sparsecore (p = 100) 0.5153 730 0.8821/0.7274 650 mafft-sparsecore (p = 500) 0.5361 1200 0.8970/0.7586 1300 mafft-sparsecore (p = 1000) 0.5440 3400 0.9075/0.7810 4400 E+C mafft-sparsecore (p = 100, memsavetree) 0.5298 1500 0.8845/0.7416 1300 mafft-sparsecore (p = 500, memsavetree) 0.5438 2000 0.8995/0.7638 2000 mafft-sparsecore (p = 1000, memsavetree) 0.5428 4200 0.9052/0.7826 5000 F G-INS-1 0.5696 55 000 0.9306/0.8288 49000 G Randomchain 0.5425 100 0.8349/0.6681 88 Note: The sum-of-pairs (SP) and total-column (TC) scores for HomFam were calculated by the FastSP program [5]. (A–G) correspond to the techniques explained in the main text. Command-line arguments are displayed after performing the calculation on the online service and also listed in the main text. Random numbers are used in (A), (E) and (G). In this test, only one set of random numbers was used for each method. For (E) and (G), seed of random numbers can be specified in the download version (see the last section in the main text) but cannot be specified in the online version. See https://mafft.sb.ecei.tohoku.ac.jp/ for detailed results. Open in new tab Figure 1 Open in new tabDownload slide Screenshot of input page for large MSAs in MAFFT online service. (A–G) are explained in the main text. Figure 1 Open in new tabDownload slide Screenshot of input page for large MSAs in MAFFT online service. (A–G) are explained in the main text. A. PartTree and DPPartTree (Figure 1A) [6] are highly approximate options. These methods recursively cluster sequences and simultaneously compute a distance between the clusters, each of which is represented by a single sequence. The order of the computational time is O(N log ⁡N) ⁠, where N is the number of sequences. They are fast and applicable to large MSAs, but accuracy is sacrificed because of the approximation of guide tree calculation (Table 1). The PartTree and DPPartTree options share a basic design, but the former uses k-mer-based distance to estimate the similarity between sequences [7], while the latter uses dynamic programming (DP) [8] to estimate the similarity. Accordingly, the latter is slower but more accurate. In the command-line version, the balance between accuracy and speed can also be adjusted by a parameter, partsize, but this parameter is fixed to 1000 in the online service. B. FFT-NS-1 (Figure 1B): This is another approximate method. Its accuracy is higher than PartTree and DPPartTree in benchmark tests (Table 1). The input sequences are progressively aligned using a guide tree [6, 9, 10]. For constructing the guide tree, pairwise distances are computed based on the number of shared k-mers. The length of k-mer is 6 for both protein and nucleotide data, but 20 amino acids are grouped into six physicochemical groups [11], and an amino acid sequence is converted to a sequence composed of six letters. The current version of MAFFT uses the following formula to compute distance Dij between sequences i and j: Dij={1−Sij/min(Sii,Sjj)} / f(x,y), where Sij is alignment score between sequences i and j. f(x, y) adjusts the distance to avoid a case where the distance between unrelated sequences happens to become zero when a long sequence and a short sequence are compared. f(x,y)=ay/x+b/(x+b)+c, where x and y are the lengths of the longer and the shorter sequence i or j, respectively. a, b and c are empirically determined parameters; a = 0.1, b = 10 000 (nucleotide), 2500 (amino acid) and c = 0.01. As Dij is computed for all sequence pairs, the computational time is proportional to N2, where N is the number of sequences. The space complexity is also O(N2) by default. To build a guide from distances, MAFFT uses a UPGMA-like method with a small modification [12]. When merging clusters L and R into a new cluster P, distance DPC from P to a third cluster C is calculated with: DPC=s (DLC+DRC)/2+(1−s) min(DLC,DRC). The resulting tree becomes more imbalanced [13] with smaller values of parameter s (0 ≤ s ≤1). The default s value has been unchanged from 0.1 since the initial release in 2002, but can be specified with the --mixedlinkage flag in the download version. C. To compute a guide tree with less RAM, a low-memory mode is available but not enabled by default (Figure 1C). If a calculation in the online service requires more RAM than a threshold, then the calculation is terminated and an error message is returned instructing the user to select the low-memory mode. In this mode, instead of storing a full distance matrix in RAM, distances are calculated two times during the tree building step. Accordingly, the calculation time is longer than the normal mode. D. FFT-NS-2 (Figure 1D): This is the default option of MAFFT. In this method, after performing FFT-NS-1, a new distance matrix and guide tree are recalculated based on the MSA, and then the final MSA is built using the new guide tree. In benchmark tests, the accuracy is generally improved by the recalculation of the guide tree as shown in Table 1. This method is at least two times slower than FFT-NS-1. The low-memory mode (Figure 1C) is also available for FFT-NS-2. E. mafft-sparsecore (Figure 1E) [12] is a combination of the iterative refinement method [14–16] and the progressive method. It aims to improve the alignment accuracy by partly applying the iterative refinement method, which is known to be more accurate than the progressive method. The procedure was described in Yamada et al. (2016) [12]: (i) the input sequences are sorted by length. From the upper n% of the sorted sequences, p sequences are randomly selected as ‘core’ sequences. The default values of n and p are 50 and 500, respectively. (ii) An MSA of the p core sequences is constructed by an iterative refinement option, G-INS-i. (iii) The remaining sequences are added to the core MSA using the –add option [17], which uses the progressive alignment method. The accuracy and speed are controlled by the parameter p. With larger p, the accuracy is improved, but computational cost becomes higher (Table 1), as more sequences are subjected to the iterative refinement calculation. The memory usage is mainly determined by the progressive alignment stage (iii). The low-memory mode (Figure 1C) is also available for mafft-sparsecore. F. G-INS-1 (Figure 1F): This gives more accurate MSAs [12, 18] but takes a longer computational time and requires more RAM than other methods. This method uses an accurate guide tree based on all-to-all DP calculation and a scoring function similar to COFFEE [19] in progressive alignment. We are developing a memory-efficient version of G-INS-1, which runs in parallel on distributed memory systems or shared memory systems (manuscript in preparation). This option is experimentally supported at http://mafft.cbrc.jp/alignment/server/large-lsf.html. G. Pileup (Figure 1G): This is the simplest strategy. The first and the second sequences are first aligned. Then, the other sequences are added to the alignment in the order in the input file. Random chain: This is similar to Pileup, but the order of sequences is randomized. The usefulness of this strategy is controversial [3, 12, 13, 20, 21]. However, because these methods have an advantage in computational simplicity, we have made them available in our service. Selecting suitable strategies To select suitable MAFFT options for specific problems, consider the following factors. For aligning a small number of sequences, the iterative refinement method is known to effectively improve the accuracy, as noted above. However, for large-scale MSAs (the subject of this article), the effect of iterative refinement was recently assessed to be small. More specifically, in Figure 1 in Le et al. [18], the advantage of MAFFT-L-INS-i (an iterative refinement method) over MAFFT-L-INS-1 (a progressive method) was clearly observed for a small number of sequences but not for thousands of sequences. Moreover, a direct application of the iterative refinement method to large sequence data sets is difficult in terms of computational resources. In benchmarks with ∼1000–100 000 sequences, G-INS-1 outperforms other methods in accuracy as shown in Table 1. The difference is statistically significant in several cases. Thus, this method is first recommended if computational resources allow. We are making an effort to decrease the computational resources required by this method. If it is difficult to apply G-INS-1, then the next candidate would be mafft-sparsecore, which uses the advantage of iterative refinement for small MSAs. These two methods can be applied to typical protein sequences with <10 000 sites, but cannot be applied to long DNA sequences. In such a case, FFT-NS-2 or FFT-NS-1 can be useful, as the computational time is proportional to L log ⁡L ⁠, where L is sequence length, because of the FFT approximation [1]. However, this is only when the input sequences share global homology (from 5′ end to 3′end), and the similarity level is high. MAFFT cannot handle data with genomic rearrangements, such as inversions and translocations. Also note that an MSA can be built only when the sequences are all homologous. It does not make sense to construct an MSA of nonhomologous sequences. For a data set with much >100 000 sequences, PartTree and DPPartTree, can be applied if the sequences are homologous, as their time complexity is O(NlogN), where N is the number of sequences. However, there are also other popular programs, such as Clustal Omega [4] and UPP [22], for this purpose. The PartTree algorithm contributed to these programs theoretically and/or practically. Clustal Omega uses the mBed algorithm [23] to build a guide tree with a time complexity of O(N log ⁡N) ⁠. UPP uses PASTA [24] to build a backbone MSA of a small number of sequences and then adds the remaining sequences using hmmalign [25] with the time complexity of O(N). PASTA uses MAFFT-PartTree to generate the initial MSA and MAFFT-L-INS-i (an iterative refinement option for small data) to generate sub-MSAs of closely related sequences. Performance comparison including these methods can be seen on https://mafft.sb.ecei.tohoku.ac.jp/, which also includes detailed benchmark results for subsets with different data sizes. Similarity level and difference in sequence lengths also should be considered. If the sequences are highly similar to each other and their lengths are also similar, then fast methods, such as FFT-NS-1 or even Pileup should result in a useful MSA. If the input data have fragmentary sequences and full-length sequences, then a two-step strategy sometimes works well. That is (i) align the full-length sequences first and then (ii) add the fragmentary sequences to the full-length MSA using the --addfragments option (see next section). Use of existing MSA Each step of the calculation of mafft-sparsecore (Figure 1E) can be separately or manually performed. If a reliable MSA and a set of unaligned sequences are given to http://mafft.cbrc.jp/alignment/server/add_sequences.html, then an MSA of all the sequences is returned, in which the existing MSA is preserved as the original one. Several variants, --add, --addfull, --addfragments and --addlong, are available. They can be selected according to the relative length of new sequences to the existing MSA as illustrated in Figure 2. The four options work similarly to each other. However, sequences added with the --add option are subjected to distance calculation with time complexity of O(N2), where N is the number of sequences. In the other three options, distances between the sequences in the existing alignment are computed with a time complexity of O(M2), where M is the number of sequences in the existing MSA, to build a tree of the M sequences using the UPGMA-like method (see above). For each of (N−M) sequences to be added, distances to the M sequences are computed to locate the position of the sequence in the tree, followed by the building of an alignment of (M + 1) sequences. Then, a full MSA is built from the (N−M) MSAs. The latter strategy is useful when the new sequences do not overlap with each other (as in the case of fragmentary sequences) and when the phylogenetic relationship between new sequences is not necessary to consider. There are several other tools, such as hmmalign [25], PaPaRa [26] and PAGAN [27], to add sequences to an existing MSA. Figure 2 Open in new tabDownload slide Variants of --add option. Figure 2 Open in new tabDownload slide Variants of --add option. Note that the length of the resulting MSA can differ from that of the original MSA. This is because additional gaps are necessary when new sequences have insertions. All-gap sites, if any, in the original MSA are deleted. As such changes in length are not useful in some cases, we have implemented a new option, --keeplength, in which (1) insertions in the new sequences are deleted and (2) all-gap sites in the original MSA are reinserted as shown at the right end in Figure 2. This option is selectable in the online version and sometimes useful for mapping new sequences to a reference MSA. Interactive sequence choice and visualization Recently, we have access to huge amounts of sequence data from widely divergent organisms, but the quality of the data is not always high because of the limitations of sequencing technologies. In the case of amino acid sequence data, the difficulty in eukaryotic gene prediction [28–30] also results in errors in data. It might be possible to automatically exclude such problematic data in certain cases, but sometimes, biologically important information is in low-quality sequences, especially when interest is in nonmodel organisms. For such cases, it is necessary to manually choose sequences, but this is becoming difficult because of increasing data size. Therefore, an interactive tool to help this process is necessary. Our service has some functions for this purpose as explained in Kuraku et al. [31]. Sequences can be selected/unselected one by one in the sequence selection window (Figure 3B). Moreover, a group of sequences in a single phylogenetic cluster can be selected or unselected in a tree viewer. If you click on a node in a tree (Figure 3A), the descendant sequences under the node are selected or unselected together in the list of sequences (Figure 3B). Automated tools for sequence selection, such as CD-HIT [32] and MaxAlign [33], can also run on our service. The selected sequences are subjected to phylogenetic tree inference using the neighbor-joining method [34] or UPGMA [35] with several options, such as distance measure and the number of bootstrap cycles (Figure 3C). Then, the data set can be further refined using the new tree. The maximum-likelihood method is not supported because of the high computational costs. It must be performed locally or using other online services. Figure 3 Open in new tabDownload slide Interactive sequence selection. A group of sequences in guide tree (A) is selected at a time in sequence selection window (B). Several options for tree estimation can be selected (C). MSA can be visually checked using MSAViewer (D). Figure 3 Open in new tabDownload slide Interactive sequence selection. A group of sequences in guide tree (A) is selected at a time in sequence selection window (B). Several options for tree estimation can be selected (C). MSA can be visually checked using MSAViewer (D). Two tree viewers, Phylo.io [36] and Archaeopteryx [37], are used for sequence selection and visualization of phylogenetic trees. Originally, we used Archaeopteryx Java plugin, but modern browsers no longer support Java plugin for security reasons. Thus, we recently adopted Phylo.io, which is written in JavaScript and works with most modern browsers. With the addition of Phylo.io to our service, we have added some new features: Coloring of sequence title corresponding to the databases in aLeaves [29]. Interactive sequence selection (see above). Automatic rooting similar to mid-point rooting. This is just for visualization without any biological basis. To estimate the position of root, outgroup or other additional information is necessary. A JavaScript version of Archaeopteryx is being developed (C. Zmasek, personal communication), and we are planning to use this viewer, too. To visualize MSAs, two tools, Jalview [38] (as Java plugin) and MSAViewer [39] (written in JavaScript; Figure 3C), are available on our service. Necessity of large MSAs The relationship between alignment accuracy and data size is still unclear. It is naively expected that the accuracy of an MSA is improved with the number of input sequences. However, highly accurate methods cannot be applied to large data because of computational costs. Useful information related to this issue has recently been reported by Le et al. [18]. In their tests, the accuracy of downstream analysis (protein secondary structure prediction in this case) is improved with the increase of sequences for medium-scale data (<1000 sequences), but with more sequences, the accuracy reaches a sort of plateau. Thus, there may be optimal data size. Their test also suggested that the accuracy of MSA itself hits a maximum point at a smaller number of sequences (around 200) and that the accuracy of MSA decreases with an increase in the number of sequences. This observation is consistent with Sievers et al. [40]. Such optimal data sizes can differ for different problems. For example, in the case of prediction of contact residues based on co-evolution, larger MSAs are generally thought to be necessary [41, 42]. Command-line options Each method also runs locally. In the current version (7.310; August 2017), the corresponding commands are as follows: PartTree (Figure 1A) mafft --parttree --partsize1000input > output DPPartTree (Figure 1A) mafft --dpparttree --partsize1000input > output FFT-NS-1 (Figure 1B) mafft --retree 1input > output mafft --retree 1 --memsavetreeinput > output (low-memory mode) mafft --retree 1 --thread -1input > output (multithread mode) With thread -1, the number of physical cores is automatically counted and all cores are used. See http://mafft.cbrc.jp/alignment/software/multithreading.html for detailed information on multithreading. FFT-NS-2 (Figure 1D) mafftinput > output mafft --memsavetreeinput > output (low-memory mode) mafft --thread -1input > output (multithread mode) mafft-sparsecore (Figure 1E) mafft-sparsecore.rb -pp-nn-ss-iinput > output mafft-sparsecore.rb -pp-nn-ss-A ”--memsavetree” -iinput > output (low-memory mode) mafft-sparsecore.rb -pp-nn-ss-A ”--thread -1” -C ”--thread -1” -iinput > output (multithread mode) p and n are as explained above, and s is seed for random numbers. Flags for the iterative refinement stage and those for the progressive stage can be specified after -C and -A, respectively. See http://mafft.cbrc.jp/alignment/software/sparsecore.html for detailed information. G-INS-1 (Figure 1F) mafft --globalpairinput > output mafft --globalpair --thread -1input > output (multithread mode) Pileup (Figure 1G) mafft --pileupinput > output Random chain (Figure 1G) mafft --randomchain --randomseeds input > output s is seed for random numbers. Adding new sequences to an MSA mafft --addnewSequencesexistingMSA > output mafft --addfullnewSequencesexistingMSA > output mafft --addlongnewSequencesexistingMSA > output mafft --addfragmentsnewSequencesexistingMSA > output The --keeplength flag can be added to each command (see above). Add --thread −1 to enable multithreading. Key Points MSA is an important step in phylogeny inference, functional prediction and many other analyses. The demand for MSAs with a large number of sequences is increasing. MAFFT has different options for computing large MSAs in both the local and online versions. The online version has additional features for preprocessing and postprocessing MSAs. Acknowledgement The authors thank Daron M. Standley, RIMD, Osaka University, for inspiring discussion. Funding Japan Society for the Promotion of Science (JSPS) KAKENHI (grant number JP16K07464) and the Platform Project for Supporting Drug Discovery and Life Science Research from Japan Agency for Medical Research and Development (AMED). Kazutaka Katoh is an associate professor in the Department of Genome Informatics, Research Institute for Microbial Diseases, Osaka University. His research interests are in bioinformatics and molecular evolution. John Rozewicki is a researcher in the Department of Genome Informatics, Research Institute for Microbial Diseases, Osaka University. His research focus is designing systems to conduct bioinformatics research using high-performance computing and networked server infrastructure. Kazunori D. Yamada is an assistant professor in graduate school of information sciences, Tohoku University. His research interests include development of amino acid sequence alignment methods and machine learning. Research Institute for Microbial Diseases, Osaka University, conducts basic research in the areas of infectious disease, immunology and oncology. References 1 Katoh K , Misawa K, Kuma K, et al. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform . Nucleic Acids Res 2002 ; 30 : 3059 – 66 . Google Scholar Crossref Search ADS PubMed WorldCat 2 Katoh K , Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability . Mol Biol Evol 2013 ; 30 ( 4 ): 772 – 80 . Google Scholar Crossref Search ADS PubMed WorldCat 3 Fox G , Sievers F, Higgins DG. Using de novo protein structure predictions to measure the quality of very large multiple sequence alignments . Bioinformatics 2016 ; 32 ( 6 ): 814 – 20 . Google Scholar Crossref Search ADS PubMed WorldCat 4 Sievers F , Wilm A, Dineen D, et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega . Mol Syst Biol 2011 ; 7 : 539. Google Scholar Crossref Search ADS PubMed WorldCat 5 Mirarab S , Warnow T. FastSP: linear time calculation of alignment accuracy . Bioinformatics 2011 ; 27 : 3250 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 6 Katoh K , Toh H. PartTree: an algorithm to build an approximate tree from a large number of unaligned sequences . Bioinformatics 2007 ; 23 : 372 – 4 . Google Scholar Crossref Search ADS PubMed WorldCat 7 Higgins DG , Sharp PM. CLUSTAL: a package for performing multiple sequence alignment on a microcomputer . Gene 1988 ; 73 : 237 – 44 . Google Scholar Crossref Search ADS PubMed WorldCat 8 Needleman SB , Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins . J Mol Biol 1970 ; 48 : 443 – 53 . Google Scholar Crossref Search ADS PubMed WorldCat 9 Hogeweg P , Hesper B. The alignment of sets of sequences and the construction of phyletic trees: an integrated method . J Mol Evol 1984 ; 20 : 175 – 86 . Google Scholar Crossref Search ADS PubMed WorldCat 10 Feng DF , Doolittle RF. Progressive sequence alignment as a prerequisite to correct phylogenetic trees . J Mol Evol 1987 ; 25 : 351 – 60 . Google Scholar Crossref Search ADS PubMed WorldCat 11 Dayhoff MO , Schwartz RM, Orcutt BC. A model of evolutionary change in proteins. In Dayhoff MO, Ech RV (eds), Atlas of Protein Sequence and Structure . Maryland : National Biomedical Research Foundation ; 1978 , 345 – 52 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC 12 Yamada KD , Tomii K, Katoh K. Application of the mafft sequence alignment program to large data-reexamination of the usefulness of chained guide trees . Bioinformatics 2016 ; 32 ( 21 ): 3246 – 51 . Google Scholar Crossref Search ADS PubMed WorldCat 13 Boyce K , Sievers F, Higgins DG. Simple chained guide trees give high-quality protein multiple sequence alignments . Proc Natl Acad Sci USA 2014 ; 111 : 10556 – 61 . Google Scholar Crossref Search ADS PubMed WorldCat 14 Barton GJ , Sternberg MJ. A strategy for the rapid multiple alignment of protein sequences. confidence levels from tertiary structure comparisons . J Mol Biol 1987 ; 198 : 327 – 37 . Google Scholar Crossref Search ADS PubMed WorldCat 15 Berger MP , Munson PJ. A novel randomized iterative strategy for aligning multiple protein sequences . Comput Appl Biosci 1991 ; 7 : 479 – 84 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 16 Gotoh O. Optimal alignment between groups of sequences and its application to multiple sequence alignment . Comput Appl Biosci 1993 ; 9 : 361 – 70 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 17 Katoh K , Frith MC. Adding unaligned sequences into an existing alignment using MAFFT and LAST . Bioinformatics 2012 ; 28 : 3144 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 18 Le Q , Sievers F, Higgins DG. Protein multiple sequence alignment benchmarking through secondary structure prediction . Bioinformatics 2017 ; 33 ( 9 ): 1331 – 7 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 19 Notredame C , Holm L, Higgins DG. COFFEE: an objective function for multiple sequence alignments . Bioinformatics 1998 ; 14 : 407 – 22 . Google Scholar Crossref Search ADS PubMed WorldCat 20 Sievers F , Hughes GM, Higgins DG. Systematic exploration of guide-tree topology effects for small protein alignments . BMC Bioinformatics 2014 ; 15 : 338. Google Scholar Crossref Search ADS PubMed WorldCat 21 Tan G , Gil M, Löytynoja AP, et al. Simple chained guide trees give poorer multiple sequence alignments than inferred trees in simulation and phylogenetic benchmarks . Proc Natl Acad Sci USA 2015 ; 112 : E99 – 100 . Google Scholar Crossref Search ADS PubMed WorldCat 22 Nguyen NPD , Mirarab S, Kumar K, et al. Ultra-large alignments using phylogeny-aware profiles . Genome Biol 2015 ; 16 : 124. Google Scholar Crossref Search ADS PubMed WorldCat 23 Blackshields G , Sievers F, Shi W, et al. Sequence embedding for fast construction of guide trees for multiple sequence alignment . Algorithms Mol Biol 2010 ; 5 : 21. Google Scholar Crossref Search ADS PubMed WorldCat 24 Mirarab S , Nguyen N, Guo S, et al. PASTA: ultra-large multiple sequence alignment for nucleotide and amino-acid sequences . J Comput Biol 2015 ; 22 ( 5 ): 377 – 86 . Google Scholar Crossref Search ADS PubMed WorldCat 25 Finn RD , Clements J, Eddy SR. Hmmer web server: interactive sequence similarity searching . Nucleic Acids Res 2011 ; 39 : W29 – 37 . Google Scholar Crossref Search ADS PubMed WorldCat 26 Berger SA , Stamatakis A. Aligning short reads to reference alignments and trees . Bioinformatics 2011 ; 27 : 2068 – 75 . Google Scholar Crossref Search ADS PubMed WorldCat 27 Löytynoja A , Vilella AJ, Goldman N. Accurate extension of multiple sequence alignments using a phylogeny-aware graph algorithm . Bioinformatics 2012 ; 28 : 1684 – 91 . Google Scholar Crossref Search ADS PubMed WorldCat 28 Gotoh O , Morita M, Nelson DR. Assessment and refinement of eukaryotic gene structure prediction with gene-structure-aware multiple protein sequence alignment . BMC Bioinformatics 2014 ; 15 : 189. Google Scholar Crossref Search ADS PubMed WorldCat 29 Nagy A , Patthy L. MisPred: a resource for identification of erroneous protein sequences in public databases . Database 2013 ; 2013 : bat053. Google Scholar Crossref Search ADS PubMed WorldCat 30 Yandell M , Ence D. A beginner’s guide to eukaryotic genome annotation . Nat Rev Genet 2012 ; 13 : 329 – 42 . Google Scholar Crossref Search ADS PubMed WorldCat 31 Kuraku S , Zmasek CM, Nishimura O, et al. aLeaves facilitates on-demand exploration of metazoan gene family trees on mafft sequence alignment server with enhanced interactivity . Nucleic Acids Res 2013 ; 41 : W22 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 32 Li W , Jaroszewski L, Godzik A. Clustering of highly homologous sequences to reduce the size of large protein databases . Bioinformatics 2001 ; 17 ( 3 ): 282 – 3 . Google Scholar Crossref Search ADS PubMed WorldCat 33 Gouveia-Oliveira R , Sackett PW, Pedersen AG. MaxAlign: maximizing usable data in an alignment . BMC Bioinformatics 2007 ; 8 : 312. Google Scholar Crossref Search ADS PubMed WorldCat 34 Saitou N , Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees . Mol Biol Evol 1987 ; 4 ( 4 ): 406 – 25 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 35 Sokal RR , Michener CD. A statistical method for evaluating systematic relationships . University of Kansas Scientific Bulletin 1958 ; 28 : 1409 – 38 . Google Scholar OpenURL Placeholder Text WorldCat 36 Robinson O , Dylus D, Dessimoz C. Phylo.io: interactive viewing and comparison of large phylogenetic trees on the web . Mol Biol Evol 2016 ; 33 ( 8 ): 2163 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 37 Han MV , Zmasek CM. phyloXML: XML for evolutionary biology and comparative genomics . BMC Bioinformatics 2009 ; 10 : 356. Google Scholar Crossref Search ADS PubMed WorldCat 38 Waterhouse AM , Procter JB, Martin DM, et al. Jalview version 2–a multiple sequence alignment editor and analysis workbench . Bioinformatics 2009 ; 25 ( 9 ): 1189 – 91 . Google Scholar Crossref Search ADS PubMed WorldCat 39 Yachdav G , Wilzbach S, Rauscher B, et al. MSAViewer: interactive JavaScript visualization of multiple sequence alignments . Bioinformatics 2016 ; 32 ( 22 ): 3501 – 3 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 40 Sievers F , Dineen D, Wilm A, et al. Making automated multiple alignments of very large numbers of protein sequences . Bioinformatics 2013 ; 29 ( 8 ): 989 – 95 . Google Scholar Crossref Search ADS PubMed WorldCat 41 Kamisetty H , Ovchinnikov S, Baker D. Assessing the utility of coevolution-based residue-residue contact predictions in a sequence- and structure-rich era . Proc Natl Acad Sci USA 2013 ; 110 : 15674 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 42 Marks DS , Hopf TA, Sander C. Protein structure prediction from sequence variation . Nat Biotechnol 2012 ; 30 ( 11 ): 1072 – 80 . Google Scholar Crossref Search ADS PubMed WorldCat © The Author 2017. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com © The Author 2017. Published by Oxford University Press.
journal article
LitStream Collection
Towards dynamic genome-scale models

Gilbert, David; Heiner, Monika; Jayaweera, Yasoda; Rohr, Christian

2019 Briefings in Bioinformatics

doi: 10.1093/bib/bbx096pmid: 29040409

Abstract The analysis of the dynamic behaviour of genome-scale models of metabolism (GEMs) currently presents considerable challenges because of the difficulties of simulating such large and complex networks. Bacterial GEMs can comprise about 5000 reactions and metabolites, and encode a huge variety of growth conditions; such models cannot be used without sophisticated tool support. This article is intended to aid modellers, both specialist and non-specialist in computerized methods, to identify and apply a suitable combination of tools for the dynamic behaviour analysis of large-scale metabolic designs. We describe a methodology and related workflow based on publicly available tools to profile and analyse whole-genome-scale biochemical models. We use an efficient approximative stochastic simulation method to overcome problems associated with the dynamic simulation of GEMs. In addition, we apply simulative model checking using temporal logic property libraries, clustering and data analysis, over time series of reaction rates and metabolite concentrations. We extend this to consider the evolution of reaction-oriented properties of subnets over time, including dead subnets and functional subsystems. This enables the generation of abstract views of the behaviour of these models, which can be large—up to whole genome in size—and therefore impractical to analyse informally by eye. We demonstrate our methodology by applying it to a reduced model of the whole-genome metabolism of Escherichia coli K-12 under different growth conditions. The overall context of our work is in the area of model-based design methods for metabolic engineering and synthetic biology. whole-genome-scale metabolic models, formal analysis, scalability, approximative stochastic simulation, model checking, reaction profiling, clustering, data analytics, delta leaping, subsystems behaviour, model-based design Introduction Currently, models of biochemical networks which can be simulated dynamically are limited in size; for instance, dynamic models for bacterial whole-genome metabolism are the exception [1, 2], with the lack of kinetic data usually given as the main reason. In addition, the dynamic simulation of such large and complex models is currently a bottleneck, presenting considerable difficulties both for stochastic and deterministic methods. These two difficulties together impede progress in the development of dynamic models of these large systems. Even when these models can be simulated to generate dynamic behaviour, there is a further challenge to analyse the large amount of data produced. Bacterial genome-scale models of metabolism (GEMs) can comprise about 5000 reactions and metabolites, and encode a huge variety of growth conditions; such models are far too large and complex in terms of their structure and number of observables (metabolite concentrations and reaction rates) to be checked without sophisticated tool support. Our approach is intended to facilitate in the long term the development of dynamic whole-genome-scale models, by providing means to simulate and analyse these models. It can already be applied to explore properties, which are not dependent on specific kinetic parameters such as dead subnets, and furthermore be used for fine-tuning of kinetic parameters via optimization. In this article, we describe a methodology and related workflow comprising methods and associated software tools to explore the dynamic behaviour of genome-scale metabolic models, and providing a guidance framework for modellers, either specialist or non-specialist in computerized methods. Our focus here is on analysis and exploration of already constructed models, rather than on the construction process itself. Typically, currently available large genome-scale models are designed to be analysed by constraint-based methods, such as flux balance analysis (FBA), flux variability analysis, etc., which explore steady-state behaviour [3]. We wish to complement these approaches by considering dynamic behaviour, which is required, for example, if the transient behaviour of microorganisms has to be described under changing external conditions, or when cell growth is crucial for metabolic engineering. Our methodology integrates the static analysis of graph properties and dynamic stochastic simulation, within a Petri net setting, exploiting a rich set of associated tools. This requires some model preparation before simulation followed by analysis of the generated time series traces. In this article, we use stochastic simulation based on δ-leaping [4], a recently introduced approach that permits the efficient simulation of these GEMs, enabling the observation of new and complex behaviours. Our approach is general, and not bound to Petri net representations. We demonstrate the use of our techniques to explore model behaviours under different growth conditions, with a focus on the reaction perspective related to functionally meaningful subsystems (pathways). Other application scenarios include the comparison of model versions arrived at by model development, manual curation, variant exploration (knockout, etc.), production target-based optimization (e.g. within the framework of synthetic biology or metabolic engineering) across populations and over generations for evolutionary approaches. The intention is that our techniques will aid the exploration and understanding of large models, and comparison between model versions and configurations. These methods may also support the modification of such models as part of the design process in synthetic biology. Novel techniques that we use include simulative model checking using libraries of properties and derived network variables (observers) for reaction behaviour, as well as analysis of the enlarging/shrinking of property-induced subnets over time with respect to functional subsystem or network location, possibly connected with structural network properties. Our methodology involves behaviour-based network decomposition using clustering and model checking, and abstraction over the data based on functional grouping. We illustrate our methods by considering metabolic models based on the whole genome of Escherichiacoli K-12, and use a reduced version as running example. Outline In the following section, we introduce the kind of models, which we consider and our running example. Next, we give an overview of our workflow and a detailed presentation of our core methods. We conclude with a brief summary and outlook on future work. Materials Models. The models that we consider in this article comprise networks of biochemical reactions, which are often exchanged via the Systems Biology Markup Language (SBML) [5]. This article builds on experience gained while working with a set of 55 public domain models of whole-genome metabolism [6] of various E.coli and Shigella strains available in SBML [7]. We use a subset of the information contained in these models comprising: compartments, e.g. cytosol, periplasm and extracellular space; metabolites (species), given by their names, initialization values (set to zero), compartment membership and if they act as inputs or outputs called boundary conditions; and reactions, given by their names, substrates, products and related stoichiometry, subsystem membership, reversibility information and flux bounds. Kinetic information is not included in the original GEMs. These public domain models incorporate the ability to represent different growth conditions by use of the boundary conditions. Individual boundary conditions are switched on or off in a Boolean manner by allowing/disabling inflow of particular metabolites. This is achieved by modifying the exchange reactions, which model the transport between the external environment and internal compartments by setting flux values to maximum or zero, corresponding closely to the laboratory experimental protocols used. We replicate this in our approach, although of course we could modulate rates in a continuous manner. The growth conditions include a minimal growth medium based on M9 [8] comprising 25 metabolites, plus a set of additional possible nutritional sources whose exchange reactions are deactivated by default. These public domain models also provide subsystem information. A subsystem is uniquely defined by a set of functionally related reactions, which belong uniquely to one subsystem only, whereas metabolites may be shared between subsystems, acting as communication channels between them, or they may be uniquely members of only one subsystem. A biochemical pathway is a subsystem with a recognized biological function, for example glycolysis or the citric acid cycle. There is no SBML tag to indicate the membership of reactions to subsystems, but this information has been added by the original modellers using a ‘notes’ annotation, which we assume to be correct. Biochemical reaction networks can be considered as classical bipartite graphs, with the two distinct types of nodes representing reactions and metabolites. Hence, they can be immediately encoded as Petri nets, and thus their analysis can benefit from the rich set of Petri net techniques and tools, not considered further in this article, covering both qualitative and (stochastic and deterministic) quantitative aspects. We deploy standard terminology of Petri net theory; see [9] for a general introduction, or the Supplementary Data for this article. Systems biology and Petri nets use related terms, e.g. reactions/transitions and metabolites/places, which we use interchangeably; see Table 1 for a quick reference, and Figure 1 for an introductory example. Small networks of this kind can then be composed together to form arbitrary complex networks. Table 1 Analogies Petri nets—systems biology Petri nets . Systems biology . Place Metabolite Transition Reaction Arc weight Stoichiometry Tokens Mass Token number Concentrationa Marking State (Firing) rate Flux Incidence matrix Stoichiometric matrixb P-invariant Mass conserving subnet Minimal T-invariant Elementary mode, extreme pathwayb Petri nets . Systems biology . Place Metabolite Transition Reaction Arc weight Stoichiometry Tokens Mass Token number Concentrationa Marking State (Firing) rate Flux Incidence matrix Stoichiometric matrixb P-invariant Mass conserving subnet Minimal T-invariant Elementary mode, extreme pathwayb a Up to obvious normalization. b Up to reversible reactions. Open in new tab Table 1 Analogies Petri nets—systems biology Petri nets . Systems biology . Place Metabolite Transition Reaction Arc weight Stoichiometry Tokens Mass Token number Concentrationa Marking State (Firing) rate Flux Incidence matrix Stoichiometric matrixb P-invariant Mass conserving subnet Minimal T-invariant Elementary mode, extreme pathwayb Petri nets . Systems biology . Place Metabolite Transition Reaction Arc weight Stoichiometry Tokens Mass Token number Concentrationa Marking State (Firing) rate Flux Incidence matrix Stoichiometric matrixb P-invariant Mass conserving subnet Minimal T-invariant Elementary mode, extreme pathwayb a Up to obvious normalization. b Up to reversible reactions. Open in new tab Figure 1 Open in new tabDownload slide Petri net building block for mass action equation describing basic enzymatic reaction. Figure 1 Open in new tabDownload slide Petri net building block for mass action equation describing basic enzymatic reaction. Converting an SBML model into a Petri net is done with the Petri net editor and simulator Snoopy [10], and involves two adjustments. As required for any discrete modelling approach, reversible reactions are modelled by two opposite transitions representing the two directions a reversible reaction can occur. Metabolites, which have been declared as boundary conditions, are associated with additional source and sink transitions called boundary transitions (Figure 2). This transforms a place-bordered net into a transition-bordered net, if all boundary places (i.e. source/sink places) have been declared as boundary conditions. Figure 2 Open in new tabDownload slide Petri net representation of reduced E. coli K-12 GEM according to [12, 14]; layout generated with Snoopy [10]. Colour code: green: generated boundary transitions, blue: reversible reactions, red: generated reverse direction for reversible reactions, yellow: P-invariants, pink: source places corresponding to the extra growth conditions, which we switch on in enhanced-growth. Figure 2 Open in new tabDownload slide Petri net representation of reduced E. coli K-12 GEM according to [12, 14]; layout generated with Snoopy [10]. Colour code: green: generated boundary transitions, blue: reversible reactions, red: generated reverse direction for reversible reactions, yellow: P-invariants, pink: source places corresponding to the extra growth conditions, which we switch on in enhanced-growth. A boundary condition is a metabolite whose concentration is assumed to be constant despite any production or consumption by the network by assuming appropriate in/outflow from/to the environment. To transform a network model generated for constraint-based computation (exploring steady-state behaviour) into a general dynamic model that permits transient behaviour, boundary conditions can be basically treated in two ways. In the first approach, the boundary conditions are kept constant by the simulator. This is achieved for deterministic simulation by not generating ordinary differential equations for the boundary metabolites. In contrast in the stochastic case, because the simulator operates directly on the network, these conditions have to be flagged to be specially treated during simulation. The second approach is to generate explicit in/outflow transitions for each boundary metabolite, which maps the assumption underlying constraint-based approaches into a corresponding net structure, making the transformed network available for analysis and simulation by general purpose algorithms that do not need to be aware of boundary conditions. We prefer to use the second approach because we wish to be able to apply either continuous or stochastic simulation as appropriate. Models can be interchangeably represented using SBML or Petri nets. We distinguish between strain-specific models, which may be closely related in their metabolic core while differing otherwise; model configurations—modelling the effect of different growth conditions on an individual strain; model variants—describing the effect of genetic modifications, which can act as designs in synthetic biology. In addition, model versions are created by the correction of modelling errors. For all such models, the overall dynamic behaviour is expected to be a flow of metabolites through the network between the boundary conditions. Regarding metabolic networks, we are interested in maintaining flow through the network, and ensuring that all reactions and metabolites are sometimes active under the conditions for maximal growth that the model encodes. A typical bacterium widely used both in modelling and experimentation is the K-12 strain of E. coli, which has >4000 genes, of which some 1400 are involved in metabolism [6]. When taking into account the compartmental structure of the organism (periplasm; cytosol; extracellular space), this results in a model comprising around 3000 reactions yielding about 4000 Petri net transitions, and the about 1200 unique metabolites yield about 2300 metabolites (Petri net places) respecting the compartmental structure. Although we can simulate and analyse models of this size, we have chosen to use a smaller example for illustration purposes in this article because it is easier for the reader to reproduce the results. See ‘Discussion and conclusions’ section for a discussion of how our techniques scale up to these large models. Running example. We use a reduced version of the whole-genome metabolism of E. coli K-12 as a running example, which has been developed by Orth [11] to illustrate the basic structure of metabolic networks and their use in metabolic engineering (Figure 2). The reduction was originally done by hand [12] based on an early version of a GEM for E. coli K-12 [13] and subsequently used for comparison with the results of an automated procedure [14]. This model contains 94 reactions of which 46 are reversible, divided into 10 subsystems, and 92 metabolites of which 20 are boundary conditions; the corresponding Petri net model comprises 180 transitions (94 + 46 + 2 ⋅ 20 boundary transitions) and 92 places. The reduced model also contains a subset of the growth conditions of the full model, seven of which comprise the cut-down version of M9 used in the reduced model and are all activated by default, including oxygen resulting in an aerobic environment (CO2, H+, H2O, D-glucose, ammonium, phosphate, O2). We considered the model under different growth conditions, and for simplicity, we report here two cases: the default seven minimal aerobic growth conditions (min-growth), and a version (enhanced-growth), which additionally included four deactivated sources that we selected to turn on (L-malate; L-glutamine; fumarate; D-fructose). Our expectation was that more metabolic reactions in the enhanced-growth model would be active compared with the min-growth model based on the assumption that there are pathways that are associated with specific nutritional sources. Methods In this section, we describe the elements of our workflow, which starting from an SBML model enables behaviour analysis over dynamic simulation traces; an overview is given in Figure 3. Figure 3 Open in new tabDownload slide Workflow, exploiting the following tools: 1Snoopy [10], 2graph editing routines in GNU Prolog [15], 3Charlie [16], 4Marcie [17], 5MC2 [18] and 6 R package [19]. Figure 3 Open in new tabDownload slide Workflow, exploiting the following tools: 1Snoopy [10], 2graph editing routines in GNU Prolog [15], 3Charlie [16], 4Marcie [17], 5MC2 [18] and 6 R package [19]. Model preparation There are several steps required in this stage, including three which involve graph-based static analysis (3, 4 and 5). (1) Rectifying typos: Assuming that models pass a standard SBML check (sbml.org/validator), there may be errors such as the incorrect naming of metabolites, usually by mixing them up, for example by confusing metabolites with similar formulae (e.g. H for H2O). Although these are hard to detect, there are published approaches to achieve this, for example [20], and these are not dealt with further in this article. (2) Converting SBML into internal format: We first need to ensure that the information regarding reversibility and boundary conditions conforms to the SBML standard regarding the use of the appropriate tags, which then allows the use of a standard SBML converter to the required internal format. (3) Identifying the main connected subgraph and deleting all others: Models may contain disconnected subgraphs, possibly because of gaps in knowledge. Disconnected subgraphs cannot influence each other, and thus can be considered separately. In the case that the model contains one dominating large component, we remove any trivial small isolated subgraphs before simulation. It is the case that structurally disconnected subnets may be dynamically connected using for instance modifiers in the context of kinetics. However, we make use of only those parts of SBML, which typically occur in constraint-based models; these do not contain initialization of the metabolites or kinetic information. Thus, the situation that there are disconnected subnets, which are dynamically connected, never occurs in these models, which we consider. In general, disconnected components can be an indication of gaps in the model; gap filling is a large area in its own right; see [3] and references therein, and is beyond the scope of this article. (4) Rectifying source and sink places: That is, boundary places, which have not been declared as boundary conditions, can be because of modelling gaps or errors, precluding those source places associated with growth conditions. These contradict the expected basic behaviour of sustained flow through the network under steady-state conditions. The ideal solution is to fill the gaps; otherwise, possible solutions are to declare them as boundary conditions, to delete them optionally with their associated reactions or to reverse one of their immediately neighbouring reactions. Constraint-based methods implicitly ignore reactions, which can never contribute to steady-state behaviour. We take a more conservative approach, so that we do not generate a network specialized to one specific growth condition. (5)Initializingmetabolite concentrations: For dynamic simulation, we need to find appropriate initialization values for all metabolites. In the case of metabolic networks, all initialization values for metabolite concentrations are typically set to zero because the publicly available GEMs have been developed for FBA analysis, which does not require them. Options are to assign uniform values, impose some ratio over metabolite categories, e.g. [2], or extract specific values from the literature. Either all metabolites can be initialized to some non-zero value, or just the minimal set identified by P-invariant analysis (mass conservation). A P-invariant defines a set of places, which induce a subnet conserving metabolic mass, i.e. the total mass of the metabolites in the subnet is constant. The (minimal) P-invariants of a network are all those (minimal) sets of metabolites that contribute to the mass conservation of that network. Owing to this conservation behaviour, P-invariants need to be initialized with non-zero mass for dynamic simulation; otherwise, all reactions with a substrate belonging to a mass-conserving subnet would never be able to occur. The P-invariants provide the minimal set of initialized metabolites required to obtain activity in the entire network, and we use it in this work because this makes it easier to distinguish between active and never-active subnets. Initializing all metabolites, which is the common practice, makes liveness analysis more cumbersome because non-activity will take longer to emerge. (6) Adding kinetic rates: All reactions need to have an associated kinetic rate to permit dynamic simulation. Kinetic rates are typically state dependent and are described by kinetic laws, the most basic of which is mass action. The mass action law is defined by the mathematical product of the concentrations of the substrates of a reaction and a reaction-specific kinetic rate constant (also called a kinetic parameter). Note that kinetic rates and kinetic rate constants are related, but different notions; the former generally vary over time, while the latter are always constant. Other laws are approximative kinetic abstractions, Michaelis–Menten [21] or linlog [22], which however are in general not suitable for faithful stochastic simulation [23]. Thus, we elect to use the basic mass-action law here as a universal illustration. Because FBA analysis does not require these kinetic rates, the kinetic laws and associated rate constants are typically not specified in GEMs. In this case, we assign mass action laws to reactions, and some default rate constants. Options are to set rate constants as arbitrary uniform values (e.g. 1), or uniform for different categories (transport, etc.), or via literature or based on FBA values (however, there will be different steady-state rates under different target conditions). This implies a constant rate for inflow boundary conditions (as they do not have any substrate). Running example. We encountered one typo, which we corrected, involving the reactions R_EX_h2o_e and R_EX_h_e, where the metabolites for water and hydrogen as products were incorrectly exchanged. Next, we added explicit SBML tags for the boundary conditions, identified by the metabolite naming convention, and adjusted the reversibility tag to be consistent with flux bounds. There is only one connected subgraph and no unintentional source or sink places. There are five P-invariants involving 12 metabolites, computed with Charlie [16] (Supplementary Data B). For simplicity, we initialized all metabolites belonging to a P-invariant with the same value. In the same way as for the large models, the running example did not contain reaction rate information, and we added mass-action kinetics with parameter 1 by default to all reactions. We used this model to test our techniques for behaviour analysis, which we report below; it will be the subject of future work to obtain differential rates by curation or optimization. A particular issue of interest in this respect is the evolution of behavioural properties of networks and their constituent subnets, with respect to reactions as well as metabolites over time. Scalability. The only step in the preparation phase, which can be time-consuming, is the computation of the P-invariants because it is known that in the worst case, there can be exponentially many in terms of the net size. However, in practice, for these networks, the time required is manageable: computation of P-invariants for the running example on a standard desktop computer requires <1 min, and 15 min for the full-size E. coli K-12 GEM to detect 17 P-invariants involving 39 metabolites. Model simulation GEMs typically have an infinite state space, which precludes the use of exact analysis methods that build on an exhaustive description of the state space [24]. An obvious choice is thus dynamic simulation, i.e. the generation of a representative (finite) set of finite traces through the infinite state space. Simulation efficiency. These large systems are highly stiff in nature, which causes severe numerical problems for continuous simulation of the set of ordinary differential equations induced by a biochemical reaction network [2] and unacceptably long runtimes for stochastic simulation algorithms (SSAs). For example, Gillespie’s direct method [25] of the GEM for E. coli K-12 takes in the order of 90 min for 1 run of 1000 time points, or 62.5 days for 1000 runs on a single-core workstation; these figures increase by about 50% for τ-leaping SSA [26]. Discrete-time δ-leaping [4] is a method that can be efficiently applied even to large models, typically taking <1 s for 1 run or close to 14 min for 1000 runs for a GEM. It converts the underlying continuous-time Markov chain into an equivalent discrete-time Markov chain and improves the efficiency via discrete-time leaps, even though this results in an approximate simulation method. In SSA, the firing frequency depends solely on the rates, while in δ-leaping, it is a binomially distributed random variable. This means that for SSA, reactions with lower rates occur less frequently than reactions with higher rates; reactions with low rates (rare events) occur very infrequently, and are thus hard to observe. In principle, this holds for δ-leaping too, but δ-leaping is much less sensitive to large differences in reaction rates (stiff models) in terms of runtime, so it is able to perform more simulation runs, and hence report more observations, than SSA in the same execution time. The discrete-time leap method is able to reproduce the dynamic behaviour (including the occurrence of switches, oscillations and tipping points) of a stochastic model comparable with the Gillespie algorithm as long as the following condition is fulfilled for all transitions t: 0≤ht(m)edt(m)·δ≤1, where ht(m) stands for the transition’s rate and the edt(m) is its enabledness degree. A violation of the condition in the above equation would not lead to negative values or incorrect markings (states), i.e. in all cases, a marking is reached that is part of the model’s state space. However, the temporal behaviour of the model, simulated with δ-leaping, would not coincide anymore with the behaviour of exact SSAs. This may have two possible causes. First, the model’s timescale is smaller than the chosen δ, i.e. reducing the δ would gain correct results. Secondly, some transitions’ rate functions are not scaled correctly, i.e. stochastic reaction rates have to be scaled with respect to their reaction order [27]. The discrete-time leap method is supported by Snoopy and Marcie [17], both of which are publicly available at http://www-dssz.informatik.tu-cottbus.de/DSSZ/Software/Software. Types of traces. Traces are time series reporting the current variable values at the n + 1 time points τi of a specified output grid i = 0,…n, typically splitting the simulation time into n equally sized time intervals. We consider two types of traces: Traces of metabolite concentrations, i.e. time series of the current concentration of the metabolites at the specified time points of the simulation run: s(τi):P→N ⁠, with s(τ0)=m0 ⁠, that is, s(τi) are place vectors over all metabolite concentrations, indexed by the set of places P. Traces of reaction rates, i.e. time series of the number of occurrences, which each individual reaction had in total in the latest time interval, v(τi):T→N ⁠, with v(τ0) = 0, that is, v(τi) are transition vectors over all reaction rates indexed by the set of transitions T. Reading a reaction rate vector as Parikh vector immediately leads us to the state equation specifying the relation between both traces: s(τi)=s(τi−1)+C·v(τi),i=1…n, where C is the incidence matrix of a Petri net (Supplementary Data A). Thus, the metabolite trace can be derived from the rate trace, but generally not vice versa. In the stochastic setting, the reaction trace cannot be uniquely deduced from the metabolite trace because of alternative and parallel reactions, which specifically holds for individual traces. Therefore, we directly record the reaction traces during simulation. Often, we consider averaged traces to reduce stochastic noise, even though the average of a set of stochastic traces is itself stochastic (except in the case of that the number of traces is infinite). Thus, the individual values at each time point are non-negative real numbers instead of natural numbers. When model checking, rare events are more obvious in an averaged trace than in single traces. Meta model. To facilitate simulating a model under various conditions, we have implemented a meta model, which takes advantage of in-built parameter selection in the simulation engines that we use—Marcie and Snoopy. This enables us to use one model, which can be configured for simulation under different conditions, rather than a set of model variants, one per condition (e.g. aerobic/non-aerobic, min-growth/max-growth, typos-fixed/typos-not-fixed). The meta model approach eliminates the danger of typographical errors, which can creep in when variants of a model are created. See Supplementary Data C for more details. Running example. The analysis techniques discussed in the following were applied to δ-leaping simulation traces averaged over 1000 runs for 1000 time points to reduce the effects of biological noise, but they could also be applied to any kind of stochastic or continuous traces. The window of 1000 time steps has been determined pragmatically, based on a few initial exploratory simulations; there will always be a need to make a decision about the length of the simulation. We consider the two versions of the running example, min-growth and enhanced-growth, focusing on the differences in their behaviours. Scalability. We use δ-leaping on the running example because of its scalability up to (unreduced) GEMs. The size of the generated average trace file from simulating the GEM of E. coli K-12 is 19 MB for reactions and 10 MB for metabolites, compared with 600 and 300 KB, respectively, for the running example. These traces are used in the model checking step for analysing transient behaviour. Simulation times on a standard desktop computer (four cores and eight threads) were 7 s for the running example (1000 runs, 1000 time points and observations, initialized with 100 tokens), and 184 s with the same conditions for the E. coli K-12 GEM. Simulative model checking Basic principles. Model checking permits us to determine if a model fulfils given properties specified in temporal logics, e.g. probabilistic linear-time logic, PLTL [18]; see examples below. In this research, we have used simulative model checking over time series traces of metabolite and reaction behaviours. In principle, this could be done over individual runs generated by a stochastic model, yielding probabilistic results, or over one trace averaging the individual runs—although this trace is still stochastic, model checking it will return a Boolean result instead of a probability. In this article, we use the latter approach, which technically belongs to a non-probabilistic subset of PLTL. For consistency with the model checker MC2 [18] that we used, we give the properties in PLTL format, with the results belonging to the set {0, 1} rather than in being in the range [0–1]. Model checking generally requires that the properties of interest have to be known, often motivated by observations in the wet lab, e.g. The concentration of metabolite A is always above a certain threshold, letussay 10: P≥1[ G(A>10) ]. If one is not so sure about an appropriate threshold, one could use the established concept of a free variable $x [28, 29], i.e. which ranges over all possible values, to determine the probability distribution of the threshold, so that A fulfils the property: P≥1[ G(A>$x) ]. In our setting, we do not know (yet) the PLTL properties certain observables (metabolites, reactions) are supposed to exhibit, which brings us to a new scenario, where we wish to ask: Which variables fulfil a given property pattern?, expressed as: P≥1[ G(«y»>10) ], with «y» being a meta variable ranging over all entities (metabolites, reactions) in the model. Properties of interest. We assume that there are generally desired behavioural characteristics. Liveness is a well-established notion for reactions (transitions) in Petri net theory. A reaction is live if for any point in time it exhibits future activity, i.e. it occurs. We extend this notion to metabolites (places): a metabolite is live if for any point in time at least one of the reactions involving the metabolite (as substrate or product) is live. A reaction network is reaction-live if all reactions in the network are live. Likewise, a network is metabolite-live if all metabolites in the network are live. If a network is reaction-live, then it is metabolite-live, however, generally not vice versa. Thus, metabolite-liveness is less strict than reaction-liveness for networks. A reaction is forever dead from a point in time after which it never exhibits activity. A metabolite is forever dead from a point in time after which all reactions involving the metabolite are dead. The notions of liveness (respectively, deadness) above are defined over infinite traces. In our case, because we are using simulation generating finite traces, we need corresponding notions over time windows, and this is what we call (reaction or metabolite) activity. A reaction is dead over a period of time from t1 to t2 if it does not exhibit any activity from t1 to t2. Similarly, a reaction is active from t1 to t2 if it exhibits activity at some point between t1 and t2. A metabolite is dead over a period of time from t1 to t2 if all of the reactions involving the metabolite are dead in the given time window, and a metabolite is active over a period of time from t1 to t2 if at least one of the reactions involving the metabolite is active in the time window. Property libraries. A library of appropriate property patterns now allows us to categorize all observables into (not necessarily disjunctive) sets fulfilling the individual property patterns. We compiled two such libraries of properties for reaction and metabolite behaviour, which are provided in Supplementary Data D accompanied by descriptions in natural language. The properties were derived from our extensive experience in model checking such models. We have previously used an automated approach using machine learning over sets of examples [30]. In the current case, the properties are so general that an automated approach is not fruitful, as it relies on the selection strategy for the example set. Of course, the current property libraries could be enhanced by automatically derived properties, or hand-crafted properties specific to the set of models under consideration. One desired behavioural characteristic that we are interested in is that under conditions for maximal growth, all reactions and metabolites are active over the period of the trace. In the following, we give two examples from the reaction library, which define dead behaviour. Note the use of the meta variable «x», which ranges over all reactions. Never active reactions, i.e. always dead reactions: P≥1[ G(«x»=0 ) ] ⁠, which is equivalent to P≥1[ ¬F(«x»≠0 ) ]. Reactions with changing activity and finally a steady state of zero activity (d— differential operator): P≥1[ F(d(«x»)≠0)∧F(G(«x»=0 ∧ d(«x»)=0)) ]. Running example. Model checking the min-growth and enhanced-growth models, we found that the number of reactions fulfilling property (1) reduced from 13 to 0, and for property (2) from 94 to 4, confirming our earlier expectation that more metabolic reactions in the enhanced-growth model would be active compared with the min-growth model. A closer analysis revealed that these four are rare events. In the ‘Whole system data analytics’ section, we introduce time blocks to better distinguish between rare and zero events. Subnets. We distinguish two categories: property subnets are defined by sets of entities sharing a certain temporal logical property, the composition of which can vary over time, unlike functional subnets (subsystems), which are statically defined by sets of reactions contributing to the same biological function. In this article, we focus on a specific class of property subnets, called dead subnets, which exhibit no activity from the current time point onwards. The existence of such subnets can be an indication of a modelling fault, missing information in the network structure (e.g. gaps because of unidentified genes), or unused parts of the network because of the set of environment conditions imposed (e.g. the growth conditions). There are two classes of dead subnets: reaction dead subnets and metabolite dead subnets. A reaction dead subnet over a period of time from t1 to t2 is induced by reactions (transitions), which are dead over that period of time; it includes those reactions and their substrates and products. At least one of the substrates has to be dead, but not all the substrates and products are necessarily dead because they can be involved in alternative pathways. A metabolite dead subnet over a period of time from t1 to t2 is induced by metabolites (places), which are dead over that period of time; all reactions involving the metabolites are dead in that time window. A metabolite that is always from some point in time in a steady state (a constant concentration of zero or above) can be so because either (i) its production matches its consumption, and these rates are non-zero, or (ii) it is neither produced nor consumed. We are interested in the latter category, as they induce dead subnets. A reaction that is in a steady state can be so because either (i) it has a steady non-zero activity, and is ‘ticking over’, or (ii) it is non-active, i.e. with zero activity. We are particularly interested in the latter category because reactions with zero activity belong to dead subnets, and we can directly monitor/observe reaction activity over time. This justifies why we are looking at reaction dead subnets because otherwise we could not distinguish between the two cases of zero and non-zero activity. Running example.Figure 4 shows the development of the active subnet over time for the min-growth and enhanced-growth versions. Note that in the min-growth model, the active subnets initially increase in extent but then decrease, whereas in the enhanced-growth model, the active subnets increase over time until they cover the entire network, possibly suggesting that more than the minimal medium is required to maintain metabolic activity according to this model. Figure 4 Open in new tabDownload slide Dynamic simulation of the min-growth model (left column), and enhanced-growth model (right column); shown are snapshots at three time points—the beginning, middle and end of the simulation, with the active reactions highlighted in blue. Snoopy was used to both automatically generate layouts, and reactions coloured using activity over reaction traces identified in each time window. These results clearly show that the model predicts under enhanced-growth conditions that metabolic reaction activity increases over the time of simulation, while under min-growth conditions, the reaction activity decreases after an initial peak. This temporal aspect of the transient behaviour can only be observed using dynamic simulation. Figure 4 Open in new tabDownload slide Dynamic simulation of the min-growth model (left column), and enhanced-growth model (right column); shown are snapshots at three time points—the beginning, middle and end of the simulation, with the active reactions highlighted in blue. Snoopy was used to both automatically generate layouts, and reactions coloured using activity over reaction traces identified in each time window. These results clearly show that the model predicts under enhanced-growth conditions that metabolic reaction activity increases over the time of simulation, while under min-growth conditions, the reaction activity decreases after an initial peak. This temporal aspect of the transient behaviour can only be observed using dynamic simulation. Scalability. The library comprises 53 properties for metabolites and 80 properties for reactions, 133 properties in total. The MC2 model checker [18] requires 7.5 s to process the reaction library, 1.6 s for the metabolites library for the running example (92 metabolites, 194 reactions) and 8 s (both metabolites and reactions) on a standard desktop computer. For the E. coli K-12 GEM (2133 metabolites, 4162 reactions), the time required is 2 m 3 s for reactions, 24 s for metabolites and 2 min 34 s for both libraries. Checking a model over all properties in both libraries is achieved using a script. In the following sections, we introduce methods to explore the evolution of dead subnets, including the use of observers, i.e. derived variables defined over the model variables, for example the total number of dead reactions. Whole system data analytics Data analytics is a well-established field of research; it can be applied to large data sets to identify trends, which may be buried under the huge amount of data. These techniques provide complementary insights, which are otherwise not obvious from the visual inspection or model checking over time series; this is because of the size and complexity of the models in terms of the number of reactions and metabolites. Data analytics comprises a large number and huge variety of techniques. In the following, we discuss our general approach and a selection of techniques, which we found most applicable in our scenario, which are then applied to our running example. (1) Clustering is a learning technique, often unsupervised, for partitioning a set of entities, so that the entities in the same partition (cluster) are more similar to each other than to those in other partitions. This technique can be used to hierarchically cluster the reaction rate traces as well as metabolite traces, using, for example, Euclidean distance; see [31] for a survey of clustering techniques for time series data. Distances can be calculated over the raw data, or over derived values, e.g. derivatives, which obviously might yield different clustering results. We choose to illustrate clustering over raw stochastic data in this article; otherwise, smoothing would be required to compute meaningful derivatives. (2) Time division blocks over tracesandevolution of properties. To better investigate the evolution of properties over time, we can introduce a level of abstraction by dividing the simulation time into (equal) blocks. Each block can be seen as a mini time series, and treated accordingly, e.g. by model checking or clustering. Alternatively, values for the reaction and metabolite activity can be summed or averaged within each block. Analysis of these variables within the blocks can be achieved using a variety of visualization methods, e.g. scatter plots, density plots and bar charts. These are supported by the statistical package R [19]; the ggplot2 [32] package can be used to generate plots. Running example. For the purpose of the following quantitive analyses, we have combined forward and reverse directions of reversible reactions because we wish to focus on the total metabolic flow carried out through the reactions in the network. For this, we have computed the absolute value of the difference between the individual rates of the two directions. (1) Clustering. We used this technique to hierarchically cluster the reaction rate traces; see Figure 5 for the result for the min-growth model, and applied the average silhouette width measure [33] in the clValid R package [19] to identify the optimal partition of the data set. Figure 5 Open in new tabDownload slide Hierarchical clustering using Euclidean distance as dissimilarity measure of the min-growth model based on reaction activity traces; the dendrogram shows two distinctive clusters of behaviour identified using the average silhouette width measure [33] in the clValid R package [19], which are illustrated in the consecutive figures. (Left) A cluster of 40 reactions comprising all boundary reactions plus some exchange and transport reactions, which maintain their activity throughout the simulation because they do not rely on the flow of metabolites through the network. (Right) A cluster of 94 reactions, which all reach a steady-state value of < 0.02 occurrences per time unit early in the simulation because of the minimal growth environment. Figure 5 Open in new tabDownload slide Hierarchical clustering using Euclidean distance as dissimilarity measure of the min-growth model based on reaction activity traces; the dendrogram shows two distinctive clusters of behaviour identified using the average silhouette width measure [33] in the clValid R package [19], which are illustrated in the consecutive figures. (Left) A cluster of 40 reactions comprising all boundary reactions plus some exchange and transport reactions, which maintain their activity throughout the simulation because they do not rely on the flow of metabolites through the network. (Right) A cluster of 94 reactions, which all reach a steady-state value of < 0.02 occurrences per time unit early in the simulation because of the minimal growth environment. This yielded two clusters: (i) one cluster of 40 reactions comprising all boundary reactions plus some exchange and transport reactions peaking at a maximum of 0.7 before reaching a steady state at 0.5. which maintain their activity throughout the simulation because they do not rely on the flow of metabolites through the network, and (ii) a larger cluster of 94 reactions, with a few members showing early high activity peaks, and then all showing low activity after 400 time units because of the minimal growth environment. Note that these clusters would not be obtained easily using model checking because the stochastic nature of the traces makes the detection of peaks and general tendencies (e.g. ‘generally increasing’) difficult. For reasons of space, we omit the same analysis for the enhanced-growth model but do treat both versions below. (2) Time division blocks over tracesandevolution of properties. For illustration, we take four blocks (i.e. quarters). Our analysis shows that in the min-growth model, the number of dead reactions increases over the time blocks, with by far the greatest increase between quarter 1 and quarter 2. In the enhanced-growth model, there is much less variation in activity in the quarters, with a slight overall higher activity in quarter 2. See the ordered average reaction activity of all reactions in the scatter plot in Figure 6. We also found that there was a peak at activity value 0.5 for all quarters for both model versions. These are primarily the boundary transitions, responsible for the inflow and outflow of the network; see the density plot in Figure 7 showing how often the values of average reaction activity occur in the four time intervals. The density of low activity reactions in the min-growth model is virtually identical in all four quarters; only the first two quarters exhibit any dead reactions in the enhanced-growth model as metabolism starts up, although in general there is a progressive shift towards both lower activity accompanied by mid-range activity as the system stabilizes. The bar charts in Figure 8 show how the number of reactions in each reaction category of zero, rare and non-rare varies between the time blocks. Figure 6 Open in new tabDownload slide Scatter plots of progression of average reaction activity over time blocks (quarters). (Left) Min-growth model, (Right) enhanced-growth model; in the min-growth model, the number of dead reactions increases over the time blocks, whereas in the enhanced-growth model, there is less variation in activity in the quarters. This is in accordance with the observations in Figure 4. Figure 6 Open in new tabDownload slide Scatter plots of progression of average reaction activity over time blocks (quarters). (Left) Min-growth model, (Right) enhanced-growth model; in the min-growth model, the number of dead reactions increases over the time blocks, whereas in the enhanced-growth model, there is less variation in activity in the quarters. This is in accordance with the observations in Figure 4. Figure 7 Open in new tabDownload slide Variation in density of average reaction activity over time blocks (quarters). (Left) Min-growth model, (Right) enhanced-growth model. In both versions, as the block time progresses, the activity of reactions gradually decreases. Zero values are not displayed because of log plotting. The results for the min-growth model show that the density of low activity reactions is virtually identical in all four quarters. In contrast, under the enhanced-growth conditions, there is a progressive shift towards lower activity over the four quarters. However, because of the log scale on the X-axis, these low activity reactions do not dominate the overall activity. In both cases, the peaks at 0.5 correspond to the exchange and transport reactions; see Figure 5(Left). Figure 7 Open in new tabDownload slide Variation in density of average reaction activity over time blocks (quarters). (Left) Min-growth model, (Right) enhanced-growth model. In both versions, as the block time progresses, the activity of reactions gradually decreases. Zero values are not displayed because of log plotting. The results for the min-growth model show that the density of low activity reactions is virtually identical in all four quarters. In contrast, under the enhanced-growth conditions, there is a progressive shift towards lower activity over the four quarters. However, because of the log scale on the X-axis, these low activity reactions do not dominate the overall activity. In both cases, the peaks at 0.5 correspond to the exchange and transport reactions; see Figure 5(Left). Figure 8 Open in new tabDownload slide Comparison of variation in membership of reaction categories over time blocks (quarters). (Left) Min-growth model, (Right) enhanced-growth model. Reactions have been categorized into zero, rare and non-rare based on their average reaction activity (activity = 0; 0 < activity ≤ 0.01; activity > 0.01). The number of dead reactions in the min-growth model increases over the time blocks indicating the progression of deadness in the network, whereas the network remains alive in the enhanced-growth model. Figure 8 Open in new tabDownload slide Comparison of variation in membership of reaction categories over time blocks (quarters). (Left) Min-growth model, (Right) enhanced-growth model. Reactions have been categorized into zero, rare and non-rare based on their average reaction activity (activity = 0; 0 < activity ≤ 0.01; activity > 0.01). The number of dead reactions in the min-growth model increases over the time blocks indicating the progression of deadness in the network, whereas the network remains alive in the enhanced-growth model. Scalability. None of the algorithms involved causes any scalability issues. Subsystem data analytics Our general approach here comprises the following steps: (1) Identifying subsystems using SBML annotations. (2) Abstracting over reactions by representing subsystem behaviour by the averaged reaction activity of constituent reactions to give a set of derived variables. (3) Clustering the subsystems by their average behaviour hierarchically using Euclidean distance and discrete wavelet transform. We have found that dynamic time warping [34] is too inefficient to be used in practice, not completing after 48 h on a standard workstation. (4) Clustering the subsystems according to the degree of their structural inter-connectivity. For this, we defined a similarity metric for hierarchical clustering over two subsystems P and Q by (∑i=1i=nmin(|Xi|,|Yi|)|Xi∪Yi|)/n, where n is the number of connected subgraphs in the network P ⋅ Q induced by the union of their reactions, and for each such connected subgraph, X is the set of reactions in P and Y the set of reactions in Q, respectively. The basis of the measure is that two subsystems, which by their definition never overlap, can be connected by shared metabolites to form a connected subgraph induced by the union over their reactions. Note that in the case of two disconnected subsystems, the similarity metric yields zero because either |Xi| or |Yi| is zero in each subgraph. Similarly, the highest possible value for similarity is 0.5 when two equal size subsystems are being considered. (5) Pairwise comparing the clusterings by behaviour and structural inter-connectivity using two well-established measures: (i) the Fowlkes–Mallows (FM) index [35] over the dendrograms produced by hierarchical clustering, implemented in the dendextend package in R [36], (ii) the Mantel test [37] over the dissimilarity matrices, implemented in the ade4 package in R [38], producing a correlation and corresponding P-value indicating the significance of the results. Running example. To obtain a functional view of model behaviour for our running example, we investigated the effect on subsystem behaviour of exposing the organism to enhanced-growth conditions. As before, in the simulation traces that we consider below, we have combined the rates of the forward and reverse directions of reversible reactions. (1) Subsystems identified from the SBML model with number of reactions (ignoring reversibility) in brackets: citric acid cycle (8), glutamate metabolism (4), glycolysis/gluconeogenesis (12), oxidative phosphorylation (9), pentose phosphate pathway (8), pyruvate metabolism (6), anaplerotic reactions (6), inorganic ion transport and metabolism (2), exchange (20), and extracellular transport (19). (2) Abstracting over reactions. The way in which the growth environment differentially affects subsystem behaviour is shown in Figures 1 and 2 in Supplementary Data E which plot the time series average activity of all reactions in a subsystem, for all subsystems under minimal and enhanced-growth conditions. In the case of all but one of the subsystems, the metabolic activity is greatly suppressed under min-growth compared with enhanced-growth conditions. The traces for inorganic ion transport and metabolism exhibit no effective difference between the two growth conditions, indicating that the two reactions involved (ammonia reversible transport and phosphate reversible transport via proton symport) make no contribution to the metabolic activity of the system according to the current model. (3) Clustering by subsystem average behaviour. Both Euclidean distance and discrete wavelet transform [39] yielded the same clusters, suggesting a robustness in the analysis for these data; we present the Euclidean distance version in Figure 9 for the enhanced-growth model and in Supplementary Data E and Figure 3A for the min-growth model. Ignoring inorganic ion transport and metabolism because of the finding in the previous step, the result clearly shows that exchange and oxidative phosphorylation are outliers in both conditions. This can be explained by the fact that oxidative phosphorylation generates ATP and thus powers, amongst other things, import/export systems. Also, the major pathways are grouped within one major cluster, together with extracellular transport. Figure 9 Open in new tabDownload slide Hierarchical clustering of the subsystems in the enhanced-growth model based on averaged activity traces per subsystem using Euclidean distance. For more details, see Figure 3 in the Supplementary Data. Figure 9 Open in new tabDownload slide Hierarchical clustering of the subsystems in the enhanced-growth model based on averaged activity traces per subsystem using Euclidean distance. For more details, see Figure 3 in the Supplementary Data. (4) Clustering according to subsystem structural inter-connectivity. As there was no difference between the min-growth and enhanced-growth model in terms of reactions comprising the subsystems, we have only computed one structural diagram for subsystem structural inter-connectivity (Supplementary Data E, Figure 3C). The clusters clearly show that the core metabolic subsystems are closely interconnected in terms of metabolites, for example glutamate metabolism and the pentose phosphate pathway are a closely related pair. Additionally, the externally oriented reactions are clustered apart from the core metabolic subsystems, with exchange and extracellular transport more closely related than with inorganic ion transport and metabolism, which itself is in between the externally oriented reactions and the metabolic core. (5) Pairwise comparison of the clusterings by behaviour (both min-growth and enhanced-growth model) and structural inter-connectivity. All three dendrograms are shown in Supplementary Data E, Figure 3. The FM values, Mantel correlation and Mantel P-value are shown in Table 2. Table 2 Three pairwise comparisons of clusterings, using the FM index [35] and the Mantel test [37] Clustering 1 . k . Clustering 2 . k . FM . Correlation . P-value . Relatedness . Min 3 Enhanced 3 1.00 0.86 0.01 Related Min 2 Structure 2 0.61 0.02 0.32 Not related Enhanced 2 Structure 2 0.61 0.03 0.33 Not related Clustering 1 . k . Clustering 2 . k . FM . Correlation . P-value . Relatedness . Min 3 Enhanced 3 1.00 0.86 0.01 Related Min 2 Structure 2 0.61 0.02 0.32 Not related Enhanced 2 Structure 2 0.61 0.03 0.33 Not related Note: k, cut value; Correlation and P-value, Mantel test. Open in new tab Table 2 Three pairwise comparisons of clusterings, using the FM index [35] and the Mantel test [37] Clustering 1 . k . Clustering 2 . k . FM . Correlation . P-value . Relatedness . Min 3 Enhanced 3 1.00 0.86 0.01 Related Min 2 Structure 2 0.61 0.02 0.32 Not related Enhanced 2 Structure 2 0.61 0.03 0.33 Not related Clustering 1 . k . Clustering 2 . k . FM . Correlation . P-value . Relatedness . Min 3 Enhanced 3 1.00 0.86 0.01 Related Min 2 Structure 2 0.61 0.02 0.32 Not related Enhanced 2 Structure 2 0.61 0.03 0.33 Not related Note: k, cut value; Correlation and P-value, Mantel test. Open in new tab Comparison of the min-growth and enhanced-growth dendrograms resulted in the highest FM index value of 1, which indicates that there is significant evidence that the two trees are similar. This conclusion is supported by the results of the Mantel test because the observed correlation of 0.86 and P-value of 0.01 with an associated cut-off alpha of 0.05 suggest that the matrix entries have a strong positive linear association. The low FM index values of 0.61 computed for behaviour against structure comparisons (min-growth – structure and enhanced-growth – structure) provide only weak evidence against the null hypothesis that the two trees are dissimilar. This is also supported by the results of the Mantel test, which has a low correlation of 0.02 and 0.03 for the two conditions, respectively, and corresponding higher P-values of >0.3 for both comparisons. In summary, both the FM index and Mantel test results suggest that for these subsystems, min-growth and enhanced-growth behaviours are related, while for both conditions, the behaviour is not significantly related to structure using the similarity metric defined above. Scalability. None of the algorithms involved causes any scalability issues. Discussion and conclusions The analysis of the dynamic behaviour of bacterial GEMs currently presents considerable challenges because of the difficulties of simulating such large and complex networks. Moreover, such models cannot be checked without sophisticated tool support. In this article, we have described a workflow comprising a set of methods and associated tools to analyse the dynamic behaviour of whole-genome bacterial metabolic models, illustrating these on a reduced version for ease of reproducibility and clarity. The workflow is applicable to full-scale GEMs, as illustrated by the discussion of scalability in the ‘Methods’ section. The focus of the running example is the analysis of two configurations of the reduced model, to compare the effects of growth conditions, with a special emphasis on reactions and functional subsystems. We introduced abstract views, which provide complementary insights, which are specifically useful for the differential behaviour analysis of sets of related models. In our example, this enables us to correlate the effect of environmental conditions with pathway activity. The ability to partition a whole-genome metabolic model into its constituent subsystems facilitates the exploration of the behaviour of a subsystem in isolation as well as in combination with other subsystems. We generally expect to obtain different behaviour and more meaningful insights when analysing pathways in context rather than in isolation; high connectivity can compound this effect. The models are currently not dynamically adaptive to environmental conditions in terms of changes in the expression of genes coding for enzymes, and hence changes in the concentration of the corresponding enzymes. However, the models do encode the reactions catalysed by the enzymes, i.e. products of gene regulation, for parts of the network that are activated because of environmental inputs, and hence in this sense do encode the relationship between environment, gene regulation and metabolic activity. The concentration of the enzymes is thus represented by the rate constants of the corresponding reactions. To incorporate dynamically changing availability of enzymes, the model would need to include dynamically changing rate constants, or even better explicitly model the enzymes, which is not done at present. The development of a model of such a complex system connecting gene expression to enzyme availability is an interesting topic, which has been explored by e.g. [40]; however, that work used constraint-based methods for the metabolic part, and Petri nets for gene regulation. Our techniques to explore GEMs through transient behaviour and structural characteristics gave us insights into the functional behaviour of the GEM. The metabolic functional pathways become inactive early on in the min-growth model, whereas all of them exhibited activity throughout the behaviour of the enhanced-growth model. Structural analysis showed that the core metabolic subsystems are closely interconnected in terms of metabolites, while the externally oriented reactions are remote from the core metabolic subsystems. Our techniques can be applied to large whole-genome-scale models, exploiting the scalability of our abstraction and approximation techniques. These are general approaches, and not bound to Petri net representations; note that some of what is done here with Snoopy or Marcie could be achieved with Copasi [41], as long as we use standard simulation algorithms. Moreover, the data analytics methods can be applied to any kind of simulation traces, not just approximative ones. The approach can be used to compare the behaviour of sets of related models, for example during model development, automatic repair, manual curation or target-based optimization. Our overall longer-term goal is to build on our workflow to support the design process in synthetic biology and the sound implementation of valuable engineered organisms. Our tools are ready to be used for metabolic engineering as soon as we have more precise kinetic information available. The determination of these is a challenge, which we are currently addressing. In the course of this, we intend to incorporate constraint-based methods, for example to obtain steady-state fluxes for the derivation of kinetic rate constants, along the lines of [2, 42]. Model modification and configuration, and the behavioural analysis induced by these models, will play a crucial role in predicting the results of genetic engineering or forced evolution in the context of specific nutritional environments. Moreover, the dynamic approach will play a crucial role in the characterization and analysis of genome-scale signal transduction or gene regulatory networks as reliable models increasingly become available in the future. Key Points Simulation of dynamic behaviour of large genome-scale models. Analysis of dynamic behaviour of large genome-scale models using model checking and data analytics. Analysis of the behaviour of functional subsystems. Workflow building on public domain tools, and supporting methodology. Illustrated on non-trivial public domain running example. Funding MH was partially funded by the Brunel University London, College of Engineering, Design and Physical Sciences via the SEED fund to promote multi-disciplinary research. David Gilbert is a Professor of Computing in the Department of Computer Science, Brunel University London, UK. His research interests include systems biology: modelling and analysis of biological systems; synthetic biology: computational design of novel biological systems. Monika Heiner is a Professor of Computer Science in the Department of Computer Science, Brandenburg Technical University, Cottbus, Germany. Her research interests include modelling and analysis of technical as well as biological systems using qualitative and quantitative Petri nets, model checking and simulation techniques. Yasoda Jayaweera is a PhD student in the Department of Computer Science, Brunel University London, UK. Her research interests are the application of data analytic techniques in systems and synthetic biology. Christian Rohr has a PhD degree in simulative stochastic methods and is a Research Associate in the Department of Computer Science, Brandenburg Technical University, Cottbus, Germany. His research interests include the simulative analysis of stochastic Petri nets. References 1 Karr JR , Sanghvi JC, Macklin DN, et al. A whole-cell computational model predicts phenotype from genotype . Cell 2012 ; 150 ( 2 ): 389 – 401 . Google Scholar Crossref Search ADS PubMed WorldCat 2 Smallbone K , Mendes P. Large-scale metabolic models: from reconstruction to differential equations . Industrial Biotechnology 2013 ; 9 ( 4 ): 179 – 84 . Google Scholar Crossref Search ADS WorldCat 3 Palsson BØ. Systems Biology: Constraint-Based Reconstruction and Analysis . Cambridge, UK: Cambridge University Press , 2015 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC 4 Rohr C. Simulative analysis of coloured extended stochastic Petri nets. PhD thesis, Department of Computer Science, Brandenburg Technical University, Cottbus, 2017 . 5 Hucka M , Finney A, Sauro HM, et al. The Systems Biology Markup Language (SBML): a medium for representation and exchange of biochemical network models . J Bioinformatics 2003 ; 19 : 524 – 31 . Google Scholar Crossref Search ADS WorldCat 6 Monk JM , Charusanti P, Azizb RK, et al. Genome-scale metabolic reconstructions of multiple Escherichia coli strains highlight strain-specific adaptations to nutritional environments . Proc Natl Acad Sci USA 2013 ; 110 ( 50 ): 20338 – 43 . Google Scholar Crossref Search ADS PubMed WorldCat 7 King Z , Lu A, Dräger JA, et al. BiGG models: a platform for integrating, standardizing and sharing genome-scale models . Nucleic Acids Res 2016 ; 44 ( D1 ): D515 – 22 . Google Scholar Crossref Search ADS PubMed WorldCat 8 Mamiatis T , Fritsch EF, Sambrook J, Engel J. Molecular Cloning–A Laboratory Manual . New York, NY : Cold Spring Harbor Laboratory . 1982 , 545 S., 1985. Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC 9 Heiner M , Gilbert D, Donaldson R. Petri nets in systems and synthetic biology. In: Formal Methods for Computational Systems Biology, SFM 2008 . (LNCS, Vol. 5016), Berlin-Heidelberg: Springer , 2008 , 215 – 64 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC 10 Heiner M , Herajy M, Liu F, et al. Snoopy—a unifying Petri net tool. In: Proceedings of the International Conference on Application and Theory of Petri Nets and Concurrency , (LNCS, Vol. 7347). Berlin-Heidelberg: Springer, 2012 , 398 – 407 . 11 Orth JD. Systems biology analysis of Escherichia coli for discovery and metabolic engineering. PhD thesis, University Of California, San Diego, 2012 . 12 Orth JD , Fleming RMT, Palsson BØ. Reconstruction and use of microbial metabolic networks: the core Escherichia coli metabolic model as an educational guide . EcoSal plus 2010 ; 4 ( 1 ). doi: 10.1128/ecosalplus.10.2.1. Google Scholar OpenURL Placeholder Text WorldCat 13 Orth JD , Conrad TM, Na J, et al. A comprehensive genome-scale reconstruction of Escherichia coli metabolism–2011 . Mol Syst Biol 2011 ; 7 ( 535 ): 535 . Google Scholar Crossref Search ADS PubMed WorldCat 14 Erdrich P , Steuer R, Klamt S. An algorithm for the reduction of genome-scale metabolic network models to meaningful core models . BMC Syst Biol 2015 ; 9 ( 1 ): 48 . Google Scholar Crossref Search ADS PubMed WorldCat 15 Diaz D , Codognet P. The GNU prolog system and its implementation. In: Proceedings of the 2000 ACM Symposium on Applied Computing , Vol. 2. New York, NY, USA: ACM, ACM Digital Library, 2000 , 728 – 32 . 16 Heiner M , Schwarick M, Wegener J. Charlie—an extensible Petri net analysis tool. In: Proceedings of the International Conference on Applications and Theory of Petri Nets and Concurrency (LNCS, Vol. 9115). Cham-Heidelberg-New York-Dordrecht-London: Springer, 2015 , 200 – 211 . 17 Heiner M , Rohr C, Schwarick M. MARCIE—model checking and reachability analysis done effiCIEntly. In: Proceedings of the International Conference on Applications and Theory of Petri Nets and Concurrency (LNCS, Vol . 7927). Berlin-Heidelberg: Springer, 2013 , 389 – 99 . 18 Donaldson R , Gilbert D. A model checking approach to the parameter estimation of biochemical pathways. In: Proceedings of the International Conference on Computational Methods in Systems Biology (LNCS/LNBI, Vol. 5307). Berlin-Heidelberg: Springer, 2008 , 269 – 87 . 19 Team RC . R: a language and environment for statistical computing . Vienna, Austria: R Foundation for Statistical Computing , 2014, 3–36 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC 20 Chindelevitch L , Stanley S, Hung D, et al. Metamerge: scaling up genome-scale metabolic reconstructions with application to mycobacterium tuberculosis . Genome Biol 2012 ; 13 ( 1 ): r6 . Google Scholar Crossref Search ADS PubMed WorldCat 21 Breitling R , Gilbert D, Heiner M, et al. A structured approach for the engineering of biochemical network models, illustrated for signalling pathways . Brief Bioinform 2008 ; 9 ( 5 ): 404 – 21 . Google Scholar Crossref Search ADS PubMed WorldCat 22 Heijnen JJ. Approximative kinetic formats used in metabolic network modeling . Biotechnol Bioeng 2005 ; 91 ( 5 ): 534 – 45 . Google Scholar Crossref Search ADS PubMed WorldCat 23 Sabouri-Ghomi M , Ciliberto A, Kar S, et al. Antagonism and bistability in protein interaction networks . J Theor Biol 2008 ; 250 ( 1 ): 209 – 18 . Google Scholar Crossref Search ADS PubMed WorldCat 24 Heiner M , Rohr C, Schwarick M, et al. A comparative study of stochastic analysis techniques. In: Proceedings of the 8th International Conference on Computational Methods in Systems Biology . New York, NY, USA: ACM, ACM Digital Library, 2010 , 96 – 106 . 25 Gillespie DT. A general method for numerically simulating the stochastic time evolution of coupled chemical species . J Comput Phys 1976 ; 22 : 403 – 34 . Google Scholar Crossref Search ADS WorldCat 26 Cao Y , Gillespie DT, Petzold LR. Adaptive explicit-implicit tau-leaping method with automatic tau selection . J Chem Phys 2007 ; 126 ( 22 ): 224101 . Google Scholar Crossref Search ADS PubMed WorldCat 27 Wilkinson DJ. Stochastic Modelling for System Biology , 1st edn. New York, NY : CRC Press , 2006 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC 28 Fages F , Rizk A. On the analysis of numerical data time series in temporal logic. In: Proceedings of the International Conference on Computational Methods in Systems Biology (LNCS/LNBI, Vol. 4695), Berlin-Heidelberg-New York: Springer, 2007 , 48 – 63 . 29 Donaldson R , Gilbert D. A model checking approach to the parameter estimation of biochemical pathways. In: Proceedings of the International Conference on Computational Methods in Systems Biology . Springer, 2008 , 269 – 87 . 30 Maccagnola D , Messina E, Gao Q, et al. A machine learning approach for generating temporal logic classifications of complex model behaviours. In: Proceedings of the Winter Simulation Conference (WSC). IEEE, 2012 , 1 – 12 . 31 Liao TW. Clustering of time series data—a survey . Pattern Recognition 2005 ; 38 ( 11 ): 1857 – 74 . Google Scholar Crossref Search ADS WorldCat 32 Wickham H. ggplot2: Elegant Graphics for Data Analysis . Dordrecht-Heidelberg-London-New York: Springer , 2009 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC 33 Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis . J Comput Appl Math 1987 ; 20 : 53 – 65 . Google Scholar Crossref Search ADS WorldCat 34 Berndt DJ , Clifford J. Using dynamic time warping to find patterns in time series. In AAAIWS'94 Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining , Technical Report WS-94-03. Seattle, WA: AAAI Press, 1994 , 359 – 70 . 35 Fowlkes EB , Mallows CL. A method for comparing two hierarchical clusterings . J Am Stat Assoc 1983 ; 78 ( 383 ): 553 – 69 . Google Scholar Crossref Search ADS WorldCat 36 Galili T. dendextend: Extending R’s Dendrogram Functionality, 2015 . 37 Mantel N. The detection of disease clustering and a generalized regression approach . Cancer Res 1967 ; 27(2 Pt 1) : 209 – 20 . Google Scholar OpenURL Placeholder Text WorldCat 38 Chessel D , Dufour AB, Thioulouse J. The ade4 package-i-one-table methods . R News 2004 ; 4 ( 1 ): 5 – 10 . Google Scholar OpenURL Placeholder Text WorldCat 39 Chaovalit P , Gangopadhyay A, Karabatis G, et al. Discrete wavelet transform-based time series analysis and mining . ACM Comput Surv 2011 ; 43 ( 2 ): 6 . Google Scholar Crossref Search ADS WorldCat 40 Fisher CP , Plant NJ, Moore JB, et al. QSSPN: dynamic simulation of molecular interaction networks describing gene regulation, signalling and whole-cell metabolism in human cells . Bioinformatics 2013 ; 29 : 3181 – 90 . Google Scholar Crossref Search ADS PubMed WorldCat 41 Hoops S , Sahle S, Gauges R, et al. Copasi—a complex pathway simulator . Bioinformatics 2006 ; 22 ( 24 ): 3067 – 74 . Google Scholar Crossref Search ADS PubMed WorldCat 42 Machado CD , Costa RS, Rocha M, et al. Model transformation of metabolic networks using a Petri net based framework. In: International Workshop on Biological Processes & Petri Nets (BioPPN 2010). Universidade do Minho, Portugal, 2010 , 101 – 15 . http://hdl.handle.net/1822/16761. © The Author 2017. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com © The Author 2017. Published by Oxford University Press.
journal article
LitStream Collection
Computational tools for plant small RNA detection and categorization

Morgado, Lionel; Johannes, Frank

2019 Briefings in Bioinformatics

doi: 10.1093/bib/bbx136pmid: 29059285

Abstract Small RNAs (sRNAs) are important short-length molecules with regulatory functions essential for plant development and plasticity. High-throughput sequencing of total sRNA populations has revealed that the largest share of sRNA remains uncategorized. To better understand the role of sRNA-mediated cellular regulation, it is necessary to create accurate and comprehensive catalogues of sRNA and their sequence features, a task that currently relies on nontrivial bioinformatic approaches. Although a large number of computational tools have been developed to predict features of sRNA sequences, these tools are mostly dedicated to microRNAs and none integrates the functionalities necessary to describe units from all sRNA pathways thus far discovered in plants. Here, we review the different classes of sRNA found in plants and describe available bioinformatics tools that can help in their detection and categorization. small RNA categorization, sRNA structural features, sRNA function prediction, sRNA sequencing Introduction Over the past few years, the scientific community has centered efforts to unravel the complex world of RNA molecules that are not translated into a protein, but that rather have a regulatory function in the cell [1]. Such regulatory RNAs are involved in the control of the concentration of messenger RNA (mRNA) and comprise, among other subclasses, the small RNAs (sRNAs). sRNAs have been shown to have key regulatory functions in development, response to biotic and abiotic stressors, genome stability and transposon control [2]. With the advent of next-generation sequencing of small RNA (sRNA-seq), it has become feasible to survey entire sRNA populations from diverse plant species, cell types, developmental time-points or from different experimental treatments. The identification and classification of sRNA from such high-throughput data is a nontrivial computation task, as plants can produce millions of sRNA from diverse pathways, which are collectively captured in a single sequencing experiment. Accurate sorting of sRNA by class requires categorizing sRNA according to their precursors, structural properties of the mature molecules, as well as functional aspects, such as their potential target sites (Figure 1). A number of computational tools have been developed to detect known sRNA in newly synthesized sequencing libraries, and to help in the identification of novel candidates. For many biologists, a key bottleneck to in silico sRNA analysis is to find software that is tailored to their specific research question and to the data type at hand. In this manuscript, we provide an inventory of various computational tools for the identification and categorization of plant sRNA. We start by describing a simplified classification scheme based on structural and functional sRNA properties, which is adopted in subsequent sections to organize currently available computational tools. Figure 1 Open in new tabDownload slide A stratified classification scheme for sRNA in plants. Figure 1 Open in new tabDownload slide A stratified classification scheme for sRNA in plants. sRNAs in plants Plant sRNA biology has been extensively reviewed elsewhere [3, 4]. Briefly, in plants, sRNAs are mostly 21–24 nucleotides (nt) in length, and result from cleavage of double-stranded RNA substrates by dicer-like (DCL) enzymes. The RNA substrates, themselves, can originate either from a single-stranded RNA precursor with a stem-loop conformation, or from a double helix. If the sRNA originates from a hairpin structure, they are referred to as hairpin-derived sRNA (hpsRNA), and if they originate from a double helix, they are referred to as small interfering RNA (siRNA). The hpsRNA class can be further considered a microRNA (miRNA) if the hairpin is processed in such a way that it produces only one or few functional units. siRNAs comprise all other classes of known sRNA: secondary siRNA such as trans-acting (ta)-siRNA, natural antisense transcript (nat)-siRNA and heterochromatic (hc)-siRNA. In the case of secondary siRNA, two nonmutually exclusive groups can be defined: phased siRNA, which is originated from a precursor that is processed in a precise and sequential manner; and ta-siRNA, which is a plant specific sRNA type with targets originated in trans. In the case of nat-siRNA, the precursor double helix is derived from overlapping RNA segments produced independently of each other, while secondary siRNA and hc-siRNA precursors are preceded by the action of a RNA-dependent RNA polymerase (RDR) over single-stranded RNA. Considering the physical distance between NAT producing loci, two main categories emerge: cis-NAT and trans-NAT. cis-NATs are transcribed from the same genomic loci but typically from opposite DNA strands and thus form perfect pairs, while trans-NAT are transcribed from distant genomic locations. Cis-NAT overlapping regions do not have a characteristic length and can occur in five orientations [5]: Head-to-head: Consists in the interception in the 5′ ends of both transcripts Tail-to-tail: Comprises the interception in the 3′ ends of both transcripts Completely overlapping: A transcript on one strand of the genome is overlapped by the entire length of the other transcript on the opposite strand Nearby head-to-head: Nearby transcripts in a head-to-head manner where the 5′-end of a transcript is near the 5′-end of another transcript in the genome Nearby tail-to-tail: Nearby transcripts in a tail-to-tail manner where the 3′-end of a transcript is near the 3′-end of another transcript in the genome To become active in plants, sRNAs must load into Argonaute (AGO) proteins, which guide silencing complexes to their targets according to sequence pairing principles. When associated with AGO, sRNA can regulate genomes at the transcriptional (TS) or posttranscriptional (PTS) level depending on the specific AGO to which the sRNA binds. Both modes of action have been intensively studied, but PTS mechanisms such as mRNA cleavage and translation inhibition are better understood. PTS is typically observed for miRNA, secondary siRNA and nat-siRNA, while TS is often associated with the action of hc-siRNA. Functional siRNA characterization is key to identify hc-siRNA, as no clear structural features to discriminate between hc-siRNA and other siRNA have been defined to date. Computational approaches for sRNA detection and categorization Software for sRNA categorization is typically designed to deal with sequences as input. Some software tools identify precursors in segments with a length of dozens or even hundreds of nucleotides, while others focus on short matured fragments of about 20 nt, and there are also platforms that combine information from both forms. A comprehensive overview of tools is given in the following sections, and further detailed in Table 1. Depending on the application, sRNA analysis can be complex, often involving preliminary steps such as data preprocessing and quality controls, as well as downstream analysis such as gene ontology annotation and pathway discovery. A number of recent computational platforms have tried to integrate various software modules by stringing together existing tools into single analysis pipelines [6–12]. The description of modules that do not directly deal with sRNA identification and categorization is outside of the scope of this document and will not be discussed here. Table 1 Main features of computational tools for sRNA characterization # . Tool . Type . Focus . Input . Analysis . Year . Refe- rence . Local . Web server . Precursor . Mature . Target . Conser- vation . Structure . Dicing . Machine learning . Conser- vation . Isomir detection . Machine learning . PTS . Hairpin . NAT . Precise excision . Phasing . Compleme- ntarity . Degra- dome . Machine learning . Expre- ssion . 1 SeqBuster x miRNA SL + TG x x x x 2010 [13] 2 QuickMIRSeq x miRNA SL + TG x x x x 2017 [14] 3 IsomiRage x miRNA SL + TG x x x x 2014 [15] 4 sRNAbench x miRNA SL + TG x x x x x 2014 [16] 5 isomiRex x miRNA SL + TG x x x x 2013 [17] 6 isomiRID x miRNA SL + TG x x x x 2013 [18] 7 RNAFold x x hpsRNA TG x 2003 [19] 8 UNAFold x hpsRNA TG x 2003 [20] 9 Mirinho x miRNA SL + TG x x x 2015 [21] 10 miRNAFold x x hpsRNA TG x 2016 [22] 11 MIRFINDER x miRNA SS + TG x x x 2004 [23] 12 MIRcheck x miRNA SS + TG x x x 2004 [24] 13 microHarvester x x miRNA SS + TG x x x 2006 [25] 14 MiMatcher x miRNA TS + TG x x 2005 [26] 15 mirTour x miRNA SS + TG x x x 2011 [27] 16 C-mii x miRNA SS + TG x x x x 2012 [28] 17 mirDeepFinder x miRNA SL + TG + DF x x x x x 2012 [29] 18 PlantMiRNAPred x miRNA TG x x 2011 [30] 19 plantMirP x miRNA TG x x 2016 [31] 20 NOVOMIR x miRNA TG x x 2010 [32] 21 HuntMi x miRNA TG x x 2013 [33] 22 miRNAPrediction x miRNA SS + TG + x 2012 [34] 23 SplamiR x miRNA TG x + x 2011 [35] 24 miPlantPreMat x mirRNA TG x x x 2014 [36] 25 MiRPara x miRNA TG x x 2011 [37] 26 miRDup x miRNA SS + TG x x 2013 [38] 27 miRduplexSVM x miRNA TG x x 2015 [39] 28 MaturePred x miRNA TG x x 2011 [40] 29 miRLocator x miRNA TG x x 2015 [41] 30 ShortStack x hpsRNA/ miRNA/ ta-siRNA SA + TG x x x x x 2013 [42] 31 mirDeep-P x miRNA SL + TG x x x 2011 [43] 32 miRPlant x miRNA SL + TG x x x 2014 [44] 33 miRA x miRNA SA + TG x x 2015 [45] 34 PIPmiR x miRNA SL + TG x x x 2011 [46] 35 Mir-PREFeR x miRNA SA + TG x x 2014 [47] 36 miRCat2 x miRNA SL + TG x x 2017 [48] 37 miReader x miRNA SL x x 2013 [49] 38 UEA sRNA Workbench x miRNA/ta-siRNA SL + TG + D x x x x x x x 2012 [7] 39 pssRNAMiner x ta-siRNA SL + TG + PI x x 2008 [50] 40 Shortran x miRNA/ta-siRNA SL + TG x x x x x 2012 [51] 41 TasExpAnalysis x ta-siRNA SL + TG + D x x x x 2014 [52] 42 PhaseTank x Secondary/ phased siRNA SL + TG x x 2015 [53] 43 NASTI-seq/R x nat-siRNA SL x 2013 [54] 44 NATpipe x nat-siRNA SL + TG + D x x x 2016 [55] 45 Cleaveland x PTS SS + TG + DF x x 2014 [56] 46 PAREsnip x PTS SS + TG + DF x x 2012 [57] 47 SoMART x miRNA/ta-siRNA SL + TG + DF x x x x x x x 2012 [58] 48 SeqTar x PTS SL + TG + DF x 2012 [59] 49 miRNA Digger x miRNA SL + TG + DF x x x x x x 2016 [60] 50 psRNATarget x PTS SS + TG x 2011 [61] 51 TAPIR x x PTS SS + TG x 2010 [62] 52 Targetfinder x PTS SS + TG x 2010 [63] 53 Target-align x x PTS SS + TG x 2010 [64] 54 PsRobot x x hpsRNA/miRNA SS + TG x x x x 2012 [65] 55 p-TAREF x x miRNA SS + TG x x 2011 [66] 56 microRNA- Target x miRNA SS + TG x x 2014 [67] 57 PlantMirnaT x miRNA SS + RL + TG  + DF x x x x 2015 [68] 58 MTide x miRNA SL + TG + DF x x x x x x 2015 [69] 59 sPARTA x PTS SS + TS + DF x x 2014 [70] 60 imiRTP x miRNA SS + TS + DF x x 2011 [71] # . Tool . Type . Focus . Input . Analysis . Year . Refe- rence . Local . Web server . Precursor . Mature . Target . Conser- vation . Structure . Dicing . Machine learning . Conser- vation . Isomir detection . Machine learning . PTS . Hairpin . NAT . Precise excision . Phasing . Compleme- ntarity . Degra- dome . Machine learning . Expre- ssion . 1 SeqBuster x miRNA SL + TG x x x x 2010 [13] 2 QuickMIRSeq x miRNA SL + TG x x x x 2017 [14] 3 IsomiRage x miRNA SL + TG x x x x 2014 [15] 4 sRNAbench x miRNA SL + TG x x x x x 2014 [16] 5 isomiRex x miRNA SL + TG x x x x 2013 [17] 6 isomiRID x miRNA SL + TG x x x x 2013 [18] 7 RNAFold x x hpsRNA TG x 2003 [19] 8 UNAFold x hpsRNA TG x 2003 [20] 9 Mirinho x miRNA SL + TG x x x 2015 [21] 10 miRNAFold x x hpsRNA TG x 2016 [22] 11 MIRFINDER x miRNA SS + TG x x x 2004 [23] 12 MIRcheck x miRNA SS + TG x x x 2004 [24] 13 microHarvester x x miRNA SS + TG x x x 2006 [25] 14 MiMatcher x miRNA TS + TG x x 2005 [26] 15 mirTour x miRNA SS + TG x x x 2011 [27] 16 C-mii x miRNA SS + TG x x x x 2012 [28] 17 mirDeepFinder x miRNA SL + TG + DF x x x x x 2012 [29] 18 PlantMiRNAPred x miRNA TG x x 2011 [30] 19 plantMirP x miRNA TG x x 2016 [31] 20 NOVOMIR x miRNA TG x x 2010 [32] 21 HuntMi x miRNA TG x x 2013 [33] 22 miRNAPrediction x miRNA SS + TG + x 2012 [34] 23 SplamiR x miRNA TG x + x 2011 [35] 24 miPlantPreMat x mirRNA TG x x x 2014 [36] 25 MiRPara x miRNA TG x x 2011 [37] 26 miRDup x miRNA SS + TG x x 2013 [38] 27 miRduplexSVM x miRNA TG x x 2015 [39] 28 MaturePred x miRNA TG x x 2011 [40] 29 miRLocator x miRNA TG x x 2015 [41] 30 ShortStack x hpsRNA/ miRNA/ ta-siRNA SA + TG x x x x x 2013 [42] 31 mirDeep-P x miRNA SL + TG x x x 2011 [43] 32 miRPlant x miRNA SL + TG x x x 2014 [44] 33 miRA x miRNA SA + TG x x 2015 [45] 34 PIPmiR x miRNA SL + TG x x x 2011 [46] 35 Mir-PREFeR x miRNA SA + TG x x 2014 [47] 36 miRCat2 x miRNA SL + TG x x 2017 [48] 37 miReader x miRNA SL x x 2013 [49] 38 UEA sRNA Workbench x miRNA/ta-siRNA SL + TG + D x x x x x x x 2012 [7] 39 pssRNAMiner x ta-siRNA SL + TG + PI x x 2008 [50] 40 Shortran x miRNA/ta-siRNA SL + TG x x x x x 2012 [51] 41 TasExpAnalysis x ta-siRNA SL + TG + D x x x x 2014 [52] 42 PhaseTank x Secondary/ phased siRNA SL + TG x x 2015 [53] 43 NASTI-seq/R x nat-siRNA SL x 2013 [54] 44 NATpipe x nat-siRNA SL + TG + D x x x 2016 [55] 45 Cleaveland x PTS SS + TG + DF x x 2014 [56] 46 PAREsnip x PTS SS + TG + DF x x 2012 [57] 47 SoMART x miRNA/ta-siRNA SL + TG + DF x x x x x x x 2012 [58] 48 SeqTar x PTS SL + TG + DF x 2012 [59] 49 miRNA Digger x miRNA SL + TG + DF x x x x x x 2016 [60] 50 psRNATarget x PTS SS + TG x 2011 [61] 51 TAPIR x x PTS SS + TG x 2010 [62] 52 Targetfinder x PTS SS + TG x 2010 [63] 53 Target-align x x PTS SS + TG x 2010 [64] 54 PsRobot x x hpsRNA/miRNA SS + TG x x x x 2012 [65] 55 p-TAREF x x miRNA SS + TG x x 2011 [66] 56 microRNA- Target x miRNA SS + TG x x 2014 [67] 57 PlantMirnaT x miRNA SS + RL + TG  + DF x x x x 2015 [68] 58 MTide x miRNA SL + TG + DF x x x x x x 2015 [69] 59 sPARTA x PTS SS + TS + DF x x 2014 [70] 60 imiRTP x miRNA SS + TS + DF x x 2011 [71] Notes: Modules for downstream analysis or that are not applicable to plants are not mentioned. DF: degradation fragments; RL: RNA sequencing; SA: sRNA-seq alignments; SL: sRNA-seq library; SS: sRNA sequence; PI: phase initiator sequence; PS: precursor sequence; TR: transcript; TG: transcript or genomic sequence. Open in new tab Table 1 Main features of computational tools for sRNA characterization # . Tool . Type . Focus . Input . Analysis . Year . Refe- rence . Local . Web server . Precursor . Mature . Target . Conser- vation . Structure . Dicing . Machine learning . Conser- vation . Isomir detection . Machine learning . PTS . Hairpin . NAT . Precise excision . Phasing . Compleme- ntarity . Degra- dome . Machine learning . Expre- ssion . 1 SeqBuster x miRNA SL + TG x x x x 2010 [13] 2 QuickMIRSeq x miRNA SL + TG x x x x 2017 [14] 3 IsomiRage x miRNA SL + TG x x x x 2014 [15] 4 sRNAbench x miRNA SL + TG x x x x x 2014 [16] 5 isomiRex x miRNA SL + TG x x x x 2013 [17] 6 isomiRID x miRNA SL + TG x x x x 2013 [18] 7 RNAFold x x hpsRNA TG x 2003 [19] 8 UNAFold x hpsRNA TG x 2003 [20] 9 Mirinho x miRNA SL + TG x x x 2015 [21] 10 miRNAFold x x hpsRNA TG x 2016 [22] 11 MIRFINDER x miRNA SS + TG x x x 2004 [23] 12 MIRcheck x miRNA SS + TG x x x 2004 [24] 13 microHarvester x x miRNA SS + TG x x x 2006 [25] 14 MiMatcher x miRNA TS + TG x x 2005 [26] 15 mirTour x miRNA SS + TG x x x 2011 [27] 16 C-mii x miRNA SS + TG x x x x 2012 [28] 17 mirDeepFinder x miRNA SL + TG + DF x x x x x 2012 [29] 18 PlantMiRNAPred x miRNA TG x x 2011 [30] 19 plantMirP x miRNA TG x x 2016 [31] 20 NOVOMIR x miRNA TG x x 2010 [32] 21 HuntMi x miRNA TG x x 2013 [33] 22 miRNAPrediction x miRNA SS + TG + x 2012 [34] 23 SplamiR x miRNA TG x + x 2011 [35] 24 miPlantPreMat x mirRNA TG x x x 2014 [36] 25 MiRPara x miRNA TG x x 2011 [37] 26 miRDup x miRNA SS + TG x x 2013 [38] 27 miRduplexSVM x miRNA TG x x 2015 [39] 28 MaturePred x miRNA TG x x 2011 [40] 29 miRLocator x miRNA TG x x 2015 [41] 30 ShortStack x hpsRNA/ miRNA/ ta-siRNA SA + TG x x x x x 2013 [42] 31 mirDeep-P x miRNA SL + TG x x x 2011 [43] 32 miRPlant x miRNA SL + TG x x x 2014 [44] 33 miRA x miRNA SA + TG x x 2015 [45] 34 PIPmiR x miRNA SL + TG x x x 2011 [46] 35 Mir-PREFeR x miRNA SA + TG x x 2014 [47] 36 miRCat2 x miRNA SL + TG x x 2017 [48] 37 miReader x miRNA SL x x 2013 [49] 38 UEA sRNA Workbench x miRNA/ta-siRNA SL + TG + D x x x x x x x 2012 [7] 39 pssRNAMiner x ta-siRNA SL + TG + PI x x 2008 [50] 40 Shortran x miRNA/ta-siRNA SL + TG x x x x x 2012 [51] 41 TasExpAnalysis x ta-siRNA SL + TG + D x x x x 2014 [52] 42 PhaseTank x Secondary/ phased siRNA SL + TG x x 2015 [53] 43 NASTI-seq/R x nat-siRNA SL x 2013 [54] 44 NATpipe x nat-siRNA SL + TG + D x x x 2016 [55] 45 Cleaveland x PTS SS + TG + DF x x 2014 [56] 46 PAREsnip x PTS SS + TG + DF x x 2012 [57] 47 SoMART x miRNA/ta-siRNA SL + TG + DF x x x x x x x 2012 [58] 48 SeqTar x PTS SL + TG + DF x 2012 [59] 49 miRNA Digger x miRNA SL + TG + DF x x x x x x 2016 [60] 50 psRNATarget x PTS SS + TG x 2011 [61] 51 TAPIR x x PTS SS + TG x 2010 [62] 52 Targetfinder x PTS SS + TG x 2010 [63] 53 Target-align x x PTS SS + TG x 2010 [64] 54 PsRobot x x hpsRNA/miRNA SS + TG x x x x 2012 [65] 55 p-TAREF x x miRNA SS + TG x x 2011 [66] 56 microRNA- Target x miRNA SS + TG x x 2014 [67] 57 PlantMirnaT x miRNA SS + RL + TG  + DF x x x x 2015 [68] 58 MTide x miRNA SL + TG + DF x x x x x x 2015 [69] 59 sPARTA x PTS SS + TS + DF x x 2014 [70] 60 imiRTP x miRNA SS + TS + DF x x 2011 [71] # . Tool . Type . Focus . Input . Analysis . Year . Refe- rence . Local . Web server . Precursor . Mature . Target . Conser- vation . Structure . Dicing . Machine learning . Conser- vation . Isomir detection . Machine learning . PTS . Hairpin . NAT . Precise excision . Phasing . Compleme- ntarity . Degra- dome . Machine learning . Expre- ssion . 1 SeqBuster x miRNA SL + TG x x x x 2010 [13] 2 QuickMIRSeq x miRNA SL + TG x x x x 2017 [14] 3 IsomiRage x miRNA SL + TG x x x x 2014 [15] 4 sRNAbench x miRNA SL + TG x x x x x 2014 [16] 5 isomiRex x miRNA SL + TG x x x x 2013 [17] 6 isomiRID x miRNA SL + TG x x x x 2013 [18] 7 RNAFold x x hpsRNA TG x 2003 [19] 8 UNAFold x hpsRNA TG x 2003 [20] 9 Mirinho x miRNA SL + TG x x x 2015 [21] 10 miRNAFold x x hpsRNA TG x 2016 [22] 11 MIRFINDER x miRNA SS + TG x x x 2004 [23] 12 MIRcheck x miRNA SS + TG x x x 2004 [24] 13 microHarvester x x miRNA SS + TG x x x 2006 [25] 14 MiMatcher x miRNA TS + TG x x 2005 [26] 15 mirTour x miRNA SS + TG x x x 2011 [27] 16 C-mii x miRNA SS + TG x x x x 2012 [28] 17 mirDeepFinder x miRNA SL + TG + DF x x x x x 2012 [29] 18 PlantMiRNAPred x miRNA TG x x 2011 [30] 19 plantMirP x miRNA TG x x 2016 [31] 20 NOVOMIR x miRNA TG x x 2010 [32] 21 HuntMi x miRNA TG x x 2013 [33] 22 miRNAPrediction x miRNA SS + TG + x 2012 [34] 23 SplamiR x miRNA TG x + x 2011 [35] 24 miPlantPreMat x mirRNA TG x x x 2014 [36] 25 MiRPara x miRNA TG x x 2011 [37] 26 miRDup x miRNA SS + TG x x 2013 [38] 27 miRduplexSVM x miRNA TG x x 2015 [39] 28 MaturePred x miRNA TG x x 2011 [40] 29 miRLocator x miRNA TG x x 2015 [41] 30 ShortStack x hpsRNA/ miRNA/ ta-siRNA SA + TG x x x x x 2013 [42] 31 mirDeep-P x miRNA SL + TG x x x 2011 [43] 32 miRPlant x miRNA SL + TG x x x 2014 [44] 33 miRA x miRNA SA + TG x x 2015 [45] 34 PIPmiR x miRNA SL + TG x x x 2011 [46] 35 Mir-PREFeR x miRNA SA + TG x x 2014 [47] 36 miRCat2 x miRNA SL + TG x x 2017 [48] 37 miReader x miRNA SL x x 2013 [49] 38 UEA sRNA Workbench x miRNA/ta-siRNA SL + TG + D x x x x x x x 2012 [7] 39 pssRNAMiner x ta-siRNA SL + TG + PI x x 2008 [50] 40 Shortran x miRNA/ta-siRNA SL + TG x x x x x 2012 [51] 41 TasExpAnalysis x ta-siRNA SL + TG + D x x x x 2014 [52] 42 PhaseTank x Secondary/ phased siRNA SL + TG x x 2015 [53] 43 NASTI-seq/R x nat-siRNA SL x 2013 [54] 44 NATpipe x nat-siRNA SL + TG + D x x x 2016 [55] 45 Cleaveland x PTS SS + TG + DF x x 2014 [56] 46 PAREsnip x PTS SS + TG + DF x x 2012 [57] 47 SoMART x miRNA/ta-siRNA SL + TG + DF x x x x x x x 2012 [58] 48 SeqTar x PTS SL + TG + DF x 2012 [59] 49 miRNA Digger x miRNA SL + TG + DF x x x x x x 2016 [60] 50 psRNATarget x PTS SS + TG x 2011 [61] 51 TAPIR x x PTS SS + TG x 2010 [62] 52 Targetfinder x PTS SS + TG x 2010 [63] 53 Target-align x x PTS SS + TG x 2010 [64] 54 PsRobot x x hpsRNA/miRNA SS + TG x x x x 2012 [65] 55 p-TAREF x x miRNA SS + TG x x 2011 [66] 56 microRNA- Target x miRNA SS + TG x x 2014 [67] 57 PlantMirnaT x miRNA SS + RL + TG  + DF x x x x 2015 [68] 58 MTide x miRNA SL + TG + DF x x x x x x 2015 [69] 59 sPARTA x PTS SS + TS + DF x x 2014 [70] 60 imiRTP x miRNA SS + TS + DF x x 2011 [71] Notes: Modules for downstream analysis or that are not applicable to plants are not mentioned. DF: degradation fragments; RL: RNA sequencing; SA: sRNA-seq alignments; SL: sRNA-seq library; SS: sRNA sequence; PI: phase initiator sequence; PS: precursor sequence; TR: transcript; TG: transcript or genomic sequence. Open in new tab Detecting known sRNA An obvious choice when trying to identify new sRNA candidates is to search for known sequences that have been experimentally confirmed. Owing to their popularity, a large number of databases of experimentally validated miRNA have been built up, comprising several species and kingdoms. miRBase [72] is the most famous miRNA repository, and provides extensive information on precursors, mature sequences and their targets. Unfortunately, similarly detailed databases for other sRNA categories do not currently exist. To our knowledge, there is only one public database (tasiRNAdb) for secondary sRNA in plants, which is strictly dedicated to ta-siRNA [52]. tasiRNAdb provides information not only about mature ta-siRNA but also their precursors and targets in 18 plant species. We are not aware of repositories of mature nat-siRNA or hc-siRNA. Still, there is one databases of NATs in plants: PlantNATsDB [73]. This resource contains a large inventory of precomputed NATs for 70 plant species, but it focuses on genes ignoring extensive intergenic regions. Sequence aligners such as BLAST [74] are commonly used to query such databases. In fact, the platforms supporting tasiRNAdb and PlantNATsDB implement their own online BLAST modules. Searches for long precursors can be performed using standard BLAST parameters; however, mature sequences pose additional challenges because of their small size. When searching for mature sequences, perfect matches reduce the odds of getting hits by chance when compared with the use of mismatches and gaps. On the other hand, allowing for mismatches and gaps is often necessary when studying close inter-species homologues. While single-nucleotide polymorphisms are a well-known source of genomic variation, mature sRNA can be subject to further modifications. For example, miRNA variants (i.e. ‘isomiRs’) have been identified as a result of inaccurate DCL cleavage, sequence editing events and even nucleotide additions to the mature sRNA [75, 76]. To deal with isomiRs, several tools rely on alignment algorithms and a preprocessing scheme that consists on sequence terminal trimming and nucleotide additions to simulate known sequence modifications. This is the case in applications such as seqBuster [13], QuickMIRSeq [14], IsomiRage [14], sRNAbench [16], isomiRex [17] and isomiRID [18]. As ‘template isomiRs’ are a result of dicing shifts, they can be detected if perfect complementarity between the sRNA candidate, and a known miRNA precursor (or pre-miRNA) is observed. On the other hand, simulating ‘non-template isomiRs’ by creating all possible combinations with 1–3 nt extensions in the 5′ and 3′ ends of known miRNA, and by trimming canonical mature sequences, has been central for their identification. To reduce false positives, some of these tools perform additional processing steps typically exploring features of sRNA-seq libraries. The simplest procedure consists in using read abundance cutoffs as done in isomiRID. seqBuster uses several filters to eliminate sequences with low read abundance and computes z-scores to distinguish true isoforms from sequencing errors. QuickMIRSeq uses multiple samples simultaneously with the rationale that noisy background reads are not captured consistently in multiple samples unlike true miRNA, even if they show low expression levels. Computer-aided de novo sRNA categorization Once known mature sRNAs are identified in sRNA-seq data, the remaining sequences (which usually comprise the large majority of the initial sRNA-seq sets) are typically mapped to a genomic reference (if available) to eliminate sequencing chimeras and artifacts. Mature sequences mapping to previously characterized ribosomal RNA (rRNA), transfer RNA (tRNA), small nucleolar RNA (snoRNA) and small nuclear RNA (from databases like for example Rfam [77]) are filtered out by some computational frameworks [29, 57], as these are thought to be mostly non-DCL fragmentation products that have a low chance of entering functional sRNA pathways [57, 78]. Still, doing so can eliminate true sRNA, as tRNA-, rRNA- and snoRNA-derived sRNAs have been identified in multiple plant species [79]. Removing low complexity and low copy reads is also a common practice to reduce noisy data [29, 42], but again, care must be taken in doing so because important sequences can be missed (e.g. hc-siRNAs are typically derived from repetitive regions). If BLAST is a popular mapping tool suitable to work with long precursor sequences, aligners like BWA [80] and bowtie [81] are primary choices when it comes to mapping libraries of short reads to a reference. An accurate mapping is of importance in the sense that meaningful clues for sRNA categorization can be obtained from the sequence and chromatin context of the mapped loci. The identification and classification of putatively functional sRNA is a challenging computational task. While the majority of computational tools have been tailored to animal data, several of these tools can also be applied to other species, including plants. However, sRNA biology differs considerably between plants and animals, and several plant-specific computational tools have been developed. Existing computational methods can be broadly divided in five main groups: (i) those that explore conservation principles; (ii) those that rely on structural features such as the spatial conformation of the precursor(s); (iii) inspired by machine learning; (iv) rule-based and (v) target-centered. In practice, this distinction can be difficult as most modern tools consist of pipelines involving a mixture of these methods. Below we discuss computational tools specific to each plant sRNA class. Because sRNA biogenesis and function can be treated separately, emphasis is given to each of these facets in distinct sections. Identification of hairpin structures and miRNA classification The root concept underlying hpsRNA and miRNA categorization is the biogenesis from a RNA transcript with capacity to fold into a hairpin-shaped precursor. As hpsRNA are not well understood and given the popularity of miRNA, most hairpin detectors were developed having in mind pre-miRNA. In truth, a large number of tools branded as miRNA detectors are no more than hairpin or pre-miRNA predictors, as they do not even provide a location for the putative mature miRNA inside the pre-miRNA. These tools must therefore be examined carefully to avoid erroneous conclusions [38]. The inference of RNA secondary structure is central to many computational methods designed to detect hairpin structures. Algorithms such as RNAfold [19] and UNAFold [20] explore thermodynamic principles applied to RNA runners and under the premise that the minimal folding free energy index for miRNA precursors is significantly lower than for other products frequently captured during sequencing such as tRNA, rRNA or mRNA [82]. It is important to recall that plant miRNA stem-loops are more heterogeneous when compared with animals (usually they are larger and can contain big bugles); hence, the parameters of the algorithms need to be adjusted according to whether the input data are derived from plant or animal species [83, 84]. Once calibrated, these ab initio methods can predict hairpin structures without additional knowledge. Traditional RNA folders are computationally intensive, characterized by a cubic time complexity, which is suboptimal for large inputs. Mirinho [21] and miRNAFold [22] are folders with a square time complexity, recently introduced to tackle this problem. Because homology search is inherently simpler than folding estimation, most software combines conservation principles with RNA secondary structure predictions to decrease processing time, but also to increase accuracy. In the case of MIRFINDER [23], a miRNA reference set from Arabidopsis is used to tune a pipeline similar to the one described above. This tool performs a search for new miRNA by comparing queries against a reference, and explores three principles: (i) the reference miRNA sequence is conserved between the query and the reference species, independently of the rest of the precursor sequence having diverged; (ii) the precursor sequence must be able to form a stem-loop secondary structure; and (iii) for two miRNA orthologs the location on the arm of the stem-loop secondary structures is the same in both species. To fulfill the first condition, a search for mature miRNA is made with BLAST, and the second condition is verified with RNAfold. Other tools that use similar comparative genomics principles include MIRcheck [24], microHarvester [25], MiMatcher [26], miRTour [27], C-mii [28] and miRDeepFinder [29]. For example, miRDeepFinder uses a set of miRNA candidates as queries and searches in a reference for segments with potential to form pre-miRNA. The hit sites are extended (700 nt by default) upstream and downstream to capture and examine precursor candidates with the miRNA located in one arm of the stem at either the 5′ or 3′ end. After a miRNA candidate is identified, miRDeepFinder extracts the complementary miRNA* sequence considering a 3′ overhang of 2 nt characteristic of the miRNA-miRNA* duplex. In a slightly different approach, machine learning methods have been used to train classifiers capable of distinguishing plant pre-miRNA from other RNA sequences. One argument in favor of these latter approaches is that comparative methods have limited capacity to detect miRNA sequences and precursors with low similarity to the reference set, while machine learning models can capture more general features that overcome this weakness. PlantMiRNAPred [30] and plantMirP [31] are part of this list, both of which were developed using support vector machines (SVMs). PlantMiRNAPred uses a classifier trained with data from several plant species. A set of 68 features extracted from pre-miRNA and optimized using information gain and feature similarity criteria was considered for training the final classifier, including information about the sequence composition, k-mers, secondary structure, energy and thermodynamics-related parameters. Interestingly, the authors compared PlantMiRNAPred with triplet-SVM [85] and microPred [86], two tools following a similar philosophy but developed using human data. Indeed, these two methods show discriminative capacity when applied to plants, but the accuracy of PlantMiRNAPred is considerably higher, illustrating the need for kingdom-specific tools. Other machine learning algorithms have been applied to pre-miRNA detection, including Markov models in NOVOMIR [32], random forest in HuntMi [33] and C5.0 decision trees in miRNAprediction [34]. In a less usual scheme, SplamiR [35] combines software for detecting primary transcripts that undergo splicing events with a machine learning classification system to identify candidate pre-miRNA among generated putative pre-miRNA. Searches for genomic sequences with the potential to form fold-back stem-loop structures do not yield high-confidence putatively functional miRNA, as many more inverted repeats can be found than the number of miRNA expected for a given organism. For example, in Arabidopsisthaliana, 138 864 inverted repeat structures have been identified [24] but <1000 miRNA confirmed [72]. To increase miRNA detection accuracy, miPlantPreMat [36] and miRPara [37] feed properties of mature miRNA sequences and their precursors to machine learning models. Both combine SVM classifiers in a hierarchical architecture. miPlantPreMat works with classifiers individually trained to recognize mature and precursor sequences, while miRPara explores inter-kingdom differences. Although the determinants for miRNA location inside a precursor remain poorly understood, efforts have been made to develop computational procedures for their detection. For example, miRDup [38] was designed to infer the precise positions and length of mature miRNA within a candidate pre-miRNA through random forest classifiers that use sequence and structural features. In addition, tools such as MiRduplexSVM [39], MaturePred [40] and miRLocator [41] use classifiers to extract the position of miRNA duplexes from hairpins. High-confidence miRNA classification requires additional criteria [87]. For example, the precursor must be diced at specific loci producing only one or a reduced number of mature miRNA. When that happens, piles of sRNA accumulate at these genomic positions. To inspect this feature, it is necessary to access the layout of sRNAs along the precursor. This can be directly assessed using the short-read alignment patterns from sRNA-seq data. Such a functionality can be found in most modern tools, including Shortstack [42], mirDeep-P [43], mirPlant [44], miRA [45], PIPmiR [46], miR-PREFeR [47] and miRCat2 [48]. MirDeep-P and mirPlant are extensions of the popular tool mirDeep [88]. While mirDeep was developed for animal applications, mirDeep-P and mirPlant were specifically designed for plant-based sRNA analysis. Following the mapping of sRNA reads to a reference genome with bowtie, mirDeep-P extracts RNA segments to further determine secondary structure and checks if sRNA spatial distribution patterns are compatible with dicer activity. The mature candidates and the respective pre-miRNA are then filtered according to plant-specific criteria based on known properties of plant miRNA genes. A significant difference between mirPlant and mirDeep-P is that in the latter case, the precursor region is determined based on the genomic region overlapping reads, while in miRPlant the precursor region is determined based on the highest expressed read, which is presumably the mature miRNA. The authors of miRPlant argue that this strategy reduces the number of false negatives, as it guarantees that the mature miRNA is located at the end of one arm of the hairpin. In the case of Shortstack, giving the mapped reads and a reference as input, a de novo sRNA cluster discovery is performed by analyzing local patterns of read coverage. Each genomic region overlapping an sRNA cluster is then subjected to a hairpin-folding analysis with RNAfold. Afterward, hairpins are annotated either as hpsRNA or miRNA loci, depending on how strong is the evidence for precise excision given by local sRNA patterns. miRA tries to maximize the flexibility in parameter settings to enable a conservation-independent miRNA analysis; the authors argue that the use of standard parameters for all plant species is suboptimal because of the complex and nonhomogeneous nature of miRNA precursors in plants. In miR-PREFeR, expression information from multiple sRNA-seq libraries can, in addition, be used to decrease false positives and improve the reliability of the predictions. Another less common solution is miReader [49], which aims at identifying mature miRNA directly from sRNA-seq data, thanks to an embedded algorithm for de novo contig assembly using short reads. Detection of secondary siRNA and ta-siRNA In contrast with miRNA-encoding MIR genes, secondary siRNA precursors such as those encoded by ta-siRNA loci or TAS genes lack a specific secondary structure, and thus require alternative computational prediction strategies. The computational identification of new secondary siRNA is strongly focused on the detection of phasing patterns. This kind of analysis requires sRNA-seq data and a genomic reference, and can be executed with tools such as UEA sRNA Workbench [7], ShortStack [42], pssRNAMiner [50] and shortran [51], which implement variants of the method described in [89]. In this approach, sRNA clusters are determined from the mapped reads, and the occurrence of significant phasing patterns inside these regions (Figure 1) is calculated considering a hypergeometric distribution. The sRNA thought to be phase-initiators can also be mapped to the reference to help identify the start and stop coordinates of the precursor, and restrict the inspection of secondary siRNA candidates to clusters inside that region [50]. In the ‘one hit’ initiator case, functional siRNA must be searched in both the 5′ and 3′ direction of the initial cleavage coordinate, as phasing is a bidirectional process. To mimic patterns of DCL slicing, both UEA sRNA Workbench and ShortStack introduce a shift that pushes the start position of the segment located in the opposite strand 2 nt downstream. The TasExpAnalysis module, available online through the tasiRNAdb [52] platform, combines phasing detection with a search for known TAS and ta-siRNA in user-provided sRNA and sequencing reads from endonucleolytic mRNA cleavage products, also known as degradome. This tool follows a target-centered approach, where after mapping sRNA-seq and degradome reads to a TAS candidate, it checks the consistency between the TAS cleavage position and the 5′ end of the degradome fragments. Next, it searches for an sRNA from the provided library that can fit the role of phase initiator. Statistical tests to detect phasing are then performed using the mapped sRNA and assuming a hypergeometric distribution. PhaseTank [53] implements a slightly different methodology. After defining phased clusters that contain at least four-phased sRNA in 84 nt regions, a nonstatistical phased score is computed to express the chance of a region to be a producer of phased siRNA. This score depends on patterns of sRNA distribution and abundance in the region. The triggering sRNA is then determined following sequence complementarity principles along with the fact that the cleavage site must occur at positions 9–11 nt of the sRNA from its 5′ terminal. Some tools like UEA sRNA Workbench do not provide an indication of the phase initiator(s). Using standard tools for PTS target prediction to test diverse sRNA candidates that fall around the initial cleavage site(s) can be a solution. On the other hand, ignoring positional information about the initial cleavage allows a more liberal approach, not restricted to known sRNA but considering potentially unidentified sRNA to be starters of the process [89, 90]. Distinguishing ta-siRNA from other secondary siRNA is done simply by comparing the mapping locations of the siRNA and its target transcript; if in trans, the siRNA is incorporated in the ta-siRNA group. Finding NAT pairs and nat-siRNA Genome-wide identification of NAT from multiple organisms is nowadays possible using the large collections of sequencing data freely available online. Annotated genomes have been used in combination with other highly abundant expressed sequences. In silico methods for detecting NAT suffer from several shortcomings depending on the source of sequence information [91]. For example, the use of mRNA can come with information about the orientations of the transcripts, but the amount of mRNA sequence information available can be limited, reflecting specific tissues or development stages [92]. Either way, computational resources and databases dedicated to NATs are scarce. Current methods for the detection of new NATs are simplistic and based on two main pillars: the sequence complementarity between candidate pairs and the potential for transcripts to hybridize. Although the main criterion for the recognition of NAT pairs is the presence of overlapping transcript clusters, the length of the overlay is a parameter artificially defined and variable from study to study [5]. Other parameters to define NATs have a heuristic basis and lack clear standardization. As an example, in [91], cis-NAT pairs from Arabidopsis were studied using annotated and anchored full-length complementary DNA (cDNA), by applying the following criteria: (i) cDNAs of both transcripts can be uniquely mapped to the genome with at least 96% sequence identity; (ii) the two transcripts are derived from overlapping loci but opposite strands; (iii) the size of the overlaid fragment must be longer than 50 nucleotides; and (iv) the sense and antisense transcripts must have distinct splicing patterns. Other studies implemented comparable but slightly altered approaches for rice [93, 94] and Arabidopsis [95]. The NASTI-seq R package [54] is one of the few computational tools currently available for NAT discovery. This software is specialized in cis-NAT detection using strand-specific RNA sequencing data. It models the probability of finding read enrichment in each strand using a binomial model and identifies cis-NAT conditional on additional spatial criteria such as the location in opposite strands and the proximity in the genome. To our knowledge, the only tool implementing an engine for generic NAT search (including trans-NAT) is NATpipe [55]. This is a pipeline for NAT prediction for organisms without a reference genome. Using transcriptomic data, it performs a BLAST-based search to preselect NAT pair candidates. Then, the annealing potential of these candidates is explored by RNAplex [96]. The secondary structure is analyzed and instances containing bubbles in the annealed region comprising >10% of its length are rejected. If sRNA-seq data are available, NATpipe can perform a search for prospective nat-siRNA by looking to phasing patterns in the annealed region in a similar way to what is done by tools for ta-siRNA detection. To distinguish trans-nat-siRNA from cis-nat-siRNA, it is necessary to keep in mind that the concept of trans implies transcripts not sharing a common genomic location. Detecting hc-siRNA and the respective generating loci In plants, hc-siRNAs are typically 24 nt long and mostly derive from transposons, repeats and heterochromatic regions. Their biogenesis is primarily connected with the activity of PolIV-RDR2-DCL3 [99, 101]. hc-siRNAs are central for RNA-directed DNA methylation (RdDM), which is the pathway responsible for de novo DNA methylation. A hallmark of RdDM is the presence of cytosine methylation in all DNA sequence contexts (CG, CHG and CHH, where H can be C, A or T) [99, 103]. Some transposon families can switch the production of siRNA from 24 to 21–22 nt when methylation is lost [97, 98]. This transition starts with the synthesis of transcripts by Pol II that are afterward degraded into 21–22 nt siRNA. Some of these siRNA can enter a noncanonical RdDM pathway dependent on PolII, RDR6 and DCL2/4 [98]. To date, there are no public tools specifically developed to detect hc-siRNA. This limitation is in part because of the fact that hc-siRNA biology remains unclear, and experimental tests to functionally validate hc-siRNA are difficult to establish. Although certain families of TEs have been described to produce hc-siRNAs when epigenetic marks are changed [97, 98], the reason for their involvement in hc-siRNA biogenesis remains poorly understood. So far, the identification of hc-siRNA has focused mostly on the abundance of 24 nt sRNA mapping to differentially methylated regions that arise in RdDM mutants [97–103], the correlation with the presence or absence of histone marks [104] and variations in gene expression. However, numerous epigenetically activated sRNAs have been identified, which seem to lack the functional properties to be included in the hc-siRNA category. Rather they show PTS activity [97, 105], but the structural features that separate these from true hc-siRNA are unclear. Although the length of mature sRNA sequences is somewhat predictive, and has been taken as a way to discriminate putative hc-siRNA from other types of siRNA (hc-siRNAs are typically 24 nt in length), this feature—by itself—can be misleading. For example, miR163 is an example of a 24 nt long miRNA, which has an exceptionally long length for a DCL1-dependent sequence, and that despite its length primarily binds to AGO1 exerting posttranscriptional regulation [100]. Function prediction in plants Established nomenclature for miRNA annotation [78] does not require the identification of a functional target sequence, as target prediction can be notoriously difficult. Moreover, target sequences are not steady entities, but can arise de novo, or be lost through mutational events over evolutionary time. However, the structural features that distinguish other (non-miRNA) sRNA classes remain obscure and can often only be clearly delineated based on knowledge of their target sequences. For example, sorting sRNA by length and matching them to heterochromatic regions, TEs or repeats are naive approaches often used to identify hc-siRNA, but these approaches are insufficient to discriminate hc-siRNA from epigenetically activated siRNA involved in PTS [98, 105]. Hence, knowing the mode of action of a given sRNA sequence would appear to be a fundamental aspect of sRNA classification. Methods for PTS target prediction Posttranscriptional regulation in plants can occur in two ways: target cleavage [106] and translational repression [107]. A negative correlation between sRNA expression levels and those of the target transcript is usually taken as evidence for target cleavage. Alternatively, translational repression can happen after binding of the sRNA-AGO complex to the 5′ untranslated region or the open reading frame of target RNA, which inhibits the recruitment or movement of ribosomes through the mRNA [108]. Targeting of plant mRNA follows rules that are significantly different from those in animals and therefore, tools developed for the animal kingdom are suboptimal for plants. Studies with miRNA have revealed a number of key differences: in animals, a seed region of around 8 nt demands near-perfect sRNA/mRNA complementarity, while in plants this complementarity must be preserved throughout the complete miRNA; in animals, miRNAs have a positional preference for the 3′-UTR of the target, while in plants this is not observed [83, 84]. Target cleavage can be identified through the analysis of degradation fragments captured by sequencing (i.e. the degradome). The underlying idea is to use experimental evidence given by the degradome to discriminate between random degradation products and RNA segments precisely targeted by AGO proteins. Methods such as Cleaveland [56], PAREsnip [57] SoMART [58], SeqTar [59] and miRNA Digger [60], jointly explore degradome data, sRNA and other transcripts to detect PTS-sRNA and their targets [57, 56]. Taking miRNA Digger as an illustrative example: miRNA Digger starts by scanning the degradome for potential cleavage sites after mapping the degradation segments to a genomic reference. The mapping loci are then tested for the presence of RNA with hairpin-folding capacity. With sRNA-seq data available, it then looks for marks of miRNA-miRNA* duplexes, plus AGO-enriched miRNA(*)s, in case such the libraries are provided. Other prediction-based algorithms such as psRNATarget [61] and TAPIR [62] only require candidate sRNA-target pair as input. The analysis is typically performed in two main steps: (i) search for the best sRNA/mRNA complementarity location in the target candidate and (ii) measure target accessibility. The strength of these parameters is in some cases used to discriminate between translational and posttranscriptional inhibition. For example, in psRNATarget, a modified version of the Smith–Waterman algorithm [109] is used to look for optimal sRNA/mRNA alignments, and the UPE score (which is the energy required to ‘open’ secondary structure around target site on mRNA) is determined with RNAup [110]. In cases where mismatches are detected in the central complementary region of the sRNA sequence, the software assumes that the sRNA is likely involved in protein translational inhibition rather than in mRNA cleavage, as cleavage activity is known to be reduced when sRNA-mRNA complementarity is poor. psRNATarget is available via a Web portal, working with an efficient computing back-end pipeline that parallelizes processing on a Linux cluster. TAPIR is another popular tool that follows similar principles used in psRNATarget. It allows a fast search using FASTA and for more precise results uses RNAhybrid [111]. Targetfinder [63] and Target-align [64] are counterparts that fall in the same category. PsRobot [65] is an interesting example on how to take advantage of the large amount of deep sequencing data currently available in a meta-analysis. Its core includes a modified Smith–Waterman algorithm and a simple scoring methodology to search for candidate targets. The user is offered extra information about the predicted targets, such as their conservation across species, degradome profiles and target expression in diverse sRNA-related mutants, something that can help to judge the reliability of the predictions. Machine learning principles have also been used to predict PTS targets: p-TAREF [66] explores dinucleotide variation around the sRNA-target sites using support vector regression, and microRNA-Target [67] implements a PCA-SVM classifier that uses multiple sequence, structure and thermodynamic features to characterize miRNA–target interaction. More recently, a new breed of tools that include PlantMirnaT [68] explore deep sequencing miRNA and mRNA expression profiles to identify condition-specific miRNA-mRNA target pairs. Unfortunately, the methods developed to date for PTS target prediction still suffer from relatively high false-positive rates [64, 112] and inconsistent results across platforms are common. This has spurned the development of pipelines that integrate several of these algorithms to obtain consensus predictions. For example, Mtide [69] combines degradome analysis by Cleveland, target prediction by TAPIR and miRNA prediction using a plant-adapted version of miRDeep2 [113] with a set of rules to determine miRNA–target interactions. Other platforms that combine multiple software packages to perform a target-centered analysis include sPARTA [70] and imiRTP [71]. We argue that PTS target prediction could be further improved by considering additional biological criteria, such as the capacity of sRNA to load into specific classes of AGO proteins that are known to be required for PTS. Sequence features of sRNA that predict AGO loading have been recently obtained from machine learning approaches applied to AGO-IP sRNA-seq libraries [114]. This information could compliment the abovementioned computational tools for PTS target prediction. Methods for TS target prediction There is currently no computational tool in the public domain to predict transcriptional silencing targets from genomic data. This kind of inference is still in an early stage of development and is typically done based on indirect observations and assumptions about DNA/chromatin properties. For example, the presence/absence of methylation in CHH sequence context, the correlation with the abundance of 24 nt sRNA mapping in the vicinity of these marks and variations in the concentration of mRNA from the candidate target are used as proxies to predict RdDM targets [97, 101, 102, 105]. Future perspectives The biology of sRNA is complex and poses numerous computational challenges. The computational categorization of sRNA is far from being solved. sRNA prediction based on sequencing data is either inaccurate or lacks dedicated tools altogether. Although less attention has been paid to plants than to animals, algorithms for predicting various aspects of sRNA biogenesis and function in plants can be found dispersed over the internet. These are mostly individual modules, making sRNA cataloging a hard assignment for nonspecialists. Future work should focus on incorporating existing tools into a unifying framework. This would aid in the automation of sRNA analysis, and shift focus away from the assembly of pipelines to their applications. Currently, the majority of tools focus on miRNA, although hc-siRNAs are by far the most numerous. This bias is most likely because of the fact that miRNA have well-defined structural features in comparison with other sRNA categories. In addition, miRNAs are easily validated experimentally, which helps in calibrating computational algorithms for miRNA detection and prediction. The investment in proper software should coevolve with experimental procedure for acquiring sRNA data. This development is necessary to be able to maximize the knowledge that can be extracted from such data. Key Points Characterizing sRNA in terms of their biogenesis and function is essential for understanding regulatory mechanisms underlying plant development and adaptation. Deep sequencing data of total sRNA indicate that a large fraction of sRNA sequences remains to be catalogued. Numerous computational algorithms have been developed to facilitate the detection and categorization of plant sRNA, but existing software is mostly dedicated to miRNA. By adapting existing software in combination with public data sources, it is possible to craft more accurate and automated in silico tools applicable to a wider spectrum of sRNA classes. Funding Lionel Morgado acknowledges support from the University of Groningen. Frank Johannes acknowledges support from the Technical University of Munich-Institute for Advanced Study funded by the German Excellence Initiative and the European Union Seventh Framework Programme under grant agreement #291763. Lionel Morgado is a PhD candidate at the Groningen Bioinformatics Centre (University of Groningen, The Netherlands). The focus of his research is the development of high-throughput computational methods to study small RNAs in plants. Frank Johannes is an assistant professor for population epigenetics and epigenomics at the Technical University of Munich. He combines bioinformatic and statistical genetic approaches with high-throughput molecular data to characterize patterns of epigenetic variation in populations of plants. The goal is to understand how this variation arises, how stable it is across generations and to what extent it determines agriculturally and evolutionarily relevant plant traits. www.johanneslab.org. References 1 Bernstein BE , Birney E, Dunham I, et al. An integrated encyclopedia of DNA elements in the human genome . Nature 2012 ; 489 : 57 – 74 . Google Scholar Crossref Search ADS PubMed WorldCat 2 Mirouze M. The small RNA-based odyssey of epigenetic information in plants: from cells to species . DNA Cell Biol 2012 ; 12 : 1650 – 6 . doi:10.1089/dna.2012.1681. Google Scholar Crossref Search ADS WorldCat 3 Axtell MJ. Classification and comparison of small RNAs from plants . Annu Rev Plant Biol 2013 ; 64 : 137 – 59 . Google Scholar Crossref Search ADS PubMed WorldCat 4 Borges F , Martienssen RA. The expanding world of small RNAs in plants . Nat Rev Mol Cell Biol 2015 ; 16 : 727 – 41 . Google Scholar Crossref Search ADS PubMed WorldCat 5 Osato N , Suzuki Y, Ikeo K, et al. Transcriptional interferences in cis natural antisense transcripts of humans and mice . Genetics 2007 ; 176 ( 2 ): 1299 – 306 . doi:10.1534/genetics.106.069484. Google Scholar Crossref Search ADS PubMed WorldCat 6 Rueda A , Barturen G, Lebrón R, et al. sRNAtoolbox: an integrated collection of small RNA research tools . Nucl Acids Res 2015 ; 43 : W467 – 73 . doi:10.1093/nar/gkv555. Google Scholar Crossref Search ADS PubMed WorldCat 7 Stocks MB , Moxon S, Mapleson D, et al. The UEA sRNA workbench: a suite of tools for analysing and visualizing next generation sequencing microRNA and small RNA datasets . Bioinformatics 2012 ; 28 : 2059 – 61 . Google Scholar Crossref Search ADS PubMed WorldCat 8 Müller S , Rycak L, Winter P, et al. omiRas: a Web server for differential expression analysis of miRNAs derived from small RNA-Seq data . Bioinformatics 2013 ; 29 ( 20 ): 2651 – 2 . Google Scholar Crossref Search ADS PubMed WorldCat 9 Patra D , Fasold M, Lagenberger D, et al. plantDARIO: web based quantitative and qualitative analysis of small RNA-seq data in plants . Front Plant Sci 2014 ; 5 : 708. Google Scholar Crossref Search ADS PubMed WorldCat 10 Chen CJ , Servant N, Toedling J, et al. ncPRO-seq: a tool for annotation and profiling of ncRNAs in sRNA-seq data . Bioinformatics 2012 ; 28 : 3147 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 11 Icay K , Chen P, Cervera A, et al. SePIA: RNA and small RNA sequence processing, integration, and analysis . BioData Min 2016 ; 9 ( 1 ): 20 . Google Scholar Crossref Search ADS PubMed WorldCat 12 Wan C , Gao J, Ban R, et al. CPSS 2.0: a computational platform update for the analysis of small RNA sequencing data . Bioinformatics 2017 , in press. doi:10.1093/bioinformatics/btx066. Google Scholar OpenURL Placeholder Text WorldCat 13 Pantano L , Estivill X, Marti E. SeqBuster, a bioinformatic tool for the processing and analysis of small RNAs datasets, reveals ubiquitous miRNA modifications in human embryonic cells . Nucl Acids Res 2010 ; 38 : e34 . Google Scholar Crossref Search ADS PubMed WorldCat 14 Zhao S , Gordon W, Du S, et al. QuickMIRSeq: a pipeline for quick and accurate quantification of both known miRNAs and isomiRs by jointly processing multiple samples from microRNA sequencing . BMC Bioinformatics 2017 ; 18 : 180. Google Scholar Crossref Search ADS PubMed WorldCat 15 Muller H , Marzi MJ, Nicassio F. IsomiRage: from functional classification to differential expression of miRNA isoforms . Front Bioeng Biotechnol 2014 ; 2 : 38 . Google Scholar Crossref Search ADS PubMed WorldCat 16 Barturen G , Rueda A, Hamberg M, et al. sRNAbench: profiling of small RNAs and its sequence variants in single or multi-species high-throughput experiments . Methods Next Gen Seq 2014 ; 43 : W467 – 73 . Google Scholar OpenURL Placeholder Text WorldCat 17 Sablok G , Milev I, Minkov G, et al. isomiRex: web-based identification of microRNAs, isomiR variations and differential expression using next-generation sequencing datasets . FEBS Lett 2013 ; 587 : 2629 – 34 . Google Scholar Crossref Search ADS PubMed WorldCat 18 De Oliveira LF , Christoff AP, Margis R. isomiRID: a framework to identify microRNA isoforms . Bioinformatics 2013 ; 29 : 2521 – 3 . Google Scholar Crossref Search ADS PubMed WorldCat 19 Hofacker IL. Vienna RNA secondary structure server . Nucl Acids Res 2003 ; 31 : 3429 – 31 . Google Scholar Crossref Search ADS PubMed WorldCat 20 Markham NR , Zuker M. UNAFold: software for nucleic acid folding and hybridization . Methods Mol Biol 2008 ; 453 : 3 – 31 . Google Scholar Crossref Search ADS PubMed WorldCat 21 Higashi S , Fournier C, Gautier C, et al. Mirinho: an efficient and general plant and animal pre-miRNA predictor for genomic and deep sequencing data . BMC Bioinformatics 2015 ; 16 : 179 . Google Scholar Crossref Search ADS PubMed WorldCat 22 Tav C , Tempel S, Poligny L, et al. miRNAFold: a Web server for fast miRNA precursor prediction in genomes . Nucleic Acids Res 2016 ; 44 ( W1 ): W181 – 4 . Google Scholar Crossref Search ADS PubMed WorldCat 23 Bonnet E , Wuyts J, Rouze P, et al. Detection of 91 potential conserved plant microRNAs in Arabidopsis thaliana and Oryza sativa identifies important target genes . Proc Natl Acad Sci USA 2004 ; 101 : 11511 – 16 . Google Scholar Crossref Search ADS PubMed WorldCat 24 Jones-Rhoades MW , Bartel DP. Computational identification of plant microRNAs and their targets, including a stress-induced miRNA . Mol Cell 2004 ; 14 ( 6 ): 787 – 99 . Google Scholar Crossref Search ADS PubMed WorldCat 25 Dezulian T , Remmert M, Palatnik JF, et al. Identification of plant microRNA homologs . Bioinformatics 2006 ; 22 : 359 – 60 . Google Scholar Crossref Search ADS PubMed WorldCat 26 Lindow M , Krogh A. Computational evidence for hundreds of non-conserved plant microRNAs . BMC Genomics 2005 ; 6 : 119 – 27 . Google Scholar Crossref Search ADS PubMed WorldCat 27 Milev I , Yahubyan G, Minkov I, et al. miRTour: plant miRNA and target prediction tool . Bioinformation 2011 ; 6 : 248 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 28 Numnark S , Mhuantong W, Ingsriswang S, et al. C-mii: a tool for plant miRNA and target identification . BMC Genomics 2012 ; 13 : S16 . Google Scholar Crossref Search ADS PubMed WorldCat 29 Xie F , Xiao P, Chen D, et al. miRDeepFinder: a miRNA analysis tool for deep sequencing of plant small RNAs . Plant Mol Biol 2012 ; 80 : 75 – 84 . Google Scholar Crossref Search ADS WorldCat 30 Xuan P , Guo M, Liu X, et al. PlantMiRNAPred: efficient classification of real and pseudo plant pre-miRNAs . Bioinformatics 2011 ; 27 : 1368 – 76 . Google Scholar Crossref Search ADS PubMed WorldCat 31 Yao Y , Ma C, Deng H, et al. plantMirP: an efficient computational program for the prediction of plant pre-miRNA by incorporating knowledge-based energy features . Mol Biosyst 2016 ; 12 : 3124 – 31 . Google Scholar Crossref Search ADS PubMed WorldCat 32 Teune JH , Steger G. NOVOMIR: de novo prediction of MicroRNA-coding regions in a single plant-genome . J Nucleic Acids 2010 ; 2010 : 1 – 10 . doi:10.4061/2010/495904 pmid:20871826. Google Scholar Crossref Search ADS WorldCat 33 Gudys A , Szczesniak M, Sikora M, et al. HuntMi: an efficient and taxon-specific approach in pre-miRNA identification . BMC Bioinformatics 2013 ; 14 : 83 . Google Scholar Crossref Search ADS PubMed WorldCat 34 Williams PH , Eyles R, Weiller G. Plant microRNA prediction by supervised machine learning using C5.0 decision trees . J Nucleic Acids 2012 ; 2012 : 652979 . Google Scholar Crossref Search ADS PubMed WorldCat 35 Thieme CJ , Gramzow L, Lobbes D, et al. SplamiR–prediction of spliced miRNAs in plants . Bioinformatics 2011 ; 27 : 1215 – 23 . Google Scholar Crossref Search ADS PubMed WorldCat 36 Meng J , Liu D, Sun C, et al. Prediction of plant pre-microRNAs and their microRNAs in genome-scale sequences using structure-sequence features and support vector machine . BMC Bioinformatics 2014 ; 15 : 6595 . Google Scholar OpenURL Placeholder Text WorldCat 37 Wu Y , Wei B, Liu H, et al. MiRPara: a SVM-based software tool for prediction of most probable microRNA coding regions in genome scale sequences . BMC Bioinformatics 2011 ; 12 : 107 . Google Scholar Crossref Search ADS PubMed WorldCat 38 Leclercq M , Diallo AB, Blanchette M. Computational prediction of the localization of microRNAs within their pre-miRNA . Nucleic Acids Res 2013 ; 41 : 7200 – 11 . Google Scholar Crossref Search ADS PubMed WorldCat 39 Karathanasis N , Tsamardinos I, Poirazi P. MiRduplexSVM: a high-performing miRNA-duplex prediction and evaluation methodology . PLoS One 2015 ; 10 : e0126151 . Google Scholar Crossref Search ADS PubMed WorldCat 40 Xuan P , Guo M, Huang Y, et al. MaturePred: efficient identification of MicroRNAs within novel plant pre-miRNAs . PLoS One 2011 ; 6 : e27422 . Google Scholar Crossref Search ADS PubMed WorldCat 41 Cui H , Zhai J, Ma C. MiRLocator: machine learning-based prediction of mature MicroRNAs within plant pre-miRNA sequences . PLoS One 2015 ; 10 : e0142753 . Google Scholar Crossref Search ADS PubMed WorldCat 42 Axtell MJ. ShortStack: comprehensive annotation and quantification of small RNA genes . RNA 2013 ; 19 : 740 – 51 . doi: 10.1261/rna.035279.112. Google Scholar Crossref Search ADS PubMed WorldCat 43 Yang X , Li L. miRDeep-P: a computational tool for analyzing the microRNA transcriptome in plants . Bioinformatics 2011 ; 27 ( 18 ): 2614 – 15 . doi:10.1093/bioinformatics/btr430. Google Scholar Crossref Search ADS PubMed WorldCat 44 An J , Lai J, Sajjanhar A, et al. miRPlant: an integrated tool for identification of plant miRNA from RNA sequencing data . BMC Bioinformatics 2014 ; 15 : 275 . Google Scholar Crossref Search ADS PubMed WorldCat 45 Evers M , Huttner M, Dueck A, et al. miRA: adaptable novel miRNA identification in plants using small RNA sequencing data . BMC Bioinformatics 2015 ; 16 : 370 . Google Scholar Crossref Search ADS PubMed WorldCat 46 Breakfield NW , Corcoran DL, Petricka JJ, et al. High-resolution experimental and computational profiling of tissue-specific known and novel miRNAs in Arabidopsis . Genome Res 2012 ; 22 ( 1 ): 163 – 76 . Google Scholar Crossref Search ADS PubMed WorldCat 47 Lei J , Sun Y. miR-PREFeR: an accurate, fast and easy-to-use plant miRNA prediction tool using small RNA-Seq data . Bioinformatics 2014 ; 30 : 2837 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 48 Paicu C , Mohorianu I, Stocks MB, et al. miRCat2: accurate prediction of plant and animal microRNAs from next-generation sequencing datasets . Bioinformatics 2017 ; 33 : 2446 – 54 . Google Scholar Crossref Search ADS PubMed WorldCat 49 Ashwani J , Shankar R. miReader: discovering novel miRNAs in species without sequenced genome . PLoS One 2013 ; 8 ( 6 ): e66857 . Google Scholar Crossref Search ADS PubMed WorldCat 50 Dai X , Zhao PX. pssRNAMiner: a plant short small RNA regulatory cascade analysis server . Nucl Acids Res 2008 ; 36 ( 2 ): W114 – 18 . doi:10.1093/nar/gkn297. Google Scholar Crossref Search ADS PubMed WorldCat 51 Gupta V , Markmann K, Pedersen CN, et al. Shortran: a pipeline for small RNA-seq data analysis . Bioinformatics 2012 ; 28 : 2698 – 700 . Google Scholar Crossref Search ADS PubMed WorldCat 52 Zhang C , Li G, Zhu S, et al. tasiRNAdb: a database of ta-siRNA regulatory pathways . Bioinformatics 2014 ; 30 ( 7 ): 1045 – 6 . doi: 10.1093/bioinformatics/btt746. Google Scholar Crossref Search ADS PubMed WorldCat 53 Guo Q , Qu X, Weibo J. PhaseTank: genome-wide computational identification of phasiRNAs and their regulatory cascades . Bioinformatics 2015 ; 31 : 284 – 6 . doi:10.1093/bioinformatics/btu628. pmid:25246430. Google Scholar Crossref Search ADS PubMed WorldCat 54 Li S , Liberman L, Mukherjee N, et al. Integrated detection of natural antisense transcripts using strand-specific RNA sequencing data . Genome Res 2013 ; 23 : 1730 – 9 . doi:10.1101/gr.149310.112. Google Scholar Crossref Search ADS PubMed WorldCat 55 Yu D , Meng Y, Zuo Z, et al. NATpipe: an integrative pipeline for systematical discovery of natural antisense transcripts (NATs) and phase-distributed nat-siRNAs from de novo assembled transcriptomes . Sci Rep 2016 ; 6 : 21666 . http://dx.doi.org/10.1038/srep21666. Google Scholar Crossref Search ADS PubMed WorldCat 56 Brousse C , Liu Q, Beuclair L, et al. A non-canonical plant microRNA target site . Nucleic Acids Rese 2014 ; 42 ( 8 ): 5270 – 9 . doi:10.1093/nar/gku157. Google Scholar Crossref Search ADS WorldCat 57 Folkes L , Moxon S, Woolfenden HC, et al. PAREsnip: a tool for rapid genome-wide discovery of small RNA/target interactions evidenced through degradome sequencing . Nucl Acids Res 2012 ; 40 ( 13 ): e103 . doi:10.1093/nar/gks277. Google Scholar Crossref Search ADS PubMed WorldCat 58 Li F , Orban R, Baker B. SoMART: a webserver for plant miRNA, tasiRNA and target gene analysis . Plant J 2012 ; 70 : 891 – 901 . Google Scholar Crossref Search ADS PubMed WorldCat 59 Zheng Y , Li YF, Sunkar R, et al. SeqTar: an effective method for identifying microRNA guided cleavage sites from degradome of polyadenylated transcripts in plants . Nucl Acids Res 2012 ; 40 : e28 . Google Scholar Crossref Search ADS PubMed WorldCat 60 Yu L , Shao C, Ye X, et al. miRNA Digger: a comprehensive pipeline for genome-wide novel miRNA mining . Sci Rep 2016 ; 6 : 18901 . doi:10.1038/srep18901. Google Scholar Crossref Search ADS PubMed WorldCat 61 Dai X , Zhao PX. psRNATarget: a plant small RNA target analysis server . Nucl Acids Res 2011 ; 39 : W155 – 9 . doi:10.1093/nar/GKR319. Google Scholar Crossref Search ADS PubMed WorldCat 62 Bonnet E , He Y, Billiau K, et al. TAPIR, a web server for the prediction of plant microRNA targets, including target mimics . Bioinformatics 2010 ; 26 : 1566 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 63 Fahlgren N , Carrington JC. miRNA target prediction in plants . Methods Mol Biol 2010 ; 592 : 51 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 64 Xie F , Zhang B. Target-align: a tool for plant microRNA target identification . Bioinformatics 2010 ; 26 : 3002 – 3 . Google Scholar Crossref Search ADS PubMed WorldCat 65 Wu HJ , Ma YK, Chen T, et al. PsRobot: a web-based plant small RNA meta-analysis toolbox . Nucl Acids Res 2012 ; 40 : W22 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 66 Jha A , Shankar R. Employing machine learning for reliable miRNA target identification in plants . BMC Genomics 2011 ; 12 ( 1 ): 636 . Google Scholar Crossref Search ADS PubMed WorldCat 67 Meng J , Shi L, Luan Y. Plant microRNA-target interaction identification model based on the integration of prediction tools and support vector machine . PLoS One 2014 ; 9 ( 7 ): e103181 . Google Scholar Crossref Search ADS PubMed WorldCat 68 Rhee S , Chae H, Kim S. PlantMirnaT: miRNA and mRNA integrated analysis fully utilizing characteristics of plant sequencing data . Methods 2015 ; 83 : 80 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 69 Zhang Z , Jiang L, Wang J, et al. MTide: an integrated tool for the identification of miRNA–target interaction in plants . Bioinformatics 2015 ; 31 ( 2 ): 290 – 1 . doi:10.1093/bioinformatics/btu633. Google Scholar Crossref Search ADS PubMed WorldCat 70 Kakrana A , Hammond R, Patel P, et al. sPARTA: a parallelized pipeline for integrated analysis of plant miRNA and cleaved mRNA data sets, including new miRNA target-identification software . Nucleic Acids Res 2015 ; 42 : e139 . Google Scholar Crossref Search ADS WorldCat 71 Ding J , Yu S, Ohler U, et al. imiRTP: an integrated method to identifying miRNA-target interactions in Arabidopsis thaliana. In: IEEE International Conference on Bioinformatics and Biomedicine . 2011 , Atlanta, GA, USA: IEEE, pp. 100 – 4 . 72 Kozomara A , Griffiths-Jones S. miRBase: integrating microRNA annotation and deep-sequencing data . Nucl Acids Res 2001 ; 39 ( 1 ): D152 – 7 . doi:10.1093/nar/gkq1027. Google Scholar OpenURL Placeholder Text WorldCat 73 Chen D , Yuan C, Zhang J, et al. PlantNATsDB: a comprehensive database of plant natural antisense transcripts . Nucl Acids Res 2012 ; 40 ( 1 ): D1187 – 93 . Google Scholar Crossref Search ADS PubMed WorldCat 74 Altschul S , Gish W, Miller W, et al. Basic local alignment search tool . J Mol Biol 1990 ; 215 ( 3 ): 403 – 10 . doi:10.1016/S0022- 2836(05)80360-2. Google Scholar Crossref Search ADS PubMed WorldCat 75 Iida K , Jin H, Zhu JK. Bioinformatics analysis suggests base modification of tRNA and miRNA in Arabidopsis thaliana . BMC Genomics 2009 ; 10 : 155 . Google Scholar Crossref Search ADS PubMed WorldCat 76 Ebhardt HA , Tsang HH, Dai DC, et al. Meta-analysis of small RNA-sequencing errors reveals ubiquitous post-transcriptional RNA modifications . Nucl Acids Res 2009 ; 37 : 2461 – 70 . Google Scholar Crossref Search ADS PubMed WorldCat 77 Gardner PP , Daub J, Tate JG, et al. Rfam: updates to the RNA families database . Nucl Acids Res 2009 ; 37 : D136 – 40 . Google Scholar Crossref Search ADS PubMed WorldCat 78 Meyers BC , Axtell MJ, Bartel B, et al. Criteria for annotation of plant microRNAs . Plant Cell 2008 ; 20 : 3186 – 90 . Google Scholar Crossref Search ADS PubMed WorldCat 79 Wang Y , Li H, Sun Q, et al. Characterization of small RNAs derived from tRNAs, rRNAs and snoRNAs and their response to heat stress in wheat seedlings . PLoS One 2016 ; 11 : e0150933. Google Scholar Crossref Search ADS PubMed WorldCat 80 Li H , Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform . Bioinformatics 2009 ; 25 : 1754 – 60 . Google Scholar Crossref Search ADS PubMed WorldCat 81 Langmead B , Trapnell C, Pop M, et al. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome . Genome Biol 2009 ; 10 ( 3 ): R25 . doi:10.1186/gb-2009-10-3-r25. Google Scholar Crossref Search ADS PubMed WorldCat 82 Zhang B , Pan X, Cannon CH, et al. Conservation and divergence of plant microRNA genes . Plant J 2006 ; 46 ( 2 ): 243 – 59 . Google Scholar Crossref Search ADS PubMed WorldCat 83 Mendes ND , Freitas AT, Sagot MF. Current tools for the identification of miRNA genes and their targets . Nucl Acids Res 2009 ; 37 : 2419 – 33 . Google Scholar Crossref Search ADS PubMed WorldCat 84 Gomes CPC , Cho JH, Hood L, et al. A review of computational tools in microRNA discovery . Front Genet 2013 ; 4 : 81 . doi: 10.3389/fgene.2013.00081. Google Scholar Crossref Search ADS PubMed WorldCat 85 Xue C , Li F, He T, et al. Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine . BMC Bioinformatics 2005 ; 6 : 310 – 16 . Google Scholar Crossref Search ADS PubMed WorldCat 86 Batuwita R , Palade V. MicroPred: effective classification of pre-miRNAs for human miRNA gene prediction . Bioinformatics 2009 ; 25 : 989 – 95 . Google Scholar Crossref Search ADS PubMed WorldCat 87 Kozomara A , Griffiths-Jones S. miRBase: annotating high confidence microRNAs using deep sequencing data . Nucleic Acids Res 2014 ; 42 : D68 – 73 . Google Scholar Crossref Search ADS PubMed WorldCat 88 Friedlander MR , Chen W, Adamidi C, et al. Discovering microRNAs from deep sequencing data using miRDeep . Nat Biotechnol 2008 ; 26 : 407 – 15 . Google Scholar Crossref Search ADS PubMed WorldCat 89 Chen HM , Li YH, Wu SH. Bioinformatic prediction and experimental validation of a microRNA-directed tandem trans-acting siRNA cascade in Arabidopsis . Proc Natl Acad Sci USA 2007 ; 104 ( 9 ): 3318 – 23 . Google Scholar Crossref Search ADS PubMed WorldCat 90 Moxon S , Schwach F, Dalmay T, et al. A toolkit for analyzing large-scale plant small RNA datasets . Bioinformatics 2008 ; 24 ( 19 ): 2252 – 3 . doi:10.1093/bioinformatics/btn428. Google Scholar Crossref Search ADS PubMed WorldCat 91 Wang XJ , Gaasterland T, Chua NH. Genome-wide prediction and identification of cis-natural antisense transcripts in Arabidopsis thaliana . Genome Biol 2005 ; 6 : R30 . doi:10.1186/gb-2005-6-4-r30. Google Scholar Crossref Search ADS PubMed WorldCat 92 Lavorgna G , Dahary D, Lehner B, et al. In search of antisense . Trends Biochem Sci 2004 ; 29 ( 2 ): 88 – 94 . Google Scholar Crossref Search ADS PubMed WorldCat 93 Osato N , Yamada H, Satoh K, et al. Antisense transcripts with rice full-length cDNAs . Genome Biol 2003 ; 5 : R5 . Google Scholar Crossref Search ADS PubMed WorldCat 94 Zhou X , Sunkar R, Jin H, et al. Genome-wide identification and analysis of small RNAs originated from natural antisense transcripts in Oryza sativa . Genome Res 2009 ; 19 ( 1 ): 70 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 95 Jen CH , Michalopoulos I, Westhead DR, et al. Natural antisense transcripts with coding capacity in Arabidopsis may have a regulatory role that is not linked to double-stranded RNA degradation . Genome Biol 2005 ; 6 : R51 . Google Scholar Crossref Search ADS PubMed WorldCat 96 Tafer H , Hofacker IL. RNAplex: a fast tool for RNA-RNA interaction search . Bioinformatics 2008 ; 24 ( 22 ): 2657 – 63 . doi: 10.1093/bioinformatics/btn193. Google Scholar Crossref Search ADS PubMed WorldCat 97 McCue A , Nuthikattu S, Reeder SH, et al. Gene expression and stress response mediated by the epigenetic regulation of a transposable element small RNA . PLoS Genet 2012 ; 8 : e1002474 . Google Scholar Crossref Search ADS PubMed WorldCat 98 Nuthikattu S , McCue A, Panda K, et al. The initiation of epigenetic silencing of active transposable elements is triggered by RDR6 and 21-22 nucleotide small interfering RNAs . Plant Physiol 2013 ; 162 : 116 – 31 . Google Scholar Crossref Search ADS PubMed WorldCat 99 Stroud H , Greenberg MVC, Feng S, et al. Comprehensive analysis of silencing mutants reveals complex regulation of the Arabidopsis methylome . Cell 2013 ; 152 : 352 – 64 . Google Scholar Crossref Search ADS PubMed WorldCat 100 Wu L , Zhou H, Zhang Q, et al. DNA methylation mediated by a microRNA pathway . Mol Cell 2010 ; 38 : 465 – 75 . Google Scholar Crossref Search ADS PubMed WorldCat 101 Mari-Ordonez A , Marchais A, Etcheverry M, et al. Reconstructing de novo silencing of an active plant retrotransposon . Nat Genet 2013 ; 45 : 1029 – 39 . Google Scholar Crossref Search ADS PubMed WorldCat 102 Zhang Q , Wang D, Lang Z, et al. Methylation interactions in Arabidopsis hybrids require RNA-directed DNA methylation and are influenced by genetic variation . Proc Natl Acad Sci USA 2016 ; 113 ( 29 ): E4248 – 56 . doi:10.1073/pnas.1607851113. Google Scholar Crossref Search ADS PubMed WorldCat 103 Lister R , O’Malley RC, Tonti-Filippini J, et al. Highly integrated single-base resolution maps of the epigenome in Arabidopsis . Cell 2008 ; 133 : 523 – 36 . Google Scholar Crossref Search ADS PubMed WorldCat 104 Li X , Wang X, He K, et al. High-resolution mapping of epigenetic modifications of the rice genome uncovers interplay between DNA methylation, histone methylation, and gene expression . Plant Cell 2008 ; 20 : 259 – 76 . Google Scholar Crossref Search ADS PubMed WorldCat 105 Creasey KM , Zhai J, Borges F, et al. MiRNAs trigger widespread epigenetically activated siRNAs from transposons in Arabidopsis . Nature 2014 ; 508 : 411 – 15 . Google Scholar Crossref Search ADS PubMed WorldCat 106 Llave C , Kasschau KD, Rector MA, et al. Endogenous and silencing-associated small RNAs in plants . Plant Cell 2002 ; 14 : 1605 – 19 . Google Scholar Crossref Search ADS PubMed WorldCat 107 Brodersen P , Sakvarelidze-Achard L, Bruun-Rasmussen M, et al. Widespread translational inhibition by plant miRNAs and siRNAs . Science 2008 ; 320 ( 5880 ): 1185 – 90 . Google Scholar Crossref Search ADS PubMed WorldCat 108 Iwakawa HO , Tomari Y. Molecular insights into microRNA-mediated translational repression in plants . Mol Cell 2013 ; 52 : 591 – 601 . Google Scholar Crossref Search ADS PubMed WorldCat 109 Smith TF , Waterman MS. Identification of common molecular subsequences . J Mol Biol 1981 ; 147 : 195 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 110 Lorenz R , Bernhart SH, Hoener zu Siederdissen C, et al. ViennaRNA package 2.0 . Algorithms Mol Biol 2011 ; 6 : 26. doi:10.1186/1748-7188-6-26. Google Scholar Crossref Search ADS PubMed WorldCat 111 Kruger J , Rehmsmeier M. RNAhybrid: microRNA target prediction easy, fast and flexible . Nucl Acids Res 2006 ; 34 : W451 – 4 . Google Scholar Crossref Search ADS PubMed WorldCat 112 Srivastava PK , Moturu TR, Pandey P, et al. A comparison of performance of plant miRNA target prediction tools and the characterization of features for genome-wide target prediction . BMC Genomics 2014 ; 15 : 348 . doi:10.1186/1471-2164-15-348. Google Scholar Crossref Search ADS PubMed WorldCat 113 Friedlander MR , Mackowiak SD, Li N, et al. miRDeep2 accurately identifies known and hundreds of novel microRNA genes in seven animal clades . Nucl Acids Res 2012 ; 40 : 37 – 52 . Google Scholar Crossref Search ADS PubMed WorldCat 114 Morgado L , Jansen RC, Johannes F. Learning sequence patterns of AGO-sRNA affinity from high-throughput sequencing libraries to improve in silico functional small RNA detection and classification in plants . bioRxiv 2107 : 173575 . doi: https://doi.org/10.1101/173575 Google Scholar OpenURL Placeholder Text WorldCat © The Author 2017. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com © The Author 2017. Published by Oxford University Press.
journal article
LitStream Collection
Systematic review of computational methods for identifying miRNA-mediated RNA-RNA crosstalk

Li,, Yongsheng;Jin,, Xiyun;Wang,, Zishan;Li,, Lili;Chen,, Hong;Lin,, Xiaoyu;Yi,, Song;Zhang,, Yunpeng;Xu,, Juan

2019 Briefings in Bioinformatics

doi: 10.1093/bib/bbx137pmid: 29077860

Abstract Posttranscriptional crosstalk and communication between RNAs yield large regulatory competing endogenous RNA (ceRNA) networks via shared microRNAs (miRNAs), as well as miRNA synergistic networks. The ceRNA crosstalk represents a novel layer of gene regulation that controls both physiological and pathological processes such as development and complex diseases. The rapidly expanding catalogue of ceRNA regulation has provided evidence for exploitation as a general model to predict the ceRNAs in silico. In this article, we first reviewed the current progress of RNA-RNA crosstalk in human complex diseases. Then, the widely used computational methods for modeling ceRNA-ceRNA interaction networks are further summarized into five types: two types of global ceRNA regulation prediction methods and three types of context-specific prediction methods, which are based on miRNA-messenger RNA regulation alone, or by integrating heterogeneous data, respectively. To provide guidance in the computational prediction of ceRNA-ceRNA interactions, we finally performed a comparative study of different combinations of miRNA–target methods as well as five types of ceRNA identification methods by using literature-curated ceRNA regulation and gene perturbation. The results revealed that integration of different miRNA–target prediction methods and context-specific miRNA/gene expression profiles increased the performance for identifying ceRNA regulation. Moreover, different computational methods were complementary in identifying ceRNA regulation and captured different functional parts of similar pathways. We believe that the application of these computational techniques provides valuable functional insights into ceRNA regulation and is a crucial step for informing subsequent functional validation studies. miRNA–target identification, ceRNA regulation, computational methods, ensemble method, crosstalk Introduction MicroRNAs (miRNAs) are endogenous small noncoding RNAs that regulate gene expression by binding to the RNA target transcripts [1, 2]. Emerging knowledge exploring the functions of miRNAs in the regulation of gene expression has dramatically altered our view of how target genes are regulated. Our knowledge of miRNA functions has been greatly expanded by recent advances in next-generation sequencing, including genome-wide miRNA–gene regulation identification [3], the application of RNA sequencing [4], the increasing availability of paired miRNA and gene expression data in various types of complex diseases (such as TCGA [5] and ICGC [6] projects), the recognition of miRNA-miRNA synergistic regulation and competing endogenous RNAs (ceRNAs) regulation mediated by miRNAs. In general, mounting evidence has indicated that ∼60% of protein-coding genes are regulated by miRNAs [7, 8]. Moreover, genes can also be regulated by more than one miRNA. The functional complexity of miRNAs and cooperative regulation among miRNAs challenged our ability to comprehensively understand the functions of miRNAs in complex diseases [9]. In addition to protein-coding genes, an increasing number of other types of RNAs were found to be regulated by miRNAs (Figure 1A), such as long noncoding RNAs (lncRNAs), expressed 3′-untranslated regions (UTRs), pseudogenes and circular RNAs (circRNAs). These complex miRNA-RNA interactions formed the miRNA-RNA regulatory networks. A comprehensive understanding of miRNA functions in complex diseases will be further aided through analysis of the structure of miRNA-RNA regulatory networks [10–12]. Figure 1 Open in new tabDownload slide MiRNA-mediated RNA-RNA crosstalk. (A) The ceRNA model involves different types of RNAs, including coding genes, expressed 3′-UTRs, pseudogenes, lncRNAs and circRNAs. MiRNA-miRNA crosstalk and ceRNA-ceRNA crosstalk play critical roles in human diseases. MiRNA-miRNA synergistic regulation was identified by coregulation, functional and disease similarity. CeRNA-ceRNA regulation was usually identified based on two principles: regulatory and expression similarity. (B) The flowchart commonly used to identify the ceRNA regulation. First, miRNA–target interactions were determined by integration of computational methods and AGO-CLIP-Seq data sets. Second, RNA pairs that coregulated by miRNAs were identified by ratio or hypergeometric test. Third, the expression similarity was evaluated based on genome-wide expression profiles. (C) The aim of this review is to evaluate the performance of different miRNA–gene interaction prediction methods and ceRNA regulation identification methods based on literature-curated ceRNAs. Figure 1 Open in new tabDownload slide MiRNA-mediated RNA-RNA crosstalk. (A) The ceRNA model involves different types of RNAs, including coding genes, expressed 3′-UTRs, pseudogenes, lncRNAs and circRNAs. MiRNA-miRNA crosstalk and ceRNA-ceRNA crosstalk play critical roles in human diseases. MiRNA-miRNA synergistic regulation was identified by coregulation, functional and disease similarity. CeRNA-ceRNA regulation was usually identified based on two principles: regulatory and expression similarity. (B) The flowchart commonly used to identify the ceRNA regulation. First, miRNA–target interactions were determined by integration of computational methods and AGO-CLIP-Seq data sets. Second, RNA pairs that coregulated by miRNAs were identified by ratio or hypergeometric test. Third, the expression similarity was evaluated based on genome-wide expression profiles. (C) The aim of this review is to evaluate the performance of different miRNA–gene interaction prediction methods and ceRNA regulation identification methods based on literature-curated ceRNAs. In this review, we first reviewed the miRNA synergistic regulation and ceRNA regulation in complex diseases, and we highlighted the ceRNA regulatory network-based methods for our understanding of miRNA functions. We also summarized recent computational methods used for identification of miRNA-mediated ceRNA regulation in complex diseases. Based on literature-curated PTEN-related ceRNA regulation, we systematically compared these methods and give some directions for choosing the suitable methods. MiRNA-miRNA crosstalk in complex diseases The majority of genes are targeted by more than one miRNA. In addition, approximately two-thirds of miRNAs are encoded in polycistronic clusters [13]. These miRNAs were cotranscribed with the cluster partners, indicating that these miRNAs were cooperative functional units, and they function collectively. Such coexpressed miRNAs have a tendency to target the same genes or regulate different genes in the same functional pathway [14, 15]. These results reinforce the miRNA-miRNA crosstalk in complex diseases. However, the application of experimental methods to identify such miRNA cooperation must address many bottlenecks, such as lengthy experimental periods, the requirement for large amounts of equipment and a high number of miRNA combinations. Thus, emerging computational methods have been proposed to identify the global or context-specific miRNA-miRNA crosstalk (Figure 1A). Recently, we reviewed these methods and found that these global methods were mainly based on genomic sequence information, chromatin interaction, miRNA coregulation and coregulation of function modules or similarity of associated diseases (phenomic similarity) [9, 16]. Given that miRNA synergistic regulation is often reprogrammed in different tissues, or different development stages even within the same tissues [11], context-specific miRNA-miRNA networks may provide a better representation. Context-specific miRNA synergistic regulation was mainly based on paired miRNA and gene expression profiles [17]. With the increase in omics data set of complex diseases, it is envisioned that our understanding of miRNA synergistic regulation will be greatly enhanced. MiRNA-mediated ceRNA-ceRNA crosstalk in complex diseases In addition to the conventional miRNA-RNA regulation, increasing studies have shown that regulation among the miRNA seed region and messenger RNA (mRNA) is not unidirectional, but that the pool of RNAs can crosstalk with each other through competing for miRNA binding [18–20]. These ceRNAs act as molecular sponges for a miRNA through their miRNA response elements (MRE), thereby regulating other target genes of the respective miRNAs. Understanding this novel type of RNA crosstalk will lead to significant insights into regulatory networks and have implications in human cancer development and other complex diseases [21, 22]. The ceRNA hypothesis has gained substantial attention and various types of RNAs, including lncRNAs, pseudogenes and circRNAs, as well as mRNAs were demonstrated to be as ceRNA molecules (Figure 1A). Pseudogene PTENP1 had also been demonstrated to regulate the expression of its cognate gene PTEN by competing miRNAs [23]. The lncRNA (linc-MD1) being a miRNA sponge was first demonstrated in muscle differentiation [24]. Moreover, linc-RoR was shown to function as a miRNA sponge to prevent OCT4, SOX2 and NANOG by competing miR-145 [25]. It was also shown that pseudogene KRAS1P can function via a miRNA sponge mechanism [26]. CircRNAs are another type of ncRNA ceRNAs found by researchers recently, and increasing circRNAs (such as CDR1as) were validated as miRNA sponge [27]. An increasing number of studies have attempted to explore ceRNA-ceRNA regulation in specific cancer type. However, the majority of these studies focused on the properties of individual ceRNA interaction in a specific cancer type, and lack of a global view of system-level properties of ceRNA regulation across cancer types. To address these needs, we performed a systematic analysis of 5203 cancer samples from 20 cancer types to discover mRNA-related ceRNA regulation [28]. This study highlights the conserved features shared by pan-cancer and higher similarity within cell types of similar origins. We also found a marked rewiring in the ceRNA program between various cancers and cancer subtypes [21], and further revealed conserved and rewired network ceRNA hubs in cancer. Moreover, Wang et al. [29] systematically identified 5119 functional lncRNA-associated triplets (lncACTs) through an integrated pipeline with which a comprehensive lncACT crosstalk network was constructed. In addition, Chiu et al. [30] introduced a method for simultaneous prediction of miRNA–target interactions and miRNA-mediated ceRNA interactions in breast cancer. Zhou et al. [31] also identified breast cancer-specific ceRNA networks by integration of miRNA regulation and paired miRNA/mRNA expression data. Le et al. [32] also summarized several computational methods for identifying ceRNA regulation in complex disease. All these studies have suggested that research into miRNA sponges is emerging, and it is an interesting and important topic for understanding miRNA functions by identifying ceRNA regulation in complex diseases. MiRNA-mediated ceRNA-ceRNA dysregulation network in complex diseases CeRNA interaction activity might change between normal and cancer samples, such as some ceRNA interactions showed competing activities in normal but not cancer samples, and vice versa. In addition, some ceRNA interactions showed competing activities in both normal and cancer context, but with opposite expression patterns. In addition to investigating ceRNA regulation in cancer, exploring the differential ceRNA regulation that were deregulated in cancer compared with normal conditions helps us to systematically understand the mechanism of cancer. Shao et al. [33] have proposed a computational method to systematically identify genome-wide dysregulated ceRNA-ceRNA interactions in lung cancer. Paci et al. [34] also proposed a computational approach to explore the lncRNA-mRNA ceRNA interactions in normal and breast cancer samples. Their results highlights a marked rewiring of the ceRNA interactions between normal and cancer samples, which were documented by its ‘on/off’ switch. Based on mutually exclusive activation, the lncRNA PVT1 was identified as a key lncRNA in breast cancer. Motivated by this study, Zhang et al. [35] systematically integrated multidimensional expression profile of >5000 samples across 12 cancers to investigate the lncRNA-related ceRNA crosstalk networks in both tumor and normal physiological states. This analysis provided a comprehensive dysregulated ceRNA landscape across cancer types. Dysregulated ceRNA networks have being used for identifying clinical-related biomarkers. For instance, Wang et al. [36] identified glioblastoma (GBM)-related lncRNA-miRNA-mRNA triplets by a differential ceRNA network between GBM and normal tissues. In summary, these studies offered a means of examining the difference of ceRNA interactions between normal and cancer context, and provide new tools for elucidate cancer processes as well as new targets for therapy. Computational methods for identifying miRNA-mediated RNA-RNA crosstalks Given the critical roles of miRNA-mediated RNA-RNA crosstalks, they have attracted growing attentions from researchers. Several computational methods have emerged in discovering ceRNA-ceRNA interactions [32]. These methods were mainly categorized into pair-wise correlation approach, partial association approach and mathematical modeling approach. Central to identification of miRNA-mediated ceRNA regulation is the identification of miRNA targets. A number of methods have been proposed over the past decade to identify miRNA targets through integration of sequence-based prediction, conservation, physical association and/or correlative gene expression (Figure 1B). The commonly used methods included TargetScan [37], miRanda [38], PicTar [39], PITA [40] and RNA22 [41]. Although much is known about the miRNA regulatory principles in target recognition and some miRNA targets can be predicted by these computational methods, much remains to be learned. Lines of evidence have indicated that there are higher false positives in the predicted target sets. Moreover, several biochemical methods have been developed to capture miRNA–target complexes on a global scale. MiRNAs and targets in the process of being regulated can be coprecipitated with Argonaute (AGO). AGO and miRNA immunoprecipitation or pulldown methodologies, such as AGO CLIP-Seq [42], HITS-CLIP [43] and PAR-CLIP [44], were proposed to genome wide identification of the miRNA targets. Collectively, these strategies offer a major advantage in identifying the interactions of functional miRNA–targets. However, we are lack of knowledge which method is better for identifying ceRNA-ceRNA regulation in complex diseases. After assembling the miRNA–target regulation, two commonly used principles for identifying miRNA-mediated ceRNA-ceRNA regulation were used [30, 45]. The central hypothesis of most computation methods is that ceRNA crosstalk increased with the high miRNA regulatory similarity between mRNAs and their strong coexpression in specific context (Figure 1B). In this section, we reviewed the widely used computational methods for identifying ceRNA regulation or miRNA sponge interactions. Here, we considered five types of methods (Table 1), including two types of global ceRNA regulation prediction methods (Figure 2A, ratio based, we termed ratio and hypergeometric test based, termed HyperT) and three types of context-specific prediction methods [Figure 2A, hypergeometric test plus coexpression, termed HyperC, sensitivity correlation-based method (SC) and conditional mutual information (CMI)-based methods]. Table 1 Summary of computational approaches for identifying miRNA-mediated RNA-RNA interaction networks Methods Statistical methods Brief description Input data P-value Global or context-specific Ratio No Genes were ranked based on the proportion of coregulating miRNAs miRNA–gene regulation N Global HyperT Hypergeometric test Extract significant gene pairs by checking whether they share the similar set of miRNAs using hypergeometric cumulative distribution test miRNA–gene regulation Y Global HyperC Hypergeometric test; correlation coefficient Consider RNA-RNA pairs sharing a significant overlap of common miRNAs, and combine gene expression to identify the significant positively coexpressed RNA pairs miRNA–gene regulation; gene expression Y Context-specific SC Sensitive correlation coefficient; random test Integration of miRNA, mRNA expression profiles to compute the sensitive correlation coefficient and using the random test to estimate the significance of the average SC compared with random conditions miRNA–gene regulation; miRNA expression; gene expression Y Context-specific CMI CMI; random test Combined paired miRNA, mRNA expression profiles and miRNA–gene regulation to estimate statistical significance of information divergence between the mutual information and CMI to identify miRNA sponge modulators miRNA–gene regulation; miRNA expression; gene expression Y Context-specific Methods Statistical methods Brief description Input data P-value Global or context-specific Ratio No Genes were ranked based on the proportion of coregulating miRNAs miRNA–gene regulation N Global HyperT Hypergeometric test Extract significant gene pairs by checking whether they share the similar set of miRNAs using hypergeometric cumulative distribution test miRNA–gene regulation Y Global HyperC Hypergeometric test; correlation coefficient Consider RNA-RNA pairs sharing a significant overlap of common miRNAs, and combine gene expression to identify the significant positively coexpressed RNA pairs miRNA–gene regulation; gene expression Y Context-specific SC Sensitive correlation coefficient; random test Integration of miRNA, mRNA expression profiles to compute the sensitive correlation coefficient and using the random test to estimate the significance of the average SC compared with random conditions miRNA–gene regulation; miRNA expression; gene expression Y Context-specific CMI CMI; random test Combined paired miRNA, mRNA expression profiles and miRNA–gene regulation to estimate statistical significance of information divergence between the mutual information and CMI to identify miRNA sponge modulators miRNA–gene regulation; miRNA expression; gene expression Y Context-specific Note: Y represents this method provided the significance level and N represents this method not provided the P-values. Open in new tab Table 1 Summary of computational approaches for identifying miRNA-mediated RNA-RNA interaction networks Methods Statistical methods Brief description Input data P-value Global or context-specific Ratio No Genes were ranked based on the proportion of coregulating miRNAs miRNA–gene regulation N Global HyperT Hypergeometric test Extract significant gene pairs by checking whether they share the similar set of miRNAs using hypergeometric cumulative distribution test miRNA–gene regulation Y Global HyperC Hypergeometric test; correlation coefficient Consider RNA-RNA pairs sharing a significant overlap of common miRNAs, and combine gene expression to identify the significant positively coexpressed RNA pairs miRNA–gene regulation; gene expression Y Context-specific SC Sensitive correlation coefficient; random test Integration of miRNA, mRNA expression profiles to compute the sensitive correlation coefficient and using the random test to estimate the significance of the average SC compared with random conditions miRNA–gene regulation; miRNA expression; gene expression Y Context-specific CMI CMI; random test Combined paired miRNA, mRNA expression profiles and miRNA–gene regulation to estimate statistical significance of information divergence between the mutual information and CMI to identify miRNA sponge modulators miRNA–gene regulation; miRNA expression; gene expression Y Context-specific Methods Statistical methods Brief description Input data P-value Global or context-specific Ratio No Genes were ranked based on the proportion of coregulating miRNAs miRNA–gene regulation N Global HyperT Hypergeometric test Extract significant gene pairs by checking whether they share the similar set of miRNAs using hypergeometric cumulative distribution test miRNA–gene regulation Y Global HyperC Hypergeometric test; correlation coefficient Consider RNA-RNA pairs sharing a significant overlap of common miRNAs, and combine gene expression to identify the significant positively coexpressed RNA pairs miRNA–gene regulation; gene expression Y Context-specific SC Sensitive correlation coefficient; random test Integration of miRNA, mRNA expression profiles to compute the sensitive correlation coefficient and using the random test to estimate the significance of the average SC compared with random conditions miRNA–gene regulation; miRNA expression; gene expression Y Context-specific CMI CMI; random test Combined paired miRNA, mRNA expression profiles and miRNA–gene regulation to estimate statistical significance of information divergence between the mutual information and CMI to identify miRNA sponge modulators miRNA–gene regulation; miRNA expression; gene expression Y Context-specific Note: Y represents this method provided the significance level and N represents this method not provided the P-values. Open in new tab Figure 2 Open in new tabDownload slide The flowchart of methods for identifying miRNA-mediated ceRNA interactions. (A) The flowchart for identifying global and context-specific ceRNA regulation. (B) The flowchart for ration-based method. (C) The flowchart for hypergeometric test-based method. (D) The flowchart for combination of hypergeometric test and coexpression method. (E) The flowchart for SC-based method. (F) The flowchart of CMI-based methods. Figure 2 Open in new tabDownload slide The flowchart of methods for identifying miRNA-mediated ceRNA interactions. (A) The flowchart for identifying global and context-specific ceRNA regulation. (B) The flowchart for ration-based method. (C) The flowchart for hypergeometric test-based method. (D) The flowchart for combination of hypergeometric test and coexpression method. (E) The flowchart for SC-based method. (F) The flowchart of CMI-based methods. Ratio-based prediction Based on the hypothesis that ceRNA pairs were likely to be regulated by same miRNAs, this method ranked the candidate genes by the proportion of common miRNAs (Figure 2B) [22]. For instance, we need to identify the ceRNA partners for gene i from all candidate gene sets S, and the ratio is calculated as: Rj=miRNAi∩miRNAjmiRNAj, j∈S. Where miRNAi is the miRNA set that regulated gene i and miRNAj is the miRNA set that regulated gene j. Hypergeometric test-based prediction-HyperT In addition to rank the genes by the proportion of common miRNAs, it is usually used the hypergeometric to evaluate whether two genes were coregulated by miRNAs (Figure 2C). This statistic test computed the significance of common miRNAs for each RNA pairs. The probability P was calculated according to: P=1−F(NXY−1|N,NX,NY)=1−∑t=0NXY−1(NXt)(N−NXNY−t)(NNY). Where N is the number of all miRNAs of human genome, NX and NY represent the total number of miRNAs that regulate RNA X and Y, respectively, and NXY is the number of miRNAs shared between RNA X and Y. All P-values were subject to false discovery rate (FDR) correction and RNAs were ranked based on the FDR values. Hypergeometric test combined with coexpression-based prediction-HyperC Next, to discover the active ceRNA-ceRNA regulatory pairs in a specific context, the commonly used method is to using the coexpression principle to filter the ceRNA-ceRNA regulation identified based on the above two global methods [31, 46]. This method integrated context-specific gene expression profile data sets. The Pearson correlation coefficient (R) of each candidate ceRNA regulatory pairs identified was calculated as: R=∑i=1n(xi-x-)(yi-y-)∑i=1n(xi-x-)2∑i=1n(yi-y-)2. Where xi and yi are the expression levels of RNA X and RNA Y in sample i, and x- and y- are the average expression levels of RNA X and Y across all tumor samples. In general, genes were first ranked by the P-value of hypergeometric test and the correlation coefficient, separately. The average rank of each gene was calculated to rank the candidate genes (Figure 2D). SC-based prediction Besides the genome-wide gene expression profiles, miRNA expression profiles were also integrated into the procedure to identify the ceRNA regulation in complex diseases. The common method is to identify high correlated RNA pairs in which the correlation is because of the presence of one or more miRNAs (Figure 2E). Paola et al. [34] had proposed SC and identified a sponge interaction network between lncRNAs and mRNAs in human breast cancer. Zhang et al. [35] systematically characterized the lncRNA-related ceRNA interactions across 12 major cancer types based on this method. For a candidate pair of RNA-A and RNA-B, given a co-regulated miRNA-M, the calculated formula was as follows: RAB|M=RAB-RAMRMB1-RAB21-RMB2 Where, RAB ⁠, RAM and RMB represent the Pearson correlation coefficient between RNA-A and RNA-B, RNA-A and miRNA-M, RNA-B and miRNA-M, respectively. Then, the SC of miRNA-M, which is referred as S, for the corresponding candidate ceRNA pair is calculated as: S=RAB-RAB|M. To identify the significant correlation, a random background distribution of the S was generated by calculating the score S of randomly selected combinations of RNA-miRNA-RNA competing interaction. CMI-based methods Quantitatively identifying direct dependencies between RNAs is an important step in identifying ceRNA interactions. CMI is widely used to identify the RNA-RNA correlations, given the value of miRNAs. Hermes is a widely used method, which predicts ceRNA interactions from expression profiles of candidate RNAs and their common miRNA regulators using CMI (Figure 2F) [47]. This method is similar to MINDY, which a computational method based on CMI estimator to identify modulators of transcription factor activity. Both these two methods rely on the idea that CMI implies that a modulator expression M is predictive of changes in regulatory activity of a regulator R to its targets T. Specifically, Hermes used two distinct test to identify the ceRNA-ceRNA pairs. First, the size of the common miRNAs, πmiRT1;T2=πmiR(T1)∩πmiR(T2) ⁠, is required to be statistically significant relative to the two individual miRNA size, and this is performed by Fisher’s exact test. Next, for each miRNAs k, Hermes evaluates the statistical significance of the test: ImiRkTM>ImiRkT. Where the variables indicate the expression of the corresponding RNA species. P-values for each ceRNA-miRNA-ceRNA triplet were computed using the random test where the candidate modulator’s expression is shuffled for 1000 times. The final significance for all miRNAs is then computed by converting all the individual P-values for each miRNA k. This is based on Fisher’s method, where χ2=-2∑k=1Nln⁡pk. Where N is the total number of miRNAs. Comparison of computational methods Although these computational methods have been demonstrated to be useful for identification of ceRNA crosstalk in complex diseases, it is difficult to evaluate the quality of these methods. In addition, we are lack of knowledge of which miRNA–targets are useful for this process. Thus, in this section, we conduct an investigative study to compare the relative performance of five representative methods mentioned above using the ceRNA regulation of PTEN (Figure 1C). Data sets PTEN-related ceRNAs. In a diverse set of human cancer types, PTEN was found to be dysregulated. An in-depth understanding of the underlying mechanisms by which PTEN expression is modulated is crucial to achieve a comprehensive knowledge of its biological roles. The competition between PTEN mRNA and other RNAs mediated by miRNAs has emerged as one such mechanism and has brought into focus. Here, we assembled a list of PTEN-related ceRNAs by curating the available literature (Figure 3A). In addition, we obtained the gene expression profiles for wild-type and PTEN overexpressed U87 cell lines. The genes were ranked based on the fold change (FC) of expression. The genes with different FCs (FC > 1.5 and FC > 2) were regarded as ceRNAs of PTEN. Figure 3 Open in new tabDownload slide Comparison of computational methods based on literature-curated ceRNAs. (A) Literature-curated PTEN-related ceRNAs. (B–F) The proportion of validated ceRNAs of PTEN in top-rank candidate ceRNAs identified by different computational methods. (B) Ratio-based. (C) Hypergeometric test. (D) Hypergeometric test and coexpression. (E) SC based. (F) CMI. Different colored lines represent different miRNA–target interaction prediction methods. Figure 3 Open in new tabDownload slide Comparison of computational methods based on literature-curated ceRNAs. (A) Literature-curated PTEN-related ceRNAs. (B–F) The proportion of validated ceRNAs of PTEN in top-rank candidate ceRNAs identified by different computational methods. (B) Ratio-based. (C) Hypergeometric test. (D) Hypergeometric test and coexpression. (E) SC based. (F) CMI. Different colored lines represent different miRNA–target interaction prediction methods. MiRNA–target regulation. Assembling the miRNA-regulation is the first step to identify the ceRNA-ceRNA crosstalk in complex diseases. Here, we assembled genome-wide miRNA-gene regulation by five commonly used methods, including TargetScan, miRanda, PicTar, PITA and RNA22. Recently, several studies have demonstrated that use of cross-linking and AGO immunoprecipitation coupled with high-throughput sequencing could identify endogenous genome-wide interaction maps for miRNAs [43, 48]. Thus, we also integrated the CLIP-Seq data sets that are available in starBase V2 [49]. In addition, we considered the Ensemble miRNA–target regulation that were predicted by at least one to five computational methods. Because there is no miRNAs target PTEN in Ensemble-five, in total, nine sets of miRNA–target regulation were considered here (Table 2). Table 2 The miRNA–gene regulation for 10 computational methods integrated with CLIP-seq data Methods Regulation mRNA miRNA PTEN-miRs Gene2 Gene3 Gene5 TargetScan 98 829 7160 277 56 3944 3611 2655 miRanda 297 873 12 294 249 88 9797 8961 7717 PicTar 82 983 6653 273 73 3750 3297 2514 PITA 63 096 5805 249 60 3330 2752 1957 RNA22 88 480 8777 311 3 87 6 0 Ensemble-one 423 975 13 801 386 104 11 220 10 328 8989 Ensemble-two 117 441 8410 277 74 5163 4510 3478 Ensemble-three 60 038 5830 249 56 3128 2617 1861 Ensemble-four 26 803 3864 244 46 1814 1386 941 Ensemble-five 3004 1100 168 0 0 0 0 Methods Regulation mRNA miRNA PTEN-miRs Gene2 Gene3 Gene5 TargetScan 98 829 7160 277 56 3944 3611 2655 miRanda 297 873 12 294 249 88 9797 8961 7717 PicTar 82 983 6653 273 73 3750 3297 2514 PITA 63 096 5805 249 60 3330 2752 1957 RNA22 88 480 8777 311 3 87 6 0 Ensemble-one 423 975 13 801 386 104 11 220 10 328 8989 Ensemble-two 117 441 8410 277 74 5163 4510 3478 Ensemble-three 60 038 5830 249 56 3128 2617 1861 Ensemble-four 26 803 3864 244 46 1814 1386 941 Ensemble-five 3004 1100 168 0 0 0 0 Note: PTEN-miRs, the number of miRNAs regulated PTEN; Gene2, the number of genes that coregulated by at least two miRNAs with PTEN; Gene3, the number of genes that coregulated by at least three miRNAs with PTEN; Gene5, the number of genes that coregulated by at least five miRNAs with PTEN; Ensemble-one to Ensemble-five, the miRNA-regulation predicted by at least one to five computational methods. Open in new tab Table 2 The miRNA–gene regulation for 10 computational methods integrated with CLIP-seq data Methods Regulation mRNA miRNA PTEN-miRs Gene2 Gene3 Gene5 TargetScan 98 829 7160 277 56 3944 3611 2655 miRanda 297 873 12 294 249 88 9797 8961 7717 PicTar 82 983 6653 273 73 3750 3297 2514 PITA 63 096 5805 249 60 3330 2752 1957 RNA22 88 480 8777 311 3 87 6 0 Ensemble-one 423 975 13 801 386 104 11 220 10 328 8989 Ensemble-two 117 441 8410 277 74 5163 4510 3478 Ensemble-three 60 038 5830 249 56 3128 2617 1861 Ensemble-four 26 803 3864 244 46 1814 1386 941 Ensemble-five 3004 1100 168 0 0 0 0 Methods Regulation mRNA miRNA PTEN-miRs Gene2 Gene3 Gene5 TargetScan 98 829 7160 277 56 3944 3611 2655 miRanda 297 873 12 294 249 88 9797 8961 7717 PicTar 82 983 6653 273 73 3750 3297 2514 PITA 63 096 5805 249 60 3330 2752 1957 RNA22 88 480 8777 311 3 87 6 0 Ensemble-one 423 975 13 801 386 104 11 220 10 328 8989 Ensemble-two 117 441 8410 277 74 5163 4510 3478 Ensemble-three 60 038 5830 249 56 3128 2617 1861 Ensemble-four 26 803 3864 244 46 1814 1386 941 Ensemble-five 3004 1100 168 0 0 0 0 Note: PTEN-miRs, the number of miRNAs regulated PTEN; Gene2, the number of genes that coregulated by at least two miRNAs with PTEN; Gene3, the number of genes that coregulated by at least three miRNAs with PTEN; Gene5, the number of genes that coregulated by at least five miRNAs with PTEN; Ensemble-one to Ensemble-five, the miRNA-regulation predicted by at least one to five computational methods. Open in new tab Gene and miRNA expression of glioma. Genome-wide miRNA and gene expression of 541 glioma samples were obtained from the TCGA project [50]. In total, there are 12 042 genes and 470 miRNAs in the profiles. Comparison based on literature-curated PTEN-related ceRNAs We first compared the five proposed computational method-based 29 literature-curated ceRNA regulation of PTEN (Figure 3A). In addition, eight methods (except RNA22) for identification of miRNA–gene regulation were considered in this process. To exclude the bias of the number of predictions for different methods, all candidate genes were ranked and we calculated the proportion of literature-curated PTEN ceRNAs in top-ranked genes (Figure 3B–F). We found that for all five computational methods, the number of recalled ceRNAs was different. Globally, the prediction power was higher when the ensemble miRNA–gene regulation was used. The performance was the highest when using the miRNA–gene regulation predicted by at least four methods (Figure 3B–F). This is consistent in five ceRNA prediction methods. This best performance might be explained by the fact that ensemble method can obtain more functionally enriched targets. In addition, we found that the methods that integrated expression profiles (HyperC, SC and CMI) had higher performance over the global predication ones (ratio and HyperT) to identify the ceRNA regulation. Specifically, we found that HyperC and CMI showed higher performance. Although with similar performance, we found that the HyperC is easier to understand for biological researchers, and the computational resource used by HyperC is much less than CMI. These observations indicate that integration of the miRNA and gene expression might identify the context-specific miRNA–gene regulation, which further increase the performance for identification of functional ceRNA regulation. Comparison based on overexpression of PTEN In addition to the literature-curated PTEN-related ceRNAs, we evaluated the performance of these computational methods based on a PTEN-overexpressed microarray analysis. Compared with PTEN wild-type cell line, ∼84% of the literature-curated ceRNAs showed increased expression when reintroduced PTEN into the cell (Figure 4A). This observation validated the coexpression principle of ceRNA regulation and also indicated that the genes with increased expression were likely to be PTEN-related ceRNAs. Thus, we next used the genes with 1.5-fold increased expression as a set of PTEN-related ceRNAs to evaluate the performance of these computational methods. By calculating the proportion of ceRNAs in top-ranked genes for each method, we found that these methods reach the best performance when using the ensemble-four-method-retrieved miRNA–gene regulation (Figure 4B–F). However, using the ensemble-one-based miRNA–gene regulation, all of these methods show the poorest performance. This might be because of the high false positive of miRNA–gene regulation when using the union set of different methods. These observations suggest that it is critical to select the suitable miRNA–gene regulation for identifying the ceRNA regulation. In addition, when using miRNA–gene regulation retrieved from the individual method, we found that PicTar showed higher performance than other methods. Similarly, the computational methods integrated with miRNA or gene expression data also increased the performance for identifying the ceRNA regulation. In addition, we also compared these methods based on genes >2-fold in PTEN overexpression data. Similar results were obtained (Figure 5A–E), and suggested that it is better to identify the ceRNA regulation based on ensemble-four-method and integration of miRNA and mRNA expression profiles. Figure 4 Open in new tabDownload slide Comparison of computational methods based on gene expression perturbation. (A) Literature-curated PTEN-related ceRNAs were highly expressed in PTEN overexpressed cell line. (B–F) The proportion of validated ceRNAs of PTEN in top-rank candidate ceRNAs identified by different computational methods. (B) Ratio based. (C) Hypergeometric test. (D) Hypergeometric test and coexpression. (E) SC based. (F) CMI. Different colored lines represent different miRNA–target interaction prediction methods. This figure is based on 1.5-fold changes. Figure 4 Open in new tabDownload slide Comparison of computational methods based on gene expression perturbation. (A) Literature-curated PTEN-related ceRNAs were highly expressed in PTEN overexpressed cell line. (B–F) The proportion of validated ceRNAs of PTEN in top-rank candidate ceRNAs identified by different computational methods. (B) Ratio based. (C) Hypergeometric test. (D) Hypergeometric test and coexpression. (E) SC based. (F) CMI. Different colored lines represent different miRNA–target interaction prediction methods. This figure is based on 1.5-fold changes. Figure 5 Open in new tabDownload slide Comparison of computational methods based on gene expression perturbation. (A–E) The proportion of validated ceRNAs of PTEN in top-rank candidate ceRNAs identified by different computational methods. (A) Ratio based. (B) Hypergeometric test. (C) Hypergeometric test and coexpression. (D) SC based. (E) CMI. Different colored lines represent different miRNA–target interaction prediction methods. This figure is based on 2-fold changes. Figure 5 Open in new tabDownload slide Comparison of computational methods based on gene expression perturbation. (A–E) The proportion of validated ceRNAs of PTEN in top-rank candidate ceRNAs identified by different computational methods. (A) Ratio based. (B) Hypergeometric test. (C) Hypergeometric test and coexpression. (D) SC based. (E) CMI. Different colored lines represent different miRNA–target interaction prediction methods. This figure is based on 2-fold changes. Overlap of different computational methods The number of PTEN-related ceRNAs identified by the five methods is different. Next, we compared the results of the five ceRNA prediction methods based on the ensemble miRNA–gene regulation that were predicted at least four-target prediction methods. For the ratio-based and HyperC method, we obtained the top-ranked 300 genes as candidate ceRNAs of PTEN. On the other hand, we retrieved the ceRNAs with FDR <0.05 as the candidate ceRNAs for the HyperT, SC and CMI methods. We found that only a small fraction of candidate ceRNAs were shared by at least four methods, and no candidate ceRNAs were identified by all five methods (Figure 6A). Specifically, the SC method identified few ceRNA candidates and the majority of these were not covered by other methods. The ceRNA candidates identified by ratio-based and HyperC were all covered by at least one of the other methods. These observations imply that different computational methods may have their own merits. Figure 6 Open in new tabDownload slide Overlap of candidate ceRNAs identified by different computational methods. (A) The overlap of ceRNAs for different methods. (B) and (C) The pathways and biological processes enriched by candidate ceRNAs identified by ratio-based, hypergeometric test, hypergeometric test and coexpression and CMI methods. The size of the circles is corresponding to the proportion of candidate ceRNAs, and the colors represent different P-values. Figure 6 Open in new tabDownload slide Overlap of candidate ceRNAs identified by different computational methods. (A) The overlap of ceRNAs for different methods. (B) and (C) The pathways and biological processes enriched by candidate ceRNAs identified by ratio-based, hypergeometric test, hypergeometric test and coexpression and CMI methods. The size of the circles is corresponding to the proportion of candidate ceRNAs, and the colors represent different P-values. As these computational methods might capture different aspects of ceRNA regulation, we next explore whether the identified candidate ceRNAs were involved in same or similar biological function. As there are few candidate ceRNAs identified by SC method, we compared the enriched functions of the other four methods. Functional enrichment analysis revealed that these ceRNA candidates play critical roles in cancer, and several pathways were shared by more than two methods, such as ‘pathway in cancer’, ‘endocytosis’ and ‘MAPK signaling pathway’ (Figure 6B). In addition, we also identified several biological processes shared by different methods, such as ‘regulation of protein serine/threonine kinase activity’ and ‘regulation of cell cycle’ were shared by all four methods (Figure 6C). These results indicated that these methods were complementary with each other, and it is best to integrate the results of different methods to identify the ceRNA regulation in human complex diseases. Discussions and future directions With the past decades, miRNA-mediated ceRNA crosstalk has been found to be involved in many diseases including cancer. MiRNA-miRNA crosstalk and ceRNA-ceRNA crosstalk are changing our understanding the mechanisms of cancer [19]. In this article, we have reviewed the recent developed computational methods for identification of miRNA-mediated ceRNA interactions. Through the increasing application of high-throughput sequencing data sets, ceRNA regulation continues to be discovered. However, the gap between identified and functionally characterized ceRNA regulation remains considerably large. One of the challenges for identification of miRNA-mediated ceRNA regulation is the accuracy of the miRNA–gene regulation. Different miRNA–target prediction methods only considered a set of possible targets for miRNAs, which also included more false-positive miRNA targets. Our analysis indicated that it is better to integrate different miRNA–target prediction methods. Specifically, the ensemble-four method gets the best performance in ceRNA prediction. In addition, integration of the context miRNA-mRNA expression profiles increased the performance. This suggested that context-specific miRNA–gene regulation is useful in identifying the miRNA-mediated ceRNA crosstalk. However, this might be challenged by the small number of samples with paired miRNA and gene expression profiles, especially when we considered multiple types of RNAs. As the research into miRNA-mediated ceRNA regulation has just emerged, there is no gold standard positive ceRNA regulation to validate these proposed computational predictions. Here, based on literature-curated PTEN-related ceRNAs and PTEN-overexpression data sets, we evaluated these methods. However, our evaluations in the case study are not sufficient for drawing definite conclusions about the performances of these methods. Our comparison results suggest that these methods may have their own merits, and they capture the ceRNA candidates involved in similar functions. We suggest that it is better to use them complementarily, e.g. by combining them to develop an ensemble method [51]. Aside from the observations that ceRNA activities were affected by the relative abundance of miRNA and RNAs, the other mechanisms that regulated ceRNA interactions remain poorly understood. Because miRNAs mainly bind to the 3′-UTR of target RNAs to perform their functions, and 3′-UTR length is observed to be highly regulated in various types of cancer [52]. It is conceivable that ceRNA interaction could be altered in cancer. However, there are limited studies to investigate the 3′-UTR-mediated ceRNA rewiring in cancer. In addition, genetic variants have been observed to widely perturb miRNA regulation, which therefore further changed the dynamic equilibrium of ceRNA regulation. Recently, one study identified a large number of genetic variants that are associated with ceRNA function [53], which were termed as ‘cerQTL’. This study suggests that another function aspect of noncoding regulatory variants. Besides the miRNA regulation, other posttranscriptional mechanisms (such as RNA editing and RNA-binding protein regulation) have been focused by recent studies. RNA editing could influence miRNA target regulation [54] and thus influence ceRNA interaction. RNA-binding protein (RBP) might compete for binding sites of miRNAs [55], which could also affect the ceRNA interactions. However, these types of regulation were not considered in current methods for identifying ceRNA interactions. Moreover, intra-tumor heterogeneity is critical for development effective methods for therapy. So far, the majority of ceRNA studies have been performed at a cell-population level. With the development of single-cell techniques [56], it may be able to shed light to the contribution of ceRNA regulation in tumor heterogeneity. Furthermore, although most of the ceRNA interactions identified so far are between binary RNA partners, increasing evidence has indicated that ceRNA crosstalk are formed as large interconnected networks. In addition to direct interactions through shared miRNAs, secondary indirect interactions have been shown to contribute ceRNA regulation [45]. In addition, ceRNA regulation and transcription regulation have been shown to be tightly coupled, adding the complexity of ceRNA crosstalk [18]. Evidence has also demonstrated that integration of protein–protein interaction information can help to understand how miRNA sponges influence the downstream biological processes [57]. However, how to integrate these functional information into the process for identification of ceRNA interaction remain poorly understood. In summary, analysis of ceRNA interactions and crosstalk in intertwined networks may represent a robust platform for understanding miRNA regulation in human complex diseases. Here, we proposed that the application of computational techniques provides valuable functional and mechanistic insight into miRNA-mediated ceRNA regulation. There are both opportunities and challenges for developing computational methods for identification miRNA-mediated ceRNA crosstalk in future studies. Key Points MiRNA-mediated ceRNA regulation plays critical roles in complex diseases. Computational methods for identification of RNA-RNA crosstalk were reviewed. Integration of different miRNA target identification methods and context-specific expression facilitates identification of ceRNA regulation. Different computational methods are complementary for identifying ceRNAs involved in similar biological functions. Fundings The National Natural Science Foundation of China (grant numbers 31571331 and 61502126), the China Postdoctoral Science Foundation (grant numbers 2016T90309, 2015M571436 and LBH-Z14134), Natural Science Foundation of Heilongjiang Province (grant number QC2015020), Weihan Yu Youth Science Fund Project of Harbin Medical University, Harbin Special Funds of Innovative Talents on Science and Technology Research Project (grant number 2015RAQXJ091). Yongsheng Li is an associate professor in the College of Bioinformatics Science and Technology at Harbin Medical University, China and Department of Systems Biology, University of Texas MD Anderson Cancer Center, USA. His research interests focus on ncRNA regulation and bioinformatics methods development. Xiyun Jin is an MS student in the College of Bioinformatics Science and Technology at Harbin Medical University, China. Her research interests focus on ncRNA regulation. Zishan Wang is a PhD student in the College of Bioinformatics Science and Technology at Harbin Medical University, China. His research interests focus on method development. Lili Li is an MS student in the College of Bioinformatics Science and Technology at Harbin Medical University, China. Her research interests focus on bioinformatics methods. Hong Chen is a PhD student in the College of Bioinformatics Science and Technology at Harbin Medical University, China. Her research interests focus on ncRNA regulation. Xiaoyu Lin is an MS student in the College of Bioinformatics Science and Technology at Harbin Medical University, China. Her research interests focus on computational biology. Song Yi is an associate professor in the Department of Systems Biology, University of Texas MD Anderson Cancer Center, USA. His research interests focus on computational system biology in human diseases. Yunpeng Zhang is an associate professor in the College of Bioinformatics Science and Technology at Harbin Medical University, China. His research interests focus on computational system biology and ncRNA regulation in human diseases. Juan Xu is an associate professor in the College of Bioinformatics Science and Technology at Harbin Medical University, China. Her research activity focused on ncRNA regulation in complex diseases. References 1 Esquela-Kerscher A , Slack FJ. Oncomirs—microRNAs with a role in cancer . Nat Rev Cancer 2006 ; 6 : 259 – 69 . Google Scholar Crossref Search ADS PubMed WorldCat 2 Pasquinelli AE. MicroRNAs and their targets: recognition, regulation and an emerging reciprocal relationship . Nat Rev Genet 2012 ; 13 : 271 – 82 . Google Scholar Crossref Search ADS PubMed WorldCat 3 Jonas S , Izaurralde E. Towards a molecular understanding of microRNA-mediated gene silencing . Nat Rev Genet 2015 ; 16 : 421 – 33 . Google Scholar Crossref Search ADS PubMed WorldCat 4 Ozsolak F , Milos PM. RNA sequencing: advances, challenges and opportunities . Nat Rev Genet 2011 ; 12 : 87 – 98 . Google Scholar Crossref Search ADS PubMed WorldCat 5 Cancer Genome Atlas Research Network , Weinstein JN , Collisson EA , et al. The cancer genome Atlas Pan-Cancer analysis project . Nat Genet 2013 ; 45 : 1113 – 20 . Google Scholar Crossref Search ADS PubMed WorldCat 6 Zhang J , Baran J , Cros A , et al. International cancer genome consortium data portal–a one-stop shop for cancer genomics data . Database 2011 ; 2011 : bar026 . Google Scholar PubMed WorldCat 7 Friedman RC , Farh KK , Burge CB , et al. Most mammalian mRNAs are conserved targets of microRNAs . Genome Res 2009 ; 19 : 92 – 105 . Google Scholar Crossref Search ADS PubMed WorldCat 8 Bajan S , Hutvagner G. Regulation of miRNA processing and miRNA mediated gene repression in cancer . Microrna 2014 ; 3 : 10 – 17 . Google Scholar Crossref Search ADS PubMed WorldCat 9 Xu J , Li CX , Li YS , et al. MiRNA-miRNA synergistic network: construction via co-regulating functional modules and disease miRNA topological features . Nucleic Acids Res 2011 ; 39 : 825 – 36 . Google Scholar Crossref Search ADS PubMed WorldCat 10 Bracken CP , Scott HS , Goodall GJ. A network-biology perspective of microRNA function and dysfunction in cancer . Nat Rev Genet 2016 ; 17 : 719 – 32 . Google Scholar Crossref Search ADS PubMed WorldCat 11 Li Y , Xu J , Chen H , et al. Comprehensive analysis of the functional microRNA-mRNA regulatory network identifies miRNA signatures associated with glioma malignant progression . Nucleic Acids Res 2013 ; 41 : e203 . Google Scholar Crossref Search ADS PubMed WorldCat 12 Gosline SJ , Gurtan AM , JnBaptiste CK , et al. Elucidating MicroRNA regulatory networks using transcriptional, post-transcriptional, and histone modification measurements . Cell Rep 2016 ; 14 : 310 – 19 . Google Scholar Crossref Search ADS PubMed WorldCat 13 Olena AF , Patton JG. Genomic organization of microRNAs . J Cell Physiol 2010 ; 222 : 540 – 5 . Google Scholar PubMed WorldCat 14 Wang Y , Luo J , Zhang H , et al. microRNAs in the same clusters evolve to coordinately regulate functionally related genes . Mol Biol Evol 2016 ; 33 : 2232 – 47 . Google Scholar Crossref Search ADS PubMed WorldCat 15 Li Y , Li S , Chen J , et al. Comparative epigenetic analyses reveal distinct patterns of oncogenic pathways activation in breast cancer subtypes . Hum Mol Genet 2014 ; 23 : 5378 – 93 . Google Scholar Crossref Search ADS PubMed WorldCat 16 Xu J , Shao T , Ding N , et al. miRNA-miRNA crosstalk: from genomics to phenomics . Brief Bioinform 2016 , pii: bbw073. [Epub ahead of print]. WorldCat 17 Meng X , Wang J , Yuan C , et al. CancerNet: a database for decoding multilevel molecular interactions across diverse cancer types . Oncogenesis 2015 ; 4 : e177 . Google Scholar Crossref Search ADS PubMed WorldCat 18 Karreth FA , Pandolfi PP. ceRNA cross-talk in cancer: when ce-bling rivalries go awry . Cancer Discov 2013 ; 3 : 1113 – 21 . Google Scholar Crossref Search ADS PubMed WorldCat 19 Wang Y , Hou J , He D , et al. The emerging function and mechanism of ceRNAs in cancer . Trends Genet 2016 ; 32 : 211 – 24 . Google Scholar Crossref Search ADS PubMed WorldCat 20 Salmena L , Poliseno L , Tay Y , et al. A ceRNA hypothesis: the Rosetta Stone of a hidden RNA language? Cell 2011 ; 146 : 353 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 21 Chen J , Xu J , Li Y , et al. Competing endogenous RNA network analysis identifies critical genes among the different breast cancer subtypes . Oncotarget 2017 ; 8 : 10171 – 84 . Google Scholar PubMed WorldCat 22 Xu J , Feng L , Han Z , et al. Extensive ceRNA-ceRNA interaction networks mediated by miRNAs regulate development in multiple rhesus tissues . Nucleic Acids Res 2016 ; 44 : 9438 – 51 . Google Scholar Crossref Search ADS PubMed WorldCat 23 Yu G , Yao W , Gumireddy K , et al. Pseudogene PTENP1 functions as a competing endogenous RNA to suppress clear-cell renal cell carcinoma progression . Mol Cancer Ther 2014 ; 13 : 3086 – 97 . Google Scholar Crossref Search ADS PubMed WorldCat 24 Cesana M , Cacchiarelli D , Legnini I , et al. A long noncoding RNA controls muscle differentiation by functioning as a competing endogenous RNA . Cell 2011 ; 147 : 358 – 69 . Google Scholar Crossref Search ADS PubMed WorldCat 25 Fu Z , Li G , Li Z , et al. Endogenous miRNA Sponge LincRNA-ROR promotes proliferation, invasion and stem cell-like phenotype of pancreatic cancer cells . Cell Death Discov 2017 ; 3 : 17004 . Google Scholar Crossref Search ADS PubMed WorldCat 26 Qi X , Zhang DH , Wu N , et al. ceRNA in cancer: possible functions and clinical implications . J Med Genet 2015 ; 52 : 710 – 18 . Google Scholar Crossref Search ADS PubMed WorldCat 27 Tay Y , Rinn J , Pandolfi PP. The multilayered complexity of ceRNA crosstalk and competition . Nature 2014 ; 505 : 344 – 52 . Google Scholar Crossref Search ADS PubMed WorldCat 28 Xu J , Li Y , Lu J , et al. The mRNA related ceRNA-ceRNA landscape and significance across 20 major cancer types . Nucleic Acids Res 2015 ; 43 : 8169 – 82 . Google Scholar Crossref Search ADS PubMed WorldCat 29 Wang P , Ning S , Zhang Y , et al. Identification of lncRNA-associated competing triplets reveals global patterns and prognostic markers for cancer . Nucleic Acids Res 2015 ; 43 : 3478 – 89 . Google Scholar Crossref Search ADS PubMed WorldCat 30 Chiu HS , Llobet-Navas D , Yang X , et al. Cupid: simultaneous reconstruction of microRNA-target and ceRNA networks . Genome Res 2015 ; 25 : 257 – 67 . Google Scholar Crossref Search ADS PubMed WorldCat 31 Zhou X , Liu J , Wang W. Construction and investigation of breast-cancer-specific ceRNA network based on the mRNA and miRNA expression data . IET Syst Biol 2014 ; 8 : 96 – 103 . Google Scholar Crossref Search ADS PubMed WorldCat 32 Le TD , Zhang J , Liu L , et al. Computational methods for identifying miRNA sponge interactions . Brief Bioinform 2017 ; 18 : 577 – 90 . Google Scholar PubMed WorldCat 33 Shao T , Wu A , Chen J , et al. Identification of module biomarkers from the dysregulated ceRNA-ceRNA interaction network in lung adenocarcinoma . Mol Biosyst 2015 ; 11 : 3048 – 58 . Google Scholar Crossref Search ADS PubMed WorldCat 34 Paci P , Colombo T , Farina L. Computational analysis identifies a sponge interaction network between long non-coding RNAs and messenger RNAs in human breast cancer . BMC Syst Biol 2014 ; 8 : 83 . Google Scholar Crossref Search ADS PubMed WorldCat 35 Zhang Y , Xu Y , Feng L , et al. Comprehensive characterization of lncRNA-mRNA related ceRNA network across 12 major cancers . Oncotarget 2016 ; 7 : 64148 – 67 . Google Scholar PubMed WorldCat 36 Wang JB , Liu FH , Chen JH , et al. Identifying survival-associated modules from the dysregulated triplet network in glioblastoma multiforme . J Cancer Res Clin Oncol 2017 ; 143 : 661 – 71 . Google Scholar Crossref Search ADS PubMed WorldCat 37 Agarwal V , Bell GW , Nam JW , et al. Predicting effective microRNA target sites in mammalian mRNAs . eLife 2015 ; 4 : e05005 . Google Scholar Crossref Search ADS WorldCat 38 Betel D , Koppal A , Agius P , et al. Comprehensive modeling of microRNA targets predicts functional non-conserved and non-canonical sites . Genome Biol 2010 ; 11 : R90 . Google Scholar Crossref Search ADS PubMed WorldCat 39 Krek A , Grun D , Poy MN , et al. Combinatorial microRNA target predictions . Nat Genet 2005 ; 37 : 495 – 500 . Google Scholar Crossref Search ADS PubMed WorldCat 40 Kertesz M , Iovino N , Unnerstall U , et al. The role of site accessibility in microRNA target recognition . Nat Genet 2007 ; 39 : 1278 – 84 . Google Scholar Crossref Search ADS PubMed WorldCat 41 Miranda KC , Huynh T , Tay Y , et al. A pattern-based method for the identification of MicroRNA binding sites and their corresponding heteroduplexes . Cell 2006 ; 126 : 1203 – 17 . Google Scholar Crossref Search ADS PubMed WorldCat 42 Clark PM , Loher P , Quann K , et al. Argonaute CLIP-seq reveals miRNA targetome diversity across tissue types . Sci Rep 2014 ; 4 : 5947 . Google Scholar Crossref Search ADS PubMed WorldCat 43 Moore MJ , Zhang C , Gantman EC , et al. Mapping Argonaute and conventional RNA-binding protein interactions with RNA at single-nucleotide resolution using HITS-CLIP and CIMS analysis . Nat Protoc 2014 ; 9 : 263 – 93 . Google Scholar Crossref Search ADS PubMed WorldCat 44 Friedersdorf MB , Keene JD. Advancing the functional utility of PAR-CLIP by quantifying background binding to mRNAs and lncRNAs . Genome Biol 2014 ; 15 : R2 . Google Scholar Crossref Search ADS PubMed WorldCat 45 Ala U , Karreth FA , Bosia C , et al. Integrated transcriptional and competitive endogenous RNA networks are cross-regulated in permissive molecular environments . Proc Natl Acad Sci USA 2013 ; 110 : 7154 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 46 Chiu YC , Hsiao TH , Chen Y , et al. Parameter optimization for constructing competing endogenous RNA regulatory network in glioblastoma multiforme and other cancers . BMC Genomics 2015 ; 16(Suppl 4) : S1 . Google Scholar Crossref Search ADS PubMed WorldCat 47 Sumazin P , Yang X , Chiu HS , et al. An extensive microRNA-mediated network of RNA-RNA interactions regulates established oncogenic pathways in glioblastoma . Cell 2011 ; 147 : 370 – 81 . Google Scholar Crossref Search ADS PubMed WorldCat 48 Yang JH , Li JH , Shao P , et al. starBase: a database for exploring microRNA-mRNA interaction maps from Argonaute CLIP-Seq and Degradome-seq data . Nucleic Acids Res 2011 ; 39 : D202 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 49 Li JH , Liu S , Zhou H , et al. starBase v2.0: decoding miRNA-ceRNA, miRNA-ncRNA and protein-RNA interaction networks from large-scale CLIP-Seq data . Nucleic Acids Res 2014 ; 42 : D92 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 50 Cancer Genome Atlas Research Network . Comprehensive genomic characterization defines human glioblastoma genes and core pathways . Nature 2008 ; 455 : 1061 – 8 . Crossref Search ADS PubMed WorldCat 51 Marbach D , Costello JC , Kuffner R , et al. Wisdom of crowds for robust gene network inference . Nat Methods 2012 ; 9 : 796 – 804 . Google Scholar Crossref Search ADS PubMed WorldCat 52 Mayr C , Bartel DP. Widespread shortening of 3'UTRs by alternative cleavage and polyadenylation activates oncogenes in cancer cells . Cell 2009 ; 138 : 673 – 84 . Google Scholar Crossref Search ADS PubMed WorldCat 53 Li MJ , Zhang J , Liang Q , et al. Exploring genetic associations with ceRNA regulation in the human genome . Nucleic Acids Res 2017 ; 45 : 5653 – 65 . Google Scholar Crossref Search ADS PubMed WorldCat 54 Wang Y , Xu X , Yu S , et al. Systematic characterization of A-to-I RNA editing hotspots in microRNAs across human cancers . Genome Res 2017 ; 27 : 1112 – 25 . Google Scholar Crossref Search ADS PubMed WorldCat 55 Treiber T , Treiber N , Plessmann U , et al. A compendium of RNA-binding proteins that regulate MicroRNA biogenesis . Mol Cell 2017 ; 66 : 270 – 84.e213 . Google Scholar Crossref Search ADS PubMed WorldCat 56 Ramskold D , Luo S , Wang YC , et al. Full-length mRNA-Seq from single-cell levels of RNA and individual circulating tumor cells . Nat Biotechnol 2012 ; 30 : 777 – 82 . Google Scholar Crossref Search ADS PubMed WorldCat 57 Zhang J , Le TD , Liu L , et al. Inferring miRNA sponge co-regulation of protein-protein interactions in human breast cancer . BMC Bioinformatics 2017 ; 18 : 243 . Google Scholar Crossref Search ADS PubMed WorldCat Author notes Yongsheng Li and Xiyun Jin contributed equally to this work. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
journal article
LitStream Collection
A generally applicable lightweight method for calculating a value structure for tools and services in bioinformatics infrastructure projects

Mayer, Gerhard; Quast, Christian; Felden, Janine; Lange, Matthias; Prinz, Manuel; Pühler, Alfred; Lawerenz, Chris; Scholz, Uwe; Glöckner, Frank Oliver; Müller, Wolfgang; Marcus, Katrin; Eisenacher, Martin

2019 Briefings in Bioinformatics

doi: 10.1093/bib/bbx140pmid: 29092005

Abstract Sustainable noncommercial bioinformatics infrastructures are a prerequisite to use and take advantage of the potential of big data analysis for research and economy. Consequently, funders, universities and institutes as well as users ask for a transparent value model for the tools and services offered. In this article, a generally applicable lightweight method is described by which bioinformatics infrastructure projects can estimate the value of tools and services offered without determining exactly the total costs of ownership. Five representative scenarios for value estimation from a rough estimation to a detailed breakdown of costs are presented. To account for the diversity in bioinformatics applications and services, the notion of service-specific ‘service provision units’ is introduced together with the factors influencing them and the main underlying assumptions for these ‘value influencing factors’. Special attention is given on how to handle personnel costs and indirect costs such as electricity. Four examples are presented for the calculation of the value of tools and services provided by the German Network for Bioinformatics Infrastructure (de.NBI): one for tool usage, one for (Web-based) database analyses, one for consulting services and one for bioinformatics training events. Finally, from the discussed values, the costs of direct funding and the costs of payment of services by funded projects are calculated and compared. bioinformatics infrastructure, de.NBI, cost factors, value-influencing factors, value estimation for offered tools and services, service provision unit Introduction The provision of an effective and sustainable computing and data infrastructure is seen as a prerequisite for further developing an efficient life science industry and harvesting their economic potentials [1]. This need was identified early in the past and has led to the foundation of national or transnational bioinformatics institutes like NCBI (1988), EMBL-EBI (1992) or SIB [2] (1998). In the past couple of years, many European countries installed national bioinformatics infrastructure projects, and the most prominent of these in Europe are DTL [3] (The Netherlands, https://www.dtls.nl, accessed June 2017), IFB (France, http://www.france-bioinformatique.fr, accessed June 2017), NBIS (Sweden, https://nbis.se, accessed June 2017), INB (Spain, http://www.inab.org, accessed June 2017) and the German Network for Bioinformatics Infrastructure—de.NBI [4] (Germany, https://www.denbi.de, accessed June 2017). Most of them are partner nodes in ELIXIR [5] (https://www.elixir-europe.org, accessed June 2017), a transnational European-wide distributed life science infrastructure project. These infrastructure projects provide data repositories, software tools and resources for the data management, analysis and interoperability to be used by life science projects producing or analyzing ‘big data’. This also encompasses services for knowledge transfer such as training and consulting for enabling the users to efficiently use the provided resources. Often these resources are deployed in cloud systems to offer highly scalable and high-performance computing environments to the end user. The costs for these services are either charged to the users or the services are offered for free, for example if they are funded by funding institutions. Even if they are offered free of charge, there is a desire of infrastructure providers to estimate the value of the offered tools and services. This is necessary to (1) plan the institutional budget to provide required personnel and technical infrastructure, (2) justify research funding and get long-lasting support for hosting services from research projects and (3) render the financial resources needed for the infrastructure transparent to the stakeholders and the general public [6]. Standardized value estimation is a basic requirement to realize funding or payment models toward self-sustainability in the long term. A virtual price tag for the used service will further increase the awareness of the value of bioinformatics work and the compliance of researchers to finance bioinformatics resources. The publicly funded bioinformatics infrastructure projects are set to the technical and scientific provisioning and support of the offered services. Usually, they have no dedicated financial accounting department to determine the exact operational costs for the offered tools and services on the basis of a full-cost pricing [total cost of ownership (TCO)] model. Consequently, there is a need for a generally applicable lightweight cost model. In de.NBI [7], two special interest groups (SIG2 ‘service and service monitoring’ and SIG4 ‘hardware infrastructure and data management’) have developed such a simplified method. Some of the details described may be specific to the de.NBI network institutions concerned (e.g. overhead rates for indirect costs) or to Germany [e.g. value-added tax (VAT) rate] but should in principle be transferable to other institutions and countries. Methods For a complete full-cost accounting, all elements of costs according to a TCO model have to be included. In general, one can distinguish between fixed and running costs. Common elements are as follows: Computer hardware and software ^ Hardware costs (computer, printer, …) ^ License costs for software ^ Schedule of depreciation for all tangible goods ^ Utilization rate to account for idle time (mean % CPU usage, software usage days per licensed week, …) Development costs ^ Personnel costs for development, testing and implementation (even in a productive environment with fixed functionality at least the costs for security bug fixes) Operational expenses ^ Direct infrastructure costs (building, i.e. floor and office space, equipment, furniture, …) ^ Infrastructure maintenance costs (e.g. janitorial supplies or maintenance, repair and overhaul) ^ Consumption costs (e.g. cooling, heating, electricity, phone, office supplies and consumables such as paper, advertising flyers) ^ Connectivity/data transfer costs (network, Internet, especially for big data) ^ Personnel costs for administration and general support (data backups and recovery, personnel administration) ^ Personnel costs for help desk, maintenance, consulting and training ^ Support contracts for hardware and licensed software ^ Costs for access to journals and books Long-term expenses ^ Replacement costs (estimation of the costs for replacing defective or old hardware) ^ Upgrade/scalability expenses (costs for nonlinear growth of service volume) ^ Decommissioning (e.g. for hardware at end of lifetime) Other bioinformatics tool and service providers like the EBI (European Bioinformatics Institute) contracted a consultancy company to estimate the costs and the generated value of their institution [8]. For de.NBI, a lightweight minimal consensus value structure model was developed, which makes some simplifications to the TCO model. For instance, because the direct infrastructure and infrastructure maintenance costs are considered as a lump sum (‘overhead costs’) in the grants of funding organizations (currently 20% for funding of the BMBF, the German Federal Ministry of Education and Research), they are also included as a lump sum in our value structure (cost Model 1). Considering that a scientist has to pay for a service via invoice (cost Model 2), this lump sum is usually higher (up to >70% depending on the research institution or organizational structure) plus minimum profit (e.g. 4%) plus VAT (currently 19% in Germany). Because fixed costs scale up in steps and running costs scale up linearly, and for allowing the inclusion of further improvements of the offered tools and services, the value structure model should reflect such scalability issues. Therefore, we defined different scenarios for the value structure model: Scenario 1: Value of the status quo, where neither tool or service improvements nor growth of usage volume is taken into account Scenario 2: Value, which includes tool/service improvements, but no growth Scenario 3: Scenario 2 plus growth of the usage volume Scenario 4: Scenario 3 plus the retrospective development costs for the tools (may have been financed by third-party projects) Scenario 5: Scenario 4 plus the expected future hardware exchange or replacement and future development costs Scenarios with higher ordinal numbers have the potential to reflect the TCO model better, but more assumptions may be necessary. Furthermore, a mix of scenarios is imaginable, e.g. a scenario incorporating expected future hardware costs but without growth of usage volume and without incorporating retrospective development costs. Within de.NBI, we follow a best practice for estimating the Scenario 1 value of a de.NBI service. We defined service-specific ‘service provision units’, which are the basic units of value calculation. This could be for example ‘one database query’ or ‘one statistical analysis day’ or ‘one training day’. For each offered tool/service, the specific underlying value considerations are explicitly formulated together with the related assumptions (‘value influencing factors’). Only factors with financial influence are taken into account. Other measures of ‘value’ that are typically derived using usage statistics, like for instance citations and value perceived by the users, etc., are regularly monitored inside de.NBI but not considered here. Such factors with financial influence could be, for example, ‘personnel costs for one day database maintenance assuming 100 queries per day’ or ‘personnel costs for a statistician assuming one analysis takes three days’; for more detailed examples, see section below. The value-influencing factors are adapted regularly (e.g. every 6 months), so that with changing knowledge, e.g. about personnel salary increase, or growing experience about the assumptions, e.g. the usage volume or average usage, the calculated values converge more and more to the ‘true’ values. For calculating the personnel costs, as a basis the yearly adapted personnel staff appropriation rates of the DFG are taken (the German Research Foundation http://www.dfg.de/formulare/60_12/60_12_en.pdf, accessed June 2017). If using the average scientist salaries of the institution or the work group or even the known person performing a service, this value-influencing factor will be more exact. Optionally, a ‘scaling limit’ for the usage volume can be specified, up to which the used value structure is reasonable. Results In the following, we show for the four service examples, ‘tool usage’, ‘web-based database query’, ‘bioinformatics consulting’ and ‘training’ how their value can be determined. This is by default done for Scenario 1. This scenario does not consider the cost to implement new features, fix security bugs and to improve the overall performance of the software. Especially, this scenario does not consider the lifetime of the software (operating system, libraries or other tools) that is required to offer the service. Some of the examples nevertheless consider also hardware exchange, which is part of Scenario 5. According to the DFG personnel staff appropriation rates 2017, a postdoctoral scientist (PostDoc) or comparable costs €68 400 in average per year, and a nonacademic technical staff member costs €47 100 in average per year. Assuming 220 working days per year (365 without weekends, working/public holidays, average sick leaves, etc.), the PostDoc working day value is €310.91, and the technician working day value is €214.09. In the Supplemental Material, we provide Excel files for the examples to provide a starting point for own calculations. Example 1: Tool usage (analysis via stand-alone executable) The service provision unit is ‘one analysis via tool usage’, i.e. one analysis performed with a tool, which is installed on a service provider’s server and that is remotely accessible by the user. The tool usage thus comprises the upload of the data, running the tool and returning the results to the user. We assume that the tool is used this way five times per week. As value-influencing factors, we consider a fraction X of a full-time PostDoc maintaining the tool in its environment (bug fixing, user help desk, e.g. supporting upload/download). Assuming there are five tools to maintain or five other services done by the PostDoc, a first assumption for the fraction X is 20%. We also consider 20% of a technical maintenance person for the server, the remote access and the network. More formally, the value-influencing factors of ‘one analysis via tool usage’ are: (20% of PostDoc week + 20% of technician week) divided by 5 [analyses per week] + 20% overhead for indirect costs. With numbers: 0.2·€1554.55+0.2·€1070.455·1.2=€62.18+€42.82·1.2=€105.00·1.2=€126.00 As an extension of Scenario 1, we also include costs for hardware renewal. We assume that renewing the dedicated middle-range server hardware costs €3300, and it is renewed after 3 years. Thus, it serves for 780 analyses (3 * 52 weeks * 5 analyses per week) and contributes €4.23. With numbers: €62.18+€42.82+€4.23·1.2=€109.23·1.2=€131.08 A value calculation for Scenario 2 would include efforts to add tool functionality and respective documentation. Depending on the tool, the research domain and the implemented complexity, for example personnel costs (e.g. 2–12 person months), the respective proportional prospective costs for development hardware (client and possibly server computers) plus possibly proportional costs of software licenses for the development environment would be incorporated before applying the overhead for indirect costs. Analogously, a value calculation for Scenario 4 would include the retrospective development costs for the tools (this is not done here because many tools provided as service have been financed by previously funded third-party projects). Analyses by tool usages within workflow systems such as Galaxy or KNIME or even in a cloud system as Docker Container or other mechanisms have to be calculated with other assumptions for the ‘value influencing factors’ or even with differing ‘value influencing factors’, and therefore, their value cannot be directly derived from this example. Example 2: Web query (analysis via browser) The service provision unit is ‘one analysis against a database’ using a Web-based service. The factors to estimate the value of a single ‘analysis’ include the value to maintain the Web presence and the value to run the analysis. Additionally, costs to create or to license the database may have to be considered. Each of these factors includes both personnel as well as hardware-associated costs. Further, it is assumed that the hardware is used exclusively to create the database and to provide the service, while the personnel may not be working full-time on the service. The following example is based on the SILVA Web service [9], a service for the analysis of ribosomal RNA sequences. The personnel investment in the daily operation of the Web service is rather small even for a large number of analyses in case these jobs run fully automated. Nevertheless, it requires technical and scientific staff to operate the service. The technician is mostly concerned with the operation of the computer hardware, the network and the operating system implementing security fixes and system updates. These tasks take about 40% of a full-time position and, according to the above mentioned technician salary; this is €18 840 per year. A software developer is required to maintain the custom software developed to run this service, mainly to adapt it to the ever-changing Web environment (Web browser, and the fast evolving HTML, HTTP, SSL and JavaScript standards and associated libraries). This takes roughly 55% of a full-time PostDoc position, or €37 620 per year. Additionally, a domain expert is needed, mostly to support users, but also to supervise the service and to proactively check for issues that may arise during the operation of the service. These tasks cannot be accomplished by the software developer because of the lack of domain knowledge. This takes up to 33% of a full-time PostDoc position, €22 572 per year. In total, €79 032 has to be spent on wages per year to operate this service. The cost of the corresponding hardware (Web server as well as associated storage server) is €10 000. For fail-over safety and to reduce the time the users have to wait for their results, a redundant design of both components is necessary. Together, the cost of this setup is €20 000. As previously described, these costs are written off over 3 years and the costs per year sum up to €6, 666.67. The personnel and hardware costs to operate the analysis service sum up to €85 698.67 per year. This sum, however, does not yet include any license fees for third-party software, the preparation of the reference database and other resources nor does it include the aforementioned overhead factor of 1.2. The reference databases used for the analyses have to be considered as indirect costs, as the databases have to be regarded as a fixed external resource used during the analyses. These costs may either result from licensing the databases or from creating and curating the databases. The calculation below assumes that the databases are maintained by the provider of the analysis service. It includes the computational costs as well as the personnel costs to supervise the preparation of the production databases. However, it does not include any costs for past development of the software to create the databases nor does it consider costs to further develop this software. In most cases, a compute cluster is needed to create the databases. The specification of the compute nodes highly depends on the data that needs to be processed. In case of the SILVA databases, 20 computes nodes, each equipped with 12 CPU cores and 64 GB of memory are required to process the raw data in 2 weeks. Each of the compute nodes costs €4500; altogether they cost €90 000 depreciated after 3 years (€30 000 per year). We assume that this hardware is exclusively used for SILVA. In case that one can lease the idle time, one must take into account the utilization rate. The preparation of the databases has to be supervised and an additional 2 weeks have to be invested to prepare all the data. Including the maintenance of the software, 5–6 weeks have to be spent to prepare a single release of the databases. Updating the databases three times a year, this accounts for 45% of a full-time PostDoc position, or €30 780 and 20% of a technician position (€9420). After the data have been processed, it needs to be curated by a domain expert, which takes another 2–3 weeks per release. In between the releases, the domain expert has to continue to curate the data and taxonomy and to incorporate up-to-date information published by the research community in preparation of the next database release. Over the course of a year, 66% of full-time PostDoc position (€45 144) has to be invested. In total, €85 344 has to be spent on personnel for the preparation of the reference databases required by the Web service. Overall, the operation of the Web service and the preparation of the database cost about €241 251.20 including personnel and hardware costs as well as a 1.2 overhead factor. To calculate the cost for the user to run a single analysis, the overall costs have to be divided by the number of jobs served per year. Each compute node can serve a maximum of 105, 120 jobs, if a job takes on average 5 min to process. Having redundant compute nodes, in theory 210, 240 jobs can be served per year reducing the cost per analysis to about €1.15. However, this assumes that the compute nodes are used 100% of the time on every day of the year, which is a theoretical assumption, as it includes weekends and public holidays where users show less activity. The price of a single analysis must rather be based on the expected number of jobs a service serves per year than the maximum number of jobs a service may serve. As an example, the SILVA project served 86 560 jobs in 2016. Using this number and anticipating an increase in demand of 5% to about 91 000 for 2017 increases the price per analysis to €2.65. In summary, for the SILVA project to serve 91 000 jobs and to continue to update the reference databases in 2017, two PostDoc positions (€136 800) and 60% of a technician position (€28 260) are required. Additionally, €36 666.67 has to be invested into the hardware. The overall formula to calculate the cost of the SILVA Web service is: 2×€68400PostDoc+0.6×€47100Technician+€36666.67hardware×1.291000jobs=€2.66 Example 3: Bioinformatics consulting (here: bioinformatics for proteomics) The term ‘consulting’ is used here in contrast to ‘analysis’ because in performing this service, we support the service users to decide which analysis workflow in which workflows system should be used and or which (open-source/free-to-use or commercial) tools may be used for analysis by inspecting their experimental hypothesis, planned experimental design, existing (mass spectrometry) technologies and—potentially—already existing data sets. A ‘Bioinformatics analysis’ service would have more value influencing factors such as software licenses. The service provision unit is ‘one bioinformatics for proteomics consulting’. It is assumed that a default ‘bioinformatics for proteomics’ consulting lasts 2 days including communication with the user, collection of existing hypothesis and experimental design, preparation of data files, if already produced, and literature/online search for existing public data. The value-influencing factors are the scientific personnel for a PostDoc and the renewal costs for a middle-range desktop computer plus monitor for these 2 days. We assume that a mid-range desktop computer for the consulter plus monitor, which is depreciated over 3 years, costs €1000 and therefore €1.52 per working day. Additionally, a factor of 1.2 is used to cover the indirect costs (overhead allowance for consumption costs). Therefore, one bioinformatics consulting (of the default 2 days duration) has a value of 2 ×(€310.91+€1.52) x 1.2=€749.83, i.e. €374.92 per consulting day This shows that the personnel costs are the main value-influencing factor for bioinformatics consulting. Example 4: de.NBI Training By extending the assumptions made in Example 3, for the bioinformatics training course, we additionally assume that a mid-range server has original costs of €3000 and therefore costs €4.55 per working day. Thin clients, which are used during training events, would be calculated with a purchase price of €300, what equals to costs of 300/660 = €0.45 per training day. But we assume here that—in contrast to the desktop computers and servers—these thin clients are exclusively used for 3 preparation days and 6 training days per year, so that the utilization rate must be taken into account. Then, one such thin client costs €300/27 = €11.11 per day used, if a depreciation period of 3 years is assumed. The service provision unit is 1 training day and includes as value-influencing factors the time for preparation and for conducting the actual training, the hardware usage for the training preparation, the teaching hardware and its preparation, flyers and posters for advertising the training event (€100), printed handouts (€4 per participant) and small snacks (€10 per participant) but no travel and accommodation expenses for the participants. In addition, we assume that the training takes place at the institution of the trainers, so that there are no travel and accommodation expenses for them, and the room for the training is provided for free. The number of trainees is assumed as 20, and we assume that four scientists and one technician are involved in the preparation of the training event. The preparation time for the four scientists is assumed to be 5 working days for each of the scientists and 1 working day for the technician (to install the needed software to the teaching hardware, a mid-range server, used during the actual teaching and to the 10 thin clients available for usage by the trainees). For that preparation, the scientists use mid-range desktop computers. Again, we calculate with a factor of 1.2 for the consumption cost overhead. Therefore, we can calculate the personnel costs as four scientists times 5 preparation days per scientist times €310.91 per preparation day plus one technician times 1 preparation day per technician times €214.09 per preparation day, which totals to 20(days) x €310.91+€214.09=€6432.29 for all 5 preparation days, and personnel costs of 4 (PostDocs) x €310.91=€1243.64 for the teaching day, totalling to €7675.93 of personnel costs. In addition, we can calculate the hardware costs during the preparation phase as four scientists times 5 preparation days per scientist times €1.52 per preparation day plus one technician times 1 preparation day per technician times (⁠ €4.55 (server)+10 (thin clients) x €11.11=€115.65 ⁠) per preparation day. This totals up to 20 x €1.52+€115.65=€146.05 for all 5 preparation days. For the 1 server and 10 thin clients used at the teaching days, the costs are calculated as €4.55 server+10 thin clientsx €11.11=€115.65 ⁠. Then, the overall hardware costs for preparation days and training days are summed up to €146.05+€115.65=€261.70 ⁠. Finally, for 20 participants, there are other costs of €380 for advertising, handouts and snacks. For a training day with 20 participants, the total costs with taking the 20% overhead into account equal to a sum of €7675.93+€261.70+€380.00 x 1.2=€8317.63 x 1.2=€9981.16, which equals to €499.06 for 1 training day for each of the 20 participants. Figure 1 shows that because of the fact that the total costs for a training day are dominated by the fixed personnel costs for the training preparation, the cost for a training day per participant is given by a hyperbolic curve. Figure 1 Open in new tabDownload slide Dependence of the ‘per-person value’ for 1 training day from the number of participants. Figure 1 Open in new tabDownload slide Dependence of the ‘per-person value’ for 1 training day from the number of participants. Larger training events such as summer schools are separately funded in de.NBI; for those, further value-influencing factors such as presentation rooms, travel, accommodation and lunch/dinner costs for participants and trainers have to be considered. Taking the funding model into account Besides using the determined values to calculate the financing needs for direct funding (funding Model 1, ‘infrastructure funding’), one can also estimate the values in case that the infrastructure is paid with compensation fees by users (funding Model 2, ‘contract research’). These two funding models differ by requiring different overhead costs (20% for infrastructure funding, 70% or more for contract research). For the contract research, a minimum profit (to avoid unfair competition with the private sector) and VAT need to be added. In Table 1, the estimates for the two funding models are compared for the four examples described above. We calculated costs x overhead factor (20%) Table 1 Estimated rounded values of direct infrastructure funding (with an overhead factor of 20%) and contract research (with an overhead factor of 60%) for the four examples described in ‘Results’ section . Funding Model 1: Direct infrastructure funding (20% overhead) . Funding Model 2: Contract research (60% overhead + 4% minimum profit + 19% VAT) . Factor Model 2/Model 1 . Example 1: Tool usage €126.00 €207.92 1.6501 Example 2: Web query €2.66 €4.39 1.6501 Example 3: Bioinformatics consulting €374.92 €618.66 1.6501 Example 4: 1 Training day €499.06 €823.52 1.6501 . Funding Model 1: Direct infrastructure funding (20% overhead) . Funding Model 2: Contract research (60% overhead + 4% minimum profit + 19% VAT) . Factor Model 2/Model 1 . Example 1: Tool usage €126.00 €207.92 1.6501 Example 2: Web query €2.66 €4.39 1.6501 Example 3: Bioinformatics consulting €374.92 €618.66 1.6501 Example 4: 1 Training day €499.06 €823.52 1.6501 Note: The assumption is a VAT of 19% and a minimum profit margin of 4% for contract research. Open in new tab Table 1 Estimated rounded values of direct infrastructure funding (with an overhead factor of 20%) and contract research (with an overhead factor of 60%) for the four examples described in ‘Results’ section . Funding Model 1: Direct infrastructure funding (20% overhead) . Funding Model 2: Contract research (60% overhead + 4% minimum profit + 19% VAT) . Factor Model 2/Model 1 . Example 1: Tool usage €126.00 €207.92 1.6501 Example 2: Web query €2.66 €4.39 1.6501 Example 3: Bioinformatics consulting €374.92 €618.66 1.6501 Example 4: 1 Training day €499.06 €823.52 1.6501 . Funding Model 1: Direct infrastructure funding (20% overhead) . Funding Model 2: Contract research (60% overhead + 4% minimum profit + 19% VAT) . Factor Model 2/Model 1 . Example 1: Tool usage €126.00 €207.92 1.6501 Example 2: Web query €2.66 €4.39 1.6501 Example 3: Bioinformatics consulting €374.92 €618.66 1.6501 Example 4: 1 Training day €499.06 €823.52 1.6501 Note: The assumption is a VAT of 19% and a minimum profit margin of 4% for contract research. Open in new tab for funding Model 1 and costs x overhead factor 60%x minimum profit margin 4%*VAT(19%) for funding Model 2, i.e. we assumed an overhead of 20% for funding Model 1 and 60% for funding Model 2, so that the estimated value for funding Model 2 is by a total factor of about 65% higher than for funding Model 1: 1.6 x 1.04 x 1.191.2=1.65 5. Discussion The presented value structure model does not automatically imply that an infrastructure will charge its users for the tools and services provided. A possibility to avoid charges is to apply for research grants together with researchers that want to use the infrastructure. Beyond that, a wide range of financing models is conceivable, such as ‘charge all users’, ‘charge only commercial users’ or ‘support the whole research community via infrastructure funding’. The abovementioned value components have been collected from a scientific and IT perspective only. When charging all users with payments per service is considered, diverse challenges from a financial perspective have to be solved: Who issues the legal invoices? Who tracks payments? How is the risk of non-payments calculated? Who takes care of accounting? How is the money flow organized? Can the bank account of the service providing institution be used or is an own legal entity (like a company or an association) necessary? How high are taxes in both cases? Who files the tax declaration? Solving these challenges has to be incorporated also into the respective value structure (and can significantly increase the payment costs). An important aspect is the question on liability up to penalties for nonperformance for the tools and services offered. In theory, potential compensations for delayed or nondelivery of services can account for rather large amounts of money. In a first, maybe naïve, approach, we assume that the risk can be minimized by adding the respective disclaimers in the terms of use of the services. Owing to the complexity of the issue, we consider a detailed discussion of the problem beyond the scope of this article. Even if tools and services are not charged to the users, it is reasonable to indicate them the value of the offered service. This helps to increase the awareness of the users that these services are not free-to-use. For other scenarios additional challenges arise, e.g. for Web services, one has to cope with the ever-evolving HTML, CSS and JavaScript standards. Web browsers implement these new standards and at some point will stop supporting the old standards, leaving the Web service inaccessible to the users. Other problems are security bugs, which leave the Web service and the user data vulnerable to attacks. However, the largest problem when deploying software on the Web is the lack of long-term support implementing new Web standards, for older versions of the software. As an example, the SILVA website is implemented as an extension of a content management system (CMS). The version of the SILVA CMS extension that has been implemented in is no longer supported, which means that the SILVA extension has to be completely rewritten for newer versions of the CMS. The effort to rewrite the CMS extension is far greater than has been accounted for in Example 2. It is hard to estimate the exact maintenance cost, as it is hard to estimate such breaking changes in the environment in which a Web service runs. However, it exceeds the 55% of a software developer accounted for in Example 2. Highly significant, but less frequently used Web services will be presumably more expensive per Web query than our SILVA example. One needs always a fixed amount of money for holding available such a service. That fixed amount is mainly caused by the personnel costs for the maintenance of the (Web) software, for the curation of the data and for user support. The variable amount, which increases with the number of service users, is the mainly increased expense for the support and for a bigger and/or more powerful hardware. Less utilization of a service (because of a smaller scientific community) means higher costs per Web query for offering such a service but is of course not correlated with higher scientific impact. Key Points A lightweight model to estimate the value structure of bioinformatics tool usage, services and training was described. The value model depends on assumptions made for each of the five defined scenarios. To demonstrate the application of the value model, four examples for the simplest Scenario 1 are given. With increased experience, the necessary assumptions reflect more precisely the reality, and therefore, the estimated values converge more and more to the real costs. The values should be communicated to the user community to increase their awareness that the provision of bioinformatics services must be acknowledged and rewarded. The value structure developed provides arguments for ensuring long-lasting support from funding organizations. Funding The German Federal Ministry of Education and Research (BMBF). The BMBF grant de.NBI—German Network for Bioinformatics Infrastructure (FKZ 031 A534 A to G.M., FKZ 031A536A to U.S., FKZ 031A532A and B to A.P. and FKZ 031A539B to J.F.). The funding of Martin Eisenacher is related to PURE and VALIBIO, projects of Northrhine-Westphalia. The Max Planck Society (to C.Q. and F.O.G.). The Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben (to M.L.). Gerhard Mayer is a PhD student in the Medical Bioinformatics group at the Medizinisches Proteom Center (MPC) within the Medical Faculty of the Ruhr-University Bochum (RUB) and works in the de.NBI network. Christian Quast is a postdoc at the Max Planck Institute in Bremen, Germany. He is the project lead of the SILVA project and manages the releases as well as the software development. Additionally, he is heading the implementation of the UniEuk taxonomy framework. Janine Felden is a postdoc at the MARUM—Center for Marine Environmental Sciences, University of Bremen. She is working as data and project manager with the Data Publisher for Earth and Environmental Science—PANGAEA. Matthias Lange studied computer science in Magdeburg. Since his PhD thesis in 2006, he worked as Bioinformatician at the IPK-Gatersleben. His main interests are information retrieval and research data management. Manuel Prinz is a bioinformatician in the Data Management and Processing IT group in the Department Theoretical Bioinformatics at the German Cancer Research Center (DKFZ) in Heidelberg. Alfred Pühler is a senior research professor at Bielefeld University for Genomics of Industrial Microorganisms. He is also a coordinator of the German Network for Bioinformatics Infrastructure (de.NBI) and Head of Node of ELIXIR-Germany. Chris Lawerenz is a group leader of the Data Management and Processing IT group in the Department Theoretical Bioinformatics at the German Cancer Research Center (DKFZ) in Heidelberg. Uwe Scholz is a group leader of the research group Bioinformatics and Information Technology at the Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben and coordinates the plant service unit GCBN within de.NBI. Frank Oliver Glöckner is a group leader of the Microbial Genomics and Bioinformatics Research group at the Max Planck Institute for Marine Microbiology Bremen and Professor of Bioinformatics at Jacobs University Bremen. Wolfgang Müller is a group leader for Scientific Databases and Visualization at the Heidelberg Institute for Theoretical Studies, HITS. Katrin Marcus is a director of the Medizinisches Proteom Center, Ruhr-University Bochum with special expertise in proteomics focusing on neurodegenerative and neuromuscular diseases. She also serves as a steering committee member and chair of the HUPO Brain Proteome Project. Martin Eisenacher is a coordinator of the research unit Medical Bioinformatics at the Medizinisches Proteom Center (MPC) within the Medical Faculty of the Ruhr-University Bochum (RUB). References 1 Robbins RJ. Bioinformatics: essential infrastructure for global biology . J Comput Biol 1996 ; 3 ( 3 ): 465 – 78 . Google Scholar Crossref Search ADS PubMed WorldCat 2 Stockinger H , Altenhoff AM, Arnold K, et al. Fifteen years SIB Swiss Institute of Bioinformatics: life science databases, tools and support . Nucleic Acids Res 2014 ; 42 ( W1 ): W436 – 41 . Google Scholar Crossref Search ADS PubMed WorldCat 3 Eijssen L , Evelo C, Kok R, et al. The Dutch Techcentre for Life Sciences: enabling data-intensive life science research in the Netherlands . F1000Res 2015 ; 4 : 33. Google Scholar Crossref Search ADS PubMed WorldCat 4 Pühler A. German Network for Bioinformatics Infrastructure – de.NBI . Germany : BMBF , 2016 , 8 – 13 . https://www.systembiologie.de/ OpenURL Placeholder Text WorldCat 5 Crosswell LC , Thornton JM. ELIXIR: a distributed infrastructure for European biological data . Trends Biotechnol 2012 ; 30 ( 5 ): 241 – 2 . Google Scholar Crossref Search ADS PubMed WorldCat 6 Chang J. Core services: reward bioinformaticians . Nature 2015 ; 520 ( 7546 ): 151 – 2 . Google Scholar Crossref Search ADS PubMed WorldCat 7 Tauch A , Al-Dilaimi A. Bioinformatics in Germany: toward a national-level infrastructure . Brief Bioinform 2017 . doi: 10.1093/bib/bbx040. Google Scholar OpenURL Placeholder Text WorldCat 8 Beagrie N , Houghton J. The Value and Impact of the European Bioinformatics Institute—Full Report . Salisbury, UK: Charles Beagrie Ltd.; Hinxton, UK : EMBL-EBI , 2016 , 1 – 96 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC 9 Quast C , Pruesse E, Yilmaz P, et al. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools . Nucleic Acids Res 2013 ; 41 ( D1 ): D590 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat © The Author 2017. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com © The Author 2017. Published by Oxford University Press.
journal article
LitStream Collection
A survey and evaluations of histogram-based statistics in alignment-free sequence comparison

Luczak, Brian B; James, Benjamin T; Girgis, Hani Z

2019 Briefings in Bioinformatics

doi: 10.1093/bib/bbx161pmid: 29220512

Abstract Motivation Since the dawn of the bioinformatics field, sequence alignment scores have been the main method for comparing sequences. However, alignment algorithms are quadratic, requiring long execution time. As alternatives, scientists have developed tens of alignment-free statistics for measuring the similarity between two sequences. Results We surveyed tens of alignment-free k-mer statistics. Additionally, we evaluated 33 statistics and multiplicative combinations between the statistics and/or their squares. These statistics are calculated on two k-mer histograms representing two sequences. Our evaluations using global alignment scores revealed that the majority of the statistics are sensitive and capable of finding similar sequences to a query sequence. Therefore, any of these statistics can filter out dissimilar sequences quickly. Further, we observed that multiplicative combinations of the statistics are highly correlated with the identity score. Furthermore, combinations involving sequence length difference or Earth Mover’s distance, which takes the length difference into account, are always among the highest correlated paired statistics with identity scores. Similarly, paired statistics including length difference or Earth Mover’s distance are among the best performers in finding the K-closest sequences. Interestingly, similar performance can be obtained using histograms of shorter words, resulting in reducing the memory requirement and increasing the speed remarkably. Moreover, we found that simple single statistics are sufficient for processing next-generation sequencing reads and for applications relying on local alignment. Finally, we measured the time requirement of each statistic. The survey and the evaluations will help scientists with identifying efficient alternatives to the costly alignment algorithm, saving thousands of computational hours. Availability The source code of the benchmarking tool is available as Supplementary Materials. alignment-free k-mer statistics, DNA sequence comparison, k-mer histograms, paired statistics Introduction Throughout the past decade in the field of bioinformatics, the shear amount of genomic data being produced has eclipsed the rate that computers can process it. Sequence comparison algorithms are among the most fundamental tools for analyzing the vast amount of DNA sequences. Devised in 1970, the Needleman–Wunsch alignment algorithm [1] was able to align the sequences of two proteins. This algorithm was shown to have extensive applications to determining the similarity between two nucleic acid or amino acid sequences. The alignment method of keeping track of insertions, deletions and substitutions between two sequences has spawned a wave of other ‘alignment-based’ approaches [2, 3] such as the popular BLAST series [4]. However, because the Needleman–Wunsch-based alignment algorithms are quadratic in terms of the sequence length, they are too costly to compute as the sequence length grows and the number of comparisons increases. For example, the deficiencies of alignment-based methods are apparent in next-generation sequencing (NGS) data with millions of reads and the costly task of whole-genomic comparison [5]. Furthermore, rearrangements of entire blocks of base pairs are highly detrimental to the way alignment is calculated [6, 7]. The realization of these issues has led to the development of many efficient ‘alignment-free’ methods [6], which will be reviewed and evaluated on DNA sequences in this study. Although many methods, such as string compression, chaos theory and universal sequence maps exist [6, 8–10], this article focuses on the widely used method of k-mer frequencies (k-tuples) or feature frequency profiles [5, 11–13]. To use this method of k-mer frequencies, a histogram of k-mer counts is generated for the respective sequences that need to be compared [14, 15]. Next, the two histograms are compared using one of the many statistical similarity/distance measures. A variety of review papers have discussed some of these methods [6, 16, 17] along with a statistical physics perspective [18]. However, no attempt has been made to review and evaluate the performance of a large number of alignment-free k-mer statistics. Further, the effects of combining multiple statistics together have not been studied yet. To this end, we have evaluated 33 statistics and the multiplicative combinations of every two statistics. One of the most important strengths of these statistics is their speed and relatively low cost [19, 20]; however, they can sometimes be less sensitive [21]. For this reason, we used the identity score obtained by the Needleman–Wunsch global alignment algorithm as the basis for comparison in several experiments. In addition, one experiment was evaluated according to local alignment identity scores. We propose several application-based methods that specifically measure each statistic’s effectiveness based on its ability to be used instead of the identity score. This manuscript is organized as the following. First, the statistics are surveyed. Next, we describe the data used in the evaluation experiments. Then, the evaluation results are presented. Finally, we conclude. Survey of alignment-free k-mer statistics Owing to past literature on classifying a comprehensive collection of histogram distances [17], we will be organizing the survey based on statistical families. The list of families includes Minkowski, match/mismatch, intersection, D2, inner product, squared chord, Markov, divergence and a variety of other statisitcs. Figure 1 diagrams these families and shows examples. In this section, a summary of each statistic or family will be included along with some initial thoughts. Figure 1 Open in new tabDownload slide Alignment-free k-mer statistics grouped by statistical families. These families are based on a classification by Cha [17]. This figure provides a visual representation of several alignment-free k-mer statistics from their respective families. Each member of a statistical family shares a common functional element such as a histogram dot-product (Inner Product), minimum/maximum (Intersection), Markov model (Markov family) or overarching radical (Minkowski); although a variety of statistical families exist such as the χ2 family, many recently developed methods do not fall into any specific category, e.g. N2, DMk and EMD. Figure 1 Open in new tabDownload slide Alignment-free k-mer statistics grouped by statistical families. These families are based on a classification by Cha [17]. This figure provides a visual representation of several alignment-free k-mer statistics from their respective families. Each member of a statistical family shares a common functional element such as a histogram dot-product (Inner Product), minimum/maximum (Intersection), Markov model (Markov family) or overarching radical (Minkowski); although a variety of statistical families exist such as the χ2 family, many recently developed methods do not fall into any specific category, e.g. N2, DMk and EMD. Discussion on notation To start, we define some notation concerning k-mer frequencies and histograms, which will be a primary focus throughout the article. Let s and t denote two sequences with corresponding lengths len(s) and len(t). If we consider the set K as the set of all possible words w determined over the alphabet [A, C, G, T], then the number of all possible words for DNA sequences is 4k with k representing the length of each word. For the rest of the article, each of these k-length words will be referred to as a k-mer. We associate each sequence s and t with their corresponding histograms or word-count vectors as hs and ht as shown in Equation (1). hx=<c(w1),c(w2),c(w3),…,c(w|K|)>.(1) Here, c(wi) represents the count of the ith k-mer in sequence x. Minkowski family In many areas of math and science, Euclidean distance is one of the most widely known statistics for comparing two sequences, i.e. their corresponding histograms as shown in Equation (2). Euclidean(hs,ht)=∑w∈K (hs(w)−ht(w))2.(2) In this equation, hs and ht are the two histograms of the sequences s and t. Although the concept of Euclidean distance has been around since the Greek era, it was not until Herman Minkowski in the late 19th century that variations of this distance were created [17]. These variations include city block distance, which is also known as Manhattan distance, and a generalized form known as Minkowski distance as shown in Equation (3). Minkowski(hs,ht)=∑w∈K (hs(w)−ht(w))pp.(3) Instead of using the exponents 2 and 1/2 as in Euclidean distance, Minkowski considered a generalized power p. From this general idea, we have city block when P = 1 and Chebyshev distance as shown in Equation (4) when P  → ∞ [17]. Chebyshev(hs,ht)=max⁡w∈K |hs(w)−ht(w)|.(4) Additionally, the idea of z-score standardization is important to consider for statistics that are not self-standardizing such as the Minkowski family. For example, creating the standardized histograms hsz and htz by using the mean and SD leads to the definition of EuclideanZ as shown in Equation (5). EuclideanZ(hs,ht)=Euclidean(hsz,htz).(5) Match/mismatch family Although there are many ways to compare the two histograms, some of the most efficient methods involve simply counting whether the counts match. As it is defined in the Deza Encyclopedia [22], Hamming distance counts how many times the k-mer counts match and then divides by the number of possible k-mers as shown in Equation (6). Hamming(hs,ht)=14k∑w∈K hs(w)==ht(w).(6) Here, the symbol = represents logical equality, evaluating to 1 if the two counts are the same and to 0 otherwise. Jaccard distance is simply the same as Hamming except that it only examines the nonzero k-mer counts. Equation (7) describes another statistic from this family referred to as Mismatch distance. Mismatch(hs,ht)=∑w∈K hs(w)≠ht(w).(7) Intersection family Intersection distance, known as Czekanowski distance [17], is based on the intersection of the frequencies of k-mers divided by the union of the counts. Equation (8) defines this distance. The min function allows this statistic to effectively determine the overlap between the two distributions by recording how many of each k-mer are in both sequences. A few statistics in this same family include Kulczynski Similarity 1 and 2 as shown in Equations (9) and (10). Inter sec tion(hs,ht)=∑w∈K2*min(hs(w),ht(w))hs(w)+ht(w).(8) Kulczynski1(hs,ht)=∑w∈Kmin(hs(w),ht(w))|hs(w)−ht(w)|.(9) Kulczynski2(hs,ht)=Aμ∑w∈Kmin(hs(w),ht(w)).(10) Here, Aμ is the scalar value 4k(μs+μt)2μsμt ⁠, where μs and μt represent the mean k-mer counts for hs and ht. χ2 distance As defined by Equation (11), χ2 distance is the sum over all the k-mers of Manhattan distance squared divided by the sum of the two k-mer counts [17]. χ2(hs,ht)=∑w∈K(hs(w)−ht(w))2hs(w)+ht(w).(11) Canberra distance Canberra distance, as in Equation (12), at first glance, is somewhat of a hybrid between Manhattan distance and χ2 distance [22]. In the original source, absolute value bars are included in the denominator. However, as k-mer counts are always positive, they will not be necessary when comparing sequences. Canberra(hs,ht)=∑w∈K|hs(w)−ht(w)|hs(w)+ht(w).(12) D2 statistic and its variations On its own, the D2 statistic is one of the most intuitive ways to find the similarity between two sequences as shown in Equation (13) [23]. D2(hs,ht)=∑w∈K hs(w)ht(w).(13) Although taking the inner product between two histograms is time efficient, the results are not standardized and identical sequences can produce entirely different distances. For example, taking the dot product of the vector (1, 2, 3) with itself yields 14, whereas (1, 1, 1) · (1, 1, 1) yields 3. To fix some of the drawbacks, one method is to use the mean and SD similar to EuclideanZ as shown in Equation (5) [24]. D2z is the dot product between the two standardized histogram vectors hsz and htz as shown in Equation (14). D2z(hs,ht)=∑w∈K hsz(w)htz(w).(14) However, there are some clear ways to improve on the idea [23, 25]. The easiest way is to make a ‘self-standardized version of D2’, which will account for any differences in background noise. To describe Reinert’s new statistic D2S and later D2* ⁠, we must also define a few other terms. Let E(w) denote the expected probability of w, which is calculated by multiplying the probability of each of the k nucleotides that make up w together. For an additional definition described in the article, let h~ be the updated histogram, which is calculated according to Equation (15). The final definition of D2S can be seen in Equation (16). h~s(w)=hs(w)−(len(S)−k+1)E(w).(15) D2S(hs,ht)=∑w∈Kh~s(w)h~t(w)h~s(w)2+h~t(w)2.(16) Furthermore, a new word probability measure E~(w) is the expected probability of w in the two sequences concatenated together. This results in the additional statistic D2* ⁠, which is described by Equations (17) and (18). l=(len(S)−k+1)(len(T)−k+1).(17) D2*(hs,ht)=∑w∈Kh~1(w)h~2(w)lE~(w).(18) Reinert ultimately concluded that D2S and D2* are considerably better at calculating sequence similarity because the D2 statistic is ‘measuring the sum of the departure of each sequence from the background rather than the (dis)similarity between the two sequences’ [23]. A few years later, even greater improvements were made in refining the D2S and D2* statistics by using a pattern transfer model [26]. In one of Wan’s papers, it was shown that the ‘power of D2* and D2S approaches a limit that is generally less than 1 when the sequence tends to infinity’. The most effective way to combat this down-side and the irregularities of the expected values for D2* and D2S is to partition the sequence into b equal subintervals [26]. If we then consider D2j* and D2jS as calculating D2* and D2S over the jth subinterval, this will lead to two new statistics, T* and TS as in Equations (19) and (20). T*(hs,ht)=∑j=1bD2j*(hs,ht)(19) TS(hs,ht)=∑j=1bD2jS(hs,ht)(20) The N2 neighborhood statistic Our next alignment-free k-mer statistic uses the novel approach of comparing weighted neighborhood counts instead of fixed k-mer frequencies [27]. Because transcription factor binding sites often times do not adhere to a preset combination of k-mers, the adaptable definition of a ‘neighborhood region’ allows for increased efficiency depending on the types of sequences compared. The set of all words in the neighborhood of w will be defined as n(w). This definition of the neighborhood can vary depending on the particular types of sequences, e.g. tissue-specific enhancers, that are being compared. Equation (21) is the overall weighted word count c(n(w)) for that neighborhood. c(n(w))=∑w∈n(w) awc(w),(21) where aw is the associated weight for each particular k-mer. If each k-mer contributes equally to the neighborhood, this weight value will be one. However, the weight can be tailored to the particular application if a specific k-mer is more important than others. Now that we have a vector of all the neighborhood counts associated with every possible word w, the next step is to simply standardize the vectors based on the mean and SD and then divide each vector by its norm to obtain the values Sc(n(K)) and Tc(n(K)). The final statistic N2 is the inner product between each of the ‘normalized standardized neighborhood count vectors’ [27] as in Equation (22). N2(hs,ht)=<Sc(n(K)),Tc(n(K))>.(22) When implementing N2 for a proposed problem, there are a variety of potential neighborhood definitions: nrc(w) is the neighborhood of the word and its reverse complement rc. nmm(w) is the neighborhood of the word and all words one mismatch away (specified with Hamming distance). nmm,rc(w) is the neighborhood consisting of both the reverse complement, rc, and one mismatch away, mm. In addition, we considered nr(w), which is the word and its reverse because inversion is common in transposons of the same family. Dinucleotide absolute frequency distance When using k-mer frequency methods, increasing k should lead to higher accuracy and increased computational cost. In a large variety of other statistics, trying to find a k value that balances the accuracy and the efficiency has been an important problem. Although each of the previous statistics has been dependent on the k value, Zhang and Chen [28] created a novel statistic centered around 2-mers and the idea of di-nucleotide absolute frequency (AFd). Similar to constructing histograms, the first step is to record the frequencies of every 2-mer. Let hsp and htp denote two probability histograms for sequences s and t as in Equation (23). hsp=(c(AA)c(A),c(AC)c(A),…,c(TT)c(T)).(23) In the above representation, the first element in hsp is the count of the first dinucleotide AA divided by the count of its first nucleotide A. If len(s)<len(t) ⁠, then a sliding window b of base pairs with the length of the smallest sequence s will be considered first. If we let w be a 2-mer in this case, absolute frequency distance with a given window b can be seen in Equation (24). AFdb(hs,ht)=∑w∈K[(hsp(w)−htp(w))fm(hsp(w)−htp(w))]2,(24) where fm(x) is the stabilizing function as defined in Equation (25). fm(x)=1(1+x)m.(25) By adjusting the window b with a sliding percentage, the final distance measure is the minimum of AFdb under each possible window. There are a variety of potential stabilizing functions; the m value in this case can be optimized to promote performance [28]. For our article and the overall focus on k-mer histograms over locations, we will consider s and t to have approximately the same length and will use m = 14 in the stabilizing function [28]. Inner product family Another example of a common family for histogram/vector comparison is the inner product family. As its name implies, this family of statistics focuses solely on the dot product of two histograms hs · ht. The dot product can be applied to either vectors of k-mer counts or probabilities [17]. As defined earlier, consider hs and ht as a vector of the counts for each k-mer. This family includes cosine distance as in Equation (26). Co sin e(hs,ht)=1−hs • ht||hs||||ht||=1−cos ⁡(θ).(26) In this equation, θ could be considered as the ‘angle’ between the two histogram vectors in 4k-dimensional space. The inner product of two normalized vectors as in Equation (27) represents the similarity version of this distance. NormVectors(hs,ht)=hs • ht||hs||||ht||=cos ⁡(θ).(27) Because a large number of these statistics end up considering θ through the geometric definition of the dot product, they are also referred to as the ‘Angle Family’. For example, consider Equation (28), which describes correlation distance. Correlation(hs,ht)=1−(hs−μs) • (ht−μt)||hs−μs||||ht−μt||=1−cos ⁡(θ^).(28) Here, μs and μh are the means of hs and ht. Also, by removing the ‘1-’, this equation simply turns into Pearson’s correlation coefficient. In this case, correlation distance is closely related to cosine distance as shown in Equation (26) if we consider the new angle θ^ as the angle between the adjusted histogram vectors hs−μs and ht−μt ⁠. Other inner product statistics such as covariance similarity as in Equation (29) also use this idea of mean-adjusted histograms [22]. Covariance=(hs−μs) • (ht−μt)4k.(29) Further, Spearman distance as referenced through MATLAB’s library is just a variation on correlation distance (and a relative of Pearson’s). Spearman distance computes 1− the cosine of the angle between the tied rank vectors minus the tied rank means. Note that this statistic takes a nonlinear time (O(n log n)). Additionally, this family includes harmonic mean as in Equation (30) and similarity ratio [22] as in Equation (31). Harmonic(hs,ht)=2*∑w∈Khs(w)ht(w)hs(w)+ht(w).(30) SimRatio(hs,ht)=hs • ht(hs • ht)+||hs−ht||.(31) Gapped k-mer inner product At this point, we have only focused on comparing k-mer histograms. As the k-mer size increases, the comparisons for sequence similarity get more accurate. At the same time, if one of the sequences is a mutated version of the other, long k-mers common to the two sequences should be infrequent. In one paper that discusses ‘gapped k-mers’, the problem of having long k-mers can be easily resolved [29]. On its own, the process of computing the gapped k-mer counts is not too complex [30]. If we consider w to be a gapped k-mer with total length k including gaps and g to be the number of gaps in the word, then for DNA sequences, the total number of words |K|=(kk−g)4k−g ⁠. The next step is to consider the upgraded histograms h~s and h~t of each of the recorded gapped k-mer frequencies for sequences s and t. Then, the article defines a similarity function as shown in Equation (32), which is the normalized inner product between the two upgraded histograms. Gapped(hs,ht)=hs~ • ht~||hs~||||ht~||.(32) However, one potential issue is that the number of gapped k-mers will grow extremely quickly as k increases [29]. To increase efficiency, ‘the key idea is that only the full [k]-mers present in the two sequences can contribute to the similarity score via all gapped k-mers derived from them’ [29]. This idea leads to a revised definition of the inner product given in Equation (33). hs~ • ht~=∑m=0kzm(hs,ht)wm.(33) Here, m is the number of mismatches between two full k-mers; zm(hs,ht) is the ‘mismatch profile’, which represents the frequency of the pairs of full k-mers with m mismatches; and wm is a coefficient determined by Equation (34). wm={(k−mk−g)k−m≥k−g0               otherwise.(34) The article asserts that obtaining zm(hs,ht) can be computationally expensive. However, several methods are described in the article that can effectively reduce the run-time. Squared chord family Families such as Minkowski use a radical over the entire summation, whereas a key characterstic of the squared chord family as in Equation (35) is a square root over each histogram independently [17]. SquaredChord(hs,ht)=∑w∈K (hs(w)−ht(w))2.(35) One interesting observation about the squared chord statistic comes from simplification shown in Equation (36). =∑w hs(w)+ht(w)−2hs(w)ht(w).(36) There is a well-known mathematical theorem called the Arithmetic–Geometric Mean Inequality, which states that for a,b≥0 ⁠; a+b2≥ab [31]. In other words, hs(w)+ht(w)−2hs(w)ht(w)≥0 always when hs(w),ht(w)≥0 ⁠. Overall, the squared chord statistic appears to be capturing the variation between the arithmetic and geometric means of the two reported frequency vectors. If the two sequences have the same histogram, the geometric mean and the arithmetic mean will both be the same, resulting in a distance of 0. Equation (37) describes another statistic belonging to the same family [22]. Hellinger(hs,ht)=2*∑w∈K(hs(w)μs−ht(w)μt)2.(37) Here, μs and μt are the means of hs and ht. Next, we discuss other families of alignment-free k-mer statistics that use Markov models. Markov chain models The premise for using a Markov chain for sequence similarity comes from the idea of a state machine and conditional probabilities [32, 33]. As we scan along a sequence with a size k window and record frequencies, it is possible to calculate the probability that the kth letter occurs based on the current state of the k−1 letters. The log of each probability value for the current state is then summed over the entire sequence until the state reaches the end. There is, however, a mathematically equivalent way to calculate this statistic without looking at the particular sequences themselves and only using the k-mer counts. The first step is to construct the conditional probability table based on each group of words. For example, Equation (38) describes the conditional probability when k = 3. mx(AAT)=px(T | AA)=c(AAT)∑n∈[A,C,G,T]c(AAn).(38) Here, c(AAA) is the frequency of AAA in the sequence x. The next step is to calculate the probability of the second sequence using the conditional probabilities calculated according to the first sequence as shown in Equation (39). dhs(ht)=∑w∈Kht(w)ln⁡(ms(w)).(39) After that, the probability of the first sequence is computed according to the conditional probabilities of the second sequence. The final statistic is the average of dhs(ht) and dht(hs) as shown in Equation (40). Markov(hs,ht)=dhs(ht)+dht(hs)2.(40) With the success of Markov models in bioinformatics, many variations were created to expand on the idea. Pham and Zuegg [34] invented a new statistic, called SimMM, based on Markov models. Dai, Yang and Wang [35] described SimMM as a ‘probabilistic measure based on the concept of comparing the similarity/dissimilarity between two constructed Markov models’. Pham and Zuegg started by defining a helper function as in Equation (41). r(hs,ht)=1len(t)ln⁡(dhs(ht)dht(ht)).(41) As the helper function is not symmetric, its average is used in computing the final form of SimMM as shown in Equation (42). SimMM(hs,ht)=1−er(hs,ht)+r(ht,hs)2.(42) In sum, SimMM involves comparing four conditional probabilities. The final form of the statistic is scaled using an exponential and is subtracted from 1 as shown in Equation (42). Another Markov-based statistic is the revised relative entropy, which was proposed in 2008 and sought to efficiently integrate Markov models and k-mer frequencies [35]. Let ps and pt be the conditional probability models created from sequences s and t. Equations (43–45) define revised relative entropy for a given k-value and Markov model order r. d1=∑w∈Kms(w)ln⁡2*ms(w)ms(w)+mt(w).(43) d2=∑w∈Kmt(w)ln⁡2*mt(w)ms(w)+mt(w).(44) rre_k_r(hs,ht)=d1+d22.(45) This statistic is largely based on Jensen–Shannon divergence, which is covered in the next section. Divergence Similar to Markov chains, a wide variety of divergence statistics use probabilities and effectively compare two sequences by assessing how far apart they are in the log-probability space. For example, consider Conditional Kullback–Liebler Divergence as shown in Equation (46), also known as conditional relative entropy [36]. CKL(hs,ht)=∑w∈Nps(w)∑b∈Bms(wb)ln⁡(ms(wb)mt(wb)).(46) Here, N is a set of all (k−1)-mers; B is a set of the four nucleotides A, C, G and, T; and wb is the word consisting of the (k−1)-mer, w, followed by the base b. Although they do not involve conditional tables, a few other divergence statistics that are commonly used are K as shown in Equation (47), Jensen Shannon as show in Equation (48) and Jeffrey divergence as shown in Equation (49). In the equations describing these divergence statistics, ps(w) is the probability (not the conditional probability) of w under the histogram of sequence s, and v(w) is the average probability for w over both histograms. K(hs,ht)=∑w∈Kps(w)ln⁡ps(w)v(w).(47) JenShan(hs,ht)=∑w∈Kln⁡ps(w)ps(w)pt(w)pt(w)v(w)ps(w)+pt(w).(48) Jeff(hs,ht)=∑w∈K(ps(w)−pt(w)ln⁡ps(w)pt(w).(49) Distance measure based on k-tuples Although it was originally created for a specific clustering algorithm, distance measure based on k-tuples (DMk) is a novel alignment-free k-mer statistic because it makes use of k-mer counts as well as the locations within the sequence [37]. The first step is to define a term related to the density ρ of the ith occurrence of a particular word w as shown in Equation (50). ρi(w)=1li−li−1,1≤i≤c(w).(50) Here, li is the ith location of word w, and c(w) is the count of w. This ρ statistic captures information about the location where each k-mer occurs as well as information on the previous occurrence. Next, Equation (51) defines ρ~i as a partial sum of the ρi starting from the first occurrence up to the ith occurrence of a particular word. ρ~i(w)=∑n=1iρn(w),1≤i≤c(w).(51) One major benefit of this statistic is that given the vector ρ~(w)=(ρ~1(w),ρ~2(w),…,ρ~c(w)(w)) ⁠, one can determine where and how many times w appears in the sequence. Now that we have a vector of densities for each k-mer, the next step is to simply construct a probability distribution vector pi by dividing each ρ~i by the sum of ρ~ ⁠. After that, these values can be further manipulated by applying Shannon’s entropy as shown in Equation (52). Shan(w)=−∑i=1c(w)pi log⁡2pi.(52) When this operation is repeated for every k-mer, we have two entropy vectors Es and Et for sequences s and t. The final statistic as shown in Equation (53) is then computed using the Euclidean distance between both density histograms: DMk(hs,ht)=Euclidean(Es,Et).(53) Overall, DMk has been shown to be more effective than count-based statistics because of the integration of both k-mers locations and ordering [37]. Earth Mover’s Distance Earth Mover’s distance (EMD) was originally demonstrated to have applications to image databases and to the transportation problem. It focuses on analyzing the ‘minimum amount of work that must be performed to transform one distribution into another’[38]. The same principle could also have applications to distributions of k-mers. If we consider hs as the supply distribution and ht as the demand, then EMD is effectively measuring the minimum number of k-mer counts that need to be transported from hs to ht. In some way, this statistic is similar to Manhattan distance except for the fact that the k-mers or bins of the histogram are no longer being compared one-to-one for both sequences [39, 14]. Thus, each k-mer is not being treated as independent, which should perform well in the context of DNA sequences with strings of interconnected and repetitive regions. Although the statistic normally has a more complicated derivation when considering multiple dimensions, it mathematically simplifies to Equation (54) when dealing with k-mer histograms. EMD(hs,ht)=∑w∈K|as(w)−at(w)|.(54) In this equation, as(w) is the aggregate sum vector of hs calculated by as(wi)=hs(w1)+hs(w2)+⋯+hs(wi) ⁠, where w1 is the first k-mer. Overall, this statistic largely depends on the location of the k-mer bins in the histogram. For our evaluation, we ordered each k-mer alphabetically. In applications involving NGS data, where all reads have the same length, the order of the histogram can be based on the order of k-mers in one of the sequences. Length difference Length difference (LD) is the difference in length between two sequences as in Equation (55). LD(s,t)=|length of s−length of t|(55) Although it is a simple statistic, it can be used for reducing the number of sequence comparisons in the case of global alignment. For example, if the minimum desired identity score is 70% and the ratio between the shorter and the longer sequence lengths is <70%, then there is no way that the alignment could happen at that threshold. Therefore, the LD metric is an important measure of sequence similarity. Materials and methods Statistics evaluated The following is a list of the 33 statistics evaluated in this article: Hellinger, Manhattan, Euclidean, χ2 ⁠, normalized vectors, harmonic mean, Jeffrey divergence, K-Divergence, Pearson correlation coefficient, squared chord, Kullback–Liebler conditional divergence, Markov similarity, intersection, rre_k_r, D2z, SimMM, EuclideanZ, EMD, Spearman, Jaccard, LD, D2S ⁠, AFd, mismatch, Canberra, Kulczynski Similarity 1, Kulczynski Similarity 2, similarity ratio, Jensen–Shannon Divergence, D2* ⁠, N2r, N2rc and N2rrc. Primarily, we chose these statistics based on having a variety of families as well a good number of the latest alignment-free k-mer statistics. Additionally, we have adopted the criteria that each statistic must require only k-mer frequencies as input. Any statistic that requires locations, specialized k-mers, or information beyond the scope of word histograms will not be considered. Gapped k-mers, T2S, T2* and DMk will not be evaluated and are included in this article for reference purposes. Because the number of paired combinations can quickly increase, other statistics are mentioned in their respective families solely for review purposes. Calculating a k-mer histogram For any particular k value, a k-mer is a k-length sequence of DNA. Because the ‘alphabet’ for nucleic acids is only four letters (A, C, G and T), each sequence has 4k potential ‘words’. A histogram or word-count vector can be created for each sequence by scanning linearly through each k-window of letters and counting occurrences of each word. Indexing a sequence of k-mers can be implemented efficiently using Horner’s rule [40]. Selection of k The selection of k determines the success of the alignment-free k-mer statistics. The k must lie in a certain range to ensure that the comparison of histograms is a linear process. We used Equation (56) to find k. k=⌈log4(1n∑i∈slen(i))⌉−1.(56) Here, n is the number of sequences in the set s. Using too short of a k may not provide enough information, but using too long of a k increases the comparison time and memory (4-fold per increment of 1). Therefore, a too long k might not guarantee linear time for comparing two histograms. For example, consider two sequences of length 100. Our formula gives k = 3. But if k gets larger, such as k = 7, the number of comparisons is quadratic (47 >1002 ⁠), negating the advantages of alignment-free k-mer statistics. A note on pseudo-counts When computing each of the statistics, many require pseudo-counts within the histograms to prevent a division by 0. This can be accomplished by adding 1 to each of the entries. In addition, these pseudo-counts are needed to allow events that ‘seem’ impossible to be able to happen [41]. In general, most statistics that operate on probability distributions are implemented with pseudo-counts. However, there are multiple statistics that require either a combination of both or will function the same, regardless, i.e. the Minkowski family. Overall, if the statistic requires dividing by k-mer freq