Microbial genome analysis: the COG approachGalperin, Michael Y; Kristensen, David M; Makarova, Kira S; Wolf, Yuri I; Koonin, Eugene V
2019 Briefings in Bioinformatics
doi: 10.1093/bib/bbx117pmid: 28968633
Abstract For the past 20 years, the Clusters of Orthologous Genes (COG) database had been a popular tool for microbial genome annotation and comparative genomics. Initially created for the purpose of evolutionary classification of protein families, the COG have been used, apart from straightforward functional annotation of sequenced genomes, for such tasks as (i) unification of genome annotation in groups of related organisms; (ii) identification of missing and/or undetected genes in complete microbial genomes; (iii) analysis of genomic neighborhoods, in many cases allowing prediction of novel functional systems; (iv) analysis of metabolic pathways and prediction of alternative forms of enzymes; (v) comparison of organisms by COG functional categories; and (vi) prioritization of targets for structural and functional characterization. Here we review the principles of the COG approach and discuss its key advantages and drawbacks in microbial genome analysis. comparative genomics, genome annotation, enzyme evolution, orthologs, paralogs Introduction The success of the entire genomic enterprise critically depends on reliable genome annotation, i.e. correct identification of the genes, which includes accurate determination of gene boundaries and functional annotation of the gene product(s). The Clusters of Orthologous Groups of proteins (COGs) database has been devised as a way to allow phylogenetic classification of proteins from complete microbial genomes [1]. While the COG system has grown over the years (Figure 1), the goal has always been for each COG to represent a family of orthologous protein-coding genes. However, when the compared genomes are separated by long evolutionary distances and possess substantially different numbers of genes, evolutionary relationships between these genes are not accurately captured by the straightforward definition of orthology as a one-to-one relationship because of such evolutionary processes as lineage-specific gene duplication and loss, as well as horizontal gene transfer [7, 8]. Owing to these complexities of the evolutionary relationships among genes, the COGs have become families of co-orthologous genes that embody one-to-many and many-to-many relationships. Hence the term ‘orthologous groups’ (of proteins) that embraces such more complex evolutionary relationships among genes and simplifies the assignment of (general) functions to genes and their products. As the genomic community gradually embraced the notion of co-orthologous relationships between genes [7–9], the COGs have been re-branded Clusters of Orthologous Genes [10]. Figure 1 Open in new tabDownload slide Evolution of the COG system. The numbers in parentheses indicate the number of bacterial, archaeal and eukaryotic genomes, respectively, included in the respective COG release [1–6]. Figure 1 Open in new tabDownload slide Evolution of the COG system. The numbers in parentheses indicate the number of bacterial, archaeal and eukaryotic genomes, respectively, included in the respective COG release [1–6]. During the 20 years since the inception of the COG project, several alternative systems for orthology analysis have been developed [11–20], some of them implementing genome-wide phylogenetic analysis, which, in principle, is supposed to provide robust resolution of evolutionary relationships between orthologs and paralogs. In practice, however, such methods are computationally expensive and fraught with artifacts at different stages, and therefore, simpler approaches such as the COGs continue to be widely used in microbial genomics. The popular EggNOG database (‘Evolutionary genealogy of genes: Non-supervised Orthologous Groups’, http://eggnog.embl.de) applies essentially the same approach as COGs to a much greater number of genomes, but fully relies on automated assignment of orthologs and does not annotate the orthologous gene clusters [21, 22]. Here we briefly review the key principles underlying the COG approach and its applications for genome annotation and comparative analysis. Rather than providing a detailed description of the COG construction methods and the resulting collections of (co)orthologous gene families, our goal here is to highlight the unresolved problems in functional annotation and the possible ways to address them. For a description of the COG database per se, the reader is referred to the previous publications [1, 2–6]. Differences between COGs and other collections of gene and protein families Functional annotation of proteins encoded in sequenced genomes typically relies on BLASTP [23] or, more recently, HMMer [24] search of protein databases for the most similar sequence, followed by a (semi-)automated transfer of the best hit annotation to the new protein. This approach has a number of well-known drawbacks [25–28]. First, if the sequence similarity is low, there is a distinct possibility that the two proteins have different functions; this problem is exacerbated in cases of transitive annotation of multiple proteins in this manner. Second, the reliance on the best hit often results in a protein ending up being annotated as ‘uncharacterized’ and/or ‘putative’ even when the function of a close homolog is already known. Third, differences in domain architectures of homologous proteins often result in erroneous functional assignment. Given these systematic errors, advanced approaches for functional annotation of proteins increasingly rely on curated databases of protein sequences [29], such as UniProt KnowledgeBase or PANTHER [30, 31], and protein domains, such as Pfam, SMART or SUPERFAMILY [32–34]. Aggregated domain databases InterPro and CDD, which allow an easy comparison of the annotations provided by various databases, often prove to be the most efficient tools [35, 36]. The COG approach shares some features with the curated protein family databases but differs from them in several important aspects. Use of complete genomes A distinct feature of the COG approach is the reliance on complete genome (proteome) sequences, which allows relatively simple and reliable recognition of potential orthologs and paralogs among all proteins encoded in the given genome. With incomplete genomes, there always remains the obvious possibility that the true ortholog of the given gene failed to make it into the final assembly. Like other methods for ortholog identification, the COG approach relies on sequence similarity searches against selected proteomes, aimed at the identification of pairwise best hits. However, instead of imposing predetermined similarity scores for delineation of likely homologs, the COG approach extends the popular concept of two-way (often also called bidirectional, symmetric or reciprocal) best BLAST hits in each particular proteome by adding the more stringent requirement of forming a triangle, or three-way set of best BLAST matches (thus forcing the mathematical property of transitivity [7, 9]) to form a new COG. Owing to the presence of potential paralogs from the same lineage (inparalogs [37]), the original approach [1] only required that at least one such triangle be included that represented symmetrical (bidirectional) matches, with that criteria being imposed by manual supervision of groups initially constructed with an automated method. Later, the process of detection and collapsing such obvious paralogs was performed by an automated method, introduced in the first major update of the COGs [3] and later codified in the EdgeSearch algorithm [38–40]. Proteins from new genomes can be added to the existing COGs by using the new sequences as queries for an RPS-BLAST search of the collection of position-specific scoring matrices generated from COG-specific multiple sequence alignments [41]. The query is assigned to the COG that yields the best score in this search. Technically, this approach is analogous to that used to search domain databases, such as InterPro and CDD, but because the COGs contain previously identified orthologs, in this case, the best hit gives a strong indication of orthology. A detailed discussion of other methods for ortholog identification can be found e.g. in [7, 9, 42–45]. In addition to sequence similarity and phylogenetic proximity, a potentially useful criterion is genomic synteny [39, 40], which, however, in practice is typically used for manual verification of the existing assignments at the quality-control stage. Flexible similarity cutoffs The advantage of the triangle-based approach for orthology inference is that it dispenses with artificially imposed sequence similarity cutoffs for different protein families, some of which evolve with dramatically different rates, and permits creation of COGs from proteins that span the entire range of similarity, from barely detectable to extremely high. For example, Na+-binding c subunits (COG0636) of Na+-translocating ATP synthases from bacteria and archaea have low sequence similarity and might not be recognized as orthologs using arbitrarily high BLAST cutoffs; to further complicate the annotation, archaeal protein is often referred to as subunit K [46]. With strict BLAST cutoffs, recognition of orthology becomes particularly complicated for short proteins, including some ribosomal proteins. The COG approach also allows separation of closely related paralogs, such as, for example, 3-isopropylmalate dehydrogenase (LeuB) and isocitrate dehydrogenase (Icd), members of COG0473 and COG0538, respectively, that in most other databases are assigned to the same family (PF00180 in Pfam, SM01329 in SMART, PS00470 in PROSITE, SSF53659 in SUPERFAMILY). Protein family granularity in COGs Flexible similarity cutoffs have the built-in advantage of allowing the COGs to be as wide or as narrow as dictated by the evolutionary history of a given gene family. In the above example, the LeuB/Icd family is split into two COGs, which reflects the wide distribution of these enzymes among bacteria and archaea. However, this family also includes even two more closely related enzymes. One of these is tartrate dehydrogenase/decarboxylase that has been characterized in Pseudomonas putida and Agrobacterium vitis [47, 48]. This enzyme is closely related to LeuB, still has the isopropylmalate dehydrogenase activity and has probably evolved from LeuB in the course of the adaptation of the host bacteria to life on tartrate-rich grapevine [47]. The fourth member is homoisocitrate dehydrogenase AksF, which participates in the biosynthesis of the methanoarchaeal coenzyme B [49]. Homoisocitrate dehydrogenase has been described in Methanocaldococcus jannaschii, and a variety of methanogenic archaea encode closely related proteins [49]. At this time, there are too few tartrate dehydrogenases to form a separate COG. As for homoisocitrate dehydrogenase, LeuB and AksF are co-orthologs with respect to the bacterial LeuB enzymes. Accordingly, all members of this family are currently assigned to the same COG0473 (LeuB) and the same arCOG01163 in archaeal COGs [10]. In the future, methanogenic homoisocitrate dehydrogenases might form an archaea-specific COG. For now, however, the split of the family into two COGs appears to represent a reasonable compromise. In contrast, TIGRfams [50] and NCBI Protein Clusters [51] databases divide this family into 6 and 13 clusters, respectively. However, because sequence similarity alone does not allow unequivocal functional assignment, most of these clusters end up with the same functional annotation, either LeuB or Icd. Phyletic profiles in COGs An important feature of the COG approach is that a protein (or domain) either belongs or does not belong to it. Accordingly, a genome is either represented in the given COG (by one or more proteins) or it is not. Thus, the COG approach can dispense with the matrix of similarity scores and replace them with the simple yes/no (1 or 0) representation or, alternatively, indicate the number of paralogous members of the given COG in the given genome. Such phyletic patterns, i.e. the patterns of species that are either represented or not represented in the given COGs, are a powerful tool for functional annotation of microbial genomes and evolutionary reconstruction. The most obvious use of phyletic patterns is for identification of supposedly essential genes that are missing in certain genomes [4, 52]. Consistent application of this principle offers an easy way to evaluate genome quality [53, 54], which is why the NCBI’s prokaryotic genome annotation pipeline currently involves routine checking of the submitted genomes for the presence of certain (nearly) universal genes, including those encoding ribosomal proteins and translation system components, as well as RNA polymerase subunits [55, 56]. A conceptually similar application of phyletic patters involves analysis of metabolic pathways and multi-protein functional systems. Obviously, metabolic pathways should not allow accumulation of any intermediate that cannot be further metabolized and represents a dead end: to avoid poisoning the cell, such intermediate would have to be exported into the surrounding milieu. Likewise, an intermediate in the functional metabolic pathway needs to be either imported or synthesized within the cell. Although the possibility of ‘distributed’ pathways cannot be discarded, these simple considerations prove productive when COGs are superimposed on the metabolic map to identify the intermediates that have no known enzymes to produce or metabolize them. Identification of such gaps in pathways often suggests alternative enzymes that can be then identified experimentally [54, 57]. Functional categories of genes in COGs Another widely used feature of the COG system is the assignment of all COGs to one of the 26 functional categories. These categories have evolved over time, with several of them (B, Y, W, Z) describing functions that are found primarily in eukaryotic cells. The recently added V (Defense mechanisms) and X (Mobilome) categories provide for a more detailed description of the dynamics of bacterial and archaeal genomes. Functional categories are assigned in accordance with the cellular roles of the respective COGs, so that, for example, peptide uptake systems are included into category E (Amino acid transport and metabolism), rather than in general ‘Transport’ or other similar categories. Two functional categories of uncharacterized proteins, R (genes with only a generic functional prediction, typically of the biochemical activity) and S (uncharacterized genes), are particularly useful, as they reflect the current level of understanding of protein function on the proteome level and allow tracing the progress in experimental characterization and computational analysis of widespread protein families. The fraction of proteins from a given genome assigned to certain COG functional categories turned out to be a useful whole-genome feature [58] and has been adopted by the Genome Standards Consortium as an essential characteristic of the newly sequenced genomes https://standardsingenomics.biomedcentral.com/submission-guidelines. Full-length proteins and domains as COG members Most existing protein family databases include either full-length sequences (NCBI protein database, UniProt, PANTHER, TIGRfams [30, 31]) or separate protein domains (Pfam, SMART, SUPERFAMILY, etc [32–34]). The COG approach allows a degree of flexibility: conserved domain combinations can be included in separate COGs without the need to split them into individual domains. As an example, along with the COG0784 for individual CheY-like receiver (REC) domains of the two-component signal transduction systems (which also includes stand-alone CheY/Spo0F proteins), the current COG collection includes 15 additional REC-domain COGs, such as COG2197 for DNA-binding response regulators of the NarL/FixJ family, containing REC and helix-turn-helix domains; COG0745 for DNA-binding response regulators of the OmpR/PhoB family, which consist of REC and winged-helix domains; COG3279 for DNA-binding response regulators of LytR/AlgR family, containing REC and LytTR domains, and many others [59, 60]. The discrimination between the architectures of proteins that share a common domain provides for a finer granularity of annotation and allows better characterization of the respective proteins. However, non-critical use of COGs for high-throughput domain annotation can result in egregious errors, whereby a multidomain protein receives a misleading annotation of its best COG hit that has a completely different domain architecture. The recent attempts to identify specific domain architectures and limit annotation transfer to proteins with the same domain combination [36] have the potential to resolve this issue. COG annotation Functional annotation of COGs, including assignment of COG names, is based on two key principles. First, reliance on orthologous relationships for the COG construction makes it likely, according to the ‘orthology conjecture’, that members of each COG have equivalent functions [7] (with only rare known exceptions [61]). Accordingly, experimentally characterized functions of a single member of a given COG often can be used to assign the functional annotation to the entire COG. Indeed, in most cases, subsequent characterization of additional COG members has confirmed the validity of the initial assignment [6]. Second, all COG names are manually curated with the goal of creating the most appropriate annotation, avoiding the common annotation errors [25], as well as over- and under-predictions. Thus, for those COGs whose members have two or more distinct functions, the annotations (COG names) get expanded to cover the entire range of experimental results. In some cases, the growing number of distinct paralogs justifies splitting a COG into two or more separate COGs with higher sequence conservation and more narrowly defined functional annotation. Many COGs, however, do not include any experimentally characterized members so that their annotation has to rely on computational analyses alone. In such cases, inference of a robust annotation requires careful analysis of their sequences, structures, genomic neighborhoods, phyletic patterns and other cues, which requires a substantial effort that, however, often leads to interesting insights [62, 63]. Such efforts are essential for increasing the fraction of proteins that belong to well-characterized COGs beyond the figure of 60–70% that is currently obtained for most bacterial and archaeal genomes [6]. The overall genome coverage by COGs (including the R- and S-type COGs) has stayed largely the same over the years and currently ranges from ∼65% of the total proteomes in Chlamydiae and Planctomycetes to >80% in Synergistetes and Thermotogae (Figure 2). This stable coverage of bacterial and archaeal genomes by COGs, despite the addition of numerous new genomes, is likely to reflect the open pangenomes of most prokaryotes [65–68] and the extremely rapid turnover of the poorly conserved gene class. Figure 2 Open in new tabDownload slide Proteome coverage by the current version of COGs. Archaeal and bacterial phyla and selected classes of Firmicutes and Proteobacteria are listed as in the latest release of the COG database [6]. The orange and blue columns show the fractions of the respective proteomes covered by COGs in each taxonomic group (including R- and S-type COGs that consist of poorly characterized or uncharacterized genes), averaged over the members of that group in the COGs (the respective numbers are shown in parentheses). The ‘Other archaea’ group includes two genomes representing, respectively, Kor- and Nanoarchaeota; the ‘Other bacteria’ group includes members of Deferribacteres, Nitrospirae, Verrucomicrobia and other sparsely sampled phyla, as well as representatives of several candidate phyla. The bright yellow rectangles on top of the archaeal columns indicate the additional coverage of the archaeal proteomes in the latest version of arCOGs [10]. The hatched rectangles indicate the additional coverage of the archaeal and bacterial proteomes in the ATGC-COGs from the latest version of the ATGCs database [64]. Figure 2 Open in new tabDownload slide Proteome coverage by the current version of COGs. Archaeal and bacterial phyla and selected classes of Firmicutes and Proteobacteria are listed as in the latest release of the COG database [6]. The orange and blue columns show the fractions of the respective proteomes covered by COGs in each taxonomic group (including R- and S-type COGs that consist of poorly characterized or uncharacterized genes), averaged over the members of that group in the COGs (the respective numbers are shown in parentheses). The ‘Other archaea’ group includes two genomes representing, respectively, Kor- and Nanoarchaeota; the ‘Other bacteria’ group includes members of Deferribacteres, Nitrospirae, Verrucomicrobia and other sparsely sampled phyla, as well as representatives of several candidate phyla. The bright yellow rectangles on top of the archaeal columns indicate the additional coverage of the archaeal proteomes in the latest version of arCOGs [10]. The hatched rectangles indicate the additional coverage of the archaeal and bacterial proteomes in the ATGC-COGs from the latest version of the ATGCs database [64]. Although COG annotations typically describe protein families, in the most recent release of the COG database, owing to the popularity of COG-based annotation, many COG names have been modified to allow functional annotation of individual proteins [6]. Unresolved problems in the COG approach The wide use of COGs for microbial genome annotation and comparative analysis has illuminated several problems inherent in the COG approach that warrant a brief discussion. These difficulties include, among others, the issues of COG hierarchy, inclusion of paralogs, splitting proteins into separate domains and scalability of the COG approach. Orthologs, paralogs and xenologs: the missing hierarchy The very definition of orthology [69] inherently depends on the group of organisms under consideration [7, 9, 37]. For example, in most members of the Crenarchaeota, the family B DNA polymerases are represented by several paralogs which form distinct orthologous families (arCOG00328, arCOG00329, arCOG15272 and others) within this archaeal phylum (all these genes are out-paralogs in Crenarchaeota). In contrast, most of those bacteria that possess the polB gene have a single copy, which is co-orthologous to all archaeal polB genes, so archaea and bacteria share only one orthologous family of polB, COG0417 (all these genes are co-orthologs among prokaryotes with several in-paralogs in archaea). Such complex relationships among homologous genes confound COG analysis because the definition of orthology becomes mutually dependent with the phyletic patterns (the definition of orthology depends on the list of organisms where these genes are present, which itself depends on which of the homologous genes are considered orthologs and which are not). Several formal and informal empirical rules have been proposed to resolve this conundrum [70]. The hierarchical orthologous groups have been implemented in such databases as EggNOG, OMA and OrthoDB [14, 22, 71]. In most of the current COG collections, all COGs are equal, and there is no hierarchical structure; only in arCOGs, an extra level of super-COGs has been introduced to combine paralogous COGs into higher level clusters. Although the non-hierarchical structure of COG collections is convenient for straightforward genome annotation, it has substantial drawbacks. Some COGs include closely related proteins with similar, if not identical, biochemical activities. In such cases, assignment of a protein to a specific COG can be taken, without justification, as an indication that the respective organism possesses one functionality but not the other. A good example is the case of glutamate and glutamine aminoacyl tRNA-synthetases (COG0008). While most bacteria encode two paralogous enzymes that charge the Glu- and Gln-specific tRNAs, archaea (as well as chlamydia, chlorobi, chloroflexi, cyanobacteria and certain members of other bacterial phyla) encode only glutamate-tRNA synthetase and produce glutamyl-tRNA by transamidation of misacylated Glu-tRNAGln [72]. Here, both bacterial paralogs are co-orthologs for the archaeal and chlamydial enzymes, which is why they end up in a single COG. Obviously, splitting COG0008 into two subCOGs would have been a better solution, allowing a precise characterization of the respective enzymes. In some cases, a COG includes a small subgroup with a well-characterized function but the lack of hierarchy results in annotation of generic function only (e.g. an ABC-type transporter). The single-level definition of orthology can even result in annotations that are largely arbitrary. In some cases (e.g. COG0183, Acetyl-CoA acetyltransferase), COGs are overloaded with paralogs because it is practically impossible to track all extant genes to distinct genes in the common ancestor. On other occasions (COG0050, Translation elongation factor EF-Tu, and COG5256, Translation elongation factor EF-1α), lineage-specific COGs are created for genes that are arguably orthologous because they are sufficiently distinct. The absence of multilevel hierarchy dilutes functional annotation of the characterized members of the COG and weakens the evolutionary reconstructions. Developing and implementing a hierarchical framework is one of the most pressing problems in the COG-based approach to gene classification and genome annotation. Whole proteins versus protein domains As noted above, COG construction is based on clustering of orthologous domains that are identified as bidirectional best hits in genome-specific BLAST searches. This approach, however, is sensitive to domain rearrangements that occurred after the divergence of the analyzed set of species from their last common ancestor. Particularly severe problems are caused by promiscuous domains, which can attract proteins to spurious COGs through significant but effectively irrelevant sequence similarity to the promiscuous domains. Although this problem can be addressed semi-automatically, e.g. by excluding the hits that cover only a small portion of the protein sequence, precise solutions still require manual intervention. On many occasions, conserved domain architectures allowed construction of consistent COGs that were not substantially affected by the presence of a shared domain (e.g. the widespread helix-turn-helix DNA-binding domain). Conversely, the diversity of domain architectures of proteins involved in microbial signal transduction and containing a number of promiscuous domains (PAS, GAF, CHASE, GGDEF, EAL and others) required splitting some of these proteins into individual domains or domain combinations. As a result, the COGs are a mix of (i) highly specific domain architectures (such as the above-mentioned response regulators), (ii) multiple domain architectures that include a single shared domain and (iii) separate promiscuous domains. To our knowledge, as of this writing, there is no complete, formal solution for optimal dissection of full-length proteins into orthologous domains. At present, for the analysis of multidomain proteins, the best practical approaches are offered by integrated domain identification tools, such as CDD (which includes the COGs) and InterPro. Scalability of the COG approach and specialized COG collections The basic COG approach relies first on an exhaustive all-against-all protein comparison that scales as O(n2) with the total number of proteins and then on a search of connected triangles in clusters of reciprocal best hits that scales as O(n3) with the number of proteins in the cluster [38]. Inevitably, the growth of the database outpaces the availability of the computational resources, making regular major updates of the entire COG database impractical. Several divide-and-conquer strategies have been used to circumvent this major difficulty. One approach that has been implemented in several COG updates includes accommodating the new sequences into the existing COGs first, then searching for potential new COGs among the sequences that do not fit the existing ones, and then, moving some sequences from the old COGs to the new ones [10]. The principal direction, however, has involved construction of dedicated COG collections for distinct microbial taxa. In particular, the COGs for archaea (arCOGs) went through several closely curated releases and remain up to date, having become a widely used framework for archaeal genome annotation and analysis [10, 70, 73]. As illustrated in Figure 2, detailed analysis of archaeal protein families increased the coverage of cren-, eury- and thaumarchaeal genomes by 18–20%, so that arCOGs now cover >92% of the proteins encoded in typical genomes of Crenarchaeota and Euryarchaeota. Separate projects have involved construction and analysis of COGs for Cyanobacteria and Gram-positive bacteria of the order Lactobacillales [74, 75]. The COG approach was also implemented in the database of Alignable Tight Genome Clusters (ATGC) that includes closely related bacterial and archaeal genomes [64, 76]. COGs have been constructed separately for each ATGC. These ATGC-COGs largely avoid the problems inherent in the COG analysis at larger evolutionary distances (lineage-specific paralogy, differential gene loss and differences in domain architectures) and have proved an efficient platform for various types of evolutionary reconstructions [77, 78]. In taxa for which ATGCs are available—i.e. those studied in sufficient depth so that multiple closely related genomes are available—the coverage of genomes is again raised so that ATGC-COGs now cover >95% of the proteins encoded in typical genomes (Figure 2). The COG approach has also been extended beyond cellular organisms to construct COG for viruses that infect bacteria or archaea, and for the large DNA viruses of eukaryotes [79, 80]. The successful application of the early versions of the COGs was to a large extent based on comprehensive manual curation of the COG membership, COG names and supporting information, and a substantial body of computational analysis aimed at predicting functions for poorly characterized COGs. This effort has led to several notable breakthroughs that have been validated by subsequent experiments and opened up new research directions, including the characterization of the CRISPR-Cas system [81, 82], prediction of the archaeal exosome [83], identification of the bacterial c-di-GMP-centered signaling network [84, 85], new bacterial toxin-antitoxin systems [86–88] and archaeal type IV secretion systems [89], and allowing prioritization of uncharacterized proteins (COGs) for further study [90, 91]. However, scaling this labor-consuming approach to accommodate the exponentially growing amount of genomic sequence data is even more challenging than keeping the COGs up to date. That path forward is likely to combine improved automatic approaches to functional annotation with subprojects focusing on specific taxa or functional classes of COGs. Concluding remarks The COG approach for identification of orthologous genes was developed as a platform for comparative genomic analysis shortly after the first few microbial genomes have been sequenced. It could have been expected that in 20 years, this simple strategy based on sequence similarity hierarchy would completely give way to more sophisticated, phylogenetic approaches. This, however, is not the case, primarily, because the extended orthology conjecture, according to which bidirectional best hits between genomes correspond to orthologs, and the latter possess equivalent functions, largely holds for prokaryotes given the limited extent of lineage-specific paralogy, differential gene loss and domain shuffling. In contrast, in eukaryotes where all these confounding aspects of genome evolution are pervasive, the COG approach encounters great difficulties, and robust, genome-wide orthology assignment does not seem to be feasible without full-scale phylogenomics. Thus, the COGs are likely to remain an important tool for microbial genome analysis for years to come, so that investment of effort into refinements of this straightforward approach seems to be justified. Key Points Robust orthology identification is essential for accurate genome annotation. Reconstructions of genome evolution are based on orthology and paralogy. COGs are an essential tool in microbial genomics. Several specialized COG projects have been developed. Acknowledgments The authors would like to thank all former members of the COG team for their contributions to the project. Funding The authors are supported by Intramural Research Program of the US National Institutes of Health at the National Library of Medicine. D.M.K. acknowledges the support of the Department of Biomedical Engineering at the University of Iowa (Iowa City, USA). Michael Y. Galperin is a Lead Scientist at the NCBI’s (NIH) Computational Biology Branch. He uses comparative genomics to study evolution of membrane energetics and bacterial metabolic and signaling pathways. David M. Kristensen is an Assistant Professor at the University of Iowa’s Department of Biomedical Engineering. He uses tools of comparative genomics, bioinformatics and systems biology to study evolution of genes in viruses and microbes. Kira S. Makarova is a Staff Scientist at the NCBI’s Computational Biology Branch. Her area of expertise is comparative genomics and sequence analysis of microbial genomes. Yuri I. Wolf is a Lead Scientist at the National Center for Biotechnology Information in Bethesda, Maryland. His research is focused on quantitative aspects of evolutionary and comparative genomics. Eugene V. Koonin is a Senior Investigator and Leader of the Evolutionary Genomics Group at the National Center for Biotechnology Information at the NIH. He studies various aspects of genome evolution. References 1 Tatusov RL , Koonin EV, Lipman DJ. A genomic perspective on protein families . Science 1997 ; 278 : 631 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 2 Koonin EV , Tatusov RL, Galperin MY. Beyond complete genomes: from sequence to structure and function . Curr Opin Struct Biol 1998 ; 8 : 355 – 63 . Google Scholar Crossref Search ADS PubMed WorldCat 3 Tatusov RL , Galperin MY, Natale DA, et al. The COG database: a tool for genome-scale analysis of protein functions and evolution . Nucleic Acids Res 2000 ; 28 : 33 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 4 Tatusov RL , Natale DA, Garkavtsev IV, et al. The COG database: new developments in phylogenetic classification of proteins from complete genomes . Nucleic Acids Res 2001 ; 29 : 22 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 5 Tatusov RL , Fedorova ND, Jackson JD, et al. The COG database: an updated version includes eukaryotes . BMC Bioinformatics 2003 ; 4 : 41 . Google Scholar Crossref Search ADS PubMed WorldCat 6 Galperin MY , Makarova KS, Wolf YI, et al. Expanded microbial genome coverage and improved protein family annotation in the COG database . Nucleic Acids Res 2015 ; 43 : D261 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 7 Koonin EV. Orthologs, paralogs, and evolutionary genomics . Annu Rev Genet 2005 ; 39 : 309 – 38 . Google Scholar Crossref Search ADS PubMed WorldCat 8 Gabaldon T , Koonin EV. Functional and evolutionary implications of gene orthology . Nat Rev Genet 2013 ; 14 : 360 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 9 Kristensen DM , Wolf YI, Mushegian AR, et al. Computational methods for gene orthology inference . Brief Bioinform 2011 ; 12 : 379 – 91 . Google Scholar Crossref Search ADS PubMed WorldCat 10 Makarova KS , Wolf YI, Koonin EV. Archaeal Clusters of Orthologous Genes (arCOGs): an update and application for analysis of shared features between Thermococcales, Methanococcales, and Methanobacteriales . Life 2015 ; 5 : 818 – 40 . Google Scholar Crossref Search ADS PubMed WorldCat 11 Kanehisa M , Sato Y, Kawashima M, et al. KEGG as a reference resource for gene and protein annotation . Nucleic Acids Res 2016 ; 44 : D457 – 62 . Google Scholar Crossref Search ADS PubMed WorldCat 12 Chen F , Mackey AJ, Stoeckert CJ Jr, et al. OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups . Nucleic Acids Res 2006 ; 34 : D363 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 13 Uchiyama I , Mihara M, Nishide H, et al. MBGD update 2015: microbial genome database for flexible ortholog analysis utilizing a diverse set of genomic data . Nucleic Acids Res 2015 ; 43 : D270 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 14 Altenhoff AM , Skunca N, Glover N, et al. The OMA orthology database in 2015: function predictions, better plant support, synteny view and other improvements . Nucleic Acids Res 2015 ; 43 : D240 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 15 Heinicke S , Livstone MS, Lu C, et al. The Princeton Protein Orthology Database (P-POD): a comparative genomics analysis tool for biologists . PLoS One 2007 ; 2 : e766 . Google Scholar Crossref Search ADS PubMed WorldCat 16 Huerta-Cepas J , Capella-Gutierrez S, Pryszcz LP, et al. PhylomeDB v4: zooming into the plurality of evolutionary histories of a genome . Nucleic Acids Res 2014 ; 42 : D897 – 902 . Google Scholar Crossref Search ADS PubMed WorldCat 17 Kriventseva EV , Tegenfeldt F, Petty TJ, et al. OrthoDB v8: update of the hierarchical catalog of orthologs and the underlying free software . Nucleic Acids Res 2015 ; 43 : D250 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 18 Powell S , Forslund K, Szklarczyk D, et al. eggNOG v4.0: nested orthology inference across 3686 organisms . Nucleic Acids Res 2014 ; 42 : D231 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 19 Sonnhammer EL , Ostlund G. InParanoid 8: orthology analysis between 273 proteomes, mostly eukaryotic . Nucleic Acids Res 2015 ; 43 : D234 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 20 Kaduk M , Riegler C, Lemp O, et al. HieranoiDB: a database of orthologs inferred by Hieranoid . Nucleic Acids Res 2017 ; 45 : D687 – 90 . Google Scholar Crossref Search ADS PubMed WorldCat 21 Jensen LJ , Julien P, Kuhn M, et al. eggNOG: automated construction and annotation of orthologous groups of genes . Nucleic Acids Res 2008 ; 36 : D250 – 4 . Google Scholar Crossref Search ADS PubMed WorldCat 22 Huerta-Cepas J , Szklarczyk D, Forslund K, et al. eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences . Nucleic Acids Res 2016 ; 44 : D286 – 93 . Google Scholar Crossref Search ADS PubMed WorldCat 23 Altschul SF , Madden TL, Schaffer AA, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs . Nucleic Acids Res 1997 ; 25 : 3389 – 402 . Google Scholar Crossref Search ADS PubMed WorldCat 24 Eddy SR. Accelerated profile HMM searches . PLoS Comput Biol 2011 ; 7 : e1002195 . Google Scholar Crossref Search ADS PubMed WorldCat 25 Galperin MY , Koonin EV. Sources of systematic error in functional annotation of genomes: domain rearrangement, non-orthologous gene displacement and operon disruption . In Silico Biol 1998 ; 1 : 55 – 67 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 26 Schnoes AM , Brown SD, Dodevski I, et al. Annotation error in public databases: misannotation of molecular function in enzyme superfamilies . PLoS Comput Biol 2009 ; 5 : e1000605 . Google Scholar Crossref Search ADS PubMed WorldCat 27 Gilks WR , Audit B, De Angelis D, et al. Modeling the percolation of annotation errors in a database of protein sequences . Bioinformatics 2002 ; 18 : 1641 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 28 Valencia A. Automatic annotation of protein function . Curr Opin Struct Biol 2005 ; 15 : 267 – 74 . Google Scholar Crossref Search ADS PubMed WorldCat 29 Gaudet P , Livstone MS, Lewis SE, et al. Phylogenetic-based propagation of functional annotations within the Gene Ontology Consortium . Brief Bioinform 2011 ; 12 : 449 – 62 . Google Scholar Crossref Search ADS PubMed WorldCat 30 Mi H , Huang X, Muruganujan A, et al. PANTHER version 11: expanded annotation data from Gene Ontology and Reactome pathways, and data analysis tool enhancements . Nucleic Acids Res 2017 ; 45 : D183 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 31 The UniProt Consortium . UniProt: the universal protein knowledgebase . Nucleic Acids Res 2017 ; 45 : D158 – 69 . Crossref Search ADS PubMed WorldCat 32 Letunic I , Doerks T, Bork P. SMART: recent updates, new developments and status in 2015 . Nucleic Acids Res 2015 ; 43 : D257 – 60 . Google Scholar Crossref Search ADS PubMed WorldCat 33 Oates ME , Stahlhacke J, Vavoulis DV, et al. The SUPERFAMILY 1.75 database in 2014: a doubling of data . Nucleic Acids Res 2015 ; 43 : D227 – 33 . Google Scholar Crossref Search ADS PubMed WorldCat 34 Finn RD , Coggill P, Eberhardt RY, et al. The Pfam protein families database: towards a more sustainable future . Nucleic Acids Res 2016 ; 44 : D279 – 85 . Google Scholar Crossref Search ADS PubMed WorldCat 35 Finn RD , Attwood TK, Babbitt PC, et al. InterPro in 2017-beyond protein family and domain annotations . Nucleic Acids Res 2017 ; 45 : D190 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 36 Marchler-Bauer A , Bo Y, Han L, et al. CDD/SPARCLE: functional classification of proteins via subfamily domain architectures . Nucleic Acids Res 2017 ; 45 : D200 – 3 . Google Scholar Crossref Search ADS PubMed WorldCat 37 Sonnhammer EL , Koonin EV. Orthology, paralogy and proposed classification for paralog subtypes . Trends Genet 2002 ; 18 : 619 – 20 . Google Scholar Crossref Search ADS PubMed WorldCat 38 Kristensen DM , Kannan L, Coleman MK, et al. A low-polynomial algorithm for assembling clusters of orthologous groups from intergenomic symmetric best matches . Bioinformatics 2010 ; 26 : 1481 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 39 Lechner M , Hernandez-Rosales M, Doerr D, et al. Orthology detection combining clustering and synteny for very large datasets . PLoS One 2014 ; 9 : e105015 . Google Scholar Crossref Search ADS PubMed WorldCat 40 Dewey CN. Positional orthology: putting genomic evolutionary relationships into context . Brief Bioinform 2011 ; 12 : 401 – 12 . Google Scholar Crossref Search ADS PubMed WorldCat 41 Marchler-Bauer A , Zheng C, Chitsaz F, et al. CDD: conserved domains and protein three-dimensional structure . Nucleic Acids Res 2013 ; 41 : D348 – 52 . Google Scholar Crossref Search ADS PubMed WorldCat 42 Alexeyenko A , Tamas I, Liu G, et al. Automatic clustering of orthologs and inparalogs shared by multiple proteomes . Bioinformatics 2006 ; 22 : e9 – 15 . Google Scholar Crossref Search ADS PubMed WorldCat 43 Chen F , Mackey AJ, Vermunt JK, et al. Assessing performance of orthology detection strategies applied to eukaryotic genomes . PLoS One 2007 ; 2 : e383 . Google Scholar Crossref Search ADS PubMed WorldCat 44 Altenhoff AM , Dessimoz C. Phylogenetic and functional assessment of orthologs inference projects and methods . PLoS Comput Biol 2009 ; 5 : e1000262 . Google Scholar Crossref Search ADS PubMed WorldCat 45 Altenhoff AM , Dessimoz C. Inferring orthology and paralogy . Methods Mol Biol 2012 ; 855 : 259 – 79 . Google Scholar Crossref Search ADS PubMed WorldCat 46 Mulkidjanian AY , Galperin MY, Makarova KS, et al. Evolutionary primacy of sodium bioenergetics . Biol Direct 2008 ; 3 : 13 . Google Scholar Crossref Search ADS PubMed WorldCat 47 Tipton PA , Beecher BS. Tartrate dehydrogenase, a new member of the family of metal-dependent decarboxylating R-hydroxyacid dehydrogenases . Arch Biochem Biophys 1994 ; 313 : 15 – 21 . Google Scholar Crossref Search ADS PubMed WorldCat 48 Salomone JY , Crouzet P, De Ruffray P, et al. Characterization and distribution of tartrate utilization genes in the grapevine pathogen Agrobacterium vitis . Mol Plant Microbe Interact 1996 ; 9 : 401 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 49 Howell DM , Graupner M, Xu H, et al. Identification of enzymes homologous to isocitrate dehydrogenase that are involved in coenzyme B and leucine biosynthesis in methanoarchaea . J Bacteriol 2000 ; 182 : 5013 – 16 . Google Scholar Crossref Search ADS PubMed WorldCat 50 Haft DH , Selengut JD, Richter RA, et al. TIGRFAMs and genome properties in 2013 . Nucleic Acids Res 2013 ; 41 : D387 – 95 . Google Scholar Crossref Search ADS PubMed WorldCat 51 Klimke W , Agarwala R, Badretdin A, et al. The national center for biotechnology information's protein clusters database . Nucleic Acids Res 2009 ; 37 : D216 – 23 . Google Scholar Crossref Search ADS PubMed WorldCat 52 Yutin N , Puigbo P, Koonin EV, et al. Phylogenomics of prokaryotic ribosomal proteins . PLoS One 2012 ; 7 : e36972 . Google Scholar Crossref Search ADS PubMed WorldCat 53 Natale DA , Galperin MY, Tatusov RL, et al. Using the COG database to improve gene recognition in complete genomes . Genetica 2000 ; 108 : 9 – 17 . Google Scholar Crossref Search ADS PubMed WorldCat 54 Koonin EV , Galperin MY ( 2003 ) Sequence—Evolution—Function: Computational Approaches in Comparative Genomics . Boston : Kluwer Academic . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC 55 Tatusova T , Ciufo S, Fedorov B, et al. RefSeq microbial genomes database: new representation and annotation strategy . Nucleic Acids Res 2014 ; 42 : D553 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 56 Tatusova T , DiCuccio M, Badretdin A, et al. NCBI prokaryotic genome annotation pipeline . Nucleic Acids Res 2016 ; 44 : 6614 – 24 . Google Scholar Crossref Search ADS PubMed WorldCat 57 Galperin MY , Koonin EV. Functional genomics and enzyme evolution. Homologous and analogous enzymes encoded in microbial genomes . Genetica 1999 ; 106 : 159 – 70 . Google Scholar Crossref Search ADS PubMed WorldCat 58 Galperin MY , Kolker E. New metrics for comparative genomics . Curr Opin Biotechnol 2006 ; 17 : 440 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 59 Galperin MY. Structural classification of bacterial response regulators: diversity of output domains and domain combinations . J Bacteriol 2006 ; 188 : 4169 – 82 . Google Scholar Crossref Search ADS PubMed WorldCat 60 Galperin MY. Diversity of structure and function of response regulator output domains . Curr Opin Microbiol 2010 ; 13 : 150 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 61 Diaz R , Vargas-Lagunas C, Villalobos MA, et al. argC orthologs from Rhizobiales show diverse profiles of transcriptional efficiency and functionality in Sinorhizobium meliloti . J Bacteriol 2011 ; 193 : 460 – 72 . Google Scholar Crossref Search ADS PubMed WorldCat 62 Prunetti L , El Yacoubi B, Schiavon CR, et al. Evidence that COG0325 proteins are involved in PLP homeostasis . Microbiology 2016 ; 162 : 694 – 706 . Google Scholar Crossref Search ADS WorldCat 63 Zallot R , Yuan Y, de Crecy-Lagard V. The Escherichia coli COG1738 member YhhQ is involved in 7-cyanodeazaguanine (preQ0) transport . Biomolecules 2017 ; 7 : 12 . Google Scholar Crossref Search ADS WorldCat 64 Kristensen DM , Wolf YI, Koonin EV. ATGC database and ATGC-COGs: an updated resource for micro- and macro-evolutionary studies of prokaryotic genomes and protein family annotation . Nucleic Acids Res 2017 ; 45 : D210 – 18 . Google Scholar Crossref Search ADS PubMed WorldCat 65 Tettelin H , Masignani V, Cieslewicz MJ, et al. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial “pan-genome” . Proc Natl Acad Sci USA 2005 ; 102 : 13950 – 5 . Google Scholar Crossref Search ADS PubMed WorldCat 66 Tettelin H , Riley D, Cattuto C, et al. Comparative genomics: the bacterial pan-genome . Curr Opin Microbiol 2008 ; 11 : 472 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 67 Puigbo P , Lobkovsky AE, Kristensen DM, et al. Genomes in turmoil: quantification of genome dynamics in prokaryote supergenomes . BMC Biol 2014 ; 12 : 66 . Google Scholar Crossref Search ADS PubMed WorldCat 68 Wolf YI , Makarova KS, Lobkovsky AE, et al. Two fundamentally different classes of microbial genes . Nat Microbiol 2016 ; 2 : 16208 . Google Scholar Crossref Search ADS PubMed WorldCat 69 Fitch WM. Distinguishing homologous from analogous proteins . Syst Zool 1970 ; 19 : 99 – 113 . Google Scholar Crossref Search ADS PubMed WorldCat 70 Makarova KS , Sorokin AV, Novichkov PS, et al. Clusters of orthologous genes for 41 archaeal genomes and implications for evolutionary genomics of archaea . Biol Direct 2007 ; 2 : 33 . Google Scholar Crossref Search ADS PubMed WorldCat 71 Zdobnov EM , Tegenfeldt F, Kuznetsov D, et al. OrthoDB v9.1: cataloging evolutionary and functional annotations for animal, fungal, plant, archaeal, bacterial and viral orthologs . Nucleic Acids Res 2017 ; 45 : D744 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 72 Curnow AW , Hong K, Yuan R, et al. Glu-tRNAGln amidotransferase: a novel heterotrimeric enzyme required for correct decoding of glutamine codons during translation . Proc Natl Acad Sci USA 1997 ; 94 : 11819 – 26 . Google Scholar Crossref Search ADS PubMed WorldCat 73 Wolf YI , Makarova KS, Yutin N, et al. Updated clusters of orthologous genes for Archaea: a complex ancestor of the Archaea and the byways of horizontal gene transfer . Biol Direct 2012 ; 7 : 46 . Google Scholar Crossref Search ADS PubMed WorldCat 74 Mulkidjanian AY , Koonin EV, Makarova KS, et al. The cyanobacterial genome core and the origin of photosynthesis . Proc Natl Acad Sci USA 2006 ; 103 : 13126 – 31 . Google Scholar Crossref Search ADS PubMed WorldCat 75 Makarova KS , Koonin EV. Evolutionary genomics of lactic acid bacteria . J Bacteriol 2007 ; 189 : 1199 – 208 . Google Scholar Crossref Search ADS PubMed WorldCat 76 Novichkov PS , Ratnere I, Wolf YI, et al. ATGC: a database of orthologous genes from closely related prokaryotic genomes and a research platform for microevolution of prokaryotes . Nucleic Acids Res 2009 ; 37 : D448 – 54 . Google Scholar Crossref Search ADS PubMed WorldCat 77 Novichkov PS , Wolf YI, Dubchak I, et al. Trends in prokaryotic evolution revealed by comparison of closely related bacterial and archaeal genomes . J Bacteriol 2009 ; 191 : 65 – 73 . Google Scholar Crossref Search ADS PubMed WorldCat 78 Ran W , Kristensen DM, Koonin EV. Coupling between protein level selection and codon usage optimization in the evolution of bacteria and archaea . MBio 2014 ; 5 : e00956-14 . Google Scholar Crossref Search ADS PubMed WorldCat 79 Yutin N , Colson P, Raoult D, et al. Mimiviridae: clusters of orthologous genes, reconstruction of gene repertoire evolution and proposed expansion of the giant virus family . Virol J 2013 ; 10 : 106 . Google Scholar Crossref Search ADS PubMed WorldCat 80 Grazziotin AL , Koonin EV, Kristensen DM. Prokaryotic Virus Orthologous Groups (pVOGs): a resource for comparative genomics and protein family annotation . Nucleic Acids Res 2017 ; 45 : D491 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 81 Makarova KS , Aravind L, Grishin NV, et al. A DNA repair system specific for thermophilic Archaea and bacteria predicted by genomic context analysis . Nucleic Acids Res 2002 ; 30 : 482 – 96 . Google Scholar Crossref Search ADS PubMed WorldCat 82 Makarova KS , Grishin NV, Shabalina SA, et al. A putative RNA-interference-based immune system in prokaryotes: computational analysis of the predicted enzymatic machinery, functional analogies with eukaryotic RNAi, and hypothetical mechanisms of action . Biol Direct 2006 ; 1 : 7 . Google Scholar Crossref Search ADS PubMed WorldCat 83 Koonin EV , Wolf YI, Aravind L. Prediction of the archaeal exosome and its connections with the proteasome and the translation and transcription machineries by a comparative-genomic approach . Genome Res 2001 ; 11 : 240 – 52 . Google Scholar Crossref Search ADS PubMed WorldCat 84 Galperin MY , Nikolskaya AN, Koonin EV. Novel domains of the prokaryotic two-component signal transduction systems . FEMS Microbiol Lett 2001 ; 203 : 11 – 21 . Google Scholar Crossref Search ADS PubMed WorldCat 85 Amikam D , Galperin MY. PilZ domain is part of the bacterial c-di-GMP binding protein . Bioinformatics 2006 ; 22 : 3 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 86 Makarova KS , Wolf YI, Koonin EV. Comprehensive comparative-genomic analysis of type 2 toxin-antitoxin systems and related mobile stress response systems in prokaryotes . Biol Direct 2009 ; 4 : 19 . Google Scholar Crossref Search ADS PubMed WorldCat 87 Fozo EM , Makarova KS, Shabalina SA, et al. Abundance of type I toxin-antitoxin systems in bacteria: searches for new candidates and discovery of novel families . Nucleic Acids Res 2010 ; 38 : 3743 – 59 . Google Scholar Crossref Search ADS PubMed WorldCat 88 Makarova KS , Wolf YI, Snir S, et al. Defense islands in bacterial and archaeal genomes and prediction of novel defense systems . J Bacteriol 2011 ; 193 : 6039 – 56 . Google Scholar Crossref Search ADS PubMed WorldCat 89 Makarova KS , Koonin EV, Albers SV. Diversity and evolution of type IV pili systems in Archaea . Front Microbiol 2016 ; 7 : 667 . Google Scholar Crossref Search ADS PubMed WorldCat 90 Galperin MY , Koonin EV. ′Conserved hypothetical′ proteins: prioritization of targets for experimental study . Nucleic Acids Res 2004 ; 32 : 5452 – 63 . Google Scholar Crossref Search ADS PubMed WorldCat 91 Galperin MY , Koonin EV. From complete genome sequence to ′complete′ understanding? Trends Biotechnol 2010 ; 28 : 398 – 406 . Google Scholar Crossref Search ADS PubMed WorldCat Published by Oxford University Press 2017. This work is written by US Government employees and is in the public domain in the US. This work is written by US Government employees and is in the public domain in the US. Published by Oxford University Press 2017. This work is written by US Government employees and is in the public domain in the US.
MicroScope—an integrated resource for community expertise of gene functions and comparative analysis of microbial genomic and metabolic dataMédigue, Claudine; Calteau, Alexandra; Cruveiller, Stéphane; Gachet, Mathieu; Gautreau, Guillaume; Josso, Adrien; Lajus, Aurélie; Langlois, Jordan; Pereira, Hugo; Planel, Rémi; Roche, David; Rollin, Johan; Rouy, Zoe; Vallenet, David
2019 Briefings in Bioinformatics
doi: 10.1093/bib/bbx113pmid: 28968784
Abstract The overwhelming list of new bacterial genomes becoming available on a daily basis makes accurate genome annotation an essential step that ultimately determines the relevance of thousands of genomes stored in public databanks. The MicroScope platform (http://www.genoscope.cns.fr/agc/microscope) is an integrative resource that supports systematic and efficient revision of microbial genome annotation, data management and comparative analysis. Starting from the results of our syntactic, functional and relational annotation pipelines, MicroScope provides an integrated environment for the expert annotation and comparative analysis of prokaryotic genomes. It combines tools and graphical interfaces to analyze genomes and to perform the manual curation of gene function in a comparative genomics and metabolic context. In this article, we describe the free-of-charge MicroScope services for the annotation and analysis of microbial (meta)genomes, transcriptomic and re-sequencing data. Then, the functionalities of the platform are presented in a way providing practical guidance and help to the nonspecialists in bioinformatics. Newly integrated analysis tools (i.e. prediction of virulence and resistance genes in bacterial genomes) and original method recently developed (the pan-genome graph representation) are also described. Integrated environments such as MicroScope clearly contribute, through the user community, to help maintaining accurate resources. microbial genome annotation system, gene function curation, comparative genomics, transcriptomics, variant detection, metabolic networks Introduction Large-scale genome sequencing and the increasingly massive use of high-throughput approaches produce a vast amount of new information that completely transforms our understanding of thousands of species. However, despite the development of powerful bioinformatics approaches, full interpretation of the content of these genomes remains a difficult task. To address this challenge, several integrated environments that combine and standardize information from a variety of sources and apply uniform (re-)annotation techniques have been developed (i.e. EnsemblGenomes [1], IMG [2], PATRIC [3]). In the context of the French National Sequencing Center (CEA/DRF/Genoscope), we have developed the MicroScope platform, which is a software environment for management, annotation, comparative analysis and visualization of microbial genomes. Published for the first time in 2006 [4], the platform has been under continuous development within the LABGeM group at CEA, and its capacities are now extensive [5–7]. MicroScope serves different used cases in bioinformatics: It supports the integration of newly sequenced or already available prokaryotic genomes through the offer of a free-of-charge service to the scientific community [genome annotation, RNA sequencing (RNA-seq) and variant analyses]. It performs computational inferences including prediction of metabolic pathways, prediction of resistome and virulome, which can be used for genome analysis. It provides tools for (comparative) analyses and visualization of prokaryotic genomes. It supports collaborative expert annotation processes through the use of specific curation tools and graphical interfaces. The present article provides a comprehensive description of MicroScope from the point of view of the end users. We start with the major objectives for which the platform was designed, and we give an overview of the main categories of MicroScope users and projects. Then we explain how to submit data and interact with the MicroScope team, and how to explore the annotated data, use the various analysis tools and perform expert annotation of gene functions. Technical details on the architecture of the system are given in the last section of this review. Where possible, earlier publications that provide more details are referenced. We conclude by one of the ongoing work that lead to a promising representation of the pan-genome of thousands of prokaryotic genomes. Who is using MicroScope and for what purposes? In the era of high-throughput sequencing technologies, a vast majority of genome sequences receive only automatic annotation, mainly based on sequence similarity, that can give spurious results [8]. Indeed, manual expertise of gene functions is a time-consuming and expensive process, but it undoubtedly adds great value to resources. In knowledge bases such as UniProtKB [9], curation efforts remain restricted to large and widespread protein families, and these resources cannot replace expert curations made by specialized biologists in community systems, such as SEED [10], IMG [2] and MicroScope. Our integrated platform supports systematic and efficient revision of microbial genome annotation, data management together with comparative genomics and metabolic analyses [4–7]. The resource provides data from completed and ongoing genome projects together with post-genomic experiments (i.e. transcriptomics; re-sequencing of evolved strains; mutant collections) allowing users to improve the understanding of gene functions. In comparison with other similar systems, MicroScope enables curation in a rich comparative genomic context and is mainly focused on (re-)annotation projects, which are built in close collaboration with microbiologists working on reference species. Indeed, MicroScope was initially dedicated to the annotation and analysis of Acinetobacter baylyi APD1 [11] and to biologists who do not have the required computing infrastructure to perform efficient annotation and analyses of newly sequenced bacterial genomes. Our system rapidly became a ‘service’ free of charge to the scientific community at large. From <400 user accounts in 2006, MicroScope counts >3300 personal accounts at present time (Figure 1). The number of registered users has doubled since 2013, and the platform has even widened its international popularity with 64% of accounts outside France. Many international projects are conducted through the platform involving users from distant geographic areas [7]. Although authentication is not required to navigate in MicroScope, it allows users to annotate genes and save data on their personal session. On average per month, we count 360 active accounts (i.e. the user logged in at least once in the month) and 2200 authentications among ∼1700 monthly unique visitors. Figure 1. Open in new tabDownload slide Evolution of the number of integrated genomes, user accounts and expert annotations stored in MicroScope since 2002. Red scale on the right refers to the number of integrated genomes (red curve) and to the number of user accounts (orange curve). Blue scale on the left refers to the cumulated number of expert annotations. The platform has been used to perform a complete expert annotation of several reference species such as Escherichia coli [12], Bacillus subtilis 128 [13, 14] and Pseudomonas putida KT2440 [15]. In addition, important pathogens and environmental species have also been extensively curated. The MicroScope system is now also used for variant analysis of re-sequenced bacterial strains (for example, in the context of bacterial evolution experiments) and for the analysis of transcriptomic experiments using RNA-seq sequencing data [6, 7], and finally, the platform is also (and in some cases, exclusively) used for the set of analysis tools pertaining to microbial genomics and metabolism, which have been integrated and made available through the MicroScope Web interface (see next sections). Indeed, the MicroScope platform has been cited 690 times since 13 years. As shown in Figure 1, although the number of MicroScope users having a personal account has increased significantly since 2011, the number of expert annotations made each year is clearly decreasing, reaching only 21 600 in 2016 (we registered >100 000 expert annotations in 2009). Past year, about one-tenth of the users performed curation of gene function and a third of them made >100 expert annotations. Obviously, with the number of prokaryotic genomes being sequenced today, the time-consuming task of expert annotation is totally unacceptable. This is the reason why our major efforts have been focused on the development of several key functionalities allowing to ease the expert annotation process and to notably improve the final annotation quality of the analyzed genomes, at least, for gene functions of interest. An annotation service to researchers in microbiology Interface for user data integration Integration and analysis of genomic data into MicroScope are open and free of charge for the worldwide community of microbiologists. To standardize and make user submission fully automated, we have developed a dedicated Web interface (https://www.genoscope.cns.fr/agc/microscope/about/services.php). The service is mainly used for the annotation of microbial genomes: both newly sequenced genomes (which will remain private till the genome publication and/or their submission to public databanks) and, for comparative analysis purpose, public prokaryotic genomes (Figure 2). Moreover, three other types of services are provided for the integration of (i) genome assemblies (bins) from metagenomic samples (ii) RNA-seq data for quantitative transcriptomics and (iii) DNA sequencing (DNA-seq) data to identify genomic variations in evolved strains (Figure 3). To ease data integration and comparative studies, standardization of contextual data about genome sequences is essential. For metagenomes, we have added a dedicated form that follows the MIMS specifications (minimum information about a metagenome sequence [16]). When submitting assembled metagenomic data in Microscope, the users are invited to select the type of environment (e.g. soil; air; water; human-associated; plant-associated) and to complete the associated fields (e.g. collection date, environment biome, geographic location, etc.). These fields are dynamically loaded and displayed on metagenome type selection. Indeed, the MicroScope database model is flexible enough to store predefined descriptors, like MIMS, or the ones defined by users. Figure 2. Open in new tabDownload slide Annotation pipelines for the analysis of newly sequenced genomes and genomes already annotated in public databanks. Figure 3. Open in new tabDownload slide Submission of genomic data into the MicroScope platform. Four types of services are provided for the integration of (i) newly sequenced or publicly available genomes (Genome), (ii) genome assemblies/bins from metagenomic samples (Metagenome), (iii) RNA-seq data for quantitative transcriptomics (RNA-Seq), (iv) DNA-seq data to identify genomic variations in evolved strains (Evolution). Following the three main steps of the procedure, the user is invited to complete the requested metadata to describe sequencing, genomes and experimental properties, to upload FASTA (genome assemblies) or FASTQ (RNA-seq or DNA-seq reads) files and, finally, to approve the terms of services. Users are then informed by an e-mail about the progress of their integration request. At present time, an average of eight genomes a day are requested for integration in the platform (this includes bins from metagenomic samples). The resource contains data for >7400 microbial genomes of which ∼3100 are publicly available. In addition, 607 RNA-seq runs and 756 runs corresponding to the re-sequencing of evolved strains have also been requested for integration into MicroScope. Running the annotation pipelines About 25 analyses workflows include most of the currently used annotation software, plus some in-house tools and/or annotation strategies (Table 1). The newly sequenced (meta)genomes, generally submitted in several contigs and organized (or not) on the final chromosome(s), are first analyzed by the syntactic annotation pipeline to identify protein genes, transfer RNA (tRNA), ribosomal RNA (rRNA), noncoding RNA (ncRNA) and repeats (Figure 2, Table 1). For a more accurate prediction of small genes and/or atypical gene composition, we have developed a strategy to first construct appropriate gene models that takes into account the codon usage of the studied organism. These models are then used in the core of the AMIGene program [17]. Starting with the set of genomic objects identified during the syntactic annotation process, the next step is to infer biological functions of the predicted genes. Our functional annotation pipeline includes sequence similarity searches tools using generalist (i.e. UniProtKB/Swiss-Prot) or specialized (i.e. Interpro, FIGFAM, etc.) databases (Table 1). Results obtained with high-quality manually curated protein sequence data sets (i.e. Swiss-Prot, E. coli K-12, B. subtilis 168 MicroScope-curated genes) are first considered in the final functional automatic annotation procedure. This procedure also takes into account the results obtained from the computation of synteny groups with complete reference prokaryotic genomes and the one available in MicroScope. Indeed, for assigning function to novel proteins, gene context approaches often complement the classical homology-based gene annotation in prokaryotes. The method we have developed offers the possibility of retaining more than one homologous gene (i.e. not only the bidirectional best hit), to allow for multiple correspondences between genes; that way, paralogy relations and/or gene fusions are easily detected [4]. Table 1. Software and databases integrated in the MicroScope pipelines Topic . Name . Software . Database . Description . Internal . URL . Syntactic annotation AMIGene x CoDing sequences (CDS) prediction x http://www.genoscope.cns.fr/agc/tools/amigene Glimmer x https://ccb.jhu.edu/software/glimmer Prodigal x http://prodigal.ornl.gov MICheck x INSDC genome CDS re-annotation x http://www.genoscope.cns.fr/agc/tools/micheck tRNAscan-SE x tRNA prediction http://eddylab.org/software/tRNAscan-SE RNAmmer x rRNA prediction http://www.cbs.dtu.dk/services/RNAmmer Rfam/Infernal x x ncRNA families and prediction http://rfam.xfam.org, http://eddylab.org/infernal RepSeek x DNA sequence repeats http://wwwabi.snv.jussieu.fr/public/RepSeek Alien hunter x DNA compositional biases to detect HGT regions http://www.sanger.ac.uk/science/tools/alien-hunter SIGI-HMM x http://www.brinkman.mbb.sfu.ca/∼mlangill/sigi-hmm GenProtFeat x Gene/protein features x Taxonomy x NCBI taxonomy database https://www.ncbi.nlm.nih.gov/taxonomy Functional annotation BLAST+ x DNA/protein sequence alignment https://blast.ncbi.nlm.nih.gov Diamond x https://github.com/bbuchfink/diamond UniProtKB x Protein sequence and function database http://www.uniprot.org InterPro x x Protein signature and family prediction https://www.ebi.ac.uk/interpro COG x x Protein family annotation and prediction https://www.ncbi.nlm.nih.gov/COG FigFam x x http://www.nmpdr.org/FIG/wiki/view.cgi/FIG/FigFam MICFAM x Protein sequence family classification with SiliX x SiliX x Clustering of protein sequences https://lbbe.univ-lyon1.fr/-SiLiX-.html ENZYME x Enzymatic activity database http://enzyme.expasy.org PRIAM x Enzymatic activity prediction http://priam.prabi.fr dbCAN x Carbohydrate-active enzyme prediction http://csbl.bmb.uga.edu/dbCAN/ SignalP x Signal peptide cleavage site prediction http://www.cbs.dtu.dk/services/SignalP TMHMM x Transmembrane helix prediction http://www.cbs.dtu.dk/services/TMHMM LipoP x Lipoprotein prediction http://www.cbs.dtu.dk/services/LipoP PSORTb x Subcellular localization prediction http://www.psort.org VFDB x Virulence factor database http://www.mgc.ac.cn/VFs VirulenceFinder x https://cge.cbs.dtu.dk/services/VirulenceFinder CARD/RGI x x Antibiotic resistance database and prediction https://card.mcmaster.ca AutoFassign x Automatic functional annotation of proteins x Relational annotation Syntonizer x Synteny conservation detection x http://www.inrialpes.fr/helix/people/viari/cccpart/ Directon x Operon prediction x PhyloProfile x Phylogenetic profilef co-evolution score x https://dx.doi.org/10.1186/1471-2164-13-69 RGP x Genomic plasticity region detection x Pathway synteny x Synteny involved in metabolic pathways x MIBiG/ antiSMASH x x Biosynthetic Gene Cluster database and prediction http://www.secondarymetabolites.org/ ChEBI x Chemical compound database https://www.ebi.ac.uk/chebi Rhea x Reaction database http://www.rhea-db.org KEGG x Metabolic pathway database http://www.genome.jp/kegg MetaCyc/ Pathway tools x x Metabolic pathway database and prediction https://metacyc.org, http://brg.ai.sri.com/ptools/ Transcriptomics and variant discovery SSAHA2 x Read mapping http://www.sanger.ac.uk/science/tools/ssaha2-0 BWA x https://github.com/lh3/bwa SAMtools x Mapping analysis http://www.htslib.org/ bedtools x http://bedtools.readthedocs.io PALOMA x Variant detection x DESeq x Differential gene expression analysis http://bioconductor.org/packages/release/bioc/html/DESeq.html Topic . Name . Software . Database . Description . Internal . URL . Syntactic annotation AMIGene x CoDing sequences (CDS) prediction x http://www.genoscope.cns.fr/agc/tools/amigene Glimmer x https://ccb.jhu.edu/software/glimmer Prodigal x http://prodigal.ornl.gov MICheck x INSDC genome CDS re-annotation x http://www.genoscope.cns.fr/agc/tools/micheck tRNAscan-SE x tRNA prediction http://eddylab.org/software/tRNAscan-SE RNAmmer x rRNA prediction http://www.cbs.dtu.dk/services/RNAmmer Rfam/Infernal x x ncRNA families and prediction http://rfam.xfam.org, http://eddylab.org/infernal RepSeek x DNA sequence repeats http://wwwabi.snv.jussieu.fr/public/RepSeek Alien hunter x DNA compositional biases to detect HGT regions http://www.sanger.ac.uk/science/tools/alien-hunter SIGI-HMM x http://www.brinkman.mbb.sfu.ca/∼mlangill/sigi-hmm GenProtFeat x Gene/protein features x Taxonomy x NCBI taxonomy database https://www.ncbi.nlm.nih.gov/taxonomy Functional annotation BLAST+ x DNA/protein sequence alignment https://blast.ncbi.nlm.nih.gov Diamond x https://github.com/bbuchfink/diamond UniProtKB x Protein sequence and function database http://www.uniprot.org InterPro x x Protein signature and family prediction https://www.ebi.ac.uk/interpro COG x x Protein family annotation and prediction https://www.ncbi.nlm.nih.gov/COG FigFam x x http://www.nmpdr.org/FIG/wiki/view.cgi/FIG/FigFam MICFAM x Protein sequence family classification with SiliX x SiliX x Clustering of protein sequences https://lbbe.univ-lyon1.fr/-SiLiX-.html ENZYME x Enzymatic activity database http://enzyme.expasy.org PRIAM x Enzymatic activity prediction http://priam.prabi.fr dbCAN x Carbohydrate-active enzyme prediction http://csbl.bmb.uga.edu/dbCAN/ SignalP x Signal peptide cleavage site prediction http://www.cbs.dtu.dk/services/SignalP TMHMM x Transmembrane helix prediction http://www.cbs.dtu.dk/services/TMHMM LipoP x Lipoprotein prediction http://www.cbs.dtu.dk/services/LipoP PSORTb x Subcellular localization prediction http://www.psort.org VFDB x Virulence factor database http://www.mgc.ac.cn/VFs VirulenceFinder x https://cge.cbs.dtu.dk/services/VirulenceFinder CARD/RGI x x Antibiotic resistance database and prediction https://card.mcmaster.ca AutoFassign x Automatic functional annotation of proteins x Relational annotation Syntonizer x Synteny conservation detection x http://www.inrialpes.fr/helix/people/viari/cccpart/ Directon x Operon prediction x PhyloProfile x Phylogenetic profilef co-evolution score x https://dx.doi.org/10.1186/1471-2164-13-69 RGP x Genomic plasticity region detection x Pathway synteny x Synteny involved in metabolic pathways x MIBiG/ antiSMASH x x Biosynthetic Gene Cluster database and prediction http://www.secondarymetabolites.org/ ChEBI x Chemical compound database https://www.ebi.ac.uk/chebi Rhea x Reaction database http://www.rhea-db.org KEGG x Metabolic pathway database http://www.genome.jp/kegg MetaCyc/ Pathway tools x x Metabolic pathway database and prediction https://metacyc.org, http://brg.ai.sri.com/ptools/ Transcriptomics and variant discovery SSAHA2 x Read mapping http://www.sanger.ac.uk/science/tools/ssaha2-0 BWA x https://github.com/lh3/bwa SAMtools x Mapping analysis http://www.htslib.org/ bedtools x http://bedtools.readthedocs.io PALOMA x Variant detection x DESeq x Differential gene expression analysis http://bioconductor.org/packages/release/bioc/html/DESeq.html Open in new tab Table 1. Software and databases integrated in the MicroScope pipelines Topic . Name . Software . Database . Description . Internal . URL . Syntactic annotation AMIGene x CoDing sequences (CDS) prediction x http://www.genoscope.cns.fr/agc/tools/amigene Glimmer x https://ccb.jhu.edu/software/glimmer Prodigal x http://prodigal.ornl.gov MICheck x INSDC genome CDS re-annotation x http://www.genoscope.cns.fr/agc/tools/micheck tRNAscan-SE x tRNA prediction http://eddylab.org/software/tRNAscan-SE RNAmmer x rRNA prediction http://www.cbs.dtu.dk/services/RNAmmer Rfam/Infernal x x ncRNA families and prediction http://rfam.xfam.org, http://eddylab.org/infernal RepSeek x DNA sequence repeats http://wwwabi.snv.jussieu.fr/public/RepSeek Alien hunter x DNA compositional biases to detect HGT regions http://www.sanger.ac.uk/science/tools/alien-hunter SIGI-HMM x http://www.brinkman.mbb.sfu.ca/∼mlangill/sigi-hmm GenProtFeat x Gene/protein features x Taxonomy x NCBI taxonomy database https://www.ncbi.nlm.nih.gov/taxonomy Functional annotation BLAST+ x DNA/protein sequence alignment https://blast.ncbi.nlm.nih.gov Diamond x https://github.com/bbuchfink/diamond UniProtKB x Protein sequence and function database http://www.uniprot.org InterPro x x Protein signature and family prediction https://www.ebi.ac.uk/interpro COG x x Protein family annotation and prediction https://www.ncbi.nlm.nih.gov/COG FigFam x x http://www.nmpdr.org/FIG/wiki/view.cgi/FIG/FigFam MICFAM x Protein sequence family classification with SiliX x SiliX x Clustering of protein sequences https://lbbe.univ-lyon1.fr/-SiLiX-.html ENZYME x Enzymatic activity database http://enzyme.expasy.org PRIAM x Enzymatic activity prediction http://priam.prabi.fr dbCAN x Carbohydrate-active enzyme prediction http://csbl.bmb.uga.edu/dbCAN/ SignalP x Signal peptide cleavage site prediction http://www.cbs.dtu.dk/services/SignalP TMHMM x Transmembrane helix prediction http://www.cbs.dtu.dk/services/TMHMM LipoP x Lipoprotein prediction http://www.cbs.dtu.dk/services/LipoP PSORTb x Subcellular localization prediction http://www.psort.org VFDB x Virulence factor database http://www.mgc.ac.cn/VFs VirulenceFinder x https://cge.cbs.dtu.dk/services/VirulenceFinder CARD/RGI x x Antibiotic resistance database and prediction https://card.mcmaster.ca AutoFassign x Automatic functional annotation of proteins x Relational annotation Syntonizer x Synteny conservation detection x http://www.inrialpes.fr/helix/people/viari/cccpart/ Directon x Operon prediction x PhyloProfile x Phylogenetic profilef co-evolution score x https://dx.doi.org/10.1186/1471-2164-13-69 RGP x Genomic plasticity region detection x Pathway synteny x Synteny involved in metabolic pathways x MIBiG/ antiSMASH x x Biosynthetic Gene Cluster database and prediction http://www.secondarymetabolites.org/ ChEBI x Chemical compound database https://www.ebi.ac.uk/chebi Rhea x Reaction database http://www.rhea-db.org KEGG x Metabolic pathway database http://www.genome.jp/kegg MetaCyc/ Pathway tools x x Metabolic pathway database and prediction https://metacyc.org, http://brg.ai.sri.com/ptools/ Transcriptomics and variant discovery SSAHA2 x Read mapping http://www.sanger.ac.uk/science/tools/ssaha2-0 BWA x https://github.com/lh3/bwa SAMtools x Mapping analysis http://www.htslib.org/ bedtools x http://bedtools.readthedocs.io PALOMA x Variant detection x DESeq x Differential gene expression analysis http://bioconductor.org/packages/release/bioc/html/DESeq.html Topic . Name . Software . Database . Description . Internal . URL . Syntactic annotation AMIGene x CoDing sequences (CDS) prediction x http://www.genoscope.cns.fr/agc/tools/amigene Glimmer x https://ccb.jhu.edu/software/glimmer Prodigal x http://prodigal.ornl.gov MICheck x INSDC genome CDS re-annotation x http://www.genoscope.cns.fr/agc/tools/micheck tRNAscan-SE x tRNA prediction http://eddylab.org/software/tRNAscan-SE RNAmmer x rRNA prediction http://www.cbs.dtu.dk/services/RNAmmer Rfam/Infernal x x ncRNA families and prediction http://rfam.xfam.org, http://eddylab.org/infernal RepSeek x DNA sequence repeats http://wwwabi.snv.jussieu.fr/public/RepSeek Alien hunter x DNA compositional biases to detect HGT regions http://www.sanger.ac.uk/science/tools/alien-hunter SIGI-HMM x http://www.brinkman.mbb.sfu.ca/∼mlangill/sigi-hmm GenProtFeat x Gene/protein features x Taxonomy x NCBI taxonomy database https://www.ncbi.nlm.nih.gov/taxonomy Functional annotation BLAST+ x DNA/protein sequence alignment https://blast.ncbi.nlm.nih.gov Diamond x https://github.com/bbuchfink/diamond UniProtKB x Protein sequence and function database http://www.uniprot.org InterPro x x Protein signature and family prediction https://www.ebi.ac.uk/interpro COG x x Protein family annotation and prediction https://www.ncbi.nlm.nih.gov/COG FigFam x x http://www.nmpdr.org/FIG/wiki/view.cgi/FIG/FigFam MICFAM x Protein sequence family classification with SiliX x SiliX x Clustering of protein sequences https://lbbe.univ-lyon1.fr/-SiLiX-.html ENZYME x Enzymatic activity database http://enzyme.expasy.org PRIAM x Enzymatic activity prediction http://priam.prabi.fr dbCAN x Carbohydrate-active enzyme prediction http://csbl.bmb.uga.edu/dbCAN/ SignalP x Signal peptide cleavage site prediction http://www.cbs.dtu.dk/services/SignalP TMHMM x Transmembrane helix prediction http://www.cbs.dtu.dk/services/TMHMM LipoP x Lipoprotein prediction http://www.cbs.dtu.dk/services/LipoP PSORTb x Subcellular localization prediction http://www.psort.org VFDB x Virulence factor database http://www.mgc.ac.cn/VFs VirulenceFinder x https://cge.cbs.dtu.dk/services/VirulenceFinder CARD/RGI x x Antibiotic resistance database and prediction https://card.mcmaster.ca AutoFassign x Automatic functional annotation of proteins x Relational annotation Syntonizer x Synteny conservation detection x http://www.inrialpes.fr/helix/people/viari/cccpart/ Directon x Operon prediction x PhyloProfile x Phylogenetic profilef co-evolution score x https://dx.doi.org/10.1186/1471-2164-13-69 RGP x Genomic plasticity region detection x Pathway synteny x Synteny involved in metabolic pathways x MIBiG/ antiSMASH x x Biosynthetic Gene Cluster database and prediction http://www.secondarymetabolites.org/ ChEBI x Chemical compound database https://www.ebi.ac.uk/chebi Rhea x Reaction database http://www.rhea-db.org KEGG x Metabolic pathway database http://www.genome.jp/kegg MetaCyc/ Pathway tools x x Metabolic pathway database and prediction https://metacyc.org, http://brg.ai.sri.com/ptools/ Transcriptomics and variant discovery SSAHA2 x Read mapping http://www.sanger.ac.uk/science/tools/ssaha2-0 BWA x https://github.com/lh3/bwa SAMtools x Mapping analysis http://www.htslib.org/ bedtools x http://bedtools.readthedocs.io PALOMA x Variant detection x DESeq x Differential gene expression analysis http://bioconductor.org/packages/release/bioc/html/DESeq.html Open in new tab Information from the syntactic and functional annotation pipelines can be placed into a biological context to understand how the predicted objects interact in functional modules such as metabolic pathways. Each genome integrated into MicroScope is processed by an in-house workflow based on the MetaCyc reference database [18] and on the Pathway Tools software [19]. This software creates a Pathway Genome DataBase (PGDB) containing the predicted pathways and reactions of an organism. It uses a matching procedure for which we directly use as input the official MetaCyc reaction frame identifiers when available in the genome annotation; this allows to avoid overpredicted or missed enzymatic reactions [20]. The collection of MicroScope PGDBs is made available at the MicroCyc Web site (http://www.genoscope.cns.fr/agc/microcyc) and in the MicroScope database (see ‘Exploration of metabolic data’ section). Moreover, these metabolic networks are synchronized each night with new MicroScope genomes and expert annotations. When a public prokaryotic genome is integrated into MicroScope, the original annotations are stored in the database, and the syntactic re-annotation process, which uses the MICheck procedure, often allows to identify missing genes or wrongly annotated one [21]. This step is useful to annotate more completely the pseudogenes found in a genome (‘real’ or because of sequencing errors), an important piece of information when comparing closely related species. Data from genomes available in public databanks generally remain with the ‘public’ status too in MicroScope. A MicroScope staff to support and train a user community As soon as annotations and comparative analysis results are processed by MicroScope, the user who submitted the genome(s) is alerted by an e-mail; he/she can subsequently use a specific administration tool to grant access to his/her collaborators and to define consultation and modification rights on the sequences (‘User Panel’ menu/‘Access Rights Management’ functionality). Continuing support and assistance to MicroScope users remain an important activity in the context of our services (or collaborative projects). These regular exchanges, together with the satisfaction surveys, are the most efficient way of performing continual evolution of the platform in response to user needs. Indeed, in addition to the user-friendliness of the tools integrated into the platform (see below), the short response time and the quality of feedback to individual queries are highly appreciated aspects of the MicroScope service. Microbiologists who submitted genomic data to the MicroScope platform are warmly invited to follow a training course organized by our team. Using the data related to their own project, attendees learn how to change or correct the current automatic functional annotations, and how to perform effective searches and analyses with the functionalities available through the Web interface. About twice a year, we provide for new users a four-and-a-half-day training ‘Annotation and analysis of prokaryotic genomes using the MicroScope platform’. Since 2016, we also provide an advanced course for former trainees, so that they can remain up-to-date on recent developments. Since 2008, 450 users from 20 countries have been trained and 13 external sessions have been organized in France and abroad (Tunisia; Denmark; Germany; Switzerland; Spain; the Netherlands; China). More information is available on our Web site: http://www.genoscope.cns.fr/agc/microscope/training. Data integration, service continuity and data conservation (backups) are currently provided free of charge. MicroScope services follow the quality management system of our laboratory (ISO 9001:2008 and NF X50-900:2013 standards). All the data previously described (primarily genomes, analysis results and annotations) should be made appropriately accessible to biologist users, to allow efficient curation of annotations and to develop hypotheses about specific genomes or sets of genes to be experimentally tested. The following sections describe the MicroScope Web interface (http://www.genoscope.cns.fr/agc/microscope), i.e. the components accessible to our users, via secure or anonymous connections. For a complete description of each functionality in terms of input and output data, a complete tutorial is available here: https://microscope.readthedocs.io. Exploration of the genomic data: simple and advanced queries The ‘Search/Export’ menu (Figure 4) allows the user to perform Blast and pattern searches in the MicroScope database, and to download, in standard file formats (Genbank, EMBL, GFF, etc.), sequences, annotation data and the metabolic networks. The ‘Search by keywords’ functionality allows the user to identify genes and functions of interest using a variety of selection filters. The ‘single mode’ is used to query only one chromosome and the ‘multiple mode’ to query several replicons (of one organism) and/or several genomes. A basic keyword search enables the user to quickly retrieve genes having a particular function (i.e. ‘kinase’, ‘transporter’). Each kind of precomputed results (i.e. Blast results on various primary data, InterPro and FigFAM results, etc.) can be queried. Figure 4 shows an example of a query on the similarity searches in the CARD database [22] (‘Resistome’ data set). Figure 4. Open in new tabDownload slide MicroScope interface illustrating the ‘Search by keywords’ functionality. In the ‘multiple’ mode, a set of Staphylococcus species has been selected, and the BLASTP similarity results obtained with well-known resistance genes stored in the CARD database are queried using an amino acid identity threshold of at least 80% and using the keywords ‘kanamycine tetracycline’. The selection of ‘At least one word’ is required to apply an ‘OR’ between the two keywords. Keyword searches are useful to compare current annotation of the gene functions with the results, in terms of biological function, given by a specific analysis method. Indeed, the result of a query can be refined with a further query. For example, one can search for gene annotated as ‘protein of unknown function’ (first query) and then, search for the one having significant Blast results with proteins annotated with specific functions (second query). Whatever the query, the result output is a list of candidate genes, the genomic contexts of which can be easily visualized: next to the gene label, a magnify icon can be clicked to come back to the MaGe graphical representation with automatic displacement of the genome browser centered on the gene of interest. MaGe (Magnifying Genome): a genome browser in the light of synteny results The MaGe graphical interface is one of the functionality that had a strong positive resonance among users: this genome browser offers gene context exploration of the studied genome compared against other microbial genomes. The graphical representation of the synteny groups allows the user to quickly see if part of the genome being annotated shares similarities and locally conserved organization with the selected sequences. As shown in Figure 5, there is a clear synteny break in the visualized part of the E. coli CFT073 strain: the genes located between 5116000 and 5131000 share homologs only with the E. coli pathogenic strain ABU and, partially, with the E. coli commensal strain ED1a. The foreign origin of this region is also obvious if one looks at the coding prediction curves: the gene model used here does not fit well with the codon usage of the genes annotated in this genomic island. The example shown in Figure 5 also indicates possible paralogy relations through multiple correspondences between genes and one case of frameshift (or sequencing error) in E. coli 536 for the idnK gene (D gluconate kinase; see Figure 5). With such graphical representation, the conservation of genomic context is fully integrated in the process of the expert curation of gene function. Figure 5. Open in new tabDownload slide MicroScope genome browser and synteny map. The first graphical map contains part of the genome being analyzed (here 30 kb of E. coli CFT073), over which the user can navigate (moving and zooming functionalities). The predicted coding genes are drawn, on the six reading frames, in red rectangles together with the coding prediction curves (computed with the gene model selected by the user; ‘Matrix’ selection menu). Below this genome browser, is represented the synteny map in which each line shows the similarity results between the genome being annotated (E. coli CFT073) and other selected genomes (i.e. 11 pathogenic and commensal E. coli strains; the selection is performed using the ‘Options’ functionality). On this map, a rectangle flags the existence of a gene, somewhere in the compared genome, homolog to the corresponding gene in the genome browser. If, for several co-localized CDSs on the annotated genome, there are several co-localized homologs on the compared genome, the rectangles are all of the same color; otherwise, the rectangle is white. Thus, in this map, a specific color indicates a synteny group. A rectangle is always of the same size as the reference gene in the genome browser; however, it is colored only on part of the gene, which aligns with the compared protein. This allows the user to visualize situations where the alignment is partial. There is one such case in E. coli 536 indicating that the idnK gene in this strain is a pseudogene compared with the idnK gene in CFT073. In contrast with the genome browser, there is no notion of scale on the synteny maps: to see how homologous genes are organized in a synteny group, the user can click on one rectangle in a given synteny group. Another visualization mode has been added more recently to represent synteny conservation at different taxonomic levels (i.e. phylum, class, order, family or species). In this ‘taxon-synteny’ mode (obtained by clicking the ‘Switch’ button, Figure 5), each line of the synteny map refers to a taxon, and colored boxes represent the percentage of synteny conservation among organisms of the corresponding taxon. Comparative genomics tools Computations of homologs and synteny groups between microbial genomes are the starting point of several comparative methods available in the ‘Comparative Genomics’ menu (Figure 6). Figure 6. Open in new tabDownload slide Comparative genomics tools of the MicroScope platform. The figure displays some of the tools available to perform in-depth comparative genomics analyses involving the bacterium of interest and one or a set of organisms: ‘Gene Phyloprofile’ (comparison of five Lactobacillus rhamnosus strains), ‘Line Plot’ (shared synteny groups found in the same DNA strand are colored in green, and in red otherwise), ‘Regions of Genomic Plasticity’ (the predicted genomic island is shown in the second layer of the circular representation), ‘Pan-core genome’ and ‘Resistome’. In this last case, the figure shows Acinetobacter baumannii AYE genes having BLASTP hits with proteins from the CARD database. First, the ‘Fusion/Fission’ functionality provides a list of candidate genes of the selected genome potentially involved in evolutionary events such as gene fusion or fission. Such events involve what is named ‘Rosetta-stone’ proteins, and suggest a high probability of functional interaction between the involved proteins [23]. Second, the ‘Gene phyloprofile’ functionality is used to find unique or common genes in the query genome with respect to other genomes of interest. Homology constraints and inclusions in synteny group criteria may be applied to refine queries. Third, the ‘LinePlot’ functionality draws a global graphical representation of conserved syntenies between two selected genomes, and the ‘Regions of Genomic Plasticity (RGP)’ is used to search for potential horizontal gene transfer (HGT). The method combines (i) the results of algorithms that detect signals in the query sequence indicative of horizontal transfer origin (tRNA hotspots; mobility genes; compositional bias [24]) and (ii) the identification of synteny breaks in the query genome in comparison with closely selected microbial genomes. Results are reported in a tabular form and on a circular representation of the genome (Figure 6). Finally, the ‘Pan/Core Genome’ functionality computes dynamically the pan-genome and its components (core-genome; variable-genome) of a set of selected organisms (up to 200). The method uses the MicroScope gene families (MICFAM) computed with the SiLiX software [25]. The set of common (= core-genome), variable and strain-specific genes of each compared genomes can be exported in a tabular file format or in a ‘Gene Cart’. Indeed, at any level of the MicroScope Web interface, the gene list that results from the corresponding search/analysis can be selected for inclusion into a ‘Gene Cart’. The user can manage several ‘Gene Carts’ at the same time resulting from different queries. A specific interface has been developed to perform various operations such as the intersection or the difference between two gene carts, to extract sequences or to run multiple alignments via the plugged Jalview software [26] (Functionality ‘Gene Carts’ of the ‘User Panel’ menu). Two functionalities of the ‘Comparative Genomics’ menu are most specifically related to pathogen analysis (Figure 6): ‘Resistome’, which uses the Comprehensive Antibiotic Resistance Database [22] a manually curated resource containing high-quality reference data on the molecular basis of antimicrobial resistance, and the Resistance Gene Identifier (RGI) tool to predict the resistome of a genome. The ‘Virulome’ functionality gives the results of a Blast similarity searches in three distinct data sets of virulence genes: VFDB, which contains experimentally demonstrated virulence genes [27], VirulenceFinder [28] and a subset of the E. coli main virulence genes. Exploration of metabolic data The ‘Metabolism’ menu of MicroScope allows to explore the predicted metabolic pathways using two main resources, KEGG and MetaCyc, and to use analysis tools (Figure 7). Figure 7. Open in new tabDownload slide Tools for the analysis of microbial metabolism. Metabolic data can be explored using the KEGG or MetaCyc metabolic pathway hierarchies. On the left, the figure shows, for one selected MicroScope genome, the mapping of the annotated EC numbers on a KEGG metabolic map (enzymes encoded by genes localized on the current genome browser region are highlighted in yellow, and the ones encoded by genes localized elsewhere are highlighted in green). Predicted PGDBs using the Pathway Tools software are available using the ‘MicroCyc’ functionality. Comparison of metabolic pathways between a set of selected genomes is performed using the ‘Metabolic profiles’ tool: for each metabolic pathway, a completion value is computed, which corresponds to the number of reactions found in the genome × divided by the total number of reactions in the pathway. This value can take into account pseudogenes or not. It ranges between 0 (absence of the pathway) and 1 (complete pathway). The figure also shows an example of antiSMASH, which predicts Biosynthetic Gene Clusters in prokaryotic genomes. For the NRPS/PKS cluster types, the predicted peptide monomer composition and its corresponding SMILES formula are specified. Below the graphical representation of the predicted antiSMASH cluster, a summary of MIBiG cluster similarities, BGC gene composition as well as tailoring cluster similarities is given. Starting from the set of predicted and/or validated Enzyme Commission numbers (EC numbers), metabolic maps are dynamically drawn via a request to the KEGG Web server (‘KEGG’ functionality). A color-based code enables to see the number of enzymatic activities (i.e. EC number) of the annotated genome found in specific metabolic pathways (Figure 7). The interconnected metabolic pathways represented in KEGG are supplemented by the MicroCyc PGDBs built with the Pathway Tools software using MetaCyc as reference metabolic database (see ‘Running the annotation pipelines’ section). The ‘MicroCyc’ functionality allows the user to browse and query the metabolic network of a target genome using the Pathway Tools Web interface [18]. These two sets of predicted pathways can be used in the ‘Metabolic profiles’ functionality. Starting with a selection of organisms and a subset (or all) of metabolic pathways from the KEGG or MetaCyc classification, the tool computes a pathway completion value for each metabolic pathways (Figure 7). These values can be used by the MeV statistical method (Java Web start application) to cluster genomes according to their metabolic capabilities. Moreover, this table is also a good starting point to find candidate genes for missing gene–reaction associations in specific pathways (see example in [6]). In the same way, the ‘Pathway Synteny’ functionality follows the ‘guilt by association’ strategy [29], as it combines information on synteny groups and metabolic pathways (i.e. it searches for groups of genes, which share conserved synteny and are found on the same metabolic pathway). Using this interface, annotators can quickly check for reaction-hole candidate genes among the conserved miss-annotated genes of a given group. Finally, the ‘antiSMASH’ functionality relies on the integration of the antiSMASH (antibiotics and Secondary Metabolite Analysis Shell) program, which enables rapid genome-wide identification, annotation and analysis of secondary metabolite Biosynthesis Gene Clusters (BGCs) in microbial genomes [30]. Each predicted cluster and its genomic context are explored in a dedicated visualization window showing also a graphical representation of the gene domain composition (Figure 7). For nonribosomal peptide synthetase (NRPS) and polyketide synthase (PKS) cluster types, the predicted peptide monomer composition and its corresponding SMILES formula are specified and the corresponding predicted chemical structure is displayed. For each predicted BGC, a summary of similarities with the reference database MIBiG [31], BGC gene composition as well as tailoring cluster similarities is given. This last item relies on a knowledge database provided with antiSMASH about tailoring clusters already described in known BGCs and associated with publications. Analysis of experimental data The functionalities available in the ‘Transcriptomics’ and ‘Variant discovery’ menus rely on the results of the pipelines used to analyze data from transcriptomic projects (i.e. RNA-seq experiments) and data from evolution projects (i.e. clones of the same species at different generation times). Exploration of these experimental data has been illustrated in the two last publications of the MicroScope platform [6, 7]. The ‘Transcriptomics’ functionality allows exploring the transcript coverage along genome, expression levels of genomic objects (genes, ncRNAs) and differential expression between samples for distinct experimental conditions. All appropriate pairwise comparisons of experimental conditions can be directly queried from the interface. Differentially expressed genes may be projected on reconstructed metabolic networks to highlight metabolic pathways significantly affected by experimental conditions. The ‘Variant discovery’ functionality offers different tools to explore and analyze the predicted mutations (single nucleotide polymorphisms and small insertions/deletions) in their genomic and functional context. This detection takes into account raw sequencing data and associated read qualities to discriminate between true variations and sequencing errors. Expert curation of genomic and metabolic data From the results of the exploration of data and the analysis tools, MicroScope users can review and curate the automatic functional annotation of genes encoded by its genome of interest. This task is performed using the ‘Gene Editor’, which has been illustrated in the 2013 MicroScope publication [6]. Briefly, it is made of three main sections: The ‘current annotation’ section allows the user to modify, delete and add information. The functional description of gene functions is a free-text field exposed to inconsistencies across genes and genomes. We thus have also integrated enumerated lists of well-defined and nonredundant terms for the product type field (defined in GenProtEC [32]), the functional classifications (MultiFun [33] and TIGRFAMs [34]) and for the class field (inspired from the Pseudomonas Genome database [35]), which helps understanding the origin of the functional annotation (e.g. it comes from the functional description of an homologous gene for which the function has been experimentally demonstrated). The curation of associations between genes coding for enzymatic activities and the biochemical reactions catalyzed by these enzymes is performed using two main enzymatic reactions resources: MetaCyc [18] and Rhea [36]. Finally, to alert users about possible inconsistencies, annotation is checked via an automatic procedure launched when the annotation is saved in the database. The ‘automatic annotation’ section contains the gene function predicted by our automatic functional annotation procedure (‘MicroScope pipeline annotation’), which involves the transfer of the reliable up-to-date reference annotations to ‘strong’ orthologs, if any [4]. In case of published bacterial genome integrated in MicroScope, the section contains information on the functional annotation in nucleotide sequence databanks and UniProtKB if available. The ‘method results’ section provides, for each individual annotation tool executed, a summary of the results, visualized in a tabulated form (this includes precomputed lists of homologs and synteny groups). This integrative strategy allows annotators to quickly browse functional evidences, tracking the history of an annotation and checking the gene context conservation with an orthologous gene having an experimentally demonstrated biological function for example. Criteria for entering an expert annotation are based on different level of evidences from direct experimentation to bioinformatics evidences. The confidence status of each gene annotation is available in the class field of the gene editor. The categories are inspired by the ‘protein name confidence’ defined in PseudoCAP (Pseudomonas aeruginosa community annotation project). A set of rules allowing to choose this ‘class’ annotation category according to bioinformatics evidences is proposed in our MicroScope tutorial: https://microscope.readthedocs.io/en/latest/content/mage/info.html (‘How to choose the “Class” annotation category?’ and ‘Annotation Rules’ sections). Following the integration of novel functionalities into MicroScope, the ‘Gene Editor’ is constantly evolving. First, new interfaces allowing to ease the curation of resistance and virulence genes are under development, especially using defined ontologies such as ARO, the Antibiotic Resistance Ontology [22]. Second, to fully exploit the results of the different tools dedicated to genomic region analysis (e.g. antiSMASH or RGPfinder), we are currently working on the development of a specific editor to annotate gene clusters such as operons, BGCs, genomic islands, CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) regions, secretion systems and phages. Expert annotations are continuously gathered in the MicroScope database. Indeed, ∼35 000 annotations are made in a year (Figure 1), and >370 000 genes have been curated so far. A third of these annotations correspond to the description of precise molecular functions supported by direct or indirect (i.e. from homology relationships) experimental evidences. Biologists generally focused their annotations on proteins/functions of interest; however, it is interesting to note that about 50 genomes integrated in MicroScope are near completely curated (≥80% of the genes were expertly annotated), and 124 additional genomes got >300 curated genes. MicroScope annotations are submitted to INSDC databanks when the genomes get published and can be easily downloaded via the Web interface (‘Search/Export->Download Data’ functionality). Moreover, we provide a RESTful API to access programmatically public genome data, and semantic Web approaches are currently used to work on the interoperability of MicroScope curated data with other European resources such as UniProtKB [9], HAMAP [37], EnsemblBacteria [1] and Rhea [36]. These developments are performed in the context of the ELIXIR bioinformatics infrastructure (https://www.elixir-europe.org). Software and database architecture The technical architecture of the MicroScope platform is shown on Figure 8. Its three components have been described and updated in the previous publications of MicroScope [5, 6]. In summary: Figure 8. Open in new tabDownload slide Technical architecture of the MicroScope platform. The MicroScope platform is made of three components: (i) a ‘Process management’ system to organize workflow execution, (ii) a ‘Data management’ system, called PkGDB, to store information from databanks, genomes and computational results and (iii) a ‘Visualization’ system for textual and graphical representation of PkGDB data. Process management system The annotation pipelines are organized in a robust automated workflow management system using the jBPM framework (java Business Process Management; http://jbpm.org), which allows us to handle simultaneously millions of tasks for the analysis of several new microbial genomes. These tasks are parallelized on hundreds of CPU cores using Pegasus MPI cluster module (https://pegasus.isi.edu). The pipelines for the structural, functional and relational annotation orchestrate >50 external/internal bioinformatics software (see section ‘Running the MicroScope pipelines’). A large part of these analyses are updated at regular intervals to take into account primary databases growth and new expert annotations. Data management system The results of these analysis tools, together with the primary data used as inputs, are stored in a relational database named PkGDB and based on the open-source MySQL relational database management system and the InnoDB (for continuous data integration and incremental updates) and MyISAM (for large bulk inserts) table engines. The PkGDB architecture supports integration of automatic and human-curated functional annotations and records a history of all the modifications. Finally, for metabolic comparative analyses purposes (see the ‘Metabolic profiles’ functionality in the ‘Exploration of metabolic data’ section), relational tables have been designed in PkGDB to store information of the MicroCyc PGDBs, together with the KEGG metabolic pathways and modules. The size of PkGDB is today 1 TB for databanks and genome data, and 30 TB for the computational results (Figure 8). Only one instance of the database gathers all genome analyses, which eases collaborative annotation process. The Web visualization component The MicroScope Web interface (http://www.genoscope.cns.fr/agc/microscope) is developed using the Apache/PHP server-based language and consists of numerous dynamic Web pages containing textual and graphical representations for accessing and querying data. Several useful graphical applications, such as Artemis [38], MeV [39] and IGV [40], are also available in the MicroScope interface through plugged Java applications. As shown in this article, the tools are organized in a menu bar to facilitate the exploration and the curation process. At any level of the interface, a ‘Help’ functionality is available, and a complete tutorial can be found in the ‘About’ menu. Conclusion In this article, we have described the MicroScope platform from the point of view of the end user, i.e. following one of the main objectives of our prokaryotic genome annotation and comparative system: to allow biologists to submit their genomic data in a simple way and, then, to perform analysis and make relevant assessments of the predicted gene functions using (i) the functionalities for querying and browsing the computed data, (ii) the synteny results and metabolic network predictions, the combination of which can be helpful in formulating hypotheses on the biological function of nonannotated genes and (iii) a gene annotation editor giving access to the results of each method applied, together with links to several useful public resources. Among the ongoing developments described in the last update of the platform [7], we have currently made great progresses in the consensus representation of thousands bacterial genomes to provide a better analysis workflow of prokaryotic species. The idea is to structure the pan-genome of an organism into the set of ‘persistent’ genes (relaxed core definition, that is to say genes found in the great majority of the genomes), the ‘shell’, which gathers moderately conserved genes and the ‘cloud’ corresponding to rare and unique genes [41]. To organize pangenomic information, we are using a graph data model, where the nodes represent the protein families, and the edges represent the genome co-localization of the two protein families (weighted by the number of the genomes sharing this co-localization). A statistical method is then used to divide the pan-genome into the three main classes (persistent, shell and cloud). The next step is the integration of this representation in MicroScope to facilitate comparative analysis and data visualization of thousands of strains. We will also add functionalities allowing users to select, at any level of this pan-genome graph, a subpart of this graph and, using one genome as reference, to come back to the MaGe genome browser. We are starting to work on an instance of MicroScope based on this novel pan-genome representation that will contain most of the reference species found in the human gut microbiota. Key Points MicroScope is open to microbiologists interested in extended analyses of species of interest. MicroScope is an integrated environment allowing to perform comparative genomic and metabolic analyses. Tools and graphical interfaces for the curation of gene function are part of the specificities of the MicroScope platform. MicroScope provides a collaborative environment to share and improve knowledge on genomes. Claudine Médigue, PhD, is a research director at CNRS. She is the head of the Laboratoire d’Analyse Bioinformatiques en Génomique et Métabolisme located at Genoscope. She has worked on the annotation and comparative analysis of prokaryotic genomes >25 years. Alexandra Calteau is a senior researcher at CEA. She contributes to different bioanalysis projects and the development of functionalities in the MicroScope platform, mainly in the Comparative Genomics field. She is responsible for the MicroScope professional training organization, and for the quality management of the LABGeM. Stéphane Cruveiller is a senior researcher at CEA. He is managing the MicroScope services and developments for the analysis of variants discovery, transcriptomics and metagenomics data. He has specific research activities in microbial evolution. Mathieu Gachet is a master student at CEA. He is working on the improvement of metagenomic data integration in MicroScope. Guillaume Gautreau is a PhD student at CEA. He works on the development of pan-genome graphs in MicroScope and their application in metagenomics. Adrien Josso is an engineer in bioinformatics at CEA. He works on MicroScope software development for workflow management and metabolic data integration. Aurélie Lajus is an engineer in bioinformatics at CEA. She works mainly on (meta)genome project management and software integration in the MicroScope platform. Jordan Langlois is an engineer in bioinformatics at CEA. He works on software integration and Web developments in the MicroScope platform. Hugo Pereira is an engineer in bioinformatics at CEA. He works on MicroScope software development for workflow management of NGS projects. Rémi Planel is an engineer in bioinformatics at CEA. He works on MicroScope software development for pan-genome computation and Web visualization. David Roche is an engineer in bioinformatics at CEA. He works on NGS project management, software integration and Web developments in the MicroScope platform. He is also involved in the training of MicroScope users. Johan Rollin is an engineer in bioinformatics at CEA. He works on software integration and Web developments in the MicroScope platform. Zoe Rouy is an engineer in bioinformatics at CEA. She works mainly on (meta-)genome project management, software integration and Web developments in the MicroScope platform. David Vallenet is a senior researcher at CEA. He is managing all the technological developments of the MicroScope platform and has specific research activities in the development of methods for enzyme function prediction and metabolic network analysis. Acknowledgements The authors would like to thank all MicroScope users for their feedback, which helped greatly in optimizing and improving many functionalities of the system. The authors also thank the entire IT system team of Genoscope for its essential contribution to the efficiency of the platform. Funding French Government ‘Investissements d’Avenir programmes’, namely, FRANCE GENOMIQUE (grant number ANR-10-INBS-09-08); INSTITUT FRANCAIS DE BOINFORMATIQUE (grant number ANR-11-INBS-0013). References 1 Kersey PJ , Allen JE, Armean I, et al. Ensembl Genomes 2016: more genomes, more complexity . Nucleic Acids Res 2016 ; 44 : D574 – 80 . Google Scholar Crossref Search ADS PubMed WorldCat 2 Chen I-MA , Markowitz VM, Palaniappan K, et al. Supporting community annotation and user collaboration in the integrated microbial genomes (IMG) system . BMC Genomics 2016 ; 17 : 307 . Google Scholar Crossref Search ADS PubMed WorldCat 3 Wattam AR , Davis JJ, Assaf R, et al. Improvements to PATRIC, the all-bacterial Bioinformatics Database and Analysis Resource Center . Nucleic Acids Res 2017 ; 45 : D535 – 42 . Google Scholar Crossref Search ADS PubMed WorldCat 4 Vallenet D , Labarre L, Rouy Z, et al. MaGe: a microbial genome annotation system supported by synteny results . Nucleic Acids Res 2006 ; 34 : 53 – 65 . Google Scholar Crossref Search ADS PubMed WorldCat 5 Vallenet D , Engelen S, Mornico D, et al. MicroScope: a platform for microbial genome annotation and comparative genomics . Database 2009 ; 2009 : bap021 . Google Scholar Crossref Search ADS PubMed WorldCat 6 Vallenet D , Belda E, Calteau A, et al. MicroScope–an integrated microbial resource for the curation and comparative analysis of genomic and metabolic data . Nucleic Acids Res 2013 ; 41 : D636 – 47 . Google Scholar Crossref Search ADS PubMed WorldCat 7 Vallenet D , Calteau A, Cruveiller S, et al. MicroScope in 2017: an expanding and evolving integrated resource for community expertise of microbial genomes . Nucleic Acids Res 2017 ; 45 : D517 – 28 . Google Scholar Crossref Search ADS PubMed WorldCat 8 Wilson CA , Kreychman J, Gerstein M. Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores . J Mol Biol 2000 ; 297 : 233 – 49 . Google Scholar Crossref Search ADS PubMed WorldCat 9 The UniProt Consortium . UniProt: the universal protein knowledgebase . Nucleic Acids Res 2017 ; 45 : D158 – 69 . Crossref Search ADS PubMed WorldCat 10 Overbeek R , Begley T, Butler RM, et al. The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes . Nucleic Acids Res 2005 ; 33 : 5691 – 702 . Google Scholar Crossref Search ADS PubMed WorldCat 11 Barbe V , Vallenet D, Fonknechten N, et al. Unique features revealed by the genome sequence of Acinetobacter sp. ADP1, a versatile and naturally transformation competent bacterium . Nucleic Acids Res 2004 ; 32 : 5766 – 79 . Google Scholar Crossref Search ADS PubMed WorldCat 12 Touchon M , Hoede C, Tenaillon O, et al. Organised genome dynamics in the Escherichia coli species results in highly diverse adaptive paths . PLoS Genet 2009 ; 5 : e1000344 . Google Scholar Crossref Search ADS PubMed WorldCat 13 Barbe V , Cruveiller S, Kunst F, et al. From a consortium sequence to a unified sequence: the Bacillus subtilis 168 reference genome a decade later . Microbiology 2009 ; 155 : 1758 – 75 . Google Scholar Crossref Search ADS PubMed WorldCat 14 Belda E , Sekowska A, Le Fèvre F, et al. An updated metabolic view of the Bacillus subtilis 168 genome . Microbiology 2013 ; 159 : 757 – 70 . Google Scholar Crossref Search ADS PubMed WorldCat 15 Belda E , van Heck RG, José Lopez-Sanchez M, et al. The revisited genome of Pseudomonas putida KT2440 enlightens its value as a robust metabolic chassis . Environ Microbiol 2016 ; 18 : 3403 – 24 . Google Scholar Crossref Search ADS PubMed WorldCat 16 Field D , Garrity G, Gray T, et al. The minimum information about a genome sequence (MIGS) specification . Nat Biotechnol 2008 ; 26 : 541 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 17 Bocs S , Cruveiller S, Vallenet D, et al. AMIGene: annotation of MIcrobial genes . Nucleic Acids Res 2003 ; 31 : 3723 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 18 Caspi R , Billington R, Ferrer L, et al. The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases . Nucleic Acids Res 2016 ; 44 : D471 – 80 . Google Scholar Crossref Search ADS PubMed WorldCat 19 Karp PD , Latendresse M, Paley SM, et al. Pathway Tools Version 19.0 update: software for pathway/genome informatics and systems biology . Brief Bioinform 2015 ; 17 : 877 – 90 . Google Scholar Crossref Search ADS PubMed WorldCat 20 Vieira G , Sabarly V, Bourguignon PY, et al. Core and panmetabolism in Escherichia coli . J Bacteriol 2011 ; 193 : 1461 – 72 . Google Scholar Crossref Search ADS PubMed WorldCat 21 Cruveiller S , Le Saux J, Vallenet D, et al. MICheck: a web tool for fast checking of syntactic annotations of bacterial genomes . Nucleic Acids Res 2005 ; 33 : W471 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 22 Jia B , Raphenya AR, Alcock B, et al. CARD 2017: expansion and model-centric curation of the comprehensive antibiotic resistance database . Nucleic Acids Res 2017 ; 45 : D566 – 73 . Google Scholar Crossref Search ADS PubMed WorldCat 23 Suhre K. Inference of gene function based on gene fusion events: the Rosetta-Stone method . Methods Mol Biol 2007 ; 396 : 31 – 41 . Google Scholar Crossref Search ADS PubMed WorldCat 24 Vernikos GS , Parkhill J. Interpolated variable order motifs for identification of horizontally acquired DNA: revisiting the Salmonella pathogenicity islands . Bioinformatics 2006 ; 22 : 2196 – 203 . Google Scholar Crossref Search ADS PubMed WorldCat 25 Miele V , Penel S, Duret L. Ultra-fast sequence clustering from similarity networks with SiLiX . BMC Bioinformatics 2011 ; 12 : 116 . Google Scholar Crossref Search ADS PubMed WorldCat 26 Waterhouse AM , Procter JB, Martin DM, et al. Jalview Version 2–a multiple sequence alignment editor and analysis workbench . Bioinformatics 2009 ; 25 : 1189 – 91 . Google Scholar Crossref Search ADS PubMed WorldCat 27 Chen L , Zheng D, Liu B, et al. VFDB 2016: hierarchical and refined dataset for big data analysis–10 years on . Nucleic Acids Res 2016 ; 44 : D694 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 28 Joensen KG , Scheutz F, Lund O, et al. Real-time whole-genome sequencing for routine typing, surveillance, and outbreak detection of verotoxigenic Escherichia coli . J Clin Microbiol 2014 ; 52 : 1501 – 10 . Google Scholar Crossref Search ADS PubMed WorldCat 29 Aravind L. Guilt by association: contextual information in genome analysis . Genome Res 2000 ; 10 : 1074 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 30 Blin K , Wolf T, Chevrette MG, et al. antiSMASH 4.0-improvements in chemistry prediction and gene cluster boundary identification . Nucleic Acids Res 2017 ; 45 : 36 – 41 . Google Scholar Crossref Search ADS WorldCat 31 Medema MH , Kottmann R, Yilmaz P, et al. Minimum information about a Biosynthetic Gene cluster . Nat Chem Biol 2015 ; 11 : 625 – 31 . Google Scholar Crossref Search ADS PubMed WorldCat 32 Serres MH , Goswami S, Riley M. GenProtEC: an updated and improved analysis of functions of Escherichia coli K-12 proteins . Nucleic Acids Res 2004 ; 32 : D300 – 2 . Google Scholar Crossref Search ADS PubMed WorldCat 33 Serres MH , Riley M. MultiFun, a multifunctional classification scheme for Escherichia coli K-12 gene products . Microb Comp Genomics 2000 ; 5 : 205 – 22 . Google Scholar Crossref Search ADS PubMed WorldCat 34 Haft DH , Selengut JD, Richter RA, et al. TIGRFAMs and genome properties in 2013 . Nucleic Acids Res 2013 ; 41 : D387 – 95 . Google Scholar Crossref Search ADS PubMed WorldCat 35 Winsor GL , Griffiths EJ, Lo R, et al. Enhanced annotations and features for comparing thousands of Pseudomonas genomes in the Pseudomonas genome database . Nucleic Acids Res 2016 ; 44 : D646 – 53 . Google Scholar Crossref Search ADS PubMed WorldCat 36 Morgat A , Lombardot T, Axelsen KB, et al. Updates in Rhea—an expert curated resource of biochemical reactions . Nucleic Acids Res 2017 ; 45 : 4279 . Google Scholar Crossref Search ADS PubMed WorldCat 37 Pedruzzi I , Rivoire C, Auchincloss AH, et al. HAMAP in 2015: updates to the protein family classification and annotation system . Nucleic Acids Res 2015 ; 43 : D1064 – 70 . Google Scholar Crossref Search ADS PubMed WorldCat 38 Carver T , Harris SR, Berriman M, et al. Artemis: an integrated platform for visualization and analysis of high-throughput sequence-based experimental data . Bioinformatics 2012 ; 28 : 464 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 39 Saeed AI , Sharov V, White J, et al. TM4: a free, open-source system for microarray data management and analysis . Biotechniques 2003 ; 34 : 374 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 40 Thorvaldsdóttir H , Robinson JT, Mesirov JP. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration . Brief Bioinform 2013 ; 14 : 178 – 92 . Google Scholar Crossref Search ADS PubMed WorldCat 41 Lobkovsky AE , Wolf YI, Koonin EV. Gene frequency distributions reject a neutral model of genome evolution . Genome Biol Evol 2013 ; 5 : 233 – 42 . Google Scholar Crossref Search ADS PubMed WorldCat © The Author 2017. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected] © The Author 2017. Published by Oxford University Press.
The BioCyc collection of microbial genomes and metabolic pathwaysKarp, Peter, D;Billington,, Richard;Caspi,, Ron;Fulcher, Carol, A;Latendresse,, Mario;Kothari,, Anamika;Keseler, Ingrid, M;Krummenacker,, Markus;Midford, Peter, E;Ong,, Quang;Ong, Wai, Kit;Paley, Suzanne, M;Subhraveti,, Pallavi
2019 Briefings in Bioinformatics
doi: 10.1093/bib/bbx085pmid: 29447345
Abstract BioCyc.org is a microbial genome Web portal that combines thousands of genomes with additional information inferred by computer programs, imported from other databases and curated from the biomedical literature by biologist curators. BioCyc also provides an extensive range of query tools, visualization services and analysis software. Recent advances in BioCyc include an expansion in the content of BioCyc in terms of both the number of genomes and the types of information available for each genome; an expansion in the amount of curated content within BioCyc; and new developments in the BioCyc software tools including redesigned gene/protein pages and metabolite pages; new search tools; a new sequence-alignment tool; a new tool for visualizing groups of related metabolic pathways; and a facility called SmartTables, which enables biologists to perform analyses that previously would have required a programmer’s assistance. genome databases, microbial genome databases, metabolic pathway databases Introduction BioCyc.org is a microbial genome Web portal that combines thousands of genomes with additional information inferred by computer programs, imported from other databases (DBs) and curated from the biomedical literature by biologist curators. BioCyc also provides an extensive range of query tools, visualization services and analysis software. BioCyc has been developed over a 25 year period, beginning with the EcoCyc DB for Escherichia coli. Over time, the content of BioCyc has expanded in terms of the number of genomes, the types of information available for each genome and the amount of curated content. BioCyc has also grown to include some eukaryotic genomes (although its main emphasis is microbial). The software behind BioCyc, called Pathway Tools [1,2], has also expanded in many ways during this period, such as to support regulatory networks, omics data analysis and metabolic modeling. Recent enhancements include redesigned gene/protein pages and metabolite pages, new search tools, a new sequence-alignment tool, a new tool for visualizing groups of related metabolic pathways and a facility called SmartTables, which enables biologists to perform analyses that previously would have required a programmer’s assistance. Expansion of BioCyc DB content Each BioCyc DB describes one sequenced genome, with the exception of the MetaCyc DB, which describes experimentally studied metabolic pathways from all domains of life. Since 2011, BioCyc has expanded from 1000 genomes to 9300 genomes. The majority of those genomes were obtained from Genbank RefSeq and from the Human Microbiome Project complete genomes DB. As the majority of sequenced microbial genomes are of interest to a relatively small number of researchers, BioCyc emphasizes breadth and quality of information for more highly used genomes at the expense of number of genomes. To facilitate access to the more commonly used BioCyc Pathway/Genome Databases (PGDBs), we have created the set of home pages listed in Table 1. When entering BioCyc through these home pages, the user’s default organism will be set to the BioCyc PGDB for the primary strain for that species. Table 1 Home pages for BioCyc organisms Home page Genus ecocyc.org Escherichia coli helicobacter.biocyc.org Helicobacter pylori vibrio.biocyc.org Vibrio cholerae listeria.biocyc.org Listeria monocytogenes salmonella.biocyc.org Salmonella enterica shigella.biocyc.org Shigella flexneri cdifficile.biocyc.org Clostridium difficile mycobacterium.biocyc.org Mycobacterium tuberculosis pseudomonas.biocyc.org Pseudomonas aeruginosa yeast.biocyc.org Saccharomyces cerevisiae Home page Genus ecocyc.org Escherichia coli helicobacter.biocyc.org Helicobacter pylori vibrio.biocyc.org Vibrio cholerae listeria.biocyc.org Listeria monocytogenes salmonella.biocyc.org Salmonella enterica shigella.biocyc.org Shigella flexneri cdifficile.biocyc.org Clostridium difficile mycobacterium.biocyc.org Mycobacterium tuberculosis pseudomonas.biocyc.org Pseudomonas aeruginosa yeast.biocyc.org Saccharomyces cerevisiae Open in new tab Table 1 Home pages for BioCyc organisms Home page Genus ecocyc.org Escherichia coli helicobacter.biocyc.org Helicobacter pylori vibrio.biocyc.org Vibrio cholerae listeria.biocyc.org Listeria monocytogenes salmonella.biocyc.org Salmonella enterica shigella.biocyc.org Shigella flexneri cdifficile.biocyc.org Clostridium difficile mycobacterium.biocyc.org Mycobacterium tuberculosis pseudomonas.biocyc.org Pseudomonas aeruginosa yeast.biocyc.org Saccharomyces cerevisiae Home page Genus ecocyc.org Escherichia coli helicobacter.biocyc.org Helicobacter pylori vibrio.biocyc.org Vibrio cholerae listeria.biocyc.org Listeria monocytogenes salmonella.biocyc.org Salmonella enterica shigella.biocyc.org Shigella flexneri cdifficile.biocyc.org Clostridium difficile mycobacterium.biocyc.org Mycobacterium tuberculosis pseudomonas.biocyc.org Pseudomonas aeruginosa yeast.biocyc.org Saccharomyces cerevisiae Open in new tab Workflow for generation of BioCyc PGDBs To produce new BioCyc PGDBs, we process each BioCyc genome through the computational steps shown in Figure 1 both to computationally infer new information for the genome and to integrate additional information from other bioinformatics DBs. The amount of information found by the different import steps will vary for different organisms. Note that we retain the original genome annotation that was present in the downloaded genome file(s) for each organism. Figure 1 Open in new tabDownload slide Processing steps involved in generating the BioCyc DBs. Recently added steps are shown in bold. No relative ordering is implied between the steps along the top and the steps along the bottom. Figure 1 Open in new tabDownload slide Processing steps involved in generating the BioCyc DBs. Recently added steps are shown in bold. No relative ordering is implied between the steps along the top and the steps along the bottom. First, Pathway Tools converts the annotated genome from the Genbank format to its internal PGDB format. Next, the computational operations in the upper portion of Figure 1 are performed. Pathway Tools modules make the following predictions [2]. Metabolic and transport reactions and metabolic pathways are predicted [3] from the reactions and pathways in the MetaCyc DB [4]. Next occurs prediction of pathway hole fillers (genes that code for enzymes catalyzing reactions with no currently assigned enzyme) and prediction of operons using both structural and functional information [5]. Orthologs among BioCyc genomes are computed by software that runs large-scale bidirectional BLAST (version 2.2.23) comparisons among all pairs of proteins in the BioCyc genomes. We use a BLAST E-value cutoff of 0.001, with all other parameters at default settings. We define two proteins A and B as orthologs if protein A from proteome PA and protein B from proteome PB are bidirectional best BLAST hits of one another, meaning that protein B is the best BLAST hit of protein A within proteome PB, and protein A is the best BLAST hit of protein B within proteome PA. In rare cases, protein A might have multiple orthologs in proteome PB, as explained below. The best hit of protein A in proteome PB is defined by finding the minimal E-value among all hits in proteome PB in the BLAST output, and collecting all the hits for A in proteome PB that have the same minimal E-value. In other words, ties are possible, as in the case of exact gene duplications. We attempt to break ties using two methods: taking the hit with the maximum alignment length; and then taking the hit with the maximum alignment amino acid residue identity. For the first method, we compare the alignment lengths among all the hits of protein A in proteome PB that share the same minimum E-value, and the protein in proteome PB with the maximum alignment length is selected. For the second method, we compare the number of identical amino acid residues in the alignments between protein A and the hits of protein A in proteome PB that share the same minimum E-value, and the protein in proteome PB with the maximum number of identical amino acid residues is selected. In the case that ties still remain (as in the case of exact gene duplications), all ties are included in the final set of orthologs used by BioCyc. Thus, protein A could have multiple orthologs in PB, such as if multiple proteins B1, B2, etc., exist in PB, and have exactly the same regions align against protein A. BioCyc does not calculate paralogs. Pfam [6] domains are identified in BioCyc proteins by running the Pfam software. Finally, zoomable cellular overview (metabolic map) diagrams are generated for each organism. Next, data from several third-party DBs are imported into BioCyc, as shown in the lower portion of Figure 1. Protein-feature data, such as locations of enzyme active sites, phosphorylation sites and metal-ion binding sites, are loaded from UniProt [7], as are Gene Ontology (GO) [8] annotations. Predicted subcellular localizations are loaded from PSORTdb [9]. Descriptions of promoters, transcription factor-binding sites and regulatory interactions are loaded from RegTransBase [10]. Organism phenotype data, such as aerobicity, are loaded from the National Center for Biotechnology Information (NCBI) BioSample DB, as are organism metadata, such as the geographical location of the site from which the sequenced organism was collected. Gene essentiality data have been loaded from the OGEE DB [11] and from individual articles. Phenotype microarray data have also been loaded from individual articles. We also generate Web links from BioCyc to other related DBs, such as UniProt, NCBI-Bioproject and BioSample. BioCyc curation After the preceding automated processing, some BioCyc DBs receive manual curation to integrate additional information and to remove some false-positive predictions. All in all, the information within the BioCyc DBs has been curated from 80 900 different publications, as shown in Table 2. The BioCyc DBs are organized into three tiers [12] to communicate the amount of manual curation that each DB has received: Table 2 For those BioCyc version 21.0 PGDBs citing ≥100 references, we list the number of references cited by each PGDB (and from which the information in each PGDB was curated), sorted by number of citations DB Citations Tier MetaCyc 52 446 1 Escherichia coli K-12 substr. MG1655 31 555 1 Saccharomyces cerevisiae S288c 12 018 1 Bacillus subtilis subtilis 168 3682 2 Clostridioides difficile 630 2027 2 Mycobacterium tuberculosis H37Rv 1521 2 Chlamydomonas reinhardtii 1233 2 Candida albicans SC5314 623 2 Streptomyces coelicolor A3(2) 343 2 Synechococcus elongatus PCC 7942 284 2 Agrobacterium fabrum C58 257 2 Leishmania major strain Friedlin 212 1 Corynebacterium glutamicum ATCC 13032 184 2 Listeria monocytogenes 10403S 176 2 Candidatus Evansia muelleri 147 2 DB Citations Tier MetaCyc 52 446 1 Escherichia coli K-12 substr. MG1655 31 555 1 Saccharomyces cerevisiae S288c 12 018 1 Bacillus subtilis subtilis 168 3682 2 Clostridioides difficile 630 2027 2 Mycobacterium tuberculosis H37Rv 1521 2 Chlamydomonas reinhardtii 1233 2 Candida albicans SC5314 623 2 Streptomyces coelicolor A3(2) 343 2 Synechococcus elongatus PCC 7942 284 2 Agrobacterium fabrum C58 257 2 Leishmania major strain Friedlin 212 1 Corynebacterium glutamicum ATCC 13032 184 2 Listeria monocytogenes 10403S 176 2 Candidatus Evansia muelleri 147 2 Note: In many cases, the curation was performed by BioCyc curators, and in other cases, the curation was performed by other DBs from which information was imported (e.g. from GO term curation or from UniProt protein-feature curation). For these PGDBs, we have removed from the citation counts those references shared with MetaCyc classes and metabolites (which were likely copied from MetaCyc during PGDB creation). MetaCyc and EcoCyc cite a number of common references because the EcoCyc pathway and enzyme data and their references are periodically copied from EcoCyc to MetaCyc. Open in new tab Table 2 For those BioCyc version 21.0 PGDBs citing ≥100 references, we list the number of references cited by each PGDB (and from which the information in each PGDB was curated), sorted by number of citations DB Citations Tier MetaCyc 52 446 1 Escherichia coli K-12 substr. MG1655 31 555 1 Saccharomyces cerevisiae S288c 12 018 1 Bacillus subtilis subtilis 168 3682 2 Clostridioides difficile 630 2027 2 Mycobacterium tuberculosis H37Rv 1521 2 Chlamydomonas reinhardtii 1233 2 Candida albicans SC5314 623 2 Streptomyces coelicolor A3(2) 343 2 Synechococcus elongatus PCC 7942 284 2 Agrobacterium fabrum C58 257 2 Leishmania major strain Friedlin 212 1 Corynebacterium glutamicum ATCC 13032 184 2 Listeria monocytogenes 10403S 176 2 Candidatus Evansia muelleri 147 2 DB Citations Tier MetaCyc 52 446 1 Escherichia coli K-12 substr. MG1655 31 555 1 Saccharomyces cerevisiae S288c 12 018 1 Bacillus subtilis subtilis 168 3682 2 Clostridioides difficile 630 2027 2 Mycobacterium tuberculosis H37Rv 1521 2 Chlamydomonas reinhardtii 1233 2 Candida albicans SC5314 623 2 Streptomyces coelicolor A3(2) 343 2 Synechococcus elongatus PCC 7942 284 2 Agrobacterium fabrum C58 257 2 Leishmania major strain Friedlin 212 1 Corynebacterium glutamicum ATCC 13032 184 2 Listeria monocytogenes 10403S 176 2 Candidatus Evansia muelleri 147 2 Note: In many cases, the curation was performed by BioCyc curators, and in other cases, the curation was performed by other DBs from which information was imported (e.g. from GO term curation or from UniProt protein-feature curation). For these PGDBs, we have removed from the citation counts those references shared with MetaCyc classes and metabolites (which were likely copied from MetaCyc during PGDB creation). MetaCyc and EcoCyc cite a number of common references because the EcoCyc pathway and enzyme data and their references are periodically copied from EcoCyc to MetaCyc. Open in new tab Tier 1 PGDBs have received at least one person-year of curation; some PGDBs have received person-decades of curation. Tier 2 PGDBs have received at least one person-month of curation. Tier 3 PGDBs have received no manual curation. Some BioCyc PGDBs were contributed by groups outside SRI (for example, the Chlamydomonas reinhardtii PGDB was developed by the Carnegie Institution for Science, and the Streptomyces coelicolor PGDB was developed by the University of Warwick and the John Innes Centre). The authors of each PGDB are listed on the summary page that is displayed when a user changes the current PGDB. The Clostridioidesdifficile 630 PGDB has undergone several recent curation enhancements. We updated its genome annotation from the recently revised RefSeq entry, and from the annotation from the MicroScope site [13]. We performed literature searches and curation updates for 213 proteins listed in MicroScope as having experimental evidence for their function in C. difficile or in the Clostridioides genus, as well as other genes encountered during the course of literature searches. Those proteins with experimental evidence in C. difficile are now annotated with experimental evidence codes and contain references to the literature from which their enhanced curation was derived. Curation adds value to BioCyc PGDBs in many ways, and is a major factor in differentiating BioCyc from other bacterial genome PGDBs. All computational prediction methods make errors, including predictors of gene boundaries, protein function and metabolic pathways. Curators correct errors in those predictions, and they supplement computational predictions with information from the experimental literature. They also annotate experimentally known information with experimental evidence codes and literature citations to indicate high-confidence information. Curators capture a wide variety of information in BioCyc PGDBs (Table 3) including protein functions, metabolic reactions and pathways and regulatory interactions of several types (such as allosteric regulation of enzymes, and control of gene expression via transcription factors and small RNAs). Table 3 Datatypes available in PGDBs, and statistics on the number of objects of each type in various PGDBs DB tier Escherichia coli Bacillus subtilis Synechococcus elongatus Mycobacterium tuberculosis K-12 substr. MG1655 subtilis 168 PCC 7942 Beijing/NITR203 1 2 2 3 Data type Genome metadata 2 2 4 0 Genes 4657 4440 2719 4206 Operons 3564 1604 1982 2680 Promoters 3850 1193 44 0 Transcription factor-binding sites 2918 763 36 0 Terminators 303 1146 0 0 Proteins 5719 4407 2832 4127 Protein features 4223 3029 0 0 Gene Ontology terms 5733 3927 2518 4 Metabolites 2758 942 990 1169 Metabolic reactions 1712 1158 1100 1450 Metabolic pathways 396 269 230 285 Transport reactions 1526 1048 953 1148 Genetic regulatory networks 3438 788 41 0 Evidence codes 134 561 58 658 15 098 3137 Growth media 436 1 2 0 Gene essentiality 4239 4217 2421 0 DB tier Escherichia coli Bacillus subtilis Synechococcus elongatus Mycobacterium tuberculosis K-12 substr. MG1655 subtilis 168 PCC 7942 Beijing/NITR203 1 2 2 3 Data type Genome metadata 2 2 4 0 Genes 4657 4440 2719 4206 Operons 3564 1604 1982 2680 Promoters 3850 1193 44 0 Transcription factor-binding sites 2918 763 36 0 Terminators 303 1146 0 0 Proteins 5719 4407 2832 4127 Protein features 4223 3029 0 0 Gene Ontology terms 5733 3927 2518 4 Metabolites 2758 942 990 1169 Metabolic reactions 1712 1158 1100 1450 Metabolic pathways 396 269 230 285 Transport reactions 1526 1048 953 1148 Genetic regulatory networks 3438 788 41 0 Evidence codes 134 561 58 658 15 098 3137 Growth media 436 1 2 0 Gene essentiality 4239 4217 2421 0 Note: Different DBs contain different proportions of these datatypes depending on factors such as the amounts of data available in DBs from which BioCyc imports information, and the amount of data curated from the literature. Typically, DBs that have received more curation will have objects of a wider range of datatypes. Open in new tab Table 3 Datatypes available in PGDBs, and statistics on the number of objects of each type in various PGDBs DB tier Escherichia coli Bacillus subtilis Synechococcus elongatus Mycobacterium tuberculosis K-12 substr. MG1655 subtilis 168 PCC 7942 Beijing/NITR203 1 2 2 3 Data type Genome metadata 2 2 4 0 Genes 4657 4440 2719 4206 Operons 3564 1604 1982 2680 Promoters 3850 1193 44 0 Transcription factor-binding sites 2918 763 36 0 Terminators 303 1146 0 0 Proteins 5719 4407 2832 4127 Protein features 4223 3029 0 0 Gene Ontology terms 5733 3927 2518 4 Metabolites 2758 942 990 1169 Metabolic reactions 1712 1158 1100 1450 Metabolic pathways 396 269 230 285 Transport reactions 1526 1048 953 1148 Genetic regulatory networks 3438 788 41 0 Evidence codes 134 561 58 658 15 098 3137 Growth media 436 1 2 0 Gene essentiality 4239 4217 2421 0 DB tier Escherichia coli Bacillus subtilis Synechococcus elongatus Mycobacterium tuberculosis K-12 substr. MG1655 subtilis 168 PCC 7942 Beijing/NITR203 1 2 2 3 Data type Genome metadata 2 2 4 0 Genes 4657 4440 2719 4206 Operons 3564 1604 1982 2680 Promoters 3850 1193 44 0 Transcription factor-binding sites 2918 763 36 0 Terminators 303 1146 0 0 Proteins 5719 4407 2832 4127 Protein features 4223 3029 0 0 Gene Ontology terms 5733 3927 2518 4 Metabolites 2758 942 990 1169 Metabolic reactions 1712 1158 1100 1450 Metabolic pathways 396 269 230 285 Transport reactions 1526 1048 953 1148 Genetic regulatory networks 3438 788 41 0 Evidence codes 134 561 58 658 15 098 3137 Growth media 436 1 2 0 Gene essentiality 4239 4217 2421 0 Note: Different DBs contain different proportions of these datatypes depending on factors such as the amounts of data available in DBs from which BioCyc imports information, and the amount of data curated from the literature. Typically, DBs that have received more curation will have objects of a wider range of datatypes. Open in new tab Curators author mini-review summaries appearing in the protein, pathway and operon pages, which summarize findings from multiple publications and save users significant amounts of time in poring through the primary literature. For some BioCyc PGDBs, person-decades of curation work have been performed across tens of thousands of publications, resulting in large volumes of mini-review summaries, measured in textbook page equivalents: EcoCyc version 21.0 contains 2907 textbook-equivalent pages of summaries and MetaCyc version 21.0 contains 7897 such pages. Further, curators enter a wide range of experimentally determined information that cannot be inferred computationally, including enzyme activators and inhibitors, protein subunit structure, enzyme kinetic values, protein features (e.g. active site residues) and transcriptional regulatory interactions. Although automated text mining software has shown gradual improvement over the years, its accuracy is still far from that of human curators. In addition, text mining systems are typically limited to extracting fewer types of data than the wide range of information that BioCyc curators capture. Perhaps most importantly, only human curators can correctly resolve the many disagreements, inconsistencies and errors found in the literature. Many metabolic pathways and enzymes are complex, and earlier reports often contain information that has been later p