Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Recent development of antiSMASH and other computational approaches to mine secondary metabolite biosynthetic gene clusters

Recent development of antiSMASH and other computational approaches to mine secondary metabolite... Many drugs are derived from small molecules produced by microorganisms and plants, so-called natural products. Natural products have diverse chemical structures, but the biosynthetic pathways producing those compounds are often organized as biosynthetic gene clusters (BGCs) and follow a highly conserved biosynthetic logic. This allows for the identification of core biosynthetic enzymes using genome mining strategies that are based on the sequence similarity of the involved enzymes/genes. However, mining for a variety of BGCs quickly approaches a complexity level where manual analyses are no longer possible and require the use of automated genome mining pipelines, such as the antiSMASH software. In this review, we discuss the principles underlying the predictions of antiSMASH and other tools and provide practical advice for their application. Furthermore, we discuss important caveats such as rule-based BGC detection, sequence and annotation quality and cluster boundary prediction, which all have to be considered while planning for, performing and analyzing the results of genome mining studies. Key words: genome mining; biosynthetic gene cluster; antibiotics; secondary metabolites; natural products; antiSMASH metabolites’. In bacteria and fungi, the genes required for the bio- Introduction synthesis of these compounds are usually organized as biosyn- thetic gene clusters (BGCs). These clusters contain all genes Most antibiotics, such as penicillin, erythromycin or tetracycline, and also other drugs like acarbose (anti-diabetic), artemisinin required for the biosynthesis of precursors, assembly of the com- pound scaffold, modification of the compound scaffold (also (anti-malarial), tacrolimus or cyclosporins (immunosuppres- sants) are so-called natural products either synthesized by or referred to as ‘tailoring’) and often also resistance, export and reg- ulation. This implies that the full pathway can easily be identified derived from microorganisms or plants [1]. As the biosynthetic pathways for such compounds are not directly related to growth if the involvement of one of the genes in biosynthesis can be demonstrated. In plants, only some pathways are organized in and reproduction, these compounds are also referred to as ‘sec- ondary metabolites’ or—in newer literature—‘specialized BGCs [2]. For other pathways, the biosynthesis genes are Kai Blin is a Postdoctoral Fellow at the Novo Nordisk Foundation Center for Biosustainability of the Technical University of Denmark. He is developing computational biology tools around microbial genome mining for natural products and connected -omics approaches. Hyun Uk Kim is a Research Fellow at KAIST, South Korea, and a visiting Senior Researcher at the Novo Nordisk Foundation Center for Biosustainability, DTU. His research field lies in systems biology, biochemical and metabolic engineering and drug targeting and discovery. Marnix H. Medema is an Assistant Professor in the Bioinformatics Group at Wageningen University. His research group develops and applies computa- tional methodologies to identify and analyze biosynthetic pathways and gene clusters. Tilmann Weber is a Co-Principal Investigator at the Novo Nordisk Foundation Center for Biosustainability of the Technical University of Denmark. He is interested in integrating bioinformatics, genome mining and systems biology approaches into Natural Products discovery and characterization and thus bridging the in silico and in vivo world. Submitted: 30 May 2017; Received (in revised form): 10 October 2017 V The Author 2017. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. 1103 Downloaded from https://academic.oup.com/bib/article/20/4/1103/4590131 by DeepDyve user on 16 July 2022 1104 | Blin et al. scattered across the genome and thus require additional experi- BGCs that were identified using genome mining approaches, mental data, such as co-expression analyses [3], for please see [36]. identification. In this review, we will focus on the general computational Soon after the first genes encoding natural product biosyn- approaches to study secondary metabolite biosynthesis and thetic enzymes were identified, sequenced and analyzed, it how these are integrated into the current antiSMASH frame- became apparent that the sequences of the corresponding work (Figure 1). Finally, we will give practical advice for prepar- enzymes contain data of highly predictive quality, which can be ing and interpreting genome mining data. Although we focus used to infer key biosynthetic steps. For example, the core scaf- on antiSMASH as an example, the issues discussed are applica- folds of the products of canonical modular type I polyketide ble to natural product genome mining in general, and hence are synthases (PKSs) can be predicted by combining several types of equally relevant when using other tools. Comprehensive user guides for antiSMASH can be found online (http://docs.anti easy-to-obtain data: (a) the content and architecture of individ- ual enzymatic domains within the megaenzymes, which are smash.secondarymetabolites.org/using_antismash/) and in responsible for the assembly of the molecular scaffold and its [37–39]. For comprehensive reviews on the different genome mining tools and databases on secondary metabolites, the modifications (e.g. reduction of the b-carbon), can be identified by using Hidden Markov model (HMM) profiles of such domains; reader is referred to [40–43]. (b) the individual acyl-CoA building blocks for each PKS module (e.g. malonyl-CoA versus methylmalonyl-CoA) can be inferred Principles of predicting secondary based on key residues in the active sites of the acyltransferase metabolite biosynthesis (AT) domains or by using phylogenetic classification; (c) the stereospecificity mediated by ketoreductase domains can be To predict secondary metabolite biosynthesis pathways, inferred by key amino acids in the active site motifs. These genome mining approaches commonly start out by identifying studies were the starting point in establishing genome mining conserved biosynthetic genes. Their gene products are subse- for secondary metabolite BGCs as one of the recent key technol- quently analyzed to gain information about their putative func- ogies in natural products research. tion in biosynthesis and sometimes their substrate specificity. One of the first computational tools to make use of such pre- To identify conserved biosynthetic genes, it is necessary to s R dictions was the proprietary DECIPHER search engine and data- have gene annotations available on the genome of interest. base of the former company Ecopia [4] that was first published Formats such as NCBI’s GenBank or EBI’s EMBL contain both in 2003. Around the same time, the first publicly available tools DNA sequence and gene annotations. GFF3 files can be used to were released. For example, SEARCHPKS automated the identifi- carry the annotations for sequences in FASTA format. cation of enzymatic domains in PKSs [5] (for URLs to this and all antiSMASH accepts input data in all of these formats. If no gene following Web tools, please see Table 1). However, it took until annotations are available, antiSMASH will run a gene finding 2009 for the first open-source genome mining pipelines tool. For the bacterial version, this is Prodigal [44]. For fungal CLUSEAN [29] and NP.searcher [21] to be published. In 2011, the and plant genomes, antiSMASH uses GlimmerHMM [45]. first version of the open-source genome mining platform In the next step, BGCs are identified based on core enzymes antiSMASH was released [30], which combined and extended involved in the biosynthesis of secondary metabolites. the functionality of the previous tools and also offered a user- Functionally related proteins frequently share common friendly Web interface. For the first time, it became possible for patterns of amino acids. Using profile-based methods like scientists without significant experience in computational biol- position-specific scoring matrices to identify these patterns ogy to perform larger-scale genome mining studies on a free seems intuitive. HMMs are probabilistic models of linear and public Web server. Since then, antiSMASH has been steadily sequences that provide an algorithmic approach to interpret the extended [6, 7, 23, 30–33] and currently offers a broad collection scores obtained from the scoring matrix. Profile HMMs (pHMMs) of tools and databases for automated genome mining and com- are HMMs designed to represent multiple sequence alignments, parative genomics for a wide variety of different classes of sec- including matches, insertions and deletions. The most com- ondary metabolites. The antiSMASH analysis pipeline for monly used tool around pHMMs in biology is HMMer [46]. Many bacterial genomes and the pipeline for fungal genomes (recently profile databases such as PFAM [47] and TIGRFAMs [48] provide named ‘fungiSMASH’) are both based on the same codebase. downloadable profiles compatible with HMMer. antiSMASH antiSMASH and fungiSMASH use two different Web submission uses pHMMs with profiles specific to conserved core enzymes of forms, each offering specific options. plantiSMASH [23]is a secondary metabolite biosynthesis pathways to run its profile- branch of antiSMASH that includes plant-specific functionality, based BGC detection. Once the core enzymes have been identi- such as plant-adapted HMM profiles and cluster detection logic, fied, antiSMASH compares co-located core genes with a set of as well as support for coexpression analysis. manually curated BGC cluster rules. These rules comprise In addition to antiSMASH, other noteworthy tools have also Boolean logic regarding domain presence/absence within either been developed and made available: SMURF [28] offers mining a gene or a genomic region of interest. For example, BGCs for fungal PKS, nonribosomal peptide synthetase (NRPS) and encoding nonribosomally synthesized peptides (such as the terpenoid gene clusters; the PRISM tool [24, 34, 35] offers antibiotic vancomycin) can be unambiguously identified if the genome mining functionality with a strong focus on predicting sequence to be analyzed contains genes encoding proteins that chemical structures of the biosynthetic pathways. PRISM is have a combination of one or multiple Condensation, closely connected to the ‘Genomes-to-Natural Products plat- Adenylation (A) and Peptidyl Carrier Protein domains. form (GNP)’ [14] that matches such predictions with MS/MS ‘Negative’ models are also used to discard false positives, e.g. data, and to the GRAPE/GARLIC tools [15, 16], which match the protein sequences that achieve higher scores for profiles of fatty predictions to chemical databases. For a comprehensive review acid synthases (which are homologous to PKSs) than for profiles describing the history and progress of secondary metabolite of PKSs will not lead to the identification of a polyketide BGC. genome mining, along with many examples of compounds and The 2017 version of antiSMASH (version 4) [6] uses such rules Downloaded from https://academic.oup.com/bib/article/20/4/1103/4590131 by DeepDyve user on 16 July 2022 RecentdevelopmentofantiSMASH | 1105 Table 1. URLs of Web servers, Web tools and databases referred to in the review Tool Functions URL Reference antiSMASH 4 Genome mining http://antismash.secondarymetabolites.org [6] BGC analysis Domain analysis antiSMASH database BGC database http://antismash-db.secondarymetabolites.org [7] ARTS Genome mining http://arts.ziemertlab.com [8] BAGEL 3 Genome mining http://bagel.molgenrug.nl/ [9] CASSIS BGC boundary prediction https://sbi.hki-jena.de/cassis/cassis.php [10] CRISPy-web sgRNA design http://crispy.secondarymetabolites.org [11] eSNaPD v2 Genome mining http://esnapd2.rockefeller.edu [12] FunGeneClusterS BGC boundary prediction https://fungiminions.shinyapps.io/FunGeneClusterS [13] fungiSMASH Genome mining http://fungismash.secondarymetabolites.org [6] BGC analysis Domain analysis GNP Metabolomics http://magarveylab.ca/gnp [14] GRAPE/GARLIC Genome mining https://magarveylab.ca/gast/ [15, 16] MIBiG BGC database http://mibig.secondarymetabolites.org [17] reference data set NaPDoS Genome mining http://napdos.ucsd.edu [18] NORINE Nonribosomal peptide database http://bioinfo.lifl.fr/NRP [19, 20] NP.searcher Genome mining http://dna.sherman.lsi.umich.edu/ [21] Domain analysis NRPSpredictor Domain analysis http://nrps.informatik.uni-tuebingen.de [22] plantiSMASH Genome mining http://plantismash.secondarymetabolites.org [23] BGC analysis PRISM 3 Genome mining http://magarveylab.ca/prism [24] BGC analysis Domain analysis RODEO Genome mining http://www.ripprodeo.org [25] RiPP analysis (SEARCHPKS)/SBSPKS v2 Domain analysis http://202.54.226.228/pksdb/sbspks_updated/master.html [26] BGC database Smiles2Monomers Retro-biosynthetic monomer prediction http://bioinfo.lifl.fr/norine/smiles2monomers.jsp [27] SMURF Genome mining http://www.jcvi.org/smurf [28] for 45 different types/classes of secondary metabolites (Table poorly in these cases. As both kinds of methods are designed to 2A). The cluster rules are stored in a tab-delimited text file, score overall sequence similarities, they—by design—gloss over which can be easily edited to add custom types of gene clusters. the few key differences. In such cases, more complex algo- Similar rule-based strategies are also used by many other sec- rithms can be used. Support vector machines (SVMs) are a ondary metabolite genome mining tools, such as PRISM [24], machine learning approach that uses supervised learning to SMURF [28] and BAGEL [9]. create nonprobabilistic binary linear classifiers. SVMs classify Alternatively, a probabilistic method to detect potential sec- data points encoded in multidimensional feature vectors by a ondary metabolite BGCs can be selected in antiSMASH that uses maximum margin hyperplane. Compared with other machine the ClusterFinder algorithm [49]. Rather than using explicit learning methods such as artificial neural networks, the con- rules requiring specific enzymes to be present for a particular struction of the SVM hyperplane allows for gaining some insight class of BGCs, ClusterFinder is based on a model built from a over which of the input parameters contribute most to the training set of PFAM domains found in BGCs and non-BGC solution. regions. Given this model and a genome of interest with anno- For the multimodular enzymes involved in NRPS biosynthe- tated PFAM domains, ClusterFinder then calculates the proba- sis, antiSMASH uses the recently published SANDPUMA tool [51] bility of a stretch of observed PFAM domains to constitute a to predict the substrates of A domains. Knowledge of these sub- BGC. In regions where this probability is higher than the config- strates and the order of the A domains are then used to predict urable threshold, a BGC is predicted. the backbone structure of the NRPS product. SANDPUMA inter- For BGCs encoding NRPS, PKS, terpene or ribosomally syn- nally uses a combination of pHMMs and SVMs to obtain the best thesized and posttranslationally modified peptides (RiPPs), it is possible A domain substrate predictions. In RiPP clusters that possible to perform some additional analyses to predict further encode the biosynthesis of, e.g., lanthi-, lasso-, sacti- and thio- details, such as substrate specificities or product cyclization peptides, identifying the precursor peptide is key to predicting patterns. To this end, it is sometimes necessary to classify pro- the cluster product. Here, antiSMASH scores putative precursor teins or domains that share a high overall sequence similarity. peptides using the recently published RODEO tool [25], as well The differences between the functional classes (e.g. different as some custom pHMMs. RODEO also uses both pHMMs and substrate specificities) are determined by a small number of key SVMs internally to identify precursor peptides. Tailoring amino acids. Sequence-alignment-based methods such as enzymes that further modify the RiPP are also identified using BLAST and profile-based methods like HMMer tend to perform pHMMs. Downloaded from https://academic.oup.com/bib/article/20/4/1103/4590131 by DeepDyve user on 16 July 2022 1106 | Blin et al. Figure 1. General workflow of an antiSMASH analysis of bacterial, fungal and plant genomes. Computational resources in the left and right boxes have been integrated with antiSMASH 4 for enhanced genome mining performance, whereas those in the box in the bottom correspond to third-party applications that use antiSMASH for the detection of BGCs. Phylogenetic analysis assists with the classification of a given phylogenetic tree according to a substitution model. enzymes in Clusters of Orthologous Groups and the calculation This method unfortunately has a high complexity for comput- of phylogenetic distances of genes/enzyme sequences of inter- ing the optimal tree. Many current tools use a combination of est to characterized reference sequences. Multiple methods methods. The popular software FastTree [52] first builds rough exist to construct phylogenetic trees based on multiple Neighbor-Joining trees and then refines them using a maximum sequence alignments. Depending on the desired output tree likelihood scoring of the trees generated in the first pass. In antiSMASH, phylogenetic methods are used in many pla- characteristics, the number of input sequences and other con- straints, the most appropriate method should be chosen. A pop- ces. For NRPS clusters, SANDPUMA includes a phylogenetic anal- ular algorithm among the distance-matrix-based methods is ysis in the PrediCAT step. A modified version of PrediCAT trained the Neighbor-Joining algorithm, which uses bottom-up cluster- on a recently released data set [53] is also used in terpenoid clus- ing to create the tree. Neighbor-Joining is a comparatively fast ters to further classify terpene synthases. Noncore biosynthetic method, but the correctness of the tree depends on the accuracy genes in a BGC are assigned to ‘secondary metabolite clusters of and additivity of the underlying distance matrix. Maximum par- orthologous groups’, for which phylogenies are reconstructed. simony methods try to identify the tree that uses the smallest In addition to BGC type-dependent analyses, antiSMASH number of evolution events to explain the observed sequence also includes general tools providing information on all cluster data. While maximum parsimony algorithms build accurate types. The built-in ClusterBlast module [30] considers the simi- trees, their computation tends to be relatively slow compared larity of individual gene products as well as their genomic arrangement. ClusterBlast contains a comprehensive database with distance matrix-based methods. Maximum likelihood methods use probability distributions to assess the likelihood of of all predicted BGCs from publicly available genomes that is Downloaded from https://academic.oup.com/bib/article/20/4/1103/4590131 by DeepDyve user on 16 July 2022 RecentdevelopmentofantiSMASH | 1107 Table 2. A: BGC types detectable by pHMM-based rules with Table 2. (continued) antiSMASH, PRISM and SMURF. B: Rule-independent methods to B: Rule-independent methods detect BGCs a Method Principle Implement- References A: Rule-based detection of gene clusters ed in BGC-type antiSMASH PRISM/RiPP PRISM SMURF ClusterFinder HMM-based antiSMASH [6, 49] Aminocoumarins X X classification Aminoglycosides/ X of which aminocyclitols PFAM Antimetabolites X domains are Aryl polyenes X X likely to be Autoinducing peptide X found inside Bacteriocins X or outside a Beta-lactams X X BGC Bottromycin X X EvoMining Phylogenomic EvoMining [50] Butyrolactones X X identification ClusterFinder fatty acid X of enzymes ClusterFinder saccharide X with ComX X expanded Cyanobactins X X substrate Ectoines X X spectrum; Furan X X such Fused (pheganomycin-like) X enzymes are Glycocin X X often found Head-to-tail cyclized peptide X X in BGCs Homoserine lactone X X Resistance gene-based Identification ARTS [8] Indoles X X mining of potential Ladderane lipids X X antibiotic Lantipeptides class I X X resistance Lantipeptides class II X X genes; often Lantipeptides class III/IV X X such genes Lasso peptide X X are part of Linaridin X X BGCs to pro- Linear azol(in)e-containing X X vide self-pro- Melanins X X tection of the Microcin X producing Microviridin X X organism Nonribosomal peptides X X X For details on the pHMM’s and specific rules used by the different genome min- Nucleosides X ing programs, please consult the original publications of antiSMASH [6, 32], Oligosaccharide X PRISM [24, 34] or SMURF [28]. Other (unusual) PKS X Others X Phenazine X X searched to identify organisms containing similar BGCs. The Phosphoglycolipids X X same algorithm is used in the ‘SubClusterBlast’ module to iden- Phosphonate X X Polyunsaturated fatty acids X tify operons/sets of genes in the query BGC that code for Prochlorosin X enzymes involved in the biosynthesis of common precursors, Proteusin X X for example the nonproteinogenic amino acid 3, 5-dihydroxy- Sactipeptide X X phenylglycine present in some types, or NRPS clusters such as Non-NRP siderophores X the vancomycin-family glycopeptides. Finally, this strategy is Streptide X also used to search the Minimum Information on Biosynthetic Terpene X X Gene cluster (MIBiG) [17] data set with the ‘KnownClusterBlast’ Thiopeptides X X function to provide information about related and well-charac- Thioviridamide X terized gene clusters. This function can also be used to perform Trans-AT type I PKS X X a sequence-based dereplication, i.e. the identification of gene Trifolitoxin X clusters that code for already known products. Type I PKS X X X Type II PKS X X Type III PKS X X ‘Linked’ tools and resources YM-216391 X A general challenge when using comparative approaches Continued to study BGCs is the varying quality of annotation in public Downloaded from https://academic.oup.com/bib/article/20/4/1103/4590131 by DeepDyve user on 16 July 2022 1108 | Blin et al. sequence databases. Some BGCs that have been extensively rules are implemented in the mining software can be detected; studied experimentally are well annotated, whereas others— all pathways that may use unknown or unrelated alternative mostly identified in high-throughput sequencing efforts—were enzymes will be missed. only annotated using standard genome annotation pipelines As an extension to the rule-based genome mining, that do not provide specific annotations of secondary metabo- antiSMASH optionally provides the possibility to use the lite BGCs. Therefore, a community effort has been established ‘ClusterFinder’ method [49]. This algorithm can identify BGCs to define a ‘MIBiG’ standard [17] and provide a standardized that are not detected by the expert-generated rule sets repository for BGCs that have been experimentally connected to described above. However, it should be noted that this method their biosynthetic products. The MIBiG repository currently (as still has some bias, as the source data used to train the HMM of April 2017) contains 1396 entries of BGCs that are validated to determining whether a gene product likely belongs to a BGC are code for a specific biosynthetic pathway. Within this set, 396 of also based on the currently known pathways. the entries contain comprehensive manually curated annota- To address these limitations, alternative methods are under tions of the specific features of the gene clusters, which were development to access the ‘biosynthetic dark matter’ and iden- provided by the specialists that studied these respective BGCs. tify novel pathways and enzymes. One promising approach is This collection now serves as a reference data set for a wide ‘EvoMining’ [50], which is based on the observation that biosyn- variety of applications and the validation of novel computa- thetic enzymes and/or resistance genes often evolved by dupli- tional tools. cation and divergence of primary metabolism enzymes. By In addition to analyses integrated into antiSMASH, the anno- detecting divergences in phylogenetic trees of enzymes from tation generated by antiSMASH can also be useful as a starting the core metabolism shared between many bacterial species, point for further downstream analyses. Therefore, antiSMASH 4 this method can identify enzymes that have likely been repur- provides an application programming interface that allows posed for secondary metabolite biosynthesis [50] or resistance third-party software to access antiSMASH annotation for fur- [8]. Once novel pathways have been identified using such meth- ther processing. Examples of such tools are the ‘Antibiotic ods and experimentally validated, the newly obtained knowl- Resistant Target Seeker ARTS’ [8], which predicts potential tar- edge on the involved enzymes is of course used to refine and gets of antibiotics and uses the annotation provided by extend the rule-based mining methods. antiSMASH to mine for BGCs and CRISpy-web [11], a Web tool that allows user-friendly design of single guide RNAs (sgRNAs) The quality of input data is important for getting for CRISPR applications on nonmodel organisms. reliable results antiSMASH is a comprehensive genome mining platform, but only provides information on individually submitted genomes One important aspect to be considered when mining genomic and does not offer any integrated search functionality. Therefore, data for BGCs using antiSMASH or alternative pipelines, such as in 2016, the antiSMASH platform was extended with a database PRISM [24, 34], SMURF [28] and ClusterFinder [49], is the quality containing precomputed antiSMASH annotation on >3900 fin- of the sequence data that is to be analyzed. All these tools use ished high-quality bacterial genome sequences [7]. Using the either rule-based or statistical approaches to identify the BGCs Web interface, it is possible to browse secondary metabolite clus- involved in secondary metabolism. Both methods require that ters by BGC type or taxonomy of the producer organism. the sequence data to be analyzed are not too fragmented and Additionally, custom queries can be constructed using an inter- that the genes of a BGC are not scattered across different contigs active query builder. This makes it possible to answer research in the assembly. Users should be particularly aware of potential questions such as ‘which clusters of type NRPS contain quality issues when analyzing genome data generated with A domains that select for the nonproteinogenic amino acid short-read sequencing technologies. Special care has to be taken 3, 5-dihydroxy-phenylglycine?’ or ‘what BGCs of type RiPP exist when analyzing type I polyketide or NRPS-containing BGCs; in the genus Streptomyces that are not lanthipeptides?’. The both types of pathways involve large multimodular mega- results are displayed in the same antiSMASH Web format. They enzymes, whose gene sequences often are highly repetitive and can also be exported in various file formats that allow further therefore difficult to assemble purely based on short sequencing processing in other bioinformatics tools. reads [54]. The same applies to metagenomic data; reliable iden- tification of BGCs—which consist of several genes—is only pos- sible on well-assembled data. Therefore, analyses on the public Considerations and caveats for computational antiSMASH Web server are limited to sequences of over 1 kb genome mining length and the first 1000 contigs. Both limits can be deactivated You can only find what you are looking for.. . in the stand-alone version of antiSMASH. To analyze highly fragmented short-read-based assemblies, pipelines focusing on Most genome mining platforms, including antiSMASH (with the detection and analysis of individual core domains, such as default search options), SMURF [28] and PRISM [24, 34], use a NaPDos [18] or eSNaPD [12], should be considered. In general, rule-based approach to define what is annotated as a secondary phylogenomics-based approaches like the abovementioned or metabolite BGC. These rules are derived from existing knowl- as used in EvoMining [50] are excellent alternatives for such edge about key biosynthetic steps/principles, which require the fragmented data, as they base their predictions on single activity of individual or combinations of specific enzymes. The enzymes/genes instead of requiring the presence of complete or genes encoding these are also often referred to as ‘core’ genes partial BGCs [55]. Therefore, we recommend first using these and used as anchors or probes to screen the genomic data of interest. While this method is highly sensitive and precise for tools to identify ‘interesting’ sequence records in such bulk DNA data and then submitting only these records (provided they identifying biosynthesis genes for many classes of secondary metabolites, such as polyketides, or nonribosomally synthe- have the required sequence length) for an analysis with sized peptides, it of course implies that only pathways for which antiSMASH. Downloaded from https://academic.oup.com/bib/article/20/4/1103/4590131 by DeepDyve user on 16 July 2022 RecentdevelopmentofantiSMASH | 1109 In addition, most algorithms predicting enzyme specificities cluster. The distances were selected in a way that we would rely on automatically generated alignments of the user- rather overpredict the distance, i.e. include genes in the gene supplied input data with experimentally characterized ‘refer- cluster annotation that may belong to the gene cluster border ence’ sequences to identify residues of the active sites or the region, than exclude genes that are part of the BGCs but are substrate-binding pockets. Depending on the tool used to pre- encoded outside this range from the core biosynthetic genes. dict specificities, these alignments are generated using standard multiple sequence alignment software like ClustalW [56]or Strategies to connect gene clusters to molecules Muscle [57]. Alternatively, BLAST or HMMer are used to match In the end, most users turn to antiSMASH or related tools to the query with a custom reference database. Consequently, accomplish one of two goals: (1) to identify potentially new mol- these tools are sensitive to sequencing errors if these errors ecules that could be synthesized by the organism of study based occur in or near the active sites or binding pockets. In addition, on its genome, or (2) to identify genes involved in the biosynthe- the accuracy of such computer-generated, nonrefined align- sis of an already observed molecule. Specific strategies are ments may suffer if the protein sequence of interest is too dis- available for each of these scenarios. similar to the reference data sets. In both cases, this can easily When trying to find out what kind of specialized metabolites lead to incorrect specificity predictions. an organism can produce based on its genome, the starting In the case where users analyze annotated sequence data, point is to go over each gene cluster in the genome in detail. which is uploaded as GenBank files or directly retrieved from First, comparisons with BGCs from MIBiG (in antiSMASH, this is the NCBI GenBank or RefSeq database, antiSMASH will only done using the KnownClusterBlast module) will identify BGCs consider the annotated genes and not perform additional gene that are either closely or more distantly related to these refer- finding. This also implies that genes annotated as ence clusters. To determine whether a BGC is likely to produce ‘pseudogenes’ are not considered for any prediction. This is the exact same molecule, manual inspection is required. It noteworthy, as many modular PKS and NRPS gene calls that should be checked that all key biosynthetic genes of the refer- were generated with the NCBI PGAP [58] pipeline (which is used ence cluster are also found in the BGC of interest by studying to annotate all microbial genomes in RefSeq [59]) were inaccu- the data of the MIBiG entry and related literature. If so, are any rate and the intact genes were labelled as pseudogenes. This additional enzymes encoded in the BGC of interest that could bug has been fixed for RefSeq 82, but users that downloaded encode chemical modifications not observed for the known earlier versions of RefSeq entries should be cautious. Many molecule? If the BGC encodes PKSs or NRPSs, do the domain GenBank records that were annotated with affected versions of architectures and their corresponding predicted substrate spe- PGAP also suffer from this issue. cificities match to those of the known cluster? The answers to If users supply unannotated sequence data, antiSMASH uses these questions will determine whether the BGC of interest is the software prodigal [44] for bacterial genomes or GlimmerHMM likely to encode the biosynthesis of: (a) the same molecule (all [45] for fungal and plant sequences to automatically identify cod- relevant genes ‘shared’ with high percent identity, and perfect ing regions. The downstream genome analyses therefore depend alignment of chemistry predictions with the structure of the on the accuracy of the automated gene finding, which can vary known molecule); (b) a potentially new variant of a known mol- between different organisms and is also dependent on the ecule (some enzyme-coding genes are cluster-specific, and/or sequence quality. If users supply annotated sequence data by some substrate specificities are different); (c) a new molecule uploading GenBank-formatted or FASTAþGFF3 files, antiSMASH within a known class of molecules (only a minority or small uses these gene coordinates. If an annotated and high-quality majority of the genes ‘shared’); or (d) an altogether unknown genome sequence of an organism of interest is available, it is molecule (no significant similarities). Before it can be concluded therefore advisable to use the preannotated data. that a molecule is unknown, it should be taken into account that some known natural products lack a described BGC; hence, Defining the extent of a secondary metabolite BGC some novel-looking BGCs may still encode the production of Predicting the boundaries of a BGC solely based on genomic data molecules for which the chemistry has been long known. For still remains challenging. For fungal BGCs, conserved binding polyketides and nonribosomal peptides, these cases can be sites of cluster-specific transcriptional regulators are good indica- assessed with a retro-biosynthetic approach using tools like tors to use in defining which genes are co-regulated. If the same Smiles2Monomers [27] or GRAPE [15]. These tools predict the regulator binding site is present near the core-genes of a cluster, potential monomers of a given compound structure, for exam- they probably belong to the same biosynthetic pathway. This ple derived from a compound database. In a second step, these approach is used in the CASSIS tool [10], which was recently inte- compounds can be connected to BGCs by mapping the mono- grated into version 4 of antiSMASH [6]. In addition, fungal tran- mer predictions derived from the chemical structure to the scriptomics data can also be used to efficiently define the cluster monomer predictions derived from the analysis of BGCs. The boundaries [60], as implemented in the FunGeneClusterS applica- latter predictions can be made using the antiSMASH database tion [13]. or tools like GARLIC [15]. For nonribosomal peptides, another For bacterial sequences, such automated or semi-automated option is to check for compounds with similar monomers in the methods are unfortunately not (yet) well established. The pres- NORINE database [19, 20]. antiSMASH provides the appropriate ence or absence of BGCs is often strain specific [61, 62]. search links from the ‘detailed annotations’ sidebar. If no Comparing genomes between closely related species to identify cluster-wide similarity is observed, it is in any case still a good which genes are highly conserved between these species and idea to look for similarities to known clusters at a smaller scale: which are unique to the strain of interest can indicate the extent either per gene or per subcluster. antiSMASH offers functional- of BGCs. In antiSMASH, we have therefore chosen an ‘inclusive’ ities to identify such similarities, using the SubClusterBlast fea- approach. Genes that are encoded within an empirically defined ture and the gene-specific BLAST search of MIBiG [17]. This distance from conserved core genes of a BGC are displayed as a makes it possible to predict the presence of specific chemical Downloaded from https://academic.oup.com/bib/article/20/4/1103/4590131 by DeepDyve user on 16 July 2022 1110 | Blin et al. moieties or chemical modifications to the molecule, which BGCs coding for the biosynthesis of known hazardous helps to prioritize the targets or to connect the gene cluster to a chemicals. molecule observed in metabolomic data. Finally, looking for Increasingly available high-quality genome data, in combi- functional markers can greatly help in prioritizing BGCs, e.g. nation with databases of BGCs of known function, such as when the aim of the project is antibiotic discovery, one can look sequence data from the MIBiG repository [17], can also be used for both general and specific types of antibiotic resistance genes for dereplication of known or closely related compounds and that are often encoded inside a BGC to provide natural self- the identification of unexplored or underexplored gene cluster resistance to the producer [8, 16]. families. So far, several studies [35, 49, 65–67] have successfully Sometimes, the structure of a molecule has already been used such approaches to identify novel natural products. In elucidated before a genome is sequenced or studied. In such a connection with large-scale metabolomics approaches (in case, the aim of using antiSMASH or related tools is usually to which gene cluster data are automatically correlated with infor- identify the biosynthetic mechanism of the molecule of inter- mation on known or unknown compounds identified by est. If, chemically, the molecule is closely related to other mass spectrometry [14, 15, 67, 68]), these high-quality data now known natural products for which the biosynthesis is known, allow for new high-throughput methods to identify novel one would usually be able to find either a single BGC or only a compounds. few BGCs with high similarity to the corresponding MIBiG refer- Many of the current limitations of automated genome min- ence cluster. However, this is often not the case. Then, the best ing approaches are being actively addressed by the interna- strategy is to use ‘exclusion logic’ and step-by-step exclude tional natural product community. The EvoMining strategy has BGCs that are unlikely to be involved in the biosynthesis of the been successfully used [50] to identify new BGCs coding for pre- molecule, thus gradually narrowing down the options to only viously unknown compounds and enzymes. Another promising one or a few gene clusters. First, one would ask: What is the approach to better predict BGC boundaries is based on compara- chemical class of the molecule, and, accordingly, what is its tive genomics by detecting ‘breaks’ in the conserved synteny of expected biosynthetic class? For some chemical classes, there related strains; as such breaks are often caused by the insertion can be multiple biosynthetic options, e.g. peptides can be made and/or horizontal acquisition of BGCs, this approach allows the in either a ribosomal or nonribosomal fashion. Second, one identification of potential secondary metabolite biosynthetic would ask: What can we specifically predict about the biosyn- pathways without relying on previous knowledge of the thetic pathway? If it concerns a potential nonribosomal peptide enzymes involved (SYNTERUPTOR, S. Lautru and J. L. Pernodet; or polyketide, knowledge of the structure would allow predict- personal communication). Thousands of BGCs already have ing the number of modules expected in corresponding NRPSs or been identified and the number is still steadily increasing. Tools PKSs, as well as their substrate specificities. Third, is there spe- like CORASON (F. Barona-Go ´ mez, personal communication; cific chemistry seen in the molecule for which enzymatic mech- https://github.com/nselem/EvoDivMet; as used in [69, 70]), anisms are known? If, for example, a peptide is acylated, one clusterTools [63] and MultiGeneBlast [64] can be used to identify could expect the presence of either a CoA-ligase or a clusters, which share varying degrees of similarity with known Condensation-starter domain in the BGC. Fourth, are any other BGCs. Large-scale clustering of these BGCs is emerging as an organisms known to produce this molecule? If so, one could see important method to compare, classify into gene cluster fami- which BGCs have homologous clusters in each of these known lies, dereplicate and identify novel or—depending on the aim of producers. the study—related BGCs [49, 66, 67]. Novel software packages When dealing with larger numbers of genomes, the above- like BIG-SCAPE (Medema, personal communication; https://git. mentioned strategies may no longer be feasible. In this case, a wageningenur.nl/medema-group/BiG-SCAPE) will help scien- targeted search could be done using software like clusterTools tists to perform such analyses. [63] or MultiGeneBlast [64] among the entire set of BGCs identi- Of course, the widespread use of genome mining approaches fied in all genomes. For example, if the presence of a certain also raises new challenges. One major bottleneck in such (combination of) specific gene(s) is either desired (in case of approaches is the frequent observation that the BGCs remain hunting for new molecules) or expected (in case of trying to con- unexpressed (i.e. ‘silent’) in the producer strains under normal nect a known molecule to its BGC), a specific query can be built laboratory fermentation conditions; in such cases, the com- to search for this. pounds cannot be detected or isolated despite the genome con- taining all the genes required for the biosynthesis. Thus, Perspectives strategies have to be developed and improved to trigger the expression of such silent BGCs [71, 72]. One important step for- With the recent progress in sequencing technologies and the ward in this regard has been the development of CRISPR-based availability of easy-to-use software programs, genome mining genome editing tools for important groups of bacterial and fun- for BGCs and evaluating the genetic potential of secondary gal secondary metabolite producers [11, 73–75] that can be used metabolite producing organisms have matured into an impor- to insert promoters to activate the silent BGCs [76] or to ‘repair’ tant technology. It complements the classical organic biosynthetic genes [77]. Successful expression of the BGC and chemistry-centered approach to find, dereplicate and character- isolation of a novel compound should be followed by metabolo- ize novel bioactive secondary metabolites, and contributes mics analysis and metabolic engineering that are intercon- toward the current paradigm-shift that brings natural products nected with each other. Metabolomics helps with identifying once more into focus for future drug discovery [36]. In addition, secondary metabolite precursors, and hence provides clues on it also can be used as an effective method to evaluate the safety of biotechnological production organisms, which are used the use of metabolic pathways. This information in turn facili- directly in food production or for the production of enzymes or tates metabolic engineering of the host strain that considers other biochemicals. In this case, genome mining data can be quantitatively optimal production of a target secondary metab- used to demonstrate that a production strain does not contain olite [78]. Downloaded from https://academic.oup.com/bib/article/20/4/1103/4590131 by DeepDyve user on 16 July 2022 RecentdevelopmentofantiSMASH | 1111 9. van Heel AJ, de Jong A, Montalba ´ n-Lo ´ pez M, et al. BAGEL3: Key Points automated identification of genes encoding bacteriocins and (non-)bactericidal posttranslationally modified peptides. Despite the huge chemical diversity of bioactive secon- Nucleic Acids Res 2013;41:W448–53. dary metabolites, the enzymes involved in their biosyn- 10. Wolf T, Shelest V, Nath N, et al. CASSIS and SMIPS: promoter- thesis are often strikingly conserved. based prediction of secondary metabolite gene clusters in The sequence conservation of these enzymes can be eukaryotic genomes. Bioinformatics 2016;32(8):1138–43. exploited by genome mining approaches to identify sec- 11. Blin K, Pedersen LE, Weber T, et al. CRISPy-web: an online ondary metabolite BGCs in genome data. resource to design sgRNAs for CRISPR applications. Synth Syst Genome mining is a powerful method to access the Biotechnol 2016;1(2):118–21. genetic potential of secondary metabolite producers. 12. Reddy BVB, Milshteyn A, Charlop-Powers Z, et al. eSNaPD: a User-friendly pipelines (e.g. antiSMASH) are available to versatile, web-based bioinformatics platform for surveying assist scientists in genome mining. and mining natural product biosynthetic diversity from There are caveats that should be considered when metagenomes. Chem Biol 2014;21:1023–33. designing and interpreting genome mining studies. 13. Vesth TC, Brandl J, Andersen MR. FunGeneClusterS: predict- ing fungal gene clusters from genome and transcriptome data. Synth Syst Biotechnol 2016;1(2):122–9. Acknowledgements 14. Johnston CW, Skinnider MA, Wyatt MA, et al. An automated Genomes-to-Natural Products platform (GNP) for the discov- The authors would like to thank Simon Shaw for the helpful ery of modular natural products. Nat Commun 2015;6:8421. discussions. 15. Dejong CA, Chen GM, Li H, et al. Polyketide and nonribosomal peptide retro-biosynthesis and global gene cluster matching. Funding Nat Chem Biol 2016;12:1007–14. 16. Johnston CW, Skinnider MA, Dejong CA, et al. Assembly and The work of T. W., K. B. and H. U. K. is supported by grants of clustering of natural antibiotics guides target identification. the Novo Nordisk Foundation (CFB and grant number Nat Chem Biol 2016;12:233–9. NNF16OC0021746); the Technology Development Program to 17. Medema MH, Kottmann R, Yilmaz P, et al. Minimum informa- Solve Climate Change on Systems Metabolic Engineering for tion about a biosynthetic gene cluster. Nat Chem Biol 2015; Biorefineries from the Ministry of Science and ICT through 11(9):625–31. the National Research Foundation (NRF) of Korea (grant num- 18. Ziemert N, Podell S, Penn K, et al. The natural product domain bers NRF-2012M1A2A2026556 and NRF-2012M1A2A2026557 seeker NaPDoS: a phylogeny based bioinformatic tool to classify to H. U. K.); and Veni grant (grant number 863.15.002 to secondary metabolite gene diversity. PLoS One 2012;7(3):e34064. M. H. M.) from The Netherlands Organization for Scientific 19. Pupin M, Esmaeel Q, Flissi A, et al. Norine: a powerful resource for novel nonribosomal peptide discovery. Synth Syst Research (NWO). Biotechnol 2016;1(2):89–94. 20. Caboche S, Pupin M, Lecle `re V, et al. NORINE: a database of References nonribosomal peptides. Nucleic Acids Res 2008;36:D326–31. 1. Newman DJ, Cragg GM. Natural products as sources of new 21. Li MH, Ung PM, Zajkowski J, et al. Automated genome mining drugs over the 30 years from 1981 to 2010. J Nat Prod 2012; for natural products. BMC Bioinformatics 2009;10:185. 75(3):311–35. 22. Ro ¨ ttig M, Medema MH, Blin K, et al. NRPSpredictor2–a web 2. Nu ¨ tzmann HW, Huang A, Osbourn A. Plant metabolic clus- server for predicting NRPS adenylation domain specificity. ters—from genetics to genomics. New Phytol 2016;211(3): Nucleic Acids Res 2011;39:W362–7. 771–89. 23. Kautsar SA, Suarez Duran HG, Blin K, et al. plantiSMASH: 3. Medema MH, Osbourn A. Computational genomic identifica- automated identification, annotation and expression analy- tion and functional reconstitution of plant natural product sis of plant biosynthetic gene clusters. Nucleic Acids Res 2017; biosynthetic pathways. Nat Prod Rep 2016;33:951–62. 45:W55–63. 4. Zazopoulos E, Huang K, Staffa A, et al. A genomics-guided 24. Skinnider MA, Merwin NJ, Johnston CW, et al. PRISM 3: approach for discovering and expressing cryptic metabolic expanded prediction of natural product chemical structures pathways. Nat Biotechnol 2003;21:187–90 [Database. from microbial genomes. Nucleic Acids Res 2017;45(W1): 5. Yadav G, Gokhale RS, Mohanty D. SEARCHPKS: a program for W49–54. detection and analysis of polyketide synthase domains. 25. Tietz JI, Schwalen CJ, Patel PS, et al. A new genome-mining Nucleic Acids Res 2003;31(13):3654–8. tool redefines the lasso peptide biosynthetic landscape. Nat 6. Blin K, Wolf T, Chevrette MG, et al. antiSMASH 4.0-improve- Chem Biol 2017;13(5):470–8. ments in chemistry prediction and gene cluster boundary 26. Khater S, Gupta M, Agrawal P, et al. SBSPKSv2: structure- identification. Nucleic Acids Res 2017;45(W1):W36–41. based sequence analysis of polyketide synthases and non- 7. Blin K, Medema MH, Kottmann R, et al. The antiSMASH data- ribosomal peptide synthetases. Nucleic Acids Res 2017;45(W1): base, a comprehensive database of microbial secondary W72–9. metabolite biosynthetic gene clusters. Nucleic Acids Res 2017; 27. Dufresne Y, Noe ´ L, Lecle `re V, et al. Smiles2Monomers: a link 45:D555–9. between chemical and biological structures for polymers. 8. Alanjary M, Kronmiller B, Adamek M, et al. The Antibiotic J Cheminform 2015;7:62. Resistant Target Seeker (ARTS), an exploration engine for 28. Khaldi N, Seifuddin FT, Turner G, et al. SMURF: genomic map- antibiotic cluster prioritization and novel drug target discov- ping of fungal secondary metabolite clusters. Fungal Genet Biol ery. Nucleic Acids Res 2017;45(W1):W42–8. 2010;47:736–41. Downloaded from https://academic.oup.com/bib/article/20/4/1103/4590131 by DeepDyve user on 16 July 2022 1112 | Blin et al. 50. Cruz-Morales P, Kopp JF, Martı ´nez-Guerrero C, et al. 29. Weber T, Rausch C, Lopez P, et al. CLUSEAN: a computer- based framework for the automated analysis of bacterial sec- Phylogenomic analysis of natural products biosynthetic gene ondary metabolite biosynthetic gene clusters. J Biotechnol clusters allows discovery of arseno-organic metabolites in 2009;140:13–17. model streptomycetes. Genome Biol Evol 2016;8:1906–16. 30. Medema MH, Blin K, Cimermancic P, et al. antiSMASH: rapid 51. Chevrette MG, Aicheler F, Kohlbacher O, et al. SANDPUMA: identification, annotation and analysis of secondary metabo- ensemble predictions of nonribosomal peptide chemistry lite biosynthesis gene clusters in bacterial and fungal genome reveals biosynthetic diversity across actinobacteria. sequences. Nucleic Acids Res 2011;39:W339–46. Bioinformatics 2017;33(20):3202–10. 31. Blin K, Medema MH, Kazempour D, et al. antiSMASH 2.0–a ver- 52. Price MN, Dehal PS, Arkin AP. FastTree 2–approximately satile platform for genome mining of secondary metabolite maximum-likelihood trees for large alignments. PLoS One producers. Nucleic Acids Res 2013;41:W204–12. 2010;5(3):e9490. 32. Weber T, Blin K, Duddela S, et al. antiSMASH 3.0-a compre- 53. Dickschat JS. Bacterial terpene cyclases. Nat Prod Rep 2016; hensive resource for the genome mining of biosynthetic gene 33(1):87–110. clusters. Nucleic Acids Res 2015;43:W237–43. 54. Klassen JL, Currie CR. Gene fragmentation in bacterial draft 33. Blin K, Kazempour D, Wohlleben W, et al. Improved lanthi- genomes: extent, consequences and mitigation. BMC peptide detection and prediction for antiSMASH. PLoS One Genomics 2012;13:14. 2014;9(2):e89420. 55. Cibria ´ n-Jaramillo A, Barona-Go ´ mez F. Increasing metage- 34. Skinnider MA, Dejong CA, Rees PN, et al. Genomes to natural nomic resolution of microbiome interactions through func- products PRediction Informatics for Secondary Metabolomes tional phylogenomics and bacterial sub-communities. Front (PRISM). Nucleic Acids Res 2015;43:9645–62. Genet 2016;7:4. 35. Skinnider MA, Johnston CW, Edgar RE, et al. Genomic charting 56. Larkin MA, Blackshields G, Brown NP, et al. Clustal W and of ribosomally synthesized natural product chemical space Clustal X version 2.0. Bioinformatics 2007;23(21):2947–8. facilitates targeted mining. Proc Natl Acad Sci USA 2016; 57. Edgar RC. MUSCLE: multiple sequence alignment with high 113(42):E6343–51. accuracy and high throughput. Nucleic Acids Res 2004;32(5): 36. Ziemert N, Alanjary M, Weber T. The evolution of genome 1792–7. mining in microbes—a review. Nat Prod Rep 2016;33(8): 58. Tatusova T, DiCuccio M, Badretdin A, et al. NCBI prokaryotic 988–1005. genome annotation pipeline. Nucleic Acids Res 2016;44(14): 37. Fedorova ND, Moktali V, Medema MH. Bioinformatics 6614–24. approaches and software for detection of secondary meta- 59. Tatusova T, Ciufo S, Fedorov B, et al. RefSeq microbial bolic gene clusters. Methods Mol Biol 2012;944:23–45. genomes database: new representation and annotation strat- 38. Lecle ` re V, Weber T, Jacques P, et al. Bioinformatics tools for egy. Nucleic Acids Res 2014;42:D553–9. the discovery of new nonribosomal peptides. Methods Mol Biol 60. Andersen MR, Nielsen JB, Klitgaard A, et al. Accurate predic- 2016;1401:209–32. tion of secondary metabolite gene clusters in filamentous 39. Adamek M, Spohn M, Stegmann E, et al. Mining bacterial fungi. Proc Natl Acad Sci USA 2013;110(1):E99–107. genomes for secondary metabolite gene clusters. Methods Mol 61. Letzel AC, Li J, Amos GCA, et al. Genomic insights into special- Biol 2017;1520:23–47. ized metabolism in the marine actinomycete Salinispora. 40. Weber T. In silico tools for the analysis of antibiotic biosyn- Environ Microbiol 2017;19:3660–73. thetic pathways. Int J Med Microbiol 2014;304(3–4):230–5. 62. Cruz-Morales P, Vijgenboom E, Iruegas-Bocardo F, et al. The 41. Weber T, Kim HU. The secondary metabolite bioinformatics genome sequence of Streptomyces lividans 66 reveals a novel portal: computational tools to facilitate synthetic biology of tRNA-dependent peptide biosynthetic system within a secondary metabolite production. Synth Syst Biotechnol 2016; metal-related genomic island. Genome Biol Evol 2013;5: 1(2):69–79. 1165–75. 42. Medema MH, Fischbach MA. Computational approaches to 63. de los Santos ELC, Challis GL. clusterTools: proximity natural product discovery. Nat Chem Biol 2015;11(9):639–48. searches for functional elements to identify putative biosyn- 43. Chavali AK, Rhee SY. Bioinformatics tools for the identifica- thetic gene clusters. bioRxiv 2017. (Epub ahead of print). doi: tion of gene clusters that biosynthesize specialized metabo- 10.1101/119214. lites. Brief Bioinform 2017. (Epub ahead of print). doi: 10.1093/ 64. Medema MH, Takano E, Breitling R. Detecting sequence bib/bbx020. homology at the gene cluster level with MultiGeneBlast. Mol 44. Hyatt D, Chen GL, Locascio PF, et al. Prodigal: prokaryotic gene Biol Evol 2013;30(5):1218–23. recognition and translation initiation site identification. BMC 65. Donia MS, Cimermancic P, Schulze CJ, et al. A systematic anal- Bioinformatics 2010;11:119. ysis of biosynthetic gene clusters in the human microbiome 45. Majoros WH, Pertea M, Salzberg SL. TigrScan and reveals a common family of antibiotics. Cell 2014;158(6): GlimmerHMM: two open source ab initio eukaryotic gene- 1402–14. finders. Bioinformatics 2004;20(16):2878–9. 66. Zhang Q, Doroghazi JR, Zhao X, et al. Expanded natural prod- 46. Eddy SR. Accelerated profile HMM searches. PLoS Comput Biol uct diversity revealed by analysis of lanthipeptide-like gene 2011;7(10):e1002195. clusters in actinobacteria. Appl Environ Microbiol 2015;81: 47. Finn RD, Coggill P, Eberhardt RY, et al. The Pfam protein fami- 4339–50. lies database: towards a more sustainable future. Nucleic Acids 67. Doroghazi JR, Albright JC, Goering AW, et al. A roadmap for Res 2016;44(D1):D279–85. natural product discovery based on large-scale genomics and 48. Haft DH, Selengut JD, Richter RA, et al. TIGRFAMs and genome metabolomics. Nat Chem Biol 2014;10(11):963–8. properties in 2013. Nucleic Acids Res 2013;41:D387–95. 68. Maansson M, Vynne NG, Klitgaard A, et al. An integrated 49. Cimermancic P, Medema MH, Claesen J, et al. Insights into metabolomic and genomic mining workflow to uncover the secondary metabolism from a global analysis of prokaryotic biosynthetic potential of bacteria. mSystems 2016;1(3): biosynthetic gene clusters. Cell 2014;158(2):412–21. e00028-15. doi: 10.1128/mSystems.00028–15. Downloaded from https://academic.oup.com/bib/article/20/4/1103/4590131 by DeepDyve user on 16 July 2022 RecentdevelopmentofantiSMASH | 1113 74. Tong Y, Charusanti P, Zhang L, et al. CRISPR-Cas9 based engi- 69. Cruz-Morales P, Ramos-Aboites HE, Licona-Cassani C, et al. Actinobacteria phylogenomics, selective isolation from an neering of actinomycetal genomes. ACS Synth Biol 2015;4(9): iron oligotrophic environment and siderophore functional 1020–9. characterization, unveil new desferrioxamine traits. FEMS 75. Nødvig CS, Nielsen JB, Kogle ME, et al. A CRISPR-Cas9 system Microbiol Ecol 2017;93(9). doi: 10.1093/femsec/fix086. for genetic engineering of filamentous fungi. PLoS One 2015; 70. Gutie ´ rrez-Garcı ´a K, Neira-Gonza ´ lez A, Pe ´ rez-Gutie ´ rrez RM, 10(7):e0133085. et al. Phylogenomics of 2, 4-Diacetylphloroglucinol-producing 76. Zhang MM, Wong FT, Wang Y, et al. CRISPR-Cas9 strategy for pseudomonas and novel antiglycation endophytes from Piper activation of silent Streptomyces biosynthetic gene clusters. auritum. J Nat Prod 2017;80:1955–63. Nat Chem Biol 2017;13:607–9. 71. Rutledge PJ, Challis GL. Discovery of microbial natural prod- 77. Weber J, Valiante V, Nødvig CS, et al. Functional reconstitu- ucts by activation of silent biosynthetic gene clusters. Nat Rev tion of a fungal natural product gene cluster by advanced Microbiol 2015;13(8):509–23. genome editing. ACS Synth Biol 2017;6:62–8. 72. Ren H, Wang B, Zhao H. Breaking the silence: new strategies 78. Kim HU, Charusanti P, Lee SY, et al. Metabolic for discovering novel natural products. Curr Opin Biotechnol engineering with systems biology tools to optimize produc- tion of prokaryotic secondary metabolites. Nat Prod Rep 2016; 2017;48:21–7. 73. Cobb RE, Wang Y, Zhao H. High-efficiency multiplex genome 33:933–41. editing of Streptomyces species using an engineered CRISPR/ Cas system. ACS Synth Biol 2015;4:723–8. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Briefings in Bioinformatics Oxford University Press

Recent development of antiSMASH and other computational approaches to mine secondary metabolite biosynthetic gene clusters

Loading next page...
 
/lp/ou_press/recent-development-of-antismash-and-other-computational-approaches-to-twvEaHLs9u

References (81)

Publisher
Oxford University Press
Copyright
Copyright © 2022 Oxford University Press
ISSN
1467-5463
eISSN
1477-4054
DOI
10.1093/bib/bbx146
Publisher site
See Article on Publisher Site

Abstract

Many drugs are derived from small molecules produced by microorganisms and plants, so-called natural products. Natural products have diverse chemical structures, but the biosynthetic pathways producing those compounds are often organized as biosynthetic gene clusters (BGCs) and follow a highly conserved biosynthetic logic. This allows for the identification of core biosynthetic enzymes using genome mining strategies that are based on the sequence similarity of the involved enzymes/genes. However, mining for a variety of BGCs quickly approaches a complexity level where manual analyses are no longer possible and require the use of automated genome mining pipelines, such as the antiSMASH software. In this review, we discuss the principles underlying the predictions of antiSMASH and other tools and provide practical advice for their application. Furthermore, we discuss important caveats such as rule-based BGC detection, sequence and annotation quality and cluster boundary prediction, which all have to be considered while planning for, performing and analyzing the results of genome mining studies. Key words: genome mining; biosynthetic gene cluster; antibiotics; secondary metabolites; natural products; antiSMASH metabolites’. In bacteria and fungi, the genes required for the bio- Introduction synthesis of these compounds are usually organized as biosyn- thetic gene clusters (BGCs). These clusters contain all genes Most antibiotics, such as penicillin, erythromycin or tetracycline, and also other drugs like acarbose (anti-diabetic), artemisinin required for the biosynthesis of precursors, assembly of the com- pound scaffold, modification of the compound scaffold (also (anti-malarial), tacrolimus or cyclosporins (immunosuppres- sants) are so-called natural products either synthesized by or referred to as ‘tailoring’) and often also resistance, export and reg- ulation. This implies that the full pathway can easily be identified derived from microorganisms or plants [1]. As the biosynthetic pathways for such compounds are not directly related to growth if the involvement of one of the genes in biosynthesis can be demonstrated. In plants, only some pathways are organized in and reproduction, these compounds are also referred to as ‘sec- ondary metabolites’ or—in newer literature—‘specialized BGCs [2]. For other pathways, the biosynthesis genes are Kai Blin is a Postdoctoral Fellow at the Novo Nordisk Foundation Center for Biosustainability of the Technical University of Denmark. He is developing computational biology tools around microbial genome mining for natural products and connected -omics approaches. Hyun Uk Kim is a Research Fellow at KAIST, South Korea, and a visiting Senior Researcher at the Novo Nordisk Foundation Center for Biosustainability, DTU. His research field lies in systems biology, biochemical and metabolic engineering and drug targeting and discovery. Marnix H. Medema is an Assistant Professor in the Bioinformatics Group at Wageningen University. His research group develops and applies computa- tional methodologies to identify and analyze biosynthetic pathways and gene clusters. Tilmann Weber is a Co-Principal Investigator at the Novo Nordisk Foundation Center for Biosustainability of the Technical University of Denmark. He is interested in integrating bioinformatics, genome mining and systems biology approaches into Natural Products discovery and characterization and thus bridging the in silico and in vivo world. Submitted: 30 May 2017; Received (in revised form): 10 October 2017 V The Author 2017. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. 1103 Downloaded from https://academic.oup.com/bib/article/20/4/1103/4590131 by DeepDyve user on 16 July 2022 1104 | Blin et al. scattered across the genome and thus require additional experi- BGCs that were identified using genome mining approaches, mental data, such as co-expression analyses [3], for please see [36]. identification. In this review, we will focus on the general computational Soon after the first genes encoding natural product biosyn- approaches to study secondary metabolite biosynthesis and thetic enzymes were identified, sequenced and analyzed, it how these are integrated into the current antiSMASH frame- became apparent that the sequences of the corresponding work (Figure 1). Finally, we will give practical advice for prepar- enzymes contain data of highly predictive quality, which can be ing and interpreting genome mining data. Although we focus used to infer key biosynthetic steps. For example, the core scaf- on antiSMASH as an example, the issues discussed are applica- folds of the products of canonical modular type I polyketide ble to natural product genome mining in general, and hence are synthases (PKSs) can be predicted by combining several types of equally relevant when using other tools. Comprehensive user guides for antiSMASH can be found online (http://docs.anti easy-to-obtain data: (a) the content and architecture of individ- ual enzymatic domains within the megaenzymes, which are smash.secondarymetabolites.org/using_antismash/) and in responsible for the assembly of the molecular scaffold and its [37–39]. For comprehensive reviews on the different genome mining tools and databases on secondary metabolites, the modifications (e.g. reduction of the b-carbon), can be identified by using Hidden Markov model (HMM) profiles of such domains; reader is referred to [40–43]. (b) the individual acyl-CoA building blocks for each PKS module (e.g. malonyl-CoA versus methylmalonyl-CoA) can be inferred Principles of predicting secondary based on key residues in the active sites of the acyltransferase metabolite biosynthesis (AT) domains or by using phylogenetic classification; (c) the stereospecificity mediated by ketoreductase domains can be To predict secondary metabolite biosynthesis pathways, inferred by key amino acids in the active site motifs. These genome mining approaches commonly start out by identifying studies were the starting point in establishing genome mining conserved biosynthetic genes. Their gene products are subse- for secondary metabolite BGCs as one of the recent key technol- quently analyzed to gain information about their putative func- ogies in natural products research. tion in biosynthesis and sometimes their substrate specificity. One of the first computational tools to make use of such pre- To identify conserved biosynthetic genes, it is necessary to s R dictions was the proprietary DECIPHER search engine and data- have gene annotations available on the genome of interest. base of the former company Ecopia [4] that was first published Formats such as NCBI’s GenBank or EBI’s EMBL contain both in 2003. Around the same time, the first publicly available tools DNA sequence and gene annotations. GFF3 files can be used to were released. For example, SEARCHPKS automated the identifi- carry the annotations for sequences in FASTA format. cation of enzymatic domains in PKSs [5] (for URLs to this and all antiSMASH accepts input data in all of these formats. If no gene following Web tools, please see Table 1). However, it took until annotations are available, antiSMASH will run a gene finding 2009 for the first open-source genome mining pipelines tool. For the bacterial version, this is Prodigal [44]. For fungal CLUSEAN [29] and NP.searcher [21] to be published. In 2011, the and plant genomes, antiSMASH uses GlimmerHMM [45]. first version of the open-source genome mining platform In the next step, BGCs are identified based on core enzymes antiSMASH was released [30], which combined and extended involved in the biosynthesis of secondary metabolites. the functionality of the previous tools and also offered a user- Functionally related proteins frequently share common friendly Web interface. For the first time, it became possible for patterns of amino acids. Using profile-based methods like scientists without significant experience in computational biol- position-specific scoring matrices to identify these patterns ogy to perform larger-scale genome mining studies on a free seems intuitive. HMMs are probabilistic models of linear and public Web server. Since then, antiSMASH has been steadily sequences that provide an algorithmic approach to interpret the extended [6, 7, 23, 30–33] and currently offers a broad collection scores obtained from the scoring matrix. Profile HMMs (pHMMs) of tools and databases for automated genome mining and com- are HMMs designed to represent multiple sequence alignments, parative genomics for a wide variety of different classes of sec- including matches, insertions and deletions. The most com- ondary metabolites. The antiSMASH analysis pipeline for monly used tool around pHMMs in biology is HMMer [46]. Many bacterial genomes and the pipeline for fungal genomes (recently profile databases such as PFAM [47] and TIGRFAMs [48] provide named ‘fungiSMASH’) are both based on the same codebase. downloadable profiles compatible with HMMer. antiSMASH antiSMASH and fungiSMASH use two different Web submission uses pHMMs with profiles specific to conserved core enzymes of forms, each offering specific options. plantiSMASH [23]is a secondary metabolite biosynthesis pathways to run its profile- branch of antiSMASH that includes plant-specific functionality, based BGC detection. Once the core enzymes have been identi- such as plant-adapted HMM profiles and cluster detection logic, fied, antiSMASH compares co-located core genes with a set of as well as support for coexpression analysis. manually curated BGC cluster rules. These rules comprise In addition to antiSMASH, other noteworthy tools have also Boolean logic regarding domain presence/absence within either been developed and made available: SMURF [28] offers mining a gene or a genomic region of interest. For example, BGCs for fungal PKS, nonribosomal peptide synthetase (NRPS) and encoding nonribosomally synthesized peptides (such as the terpenoid gene clusters; the PRISM tool [24, 34, 35] offers antibiotic vancomycin) can be unambiguously identified if the genome mining functionality with a strong focus on predicting sequence to be analyzed contains genes encoding proteins that chemical structures of the biosynthetic pathways. PRISM is have a combination of one or multiple Condensation, closely connected to the ‘Genomes-to-Natural Products plat- Adenylation (A) and Peptidyl Carrier Protein domains. form (GNP)’ [14] that matches such predictions with MS/MS ‘Negative’ models are also used to discard false positives, e.g. data, and to the GRAPE/GARLIC tools [15, 16], which match the protein sequences that achieve higher scores for profiles of fatty predictions to chemical databases. For a comprehensive review acid synthases (which are homologous to PKSs) than for profiles describing the history and progress of secondary metabolite of PKSs will not lead to the identification of a polyketide BGC. genome mining, along with many examples of compounds and The 2017 version of antiSMASH (version 4) [6] uses such rules Downloaded from https://academic.oup.com/bib/article/20/4/1103/4590131 by DeepDyve user on 16 July 2022 RecentdevelopmentofantiSMASH | 1105 Table 1. URLs of Web servers, Web tools and databases referred to in the review Tool Functions URL Reference antiSMASH 4 Genome mining http://antismash.secondarymetabolites.org [6] BGC analysis Domain analysis antiSMASH database BGC database http://antismash-db.secondarymetabolites.org [7] ARTS Genome mining http://arts.ziemertlab.com [8] BAGEL 3 Genome mining http://bagel.molgenrug.nl/ [9] CASSIS BGC boundary prediction https://sbi.hki-jena.de/cassis/cassis.php [10] CRISPy-web sgRNA design http://crispy.secondarymetabolites.org [11] eSNaPD v2 Genome mining http://esnapd2.rockefeller.edu [12] FunGeneClusterS BGC boundary prediction https://fungiminions.shinyapps.io/FunGeneClusterS [13] fungiSMASH Genome mining http://fungismash.secondarymetabolites.org [6] BGC analysis Domain analysis GNP Metabolomics http://magarveylab.ca/gnp [14] GRAPE/GARLIC Genome mining https://magarveylab.ca/gast/ [15, 16] MIBiG BGC database http://mibig.secondarymetabolites.org [17] reference data set NaPDoS Genome mining http://napdos.ucsd.edu [18] NORINE Nonribosomal peptide database http://bioinfo.lifl.fr/NRP [19, 20] NP.searcher Genome mining http://dna.sherman.lsi.umich.edu/ [21] Domain analysis NRPSpredictor Domain analysis http://nrps.informatik.uni-tuebingen.de [22] plantiSMASH Genome mining http://plantismash.secondarymetabolites.org [23] BGC analysis PRISM 3 Genome mining http://magarveylab.ca/prism [24] BGC analysis Domain analysis RODEO Genome mining http://www.ripprodeo.org [25] RiPP analysis (SEARCHPKS)/SBSPKS v2 Domain analysis http://202.54.226.228/pksdb/sbspks_updated/master.html [26] BGC database Smiles2Monomers Retro-biosynthetic monomer prediction http://bioinfo.lifl.fr/norine/smiles2monomers.jsp [27] SMURF Genome mining http://www.jcvi.org/smurf [28] for 45 different types/classes of secondary metabolites (Table poorly in these cases. As both kinds of methods are designed to 2A). The cluster rules are stored in a tab-delimited text file, score overall sequence similarities, they—by design—gloss over which can be easily edited to add custom types of gene clusters. the few key differences. In such cases, more complex algo- Similar rule-based strategies are also used by many other sec- rithms can be used. Support vector machines (SVMs) are a ondary metabolite genome mining tools, such as PRISM [24], machine learning approach that uses supervised learning to SMURF [28] and BAGEL [9]. create nonprobabilistic binary linear classifiers. SVMs classify Alternatively, a probabilistic method to detect potential sec- data points encoded in multidimensional feature vectors by a ondary metabolite BGCs can be selected in antiSMASH that uses maximum margin hyperplane. Compared with other machine the ClusterFinder algorithm [49]. Rather than using explicit learning methods such as artificial neural networks, the con- rules requiring specific enzymes to be present for a particular struction of the SVM hyperplane allows for gaining some insight class of BGCs, ClusterFinder is based on a model built from a over which of the input parameters contribute most to the training set of PFAM domains found in BGCs and non-BGC solution. regions. Given this model and a genome of interest with anno- For the multimodular enzymes involved in NRPS biosynthe- tated PFAM domains, ClusterFinder then calculates the proba- sis, antiSMASH uses the recently published SANDPUMA tool [51] bility of a stretch of observed PFAM domains to constitute a to predict the substrates of A domains. Knowledge of these sub- BGC. In regions where this probability is higher than the config- strates and the order of the A domains are then used to predict urable threshold, a BGC is predicted. the backbone structure of the NRPS product. SANDPUMA inter- For BGCs encoding NRPS, PKS, terpene or ribosomally syn- nally uses a combination of pHMMs and SVMs to obtain the best thesized and posttranslationally modified peptides (RiPPs), it is possible A domain substrate predictions. In RiPP clusters that possible to perform some additional analyses to predict further encode the biosynthesis of, e.g., lanthi-, lasso-, sacti- and thio- details, such as substrate specificities or product cyclization peptides, identifying the precursor peptide is key to predicting patterns. To this end, it is sometimes necessary to classify pro- the cluster product. Here, antiSMASH scores putative precursor teins or domains that share a high overall sequence similarity. peptides using the recently published RODEO tool [25], as well The differences between the functional classes (e.g. different as some custom pHMMs. RODEO also uses both pHMMs and substrate specificities) are determined by a small number of key SVMs internally to identify precursor peptides. Tailoring amino acids. Sequence-alignment-based methods such as enzymes that further modify the RiPP are also identified using BLAST and profile-based methods like HMMer tend to perform pHMMs. Downloaded from https://academic.oup.com/bib/article/20/4/1103/4590131 by DeepDyve user on 16 July 2022 1106 | Blin et al. Figure 1. General workflow of an antiSMASH analysis of bacterial, fungal and plant genomes. Computational resources in the left and right boxes have been integrated with antiSMASH 4 for enhanced genome mining performance, whereas those in the box in the bottom correspond to third-party applications that use antiSMASH for the detection of BGCs. Phylogenetic analysis assists with the classification of a given phylogenetic tree according to a substitution model. enzymes in Clusters of Orthologous Groups and the calculation This method unfortunately has a high complexity for comput- of phylogenetic distances of genes/enzyme sequences of inter- ing the optimal tree. Many current tools use a combination of est to characterized reference sequences. Multiple methods methods. The popular software FastTree [52] first builds rough exist to construct phylogenetic trees based on multiple Neighbor-Joining trees and then refines them using a maximum sequence alignments. Depending on the desired output tree likelihood scoring of the trees generated in the first pass. In antiSMASH, phylogenetic methods are used in many pla- characteristics, the number of input sequences and other con- straints, the most appropriate method should be chosen. A pop- ces. For NRPS clusters, SANDPUMA includes a phylogenetic anal- ular algorithm among the distance-matrix-based methods is ysis in the PrediCAT step. A modified version of PrediCAT trained the Neighbor-Joining algorithm, which uses bottom-up cluster- on a recently released data set [53] is also used in terpenoid clus- ing to create the tree. Neighbor-Joining is a comparatively fast ters to further classify terpene synthases. Noncore biosynthetic method, but the correctness of the tree depends on the accuracy genes in a BGC are assigned to ‘secondary metabolite clusters of and additivity of the underlying distance matrix. Maximum par- orthologous groups’, for which phylogenies are reconstructed. simony methods try to identify the tree that uses the smallest In addition to BGC type-dependent analyses, antiSMASH number of evolution events to explain the observed sequence also includes general tools providing information on all cluster data. While maximum parsimony algorithms build accurate types. The built-in ClusterBlast module [30] considers the simi- trees, their computation tends to be relatively slow compared larity of individual gene products as well as their genomic arrangement. ClusterBlast contains a comprehensive database with distance matrix-based methods. Maximum likelihood methods use probability distributions to assess the likelihood of of all predicted BGCs from publicly available genomes that is Downloaded from https://academic.oup.com/bib/article/20/4/1103/4590131 by DeepDyve user on 16 July 2022 RecentdevelopmentofantiSMASH | 1107 Table 2. A: BGC types detectable by pHMM-based rules with Table 2. (continued) antiSMASH, PRISM and SMURF. B: Rule-independent methods to B: Rule-independent methods detect BGCs a Method Principle Implement- References A: Rule-based detection of gene clusters ed in BGC-type antiSMASH PRISM/RiPP PRISM SMURF ClusterFinder HMM-based antiSMASH [6, 49] Aminocoumarins X X classification Aminoglycosides/ X of which aminocyclitols PFAM Antimetabolites X domains are Aryl polyenes X X likely to be Autoinducing peptide X found inside Bacteriocins X or outside a Beta-lactams X X BGC Bottromycin X X EvoMining Phylogenomic EvoMining [50] Butyrolactones X X identification ClusterFinder fatty acid X of enzymes ClusterFinder saccharide X with ComX X expanded Cyanobactins X X substrate Ectoines X X spectrum; Furan X X such Fused (pheganomycin-like) X enzymes are Glycocin X X often found Head-to-tail cyclized peptide X X in BGCs Homoserine lactone X X Resistance gene-based Identification ARTS [8] Indoles X X mining of potential Ladderane lipids X X antibiotic Lantipeptides class I X X resistance Lantipeptides class II X X genes; often Lantipeptides class III/IV X X such genes Lasso peptide X X are part of Linaridin X X BGCs to pro- Linear azol(in)e-containing X X vide self-pro- Melanins X X tection of the Microcin X producing Microviridin X X organism Nonribosomal peptides X X X For details on the pHMM’s and specific rules used by the different genome min- Nucleosides X ing programs, please consult the original publications of antiSMASH [6, 32], Oligosaccharide X PRISM [24, 34] or SMURF [28]. Other (unusual) PKS X Others X Phenazine X X searched to identify organisms containing similar BGCs. The Phosphoglycolipids X X same algorithm is used in the ‘SubClusterBlast’ module to iden- Phosphonate X X Polyunsaturated fatty acids X tify operons/sets of genes in the query BGC that code for Prochlorosin X enzymes involved in the biosynthesis of common precursors, Proteusin X X for example the nonproteinogenic amino acid 3, 5-dihydroxy- Sactipeptide X X phenylglycine present in some types, or NRPS clusters such as Non-NRP siderophores X the vancomycin-family glycopeptides. Finally, this strategy is Streptide X also used to search the Minimum Information on Biosynthetic Terpene X X Gene cluster (MIBiG) [17] data set with the ‘KnownClusterBlast’ Thiopeptides X X function to provide information about related and well-charac- Thioviridamide X terized gene clusters. This function can also be used to perform Trans-AT type I PKS X X a sequence-based dereplication, i.e. the identification of gene Trifolitoxin X clusters that code for already known products. Type I PKS X X X Type II PKS X X Type III PKS X X ‘Linked’ tools and resources YM-216391 X A general challenge when using comparative approaches Continued to study BGCs is the varying quality of annotation in public Downloaded from https://academic.oup.com/bib/article/20/4/1103/4590131 by DeepDyve user on 16 July 2022 1108 | Blin et al. sequence databases. Some BGCs that have been extensively rules are implemented in the mining software can be detected; studied experimentally are well annotated, whereas others— all pathways that may use unknown or unrelated alternative mostly identified in high-throughput sequencing efforts—were enzymes will be missed. only annotated using standard genome annotation pipelines As an extension to the rule-based genome mining, that do not provide specific annotations of secondary metabo- antiSMASH optionally provides the possibility to use the lite BGCs. Therefore, a community effort has been established ‘ClusterFinder’ method [49]. This algorithm can identify BGCs to define a ‘MIBiG’ standard [17] and provide a standardized that are not detected by the expert-generated rule sets repository for BGCs that have been experimentally connected to described above. However, it should be noted that this method their biosynthetic products. The MIBiG repository currently (as still has some bias, as the source data used to train the HMM of April 2017) contains 1396 entries of BGCs that are validated to determining whether a gene product likely belongs to a BGC are code for a specific biosynthetic pathway. Within this set, 396 of also based on the currently known pathways. the entries contain comprehensive manually curated annota- To address these limitations, alternative methods are under tions of the specific features of the gene clusters, which were development to access the ‘biosynthetic dark matter’ and iden- provided by the specialists that studied these respective BGCs. tify novel pathways and enzymes. One promising approach is This collection now serves as a reference data set for a wide ‘EvoMining’ [50], which is based on the observation that biosyn- variety of applications and the validation of novel computa- thetic enzymes and/or resistance genes often evolved by dupli- tional tools. cation and divergence of primary metabolism enzymes. By In addition to analyses integrated into antiSMASH, the anno- detecting divergences in phylogenetic trees of enzymes from tation generated by antiSMASH can also be useful as a starting the core metabolism shared between many bacterial species, point for further downstream analyses. Therefore, antiSMASH 4 this method can identify enzymes that have likely been repur- provides an application programming interface that allows posed for secondary metabolite biosynthesis [50] or resistance third-party software to access antiSMASH annotation for fur- [8]. Once novel pathways have been identified using such meth- ther processing. Examples of such tools are the ‘Antibiotic ods and experimentally validated, the newly obtained knowl- Resistant Target Seeker ARTS’ [8], which predicts potential tar- edge on the involved enzymes is of course used to refine and gets of antibiotics and uses the annotation provided by extend the rule-based mining methods. antiSMASH to mine for BGCs and CRISpy-web [11], a Web tool that allows user-friendly design of single guide RNAs (sgRNAs) The quality of input data is important for getting for CRISPR applications on nonmodel organisms. reliable results antiSMASH is a comprehensive genome mining platform, but only provides information on individually submitted genomes One important aspect to be considered when mining genomic and does not offer any integrated search functionality. Therefore, data for BGCs using antiSMASH or alternative pipelines, such as in 2016, the antiSMASH platform was extended with a database PRISM [24, 34], SMURF [28] and ClusterFinder [49], is the quality containing precomputed antiSMASH annotation on >3900 fin- of the sequence data that is to be analyzed. All these tools use ished high-quality bacterial genome sequences [7]. Using the either rule-based or statistical approaches to identify the BGCs Web interface, it is possible to browse secondary metabolite clus- involved in secondary metabolism. Both methods require that ters by BGC type or taxonomy of the producer organism. the sequence data to be analyzed are not too fragmented and Additionally, custom queries can be constructed using an inter- that the genes of a BGC are not scattered across different contigs active query builder. This makes it possible to answer research in the assembly. Users should be particularly aware of potential questions such as ‘which clusters of type NRPS contain quality issues when analyzing genome data generated with A domains that select for the nonproteinogenic amino acid short-read sequencing technologies. Special care has to be taken 3, 5-dihydroxy-phenylglycine?’ or ‘what BGCs of type RiPP exist when analyzing type I polyketide or NRPS-containing BGCs; in the genus Streptomyces that are not lanthipeptides?’. The both types of pathways involve large multimodular mega- results are displayed in the same antiSMASH Web format. They enzymes, whose gene sequences often are highly repetitive and can also be exported in various file formats that allow further therefore difficult to assemble purely based on short sequencing processing in other bioinformatics tools. reads [54]. The same applies to metagenomic data; reliable iden- tification of BGCs—which consist of several genes—is only pos- sible on well-assembled data. Therefore, analyses on the public Considerations and caveats for computational antiSMASH Web server are limited to sequences of over 1 kb genome mining length and the first 1000 contigs. Both limits can be deactivated You can only find what you are looking for.. . in the stand-alone version of antiSMASH. To analyze highly fragmented short-read-based assemblies, pipelines focusing on Most genome mining platforms, including antiSMASH (with the detection and analysis of individual core domains, such as default search options), SMURF [28] and PRISM [24, 34], use a NaPDos [18] or eSNaPD [12], should be considered. In general, rule-based approach to define what is annotated as a secondary phylogenomics-based approaches like the abovementioned or metabolite BGC. These rules are derived from existing knowl- as used in EvoMining [50] are excellent alternatives for such edge about key biosynthetic steps/principles, which require the fragmented data, as they base their predictions on single activity of individual or combinations of specific enzymes. The enzymes/genes instead of requiring the presence of complete or genes encoding these are also often referred to as ‘core’ genes partial BGCs [55]. Therefore, we recommend first using these and used as anchors or probes to screen the genomic data of interest. While this method is highly sensitive and precise for tools to identify ‘interesting’ sequence records in such bulk DNA data and then submitting only these records (provided they identifying biosynthesis genes for many classes of secondary metabolites, such as polyketides, or nonribosomally synthe- have the required sequence length) for an analysis with sized peptides, it of course implies that only pathways for which antiSMASH. Downloaded from https://academic.oup.com/bib/article/20/4/1103/4590131 by DeepDyve user on 16 July 2022 RecentdevelopmentofantiSMASH | 1109 In addition, most algorithms predicting enzyme specificities cluster. The distances were selected in a way that we would rely on automatically generated alignments of the user- rather overpredict the distance, i.e. include genes in the gene supplied input data with experimentally characterized ‘refer- cluster annotation that may belong to the gene cluster border ence’ sequences to identify residues of the active sites or the region, than exclude genes that are part of the BGCs but are substrate-binding pockets. Depending on the tool used to pre- encoded outside this range from the core biosynthetic genes. dict specificities, these alignments are generated using standard multiple sequence alignment software like ClustalW [56]or Strategies to connect gene clusters to molecules Muscle [57]. Alternatively, BLAST or HMMer are used to match In the end, most users turn to antiSMASH or related tools to the query with a custom reference database. Consequently, accomplish one of two goals: (1) to identify potentially new mol- these tools are sensitive to sequencing errors if these errors ecules that could be synthesized by the organism of study based occur in or near the active sites or binding pockets. In addition, on its genome, or (2) to identify genes involved in the biosynthe- the accuracy of such computer-generated, nonrefined align- sis of an already observed molecule. Specific strategies are ments may suffer if the protein sequence of interest is too dis- available for each of these scenarios. similar to the reference data sets. In both cases, this can easily When trying to find out what kind of specialized metabolites lead to incorrect specificity predictions. an organism can produce based on its genome, the starting In the case where users analyze annotated sequence data, point is to go over each gene cluster in the genome in detail. which is uploaded as GenBank files or directly retrieved from First, comparisons with BGCs from MIBiG (in antiSMASH, this is the NCBI GenBank or RefSeq database, antiSMASH will only done using the KnownClusterBlast module) will identify BGCs consider the annotated genes and not perform additional gene that are either closely or more distantly related to these refer- finding. This also implies that genes annotated as ence clusters. To determine whether a BGC is likely to produce ‘pseudogenes’ are not considered for any prediction. This is the exact same molecule, manual inspection is required. It noteworthy, as many modular PKS and NRPS gene calls that should be checked that all key biosynthetic genes of the refer- were generated with the NCBI PGAP [58] pipeline (which is used ence cluster are also found in the BGC of interest by studying to annotate all microbial genomes in RefSeq [59]) were inaccu- the data of the MIBiG entry and related literature. If so, are any rate and the intact genes were labelled as pseudogenes. This additional enzymes encoded in the BGC of interest that could bug has been fixed for RefSeq 82, but users that downloaded encode chemical modifications not observed for the known earlier versions of RefSeq entries should be cautious. Many molecule? If the BGC encodes PKSs or NRPSs, do the domain GenBank records that were annotated with affected versions of architectures and their corresponding predicted substrate spe- PGAP also suffer from this issue. cificities match to those of the known cluster? The answers to If users supply unannotated sequence data, antiSMASH uses these questions will determine whether the BGC of interest is the software prodigal [44] for bacterial genomes or GlimmerHMM likely to encode the biosynthesis of: (a) the same molecule (all [45] for fungal and plant sequences to automatically identify cod- relevant genes ‘shared’ with high percent identity, and perfect ing regions. The downstream genome analyses therefore depend alignment of chemistry predictions with the structure of the on the accuracy of the automated gene finding, which can vary known molecule); (b) a potentially new variant of a known mol- between different organisms and is also dependent on the ecule (some enzyme-coding genes are cluster-specific, and/or sequence quality. If users supply annotated sequence data by some substrate specificities are different); (c) a new molecule uploading GenBank-formatted or FASTAþGFF3 files, antiSMASH within a known class of molecules (only a minority or small uses these gene coordinates. If an annotated and high-quality majority of the genes ‘shared’); or (d) an altogether unknown genome sequence of an organism of interest is available, it is molecule (no significant similarities). Before it can be concluded therefore advisable to use the preannotated data. that a molecule is unknown, it should be taken into account that some known natural products lack a described BGC; hence, Defining the extent of a secondary metabolite BGC some novel-looking BGCs may still encode the production of Predicting the boundaries of a BGC solely based on genomic data molecules for which the chemistry has been long known. For still remains challenging. For fungal BGCs, conserved binding polyketides and nonribosomal peptides, these cases can be sites of cluster-specific transcriptional regulators are good indica- assessed with a retro-biosynthetic approach using tools like tors to use in defining which genes are co-regulated. If the same Smiles2Monomers [27] or GRAPE [15]. These tools predict the regulator binding site is present near the core-genes of a cluster, potential monomers of a given compound structure, for exam- they probably belong to the same biosynthetic pathway. This ple derived from a compound database. In a second step, these approach is used in the CASSIS tool [10], which was recently inte- compounds can be connected to BGCs by mapping the mono- grated into version 4 of antiSMASH [6]. In addition, fungal tran- mer predictions derived from the chemical structure to the scriptomics data can also be used to efficiently define the cluster monomer predictions derived from the analysis of BGCs. The boundaries [60], as implemented in the FunGeneClusterS applica- latter predictions can be made using the antiSMASH database tion [13]. or tools like GARLIC [15]. For nonribosomal peptides, another For bacterial sequences, such automated or semi-automated option is to check for compounds with similar monomers in the methods are unfortunately not (yet) well established. The pres- NORINE database [19, 20]. antiSMASH provides the appropriate ence or absence of BGCs is often strain specific [61, 62]. search links from the ‘detailed annotations’ sidebar. If no Comparing genomes between closely related species to identify cluster-wide similarity is observed, it is in any case still a good which genes are highly conserved between these species and idea to look for similarities to known clusters at a smaller scale: which are unique to the strain of interest can indicate the extent either per gene or per subcluster. antiSMASH offers functional- of BGCs. In antiSMASH, we have therefore chosen an ‘inclusive’ ities to identify such similarities, using the SubClusterBlast fea- approach. Genes that are encoded within an empirically defined ture and the gene-specific BLAST search of MIBiG [17]. This distance from conserved core genes of a BGC are displayed as a makes it possible to predict the presence of specific chemical Downloaded from https://academic.oup.com/bib/article/20/4/1103/4590131 by DeepDyve user on 16 July 2022 1110 | Blin et al. moieties or chemical modifications to the molecule, which BGCs coding for the biosynthesis of known hazardous helps to prioritize the targets or to connect the gene cluster to a chemicals. molecule observed in metabolomic data. Finally, looking for Increasingly available high-quality genome data, in combi- functional markers can greatly help in prioritizing BGCs, e.g. nation with databases of BGCs of known function, such as when the aim of the project is antibiotic discovery, one can look sequence data from the MIBiG repository [17], can also be used for both general and specific types of antibiotic resistance genes for dereplication of known or closely related compounds and that are often encoded inside a BGC to provide natural self- the identification of unexplored or underexplored gene cluster resistance to the producer [8, 16]. families. So far, several studies [35, 49, 65–67] have successfully Sometimes, the structure of a molecule has already been used such approaches to identify novel natural products. In elucidated before a genome is sequenced or studied. In such a connection with large-scale metabolomics approaches (in case, the aim of using antiSMASH or related tools is usually to which gene cluster data are automatically correlated with infor- identify the biosynthetic mechanism of the molecule of inter- mation on known or unknown compounds identified by est. If, chemically, the molecule is closely related to other mass spectrometry [14, 15, 67, 68]), these high-quality data now known natural products for which the biosynthesis is known, allow for new high-throughput methods to identify novel one would usually be able to find either a single BGC or only a compounds. few BGCs with high similarity to the corresponding MIBiG refer- Many of the current limitations of automated genome min- ence cluster. However, this is often not the case. Then, the best ing approaches are being actively addressed by the interna- strategy is to use ‘exclusion logic’ and step-by-step exclude tional natural product community. The EvoMining strategy has BGCs that are unlikely to be involved in the biosynthesis of the been successfully used [50] to identify new BGCs coding for pre- molecule, thus gradually narrowing down the options to only viously unknown compounds and enzymes. Another promising one or a few gene clusters. First, one would ask: What is the approach to better predict BGC boundaries is based on compara- chemical class of the molecule, and, accordingly, what is its tive genomics by detecting ‘breaks’ in the conserved synteny of expected biosynthetic class? For some chemical classes, there related strains; as such breaks are often caused by the insertion can be multiple biosynthetic options, e.g. peptides can be made and/or horizontal acquisition of BGCs, this approach allows the in either a ribosomal or nonribosomal fashion. Second, one identification of potential secondary metabolite biosynthetic would ask: What can we specifically predict about the biosyn- pathways without relying on previous knowledge of the thetic pathway? If it concerns a potential nonribosomal peptide enzymes involved (SYNTERUPTOR, S. Lautru and J. L. Pernodet; or polyketide, knowledge of the structure would allow predict- personal communication). Thousands of BGCs already have ing the number of modules expected in corresponding NRPSs or been identified and the number is still steadily increasing. Tools PKSs, as well as their substrate specificities. Third, is there spe- like CORASON (F. Barona-Go ´ mez, personal communication; cific chemistry seen in the molecule for which enzymatic mech- https://github.com/nselem/EvoDivMet; as used in [69, 70]), anisms are known? If, for example, a peptide is acylated, one clusterTools [63] and MultiGeneBlast [64] can be used to identify could expect the presence of either a CoA-ligase or a clusters, which share varying degrees of similarity with known Condensation-starter domain in the BGC. Fourth, are any other BGCs. Large-scale clustering of these BGCs is emerging as an organisms known to produce this molecule? If so, one could see important method to compare, classify into gene cluster fami- which BGCs have homologous clusters in each of these known lies, dereplicate and identify novel or—depending on the aim of producers. the study—related BGCs [49, 66, 67]. Novel software packages When dealing with larger numbers of genomes, the above- like BIG-SCAPE (Medema, personal communication; https://git. mentioned strategies may no longer be feasible. In this case, a wageningenur.nl/medema-group/BiG-SCAPE) will help scien- targeted search could be done using software like clusterTools tists to perform such analyses. [63] or MultiGeneBlast [64] among the entire set of BGCs identi- Of course, the widespread use of genome mining approaches fied in all genomes. For example, if the presence of a certain also raises new challenges. One major bottleneck in such (combination of) specific gene(s) is either desired (in case of approaches is the frequent observation that the BGCs remain hunting for new molecules) or expected (in case of trying to con- unexpressed (i.e. ‘silent’) in the producer strains under normal nect a known molecule to its BGC), a specific query can be built laboratory fermentation conditions; in such cases, the com- to search for this. pounds cannot be detected or isolated despite the genome con- taining all the genes required for the biosynthesis. Thus, Perspectives strategies have to be developed and improved to trigger the expression of such silent BGCs [71, 72]. One important step for- With the recent progress in sequencing technologies and the ward in this regard has been the development of CRISPR-based availability of easy-to-use software programs, genome mining genome editing tools for important groups of bacterial and fun- for BGCs and evaluating the genetic potential of secondary gal secondary metabolite producers [11, 73–75] that can be used metabolite producing organisms have matured into an impor- to insert promoters to activate the silent BGCs [76] or to ‘repair’ tant technology. It complements the classical organic biosynthetic genes [77]. Successful expression of the BGC and chemistry-centered approach to find, dereplicate and character- isolation of a novel compound should be followed by metabolo- ize novel bioactive secondary metabolites, and contributes mics analysis and metabolic engineering that are intercon- toward the current paradigm-shift that brings natural products nected with each other. Metabolomics helps with identifying once more into focus for future drug discovery [36]. In addition, secondary metabolite precursors, and hence provides clues on it also can be used as an effective method to evaluate the safety of biotechnological production organisms, which are used the use of metabolic pathways. This information in turn facili- directly in food production or for the production of enzymes or tates metabolic engineering of the host strain that considers other biochemicals. In this case, genome mining data can be quantitatively optimal production of a target secondary metab- used to demonstrate that a production strain does not contain olite [78]. Downloaded from https://academic.oup.com/bib/article/20/4/1103/4590131 by DeepDyve user on 16 July 2022 RecentdevelopmentofantiSMASH | 1111 9. van Heel AJ, de Jong A, Montalba ´ n-Lo ´ pez M, et al. BAGEL3: Key Points automated identification of genes encoding bacteriocins and (non-)bactericidal posttranslationally modified peptides. Despite the huge chemical diversity of bioactive secon- Nucleic Acids Res 2013;41:W448–53. dary metabolites, the enzymes involved in their biosyn- 10. Wolf T, Shelest V, Nath N, et al. CASSIS and SMIPS: promoter- thesis are often strikingly conserved. based prediction of secondary metabolite gene clusters in The sequence conservation of these enzymes can be eukaryotic genomes. Bioinformatics 2016;32(8):1138–43. exploited by genome mining approaches to identify sec- 11. Blin K, Pedersen LE, Weber T, et al. CRISPy-web: an online ondary metabolite BGCs in genome data. resource to design sgRNAs for CRISPR applications. Synth Syst Genome mining is a powerful method to access the Biotechnol 2016;1(2):118–21. genetic potential of secondary metabolite producers. 12. Reddy BVB, Milshteyn A, Charlop-Powers Z, et al. eSNaPD: a User-friendly pipelines (e.g. antiSMASH) are available to versatile, web-based bioinformatics platform for surveying assist scientists in genome mining. and mining natural product biosynthetic diversity from There are caveats that should be considered when metagenomes. Chem Biol 2014;21:1023–33. designing and interpreting genome mining studies. 13. Vesth TC, Brandl J, Andersen MR. FunGeneClusterS: predict- ing fungal gene clusters from genome and transcriptome data. Synth Syst Biotechnol 2016;1(2):122–9. Acknowledgements 14. Johnston CW, Skinnider MA, Wyatt MA, et al. An automated Genomes-to-Natural Products platform (GNP) for the discov- The authors would like to thank Simon Shaw for the helpful ery of modular natural products. Nat Commun 2015;6:8421. discussions. 15. Dejong CA, Chen GM, Li H, et al. Polyketide and nonribosomal peptide retro-biosynthesis and global gene cluster matching. Funding Nat Chem Biol 2016;12:1007–14. 16. Johnston CW, Skinnider MA, Dejong CA, et al. Assembly and The work of T. W., K. B. and H. U. K. is supported by grants of clustering of natural antibiotics guides target identification. the Novo Nordisk Foundation (CFB and grant number Nat Chem Biol 2016;12:233–9. NNF16OC0021746); the Technology Development Program to 17. Medema MH, Kottmann R, Yilmaz P, et al. Minimum informa- Solve Climate Change on Systems Metabolic Engineering for tion about a biosynthetic gene cluster. Nat Chem Biol 2015; Biorefineries from the Ministry of Science and ICT through 11(9):625–31. the National Research Foundation (NRF) of Korea (grant num- 18. Ziemert N, Podell S, Penn K, et al. The natural product domain bers NRF-2012M1A2A2026556 and NRF-2012M1A2A2026557 seeker NaPDoS: a phylogeny based bioinformatic tool to classify to H. U. K.); and Veni grant (grant number 863.15.002 to secondary metabolite gene diversity. PLoS One 2012;7(3):e34064. M. H. M.) from The Netherlands Organization for Scientific 19. Pupin M, Esmaeel Q, Flissi A, et al. Norine: a powerful resource for novel nonribosomal peptide discovery. Synth Syst Research (NWO). Biotechnol 2016;1(2):89–94. 20. Caboche S, Pupin M, Lecle `re V, et al. NORINE: a database of References nonribosomal peptides. Nucleic Acids Res 2008;36:D326–31. 1. Newman DJ, Cragg GM. Natural products as sources of new 21. Li MH, Ung PM, Zajkowski J, et al. Automated genome mining drugs over the 30 years from 1981 to 2010. J Nat Prod 2012; for natural products. BMC Bioinformatics 2009;10:185. 75(3):311–35. 22. Ro ¨ ttig M, Medema MH, Blin K, et al. NRPSpredictor2–a web 2. Nu ¨ tzmann HW, Huang A, Osbourn A. Plant metabolic clus- server for predicting NRPS adenylation domain specificity. ters—from genetics to genomics. New Phytol 2016;211(3): Nucleic Acids Res 2011;39:W362–7. 771–89. 23. Kautsar SA, Suarez Duran HG, Blin K, et al. plantiSMASH: 3. Medema MH, Osbourn A. Computational genomic identifica- automated identification, annotation and expression analy- tion and functional reconstitution of plant natural product sis of plant biosynthetic gene clusters. Nucleic Acids Res 2017; biosynthetic pathways. Nat Prod Rep 2016;33:951–62. 45:W55–63. 4. Zazopoulos E, Huang K, Staffa A, et al. A genomics-guided 24. Skinnider MA, Merwin NJ, Johnston CW, et al. PRISM 3: approach for discovering and expressing cryptic metabolic expanded prediction of natural product chemical structures pathways. Nat Biotechnol 2003;21:187–90 [Database. from microbial genomes. Nucleic Acids Res 2017;45(W1): 5. Yadav G, Gokhale RS, Mohanty D. SEARCHPKS: a program for W49–54. detection and analysis of polyketide synthase domains. 25. Tietz JI, Schwalen CJ, Patel PS, et al. A new genome-mining Nucleic Acids Res 2003;31(13):3654–8. tool redefines the lasso peptide biosynthetic landscape. Nat 6. Blin K, Wolf T, Chevrette MG, et al. antiSMASH 4.0-improve- Chem Biol 2017;13(5):470–8. ments in chemistry prediction and gene cluster boundary 26. Khater S, Gupta M, Agrawal P, et al. SBSPKSv2: structure- identification. Nucleic Acids Res 2017;45(W1):W36–41. based sequence analysis of polyketide synthases and non- 7. Blin K, Medema MH, Kottmann R, et al. The antiSMASH data- ribosomal peptide synthetases. Nucleic Acids Res 2017;45(W1): base, a comprehensive database of microbial secondary W72–9. metabolite biosynthetic gene clusters. Nucleic Acids Res 2017; 27. Dufresne Y, Noe ´ L, Lecle `re V, et al. Smiles2Monomers: a link 45:D555–9. between chemical and biological structures for polymers. 8. Alanjary M, Kronmiller B, Adamek M, et al. The Antibiotic J Cheminform 2015;7:62. Resistant Target Seeker (ARTS), an exploration engine for 28. Khaldi N, Seifuddin FT, Turner G, et al. SMURF: genomic map- antibiotic cluster prioritization and novel drug target discov- ping of fungal secondary metabolite clusters. Fungal Genet Biol ery. Nucleic Acids Res 2017;45(W1):W42–8. 2010;47:736–41. Downloaded from https://academic.oup.com/bib/article/20/4/1103/4590131 by DeepDyve user on 16 July 2022 1112 | Blin et al. 50. Cruz-Morales P, Kopp JF, Martı ´nez-Guerrero C, et al. 29. Weber T, Rausch C, Lopez P, et al. CLUSEAN: a computer- based framework for the automated analysis of bacterial sec- Phylogenomic analysis of natural products biosynthetic gene ondary metabolite biosynthetic gene clusters. J Biotechnol clusters allows discovery of arseno-organic metabolites in 2009;140:13–17. model streptomycetes. Genome Biol Evol 2016;8:1906–16. 30. Medema MH, Blin K, Cimermancic P, et al. antiSMASH: rapid 51. Chevrette MG, Aicheler F, Kohlbacher O, et al. SANDPUMA: identification, annotation and analysis of secondary metabo- ensemble predictions of nonribosomal peptide chemistry lite biosynthesis gene clusters in bacterial and fungal genome reveals biosynthetic diversity across actinobacteria. sequences. Nucleic Acids Res 2011;39:W339–46. Bioinformatics 2017;33(20):3202–10. 31. Blin K, Medema MH, Kazempour D, et al. antiSMASH 2.0–a ver- 52. Price MN, Dehal PS, Arkin AP. FastTree 2–approximately satile platform for genome mining of secondary metabolite maximum-likelihood trees for large alignments. PLoS One producers. Nucleic Acids Res 2013;41:W204–12. 2010;5(3):e9490. 32. Weber T, Blin K, Duddela S, et al. antiSMASH 3.0-a compre- 53. Dickschat JS. Bacterial terpene cyclases. Nat Prod Rep 2016; hensive resource for the genome mining of biosynthetic gene 33(1):87–110. clusters. Nucleic Acids Res 2015;43:W237–43. 54. Klassen JL, Currie CR. Gene fragmentation in bacterial draft 33. Blin K, Kazempour D, Wohlleben W, et al. Improved lanthi- genomes: extent, consequences and mitigation. BMC peptide detection and prediction for antiSMASH. PLoS One Genomics 2012;13:14. 2014;9(2):e89420. 55. Cibria ´ n-Jaramillo A, Barona-Go ´ mez F. Increasing metage- 34. Skinnider MA, Dejong CA, Rees PN, et al. Genomes to natural nomic resolution of microbiome interactions through func- products PRediction Informatics for Secondary Metabolomes tional phylogenomics and bacterial sub-communities. Front (PRISM). Nucleic Acids Res 2015;43:9645–62. Genet 2016;7:4. 35. Skinnider MA, Johnston CW, Edgar RE, et al. Genomic charting 56. Larkin MA, Blackshields G, Brown NP, et al. Clustal W and of ribosomally synthesized natural product chemical space Clustal X version 2.0. Bioinformatics 2007;23(21):2947–8. facilitates targeted mining. Proc Natl Acad Sci USA 2016; 57. Edgar RC. MUSCLE: multiple sequence alignment with high 113(42):E6343–51. accuracy and high throughput. Nucleic Acids Res 2004;32(5): 36. Ziemert N, Alanjary M, Weber T. The evolution of genome 1792–7. mining in microbes—a review. Nat Prod Rep 2016;33(8): 58. Tatusova T, DiCuccio M, Badretdin A, et al. NCBI prokaryotic 988–1005. genome annotation pipeline. Nucleic Acids Res 2016;44(14): 37. Fedorova ND, Moktali V, Medema MH. Bioinformatics 6614–24. approaches and software for detection of secondary meta- 59. Tatusova T, Ciufo S, Fedorov B, et al. RefSeq microbial bolic gene clusters. Methods Mol Biol 2012;944:23–45. genomes database: new representation and annotation strat- 38. Lecle ` re V, Weber T, Jacques P, et al. Bioinformatics tools for egy. Nucleic Acids Res 2014;42:D553–9. the discovery of new nonribosomal peptides. Methods Mol Biol 60. Andersen MR, Nielsen JB, Klitgaard A, et al. Accurate predic- 2016;1401:209–32. tion of secondary metabolite gene clusters in filamentous 39. Adamek M, Spohn M, Stegmann E, et al. Mining bacterial fungi. Proc Natl Acad Sci USA 2013;110(1):E99–107. genomes for secondary metabolite gene clusters. Methods Mol 61. Letzel AC, Li J, Amos GCA, et al. Genomic insights into special- Biol 2017;1520:23–47. ized metabolism in the marine actinomycete Salinispora. 40. Weber T. In silico tools for the analysis of antibiotic biosyn- Environ Microbiol 2017;19:3660–73. thetic pathways. Int J Med Microbiol 2014;304(3–4):230–5. 62. Cruz-Morales P, Vijgenboom E, Iruegas-Bocardo F, et al. The 41. Weber T, Kim HU. The secondary metabolite bioinformatics genome sequence of Streptomyces lividans 66 reveals a novel portal: computational tools to facilitate synthetic biology of tRNA-dependent peptide biosynthetic system within a secondary metabolite production. Synth Syst Biotechnol 2016; metal-related genomic island. Genome Biol Evol 2013;5: 1(2):69–79. 1165–75. 42. Medema MH, Fischbach MA. Computational approaches to 63. de los Santos ELC, Challis GL. clusterTools: proximity natural product discovery. Nat Chem Biol 2015;11(9):639–48. searches for functional elements to identify putative biosyn- 43. Chavali AK, Rhee SY. Bioinformatics tools for the identifica- thetic gene clusters. bioRxiv 2017. (Epub ahead of print). doi: tion of gene clusters that biosynthesize specialized metabo- 10.1101/119214. lites. Brief Bioinform 2017. (Epub ahead of print). doi: 10.1093/ 64. Medema MH, Takano E, Breitling R. Detecting sequence bib/bbx020. homology at the gene cluster level with MultiGeneBlast. Mol 44. Hyatt D, Chen GL, Locascio PF, et al. Prodigal: prokaryotic gene Biol Evol 2013;30(5):1218–23. recognition and translation initiation site identification. BMC 65. Donia MS, Cimermancic P, Schulze CJ, et al. A systematic anal- Bioinformatics 2010;11:119. ysis of biosynthetic gene clusters in the human microbiome 45. Majoros WH, Pertea M, Salzberg SL. TigrScan and reveals a common family of antibiotics. Cell 2014;158(6): GlimmerHMM: two open source ab initio eukaryotic gene- 1402–14. finders. Bioinformatics 2004;20(16):2878–9. 66. Zhang Q, Doroghazi JR, Zhao X, et al. Expanded natural prod- 46. Eddy SR. Accelerated profile HMM searches. PLoS Comput Biol uct diversity revealed by analysis of lanthipeptide-like gene 2011;7(10):e1002195. clusters in actinobacteria. Appl Environ Microbiol 2015;81: 47. Finn RD, Coggill P, Eberhardt RY, et al. The Pfam protein fami- 4339–50. lies database: towards a more sustainable future. Nucleic Acids 67. Doroghazi JR, Albright JC, Goering AW, et al. A roadmap for Res 2016;44(D1):D279–85. natural product discovery based on large-scale genomics and 48. Haft DH, Selengut JD, Richter RA, et al. TIGRFAMs and genome metabolomics. Nat Chem Biol 2014;10(11):963–8. properties in 2013. Nucleic Acids Res 2013;41:D387–95. 68. Maansson M, Vynne NG, Klitgaard A, et al. An integrated 49. Cimermancic P, Medema MH, Claesen J, et al. Insights into metabolomic and genomic mining workflow to uncover the secondary metabolism from a global analysis of prokaryotic biosynthetic potential of bacteria. mSystems 2016;1(3): biosynthetic gene clusters. Cell 2014;158(2):412–21. e00028-15. doi: 10.1128/mSystems.00028–15. Downloaded from https://academic.oup.com/bib/article/20/4/1103/4590131 by DeepDyve user on 16 July 2022 RecentdevelopmentofantiSMASH | 1113 74. Tong Y, Charusanti P, Zhang L, et al. CRISPR-Cas9 based engi- 69. Cruz-Morales P, Ramos-Aboites HE, Licona-Cassani C, et al. Actinobacteria phylogenomics, selective isolation from an neering of actinomycetal genomes. ACS Synth Biol 2015;4(9): iron oligotrophic environment and siderophore functional 1020–9. characterization, unveil new desferrioxamine traits. FEMS 75. Nødvig CS, Nielsen JB, Kogle ME, et al. A CRISPR-Cas9 system Microbiol Ecol 2017;93(9). doi: 10.1093/femsec/fix086. for genetic engineering of filamentous fungi. PLoS One 2015; 70. Gutie ´ rrez-Garcı ´a K, Neira-Gonza ´ lez A, Pe ´ rez-Gutie ´ rrez RM, 10(7):e0133085. et al. Phylogenomics of 2, 4-Diacetylphloroglucinol-producing 76. Zhang MM, Wong FT, Wang Y, et al. CRISPR-Cas9 strategy for pseudomonas and novel antiglycation endophytes from Piper activation of silent Streptomyces biosynthetic gene clusters. auritum. J Nat Prod 2017;80:1955–63. Nat Chem Biol 2017;13:607–9. 71. Rutledge PJ, Challis GL. Discovery of microbial natural prod- 77. Weber J, Valiante V, Nødvig CS, et al. Functional reconstitu- ucts by activation of silent biosynthetic gene clusters. Nat Rev tion of a fungal natural product gene cluster by advanced Microbiol 2015;13(8):509–23. genome editing. ACS Synth Biol 2017;6:62–8. 72. Ren H, Wang B, Zhao H. Breaking the silence: new strategies 78. Kim HU, Charusanti P, Lee SY, et al. Metabolic for discovering novel natural products. Curr Opin Biotechnol engineering with systems biology tools to optimize produc- tion of prokaryotic secondary metabolites. Nat Prod Rep 2016; 2017;48:21–7. 73. Cobb RE, Wang Y, Zhao H. High-efficiency multiplex genome 33:933–41. editing of Streptomyces species using an engineered CRISPR/ Cas system. ACS Synth Biol 2015;4:723–8.

Journal

Briefings in BioinformaticsOxford University Press

Published: Jul 19, 2019

There are no references for this article.