Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

NRPSsp: non-ribosomal peptide synthase substrate predictor

NRPSsp: non-ribosomal peptide synthase substrate predictor Copyedited by: ES MANUSCRIPT CATEGORY: APPLICATIONS NOTE Vol. 28 no. 3 2012, pages 426–427 BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btr659 Sequence analysis Advance Access publication November 29, 2011 Carlos Prieto , Carlos García-Estrada, Diego Lorenzana and Juan Francisco Martín Institute of Biotechnology of Leon, INBIOTEC, Parque Científico de León, 24006 León, Spain Associate Editor: Alfonso Valencia ABSTRACT the binding substrates of an NRPS such as NRPS-PKS (Ansari et al., 2004), NP.Searcher (Li et al., 2009), PKS/NRPS Analysis Summary: Non-ribosomal peptide synthetases (NRPSs) are multi- (Bachmann and Ravel, 2009) and NRPSPredictor2 (Röttig et al., modular enzymes, which biosynthesize many important peptide 2011). These prediction websites are mainly based on the generally compounds produced by bacteria and fungi. Some studies have accepted rule of analysing the active site within the A-domains. revealed that an individual domain within the NRPSs shows However, it is known that this approach has difficulties analysing significant substrate selectivity. The discovery and characterization of certain types of synthetases, especially those which belong to fungi non-ribosomal peptides are of great interest for the biotechnological species. Problems can be caused because the GrsA crystal seems industries. We have applied computational mining methods in order to be an inadequate model for them (Jenke-Kodama and Dittmann to build a database of NRPSs modules that bind to specific 2009), or because the large number of sequence variants in the active substrates. We have used this database to build a hidden Markov centre does not allow a correct extraction of the key residues for model predictor of substrates that bind to a given NRPS. prediction. This observation suggests the interest of developing new Availability: The database and the predictor are freely available on prediction methods supported by other approaches. One of these an easy-to-use website at www.nrpssp.com. approaches could be the use of hidden Markov model (HMM) Contact: carlos.prieto@unileon.es as Khurana et al. (2010) have applied to functionally classify the Supplementary information: Supplementary data is available at acyl:CoA synthetase super-family members. This work suggests Bioinformatics online. that the application of HMM profiles to classify this superfamily Received on July 12, 2011; revised on November 2, 2011; accepted outperforms the predictions based on a limited number of active site on November 24, 2011 residues (Khurana et al., 2010). The methods stated above can also be applied to a more ambitious goal, such as the determination of the substrate that binds to an adenylation domain. 1 INTRODUCTION The current omics era has enabled the exponential growth of Nonribosomal peptide synthetases (NRPSs) are multi-modular the sequenced NRPS. This implies that a tool which could predict enzymes involved in the biosynthesis of natural products. A minimal the specificity of their A domains is of increasing interest, and its NRPS module contains specific functional domains which are training could be beneficial with the new annotated NRPS. These able to catalyze several activities, such as amino acid adenylation facts, the previous experience of our group in the area, and the (A-activation), thioesterification (T- thiolation or acyl carrier cited publications, have enabled the presented work, whose ultimate domain) and peptide-bond formation (C-condensation domain), goal is to develop a new bioinformatic tool in order to achieve allowing elongation of the nascent peptide (Schwarzer and Marahiel, the collection, annotation, storage and prediction of substrates, 2001). The primary composition of the final product is determined which bind to adenylation domains in NRPSs. This open software by the sequential order of the A-domains along the synthetase, tool applies a new approach in the areas of, the prediction based because each A-domain recruits a particular type of substrate. on HMM, enlarged training sets applying mining techniques, the The crystal structure of the peptide synthetase GrsA, which was regular update of its database and its design for the functional solved with a bound, a phenylalanine substrate molecule, has analysis of incoming NGS data. enabled the identification of 10 key residues in the A-domain, which are important for the substrate binding (Conti et al., 1997). Accordingly, these residues can be determinant in the substrate 2 METHODS specificity of A-domains, and their extraction from characterized This work has been developed in three phases: (i) construction of a database A-domains has achieved a collection of key residues signatures and with adenylation domains, which bind a known substrate; (ii) build, train general rules for deducing substrate specificity of non-characterized and test a computational predictor; and (iii) development of a web tool. A-domains (Challis et al., 2000; Stachelhaus et al., 1999). Moreover, The global work flow is represented in the Supplementary Figure S1 and a detailed description of the methods is in the Supplementary Material. In machine learning techniques have been applied to build a classifier order to construct the database, a semi-automatic annotation protocol was based on 20 key residues and on the physico-chemical properties of implemented and will be applied regularly in order to update the database (see amino acids to gain prediction power (Rausch et al., 2005; Röttig Supplementary Material for a detailed description of the methods). Regarding et al., 2011). Consequently, software tools and databases have been the predictor, strategies based on position specific scoring matrices (PSSMs) developed to collect NRPS products, such as NORINE (Caboche and HMMs were tested to build the classifier. HMM approaches obtain better et al., 2008) and NRPS-PKS (Anand et al., 2010), and to predict results due to the sequence heterogeneity of the A-domains and consequently the difficulty of its alignment. That is why the classification method was To whom correspondence should be addressed. developed with HMM, although the idea of identifying key residues in input 426 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com [13:47 31/12/2011 Bioinformatics-btr659.tex] Page: 426 426–427 Copyedited by: ES MANUSCRIPT CATEGORY: APPLICATIONS NOTE NRPSsp sequences was abandoned. A cross-validation of the predictor was done training data. An increase of the error rate is noticeable when low applying a leave one out test (LOO) and an receiver operating characteristic score results are not filtered. This increase is induced by the wide curve of these results was plotted with the R package ROCR (Supplementary variety of fungi A-domains and their small number in the training Fig. S2). The input sequences of the LOO test were also analysed with set. Similar results have been obtained with NRPSpredictor2 and NRPSpredictor2 (one class classifier) and PKS/NRPS Analysis in order to PKS/NRPS Analysis, whose predictions had a low coverage (around compare different approaches. The predictor is available online through the 30%) and an error rate of 36 and 9.1%, respectively (excluding website www.nrpssp.com. It was developed with LAMP architecture (Linux, null and non-available predictions, see Supplementary Table S2). Apache, MySQL and PHP). However, coverage problems are expected to disappear as the number of fungi A-domains in the test set is increased, and this is a major objective which NRPSsp attempts to address with frequent 3 RESULTS updates. The large increase of sequenced proteins that has occurred in recent NRPSsp is available via the website www.nrpssp.com. This years has enabled the collection of more data than in previous website easily allows the analysis of a set of sequences, which are studies. Proteins (37 126) were initially worked with, which were passed as parameters in a FASTA format (Supplementary Fig. S3 annotated as NRPSs or had at least one A-domain. However, only a shows an example). In addition, the website has a download section small subset of these proteins are fully annotated (only 721 proteins which contains the updated database that has been used to train the are in the Swissprot subset), and a small fraction of this subset current classifier and the HMM profiles which have been built. It was useful for building the database. The automatic annotation of enables future studies in the area and the execution of the classifier substrates with its corresponding A-domain obtained 1490 entries. in a stand-alone mode. In this way, the application is designed for Then, a data curation was manually done correcting the existing use with NGS data, which is becoming common in biotechnological data errors and deleting the doubtful entries [mainly because (i) they research, and allows a quick functional annotation of NRPS proteins do not belong to an NRPS module and (ii) the lack of knowledge and knowledge of the substrate specificity of their A domains. of the exact correspondence with a substrate]. This process results in a database with 1598 domains, which have a known binding Funding: Agencia de Inversiones y Servicios de Castilla y León substrate. From these, 1578 sequences were used for training the (record CCTT/10/LE/0001); Juan de la Cierva programme (JCI- classifier because the rest binds to substrates that have <4 annotated 2009-05444) of the Ministry of Science and Innovation (Spain) sequences. Although the size is not too large, it is the biggest (to C.P.). database that has been used to train a method which predicts NRPS Conflict of Interest: none declared. substrates. Rausch et al. (2005) used a database with 394 entries, of which 300 were used to train the SVM classifier, and the recent update of this method used a database with 557 entries (the number REFERENCES of entries to train the classifier was not described) (Rausch et al., Anand,S. et al. (2010) SBSPKS: structure based sequence analysis of polyketide 2005; Röttig et al., 2011). This means that our database has more synthases. Nucleic Acids Res., 38, W487–W496. than triple the size of previous ones. It is available online and the Ansari,M.Z. et al. (2004) NRPS-PKS: a knowledge-based resource for analysis of semi-automatic methods that have been developed allow its regular NRPS/PKS megasynthases. Nucleic Acids Res., 32, W405–W413. update. We expect that this resource will be a reference set for future Bachmann,B.O. and Ravel,J. (2009) Chapter 8. Methods for in silico prediction of microbial polyketide and nonribosomal peptide biosynthetic pathways from DNA research in this area. sequence data. Methods Enzymol., 458, 181–217. This database was used to construct the classifier by means of the Caboche,S. et al. (2008) NORINE: a database of nonribosomal peptides. Nucleic Acids application of HMM profiles. The reliability of the classifier was Res., 36, D326– D331. measured using all data as a train and test set and by a LOO method Conti,E. et al. (1997) Structural basis for the activation of phenylalanine in the non- ribosomal biosynthesis of gramicidin S. EMBO J., 16, 4174–4183. as well (Section 2). The error rate was 13.6 and 4.96% for LOO and Challis,G.L. et al. (2000) Predictive, structure-based model of amino acid recognition whole training data, respectively. If low score results (highlighted by nonribosomal peptide synthetase adenylation domains. Chem. Biol., 7, 211–224. in red in the web application) are considered as not available, the Jenke-Kodama,H. and Dittmann,E. (2009) Bioinformatic perspectives on NRPS/PKS error rate decreases to 5.77 and 3.76%, respectively. The LOO megasynthases: advances and challenges. Nat. Prod. Rep., 26, 874–883. sequences were analysed with NRPSpredictor2 and PKS/NRPS Khurana,P. et al. (2010) Genome scale prediction of substrate specificity for acyl adenylate superfamily of enzymes based on active site residue profiles. BMC Analysis in order to estimate their error rate in similar terms. The Bioinformatics, 11, 57. NRPSpredictor2 test obtained an error rate of 22.6% taking into Li,M.H. et al. (2009) Automated genome mining for natural products. BMC account all the predictions and 8.29% excluding null and unavailable Bioinformatics, 10, 185. predictions. Similarly, PKS/NRPS Analysis obtained error rates of Rausch,C. et al. (2005) Specificity prediction of adenylation domains in nonribosomal peptide synthetases (NRPS) using transductive support vector machines (TSVMs). 73.7 and 27.35%, respectively (Supplementary Table S1). This is a Nucleic Acids Res., 33, 5799–5808. very promising result, indicating that the use of more comprehensive Röttig,M. et al. (2011) NRPSpredictor2–a web server for predicting NRPS adenylation training data and HMM achieves a more reliable predictor. In domain specificity’, Nucleic Acids Res, 39, W362–W367. addition, the results obtained by classifying fungal proteins were Schwarzer,D. and Marahiel,M.A. (2001) Multimodular biocatalysts for natural product studied separately. The error rate excluding low score results were assembly. Naturwissenschaften, 88, 93–101. Stachelhaus,T. et al. (1999) specificity-conferring code of adenylation domains in 4.35% for LOO and 2.17% for whole training data, and if the low nonribosomal peptide synthetases. Chem. Biol., 6, 493–505. score results are not excluded, 35% for LOO and 4.10% for whole [13:47 31/12/2011 Bioinformatics-btr659.tex] Page: 427 426–427 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

NRPSsp: non-ribosomal peptide synthase substrate predictor

Loading next page...
 
/lp/oxford-university-press/nrpssp-non-ribosomal-peptide-synthase-substrate-predictor-np70DaM6hY

References (26)

Publisher
Oxford University Press
Copyright
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
ISSN
1367-4803
eISSN
1460-2059
DOI
10.1093/bioinformatics/btr659
pmid
22130593
Publisher site
See Article on Publisher Site

Abstract

Copyedited by: ES MANUSCRIPT CATEGORY: APPLICATIONS NOTE Vol. 28 no. 3 2012, pages 426–427 BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btr659 Sequence analysis Advance Access publication November 29, 2011 Carlos Prieto , Carlos García-Estrada, Diego Lorenzana and Juan Francisco Martín Institute of Biotechnology of Leon, INBIOTEC, Parque Científico de León, 24006 León, Spain Associate Editor: Alfonso Valencia ABSTRACT the binding substrates of an NRPS such as NRPS-PKS (Ansari et al., 2004), NP.Searcher (Li et al., 2009), PKS/NRPS Analysis Summary: Non-ribosomal peptide synthetases (NRPSs) are multi- (Bachmann and Ravel, 2009) and NRPSPredictor2 (Röttig et al., modular enzymes, which biosynthesize many important peptide 2011). These prediction websites are mainly based on the generally compounds produced by bacteria and fungi. Some studies have accepted rule of analysing the active site within the A-domains. revealed that an individual domain within the NRPSs shows However, it is known that this approach has difficulties analysing significant substrate selectivity. The discovery and characterization of certain types of synthetases, especially those which belong to fungi non-ribosomal peptides are of great interest for the biotechnological species. Problems can be caused because the GrsA crystal seems industries. We have applied computational mining methods in order to be an inadequate model for them (Jenke-Kodama and Dittmann to build a database of NRPSs modules that bind to specific 2009), or because the large number of sequence variants in the active substrates. We have used this database to build a hidden Markov centre does not allow a correct extraction of the key residues for model predictor of substrates that bind to a given NRPS. prediction. This observation suggests the interest of developing new Availability: The database and the predictor are freely available on prediction methods supported by other approaches. One of these an easy-to-use website at www.nrpssp.com. approaches could be the use of hidden Markov model (HMM) Contact: carlos.prieto@unileon.es as Khurana et al. (2010) have applied to functionally classify the Supplementary information: Supplementary data is available at acyl:CoA synthetase super-family members. This work suggests Bioinformatics online. that the application of HMM profiles to classify this superfamily Received on July 12, 2011; revised on November 2, 2011; accepted outperforms the predictions based on a limited number of active site on November 24, 2011 residues (Khurana et al., 2010). The methods stated above can also be applied to a more ambitious goal, such as the determination of the substrate that binds to an adenylation domain. 1 INTRODUCTION The current omics era has enabled the exponential growth of Nonribosomal peptide synthetases (NRPSs) are multi-modular the sequenced NRPS. This implies that a tool which could predict enzymes involved in the biosynthesis of natural products. A minimal the specificity of their A domains is of increasing interest, and its NRPS module contains specific functional domains which are training could be beneficial with the new annotated NRPS. These able to catalyze several activities, such as amino acid adenylation facts, the previous experience of our group in the area, and the (A-activation), thioesterification (T- thiolation or acyl carrier cited publications, have enabled the presented work, whose ultimate domain) and peptide-bond formation (C-condensation domain), goal is to develop a new bioinformatic tool in order to achieve allowing elongation of the nascent peptide (Schwarzer and Marahiel, the collection, annotation, storage and prediction of substrates, 2001). The primary composition of the final product is determined which bind to adenylation domains in NRPSs. This open software by the sequential order of the A-domains along the synthetase, tool applies a new approach in the areas of, the prediction based because each A-domain recruits a particular type of substrate. on HMM, enlarged training sets applying mining techniques, the The crystal structure of the peptide synthetase GrsA, which was regular update of its database and its design for the functional solved with a bound, a phenylalanine substrate molecule, has analysis of incoming NGS data. enabled the identification of 10 key residues in the A-domain, which are important for the substrate binding (Conti et al., 1997). Accordingly, these residues can be determinant in the substrate 2 METHODS specificity of A-domains, and their extraction from characterized This work has been developed in three phases: (i) construction of a database A-domains has achieved a collection of key residues signatures and with adenylation domains, which bind a known substrate; (ii) build, train general rules for deducing substrate specificity of non-characterized and test a computational predictor; and (iii) development of a web tool. A-domains (Challis et al., 2000; Stachelhaus et al., 1999). Moreover, The global work flow is represented in the Supplementary Figure S1 and a detailed description of the methods is in the Supplementary Material. In machine learning techniques have been applied to build a classifier order to construct the database, a semi-automatic annotation protocol was based on 20 key residues and on the physico-chemical properties of implemented and will be applied regularly in order to update the database (see amino acids to gain prediction power (Rausch et al., 2005; Röttig Supplementary Material for a detailed description of the methods). Regarding et al., 2011). Consequently, software tools and databases have been the predictor, strategies based on position specific scoring matrices (PSSMs) developed to collect NRPS products, such as NORINE (Caboche and HMMs were tested to build the classifier. HMM approaches obtain better et al., 2008) and NRPS-PKS (Anand et al., 2010), and to predict results due to the sequence heterogeneity of the A-domains and consequently the difficulty of its alignment. That is why the classification method was To whom correspondence should be addressed. developed with HMM, although the idea of identifying key residues in input 426 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com [13:47 31/12/2011 Bioinformatics-btr659.tex] Page: 426 426–427 Copyedited by: ES MANUSCRIPT CATEGORY: APPLICATIONS NOTE NRPSsp sequences was abandoned. A cross-validation of the predictor was done training data. An increase of the error rate is noticeable when low applying a leave one out test (LOO) and an receiver operating characteristic score results are not filtered. This increase is induced by the wide curve of these results was plotted with the R package ROCR (Supplementary variety of fungi A-domains and their small number in the training Fig. S2). The input sequences of the LOO test were also analysed with set. Similar results have been obtained with NRPSpredictor2 and NRPSpredictor2 (one class classifier) and PKS/NRPS Analysis in order to PKS/NRPS Analysis, whose predictions had a low coverage (around compare different approaches. The predictor is available online through the 30%) and an error rate of 36 and 9.1%, respectively (excluding website www.nrpssp.com. It was developed with LAMP architecture (Linux, null and non-available predictions, see Supplementary Table S2). Apache, MySQL and PHP). However, coverage problems are expected to disappear as the number of fungi A-domains in the test set is increased, and this is a major objective which NRPSsp attempts to address with frequent 3 RESULTS updates. The large increase of sequenced proteins that has occurred in recent NRPSsp is available via the website www.nrpssp.com. This years has enabled the collection of more data than in previous website easily allows the analysis of a set of sequences, which are studies. Proteins (37 126) were initially worked with, which were passed as parameters in a FASTA format (Supplementary Fig. S3 annotated as NRPSs or had at least one A-domain. However, only a shows an example). In addition, the website has a download section small subset of these proteins are fully annotated (only 721 proteins which contains the updated database that has been used to train the are in the Swissprot subset), and a small fraction of this subset current classifier and the HMM profiles which have been built. It was useful for building the database. The automatic annotation of enables future studies in the area and the execution of the classifier substrates with its corresponding A-domain obtained 1490 entries. in a stand-alone mode. In this way, the application is designed for Then, a data curation was manually done correcting the existing use with NGS data, which is becoming common in biotechnological data errors and deleting the doubtful entries [mainly because (i) they research, and allows a quick functional annotation of NRPS proteins do not belong to an NRPS module and (ii) the lack of knowledge and knowledge of the substrate specificity of their A domains. of the exact correspondence with a substrate]. This process results in a database with 1598 domains, which have a known binding Funding: Agencia de Inversiones y Servicios de Castilla y León substrate. From these, 1578 sequences were used for training the (record CCTT/10/LE/0001); Juan de la Cierva programme (JCI- classifier because the rest binds to substrates that have <4 annotated 2009-05444) of the Ministry of Science and Innovation (Spain) sequences. Although the size is not too large, it is the biggest (to C.P.). database that has been used to train a method which predicts NRPS Conflict of Interest: none declared. substrates. Rausch et al. (2005) used a database with 394 entries, of which 300 were used to train the SVM classifier, and the recent update of this method used a database with 557 entries (the number REFERENCES of entries to train the classifier was not described) (Rausch et al., Anand,S. et al. (2010) SBSPKS: structure based sequence analysis of polyketide 2005; Röttig et al., 2011). This means that our database has more synthases. Nucleic Acids Res., 38, W487–W496. than triple the size of previous ones. It is available online and the Ansari,M.Z. et al. (2004) NRPS-PKS: a knowledge-based resource for analysis of semi-automatic methods that have been developed allow its regular NRPS/PKS megasynthases. Nucleic Acids Res., 32, W405–W413. update. We expect that this resource will be a reference set for future Bachmann,B.O. and Ravel,J. (2009) Chapter 8. Methods for in silico prediction of microbial polyketide and nonribosomal peptide biosynthetic pathways from DNA research in this area. sequence data. Methods Enzymol., 458, 181–217. This database was used to construct the classifier by means of the Caboche,S. et al. (2008) NORINE: a database of nonribosomal peptides. Nucleic Acids application of HMM profiles. The reliability of the classifier was Res., 36, D326– D331. measured using all data as a train and test set and by a LOO method Conti,E. et al. (1997) Structural basis for the activation of phenylalanine in the non- ribosomal biosynthesis of gramicidin S. EMBO J., 16, 4174–4183. as well (Section 2). The error rate was 13.6 and 4.96% for LOO and Challis,G.L. et al. (2000) Predictive, structure-based model of amino acid recognition whole training data, respectively. If low score results (highlighted by nonribosomal peptide synthetase adenylation domains. Chem. Biol., 7, 211–224. in red in the web application) are considered as not available, the Jenke-Kodama,H. and Dittmann,E. (2009) Bioinformatic perspectives on NRPS/PKS error rate decreases to 5.77 and 3.76%, respectively. The LOO megasynthases: advances and challenges. Nat. Prod. Rep., 26, 874–883. sequences were analysed with NRPSpredictor2 and PKS/NRPS Khurana,P. et al. (2010) Genome scale prediction of substrate specificity for acyl adenylate superfamily of enzymes based on active site residue profiles. BMC Analysis in order to estimate their error rate in similar terms. The Bioinformatics, 11, 57. NRPSpredictor2 test obtained an error rate of 22.6% taking into Li,M.H. et al. (2009) Automated genome mining for natural products. BMC account all the predictions and 8.29% excluding null and unavailable Bioinformatics, 10, 185. predictions. Similarly, PKS/NRPS Analysis obtained error rates of Rausch,C. et al. (2005) Specificity prediction of adenylation domains in nonribosomal peptide synthetases (NRPS) using transductive support vector machines (TSVMs). 73.7 and 27.35%, respectively (Supplementary Table S1). This is a Nucleic Acids Res., 33, 5799–5808. very promising result, indicating that the use of more comprehensive Röttig,M. et al. (2011) NRPSpredictor2–a web server for predicting NRPS adenylation training data and HMM achieves a more reliable predictor. In domain specificity’, Nucleic Acids Res, 39, W362–W367. addition, the results obtained by classifying fungal proteins were Schwarzer,D. and Marahiel,M.A. (2001) Multimodular biocatalysts for natural product studied separately. The error rate excluding low score results were assembly. Naturwissenschaften, 88, 93–101. Stachelhaus,T. et al. (1999) specificity-conferring code of adenylation domains in 4.35% for LOO and 2.17% for whole training data, and if the low nonribosomal peptide synthetases. Chem. Biol., 6, 493–505. score results are not excluded, 35% for LOO and 4.10% for whole [13:47 31/12/2011 Bioinformatics-btr659.tex] Page: 427 426–427

Journal

BioinformaticsOxford University Press

Published: Nov 29, 2011

There are no references for this article.