Access the full text.
Sign up today, get DeepDyve free for 14 days.
J. Gorodkin (2004)
Comparing two K-category assignments by a K-category correlation coefficientComputational biology and chemistry, 28 5-6
Ulrich Bodenhofer, Andreas Kothmeier, S. Hochreiter (2011)
APCluster: an R package for affinity propagation clusteringBioinformatics, 27 17
Carsten Mahrenholz, Ingrid Abfalter, Ulrich Bodenhofer, R. Volkmer, S. Hochreiter (2011)
Complex Networks Govern Coiled-Coil Oligomerization – Predicting and Profiling by Means of a Machine Learning ApproachMolecular & Cellular Proteomics : MCP, 10
Chih-Chung Chang, Chih-Jen Lin (2011)
LIBSVM: A library for support vector machinesACM Trans. Intell. Syst. Technol., 2
S. Sonnenburg, G. Rätsch, S. Henschel, Christian Widmer, Jonas Behr, A. Zien, F. Bona, Alexander Binder, Christian Gehl, Vojtech Franc (2010)
The SHOGUN Machine Learning ToolboxJ. Mach. Learn. Res., 11
Alexandros Karatzoglou, A. Smola, K. Hornik, A. Zeileis (2004)
kernlab - An S4 Package for Kernel Methods in RJournal of Statistical Software, 11
S. Hochreiter, K. Obermayer (2006)
Support Vector Machines for Dyadic DataNeural Computation, 18
(1975)
Kernel Methods in Computational Biology, chapter Accurate Splice Site Prediction for Caenorhabditis elegans
Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, Chih-Jen Lin (2008)
LIBLINEAR: A Library for Large Linear ClassificationJ. Mach. Learn. Res., 9
(2011)
To obtain BibT E X entries of the reference, you can enter the following into your R session: toBibtex(citation("kebabs")) a standalone and lightweight c library
D. Madden, D. Garboczi, D. Wiley (1993)
The antigenic identity of peptide-MHC complexes: A comparison of the conformations of five viral peptides presented by HLA-A2Cell, 75
P. Meinicke, M. Tech, B. Morgenstern, R. Merkl (2004)
Oligo kernels for datamining on biological sequences: a case study on prokaryotic translation initiation sitesBMC Bioinformatics, 5
Multiclass Classification An object of class "PredictionProfile" Sequences: A DNAStringSet instance of length 500 width seq names
G. Rätsch, S. Sonnenburg, B. Scholkopf (2005)
RASE: recognition of alternatively spliced exons in C.elegansBioinformatics, 21 Suppl 1
Michael Tipping (2001)
Sparse Bayesian Learning and the Relevance Vector MachineJ. Mach. Learn. Res., 1
A. Visel, M. Blow, Zirong Li, Tao Zhang, J. Akiyama, Amy Holt, I. Plajzer-Frick, Malak Shoukry, Crystal Wright, Feng Chen, Veena Afzal, B. Ren, E. Rubin, L. Pennacchio (2009)
ChIP-seq accurately predicts tissue-specific activity of enhancersNature, 457
A. Ben-Hur, D. Brutlag (2003)
Remote homology detection: a motif based approachBioinformatics, 19 Suppl 1
J. Suykens, J. Vandewalle (1999)
Least Squares Support Vector Machine ClassifiersNeural Processing Letters, 9
C. Leslie, E. Eskin, William Noble (2001)
The Spectrum Kernel: A String Kernel for SVM Protein ClassificationPacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
Dongwon Lee, R. Karchin, Michael Beer (2011)
Discriminative prediction of mammalian enhancers from DNA sequence.Genome research, 21 12
Bioinformatics Advance Access published January 22, 2004
(2003)
Mismatch string kernels for discriminative protein classification, 1
Ulrich Bodenhofer (2012)
PrOCoil — A Web Service and an R Package for Predicting the Oligomerization of Coiled Coil Proteins
Sample_500 0.002069035 0.002496576 ... 0.000000000 0
Kirsten Roomp, I. Antes, Thomas Lengauer (2010)
Predicting MHC class I epitopes in large datasetsBMC Bioinformatics, 11
GTAGAGTAGCTGCTCTC Sample_500 gappy pair kernel: k=1, m=2 Baselines: -0
Ulrich Bodenhofer, Karin Schwarzbauer, M. Ionescu, S. Hochreiter (2009)
Modeling Position Specificity in Sequence Kernels by Fuzzy Equivalence Relations
P. Kuksa, Pai-Hsi Huang, V. Pavlovic (2008)
A fast , large-scale learning method for protein sequence classification
(2015)
KeBABS: an R package for kernel-based analysis of biological sequences
Summary: KeBABS provides a powerful, flexible and easy to use framework for kernel-based ana- lysis of biological sequences in R. It includes efficient implementations of the most important se- quence kernels, also including variants that allow for taking sequence annotations and positional information into account. KeBABS seamlessly integrates three common support vector machine (SVM) implementations with a unified interface. It allows for hyperparameter selection by cross validation, nested cross validation and also features grouped cross validation. The biological inter- pretation of SVM models is supported by (1) the computation of weights of sequence patterns and (2) prediction profiles that highlight the contributions of individual sequence positions or sections. Availability and implementation: The R package kebabs is available via the Bioconductor project: http://bioconductor.org/packages/release/bioc/html/kebabs.html. Further information and the R code of the example in this paper are available at http://www.bioinf.jku.at/software/kebabs/. Contact: [email protected] or [email protected] 1 Introduction 2 Package description The analysis of biological sequences is a fundamental task in Sequence kernels are the core functionality of KeBABS. Four com- bioinformatics. In the last two decades, kernel methods have been monly used kernels are provided: spectrum kernel (Leslie et al., established as an important class of sequence analysis methods. For 2002), mismatch kernel (Leslie et al., 2003), gappy pair kernel the classification of sequences, in particular, support vector (a subset of spatial sample kernels according to Kuksa et al., 2008) machines (SVMs) have emerged as a sort of best practice. To apply and motif kernel (Ben-Hur and Brutlag, 2003). These kernels con- SVMs for sequence analysis, it is necessary to either use a vectorial sider occurrences of patterns regardless of their positions. representation of the sequence data or to use sequence kernels, that KeBABS also supports position-dependent variants for all its ker- is, positive semi-definite similarity measures for sequences. The use nels except for the mismatch kernel: (i) position-specific variants of sequence kernels, however, is not limited to sequence classifica- only count occurrences of patterns if they appear at exactly the same tion. For example, they can also be used for regression tasks and position. (ii) Distance-weighted variants count occurrences of pat- similarity-based clustering. terns if they appear at similar positions (Bodenhofer et al., 2009), On the scientific computing platform R which is widely used in where the positional similarity is determined by a distance weighting bioinformatics, only the kernlab package (Karatzoglou et al., 2004) function. The package provides three built-in distance weighting provides a limited selection of sequence kernels. This article presents functions. Gaussian distance weights together with the spectrum KeBABS, an R/Bioconductor package for kernel-based sequence kernel closely corresponds to the oligo kernel (Meinicke et al., analysis that is primarily focused on biological applications. 2004). The distance weighting used by the shifted weighted degree Compared with the SHOGUN Toolbox (Sonnenburg et al., 2010), kernel (Ra ¨ tsch et al., 2005) is available too and users can also supply KeBABS provides a wider selection of up-to-date sequence kernels custom distance weights. Therefore, the package also includes the and facilitates seamless interplay with R and Bioconductor’s weighted degree kernel and the shifted weighted degree kernel sequence data packages. (Ra ¨ tsch et al., 2005), but without position weights. V The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: [email protected] 2574 KeBABS 2575 As a unique new feature, KeBABS provides annotation-specific variants for all kernels except the mismatch kernel: each sequence can be accompanied by an aligned annotation sequence from a user- defined alphabet—to enable the kernel to distinguish patterns if they appear in different annotation contexts. For example, gene sequences can be annotated with an ‘e’ for all exon positions and ‘i’’ for all intron positions. Then the kernel automatically distinguishes between exonic and intronic patterns. As another example, a coiled- coil sequence can be annotated with the heptade register. If used in conjunction with the gappy pair kernel, this corresponds to the coiled-coil kernel (Mahrenholz et al., 2011). KeBABS can compute kernel matrices for all kernels both in dense and sparse formats. For position-independent kernels, KeBABS also facilitates explicit feature representations that can be stored to dense or sparse matrices. These sparse feature representa- tions in conjunction with LiblineaR allow for analyzing up to hun- dreds of thousands of sequences. SVM framework: KeBABS provides a unified interface to three SVM implementations: kernlab (Karatzoglou et al., 2004), LIBSVM (Chang and Lin, 2011; via the e1071 package) and LiblineaR (Fan et al., 2008). The SVM framework in KeBABS can be used for classification (binary and multi-class) and regression tasks. Cross validation and hyperparameter selection are supported with all interfaced SVMs. For Fig. 1. (a) Structure of the binding groove of human class I MHC HLA-A*0201 hyperparameter selection, accuracy, balanced accuracy and the with bound peptide LLFGYPVYV in red (PDB ID: 1hhk); (b) position-specific feature weights; (c) prediction profiles for the peptide from (a) and a non- Matthews correlation coefficient can be selected as performance ob- binding peptide; (d) prediction profiles of all samples sorted by increasing de- jectives (the area under the ROC curve is also available for Version cision value. Positive contributions to the decision value are shown in blue 1.2.0 or newer). and negative ones in red. All results are based on a normalized position-spe- cific spectrum kernel with k¼ 1 Grouped cross validation: Apart from the standard k-fold cross val- idation, KeBABS also supports grouped cross validation, i.e. cross validation that keeps pre-specified groups together in the same folds. clear non-binder with negative contributions from all positions As an example, to group sequences by their sequence identity is a except the last one. The heatmap of the prediction profiles for all common approach in protein structure prediction to assess whether sequences in Figure 1d reveals that the second and the last position a predictor is able to make use of sequence features beyond making have high relevance for all binders. Non-binders in the lower half a simple sequence identity-based prediction. of the image show higher negative contributions from the first and the third position. Positions 4, 5 and 8 generally have little importance. 3 Example: Epitope-to-MHC binding The importance of the anchor positions 2 and 9 for binding is We analyzed the binding of protein fragments to the human MHC well known (Madden et al., 1993). The irrelevance of positions 4, 5 (major histocompatibility complex) class I molecule HLA-A*0201 and 8 corresponds to the Janus face characteristics of the peptide, based on epitope data from Roomp et al. (2010). The binding of with some of the positions facing toward the MHC molecule and protein fragments to the MHC molecule is an important step of the some toward a possibly binding T-cell receptor. KeBABS, via the immune system to recognize proteins of questionable origin and trig- computation of feature weights and prediction profiles, allows for ger the immune system’s reaction. Figure 1a illustrates the binding mining such biological knowledge from classification models based groove of the MHC molecule with the bound peptide fragment on SVMs and sequence kernels. LLFGYPVYV (Madden et al., 1993). Conflict of Interest: none declared. For our analysis, we used the strong binder/clear non-binder sub- set with 549 strong binders and 503 clear non-binders. The analysis was performed with the C-SVC from package kernlab. Upon hyper- References parameter selection on 40% of the data, the position-specific spec- Ben-Hur,A. and Brutlag,D.L. (2003) Remote homology detection: a motif trum kernel with k¼ 1 turned out to be the best choice, which based approach. Bioinformatics, 19, 26–33. indicates that individual positions are highly relevant for the binding Bodenhofer,U. et al. (2009) Modeling position specificity in sequence kernels behavior. This setting resulted in a cross validation accuracy of by fuzzy equivalence relations. In: Carvalho,J.P. et al. (eds), Proceedings of 94.3% (average of 10 runs, with r ¼ 0:453% and an average area the Joint 13th IFSAWorld Congress and 6th EUSFLAT Conference, Lisbon, under the ROC curve of 0.983). Portugal, pp. 1376–1381. Figure 1b shows the feature weights computed from the SVM Chang,C.-C. and Lin,C.-J. (2011) LIBSVM: a library for support vector ma- model and the relevance of each amino acid at each position. The chines. ACM Trans. Intell. Syst. Technol., 2, 27:1–27:27. prediction profiles of two sequences in Figure 1c show the contribu- Fan,R.-E. et al. (2008) LIBLINEAR: a library for large linear classification. tion of each sequence position to the prediction. The upper sequence J. Mach. Learn. Res., 9, 1871–1874. (peptide from Fig. 1a) is a strong binder with high positive contribu- Karatzoglou,A. et al. (2004) Kernlab—an S4 package for kernel methods in R. J. Stat. Softw., 11, 1–20. tions for Leu at pos. 2 and Val at pos. 9. The lower sequence is a 2576 J.Palme et al. Kuksa,P. et al. (2008) A fast, large-scale learning method for protein sequence Mahrenholz,C.C. et al. (2011) Complex networks govern coiled-coil oligo- classification. In: 8th International Workshop on Data Mining in merizations—predicting and profiling by means of a machine learning ap- Bioinformatics, Las Vegas, NV, USA, Chapman & Hall/CRC Press, pp. proach. Mol. Cell Proteomics, 10, M110.004994. 29–37. Meinicke,P. et al. (2004) Oligo kernels for datamining on biological Leslie,C.S. et al. (2002) The spectrum kernel: a string kernel for SVM sequences: a case study on prokaryotic translation initiation sites. BMC protein classification. In: Altman,R.B. et al. (eds). Pacific Symposium on Bioinformatics, 5, 169. Biocomputing, World Scientific, Lihue, HI, USA, pp. 564–575. Ra ¨ tsch,G. et al. (2005) RASE: recognition of alternatively spliced exons in Leslie,C.S. et al. (2003) Mismatch string kernels for discriminative protein C. elegans. Bioinformatics, 21(Suppl. 1), i369–i377. classification. Bioinformatics, 1, 1–10. Roomp,K. et al. (2010) Predicting MHC class I epitopes in large datasets. Madden,D.R. et al. (1993) The antigenic identity of peptide-MHC complexes: BMC Bioinformatics, 11, 90. a comparison of five viral peptides presented by HLA-A2. Cell, 75, Sonnenburg,S. et al. (2010) The SHOGUN machine learning toolbox. 693–708. J. Mach. Learn. Res., 11, 1799–1802.
Bioinformatics – Oxford University Press
Published: Mar 25, 2015
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.