Background: In plants, long non-protein coding RNAs are believed to have essential roles in development and stress responses. However, relative to advances on discerning biological roles for long non-protein coding RNAs in animal systems, this RNA class in plants is largely understudied. With comparatively few validated plant long non-coding RNAs, research on this potentially critical class of RNA is hindered by a lack of appropriate prediction tools and databases. Supervised learning models trained on data sets of mostly non-validated, non-coding transcripts have been previously used to identify this enigmatic RNA class with applications largely focused on animal systems. Our approach uses a training set comprised only of empirically validated long non-protein coding RNAs from plant, animal, and viral sources to predict and rank candidate long non-protein coding gene products for future functional validation. Results: Individual stochastic gradient boosting and random forest classifiers trained on only empirically validated long non-protein coding RNAs were constructed. In order to use the strengths of multiple classifiers, we combined multiple models into a single stacking meta-learner. This ensemble approach benefits from the diversity of several learners to effectively identify putative plant long non-coding RNAs from transcript sequence features. When the predicted genes identified by the ensemble classifier were compared to those listed in GreeNC, an established plant long non-coding RNA database, overlap for predicted genes from Arabidopsis thaliana, Oryza sativa and Eutrema salsugineum ranged from 51 to 83% with the highest agreement in Eutrema salsugineum. Most of the highest ranking predictions from Arabidopsis thaliana were annotated as potential natural antisense genes, pseudogenes, transposable elements, or simply computationally predicted hypothetical protein. Due to the nature of this tool, the model can be updated as new long non-protein coding transcripts are identified and functionally verified. Conclusions: This ensemble classifier is an accurate tool that can be used to rank long non-protein coding RNA predictions for use in conjunction with gene expression studies. Selection of plant transcripts with a high potential for regulatory roles as long non-protein coding RNAs will advance research in the elucidation of long non-protein coding RNA function. Keywords: lncRNA, Classifier, Machine learning, Ensemble, Transcript Background kingdoms of life. LncRNA transcripts often lack sequence Long non-protein coding RNAs (lncRNAs) represent a conservation within close relatives, and the evolution of diverse and functionally important class of RNAs , these transcripts remains poorly understood, but there and have been classically defined as transcripts longer exists growing evidence of positional and structural than 200 nucleotides with little protein-coding poten- conservation that may indicate selection on transcript tial . Previously thought to be transcriptional noise, function . there is now evidence of their involvement in the devel- Unlike other non-coding RNAs, the mechanisms and opment, disease, and stress responses of plants [3, 4]; functions of lncRNAs can range wildly – from epigenetic however, these transcripts are also found throughout all regulation, as exemplified by mouse Xist and human XIST [6, 7], to small RNA target mimics, as seen with IPS1 *Correspondence: email@example.com and ath-miR399 in Arabidopsis thaliana . COLDAIR,a Department of Biology, McMaster University, 1280 Main Street West, Hamilton, Canada © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Simopoulos et al. BMC Genomics (2018) 19:316 Page 2 of 11 lncRNA associated with flowering, functions by remod- differentiate between validated and predicted lncRNA eling chromatin and alters expression of the FLC locus transcripts. These biases can be seen in the popular . A recent review by Ma et al.  suggests that lncRNA databases, LNCipedia and NONCODE [24, 25]. most known lncRNAs regulate transcription, both in cis Outputs from lncRNA software often result in thou- and trans, while others can affect translation, splicing, sands of unranked predictions leaving the researcher to post-translational regulation or are classified as “other choose the most likely candidates for empirical valida- functional mechanisms.” Due to such a wide range of func- tion. In combination with an RNASeq experiment that can tionality, lncRNAs are typically classified by their position result in tens of thousands of transcripts, filtering through to protein coding genes as intergenic (also referred to as thousands of lncRNA predictions can be difficult and lincRNAs), natural antisense, or intronic [1, 10]. time consuming for a researcher. Objectively ranking pre- Notably, lncRNAs can not only be functional in their dictions in combination with gene expression estimates long RNA form, but also act as small RNA precur- can help researchers complete functional validation of sors and sources of small regulatory peptides [11–13] lncRNAs more efficiently. although extensive translation of lncRNAs has been dis- Recently, ensemble methods have become popular for puted . Adding to the complexity of these RNAs, approaching difficult biological problems typically solved some transcripts do not meet the arbitrary length cut- by machine learning [26, 27]. Ensemble models work by offs set by the classical definition for lncRNAs, such as combining multiple learners into a single model which BC1 in mice (152nt) . Even with recent developments helps to avoid over fitting and encourages generalization in sequencing technologies, lncRNAs remain difficult to of the classifier. In addition to improved classification, identify due to low, and condition-dependent and tissue- ensemble methods also remove the difficulty in choos- dependent expression levels . Demonstrating mini- ing the “best” model as all models can be used in a single mal homology with close relatives , current research classifier. Each individual classifier used in the construc- suggests these transcripts undergo fast and unclear evo- tion of the overall ensemble model will have its own lution making functional predictions challenging. This classification strengths, resulting in stronger and more lack of distinct rules for predicting and identifying lncR- accurate predictions when these classifiers are used in NAs is a likely contributor to the lack of validated combination. plant lncRNAs. Here we describe a lncRNA predictor constructed using Currently, many lncRNA prediction softwares that are an ensemble of machine learning models developed for available to researchers, such as PLEK , lncRScan- and tested on plant transcript sequences. We compared SVM , and COME , use machine learning methods accuracy of this meta-learner trained on multiple machine trained on data consisting of lncRNA transcripts yet to learning models to the prediction ability of individual ran- be empirically validated. Without empirical validation, dom forest and gradient boosting models making up the many of these predicted lncRNA transcripts could have no meta-learner. All models were trained on empirically val- regulatory function and could be produced due to spuri- idated lncRNAs to ensure only true lncRNA transcripts ous transcription because of the low fidelity of RNApolII were used in each model’s training sets. We found the . In addition, CPAT and CPC2 are popular most successful method to be a stacking meta-learner softwares used to identify non-coding transcripts. These constructed from eight stochastic gradient boosting mod- softwares are successful at quickly predicting the protein- els. This approach offers multiple advantages over those coding potential of mRNA sequences, but are not specific currently available as this machine learning approach to lncRNAs and are unsuitable for identifying those lncR- prevents predictions from being constrained to the arbi- NAs that may code for small peptides. Additionally, since trary classic definitions of lncRNAs, such as ignoring the majority of lncRNA research is on animals, software transcripts with high coding potential of small open packages for lncRNAs prediction often use only animal reading frames (ORFs). In addition, our method numer- training datasets. While the exact functions of most plant ically scores each prediction to help researchers focus and animal lncRNAs remain poorly understood, there are their validation efforts on highly ranked lncRNA predic- known differences in biogenesis and mechanisms of other tions. Finally, this approach uses the Diamond algorithm non-coding RNAs, such as miRNAs . As such, ignor-  that allows for efficient and fast sequence align- ing the few plant lncRNA transcripts with known func- ment in protein databases, an essential feature for lncRNA tion could hinder the potential of future plant lncRNA prediction. predictors. Depending on the source, lncRNA databases can also Methods fall victim to biases toward animal systems and non- Overview of classifiers validated transcripts as they are often model organ- Multiple machine learning approaches to lncRNA predic- ism specific with a preference for humans, and rarely tion were compared to find the most accurate plant tran- Simopoulos et al. BMC Genomics (2018) 19:316 Page 3 of 11 script classifier. Ensemble approaches were chosen due to on March 28, 2017, using search terms available in the diversity of RNAs in the lncRNA category as these Additional file 1 and were then removed from the dataset. approaches are ideal for heterogeneous data. Ensemble Eight different training sets with different combinations models typically follow three main approaches: bagging, of negative data from multiple species were used to con- boosting, and stacking. Bagging (bootstrap aggregating) struct eight different models and are described in Table 1. relies on creating n models on bootstrapped training Sets denoted “A” and “B” remained constant throughout data, and averages predictions of all models for a final the training sets and were randomly chosen from the tran- group prediction. This protocol is used in the ran- script sequences of each species. These training datasets dom forest method. With boosting, such as in gra- were used in both random forest and gradient boost- dient boosting, one iteratively trains n learners, with ing methods, for a total of 16 preliminary models. The each iteration attempting to reduce prediction error. variety of training datasets was used to maximize model The predictions are summed for a final classification. diversity, a requirement for the proceeding ensemble Finally, a stacking generalizer refers to training a new models. learner, for example by logistic regression, on the out- Feature extraction and selection put of multiple learners. This is commonly referred to as Eleven features were chosen for use in model construction: meta-learner. This study used all three approaches to ensemble meth- 1 mRNA length ods, firstly by evaluating the lncRNA prediction accuracy 2 ORF length of individual stochastic gradient boosting and random 3 GC% forest models. These individual models were then also 4 Fickett score combined into four ensemble classifiers explained fur- 5 hexamer score ther in the proceeding sections: 1. Arithmetic mean of 6 alignment identity in SwissProt database 7 length of alignment in SwissProt database scores, 2. Geometric mean of scores, 3. Majority vote, 8 proportion of alignment length and mRNA length 4. Logistic regression meta-learner, and were evaluated (alignment length:mRNA length) similarly. 9 proportion of alignment length and ORF length (alignment length:ORF) Individual stochastic gradient boosting and random forest 10 presence of transposable element models 11 sequence percent divergence from transposable Data element Positive data remained constant in each training set and consisted of a total of 436 unique, validated Features were extracted using a combination of custom lncRNA sequences downloaded from two separate Python scripts and known software (CPAT usedfor lncRNA databases: 1. lncRNAdb v2.0 (http://lncrnadb.org) features 4 and 5, Diamond  used for features 6, 7, 8, 9, on November 25, 2016 and 2. lncRNAdisease (http:// RepeatMasker  used for features 10 and 11). www.cuilab.cn/lncrnadisease) on February 15, 2017. These sources for lncRNA sequences include all avail- CPAT model creation and application As no publicly able validated lncRNAs, but are heavily populated by available plant CPAT model exists, two logit models animal systems and include only six plant lncRNA were built using coding and non-protein coding RNA sequences. sequences from A. thaliana and O. sativa. Non-coding Negative data for each training set consisted of lncRNA, miRNA, snRNA, and snoRNA sequences from sequences from four different species: Homo sapiens, each species were downloaded from the Plant Non-coding A. thaliana, Mus musculus,and Oryza sativa. H. sapiens RNA Database on September 26, 2016 (A. thaliana, and M. musculus sequences were included in the negative 5062 sequences total) and July 14, 2017 (O. sativa, data of the training set as these species are the source for 4718 sequences total) . Protein coding transcript the majority of validated lncRNAs. H. sapiens sequences sequences from each species were downloaded from were downloaded from Ensembl (http://www.ensembl. Phytozome v11  on August 3, 2016. In order to org) on December 19, 2016, A. thaliana from Araport supply a balanced training set, 5938 A. thaliana and v11 (https://araport-dev.tacc.utexas.edu)onDecember 5283 O. sativa protein coding sequences were ran- 16, 2016, M. musculus from Ensembl on March 28, 2017 domly selected for a total of 11,000 A. thaliana tran- and O. sativa from Ensembl on March 28, 2017. These scripts and 10,000 O. sativa transcripts for CPAT model data are made available in Additional file 2.Toensure construction. that lncRNA, tRNAs, and rRNAs were removed from A. thaliana CPAT models were used for predictions in the negative training data, these types of sequences were all species but A. thaliana itself, which used O. sativa downloaded from RNAcentral v6 (http://rnacentral.org) CPAT models. Fickett and hexamer values from CPAT Simopoulos et al. BMC Genomics (2018) 19:316 Page 4 of 11 Table 1 Negative training data sets in individual models, and corresponding accuracy, sensitivity, specificity and AUC values Training dataset Negative data AUC Accuracy Specificity Sensitivity GB RF GB RF GB RF GB RF 1 3000 H. sapiens (set A) 0.940 0.943 0.962 0.956 0.988 0.990 0.548 0.404 1000 M. musculus (set A) 3000 O. sativa (set A) 2 3000 H. sapiens (set A) 0.943 0.944 0.960 0.953 0.988 0.989 0.576 0.461 3000 O. sativa (set A) 3 3000 H. sapiens (set A) 0.961 0.962 0.973 0.970 0.990 0.992 0.693 0.592 1000 M. musculus (set A) 3000 A. thaliana (set A) 4 3000 H. sapiens (set A) 0.962 0.966 0.972 0.967 0.990 0.990 0.725 0.640 3000 A. thaliana (set A) 5 3000 H. sapiens (set B) 0.955 0.959 0.965 0.958 0.991 0.980 0.608 0.530 3000 A. thaliana (set B) 6 4500 H. sapiens (set A + 1500 seq) 0.961 0.967 0.979 0.979 0.995 0.995 0.633 0.571 4500 A. thaliana (set A + 1500 seq) 7 3000 H. sapiens (set A) 0.963 0.967 0.976 0.971 0.993 0.992 0.700 0.603 4500 A. thaliana (set A + 1500 seq) 8 2000 H. sapiens (2000 from set A) 0.964 0.965 0.968 0.965 0.988 0.990 0.695 0.619 1000 M. musculus (set A) 3000 A. thaliana (set A) Training datasets of random forest (RF) and gradient boosting (GB) individual models are described. The positive training dataset, 436 validated lncRNAs, remained constant throughout all training datasets. Specificity, sensitivity, accuracy and AUC values were found using 10-fold cross validation of all training data results were used as features in machine learning model and random forest approaches, for a total of 16 models construction. differing in negative training data or machine learning algorithm (Table 1). All transposable element related fea- tures were removed after performing recursive feature Diamond alignment in SwissProt database Diamond elimination as they were found to be uninformative and v0.8.34  was used to quantify transcript sequence reduced the accuracy of models. With the 9 remaining fea- alignments to curated protein sequences in the SwissProt tures, a nested 4-fold cross-validation grid search was per- database  downloaded February 1, 2017 from http:// formed for 30 trials in gradient boosting hyper-parameter www.uniprot.org/downloads. We ran Diamond in “more- selection with possible hyper-parameters: sensitive” mode as we aligned full transcript sequences to the SwissProt database rather than RNASeq reads. learning_rate: 0.02, 0.04, 0.06, 0.08, 0.1 Options for each Diamond run were as follows: -e max_depth:4,6,8,10 0.001, -k 5, -matrix BLOSUM62, -gapopen 11, subsample: 0.2, 0.4, 0.6, 0.8, 1 -gapextend:1, -f 6 qseqid pident length qframe qstart n_estimators: 100, 500, 1000 qend sstart send evalue bitscore. Random forest hyper-parameters remained constant through all models with the only change from default RepeatMasker RepeatMasker  was used to extract parameters being n_estimators = 5000 and information on transcription element related features. min_samples_leaf = 20. The software was run on transcript sequences using Models were evaluated by sensitivity, specificity, accu- default settings, and with -species set to Eukaryota. racy area under the curve (AUC) values using 10-fold Stochastic gradient boosting and random forest model cross validation and the caret Rpackage . construction and hyper-parameter selection Once features were extracted, models were constructed Ensemble learner construction As gradient boosting and random forest models 1-8 were using Python’s scikit-learn package . Eight separate trained using eight different negative training sets, 3000 models were constructed using both gradient boosting Simopoulos et al. BMC Genomics (2018) 19:316 Page 5 of 11 randomly selected Zea mays protein coding sequences largest vote. The final prediction is made depending were used as negative data in the construction and/or on the value of the majority vote score. testing of each ensemble model for consistency through 4 Logistic regression models. Z. mays was chosen as no training set contained This meta learner is trained on a training dataset of sequences from this species and the genome is well anno- 3000 known Z. mays protein coding sequences as tated. Z. mays transcripts were downloaded from Ensem- negative data and the 10-fold cross validation blPlants on April 27, 2017. Two separate values were used prediction outputs of known lncRNAs as positive for the creation of each ensemble model – scores s and data. ij predictions p where i represents model number and j ij Voting, arithmetic mean, and geometric mean ensem- transcript. Scores can take any number between 0 and 1, ble models were evaluated by directly comparing scores while predictions are binary and indicate if the transcript of predictions to the known outcomes of validated lncR- wasorwasnotpredictedasalncRNA.Ascoregreater NAs and 3000 Z. mays protein coding sequences. The than or equal to 0.5 would indicate the transcript is pre- logistic regression stacking generalizer was evaluated by dicted as a lncRNA and would have a prediction value of 10-fold cross validation. Accuracy, sensitivity, specificity, 1. Ensemble models were constructed for random forest Matthews correlation coefficient (MCC), and AUC values and gradient boosting models separately in order to avoid were calculated using a custom R script and the R package potential correlation of predictions. The four ensemble caret . approaches included both algebraic combiners and vot- ing methods as non-trainable methods, and a stacking Comparison of predicted lncRNAs to GreeNC and generalizer as a meta-learner. annotation exploration The four ensemble methods are described as follows Transcript sequences of O. sativa and Eutrema salsug- and are illustrated in Fig. 1: ineum were downloaded from Phytozome v10.3 and 1 Arithmetic Mean A. thaliana from TAIR10 for direct comparison to n GreeNC. LncRNAs predictions by GreeNC of A. thaliana, s (1) O. sativa and E. salsugineum were downloaded on June 19, ij i=1 2017. Annotations from each species were downloaded from Phytozome v12, with extra A. thaliana annotation Where n = 8, the number of individual models downloaded from Araport v11. combined into the ensemble approach. The ensemble decision is made from taking the arithmetic mean of each score s from models 1-8 for ij Results each gene j. The arithmetic mean of scores will act as Individual random forest and stochastic gradient boosting a new ensemble score, and prediction will be made as model construction described previously. Feature selection 2 Geometric mean Researchers have proposed that specific characters in transcript sequences can be useful in lncRNA classi- fication. For example, lncRNAs can be translated into s (2) ij short peptides [11–13], however most validated lncR- i=1 NAs remain functional in their RNA form with little Where n = 8, the number of individual models protein coding potential. The potential for a transcript combined into the ensemble approach. The to be translated into a protein can be predicted by ensemble decision is made from taking the geometric codon bias, often measured by Fickett score, and hex- mean for each score s from models 1-8 for each ij amer usage bias . Mammalian lncRNAs are known to gene j. The geometric mean of scores will act as a have a lower GC content than protein-coding RNAs , new ensemble score, and prediction will be made as andthisfeature hasbeenusedasadefining featurefor described previously. A. thaliana lncRNA prediction in the past . Trans- 3 Majority vote posable elements (TEs) are also known to be sources for plant lncRNAs . Based on these studies, 11 features p (3) ij were originally chosen for use in lncRNA classification: i=1 mRNA length, ORF length, GC%, Fickett score, hexamer Where n = 8, the number of individual models score, alignment identity in SwissProt database, length combined into the ensemble approach. The of alignment in SwissProt database, proportion of align- ensemble decision depends only on final predictions ment length and mRNA length (alignment length:mRNA and is decided on which label (0 or 1) receives the length), proportion of alignment length and ORF length Simopoulos et al. BMC Genomics (2018) 19:316 Page 6 of 11 Fig. 1 Illustration of ensemble methods. An illustrative example of all four ensemble methods: arithmetic mean, geometric mean, majority vote and the stacking generalizer. Real examples from three different genes are given: gene A represents AT5G44470 a predicted protein, gene B represents At43G09922.1 IPS1 a known lncRNA, and gene C represents At2G18130.1 a known protein coding gene, AtPAP11. Note the final stacking generalizer score of gene B compared to the individual model scores for the gene (alignment length:ORF), presence of transposable ele- parameters; all options were left as default other than ment, and sequence percent divergence from transposable n_estimators=5000 and min_samples_leaf = 20. element. Using recursive feature elimination as described After training calibrated models, gradient boosting in the “Methods” section, features that related to trans- and random forest models were evaluated individu- posable elements were removed since inclusion of these ally by 10-fold cross validation by accuracy, specificity, features in classifiers decreased prediction accuracy and sensitivity and AUC measures for model validation thus were deemed uninformative for this training data. (Table 1). All models performed at or above accu- After feature elimination, nine features were chosen for racy, specificity and AUC measures of 0.94, how- implementation in individual random forest and gradi- ever, sensitivity values ranged from 0.40 to 0.725 ent boosting models: mRNA length, ORF length, GC%, (Table 1). Because of this wide range of sensitivity Fickett score, hexamer score, alignment identity, length of alignment, alignment length:mRNA length, and align- ment length:ORF. Table 2 Gradient boosting hyper-parameters chosen by grid search for each model Individual model configuration and model evaluation Gradient boosting and random forest models were con- GB Model # Learning rate Maxdepth Subsample n estimators structed using eight different negative training datasets for 1 0.04 10 0.6 100 a total of sixteen models (Table 1). Empirically validated 2 0.04 10 0.6 100 lncRNA transcripts were downloaded from databases 3 0.04 10 0.6 100 as described in “Methods” section. To ensure optimal 4 0.02 8 0.6 100 performance of each gradient boosting classifier, proper 5 0.02 10 0.6 100 calibration of multiple hyper-parameters is required. As such, hyper-parameter tuning (learning_rate, 6 0.02 10 0.6 100 max_depth, subsample,and n_estimators)for 7 0.04 10 0.6 100 each gradient boosting model was completed by grid 8 0.04 10 0.6 100 search and 30 iterations of 4-fold nested cross valida- Hyper-parameters were chosen by grid search using 30 iterations of 4-fold nested tion with results summarized in Table 2.All random cross validation. The given hyper-parameters corresponded to models with the forest models were constructed with the same hyper- highest accuracy values of all given hyper-parameter combinations Simopoulos et al. BMC Genomics (2018) 19:316 Page 7 of 11 values, four alternative ensemble approaches using com- (Table 3). Using these values as methods of evaluation, bined random forest and gradient boosting models were the stacking model constructed from gradient boost- explored. ing model outputs was found to be the best perform- ing model and was used for the remainder of the Ensemble classifier construction study. To take advantage of the predictive strengths of each random forest and gradient boosting model, ensemble Comparison of meta-learner to GreeNC predictions learners for all random forest and all gradient boosting To assess the overlap of predictions to another plant models were constructed. As ensemble classifiers function lncRNA resource, the lncRNAs predicted by the stack- by combining “diverse” learners , only models con- ing generalizer were compared to an established lncRNA structed from different training sets were used in each database, GreeNC . This database uses a transcript fil- ensemble classifier to maintain diversity in predictors. tering method, rather than a machine learning approach, In other words, ensemble classifiers were constructed where transcripts must meet the criteria of a classic from all eight random forest models and a separate set lncRNA in order to be identified as putative lncRNAs. of ensemble classifiers were constructed from all eight To be considered a lncRNA in the GreeNC database, gradient boosting models. the transcript must: be larger than 200nt, have an ORF Four types of ensemble classifiers were constructed: a smaller than 120aa, not have a hit in the SwissProt majority vote model, arithmetic means of scores model, database or be considered non-coding by the Coding geometric means of scores model, and a stacking ensem- Potential Calculator , and not be already classi- ble model constructed from a logistic regression of model fied as another class of functional RNA as identified outputs (Fig. 1 and “Methods” section for details). by Rfam. A final training set comprised of 3000 known Z. mays Transcript sequences of O. sativa,and E. salsugineum protein coding genes and validated lncRNAs was cre- were downloaded from Phytozome v10.3 and A. thaliana ated. This Z. mays training data set was used for training sequences from TAIR10 to enable direct comparison to the logistic regression classifier because random forest the GreeNC protocol. In total, 1310, 856 and 198 lncR- and gradient boosting models were trained on differ- NAs were predicted from A. thaliana, O. sativa,and ent data sets (see “Methods” section). For consistency, E. salsugineum respectively, of which 872 (66.6%), 444 all four ensemble methods were also evaluated using (51.9%), and 164 (82.8%) have been previously predicted these data. The arithmetic mean, geometric mean, and by GreeNC (Fig. 2). Comparing number of predicted majority vote methods were evaluated by comparing lncRNAs using this method to GreeNC, 1700, 4381, ensemble method outputs to true labels, and 10-fold and 1471 fewer lncRNAs are identified in A. thaliana, cross validation scores were used to evaluate the logis- O. sativa and E. salsugineum using the stacking method. tic regression stacking model. Accuracy, specificity, and Another 438, 412 and 34 putative lncRNAs were iden- AUC values were similar for all ensemble approaches; tified using the stacking learner that have not been therefore, the best performing ensemble method was predicted by GreeNC in A. thaliana, O. sativa,and largely determined by both sensitivity and MCC measures E. salsugineum. Table 3 Evaluation measures of random forest (RF) and gradient boosting (GB) ensemble models ML model type Ensemble type AUC MCC Accuracy Sensitivity Specificity RF Vote 0.834 0.725 0.944 0.594 0.995 Arithmetic mean 0.963 0.661 0.941 0.562 0.996 Geometric mean 0.963 0.706 0.941 0.555 0.997 Logistic regression 0.835 0.765 0.952 0.665 0.994 GB Vote 0.887 0.797 0.958 0.702 0.995 Arithmetic mean 0.945 0.786 0.956 0.681 0.996 Geometric mean 0.940 0.750 0.949 0.601 0.999 Logistic regression 0.883 0.822 0.963 0.745 0.994 Statistics for vote, arithmetic mean, and geometric mean models were calculated using outputs of models compared to true labels. Logistic regression evaluation statistics were calculated using the scores found by 10-fold cross validation of O. sativa training data and validated lncRNA sequences Simopoulos et al. BMC Genomics (2018) 19:316 Page 8 of 11 lncRNAs from A. thaliana, 417 remain unannotated, with only seven predicted as potential proteins. Discussion Our approach to lncRNA prediction by stacking with logistic regression allows researchers to combine the strengths of various machine learning models with- out restricting predictions to arbitrary feature cutoffs of a classic lncRNA definition. The flexible nature of this lncRNA prediction tool allows the model to be updated when additional lncRNAs are validated, helping researchers focus on empirical validation of plant lncRNA transcripts. As lncRNA research has previously primar- ily focused on animal systems with a large emphasis on humans and mice, this tools’ training sets may have a human and mouse bias that is present out of necessity. Fig. 2 Counts of predicted lncRNAs in A. thaliana, E. salsugineum and When more plant lncRNAs are added to the tool’s train- O. sativa from the gradient boosting stacking generalizer method and ing set, the human and mouse lncRNA bias that may be GreeNC database. Counts of predicted lncRNAs in this work from all found in the model will be reduced. Acting as positive three species were also compared to predictions recorded in GreeNC. feedback, as more plant lncRNAs are added to the model, Overlapping predictions of the two methods are represented as shaded bars. The percentages above each bar represent the percent the predictions themselves will improve. of the total predictions by each method that are shared To help researchers choose the best lncRNAs for vali- dation, the predictions are ranked. While softwares that rank lncRNA predictions, such as COME , do exist, they are trained on a majority of non-empirically validated transcripts adding a potential bias towards non func- Current annotation of top ranking lncRNAs in A. thaliana, tional transcripts. A combination of ranked predictions E. salsugineum, and O. sativa and models trained only on true lncRNAs will help ensure Using the prediction scoring system of this stacking researchers focus on the most likely functional lncRNAs method, the current annotation of the highest ranking A lower number of identified lncRNAs in compari- lncRNAs from each species was explored. Due to the son to other prediction methods, such as GreeNC, was nature of a logistic regression-type ensemble method, expected. Using a machine learning classification method, transcripts with similar features will have identical predic- lncRNA predictions were not constrained to arbitrary tion scores. As such, multiple prediction score ties exist in criteria for this RNA classification. Instead, the classi- the top ranking transcripts of each species (See Additional fiers were trained on validated lncRNAs and are expected file 3 for distribution of lncRNA scores). Using a cut- to identify only true functional lncRNA transcripts. In off of the top three unique prediction scores, annotations other words, although transcripts were subjected to less of 256, 17 and 94 transcripts in A. thaliana, E. salsug- rules for lncRNA identification, the stacking method is ineum,and O. sativa were identified as “top scoring” due expected to have higher accuracy. Further, this work to these multiple ties. The majority of predicted lncRNAs in A. thaliana were annotated by TAIR as potential nat- ural antisense lncRNAs, pseudogenes, and transposable element related genes (Table 4). Only one transcript from Table 4 Number of transcripts in annotation categories of top ranking lncRNAs in the A. thaliana transcriptome E. salsugineum’s top predictions, and two transcripts from O. sativa’s top predictions have annotation in Phytozome Annotation category Number of annotations v12. Natural antisense lncRNA 64 Pseudogene 75 Novel lncRNAs identified by the stacking generalizer Transposable element gene 10 Annotation of the predicted lncRNAs not previously iden- tified by GreeNC from all three species were explored. Transposase 46 While all of the newly predicted lncRNAs from E. sal- miRNA primary transcript 4 sugineum and O. sativa were annotated as homologs of Hypothetical protein 5 A. thaliana genes, 10 of 34 novel lncRNAs from E. salsug- Protein 8 ineum and 11 of 412 novel lncRNAs from O. sativa were Other 8 annotated specifically as proteins. Of the newly predicted Simopoulos et al. BMC Genomics (2018) 19:316 Page 9 of 11 was tested only on sequence information available from model creation indicates that only 19 of the 436 (4.4%) val- Phytozome v10.3 in order to compare predictions directly idated lncRNAs show evidence of transposable element to GreeNC. Additional transcript sequences available in association. Of this minor group of transposable ele- public repositories, or from researchers’ own sequencing ment associated lncRNAs, none were from plant species. libraries, would add to the number of putative lncR- Nonetheless, the tool did not favour lncRNAs that are NAs and could be used to improve accuracy. Moreover, not associated with transposable elements, as the tool COOLAIR and COLDAIR, known A. thaliana lncRNAs, remained successful at identifying these types of tran- are not predicted by GreeNC because the database relies scripts. Additionally, as novel lncRNAs are validated and on transcript sequences provided by Phytozome and these added to this tool, an update to the models’ feature selec- transcript sequences were not available in the database at tion step may be required, and may lead to future inclusion the time of prediction. Our stacking generalizer method of transposable element associated characters. However, for lncRNA prediction is not restricted to a single data by not including transposable element information, the source, and allows researchers to calculate a lncRNA score computational time for data preprocessing before tran- from any transcript sequence, not solely those available script classification is significantly reduced to minutes from an online repository. from days as RepeatMasker is no longer needed. While we expect a lower number of putative lncRNAs Features of secondary RNA structure have previously than other protocols, of interest is the lower proportion been used in other RNA classifiers, such as nRC and of predicted lncRNAs E. salsugineum genome compared GraPPLE , that are used to classify RNAs into func- to O. sativa or A. thaliana. A reason for the low lncRNA tional categories. These classifications include RNAs such discovery rate in E. salsugineum, could potentially be as miRNAs, tRNAs, rRNA, ribozymes, and riboswitch that plants were not subjected to conditions sufficient domains, all of which have conserved secondary struc- for observable lncRNA expression. For example, IPS1  tures. Rather than using sequence homology, commonly and COLDAIR , two well studied A. thaliana lncRNAs, used with protein coding genes, structural homology has are induced by phosphate or cold-related stresses respec- previously been used in lncRNA functional prediction, tively. This hypothesis is supported by Derrien et al.  and identification . However, a lack of secondary struc- who found human lncRNA expression to be at low lev- ture conservation in animal lncRNAs with conserved els in a condition, tissue and developmental state specific sequences (e. g. HOTAIR, ncSRA and Xist)was recently manner. It is also possible that there exists natural vari- observed . As structural conservation may not be as ation in the numbers of putative lncRNAs in different pervasiveinlncRNAclassification as previously thought, species. Further investigation on the number of putative we did not include structural features in our ensemble lncRNA and their relationships to plant growth conditions learner. A lack of structural features allows the predictor for transcriptome sequencing of multiple plant species is to identify a wide variety of lncRNAs and does not limit currently underway. the predictor to the structures of the small number of Although the quantity of detected lncRNAs was low in validated plant lncRNAs available. An additional test was E. salsugineum, the quality of putative lncRNAs in all three completed to ensure our predictor, lacking structural fea- species is high, demonstrating that this tool can accurately tures, did not merely distinguish non-coding transcripts classify transcripts no matter size or quality of input tran- from protein coding genes. By comparing the results of the script sequence data. When exploring the annotations of ensemble learner to predicted CPAT protein coding prob- the top scoring predictions in A. thaliana, the majority abilities , our ensemble method was able distinguish of transcripts were annotated as potential natural anti- between other CPAT-predicted non-coding transcripts sense lncRNA, pseudogenes, transposable elements, small and likely lncRNAs (Additional file 4:Table S2). Apor- RNA primary transcripts, or remain computationally pre- tion of putative lncRNAs in all three plant species are dicted as hypothetical proteins (Table 4). Pseudogenes also predicted to be protein coding and may encode small remain poorly understood, however there is evidence of regulatory peptides. pseudogene derived lncRNAs regulating their parental High quality lncRNA predictions from this method genes , making pseudogene derived lncRNAs targets require sequences from fully processed transcripts and of potential regulatory interest. Transposable elements are cannot be predicted directly from genomic sequences. another known source of lncRNAs, particularly in verte- Nevertheless, potential lncRNA sequences of interest are brates  and long intergenic non-protein coding RNAs typically more accessible by transcriptome sequencing in plants . This study did not find evidence that fea- rather than complete genome sequencing, which remains tures related to transposable elements were helpful at technically challenging for crop plants with large and/or predicting plant lncRNAs as the addition of transposable polyploid genomes. This tool is flexible and can be used related features decreased the quality of lncRNA predic- to identify lncRNAs from all transcriptional units of an tions. However, exploration of the training data used for organism, or to check the lncRNA score of a single Simopoulos et al. BMC Genomics (2018) 19:316 Page 10 of 11 transcript. Furthermore, as mentioned in their summary, Funding This work was supported by grants from the Ontario Research Fund-Research Kang et al.  suggest that researchers should now con- Excellence and Natural Science and Engineering Research Council of Canada sider working on uncovering the biological implications of to EAW (RGPIN-2015-06530) and to GBG (RGPIN-2015-04477). The funding lncRNAs rather than solely using computational tools for bodies supported the publication costs of this manuscript. transcript classification. We agree that future work should Availability of data and materials centre around using software to also further knowledge The first version of code used in this work is available at www.github.com/ on these types of transcripts. Due to the diversity of gbgolding/crema. Random protein and lncRNA datasets used for machine these transcripts, there is increasing need for classifica- learning model training are attached as supplementary material. tion of lncRNAs into categories based on mechanism and Authors’ contributions function, as well as continuation of empirical validation, The idea of this article was conceived by all authors. CMAS developed the tool, particularly for plants. Once validated, not only can novel acquired the datasets, obtained and analyzed the results, and prepared the lncRNAs mechanisms be explored, but their features can manuscript. GBG and EAW supervised the analysis, and edited the manuscript. All authors have read and approved the manuscript. be added to this tool for further improvement in lncRNA prediction. Ethics approval and consent to participate Not applicable as this study did not use plant or animal material, and instead Conclusion genomic and transcriptomic data available from various databases. For this machine learning based tool for lncRNA predic- Competing interests tion, we have used only empirically validated lncRNAs for The authors declare that they have no competing interests. training. Although lncRNAs from multiple species were Publisher’s Note used, our tool identified putative plant lncRNAs with high Springer Nature remains neutral with regard to jurisdictional claims in scores. Ranking of lncRNA predictions should improve published maps and institutional affiliations. the confidence by which gene products meriting valida- Received: 23 November 2017 Accepted: 12 April 2018 tion are selected for empirical testing. The machine learn- ing structure and its open source availability allows for the flexible inclusion of validated lncRNAs as our knowledge References of this class of RNA improves. An important considera- 1. Kung JT, Colognori D, Lee JT. Long noncoding RNAs: past, present, and future. Genetics. 2013;193:651–9. https://doi.org/10.1534/genetics.112. tion of this tool is that it is not constrained by precon- ceived rules that may or may not appropriately classify 2. Kapranov P, Cheng J, Dike S, Nix DA, Duttagupta R, Willingham AT, lncRNA properties. As Kung et al.  suggest, setting rules Stadler PF, Hertel J, Hackermüller J, Hofacker IL, Bell I, Cheung E, Drenkow J, Dumais E, Patel S, Helt G, Ganesh M, Ghosh S, Piccolboni A, for the detection of these non-conforming transcripts Sementchenko V, Tammana H, Gingeras TR. RNA maps reveal new RNA could be detrimental due to the diversity in functional- classes and a possible function for pervasive transcription. Science. ity, structure, expression and mechanism of these tran- 2007;316:1484–8. https://doi.org/10.1126/science.1138341. scripts. Accordingly, our stacking generalizer model based 3. Wang D, Qu Z, Yang L, Zhang Q, Liu ZH, Do T, Adelson DL, Wang ZY, Searle I, Zhu JK. Transposable elements (TEs) contribute to stress-related on gradient boosting models will facilitate lncRNA iden- long intergenic noncoding RNAs in plants. Plant J. 2017;90:133–46. tification without imposing arbitrary rules for lncRNA https://doi.org/10.1111/tpj.13481. detection. 4. Xu Q, Song Z, Zhu C, Tao C, Kang L, Liu W, He F, Yan J, Sang T. Systematic comparison of lncRNAs with protein coding mRNAs in population expression and their response to environmental change. BMC Plant Biol. 2017;17:42. https://doi.org/10.1186/s12870-017-0984-8. Additional files 5. Hezroni H, Koppstein D, Schwartz MG, Avrutin A, Bartel DP, Ulitsky I. Principles of long noncoding RNA evolution derived from direct comparison of transcriptomes in 17 species. Cell Rep. 2015;11:1110–22. Additional file 1: Non-coding RNA search terms. Terms used to search for https://doi.org/10.1016/j.celrep.2015.04.023. organism specific non-coding sequences on RNA central. (ZIP 16500 kb) 6. Jeon Y, Lee JT. YY1 tethers Xist RNA to the inactive X nucleation center. Additional file 2: Random protein training data sets, lncRNA data sets. Cell. 2011;146:119–33. https://doi.org/10.1016/j.cell.2011.06.026. Fasta files of protein coding and lncRNA sequences in data sets used for 7. Zhao J, Sun BK, Erwin JA, Song JJ, Lee JT. Polycomb proteins targeted training machine learning classifiers. (TXT 0.07725 kb) by a short repeat RNA to the mouse X chromosome. Science. 2008;322: 750–6. https://doi.org/10.1126/science.1163045. Additional file 3: Distribution of predicted lncRNA scores. Figure and 8. Franco-Zorrilla JM, Valli A, Todesco M, Mateos I, Puga MI, table of distribution of scores. (PDF 58 kb) Rubio-Somoza I, Leyva A, Weigel D, Garcia JA, Paz-Ares J. Target Additional file 4: Comparison of predicted lncRNAs to CPAT results. Table mimicry provides a new mechanism for regulation of microRNA activity. of results and explanation of additional test. (PDF 31 kb) Nat Genet. 2007;39:1033–7. https://doi.org/10.1038/ng2079. 9. He C, Huang H, Xu L. Mechanisms guiding Polycomb activities during Abbreviations gene silencing in Arabidopsis thaliana. Front Plant Sci. 2013;4:454. https:// AUC: Area under the curve; lncRNA: Long non-protein coding RNA; MCC: doi.org/10.3389/fpls.2013.00454. Matthews correlation coefficient; ORF: Open reading frame; TE: Transposable 10. Ma L, Bajic VB, Zhang Z. On the classification of long non-coding RNAs. element RNA Biol. 2013;10:925–33. https://doi.org/10.4161/rna.24604. 11. Anderson DM, Anderson KM, Chang CL, Makarewich CA, Nelson BR, Acknowledgements McAnally JR, Kasaragod P, Shelton JM, Liou J, Bassel-Duby R, Olson EN. We thank two anonymous reviewers for their helpful questions and comments. A micropeptide encoded by a putative long noncoding RNA regulates Simopoulos et al. BMC Genomics (2018) 19:316 Page 11 of 11 muscle performance. Cell. 2015;160:595–606. https://doi.org/10.1016/j. 33. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, cell.2015.01.009. Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, 12. Ji Z, Song R, Regev A, Struhl K. Many lncRNAs, 5’UTRs, and pseudogenes Cournapeau D, Brucher M, Perrot M, Duchesnay E. Scikit-learn: machine are translated and some are likely to express functional proteins. Elife. learning in Python. J Mach Learn Res. 2011;12:2825–30. 2015;4:08890. https://doi.org/10.7554/eLife.08890. 34. Jed Wing MKC, Weston S, Williams A, Keefer C, Engelhardt A, Cooper T, 13. Juntawong P, Girke T, Bazin J, Bailey-Serres J. Translational dynamics Mayer Z, Kenkel B, The R Core Team, Benesty M, Lescarbeau R, Ziem A, revealed by genome-wide profiling of ribosome footprints in Arabidopsis. Scrucca L, Tang Y, Candan C, Hunt T. Caret: Classification and Regression Proc Natl Acad Sci U S A. 2014;111:203–12. https://doi.org/10.1073/pnas. Training. 2017. R package version 6.0-76. https://CRAN.R-project.org/ 1317811111. package=caret. Accessed 1 Feb 2018. 14. Guttman M, Russell P, Ingolia NT, Weissman JS, Lander ES. Ribosome 35. Niazi F, Valadkhan S. Computational analysis of functional long profiling provides evidence that large noncoding RNAs do not encode noncoding RNAs reveals lack of peptide-coding capacity and parallels with proteins. Cell. 2013;154:240–51. https://doi.org/10.1016/j.cell.2013.06.009. 3’ UTRs. RNA. 2012;18:825–43. https://doi.org/10.1261/rna.029520.111. 15. DeChiara TM, Brosius J. Neural BC1 RNA: cDNA clones reveal nonrepetitive 36. Di C, Yuan J, Wu Y, Li J, Lin H, Hu L, Zhang T, Qi Y, Gerstein MB, Guo Y, sequence content. Proc Natl Acad Sci U S A. 1987;84:2624–8. Lu ZJ. Characterization of stress-responsive lncRNAs in Arabidopsis 16. Derrien T, Johnson R, Bussotti G, Tanzer A, Djebali S, Tilgner H, Guernec G, thaliana by integrating expression, epigenetic and structural features. Martin D, Merkel A, Knowles DG, Lagarde J, Veeravalli L, Ruan X, Ruan Y, Plant J. 2014;80:848–61. https://doi.org/10.1111/tpj.12679. Lassmann T, Carninci P, Brown JB, Lipovich L, Gonzalez JM, Thomas M, 37. Brown G, Wyatt J, Harris R, Yao X. Diversity creation methods: a survey Davis CA, Shiekhattar R, Gingeras TR, Hubbard TJ, Notredame C, Harrow J, and categorisation. Inf Fusion. 2005. https://doi.org/10.1016/j.inffus.2004. Guigo R. The GENCODE v7 catalog of human long noncoding RNAs: 04.004. analysis of their gene structure, evolution, and expression. Genome Res. 38. PaytuviGallart A, HermosoPulido A, AnzarMartinezdeLagran I, 2012;22:1775–89. https://doi.org/10.1101/gr.132159.111. Sanseverino W, AieseCigliano R. GREENC: a Wiki-based database of plant 17. Li A, Zhang J, Zhou Z. PLEK: a tool for predicting long non-coding RNAs lncRNAs. Nucleic Acids Res. 2016;44:1161–6. https://doi.org/10.1093/nar/ and messenger RNAs based on an improved k-mer scheme. BMC gkv1215. Bioinformatics. 2014;15:311. https://doi.org/10.1186/1471-2105-15-311. 39. Kong L, Zhang Y, Ye ZQ, Liu XQ, Zhao SQ, Wei L, Gao G. CPC: assess the 18. Sun L, Liu H, Zhang L, Meng J. lncRScan-SVM: A Tool for Predicting Long protein-coding potential of transcripts using sequence features and Non-Coding RNAs Using Support Vector Machine. PLoS ONE. 2015;10: support vector machine. Nucleic Acids Res. 2007;35:345–9. https://doi. 0139654. https://doi.org/10.1371/journal.pone.0139654. org/10.1093/nar/gkm391. 19. Hu L, Xu Z, Hu B, Lu ZJ. COME: a robust coding potential calculation tool 40. Milligan MJ, Lipovich L. Pseudogene-derived lncRNAs: emerging for lncRNA identification and characterization based on multiple features. regulators of gene expression. Front Genet. 2014;5:476. https://doi.org/ Nucleic Acids Res. 2017;45:2. https://doi.org/10.1093/nar/gkw798. 10.3389/fgene.2014.00476. 20. Struhl K. Transcriptional noise and the fidelity of initiation by RNA 41. Kapusta A, Kronenberg Z, Lynch VJ, Zhuo X, Ramsay L, Bourque G, polymerase II. Nat Struct Mol Biol. 2007;14:103–5. https://doi.org/10.1038/ Yandell M, Feschotte C. Transposable elements are major contributors to nsmb0207-103. the origin, diversification, and regulation of vertebrate long noncoding 21. Wang L, Park HJ, Dasari S, Wang S, Kocher JP, Li W. CPAT: Coding- RNAs. PLoS Genet. 2013;9:1003470. https://doi.org/10.1371/journal.pgen. Potential Assessment Tool using an alignment-free logistic regression 1003470. model. Nucleic Acids Res. 2013;41:74. https://doi.org/10.1093/nar/gkt006. 42. Fiannaca A, LaRosa M, LaPaglia L, Rizzo R, Urso A. nRC: non-coding RNA 22. Kang YJ, Yang DC, Kong L, Hou M, Meng YQ, Wei L, Gao G. CPC2: a fast Classifier based on structural features. BioData Min. 2017;10:27. https:// and accurate coding potential calculator based on sequence intrinsic doi.org/10.1186/s13040-017-0148-2. features. Nucleic Acids Res. 2017. https://doi.org/10.1093/nar/gkx428. 43. Childs L, Nikoloski Z, May P, Walther D. Identification and classification of 23. Axtell MJ, Westholm JO, Lai EC. Vive la difference: biogenesis and ncRNA molecules using graph properties. Nucleic Acids Res. 2009;37:66. evolution of microRNAs in plants and animals. Genome Biol. 2011;12:221. https://doi.org/10.1093/nar/gkp206. https://doi.org/10.1186/gb-2011-12-4-221. 44. Rivas E, Clements J, Eddy SR. A statistical test for conserved RNA 24. Volders PJ, Helsens K, Wang X, Menten B, Martens L, Gevaert K, structure shows lack of evidence for structure in lncRNAs. Nat Methods. Vandesompele J, Mestdagh P. LNCipedia: a database for annotated 2017;14:45–8. https://doi.org/10.1038/nmeth.4066. human lncRNA transcript sequences and structures. Nucleic Acids Res. 2013;41:246–51. https://doi.org/10.1093/nar/gks915. 25. Zhao Y, Li H, Fang S, Kang Y, Wu W, Hao Y, Li Z, Bu D, Sun N, Zhang MQ, Chen R. NONCODE 2016: an informative and valuable data source of long non-coding RNAs. Nucleic Acids Res. 2016;44:203–8. https://doi.org/10. 1093/nar/gkv1252. 26. Liu B, Wang S, Long R, Chou KC. iRSpot-EL: identify recombination spots with an ensemble learning approach. Bioinformatics. 2017;33:35–41. https://doi.org/10.1093/bioinformatics/btw539. 27. You ZH, Lei YK, Zhu L, Xia J, Wang B. Prediction of protein-protein interactions from amino acid sequences with ensemble extreme learning machines and principal component analysis. BMC Bioinformatics. 2013;14 Suppl 8:10. https://doi.org/10.1186/1471-2105-14-S8-S10. 28. Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nat Methods. 2015;12:59–60. https://doi.org/10.1038/nmeth. 29. Smit AFA, Hubley R, Green P. Repeatmasker open-4.0. 2015. http://www. repeatmasker.org. 30. Yi X, Zhang Z, Ling Y, Xu W, Su Z. PNRD: a plant non-coding RNA database. Nucleic Acids Res. 2015;43:982–9. https://doi.org/10.1093/nar/ gku1162. 31. Goodstein DM, Shu S, Howson R, Neupane R, Hayes RD, Fazo J, Mitros T, Dirks W, Hellsten U, Putnam N, Rokhsar DS. Phytozome: a comparative platform for green plant genomics. Nucleic Acids Res. 2012;40:1178–86. https://doi.org/10.1093/nar/gkr944. 32. Bairoch A, Apweiler R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 2000;28:45–8.
– Springer Journals
Published: May 2, 2018