Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

PlantMiRNAPred: efficient classification of real and pseudo plant pre-miRNAs

PlantMiRNAPred: efficient classification of real and pseudo plant pre-miRNAs Vol. 27 no. 10 2011, pages 1368–1376 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btr153 Structural bioinformatics Advance Access publication March 26, 2011 PlantMiRNAPred: efficient classification of real and pseudo plant pre-miRNAs 1 1,∗ 1 1 2 Ping Xuan , Maozu Guo , Xiaoyan Liu , Yangchao Huang , Wenbin Li 3,∗ and Yufei Huang 1 2 Department of Computer Science and Engineering, Harbin Institute of Technology, Harbin 150001, Soybean Research Institute (Key Laboratory of Soybean Biology of Chinese Education Ministry), Northeast Agricultural University, Harbin 150030, P.R. China and Department of Electrical and Computer Engineering, University of Texas at San Antonio, San Antonio, TX 78249-0669, USA Associate Editor: Ivo Hofacker ABSTRACT 1 INTRODUCTION Motivation: MicroRNAs (miRNAs) are a set of short (21–24 nt) Derived from hairpin precursors (pre-miRNAs), mature microRNAs non-coding RNAs that play significant roles as post-transcriptional (miRNAs) are non-coding RNAs that play important roles in regulators in animals and plants. While some existing methods gene regulation by targeting mRNAs for cleavage or translational use comparative genomic approaches to identify plant precursor repression (Bartel, 2004; Chatterjee and Grobhans, 2009). MiRNAs miRNAs (pre-miRNAs), others are based on the complementarity are involved in many important biological processes including plant characteristics between miRNAs and their target mRNAs sequences. development, signal transduction and protein degradation (Zhang However, they can only identify the homologous miRNAs or the et al., 2006b). limited complementary miRNAs. Furthermore, since the plant pre- However, systematically detecting miRNAs from a genome miRNAs are quite different from the animal pre-miRNAs, all the by experimental techniques is difficult (Bartel, 2004; Berezikov ab initio methods for animals cannot be applied to plants. Therefore, et al., 2006). As an alternative, the computational prediction it is essential to develop a method based on machine learning to methods are used to analyze the genomic DNA and to obtain classify real plant pre-miRNAs and pseudo genome hairpins. the putative candidates for experimental verification. Since many Results: A novel classification method based on support miRNAs are evolutionarily conserved in multiple species, methods vector machine (SVM) is proposed specifically for predicting that use comparative genomics to identify putative miRNAs have plant pre-miRNAs. To make efficient prediction, we extract the been presented. MirFinder (Bonnet et al., 2004) predicted the pseudo hairpin sequences from the protein coding sequences of potential miRNA of in the Arabidopsis thaliana genome. It is Arabidopsis thaliana and Glycine max, respectively. These pseudo based on the conservation of short sequences between the genomes pre-miRNAs are extracted in this study for the first time. A set of Arabidopsis and Oryza sativa. MicroHARVESTOR (Dezulian of informative features are selected to improve the classification et al., 2006) can identify candidate plant miRNA homologs for accuracy. The training samples are selected according to their a given query miRNA. The approach is based on a sequence distributions in the high-dimensional sample space. Our classifier similarity search step followed by a set of structural filters. The PlantMiRNAPred achieves >90% accuracy on the plant datasets miRNAs of Vigna unguiculata are identified through homology from eight plant species, including A.thaliana, Oryza sativa, Populus alignment of highly conserved miRNAs in multiple species (Lu and trichocarpa, Physcomitrella patens, Medicago truncatula, Sorghum Yang, 2010). Although these methods are effective in identifying bicolor, Zea mays and G.max. The superior performance of the new conserved miRNAs, they cannot discover novel miRNAs proposed classifier can be attributed to the extracted plant pseudo with less homology. On the other hand, since miRNAs in plants pre-miRNAs, the selected training dataset and the carefully selected often have near perfect matches to their target mRNAs, methods features. The ability of PlantMiRNAPred to discern real and pseudo that are based on the complementarity characteristics have also pre-miRNAs provides a viable method for discovering new non- been proposed. MIRcheck (Jones-Rhoades and Bartel, 2004) homologous plant pre-miRNAs. searches for the new miRNAs whose target sites are conserved in Availability: The web service of PlantMiRNAPred, the training A.thaliana and O.sativa. FindMiRNA (Adai et al., 2005) predicts datasets, the testing datasets and the selected features are freely potential Arabidopsis miRNAs from candidate pre-miRNAs that available at http://nclab.hit.edu.cn/PlantMiRNAPred/. have corresponding target sites within transcripts. In order to Contact: maozuguo@hit.edu.cn; yufei.huang@utsa.edu decrease the number of miRNA candidates, the predicted candidates are required to have orthologs in rice. MiMatcher (Lindow and Received on October 29, 2010; revised on February 28, 2011; Krogh, 2005) also predicts the plant miRNAs through exploiting the accepted on March 23, 2011 complementarity of plant miRNAs. However, since the candidates that are complementary to target mRNAs are enormous within intergenic regions and introns, rigorous criteria such as conservation To whom correspondence should be addressed. among multiple species are introduced to significantly reduce the 1368 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com [15:38 19/4/2011 Bioinformatics-btr153.tex] Page: 1368 1368–1376 PlantMiRNAPred number of candidates. This practice also considerable reduces the chance of discovering new miRNAs. As an alternative, the ab initio methods have been developed to distinguish real pre-miRNAs from pseudo pre-miRNAs. The real pre-miRNAs and pseudo pre-miRNAs are used to train the classification models including support vector machines (SVM) (Batuwita and Palade, 2009; Ng and Mishra, 2007; Sewer et al., 2005; Xue et al., 2005), probabilistic co-learning model (Nam et al., 2005), naïve Bayes (Yousef et al., 2006), random forest (Jiang et al., 2007) and kernel density estimation (Chang et al., 2008). These trained classification models can be then used to Fig. 1. Arabidopsis pre-miRNA ath-miR166c and human pre-miRNA classify real pre-miRNAs from pseudo pre-miRNAs. However, they hsa-mir-95. Their secondary structures are obtained from miRBase. The all are trained by the human pre-miRNAs and pseudo pre-miRNAs, nucleotides of the mature miRNA are displayed in bold. and mainly used to identify human pre-miRNAs. Triplet-SVM (Xue et al., 2005) is the only one that has been tested on the pre- miRNAs from A.thaliana and O.sativa. As the plant pre-miRNAs differ greatly from the animal pre-miRNAs, triplet-SVM cannot and %(G −U)/n_stems, where n_stems is the number of stems in the achieve high classification accuracy for the plant species including secondary structure. A.thaliana and O.sativa. The plant pre-miRNAs have more diversities than the animal pre- miRNAs. First, the pre-miRNAs in animals are typically 60–80 nt (Ambros Since nearly all the pre-miRNAs in plants and animals have the et al., 2003), whereas the length of pre-miRNAs in plants ranges from stem–loop hairpin structures, this characteristic is widely used as an 60 to >400 nt (Smalheiser and Torvik, 2005). Secondly, the molecular important feature in the ab initio methods. However, the plant pre- characteristics of plant pre-miRNAs are also different from the animal miRNAs usually have more complex secondary structures than the pre-miRNAs. The former have great varieties in secondary structures. For animal pre-miRNAs and the structures have not been considered instance, like Homo sapiens miR-95, the central loops of the pre-miRNAs in existing methods. We propose a computational classification in animals are typically 3–20 nt in length (Nam et al., 2005). The loops approach that considers the unique characteristics of plant pre- of plant pre-miRNAs have great varieties in length, such as the loop in miRNAs. To construct a comprehensive training dataset, the new A.thaliana miR166c. This is shown in Figure 1. Further, some plant pre- pseudo plant pre-miRNAs are extracted from the protein coding miRNAs contain the big bugles, e.g. G.max miR166b in Figure 2b. Moreover, regions of the A.thaliana and Glycine max genomes, based on which there are big unmatched parts in some plant pre-miRNAs, e.g. Physcomitrella patens miR166i in Figure 2c. an efficient SVM classifier is constructed to classify the real/pseudo Stems of plant pre-miRNAs are relatively stable and conserved. Central plant pre-miRNAs. loops, big bugles and big unmatched parts have great diversities and are not conserved. Therefore, they are removed to obtain the stems. Two novel MFE- related features are proposed and calculated for stems, since they are more 2 METHODS stable. (i) MFE Index 5: MFEI = MFE/%G+C_S, where %G+C_S is the GC 2.1 Features of plant pre-miRNAs content in the stems. (ii) MFE Index 6: MFEI = MFE/stem_tot_bases, where stem_tot_bases is the number of base pairs in the stems. (iii) Average number Recent research indicates that pre-miRNAs in animals and plants have many of mismatches per 21-nt window: Avg_mis_num = tot_mismatches/n_21nts, features in both primary sequence and secondary structure. These features where tot_mismatches is the total number of mismatches in the 21-nt sliding can be used to classify real plant pre-miRNAs and pseudo hairpins with an windows (which is roughly the length of a mature miRNA region and ab initio method. miPred (Ng and Mishra, 2007) extracted 29 global and intrinsic folding naturally has fewer than four successive mismatches) and n_21nts is the features from human real and pseudo pre-miRNAs. These features include number of sliding windows in a stem. (i) 17 base composition variables: 16 dinucleotide frequencies, that is %XY For the 48 existing features and 3 new features, the 17 dinucleotide where X,Y ∈{A,C,G,U} and %G +C content; (ii) six folding measures: frequencies features describe the sequential characteristic. The remaining adjusted base pairing propensity dP (Schultes et al., 1999), adjusted 31 features are mainly related to the thermodynamics and stability of the minimum free energy (MFE) of folding denoted as dG (Freyhult et al., 2005; secondary structures of the hairpins. The current research indicates that the Seffens and Digby, 1999), adjusted base pair distance dD (Freyhult et al, structural characteristic is also significant for distinguishing the hairpins 2005; Moulton et al., 2000), adjusted Shannon entropy dQ (Freyhult et al., of real/pseudo pre-miRNAs. Therefore, 32 structured triplet composition 2005), MFE Index 1 MFEI (Zhang et al., 2006a) and MFE Index 2 MFEI ; features [frequencies of "U(((","A((.", etc., which were defined in triplet- 1 2 (iii) one topological descriptor which is the degree of compactness dF (Gan SVM (Xue et al., 2005)] are extracted from the pre-miRNAs. Since nearly et al., 2004); (iv) five normalized variants of dP, dG, dQ, dD and dF: zP, all mature miRNAs are located in stems, these 32 features are extracted again zG, zQ, zD and zF. from stems and denoted as "U(((_S", "A((._S", etc. In addition to the 29 features listed above, microPred (Batuwita and The transformation of secondary structure of a pre-miRNA is shown in Palade, 2009) extracted 19 new features. These features are (i) two MFE- Figure 2. First, loops, big bugles and big unmatched parts of pre-miRNAs related features: MFE Index 3 MFEI and MFE Index 4 MFEI ; (ii) are removed, in order to capture the features of stable stems. Then, the 3 4 four RNAfold-related features: normalized ensemble free energy (NEFE), 5 -arm and 3 -arm are connected by a linker sequence, ‘LLLLLLLL’. It frequency of the MFE structure Freq, structural diversity denoted as Diversity is helpful to unify the length of loops in all the real/pseudo pre-miRNAs. and a combined feature Diff; (iii) six thermodynamical features: structure Since ‘L’ is not an RNA nucleotide, it does not match with any nucleotide entropy dS and dS/L, structure enthalpy dH and dH/L, melting energy of the and prevents nucleotides in 5 -arm and 3 -arm from binding with sequence- structure T and T /L, where L is the length of pre-miRNA sequence; (iv) specific linker sequences. Finally, the three new features (MFEI , MFEI m m 5 6 and Avg_mis_num) and the 32 structure-related features are extracted from seven base pair-related features: |A −U|/L, |G −C|/L, |G −U|/L, average the linked stems. base pairs per stem Avg_BP_Stem, %(A −U)/n_stems, %(G −C)/n_stems [15:38 19/4/2011 Bioinformatics-btr153.tex] Page: 1369 1368–1376 P.Xuan et al. Fig. 3. Construction of SVM classifier based on feature selection and sample selection. Each circle/triangle represents a real/pseudo pre-miRNA. 2.3 Classification based on SVM Both the features of real/pseudo pre-miRNAs and the training samples are important for constructing an efficient SVM classifier. As shown in Figure 3, we propose a classification method based on SVM. All the 2043 plant pre-miRNAs from miRBase 14 (covers 29 plant species) are collected as Fig. 2. Transforming secondary structures of ath-miR166c, gma-miR166b positive datasets. The homologous pre-miRNAs among different species are and ppt-miR166i. (a) The big loop of ath-miR166c is removed and replaced gathered into the same miRNA gene family by RFam (Gardner et al., 2009). with ‘LLLLLLLL’. (b) Some pre-miRNAs in plants, such as gma-miR166b, Most of the pre-miRNAs in the same families are similar. Therefore, the have big bugles which are near the loops. The big bugles are removed and representative pre-miRNAs are selected from the 128 plant miRNA families the loop is replaced with ‘LLLLLLLL’. (c) The big unmatched part in the of miRBase 14 (including 1612 real pre-miRNAs), as the positive training left end of the stem is removed. samples. Since the remaining 431 pre-miRNAs do not belong to any of miRNA families, they are used as the positive training samples. The negative dataset consists of the 17 groups. Each group is composed of pseudo pre- In total, 115 features are obtained from each real/pseudo plant pre- miRNAs from the genome segments of A.thaliana and G.max (see Section miRNAs. They include redundant features that do not contribute to 3 for details). (i) The 115 features are extracted from the primary sequence classification. The algorithm based on graph is presented to eliminate the and secondary structure of pre-miRNAs and their stems. (ii) The redundant redundant features. In the end, the discriminative feature subset is selected features are eliminated and the informative feature subset is selected through to achieve the best classification accuracy. calculating the information gain of features and the similarity between any two features. (iii) The positive/negative training samples are selected according to their distribution in the high-dimensional sample space, the size 2.2 SVM of each family and the size of each negative sample group. (iv) An SVM- Due to the excellent generalization ability of SVM, we use it to classify based classifier named PlantMiRNAPred is trained by using these samples real/pseudo plant pre-miRNAs with high-dimensional (115-dimensional) to classify real pre-miRNAs and pseudo pre-miRNAs. The feature selection feature vectors. Given a training dataset S, each x ∈S (i =1, ...,N) is a feature and sample selection modules are implemented in Java. The web service of vector of real/pseudo pre-miRNA with corresponding labels z (z =+1or classifying plant pre-miRNAs is developed in PHP on the Linux platform. i i −1, real pre-miRNA or pseudo pre-miRNA, respectively). SVM constructs a decision function (classification of unknown sequence x with stem–loop structure), 2.4 Feature subset selection Feature selection aims to select a group of informative features which can g(x) =sgn z α k(x,x )+w (1) i i i 0 retain most information of original data and distinguish each sample in the i=1 dataset. The feature selection method considers information gain and feature redundancy. where α is the coefficient to be learned (0 ≤ α ≤C) and k is a kernel function. i i Information gain: since all the features of pre-miRNAs are discrete, the In our study, a radial basis function (RBF) kernel is used, where the parameter discrimination ability of a feature is measured by information gain based on γ determines the similarity level of the features so that the classifier becomes Shannon entropy. Suppose a feature of pre-miRNAs is x and the entropy of optimal. x is denoted as H(x). When the value of feature y is given, the conditional k(x,x ) =exp(−γ x −x ) (2) i i entropy is H(x|y). IG(c,x) is the information gain of feature x relative to the The penalty parameter C and the RBF kernel parameter γ are tuned based on class attribute c (Quinlan, 1993). Since classification of real or pseudo pre- the training dataset using the grid search strategy in libSVM (version 2.9). miRNAs is binary classification problem, c is assigned to 1 (real pre-miRNA) [15:38 19/4/2011 Bioinformatics-btr153.tex] Page: 1370 1368–1376 PlantMiRNAPred or −1 (pseudo pre-miRNA). IG(c,x) =H(c) −H(c|x) p(cx) (3) = p(cx)log p(c)p(x) c,x Suppose that the complete feature set is X ={x ,x , ... ,x } and the class 1 2 115 attribute is c. The values of 115 features are obtained from each real pre- miRNA (1906 real pre-miRNAs in total) and c is set to 1. Also, the values of 115 features are obtained from each pseudo hairpin (2122 pseudo hairpins in total) and c is set to −1. The information gain of feature x (1 ≤i ≤115) is calculated and denoted as IG(c,x ). The features with greater information gain are given higher preference. However, the 115 features still include redundant features, inclusion of which will not improve the classification accuracy. To identify the redundant features, the feature similarity is used to measure the similarity between two features. Feature similarity: let Sim(x,y) represent the similarity between features x and y and it is defined as, IG(x,y) Sim(x,y) =2 (4) H(x) +H(y) where IG(x,y) denote the information gain of y respect to x. Sim(x,y) ranges from 0 to 1. Sim(x,y) equal to 0 means that x and y are completely irrelevant. Sim(x,y) equal to 1 means that x and y are completely relevant. When Sim(x,y) is greater than a threshold ε, x or y is a redundant feature. Keeping both features does not improve the classification performance. In such situation, the feature with greater information gain is kept and the other one is dropped. In order to effectively eliminate the redundant features when multiple features are correlated, a redundant feature graph G is constructed. We propose an algorithm based on graph. Suppose that the redundant feature Fig. 4. Eliminating redundant features based on the graph G. graph is G =(V ,E). Each node v (v ∈V) represents the feature x . The weight i i i of node v is the IG(c,x ). If the similarity between two nodes v and v is i i i j more than the threshold ε(ε =0.49), x or x is redundant. Then a new edge i j is added to connect the two nodes. The weight of the edge between v and v is Sim(x ,x ). Here, ε is determined by the experiments and the prior j i j experience (Ng and Mishra, 2007). The process of eliminating redundant features is illustrated by an example. As shown in Figure 4, a redundant feature graph G consists of eight features, including x ,x , ... and x . Suppose that groups of redundant features 1 2 8 exist, where features within a group are redundant to one another but are independent to the remaining features. If we assume that there are three groups, then the graph G is composed of three subgraphs: SG ,SG and 1 2 SG . Feature selection weight (FSW) of each feature is first calculated. Suppose that k edges are adjacent to v . FSW of v is defined as FSW = Sum i i vi of weights of the k edges (that are connected with v ) + weight of v . The i i feature node v with the most FSW is selected. The nodes adjacent to v are x x the redundant features and should be deleted. In the subgraph SG , FSW of x =(0.71 +0.51) +0.42 =1.64. FSWs of x , x , x and x are 1.48, 0.8, 3 4 5 7 8 1.52 and 1.15, respectively. Therefore, x is selected, and the adjacent x and 3 5 x are deleted. The bolded nodes in Figure 4 are the selected feature nodes. Next, the FSWs of the remaining nodes x and x in the current SG are 4 8 2 calculated again. Since x has greater FSW, it is selected. In the same way, in SG , x is selected and x is deleted. In SG , x is an independent node. 1 1 2 3 6 As no other node can represent the independent node, x is also selected. At Fig. 5. Algorithm of eliminating redundant features. last, a feature subset {x ,x ,x ,x } is obtained and the redundant features x , 1 3 6 8 2 x , x and x are eliminated. In addition, if two nodes (v and v ) are with the 4 5 7 y z maximum FSW in a subgraph, the node weights of v and v are compared Initially, the 115 features are categorized into three y z and the node with greater weight is selected. The algorithm of eliminating feature subsets: (i) primary sequence-related feature subset redundant features is described in Figure 5. S ={%G+C,%XY |X,Y ∈{A,C,G,U}} (17 features); (ii) secondary structure- The proposed algorithm is applied to eliminating redundant features related feature subset S = {"A(((",…"U…","A(((_S",…,"U…_S"} among the 115 features. We found that two pairs of attributes are strongly (64 features); (iii) energy- and thermodynamics-related feature subset correlated: dQ versus dD, and dG versus NEFE. This observation is S ={dP, dG,…, zF, MFEI , MFEI , Avg_mis_num} (34 features). 3 5 6 consistent with the result in miPred, which indirectly confirms the result Supplementary Table S1 illustrates the name and the classification of 115 of eliminating features. In addition, we found two new strongly correlated features. After eliminating the redundant features, the three subsets are pairs of attributes: dH versus dS, and dH/L versus dS/L, and dQ, dG, dH and denoted as S , S , and S . For each subset, the remaining features are sorted 1 2 3 dH/L are selected due to their higher selection weights. by information gain in descending order. The features with information [15:38 19/4/2011 Bioinformatics-btr153.tex] Page: 1371 1368–1376 P.Xuan et al. gain greater than a threshold λ (λ =0.136, λ =0.083, λ =0.0159) are 3 RESULTS AND DISCUSSION 1 2 3 selected for the classification. λ is determined by the experiments and the 3.1 Data collection prior experience (Ng and Mishra, 2007). In the end, a total of 68 features are selected and listed in Section 3. A classifier of pre-miRNAs should distinguish real plant pre- miRNAs from pseudo plant hairpins. The positive dataset is composed of known plant pre-miRNAs, while the negative dataset 2.5 Training samples selection is composed of both pseudo A.thaliana hairpins and pseudo G.max The classification performance of SVM is highly dependent on the selection hairpins. of training dataset. First, we noted that the real pre-miRNAs from the same Positive dataset: there are 2043 known plant pre-miRNAs in species and the same miRNA families are highly similar to one another. the miRNA database miRBase 14 (http://www.mirbase.org/). Rfam These redundant positive samples should be removed from training samples (http://rfam.janelia.org/) grouped the available real pre-miRNAs to avoid over-fitting. Secondly, current research has shown that training an into a set of families by means of multiple sequence alignments. A SVM classifier with an imbalance positive and negative dataset would result miRNA gene family is composed of the homologous pre-miRNAs in poor classification performance with respect to the minority class (Weiss, from different species. One thousand six hundred and twelve pre- 2004). So we select the appropriate proportion of representative real/pseudo pre-miRNAs to construct positive/negative training dataset. miRNAs belong to 128 miRNA families and 431 pre-miRNAs do The sample selection method (miSampleSelection) selects the not belong to any of miRNA families. Two thousand forty-three real positive/negative training samples according to the sample distribution pre-miRNAs are chosen as the positive sample dataset. They are in the positive families/negative groups. As shown in Figure 3, the real from A.thaliana, O.sativa, Populus trichocarpa, P.patens, Medicago pre-miRNAs are selected based on the sample distribution in the 128 truncatula, Sorghum bicolor, Zea mays, G.max and other 21 plant families. Since 68 informative features have been selected, each sample is species. As shown in Figure 6, the top 22 families consist of 1066 denoted as a 68-dimensional feature vector. Suppose the feature vector of plant pre-miRNAs. Supplementary Table S2 shows the pre-miRNAs a sample is v and there are M (M =128) families. The vector set of central distribution and the species distribution in all the 128 plant families. points is C ={c ,c ,...,c }, where c represents the feature vector of the 1 2 M i Each family contains at least two pre-miRNAs and covers at least central point in the i-th family. The sample selection process of the i-th one plant species. After eliminating the special sequences with family is as follows. complex secondary structures, 1906 real pre-miRNAs remain in the (1) Assume that the number of samples in the i-th family is N . v i k positive dataset. is the feature vector corresponding to the k-th sample. c is then Negative dataset: the complete genome sequence of A.thaliana calculated as, was released in 2000 (AGI, 2000). The sequences of 20 1 chromosomes in G.max genome are released in 2010 (Schmutz c = v (5) i k N et al., 2010). Arabidopsis thaliana is a typically model plant and k=1 G.max is one of the most important crop. The negative samples (2) The distance between the k-th sample (real pre-miRNA) v and the are extracted from these two species. Since almost all reported central point c is denoted as d . v means the transpose of vector i v c k i miRNAs are located in the untranslated regions or intergenic regions, v . Then, the radius of the i-th family is r , where r =max(d )(1 ≤ k i i v c k i the pseudo hairpins are extracted from protein coding sequences k ≤N ). (CDSs) of A.thaliana and G.max. The CDSs of A.thaliana and v ·c d =1 − (6) v c G.max are downloaded from the plant database Phytozome 6 k t t t v ·v +c ·c −v ·c k i i k i k (http://www.phytozome.net/). (3) Suppose that the selection rate of sample space is 1/n. That is, N /n The ratios of pre-miRNAs with different length in the 2043 real samples in the i-th family are selected. The number of the selected pre-miRNAs are listed in Figure 7. It is found that most of known samples is denoted as P =N /n. i i plant pre-miRNAs in length ranges from 60 to 220 nt. Thus, a sliding (4) Suppose that c is the center of a circle, draw two circles with radius window of width ranging from 60 to 220 nt is used to scan the 0r and (1/P )r , respectively. The region between these two circles i i i CDSs to produce sequence segments. The secondary structures of is denoted as A . The degree of coverage for each sample s in A is 0 0 the sequence segments are predicted by RNAfold from the Vienna calculated and denoted as C(s). C(s) represents the number of samples in A whose nearest neighbor sample is s. The sample s with the greatest C(s) value is selected as a training sample. (5) We set (1/P )r as the step length and compute the degree of sample i i coverage in the region A between two circles with the radius (1/P )kr k i i and (1/P )(k +1)r (1 ≤k ≤P −1), respectively. The sample in A i i i k with the largest degree of coverage is selected. The positive training dataset is composed of the samples selected from the 128 families. For 431 pre-miRNAs that do not belong to any of miRNA families, they are added into the positive training dataset. The process of selecting negative training samples is similar. The negative samples are composed of pseudo hairpins grouped by length. There are 17 groups in total, where the 60nt_Group refers to pseudo hairpins of length from 60 to 69 nt, the 220nt_Group refers to pseudo hairpins of length from 220 to 229 nt, Fig. 6. The 128 families are ranked by the size of species that a miRNA gene etc. Seventeen groups of the negative dataset correspond to the families of family covers. The distribution of the top 22 miRNA families (containing the positive dataset. The negative training samples are selected in the same 1066 pre-miRNAs) is shown. The names of miRNA familes are listed in the way as that of the positive samples. x-axis. The y-axis represents the number of species/miRNAs. [15:38 19/4/2011 Bioinformatics-btr153.tex] Page: 1372 1368–1376 PlantMiRNAPred Fig. 7. Ratio of plant pre-miRNAs in different length. The x-axis represents the length of real pre-miRNAs. The y-axis represents the proportion of the pre-miRNAs in length among 60–69 nt, 70–79 nt, …, 220–229 nt accounting for the sum of real pre-miRNAs. package (Hofacker et al., 1994). The sequence segments should be pre-miRNAs and 83 G.max (gma) pre-miRNAs are used to construct folded into stem–loop structures. Further, they should satisfy three the osa dataset, ptc dataset, ppt dataset, mtr dataset, sbi dataset, zma criteria on the number of base pairs in hairpins, %G+C and MFEI. dataset and gma dataset, respectively. The remaining 1142 pseudo The criteria are determined by observing real plant pre-miRNAs in pre-miRNAs from 2122 pseudo pre-miRNAs (excluding the 980 length among 60–69 nt, 70–79 nt, etc., till 220–229 nt. For instance, pseudo pre-miRNAs) are used as 1142 negative testing dataset. The the criteria for selecting the pseudo miRNAs in length from 60 to 191 A.lyrata (updated aly dataset) and 118 G.max pre-miRNAs 69 nt are: minimum of 19 base pairings in hairpin structure, %G+C (updated gma dataset) were newly reported by miRBase 15–16 >0.242 and <0.825 and MFEI >0.522 and <1.39. The length of the when this work was nearly completed. sliding windows changes from 60 to 69 randomly. Supplementary Table S3 listed all the criteria for different lengths. Therefore, the 3.2 Evaluation method extracted pseudo pre-miRNAs are similar to the real pre-miRNAs. The informative feature subset and the training samples were used The negative samples (pseudo pre-miRNAs) are collected to construct the classifier PlantMiRNAPred. The prediction result of according to the proportion of the real pre-miRNAs of different PlantMiRNAPred can be either one of the following four outcomes: lengths. For example, suppose the ratio of real pre-miRNAs in true positive (TP), false positive (FP), true negative (TN) and false length 70–79, 80–89, 90–99 and 100–109 nt is 0.02:0.08:0.12:0.20. negative (FN). The sensitivity (SE), specificity (SP), geometric mean Then the negative samples in different length are added into the (Gm) and total prediction accuracy (Acc) for assessment of the negative dataset in corresponding proportion. In total, 2122 pseudo prediction system are as follows, pre-miRNAs are collected as negative dataset. Positive and negative training dataset: the 980 real pre-miRNAs TP TN SE = , SP= , Gm = SE ×SP, and 980 pseudo pre-miRNAs are selected by the sample selection TP+FN TN +FP algorithm. The final training dataset includes a total of 1960 samples. (7) TP+TN It is denoted as 1960 training dataset. Acc = TP+FP+TN +FN Positive and negative testing dataset: A.thaliana, O.sativa, P.trichocarpa, P.patens and M.truncatula are typical model plants. where SE is the proportion of the positive samples (real pre- Sorghum bicolor, Z.mays and G.max are important crops. To date, miRNAs) correctly classified, and SP is the proportion of the relatively more miRNAs are identified from the eight species negative samples (pseudo pre-miRNAs) correctly classified. listed above. Thus eight groups of testing datasets are created to evaluate our classifier. The first group is composed of all 3.3 Feature subset selection result the 180 A.thaliana (ath) pre-miRNAs, referred to as ath dataset. The 397 O.sativa (osa) pre-miRNAs, 233 P.trichocarpa (ptc) pre- The selected 68 informative features by feature selection process miRNAs, 211 P.patens (ppt) pre-miRNAs, 106 M.truncatula (mtr) and the corresponding information gain are listed in Table 1. They pre-miRNAs, 131 S.bicolor (sbi) pre-miRNAs, 97 Z.mays (zma) are ranked by their normalized information gain. [15:38 19/4/2011 Bioinformatics-btr153.tex] Page: 1373 1368–1376 P.Xuan et al. Table 1. Selected features ranked by their information gain Table 2. Classification results on different feature subsets Feature subset Classification results (%) Rank AttrName IG(c, attr) Rank AttrName IG(c, attr) SE SP G Acc 1 Tm 1.0 35 G.((_S 0.1365 m 2 MFEI 0.6362 36 %GG 0.1358 3 MFEI 0.4837 37 zF 0.1325 68 features 91.93 97.84 94.84 94.39 4 MFEI 0.3959 38 G(.( 0.1302 80 features 89.08 92.82 90.93 90.63 5 MFEI 0.3759 39 %G+C 0.1223 51 features 88.78 92.96 90.85 90.51 6 dG 0.3598 40 G((. 0.1218 All 115 features 90.31 94.54 92.40 92.06 7 A((( 0.2964 41 G((._S 0.1148 8 A(((_S 0.2804 42 U... 0.1106 9 %(G-C)/stems 0.2653 43 G(.(_S 0.1061 10 C((( 0.2411 44 dP 0.1032 Table 3. Classification results with different sample selection methods 11 G((( 0.2274 45 A... 0.1014 12 U((( 0.2269 46 G(.. 0.1009 Sample Selection Dataset Classification results (%) 13 U(((_S 0.2187 47 G..( 0.0989 methods 14 C(((_S 0.2154 48 C(.. 0.0982 SE SP G Acc 15 G(((_S 0.2096 49 dH 0.0979 16 C((. 0.1979 50 zP 0.0971 miSampleSelection 1960 training dataset 91.93 97.84 94.84 94.39 17 MFEI 0.1948 51 Avg_mis_num 0.0967 Random Selection 1960 random dataset 89.69 93.25 91.45 91.17 18 C.(( 0.1922 52 U..._S 0.0860 19 dH/L 0.1822 53 C..( 0.0855 20 C(.( 0.1822 54 G(.._S 0.0852 four of which were used for training the classifier, while the left 21 %GC 0.1769 55 U..( 0.0832 22 %UA 0.1767 56 MFEI 0.0774 out subset was used for validation. We performed 10 repeated 23 %AU 0.1742 57 dQ 0.0662 evaluations for each testing dataset and averaged the results. 24 %AA 0.1727 58 Avg_Bp_Stem 0.0662 The classification results are summarized in Table 2. The 25 %CG 0.1678 59 Diff 0.0618 classification performances of 80 features and 51 features are worse 26 C(.(_S 0.1663 60 Freq 0.0607 than that of 115 features. It indicates that the stem-related features 27 Tm/L 0.1618 61 Diversity 0.0606 and the structural features are absolutely necessary. Obviously, 28 zG 0.1592 62 |G-C|/L 0.0597 the classifier trained by the selected 68 features achieves the 29 G.(( 0.1558 63 zD 0.0498 best classification performance. It shows the importance of feature 30 C((._S 0.1556 64 dF 0.0376 selection during construction of the efficient classifier. 31 %UU 0.1554 65 |G-U|/L 0.0374 32 %(A-U)/stems 0.1536 66 |A-U|/L 0.0305 33 %CC 0.1528 67 %(G-U)/stems 0.0171 3.4 Training sample selection result 34 C.((_S 0.1491 68 zQ 0.0159 The 980 positive samples and 980 negative samples with 68 features were selected by our sample selection method miSampleSelection to construct the classifier PlantMiRNAPred. Moreover, the equal It has been well studied that the stem–loop structures of plant number of real/pseudo pre-miRNAs were randomly selected from pre-miRNAs is thermodynamically stable. Most of the selected the positive/negative dataset, referred to as 1960 random dataset. The features are related to the thermodynamic stability of the secondary performance of PlantMiRNAPred was compared with the classifier structures. It indirectly confirms the effectiveness of the selected trained with 1960 random dataset. As shown in Table 3, five-fold features. There are some features with suffix _S and three new cross-validation was performed on each training dataset. features (MFEI , MFEI , Avg_mis_num) in the selected feature 5 6 The classifier trained by 1960 training dataset achieves subset. It shows the significance of extracting the new features much higher sensitivity and specificity. It demonstrates that for the stems. In addition, some features and the corresponding miSampleSelection is effective for improving the classification features obtained from stems appear in pairs, such as A((( and accuracy. In addition, the classifier which was trained by the 1960 A(((_S, U((( and U(((_S, etc. It indicates that there is an obvious random dataset achieves excellent classification accuracy. It further difference between two features in any pair of the features listed confirms that the selected 68 features are sufficient to ensure the above. Also, both features are important for the classification of classification performance. real/pseudo pre-miRNAs. In order to validate the efficiency of the feature selection method, 3.5 Comparison with triplet-SVM and microPred we tested the classification accuracies of 68 features, 80 features (containing no features of stems), 51 features (containing no Triplet-SVM is the only ab initio method that has been structural features) and all 115 features, respectively. For each tested with the pre-miRNAs from A.thaliana and O.sativa. feature subset, 980 real pre-miRNAs and 980 pseudo pre-miRNAs Therefore, we compared PlantMiRNAPred with triplet-SVM. were selected by the sample selection method to train an SVM The program of triplet-SVM was downloaded from Xue’s web classifier. These four SVM classifiers were tested by performing site (http://bioinfo.au.tsinghua.edu.cn/mirnasvm/). The eight testing five-fold cross-validation. With five-fold cross-validation, all pre- datasets composed of known pre-miRNAs from eight species were miRNAs in the training dataset were divided into five equal subsets, tested to evaluate the ability of identifying the real pre-miRNAs. [15:38 19/4/2011 Bioinformatics-btr153.tex] Page: 1374 1368–1376 PlantMiRNAPred Table 4. Classification results on different testing datasets be able to achieve high performance in classifying lineage-specific pre-miRNAs. Testing dataset Type Size Accuracy (%) In addition, 11 918 inverted repeats were also extracted from the Gm08 (the eighth chromosome of G.max genome) by EINVERTED PlantMiRNA Triplet- micro (Rice et al., 2000). One thousand inverted repeats (including eight Pred SVM Pred real pre-miRNAs) were selected according to the proportion of the real pre-miRNAs of different lengths. Thirty-seven of 1000 ath dataset Real 180 92.22 76.06 89.44 are classified by PlantMiRNAPred as putative real pre-miRNAs, osa dataset Real 397 94.21 75.54 90.43 covering eight real pre-miRNAs. The FP rate of PlantMiRNAPred is ptc dataset Real 233 91.85 75.21 84.98 2.9%. MicroPred classified 89 inverted repeats as real pre-miRNAs, ppt dataset Real 211 92.42 71.49 89.57 covering eight real pre-miRNAs. The FP rate of MicroPred is 8.1%. mtr dataset Real 106 100 80.18 95.28 Triplet-SVM classified 184 inverted repeats as real pre-miRNAs, sbi dataset Real 131 98.47 69.51 94.66 covering six real pre-miRNAs. The FP rate of triplet-SVM is 17.8%. zma dataset Real 97 97.94 66.97 93.81 gma dataset Real 83 98.31 74.12 86.75 It indicates that PlantMiRNAPred is more sensitive to the inverted 1,142 negative testing Pseudo 1142 98.59 86.34 93.61 repeats from the genome. dataset updated aly dataset Real 191 97.91 70.98 91.62 updated gma dataset Real 118 98.31 79.66 93.22 4 CONCLUSION A new ab initio classifier (PlantMiRNAPred) was developed for predicting plant pre-miRNAs. We demonstrated the importance of careful feature extraction, feature selection and training sample The 1142 negative testing dataset was tested to evaluate the ability selection in achieving efficient and effective classification result. of identifying the pseudo hairpins. The Updated dataset was also Particularly, according to the characteristics of plant pre-miRNAs, tested to observe the ability of discovering new plant pre-miRNAs. 115 features were extracted to distinguish the hairpins of real pre- We performed evaluations for all the testing datasets and miRNAs and pseudo pre-miRNAs. After eliminating redundant illustrated the results in the Table 4. PlantMiRNAPred is nearly 18% features, 68 informative features were selected. Each real/pseudo better than triplet-SVM in overall accuracy. SE increased by 22.19% pre-miRNA was mapped into the 68-dimensional space. 1960 and SP increased by 12.25% on average. As many plant pre-miRNAs positive/negative representative samples were selected as the contain multiple loops, triplet-SVM cannot classify them correctly. training dataset. Almost all the plant pre-miRNAs with multiple loops in the testing PlantMiRNAPred has been compared with the existing pre- dataset are classified by PlantMiRNAPred correctly. This indicates miRNA classification methods, triplet-SVM and microPred. The that our method is sensitive enough to identify pre-miRNAs with results demonstrated that PlantMiRNAPred has higher classification multi-loops. performance. Further analysis indicated that the improvement of MicroPred is more similar to our approach as it uses the same classification accuracy was due to the informative features and the 48 features to classify pre-miRNAs. However, it was originally representative training samples. PlantMiRNAPred will be useful in developed for human pre-miRNAs. The program of microPred generating effective hypothesis for subsequent biological testing. can be downloaded from the web site (http://web.comlab.ox.ac .uk/people/manohara.rukshan.batuwita/microPred.htm). In order to compare with microPred, the classification model of microPred ACKNOWLEDGEMENTS was changed according to the plant pre-miRNAs datasets. As We appreciate Yingpeng Han and Yongxin Liu from the soybean shown in Table 4, PlantMiRNAPred is nearly 5% better than research institute in the Northeast Agricultural University for microPred in overall accuracy. SE increased by 5.19% on average valuable assistance. and SP increased by 4.98%. The improvement is mainly due to the additional 32 structural features extracted from the plant Funding: Natural Science Foundation of China (60932008 pre-miRNAs and the 35 features extracted from the stems. and 60871092); Fundamental Research Funds for the Central Twenty-three of 397 O.sativa-positive samples, 19 of 233 Universities (HIT.ICRST.2010 022); Returned Scholar Foundation P.patens-positive samples, two of 131 S.bicolor-positive samples of Educational Department of Heilongjiang Province (1154hz26); are classified as pseudo pre-miRNAs. However, in miRBase 16, 9 Natural Science Foundation (Grant CCF-0546345 to Y.H.). of 23 O.sativa-positive samples, 8 of 19 P.patens-positive samples, Conflict of Interest: none declared. 2of2 S.bicolor-positive samples are obtained by computational identification method. They are not verified by biology experiments. Despite this, the accuracies of the three testing datasets are 96.39, REFERENCES 95.11 and 100%, respectively. Most of the new reported pre-miRNAs in miRBase 15–16 was Adai,A. et al. (2005) Computational prediction of miRNAs in Arabidopsis thaliana. Genome Res., 15, 78–91. correctly predicted by PlantMiRNAPred with an average accuracy Ambros,V. et al. (2003) A uniform system for microRNA annotation. RNA, 9, 277–279. of 98.11%. This shows that PlantmiRNAPred is powerful in Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering discovering novel pre-miRNAs. In the two updated datasets, 109 plant Arabidopsis thaliana. Nature, 408, 796–815. of 118 G.max pre-miRNAs and 74 of 193 A.lyrata are lineage- Bartel,D.P. (2004) MicroRNAs: genomics, biogenesis, mechanism, and function. Cell, specific pre-miRNAs. Therefore, PlantMiRNAPred is also shown to 116, 281–297. [15:38 19/4/2011 Bioinformatics-btr153.tex] Page: 1375 1368–1376 P.Xuan et al. Batuwita,R. and Palade,V. (2009) MicroPred: effective classification of pre-miRNAs Nam,J. et al. (2005) Human microRNA prediction through a probabilistic co-learning for human miRNA gene prediction. Bioinformatics, 25, 989–995. model of sequence and structure. Nucleic Acids Res., 33, 3570–3581. Berezikov,E. et al. (2006) Approaches to microRNA discovery. Nat. Genet., 38, 2–7. Ng,K.L.S. and Mishra,S.K. (2007) De novo SVM classification of precursor Bonnet,E. et al. (2004) Detection of 91 potential conserved plant microRNAs in microRNAs from genomic pseudo hairpins using global and intrinsic folding Arabidopsis thaliana and Oryza sativa identifies important target genes. PNAS, 101, measures. Bioinformatics, 23, 1321–1330. 11511–11516. Quinlan,J.R. (1993) C4.5: Programs for Machine Learning. Morgan Kaufmann Chang,D.T. et al. (2008) Using a kernel density estimation based classifier to predict Publishers Inc., San Francisco. species-specific microRNA precursors. BMC Bioinformatics, 9(Suppl. 12), 2–12. Rice,P. et al. (2000) EMBOSS: the European Molecular Biology Open Software Suite. Chatterjee,S. and Grobhans,H. (2009) Active turnover modulates mature microRNA Trends Genet., 16, 276–277. activity in caenorhabditis elegans. Nature, 461, 546–549. Schmutz,J. et al. (2010) Genome sequence of the palaeopolyploid soybean. Nature, Dezulian,T. et al. (2006) Identification of plant microRNA homologs. Bioinformatics, 463, 178–183. 22, 359–360. Schultes,E.A. et al. (1999) Estimating the contributions of selection and self- Freyhult,E. et al. (2005) A comparison of RNA folding measures. BMC Bioinformatics, organization in RNA secondary structure. J. Mol. Evol., 49, 76–83. 6, 241–248. Seffens,W. and Digby,D. (1999) mRNAs have greater negative folding free energies Gan,H.H. et al. (2004) RAG: RNA-as-graphs database—concepts, analysis, and than shuffled or codon choice randomized sequences. Nucleic Acids Res., features. Bioinformatics, 20, 1285–1291. 1578–1584. Gardner, P.P. et al. (2009) Rfam: updates to the RNA families database. Nucleic Acids Sewer,A. et al. (2005) Identification of clustered microRNAs using an ab initio Res., 37(Suppl. 1), 136–140. prediction method. BMC Bioinformatics, 6, 267–281. Hofacker,I.L. et al. (1994) Fast folding and comparison of RNA secondary structures. Smalheiser,N.R. and Torvik,V.I. (2005) Mammalian microRNAs derived from genomic Monatsh. Chem., 125, 167–188. repeats. Trends Genet., 21, 322–326. Jiang,P. et al. (2007) MiPred: classification of real and pseudo microRNA precursors Weiss,G. (2004) Mining with rarity: a unifying framework. SIGKDD Expl., 6, using random forest prediction model with combined features. Nucleic Acids Res., 7–19. 35(Suppl. 2), 339–344. Xue,C.H. et al. (2005) Classification of real and pseudo microRNA precursors using Jones-Rhoades,M.W. and Bartel,D.P. (2004) Computational identification of plant local structure-sequence features and support vector machine. BMC Bioinformatics, microRNAs and their targets, including a stress-induced miRNA. Mol. Cell, 14, 6, 310–316. 787–799. Yousef,M. et al. (2006) Combining multi-Species genomic data for microRNA Lindow,M. and Krogh,A. (2005) Computational evidence for hundreds of non- identification using a naïve Bayes classifier machine learning for identification of conserved plant microRNAs. BMC Genomics, 6, 119–127. microRNA genes. Bioinformatics, 22, 1325–1334. Lu,Y.Z. and Yang,X.Y (2010) Computatinal identification of novel microRNAs and Zhang,B.H. et al. (2006a) Evidence that miRNAs are different from other RNAs. Cell their targets in vigna unguiculata. Com. Funct. Genomics, 10, 128297–128313. Mol. Life Sci., 63, 246–254. Moulton,V. et al. (2000) Metrics on RNA secondary structures. J. Comput. Biol., 7, Zhang,B.H. et al. (2006b) Plant microRNA: a small regulatory molecule with big 277–292. impact. Dev. Biol., 289, 3–16. [15:38 19/4/2011 Bioinformatics-btr153.tex] Page: 1376 1368–1376 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

PlantMiRNAPred: efficient classification of real and pseudo plant pre-miRNAs

Loading next page...
 
/lp/oxford-university-press/plantmirnapred-efficient-classification-of-real-and-pseudo-plant-pre-Z4HyLU45nA

References (38)

Publisher
Oxford University Press
Copyright
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
ISSN
1367-4803
eISSN
1460-2059
DOI
10.1093/bioinformatics/btr153
pmid
21441575
Publisher site
See Article on Publisher Site

Abstract

Vol. 27 no. 10 2011, pages 1368–1376 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btr153 Structural bioinformatics Advance Access publication March 26, 2011 PlantMiRNAPred: efficient classification of real and pseudo plant pre-miRNAs 1 1,∗ 1 1 2 Ping Xuan , Maozu Guo , Xiaoyan Liu , Yangchao Huang , Wenbin Li 3,∗ and Yufei Huang 1 2 Department of Computer Science and Engineering, Harbin Institute of Technology, Harbin 150001, Soybean Research Institute (Key Laboratory of Soybean Biology of Chinese Education Ministry), Northeast Agricultural University, Harbin 150030, P.R. China and Department of Electrical and Computer Engineering, University of Texas at San Antonio, San Antonio, TX 78249-0669, USA Associate Editor: Ivo Hofacker ABSTRACT 1 INTRODUCTION Motivation: MicroRNAs (miRNAs) are a set of short (21–24 nt) Derived from hairpin precursors (pre-miRNAs), mature microRNAs non-coding RNAs that play significant roles as post-transcriptional (miRNAs) are non-coding RNAs that play important roles in regulators in animals and plants. While some existing methods gene regulation by targeting mRNAs for cleavage or translational use comparative genomic approaches to identify plant precursor repression (Bartel, 2004; Chatterjee and Grobhans, 2009). MiRNAs miRNAs (pre-miRNAs), others are based on the complementarity are involved in many important biological processes including plant characteristics between miRNAs and their target mRNAs sequences. development, signal transduction and protein degradation (Zhang However, they can only identify the homologous miRNAs or the et al., 2006b). limited complementary miRNAs. Furthermore, since the plant pre- However, systematically detecting miRNAs from a genome miRNAs are quite different from the animal pre-miRNAs, all the by experimental techniques is difficult (Bartel, 2004; Berezikov ab initio methods for animals cannot be applied to plants. Therefore, et al., 2006). As an alternative, the computational prediction it is essential to develop a method based on machine learning to methods are used to analyze the genomic DNA and to obtain classify real plant pre-miRNAs and pseudo genome hairpins. the putative candidates for experimental verification. Since many Results: A novel classification method based on support miRNAs are evolutionarily conserved in multiple species, methods vector machine (SVM) is proposed specifically for predicting that use comparative genomics to identify putative miRNAs have plant pre-miRNAs. To make efficient prediction, we extract the been presented. MirFinder (Bonnet et al., 2004) predicted the pseudo hairpin sequences from the protein coding sequences of potential miRNA of in the Arabidopsis thaliana genome. It is Arabidopsis thaliana and Glycine max, respectively. These pseudo based on the conservation of short sequences between the genomes pre-miRNAs are extracted in this study for the first time. A set of Arabidopsis and Oryza sativa. MicroHARVESTOR (Dezulian of informative features are selected to improve the classification et al., 2006) can identify candidate plant miRNA homologs for accuracy. The training samples are selected according to their a given query miRNA. The approach is based on a sequence distributions in the high-dimensional sample space. Our classifier similarity search step followed by a set of structural filters. The PlantMiRNAPred achieves >90% accuracy on the plant datasets miRNAs of Vigna unguiculata are identified through homology from eight plant species, including A.thaliana, Oryza sativa, Populus alignment of highly conserved miRNAs in multiple species (Lu and trichocarpa, Physcomitrella patens, Medicago truncatula, Sorghum Yang, 2010). Although these methods are effective in identifying bicolor, Zea mays and G.max. The superior performance of the new conserved miRNAs, they cannot discover novel miRNAs proposed classifier can be attributed to the extracted plant pseudo with less homology. On the other hand, since miRNAs in plants pre-miRNAs, the selected training dataset and the carefully selected often have near perfect matches to their target mRNAs, methods features. The ability of PlantMiRNAPred to discern real and pseudo that are based on the complementarity characteristics have also pre-miRNAs provides a viable method for discovering new non- been proposed. MIRcheck (Jones-Rhoades and Bartel, 2004) homologous plant pre-miRNAs. searches for the new miRNAs whose target sites are conserved in Availability: The web service of PlantMiRNAPred, the training A.thaliana and O.sativa. FindMiRNA (Adai et al., 2005) predicts datasets, the testing datasets and the selected features are freely potential Arabidopsis miRNAs from candidate pre-miRNAs that available at http://nclab.hit.edu.cn/PlantMiRNAPred/. have corresponding target sites within transcripts. In order to Contact: maozuguo@hit.edu.cn; yufei.huang@utsa.edu decrease the number of miRNA candidates, the predicted candidates are required to have orthologs in rice. MiMatcher (Lindow and Received on October 29, 2010; revised on February 28, 2011; Krogh, 2005) also predicts the plant miRNAs through exploiting the accepted on March 23, 2011 complementarity of plant miRNAs. However, since the candidates that are complementary to target mRNAs are enormous within intergenic regions and introns, rigorous criteria such as conservation To whom correspondence should be addressed. among multiple species are introduced to significantly reduce the 1368 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com [15:38 19/4/2011 Bioinformatics-btr153.tex] Page: 1368 1368–1376 PlantMiRNAPred number of candidates. This practice also considerable reduces the chance of discovering new miRNAs. As an alternative, the ab initio methods have been developed to distinguish real pre-miRNAs from pseudo pre-miRNAs. The real pre-miRNAs and pseudo pre-miRNAs are used to train the classification models including support vector machines (SVM) (Batuwita and Palade, 2009; Ng and Mishra, 2007; Sewer et al., 2005; Xue et al., 2005), probabilistic co-learning model (Nam et al., 2005), naïve Bayes (Yousef et al., 2006), random forest (Jiang et al., 2007) and kernel density estimation (Chang et al., 2008). These trained classification models can be then used to Fig. 1. Arabidopsis pre-miRNA ath-miR166c and human pre-miRNA classify real pre-miRNAs from pseudo pre-miRNAs. However, they hsa-mir-95. Their secondary structures are obtained from miRBase. The all are trained by the human pre-miRNAs and pseudo pre-miRNAs, nucleotides of the mature miRNA are displayed in bold. and mainly used to identify human pre-miRNAs. Triplet-SVM (Xue et al., 2005) is the only one that has been tested on the pre- miRNAs from A.thaliana and O.sativa. As the plant pre-miRNAs differ greatly from the animal pre-miRNAs, triplet-SVM cannot and %(G −U)/n_stems, where n_stems is the number of stems in the achieve high classification accuracy for the plant species including secondary structure. A.thaliana and O.sativa. The plant pre-miRNAs have more diversities than the animal pre- miRNAs. First, the pre-miRNAs in animals are typically 60–80 nt (Ambros Since nearly all the pre-miRNAs in plants and animals have the et al., 2003), whereas the length of pre-miRNAs in plants ranges from stem–loop hairpin structures, this characteristic is widely used as an 60 to >400 nt (Smalheiser and Torvik, 2005). Secondly, the molecular important feature in the ab initio methods. However, the plant pre- characteristics of plant pre-miRNAs are also different from the animal miRNAs usually have more complex secondary structures than the pre-miRNAs. The former have great varieties in secondary structures. For animal pre-miRNAs and the structures have not been considered instance, like Homo sapiens miR-95, the central loops of the pre-miRNAs in existing methods. We propose a computational classification in animals are typically 3–20 nt in length (Nam et al., 2005). The loops approach that considers the unique characteristics of plant pre- of plant pre-miRNAs have great varieties in length, such as the loop in miRNAs. To construct a comprehensive training dataset, the new A.thaliana miR166c. This is shown in Figure 1. Further, some plant pre- pseudo plant pre-miRNAs are extracted from the protein coding miRNAs contain the big bugles, e.g. G.max miR166b in Figure 2b. Moreover, regions of the A.thaliana and Glycine max genomes, based on which there are big unmatched parts in some plant pre-miRNAs, e.g. Physcomitrella patens miR166i in Figure 2c. an efficient SVM classifier is constructed to classify the real/pseudo Stems of plant pre-miRNAs are relatively stable and conserved. Central plant pre-miRNAs. loops, big bugles and big unmatched parts have great diversities and are not conserved. Therefore, they are removed to obtain the stems. Two novel MFE- related features are proposed and calculated for stems, since they are more 2 METHODS stable. (i) MFE Index 5: MFEI = MFE/%G+C_S, where %G+C_S is the GC 2.1 Features of plant pre-miRNAs content in the stems. (ii) MFE Index 6: MFEI = MFE/stem_tot_bases, where stem_tot_bases is the number of base pairs in the stems. (iii) Average number Recent research indicates that pre-miRNAs in animals and plants have many of mismatches per 21-nt window: Avg_mis_num = tot_mismatches/n_21nts, features in both primary sequence and secondary structure. These features where tot_mismatches is the total number of mismatches in the 21-nt sliding can be used to classify real plant pre-miRNAs and pseudo hairpins with an windows (which is roughly the length of a mature miRNA region and ab initio method. miPred (Ng and Mishra, 2007) extracted 29 global and intrinsic folding naturally has fewer than four successive mismatches) and n_21nts is the features from human real and pseudo pre-miRNAs. These features include number of sliding windows in a stem. (i) 17 base composition variables: 16 dinucleotide frequencies, that is %XY For the 48 existing features and 3 new features, the 17 dinucleotide where X,Y ∈{A,C,G,U} and %G +C content; (ii) six folding measures: frequencies features describe the sequential characteristic. The remaining adjusted base pairing propensity dP (Schultes et al., 1999), adjusted 31 features are mainly related to the thermodynamics and stability of the minimum free energy (MFE) of folding denoted as dG (Freyhult et al., 2005; secondary structures of the hairpins. The current research indicates that the Seffens and Digby, 1999), adjusted base pair distance dD (Freyhult et al, structural characteristic is also significant for distinguishing the hairpins 2005; Moulton et al., 2000), adjusted Shannon entropy dQ (Freyhult et al., of real/pseudo pre-miRNAs. Therefore, 32 structured triplet composition 2005), MFE Index 1 MFEI (Zhang et al., 2006a) and MFE Index 2 MFEI ; features [frequencies of "U(((","A((.", etc., which were defined in triplet- 1 2 (iii) one topological descriptor which is the degree of compactness dF (Gan SVM (Xue et al., 2005)] are extracted from the pre-miRNAs. Since nearly et al., 2004); (iv) five normalized variants of dP, dG, dQ, dD and dF: zP, all mature miRNAs are located in stems, these 32 features are extracted again zG, zQ, zD and zF. from stems and denoted as "U(((_S", "A((._S", etc. In addition to the 29 features listed above, microPred (Batuwita and The transformation of secondary structure of a pre-miRNA is shown in Palade, 2009) extracted 19 new features. These features are (i) two MFE- Figure 2. First, loops, big bugles and big unmatched parts of pre-miRNAs related features: MFE Index 3 MFEI and MFE Index 4 MFEI ; (ii) are removed, in order to capture the features of stable stems. Then, the 3 4 four RNAfold-related features: normalized ensemble free energy (NEFE), 5 -arm and 3 -arm are connected by a linker sequence, ‘LLLLLLLL’. It frequency of the MFE structure Freq, structural diversity denoted as Diversity is helpful to unify the length of loops in all the real/pseudo pre-miRNAs. and a combined feature Diff; (iii) six thermodynamical features: structure Since ‘L’ is not an RNA nucleotide, it does not match with any nucleotide entropy dS and dS/L, structure enthalpy dH and dH/L, melting energy of the and prevents nucleotides in 5 -arm and 3 -arm from binding with sequence- structure T and T /L, where L is the length of pre-miRNA sequence; (iv) specific linker sequences. Finally, the three new features (MFEI , MFEI m m 5 6 and Avg_mis_num) and the 32 structure-related features are extracted from seven base pair-related features: |A −U|/L, |G −C|/L, |G −U|/L, average the linked stems. base pairs per stem Avg_BP_Stem, %(A −U)/n_stems, %(G −C)/n_stems [15:38 19/4/2011 Bioinformatics-btr153.tex] Page: 1369 1368–1376 P.Xuan et al. Fig. 3. Construction of SVM classifier based on feature selection and sample selection. Each circle/triangle represents a real/pseudo pre-miRNA. 2.3 Classification based on SVM Both the features of real/pseudo pre-miRNAs and the training samples are important for constructing an efficient SVM classifier. As shown in Figure 3, we propose a classification method based on SVM. All the 2043 plant pre-miRNAs from miRBase 14 (covers 29 plant species) are collected as Fig. 2. Transforming secondary structures of ath-miR166c, gma-miR166b positive datasets. The homologous pre-miRNAs among different species are and ppt-miR166i. (a) The big loop of ath-miR166c is removed and replaced gathered into the same miRNA gene family by RFam (Gardner et al., 2009). with ‘LLLLLLLL’. (b) Some pre-miRNAs in plants, such as gma-miR166b, Most of the pre-miRNAs in the same families are similar. Therefore, the have big bugles which are near the loops. The big bugles are removed and representative pre-miRNAs are selected from the 128 plant miRNA families the loop is replaced with ‘LLLLLLLL’. (c) The big unmatched part in the of miRBase 14 (including 1612 real pre-miRNAs), as the positive training left end of the stem is removed. samples. Since the remaining 431 pre-miRNAs do not belong to any of miRNA families, they are used as the positive training samples. The negative dataset consists of the 17 groups. Each group is composed of pseudo pre- In total, 115 features are obtained from each real/pseudo plant pre- miRNAs from the genome segments of A.thaliana and G.max (see Section miRNAs. They include redundant features that do not contribute to 3 for details). (i) The 115 features are extracted from the primary sequence classification. The algorithm based on graph is presented to eliminate the and secondary structure of pre-miRNAs and their stems. (ii) The redundant redundant features. In the end, the discriminative feature subset is selected features are eliminated and the informative feature subset is selected through to achieve the best classification accuracy. calculating the information gain of features and the similarity between any two features. (iii) The positive/negative training samples are selected according to their distribution in the high-dimensional sample space, the size 2.2 SVM of each family and the size of each negative sample group. (iv) An SVM- Due to the excellent generalization ability of SVM, we use it to classify based classifier named PlantMiRNAPred is trained by using these samples real/pseudo plant pre-miRNAs with high-dimensional (115-dimensional) to classify real pre-miRNAs and pseudo pre-miRNAs. The feature selection feature vectors. Given a training dataset S, each x ∈S (i =1, ...,N) is a feature and sample selection modules are implemented in Java. The web service of vector of real/pseudo pre-miRNA with corresponding labels z (z =+1or classifying plant pre-miRNAs is developed in PHP on the Linux platform. i i −1, real pre-miRNA or pseudo pre-miRNA, respectively). SVM constructs a decision function (classification of unknown sequence x with stem–loop structure), 2.4 Feature subset selection Feature selection aims to select a group of informative features which can g(x) =sgn z α k(x,x )+w (1) i i i 0 retain most information of original data and distinguish each sample in the i=1 dataset. The feature selection method considers information gain and feature redundancy. where α is the coefficient to be learned (0 ≤ α ≤C) and k is a kernel function. i i Information gain: since all the features of pre-miRNAs are discrete, the In our study, a radial basis function (RBF) kernel is used, where the parameter discrimination ability of a feature is measured by information gain based on γ determines the similarity level of the features so that the classifier becomes Shannon entropy. Suppose a feature of pre-miRNAs is x and the entropy of optimal. x is denoted as H(x). When the value of feature y is given, the conditional k(x,x ) =exp(−γ x −x ) (2) i i entropy is H(x|y). IG(c,x) is the information gain of feature x relative to the The penalty parameter C and the RBF kernel parameter γ are tuned based on class attribute c (Quinlan, 1993). Since classification of real or pseudo pre- the training dataset using the grid search strategy in libSVM (version 2.9). miRNAs is binary classification problem, c is assigned to 1 (real pre-miRNA) [15:38 19/4/2011 Bioinformatics-btr153.tex] Page: 1370 1368–1376 PlantMiRNAPred or −1 (pseudo pre-miRNA). IG(c,x) =H(c) −H(c|x) p(cx) (3) = p(cx)log p(c)p(x) c,x Suppose that the complete feature set is X ={x ,x , ... ,x } and the class 1 2 115 attribute is c. The values of 115 features are obtained from each real pre- miRNA (1906 real pre-miRNAs in total) and c is set to 1. Also, the values of 115 features are obtained from each pseudo hairpin (2122 pseudo hairpins in total) and c is set to −1. The information gain of feature x (1 ≤i ≤115) is calculated and denoted as IG(c,x ). The features with greater information gain are given higher preference. However, the 115 features still include redundant features, inclusion of which will not improve the classification accuracy. To identify the redundant features, the feature similarity is used to measure the similarity between two features. Feature similarity: let Sim(x,y) represent the similarity between features x and y and it is defined as, IG(x,y) Sim(x,y) =2 (4) H(x) +H(y) where IG(x,y) denote the information gain of y respect to x. Sim(x,y) ranges from 0 to 1. Sim(x,y) equal to 0 means that x and y are completely irrelevant. Sim(x,y) equal to 1 means that x and y are completely relevant. When Sim(x,y) is greater than a threshold ε, x or y is a redundant feature. Keeping both features does not improve the classification performance. In such situation, the feature with greater information gain is kept and the other one is dropped. In order to effectively eliminate the redundant features when multiple features are correlated, a redundant feature graph G is constructed. We propose an algorithm based on graph. Suppose that the redundant feature Fig. 4. Eliminating redundant features based on the graph G. graph is G =(V ,E). Each node v (v ∈V) represents the feature x . The weight i i i of node v is the IG(c,x ). If the similarity between two nodes v and v is i i i j more than the threshold ε(ε =0.49), x or x is redundant. Then a new edge i j is added to connect the two nodes. The weight of the edge between v and v is Sim(x ,x ). Here, ε is determined by the experiments and the prior j i j experience (Ng and Mishra, 2007). The process of eliminating redundant features is illustrated by an example. As shown in Figure 4, a redundant feature graph G consists of eight features, including x ,x , ... and x . Suppose that groups of redundant features 1 2 8 exist, where features within a group are redundant to one another but are independent to the remaining features. If we assume that there are three groups, then the graph G is composed of three subgraphs: SG ,SG and 1 2 SG . Feature selection weight (FSW) of each feature is first calculated. Suppose that k edges are adjacent to v . FSW of v is defined as FSW = Sum i i vi of weights of the k edges (that are connected with v ) + weight of v . The i i feature node v with the most FSW is selected. The nodes adjacent to v are x x the redundant features and should be deleted. In the subgraph SG , FSW of x =(0.71 +0.51) +0.42 =1.64. FSWs of x , x , x and x are 1.48, 0.8, 3 4 5 7 8 1.52 and 1.15, respectively. Therefore, x is selected, and the adjacent x and 3 5 x are deleted. The bolded nodes in Figure 4 are the selected feature nodes. Next, the FSWs of the remaining nodes x and x in the current SG are 4 8 2 calculated again. Since x has greater FSW, it is selected. In the same way, in SG , x is selected and x is deleted. In SG , x is an independent node. 1 1 2 3 6 As no other node can represent the independent node, x is also selected. At Fig. 5. Algorithm of eliminating redundant features. last, a feature subset {x ,x ,x ,x } is obtained and the redundant features x , 1 3 6 8 2 x , x and x are eliminated. In addition, if two nodes (v and v ) are with the 4 5 7 y z maximum FSW in a subgraph, the node weights of v and v are compared Initially, the 115 features are categorized into three y z and the node with greater weight is selected. The algorithm of eliminating feature subsets: (i) primary sequence-related feature subset redundant features is described in Figure 5. S ={%G+C,%XY |X,Y ∈{A,C,G,U}} (17 features); (ii) secondary structure- The proposed algorithm is applied to eliminating redundant features related feature subset S = {"A(((",…"U…","A(((_S",…,"U…_S"} among the 115 features. We found that two pairs of attributes are strongly (64 features); (iii) energy- and thermodynamics-related feature subset correlated: dQ versus dD, and dG versus NEFE. This observation is S ={dP, dG,…, zF, MFEI , MFEI , Avg_mis_num} (34 features). 3 5 6 consistent with the result in miPred, which indirectly confirms the result Supplementary Table S1 illustrates the name and the classification of 115 of eliminating features. In addition, we found two new strongly correlated features. After eliminating the redundant features, the three subsets are pairs of attributes: dH versus dS, and dH/L versus dS/L, and dQ, dG, dH and denoted as S , S , and S . For each subset, the remaining features are sorted 1 2 3 dH/L are selected due to their higher selection weights. by information gain in descending order. The features with information [15:38 19/4/2011 Bioinformatics-btr153.tex] Page: 1371 1368–1376 P.Xuan et al. gain greater than a threshold λ (λ =0.136, λ =0.083, λ =0.0159) are 3 RESULTS AND DISCUSSION 1 2 3 selected for the classification. λ is determined by the experiments and the 3.1 Data collection prior experience (Ng and Mishra, 2007). In the end, a total of 68 features are selected and listed in Section 3. A classifier of pre-miRNAs should distinguish real plant pre- miRNAs from pseudo plant hairpins. The positive dataset is composed of known plant pre-miRNAs, while the negative dataset 2.5 Training samples selection is composed of both pseudo A.thaliana hairpins and pseudo G.max The classification performance of SVM is highly dependent on the selection hairpins. of training dataset. First, we noted that the real pre-miRNAs from the same Positive dataset: there are 2043 known plant pre-miRNAs in species and the same miRNA families are highly similar to one another. the miRNA database miRBase 14 (http://www.mirbase.org/). Rfam These redundant positive samples should be removed from training samples (http://rfam.janelia.org/) grouped the available real pre-miRNAs to avoid over-fitting. Secondly, current research has shown that training an into a set of families by means of multiple sequence alignments. A SVM classifier with an imbalance positive and negative dataset would result miRNA gene family is composed of the homologous pre-miRNAs in poor classification performance with respect to the minority class (Weiss, from different species. One thousand six hundred and twelve pre- 2004). So we select the appropriate proportion of representative real/pseudo pre-miRNAs to construct positive/negative training dataset. miRNAs belong to 128 miRNA families and 431 pre-miRNAs do The sample selection method (miSampleSelection) selects the not belong to any of miRNA families. Two thousand forty-three real positive/negative training samples according to the sample distribution pre-miRNAs are chosen as the positive sample dataset. They are in the positive families/negative groups. As shown in Figure 3, the real from A.thaliana, O.sativa, Populus trichocarpa, P.patens, Medicago pre-miRNAs are selected based on the sample distribution in the 128 truncatula, Sorghum bicolor, Zea mays, G.max and other 21 plant families. Since 68 informative features have been selected, each sample is species. As shown in Figure 6, the top 22 families consist of 1066 denoted as a 68-dimensional feature vector. Suppose the feature vector of plant pre-miRNAs. Supplementary Table S2 shows the pre-miRNAs a sample is v and there are M (M =128) families. The vector set of central distribution and the species distribution in all the 128 plant families. points is C ={c ,c ,...,c }, where c represents the feature vector of the 1 2 M i Each family contains at least two pre-miRNAs and covers at least central point in the i-th family. The sample selection process of the i-th one plant species. After eliminating the special sequences with family is as follows. complex secondary structures, 1906 real pre-miRNAs remain in the (1) Assume that the number of samples in the i-th family is N . v i k positive dataset. is the feature vector corresponding to the k-th sample. c is then Negative dataset: the complete genome sequence of A.thaliana calculated as, was released in 2000 (AGI, 2000). The sequences of 20 1 chromosomes in G.max genome are released in 2010 (Schmutz c = v (5) i k N et al., 2010). Arabidopsis thaliana is a typically model plant and k=1 G.max is one of the most important crop. The negative samples (2) The distance between the k-th sample (real pre-miRNA) v and the are extracted from these two species. Since almost all reported central point c is denoted as d . v means the transpose of vector i v c k i miRNAs are located in the untranslated regions or intergenic regions, v . Then, the radius of the i-th family is r , where r =max(d )(1 ≤ k i i v c k i the pseudo hairpins are extracted from protein coding sequences k ≤N ). (CDSs) of A.thaliana and G.max. The CDSs of A.thaliana and v ·c d =1 − (6) v c G.max are downloaded from the plant database Phytozome 6 k t t t v ·v +c ·c −v ·c k i i k i k (http://www.phytozome.net/). (3) Suppose that the selection rate of sample space is 1/n. That is, N /n The ratios of pre-miRNAs with different length in the 2043 real samples in the i-th family are selected. The number of the selected pre-miRNAs are listed in Figure 7. It is found that most of known samples is denoted as P =N /n. i i plant pre-miRNAs in length ranges from 60 to 220 nt. Thus, a sliding (4) Suppose that c is the center of a circle, draw two circles with radius window of width ranging from 60 to 220 nt is used to scan the 0r and (1/P )r , respectively. The region between these two circles i i i CDSs to produce sequence segments. The secondary structures of is denoted as A . The degree of coverage for each sample s in A is 0 0 the sequence segments are predicted by RNAfold from the Vienna calculated and denoted as C(s). C(s) represents the number of samples in A whose nearest neighbor sample is s. The sample s with the greatest C(s) value is selected as a training sample. (5) We set (1/P )r as the step length and compute the degree of sample i i coverage in the region A between two circles with the radius (1/P )kr k i i and (1/P )(k +1)r (1 ≤k ≤P −1), respectively. The sample in A i i i k with the largest degree of coverage is selected. The positive training dataset is composed of the samples selected from the 128 families. For 431 pre-miRNAs that do not belong to any of miRNA families, they are added into the positive training dataset. The process of selecting negative training samples is similar. The negative samples are composed of pseudo hairpins grouped by length. There are 17 groups in total, where the 60nt_Group refers to pseudo hairpins of length from 60 to 69 nt, the 220nt_Group refers to pseudo hairpins of length from 220 to 229 nt, Fig. 6. The 128 families are ranked by the size of species that a miRNA gene etc. Seventeen groups of the negative dataset correspond to the families of family covers. The distribution of the top 22 miRNA families (containing the positive dataset. The negative training samples are selected in the same 1066 pre-miRNAs) is shown. The names of miRNA familes are listed in the way as that of the positive samples. x-axis. The y-axis represents the number of species/miRNAs. [15:38 19/4/2011 Bioinformatics-btr153.tex] Page: 1372 1368–1376 PlantMiRNAPred Fig. 7. Ratio of plant pre-miRNAs in different length. The x-axis represents the length of real pre-miRNAs. The y-axis represents the proportion of the pre-miRNAs in length among 60–69 nt, 70–79 nt, …, 220–229 nt accounting for the sum of real pre-miRNAs. package (Hofacker et al., 1994). The sequence segments should be pre-miRNAs and 83 G.max (gma) pre-miRNAs are used to construct folded into stem–loop structures. Further, they should satisfy three the osa dataset, ptc dataset, ppt dataset, mtr dataset, sbi dataset, zma criteria on the number of base pairs in hairpins, %G+C and MFEI. dataset and gma dataset, respectively. The remaining 1142 pseudo The criteria are determined by observing real plant pre-miRNAs in pre-miRNAs from 2122 pseudo pre-miRNAs (excluding the 980 length among 60–69 nt, 70–79 nt, etc., till 220–229 nt. For instance, pseudo pre-miRNAs) are used as 1142 negative testing dataset. The the criteria for selecting the pseudo miRNAs in length from 60 to 191 A.lyrata (updated aly dataset) and 118 G.max pre-miRNAs 69 nt are: minimum of 19 base pairings in hairpin structure, %G+C (updated gma dataset) were newly reported by miRBase 15–16 >0.242 and <0.825 and MFEI >0.522 and <1.39. The length of the when this work was nearly completed. sliding windows changes from 60 to 69 randomly. Supplementary Table S3 listed all the criteria for different lengths. Therefore, the 3.2 Evaluation method extracted pseudo pre-miRNAs are similar to the real pre-miRNAs. The informative feature subset and the training samples were used The negative samples (pseudo pre-miRNAs) are collected to construct the classifier PlantMiRNAPred. The prediction result of according to the proportion of the real pre-miRNAs of different PlantMiRNAPred can be either one of the following four outcomes: lengths. For example, suppose the ratio of real pre-miRNAs in true positive (TP), false positive (FP), true negative (TN) and false length 70–79, 80–89, 90–99 and 100–109 nt is 0.02:0.08:0.12:0.20. negative (FN). The sensitivity (SE), specificity (SP), geometric mean Then the negative samples in different length are added into the (Gm) and total prediction accuracy (Acc) for assessment of the negative dataset in corresponding proportion. In total, 2122 pseudo prediction system are as follows, pre-miRNAs are collected as negative dataset. Positive and negative training dataset: the 980 real pre-miRNAs TP TN SE = , SP= , Gm = SE ×SP, and 980 pseudo pre-miRNAs are selected by the sample selection TP+FN TN +FP algorithm. The final training dataset includes a total of 1960 samples. (7) TP+TN It is denoted as 1960 training dataset. Acc = TP+FP+TN +FN Positive and negative testing dataset: A.thaliana, O.sativa, P.trichocarpa, P.patens and M.truncatula are typical model plants. where SE is the proportion of the positive samples (real pre- Sorghum bicolor, Z.mays and G.max are important crops. To date, miRNAs) correctly classified, and SP is the proportion of the relatively more miRNAs are identified from the eight species negative samples (pseudo pre-miRNAs) correctly classified. listed above. Thus eight groups of testing datasets are created to evaluate our classifier. The first group is composed of all 3.3 Feature subset selection result the 180 A.thaliana (ath) pre-miRNAs, referred to as ath dataset. The 397 O.sativa (osa) pre-miRNAs, 233 P.trichocarpa (ptc) pre- The selected 68 informative features by feature selection process miRNAs, 211 P.patens (ppt) pre-miRNAs, 106 M.truncatula (mtr) and the corresponding information gain are listed in Table 1. They pre-miRNAs, 131 S.bicolor (sbi) pre-miRNAs, 97 Z.mays (zma) are ranked by their normalized information gain. [15:38 19/4/2011 Bioinformatics-btr153.tex] Page: 1373 1368–1376 P.Xuan et al. Table 1. Selected features ranked by their information gain Table 2. Classification results on different feature subsets Feature subset Classification results (%) Rank AttrName IG(c, attr) Rank AttrName IG(c, attr) SE SP G Acc 1 Tm 1.0 35 G.((_S 0.1365 m 2 MFEI 0.6362 36 %GG 0.1358 3 MFEI 0.4837 37 zF 0.1325 68 features 91.93 97.84 94.84 94.39 4 MFEI 0.3959 38 G(.( 0.1302 80 features 89.08 92.82 90.93 90.63 5 MFEI 0.3759 39 %G+C 0.1223 51 features 88.78 92.96 90.85 90.51 6 dG 0.3598 40 G((. 0.1218 All 115 features 90.31 94.54 92.40 92.06 7 A((( 0.2964 41 G((._S 0.1148 8 A(((_S 0.2804 42 U... 0.1106 9 %(G-C)/stems 0.2653 43 G(.(_S 0.1061 10 C((( 0.2411 44 dP 0.1032 Table 3. Classification results with different sample selection methods 11 G((( 0.2274 45 A... 0.1014 12 U((( 0.2269 46 G(.. 0.1009 Sample Selection Dataset Classification results (%) 13 U(((_S 0.2187 47 G..( 0.0989 methods 14 C(((_S 0.2154 48 C(.. 0.0982 SE SP G Acc 15 G(((_S 0.2096 49 dH 0.0979 16 C((. 0.1979 50 zP 0.0971 miSampleSelection 1960 training dataset 91.93 97.84 94.84 94.39 17 MFEI 0.1948 51 Avg_mis_num 0.0967 Random Selection 1960 random dataset 89.69 93.25 91.45 91.17 18 C.(( 0.1922 52 U..._S 0.0860 19 dH/L 0.1822 53 C..( 0.0855 20 C(.( 0.1822 54 G(.._S 0.0852 four of which were used for training the classifier, while the left 21 %GC 0.1769 55 U..( 0.0832 22 %UA 0.1767 56 MFEI 0.0774 out subset was used for validation. We performed 10 repeated 23 %AU 0.1742 57 dQ 0.0662 evaluations for each testing dataset and averaged the results. 24 %AA 0.1727 58 Avg_Bp_Stem 0.0662 The classification results are summarized in Table 2. The 25 %CG 0.1678 59 Diff 0.0618 classification performances of 80 features and 51 features are worse 26 C(.(_S 0.1663 60 Freq 0.0607 than that of 115 features. It indicates that the stem-related features 27 Tm/L 0.1618 61 Diversity 0.0606 and the structural features are absolutely necessary. Obviously, 28 zG 0.1592 62 |G-C|/L 0.0597 the classifier trained by the selected 68 features achieves the 29 G.(( 0.1558 63 zD 0.0498 best classification performance. It shows the importance of feature 30 C((._S 0.1556 64 dF 0.0376 selection during construction of the efficient classifier. 31 %UU 0.1554 65 |G-U|/L 0.0374 32 %(A-U)/stems 0.1536 66 |A-U|/L 0.0305 33 %CC 0.1528 67 %(G-U)/stems 0.0171 3.4 Training sample selection result 34 C.((_S 0.1491 68 zQ 0.0159 The 980 positive samples and 980 negative samples with 68 features were selected by our sample selection method miSampleSelection to construct the classifier PlantMiRNAPred. Moreover, the equal It has been well studied that the stem–loop structures of plant number of real/pseudo pre-miRNAs were randomly selected from pre-miRNAs is thermodynamically stable. Most of the selected the positive/negative dataset, referred to as 1960 random dataset. The features are related to the thermodynamic stability of the secondary performance of PlantMiRNAPred was compared with the classifier structures. It indirectly confirms the effectiveness of the selected trained with 1960 random dataset. As shown in Table 3, five-fold features. There are some features with suffix _S and three new cross-validation was performed on each training dataset. features (MFEI , MFEI , Avg_mis_num) in the selected feature 5 6 The classifier trained by 1960 training dataset achieves subset. It shows the significance of extracting the new features much higher sensitivity and specificity. It demonstrates that for the stems. In addition, some features and the corresponding miSampleSelection is effective for improving the classification features obtained from stems appear in pairs, such as A((( and accuracy. In addition, the classifier which was trained by the 1960 A(((_S, U((( and U(((_S, etc. It indicates that there is an obvious random dataset achieves excellent classification accuracy. It further difference between two features in any pair of the features listed confirms that the selected 68 features are sufficient to ensure the above. Also, both features are important for the classification of classification performance. real/pseudo pre-miRNAs. In order to validate the efficiency of the feature selection method, 3.5 Comparison with triplet-SVM and microPred we tested the classification accuracies of 68 features, 80 features (containing no features of stems), 51 features (containing no Triplet-SVM is the only ab initio method that has been structural features) and all 115 features, respectively. For each tested with the pre-miRNAs from A.thaliana and O.sativa. feature subset, 980 real pre-miRNAs and 980 pseudo pre-miRNAs Therefore, we compared PlantMiRNAPred with triplet-SVM. were selected by the sample selection method to train an SVM The program of triplet-SVM was downloaded from Xue’s web classifier. These four SVM classifiers were tested by performing site (http://bioinfo.au.tsinghua.edu.cn/mirnasvm/). The eight testing five-fold cross-validation. With five-fold cross-validation, all pre- datasets composed of known pre-miRNAs from eight species were miRNAs in the training dataset were divided into five equal subsets, tested to evaluate the ability of identifying the real pre-miRNAs. [15:38 19/4/2011 Bioinformatics-btr153.tex] Page: 1374 1368–1376 PlantMiRNAPred Table 4. Classification results on different testing datasets be able to achieve high performance in classifying lineage-specific pre-miRNAs. Testing dataset Type Size Accuracy (%) In addition, 11 918 inverted repeats were also extracted from the Gm08 (the eighth chromosome of G.max genome) by EINVERTED PlantMiRNA Triplet- micro (Rice et al., 2000). One thousand inverted repeats (including eight Pred SVM Pred real pre-miRNAs) were selected according to the proportion of the real pre-miRNAs of different lengths. Thirty-seven of 1000 ath dataset Real 180 92.22 76.06 89.44 are classified by PlantMiRNAPred as putative real pre-miRNAs, osa dataset Real 397 94.21 75.54 90.43 covering eight real pre-miRNAs. The FP rate of PlantMiRNAPred is ptc dataset Real 233 91.85 75.21 84.98 2.9%. MicroPred classified 89 inverted repeats as real pre-miRNAs, ppt dataset Real 211 92.42 71.49 89.57 covering eight real pre-miRNAs. The FP rate of MicroPred is 8.1%. mtr dataset Real 106 100 80.18 95.28 Triplet-SVM classified 184 inverted repeats as real pre-miRNAs, sbi dataset Real 131 98.47 69.51 94.66 covering six real pre-miRNAs. The FP rate of triplet-SVM is 17.8%. zma dataset Real 97 97.94 66.97 93.81 gma dataset Real 83 98.31 74.12 86.75 It indicates that PlantMiRNAPred is more sensitive to the inverted 1,142 negative testing Pseudo 1142 98.59 86.34 93.61 repeats from the genome. dataset updated aly dataset Real 191 97.91 70.98 91.62 updated gma dataset Real 118 98.31 79.66 93.22 4 CONCLUSION A new ab initio classifier (PlantMiRNAPred) was developed for predicting plant pre-miRNAs. We demonstrated the importance of careful feature extraction, feature selection and training sample The 1142 negative testing dataset was tested to evaluate the ability selection in achieving efficient and effective classification result. of identifying the pseudo hairpins. The Updated dataset was also Particularly, according to the characteristics of plant pre-miRNAs, tested to observe the ability of discovering new plant pre-miRNAs. 115 features were extracted to distinguish the hairpins of real pre- We performed evaluations for all the testing datasets and miRNAs and pseudo pre-miRNAs. After eliminating redundant illustrated the results in the Table 4. PlantMiRNAPred is nearly 18% features, 68 informative features were selected. Each real/pseudo better than triplet-SVM in overall accuracy. SE increased by 22.19% pre-miRNA was mapped into the 68-dimensional space. 1960 and SP increased by 12.25% on average. As many plant pre-miRNAs positive/negative representative samples were selected as the contain multiple loops, triplet-SVM cannot classify them correctly. training dataset. Almost all the plant pre-miRNAs with multiple loops in the testing PlantMiRNAPred has been compared with the existing pre- dataset are classified by PlantMiRNAPred correctly. This indicates miRNA classification methods, triplet-SVM and microPred. The that our method is sensitive enough to identify pre-miRNAs with results demonstrated that PlantMiRNAPred has higher classification multi-loops. performance. Further analysis indicated that the improvement of MicroPred is more similar to our approach as it uses the same classification accuracy was due to the informative features and the 48 features to classify pre-miRNAs. However, it was originally representative training samples. PlantMiRNAPred will be useful in developed for human pre-miRNAs. The program of microPred generating effective hypothesis for subsequent biological testing. can be downloaded from the web site (http://web.comlab.ox.ac .uk/people/manohara.rukshan.batuwita/microPred.htm). In order to compare with microPred, the classification model of microPred ACKNOWLEDGEMENTS was changed according to the plant pre-miRNAs datasets. As We appreciate Yingpeng Han and Yongxin Liu from the soybean shown in Table 4, PlantMiRNAPred is nearly 5% better than research institute in the Northeast Agricultural University for microPred in overall accuracy. SE increased by 5.19% on average valuable assistance. and SP increased by 4.98%. The improvement is mainly due to the additional 32 structural features extracted from the plant Funding: Natural Science Foundation of China (60932008 pre-miRNAs and the 35 features extracted from the stems. and 60871092); Fundamental Research Funds for the Central Twenty-three of 397 O.sativa-positive samples, 19 of 233 Universities (HIT.ICRST.2010 022); Returned Scholar Foundation P.patens-positive samples, two of 131 S.bicolor-positive samples of Educational Department of Heilongjiang Province (1154hz26); are classified as pseudo pre-miRNAs. However, in miRBase 16, 9 Natural Science Foundation (Grant CCF-0546345 to Y.H.). of 23 O.sativa-positive samples, 8 of 19 P.patens-positive samples, Conflict of Interest: none declared. 2of2 S.bicolor-positive samples are obtained by computational identification method. They are not verified by biology experiments. Despite this, the accuracies of the three testing datasets are 96.39, REFERENCES 95.11 and 100%, respectively. Most of the new reported pre-miRNAs in miRBase 15–16 was Adai,A. et al. (2005) Computational prediction of miRNAs in Arabidopsis thaliana. Genome Res., 15, 78–91. correctly predicted by PlantMiRNAPred with an average accuracy Ambros,V. et al. (2003) A uniform system for microRNA annotation. RNA, 9, 277–279. of 98.11%. This shows that PlantmiRNAPred is powerful in Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering discovering novel pre-miRNAs. In the two updated datasets, 109 plant Arabidopsis thaliana. Nature, 408, 796–815. of 118 G.max pre-miRNAs and 74 of 193 A.lyrata are lineage- Bartel,D.P. (2004) MicroRNAs: genomics, biogenesis, mechanism, and function. Cell, specific pre-miRNAs. Therefore, PlantMiRNAPred is also shown to 116, 281–297. [15:38 19/4/2011 Bioinformatics-btr153.tex] Page: 1375 1368–1376 P.Xuan et al. Batuwita,R. and Palade,V. (2009) MicroPred: effective classification of pre-miRNAs Nam,J. et al. (2005) Human microRNA prediction through a probabilistic co-learning for human miRNA gene prediction. Bioinformatics, 25, 989–995. model of sequence and structure. Nucleic Acids Res., 33, 3570–3581. Berezikov,E. et al. (2006) Approaches to microRNA discovery. Nat. Genet., 38, 2–7. Ng,K.L.S. and Mishra,S.K. (2007) De novo SVM classification of precursor Bonnet,E. et al. (2004) Detection of 91 potential conserved plant microRNAs in microRNAs from genomic pseudo hairpins using global and intrinsic folding Arabidopsis thaliana and Oryza sativa identifies important target genes. PNAS, 101, measures. Bioinformatics, 23, 1321–1330. 11511–11516. Quinlan,J.R. (1993) C4.5: Programs for Machine Learning. Morgan Kaufmann Chang,D.T. et al. (2008) Using a kernel density estimation based classifier to predict Publishers Inc., San Francisco. species-specific microRNA precursors. BMC Bioinformatics, 9(Suppl. 12), 2–12. Rice,P. et al. (2000) EMBOSS: the European Molecular Biology Open Software Suite. Chatterjee,S. and Grobhans,H. (2009) Active turnover modulates mature microRNA Trends Genet., 16, 276–277. activity in caenorhabditis elegans. Nature, 461, 546–549. Schmutz,J. et al. (2010) Genome sequence of the palaeopolyploid soybean. Nature, Dezulian,T. et al. (2006) Identification of plant microRNA homologs. Bioinformatics, 463, 178–183. 22, 359–360. Schultes,E.A. et al. (1999) Estimating the contributions of selection and self- Freyhult,E. et al. (2005) A comparison of RNA folding measures. BMC Bioinformatics, organization in RNA secondary structure. J. Mol. Evol., 49, 76–83. 6, 241–248. Seffens,W. and Digby,D. (1999) mRNAs have greater negative folding free energies Gan,H.H. et al. (2004) RAG: RNA-as-graphs database—concepts, analysis, and than shuffled or codon choice randomized sequences. Nucleic Acids Res., features. Bioinformatics, 20, 1285–1291. 1578–1584. Gardner, P.P. et al. (2009) Rfam: updates to the RNA families database. Nucleic Acids Sewer,A. et al. (2005) Identification of clustered microRNAs using an ab initio Res., 37(Suppl. 1), 136–140. prediction method. BMC Bioinformatics, 6, 267–281. Hofacker,I.L. et al. (1994) Fast folding and comparison of RNA secondary structures. Smalheiser,N.R. and Torvik,V.I. (2005) Mammalian microRNAs derived from genomic Monatsh. Chem., 125, 167–188. repeats. Trends Genet., 21, 322–326. Jiang,P. et al. (2007) MiPred: classification of real and pseudo microRNA precursors Weiss,G. (2004) Mining with rarity: a unifying framework. SIGKDD Expl., 6, using random forest prediction model with combined features. Nucleic Acids Res., 7–19. 35(Suppl. 2), 339–344. Xue,C.H. et al. (2005) Classification of real and pseudo microRNA precursors using Jones-Rhoades,M.W. and Bartel,D.P. (2004) Computational identification of plant local structure-sequence features and support vector machine. BMC Bioinformatics, microRNAs and their targets, including a stress-induced miRNA. Mol. Cell, 14, 6, 310–316. 787–799. Yousef,M. et al. (2006) Combining multi-Species genomic data for microRNA Lindow,M. and Krogh,A. (2005) Computational evidence for hundreds of non- identification using a naïve Bayes classifier machine learning for identification of conserved plant microRNAs. BMC Genomics, 6, 119–127. microRNA genes. Bioinformatics, 22, 1325–1334. Lu,Y.Z. and Yang,X.Y (2010) Computatinal identification of novel microRNAs and Zhang,B.H. et al. (2006a) Evidence that miRNAs are different from other RNAs. Cell their targets in vigna unguiculata. Com. Funct. Genomics, 10, 128297–128313. Mol. Life Sci., 63, 246–254. Moulton,V. et al. (2000) Metrics on RNA secondary structures. J. Comput. Biol., 7, Zhang,B.H. et al. (2006b) Plant microRNA: a small regulatory molecule with big 277–292. impact. Dev. Biol., 289, 3–16. [15:38 19/4/2011 Bioinformatics-btr153.tex] Page: 1376 1368–1376

Journal

BioinformaticsOxford University Press

Published: Mar 26, 2011

There are no references for this article.