K      2 and K2*: efficient alignment-free sequence similarity measurement based on Kendall statistics

Jie Lin; Donald A Adjeroh; Bing-Hua Jiang; Yue Jiang

doi:10.1093/bioinformatics/btx809

K 2 and K2*: efficient alignment-free sequence similarity measurement based on Kendall statistics

Lin, Jie; Adjeroh, Donald A; Jiang, Bing-Hua; Jiang, Yue 2017-12-15 00:00:00 Motivation: Alignment-free sequence comparison methods can compute the pairwise similarity between a huge number of sequences much faster than sequence-alignment based methods. Results: We propose a new non-parametric alignment-free sequence comparison method, called K , based on the Kendall statistics. Comparing to the other state-of-the-art alignment-free comparison methods, K demonstrates competitive performance in generating the phylogenetic tree, in evaluat- ing functionally related regulatory sequences, and in computing the edit distance (similarity/dissimi- larity) between sequences. Furthermore, the K approach is much faster than the other methods. An improved method, K , is also proposed, which is able to determine the appropriate algorithmic par- ameter (length) automatically, without ﬁrst considering different values. Comparative analysis with the state-of-the-art alignment-free sequence similarity methods demonstrates the superiority of the proposed approaches, especially with increasing sequence length, or increasing dataset sizes. Availability and implementation: The K and K approaches are implemented in the R language as a package and is freely available for open access (http://community.wvu.edu/daadjeroh/projects/ K2/K2_1.0.tar.gz). Contact: yueljiang@163.com Supplementary information: Supplementary data are available at Bioinformatics online. 1 Introduction include compression and efficient storage of the rapidly expanding genomic datasets (Beal et al., 2016a, b; Deorowicz and Grabowski, Evaluating the similarity between two sequences is a classical prob- 2013; Giancarlo et al., 2012), and resequencing a set of strings given lem that has long been studied in computer science, primarily from a target string (Kuo et al., 2015), an important step in efficient gen- the view point of string pattern matching (Adjeroh et al., 2008; ome assembly. Gusfield, 1997). Such similarity measurement has applications in Alignment-free sequence comparison methods can compute the various areas in computational biology, e.g. sequence alignment similarity between a large number of sequences much faster than (Smith and Waterman, 1981), in comparative genomics (Aach et al., alignment-based methods (Vinga and Almeida, 2003; Vinga, 2014). 2001), genomic evolution and phylogenetic tree construction and Word analysis of k-length substrings (also called k-mers, k-grams, analysis (Cao et al., 1998; Reyes et al., 2000), analysis of regulatory or k-tuple) from sequences is one approach to improved sequence functions (Kantorovitz et al., 2007), rapid search in huge biological comparision (Bonham-Carter et al., 2014). Words can be extracted sequences (Wandelt and Leser, 2013). Other recent applications V The Author(s) 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com 1682 * K and K 1683 2 2 in different ways, and with varying lengths. The most common is to (typically, with k 7). This places the proposed K statistic among use sliding windows from length 2 to n – 1, where n is the length of the best non-alignment based similarity measures, especially with sequence (Bauer et al., 2008; Dai et al., 2011; Liu et al., 2006; Qi increasing sequence lengths (n, m), or increasing size of the k-mer. et al., 2004). Some methods divide a sequence into several even parts Based on K , we further propose an improved method, named K , (Zhao et al., 2011), while some others have used fixed length sub- which is able to determine a suitable value for k, the k-gram param- strings, e.g. k ¼ 2 (2-mer) (Shi and Huang, 2012). After extracting eter automatically with competitive performance. We have imple- the words, different statistical methods can be applied to analyze mented K and K in the R statistical and graphics environment, and two sequences for similarity (Li and Wang, 2005; Wang and Zheng, the codes are freely available for open access. 2008). DMk (Wei et al., 2012) and Category-Position-Frequency (CPF)(Bao et al., 2014) incorporate positions and frequencies of k-mers into feature vectors. DV (Zhao et al., 2011) utilizes distribu- 2 Materials and methods tion vectors from k-mers. Shi (Shi and Huang, 2012) maps a DNA 2.1 Kendall statistic primary sequence into three symbolic sequences and groups these se- The Kendall statistic is a nonparametric method which makes no as- quences into a twelve-component vector. Wavelet Feature Vector sumption about the probability distribution of the variables being (WFV) converts a sequence into a L-length feature vector by wavelet assessed. The Kendall statistic estimates the correlation between two transform (Bao and Yuan, 2015). sets of random variables X and Y, represented using the pairs Our approach is more closely related to the D statistic, another ðÞ X ; YðÞ X ; Y .. .ðÞ X ; Y . The Kendall correlation, s, is then 1 1 2 2 n n popular approach for measuring the similarity (or dissimilarity) be- defined as follows (Kendall, 1938). tween two sequences (Bonham-Carter et al., 2014; Song et al.,2014). It was first proposed by Blaisdell (1986). Since then, many variants sðÞ X; Y ¼ Pf X X Y Y > 0g Pf X X Y Y < 0g j i j i j i j i and improvements have been proposed, such as D (Kantorovitz 2 (1) sh z et al.,2007), D (Reinert et al., 2009)and D (Wan et al., 2010). D 2 2 2 In this study, we compute the Kendall correlation by using the fol- (Kantorovitz et al.,2007)normalizesthe D statistic using its mean lowing formula to approximate s (Kendall, 1938; Marden et al., and standard deviation to improve its detection power (Song et al., sh 1992): 2014). D and D are two other normalization improvement meth- 2 2 ods which were proposed in Reinert et al. (2009) and Wan et al. n n c d b s ¼ (2) sh S nðÞ n1 (2010). D [also denoted D in the literature (Reinert et al.,2009; 2 2 Song et al.,2014)] uses an approach based on Shepp (1964). sh According to a recent review (Song et al., 2014), D and its variant where n is the number of distinct k-grams for the concatenated se- are generally the best D statistical methods for alignment-free com- quence S¼T$P$, n is the number of concordant k-gram pairs 2 c parison of genomic sequences, especially with increasing sequence X X Y Y > 0, with 0 < i < j n; and n is the number j i j i d length. More detailed discussion of the D -statistic family of algo- of discordant k-gram pairs X X Y Y < 0, with 2 j i j i rithms can be founded in Section 6 of the Supplementary Material. 0 < i < j n. In general, the D -statistic family of algorithms have a general problem of requiring a quadratic or cubic time complexity, with re- 2.2 Optimized computation of Kendall statistics spect to n or m, the length of the sequences, and k, the size of the sub- The time cost to compute b s, the approximation to the Kendall cor- strings being considered. Also, the D family of statistics generally relation statistic is On , including time to compare each pair be- makes some assumptions on the distribution of the sequences, for in- tween (X ; X ) and (Y ; Y ), i 6¼ j, where n is the number of pairs in i j i j stance, most assumed either a uniform distribution, or a normal distri- X and Y. Christensen (2005) showed an algorithm to calculate b s in bution, for the symbols in the sequences. This parametric nature of the OnðÞ log n time complexity. It was implemented in Pascal. Lin et al. statistics obviously limits their practical applicability, since practical (2017) recently introduced an algorithm for the related problem of data, especially for biological sequences (e.g. complete genomes for in- weighted Kendall correlation. In this work, we propose data dividuals of the same species, or for related organisms) rarely follow structures and a new algorithm to compute b s. Our algorithm also these theoretical distributions. A non-parametric approach to the meas- runs in OnðÞ log n time, but uses a different approach to compute urement of sequence similarity is required, one that does not make any the Kendall statistics. We then apply the algorithm to analyze simi- assumption on the distribution of the sequences under consideration, larity between a given pair of sequences. More detailed discussion and one that is efficient enough to handle the rapidly increasing com- on the improved algorithm for the Kendall Statistics can be found in plexity and data sizes of available biological sequence data. Section 3.1 of the Supplementary Material. In this work, we propose a nonparametric approach, K , which uses the Kendall correlation statistic to estimate the similarity be- tween sequences. The Kendall correlation is a non-parametric 2.3 The K approach method to calculate the correlation between two sets of random Here, we propose the K statistic as a new method for rapid and ef- variables. We adopt this to measure the similarity among sequences. ficient measurement of biological sequence similarity, without When compared to the other state-of-the-art alignment-free se- requiring an initial sequence alignment step. The K statistic makes sh z quence similarity methods, (e.g. D , D ; D ; D , DMk, DV, CPF, use of the above optimized method for computing the Kendall’s s 2 2 2 Shi and WFV), K demonstrates an improved power in detecting re- correlation between two sequences. Here, the correlation is com- latedness between sequences, as measured by its ability to generate puted based on the k-mer count statistics ðX and Y Þ between the w w the correct phylogenetic tree, and to identify functionally related two sequences. The counts are obtained in OðÞ jSj time using the regulatory sequences. The K also showed significant correlation suffix array data structure (Adjeroh et al., 2008; Gusfield, 1997; with the edit distance, the standard, though time consuming, meas- Manber and Myers, 1993), where jSj is the length ofinputse- ure of (similarity/dissimilarity) between sequences. Further, the K quence S ¼ T$P$. We describe the steps of the algorithm in the approach is faster than most of the other methods when k is large, following. 1684 J.Lin et al. 1. Given two sequences T and P, combine them into one sequence, We can observe the connection between the above relation for k and S ¼ T$P$, after appending an ‘$’ at the end of each sequence. the longest common prefix (LCP) between suffixes in S. For an arbi- The concatenated sequence S is of length jSj. trary sequence Q with symbols from the alphabet R, it is known 2. Build the sufﬁx array (SA) from the combined sequence S ¼ T$P$. that, on average, the length of the longest common prefix between And for a given parameter k,readall k-grams from SA. suffixes in Q is in O log ðÞ jQj . See Karlin et al. (1983) and jRj 3. Compute the frequency for each k-gram using the SA. Here, we Le ´ onard et al. (2012). Thus, for an arbitrary sequence, our suggested use X , and Y to denote the frequency of the k-gram w in se- value for k is essentially in the same order as this expected maximal w w quences T and P, respectively. Notice that, both X and Y will LCP value. This makes sense, in that, the maximal length k-gram w w be found at essentially the same time, using the SA of the con- should be close to the expected maximal LCP length, since if we catenated sequence, S. have k values much larger than the average maximal LCP length, we 4. Order all the (X ; Y ) frequencies of k-gram pairs by grouping may not be able to observe some repeated k-grams. On the other w w them according to Y , and then X . We get pairs hand, if we use k values much smaller than the average maximal w w {ðÞ X ; Y ;ðÞ X ; Y ; ...ðÞ X ; Y ; ...ðÞ X ; Y }, where n is the num- LCP length, we will be double-counting some smaller repeated sub- 1 1 2 2 i i n n ber of distinct k-grams from the concatenated sequence strings. Thus, operating with k values far from the expected max- S ¼ T$P$, and (X , Y ) is the frequency pair of ith ranked imal LCP length could lead to either underestimating or i i k-gram from sequences T and P. Thus, (1) Y Y and i < n overestimating the frequency for the k-grams that capture the major i iþ1 and (2) X X when Y ¼ Y and i < n. variations in the sequence. i iþ1 i iþ1 5. Compute n , the number of concordant pairs, and n the number c d of discordant pairs, for the ranked frequency pairs from se- 2.5 Comparative complexity analysis quences T and P. The number of concordant pairs n is the sum ðÞ The proposed K algorithm runs in O jSj log jSj time, which is a of the number pairs in one of these two conditions: (1) x < x i j significant improvement in complexity, when compared with the and y < y ; (2) x > x and y > y , where 0 i < j < n. k i j i j i j OkjSjjRj required for computing D and other related statistics, Similarly, the number of disconcordant pairs n , is the sum of or even with the observed improvement that reduces the time the number of pairs in one of the following two conditions: (1) 2 to OkjSj . K requires just a one-time run of K , using the auto- x < x and y > y ; (2) x > x and y < y , where 0 i < j < n. i j i j i j i j matically computed k-parameter. This will be practically faster than 6. Calculate the Kendall correlation using the formula: using K , however, the time complexity of K still remains the same n n c d OðÞ jSj log jSj as in K . More detailed discussion can be found in b s ¼ : nðÞ n1 Section 3.2 of the Supplementary Material. 7. Return b s which is the K similarity between sequences T and P. 2.6 Experimental design To test the proposed methods, we performed some experiments The last three steps are based on the optimized Kendall algorithm using three different datasets. We also compared our experimental introduced previously (Section 2.2). results with those from state-of-the-art alignment-free sequence 2.4 K : improved K with automated k value similarity measurement algorithms. Similar to the alignment-free methods from the D family, the pro- posed K approach depends critically on the length parameter, k. 2.6.1 Datasets and environment Here, we propose a method to determine the k parameter automatic- We use three sets of biological sequence data for the experiments in ally, without needing to test with all possible values. this study. The first dataset used is the complete mtDNA sequences Given the alphabet jRj and the length parameter k, there are at k from Cao et al. (1998)and Reyes et al. (2000) containing data on 12 most jRj possible k-grams, independent of the sequence lengths n proteins encoded in the H strand of mtDNA in 20 eutherian species. and m. These are the unique k-grams, given the alphabet. Given the The sequence lengths ranged from 16 300 to 17 080 symbols. This concatenated sequence S ¼ T$P$ with length of jSj, the k-grams are dataset is often used to evaluate the similarity of different species, espe- simply k-length substrings of S. Thus, we can have at most jSj k cially using phylogenetic trees. We call this the ‘mtDNA20’ dataset. þ1 number of k-grams from S. These may not be unique, since they The second dataset is 23 whole mitochondrial DNA genomes may include repeated k-grams, depending on the nature of the se- from different Eukaryotic fish species of the suborder Labroidei, taken quences T and P. At the same time, we need the k-grams to capture from Fischer et al. (2013). We could not locate the sequences for two most of the variations in the input sequences (now contained in S), of the species, namely, P.trewavasae and T.moorii. Thus, though the while avoiding k-grams that are repeated inside other k-grams. That original work in Fischer et al. (2013) used 25 species, our dataset con- is, we want the maximal length k-grams that capture the variations tained only 23 of the 25 species. The sequence lengths ranged from in S, without missing out on the smaller k-grams, especially those 16 440 to 17 040 symbols. We call this dataset the ‘Fish23’ dataset. that did not occur inside the longer k-grams. These shorter k-grams The third dataset used is the set containing cis-regulatory modules are likely to be more numerous, and can also provide important in- (CRMs) used by Kantorovitz et al. (2007) in their work on identifica- formation about the sequences. To satisfy the above competing con- tion of functional relationships between cis-regulatory sequences. ditions, the choice of k should meet the following criterion: There are seven sets including 185 CRM sequences, taken from k k1 jRj jSj k þ 1 > jRj (3) Drosophila melanogaster and Homo sapiens. We call this the ‘CRM185’ dataset. This dataset is available for download at http:// where jSj¼ m þ n þ 2 is the length of the concatenated sequences S. veda.cs.uiuc.edu/d2z/publicdata.tar.gz. Following the above, the value of k can be approximated as: The experiments were performed in a PC environment, running Intel i5, 4 cores, with 16 GB RAM and 1 TB HD. K and K were k ¼d log ðÞ jSje (4) 2 jRj * K and K 1685 2 2 written using the R Language. For comparison purposes, we also (2017)]. The Robinson-Foulds distance measures the topological tested several other state-of-the-art alignment-free methods using distance between two labeled trees essentially by counting the min- the same datasets. The algorithm for D was from Song et al. imum number of elementary operations needed to transform one sh (2014), D was from Wan et al. (2010), and D was from Reinert tree to the other. 2 2 et al. (2009). They all were implemented using the C language. The For the experiments on the mtDNA20 dataset, and we used the method D was developed in Perl in the original work of tree published by Cao et al. (1998) as the reference. See also Otu and Kantorovitz et al. (2007). We implemented the methods for DMk Sayood (2003). For phylogenetic analysis using the Fish23 dataset, we (Wei et al., 2012), CPF (Bao et al., 2014), DV (Zhao et al., 2011) used the tree published by Fischer et al. (2013) as the reference tree. and Shi (Shi and Huang, 2012) in R, according to descriptions pro- vided in the respective papers. The codes for WFV, developed in 3.1.1 mtDNA20 dataset Python in their original work (Bao and Yuan, 2015), were kindly Table 1 shows the Robinson-Foulds distance between each tree and provided by the authors. In our experiments, the parameter k corres- the reference tree. Each column contains distances of a given ponds to the length L ¼ 4 in their work. alignment-free method with parameter k varied from 2–9. The re- sults of three methods without parameter k are shown in the last 2.6.2 Experiment 1 row. The minimum distance in this table is 12. This minimum was The first experiment aimed at analyzing the general performance obtained with the K method, and it is also present in the column of each alignment-free method studied. The experiment sh for K with parameter k¼8, 9, and for D with parameter k¼7, 8. z 2 sh compared eleven alignment-free methods, namely, D ; D ; D ; D , 2 2 2 The remaining 8 methods are unable to achieve the minimum (best) DMk; DV; CPF; Shi and WFV and our two proposed methods, K distance. However, D and CPF are able to take the second place and K . The experiment was performed on mtDNA20 and Fish23 with minimum RF distance of 14. D and DMk can obtain the min- two datasets. imum RF distance of 16. The distances reported by the other meth- To evaluate the performance of the algorithms, we consider three ods, D ; DV; Shi and WFV were far from the minimum distance, performance measures: (i) the Robinson-Foulds (RF) distance hence, were ranked lower. On this dataset, the methods K ; K and (Robinson and Foulds, 1981) which measures the topological dis- sh D performed generally better than the others. However, the fact tance between the golden reference phylogenetic tree and the phylo- that K does not need to try all the possible k values from 2–9, gives genetic tree constructed using a given alignment-free method; (ii) the it an advantage over the others. correlation of the similarity/distance values as determined by the Figure 1 shows the reference phylogenetic tree from Cao et al. alignment-free method with the standard edit distance; (iii) the com- (1998), and the corresponding tree generated by the proposed K ap- putation time required. These performance measures need to be con- proach. Detailed figures for the other methods are presented in the sidered both individually and jointly in evaluating algorithms for Supplementary Material. To compare different methods, we show the sequence similarity measurement. phylogenetic trees constructed using each of the methods. Methods sh D ; D ; D , DMk; CPF and K depend on the input parameter k. 2 2 2 2 2.6.3 Experiment 2 For each of these methods, the Supplementary Figure S1 shows the The second experiment investigated how well the results from the corresponding phylogenetic tree that resulted in the minimum proposed alignment-free methods can capture the similarity between Robinson-Foulds distance with the reference tree. For the K method, sequences with similar functional roles. For this experiment, we the k value is automatically computed, so, only one tree is generated. used the related regulatory sequences in the CRM185 dataset, our The phylogenetic trees from D ; Shi; WFV and DV are not shown in third dataset. The ‘positive’ set is the set of CRMs that are in the the Supplementary Figure S1 because these trees are far away from same tissue and/or same developmental stage. The ‘negative’ set is the reference tree. See also the RF distances shown in Table 1. the set chosen from non-coding sequences, which are expected to be Looking at these figures, we can see that the trees are generally unrelated with respect to function. This experiment is designed to similar to the reference tree, though with some variations. We can predict whether or not any two given sequences are in the ‘positive’ set, using alignment-free methods. First, we compute the similarity Table 1. The Robinson-Foulds distance between the reference between pairwise sequences using alignment-free methods. Next, we phylogenetic tree and phylogenetic trees generated using different rank these pairs based on their similarity, and determine the number alignment-free statistical methods (with k ¼ 2; 3; ... ; 9) of positive pairs and return the accuracy ratio. sh z kD D D D K DMk CPF WFV 2 2 2 2 2 222 26 26 36 26 18 24 26 3 Results and discussion 324 26 28 34 22 20 22 24 422 20 22 26 22 16 18 24 3.1 Phylogenetic tree analysis 522 20 16 26 20 16 16 22 One way to evaluate the performance of the alignment-free methods 624 16 16 24 18 18 16 24 is to compare the phylogenetic trees generated using the distance 718 14 12 20 14 16 14 24 matrix against the known correct (reference) phylogenetic tree for 818 16 12 20 12 16 14 24 the species in the dataset. In this case, methods that generate trees 916 14 14 — 12 18 16 24 that have more similarity in structure with the reference tree will be K 12 DV 20 Shi 22 taken to be of better performance. To compare the similarity/dissimilarity between two trees, we Note: Results are based on the mtDNA20 dataset (Cao et al., 1998). K use the Robinson-Foulds(RF) distance (Robinson and Foulds, 1981). having automatically determined k values, DV and Shi without varied k par- The Robinson-Foulds distance (also called the symmetric difference ameter, they are all reported in the last row for brevity. D generated an error metric) is a well-known approach for measuring the similarity be- at k ¼ 9. The bold value 12 here indicates the minimal RF distance. The tween two trees. [See for example Bansal et al., (2010) and Lu et al., smaller the RF distance is, the better a method performs. 1686 J.Lin et al. Table 2. The Robinson-Foulds distance between the reference phylogenetic tree and phylogenetic trees generated using different alignment-free statistical methods (with k ¼ 2; 3; ... 9) sh z k D D D D K DMk CPF WFV 2 2 2 2 2 232 34 36 40 36 30 32 36 330 30 28 40 26 28 30 30 426 26 30 36 24 22 24 26 524 20 22 38 20 20 20 26 614 10 20 36 12 10 12 32 714 8 14 34 8 12 12 34 8 88 8 34 8 12 14 34 9 8 10 14 — 10 14 16 34 K 8 DV 32 Shi 34 Note: Results are based on the Fish23 dataset (Fischer et al., 2013). For brevity, the results for K (with automatically determined k value), and DV and Shi (both with ﬁxed k parameters), are reported in the last row. D gener- ated an error at k ¼ 9. The bold value 8 here indicates the minimal RF dis- tance. The smaller the RF distance is, the better a method performs. DV are worse than the others. Among these methods, only K is able to automatically determine an appropriate k value. From these results, we conclude that with respect to phylogenetic trees, the K is the best amongst all the tested alignment-free methods. The Supplementary Figure S2a shows the phylogenetic tree re- ported by Fischer et al. (2013) in their original paper using the Fish23 dataset. Similar to the ‘mtDNA20’ experiment, we show the phylogenetic trees generated by the alignment-free methods: sh D ; D ; D , DMk; CPF, K and K . The Supplementary Figure 2 2 2 2 2 S2(b–h) show the phylogenetic trees with the minimum Robinson- Foulds distance for each method. Supplementary Figure S2a is the reference tree. For our experiments, since we did not have the se- quences for P.trewavasae and T.moorii, the pairs N.brichardi, T.duboisi will become neighbors, with parent at node 16 in the ori- ginal reference tree. Fig. 1. Reference phylogenetic tree from Cao et al. (1998), and the corres- Fish Dataset demonstrates similar trends to the mtDNA20 data- ponding tree generated using the proposed K alignment-free sequence com- set, see more details in Supplementary Material, Section 4.3. parison method, using the mtDNA20 dataset observe that D and D placed horse and white rhinoceros close to 3.2 Correlation with the edit distance each other as expected, however, their parent nodes were wrongly 3.2.1 mtDNA20 dataset placed, making them much further from say cow than in the reference Table 3 shows the Pearson correlation coefficients between the simi- tree. Also, D wrongly placed wallaroo very close to mouse and rat, larity measurements from the different alignment-free methods and while D had cow much closer to rat and mouse than the reference the edit distance, using the mtDNA20 dataset. From the table, one sh tree. D provided a better result than D and D , but it also incor- 2 2 sh can observe that D achieved the best result 0:92 when k ¼ 6or rectly placed platypus much closer to rat and mouse. Methods K and k ¼ 7:K achieve the best result (q ¼0:95) when k ¼ 9:K can K seem to avoid these problems. One quick way to access the per- reach q ¼0:94 which is close to the best of K .In a word, the K 2 2 formance of the methods is to compare the minimum number of hops method can reach the best accuracy, and K is quite competitive. A needed to go from one given leaf node (representing a species) to an- key advantage of the K method is that it is able to select parameter k other leaf node on a given tree. The Supplementary Table S3 shows automatically and quickly. However, considering the K may need to the number of hops for two pairs of species. The Supplementary try all possible k values to determine the best k (9 in this case), the Figure S1 and Table S3 suggest that the proposed methods K and K slight performance disadvantage (q ¼ 0:94 versus q ¼ 0:95) of K be- work better than the other methods on the mtDNA20 dataset. comes even less significant, especially when data volume is huge. See more detailed analysis in Supplementary Material, Section 4.1.1. 3.1.2 Fish23 dataset Similar results were observed using the Fish23 dataset. These Table 2 shows the Robinson-Foulds distances for the Fish23 dataset. have been included in Section 4.1.2 of Supplementary Material. Each column shows distances of one alignment-free method with parameter k varied from 2 to 9. The last row shows the results for three methods that did not use varying k parameters. The minimum 3.3 Practical running time distance in this table is 8 (shown in boldface in the table). Methods We compare the running time of eleven methods, [9 earlier sh sh z K ; K ; D ; D and D are able to achieve the minimum distance. approaches (D ; D ; D ; D ; DMk; CPF; WFV; DV and Shi) and 2 2 2 2 2 2 2 2 2 As with the previous experiment on ‘mtDNA20’, D ; WFV; Shi and the two proposed methods (K and K )]. 2 2 * K and K 1687 2 2 Table 3. Pearson correlation coefﬁcient between the similarity/dis- tance measure from different alignment-free statistical methods sh and the edit distance sh z k D D D D K DMk CPF WFV 2 2 2 2 2 DMk CPF 2 0.45 0.51 0.55 0.02 0.56 0.67 0.62 0.57 WFV 3 0.48 0.60 0.74 0.10 0.73 0.68 0.66 0.62 4 0.53 0.71 0.86 0.74 0.82 0.70 0.71 0.63 5 0.61 0.79 0.91 0.81 0.89 0.78 0.77 0.72 6 0.77 0.87 0.92 0.83 0.92 0.84 0.86 0.68 7 0.87 0.91 0.92 0.84 0.92 0.87 0.89 0.68 8 0.90 0.92 0.91 0.84 0.93 0.85 0.89 0.66 9 0.91 0.91 0.91 — 0.95 0.85 0.87 0.67 K 0.94 DV 0.70 Shi 0.68 Note: Reports are for the mtDNA20 dataset. K having automatically deter- mined k values, DV and Shi without varied k parameter, they are all reported in the last row for brevity. D generated an error at k ¼ 9. The bold values indicate the biggest absolute value of Pearson correlation coefﬁcient for differ- ent k values. The bigger an absolute value, the better a method performs. sh z Fig. 2. Time cost comparison for D ; D ; D ; D ; DMk; CPF; WFV and K 2 2 2 2 2 with parameter k varying from 2 to 9, using the mtDNA20 dataset. Results for Table 4. Practical running time (in seconds) using alignment-free K ¼3.07 s, DV¼2.33 s and Shi¼1.37 s are not shown in the ﬁgure for clarity methods on the mtDNA20 dataset sh z with 2.33 and 1.37 s respectively. However, K generated a much kD D D D K DMk CPF WFV 2 2 2 2 2 lower RF distance—see Table 1. K is slower than the other methods sh 2 0.02 0.05 0.05 1.55 0.41 3.64 13.23 0.004 (i.e. D ; D ; D ) with k ¼ 2; 3; 4; 5; 6, and faster than the other 2 2 3 0.03 0.05 0.07 1.56 0.45 4.90 14.02 0.008 methods with k ¼ 7; 8; 9. We can also observe from the results dis- 4 0.08 0.11 0.15 1.61 0.57 5.91 15.63 0.020 cussed earlier that, for this dataset, the best performance for the 5 0.20 0.34 0.5 1.76 1.94 7.06 16.82 0.088 other methods were recorded at k 6. See Supplementary Figure S1 6 0.56 1.29 2 2.35 2.22 9.78 16.58 0.884 and Table 1. Clearly, since K does not need to search for the best k 7 1.26 4.91 7 5.38 3.17 18.09 16.43 7.768 value (i.e. it is executed for just one k value), it is overall faster than 8 2.40 18.18 25 19.19 3.63 40.13 16.82 38.3 the other methods, without degrading the accuracy. This is import- 9 4.58 70.28 99 — 4.34 92.10 17.05 347.1 ant, considering the increasingly huge volumes of data involved in K 3.07 DV 2.33 Shi 1.37 most applications of these techniques. In fact, the primary motiv- Note:Results for K with automatically determined k values, DV and Shi with 2 ation for the alignment-free methods is their rapid processing speed, ﬁxed k values, are reported in the last row. D generated an error at k ¼ 9. The bold when compared with alignment-based methods. values shown the smallest running time (the fastest method) for different k values. Results on running time using the Fish23 dataset is provided in the Supplementary Material. With respect to running time, we can identify two key points from 3.3.1 mtDNA20 dataset sh z our experiments: (i) the running time for D ; D , DMk and WFV in- 2 2 Table 4 shows a comparison of the running time for eleven methods creases rapidly with increasing k. The running time for K is approxi- using the first dataset (mtDNA20 dataset) from Cao et al. (1998). mately linear with respect to the sequence length. (ii) Comparing K Figure 2 plots the corresponding running times. The time for K is and K ; K is more practical, since it can determine the k value auto- 2 2 3.07 s, time for DV¼2.33 s and time for Shi¼1.37 s which are not matically, and has a competitive performance. plotted in the figure. When k ¼ 9; D generates a runtime error, thus, we could not obtain a result for this case. 3.4 Evaluation on functionally related regulatory First, consider the methods that use varied k values. When k < 6, the WFV approach is the fastest among all methods. When the par- sequences ameter k increases, the running time of WFV increases rapidly, much While the alignment-free methods could be generally fast, an im- quicker than all the others. When k ¼ 7; 8; 9; WFV requires approxi- portant consideration is whether they can identify similarities be- mately 2.45, 10.55 and 109.5-fold time increases, respectively, when tween sequences that are functionally related. Of course, this can compared with K . Therefore, in terms of running time, K is the bet- only be possible if the sequences share some similar patterns. To 2 2 ter choice than the other methods, with less running time and higher evaluate this aspect of performance, we consider to what extent the accuracy when k > 6. The WFV method with RF distances (26, 24 alignment-free similarity measures are able to capture the similar- and 22) shown in Table 1 did not perform well. ities between sequences from the same anatomic regions of the same sh Consider D and K , the two methods that achieved the best results species. For this experiment, we used the third dataset—CRM185 sh with RF distance¼ 12 in Table 1. D reaches its best performance dataset, the regulatory sequences from Kantorovitz et al. (2007).We z sh when k ¼ 7; 8:K reaches its best performance when k ¼ 8; 9. When k compare our proposed methods K and K against D ; D ; D and 2 2 2 2 2 2 sh ¼ 7; 8; 9; D requires approximately 2, 8 and 25 fold time increases, D , DMk; DV; CPF; Shi and WFV. Table 5 shows the results. In 2 2 respectively, when compared with K . Therefore, in terms of combining the table, the result for D is taken from the original work of sh z sh with running time and accuracy, K is the better choice than D . Kantorovitz et al. (2007). For D ; D ; D and D , the table shows 2 2 2 2 2 2 Now consider K ; DV and Shi which do not use varying k val- the best results with k values in the range 2 k 7. For K ues. K requires 3.07 s to execute. DV and Shi are relatively faster method, we also tested with 2 k 7. Running Time (s) 0 20406080 100 1688 J.Lin et al. Table 5. The performance of popular alignment-free sequence similarity methods in capturing functional relatedness z sh Species Dataset D K D D D K DMk CPF WFV DV Shi 2 2 2 2 2 2 Fly Blastoderm 0.73 0.92(4) 0.85(2) 0.82(2) 0.82(6) 0.79 0.83(3) 0.84(4) 0.79(5) 0.72 0.7 Fly PNS 0.62 0.60(5) 0.63(3) 0.64(4) 0.64(3) 0.56 0.62(4) 0.61(4) 0.63(3) 0.58 0.55 Fly Tracheal 0.75 0.75(4) 0.72(4) 0.69(4) 0.69(4) 0.75 0.73(3) 0.75(4) 0.70(5) 0.7 0.71 Fly Eye 0.58 0.69(3) 0.61(2) 0.63(3) 0.60(3) 0.69 0.63(5) 0.63(4) 0.64(3) 0.62 0.59 Human Muscle 0.83 0.88(4) 0.83(5) 0.83(5) 0.86(6) 0.81 0.84(3) 0.82(4) 0.81(5) 0.76 0.79 Human Liver 0.69 0.83(2) 0.88(2) 0.78(6) 0.73(4) 0.69 0.82(2) 0.84(5) 0.80(3) 0.8 0.76 Human HBB 0.64 0.65(3) 0.58(3) 0.53(2) 0.59(4) 0.66 0.57(2) 0.60(3) 0.61(4) 0.56 0.52 Note: Numbers in brackets indicate the k value that produced the best results for the given method. Results are based on the CRM185 dataset. The bold values shown the best methods on a data set without considering K * and with considering K *. 2 2 The bold items are the best results on the dataset comparing differ- References ent methods, while excluding K .From Table 5, K reported five best Aach,J. et al. (2001) Computational comparison of two draft sequences of the z sh results out of seven using the CRM185 dataset. D ; D ; D and D 2 2 2 human genome. Nature, 26, 5–14. reported one best result each. K demonstrates competitive perform- Adjeroh,D. et al. (2008) The Burrows-Wheeler Transform: Data ance with the other methods. When we take K into consideration, we 2 Compression, Sufﬁx Arrays, and Pattern Matching, 1st edn. Springer can observe that it gets three best results out of seven datasets. D and Publishing Company, Berlin, German. sh D get one best result of seven cases, D and D are best on two Bansal,M.S. et al. (2010) Robinson foulds supertrees. Algorithms Mol. Biol., 2 2 5, 1–12. cases, and K was best on four cases. In general, the proposed K and 2 2 Bao,J. et al. (2014) An improved alignment-free model for DNA sequence K methods provide the overall best performance on this problem. similarity metric. BMC Bioinformatics, 15, 1–15. Bao,J.P. and Yuan,R.Y. (2015) A wavelet-based feature vector model for DNA clustering. Genet. Mol. Res. GMR, 14, 19163. 4 Conclusion Bauer,M. et al. (2008) The average mutual information proﬁle as a genomic The problem of sequence similarity measurement is critical to sev- signature. BMC Bioinformatics, 9, 48. eral important applications in huge volume genomic sequence ana- Beal,R. et al. (2016a) Compressing genome resequencing data via the maximal lysis. We proposed a novel non-parametric algorithm K for longest factor. In: IEEE International Conference on Bioinformatics alignment-free measurement of relatedness between sequences, using and Biomedicine, BIBM 2016, Shenzhen, China, December 15–18, 2016, the statistics of k-grams in the sequences. K is a non-parametric ap- pp. 92–97. Beal,R. et al. (2016b) A new algorithm for the LCS problem with application proach based on the Kendall correlation statistic to estimate the dis- in compressing genome resequencing data. BMC Genomics, 17, 544. similarity(/similarity) of sequences. Blaisdell,B. (1986) A measure of the similarity of sets of sequences not requiring Compared with other state-of-the-art alignment-free comparison sequence alignment. Proc. Natl. Acad. Sci. USA, 83, 5155–5519. sh z methods (D ; D ; D ; D ; DMk; CPF; WFV; DV and Shi), K 2 2 2 2 2 Bonham-Carter,O. et al. (2014) Alignment-free genetic sequence comparisons: demonstrates comparable or better performance, in phylogenetic a review of recent approaches by word analysis. Brief. Bioinf., 15, 890–905. analysis, in generating (similarity/dissimilarity) measures that correl- Cao,Y. et al. (1998) Conﬂict among individual mitochondrial proteins in ate with the edit distance among a large number of sequences, and resolving the phylogeny of eutherian orders. J. Mol. Evol., 47, 307–322. in capturing functional relatedness between sequences. Further, the Christensen,D. (2005) Fast algorithms for the calculation of Kendall’s tau. K approach is faster than the other methods when k 7: 2 Comput. Stat., 20, 51–62. We also introduced K , an improved version of K that is able to Dai,Q. et al. (2011) Numerical characteristics of word frequencies and their application to dissimilarity measure for sequence comparison. J. Theor. automatically determine the suitable k value, thus eliminating the Biol., 276, 174–180. need to search for all possible k values (for the k-grams), potentially Deorowicz,S. and Grabowski,S. (2013) Data compression for sequencing from k ¼ 2to k ¼ n :K produced the best overall results, with re- data. Algorithms Mol. Biol., 8, 25. spect to both efficiency and accuracy. Along with K competitive per- Fischer,C. et al. (2013) Complete mitochondrial DNA sequences of the thread- formance in measuring the similarity between sequences, its speed ﬁn cichlid (Petrochromis trewavasae) and the blunthead cichlid (Tropheus makes it practical, an important consideration given the increasingly moorii) and patterns of mitochondrial genome evolution in cichlid ﬁshes. huge datasets in various applications of alignment-free methods. PLoS One, 8, e67048. Giancarlo,R. et al. (2012) Textual data compression in computational biology: Algorithmic techniques. Comput. Sci. Rev., 6, 1–25. Acknowledgement Gusﬁeld,D. (1997) Algorithms on Strings, Trees and Sequences: Computer A shorter version of this paper was presented at IEEE BIBM’2016, Lin et al. Science and Computational Biology. Cambridge University Press, (2016). Cambridge, England. Kantorovitz,M. et al. (2007) A statistical method for alignment-free compari- son of regulatory sequences. Bioinformatics, 23, i249–i255. Funding Karlin,S. et al. (1983) New approaches for computer analysis of nucleic acid sequences. Proc. Natl. Acad. Sci. USA, 80, 5660–5664. This work was supported by the Chinese National Natural Science Foundation Kendall,M.G. (1938) A new measure of rank correlation. Biometrika, 30, (Grant No. 61472082), Natural Science Foundation of Fujian Province of 81–93. China (Grant No. 2014J01220), the US National Science Foundation (Grant Kuo,C.-E. et al. (2015) Resequencing a set of strings based on a target string. No. IIS-1552860) and Scientiﬁc Research Innovation Team Construction Algorithmica, 72, 430–449. Program of Fujian Normal University (Grant No. IRTL1702). Li,C. and Wang,J. (2005) Relative entropy of DNA and its application. Phys. Conﬂict of Interest: none declared. A Stat. Mech. Appl., 347, 465–471. * K and K 1689 2 2 Lin,J. et al. (2016) K : Efﬁcient alignment-free sequence similarity measure- Shepp,L. (1964) Normal functions of normal random variables. SIAM Rev., 6, ment using the Kendall statistic. In: IEEE International Conference on 459–460. Bioinformatics and Biomedicine, pp. 1128–1132. Shi,L. and Huang,H. (2012) DNA sequences analysis based on classiﬁcations Lin,J. et al. (2017) fastwkendall: an efﬁcient algorithm for weighted Kendall of nucleotide bases. In: Luo J. (ed.) Affective Computing and Intelligent correlation. accepted by Comput. Stat. Interaction. Springer, Berlin, Heidelberg, pp. 379–384. Liu,L. et al. (2006) Clustering DNA sequences by feature vectors. Mol. Smith,T.F. and Waterman,M.S. (1981) Identiﬁcation of common molecular Phylogenet. Evol., 41, 64. subsequences. J. Mol. Biol., 147, 195–197. Le ´ onard,M. et al. (2012) On the number of elements to reorder when updating Song,K. et al. (2014) New developments of alignment-free sequence compari- a sufﬁx array. J. Discret. Algorithms, 11, 87–99. son: measures, statistics and next-generation sequencing. Brief. Bioinf., 15, Lu,B. et al. (2017) A program to compute the soft Robinson–Foulds distance 343–353. between phylogenetic networks. BMC Genomics, 18, 111. Vinga,S. (2014) Information theory applications for biological sequence ana- Manber,U. and Myers,G. (1993) Sufﬁx arrays: a new method for on-line string lysis. Brief. Bioinf., 15, 376–389. searches. SIAM J. Comput., 22, 935–938. Vinga,S. and Almeida,J. (2003) Alignment-free sequence comparison: a re- Marden,J.I. et al. (1992) Rank correlation methods (5th ed.). J. Am. Stat. view. Bioinformatics, 19, 513–523. Assoc., 87, 249. Wan,L. et al. (2010) Alignment-free sequence comparison (II): theoretical Otu,H.H. and Sayood,K. (2003) A new sequence distance measure for phylo- power of comparison statistics. J. Comput. Biol., 17, 1467–1490. genetic tree construction. Bioinformatics, 19, 2122–2130. Wandelt,S. and Leser,U. (2013) FRESCO: referential compression of highly Qi,J. et al. (2004) Whole proteome prokaryote phylogeny without sequence similar sequences. IEEE/ACM Trans. Comput. Biol. Bioinf., 10, alignment: a K-string composition approach. J. Mol. Evol., 58, 1–11. 1275–1288. Reinert,G. et al. (2009) Alignment-free sequence comparison (I): statistics and Wang,J. and Zheng,X. (2008) WSE, a new sequence distance measure based power. J. Comput. Biol., 16, 1615–1634. on word frequencies. Math. Biosci., 215, 78–83. Reyes,A. et al. (2000) Where do rodents ﬁt? Evidence from the complete mito- Wei,D. et al. (2012) A novel hierarchical clustering algorithm for gene se- chondrial genome of Sciurus vulgaris. Mol. Biol. Evol., 17, 979–983. quences. BMC Bioinformatics, 13, 1–15. Robinson,D.F. and Foulds,L.R. (1981) Comparison of phylogenetic trees. Zhao,B. et al. (2011) A new distribution vector and its application in genome Math. Biosci., 53, 131–147. clustering. Mol. Phylogenet. Evol., 59, 438–443. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press http://www.deepdyve.com/lp/oxford-university-press/k-2-and-k2-efficient-alignment-free-sequence-similarity-measurement-A2YAggT3CB

Loading next page...

References (55)

Z. Govindarajulu (2012)
Rank Correlation Methods (5th ed.)
Technometrics
We can observe the connection between the above relation for the longest common prefix (LCP) between suffixes in S . For
Qi Dai, Xiaoqing Liu, Yuhua Yao, Fukun Zhao (2011)
Numerical characteristics of word frequencies and their application to dissimilarity measure for sequence comparison.
Journal of theoretical biology, 276 1
Jie Lin, D. Adjeroh, Bing-Hua Jiang, Yue Jiang (2018)
fastWKendall: an efficient algorithm for weighted Kendall correlation
Computational Statistics, 33
David Christensen (2005)
Fast algorithms for the calculation of Kendall’s τ
Computational Statistics, 20
Chih-En Kuo, Yue-Li Wang, J. Liu, M. Ko (2015)
Resequencing a Set of Strings Based on a Target String
Algorithmica, 72
J. Bao, R. Yuan (2015)
A wavelet-based feature vector model for DNA clustering.
Genetics and molecular research : GMR, 14 4
Build the sufﬁx array (SA) from the combined sequence S ¼ T $ P $. And for a given parameter k , read all k -grams from SA
S. Karlin, G. Ghandour, F. Ost, S. Tavaré, L. Korn (1983)
New approaches for computer analysis of nucleic acid sequences.
Proceedings of the National Academy of Sciences of the United States of America, 80 18
Compute the frequency for each k -gram using the SA.
C. Fischer, S. Koblmüller, C. Gülly, C. Schlötterer, C. Sturmbauer, G. Thallinger (2013)
Complete Mitochondrial DNA Sequences of the Threadfin Cichlid (Petrochromis trewavasae) and the Blunthead Cichlid (Tropheus moorii) and Patterns of Mitochondrial Genome Evolution in Cichlid Fishes
PLoS ONE, 8
H. Otu, Khalid Sayood (2003)
A new sequence distance measure for phylogenetic tree construction
Bioinformatics, 19 16
Bingxin Lu, Louxin Zhang, H. Leong (2017)
A program to compute the soft Robinson–Foulds distance between phylogenetic networks
BMC Genomics, 18
Oliver Bonham-Carter, J. Steele, D. Bastola (2014)
Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis
Briefings in bioinformatics, 15 6
R. Giancarlo, Davide Scaturro, F. Utro (2012)
Textual data compression in computational biology: Algorithmic techniques
Comput. Sci. Rev., 6
M. Kantorovitz, G. Robinson, S. Sinha (2007)
A statistical method for alignment-free comparison of regulatory sequences
Bioinformatics, 23 13
Mukul Bansal, J. Burleigh, O. Eulenstein, David Fernández-Baca (2010)
Robinson-Foulds Supertrees
Algorithms for Molecular Biology : AMB, 5
Richard Beal, T. Afrin, A. Farheen, D. Adjeroh (2016)
A new algorithm for “the LCS problem” with application in compressing genome resequencing data
BMC Genomics, 17
J. Aach, M. Bulyk, G. Church, J. Comander, A. Derti, J. Shendure (2001)
Computational comparison of two draft sequences of the human genome
Nature, 409
Martine Léonard, L. Mouchard, Mikaël Salson (2012)
On the number of elements to reorder when updating a suffix array
J. Discrete Algorithms, 11
Richard Beal, A. Farheen, D. Adjeroh (2016)
Compressing genome resequencing data via the Maximal Longest Factor
2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)
J. Reid (1987)
Normal functions of normal random variables
Computers & Mathematics With Applications, 14
Bao Jp, Yuan Ry (2015)
A wavelet-based feature vector model for DNA clustering.
Genetics and Molecular Research, 14
Dan Wei, Q. Jiang, Yanjie Wei, Shengrui Wang (2012)
A novel hierarchical clustering algorithm for gene sequences
BMC Bioinformatics, 13
Long Shi, Hailan Huang (2012)
DNA Sequences Analysis Based on Classifications of Nucleotide Bases
Sebastian Deorowicz, S. Grabowski (2013)
Data compression for sequencing data
Algorithms for Molecular Biology : AMB, 8
Kai Song, Jie Ren, G. Reinert, Minghua Deng, M. Waterman, Fengzhu Sun (2014)
New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing
Briefings in bioinformatics, 15 3
Aurelio Reyes, C. Gissi, G. Pesole, F. Catzeflis, Cecilia Saccone (2000)
Where do rodents fit? Evidence from the complete mitochondrial genome of Sciurus vulgaris.
Molecular biology and evolution, 17 6
L. Bevington (1882)
A Primer of Art
Nature, 26
Lin Wan, G. Reinert, Fengzhu Sun, M. Waterman (2010)
Alignment-Free Sequence Comparison (II): Theoretical Power of Comparison Statistics
Journal of computational biology : a journal of computational molecular cell biology, 17 11
M. Bauer, S. Schuster, K. Sayood (2008)
The Average Mutual Information Profile as a Genomic Signature
BMC Bioinformatics, 9
J. Bao, R. Yuan, Zhenzhong Bao (2014)
An improved alignment-free model for dna sequence similarity metric
BMC Bioinformatics, 15
S. Vinga, Jonas Almeida (2003)
Alignment-free sequence comparison-a review
Bioinformatics, 19 4
Ying Cao, A. Janke, P. Waddell, M. Westerman, O. Takenaka, S. Murata, N. Okada, S. Pääbo, M. Hasegawa (1998)
Conflict Among Individual Mitochondrial Proteins in Resolving the Phylogeny of Eutherian Orders
Journal of Molecular Evolution, 47
Jie Lin, D. Adjeroh, Bing-Hua Jiang, Yue Jiang (2016)
K2: Efficient alignment-free sequence similarity measurement using the Kendall statistic
2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)
Jun Wang, Xiaoqi Zheng (2008)
WSE, a new sequence distance measure based on word frequencies
Mathematical Biosciences, 215
Xiuwei Zhang, Bernard Moret (2010)
Refining transcriptional regulatory networks using network evolutionary models and gene histories
Algorithms for Molecular Biology : AMB, 5
Aach (2001)
5
Nature, 26
B. Blaisdell (1986)
A measure of the similarity of sets of sequences not requiring sequence alignment.
Proceedings of the National Academy of Sciences of the United States of America, 83 14
Lizhi Cui, Josiah Poon, S. Poon, Hao Chen, Junbin Gao, P. Kwan, Kei Fan, Zhihao Ling (2014)
An improved independent component analysis model for 3D chromatogram separation and its solution by multi-areas genetic algorithm
BMC Bioinformatics, 15
S. Wandelt, U. Leser (2013)
FRESCO: Referential Compression of Highly Similar Sequences
IEEE/ACM Transactions on Computational Biology and Bioinformatics, 10
Given two sequences T and P , combine them into one sequence, S ¼ T $ P $, after appending an ‘$’ at the end of each sequence. The concatenated sequence S is of length
U. Manber, E. Myers (1993)
Suffix arrays: a new method for on-line string searches
SIAM J. Comput., 22
D. Gusfield (1997)
Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology
D. Robinson, L. Foulds (1981)
Comparison of phylogenetic trees
Bellman Prize in Mathematical Biosciences, 53
Temple Smith, M. Waterman (1981)
Identification of common molecular subsequences.
Journal of molecular biology, 147 1
Libin Liu, Yee-kin Ho, S. Yau (2006)
Clustering DNA sequences by feature vectors.
Molecular phylogenetics and evolution, 41 1
Chun Li, Chun Li, Jun Wang (2005)
Relative entropy of DNA and its application
Physica A-statistical Mechanics and Its Applications, 347
M. Kendall (1938)
A NEW MEASURE OF RANK CORRELATION
Biometrika, 30
D. Adjeroh, Tim Bell, Amar Mukherjee (2008)
The Burrows-Wheeler Transform:: Data Compression, Suffix Arrays, and Pattern Matching
J. Qi, Bin Wang, Bainian Hao (2003)
Whole Proteome Prokaryote Phylogeny Without Sequence Alignment: A K-String Composition Approach
Journal of Molecular Evolution, 58
Dongxia Yao, Qiang Wei, Wenying Xu, Ryan Syrenne, Joshua Yuan, Zhen Su (2012)
Comparative genomic analysis of NAC transcriptional factors to dissect the regulatory mechanisms for cell wall biosynthesis
BMC Bioinformatics, 13
G. Reinert, David Chew, Fengzhu Sun, M. Waterman (2009)
Alignment-Free Sequence Comparison (I): Statistics and Power
Journal of computational biology : a journal of computational molecular cell biology, 16 12
Bo Zhao, R. He, S. Yau (2011)
A new distribution vector and its application in genome clustering.
Molecular phylogenetics and evolution, 59 2
S. Vinga (2013)
Information theory applications for biological sequence analysis
Briefings in Bioinformatics, 15

Publisher: Oxford University Press
Copyright: © The Author(s) 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
ISSN: 1367-4803
eISSN: 1460-2059
DOI: 10.1093/bioinformatics/btx809
Publisher site: See Article on Publisher Site

Abstract

Motivation: Alignment-free sequence comparison methods can compute the pairwise similarity between a huge number of sequences much faster than sequence-alignment based methods. Results: We propose a new non-parametric alignment-free sequence comparison method, called K , based on the Kendall statistics. Comparing to the other state-of-the-art alignment-free comparison methods, K demonstrates competitive performance in generating the phylogenetic tree, in evaluat- ing functionally related regulatory sequences, and in computing the edit distance (similarity/dissimi- larity) between sequences. Furthermore, the K approach is much faster than the other methods. An improved method, K , is also proposed, which is able to determine the appropriate algorithmic par- ameter (length) automatically, without ﬁrst considering different values. Comparative analysis with the state-of-the-art alignment-free sequence similarity methods demonstrates the superiority of the proposed approaches, especially with increasing sequence length, or increasing dataset sizes. Availability and implementation: The K and K approaches are implemented in the R language as a package and is freely available for open access (http://community.wvu.edu/daadjeroh/projects/ K2/K2_1.0.tar.gz). Contact: yueljiang@163.com Supplementary information: Supplementary data are available at Bioinformatics online. 1 Introduction include compression and efficient storage of the rapidly expanding genomic datasets (Beal et al., 2016a, b; Deorowicz and Grabowski, Evaluating the similarity between two sequences is a classical prob- 2013; Giancarlo et al., 2012), and resequencing a set of strings given lem that has long been studied in computer science, primarily from a target string (Kuo et al., 2015), an important step in efficient gen- the view point of string pattern matching (Adjeroh et al., 2008; ome assembly. Gusfield, 1997). Such similarity measurement has applications in Alignment-free sequence comparison methods can compute the various areas in computational biology, e.g. sequence alignment similarity between a large number of sequences much faster than (Smith and Waterman, 1981), in comparative genomics (Aach et al., alignment-based methods (Vinga and Almeida, 2003; Vinga, 2014). 2001), genomic evolution and phylogenetic tree construction and Word analysis of k-length substrings (also called k-mers, k-grams, analysis (Cao et al., 1998; Reyes et al., 2000), analysis of regulatory or k-tuple) from sequences is one approach to improved sequence functions (Kantorovitz et al., 2007), rapid search in huge biological comparision (Bonham-Carter et al., 2014). Words can be extracted sequences (Wandelt and Leser, 2013). Other recent applications V The Author(s) 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com 1682 * K and K 1683 2 2 in different ways, and with varying lengths. The most common is to (typically, with k 7). This places the proposed K statistic among use sliding windows from length 2 to n – 1, where n is the length of the best non-alignment based similarity measures, especially with sequence (Bauer et al., 2008; Dai et al., 2011; Liu et al., 2006; Qi increasing sequence lengths (n, m), or increasing size of the k-mer. et al., 2004). Some methods divide a sequence into several even parts Based on K , we further propose an improved method, named K , (Zhao et al., 2011), while some others have used fixed length sub- which is able to determine a suitable value for k, the k-gram param- strings, e.g. k ¼ 2 (2-mer) (Shi and Huang, 2012). After extracting eter automatically with competitive performance. We have imple- the words, different statistical methods can be applied to analyze mented K and K in the R statistical and graphics environment, and two sequences for similarity (Li and Wang, 2005; Wang and Zheng, the codes are freely available for open access. 2008). DMk (Wei et al., 2012) and Category-Position-Frequency (CPF)(Bao et al., 2014) incorporate positions and frequencies of k-mers into feature vectors. DV (Zhao et al., 2011) utilizes distribu- 2 Materials and methods tion vectors from k-mers. Shi (Shi and Huang, 2012) maps a DNA 2.1 Kendall statistic primary sequence into three symbolic sequences and groups these se- The Kendall statistic is a nonparametric method which makes no as- quences into a twelve-component vector. Wavelet Feature Vector sumption about the probability distribution of the variables being (WFV) converts a sequence into a L-length feature vector by wavelet assessed. The Kendall statistic estimates the correlation between two transform (Bao and Yuan, 2015). sets of random variables X and Y, represented using the pairs Our approach is more closely related to the D statistic, another ðÞ X ; YðÞ X ; Y .. .ðÞ X ; Y . The Kendall correlation, s, is then 1 1 2 2 n n popular approach for measuring the similarity (or dissimilarity) be- defined as follows (Kendall, 1938). tween two sequences (Bonham-Carter et al., 2014; Song et al.,2014). It was first proposed by Blaisdell (1986). Since then, many variants sðÞ X; Y ¼ Pf X X Y Y > 0g Pf X X Y Y < 0g j i j i j i j i and improvements have been proposed, such as D (Kantorovitz 2 (1) sh z et al.,2007), D (Reinert et al., 2009)and D (Wan et al., 2010). D 2 2 2 In this study, we compute the Kendall correlation by using the fol- (Kantorovitz et al.,2007)normalizesthe D statistic using its mean lowing formula to approximate s (Kendall, 1938; Marden et al., and standard deviation to improve its detection power (Song et al., sh 1992): 2014). D and D are two other normalization improvement meth- 2 2 ods which were proposed in Reinert et al. (2009) and Wan et al. n n c d b s ¼ (2) sh S nðÞ n1 (2010). D [also denoted D in the literature (Reinert et al.,2009; 2 2 Song et al.,2014)] uses an approach based on Shepp (1964). sh According to a recent review (Song et al., 2014), D and its variant where n is the number of distinct k-grams for the concatenated se- are generally the best D statistical methods for alignment-free com- quence S¼T$P$, n is the number of concordant k-gram pairs 2 c parison of genomic sequences, especially with increasing sequence X X Y Y > 0, with 0 < i < j n; and n is the number j i j i d length. More detailed discussion of the D -statistic family of algo- of discordant k-gram pairs X X Y Y < 0, with 2 j i j i rithms can be founded in Section 6 of the Supplementary Material. 0 < i < j n. In general, the D -statistic family of algorithms have a general problem of requiring a quadratic or cubic time complexity, with re- 2.2 Optimized computation of Kendall statistics spect to n or m, the length of the sequences, and k, the size of the sub- The time cost to compute b s, the approximation to the Kendall cor- strings being considered. Also, the D family of statistics generally relation statistic is On , including time to compare each pair be- makes some assumptions on the distribution of the sequences, for in- tween (X ; X ) and (Y ; Y ), i 6¼ j, where n is the number of pairs in i j i j stance, most assumed either a uniform distribution, or a normal distri- X and Y. Christensen (2005) showed an algorithm to calculate b s in bution, for the symbols in the sequences. This parametric nature of the OnðÞ log n time complexity. It was implemented in Pascal. Lin et al. statistics obviously limits their practical applicability, since practical (2017) recently introduced an algorithm for the related problem of data, especially for biological sequences (e.g. complete genomes for in- weighted Kendall correlation. In this work, we propose data dividuals of the same species, or for related organisms) rarely follow structures and a new algorithm to compute b s. Our algorithm also these theoretical distributions. A non-parametric approach to the meas- runs in OnðÞ log n time, but uses a different approach to compute urement of sequence similarity is required, one that does not make any the Kendall statistics. We then apply the algorithm to analyze simi- assumption on the distribution of the sequences under consideration, larity between a given pair of sequences. More detailed discussion and one that is efficient enough to handle the rapidly increasing com- on the improved algorithm for the Kendall Statistics can be found in plexity and data sizes of available biological sequence data. Section 3.1 of the Supplementary Material. In this work, we propose a nonparametric approach, K , which uses the Kendall correlation statistic to estimate the similarity be- tween sequences. The Kendall correlation is a non-parametric 2.3 The K approach method to calculate the correlation between two sets of random Here, we propose the K statistic as a new method for rapid and ef- variables. We adopt this to measure the similarity among sequences. ficient measurement of biological sequence similarity, without When compared to the other state-of-the-art alignment-free se- requiring an initial sequence alignment step. The K statistic makes sh z quence similarity methods, (e.g. D , D ; D ; D , DMk, DV, CPF, use of the above optimized method for computing the Kendall’s s 2 2 2 Shi and WFV), K demonstrates an improved power in detecting re- correlation between two sequences. Here, the correlation is com- latedness between sequences, as measured by its ability to generate puted based on the k-mer count statistics ðX and Y Þ between the w w the correct phylogenetic tree, and to identify functionally related two sequences. The counts are obtained in OðÞ jSj time using the regulatory sequences. The K also showed significant correlation suffix array data structure (Adjeroh et al., 2008; Gusfield, 1997; with the edit distance, the standard, though time consuming, meas- Manber and Myers, 1993), where jSj is the length ofinputse- ure of (similarity/dissimilarity) between sequences. Further, the K quence S ¼ T$P$. We describe the steps of the algorithm in the approach is faster than most of the other methods when k is large, following. 1684 J.Lin et al. 1. Given two sequences T and P, combine them into one sequence, We can observe the connection between the above relation for k and S ¼ T$P$, after appending an ‘$’ at the end of each sequence. the longest common prefix (LCP) between suffixes in S. For an arbi- The concatenated sequence S is of length jSj. trary sequence Q with symbols from the alphabet R, it is known 2. Build the sufﬁx array (SA) from the combined sequence S ¼ T$P$. that, on average, the length of the longest common prefix between And for a given parameter k,readall k-grams from SA. suffixes in Q is in O log ðÞ jQj . See Karlin et al. (1983) and jRj 3. Compute the frequency for each k-gram using the SA. Here, we Le ´ onard et al. (2012). Thus, for an arbitrary sequence, our suggested use X , and Y to denote the frequency of the k-gram w in se- value for k is essentially in the same order as this expected maximal w w quences T and P, respectively. Notice that, both X and Y will LCP value. This makes sense, in that, the maximal length k-gram w w be found at essentially the same time, using the SA of the con- should be close to the expected maximal LCP length, since if we catenated sequence, S. have k values much larger than the average maximal LCP length, we 4. Order all the (X ; Y ) frequencies of k-gram pairs by grouping may not be able to observe some repeated k-grams. On the other w w them according to Y , and then X . We get pairs hand, if we use k values much smaller than the average maximal w w {ðÞ X ; Y ;ðÞ X ; Y ; ...ðÞ X ; Y ; ...ðÞ X ; Y }, where n is the num- LCP length, we will be double-counting some smaller repeated sub- 1 1 2 2 i i n n ber of distinct k-grams from the concatenated sequence strings. Thus, operating with k values far from the expected max- S ¼ T$P$, and (X , Y ) is the frequency pair of ith ranked imal LCP length could lead to either underestimating or i i k-gram from sequences T and P. Thus, (1) Y Y and i < n overestimating the frequency for the k-grams that capture the major i iþ1 and (2) X X when Y ¼ Y and i < n. variations in the sequence. i iþ1 i iþ1 5. Compute n , the number of concordant pairs, and n the number c d of discordant pairs, for the ranked frequency pairs from se- 2.5 Comparative complexity analysis quences T and P. The number of concordant pairs n is the sum ðÞ The proposed K algorithm runs in O jSj log jSj time, which is a of the number pairs in one of these two conditions: (1) x < x i j significant improvement in complexity, when compared with the and y < y ; (2) x > x and y > y , where 0 i < j < n. k i j i j i j OkjSjjRj required for computing D and other related statistics, Similarly, the number of disconcordant pairs n , is the sum of or even with the observed improvement that reduces the time the number of pairs in one of the following two conditions: (1) 2 to OkjSj . K requires just a one-time run of K , using the auto- x < x and y > y ; (2) x > x and y < y , where 0 i < j < n. i j i j i j i j matically computed k-parameter. This will be practically faster than 6. Calculate the Kendall correlation using the formula: using K , however, the time complexity of K still remains the same n n c d OðÞ jSj log jSj as in K . More detailed discussion can be found in b s ¼ : nðÞ n1 Section 3.2 of the Supplementary Material. 7. Return b s which is the K similarity between sequences T and P. 2.6 Experimental design To test the proposed methods, we performed some experiments The last three steps are based on the optimized Kendall algorithm using three different datasets. We also compared our experimental introduced previously (Section 2.2). results with those from state-of-the-art alignment-free sequence 2.4 K : improved K with automated k value similarity measurement algorithms. Similar to the alignment-free methods from the D family, the pro- posed K approach depends critically on the length parameter, k. 2.6.1 Datasets and environment Here, we propose a method to determine the k parameter automatic- We use three sets of biological sequence data for the experiments in ally, without needing to test with all possible values. this study. The first dataset used is the complete mtDNA sequences Given the alphabet jRj and the length parameter k, there are at k from Cao et al. (1998)and Reyes et al. (2000) containing data on 12 most jRj possible k-grams, independent of the sequence lengths n proteins encoded in the H strand of mtDNA in 20 eutherian species. and m. These are the unique k-grams, given the alphabet. Given the The sequence lengths ranged from 16 300 to 17 080 symbols. This concatenated sequence S ¼ T$P$ with length of jSj, the k-grams are dataset is often used to evaluate the similarity of different species, espe- simply k-length substrings of S. Thus, we can have at most jSj k cially using phylogenetic trees. We call this the ‘mtDNA20’ dataset. þ1 number of k-grams from S. These may not be unique, since they The second dataset is 23 whole mitochondrial DNA genomes may include repeated k-grams, depending on the nature of the se- from different Eukaryotic fish species of the suborder Labroidei, taken quences T and P. At the same time, we need the k-grams to capture from Fischer et al. (2013). We could not locate the sequences for two most of the variations in the input sequences (now contained in S), of the species, namely, P.trewavasae and T.moorii. Thus, though the while avoiding k-grams that are repeated inside other k-grams. That original work in Fischer et al. (2013) used 25 species, our dataset con- is, we want the maximal length k-grams that capture the variations tained only 23 of the 25 species. The sequence lengths ranged from in S, without missing out on the smaller k-grams, especially those 16 440 to 17 040 symbols. We call this dataset the ‘Fish23’ dataset. that did not occur inside the longer k-grams. These shorter k-grams The third dataset used is the set containing cis-regulatory modules are likely to be more numerous, and can also provide important in- (CRMs) used by Kantorovitz et al. (2007) in their work on identifica- formation about the sequences. To satisfy the above competing con- tion of functional relationships between cis-regulatory sequences. ditions, the choice of k should meet the following criterion: There are seven sets including 185 CRM sequences, taken from k k1 jRj jSj k þ 1 > jRj (3) Drosophila melanogaster and Homo sapiens. We call this the ‘CRM185’ dataset. This dataset is available for download at http:// where jSj¼ m þ n þ 2 is the length of the concatenated sequences S. veda.cs.uiuc.edu/d2z/publicdata.tar.gz. Following the above, the value of k can be approximated as: The experiments were performed in a PC environment, running Intel i5, 4 cores, with 16 GB RAM and 1 TB HD. K and K were k ¼d log ðÞ jSje (4) 2 jRj * K and K 1685 2 2 written using the R Language. For comparison purposes, we also (2017)]. The Robinson-Foulds distance measures the topological tested several other state-of-the-art alignment-free methods using distance between two labeled trees essentially by counting the min- the same datasets. The algorithm for D was from Song et al. imum number of elementary operations needed to transform one sh (2014), D was from Wan et al. (2010), and D was from Reinert tree to the other. 2 2 et al. (2009). They all were implemented using the C language. The For the experiments on the mtDNA20 dataset, and we used the method D was developed in Perl in the original work of tree published by Cao et al. (1998) as the reference. See also Otu and Kantorovitz et al. (2007). We implemented the methods for DMk Sayood (2003). For phylogenetic analysis using the Fish23 dataset, we (Wei et al., 2012), CPF (Bao et al., 2014), DV (Zhao et al., 2011) used the tree published by Fischer et al. (2013) as the reference tree. and Shi (Shi and Huang, 2012) in R, according to descriptions pro- vided in the respective papers. The codes for WFV, developed in 3.1.1 mtDNA20 dataset Python in their original work (Bao and Yuan, 2015), were kindly Table 1 shows the Robinson-Foulds distance between each tree and provided by the authors. In our experiments, the parameter k corres- the reference tree. Each column contains distances of a given ponds to the length L ¼ 4 in their work. alignment-free method with parameter k varied from 2–9. The re- sults of three methods without parameter k are shown in the last 2.6.2 Experiment 1 row. The minimum distance in this table is 12. This minimum was The first experiment aimed at analyzing the general performance obtained with the K method, and it is also present in the column of each alignment-free method studied. The experiment sh for K with parameter k¼8, 9, and for D with parameter k¼7, 8. z 2 sh compared eleven alignment-free methods, namely, D ; D ; D ; D , 2 2 2 The remaining 8 methods are unable to achieve the minimum (best) DMk; DV; CPF; Shi and WFV and our two proposed methods, K distance. However, D and CPF are able to take the second place and K . The experiment was performed on mtDNA20 and Fish23 with minimum RF distance of 14. D and DMk can obtain the min- two datasets. imum RF distance of 16. The distances reported by the other meth- To evaluate the performance of the algorithms, we consider three ods, D ; DV; Shi and WFV were far from the minimum distance, performance measures: (i) the Robinson-Foulds (RF) distance hence, were ranked lower. On this dataset, the methods K ; K and (Robinson and Foulds, 1981) which measures the topological dis- sh D performed generally better than the others. However, the fact tance between the golden reference phylogenetic tree and the phylo- that K does not need to try all the possible k values from 2–9, gives genetic tree constructed using a given alignment-free method; (ii) the it an advantage over the others. correlation of the similarity/distance values as determined by the Figure 1 shows the reference phylogenetic tree from Cao et al. alignment-free method with the standard edit distance; (iii) the com- (1998), and the corresponding tree generated by the proposed K ap- putation time required. These performance measures need to be con- proach. Detailed figures for the other methods are presented in the sidered both individually and jointly in evaluating algorithms for Supplementary Material. To compare different methods, we show the sequence similarity measurement. phylogenetic trees constructed using each of the methods. Methods sh D ; D ; D , DMk; CPF and K depend on the input parameter k. 2 2 2 2 2.6.3 Experiment 2 For each of these methods, the Supplementary Figure S1 shows the The second experiment investigated how well the results from the corresponding phylogenetic tree that resulted in the minimum proposed alignment-free methods can capture the similarity between Robinson-Foulds distance with the reference tree. For the K method, sequences with similar functional roles. For this experiment, we the k value is automatically computed, so, only one tree is generated. used the related regulatory sequences in the CRM185 dataset, our The phylogenetic trees from D ; Shi; WFV and DV are not shown in third dataset. The ‘positive’ set is the set of CRMs that are in the the Supplementary Figure S1 because these trees are far away from same tissue and/or same developmental stage. The ‘negative’ set is the reference tree. See also the RF distances shown in Table 1. the set chosen from non-coding sequences, which are expected to be Looking at these figures, we can see that the trees are generally unrelated with respect to function. This experiment is designed to similar to the reference tree, though with some variations. We can predict whether or not any two given sequences are in the ‘positive’ set, using alignment-free methods. First, we compute the similarity Table 1. The Robinson-Foulds distance between the reference between pairwise sequences using alignment-free methods. Next, we phylogenetic tree and phylogenetic trees generated using different rank these pairs based on their similarity, and determine the number alignment-free statistical methods (with k ¼ 2; 3; ... ; 9) of positive pairs and return the accuracy ratio. sh z kD D D D K DMk CPF WFV 2 2 2 2 2 222 26 26 36 26 18 24 26 3 Results and discussion 324 26 28 34 22 20 22 24 422 20 22 26 22 16 18 24 3.1 Phylogenetic tree analysis 522 20 16 26 20 16 16 22 One way to evaluate the performance of the alignment-free methods 624 16 16 24 18 18 16 24 is to compare the phylogenetic trees generated using the distance 718 14 12 20 14 16 14 24 matrix against the known correct (reference) phylogenetic tree for 818 16 12 20 12 16 14 24 the species in the dataset. In this case, methods that generate trees 916 14 14 — 12 18 16 24 that have more similarity in structure with the reference tree will be K 12 DV 20 Shi 22 taken to be of better performance. To compare the similarity/dissimilarity between two trees, we Note: Results are based on the mtDNA20 dataset (Cao et al., 1998). K use the Robinson-Foulds(RF) distance (Robinson and Foulds, 1981). having automatically determined k values, DV and Shi without varied k par- The Robinson-Foulds distance (also called the symmetric difference ameter, they are all reported in the last row for brevity. D generated an error metric) is a well-known approach for measuring the similarity be- at k ¼ 9. The bold value 12 here indicates the minimal RF distance. The tween two trees. [See for example Bansal et al., (2010) and Lu et al., smaller the RF distance is, the better a method performs. 1686 J.Lin et al. Table 2. The Robinson-Foulds distance between the reference phylogenetic tree and phylogenetic trees generated using different alignment-free statistical methods (with k ¼ 2; 3; ... 9) sh z k D D D D K DMk CPF WFV 2 2 2 2 2 232 34 36 40 36 30 32 36 330 30 28 40 26 28 30 30 426 26 30 36 24 22 24 26 524 20 22 38 20 20 20 26 614 10 20 36 12 10 12 32 714 8 14 34 8 12 12 34 8 88 8 34 8 12 14 34 9 8 10 14 — 10 14 16 34 K 8 DV 32 Shi 34 Note: Results are based on the Fish23 dataset (Fischer et al., 2013). For brevity, the results for K (with automatically determined k value), and DV and Shi (both with ﬁxed k parameters), are reported in the last row. D gener- ated an error at k ¼ 9. The bold value 8 here indicates the minimal RF dis- tance. The smaller the RF distance is, the better a method performs. DV are worse than the others. Among these methods, only K is able to automatically determine an appropriate k value. From these results, we conclude that with respect to phylogenetic trees, the K is the best amongst all the tested alignment-free methods. The Supplementary Figure S2a shows the phylogenetic tree re- ported by Fischer et al. (2013) in their original paper using the Fish23 dataset. Similar to the ‘mtDNA20’ experiment, we show the phylogenetic trees generated by the alignment-free methods: sh D ; D ; D , DMk; CPF, K and K . The Supplementary Figure 2 2 2 2 2 S2(b–h) show the phylogenetic trees with the minimum Robinson- Foulds distance for each method. Supplementary Figure S2a is the reference tree. For our experiments, since we did not have the se- quences for P.trewavasae and T.moorii, the pairs N.brichardi, T.duboisi will become neighbors, with parent at node 16 in the ori- ginal reference tree. Fig. 1. Reference phylogenetic tree from Cao et al. (1998), and the corres- Fish Dataset demonstrates similar trends to the mtDNA20 data- ponding tree generated using the proposed K alignment-free sequence com- set, see more details in Supplementary Material, Section 4.3. parison method, using the mtDNA20 dataset observe that D and D placed horse and white rhinoceros close to 3.2 Correlation with the edit distance each other as expected, however, their parent nodes were wrongly 3.2.1 mtDNA20 dataset placed, making them much further from say cow than in the reference Table 3 shows the Pearson correlation coefficients between the simi- tree. Also, D wrongly placed wallaroo very close to mouse and rat, larity measurements from the different alignment-free methods and while D had cow much closer to rat and mouse than the reference the edit distance, using the mtDNA20 dataset. From the table, one sh tree. D provided a better result than D and D , but it also incor- 2 2 sh can observe that D achieved the best result 0:92 when k ¼ 6or rectly placed platypus much closer to rat and mouse. Methods K and k ¼ 7:K achieve the best result (q ¼0:95) when k ¼ 9:K can K seem to avoid these problems. One quick way to access the per- reach q ¼0:94 which is close to the best of K .In a word, the K 2 2 formance of the methods is to compare the minimum number of hops method can reach the best accuracy, and K is quite competitive. A needed to go from one given leaf node (representing a species) to an- key advantage of the K method is that it is able to select parameter k other leaf node on a given tree. The Supplementary Table S3 shows automatically and quickly. However, considering the K may need to the number of hops for two pairs of species. The Supplementary try all possible k values to determine the best k (9 in this case), the Figure S1 and Table S3 suggest that the proposed methods K and K slight performance disadvantage (q ¼ 0:94 versus q ¼ 0:95) of K be- work better than the other methods on the mtDNA20 dataset. comes even less significant, especially when data volume is huge. See more detailed analysis in Supplementary Material, Section 4.1.1. 3.1.2 Fish23 dataset Similar results were observed using the Fish23 dataset. These Table 2 shows the Robinson-Foulds distances for the Fish23 dataset. have been included in Section 4.1.2 of Supplementary Material. Each column shows distances of one alignment-free method with parameter k varied from 2 to 9. The last row shows the results for three methods that did not use varying k parameters. The minimum 3.3 Practical running time distance in this table is 8 (shown in boldface in the table). Methods We compare the running time of eleven methods, [9 earlier sh sh z K ; K ; D ; D and D are able to achieve the minimum distance. approaches (D ; D ; D ; D ; DMk; CPF; WFV; DV and Shi) and 2 2 2 2 2 2 2 2 2 As with the previous experiment on ‘mtDNA20’, D ; WFV; Shi and the two proposed methods (K and K )]. 2 2 * K and K 1687 2 2 Table 3. Pearson correlation coefﬁcient between the similarity/dis- tance measure from different alignment-free statistical methods sh and the edit distance sh z k D D D D K DMk CPF WFV 2 2 2 2 2 DMk CPF 2 0.45 0.51 0.55 0.02 0.56 0.67 0.62 0.57 WFV 3 0.48 0.60 0.74 0.10 0.73 0.68 0.66 0.62 4 0.53 0.71 0.86 0.74 0.82 0.70 0.71 0.63 5 0.61 0.79 0.91 0.81 0.89 0.78 0.77 0.72 6 0.77 0.87 0.92 0.83 0.92 0.84 0.86 0.68 7 0.87 0.91 0.92 0.84 0.92 0.87 0.89 0.68 8 0.90 0.92 0.91 0.84 0.93 0.85 0.89 0.66 9 0.91 0.91 0.91 — 0.95 0.85 0.87 0.67 K 0.94 DV 0.70 Shi 0.68 Note: Reports are for the mtDNA20 dataset. K having automatically deter- mined k values, DV and Shi without varied k parameter, they are all reported in the last row for brevity. D generated an error at k ¼ 9. The bold values indicate the biggest absolute value of Pearson correlation coefﬁcient for differ- ent k values. The bigger an absolute value, the better a method performs. sh z Fig. 2. Time cost comparison for D ; D ; D ; D ; DMk; CPF; WFV and K 2 2 2 2 2 with parameter k varying from 2 to 9, using the mtDNA20 dataset. Results for Table 4. Practical running time (in seconds) using alignment-free K ¼3.07 s, DV¼2.33 s and Shi¼1.37 s are not shown in the ﬁgure for clarity methods on the mtDNA20 dataset sh z with 2.33 and 1.37 s respectively. However, K generated a much kD D D D K DMk CPF WFV 2 2 2 2 2 lower RF distance—see Table 1. K is slower than the other methods sh 2 0.02 0.05 0.05 1.55 0.41 3.64 13.23 0.004 (i.e. D ; D ; D ) with k ¼ 2; 3; 4; 5; 6, and faster than the other 2 2 3 0.03 0.05 0.07 1.56 0.45 4.90 14.02 0.008 methods with k ¼ 7; 8; 9. We can also observe from the results dis- 4 0.08 0.11 0.15 1.61 0.57 5.91 15.63 0.020 cussed earlier that, for this dataset, the best performance for the 5 0.20 0.34 0.5 1.76 1.94 7.06 16.82 0.088 other methods were recorded at k 6. See Supplementary Figure S1 6 0.56 1.29 2 2.35 2.22 9.78 16.58 0.884 and Table 1. Clearly, since K does not need to search for the best k 7 1.26 4.91 7 5.38 3.17 18.09 16.43 7.768 value (i.e. it is executed for just one k value), it is overall faster than 8 2.40 18.18 25 19.19 3.63 40.13 16.82 38.3 the other methods, without degrading the accuracy. This is import- 9 4.58 70.28 99 — 4.34 92.10 17.05 347.1 ant, considering the increasingly huge volumes of data involved in K 3.07 DV 2.33 Shi 1.37 most applications of these techniques. In fact, the primary motiv- Note:Results for K with automatically determined k values, DV and Shi with 2 ation for the alignment-free methods is their rapid processing speed, ﬁxed k values, are reported in the last row. D generated an error at k ¼ 9. The bold when compared with alignment-based methods. values shown the smallest running time (the fastest method) for different k values. Results on running time using the Fish23 dataset is provided in the Supplementary Material. With respect to running time, we can identify two key points from 3.3.1 mtDNA20 dataset sh z our experiments: (i) the running time for D ; D , DMk and WFV in- 2 2 Table 4 shows a comparison of the running time for eleven methods creases rapidly with increasing k. The running time for K is approxi- using the first dataset (mtDNA20 dataset) from Cao et al. (1998). mately linear with respect to the sequence length. (ii) Comparing K Figure 2 plots the corresponding running times. The time for K is and K ; K is more practical, since it can determine the k value auto- 2 2 3.07 s, time for DV¼2.33 s and time for Shi¼1.37 s which are not matically, and has a competitive performance. plotted in the figure. When k ¼ 9; D generates a runtime error, thus, we could not obtain a result for this case. 3.4 Evaluation on functionally related regulatory First, consider the methods that use varied k values. When k < 6, the WFV approach is the fastest among all methods. When the par- sequences ameter k increases, the running time of WFV increases rapidly, much While the alignment-free methods could be generally fast, an im- quicker than all the others. When k ¼ 7; 8; 9; WFV requires approxi- portant consideration is whether they can identify similarities be- mately 2.45, 10.55 and 109.5-fold time increases, respectively, when tween sequences that are functionally related. Of course, this can compared with K . Therefore, in terms of running time, K is the bet- only be possible if the sequences share some similar patterns. To 2 2 ter choice than the other methods, with less running time and higher evaluate this aspect of performance, we consider to what extent the accuracy when k > 6. The WFV method with RF distances (26, 24 alignment-free similarity measures are able to capture the similar- and 22) shown in Table 1 did not perform well. ities between sequences from the same anatomic regions of the same sh Consider D and K , the two methods that achieved the best results species. For this experiment, we used the third dataset—CRM185 sh with RF distance¼ 12 in Table 1. D reaches its best performance dataset, the regulatory sequences from Kantorovitz et al. (2007).We z sh when k ¼ 7; 8:K reaches its best performance when k ¼ 8; 9. When k compare our proposed methods K and K against D ; D ; D and 2 2 2 2 2 2 sh ¼ 7; 8; 9; D requires approximately 2, 8 and 25 fold time increases, D , DMk; DV; CPF; Shi and WFV. Table 5 shows the results. In 2 2 respectively, when compared with K . Therefore, in terms of combining the table, the result for D is taken from the original work of sh z sh with running time and accuracy, K is the better choice than D . Kantorovitz et al. (2007). For D ; D ; D and D , the table shows 2 2 2 2 2 2 Now consider K ; DV and Shi which do not use varying k val- the best results with k values in the range 2 k 7. For K ues. K requires 3.07 s to execute. DV and Shi are relatively faster method, we also tested with 2 k 7. Running Time (s) 0 20406080 100 1688 J.Lin et al. Table 5. The performance of popular alignment-free sequence similarity methods in capturing functional relatedness z sh Species Dataset D K D D D K DMk CPF WFV DV Shi 2 2 2 2 2 2 Fly Blastoderm 0.73 0.92(4) 0.85(2) 0.82(2) 0.82(6) 0.79 0.83(3) 0.84(4) 0.79(5) 0.72 0.7 Fly PNS 0.62 0.60(5) 0.63(3) 0.64(4) 0.64(3) 0.56 0.62(4) 0.61(4) 0.63(3) 0.58 0.55 Fly Tracheal 0.75 0.75(4) 0.72(4) 0.69(4) 0.69(4) 0.75 0.73(3) 0.75(4) 0.70(5) 0.7 0.71 Fly Eye 0.58 0.69(3) 0.61(2) 0.63(3) 0.60(3) 0.69 0.63(5) 0.63(4) 0.64(3) 0.62 0.59 Human Muscle 0.83 0.88(4) 0.83(5) 0.83(5) 0.86(6) 0.81 0.84(3) 0.82(4) 0.81(5) 0.76 0.79 Human Liver 0.69 0.83(2) 0.88(2) 0.78(6) 0.73(4) 0.69 0.82(2) 0.84(5) 0.80(3) 0.8 0.76 Human HBB 0.64 0.65(3) 0.58(3) 0.53(2) 0.59(4) 0.66 0.57(2) 0.60(3) 0.61(4) 0.56 0.52 Note: Numbers in brackets indicate the k value that produced the best results for the given method. Results are based on the CRM185 dataset. The bold values shown the best methods on a data set without considering K * and with considering K *. 2 2 The bold items are the best results on the dataset comparing differ- References ent methods, while excluding K .From Table 5, K reported five best Aach,J. et al. (2001) Computational comparison of two draft sequences of the z sh results out of seven using the CRM185 dataset. D ; D ; D and D 2 2 2 human genome. Nature, 26, 5–14. reported one best result each. K demonstrates competitive perform- Adjeroh,D. et al. (2008) The Burrows-Wheeler Transform: Data ance with the other methods. When we take K into consideration, we 2 Compression, Sufﬁx Arrays, and Pattern Matching, 1st edn. Springer can observe that it gets three best results out of seven datasets. D and Publishing Company, Berlin, German. sh D get one best result of seven cases, D and D are best on two Bansal,M.S. et al. (2010) Robinson foulds supertrees. Algorithms Mol. Biol., 2 2 5, 1–12. cases, and K was best on four cases. In general, the proposed K and 2 2 Bao,J. et al. (2014) An improved alignment-free model for DNA sequence K methods provide the overall best performance on this problem. similarity metric. BMC Bioinformatics, 15, 1–15. Bao,J.P. and Yuan,R.Y. (2015) A wavelet-based feature vector model for DNA clustering. Genet. Mol. Res. GMR, 14, 19163. 4 Conclusion Bauer,M. et al. (2008) The average mutual information proﬁle as a genomic The problem of sequence similarity measurement is critical to sev- signature. BMC Bioinformatics, 9, 48. eral important applications in huge volume genomic sequence ana- Beal,R. et al. (2016a) Compressing genome resequencing data via the maximal lysis. We proposed a novel non-parametric algorithm K for longest factor. In: IEEE International Conference on Bioinformatics alignment-free measurement of relatedness between sequences, using and Biomedicine, BIBM 2016, Shenzhen, China, December 15–18, 2016, the statistics of k-grams in the sequences. K is a non-parametric ap- pp. 92–97. Beal,R. et al. (2016b) A new algorithm for the LCS problem with application proach based on the Kendall correlation statistic to estimate the dis- in compressing genome resequencing data. BMC Genomics, 17, 544. similarity(/similarity) of sequences. Blaisdell,B. (1986) A measure of the similarity of sets of sequences not requiring Compared with other state-of-the-art alignment-free comparison sequence alignment. Proc. Natl. Acad. Sci. USA, 83, 5155–5519. sh z methods (D ; D ; D ; D ; DMk; CPF; WFV; DV and Shi), K 2 2 2 2 2 Bonham-Carter,O. et al. (2014) Alignment-free genetic sequence comparisons: demonstrates comparable or better performance, in phylogenetic a review of recent approaches by word analysis. Brief. Bioinf., 15, 890–905. analysis, in generating (similarity/dissimilarity) measures that correl- Cao,Y. et al. (1998) Conﬂict among individual mitochondrial proteins in ate with the edit distance among a large number of sequences, and resolving the phylogeny of eutherian orders. J. Mol. Evol., 47, 307–322. in capturing functional relatedness between sequences. Further, the Christensen,D. (2005) Fast algorithms for the calculation of Kendall’s tau. K approach is faster than the other methods when k 7: 2 Comput. Stat., 20, 51–62. We also introduced K , an improved version of K that is able to Dai,Q. et al. (2011) Numerical characteristics of word frequencies and their application to dissimilarity measure for sequence comparison. J. Theor. automatically determine the suitable k value, thus eliminating the Biol., 276, 174–180. need to search for all possible k values (for the k-grams), potentially Deorowicz,S. and Grabowski,S. (2013) Data compression for sequencing from k ¼ 2to k ¼ n :K produced the best overall results, with re- data. Algorithms Mol. Biol., 8, 25. spect to both efficiency and accuracy. Along with K competitive per- Fischer,C. et al. (2013) Complete mitochondrial DNA sequences of the thread- formance in measuring the similarity between sequences, its speed ﬁn cichlid (Petrochromis trewavasae) and the blunthead cichlid (Tropheus makes it practical, an important consideration given the increasingly moorii) and patterns of mitochondrial genome evolution in cichlid ﬁshes. huge datasets in various applications of alignment-free methods. PLoS One, 8, e67048. Giancarlo,R. et al. (2012) Textual data compression in computational biology: Algorithmic techniques. Comput. Sci. Rev., 6, 1–25. Acknowledgement Gusﬁeld,D. (1997) Algorithms on Strings, Trees and Sequences: Computer A shorter version of this paper was presented at IEEE BIBM’2016, Lin et al. Science and Computational Biology. Cambridge University Press, (2016). Cambridge, England. Kantorovitz,M. et al. (2007) A statistical method for alignment-free compari- son of regulatory sequences. Bioinformatics, 23, i249–i255. Funding Karlin,S. et al. (1983) New approaches for computer analysis of nucleic acid sequences. Proc. Natl. Acad. Sci. USA, 80, 5660–5664. This work was supported by the Chinese National Natural Science Foundation Kendall,M.G. (1938) A new measure of rank correlation. Biometrika, 30, (Grant No. 61472082), Natural Science Foundation of Fujian Province of 81–93. China (Grant No. 2014J01220), the US National Science Foundation (Grant Kuo,C.-E. et al. (2015) Resequencing a set of strings based on a target string. No. IIS-1552860) and Scientiﬁc Research Innovation Team Construction Algorithmica, 72, 430–449. Program of Fujian Normal University (Grant No. IRTL1702). Li,C. and Wang,J. (2005) Relative entropy of DNA and its application. Phys. Conﬂict of Interest: none declared. A Stat. Mech. Appl., 347, 465–471. * K and K 1689 2 2 Lin,J. et al. (2016) K : Efﬁcient alignment-free sequence similarity measure- Shepp,L. (1964) Normal functions of normal random variables. SIAM Rev., 6, ment using the Kendall statistic. In: IEEE International Conference on 459–460. Bioinformatics and Biomedicine, pp. 1128–1132. Shi,L. and Huang,H. (2012) DNA sequences analysis based on classiﬁcations Lin,J. et al. (2017) fastwkendall: an efﬁcient algorithm for weighted Kendall of nucleotide bases. In: Luo J. (ed.) Affective Computing and Intelligent correlation. accepted by Comput. Stat. Interaction. Springer, Berlin, Heidelberg, pp. 379–384. Liu,L. et al. (2006) Clustering DNA sequences by feature vectors. Mol. Smith,T.F. and Waterman,M.S. (1981) Identiﬁcation of common molecular Phylogenet. Evol., 41, 64. subsequences. J. Mol. Biol., 147, 195–197. Le ´ onard,M. et al. (2012) On the number of elements to reorder when updating Song,K. et al. (2014) New developments of alignment-free sequence compari- a sufﬁx array. J. Discret. Algorithms, 11, 87–99. son: measures, statistics and next-generation sequencing. Brief. Bioinf., 15, Lu,B. et al. (2017) A program to compute the soft Robinson–Foulds distance 343–353. between phylogenetic networks. BMC Genomics, 18, 111. Vinga,S. (2014) Information theory applications for biological sequence ana- Manber,U. and Myers,G. (1993) Sufﬁx arrays: a new method for on-line string lysis. Brief. Bioinf., 15, 376–389. searches. SIAM J. Comput., 22, 935–938. Vinga,S. and Almeida,J. (2003) Alignment-free sequence comparison: a re- Marden,J.I. et al. (1992) Rank correlation methods (5th ed.). J. Am. Stat. view. Bioinformatics, 19, 513–523. Assoc., 87, 249. Wan,L. et al. (2010) Alignment-free sequence comparison (II): theoretical Otu,H.H. and Sayood,K. (2003) A new sequence distance measure for phylo- power of comparison statistics. J. Comput. Biol., 17, 1467–1490. genetic tree construction. Bioinformatics, 19, 2122–2130. Wandelt,S. and Leser,U. (2013) FRESCO: referential compression of highly Qi,J. et al. (2004) Whole proteome prokaryote phylogeny without sequence similar sequences. IEEE/ACM Trans. Comput. Biol. Bioinf., 10, alignment: a K-string composition approach. J. Mol. Evol., 58, 1–11. 1275–1288. Reinert,G. et al. (2009) Alignment-free sequence comparison (I): statistics and Wang,J. and Zheng,X. (2008) WSE, a new sequence distance measure based power. J. Comput. Biol., 16, 1615–1634. on word frequencies. Math. Biosci., 215, 78–83. Reyes,A. et al. (2000) Where do rodents ﬁt? Evidence from the complete mito- Wei,D. et al. (2012) A novel hierarchical clustering algorithm for gene se- chondrial genome of Sciurus vulgaris. Mol. Biol. Evol., 17, 979–983. quences. BMC Bioinformatics, 13, 1–15. Robinson,D.F. and Foulds,L.R. (1981) Comparison of phylogenetic trees. Zhao,B. et al. (2011) A new distribution vector and its application in genome Math. Biosci., 53, 131–147. clustering. Mol. Phylogenet. Evol., 59, 438–443.

Journal

Bioinformatics – Oxford University Press

Published: Dec 15, 2017

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

K 2 and K2*: efficient alignment-free sequence similarity measurement based on Kendall statistics

K 2 and K2*: efficient alignment-free sequence similarity measurement based on Kendall statistics

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

K 2 and K2*: efficient alignment-free sequence similarity measurement based on Kendall statistics

K 2 and K2*: efficient alignment-free sequence similarity measurement based on Kendall statistics

References (55)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies