A fast and accurate enumeration-based algorithm for haplotyping a triploid individual

A fast and accurate enumeration-based algorithm for haplotyping a triploid individual Background: Haplotype assembly, reconstructing haplotypes from sequence data, is one of the major computa- tional problems in bioinformatics. Most of the current methodologies for haplotype assembly are designed for diploid individuals. In recent years, genomes having more than two sets of homologous chromosomes have attracted many research groups that are interested in the genomics of disease, phylogenetics, botany and evolution. However, there is still a lack of methods for reconstructing polyploid haplotypes. Results: In this work, the minimum error correction with genotype information (MEC/GI) model, an important com- binatorial model for haplotyping a single individual, is used to study the triploid individual haplotype reconstruction problem. A fast and accurate enumeration-based algorithm enumeration haplotyping triploid with least difference (EHTLD) is proposed for solving the MEC/GI model. The EHTLD algorithm tries to reconstruct the three haplotypes according to the order of single nucleotide polymorphism (SNP) loci along them. When reconstructing a given SNP site, the EHTLD algorithm enumerates three kinds of SNP values in terms of the corresponding site’s genotype value, and chooses the one, which leads to the minimum difference between the reconstructed haplotypes and the sequenced fragments covering that SNP site, to fill the SNP loci being reconstructed. Conclusion: Extensive experimental comparisons were performed between the EHTLD algorithm and the well known HapCompass and HapTree. Compared with algorithms HapCompass and HapTree, the EHTLD algorithm can reconstruct more accurate haplotypes, which were proven by a number of experiments. Keywords: Bioinformatics, Sequence analysis, Single nucleotide polymorphism (SNP), Triploid, Haplotype, Minimum error correction with genotype information (MEC/GI), Genotype, Algorithm [1]. Therefore, computational methods to infer haplo - Background types are needed, for determining haplotypes is both As a large number of sequencing data are available, the time consuming and expensive by direct using biological investigation of genetic variations has become one of the experiments. In recent decade, the presented compu- main topics in bioinformatics. Single nucleotide poly- tational haplotyping algorithms generally fall into three morphism (SNP), the most widespread form of variation, categories [2]: (1) population haplotyping with genotype is believed to be the major genetic cause to phenotypic data [3, 4]; (2) population haplotyping with fragment data variability. A sequence of SNPs along a chromosome is [5]; (3) individual haplotyping with fragment data [6]. In referred to as a haplotype, which is more important for this paper, individual haplotyping problem is studied for complete comprehending the complex genetic polymor- a triploid individual. phisms than isolated SNPs. Increasing evidence shows The problem of individual haplotyping is also called as that haplotypes play a crucial role in studying the varia- haplotype assembly problem or haplotype reconstruc- tions relating to diseases prediction and gene expression tion problem. It has received extensive study in the recent decade. Most of the existing research results are regard- *Correspondence: wjlhappy@mailbox.gxnu.edu.cn ing diploid individuals [1, 7, 8], and there is still a lack of Guangxi Key Lab of Multi-source Information Mining & Security, Guangxi research studies for reconstructing triploid ones. Sev- Normal University, Guilin 541004, China Full list of author information is available at the end of the article eral algorithms for assembling K-individual haplotypes © The Author(s) 2018. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/ publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Wu and Zhang Algorithms Mol Biol (2018) 13:10 Page 2 of 10 were proposed. Based on the minimum error correction and the HapTree algorithms. The results proved that the (MEC) model and the minimum error correction with performance of algorithm EHTLD was superior to those genotype information (MEC/GI) model, Wang et  al. of algorithms HapCompass and HapTree. The rest of this [9] and Qian et  al. [10] respectively proposed a genetic paper is arranged as follows. “Definitions and notations” algorithm and a particle swarm optimization algorithm section provides definitions and notations used later. to reconstruct diploid individual haplotypes, both of “Algorithm EHTLD” section introduces the EHTLD algo- which can be adapted to reconstructing the K-individual rithm. “Experimental results” section presents the experi- ones. The code length of the two algorithms is very long mental results of the EHTLD, the HapCompass and the in practical applications, for it is equal to the number of HapTree algorithms. Some conclusions are drawn in the sequencing SNP fragments. This brings huge solution last section. space to these two algorithms and negatively affects the performance of them. Based on the minimum fragment Definitions and notations removal (MFR) model [11], an exact exponential algo- Triploid somatic cells contain three sets of chromo- rithm was introduced by Li et  al. [11]. The time com - somes, i.e., a triploid organism has three copies of each 2t 2 (K+1)t K+1 plexity of which is O(2 m n + 2 m ) , where m chromosome. Since haplotype consists of the sequence of denotes the number of SNP fragments, n denotes the all SNPs along a chromosome, a triploid individual owns number of SNP sites, and t is the max number of holes three haplotypes. It is commonly regarded that a SNP covered by a fragment. The algorithm can not perform locus shows merely two possible alleles, hence the major well with large m, n and t. In 2013, Aguiar et  al. [12] allele can be represented as ‘0’ and the minor one can be introduced the HapCompass model and the minimum represented as ‘1’ . A haplotype can be encoded as a string weighted edge removal (MWER) optimization for hap- over a 2-letter alphabet {0, 1} instead of four real bases lotyping polyploid genomes. Algorithm HapCompass {A,T,C,G}. A genotype is the conflation of three haplo - aims to remove a minimal weighted set of edges from types on the homologous chromosomes. When three the compass graph such that a unique phasing may be alleles at a SNP site have identical values, this SNP site is constructed. The HapCompass algorithm performs on called a homozygous site, otherwise it is called a heterozy- T T the spanning-tree cycle basis of the compass graph to gous site. For example, (000) or (111) represents the iteratively delete errors. However, in the same conflict genotype value at a homozygous SNP site, while (001) cycle basis, there may be more than one edge having the or (011) represents the genotype value at a heterozy- same absolute value of weight. It may lead to the wrong gous SNP site. Suppose that m aligned SNP fragments, SNP phasing to select the removed edge randomly. In coming from three haplotypes of length n, are gener- 2014, Berger et al. [13] described a maximum-likelihood ated by DNA sequencing experiments. Let M denote an estimation framework HapTree for haplotyping a sin- m × n SNP matrix over the alphabet {0, 1, −} (− denotes gle polyploid. It can obtain better performance than the value is null). As shown in Fig.  1a, each row repre- the HapCompass algorithm [13]. In 2014, based on the sents a SNP fragment, each column represents a SNP MEC model, Wu et al. [14] presented a genetic algorithm site, and each entry M[i, j] denotes the SNP allele of the GTIHR for reconstructing triploid haplotypes. Since the ith fragment at the jth SNP site. Let G = (g , g , . . . , g ) 1 2 n code length of algorithm GTIHR equals to the number denote the genotype matrix corresponding to M, where of heterozygous sites in haplotype, the performance of g = (g , g , g ) (g ∈{0, 1}, k = 1, 2, 3, j = 1, 2, . .. , n) j j1 j2 j3 jk the GTIHR algorithm is negatively affected by haplotype length and heterozygous rate. In this paper, the triploid individual haplotype assembly problem is studied based on the MEC/GI model. An enumeration-based algorithm enumeration haplotyping triploid with least difference (EHTLD) is proposed for solving it. Algorithm EHTLD reconstructs the three haplotypes according to the order of SNP loci along them. For reconstructing the three alleles of a given site, it enumerates three kinds of SNP values by using the site’s genotype, and chooses the kind of value resulting in the minimum difference between the (a) (b) reconstructed haplotypes and the sequenced fragments Fig. 1 An example of SNP fragment matrix and genotype matrix. a covering that SNP site. The experimental comparisons SNP fragment matrix M , b genotype matrix G were performed between the EHTLD, the HapCompass Wu and Zhang Algorithms Mol Biol (2018) 13:10 Page 3 of 10 denotes the genotype value at the jth SNP site. Figure 1b (j = 2, 3,…,n) SNP site in terms of its genotype, and chooses shows an example of the genotype matrix. the one leading to the minimum difference between the Given a column M[−, j] (j = 1, 2, . . . , n) of the matrix reconstructed haplotypes and the fragments covering the M, define r(j) as the set of fragments that cover the jth jth site. After this iteration process is completed, three ′ ′ ′ ′ column. Given a row M[i, −] (i = 1, 2, . . . ,m) of the haplotypes h = (h , h h ) having only heterozygous SNP 1 2 3 matrix M, let l(i) indicate the index of the leftmost sites are built, for only heterozygous SNPs are remained SNP j (j = 1, 2, . . . , n) such that M[i, j] �=− . Given two in the preprocessed matrices. Finally, h is augmented by strings X = x , x , . . . , x and Y = y , y , . . . , y , where inserting the SNPs discarded in preprocessing step and 1 2 n 1 2 n x , y ∈{0, 1, −} (j = 1, 2, . . . , n) , the distance metric the final haplotypes h is obtained. Some key steps of the j j HD(X,  Y,  s,  e) is defined as Formula  (1). Take fragment EHTLD algorithm will be introduced in detail as follows. f (10 − 011) and fragment f (01010−) in Fig.  1a for 1 2 example. HD(f , f ,2,5) = 3. 1 2 Preprocessing Since homozygous sites do not contribute to haplotype reconstruction, they are deleted from matrices M and HD(X, Y , s, e) = d(x , y ), (1 ≤ s ≤ e ≤ n) j j G to improve the efficiency of assembly. Drop column j=s j(j = 1, 2, . . . , n) from G where g = g = g , and the (1) j1 j2 j3 corresponding column is also dropped from matrix M. where The deleted column j ( j = 1, 2, . . . , n ) is recorded as g . j1 After dropping columns from matrix M, some SNP frag- 1 if x �= −,y �= −, and x �= y j j j j d(x , y ) = j j ments with only—elements are also deleted, for they are 0 otherwise. also redundant information. The remained SNPs are all (2) heterozygous sites. For convenience of description, the Let the strings X and Y be regarded as two SNP frag- preprocessed matrices are still denoted by M and G. Sort ments, they are said to compatible if HD(X, Y,  1, n) = 0. the rows of M by their l(.) values in ascending order. For The larger HD(X,  Y,  1,  n) is, the greater the probability each column j ( j = 1, 2, . . . , n ) of M, calculate set r(j) of fragments X and Y coming from different chromo - which contains the rows covering the jth column. some copies or having sequencing errors is. If there are no errors in the data, the rows of M can be divided Enumerating and computing into three classes of compatible fragments. Three hap - The EHTLD algorithm iteratively reconstructs each hete - lotypes can be reconstructed by assembling the frag- ′ ′ ′ ′ rozygous site of haplotypes h = (h , h , h ). Each step 1 2 3 ments in the three classes. In this situation, the SNP concerns reconstructing the current empty site, starting matrix M is called feasible or error-free. Given haplotype from the left first site. Suppose that the first j − 1 sites of h = (h , h , h )(h = (h , h , . . . , h ), k = 1, 2, 3) and 1 2 3 k k1 k2 kn ′ ′ the three haplotypes h have already been filled, i.e., ( h , T k1 genotype G = (g , g , ... , g )(g = (g , g , g ) , j = 1, 2, ... , n) , 1 2 n j j1 j2 j3 ′ ′ h ,…, h ) (k = 1, 2, 3, j = 2, 3,…, n) has been assem- 3 3 k2 kj−1 if h = g (j = 1, 2, . . . , n), h and G are kj jk k=1 k=1 regarded as compatible. bled, and the jth site is under consideration. The calculat - Based on the above mentioned concepts for the haplo- ing method comprises the following two steps. type reconstruction problem, the MEC/GI model can be described as follows [9]: 1. Enumerating three kinds of possible values according MEC/GI: Given a SNP matrix M and a genotype matrix to g : G, correct the minimum number of entries in M (0 into 1 and vice versa) so that the resulting matrix is feasible, and a. if g = 1, the three kinds of values are jk k=1 the three reconstructed haplotypes are compatible with ′ ′ ′ ′ ′ ′ ( h = 0, h = 0, h = 1 ), ( h = 0, h = 1, h = 0 ) the genotype G. 1j 2j 3j 1j 2j 3j ′ ′ ′ and ( h = 1, h = 0, h = 0). 1j 2j 3j Algorithm EHTLD b. if g = 2, the three kinds of values are jk k=1 ′ ′ ′ ′ ′ ′ In this section, the EHTLD algorithm is described. The ( h = 0, h = 1, h = 1), ( h = 1, h = 0, h = 1) 1j 2j 3j 1j 2j 3j input consists of a SNP matrix M and a genotype matrix ′ ′ ′ and ( h = 1, h = 1, h = 0). 1j 2j 3j G. The output is three assembled haplotypes h = (h , h , h ) 1 2 3 of length n. In the first step of this algorithm, the matrices ′ ′ ′ ′ ′ ′ 2. Given the jth site value ( h , h ,h ), let D(h , h , h ) 1j 2j 3j 1j 2j 3j M and G are preprocessed by removing the homozygous measure the difference between the reconstructed SNPs, which do not play a role in assembling haplotypes. haplotypes and the fragments covering the jth site, as Subsequently, enumerates three kinds of values for the jth Wu and Zhang Algorithms Mol Biol (2018) 13:10 Page 4 of 10 defined in Formula  (3). From the three kinds of val - Algorithm EHTLD ues enumerated in step (1), choose the one with the Input: a SNP matrix M ,a genotype matrix G m×n Output: three reconstructed haplotypes h=(h ,h ,h ) 1 2 3 minimum D(.) value. 1. preprocess M and G 2. h =g (k=1,2,3) 1k k1 3. for j=2,3,... ,n do ′ ′ ′ ′ 4. mindif=m×n //initialize mindif as themaximum value D(h , h , h ) = min{HD(h , M[i, −], l(i), j)|k = 1, 2, 3} 1j 2j 3j k 5. if ( g =1) then jk k=1 i∈r(j) 6. if (D(0, 0, 1) <mindif) then (3) 7. h =h =0, h =1, mindif=D(0,0,1) 1j 2j 3j 8. if (D(0, 1, 0) <mindif) then In the following, we give an example for enumerating and 9. h =h =0, h =1, mindif=D(0,1,0) 1j 3j 2j 10. if (D(1, 0, 0) <mindif) then computing by using the matrices in Fig.  1. As shown in 11. h =h =0, h =1, mindif=D(1,0,0) 2j 3j 1j Fig. 2, assume that the first three sites of the three haplo - 12. elseif( g =2) then jk k=1 ′ ′ ′ ′ ′ ′ types h = (h , h , h ) have been reconstructed, i.e., h = (h 13. if(D(0,1,1)<mindif) then 1 2 3 1 ′ ′ 14. h =h =1, h =0, mindif=D(0,1,1) (011), h (010), h (100)), and the fourth site is under 2j 3j 1j 2 3 15. if(D(1,0,1)<mindif) then reconstruction. The genotype of the fourth SNP site is 16. h =h =1, h =0, mindif=D(1,0,1) 1j 3j 2j T ′ ′ ′ ′ (101) , hence haplotypes h = (h , h , h ) have the fol- 17. if(D(1,1,0)<mindif) then 1 2 3 18. h =h =1, h =0, mindif=D(1,1,0) 1j 2j 3j lowing three kinds of possible values on the fourth SNP 14. Augment h =(h ,h ,h ), andget thefinal result h=(h ,h ,h ) 1 2 3 1 2 3 ′ ′ ′ ′ ′ ′ site: ( h = 0, h = 1, h  = 1), ( h  = 1, h  = 0, h  = 1) 15. output h 14 24 34 14 24 34 ′ ′ ′ and ( h   =  1, h   =  1, h   =  0). The values of D(0,1,1), Fig. 3 Algorithm EHTLD 14 24 34 D(1,0,1) and D(1,1,0) are computed respectively accord- ing to the fragments in Fig.  1a and the three haplotypes ′ ′ ′ ′ h  = (h , h , h ). D(0,1,1) = 3, D(1,0,1) = 2, D(1,1,0) = 1. 1 2 3 ′ ′ Now the time complexity of the EHTLD algorithm is Because D(1,1,0) is the smallest, ( h   =  1, h   =  1, 14 24 ′ ′ ′ ′ ′ discussed. In preprocessing, dropping redundant infor- h  = 0) is chosen, and h  = (h (0111), h (0101), h (1000)) 34 1 2 3 mation and calculating set r(·) take time O(m × n) , are reconstructed. sorting the rows of M takes time O(m × logm) . D uring enumerating and computing, three haplotypes with only Augmenting heterozygous SNP sites are reconstructed, which takes The homozygous SNPs that are deleted by preprocess - time O(c × n × len) , here c denotes the fragments cover- ing must be reinserted. The reconstructed haplotypes ′ ′ ′ ′ age, and len represents the average length of fragments. h  = (h , h , h ) are augmented by the bits of the columns 1 2 3 In augmenting, the discarded columns can be reinserted removed, and h = (h , h , h ) are built. For a given posi- 1 2 3 ′ ′ ′ by scanning the columns only once, which takes time tion j, haplotypes h , h and h are inserted with g when j1 1 2 3 O(n). In summary, the time complexity of the algorithm the discarded column j is recorded as g . Based on the j1 is O(m × n + m × logm + c × n × len). above mentioned steps, the EHTLD algorithm for assem- bling triploid haplotypes is depicted in Fig. 3. Experimental results In this section, the EHTLD algorithm is compared with two state-of-the-art algorithms, i.e., the HapCom- pass [12] and the HapTree [13] algorithms. Algorithms EHTLD and HapCompass were implemented on a Win- dows 7 and the compiler was Microsoft Visual C# 2012. The Python program HapTree (v0.1), downloaded from http://group s.csail .mit.edu/cb/haptr ee/, was imple- mented on a Linux system. All the tests below are con- ducted on a 64 bit PC with Intel Core i5 2.50GHz CPU and 6GB RAM. One hundred data sets were generated for each parameter setting. The average over 100 runs at each parameter setting was calculated and presented. The vector error (VE) [13], the reconstruction rate (RR) [1, 9, 15] and the minimum error correction (MEC) score [12] were used to measure the performance of the algo- rithms. The vector error, generalized from switch error, is a special kind of measurement for evaluating the accu- Fig. 2 An example for enumerating and computing racy of polyploid phasing. Given three reconstructed Wu and Zhang Algorithms Mol Biol (2018) 13:10 Page 5 of 10 haplotypes, the vector error is equal to the minimum fragments and m mate-pair SNP fragments were gener- number of segments on them for which a switch must ated. A single fragment had a length ranging from f to min occur to correspond with the three true haplotypes, i.e., f , and a mate-pair fragment had a length of n/10. The max the minimum number of segments a reconstructed phase coverage was c/2 for both kinds of fragments, and the and the true phase have in common [13]. total coverage was c. Reading errors were planted into the The reconstruction rate (RR), which measures the simi - fragments with probability p . In practical applications larity degree between the pair of true haplotypes and of shotgun sequencing, the values of f and f are 3 min max the pair of reconstructed ones, is a widely adopted index and 7, respectively, c ranges from 5 to 10, and p ranges to evaluate diploid phasing [1, 9, 15]. For triploid phas- between 2 and 5% [2, 18]. In the following tables, algo- ing, we generalized it to calculate the similarity degree rithms EHTLD, HapCompass and HapTree are abbrevi- between the three true haplotypes and the three recon- ated to EH, HC and HT, respectively. structed haplotypes. Assuming that h  = (h , h , h ) are In Table  1, 12 sets of parameters were set in deal- 1 2 3 ˆ ˆ ˆ ˆ the original haplotypes, and h = (h , h , h ) are the recon- ing with error rate p , where c =  10, f  =  3, f  =  7, 1 2 3 s min max structed haplotypes. RR is defined as the proportion of n =  100 and d =  0.3. It can be seen from this table that nucleotides that are reconstructed correctly, as shown in algorithm EHTLD can achieve much higher reconstruc- Formula (4). tion rates, smaller vector errors and MEC scores than 3 3 3 min{ r |i , j ∈{1, 2, 3}, i = j = 6} i j k k k k k=1 k k k=1 k=1 ˆ (4) RR(h, h) = 1 − , 3n where r  = HD(h , h , 1, n). the HapCompass and the HapTree algorithms in every p i j i j s k k k k The minimum error correction (MEC) score measures setting. When p  = 0, algorithm EHTLD achieves recon- the minimum number of mismatches between the recon- struction rate of 0.97, which is higher than both Hap- ˆ ˆ ˆ ˆ structed haplotypes h  = (h , h , h ) and the SNP matrix Compass and HapTree algorithms by about 9.0%, and 1 2 3 M, as shown in Formula (5). vector error of 3, which is less than them by 8 times or so. In particular, the MEC score obtained by algorithm EHTLD reaches zero, while those of the other two algo- ˆ ˆ MEC(M, h) = min{HD(M[i, −], h , 1, n)|k = 1, 2, 3}. rithms are 126 and 57. Although the increase of p plays i=1 (5) stronger negative effect on algorithm EHTLD than on algorithms HapCompass and HapTree, the EHTLD algo- To the best of our knowledge, the real triploid haplotype rithm still obtains better performance than algorithms data are not available in the public domain, Aguiar et al. HapCompass and HapTree with high error rate. When [12] and Berger et al. [13] used computer-generated sim- p   =  0.2, the RRs of algorithms EHTLD, HapCompass ulated data. Therefore, simulated data were also used in and HapTree are 0.92, 0.89 and 0.88, the vector errors of our experiments. Three simulation haplotypes h  = (h , them are 14, 31 and 26, and the MEC scores of them are h , h ) of length n were created by using the following 335, 407 and 364, respectively. The three algorithms all 2 3 method. h was generated at random firstly. h was gener- execute very efficiently when p ranges from 0 to 0.2. 1 2 s ated by flipping each bit of h randomly so that the ham- In Table  2, nine sets of parameters were set in deal- ming distance between h and h was equal to a given ing with coverage c, where n  =  100, f   =  3, f   =  7, 1 2 min max parameter d. h also had the same length and h was p  = 0.05 and d = 0.3. From Table 2 we observe that algo- 3 3j s set to h or h (j = 1, 2, . . . , n) with uniform probabil- rithm EHTLD still obtains the highest reconstruction 1j 2j ity. With regard to fragment data, two kinds of sequenc- rate and the smallest vector error and MEC score under ing simulators, CELSIM [16] and MetaSim [17], were different coverage settings. When the coverage is 2, the adopted to generate simulation fragments, and the test- RRs of algorithms EHTLD, HapCompass and HapTree ing datasets were called as CELSIM instances and Meta- are 0.94, 0.89 and 0.86, the vector errors of them are 10, Sim instances, respectively. 30 and 29, and the MEC scores of them are 16, 40 and 19. When the coverage increases, the RR of algorithm EHTLD increases gradually, while that of algorithm Hap- CELSIM instances Compass fluctuates between 0.89 and 0.90, and that of In this section, the evaluation of the EHTLD, the Hap- algorithm HapTree varies between 0.85 and 0.91. Gener- Compass and the HapTree algorithms is described ally, the increase of coverage plays a positive role in the by using CELSIM instances. CELSIM was invoked to improvement of algorithm performance, for much more simulate shotgun sequencing platform. m single SNP 1 Wu and Zhang Algorithms Mol Biol (2018) 13:10 Page 6 of 10 original fragment information can be utilized. How- from this table, algorithm EHTLD still obtains superior ever, it is not apparent for algorithms HapCompass and results to the other two algorithms under each param- HapTree. eter setting. With the increase of haplotype length, the Table  3 compares the performance of the three algo- three algorithms experience a gradual degradation in rithms with different haplotype lengths n, where c  =  10, the performance. When n is 100, the RR of algorithm f  = 3, f  = 7, p  = 0.05 and d = 0.3. As can be seen EHTLD is 0.97, which is higher than both HapCompass min max s Table 1 Comparison with different error rates (CELSIM instance) p RR VE MEC Running time (s) EH HC HT EH HC HT EH HC HT EH HC HT 0 0.97 0.89 0.89 3 29 27 0 126 57 0.01 0.02 0.01 0.01 0.97 0.89 0.88 3 31 28 17 147 64 0.01 0.03 0.01 0.02 0.97 0.89 0.88 4 30 27 34 152 79 0.01 0.03 0.01 0.03 0.97 0.89 0.90 4 31 26 51 180 83 0.01 0.03 0.01 0.04 0.97 0.90 0.89 4 29 27 69 179 96 0.01 0.03 0.01 0.05 0.97 0.90 0.89 4 29 26 85 194 117 0.01 0.03 0.01 0.06 0.96 0.89 0.88 5 31 26 102 210 132 0.01 0.03 0.01 0.07 0.96 0.89 0.88 5 30 26 122 225 157 0.01 0.03 0.01 0.08 0.96 0.90 0.88 6 29 24 138 238 173 0.01 0.03 0.01 0.09 0.95 0.89 0.87 6 30 28 154 254 181 0.01 0.03 0.01 0.1 0.95 0.90 0.89 7 29 25 172 263 206 0.01 0.03 0.01 0.2 0.92 0.89 0.88 14 31 26 335 407 364 0.01 0.03 0.01 Table 2 Comparison with different coverages (CELSIM instance) c RR VE MEC Running time (s) EH HC HT EH HC HT EH HC HT EH HC HT 2 0.94 0.89 0.86 10 30 29 16 40 19 0.01 0.01 0.01 3 0.95 0.89 0.85 9 31 28 25 60 30 0.01 0.02 0.01 4 0.95 0.89 0.91 7 31 28 34 79 38 0.01 0.02 0.01 5 0.96 0.89 0.87 6 30 26 41 96 46 0.01 0.02 0.01 6 0.96 0.89 0.87 5 30 25 52 120 58 0.01 0.02 0.01 7 0.96 0.90 0.89 5 29 26 58 133 65 0.01 0.02 0.01 8 0.96 0.89 0.89 5 30 25 67 157 75 0.01 0.02 0.01 9 0.96 0.89 0.89 4 30 25 75 175 84 0.01 0.03 0.01 10 0.97 0.90 0.89 4 29 26 85 194 117 0.01 0.03 0.01 Table 3 Comparison with different haplotype lengths (CELSIM instance) n RR VE MEC Running time (s) EH HC HT EH HC HT EH HC HT EH HC HT 100 0.97 0.90 0.89 4 29 26 85 194 117 0.01 0.03 0.01 200 0.96 0.89 0.90 12 61 57 136 305 169 0.04 0.16 0.04 300 0.95 0.89 0.88 29 92 90 181 387 230 0.11 0.20 0.09 500 0.93 0.88 0.87 57 156 148 271 573 337 0.56 0.83 0.72 800 0.92 0.88 0.86 100 256 242 398 855 492 2.21 2.59 2.30 1000 0.92 0.88 0.86 136 322 314 479 1029 595 4.36 4.67 4.51 Wu and Zhang Algorithms Mol Biol (2018) 13:10 Page 7 of 10 and HapTree algorithms by about 7.8%, the vector error the MEC scores drop to 65, 86 and 76, respectively. The of algorithm EHTLD is 4, which is less than algorithms decrease of the MEC scores explain the shorter the frag- HapCompass and HapTree by about 86 and 85%, the ments are, the more probability the fragments agree with MEC score of algorithm EHTLD is 85, which is less the reconstructed haplotypes. The change of single frag - than algorithms HapCompass and HapTree by about 56 ment length plays little effect on the running time of the and 21%, respectively. When n is 1000, the RRs of them three algorithms. decrease to 0.92, 0.88 and 0.86, the vector errors of them Table  5 compares the three algorithms with different increase to 136, 322 and 314, and the MEC scores of hamming distances d, where f  =  3, f  =  7, c =  10, min max them go up to 479, 1029 and 595, respectively. The run - p   =  0.05 and n  =  100. It can be seen from this table ning time of the three algorithms increases significantly that the performance of algorithm EHTLD remains rel- with the increase of n. When n = 100, the running time atively stable under different d, while that of algorithms of algorithms EHTLD, HapCompass and HapTree is 0.01, HapCompass and HapTree suffers strong negative influ - 0.03 and 0.01 s, respectively, while n = 1000, it increases ence with the increase of hamming distance. For exam- to 4.36, 4.67 and 4.51 s, respectively. ple, when d varies from 0.1 to 1.0, the RR of the EHTLD In Table 4, three groups of parameters were set in deal- algorithm fluctuates between 0.97 and 1.0, while those ing with single fragment length range [f , f ] , where of the HapCompass and the HapTree algorithms achieve min max c  =  10, p   =  0.05, n  =  100 and d  =  0.3. As shown in decrease rate up to 35 and 20%, respectively. Table  4, algorithm EHTLD still performs the best under different parameter settings. When [f , f ]= [3, 7] , MetaSim instances min max the RRs of algorithms EHTLD, HapCompass and Hap- MetaSim was used to simulate 454 sequencing plat- Tree are 0.97, 0.90 and 0.89, the vector errors of them form. m SNP fragments, including m = (1 − p ) × m 1 m are 4, 29 and 26, and the MEC scores of them are 85, 194 single ones and m   =  p × m mate-pair ones, were 2 m and 117, respectively. With the decrease of the length of generated, where p denoted the probability of mate- single fragment, the decline of fragments overlap might pair fragments and was set to 0.25 in the experiments. be disadvantageous for haplotype reconstruction. When A single fragment had an expected length of f _len , [f , f ]= [1, 2] , the RRs decrease to 0.94, 0.90 and and a mate-pair fragment had a length of 3 × f _len . min max 0.88, the vector errors increase to 14, 30 and 28, and Since each mate-pair fragment consists of two single Table 4 Comparison with different single fragment length ranges (CELSIM instance) f , f RR VE MEC Running time (s) min max EH HC HT EH HC HT EH HC HT EH HC HT [3, 7] 0.97 0.90 0.89 4 29 26 85 194 117 0.01 0.03 0.01 [2, 4] 0.95 0.89 0.88 8 30 28 67 129 90 0.01 0.03 0.01 [1, 2] 0.94 0.90 0.88 14 30 28 65 86 76 0.02 0.03 0.01 Table 5 Comparison with different hamming distances (CELSIM instance) d RR VE MEC Running time (s) EH HC HT EH HC HT EH HC HT EH HC HT 0.1 0.99 0.97 0.96 5 9 8 88 109 92 0.01 0.02 0.01 0.2 0.97 0.93 0.94 6 17 16 86 140 106 0.01 0.03 0.01 0.3 0.97 0.90 0.89 4 28 26 85 194 117 0.01 0.03 0.01 0.4 0.97 0.86 0.85 3 38 35 86 260 145 0.02 0.06 0.02 0.5 0.97 0.82 0.84 1 46 42 88 326 201 0.04 0.08 0.03 0.6 0.97 0.79 0.82 1 57 49 89 393 263 0.05 0.10 0.05 0.7 0.98 0.74 0.81 0 71 63 92 478 346 0.07 0.16 0.06 0.8 0.98 0.71 0.80 0 78 72 90 553 421 0.10 0.20 0.08 0.9 0.99 0.67 0.78 0 89 78 90 633 507 0.12 0.25 0.10 1.0 1.00 0.63 0.77 0 94 85 91 722 589 0.15 0.29 0.12 Wu and Zhang Algorithms Mol Biol (2018) 13:10 Page 8 of 10 fragments of the same haplotype, the coverage c equals to HapCompass and the HapTree algorithms under differ - [(m + 2 × m ) × f _len]/3 × n. ent c, n, f _len and d settings. 1 2 Table  6 gives the comparisons with coverage ranging from 5 to 50, where n =  100, f _len = 5, and d = 0.3. In Conclusion Table  7, six sets of experimental results under different The minimum error correction with genotype infor - haplotype length settings are displayed, where c  =  20, mation (MEC/GI) model is one of the important com- f _len = 5, and d = 0.3. In Table  8, three instances were putational models for solving single individual SNP generated in dealing with single fragment length f _len , haplotyping problem. In this paper, an enumeration- where n  =  100, c  =  20, and d  =  0.3. In Table  9, the test based algorithm EHTLD is presented for haplotyp- results under different parameter d are shown, where ing a triploid single individual by using this model. n = 100, c = 20, and f _len = 5. The experimental results Algorithm EHTLD reconstructs the three haplotypes obtained from MetaSim instances indicate that algo- according to the order of SNP loci along them. For a rithm EHTLD still obtain much higher reconstruction SNP site being reconstructed, the EHTLD algorithm rates, smaller vector errors and MEC scores than the enumerates three possible values in terms of the site’s Table 6 Comparison with different coverages (MetaSim instance) c RR VE MEC Running time (s) EH HC HT EH HC HT EH HC HT EH HC HT 5 0.94 0.89 0.88 10 30 26 80 134 99 0.01 0.02 0.01 10 0.94 0.90 0.88 8 29 26 144 252 189 0.01 0.02 0.01 15 0.95 0.90 0.88 8 30 26 249 405 287 0.01 0.03 0.01 20 0.95 0.90 0.88 7 30 26 299 513 343 0.02 0.04 0.02 25 0.95 0.89 0.89 7 28 25 383 648 446 0.03 0.05 0.03 30 0.95 0.90 0.89 7 30 25 458 775 531 0.03 0.07 0.03 35 0.95 0.89 0.89 7 30 26 532 903 623 0.05 0.09 0.04 40 0.95 0.90 0.89 7 28 26 600 1028 711 0.07 0.12 0.06 45 0.95 0.90 0.90 7 29 24 700 1193 807 0.10 0.16 0.09 50 0.95 0.90 0.90 7 28 25 758 1293 879 0.13 0.19 0.11 Table 7 Comparison with different haplotype lengths (MetaSim instance) n RR VE MEC Running time (s) EH HC HT EH HC HT EH HC HT EH HC HT 100 0.95 0.90 0.88 7 30 26 299 513 343 0.02 0.04 0.02 200 0.93 0.89 0.90 17 62 57 550 985 620 0.24 0.31 0.21 300 0.93 0.89 0.89 25 93 79 788 1441 896 0.56 0.63 0.60 500 0.93 0.88 0.89 45 156 142 1284 2392 1440 2.43 2.83 2.64 800 0.92 0.88 0.89 75 254 238 1998 3791 2254 9.65 10.50 9.89 1000 0.92 0.88 0.88 91 319 305 2484 4731 2804 17.43 17.95 17.47 Table 8 Comparison with different single fragment lengths (MetaSim instance) f_len RR VE MEC Running time (s) EH HC HT EH HC HT EH HC HT EH HC HT 10 0.96 0.90 0.88 5 30 26 248 404 307 0.01 0.04 0.01 5 0.95 0.90 0.88 7 30 26 299 513 343 0.02 0.04 0.02 3 0.93 0.89 0.89 14 31 28 129 246 201 0.03 0.06 0.03 Wu and Zhang Algorithms Mol Biol (2018) 13:10 Page 9 of 10 Table 9 Comparison with different hamming distances (MetaSim instance) d RR VE MEC Running time (s) EH HC HT EH HC HT EH HC HT EH HC HT 0.1 0.98 0.97 0.87 4 10 28 347 387 375 0.01 0.01 0.01 0.2 0.96 0.93 0.88 6 18 26 309 427 353 0.01 0.03 0.01 0.3 0.95 0.90 0.88 7 30 26 299 513 343 0.02 0.04 0.02 0.4 0.95 0.86 0.89 6 35 25 294 629 330 0.05 0.11 0.02 0.5 0.94 0.82 0.88 5 48 25 289 752 325 0.09 0.16 0.03 0.6 0.94 0.78 0.88 3 60 26 288 895 320 0.12 0.26 0.03 0.7 0.95 0.75 0.88 3 69 26 296 1060 336 0.18 0.37 0.04 0.8 0.95 0.71 0.88 5 77 26 321 1227 366 0.22 0.42 0.06 0.9 0.96 0.67 0.88 3 86 25 290 1359 333 0.27 0.50 0.09 1.0 0.96 0.63 0.88 2 94 25 277 1523 321 0.35 0.61 0.11 Competing interests genotype, and chooses the one leading to the mini- The authors declare that they have no competing interests. mum difference between the reconstructed haplotypes and the fragments covering that SNP site. The recon - Availability of data and materials Not applicable. structed alleles of a SNP site mainly depend on the fragments which cover the site, and are little affected by Consent for publication other former reconstructed alleles. Therefore, the for - Not applicable. mer wrongly reconstructed SNP alleles would not affect Ethics approval and consent to participate the latter reconstructed SNP alleles, i.e., reconstructed Not applicable. errors on the former SNP alleles would not spread to Funding the latter ones. The kind of enumeration strategy can This research is supported by the National Natural Science Foundation of also be apply to reconstruct haplotypes of other ploidy, China under Grant Nos. 61363035, 61762015 and 61502111, Guangxi Natural which will be studied in the future. Science Foundation under Grant No. 2015GXNSFAA139288, Research Fund of Guangxi Key Lab of Multi-source Information Mining & Security Nos. 14-A-03- Compared with algorithms HapCompass and Hap- 02 and 15-A-03-02, “Bagui Scholar” Project Special Funds. Tree by using two kinds of simulated sequencing data, the EHTLD algorithm can get the highest reconstruc- Publisher’s Note tion rates, the smallest vector errors and MEC scores, Springer Nature remains neutral with regard to jurisdictional claims in pub- which was tested by a number of experiments. In addi- lished maps and institutional affiliations. tion, algorithm EHTLD still achieves satisfying per- Received: 9 January 2017 Accepted: 24 May 2018 formance even with high error rate, low fragment coverage, or long haplotype length. All of these advan- tages may contribute to the practical application of the EHTLD algorithm when haplotyping a triploid single References individual. 1. Geraci F. A comparison of several algorithms for the single indi- vidual SNP haplotyping reconstruction problem. Bioinformatics. Authors’ contributions 2010;26(18):2217–25. JW designed and implemented the algorithms and methods, QZ contributed 2. Wu JL, Liang BB. A fast and accurate algorithm for diploid individual on experimental design and data processing. JW wrote the most part of the haplotype reconstruction. J Bioinform Comput Biol. 2013;11(4):1350010. manuscript. QZ helped in data preparing and modifying the manuscript. 3. Clark AG. Inference of haplotypes from PCR-amplified samples of diploid All the work was guided by JW in the whole process. Both authors read and populations. Mol Biol Evol. 1990;7:111–22. approved the final manuscript. 4. Gusfield D. Inference of haplotypes from samples of diploid populations: complexity and algorithms. J Comput Biol. 2001;8:305–23. Author details 5. O’Neil ST, Emrich SJ. Haplotype and minimum-chimerism consensus Guangxi Key Lab of Multi-source Information Mining & Security, Guangxi determination using short sequence data. BMC Genomics. 2012;13(Suppl Normal University, Guilin 541004, China. College of Computer Science 2):S4. and Information Technology, Guangxi Normal University, Guilin 541004, China. 6. Lancia G, Bafna V, Istrail S, Lippert R, Schwartz R. SNPs problems, complex- ity and algorithms. In: Heide FM, editor. Proceeding on 9th European Acknowledgements symposium on algorithms, Aarhus, Denmark; 2001. p. 182–93. The authors are grateful to anonymous referees for their helpful comments 7. Chen ZZ, Deng F, Wang LS. Exact algorithms for haplotype assembly from and to Professor Gene Myers for his kindly providing the source codes of CEL- whole-genome sequence data. Bioinformatics. 2013;29(16):1938–45. SIM. This research is supported by Guangxi Collaborative Innovation Center of Multi-source Information Integration and Intelligent Processing. Wu and Zhang Algorithms Mol Biol (2018) 13:10 Page 10 of 10 8. Mazrouee S, Wang W. FastHap: fast and accurate single individual 14. Wu JL, Chen XX, Li XC. Haplotyping a single triploid individual based on haplotype reconstruction using fuzzy conflict graphs. Bioinformatics. genetic algorithm. Bio-Med Mater Eng. 2014;24(6):3753–62. 2014;30(ECCB):i371–8. 15. Xie MZ, Wang JX, Chen JE. A high accurate model of the individual hap- 9. Wang RS, Wu LY, Li ZP, Zhang XS. Haplotype reconstruction from lotyping problem based on weighted SNP fragments and genotype with SNP fragments by minimum error correction. Bioinformatics. errors. Bioinformatics. 2008;24(ISMB):i105–13. 2005;21(10):2456–62. 16. Myers G. A dataset generator for whole genome shotgun sequencing. In: 10. Qian WY, Yang YJ, Yang NN, Chun L. Particle swarm optimization for snp Proceedings of the 7th international conference on intelligent systems haplotype reconstruction problem. Appl Math Comput. 2008;196:266–72. for molecular biology; 1999. p. 202–10. 11. Li ZP, Wu LY, Zhao YY, Zhang XS. A dynamic programming algorithm 17. Richter DC, Ott F, Auch AF, Schmid R, Huson DH. MetaSim—a sequencing for the k-haplotyping problem. Acta Math Sinic English Series. simulator for genomics and metagenomics. PLOS ONE. 2008;3(10):e3373. 2006;22:405–12. 18. Panconesi A, Sozio M. Fast hare: a fast heuristic for single individual SNP 12. Aguiar D, Istrail S. Haplotype assembly in polyploid genomes and identi- haplotype reconstruction. In: Proceedings of 4th workshop on algorithms cal by descent shared tracts. Bioinformatics. 2013;29(13):i352–60. in bioinformatics; 2004. p. 266–77. 13. Berger E, Yorukoglu D, Peng J, Berger B. HapTree: a novel Bayesian frame- work for single individual polyplotyping using NGS data. PLoS Comput Biol. 2014;10(3):e1003502. Ready to submit your research ? Choose BMC and benefit from: fast, convenient online submission thorough peer review by experienced researchers in your field rapid publication on acceptance support for research data, including large and complex data types • gold Open Access which fosters wider collaboration and increased citations maximum visibility for your research: over 100M website views per year At BMC, research is always in progress. Learn more biomedcentral.com/submissions http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Algorithms for Molecular Biology Springer Journals

A fast and accurate enumeration-based algorithm for haplotyping a triploid individual

Free
10 pages

Loading next page...
 
/lp/springer_journal/a-fast-and-accurate-enumeration-based-algorithm-for-haplotyping-a-io8FkH0SLe
Publisher
Springer Journals
Copyright
Copyright © 2018 by The Author(s)
Subject
Life Sciences; Bioinformatics; Computational Biology/Bioinformatics; Physiological, Cellular and Medical Topics; Algorithms
eISSN
1748-7188
D.O.I.
10.1186/s13015-018-0129-0
Publisher site
See Article on Publisher Site

Abstract

Background: Haplotype assembly, reconstructing haplotypes from sequence data, is one of the major computa- tional problems in bioinformatics. Most of the current methodologies for haplotype assembly are designed for diploid individuals. In recent years, genomes having more than two sets of homologous chromosomes have attracted many research groups that are interested in the genomics of disease, phylogenetics, botany and evolution. However, there is still a lack of methods for reconstructing polyploid haplotypes. Results: In this work, the minimum error correction with genotype information (MEC/GI) model, an important com- binatorial model for haplotyping a single individual, is used to study the triploid individual haplotype reconstruction problem. A fast and accurate enumeration-based algorithm enumeration haplotyping triploid with least difference (EHTLD) is proposed for solving the MEC/GI model. The EHTLD algorithm tries to reconstruct the three haplotypes according to the order of single nucleotide polymorphism (SNP) loci along them. When reconstructing a given SNP site, the EHTLD algorithm enumerates three kinds of SNP values in terms of the corresponding site’s genotype value, and chooses the one, which leads to the minimum difference between the reconstructed haplotypes and the sequenced fragments covering that SNP site, to fill the SNP loci being reconstructed. Conclusion: Extensive experimental comparisons were performed between the EHTLD algorithm and the well known HapCompass and HapTree. Compared with algorithms HapCompass and HapTree, the EHTLD algorithm can reconstruct more accurate haplotypes, which were proven by a number of experiments. Keywords: Bioinformatics, Sequence analysis, Single nucleotide polymorphism (SNP), Triploid, Haplotype, Minimum error correction with genotype information (MEC/GI), Genotype, Algorithm [1]. Therefore, computational methods to infer haplo - Background types are needed, for determining haplotypes is both As a large number of sequencing data are available, the time consuming and expensive by direct using biological investigation of genetic variations has become one of the experiments. In recent decade, the presented compu- main topics in bioinformatics. Single nucleotide poly- tational haplotyping algorithms generally fall into three morphism (SNP), the most widespread form of variation, categories [2]: (1) population haplotyping with genotype is believed to be the major genetic cause to phenotypic data [3, 4]; (2) population haplotyping with fragment data variability. A sequence of SNPs along a chromosome is [5]; (3) individual haplotyping with fragment data [6]. In referred to as a haplotype, which is more important for this paper, individual haplotyping problem is studied for complete comprehending the complex genetic polymor- a triploid individual. phisms than isolated SNPs. Increasing evidence shows The problem of individual haplotyping is also called as that haplotypes play a crucial role in studying the varia- haplotype assembly problem or haplotype reconstruc- tions relating to diseases prediction and gene expression tion problem. It has received extensive study in the recent decade. Most of the existing research results are regard- *Correspondence: wjlhappy@mailbox.gxnu.edu.cn ing diploid individuals [1, 7, 8], and there is still a lack of Guangxi Key Lab of Multi-source Information Mining & Security, Guangxi research studies for reconstructing triploid ones. Sev- Normal University, Guilin 541004, China Full list of author information is available at the end of the article eral algorithms for assembling K-individual haplotypes © The Author(s) 2018. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/ publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Wu and Zhang Algorithms Mol Biol (2018) 13:10 Page 2 of 10 were proposed. Based on the minimum error correction and the HapTree algorithms. The results proved that the (MEC) model and the minimum error correction with performance of algorithm EHTLD was superior to those genotype information (MEC/GI) model, Wang et  al. of algorithms HapCompass and HapTree. The rest of this [9] and Qian et  al. [10] respectively proposed a genetic paper is arranged as follows. “Definitions and notations” algorithm and a particle swarm optimization algorithm section provides definitions and notations used later. to reconstruct diploid individual haplotypes, both of “Algorithm EHTLD” section introduces the EHTLD algo- which can be adapted to reconstructing the K-individual rithm. “Experimental results” section presents the experi- ones. The code length of the two algorithms is very long mental results of the EHTLD, the HapCompass and the in practical applications, for it is equal to the number of HapTree algorithms. Some conclusions are drawn in the sequencing SNP fragments. This brings huge solution last section. space to these two algorithms and negatively affects the performance of them. Based on the minimum fragment Definitions and notations removal (MFR) model [11], an exact exponential algo- Triploid somatic cells contain three sets of chromo- rithm was introduced by Li et  al. [11]. The time com - somes, i.e., a triploid organism has three copies of each 2t 2 (K+1)t K+1 plexity of which is O(2 m n + 2 m ) , where m chromosome. Since haplotype consists of the sequence of denotes the number of SNP fragments, n denotes the all SNPs along a chromosome, a triploid individual owns number of SNP sites, and t is the max number of holes three haplotypes. It is commonly regarded that a SNP covered by a fragment. The algorithm can not perform locus shows merely two possible alleles, hence the major well with large m, n and t. In 2013, Aguiar et  al. [12] allele can be represented as ‘0’ and the minor one can be introduced the HapCompass model and the minimum represented as ‘1’ . A haplotype can be encoded as a string weighted edge removal (MWER) optimization for hap- over a 2-letter alphabet {0, 1} instead of four real bases lotyping polyploid genomes. Algorithm HapCompass {A,T,C,G}. A genotype is the conflation of three haplo - aims to remove a minimal weighted set of edges from types on the homologous chromosomes. When three the compass graph such that a unique phasing may be alleles at a SNP site have identical values, this SNP site is constructed. The HapCompass algorithm performs on called a homozygous site, otherwise it is called a heterozy- T T the spanning-tree cycle basis of the compass graph to gous site. For example, (000) or (111) represents the iteratively delete errors. However, in the same conflict genotype value at a homozygous SNP site, while (001) cycle basis, there may be more than one edge having the or (011) represents the genotype value at a heterozy- same absolute value of weight. It may lead to the wrong gous SNP site. Suppose that m aligned SNP fragments, SNP phasing to select the removed edge randomly. In coming from three haplotypes of length n, are gener- 2014, Berger et al. [13] described a maximum-likelihood ated by DNA sequencing experiments. Let M denote an estimation framework HapTree for haplotyping a sin- m × n SNP matrix over the alphabet {0, 1, −} (− denotes gle polyploid. It can obtain better performance than the value is null). As shown in Fig.  1a, each row repre- the HapCompass algorithm [13]. In 2014, based on the sents a SNP fragment, each column represents a SNP MEC model, Wu et al. [14] presented a genetic algorithm site, and each entry M[i, j] denotes the SNP allele of the GTIHR for reconstructing triploid haplotypes. Since the ith fragment at the jth SNP site. Let G = (g , g , . . . , g ) 1 2 n code length of algorithm GTIHR equals to the number denote the genotype matrix corresponding to M, where of heterozygous sites in haplotype, the performance of g = (g , g , g ) (g ∈{0, 1}, k = 1, 2, 3, j = 1, 2, . .. , n) j j1 j2 j3 jk the GTIHR algorithm is negatively affected by haplotype length and heterozygous rate. In this paper, the triploid individual haplotype assembly problem is studied based on the MEC/GI model. An enumeration-based algorithm enumeration haplotyping triploid with least difference (EHTLD) is proposed for solving it. Algorithm EHTLD reconstructs the three haplotypes according to the order of SNP loci along them. For reconstructing the three alleles of a given site, it enumerates three kinds of SNP values by using the site’s genotype, and chooses the kind of value resulting in the minimum difference between the (a) (b) reconstructed haplotypes and the sequenced fragments Fig. 1 An example of SNP fragment matrix and genotype matrix. a covering that SNP site. The experimental comparisons SNP fragment matrix M , b genotype matrix G were performed between the EHTLD, the HapCompass Wu and Zhang Algorithms Mol Biol (2018) 13:10 Page 3 of 10 denotes the genotype value at the jth SNP site. Figure 1b (j = 2, 3,…,n) SNP site in terms of its genotype, and chooses shows an example of the genotype matrix. the one leading to the minimum difference between the Given a column M[−, j] (j = 1, 2, . . . , n) of the matrix reconstructed haplotypes and the fragments covering the M, define r(j) as the set of fragments that cover the jth jth site. After this iteration process is completed, three ′ ′ ′ ′ column. Given a row M[i, −] (i = 1, 2, . . . ,m) of the haplotypes h = (h , h h ) having only heterozygous SNP 1 2 3 matrix M, let l(i) indicate the index of the leftmost sites are built, for only heterozygous SNPs are remained SNP j (j = 1, 2, . . . , n) such that M[i, j] �=− . Given two in the preprocessed matrices. Finally, h is augmented by strings X = x , x , . . . , x and Y = y , y , . . . , y , where inserting the SNPs discarded in preprocessing step and 1 2 n 1 2 n x , y ∈{0, 1, −} (j = 1, 2, . . . , n) , the distance metric the final haplotypes h is obtained. Some key steps of the j j HD(X,  Y,  s,  e) is defined as Formula  (1). Take fragment EHTLD algorithm will be introduced in detail as follows. f (10 − 011) and fragment f (01010−) in Fig.  1a for 1 2 example. HD(f , f ,2,5) = 3. 1 2 Preprocessing Since homozygous sites do not contribute to haplotype reconstruction, they are deleted from matrices M and HD(X, Y , s, e) = d(x , y ), (1 ≤ s ≤ e ≤ n) j j G to improve the efficiency of assembly. Drop column j=s j(j = 1, 2, . . . , n) from G where g = g = g , and the (1) j1 j2 j3 corresponding column is also dropped from matrix M. where The deleted column j ( j = 1, 2, . . . , n ) is recorded as g . j1 After dropping columns from matrix M, some SNP frag- 1 if x �= −,y �= −, and x �= y j j j j d(x , y ) = j j ments with only—elements are also deleted, for they are 0 otherwise. also redundant information. The remained SNPs are all (2) heterozygous sites. For convenience of description, the Let the strings X and Y be regarded as two SNP frag- preprocessed matrices are still denoted by M and G. Sort ments, they are said to compatible if HD(X, Y,  1, n) = 0. the rows of M by their l(.) values in ascending order. For The larger HD(X,  Y,  1,  n) is, the greater the probability each column j ( j = 1, 2, . . . , n ) of M, calculate set r(j) of fragments X and Y coming from different chromo - which contains the rows covering the jth column. some copies or having sequencing errors is. If there are no errors in the data, the rows of M can be divided Enumerating and computing into three classes of compatible fragments. Three hap - The EHTLD algorithm iteratively reconstructs each hete - lotypes can be reconstructed by assembling the frag- ′ ′ ′ ′ rozygous site of haplotypes h = (h , h , h ). Each step 1 2 3 ments in the three classes. In this situation, the SNP concerns reconstructing the current empty site, starting matrix M is called feasible or error-free. Given haplotype from the left first site. Suppose that the first j − 1 sites of h = (h , h , h )(h = (h , h , . . . , h ), k = 1, 2, 3) and 1 2 3 k k1 k2 kn ′ ′ the three haplotypes h have already been filled, i.e., ( h , T k1 genotype G = (g , g , ... , g )(g = (g , g , g ) , j = 1, 2, ... , n) , 1 2 n j j1 j2 j3 ′ ′ h ,…, h ) (k = 1, 2, 3, j = 2, 3,…, n) has been assem- 3 3 k2 kj−1 if h = g (j = 1, 2, . . . , n), h and G are kj jk k=1 k=1 regarded as compatible. bled, and the jth site is under consideration. The calculat - Based on the above mentioned concepts for the haplo- ing method comprises the following two steps. type reconstruction problem, the MEC/GI model can be described as follows [9]: 1. Enumerating three kinds of possible values according MEC/GI: Given a SNP matrix M and a genotype matrix to g : G, correct the minimum number of entries in M (0 into 1 and vice versa) so that the resulting matrix is feasible, and a. if g = 1, the three kinds of values are jk k=1 the three reconstructed haplotypes are compatible with ′ ′ ′ ′ ′ ′ ( h = 0, h = 0, h = 1 ), ( h = 0, h = 1, h = 0 ) the genotype G. 1j 2j 3j 1j 2j 3j ′ ′ ′ and ( h = 1, h = 0, h = 0). 1j 2j 3j Algorithm EHTLD b. if g = 2, the three kinds of values are jk k=1 ′ ′ ′ ′ ′ ′ In this section, the EHTLD algorithm is described. The ( h = 0, h = 1, h = 1), ( h = 1, h = 0, h = 1) 1j 2j 3j 1j 2j 3j input consists of a SNP matrix M and a genotype matrix ′ ′ ′ and ( h = 1, h = 1, h = 0). 1j 2j 3j G. The output is three assembled haplotypes h = (h , h , h ) 1 2 3 of length n. In the first step of this algorithm, the matrices ′ ′ ′ ′ ′ ′ 2. Given the jth site value ( h , h ,h ), let D(h , h , h ) 1j 2j 3j 1j 2j 3j M and G are preprocessed by removing the homozygous measure the difference between the reconstructed SNPs, which do not play a role in assembling haplotypes. haplotypes and the fragments covering the jth site, as Subsequently, enumerates three kinds of values for the jth Wu and Zhang Algorithms Mol Biol (2018) 13:10 Page 4 of 10 defined in Formula  (3). From the three kinds of val - Algorithm EHTLD ues enumerated in step (1), choose the one with the Input: a SNP matrix M ,a genotype matrix G m×n Output: three reconstructed haplotypes h=(h ,h ,h ) 1 2 3 minimum D(.) value. 1. preprocess M and G 2. h =g (k=1,2,3) 1k k1 3. for j=2,3,... ,n do ′ ′ ′ ′ 4. mindif=m×n //initialize mindif as themaximum value D(h , h , h ) = min{HD(h , M[i, −], l(i), j)|k = 1, 2, 3} 1j 2j 3j k 5. if ( g =1) then jk k=1 i∈r(j) 6. if (D(0, 0, 1) <mindif) then (3) 7. h =h =0, h =1, mindif=D(0,0,1) 1j 2j 3j 8. if (D(0, 1, 0) <mindif) then In the following, we give an example for enumerating and 9. h =h =0, h =1, mindif=D(0,1,0) 1j 3j 2j 10. if (D(1, 0, 0) <mindif) then computing by using the matrices in Fig.  1. As shown in 11. h =h =0, h =1, mindif=D(1,0,0) 2j 3j 1j Fig. 2, assume that the first three sites of the three haplo - 12. elseif( g =2) then jk k=1 ′ ′ ′ ′ ′ ′ types h = (h , h , h ) have been reconstructed, i.e., h = (h 13. if(D(0,1,1)<mindif) then 1 2 3 1 ′ ′ 14. h =h =1, h =0, mindif=D(0,1,1) (011), h (010), h (100)), and the fourth site is under 2j 3j 1j 2 3 15. if(D(1,0,1)<mindif) then reconstruction. The genotype of the fourth SNP site is 16. h =h =1, h =0, mindif=D(1,0,1) 1j 3j 2j T ′ ′ ′ ′ (101) , hence haplotypes h = (h , h , h ) have the fol- 17. if(D(1,1,0)<mindif) then 1 2 3 18. h =h =1, h =0, mindif=D(1,1,0) 1j 2j 3j lowing three kinds of possible values on the fourth SNP 14. Augment h =(h ,h ,h ), andget thefinal result h=(h ,h ,h ) 1 2 3 1 2 3 ′ ′ ′ ′ ′ ′ site: ( h = 0, h = 1, h  = 1), ( h  = 1, h  = 0, h  = 1) 15. output h 14 24 34 14 24 34 ′ ′ ′ and ( h   =  1, h   =  1, h   =  0). The values of D(0,1,1), Fig. 3 Algorithm EHTLD 14 24 34 D(1,0,1) and D(1,1,0) are computed respectively accord- ing to the fragments in Fig.  1a and the three haplotypes ′ ′ ′ ′ h  = (h , h , h ). D(0,1,1) = 3, D(1,0,1) = 2, D(1,1,0) = 1. 1 2 3 ′ ′ Now the time complexity of the EHTLD algorithm is Because D(1,1,0) is the smallest, ( h   =  1, h   =  1, 14 24 ′ ′ ′ ′ ′ discussed. In preprocessing, dropping redundant infor- h  = 0) is chosen, and h  = (h (0111), h (0101), h (1000)) 34 1 2 3 mation and calculating set r(·) take time O(m × n) , are reconstructed. sorting the rows of M takes time O(m × logm) . D uring enumerating and computing, three haplotypes with only Augmenting heterozygous SNP sites are reconstructed, which takes The homozygous SNPs that are deleted by preprocess - time O(c × n × len) , here c denotes the fragments cover- ing must be reinserted. The reconstructed haplotypes ′ ′ ′ ′ age, and len represents the average length of fragments. h  = (h , h , h ) are augmented by the bits of the columns 1 2 3 In augmenting, the discarded columns can be reinserted removed, and h = (h , h , h ) are built. For a given posi- 1 2 3 ′ ′ ′ by scanning the columns only once, which takes time tion j, haplotypes h , h and h are inserted with g when j1 1 2 3 O(n). In summary, the time complexity of the algorithm the discarded column j is recorded as g . Based on the j1 is O(m × n + m × logm + c × n × len). above mentioned steps, the EHTLD algorithm for assem- bling triploid haplotypes is depicted in Fig. 3. Experimental results In this section, the EHTLD algorithm is compared with two state-of-the-art algorithms, i.e., the HapCom- pass [12] and the HapTree [13] algorithms. Algorithms EHTLD and HapCompass were implemented on a Win- dows 7 and the compiler was Microsoft Visual C# 2012. The Python program HapTree (v0.1), downloaded from http://group s.csail .mit.edu/cb/haptr ee/, was imple- mented on a Linux system. All the tests below are con- ducted on a 64 bit PC with Intel Core i5 2.50GHz CPU and 6GB RAM. One hundred data sets were generated for each parameter setting. The average over 100 runs at each parameter setting was calculated and presented. The vector error (VE) [13], the reconstruction rate (RR) [1, 9, 15] and the minimum error correction (MEC) score [12] were used to measure the performance of the algo- rithms. The vector error, generalized from switch error, is a special kind of measurement for evaluating the accu- Fig. 2 An example for enumerating and computing racy of polyploid phasing. Given three reconstructed Wu and Zhang Algorithms Mol Biol (2018) 13:10 Page 5 of 10 haplotypes, the vector error is equal to the minimum fragments and m mate-pair SNP fragments were gener- number of segments on them for which a switch must ated. A single fragment had a length ranging from f to min occur to correspond with the three true haplotypes, i.e., f , and a mate-pair fragment had a length of n/10. The max the minimum number of segments a reconstructed phase coverage was c/2 for both kinds of fragments, and the and the true phase have in common [13]. total coverage was c. Reading errors were planted into the The reconstruction rate (RR), which measures the simi - fragments with probability p . In practical applications larity degree between the pair of true haplotypes and of shotgun sequencing, the values of f and f are 3 min max the pair of reconstructed ones, is a widely adopted index and 7, respectively, c ranges from 5 to 10, and p ranges to evaluate diploid phasing [1, 9, 15]. For triploid phas- between 2 and 5% [2, 18]. In the following tables, algo- ing, we generalized it to calculate the similarity degree rithms EHTLD, HapCompass and HapTree are abbrevi- between the three true haplotypes and the three recon- ated to EH, HC and HT, respectively. structed haplotypes. Assuming that h  = (h , h , h ) are In Table  1, 12 sets of parameters were set in deal- 1 2 3 ˆ ˆ ˆ ˆ the original haplotypes, and h = (h , h , h ) are the recon- ing with error rate p , where c =  10, f  =  3, f  =  7, 1 2 3 s min max structed haplotypes. RR is defined as the proportion of n =  100 and d =  0.3. It can be seen from this table that nucleotides that are reconstructed correctly, as shown in algorithm EHTLD can achieve much higher reconstruc- Formula (4). tion rates, smaller vector errors and MEC scores than 3 3 3 min{ r |i , j ∈{1, 2, 3}, i = j = 6} i j k k k k k=1 k k k=1 k=1 ˆ (4) RR(h, h) = 1 − , 3n where r  = HD(h , h , 1, n). the HapCompass and the HapTree algorithms in every p i j i j s k k k k The minimum error correction (MEC) score measures setting. When p  = 0, algorithm EHTLD achieves recon- the minimum number of mismatches between the recon- struction rate of 0.97, which is higher than both Hap- ˆ ˆ ˆ ˆ structed haplotypes h  = (h , h , h ) and the SNP matrix Compass and HapTree algorithms by about 9.0%, and 1 2 3 M, as shown in Formula (5). vector error of 3, which is less than them by 8 times or so. In particular, the MEC score obtained by algorithm EHTLD reaches zero, while those of the other two algo- ˆ ˆ MEC(M, h) = min{HD(M[i, −], h , 1, n)|k = 1, 2, 3}. rithms are 126 and 57. Although the increase of p plays i=1 (5) stronger negative effect on algorithm EHTLD than on algorithms HapCompass and HapTree, the EHTLD algo- To the best of our knowledge, the real triploid haplotype rithm still obtains better performance than algorithms data are not available in the public domain, Aguiar et al. HapCompass and HapTree with high error rate. When [12] and Berger et al. [13] used computer-generated sim- p   =  0.2, the RRs of algorithms EHTLD, HapCompass ulated data. Therefore, simulated data were also used in and HapTree are 0.92, 0.89 and 0.88, the vector errors of our experiments. Three simulation haplotypes h  = (h , them are 14, 31 and 26, and the MEC scores of them are h , h ) of length n were created by using the following 335, 407 and 364, respectively. The three algorithms all 2 3 method. h was generated at random firstly. h was gener- execute very efficiently when p ranges from 0 to 0.2. 1 2 s ated by flipping each bit of h randomly so that the ham- In Table  2, nine sets of parameters were set in deal- ming distance between h and h was equal to a given ing with coverage c, where n  =  100, f   =  3, f   =  7, 1 2 min max parameter d. h also had the same length and h was p  = 0.05 and d = 0.3. From Table 2 we observe that algo- 3 3j s set to h or h (j = 1, 2, . . . , n) with uniform probabil- rithm EHTLD still obtains the highest reconstruction 1j 2j ity. With regard to fragment data, two kinds of sequenc- rate and the smallest vector error and MEC score under ing simulators, CELSIM [16] and MetaSim [17], were different coverage settings. When the coverage is 2, the adopted to generate simulation fragments, and the test- RRs of algorithms EHTLD, HapCompass and HapTree ing datasets were called as CELSIM instances and Meta- are 0.94, 0.89 and 0.86, the vector errors of them are 10, Sim instances, respectively. 30 and 29, and the MEC scores of them are 16, 40 and 19. When the coverage increases, the RR of algorithm EHTLD increases gradually, while that of algorithm Hap- CELSIM instances Compass fluctuates between 0.89 and 0.90, and that of In this section, the evaluation of the EHTLD, the Hap- algorithm HapTree varies between 0.85 and 0.91. Gener- Compass and the HapTree algorithms is described ally, the increase of coverage plays a positive role in the by using CELSIM instances. CELSIM was invoked to improvement of algorithm performance, for much more simulate shotgun sequencing platform. m single SNP 1 Wu and Zhang Algorithms Mol Biol (2018) 13:10 Page 6 of 10 original fragment information can be utilized. How- from this table, algorithm EHTLD still obtains superior ever, it is not apparent for algorithms HapCompass and results to the other two algorithms under each param- HapTree. eter setting. With the increase of haplotype length, the Table  3 compares the performance of the three algo- three algorithms experience a gradual degradation in rithms with different haplotype lengths n, where c  =  10, the performance. When n is 100, the RR of algorithm f  = 3, f  = 7, p  = 0.05 and d = 0.3. As can be seen EHTLD is 0.97, which is higher than both HapCompass min max s Table 1 Comparison with different error rates (CELSIM instance) p RR VE MEC Running time (s) EH HC HT EH HC HT EH HC HT EH HC HT 0 0.97 0.89 0.89 3 29 27 0 126 57 0.01 0.02 0.01 0.01 0.97 0.89 0.88 3 31 28 17 147 64 0.01 0.03 0.01 0.02 0.97 0.89 0.88 4 30 27 34 152 79 0.01 0.03 0.01 0.03 0.97 0.89 0.90 4 31 26 51 180 83 0.01 0.03 0.01 0.04 0.97 0.90 0.89 4 29 27 69 179 96 0.01 0.03 0.01 0.05 0.97 0.90 0.89 4 29 26 85 194 117 0.01 0.03 0.01 0.06 0.96 0.89 0.88 5 31 26 102 210 132 0.01 0.03 0.01 0.07 0.96 0.89 0.88 5 30 26 122 225 157 0.01 0.03 0.01 0.08 0.96 0.90 0.88 6 29 24 138 238 173 0.01 0.03 0.01 0.09 0.95 0.89 0.87 6 30 28 154 254 181 0.01 0.03 0.01 0.1 0.95 0.90 0.89 7 29 25 172 263 206 0.01 0.03 0.01 0.2 0.92 0.89 0.88 14 31 26 335 407 364 0.01 0.03 0.01 Table 2 Comparison with different coverages (CELSIM instance) c RR VE MEC Running time (s) EH HC HT EH HC HT EH HC HT EH HC HT 2 0.94 0.89 0.86 10 30 29 16 40 19 0.01 0.01 0.01 3 0.95 0.89 0.85 9 31 28 25 60 30 0.01 0.02 0.01 4 0.95 0.89 0.91 7 31 28 34 79 38 0.01 0.02 0.01 5 0.96 0.89 0.87 6 30 26 41 96 46 0.01 0.02 0.01 6 0.96 0.89 0.87 5 30 25 52 120 58 0.01 0.02 0.01 7 0.96 0.90 0.89 5 29 26 58 133 65 0.01 0.02 0.01 8 0.96 0.89 0.89 5 30 25 67 157 75 0.01 0.02 0.01 9 0.96 0.89 0.89 4 30 25 75 175 84 0.01 0.03 0.01 10 0.97 0.90 0.89 4 29 26 85 194 117 0.01 0.03 0.01 Table 3 Comparison with different haplotype lengths (CELSIM instance) n RR VE MEC Running time (s) EH HC HT EH HC HT EH HC HT EH HC HT 100 0.97 0.90 0.89 4 29 26 85 194 117 0.01 0.03 0.01 200 0.96 0.89 0.90 12 61 57 136 305 169 0.04 0.16 0.04 300 0.95 0.89 0.88 29 92 90 181 387 230 0.11 0.20 0.09 500 0.93 0.88 0.87 57 156 148 271 573 337 0.56 0.83 0.72 800 0.92 0.88 0.86 100 256 242 398 855 492 2.21 2.59 2.30 1000 0.92 0.88 0.86 136 322 314 479 1029 595 4.36 4.67 4.51 Wu and Zhang Algorithms Mol Biol (2018) 13:10 Page 7 of 10 and HapTree algorithms by about 7.8%, the vector error the MEC scores drop to 65, 86 and 76, respectively. The of algorithm EHTLD is 4, which is less than algorithms decrease of the MEC scores explain the shorter the frag- HapCompass and HapTree by about 86 and 85%, the ments are, the more probability the fragments agree with MEC score of algorithm EHTLD is 85, which is less the reconstructed haplotypes. The change of single frag - than algorithms HapCompass and HapTree by about 56 ment length plays little effect on the running time of the and 21%, respectively. When n is 1000, the RRs of them three algorithms. decrease to 0.92, 0.88 and 0.86, the vector errors of them Table  5 compares the three algorithms with different increase to 136, 322 and 314, and the MEC scores of hamming distances d, where f  =  3, f  =  7, c =  10, min max them go up to 479, 1029 and 595, respectively. The run - p   =  0.05 and n  =  100. It can be seen from this table ning time of the three algorithms increases significantly that the performance of algorithm EHTLD remains rel- with the increase of n. When n = 100, the running time atively stable under different d, while that of algorithms of algorithms EHTLD, HapCompass and HapTree is 0.01, HapCompass and HapTree suffers strong negative influ - 0.03 and 0.01 s, respectively, while n = 1000, it increases ence with the increase of hamming distance. For exam- to 4.36, 4.67 and 4.51 s, respectively. ple, when d varies from 0.1 to 1.0, the RR of the EHTLD In Table 4, three groups of parameters were set in deal- algorithm fluctuates between 0.97 and 1.0, while those ing with single fragment length range [f , f ] , where of the HapCompass and the HapTree algorithms achieve min max c  =  10, p   =  0.05, n  =  100 and d  =  0.3. As shown in decrease rate up to 35 and 20%, respectively. Table  4, algorithm EHTLD still performs the best under different parameter settings. When [f , f ]= [3, 7] , MetaSim instances min max the RRs of algorithms EHTLD, HapCompass and Hap- MetaSim was used to simulate 454 sequencing plat- Tree are 0.97, 0.90 and 0.89, the vector errors of them form. m SNP fragments, including m = (1 − p ) × m 1 m are 4, 29 and 26, and the MEC scores of them are 85, 194 single ones and m   =  p × m mate-pair ones, were 2 m and 117, respectively. With the decrease of the length of generated, where p denoted the probability of mate- single fragment, the decline of fragments overlap might pair fragments and was set to 0.25 in the experiments. be disadvantageous for haplotype reconstruction. When A single fragment had an expected length of f _len , [f , f ]= [1, 2] , the RRs decrease to 0.94, 0.90 and and a mate-pair fragment had a length of 3 × f _len . min max 0.88, the vector errors increase to 14, 30 and 28, and Since each mate-pair fragment consists of two single Table 4 Comparison with different single fragment length ranges (CELSIM instance) f , f RR VE MEC Running time (s) min max EH HC HT EH HC HT EH HC HT EH HC HT [3, 7] 0.97 0.90 0.89 4 29 26 85 194 117 0.01 0.03 0.01 [2, 4] 0.95 0.89 0.88 8 30 28 67 129 90 0.01 0.03 0.01 [1, 2] 0.94 0.90 0.88 14 30 28 65 86 76 0.02 0.03 0.01 Table 5 Comparison with different hamming distances (CELSIM instance) d RR VE MEC Running time (s) EH HC HT EH HC HT EH HC HT EH HC HT 0.1 0.99 0.97 0.96 5 9 8 88 109 92 0.01 0.02 0.01 0.2 0.97 0.93 0.94 6 17 16 86 140 106 0.01 0.03 0.01 0.3 0.97 0.90 0.89 4 28 26 85 194 117 0.01 0.03 0.01 0.4 0.97 0.86 0.85 3 38 35 86 260 145 0.02 0.06 0.02 0.5 0.97 0.82 0.84 1 46 42 88 326 201 0.04 0.08 0.03 0.6 0.97 0.79 0.82 1 57 49 89 393 263 0.05 0.10 0.05 0.7 0.98 0.74 0.81 0 71 63 92 478 346 0.07 0.16 0.06 0.8 0.98 0.71 0.80 0 78 72 90 553 421 0.10 0.20 0.08 0.9 0.99 0.67 0.78 0 89 78 90 633 507 0.12 0.25 0.10 1.0 1.00 0.63 0.77 0 94 85 91 722 589 0.15 0.29 0.12 Wu and Zhang Algorithms Mol Biol (2018) 13:10 Page 8 of 10 fragments of the same haplotype, the coverage c equals to HapCompass and the HapTree algorithms under differ - [(m + 2 × m ) × f _len]/3 × n. ent c, n, f _len and d settings. 1 2 Table  6 gives the comparisons with coverage ranging from 5 to 50, where n =  100, f _len = 5, and d = 0.3. In Conclusion Table  7, six sets of experimental results under different The minimum error correction with genotype infor - haplotype length settings are displayed, where c  =  20, mation (MEC/GI) model is one of the important com- f _len = 5, and d = 0.3. In Table  8, three instances were putational models for solving single individual SNP generated in dealing with single fragment length f _len , haplotyping problem. In this paper, an enumeration- where n  =  100, c  =  20, and d  =  0.3. In Table  9, the test based algorithm EHTLD is presented for haplotyp- results under different parameter d are shown, where ing a triploid single individual by using this model. n = 100, c = 20, and f _len = 5. The experimental results Algorithm EHTLD reconstructs the three haplotypes obtained from MetaSim instances indicate that algo- according to the order of SNP loci along them. For a rithm EHTLD still obtain much higher reconstruction SNP site being reconstructed, the EHTLD algorithm rates, smaller vector errors and MEC scores than the enumerates three possible values in terms of the site’s Table 6 Comparison with different coverages (MetaSim instance) c RR VE MEC Running time (s) EH HC HT EH HC HT EH HC HT EH HC HT 5 0.94 0.89 0.88 10 30 26 80 134 99 0.01 0.02 0.01 10 0.94 0.90 0.88 8 29 26 144 252 189 0.01 0.02 0.01 15 0.95 0.90 0.88 8 30 26 249 405 287 0.01 0.03 0.01 20 0.95 0.90 0.88 7 30 26 299 513 343 0.02 0.04 0.02 25 0.95 0.89 0.89 7 28 25 383 648 446 0.03 0.05 0.03 30 0.95 0.90 0.89 7 30 25 458 775 531 0.03 0.07 0.03 35 0.95 0.89 0.89 7 30 26 532 903 623 0.05 0.09 0.04 40 0.95 0.90 0.89 7 28 26 600 1028 711 0.07 0.12 0.06 45 0.95 0.90 0.90 7 29 24 700 1193 807 0.10 0.16 0.09 50 0.95 0.90 0.90 7 28 25 758 1293 879 0.13 0.19 0.11 Table 7 Comparison with different haplotype lengths (MetaSim instance) n RR VE MEC Running time (s) EH HC HT EH HC HT EH HC HT EH HC HT 100 0.95 0.90 0.88 7 30 26 299 513 343 0.02 0.04 0.02 200 0.93 0.89 0.90 17 62 57 550 985 620 0.24 0.31 0.21 300 0.93 0.89 0.89 25 93 79 788 1441 896 0.56 0.63 0.60 500 0.93 0.88 0.89 45 156 142 1284 2392 1440 2.43 2.83 2.64 800 0.92 0.88 0.89 75 254 238 1998 3791 2254 9.65 10.50 9.89 1000 0.92 0.88 0.88 91 319 305 2484 4731 2804 17.43 17.95 17.47 Table 8 Comparison with different single fragment lengths (MetaSim instance) f_len RR VE MEC Running time (s) EH HC HT EH HC HT EH HC HT EH HC HT 10 0.96 0.90 0.88 5 30 26 248 404 307 0.01 0.04 0.01 5 0.95 0.90 0.88 7 30 26 299 513 343 0.02 0.04 0.02 3 0.93 0.89 0.89 14 31 28 129 246 201 0.03 0.06 0.03 Wu and Zhang Algorithms Mol Biol (2018) 13:10 Page 9 of 10 Table 9 Comparison with different hamming distances (MetaSim instance) d RR VE MEC Running time (s) EH HC HT EH HC HT EH HC HT EH HC HT 0.1 0.98 0.97 0.87 4 10 28 347 387 375 0.01 0.01 0.01 0.2 0.96 0.93 0.88 6 18 26 309 427 353 0.01 0.03 0.01 0.3 0.95 0.90 0.88 7 30 26 299 513 343 0.02 0.04 0.02 0.4 0.95 0.86 0.89 6 35 25 294 629 330 0.05 0.11 0.02 0.5 0.94 0.82 0.88 5 48 25 289 752 325 0.09 0.16 0.03 0.6 0.94 0.78 0.88 3 60 26 288 895 320 0.12 0.26 0.03 0.7 0.95 0.75 0.88 3 69 26 296 1060 336 0.18 0.37 0.04 0.8 0.95 0.71 0.88 5 77 26 321 1227 366 0.22 0.42 0.06 0.9 0.96 0.67 0.88 3 86 25 290 1359 333 0.27 0.50 0.09 1.0 0.96 0.63 0.88 2 94 25 277 1523 321 0.35 0.61 0.11 Competing interests genotype, and chooses the one leading to the mini- The authors declare that they have no competing interests. mum difference between the reconstructed haplotypes and the fragments covering that SNP site. The recon - Availability of data and materials Not applicable. structed alleles of a SNP site mainly depend on the fragments which cover the site, and are little affected by Consent for publication other former reconstructed alleles. Therefore, the for - Not applicable. mer wrongly reconstructed SNP alleles would not affect Ethics approval and consent to participate the latter reconstructed SNP alleles, i.e., reconstructed Not applicable. errors on the former SNP alleles would not spread to Funding the latter ones. The kind of enumeration strategy can This research is supported by the National Natural Science Foundation of also be apply to reconstruct haplotypes of other ploidy, China under Grant Nos. 61363035, 61762015 and 61502111, Guangxi Natural which will be studied in the future. Science Foundation under Grant No. 2015GXNSFAA139288, Research Fund of Guangxi Key Lab of Multi-source Information Mining & Security Nos. 14-A-03- Compared with algorithms HapCompass and Hap- 02 and 15-A-03-02, “Bagui Scholar” Project Special Funds. Tree by using two kinds of simulated sequencing data, the EHTLD algorithm can get the highest reconstruc- Publisher’s Note tion rates, the smallest vector errors and MEC scores, Springer Nature remains neutral with regard to jurisdictional claims in pub- which was tested by a number of experiments. In addi- lished maps and institutional affiliations. tion, algorithm EHTLD still achieves satisfying per- Received: 9 January 2017 Accepted: 24 May 2018 formance even with high error rate, low fragment coverage, or long haplotype length. All of these advan- tages may contribute to the practical application of the EHTLD algorithm when haplotyping a triploid single References individual. 1. Geraci F. A comparison of several algorithms for the single indi- vidual SNP haplotyping reconstruction problem. Bioinformatics. Authors’ contributions 2010;26(18):2217–25. JW designed and implemented the algorithms and methods, QZ contributed 2. Wu JL, Liang BB. A fast and accurate algorithm for diploid individual on experimental design and data processing. JW wrote the most part of the haplotype reconstruction. J Bioinform Comput Biol. 2013;11(4):1350010. manuscript. QZ helped in data preparing and modifying the manuscript. 3. Clark AG. Inference of haplotypes from PCR-amplified samples of diploid All the work was guided by JW in the whole process. Both authors read and populations. Mol Biol Evol. 1990;7:111–22. approved the final manuscript. 4. Gusfield D. Inference of haplotypes from samples of diploid populations: complexity and algorithms. J Comput Biol. 2001;8:305–23. Author details 5. O’Neil ST, Emrich SJ. Haplotype and minimum-chimerism consensus Guangxi Key Lab of Multi-source Information Mining & Security, Guangxi determination using short sequence data. BMC Genomics. 2012;13(Suppl Normal University, Guilin 541004, China. College of Computer Science 2):S4. and Information Technology, Guangxi Normal University, Guilin 541004, China. 6. Lancia G, Bafna V, Istrail S, Lippert R, Schwartz R. SNPs problems, complex- ity and algorithms. In: Heide FM, editor. Proceeding on 9th European Acknowledgements symposium on algorithms, Aarhus, Denmark; 2001. p. 182–93. The authors are grateful to anonymous referees for their helpful comments 7. Chen ZZ, Deng F, Wang LS. Exact algorithms for haplotype assembly from and to Professor Gene Myers for his kindly providing the source codes of CEL- whole-genome sequence data. Bioinformatics. 2013;29(16):1938–45. SIM. This research is supported by Guangxi Collaborative Innovation Center of Multi-source Information Integration and Intelligent Processing. Wu and Zhang Algorithms Mol Biol (2018) 13:10 Page 10 of 10 8. Mazrouee S, Wang W. FastHap: fast and accurate single individual 14. Wu JL, Chen XX, Li XC. Haplotyping a single triploid individual based on haplotype reconstruction using fuzzy conflict graphs. Bioinformatics. genetic algorithm. Bio-Med Mater Eng. 2014;24(6):3753–62. 2014;30(ECCB):i371–8. 15. Xie MZ, Wang JX, Chen JE. A high accurate model of the individual hap- 9. Wang RS, Wu LY, Li ZP, Zhang XS. Haplotype reconstruction from lotyping problem based on weighted SNP fragments and genotype with SNP fragments by minimum error correction. Bioinformatics. errors. Bioinformatics. 2008;24(ISMB):i105–13. 2005;21(10):2456–62. 16. Myers G. A dataset generator for whole genome shotgun sequencing. In: 10. Qian WY, Yang YJ, Yang NN, Chun L. Particle swarm optimization for snp Proceedings of the 7th international conference on intelligent systems haplotype reconstruction problem. Appl Math Comput. 2008;196:266–72. for molecular biology; 1999. p. 202–10. 11. Li ZP, Wu LY, Zhao YY, Zhang XS. A dynamic programming algorithm 17. Richter DC, Ott F, Auch AF, Schmid R, Huson DH. MetaSim—a sequencing for the k-haplotyping problem. Acta Math Sinic English Series. simulator for genomics and metagenomics. PLOS ONE. 2008;3(10):e3373. 2006;22:405–12. 18. Panconesi A, Sozio M. Fast hare: a fast heuristic for single individual SNP 12. Aguiar D, Istrail S. Haplotype assembly in polyploid genomes and identi- haplotype reconstruction. In: Proceedings of 4th workshop on algorithms cal by descent shared tracts. Bioinformatics. 2013;29(13):i352–60. in bioinformatics; 2004. p. 266–77. 13. Berger E, Yorukoglu D, Peng J, Berger B. HapTree: a novel Bayesian frame- work for single individual polyplotyping using NGS data. PLoS Comput Biol. 2014;10(3):e1003502. Ready to submit your research ? Choose BMC and benefit from: fast, convenient online submission thorough peer review by experienced researchers in your field rapid publication on acceptance support for research data, including large and complex data types • gold Open Access which fosters wider collaboration and increased citations maximum visibility for your research: over 100M website views per year At BMC, research is always in progress. Learn more biomedcentral.com/submissions

Journal

Algorithms for Molecular BiologySpringer Journals

Published: Jun 1, 2018

References

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off