Exploiting next-generation sequencing to solve the haplotyping puzzle in polyploids: a simulation study

Exploiting next-generation sequencing to solve the haplotyping puzzle in polyploids: a simulation... Abstract Haplotypes are the units of inheritance in an organism, and many genetic analyses depend on their precise determination. Methods for haplotyping single individuals use the phasing information available in next-generation sequencing reads, by matching overlapping single-nucleotide polymorphisms while penalizing post hoc nucleotide corrections made. Haplotyping diploids is relatively easy, but the complexity of the problem increases drastically for polyploid genomes, which are found in both model organisms and in economically relevant plant and animal species. Although a number of tools are available for haplotyping polyploids, the effects of the genomic makeup and the sequencing strategy followed on the accuracy of these methods have hitherto not been thoroughly evaluated. We developed the simulation pipeline haplosim to evaluate the performance of three haplotype estimation algorithms for polyploids: HapCompass, HapTree and SDhaP, in settings varying in sequencing approach, ploidy levels and genomic diversity, using tetraploid potato as the model. Our results show that sequencing depth is the major determinant of haplotype estimation quality, that 1 kb PacBio circular consensus sequencing reads and Illumina reads with large insert-sizes are competitive and that all methods fail to produce good haplotypes when ploidy levels increase. Comparing the three methods, HapTree produces the most accurate estimates, but also consumes the most resources. There is clearly room for improvement in polyploid haplotyping algorithms. next-generation sequencing, haplotyping, polyploids, benchmarking Introduction The advent of sequencing technology has had tremendous impact in genomics and genetics over the past years. The human (Homo sapiens) genome, published in 2003 [1], formed the basis for large-scale efforts to catalogue sequence variation found between individual genomes, in particular single-nucleotide polymorphisms (SNPs) [2]. Subsequently, the international HapMap project [3] sequenced a large number of individuals to discover haplotype blocks, i.e. genomic regions containing co-segregating SNPs. These haplotype blocks and their so-called haplotype tagging SNP markers, heterozygous SNPs whose alleles predict the presence of a certain haplotype, formed the basis for the development of high-density SNP arrays [4, 5], capable of determining rare or less-frequent genotypes in a population, which were used in a large number of studies relating SNPs to phenotypes such as ecological traits, diseases and disorders [6]. Following the success of the human genome project, many animal and plant species were sequenced and genotyped, notably mouse-ear cress (Arabidopsis thaliana) [7], fruit fly (Drosophila) [8], chicken (Gallus gallus) [9], pig (Sus scrofa) [10], potato (Solanum tuberosum) [11], tomato (Solanum lycopersicum) [12] and hot pepper (Capsicum annuum) [13]; the last three being plant genera within the Solanacea family. This has impacted not only fundamental genetics research but has also revolutionized the fields of animal and plant breeding, by relating thousands of previously unknown genomic variants to physiological, morphological and economically important traits such as yield per generation and disease resistance [14, 15]. The introduction of high-throughput, relatively cheap and reliable next-generation sequencing (NGS) technologies made it possible to determine most of the variants directly within a single genome rather than using a predefined set of marker variants as proxies for the other variants [16]. Efficient tools have been developed to call variants based on NGS data, e.g. FreeBayes [17] and GATK [18], and also to link variants on the same homologous chromosome, so-called haplotype phasing. However, while phasing of nearby variants occurring within the average NGS read length is relatively straightforward, long-range haplotyping using NGS data remains a challenge. Nevertheless, haplotyping is important in many areas: in fundamental biology, to improve our understanding of genome structure, recombination and evolution [19]; in medicine, to obtain a full picture of the genetic variation in a population potentially linked to diseases and traits [20] and to investigate the effect of compound heterozygosity [21]; and in animal and plant breeding, to move from phenotype-based to genotype-based crossing and selection of individuals [22, 23]. Moreover, the knowledge of haplotypes can help reveal the linkage disequilibrium pattern in a population and hence increase the power and coverage of genetic analysis by allowing the imputation of a large number of alleles using a limited set of genotyped loci [24, 25]. In the diploid case, haplotyping algorithms aim to divide the aligned reads into two complementary sets, each covering a specific region of, say, n heterozygous sites, so that the nucleotides are the same at the overlapping sites of the reads within each set, but different between the sets. The algorithmic challenge then is to take the occurrence of sequencing and variant calling errors into account [26–28]. Minimum Error Correction (MEC), the most prevalent approach, uses single base flips for reads that conflict with both of the read sets, presumably because of sequencing or variant calling errors, to assign them to one of these. The aim is to find a configuration that requires a minimal number of such base flips [27, 29]. This strategy is the basis of several diploid haplotyping algorithms [30–37]. Recently, diploid-aware genome assemblers based on long reads, produced for instance by PacBio and Nanopore technologies [38], have also been developed. These algorithms use the overlap-layout-consensus approach to construct the assembly graph and obtain the primary contigs from the raw reads, and try to resolve the haplotypes by calling heterozygous SNPs within the contigs. These SNPs are used to separate the long reads into two groups, the so called ‘haplotigs’, from which consensus sequences are obtained to determine the phasing [39]. For polyploid genomes, the problem could be formulated as dividing the reads into k > 2 groups, but the generalization from the diploid case is not straightforward. Specifically, polyploids can be classified as allopolyploids, autopolyploids or mixture types, i.e. so-called segmental allopolyploids. In the allopolyploid case, the constituent sub-genomes of the polyploid are derived from adequately distant diploid ancestors that do not usually recombine with each other, a situation observed among several species in the plant kingdom such as tetraploid and hexaploid wheat [40]. Under specific circumstances, one may treat the sub-genome haplotypes as separate diploids, and an ad hoc phasing solution could be still achievable using the algorithms developed for diploids. As an example, this strategy has been successfully applied to pasta wheat (Triticum turgidum) [41], which is a self-fertilizing allotetraploid for which the ancestral diploid genomes are also known, to determine the variation on each sub-genome by dividing the transcriptome reads into two groups using HapCut [33]. In the case of (partial) autopolyploids, however, recombination is observed between homologues belonging to different sub-genomes, and unlike for the diploid case, knowledge of one haplotype does not automatically determine the phasing of the others. Besides, some haplotypes may be (locally) identical, and, thus, several configurations could have the same MEC score. Moreover, the computational complexity of haplotype reconstruction increases rapidly with an increase in ploidy [42, 43]. The diploid approaches are therefore in general not applicable to polyploids. Still, haplotype assembly for polyploids is highly relevant, as many interesting organisms have polyploid genomes and haplotyping will help unravel the range of the complex recombinations allowed by such genomes. Within the animal kingdom, triploidy and tetraploidy are observed in tree frog (Xenopus laevis) [44] and zebrafish (Danio rerio) [45], both important model organisms in evolutionary biology. Moreover, many economically important crops and ornamentals are polyploid, including tetraploid alfalfa (Medicago sativa), triploid banana (Musa acuminata × Musa balbisiana), tetraploid leek (Allium ampeloprasum), tetraploid potato (S. tuberosum), tetraploid hard wheat (Triticum durum), hexaploid bread wheat (Triticum aestivum), tetraploid, hexaploid and octoploid strawberry species including Fragaria moupinensis (k = 4), Fragaria moschata (k = 6), Fragaria  ananassa (k = 8) and several hybrid cotton (Gossypium, tetraploid or hexaploid) and rose (Rosa, tetraploid) species. Here, we review three state-of-the-art haplotyping algorithms applicable to polyploids: HapCompass [42, 46], HapTree [47] and SDhaP [43], and evaluate their accuracy through extensive simulations of random genomes and NGS reads. Using the highly heterozygous tetraploid potato (S. tuberosum) as a model, we generated random genomes using a realistic stochastic model with parameters SNP density and distribution of SNP dosages, i.e. the number of alternative alleles at each SNP site, derived from a recent genomic study of potato [48]. In addition, we simulated genomes at higher levels of ploidy with the same SNP density, as well as tetraploid genomes with different SNP densities and haplotype dosages, to investigate the effects of genome characteristics on the estimation. Moreover, we considered various sequencing depths, paired-end insert-sizes and sequencing technologies to quantify the impact of these parameters on the haplotyping. We provide guidelines to apply the haplotyping methods in practice, and show the characteristics of each method in various situations. The pipeline used is available as a software package called haplosim, which allows simulation for various sequencing approaches, genomic characteristics and variation models. Material and methods Although several studies have used experimental data, e.g. the human haplotype panels and sequence reads, to evaluate the efficiency of diploid haplotyping algorithms [49], experimentally obtained haplotypes are often not available for polyploids at a scale enabling insightful statistical comparison. Therefore, the evaluation of polyploid haplotyping algorithms has been based on artificial data sets [42, 43, 46, 47]. Here, we also rely on simulation to evaluate the performance of these methods. Compared with the previous studies, our approach has the advantage of simulating all parts of a practical haplotyping pipeline, encompassing the careful simulation of genomes and sequence reads based on real data and the application of standard software for read alignment and genotype calling. In contrast, previous studies relied on the direct simulation of a SNP-fragment matrix using simplifying assumptions. An additional advantage of our simulation approach is that it allows to investigate the effects of SNP density, similarity between homologues, ploidy level, sequencing depth, sequencing technology and DNA library size on the quality of haplotype estimates, which is usually not feasible using real data. We developed a multistage pipeline, haplosim, to simulate polyploid individuals, with various genomic characteristics, that are sequenced in silico. After individual SNP detection and dosage estimation, the haplotypes are estimated, separately for each individual, by the available algorithms: HapCompass [42, 46], HapTree [47] and SDhaP [43] in the next steps. In the last step of the pipeline, the estimated haplotypes are compared with the originally simulated haplotypes using quantitative measures (Figure 1). Figure 1. View largeDownload slide Haplosim pipeline to generate, estimate and evaluate haplotypes. Random genomes and haplotypes are produced by haplogenerator, from which NGS reads are simulated and mapped backed to the reference. SNPs are called at the next step and the haplotypes estimated by HapTree, HapCompass and SDhaP. The estimates are compared with the original haplotypes by hapcompare. Figure 1. View largeDownload slide Haplosim pipeline to generate, estimate and evaluate haplotypes. Random genomes and haplotypes are produced by haplogenerator, from which NGS reads are simulated and mapped backed to the reference. SNPs are called at the next step and the haplotypes estimated by HapTree, HapCompass and SDhaP. The estimates are compared with the original haplotypes by hapcompare. Figure 2. View largeDownload slide The schematic diagram of simulation for each considered scenario. In total, 10 references of length 20 kb are chosen from the draft sequence chromosome 5 [11], from each 10 polyploid genomes are simulated containing bi-allelic SNPs randomly distributed according to the lognormal distance model, to obtain data sets of size 250. Figure 2. View largeDownload slide The schematic diagram of simulation for each considered scenario. In total, 10 references of length 20 kb are chosen from the draft sequence chromosome 5 [11], from each 10 polyploid genomes are simulated containing bi-allelic SNPs randomly distributed according to the lognormal distance model, to obtain data sets of size 250. For the first step (Figure 1A), polyploid genomes are produced from a reference DNA sequence by introducing heterozygous regions containing bi-allelic SNPs, using the command-line tool haplogenerator that we developed to this purpose. Next, NGS reads are simulated for each produced individual using ART [50] and PBSIM [51] for Illumina and PacBio, respectively, and the reads are mapped back to their reference genome using bwa-mem [52] (with the settings recommended in its manual for Illumina and PacBio reads). The alignments are preprocessed to generate BAM files and remove duplicates by SAMtools [53] and Picardtools [54], after which SNPs are called using FreeBayes [17] (Figure 1B). The processed alignments, the reference and the VCF files are used in the haplotyping step by HapCompass [42, 46], HapTree [47] and SDhaP [43] (Figure 1C), and the obtained haplotype estimates are compared with the original haplotypes by the command-line tool hapcompare that we developed using several measures of estimation quality (Figure 1D). These steps are explained in detail below. Polyploid haplotyping software Currently, three optimization-based haplotyping algorithms are available for polyploids that make use of the sequence reads of a single sample: HapCompass [42, 46], HapTree [47] and SDhaP [43] (Table 1). Among these three algorithms, HapTree and SDhaP have separate software releases for diploids and polyploids, which may have major performance differences not discussed here. We explain each method assuming a genomic region containing l heterozygous SNP sites s1, s2,…,sl, a ploidy level k > 2 and an NGS data set containing m (paired-end) reads. We define a ‘fragment’ as the sequence of the determined alleles at the heterozygous sites within a (paired-end) read. For simplicity, we focus on the most prevalent type of SNPs, bi-allelic SNPs, for which the alleles can be represented by ‘0’ (the reference) and ‘1’ (the alternative). Table 1. Summary of the polyploid haplotype estimation algorithms using sequence reads Algorithm Principle Release version Separate diploid release HapCompass (Aguiar and Istrail 2013) Graph based using weighted minimum edge reduction criterion HapCompass v0.8.2 No HapTree (Berger et al. 2014) Using Bayesian probability tree with maximum relative likelihood criterion HapTree_v0.1 Yes SDhaP (Das and Vikalo 2015) Minimum error correction criterion with semi-definite approximation SDhaP_poly Yes Algorithm Principle Release version Separate diploid release HapCompass (Aguiar and Istrail 2013) Graph based using weighted minimum edge reduction criterion HapCompass v0.8.2 No HapTree (Berger et al. 2014) Using Bayesian probability tree with maximum relative likelihood criterion HapTree_v0.1 Yes SDhaP (Das and Vikalo 2015) Minimum error correction criterion with semi-definite approximation SDhaP_poly Yes Table 1. Summary of the polyploid haplotype estimation algorithms using sequence reads Algorithm Principle Release version Separate diploid release HapCompass (Aguiar and Istrail 2013) Graph based using weighted minimum edge reduction criterion HapCompass v0.8.2 No HapTree (Berger et al. 2014) Using Bayesian probability tree with maximum relative likelihood criterion HapTree_v0.1 Yes SDhaP (Das and Vikalo 2015) Minimum error correction criterion with semi-definite approximation SDhaP_poly Yes Algorithm Principle Release version Separate diploid release HapCompass (Aguiar and Istrail 2013) Graph based using weighted minimum edge reduction criterion HapCompass v0.8.2 No HapTree (Berger et al. 2014) Using Bayesian probability tree with maximum relative likelihood criterion HapTree_v0.1 Yes SDhaP (Das and Vikalo 2015) Minimum error correction criterion with semi-definite approximation SDhaP_poly Yes HapCompass Aguiar and Istrail (2013) extended their graphical haplotype estimation approach for diploids [42], by constructing the polyploid compass graph, which has k nodes for each variant site si in a k-ploid corresponding to the k alleles at that site [46]. To each SNP pair (si, sj) covered by at least one of the m fragments, the phasing with the largest likelihood is assigned by a polyploid likelihood model, conditional on the covering fragments and assuming a fixed base-calling error rate. k edges are accordingly added to the compass graph between the nodes at si and sj sites, representing the k homologues covering si and sj and weighted by their likelihoods. A global minimum weighted edge removal (MWER) algorithm is applied to detect and eliminate edges with conflicting phasing information from the compass graph with the aid of auxiliary chain graphs. Each chain graph takes a set of variants that make a cycle in the compass graph and detects phasing conflicts within the cycle, if present. The MWER algorithm then tries to resolve the conflicts by eliminating a number of edges with minimum total weight corresponding to the least likely phasings. Finally, the most likely haplotypes are found over the full set of SNPs from the conflict-free compass graph by finding k disjoint maximum spanning trees through an efficient greedy algorithm, corresponding to the k most likely homologues covering the l SNP sites. HapTree The HapTree algorithm, developed by Berger et al. (2014) [47], builds a tree representing a subset of likely phasing solutions for l heterozygous sites and tries to find the most probable path from s1 to sl in the tree as the best phasing, by calculating the probability of each phasing conditional on the m fragments, assuming a fixed base-calling error rate. Nevertheless, as an exhaustive search over the full tree of all solutions would be computationally prohibitive, the tree is built and extended site by site with branching and pruning to greedily eliminate the (relatively) low probability paths from the final tree. In doing so, HapTree calculates the relative probabilities of the haplotypes at each extension using the relative probabilities of the haplotype at the previous extension that survived the branching and pruning, taking the error model into account. SDhaP The third algorithm, polyploid SDhaP designed by Das and Vikalo (2015), is a semi-definite programming approach that aims to find an approximate MEC solution by a greedy search of the space of all possible phasings from s1 to sl [43]. The algorithm starts with random initial haplotypes, and tries to find the MEC solution by making changes to these initial haplotypes according to a gradient descent method. To this end, the MEC problem is reformulated as a semi-definite optimization task, and preliminary solutions are obtained in polynomial time by exploiting the sparseness of overlaps between fragments for efficient implementation. These preliminary estimates are subsequently refined by greedy flipping of the alleles in the estimated homologues to further reduce the MEC, if possible. By this flipping, SDhaP allows making changes to the dosages of the alternative alleles estimated during variant calling, which could sometimes lower the error correction score. Therefore, the dosages of corresponding SNPs in the SDhaP estimates and original haplotypes could differ, in contrast to the estimates produced by HapTree and HapCompass. Simulation of polyploid genomes and NGS reads Haplotype generation We developed the command-line tool haplogenerator for generating artificial genomes and their haplotypes with desired characteristics. Specifying an indexed FASTA file as input, haplogenerator applies random insertions, deletions and mutations to the input sequence according to a chosen stochastic model to produce modified FASTA files for each of the k-genomes of a k-ploid individual. A separate haplotype file is also made containing the phasing of the generated variants. In the haplotype file, the reference and alternative alleles are numerically coded, assigning 0 to the nucleotide present in the input reference and the following integers to the alternative alleles. The coordinate of each variant on its contig is also specified within the haplotype file. The random indel and mutation sites are scattered across the input genome according to a selected built-in stochastic model for the distance between consecutive variations, or alternatively by sampling with replacement from a given empirical distribution for this distance. At each position i, possible alternative alleles are generated: for indels, an inserted or deleted nucleotide; for mutations, nucleotides other than the reference (or just one nucleotide for obtaining bi-allelic SNPs). Next, a dosage di is assigned to the alternative allele based on the ploidy si, according to user-specified probabilities for di = 1 to di = k, and di of k homologues are selected randomly to get their allele at si changed to an alternative. To account for linkage disequilibrium, we imposed an additional step after this dosage assignment to reassign the alternative alleles at each si+1, i=1, 2,…, n−1, to the homologues containing the alternative alleles at si (as much as the numbers of alternative alleles, i.e. the dosages, at si and si+1 allow), with a reassignment decision made independently for each site from s2 to sn with an arbitrary probability set to 0.4. Simulation of NGS reads We used the technology-specific simulator ART [50] to generate paired-end reads from Illumina MiSeq and HiSeq 2500 technologies [38, 55], and PBSIM [51] to simulate circular consensus sequencing (CCS) and continuous long reads (CLR) from Pacific BioScience [55, 56]. The average length of single Illumina reads was set to the maximum allowed by ART (125 and 250 bp for HiSeq 2500 and MiSeq, respectively). For PBSIM, the average read length was set to 1 and 5 kb with CCS, and to 10 kb with CLR. Setting these averages, both softwares generated reads with random lengths following the built-in distributions derived from empirical data for each technology. Each homologue was ‘sequenced’ separately, and the reads were combined to simulate the output of real sequencing apparatus. Average sequencing depths were specified for each homologue to obtain the desired average total depth equal to k times the per homologue depth, with k being the ploidy. Both ART and PBSIM consider a discrete uniform distribution with the user-set mean for the depth at each position, and hence the SD of the total depth was dependent on the average depth per homologue, c, and equal to kc(c+2)12 for a k-ploid. The choice of the sequencing strategies in our study was based on the efficiency and performance of the available techniques [55], as well as their practical convenience. In particular, we did not consider single-ended reads of Illumina, as preliminary assessment showed that they produce low-quality estimates with a large number of gaps in the solution. Simulation of polyploid data sets To simulate realistic polyploid genomes, we chose tetraploid potato (S. tuberosum) as the model organism, because of the availability of a reference genome [11] as well as NGS data containing genomic variation of 83 diverse cultivars [48]. The sequence of chromosome 5 from the PGSC v4.03 DM draft genome [11] was used as the template sequence for haplogenerator. We selected random contiguous regions with 20 kb length from this template to be used as references for simulating genomes. In selecting the references, we rejected 20 kb regions of the template that contained >20% undetermined nucleotides, and omitted these undetermined sites (denoted by ‘N’ in PGSC v4.03 DM sequence) before introducing mutations in the accepted regions. As the length of many genes falls below 20 kb, choosing references with this size allows us to evaluate haplotype estimation for amplicon sequences covering a complete gene. Random bi-allelic SNPs were introduced in each reference to produce synthetic tetraploid genomes according to the built-in lognormal model of haplogenerator, with the mean and the SD of the log-distance between the SNPs being set to 3.0349 and 1.293, respectively, corresponding to an expected SNP frequency of 1 per 21 bp with a SD of 27 bp. The distribution of the dosages, di, was similarly set equal to that from [48], with percentages equal to 50, 23, 14 and 13% of simplex (di = 1), duplex, (di = 2), triplex (di = 3) and quadruplex (di = 4) SNPs. To investigate the effect of library preparation, we considered various insert-sizes for paired-end Illumina reads, namely, end-to-end insert-sizes of 235, 300, 400, 500, 600 and 800 bp with HiSeq 2500, and 400, 450, 500, 600 and 800 bp with MiSeq. For evaluation of the effect of sequencing depth on haplotyping, 2×, 5×, 8×, 10×, 20×, 22×, 25×, 28×, 30× and 35× average coverages were considered per homologue for each of these insert-sizes. To investigate the effects of genome characteristics, the ploidy level, the dosage of different homologues and the SNP density on the quality of haplotype estimation, additional genomes were generated in a similar way by haplogenerator. Considering the same proportion of simplex and duplex SNPs, i.e. SNPs with dosages equal to 1 and 2, respectively, as in [48] and considering equal proportions for the dosages >2, we simulated genomes with 3n, 4n, 6n, 8n, 10n and 12n ploidy levels to investigate the effect of the ploidy, and simulated modified tetraploid genomes that contained only two distinct homologues with simplex and triplex dosages to investigate the effect of similarity between the homologues on haplotype estimation. Although these scenarios assume a SNP-density model valid for S. tuberosum, they still can show the pattern by which ploidy level and similarity between the homologues influence the quality of haplotyping. Finally, tetraploid genomes with SNP densities lower than that of the highly heterozygous S. tuberosum [48] were simulated to observe the effect of SNP density, with average frequencies of 1 per 22 bp to 1 per 110 bp. In total, 250 individuals were simulated for each of the above-mentioned scenarios by choosing 25 random references from the template and generating 10 genomes with randomly distributed bi-allelic SNPs for each selected reference (Figure 2). Figure 3. View largeDownload slide Quantile plots for the distances between successive SNPs obtained from simulation by haplogenerator using the lognormal distance model (horizontal axis) versus the one obtained from the data of 83 potato cultivars of Uitdewilligen et al. (2013) [48]. The two distributions match well, though a heavier tail is observed for the data of Uitdewilligen et al. (2013), accounting for <2% of the SNPs. Figure 3. View largeDownload slide Quantile plots for the distances between successive SNPs obtained from simulation by haplogenerator using the lognormal distance model (horizontal axis) versus the one obtained from the data of 83 potato cultivars of Uitdewilligen et al. (2013) [48]. The two distributions match well, though a heavier tail is observed for the data of Uitdewilligen et al. (2013), accounting for <2% of the SNPs. Evaluation of the estimated haplotypes As several types of error occurring in different steps of the haplotyping pipeline could cause differences between the actual haplotypes and their estimates, we needed several measures of consistency to be able to capture all of them, as summarized in Table 2. These errors include the absence or wrong dosage of original SNPs in the estimates, presence of spurious SNPs, discontinuity of the estimated haplotypes, i.e. presence of gaps between estimated haplotype blocks and finally wrong extension of homologues leading to incorrect phasing. We also included an extra measure, the failure rate (FR) for each algorithm, regardless of the quality of haplotype estimation, as it could happen that the haplotyping tools failed to produce any estimate for some of the individuals. Table 2. Summary of the measures used to assess the quality of haplotyping Measure Description Limitation PAR The average accuracy of phasing for any possible SNP pair SNPs present in both original haplotypes and estimates, no account of gaps VER* Number of wrong extensions of homologues with (parts of) other homologues SNPs present in both original haplotypes and estimates with same dosages, no account of gaps SMR Proportion of original SNPs missing in the estimates No account of phasing, false-positive SNPs in estimates and dosages IDR Proportion of SNPs with an incorrect dosage in estimated haplotypes SNPs present in both original haplotypes and estimates, no account of phasing PPV Proportion of true SNPs in the estimated SNPs No account of phasing, missing SNPs and wrong dosages NGPS* Number of gaps (interruptions) introduced in the estimated haplotype No account of phasing, missing, wrong dosages and false positives FR Rate of failure for an algorithm No account of the quality of the estimates Measure Description Limitation PAR The average accuracy of phasing for any possible SNP pair SNPs present in both original haplotypes and estimates, no account of gaps VER* Number of wrong extensions of homologues with (parts of) other homologues SNPs present in both original haplotypes and estimates with same dosages, no account of gaps SMR Proportion of original SNPs missing in the estimates No account of phasing, false-positive SNPs in estimates and dosages IDR Proportion of SNPs with an incorrect dosage in estimated haplotypes SNPs present in both original haplotypes and estimates, no account of phasing PPV Proportion of true SNPs in the estimated SNPs No account of phasing, missing SNPs and wrong dosages NGPS* Number of gaps (interruptions) introduced in the estimated haplotype No account of phasing, missing, wrong dosages and false positives FR Rate of failure for an algorithm No account of the quality of the estimates Note. *: measure normalized by the size of original haplotypes, i.e. the number of original SNPs. Table 2. Summary of the measures used to assess the quality of haplotyping Measure Description Limitation PAR The average accuracy of phasing for any possible SNP pair SNPs present in both original haplotypes and estimates, no account of gaps VER* Number of wrong extensions of homologues with (parts of) other homologues SNPs present in both original haplotypes and estimates with same dosages, no account of gaps SMR Proportion of original SNPs missing in the estimates No account of phasing, false-positive SNPs in estimates and dosages IDR Proportion of SNPs with an incorrect dosage in estimated haplotypes SNPs present in both original haplotypes and estimates, no account of phasing PPV Proportion of true SNPs in the estimated SNPs No account of phasing, missing SNPs and wrong dosages NGPS* Number of gaps (interruptions) introduced in the estimated haplotype No account of phasing, missing, wrong dosages and false positives FR Rate of failure for an algorithm No account of the quality of the estimates Measure Description Limitation PAR The average accuracy of phasing for any possible SNP pair SNPs present in both original haplotypes and estimates, no account of gaps VER* Number of wrong extensions of homologues with (parts of) other homologues SNPs present in both original haplotypes and estimates with same dosages, no account of gaps SMR Proportion of original SNPs missing in the estimates No account of phasing, false-positive SNPs in estimates and dosages IDR Proportion of SNPs with an incorrect dosage in estimated haplotypes SNPs present in both original haplotypes and estimates, no account of phasing PPV Proportion of true SNPs in the estimated SNPs No account of phasing, missing SNPs and wrong dosages NGPS* Number of gaps (interruptions) introduced in the estimated haplotype No account of phasing, missing, wrong dosages and false positives FR Rate of failure for an algorithm No account of the quality of the estimates Note. *: measure normalized by the size of original haplotypes, i.e. the number of original SNPs. Some of the SNPs originally simulated were absent in the output set of each simulation, either because of mapping and variant calling errors or because of their not being phased by the algorithms, and therefore could not be included in the comparisons of true and estimated phasings. Instead, their proportion was calculated as SNP missing rate (SMR) and considered as the first measure of estimation quality. Similarly, spurious SNPs in the estimates were not included in the comparisons, and the positive predictive value (PPV) or the precision of the estimated SNPs was calculated as the proportion of genuine SNPs among the estimated SNPs as the next measure of estimation accuracy. The incorrect dosage rate (IDR) was also calculated as the proportion of SNPs that had an incorrect dosage in the estimated haplotypes in the set of SNPs that were common between the original haplotypes and the estimates. Having excluded the missing and spurious SNPs, the pairwise phasing accuracy rate (PAR) [57] was computed as the proportion of all heterozygous SNP pairs for which the estimated phasing was correct. This measure captures the errors caused by chimeric elongation of the homologues during haplotype estimation, i.e. the elongation of a homologue by (part of) another homologue, as well as errors caused by incorrect dosage estimation. One way to calculate the accuracy of phasing for more than just two SNPs is to consider the phasing accuracy rate for groups of three SNPs, four SNPs, etc. However, the phasing accuracies will no longer be independent for the groups of SNPs that have more than one SNP in common, leading to biased estimates of the accuracy rates. Instead, we calculated the vector error rate (VER), also called the switch error rate, defined as the number of times a homologue is erroneously extended by part of another homologue [47]. Such erroneous extensions are also called switches between homologues, and the measure is equal to two times the number of wrong phasings for pairs of consecutive SNPs for diploids. For polyploids, the measure is calculated by finding the minimum number of crossing-overs needed to reconstruct the true haplotypes from the estimates [47]. To be able to compare of VER for different ploidy levels, genome lengths and SNP densities, we normalized it by the number of originally simulated SNPs as well as the ploidy level for each individual. The SNPs with a wrong estimated dosage of the alternative allele were omitted before applying this measure, as otherwise the true haplotypes could not be reconstructed by simple switching of the estimated homologues without considering allele flips from 0 to 1 or vice versa. The last measure of estimation quality that we used was the number of gaps per SNP (NGPS) in the estimates, as the simulated continuous haplotypes can be broken into several disconnected blocks, causing gaps in the estimated haplotypes. This phenomenon happens if the connection between SNPs is lost because of low sequencing coverage or sequencing/variant calling errors at certain sites. Therefore, we calculated the number of break points, i.e. gaps, in the estimates (equal to the number of disjoint blocks minus one), and normalized it by the total number of simulated SNPs for each individual for the same reasons as for VER. In case gaps were present in the estimates, we calculated the other measures separately for each estimated block and reported the weighted average of the block-specific measures, weighted by the number of compared SNPs in each estimated block or the number of possible pairwise phases in case of PAR. Finally, the computational complexity of each of the haplotyping algorithms was considered as a function of sequencing coverage, insert-size, ploidy, SNP density and homologue dosages. The applied haplotyping methods are memory-intensive methods, increasingly consuming system resources with time, sometimes up to tens of gigabytes of virtual memory. To run the methods on a system with shared resources, and considering the fact that the algorithms require an increasing amount of virtual memory with time, a time limit of 900 s (2000 s for the analysis with various levels of ploidy) was imposed on each haplotyping algorithm, after which the algorithms were externally halted and the estimation considered a failure. This amount of time was deemed reasonable considering the 20 kb length of the simulated genomic regions, and the number of time-out events was added to the number of times each algorithm failed to estimate any haplotypes because of the occurrence of internal errors. Total FRs are thus also reported for each haplotyping scenario. Comparison of haplotyping algorithms To compare the overall performance of the three haplotype estimation methods, we built three linear regression models with the mentioned quality measures as response and the haplotyping method as predictor, considering sequencing depth, sequencing technology and the (paired-end) library size as covariates in the model. As each of the simulated genomes was haplotyped simultaneously by the three estimation methods, the effect of the genome on the estimation quality was incorporated as a random effect in the model. Similarly, as 10 genomes were generated from each of the 25 randomly selected references, the effect of the common reference was added as the second random component to the model. For each quality measure, a complete-case analysis was performed, including only the results of those simulations for which all the three estimation methods reported some value. The models were estimated by restricted maximum likelihood [58] using the lmer function from the package lme4 [59] in R 3.2.2 [60]. Results and discussion Haplogenerator produces realistic genomes To investigate the compatibility of the simulated 20 kb S. tuberosum genomic regions with the real regions sequenced by Uitdewilligen et al.(2013) [48] in terms of the density of bi-allelic SNPs, we obtained quantile plots (QQ-plots) of the distances between consecutive SNPs si, si+1, generated by the applied lognormal model versus the distances between consecutive bi-allelic SNPs (within the same RNA-capture region) in the combined data of 83 diverse cultivars from [48]. As shown in Figure 3, the two empirical distributions match well enough, although the distribution of real SNPs seems to have a heavier tail than lognormal (accounting for <2% of the total number of real bi-allelic SNPs). This heavier tail is plausibly explained by the presence of highly conserved regions in real genomes, subject to natural and artificial selection pressure, as well as the use of genome-wide RNA baits to reduce the complexity of genome in [48], which can result in the exclusion of some SNPs in target regions that had poor capture success. The proportions of simplex to quadruplex SNPs, i.e. the dosage proportions, were also almost identical as those obtained from Uitdewilligen et al. (2013) [48] (Section Simulation of polyploid data sets). Sequencing depth is the major determinant of haplotyping quality An important goal of the simulations was to observe to what extent sequencing strategies influence haplotyping results because of the practical importance in setting up experiments and choosing a technology. As different technologies rely on different library preparation and nucleotide calling methods, their output is often different in terms of the average read length and sequencing error profile. Besides, the sequencing depth, average read length and paired-end insert-size can vary according to the user’s requirements with the same technology. We found that the performance of all three haplotyping methods was considerably affected by the sequencing strategy, most notably by the sequencing depth. Regardless of the used sequencing technology and the insert-size, a sequencing depth between 5× and 20× per homologue is required to obtain results satisfactory in terms of haplotype accuracy (PAR) and completeness (SMR) (Figure 4A and B). Both improve continuously with sequencing depth, but flatten out at 15×. A notable exception is HapTree, of which the increased FR at higher depths is reflected in a worse completeness (increasing SMR). Other quality measures (VER, PPV, IDR and NGPS) were not substantially influenced by sequencing depth at depths >5× per homologue. Figure 4. View largeDownload slide Plots of haplotype estimation quality measures: (A) PAR, (B) SMR and (C) FR as a function of sequencing depth per homologue using HapCompass (black), HapTree (gray) and SDhaP (light gray), for simulated 20 kb tetraploid S. tuberosum genomes. Sequencing was performed in silico for paired-end MiSeq (triangle) and HiSeq 2500 (diamond) with 800 bp insert-size, as well as for PacBio-CCS of 1 kb length (circle). Figure 4. View largeDownload slide Plots of haplotype estimation quality measures: (A) PAR, (B) SMR and (C) FR as a function of sequencing depth per homologue using HapCompass (black), HapTree (gray) and SDhaP (light gray), for simulated 20 kb tetraploid S. tuberosum genomes. Sequencing was performed in silico for paired-end MiSeq (triangle) and HiSeq 2500 (diamond) with 800 bp insert-size, as well as for PacBio-CCS of 1 kb length (circle). HapTree’s FR (Figure 4C) was rather high for low and high sequencing depths. At lower depths, <2× per homologue, there is not enough information available for effective branching and pruning of the solution tree and time-out errors result in failures. In contrast, for high sequencing depths, the relative likelihood values often become small because of the presence of many terms in the likelihoods of partial haplotypes, making a meaningful comparison impossible. This problem will be discussed further below. As sequencing depth is an important factor in determining the total cost of sequencing, these results show that extra cost can be avoided by choosing a moderate sequencing depth without sacrificing considerable haplotyping accuracy. Large-insert paired-end reads are competitive with long reads In addition to sequencing depth, the insert-size of paired-end reads and the used sequencing technology can have an impact on the estimation quality. These are also important factors to specify when designing a sequencing experiment, as they influence cost, throughput and quality. To quantify their effects, we simulated NGS reads for HiSeq 2500, MiSeq and CCS technologies at each sequencing depth, and simulated paired-end reads with various insert-sizes for HiSeq 2500 and MiSeq (Section Simulation of NGS reads). Our results show that at the same sequencing depth, increasing the insert-size of paired-end reads was not markedly influential on the overall quality of haplotyping (Supplementary Figure S1), except for the number of gaps that was expectedly reduced with larger inserts. Moreover, similar estimation qualities were obtained using the 1 kb CCS reads and paired-end Illumina reads with a large insert-size (800 bp) (Figure 4). At the same depth, the paired-end reads contain basically the same phasing information as the long reads. Although libraries with large inserts are costly and difficult to obtain, they may be still easier to generate than long continuous reads and therefore can be a competitive option for designing haplotyping experiments. HapTree is the most accurate method, but often fails Different haplotyping algorithms yield different estimates for the same individual. With simulated individuals, it is possible to compare the quality of these estimates, as the haplotypes are known a priori. To this end, we used linear regression models relating the performance measures to the algorithms used (Section Comparison of haplotyping algorithms). Because of the substantial difference between the estimation results using CLR compared with the other sequencing methods (see below), those results were excluded from the regression analysis. Table 3 shows the 99% confidence intervals for the effects of estimation method and sequencing technology on the haplotyping accuracy for the tetraploid genomes, with HapCompass on CCS data taken as the baseline. HapTree is significantly more accurate (higher PAR and lower VER) than the other methods, but less complete (higher SMR) because of its frequent failure (Figure 4C). SDhaP yields slightly, but significantly, worse dosage estimates (higher IDR). There was no significant relation between the method used and the continuity of haplotype estimates (NGPS) or precision of the SNPs (PPV). Table 3. Point estimates and 99% confidence intervals for the effects of haplotyping and sequencing methods on the haplotyping quality measures Parameter\Quality measure PAR VER SMR IDR PPV NGPS Intercept* 0.33 (0.308; 0.356) 0.25 (0.206; 0.285) 0.06 (0.022; 0.090) 0.20 (0.178; 0.222) 0.97 (0.970; 0.971) 0.01 (0.007;0.013) HapTree 0.23* (0.226; 0.230) 0.1`* (−0.1; −0.09) 0.27* (0.263; 0.270) 0.00 (0.003; 0.005) 0.00 (0.000; 0.000) 0.00 (−0.002; −0.001) SDhaP 0.08* (0.079; 0.082) −0.01* (−0.014; −0.005) 0.00 (−0.003; 0.004) 0.01* (0.012; 0.013) 0.00 (0.000; 0.000) 0.00 (0.000;0.000) HiSeq 2500 −0.005 (−0.026; 0.016) 0.06* (0.024; 0.1) 0.04* (0.011; 0.072) 0.01 (−0.011; 0.028) 0.03* (0.025; 0.026) 0.01* (0.008;0.013) MiSeq −0.002 (−0.023; 0.019) −0.02 (−0.051; 0.02) 0.02 (−0.008; 0.052) 0.01 (−0.013; 0.026) 0.02* (0.023; 0.025) 0.00 (0.001;0.006) Parameter\Quality measure PAR VER SMR IDR PPV NGPS Intercept* 0.33 (0.308; 0.356) 0.25 (0.206; 0.285) 0.06 (0.022; 0.090) 0.20 (0.178; 0.222) 0.97 (0.970; 0.971) 0.01 (0.007;0.013) HapTree 0.23* (0.226; 0.230) 0.1`* (−0.1; −0.09) 0.27* (0.263; 0.270) 0.00 (0.003; 0.005) 0.00 (0.000; 0.000) 0.00 (−0.002; −0.001) SDhaP 0.08* (0.079; 0.082) −0.01* (−0.014; −0.005) 0.00 (−0.003; 0.004) 0.01* (0.012; 0.013) 0.00 (0.000; 0.000) 0.00 (0.000;0.000) HiSeq 2500 −0.005 (−0.026; 0.016) 0.06* (0.024; 0.1) 0.04* (0.011; 0.072) 0.01 (−0.011; 0.028) 0.03* (0.025; 0.026) 0.01* (0.008;0.013) MiSeq −0.002 (−0.023; 0.019) −0.02 (−0.051; 0.02) 0.02 (−0.008; 0.052) 0.01 (−0.013; 0.026) 0.02* (0.023; 0.025) 0.00 (0.001;0.006) * Statistically significant at α=0.01. Point estimates and 99% Wald-type confidence intervals for the effects of haplotyping methods: HapTree, SDhaP and HapCompass (the reference) and the sequencing technologies: MiSeq, HiSeq2500 and CCS (the reference), on five measures of haplotyping quality: pairwise-phasing accuracy rate (PAR), vector error rate (VER), SNP missing rate (SMR), incorrect dosage rate (IDR), positive predictive value of the called SNPs (PPV) and the number of gaps in estimates per SNP (NGPS). Table 3. Point estimates and 99% confidence intervals for the effects of haplotyping and sequencing methods on the haplotyping quality measures Parameter\Quality measure PAR VER SMR IDR PPV NGPS Intercept* 0.33 (0.308; 0.356) 0.25 (0.206; 0.285) 0.06 (0.022; 0.090) 0.20 (0.178; 0.222) 0.97 (0.970; 0.971) 0.01 (0.007;0.013) HapTree 0.23* (0.226; 0.230) 0.1`* (−0.1; −0.09) 0.27* (0.263; 0.270) 0.00 (0.003; 0.005) 0.00 (0.000; 0.000) 0.00 (−0.002; −0.001) SDhaP 0.08* (0.079; 0.082) −0.01* (−0.014; −0.005) 0.00 (−0.003; 0.004) 0.01* (0.012; 0.013) 0.00 (0.000; 0.000) 0.00 (0.000;0.000) HiSeq 2500 −0.005 (−0.026; 0.016) 0.06* (0.024; 0.1) 0.04* (0.011; 0.072) 0.01 (−0.011; 0.028) 0.03* (0.025; 0.026) 0.01* (0.008;0.013) MiSeq −0.002 (−0.023; 0.019) −0.02 (−0.051; 0.02) 0.02 (−0.008; 0.052) 0.01 (−0.013; 0.026) 0.02* (0.023; 0.025) 0.00 (0.001;0.006) Parameter\Quality measure PAR VER SMR IDR PPV NGPS Intercept* 0.33 (0.308; 0.356) 0.25 (0.206; 0.285) 0.06 (0.022; 0.090) 0.20 (0.178; 0.222) 0.97 (0.970; 0.971) 0.01 (0.007;0.013) HapTree 0.23* (0.226; 0.230) 0.1`* (−0.1; −0.09) 0.27* (0.263; 0.270) 0.00 (0.003; 0.005) 0.00 (0.000; 0.000) 0.00 (−0.002; −0.001) SDhaP 0.08* (0.079; 0.082) −0.01* (−0.014; −0.005) 0.00 (−0.003; 0.004) 0.01* (0.012; 0.013) 0.00 (0.000; 0.000) 0.00 (0.000;0.000) HiSeq 2500 −0.005 (−0.026; 0.016) 0.06* (0.024; 0.1) 0.04* (0.011; 0.072) 0.01 (−0.011; 0.028) 0.03* (0.025; 0.026) 0.01* (0.008;0.013) MiSeq −0.002 (−0.023; 0.019) −0.02 (−0.051; 0.02) 0.02 (−0.008; 0.052) 0.01 (−0.013; 0.026) 0.02* (0.023; 0.025) 0.00 (0.001;0.006) * Statistically significant at α=0.01. Point estimates and 99% Wald-type confidence intervals for the effects of haplotyping methods: HapTree, SDhaP and HapCompass (the reference) and the sequencing technologies: MiSeq, HiSeq2500 and CCS (the reference), on five measures of haplotyping quality: pairwise-phasing accuracy rate (PAR), vector error rate (VER), SNP missing rate (SMR), incorrect dosage rate (IDR), positive predictive value of the called SNPs (PPV) and the number of gaps in estimates per SNP (NGPS). Finally, the results were not significantly different using Illumina MiSeq reads and PacBio CCS, except for PPV, which was slightly higher with MiSeq. On the other hand, HiSeq 2500 reads resulted in significantly higher VER, SMR and NGPS at α = 0.01 compared with the other sequencing methods, with the most noticeable effect being on the number of gaps, which was expected considering the short single-read length of HiSeq. Overall, these results confirm that HapTree is the most accurate method when it does not fail, and that Illumina and PacBio reads offer similar performance. Similarity between homologues eases haplotyping with Illumina Similarity between homologues can have a large effect on haplotyping. This similarity often occurs when random mating is violated, e.g. in inbred or isolated populations. To investigate this, we simulated simplex–triplex individuals, i.e. tetraploid individuals consisting of two different genomes with dosages of 1 and 3. We generated paired-end MiSeq and HiSeq 2500 reads (800 bp insert-size), as well as 1 kb CCS read of PacBio, and evaluated the estimated haplotypes. On these data, the performances of HapTree (with Illumina reads) and HapCompass improve over the original simulation, whereas the performance of SDhaP deteriorates significantly (Figure 5). In particular, the similarity between homologues resulted in a decreased accuracy for SDhaP (Figure 5A, PAR around 0.2) and incorrect dosage estimates for more than half of the SNPs (Figure 5C, IDR of 0.55), regardless of the sequencing method. Figure 5. View largeDownload slide View largeDownload slide Plots of haplotype estimation quality measures: (A) PAR, (B) VER, (C) IDR, (D) SMR and (E) FR as a function of sequencing depth using HapCompass (black), HapTree (gray) and SDhaP (light gray), for simulated 20 kb simplex–triplex tetraploid genomes (circle) compared with genomes with random haplotype dosages (square). Sequencing was performed in silico for paired-end HiSeq 2500 reads with 800 bp insert-size. Figure 5. View largeDownload slide View largeDownload slide Plots of haplotype estimation quality measures: (A) PAR, (B) VER, (C) IDR, (D) SMR and (E) FR as a function of sequencing depth using HapCompass (black), HapTree (gray) and SDhaP (light gray), for simulated 20 kb simplex–triplex tetraploid genomes (circle) compared with genomes with random haplotype dosages (square). Sequencing was performed in silico for paired-end HiSeq 2500 reads with 800 bp insert-size. These results demonstrate the differences between the MEC approach to haplotyping and other approaches. MEC is sensitive to (local) similarities between homologues, as they lead to approximately identical MEC scores for several different phasings, causing SDhaP to report a suboptimal solution. In contrast, the performances of HapCompass (MWER approach) and HapTree (relative likelihood approach) improve, at least when using Illumina sequencing (Figure 5A and B). Having more similar fragments simplifies construction of the maximum spanning tree in the compass graph and makes the branching and pruning of the solution tree of HapTree more accurate by enhancing the relative likelihoods of correct partial phasings. No improvement was observed, however, with HapTree using CCS reads because of increasing FRs (Figure 5E) caused by time-out errors. These results show that the underlying algorithms lead to different sensitivities to homologue similarity, with MEC-based approaches yielding incorrect results and other methods demanding increasing computation time. SNP density mainly influences continuity of haplotype estimates In genomes with a lower SNP density than the highly heterozygous potato, S. tuberosum, fragments will overlap less often, which can influence the quality of haplotyping. To determine the effect of SNP density, we simulated tetraploid genomes with average SNP densities ranging from 1 SNP per 22 base pairs, the average density for potato, to 1 SNP per 110 base pair, and estimated the haplotypes using Illumina paired-end reads with an insert-size of 800 bp, as well as 1 kb CCS reads of PacBio, at a sequencing depth of 15× per homologue. Increasing numbers of gaps (NGPS, Figure 6A) and a decrease in completeness (SMR, Figure 6B) were observed in the estimated haplotypes at lower densities for all three haplotyping methods. The effect of the SNP density was, however, not manifest on the other haplotyping quality measures (Supplementary Figure S5). Figure 6. View largeDownload slide Plots of haplotype estimation quality measures: (A) NGPS, (B) SMR as a function of SNP density (at logarithmic distance scale) using HapCompass (black), HapTree (gray) and SDhaP (light gray), for simulated 20 kb tetraploid S. tuberosum genomes. Sequencing was performed in silico for paired-end MiSeq (triangle) and HiSeq 2500 (diamond) with 800 bp insert-size, as well as for PacBio-CCS of 1 kb length (circle), at a depth of 15×. Figure 6. View largeDownload slide Plots of haplotype estimation quality measures: (A) NGPS, (B) SMR as a function of SNP density (at logarithmic distance scale) using HapCompass (black), HapTree (gray) and SDhaP (light gray), for simulated 20 kb tetraploid S. tuberosum genomes. Sequencing was performed in silico for paired-end MiSeq (triangle) and HiSeq 2500 (diamond) with 800 bp insert-size, as well as for PacBio-CCS of 1 kb length (circle), at a depth of 15×. At higher ploidy levels, HapCompass is the best method to use To investigate whether our findings for tetraploid genomes hold for other ploidy levels, we performed simulations with ploidy levels of 3–12 (Section Simulation of polyploid data sets). We simulated paired-end HiSeq 2500 reads with an insert-size of 800 bp, as it gave high-quality estimates in tetraploids and was more practical than the competitive sequencing options, at 5×, 15× and 20× sequencing depths per homologue. The accuracy of HapTree and SDhaP decreases markedly with increasing ploidy level, up to 30% for 12n (PAR, Figure 7A), whereas the performance of HapCompass remained stable. Likewise, the completeness of HapTree decreased (SMR, Figure 7B) and FRs for both HapTree and SDhaP increased (Figure 7C). Although the performance of the methods at each ploidy level was relatively better at higher sequencing depths, the deterioration of the haplotype estimation quality followed a similar pattern with the increase in ploidy, regardless of the depth. Figure 7. View largeDownload slide Plots of haplotype estimation quality measures: (A) PAR, (B) SMR and (C) FR as a function of ploidy level using HapCompass (black), HapTree (gray) and SDhaP (light gray), for simulated 20 kb 3n, 4n, 6n, 8n, 10n and 12n genomes. Sequencing was performed in silico for paired-end HiSeq 2500 with 600 bp insert-size. Three sequencing depths were used per homologue: 5 × (circle), 15 × (triangle) and 20 × (diamond). Figure 7. View largeDownload slide Plots of haplotype estimation quality measures: (A) PAR, (B) SMR and (C) FR as a function of ploidy level using HapCompass (black), HapTree (gray) and SDhaP (light gray), for simulated 20 kb 3n, 4n, 6n, 8n, 10n and 12n genomes. Sequencing was performed in silico for paired-end HiSeq 2500 with 600 bp insert-size. Three sequencing depths were used per homologue: 5 × (circle), 15 × (triangle) and 20 × (diamond). Overall, none of the haplotyping methods is equipped to deal with high levels of ploidy: either they break down (HapTree and SDhaP) or are inaccurate (HapCompass). SDhaP yields best results using long reads Among the three tested haplotyping methods, SDhaP is the only method relying on MEC to select the best estimate. Although MEC is an efficient criterion for diploid haplotyping, it may not be able to distinguish the estimates in the presence of more than two homologues, as several estimates can have the same error correction score. This situation can especially occur when the homologues have local similarities and the SNP-fragment lengths are small. HapTree and HapCompass try to surmount this problem by considering complex criteria that result in more accurate estimates with short reads of Illumina and even 1 kb reads of CCS. Nevertheless, these criteria lose their advantage with long sequence reads, 5 kb CCS and 10 kb CLR in our study. These longer reads also increase the failure frequency of HapTree. The MEC criterion of SDhaP performs in contrast well when the reads are long enough to distinguish the homologues, requiring the least computation time (Figure 9A) and providing accurate results using 5 kb CCS (Figure 8A). Figure 8 View largeDownload slide View largeDownload slide Plots of haplotype estimation quality measures: (A) PAR, (B) VER, (C) IDR, (D) PPV, (E) SMR and (F) FR as a function of sequencing depth per homologue using HapCompass (left), HapTree (middle) and SDhaP (right), for simulated PacBio read: CCS 1 kb (black), CCS 5 kb (gray) and CLR 10 kb (light gray). Figure 8 View largeDownload slide View largeDownload slide Plots of haplotype estimation quality measures: (A) PAR, (B) VER, (C) IDR, (D) PPV, (E) SMR and (F) FR as a function of sequencing depth per homologue using HapCompass (left), HapTree (middle) and SDhaP (right), for simulated PacBio read: CCS 1 kb (black), CCS 5 kb (gray) and CLR 10 kb (light gray). Figure 9. View largeDownload slide (A) Computation time (in seconds) (B) Physical memory consumption (in megabyte) as a function of sequencing depth per homologue with three haplotype estimation softwares: HapCompass (black circle), HapTree (gray square) and SDhaP (light gray rhombus), using 10 kb continuous longs reads of PacBio for tetraploid genomes of length 20 kb. Figure 9. View largeDownload slide (A) Computation time (in seconds) (B) Physical memory consumption (in megabyte) as a function of sequencing depth per homologue with three haplotype estimation softwares: HapCompass (black circle), HapTree (gray square) and SDhaP (light gray rhombus), using 10 kb continuous longs reads of PacBio for tetraploid genomes of length 20 kb. Erroneous long reads lead to low accuracy, high SMR and many false SNPs in the estimates Recently, the generation of long reads spanning tens of thousands of nucleotides has become a reality using technologies such as Oxford Nanopore and PacBio, at the expense of the precision of base calling [38]. Such lengthy reads are potentially ideal candidates for haplotyping, as they can cover many variants and provide enough overlaps for accurate haplotype reconstruction [42]. However, our simulations using 10 kb CLR reads of PacBio, with an average accuracy of 82%, show that these reads lead to inferior estimates for polyploids compared with paired-end Illumina reads and CCS reads. In particular, many spurious SNPs will be present (Figure 8D), and many of the original SNPs will be missing in the estimates (Figure 8E). In addition, wrong dosages abound in the estimated haplotypes (Figure 8C). Although increasing the coverage helps improve the estimation to some extent, especially with SDhaP (Figures 8A, C and E), our results do not encourage the use of erroneous long reads for the estimation of polyploid haplotypes, as achieving the extremely high coverages needed is usually not practical. HapTree requires most resources Computational efficiency is an important feature of every complex algorithm, such as the haplotyping algorithms discussed in this article. Therefore, we measured the memory and time consumption of each algorithm for various ploidy levels, sequencing coverages and genome lengths. Using HiSeq 2500 and paired-end libraries with an insert-size of 800 bp for the simulation of sequencing, we tested the effect of sequencing depth with tetraploid individuals and genomes of length 10 kb, and the effect of genome length with tetraploid individuals sequenced at an average depth of 10× per homologue. Other settings were the same as for S. tuberosum (Section Simulation of polyploid genomes and NGS reads), for each condition generating 50 individuals from 50 randomly selected regions, i.e. one individual per region, with a time limit of 7200 s. Fixing the depth to 10× per homologue and the genome length to 10 kb, the effect of ploidy was also investigated in a similar manner. The analyses were run on multicore 2.6 GHz Intel-Xeon processors. For each run, the total CPU time and physical memory consumption was measured using the Unix getrusage routine. HapTree clearly consumed most time and memory resources (Figure 10), increasing with genome length (Figure 10A and B) and ploidy level (Figure 10C and D). This increase was much less for HapCompass and SDhaP. Figure 10. View largeDownload slide Plots of physical memory consumption (in megabyte) with: length of genome (in base-pair) (A), ploidy (C) and sequencing depth (E), and plots of computation time (in seconds) with length of genome (in base-pair) (B), ploidy (D) and sequencing depth per homologue (F), for three haplotype estimation softwares: HapCompass (black circle), HapTree (gray square) and SDhaP (light gray diamond). Sequencing was based on HiSeq 2500 paired-end technology with 800 bp insert-size. Figure 10. View largeDownload slide Plots of physical memory consumption (in megabyte) with: length of genome (in base-pair) (A), ploidy (C) and sequencing depth (E), and plots of computation time (in seconds) with length of genome (in base-pair) (B), ploidy (D) and sequencing depth per homologue (F), for three haplotype estimation softwares: HapCompass (black circle), HapTree (gray square) and SDhaP (light gray diamond). Sequencing was based on HiSeq 2500 paired-end technology with 800 bp insert-size. Although sequencing depth was less influential, HapTree used much more time and memory at the lowest sequencing depth, 5× per homologue, falling rapidly with an increase in depth up to 10× per homologue and remaining almost constant afterward up to a depth of 20× per homologue (Figure 10E and F). At depths of >20× per homologue, the computation time fell rapidly again because of the premature failure of the algorithm as discussed in the section on the influence of sequencing depth. Conclusion We evaluated three algorithms for single-individual haplotype estimation in polyploids: HapCompass, HapTree and SDhaP, and investigated the effects of SNP density, similarity between homologues, ploidy level, sequencing technology, sequencing depth and DNA library size on the estimation quality using several measures of quality (Table 2) and through extensive simulation experiments. This yielded insight about the performance of haplotype estimation methods in practical situations. For this purpose, we have developed a realistic pipeline that can be used as a basis for the benchmarking of single-individual haplotyping softwares in future. Our results show that HapTree can produce the best triploid and tetraploid haplotype estimates, followed by SDhaP and HapCompass. HapCompass is the best method to use with ploidy levels over 6n, although its performance is not good in an absolute sense. We showed that sequencing depth was the most important factor determining the quality of haplotype estimation, and paired-end short reads of Illumina with a large insert can perform as well as long CCS reads of the same total size now possible with PacBio. For accurate haplotyping, we therefore suggest an average depth of between 5× and 20× per homologue with paired-end reads and an insert-size of 600–800 bp. In addition to the estimation quality, we investigated computation time and memory consumption of each algorithm under various settings to compare their efficiency. We showed that on average, HapTree requires the most computation time and memory, and its use of resources is highly dependent on the length of the genomic region, the ploidy level and the sequencing depth. Combined with the frequent failure to complete the estimation, this raises difficulties for applying HapTree on practical problems where the aim is to reconstruct long-range haplotypes. Our findings show that while state-of-the-art single-individual haplotype estimation algorithms produce promising results for triploid and tetraploid organisms over a limited genomic region, their performance rapidly decreases at higher ploidy levels, and their resource use prohibits application to large genomic regions. The probability-based algorithm of HapTree produces the most accurate estimates but also requires the most computation time and memory. We believe it is worth investigating whether the HapTree approach can be made robust when faced with larger problems while maintaining its accuracy, e.g. using a divide-and-conquer approach or by adjusting the branching and pruning parameters according to the length of the genome, the ploidy level and the sequencing coverage. The variant calling error model could also be upgraded to be specific to the applied sequencing strategy and technology. Finally, the performance of haplotyping methods on individual organisms could be greatly improved if it could also incorporate parental and sib information if available, e.g. in mapping populations relevant to plant and animal breeding studies. Although the evaluated algorithms ignore this information, it can be extremely helpful to increase the precision of genotype calling when the average sequencing depth is low or to favor/disfavor some of the haplotypes a priori based on their expected frequency in the population. Such enhancements will prove essential to help understand the complex genetics found in many polyploid organisms and, in the long run, to better understand the rules governing genome organization. Software The simulation pipeline and its components can be downloaded from the Software section at http://bif.wur.nl/ Key Points Haplotyping accuracy for polyploids depends mostly on sequencing depth, with 5–20×  giving optimal results. Paired-end Illumina reads with large insert-sizes are competitive with PacBio CCS reads at reduced cost where quality of haplotype estimation is concerned. For ploidy levels up to six, HapTree is the most accurate method and can produce high-quality estimates; above that, HapCompass is the best method to use, although its estimation quality is not so high in the absolute sense. With long reads generated by recent sequencing technologies, SDhaP proves to be the most accurate and efficient. Despite its high accuracy, HapTree may break down for long genomic regions, high ploidy levels and with long reads of recent technologies, sometimes failing to converge within a reasonable amount of time. Improvements are needed to allow accurate haplotyping in polyploids in large genomic regions and may be found in adapting parameters to the problem at hand and in using sibship and parental information if available. Ehsan Motazedi is a PhD student at Wageningen UR Plant Breeding and the Bioinformatics Group, both at Wageningen University and Research. His project focuses on computational methods studying the genetics of polyploid crops. Richard Finkers is a staff scientist at Wageningen Research, Plant Breeding. His research interests include using genomics technologies to enable precision breeding in crops with complex genomes. Chris Maliepaard is an associate professor at Wageningen University, Plant Breeding, where he leads the research group ‘Quantitative aspects of Plant Breeding’. His research interests include quantitative genetics with a focus on genetic analysis of polyploid species. Dick de Ridder is professor in bioinformatics and head of the Bioinformatics Group at Wageningen University. He works on learning-based algorithms, with applications in the analysis of complex genomes and systems and synthetic biology. Supplementary Data Supplementary data are available online at http://bib.oxfordjournals.org/. Acknowledgements The authors wish to thank Herman J. van Eck for his input and valuable discussions regarding the data from Uitdewilligen et al. [48], Emily Berger for sharing HapTree code, Vikas Bansal and Derek Aguiar for discussion on HapCut and HapCompass, respectively. Funding Wageningen University and Research Centre through the graduate school Experimental Plant Sciences (EPS). References 1 Collins FS , Morgan M , Patrinos A. The human genome project: lessons from large-scale biology . Science 2003 ; 300 ( 5617 ): 286 – 90 . Google Scholar CrossRef Search ADS PubMed 2 Sachidanandam R , Weissman D , Schmidt SC , et al. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms . Nature 2001 ; 409 ( 6822 ): 928 – 33 . Google Scholar CrossRef Search ADS PubMed 3 Gibbs RA , Belmont JW , Hardenbol P , et al. The international HapMap project . Nature 2003 ; 426 ( 6968 ): 789 – 96 . Google Scholar CrossRef Search ADS PubMed 4 Ganal MW , Altmann T , Röder MS. SNP identification in crop plants . Curr Opin Plant Biol 2009 ; 12 ( 2 ): 211 – 7 . Google Scholar CrossRef Search ADS PubMed 5 LaFramboise T. Single nucleotide polymorphism arrays: a decade of biological, computational and technological advances . Nucleic Acids Res 2009 ; 37 ( 13 ): 4181 – 93 . Google Scholar CrossRef Search ADS PubMed 6 Visscher PM , Brown MA , McCarthy MI , et al. Five years of GWAS discovery . Am J Hum Genet 2012 ; 90 ( 1 ): 7 – 24 . Google Scholar CrossRef Search ADS PubMed 7 Kaul S , Koo HL , Jenkins J , et al. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana . Nature 2000 ; 408 ( 6814 ): 796 – 815 . Google Scholar CrossRef Search ADS PubMed 8 Clark AG , Eisen MB , Smith DR , et al. Evolution of genes and genomes on the Drosophila phylogeny . Nature 2007 ; 450 ( 7167 ): 203 – 18 . Google Scholar CrossRef Search ADS PubMed 9 Hillier LW , Miller W , Birney E , et al. Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution . Nature 2004 ; 432 ( 7018 ): 695 – 716 . Google Scholar CrossRef Search ADS PubMed 10 Groenen MA , Archibald AL , Uenishi H , et al. Analyses of pig genomes provide insight into porcine demography and evolution . Nature 2012 ; 491 ( 7424 ): 393 – 8 . Google Scholar CrossRef Search ADS PubMed 11 Potato Genome Sequencing Consortium ; Xu X , Pan S , et al. Genome sequence and analysis of the tuber crop potato . Nature 2011 ; 475 ( 7355 ): 189 – 95 . Google Scholar CrossRef Search ADS PubMed 12 Tomato Genome Consortium . The tomato genome sequence provides insights into fleshy fruit evolution . Nature 2012 ; 485 ( 7400 ): 635 – 41 . CrossRef Search ADS PubMed 13 Kim S , Park M , Yeom SI , et al. Genome sequence of the hot pepper provides insights into the evolution of pungency in Capsicum species . Nat Genet 2014 ; 46 ( 3 ): 270 – 8 . Google Scholar CrossRef Search ADS PubMed 14 Goddard ME , Hayes BJ. Mapping genes for complex traits in domestic animals and their use in breeding programmes . Nat Rev Genet 2009 ; 10 ( 6 ): 381 – 91 . Google Scholar CrossRef Search ADS PubMed 15 Mammadov J , Aggarwal R , Buyyarapu R , et al. SNP markers and their impact on plant breeding . Int J Plant Genomics 2012 ; 2012 ( 3 ): 728398 . Google Scholar PubMed 16 Birney E , Soranzo N. Human genomics: the end of the start for population sequencing . Nature 2015 ; 526 ( 7571 ): 52 – 3 . Google Scholar CrossRef Search ADS PubMed 17 Garrison E , Marth G. Haplotype-based variant detection from short-read sequencing. arXiv Preprint arXiv1207.3907 [q-bio.GN], 2012 . 18 McKenna A , Hanna M , Banks E , et al. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data . Genome Res 2010 ; 20 ( 9 ): 1297 – 303 . Google Scholar CrossRef Search ADS PubMed 19 Castiglione C , Deinard A , Speed W , et al. Evolution of haplotypes at the DRD2 locus . Am J Hum Genet 1995 ; 57 ( 6 ): 1445 . Google Scholar PubMed 20 Glusman G , Cox HC , Roach JC. Whole-genome haplotyping approaches and genomic medicine . Genome Med 2014 ; 6 ( 9 ): 73 . Google Scholar CrossRef Search ADS PubMed 21 Caputo M , Rivolta CM , Gutnisky VJ , et al. Recurrence of the p. R277X/p. R1511X compound heterozygous mutation in the thyroglobulin gene in unrelated families with congenital goiter and hypothyroidism: haplotype analysis using intragenic thyroglobulin polymorphisms . J Endocrinol 2007 ; 195 ( 1 ): 167 – 77 . Google Scholar CrossRef Search ADS PubMed 22 Hamblin MT , Jannink JL. Factors affecting the power of haplotype markers in association studies . Plant Genome 2011 ; 4 ( 2 ): 145 – 53 . Google Scholar CrossRef Search ADS 23 Lorenz AJ , Hamblin MT , Jannink JL , et al. Performance of single nucleotide polymorphisms versus haplotypes for genome-wide association analysis in barley . PLoS One 2010 ; 5 ( 11 ): e14079 . Google Scholar CrossRef Search ADS PubMed 24 Howie B , Fuchsberger C , Stephens M , et al. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing . Nat Genet 2012 ; 44 ( 8 ): 955 – 9 . Google Scholar CrossRef Search ADS PubMed 25 Hickey JM , Gorjanc G , Varshney RK , et al. Imputation of single nucleotide polymorphism genotypes in biparental, backcross, and topcross populations with a hidden Markov model . Crop Science 2015 ; 55 ( 5 ): 1934 – 46 . Google Scholar CrossRef Search ADS 26 Lancia G , Bafna V , Istrail S , et al. SNPs problems, complexity, and algorithms. In: Algorithms–ESA . Springer-Verlag, Heidelberg , 2001 , 182 – 93 . 27 Rizzi R , Bafna V , Istrail S , et al. Practical algorithms and fixed-parameter tractability for the single individual SNP haplotyping problem. In: Algorithms in Bioinformatics . Springer-Verlag, Heidelberg , 2002 , 29 – 43 . Google Scholar CrossRef Search ADS 28 Lippert R , Schwartz R , Lancia G , et al. Algorithmic strategies for the single nucleotide polymorphism haplotype assembly problem . Brief Bioinform 2002 ; 3 ( 1 ): 23 – 31 . Google Scholar CrossRef Search ADS PubMed 29 Wang RS , Wu LY , Li ZP , et al. Haplotype reconstruction from SNP fragments by minimum error correction . Bioinformatics 2005 ; 21 ( 10 ): 2456 – 62 . Google Scholar CrossRef Search ADS PubMed 30 Zhao YY , Wu LY , Zhang JH , et al. Haplotype assembly from aligned weighted SNP fragments . Comput Biol Chem 2005 ; 29 ( 4 ): 281 – 7 . Google Scholar CrossRef Search ADS PubMed 31 Wang RS , Wu LY , Zhang XS , et al. A Markov chain model for haplotype assembly from SNP fragments . Genome Inform 2006 ; 17 ( 2 ): 162 – 71 . Google Scholar PubMed 32 Wang Y , Feng E , Wang R. A clustering algorithm based on two distance functions for MEC model . Comput Biol Chem 2007 ; 31 ( 2 ): 148 – 50 . Google Scholar CrossRef Search ADS PubMed 33 Bansal V , Bafna V. HapCUT: an efficient and accurate algorithm for the haplotype assembly problem . Bioinformatics 2008 ; 24 ( 16 ): i153 – 9 . Google Scholar CrossRef Search ADS PubMed 34 Wu LY , Li Z , Wang RS , et al. Self-organizing map approaches for the haplotype assembly problem . Math Comput Simul 2009 ; 79 ( 10 ): 3026 – 37 . Google Scholar CrossRef Search ADS 35 Bansal V , Halpern AL , Axelrod N , et al. An MCMC algorithm for haplotype assembly from whole-genome sequence data . Genome Res 2008 ; 18 ( 8 ): 1336 – 46 . Google Scholar CrossRef Search ADS PubMed 36 Wu J , Wang J , Chen J. A practical algorithm based on particle swarm optimization for haplotype reconstruction . Appl Math Comput 2009 ; 208 ( 2 ): 363 – 72 . 37 Kuleshov V. Probabilistic single-individual haplotyping . Bioinformatics 2014 ; 30 ( 17 ): i379 – 85 . Google Scholar CrossRef Search ADS PubMed 38 Metzker ML. Sequencing technologies–the next generation . Nat Rev Genet 2010 ; 11 ( 1 ): 31 – 46 . Google Scholar CrossRef Search ADS PubMed 39 Chin CS , Peluso P , Sedlazeck FJ , et al. Phased diploid genome assembly with single molecule real-time sequencing . Nat Methods 2016 . 40 Ozkan H , Levy AA , Feldman M. Allopolyploidy-induced rapid genome evolution in the wheat (Aegilops–Triticum) group . Plant Cell 2001 ; 13 ( 8 ): 1735 – 47 . Google Scholar CrossRef Search ADS PubMed 41 Krasileva KV , Buffalo V , Bailey P , et al. Separating homeologs by phasing in the tetraploid wheat transcriptome . Genome Biol 2013 ; 14 ( 6 ): R66 . Google Scholar CrossRef Search ADS PubMed 42 Aguiar D , Istrail S. Haplotype assembly in polyploid genomes and identical by descent shared tracts . Bioinformatics 2013 ; 29 ( 13 ): i352 – 60 . Google Scholar CrossRef Search ADS PubMed 43 Das S , Vikalo H. SDhaP: haplotype assembly for diploids and polyploids via semi-definite programming . BMC Genomics 2015 ; 16 ( 1 ): 260. Google Scholar CrossRef Search ADS PubMed 44 Vrijenhoek RC. Polyploid hybrids: multiple origins of a treefrog species . Curr Biol 2006 ; 16 ( 7 ): R245 – 7 . Google Scholar CrossRef Search ADS PubMed 45 Yabe T , Ge X , Pelegri F. The zebrafish maternal-effect gene cellular atoll encodes the centriolar component sas-6 and defects in its paternal function promote whole genome duplication . Dev Biol 2007 ; 312 ( 1 ): 44 – 60 . Google Scholar CrossRef Search ADS PubMed 46 Aguiar D , Wong WS , Istrail S. Tumor haplotype assembly algorithms for cancer genomics . Pac Symp Biocomput 2014 ; 19 : 3 – 14 . 47 Berger E , Yorukoglu D , Peng J , et al. HapTree: a novel Bayesian framework for single individual polyplotyping using NGS data . PLoS Comput Biol 2014 ; 10 ( 3 ): e1003502 . Google Scholar CrossRef Search ADS PubMed 48 Uitdewilligen JG , Wolters AMA , D’hoop BB , et al. A next-generation sequencing method for genotyping-by-sequencing of highly heterozygous autotetraploid potato . PLoS One 2013 ; 8 ( 5 ): e62355 . Google Scholar CrossRef Search ADS PubMed 49 Aguiar D , Istrail S. HapCompass: a fast cycle basis algorithm for accurate haplotype assembly of sequence data . J Comput Biol 2012 ; 19 ( 6 ): 577 – 90 . Google Scholar CrossRef Search ADS PubMed 50 Huang W , Li L , Myers JR , et al. Art: a next-generation sequencing read simulator . Bioinformatics 2012 ; 28 ( 4 ): 593 – 4 . Google Scholar CrossRef Search ADS PubMed 51 Ono Y , Asai K , Hamada M. PBSIM: PacBio reads simulator–toward accurate genome assembly . Bioinformatics 2013 ; 29 ( 1 ): 119 – 21 . Google Scholar CrossRef Search ADS PubMed 52 Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv Preprint arXiv1303.3997, 2013 . 53 Li H , Handsaker B , Wysoker A , et al. The sequence alignment/map format and SAMtools . Bioinformatics 2009 ; 25 ( 16 ): 2078 – 9 . Google Scholar CrossRef Search ADS PubMed 54 Picardtools Home Page . http://broadinstitute.github.io/picard (13 January 2016, date last accessed). 55 Quail MA , Smith M , Coupland P , et al. A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers . BMC Genomics 2012 ; 13 ( 1 ): 341 . Google Scholar CrossRef Search ADS PubMed 56 Liu L , Li Y , Li S , et al. Comparison of next-generation sequencing systems . J Biomed Biotechnol 2012 ; 2012 : 251364 . Google Scholar PubMed 57 Browning SR , Browning BL. Haplotype phasing: existing methods and new developments . Nat Rev Genet 2011 ; 12 ( 10 ): 703 – 14 . Google Scholar CrossRef Search ADS PubMed 58 Thompson R. Maximum likelihood estimation of variance components . Statistics 1980 ; 11 ( 4 ): 545 – 61 . 59 Bates D , Sarkar D , Bates MD , et al. The lme4 package. R Package Version 2(1), 2007 . 60 R Core Team . R: A Language and Environment for Statistical Computing . Vienna, Austria : R Foundation for Statistical Computing , 2015 . © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Briefings in Bioinformatics Oxford University Press

Exploiting next-generation sequencing to solve the haplotyping puzzle in polyploids: a simulation study

Loading next page...
 
/lp/ou_press/exploiting-next-generation-sequencing-to-solve-the-haplotyping-puzzle-TQl5b5IArW
Publisher
Oxford University Press
Copyright
© The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
ISSN
1467-5463
eISSN
1477-4054
D.O.I.
10.1093/bib/bbw126
Publisher site
See Article on Publisher Site

Abstract

Abstract Haplotypes are the units of inheritance in an organism, and many genetic analyses depend on their precise determination. Methods for haplotyping single individuals use the phasing information available in next-generation sequencing reads, by matching overlapping single-nucleotide polymorphisms while penalizing post hoc nucleotide corrections made. Haplotyping diploids is relatively easy, but the complexity of the problem increases drastically for polyploid genomes, which are found in both model organisms and in economically relevant plant and animal species. Although a number of tools are available for haplotyping polyploids, the effects of the genomic makeup and the sequencing strategy followed on the accuracy of these methods have hitherto not been thoroughly evaluated. We developed the simulation pipeline haplosim to evaluate the performance of three haplotype estimation algorithms for polyploids: HapCompass, HapTree and SDhaP, in settings varying in sequencing approach, ploidy levels and genomic diversity, using tetraploid potato as the model. Our results show that sequencing depth is the major determinant of haplotype estimation quality, that 1 kb PacBio circular consensus sequencing reads and Illumina reads with large insert-sizes are competitive and that all methods fail to produce good haplotypes when ploidy levels increase. Comparing the three methods, HapTree produces the most accurate estimates, but also consumes the most resources. There is clearly room for improvement in polyploid haplotyping algorithms. next-generation sequencing, haplotyping, polyploids, benchmarking Introduction The advent of sequencing technology has had tremendous impact in genomics and genetics over the past years. The human (Homo sapiens) genome, published in 2003 [1], formed the basis for large-scale efforts to catalogue sequence variation found between individual genomes, in particular single-nucleotide polymorphisms (SNPs) [2]. Subsequently, the international HapMap project [3] sequenced a large number of individuals to discover haplotype blocks, i.e. genomic regions containing co-segregating SNPs. These haplotype blocks and their so-called haplotype tagging SNP markers, heterozygous SNPs whose alleles predict the presence of a certain haplotype, formed the basis for the development of high-density SNP arrays [4, 5], capable of determining rare or less-frequent genotypes in a population, which were used in a large number of studies relating SNPs to phenotypes such as ecological traits, diseases and disorders [6]. Following the success of the human genome project, many animal and plant species were sequenced and genotyped, notably mouse-ear cress (Arabidopsis thaliana) [7], fruit fly (Drosophila) [8], chicken (Gallus gallus) [9], pig (Sus scrofa) [10], potato (Solanum tuberosum) [11], tomato (Solanum lycopersicum) [12] and hot pepper (Capsicum annuum) [13]; the last three being plant genera within the Solanacea family. This has impacted not only fundamental genetics research but has also revolutionized the fields of animal and plant breeding, by relating thousands of previously unknown genomic variants to physiological, morphological and economically important traits such as yield per generation and disease resistance [14, 15]. The introduction of high-throughput, relatively cheap and reliable next-generation sequencing (NGS) technologies made it possible to determine most of the variants directly within a single genome rather than using a predefined set of marker variants as proxies for the other variants [16]. Efficient tools have been developed to call variants based on NGS data, e.g. FreeBayes [17] and GATK [18], and also to link variants on the same homologous chromosome, so-called haplotype phasing. However, while phasing of nearby variants occurring within the average NGS read length is relatively straightforward, long-range haplotyping using NGS data remains a challenge. Nevertheless, haplotyping is important in many areas: in fundamental biology, to improve our understanding of genome structure, recombination and evolution [19]; in medicine, to obtain a full picture of the genetic variation in a population potentially linked to diseases and traits [20] and to investigate the effect of compound heterozygosity [21]; and in animal and plant breeding, to move from phenotype-based to genotype-based crossing and selection of individuals [22, 23]. Moreover, the knowledge of haplotypes can help reveal the linkage disequilibrium pattern in a population and hence increase the power and coverage of genetic analysis by allowing the imputation of a large number of alleles using a limited set of genotyped loci [24, 25]. In the diploid case, haplotyping algorithms aim to divide the aligned reads into two complementary sets, each covering a specific region of, say, n heterozygous sites, so that the nucleotides are the same at the overlapping sites of the reads within each set, but different between the sets. The algorithmic challenge then is to take the occurrence of sequencing and variant calling errors into account [26–28]. Minimum Error Correction (MEC), the most prevalent approach, uses single base flips for reads that conflict with both of the read sets, presumably because of sequencing or variant calling errors, to assign them to one of these. The aim is to find a configuration that requires a minimal number of such base flips [27, 29]. This strategy is the basis of several diploid haplotyping algorithms [30–37]. Recently, diploid-aware genome assemblers based on long reads, produced for instance by PacBio and Nanopore technologies [38], have also been developed. These algorithms use the overlap-layout-consensus approach to construct the assembly graph and obtain the primary contigs from the raw reads, and try to resolve the haplotypes by calling heterozygous SNPs within the contigs. These SNPs are used to separate the long reads into two groups, the so called ‘haplotigs’, from which consensus sequences are obtained to determine the phasing [39]. For polyploid genomes, the problem could be formulated as dividing the reads into k > 2 groups, but the generalization from the diploid case is not straightforward. Specifically, polyploids can be classified as allopolyploids, autopolyploids or mixture types, i.e. so-called segmental allopolyploids. In the allopolyploid case, the constituent sub-genomes of the polyploid are derived from adequately distant diploid ancestors that do not usually recombine with each other, a situation observed among several species in the plant kingdom such as tetraploid and hexaploid wheat [40]. Under specific circumstances, one may treat the sub-genome haplotypes as separate diploids, and an ad hoc phasing solution could be still achievable using the algorithms developed for diploids. As an example, this strategy has been successfully applied to pasta wheat (Triticum turgidum) [41], which is a self-fertilizing allotetraploid for which the ancestral diploid genomes are also known, to determine the variation on each sub-genome by dividing the transcriptome reads into two groups using HapCut [33]. In the case of (partial) autopolyploids, however, recombination is observed between homologues belonging to different sub-genomes, and unlike for the diploid case, knowledge of one haplotype does not automatically determine the phasing of the others. Besides, some haplotypes may be (locally) identical, and, thus, several configurations could have the same MEC score. Moreover, the computational complexity of haplotype reconstruction increases rapidly with an increase in ploidy [42, 43]. The diploid approaches are therefore in general not applicable to polyploids. Still, haplotype assembly for polyploids is highly relevant, as many interesting organisms have polyploid genomes and haplotyping will help unravel the range of the complex recombinations allowed by such genomes. Within the animal kingdom, triploidy and tetraploidy are observed in tree frog (Xenopus laevis) [44] and zebrafish (Danio rerio) [45], both important model organisms in evolutionary biology. Moreover, many economically important crops and ornamentals are polyploid, including tetraploid alfalfa (Medicago sativa), triploid banana (Musa acuminata × Musa balbisiana), tetraploid leek (Allium ampeloprasum), tetraploid potato (S. tuberosum), tetraploid hard wheat (Triticum durum), hexaploid bread wheat (Triticum aestivum), tetraploid, hexaploid and octoploid strawberry species including Fragaria moupinensis (k = 4), Fragaria moschata (k = 6), Fragaria  ananassa (k = 8) and several hybrid cotton (Gossypium, tetraploid or hexaploid) and rose (Rosa, tetraploid) species. Here, we review three state-of-the-art haplotyping algorithms applicable to polyploids: HapCompass [42, 46], HapTree [47] and SDhaP [43], and evaluate their accuracy through extensive simulations of random genomes and NGS reads. Using the highly heterozygous tetraploid potato (S. tuberosum) as a model, we generated random genomes using a realistic stochastic model with parameters SNP density and distribution of SNP dosages, i.e. the number of alternative alleles at each SNP site, derived from a recent genomic study of potato [48]. In addition, we simulated genomes at higher levels of ploidy with the same SNP density, as well as tetraploid genomes with different SNP densities and haplotype dosages, to investigate the effects of genome characteristics on the estimation. Moreover, we considered various sequencing depths, paired-end insert-sizes and sequencing technologies to quantify the impact of these parameters on the haplotyping. We provide guidelines to apply the haplotyping methods in practice, and show the characteristics of each method in various situations. The pipeline used is available as a software package called haplosim, which allows simulation for various sequencing approaches, genomic characteristics and variation models. Material and methods Although several studies have used experimental data, e.g. the human haplotype panels and sequence reads, to evaluate the efficiency of diploid haplotyping algorithms [49], experimentally obtained haplotypes are often not available for polyploids at a scale enabling insightful statistical comparison. Therefore, the evaluation of polyploid haplotyping algorithms has been based on artificial data sets [42, 43, 46, 47]. Here, we also rely on simulation to evaluate the performance of these methods. Compared with the previous studies, our approach has the advantage of simulating all parts of a practical haplotyping pipeline, encompassing the careful simulation of genomes and sequence reads based on real data and the application of standard software for read alignment and genotype calling. In contrast, previous studies relied on the direct simulation of a SNP-fragment matrix using simplifying assumptions. An additional advantage of our simulation approach is that it allows to investigate the effects of SNP density, similarity between homologues, ploidy level, sequencing depth, sequencing technology and DNA library size on the quality of haplotype estimates, which is usually not feasible using real data. We developed a multistage pipeline, haplosim, to simulate polyploid individuals, with various genomic characteristics, that are sequenced in silico. After individual SNP detection and dosage estimation, the haplotypes are estimated, separately for each individual, by the available algorithms: HapCompass [42, 46], HapTree [47] and SDhaP [43] in the next steps. In the last step of the pipeline, the estimated haplotypes are compared with the originally simulated haplotypes using quantitative measures (Figure 1). Figure 1. View largeDownload slide Haplosim pipeline to generate, estimate and evaluate haplotypes. Random genomes and haplotypes are produced by haplogenerator, from which NGS reads are simulated and mapped backed to the reference. SNPs are called at the next step and the haplotypes estimated by HapTree, HapCompass and SDhaP. The estimates are compared with the original haplotypes by hapcompare. Figure 1. View largeDownload slide Haplosim pipeline to generate, estimate and evaluate haplotypes. Random genomes and haplotypes are produced by haplogenerator, from which NGS reads are simulated and mapped backed to the reference. SNPs are called at the next step and the haplotypes estimated by HapTree, HapCompass and SDhaP. The estimates are compared with the original haplotypes by hapcompare. Figure 2. View largeDownload slide The schematic diagram of simulation for each considered scenario. In total, 10 references of length 20 kb are chosen from the draft sequence chromosome 5 [11], from each 10 polyploid genomes are simulated containing bi-allelic SNPs randomly distributed according to the lognormal distance model, to obtain data sets of size 250. Figure 2. View largeDownload slide The schematic diagram of simulation for each considered scenario. In total, 10 references of length 20 kb are chosen from the draft sequence chromosome 5 [11], from each 10 polyploid genomes are simulated containing bi-allelic SNPs randomly distributed according to the lognormal distance model, to obtain data sets of size 250. For the first step (Figure 1A), polyploid genomes are produced from a reference DNA sequence by introducing heterozygous regions containing bi-allelic SNPs, using the command-line tool haplogenerator that we developed to this purpose. Next, NGS reads are simulated for each produced individual using ART [50] and PBSIM [51] for Illumina and PacBio, respectively, and the reads are mapped back to their reference genome using bwa-mem [52] (with the settings recommended in its manual for Illumina and PacBio reads). The alignments are preprocessed to generate BAM files and remove duplicates by SAMtools [53] and Picardtools [54], after which SNPs are called using FreeBayes [17] (Figure 1B). The processed alignments, the reference and the VCF files are used in the haplotyping step by HapCompass [42, 46], HapTree [47] and SDhaP [43] (Figure 1C), and the obtained haplotype estimates are compared with the original haplotypes by the command-line tool hapcompare that we developed using several measures of estimation quality (Figure 1D). These steps are explained in detail below. Polyploid haplotyping software Currently, three optimization-based haplotyping algorithms are available for polyploids that make use of the sequence reads of a single sample: HapCompass [42, 46], HapTree [47] and SDhaP [43] (Table 1). Among these three algorithms, HapTree and SDhaP have separate software releases for diploids and polyploids, which may have major performance differences not discussed here. We explain each method assuming a genomic region containing l heterozygous SNP sites s1, s2,…,sl, a ploidy level k > 2 and an NGS data set containing m (paired-end) reads. We define a ‘fragment’ as the sequence of the determined alleles at the heterozygous sites within a (paired-end) read. For simplicity, we focus on the most prevalent type of SNPs, bi-allelic SNPs, for which the alleles can be represented by ‘0’ (the reference) and ‘1’ (the alternative). Table 1. Summary of the polyploid haplotype estimation algorithms using sequence reads Algorithm Principle Release version Separate diploid release HapCompass (Aguiar and Istrail 2013) Graph based using weighted minimum edge reduction criterion HapCompass v0.8.2 No HapTree (Berger et al. 2014) Using Bayesian probability tree with maximum relative likelihood criterion HapTree_v0.1 Yes SDhaP (Das and Vikalo 2015) Minimum error correction criterion with semi-definite approximation SDhaP_poly Yes Algorithm Principle Release version Separate diploid release HapCompass (Aguiar and Istrail 2013) Graph based using weighted minimum edge reduction criterion HapCompass v0.8.2 No HapTree (Berger et al. 2014) Using Bayesian probability tree with maximum relative likelihood criterion HapTree_v0.1 Yes SDhaP (Das and Vikalo 2015) Minimum error correction criterion with semi-definite approximation SDhaP_poly Yes Table 1. Summary of the polyploid haplotype estimation algorithms using sequence reads Algorithm Principle Release version Separate diploid release HapCompass (Aguiar and Istrail 2013) Graph based using weighted minimum edge reduction criterion HapCompass v0.8.2 No HapTree (Berger et al. 2014) Using Bayesian probability tree with maximum relative likelihood criterion HapTree_v0.1 Yes SDhaP (Das and Vikalo 2015) Minimum error correction criterion with semi-definite approximation SDhaP_poly Yes Algorithm Principle Release version Separate diploid release HapCompass (Aguiar and Istrail 2013) Graph based using weighted minimum edge reduction criterion HapCompass v0.8.2 No HapTree (Berger et al. 2014) Using Bayesian probability tree with maximum relative likelihood criterion HapTree_v0.1 Yes SDhaP (Das and Vikalo 2015) Minimum error correction criterion with semi-definite approximation SDhaP_poly Yes HapCompass Aguiar and Istrail (2013) extended their graphical haplotype estimation approach for diploids [42], by constructing the polyploid compass graph, which has k nodes for each variant site si in a k-ploid corresponding to the k alleles at that site [46]. To each SNP pair (si, sj) covered by at least one of the m fragments, the phasing with the largest likelihood is assigned by a polyploid likelihood model, conditional on the covering fragments and assuming a fixed base-calling error rate. k edges are accordingly added to the compass graph between the nodes at si and sj sites, representing the k homologues covering si and sj and weighted by their likelihoods. A global minimum weighted edge removal (MWER) algorithm is applied to detect and eliminate edges with conflicting phasing information from the compass graph with the aid of auxiliary chain graphs. Each chain graph takes a set of variants that make a cycle in the compass graph and detects phasing conflicts within the cycle, if present. The MWER algorithm then tries to resolve the conflicts by eliminating a number of edges with minimum total weight corresponding to the least likely phasings. Finally, the most likely haplotypes are found over the full set of SNPs from the conflict-free compass graph by finding k disjoint maximum spanning trees through an efficient greedy algorithm, corresponding to the k most likely homologues covering the l SNP sites. HapTree The HapTree algorithm, developed by Berger et al. (2014) [47], builds a tree representing a subset of likely phasing solutions for l heterozygous sites and tries to find the most probable path from s1 to sl in the tree as the best phasing, by calculating the probability of each phasing conditional on the m fragments, assuming a fixed base-calling error rate. Nevertheless, as an exhaustive search over the full tree of all solutions would be computationally prohibitive, the tree is built and extended site by site with branching and pruning to greedily eliminate the (relatively) low probability paths from the final tree. In doing so, HapTree calculates the relative probabilities of the haplotypes at each extension using the relative probabilities of the haplotype at the previous extension that survived the branching and pruning, taking the error model into account. SDhaP The third algorithm, polyploid SDhaP designed by Das and Vikalo (2015), is a semi-definite programming approach that aims to find an approximate MEC solution by a greedy search of the space of all possible phasings from s1 to sl [43]. The algorithm starts with random initial haplotypes, and tries to find the MEC solution by making changes to these initial haplotypes according to a gradient descent method. To this end, the MEC problem is reformulated as a semi-definite optimization task, and preliminary solutions are obtained in polynomial time by exploiting the sparseness of overlaps between fragments for efficient implementation. These preliminary estimates are subsequently refined by greedy flipping of the alleles in the estimated homologues to further reduce the MEC, if possible. By this flipping, SDhaP allows making changes to the dosages of the alternative alleles estimated during variant calling, which could sometimes lower the error correction score. Therefore, the dosages of corresponding SNPs in the SDhaP estimates and original haplotypes could differ, in contrast to the estimates produced by HapTree and HapCompass. Simulation of polyploid genomes and NGS reads Haplotype generation We developed the command-line tool haplogenerator for generating artificial genomes and their haplotypes with desired characteristics. Specifying an indexed FASTA file as input, haplogenerator applies random insertions, deletions and mutations to the input sequence according to a chosen stochastic model to produce modified FASTA files for each of the k-genomes of a k-ploid individual. A separate haplotype file is also made containing the phasing of the generated variants. In the haplotype file, the reference and alternative alleles are numerically coded, assigning 0 to the nucleotide present in the input reference and the following integers to the alternative alleles. The coordinate of each variant on its contig is also specified within the haplotype file. The random indel and mutation sites are scattered across the input genome according to a selected built-in stochastic model for the distance between consecutive variations, or alternatively by sampling with replacement from a given empirical distribution for this distance. At each position i, possible alternative alleles are generated: for indels, an inserted or deleted nucleotide; for mutations, nucleotides other than the reference (or just one nucleotide for obtaining bi-allelic SNPs). Next, a dosage di is assigned to the alternative allele based on the ploidy si, according to user-specified probabilities for di = 1 to di = k, and di of k homologues are selected randomly to get their allele at si changed to an alternative. To account for linkage disequilibrium, we imposed an additional step after this dosage assignment to reassign the alternative alleles at each si+1, i=1, 2,…, n−1, to the homologues containing the alternative alleles at si (as much as the numbers of alternative alleles, i.e. the dosages, at si and si+1 allow), with a reassignment decision made independently for each site from s2 to sn with an arbitrary probability set to 0.4. Simulation of NGS reads We used the technology-specific simulator ART [50] to generate paired-end reads from Illumina MiSeq and HiSeq 2500 technologies [38, 55], and PBSIM [51] to simulate circular consensus sequencing (CCS) and continuous long reads (CLR) from Pacific BioScience [55, 56]. The average length of single Illumina reads was set to the maximum allowed by ART (125 and 250 bp for HiSeq 2500 and MiSeq, respectively). For PBSIM, the average read length was set to 1 and 5 kb with CCS, and to 10 kb with CLR. Setting these averages, both softwares generated reads with random lengths following the built-in distributions derived from empirical data for each technology. Each homologue was ‘sequenced’ separately, and the reads were combined to simulate the output of real sequencing apparatus. Average sequencing depths were specified for each homologue to obtain the desired average total depth equal to k times the per homologue depth, with k being the ploidy. Both ART and PBSIM consider a discrete uniform distribution with the user-set mean for the depth at each position, and hence the SD of the total depth was dependent on the average depth per homologue, c, and equal to kc(c+2)12 for a k-ploid. The choice of the sequencing strategies in our study was based on the efficiency and performance of the available techniques [55], as well as their practical convenience. In particular, we did not consider single-ended reads of Illumina, as preliminary assessment showed that they produce low-quality estimates with a large number of gaps in the solution. Simulation of polyploid data sets To simulate realistic polyploid genomes, we chose tetraploid potato (S. tuberosum) as the model organism, because of the availability of a reference genome [11] as well as NGS data containing genomic variation of 83 diverse cultivars [48]. The sequence of chromosome 5 from the PGSC v4.03 DM draft genome [11] was used as the template sequence for haplogenerator. We selected random contiguous regions with 20 kb length from this template to be used as references for simulating genomes. In selecting the references, we rejected 20 kb regions of the template that contained >20% undetermined nucleotides, and omitted these undetermined sites (denoted by ‘N’ in PGSC v4.03 DM sequence) before introducing mutations in the accepted regions. As the length of many genes falls below 20 kb, choosing references with this size allows us to evaluate haplotype estimation for amplicon sequences covering a complete gene. Random bi-allelic SNPs were introduced in each reference to produce synthetic tetraploid genomes according to the built-in lognormal model of haplogenerator, with the mean and the SD of the log-distance between the SNPs being set to 3.0349 and 1.293, respectively, corresponding to an expected SNP frequency of 1 per 21 bp with a SD of 27 bp. The distribution of the dosages, di, was similarly set equal to that from [48], with percentages equal to 50, 23, 14 and 13% of simplex (di = 1), duplex, (di = 2), triplex (di = 3) and quadruplex (di = 4) SNPs. To investigate the effect of library preparation, we considered various insert-sizes for paired-end Illumina reads, namely, end-to-end insert-sizes of 235, 300, 400, 500, 600 and 800 bp with HiSeq 2500, and 400, 450, 500, 600 and 800 bp with MiSeq. For evaluation of the effect of sequencing depth on haplotyping, 2×, 5×, 8×, 10×, 20×, 22×, 25×, 28×, 30× and 35× average coverages were considered per homologue for each of these insert-sizes. To investigate the effects of genome characteristics, the ploidy level, the dosage of different homologues and the SNP density on the quality of haplotype estimation, additional genomes were generated in a similar way by haplogenerator. Considering the same proportion of simplex and duplex SNPs, i.e. SNPs with dosages equal to 1 and 2, respectively, as in [48] and considering equal proportions for the dosages >2, we simulated genomes with 3n, 4n, 6n, 8n, 10n and 12n ploidy levels to investigate the effect of the ploidy, and simulated modified tetraploid genomes that contained only two distinct homologues with simplex and triplex dosages to investigate the effect of similarity between the homologues on haplotype estimation. Although these scenarios assume a SNP-density model valid for S. tuberosum, they still can show the pattern by which ploidy level and similarity between the homologues influence the quality of haplotyping. Finally, tetraploid genomes with SNP densities lower than that of the highly heterozygous S. tuberosum [48] were simulated to observe the effect of SNP density, with average frequencies of 1 per 22 bp to 1 per 110 bp. In total, 250 individuals were simulated for each of the above-mentioned scenarios by choosing 25 random references from the template and generating 10 genomes with randomly distributed bi-allelic SNPs for each selected reference (Figure 2). Figure 3. View largeDownload slide Quantile plots for the distances between successive SNPs obtained from simulation by haplogenerator using the lognormal distance model (horizontal axis) versus the one obtained from the data of 83 potato cultivars of Uitdewilligen et al. (2013) [48]. The two distributions match well, though a heavier tail is observed for the data of Uitdewilligen et al. (2013), accounting for <2% of the SNPs. Figure 3. View largeDownload slide Quantile plots for the distances between successive SNPs obtained from simulation by haplogenerator using the lognormal distance model (horizontal axis) versus the one obtained from the data of 83 potato cultivars of Uitdewilligen et al. (2013) [48]. The two distributions match well, though a heavier tail is observed for the data of Uitdewilligen et al. (2013), accounting for <2% of the SNPs. Evaluation of the estimated haplotypes As several types of error occurring in different steps of the haplotyping pipeline could cause differences between the actual haplotypes and their estimates, we needed several measures of consistency to be able to capture all of them, as summarized in Table 2. These errors include the absence or wrong dosage of original SNPs in the estimates, presence of spurious SNPs, discontinuity of the estimated haplotypes, i.e. presence of gaps between estimated haplotype blocks and finally wrong extension of homologues leading to incorrect phasing. We also included an extra measure, the failure rate (FR) for each algorithm, regardless of the quality of haplotype estimation, as it could happen that the haplotyping tools failed to produce any estimate for some of the individuals. Table 2. Summary of the measures used to assess the quality of haplotyping Measure Description Limitation PAR The average accuracy of phasing for any possible SNP pair SNPs present in both original haplotypes and estimates, no account of gaps VER* Number of wrong extensions of homologues with (parts of) other homologues SNPs present in both original haplotypes and estimates with same dosages, no account of gaps SMR Proportion of original SNPs missing in the estimates No account of phasing, false-positive SNPs in estimates and dosages IDR Proportion of SNPs with an incorrect dosage in estimated haplotypes SNPs present in both original haplotypes and estimates, no account of phasing PPV Proportion of true SNPs in the estimated SNPs No account of phasing, missing SNPs and wrong dosages NGPS* Number of gaps (interruptions) introduced in the estimated haplotype No account of phasing, missing, wrong dosages and false positives FR Rate of failure for an algorithm No account of the quality of the estimates Measure Description Limitation PAR The average accuracy of phasing for any possible SNP pair SNPs present in both original haplotypes and estimates, no account of gaps VER* Number of wrong extensions of homologues with (parts of) other homologues SNPs present in both original haplotypes and estimates with same dosages, no account of gaps SMR Proportion of original SNPs missing in the estimates No account of phasing, false-positive SNPs in estimates and dosages IDR Proportion of SNPs with an incorrect dosage in estimated haplotypes SNPs present in both original haplotypes and estimates, no account of phasing PPV Proportion of true SNPs in the estimated SNPs No account of phasing, missing SNPs and wrong dosages NGPS* Number of gaps (interruptions) introduced in the estimated haplotype No account of phasing, missing, wrong dosages and false positives FR Rate of failure for an algorithm No account of the quality of the estimates Note. *: measure normalized by the size of original haplotypes, i.e. the number of original SNPs. Table 2. Summary of the measures used to assess the quality of haplotyping Measure Description Limitation PAR The average accuracy of phasing for any possible SNP pair SNPs present in both original haplotypes and estimates, no account of gaps VER* Number of wrong extensions of homologues with (parts of) other homologues SNPs present in both original haplotypes and estimates with same dosages, no account of gaps SMR Proportion of original SNPs missing in the estimates No account of phasing, false-positive SNPs in estimates and dosages IDR Proportion of SNPs with an incorrect dosage in estimated haplotypes SNPs present in both original haplotypes and estimates, no account of phasing PPV Proportion of true SNPs in the estimated SNPs No account of phasing, missing SNPs and wrong dosages NGPS* Number of gaps (interruptions) introduced in the estimated haplotype No account of phasing, missing, wrong dosages and false positives FR Rate of failure for an algorithm No account of the quality of the estimates Measure Description Limitation PAR The average accuracy of phasing for any possible SNP pair SNPs present in both original haplotypes and estimates, no account of gaps VER* Number of wrong extensions of homologues with (parts of) other homologues SNPs present in both original haplotypes and estimates with same dosages, no account of gaps SMR Proportion of original SNPs missing in the estimates No account of phasing, false-positive SNPs in estimates and dosages IDR Proportion of SNPs with an incorrect dosage in estimated haplotypes SNPs present in both original haplotypes and estimates, no account of phasing PPV Proportion of true SNPs in the estimated SNPs No account of phasing, missing SNPs and wrong dosages NGPS* Number of gaps (interruptions) introduced in the estimated haplotype No account of phasing, missing, wrong dosages and false positives FR Rate of failure for an algorithm No account of the quality of the estimates Note. *: measure normalized by the size of original haplotypes, i.e. the number of original SNPs. Some of the SNPs originally simulated were absent in the output set of each simulation, either because of mapping and variant calling errors or because of their not being phased by the algorithms, and therefore could not be included in the comparisons of true and estimated phasings. Instead, their proportion was calculated as SNP missing rate (SMR) and considered as the first measure of estimation quality. Similarly, spurious SNPs in the estimates were not included in the comparisons, and the positive predictive value (PPV) or the precision of the estimated SNPs was calculated as the proportion of genuine SNPs among the estimated SNPs as the next measure of estimation accuracy. The incorrect dosage rate (IDR) was also calculated as the proportion of SNPs that had an incorrect dosage in the estimated haplotypes in the set of SNPs that were common between the original haplotypes and the estimates. Having excluded the missing and spurious SNPs, the pairwise phasing accuracy rate (PAR) [57] was computed as the proportion of all heterozygous SNP pairs for which the estimated phasing was correct. This measure captures the errors caused by chimeric elongation of the homologues during haplotype estimation, i.e. the elongation of a homologue by (part of) another homologue, as well as errors caused by incorrect dosage estimation. One way to calculate the accuracy of phasing for more than just two SNPs is to consider the phasing accuracy rate for groups of three SNPs, four SNPs, etc. However, the phasing accuracies will no longer be independent for the groups of SNPs that have more than one SNP in common, leading to biased estimates of the accuracy rates. Instead, we calculated the vector error rate (VER), also called the switch error rate, defined as the number of times a homologue is erroneously extended by part of another homologue [47]. Such erroneous extensions are also called switches between homologues, and the measure is equal to two times the number of wrong phasings for pairs of consecutive SNPs for diploids. For polyploids, the measure is calculated by finding the minimum number of crossing-overs needed to reconstruct the true haplotypes from the estimates [47]. To be able to compare of VER for different ploidy levels, genome lengths and SNP densities, we normalized it by the number of originally simulated SNPs as well as the ploidy level for each individual. The SNPs with a wrong estimated dosage of the alternative allele were omitted before applying this measure, as otherwise the true haplotypes could not be reconstructed by simple switching of the estimated homologues without considering allele flips from 0 to 1 or vice versa. The last measure of estimation quality that we used was the number of gaps per SNP (NGPS) in the estimates, as the simulated continuous haplotypes can be broken into several disconnected blocks, causing gaps in the estimated haplotypes. This phenomenon happens if the connection between SNPs is lost because of low sequencing coverage or sequencing/variant calling errors at certain sites. Therefore, we calculated the number of break points, i.e. gaps, in the estimates (equal to the number of disjoint blocks minus one), and normalized it by the total number of simulated SNPs for each individual for the same reasons as for VER. In case gaps were present in the estimates, we calculated the other measures separately for each estimated block and reported the weighted average of the block-specific measures, weighted by the number of compared SNPs in each estimated block or the number of possible pairwise phases in case of PAR. Finally, the computational complexity of each of the haplotyping algorithms was considered as a function of sequencing coverage, insert-size, ploidy, SNP density and homologue dosages. The applied haplotyping methods are memory-intensive methods, increasingly consuming system resources with time, sometimes up to tens of gigabytes of virtual memory. To run the methods on a system with shared resources, and considering the fact that the algorithms require an increasing amount of virtual memory with time, a time limit of 900 s (2000 s for the analysis with various levels of ploidy) was imposed on each haplotyping algorithm, after which the algorithms were externally halted and the estimation considered a failure. This amount of time was deemed reasonable considering the 20 kb length of the simulated genomic regions, and the number of time-out events was added to the number of times each algorithm failed to estimate any haplotypes because of the occurrence of internal errors. Total FRs are thus also reported for each haplotyping scenario. Comparison of haplotyping algorithms To compare the overall performance of the three haplotype estimation methods, we built three linear regression models with the mentioned quality measures as response and the haplotyping method as predictor, considering sequencing depth, sequencing technology and the (paired-end) library size as covariates in the model. As each of the simulated genomes was haplotyped simultaneously by the three estimation methods, the effect of the genome on the estimation quality was incorporated as a random effect in the model. Similarly, as 10 genomes were generated from each of the 25 randomly selected references, the effect of the common reference was added as the second random component to the model. For each quality measure, a complete-case analysis was performed, including only the results of those simulations for which all the three estimation methods reported some value. The models were estimated by restricted maximum likelihood [58] using the lmer function from the package lme4 [59] in R 3.2.2 [60]. Results and discussion Haplogenerator produces realistic genomes To investigate the compatibility of the simulated 20 kb S. tuberosum genomic regions with the real regions sequenced by Uitdewilligen et al.(2013) [48] in terms of the density of bi-allelic SNPs, we obtained quantile plots (QQ-plots) of the distances between consecutive SNPs si, si+1, generated by the applied lognormal model versus the distances between consecutive bi-allelic SNPs (within the same RNA-capture region) in the combined data of 83 diverse cultivars from [48]. As shown in Figure 3, the two empirical distributions match well enough, although the distribution of real SNPs seems to have a heavier tail than lognormal (accounting for <2% of the total number of real bi-allelic SNPs). This heavier tail is plausibly explained by the presence of highly conserved regions in real genomes, subject to natural and artificial selection pressure, as well as the use of genome-wide RNA baits to reduce the complexity of genome in [48], which can result in the exclusion of some SNPs in target regions that had poor capture success. The proportions of simplex to quadruplex SNPs, i.e. the dosage proportions, were also almost identical as those obtained from Uitdewilligen et al. (2013) [48] (Section Simulation of polyploid data sets). Sequencing depth is the major determinant of haplotyping quality An important goal of the simulations was to observe to what extent sequencing strategies influence haplotyping results because of the practical importance in setting up experiments and choosing a technology. As different technologies rely on different library preparation and nucleotide calling methods, their output is often different in terms of the average read length and sequencing error profile. Besides, the sequencing depth, average read length and paired-end insert-size can vary according to the user’s requirements with the same technology. We found that the performance of all three haplotyping methods was considerably affected by the sequencing strategy, most notably by the sequencing depth. Regardless of the used sequencing technology and the insert-size, a sequencing depth between 5× and 20× per homologue is required to obtain results satisfactory in terms of haplotype accuracy (PAR) and completeness (SMR) (Figure 4A and B). Both improve continuously with sequencing depth, but flatten out at 15×. A notable exception is HapTree, of which the increased FR at higher depths is reflected in a worse completeness (increasing SMR). Other quality measures (VER, PPV, IDR and NGPS) were not substantially influenced by sequencing depth at depths >5× per homologue. Figure 4. View largeDownload slide Plots of haplotype estimation quality measures: (A) PAR, (B) SMR and (C) FR as a function of sequencing depth per homologue using HapCompass (black), HapTree (gray) and SDhaP (light gray), for simulated 20 kb tetraploid S. tuberosum genomes. Sequencing was performed in silico for paired-end MiSeq (triangle) and HiSeq 2500 (diamond) with 800 bp insert-size, as well as for PacBio-CCS of 1 kb length (circle). Figure 4. View largeDownload slide Plots of haplotype estimation quality measures: (A) PAR, (B) SMR and (C) FR as a function of sequencing depth per homologue using HapCompass (black), HapTree (gray) and SDhaP (light gray), for simulated 20 kb tetraploid S. tuberosum genomes. Sequencing was performed in silico for paired-end MiSeq (triangle) and HiSeq 2500 (diamond) with 800 bp insert-size, as well as for PacBio-CCS of 1 kb length (circle). HapTree’s FR (Figure 4C) was rather high for low and high sequencing depths. At lower depths, <2× per homologue, there is not enough information available for effective branching and pruning of the solution tree and time-out errors result in failures. In contrast, for high sequencing depths, the relative likelihood values often become small because of the presence of many terms in the likelihoods of partial haplotypes, making a meaningful comparison impossible. This problem will be discussed further below. As sequencing depth is an important factor in determining the total cost of sequencing, these results show that extra cost can be avoided by choosing a moderate sequencing depth without sacrificing considerable haplotyping accuracy. Large-insert paired-end reads are competitive with long reads In addition to sequencing depth, the insert-size of paired-end reads and the used sequencing technology can have an impact on the estimation quality. These are also important factors to specify when designing a sequencing experiment, as they influence cost, throughput and quality. To quantify their effects, we simulated NGS reads for HiSeq 2500, MiSeq and CCS technologies at each sequencing depth, and simulated paired-end reads with various insert-sizes for HiSeq 2500 and MiSeq (Section Simulation of NGS reads). Our results show that at the same sequencing depth, increasing the insert-size of paired-end reads was not markedly influential on the overall quality of haplotyping (Supplementary Figure S1), except for the number of gaps that was expectedly reduced with larger inserts. Moreover, similar estimation qualities were obtained using the 1 kb CCS reads and paired-end Illumina reads with a large insert-size (800 bp) (Figure 4). At the same depth, the paired-end reads contain basically the same phasing information as the long reads. Although libraries with large inserts are costly and difficult to obtain, they may be still easier to generate than long continuous reads and therefore can be a competitive option for designing haplotyping experiments. HapTree is the most accurate method, but often fails Different haplotyping algorithms yield different estimates for the same individual. With simulated individuals, it is possible to compare the quality of these estimates, as the haplotypes are known a priori. To this end, we used linear regression models relating the performance measures to the algorithms used (Section Comparison of haplotyping algorithms). Because of the substantial difference between the estimation results using CLR compared with the other sequencing methods (see below), those results were excluded from the regression analysis. Table 3 shows the 99% confidence intervals for the effects of estimation method and sequencing technology on the haplotyping accuracy for the tetraploid genomes, with HapCompass on CCS data taken as the baseline. HapTree is significantly more accurate (higher PAR and lower VER) than the other methods, but less complete (higher SMR) because of its frequent failure (Figure 4C). SDhaP yields slightly, but significantly, worse dosage estimates (higher IDR). There was no significant relation between the method used and the continuity of haplotype estimates (NGPS) or precision of the SNPs (PPV). Table 3. Point estimates and 99% confidence intervals for the effects of haplotyping and sequencing methods on the haplotyping quality measures Parameter\Quality measure PAR VER SMR IDR PPV NGPS Intercept* 0.33 (0.308; 0.356) 0.25 (0.206; 0.285) 0.06 (0.022; 0.090) 0.20 (0.178; 0.222) 0.97 (0.970; 0.971) 0.01 (0.007;0.013) HapTree 0.23* (0.226; 0.230) 0.1`* (−0.1; −0.09) 0.27* (0.263; 0.270) 0.00 (0.003; 0.005) 0.00 (0.000; 0.000) 0.00 (−0.002; −0.001) SDhaP 0.08* (0.079; 0.082) −0.01* (−0.014; −0.005) 0.00 (−0.003; 0.004) 0.01* (0.012; 0.013) 0.00 (0.000; 0.000) 0.00 (0.000;0.000) HiSeq 2500 −0.005 (−0.026; 0.016) 0.06* (0.024; 0.1) 0.04* (0.011; 0.072) 0.01 (−0.011; 0.028) 0.03* (0.025; 0.026) 0.01* (0.008;0.013) MiSeq −0.002 (−0.023; 0.019) −0.02 (−0.051; 0.02) 0.02 (−0.008; 0.052) 0.01 (−0.013; 0.026) 0.02* (0.023; 0.025) 0.00 (0.001;0.006) Parameter\Quality measure PAR VER SMR IDR PPV NGPS Intercept* 0.33 (0.308; 0.356) 0.25 (0.206; 0.285) 0.06 (0.022; 0.090) 0.20 (0.178; 0.222) 0.97 (0.970; 0.971) 0.01 (0.007;0.013) HapTree 0.23* (0.226; 0.230) 0.1`* (−0.1; −0.09) 0.27* (0.263; 0.270) 0.00 (0.003; 0.005) 0.00 (0.000; 0.000) 0.00 (−0.002; −0.001) SDhaP 0.08* (0.079; 0.082) −0.01* (−0.014; −0.005) 0.00 (−0.003; 0.004) 0.01* (0.012; 0.013) 0.00 (0.000; 0.000) 0.00 (0.000;0.000) HiSeq 2500 −0.005 (−0.026; 0.016) 0.06* (0.024; 0.1) 0.04* (0.011; 0.072) 0.01 (−0.011; 0.028) 0.03* (0.025; 0.026) 0.01* (0.008;0.013) MiSeq −0.002 (−0.023; 0.019) −0.02 (−0.051; 0.02) 0.02 (−0.008; 0.052) 0.01 (−0.013; 0.026) 0.02* (0.023; 0.025) 0.00 (0.001;0.006) * Statistically significant at α=0.01. Point estimates and 99% Wald-type confidence intervals for the effects of haplotyping methods: HapTree, SDhaP and HapCompass (the reference) and the sequencing technologies: MiSeq, HiSeq2500 and CCS (the reference), on five measures of haplotyping quality: pairwise-phasing accuracy rate (PAR), vector error rate (VER), SNP missing rate (SMR), incorrect dosage rate (IDR), positive predictive value of the called SNPs (PPV) and the number of gaps in estimates per SNP (NGPS). Table 3. Point estimates and 99% confidence intervals for the effects of haplotyping and sequencing methods on the haplotyping quality measures Parameter\Quality measure PAR VER SMR IDR PPV NGPS Intercept* 0.33 (0.308; 0.356) 0.25 (0.206; 0.285) 0.06 (0.022; 0.090) 0.20 (0.178; 0.222) 0.97 (0.970; 0.971) 0.01 (0.007;0.013) HapTree 0.23* (0.226; 0.230) 0.1`* (−0.1; −0.09) 0.27* (0.263; 0.270) 0.00 (0.003; 0.005) 0.00 (0.000; 0.000) 0.00 (−0.002; −0.001) SDhaP 0.08* (0.079; 0.082) −0.01* (−0.014; −0.005) 0.00 (−0.003; 0.004) 0.01* (0.012; 0.013) 0.00 (0.000; 0.000) 0.00 (0.000;0.000) HiSeq 2500 −0.005 (−0.026; 0.016) 0.06* (0.024; 0.1) 0.04* (0.011; 0.072) 0.01 (−0.011; 0.028) 0.03* (0.025; 0.026) 0.01* (0.008;0.013) MiSeq −0.002 (−0.023; 0.019) −0.02 (−0.051; 0.02) 0.02 (−0.008; 0.052) 0.01 (−0.013; 0.026) 0.02* (0.023; 0.025) 0.00 (0.001;0.006) Parameter\Quality measure PAR VER SMR IDR PPV NGPS Intercept* 0.33 (0.308; 0.356) 0.25 (0.206; 0.285) 0.06 (0.022; 0.090) 0.20 (0.178; 0.222) 0.97 (0.970; 0.971) 0.01 (0.007;0.013) HapTree 0.23* (0.226; 0.230) 0.1`* (−0.1; −0.09) 0.27* (0.263; 0.270) 0.00 (0.003; 0.005) 0.00 (0.000; 0.000) 0.00 (−0.002; −0.001) SDhaP 0.08* (0.079; 0.082) −0.01* (−0.014; −0.005) 0.00 (−0.003; 0.004) 0.01* (0.012; 0.013) 0.00 (0.000; 0.000) 0.00 (0.000;0.000) HiSeq 2500 −0.005 (−0.026; 0.016) 0.06* (0.024; 0.1) 0.04* (0.011; 0.072) 0.01 (−0.011; 0.028) 0.03* (0.025; 0.026) 0.01* (0.008;0.013) MiSeq −0.002 (−0.023; 0.019) −0.02 (−0.051; 0.02) 0.02 (−0.008; 0.052) 0.01 (−0.013; 0.026) 0.02* (0.023; 0.025) 0.00 (0.001;0.006) * Statistically significant at α=0.01. Point estimates and 99% Wald-type confidence intervals for the effects of haplotyping methods: HapTree, SDhaP and HapCompass (the reference) and the sequencing technologies: MiSeq, HiSeq2500 and CCS (the reference), on five measures of haplotyping quality: pairwise-phasing accuracy rate (PAR), vector error rate (VER), SNP missing rate (SMR), incorrect dosage rate (IDR), positive predictive value of the called SNPs (PPV) and the number of gaps in estimates per SNP (NGPS). Finally, the results were not significantly different using Illumina MiSeq reads and PacBio CCS, except for PPV, which was slightly higher with MiSeq. On the other hand, HiSeq 2500 reads resulted in significantly higher VER, SMR and NGPS at α = 0.01 compared with the other sequencing methods, with the most noticeable effect being on the number of gaps, which was expected considering the short single-read length of HiSeq. Overall, these results confirm that HapTree is the most accurate method when it does not fail, and that Illumina and PacBio reads offer similar performance. Similarity between homologues eases haplotyping with Illumina Similarity between homologues can have a large effect on haplotyping. This similarity often occurs when random mating is violated, e.g. in inbred or isolated populations. To investigate this, we simulated simplex–triplex individuals, i.e. tetraploid individuals consisting of two different genomes with dosages of 1 and 3. We generated paired-end MiSeq and HiSeq 2500 reads (800 bp insert-size), as well as 1 kb CCS read of PacBio, and evaluated the estimated haplotypes. On these data, the performances of HapTree (with Illumina reads) and HapCompass improve over the original simulation, whereas the performance of SDhaP deteriorates significantly (Figure 5). In particular, the similarity between homologues resulted in a decreased accuracy for SDhaP (Figure 5A, PAR around 0.2) and incorrect dosage estimates for more than half of the SNPs (Figure 5C, IDR of 0.55), regardless of the sequencing method. Figure 5. View largeDownload slide View largeDownload slide Plots of haplotype estimation quality measures: (A) PAR, (B) VER, (C) IDR, (D) SMR and (E) FR as a function of sequencing depth using HapCompass (black), HapTree (gray) and SDhaP (light gray), for simulated 20 kb simplex–triplex tetraploid genomes (circle) compared with genomes with random haplotype dosages (square). Sequencing was performed in silico for paired-end HiSeq 2500 reads with 800 bp insert-size. Figure 5. View largeDownload slide View largeDownload slide Plots of haplotype estimation quality measures: (A) PAR, (B) VER, (C) IDR, (D) SMR and (E) FR as a function of sequencing depth using HapCompass (black), HapTree (gray) and SDhaP (light gray), for simulated 20 kb simplex–triplex tetraploid genomes (circle) compared with genomes with random haplotype dosages (square). Sequencing was performed in silico for paired-end HiSeq 2500 reads with 800 bp insert-size. These results demonstrate the differences between the MEC approach to haplotyping and other approaches. MEC is sensitive to (local) similarities between homologues, as they lead to approximately identical MEC scores for several different phasings, causing SDhaP to report a suboptimal solution. In contrast, the performances of HapCompass (MWER approach) and HapTree (relative likelihood approach) improve, at least when using Illumina sequencing (Figure 5A and B). Having more similar fragments simplifies construction of the maximum spanning tree in the compass graph and makes the branching and pruning of the solution tree of HapTree more accurate by enhancing the relative likelihoods of correct partial phasings. No improvement was observed, however, with HapTree using CCS reads because of increasing FRs (Figure 5E) caused by time-out errors. These results show that the underlying algorithms lead to different sensitivities to homologue similarity, with MEC-based approaches yielding incorrect results and other methods demanding increasing computation time. SNP density mainly influences continuity of haplotype estimates In genomes with a lower SNP density than the highly heterozygous potato, S. tuberosum, fragments will overlap less often, which can influence the quality of haplotyping. To determine the effect of SNP density, we simulated tetraploid genomes with average SNP densities ranging from 1 SNP per 22 base pairs, the average density for potato, to 1 SNP per 110 base pair, and estimated the haplotypes using Illumina paired-end reads with an insert-size of 800 bp, as well as 1 kb CCS reads of PacBio, at a sequencing depth of 15× per homologue. Increasing numbers of gaps (NGPS, Figure 6A) and a decrease in completeness (SMR, Figure 6B) were observed in the estimated haplotypes at lower densities for all three haplotyping methods. The effect of the SNP density was, however, not manifest on the other haplotyping quality measures (Supplementary Figure S5). Figure 6. View largeDownload slide Plots of haplotype estimation quality measures: (A) NGPS, (B) SMR as a function of SNP density (at logarithmic distance scale) using HapCompass (black), HapTree (gray) and SDhaP (light gray), for simulated 20 kb tetraploid S. tuberosum genomes. Sequencing was performed in silico for paired-end MiSeq (triangle) and HiSeq 2500 (diamond) with 800 bp insert-size, as well as for PacBio-CCS of 1 kb length (circle), at a depth of 15×. Figure 6. View largeDownload slide Plots of haplotype estimation quality measures: (A) NGPS, (B) SMR as a function of SNP density (at logarithmic distance scale) using HapCompass (black), HapTree (gray) and SDhaP (light gray), for simulated 20 kb tetraploid S. tuberosum genomes. Sequencing was performed in silico for paired-end MiSeq (triangle) and HiSeq 2500 (diamond) with 800 bp insert-size, as well as for PacBio-CCS of 1 kb length (circle), at a depth of 15×. At higher ploidy levels, HapCompass is the best method to use To investigate whether our findings for tetraploid genomes hold for other ploidy levels, we performed simulations with ploidy levels of 3–12 (Section Simulation of polyploid data sets). We simulated paired-end HiSeq 2500 reads with an insert-size of 800 bp, as it gave high-quality estimates in tetraploids and was more practical than the competitive sequencing options, at 5×, 15× and 20× sequencing depths per homologue. The accuracy of HapTree and SDhaP decreases markedly with increasing ploidy level, up to 30% for 12n (PAR, Figure 7A), whereas the performance of HapCompass remained stable. Likewise, the completeness of HapTree decreased (SMR, Figure 7B) and FRs for both HapTree and SDhaP increased (Figure 7C). Although the performance of the methods at each ploidy level was relatively better at higher sequencing depths, the deterioration of the haplotype estimation quality followed a similar pattern with the increase in ploidy, regardless of the depth. Figure 7. View largeDownload slide Plots of haplotype estimation quality measures: (A) PAR, (B) SMR and (C) FR as a function of ploidy level using HapCompass (black), HapTree (gray) and SDhaP (light gray), for simulated 20 kb 3n, 4n, 6n, 8n, 10n and 12n genomes. Sequencing was performed in silico for paired-end HiSeq 2500 with 600 bp insert-size. Three sequencing depths were used per homologue: 5 × (circle), 15 × (triangle) and 20 × (diamond). Figure 7. View largeDownload slide Plots of haplotype estimation quality measures: (A) PAR, (B) SMR and (C) FR as a function of ploidy level using HapCompass (black), HapTree (gray) and SDhaP (light gray), for simulated 20 kb 3n, 4n, 6n, 8n, 10n and 12n genomes. Sequencing was performed in silico for paired-end HiSeq 2500 with 600 bp insert-size. Three sequencing depths were used per homologue: 5 × (circle), 15 × (triangle) and 20 × (diamond). Overall, none of the haplotyping methods is equipped to deal with high levels of ploidy: either they break down (HapTree and SDhaP) or are inaccurate (HapCompass). SDhaP yields best results using long reads Among the three tested haplotyping methods, SDhaP is the only method relying on MEC to select the best estimate. Although MEC is an efficient criterion for diploid haplotyping, it may not be able to distinguish the estimates in the presence of more than two homologues, as several estimates can have the same error correction score. This situation can especially occur when the homologues have local similarities and the SNP-fragment lengths are small. HapTree and HapCompass try to surmount this problem by considering complex criteria that result in more accurate estimates with short reads of Illumina and even 1 kb reads of CCS. Nevertheless, these criteria lose their advantage with long sequence reads, 5 kb CCS and 10 kb CLR in our study. These longer reads also increase the failure frequency of HapTree. The MEC criterion of SDhaP performs in contrast well when the reads are long enough to distinguish the homologues, requiring the least computation time (Figure 9A) and providing accurate results using 5 kb CCS (Figure 8A). Figure 8 View largeDownload slide View largeDownload slide Plots of haplotype estimation quality measures: (A) PAR, (B) VER, (C) IDR, (D) PPV, (E) SMR and (F) FR as a function of sequencing depth per homologue using HapCompass (left), HapTree (middle) and SDhaP (right), for simulated PacBio read: CCS 1 kb (black), CCS 5 kb (gray) and CLR 10 kb (light gray). Figure 8 View largeDownload slide View largeDownload slide Plots of haplotype estimation quality measures: (A) PAR, (B) VER, (C) IDR, (D) PPV, (E) SMR and (F) FR as a function of sequencing depth per homologue using HapCompass (left), HapTree (middle) and SDhaP (right), for simulated PacBio read: CCS 1 kb (black), CCS 5 kb (gray) and CLR 10 kb (light gray). Figure 9. View largeDownload slide (A) Computation time (in seconds) (B) Physical memory consumption (in megabyte) as a function of sequencing depth per homologue with three haplotype estimation softwares: HapCompass (black circle), HapTree (gray square) and SDhaP (light gray rhombus), using 10 kb continuous longs reads of PacBio for tetraploid genomes of length 20 kb. Figure 9. View largeDownload slide (A) Computation time (in seconds) (B) Physical memory consumption (in megabyte) as a function of sequencing depth per homologue with three haplotype estimation softwares: HapCompass (black circle), HapTree (gray square) and SDhaP (light gray rhombus), using 10 kb continuous longs reads of PacBio for tetraploid genomes of length 20 kb. Erroneous long reads lead to low accuracy, high SMR and many false SNPs in the estimates Recently, the generation of long reads spanning tens of thousands of nucleotides has become a reality using technologies such as Oxford Nanopore and PacBio, at the expense of the precision of base calling [38]. Such lengthy reads are potentially ideal candidates for haplotyping, as they can cover many variants and provide enough overlaps for accurate haplotype reconstruction [42]. However, our simulations using 10 kb CLR reads of PacBio, with an average accuracy of 82%, show that these reads lead to inferior estimates for polyploids compared with paired-end Illumina reads and CCS reads. In particular, many spurious SNPs will be present (Figure 8D), and many of the original SNPs will be missing in the estimates (Figure 8E). In addition, wrong dosages abound in the estimated haplotypes (Figure 8C). Although increasing the coverage helps improve the estimation to some extent, especially with SDhaP (Figures 8A, C and E), our results do not encourage the use of erroneous long reads for the estimation of polyploid haplotypes, as achieving the extremely high coverages needed is usually not practical. HapTree requires most resources Computational efficiency is an important feature of every complex algorithm, such as the haplotyping algorithms discussed in this article. Therefore, we measured the memory and time consumption of each algorithm for various ploidy levels, sequencing coverages and genome lengths. Using HiSeq 2500 and paired-end libraries with an insert-size of 800 bp for the simulation of sequencing, we tested the effect of sequencing depth with tetraploid individuals and genomes of length 10 kb, and the effect of genome length with tetraploid individuals sequenced at an average depth of 10× per homologue. Other settings were the same as for S. tuberosum (Section Simulation of polyploid genomes and NGS reads), for each condition generating 50 individuals from 50 randomly selected regions, i.e. one individual per region, with a time limit of 7200 s. Fixing the depth to 10× per homologue and the genome length to 10 kb, the effect of ploidy was also investigated in a similar manner. The analyses were run on multicore 2.6 GHz Intel-Xeon processors. For each run, the total CPU time and physical memory consumption was measured using the Unix getrusage routine. HapTree clearly consumed most time and memory resources (Figure 10), increasing with genome length (Figure 10A and B) and ploidy level (Figure 10C and D). This increase was much less for HapCompass and SDhaP. Figure 10. View largeDownload slide Plots of physical memory consumption (in megabyte) with: length of genome (in base-pair) (A), ploidy (C) and sequencing depth (E), and plots of computation time (in seconds) with length of genome (in base-pair) (B), ploidy (D) and sequencing depth per homologue (F), for three haplotype estimation softwares: HapCompass (black circle), HapTree (gray square) and SDhaP (light gray diamond). Sequencing was based on HiSeq 2500 paired-end technology with 800 bp insert-size. Figure 10. View largeDownload slide Plots of physical memory consumption (in megabyte) with: length of genome (in base-pair) (A), ploidy (C) and sequencing depth (E), and plots of computation time (in seconds) with length of genome (in base-pair) (B), ploidy (D) and sequencing depth per homologue (F), for three haplotype estimation softwares: HapCompass (black circle), HapTree (gray square) and SDhaP (light gray diamond). Sequencing was based on HiSeq 2500 paired-end technology with 800 bp insert-size. Although sequencing depth was less influential, HapTree used much more time and memory at the lowest sequencing depth, 5× per homologue, falling rapidly with an increase in depth up to 10× per homologue and remaining almost constant afterward up to a depth of 20× per homologue (Figure 10E and F). At depths of >20× per homologue, the computation time fell rapidly again because of the premature failure of the algorithm as discussed in the section on the influence of sequencing depth. Conclusion We evaluated three algorithms for single-individual haplotype estimation in polyploids: HapCompass, HapTree and SDhaP, and investigated the effects of SNP density, similarity between homologues, ploidy level, sequencing technology, sequencing depth and DNA library size on the estimation quality using several measures of quality (Table 2) and through extensive simulation experiments. This yielded insight about the performance of haplotype estimation methods in practical situations. For this purpose, we have developed a realistic pipeline that can be used as a basis for the benchmarking of single-individual haplotyping softwares in future. Our results show that HapTree can produce the best triploid and tetraploid haplotype estimates, followed by SDhaP and HapCompass. HapCompass is the best method to use with ploidy levels over 6n, although its performance is not good in an absolute sense. We showed that sequencing depth was the most important factor determining the quality of haplotype estimation, and paired-end short reads of Illumina with a large insert can perform as well as long CCS reads of the same total size now possible with PacBio. For accurate haplotyping, we therefore suggest an average depth of between 5× and 20× per homologue with paired-end reads and an insert-size of 600–800 bp. In addition to the estimation quality, we investigated computation time and memory consumption of each algorithm under various settings to compare their efficiency. We showed that on average, HapTree requires the most computation time and memory, and its use of resources is highly dependent on the length of the genomic region, the ploidy level and the sequencing depth. Combined with the frequent failure to complete the estimation, this raises difficulties for applying HapTree on practical problems where the aim is to reconstruct long-range haplotypes. Our findings show that while state-of-the-art single-individual haplotype estimation algorithms produce promising results for triploid and tetraploid organisms over a limited genomic region, their performance rapidly decreases at higher ploidy levels, and their resource use prohibits application to large genomic regions. The probability-based algorithm of HapTree produces the most accurate estimates but also requires the most computation time and memory. We believe it is worth investigating whether the HapTree approach can be made robust when faced with larger problems while maintaining its accuracy, e.g. using a divide-and-conquer approach or by adjusting the branching and pruning parameters according to the length of the genome, the ploidy level and the sequencing coverage. The variant calling error model could also be upgraded to be specific to the applied sequencing strategy and technology. Finally, the performance of haplotyping methods on individual organisms could be greatly improved if it could also incorporate parental and sib information if available, e.g. in mapping populations relevant to plant and animal breeding studies. Although the evaluated algorithms ignore this information, it can be extremely helpful to increase the precision of genotype calling when the average sequencing depth is low or to favor/disfavor some of the haplotypes a priori based on their expected frequency in the population. Such enhancements will prove essential to help understand the complex genetics found in many polyploid organisms and, in the long run, to better understand the rules governing genome organization. Software The simulation pipeline and its components can be downloaded from the Software section at http://bif.wur.nl/ Key Points Haplotyping accuracy for polyploids depends mostly on sequencing depth, with 5–20×  giving optimal results. Paired-end Illumina reads with large insert-sizes are competitive with PacBio CCS reads at reduced cost where quality of haplotype estimation is concerned. For ploidy levels up to six, HapTree is the most accurate method and can produce high-quality estimates; above that, HapCompass is the best method to use, although its estimation quality is not so high in the absolute sense. With long reads generated by recent sequencing technologies, SDhaP proves to be the most accurate and efficient. Despite its high accuracy, HapTree may break down for long genomic regions, high ploidy levels and with long reads of recent technologies, sometimes failing to converge within a reasonable amount of time. Improvements are needed to allow accurate haplotyping in polyploids in large genomic regions and may be found in adapting parameters to the problem at hand and in using sibship and parental information if available. Ehsan Motazedi is a PhD student at Wageningen UR Plant Breeding and the Bioinformatics Group, both at Wageningen University and Research. His project focuses on computational methods studying the genetics of polyploid crops. Richard Finkers is a staff scientist at Wageningen Research, Plant Breeding. His research interests include using genomics technologies to enable precision breeding in crops with complex genomes. Chris Maliepaard is an associate professor at Wageningen University, Plant Breeding, where he leads the research group ‘Quantitative aspects of Plant Breeding’. His research interests include quantitative genetics with a focus on genetic analysis of polyploid species. Dick de Ridder is professor in bioinformatics and head of the Bioinformatics Group at Wageningen University. He works on learning-based algorithms, with applications in the analysis of complex genomes and systems and synthetic biology. Supplementary Data Supplementary data are available online at http://bib.oxfordjournals.org/. Acknowledgements The authors wish to thank Herman J. van Eck for his input and valuable discussions regarding the data from Uitdewilligen et al. [48], Emily Berger for sharing HapTree code, Vikas Bansal and Derek Aguiar for discussion on HapCut and HapCompass, respectively. Funding Wageningen University and Research Centre through the graduate school Experimental Plant Sciences (EPS). References 1 Collins FS , Morgan M , Patrinos A. The human genome project: lessons from large-scale biology . Science 2003 ; 300 ( 5617 ): 286 – 90 . Google Scholar CrossRef Search ADS PubMed 2 Sachidanandam R , Weissman D , Schmidt SC , et al. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms . Nature 2001 ; 409 ( 6822 ): 928 – 33 . Google Scholar CrossRef Search ADS PubMed 3 Gibbs RA , Belmont JW , Hardenbol P , et al. The international HapMap project . Nature 2003 ; 426 ( 6968 ): 789 – 96 . Google Scholar CrossRef Search ADS PubMed 4 Ganal MW , Altmann T , Röder MS. SNP identification in crop plants . Curr Opin Plant Biol 2009 ; 12 ( 2 ): 211 – 7 . Google Scholar CrossRef Search ADS PubMed 5 LaFramboise T. Single nucleotide polymorphism arrays: a decade of biological, computational and technological advances . Nucleic Acids Res 2009 ; 37 ( 13 ): 4181 – 93 . Google Scholar CrossRef Search ADS PubMed 6 Visscher PM , Brown MA , McCarthy MI , et al. Five years of GWAS discovery . Am J Hum Genet 2012 ; 90 ( 1 ): 7 – 24 . Google Scholar CrossRef Search ADS PubMed 7 Kaul S , Koo HL , Jenkins J , et al. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana . Nature 2000 ; 408 ( 6814 ): 796 – 815 . Google Scholar CrossRef Search ADS PubMed 8 Clark AG , Eisen MB , Smith DR , et al. Evolution of genes and genomes on the Drosophila phylogeny . Nature 2007 ; 450 ( 7167 ): 203 – 18 . Google Scholar CrossRef Search ADS PubMed 9 Hillier LW , Miller W , Birney E , et al. Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution . Nature 2004 ; 432 ( 7018 ): 695 – 716 . Google Scholar CrossRef Search ADS PubMed 10 Groenen MA , Archibald AL , Uenishi H , et al. Analyses of pig genomes provide insight into porcine demography and evolution . Nature 2012 ; 491 ( 7424 ): 393 – 8 . Google Scholar CrossRef Search ADS PubMed 11 Potato Genome Sequencing Consortium ; Xu X , Pan S , et al. Genome sequence and analysis of the tuber crop potato . Nature 2011 ; 475 ( 7355 ): 189 – 95 . Google Scholar CrossRef Search ADS PubMed 12 Tomato Genome Consortium . The tomato genome sequence provides insights into fleshy fruit evolution . Nature 2012 ; 485 ( 7400 ): 635 – 41 . CrossRef Search ADS PubMed 13 Kim S , Park M , Yeom SI , et al. Genome sequence of the hot pepper provides insights into the evolution of pungency in Capsicum species . Nat Genet 2014 ; 46 ( 3 ): 270 – 8 . Google Scholar CrossRef Search ADS PubMed 14 Goddard ME , Hayes BJ. Mapping genes for complex traits in domestic animals and their use in breeding programmes . Nat Rev Genet 2009 ; 10 ( 6 ): 381 – 91 . Google Scholar CrossRef Search ADS PubMed 15 Mammadov J , Aggarwal R , Buyyarapu R , et al. SNP markers and their impact on plant breeding . Int J Plant Genomics 2012 ; 2012 ( 3 ): 728398 . Google Scholar PubMed 16 Birney E , Soranzo N. Human genomics: the end of the start for population sequencing . Nature 2015 ; 526 ( 7571 ): 52 – 3 . Google Scholar CrossRef Search ADS PubMed 17 Garrison E , Marth G. Haplotype-based variant detection from short-read sequencing. arXiv Preprint arXiv1207.3907 [q-bio.GN], 2012 . 18 McKenna A , Hanna M , Banks E , et al. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data . Genome Res 2010 ; 20 ( 9 ): 1297 – 303 . Google Scholar CrossRef Search ADS PubMed 19 Castiglione C , Deinard A , Speed W , et al. Evolution of haplotypes at the DRD2 locus . Am J Hum Genet 1995 ; 57 ( 6 ): 1445 . Google Scholar PubMed 20 Glusman G , Cox HC , Roach JC. Whole-genome haplotyping approaches and genomic medicine . Genome Med 2014 ; 6 ( 9 ): 73 . Google Scholar CrossRef Search ADS PubMed 21 Caputo M , Rivolta CM , Gutnisky VJ , et al. Recurrence of the p. R277X/p. R1511X compound heterozygous mutation in the thyroglobulin gene in unrelated families with congenital goiter and hypothyroidism: haplotype analysis using intragenic thyroglobulin polymorphisms . J Endocrinol 2007 ; 195 ( 1 ): 167 – 77 . Google Scholar CrossRef Search ADS PubMed 22 Hamblin MT , Jannink JL. Factors affecting the power of haplotype markers in association studies . Plant Genome 2011 ; 4 ( 2 ): 145 – 53 . Google Scholar CrossRef Search ADS 23 Lorenz AJ , Hamblin MT , Jannink JL , et al. Performance of single nucleotide polymorphisms versus haplotypes for genome-wide association analysis in barley . PLoS One 2010 ; 5 ( 11 ): e14079 . Google Scholar CrossRef Search ADS PubMed 24 Howie B , Fuchsberger C , Stephens M , et al. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing . Nat Genet 2012 ; 44 ( 8 ): 955 – 9 . Google Scholar CrossRef Search ADS PubMed 25 Hickey JM , Gorjanc G , Varshney RK , et al. Imputation of single nucleotide polymorphism genotypes in biparental, backcross, and topcross populations with a hidden Markov model . Crop Science 2015 ; 55 ( 5 ): 1934 – 46 . Google Scholar CrossRef Search ADS 26 Lancia G , Bafna V , Istrail S , et al. SNPs problems, complexity, and algorithms. In: Algorithms–ESA . Springer-Verlag, Heidelberg , 2001 , 182 – 93 . 27 Rizzi R , Bafna V , Istrail S , et al. Practical algorithms and fixed-parameter tractability for the single individual SNP haplotyping problem. In: Algorithms in Bioinformatics . Springer-Verlag, Heidelberg , 2002 , 29 – 43 . Google Scholar CrossRef Search ADS 28 Lippert R , Schwartz R , Lancia G , et al. Algorithmic strategies for the single nucleotide polymorphism haplotype assembly problem . Brief Bioinform 2002 ; 3 ( 1 ): 23 – 31 . Google Scholar CrossRef Search ADS PubMed 29 Wang RS , Wu LY , Li ZP , et al. Haplotype reconstruction from SNP fragments by minimum error correction . Bioinformatics 2005 ; 21 ( 10 ): 2456 – 62 . Google Scholar CrossRef Search ADS PubMed 30 Zhao YY , Wu LY , Zhang JH , et al. Haplotype assembly from aligned weighted SNP fragments . Comput Biol Chem 2005 ; 29 ( 4 ): 281 – 7 . Google Scholar CrossRef Search ADS PubMed 31 Wang RS , Wu LY , Zhang XS , et al. A Markov chain model for haplotype assembly from SNP fragments . Genome Inform 2006 ; 17 ( 2 ): 162 – 71 . Google Scholar PubMed 32 Wang Y , Feng E , Wang R. A clustering algorithm based on two distance functions for MEC model . Comput Biol Chem 2007 ; 31 ( 2 ): 148 – 50 . Google Scholar CrossRef Search ADS PubMed 33 Bansal V , Bafna V. HapCUT: an efficient and accurate algorithm for the haplotype assembly problem . Bioinformatics 2008 ; 24 ( 16 ): i153 – 9 . Google Scholar CrossRef Search ADS PubMed 34 Wu LY , Li Z , Wang RS , et al. Self-organizing map approaches for the haplotype assembly problem . Math Comput Simul 2009 ; 79 ( 10 ): 3026 – 37 . Google Scholar CrossRef Search ADS 35 Bansal V , Halpern AL , Axelrod N , et al. An MCMC algorithm for haplotype assembly from whole-genome sequence data . Genome Res 2008 ; 18 ( 8 ): 1336 – 46 . Google Scholar CrossRef Search ADS PubMed 36 Wu J , Wang J , Chen J. A practical algorithm based on particle swarm optimization for haplotype reconstruction . Appl Math Comput 2009 ; 208 ( 2 ): 363 – 72 . 37 Kuleshov V. Probabilistic single-individual haplotyping . Bioinformatics 2014 ; 30 ( 17 ): i379 – 85 . Google Scholar CrossRef Search ADS PubMed 38 Metzker ML. Sequencing technologies–the next generation . Nat Rev Genet 2010 ; 11 ( 1 ): 31 – 46 . Google Scholar CrossRef Search ADS PubMed 39 Chin CS , Peluso P , Sedlazeck FJ , et al. Phased diploid genome assembly with single molecule real-time sequencing . Nat Methods 2016 . 40 Ozkan H , Levy AA , Feldman M. Allopolyploidy-induced rapid genome evolution in the wheat (Aegilops–Triticum) group . Plant Cell 2001 ; 13 ( 8 ): 1735 – 47 . Google Scholar CrossRef Search ADS PubMed 41 Krasileva KV , Buffalo V , Bailey P , et al. Separating homeologs by phasing in the tetraploid wheat transcriptome . Genome Biol 2013 ; 14 ( 6 ): R66 . Google Scholar CrossRef Search ADS PubMed 42 Aguiar D , Istrail S. Haplotype assembly in polyploid genomes and identical by descent shared tracts . Bioinformatics 2013 ; 29 ( 13 ): i352 – 60 . Google Scholar CrossRef Search ADS PubMed 43 Das S , Vikalo H. SDhaP: haplotype assembly for diploids and polyploids via semi-definite programming . BMC Genomics 2015 ; 16 ( 1 ): 260. Google Scholar CrossRef Search ADS PubMed 44 Vrijenhoek RC. Polyploid hybrids: multiple origins of a treefrog species . Curr Biol 2006 ; 16 ( 7 ): R245 – 7 . Google Scholar CrossRef Search ADS PubMed 45 Yabe T , Ge X , Pelegri F. The zebrafish maternal-effect gene cellular atoll encodes the centriolar component sas-6 and defects in its paternal function promote whole genome duplication . Dev Biol 2007 ; 312 ( 1 ): 44 – 60 . Google Scholar CrossRef Search ADS PubMed 46 Aguiar D , Wong WS , Istrail S. Tumor haplotype assembly algorithms for cancer genomics . Pac Symp Biocomput 2014 ; 19 : 3 – 14 . 47 Berger E , Yorukoglu D , Peng J , et al. HapTree: a novel Bayesian framework for single individual polyplotyping using NGS data . PLoS Comput Biol 2014 ; 10 ( 3 ): e1003502 . Google Scholar CrossRef Search ADS PubMed 48 Uitdewilligen JG , Wolters AMA , D’hoop BB , et al. A next-generation sequencing method for genotyping-by-sequencing of highly heterozygous autotetraploid potato . PLoS One 2013 ; 8 ( 5 ): e62355 . Google Scholar CrossRef Search ADS PubMed 49 Aguiar D , Istrail S. HapCompass: a fast cycle basis algorithm for accurate haplotype assembly of sequence data . J Comput Biol 2012 ; 19 ( 6 ): 577 – 90 . Google Scholar CrossRef Search ADS PubMed 50 Huang W , Li L , Myers JR , et al. Art: a next-generation sequencing read simulator . Bioinformatics 2012 ; 28 ( 4 ): 593 – 4 . Google Scholar CrossRef Search ADS PubMed 51 Ono Y , Asai K , Hamada M. PBSIM: PacBio reads simulator–toward accurate genome assembly . Bioinformatics 2013 ; 29 ( 1 ): 119 – 21 . Google Scholar CrossRef Search ADS PubMed 52 Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv Preprint arXiv1303.3997, 2013 . 53 Li H , Handsaker B , Wysoker A , et al. The sequence alignment/map format and SAMtools . Bioinformatics 2009 ; 25 ( 16 ): 2078 – 9 . Google Scholar CrossRef Search ADS PubMed 54 Picardtools Home Page . http://broadinstitute.github.io/picard (13 January 2016, date last accessed). 55 Quail MA , Smith M , Coupland P , et al. A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers . BMC Genomics 2012 ; 13 ( 1 ): 341 . Google Scholar CrossRef Search ADS PubMed 56 Liu L , Li Y , Li S , et al. Comparison of next-generation sequencing systems . J Biomed Biotechnol 2012 ; 2012 : 251364 . Google Scholar PubMed 57 Browning SR , Browning BL. Haplotype phasing: existing methods and new developments . Nat Rev Genet 2011 ; 12 ( 10 ): 703 – 14 . Google Scholar CrossRef Search ADS PubMed 58 Thompson R. Maximum likelihood estimation of variance components . Statistics 1980 ; 11 ( 4 ): 545 – 61 . 59 Bates D , Sarkar D , Bates MD , et al. The lme4 package. R Package Version 2(1), 2007 . 60 R Core Team . R: A Language and Environment for Statistical Computing . Vienna, Austria : R Foundation for Statistical Computing , 2015 . © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

Journal

Briefings in BioinformaticsOxford University Press

Published: Jan 8, 2017

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off